Matrices and Matrix Operations
- Matrices and Matrix Operations
- Matrix Addition and Multiplication
- Transpose, Inverse, Determinant, Trace, and Rank
- Special Matrices in ML
- Block and Partitioned Matrices
- Final Thoughts
In machine learning, whether you’re manipulating datasets, performing linear transformations in neural networks, or analyzing graph structures, matrices provide a compact and efficient way to represent and process information. In this post, we’ll explore the world of matrices and matrix operations through a problem-driven approach. Each section grounds the math in real ML use cases, walks through the concepts clearly, and wraps up with code and applications.
Matrices and Matrix Operations
Think of a matrix as a grid—a two-dimensional array of numbers. On the surface, it may just look like a neat way to organize data, but in machine learning, it does so much more. Matrices are the building blocks for operations like transforming input features, propagating signals through neural networks, and encoding relationships in graph data.
Suppose you’re representing three data points, each with two features. You might organize this as a matrix:
\[X = \begin{bmatrix} 1.2 & 3.4 \\ 2.1 & 0.7 \\ 4.0 & 1.5 \end{bmatrix}\]Each row is a sample, and each column is a feature. This basic structure appears everywhere from data preprocessing to forward passes in deep learning.
Matrix Addition and Multiplication
Let’s start with two toy matrices:
\[A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \quad B = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix}\]Matrix Addition:
Two matrices can be added if they have the same dimensions. Element-wise addition gives:
\[A + B = \begin{bmatrix} a_{11}+b_{11} & a_{12}+b_{12} \\ a_{21}+b_{21} & a_{22}+b_{22} \end{bmatrix} = \begin{bmatrix} 6 & 8 \\ 10 & 12 \end{bmatrix}\]Properties:
- Commutative: \(A + B = B + A\)
- Associative: \((A + B) + C = A + (B + C)\)
- Additive identity: \(A + 0 = A\)
- Additive inverse: \(A + (-A) = 0\)
Matrix Multiplication:
Matrix multiplication is defined when the number of columns in \(A\) matches the number of rows in \(B\):
\[(AB)_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}\]For our example:
\[AB = \begin{bmatrix} (1\cdot5 + 2\cdot7) & (1\cdot6 + 2\cdot8) \\ (3\cdot5 + 4\cdot7) & (3\cdot6 + 4\cdot8) \end{bmatrix} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}\]Properties:
- Associative: \((AB)C = A(BC)\)
- Distributive: \(A(B + C) = AB + AC\)
- Not necessarily commutative: \(AB \neq BA\) in general
Data Science Applications:
- Linear transformations in neural networks: \(Z = WX + b\)
- Embeddings in NLP: word matrices transformed into contextual vectors
- Recommender Systems: low-rank approximations: \(\hat{R} = U \Sigma V^T\)
Transpose, Inverse, Determinant, Trace, and Rank
Let’s dive deeper into the matrix:
\[A = \begin{bmatrix} 4 & 7 \\ 2 & 6 \end{bmatrix}\]Transpose
The transpose of \(A\) flips its rows and columns:
\[A^T = \begin{bmatrix} a_{11} & a_{21} \\ a_{12} & a_{22} \end{bmatrix} = \begin{bmatrix} 4 & 2 \\ 7 & 6 \end{bmatrix}\]Properties:
- \[(A^T)^T = A\]
- \[(A + B)^T = A^T + B^T\]
- \[(AB)^T = B^T A^T\]
Cofactor Matrix
The cofactor matrix of a square matrix is formed by replacing each element with its cofactor. The cofactor of an element \(a_{ij}\) is calculated using:
\[C_{ij} = (-1)^{i+j} \cdot M_{ij}\]where:
- \(M_{ij}\) is the minor of \(a_{ij}\), i.e., the determinant of the submatrix obtained by removing the \(i\)-th row and \(j\)-th column from the original matrix.
- \((-1)^{i+j}\) gives the correct sign based on the position of the element (row \(i\), column \(j\)).
Steps to Find the Cofactor Matrix
- For each element \(a_{ij}\) in the matrix:
- Remove the \(i\)-th row and \(j\)-th column to form a smaller submatrix.
- Calculate the determinant of this submatrix to get the minor \(M_{ij}\).
- Multiply the minor by \((-1)^{i+j}\) to get the cofactor \(C_{ij}\).
- The cofactor matrix is the matrix where each entry is the corresponding cofactor \(C_{ij}\).
Example
Given a \(3 \times 3\) matrix: \(A = \begin{bmatrix} 1 & 0 & -2 \\ 3 & -1 & 2 \\ 4 & 5 & 6 \end{bmatrix}\)
To find the cofactor of the element \(0\) (in the first row, second column):
- Remove the first row and second column: \(\begin{bmatrix} 3 & 2 \\ 4 & 6 \end{bmatrix}\)
- Compute the minor: \(M_{12} = (3 \times 6) - (4 \times 2) = 18 - 8 = 10\)
- Apply the sign: \(C_{12} = (-1)^{1+2} \times 10 = -10\)
Repeat this process for every element to construct the full cofactor matrix.
Determinant
The determinant of a 2x2 matrix \(A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}\) is:
\[\det(A) = ad - bc\]The general formula (Laplace expansion) for an \(n \times n\) matrix is:
\[\det(A) = \sum_{j=1}^{n} (-1)^{1+j} a_{1j} M_{1j}\]Where \(M_{1j}\) is the determinant of the \((n-1) \times (n-1)\) minor matrix obtained by removing the first row and \(j\)-th column.
Properties:
- \[\det(AB) = \det(A) \cdot \det(B)\]
- \[\det(A^T) = \det(A)\]
- \(\det(cA) = c^n \cdot \det(A)\) for scalar \(c\)
- \[\det(I) = 1\]
Inverse of a Matrix
The inverse of a matrix is the key to “undoing” a linear transformation. For a square matrix \(A\), its inverse \(A^{-1}\) satisfies:
\[A A^{-1} = A^{-1} A = I\]However, not every matrix has an inverse. Only non-singular square matrices—those with \(\det(A) \ne 0\)—can be inverted.
Inverse of a 2×2 Matrix
Let’s begin with the simplest case.
Let
\(A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}\)
Step 1: Check Invertibility
Compute the determinant:
\[\det(A) = ad - bc\]- If \(\det(A) = 0\) → matrix is singular → no inverse exists.
- If \(\det(A) \ne 0\) → proceed to next step.
Step 2: Use the Inverse Formula
\[A^{-1} = \frac{1}{\det(A)} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix}\]Example
Given
\(A = \begin{bmatrix} 4 & 7 \\ 2 & 6 \end{bmatrix}\)
Calculate the determinant:
\[\det(A) = 4 \cdot 6 - 7 \cdot 2 = 24 - 14 = 10\]Hence, the inverse is:
\[A^{-1} = \frac{1}{10} \begin{bmatrix} 6 & -7 \\ -2 & 4 \end{bmatrix}\]Inverse of a 3×3 Matrix
The procedure becomes more involved but follows the same principles.
Let \(A\) be a 3×3 matrix.
Step 1: Compute Determinant
Use cofactor (Laplace) expansion, typically along the first row:
\[\begin{aligned} \det(A) &= a_{11} \begin{vmatrix} a_{22} & a_{23} \\ a_{32} & a_{33} \end{vmatrix} - a_{12} \begin{vmatrix} a_{21} & a_{23} \\ a_{31} & a_{33} \end{vmatrix} + a_{13} \begin{vmatrix} a_{21} & a_{22} \\ a_{31} & a_{32} \end{vmatrix} \end{aligned}\]Step 2: Compute Cofactor Matrix
Each entry:
\[C_{ij} = (-1)^{i+j} \cdot \det(\text{minor of } a_{ij})\]Step 3: Compute Adjugate Matrix
Transpose the cofactor matrix:
\[\mathrm{adj}(A) = C^\top\]Step 4: Divide by Determinant
The inverse is:
\[A^{-1} = \frac{1}{\det(A)} \cdot \mathrm{adj}(A)\]Complete Example (3×3)
Let
\(A = \begin{bmatrix} 7 & 2 & 1 \\ 0 & 3 & -1 \\ -3 & 4 & -2 \end{bmatrix}\)
1. Determinant
\[\begin{aligned} \det(A) &= 7 \cdot \begin{vmatrix} 3 & -1 \\ 4 & -2 \end{vmatrix} \,-\, 2 \cdot \begin{vmatrix} 0 & -1 \\ -3 & -2 \end{vmatrix} \,+\, 1 \cdot \begin{vmatrix} 0 & 3 \\ -3 & 4 \end{vmatrix} \\ &= 7(-6 + 4) \,-\, 2(0 - (-3)) \,+\, 1(0 + 9) \\ &= 7(-2) \,-\, 2(-3) \,+\, 9 \\ &= -14 + 6 + 9 \\ &= 1 \end{aligned}\]Hence, invertible.
2. Cofactor Matrix
\[C = \begin{bmatrix} -2 & -3 & 9 \\ 8 & -11 & -34 \\ -5 & 7 & 21 \end{bmatrix}\]3. Adjugate
\[\mathrm{adj}(A) = C^\top = \begin{bmatrix} -2 & 8 & -5 \\ -3 & -11 & 7 \\ 9 & -34 & 21 \end{bmatrix}\]4. Final Inverse
\[A^{-1} = \frac{1}{1} \begin{bmatrix} -2 & 8 & -5 \\ -3 & -11 & 7 \\ 9 & -34 & 21 \end{bmatrix}\]Sanity Checks and Verification
-
Determinant Test:
\(\det(A) = 0\) → matrix is not invertible. -
Multiplication Test:
\(A \cdot A^{-1} = I\) must hold. Otherwise, recheck calculation. - Inverse Property Tests:
- \[(A^{-1})^{-1} = A\]
- \[(A^T)^{-1} = (A^{-1})^T\]
- \[\det(A^{-1}) = \frac{1}{\det(A)}\]
- Special Cases:
For triangular matrices, \(A^{-1}\) will also be triangular. Diagonal entries of the inverse are reciprocals of the original diagonal entries (if all are non-zero).
Summary of Properties
- A matrix is invertible iff \(\det(A) \ne 0\)
- \[(AB)^{-1} = B^{-1} A^{-1}\]
- \[(A^T)^{-1} = (A^{-1})^T\]
- \[(A^{-1})^{-1} = A\]
Pseudo-Inverse of a Matrix
Not all matrices are square or invertible—but in machine learning, we still need to “solve” equations like \(Ax = b\), even when \(A\) is rectangular or singular.
This is where the Moore–Penrose pseudo-inverse comes in.
Given a matrix \(A \in \mathbb{R}^{m \times n}\), its pseudo-inverse, denoted \(A^+\), is a matrix such that:
\[A^+ = \arg\min_{X} \|AX - I\|_F\]It satisfies four key properties:
- \[A A^+ A = A\]
- \[A^+ A A^+ = A^+\]
- \[(A A^+)^T = A A^+\]
- \[(A^+ A)^T = A^+ A\]
How to Compute It
There are multiple ways to compute \(A^+\), but the most robust is via Singular Value Decomposition (SVD):
Let
\(A = U \Sigma V^T\)
Then,
\(A^+ = V \Sigma^+ U^T\)
Where:
- \(\Sigma^+\) is formed by taking the reciprocal of the non-zero singular values in \(\Sigma\) and transposing the result.
We’ll revisit SVD (Singular Value Decomposition) in more depth later when we explore eigenvalues, eigenvectors, and decompositions like PCA and SVD-based dimensionality reduction. For now, just note that it’s a powerful tool that lets us compute pseudo-inverses even for non-square or rank-deficient matrices.
Example: Rectangular Matrix
Suppose:
\[A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix}\]We want to solve \(Ax = b\) in the least squares sense.
We compute:
\[A^+ = (A^T A)^{-1} A^T\]This gives a matrix that maps a target vector \(b\) to the best-fitting \(x\) minimizing:
\[\|Ax - b\|^2\]Application in Machine Learning
The pseudo-inverse appears frequently when:
- Solving normal equations in linear regression: \(\hat{\beta} = (X^T X)^{-1} X^T y = X^+ y\)
- Working with non-square matrices in overdetermined or underdetermined systems
- Implementing closed-form least squares solutions
- Dealing with non-invertible covariance matrices in statistics
- Backpropagation in networks where you need a generalized inverse
If the inverse is a rare perfect square peg, the pseudo-inverse is the flexible wrench—it works when nothing else does. It allows us to perform meaningful matrix “division” even in messy, high-dimensional settings common in ML and data science.
Trace
The trace of a square matrix is the sum of its diagonal elements:
\[\text{tr}(A) = \sum_{i=1}^{n} A_{ii}\]Properties:
- \[\text{tr}(A + B) = \text{tr}(A) + \text{tr}(B)\]
- \[\text{tr}(cA) = c \cdot \text{tr}(A)\]
- \[\text{tr}(AB) = \text{tr}(BA)\]
- \[\text{tr}(A^T) = \text{tr}(A)\]
Rank
The rank is the number of linearly independent rows (or columns). This can be found using row-reduction or SVD.
Data Science Applications:
- Regression: Normal equations: \(\hat{\beta} = (X^TX)^{-1}X^Ty\)
- PCA: Eigen-decomposition of covariance matrices relies on trace and rank
- Multicollinearity: Detected via rank deficiency
-
Normalizing Flows: Require computing $$\log \det(Jacobian) $$
Special Matrices in ML
Identity Matrix
\[I_n = \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix}\]Properties:
- \(AI = IA = A\) for any matrix \(A\) of compatible size
- \[I^T = I\]
- \[I^{-1} = I\]
Diagonal Matrix
\[D = \text{diag}(\lambda_1, \lambda_2, ..., \lambda_n) = \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_n \end{bmatrix}\]Properties:
- \(D^T = D\) (symmetric)
- \(D^{-1} = \text{diag}(1/\lambda_1, ..., 1/\lambda_n)\) if all \(\lambda_i \neq 0\)
- \[\det(D) = \prod_{i=1}^n \lambda_i\]
- \[\text{tr}(D) = \sum_{i=1}^n \lambda_i\]
Symmetric Matrix
\[A = A^T\]Properties:
- All eigenvalues of a real symmetric matrix are real
- Can be orthogonally diagonalized: \(A = Q \Lambda Q^T\) where \(Q\) is orthogonal and \(\Lambda\) is diagonal
Orthogonal Matrix
\[Q^T Q = QQ^T = I\]Properties:
- \[Q^{-1} = Q^T\]
- \[\det(Q) = \pm 1\]
- Preserves vector norms: \(\|Qx\| = \|x\|\)
- Preserves inner products: \(\langle Qx, Qy \rangle = \langle x, y \rangle\)
Positive Definite and Semi-Definite Matrices
Positive definite and positive semi-definite matrices often hide behind the scenes in ML models, but they’re doing critical work—especially in optimization, regularization, and probabilistic modeling.
Definitions
Let \(A \in \mathbb{R}^{n \times n}\) be a symmetric matrix.
- Positive Definite (PD): \(\forall x \in \mathbb{R}^n, \; x \ne 0 \Rightarrow x^T A x > 0\)
- Positive Semi-Definite (PSD): \(\forall x \in \mathbb{R}^n, \; x \ne 0 \Rightarrow x^T A x \ge 0\)
That is, a matrix is PD if it always stretches any nonzero vector in such a way that the result is strictly aligned (positive projection) with the vector. If it could flatten the vector (project to zero), it’s PSD.
When we say a matrix is positive definite (PD), we mean the following:
For any non-zero vector \(x\), if you transform it using the matrix \(A\) (by computing \(Ax\)), and then measure how aligned the result is with the original \(x\)—the alignment is always positive.
Mathematically, this alignment is given by:
\[x^T A x\]If this value is strictly positive for all \(x \ne 0\), then \(A\) is positive definite.
Now, if that value is allowed to be zero (but never negative), the matrix is called positive semi-definite (PSD).
So you can think of it like this:
- Positive Definite (PD) → always pushes every vector in a direction that’s at least somewhat aligned with itself (never flat, never reversed)
- Positive Semi-Definite (PSD) → may flatten some vectors completely (i.e., project them to 0), but still never bends anything backwards
Visualize
We want to understand what it means for a matrix to be:
- Positive Definite (PD): \(x^T A x > 0\) for every non-zero vector \(x\)
- Positive Semi-Definite (PSD): \(x^T A x \ge 0\) for all \(x\), but it might be zero for some
But those formulas can feel abstract. So here’s the core idea:
Pick a bunch of directions (vectors \(x\)), plug them into \(x^T A x\), and ask:
“How much is this matrix ‘amplifying’ or ‘flattening’ that direction?”
What We Plotted
We sampled 200 points that lie evenly around the unit circle—meaning we’re probing every possible direction in 2D with a vector of length 1.
For each direction vector \(x\), we compute:
\[x^T A x\]This gives us a number telling us how the matrix \(A\) stretches or flattens that direction.
Then, we scale each point in that direction by its computed value. So in the plot, a point farther from the origin means \(x^T A x\) was larger—it shows stronger stretching. If it’s close to the origin, the matrix barely affected that direction.
The color of each dot also encodes the magnitude of \(x^T A x\).
Plot: Positive Definite Matrix
Here, the matrix is:
\[A = \begin{bmatrix} 2 & 0 \\ 0 & 1 \end{bmatrix}\]This matrix stretches along the \(x\)-axis more than the \(y\)-axis, but it still positively “responds” to every direction.
What you see:
- All the dots stay away from the origin—none collapse
- The pattern is elliptical, showing stronger stretching in some directions (especially along the x-axis)
- Every direction yields a strictly positive \(x^T A x\)
Why this matters:
A positive definite matrix always increases the value of \(x^T A x\) for any non-zero \(x\). You never lose information, never flatten. That’s why the shape in the plot is smooth and full—no gaps, no collapses.
Plot: Positive Semi-Definite Matrix
This time, the matrix is:
\[A = \begin{bmatrix} 1 & -1 \\ -1 & 1 \end{bmatrix}\]It’s symmetric, but not invertible, and has rank 1. That means it only responds to vectors lying in a particular direction—and ignores others completely.
What you see:
- Most points are stretched away from the origin, just like before
- But along specific directions (e.g., the diagonal \([1, 1]\)), the dots shrink back to the center—they completely vanish
Interpretation:
In those collapsed directions, \(x^T A x = 0\). The matrix doesn’t “see” those vectors at all—it acts as if they don’t exist. That’s the essence of positive semi-definiteness: some directions may be ignored (zero energy), but none are flipped or made negative.
Properties
- All eigenvalues of a PD matrix are strictly positive
- All eigenvalues of a PSD matrix are non-negative
- A symmetric matrix is:
- PD if and only if all eigenvalues > 0
- PSD if and only if all eigenvalues ≥ 0
- Covariance matrices are always PSD, and PD if the data has full rank
- Every PD matrix is invertible, but not every PSD matrix is
Why It Matters in ML
- Covariance matrices (e.g., in Gaussian models, PCA, multivariate normal) must be PSD
- Kernel matrices in SVMs and Gaussian processes must be PSD (they define valid inner products)
- In convex optimization, if your loss function’s Hessian is PD, the function is strictly convex—and you’re guaranteed a unique minimum
- Ridge regression uses: \((X^T X + \lambda I)^{-1}\) This term is guaranteed to be PD for \(\lambda > 0\), even if \(X^T X\) is only PSD
- In PCA, we diagonalize the covariance matrix—its PSD nature ensures real, non-negative eigenvalues for interpretable principal components
Quick Test (for symmetric matrices)
Let \(A\) be symmetric. Then:
- If all the leading principal minors (i.e., the determinants of the top-left \(k \times k\) submatrices for \(k = 1, 2, ..., n\)) of a symmetric matrix \(A\) are positive, then \(A\) is positive definite. This is known as Sylvester’s Criterion, and it provides a quick way to test positive definiteness without computing all eigenvalues.
- If all eigenvalues of \(A\) are ≥ 0 → \(A\) is PSD
- Try computing \(x^T A x\) for random vectors \(x\)—if it’s always positive, the matrix is likely PD
Block and Partitioned Matrices
Sometimes, a matrix is too big or too complex to work with all at once. Instead of treating it as one big chunk, we break it down into smaller, more manageable pieces—called blocks.
Think of it like splitting a large spreadsheet into smaller tables that still fit together. This helps with both understanding and computation.
A Simple Example
Let’s take a \(4 \times 4\) matrix:
\[A = \begin{bmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \\ 9 & 10 & 11 & 12 \\ 13 & 14 & 15 & 16 \end{bmatrix}\]We can divide this into four \(2 \times 2\) blocks like this:
\[A = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix}\]Where:
- \[A_{11} = \begin{bmatrix} 1 & 2 \\ 5 & 6 \end{bmatrix}\]
- \[A_{12} = \begin{bmatrix} 3 & 4 \\ 7 & 8 \end{bmatrix}\]
- \[A_{21} = \begin{bmatrix} 9 & 10 \\ 13 & 14 \end{bmatrix}\]
- \[A_{22} = \begin{bmatrix} 11 & 12 \\ 15 & 16 \end{bmatrix}\]
Each block is just a smaller matrix, but together they still represent the original structure.
Why Use Blocks?
Block matrices are useful when:
- You’re dealing with structured data (e.g., images, time-series, graphs)
- You want to optimize computations using smaller chunks
- Your matrix represents multiple modalities or features (e.g., text + image)
- You’re distributing operations across multiple processors or nodes
Block Matrix Multiplication
Let’s say you have two block-partitioned matrices:
\[A = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix}, \quad B = \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{bmatrix}\]If the inner dimensions match (i.e., each block multiplication is defined), you can compute the product blockwise:
\[AB = \begin{bmatrix} A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22} \\ A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22} \end{bmatrix}\]So instead of multiplying the full matrices all at once, you’re multiplying and adding smaller blocks—this is especially useful in large-scale computing.
Applications
Block matrices show up in more places than you might expect:
-
Distributed computing: In big-data systems like Spark or Dask, matrices are often partitioned into blocks for parallel processing. Each node handles a piece.
-
Mini-batch learning: In neural networks, training often happens on batches. Each mini-batch can be viewed as a block of the full data matrix.
-
Multi-modal learning: Say you’re combining image features and text features—each type can be represented as a block in a larger feature matrix.
-
Graph learning: Adjacency matrices of graphs can be block-partitioned to reflect communities or clusters. This is helpful in spectral clustering or modularity detection.
Block matrices are one of those “simple once you see it” ideas. They don’t just make computation faster—they also mirror the real-world structure in many ML tasks. If your data or model has parts, blocks are a natural way to think and compute.
Final Thoughts
We started with a humble matrix and ended up unlocking a toolkit that powers nearly every machine learning algorithm today. Whether you’re transforming features, training deep nets, or decomposing graphs—matrices are behind the scenes, doing the heavy lifting.