Matrix Calculus for Data Science
- Scalar-by-Vector Derivatives: \(\frac{\partial f}{\partial \mathbf{x}}\)
- Scalar-by-Matrix Derivatives: \(\frac{\partial L}{\partial W}\)
- Vector-by-Vector Derivatives: \(\frac{\partial \mathbf{f}}{\partial \mathbf{x}}\)
- Matrix-by-Matrix Derivatives: \(\frac{\partial F}{\partial X}\)
- Chain Rule in High Dimensions
- Vectorization Techniques
- Trace and determinant derivatives
- Derivatives of Matrix Inverses
- Derivatives of Eigenvalues and Eigenvectors
- Derivatives of Matrix Norms
- Derivatives of Structured Matrices
- Generalized and Pseudo-Inverses
- Matrix Decompositions and Their Derivatives
- Multivariate Distributions and Matrix Calculus
- Application 1: Backpropagation with Matrix Calculus
- Application 2: Closed-form Solutions in Linear Models
- Application 3: Convolutional Neural Networks (CNNs)
- Wrapping Up
If you’ve ever trained a neural network, optimized a loss function, or simply tried to make your model a little less wrong, you’ve likely relied on gradients flowing through vectors, matrices, and tensors. These gradients were computed — carefully, systematically — using the machinery of matrix calculus.
This blog is about that machinery. Not the abstract, hand-wavy kind. But the real thing — the rules, tricks, and identities that power everything from a simple linear regression to backpropagation in massive language models.
In undergrad calculus, we learn to differentiate functions like \(f(x) = x^2 + 3x\). Neat and useful — for scalar variables. But machine learning isn’t scalar land. Models work with:
- Vectors of weights: \(\mathbf{w} \in \mathbb{R}^n\)
- Data matrices: \(X \in \mathbb{R}^{n \times p}\)
- Outputs: sometimes scalars, often vectors, occasionally entire matrices
And when we want to compute how a scalar-valued loss changes with respect to all of these — we need vector derivatives, Jacobian matrices, matrix-by-matrix derivatives, and more.
They’re the basis of:
- Training algorithms (SGD, Adam, Newton’s method)
- Regularization (L2, Frobenius norm)
- Dimensionality reduction (PCA, SVD)
- Deep learning (backprop, chain rule in matrix form)
- Probabilistic models (matrix normal distributions, determinant derivatives)
Even the automatic differentiation tools in PyTorch or TensorFlow rely on these rules under the hood. When things break, or when you’re writing something novel (a custom loss, a new model class), you need to know how the derivatives actually work.
We’re going to build the entire toolbox — step by step.
Scalar-by-Vector Derivatives: \(\frac{\partial f}{\partial \mathbf{x}}\)
Let’s start simple.
Suppose you’ve got a scalar-valued function — say, a loss function — that depends on multiple variables:
\(f: \mathbb{R}^n \rightarrow \mathbb{R}\)
This means: you feed in a vector \(\mathbf{x} \in \mathbb{R}^n\) and out comes a single number. In machine learning, this is incredibly common — almost every cost function is of this type. You feed in a vector of weights, and get a scalar loss.
Our job?
To figure out: how does this scalar change as we wiggle each coordinate of \(\mathbf{x}\)?
That’s where gradients and directional derivatives come in.
From Partial Derivatives to the Gradient
Let’s say: \(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}, \quad f(\mathbf{x}) \in \mathbb{R}\)
Then the gradient of \(f\) with respect to \(\mathbf{x}\) is just a vector of partial derivatives:
\[\nabla_{\mathbf{x}} f = \frac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} \in \mathbb{R}^n\]This vector tells us how much and in what direction the function \(f\) changes as each component of \(\mathbf{x}\) changes.
Think of it like this: if you’re hiking on a scalar-valued elevation map (like a hill), and your coordinates are given by \(\mathbf{x}\), then the gradient vector points uphill. Its direction gives the steepest ascent, and its magnitude gives the steepness.
A Simple Example
Let’s take a warm-up example:
Function:
\[f(\mathbf{x}) = \mathbf{x}^\top \mathbf{x} = \sum_{i=1}^n x_i^2\]That’s just the squared L2 norm, i.e. \(f(\mathbf{x}) = \|\mathbf{x}\|^2\).
Gradient:
Let’s compute: \(\frac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial}{\partial x_1}(x_1^2 + \dots + x_n^2) \\ \vdots \\ \frac{\partial}{\partial x_n}(x_1^2 + \dots + x_n^2) \end{bmatrix} = \begin{bmatrix} 2x_1 \\ 2x_2 \\ \vdots \\ 2x_n \end{bmatrix} = 2\mathbf{x}\)
Interpretation: The gradient of the squared norm points directly away from the origin, scaled by 2. That makes sense — the farther you are from the origin, the faster the norm grows.
Directional Derivative — the Gradient in Disguise
Given a scalar-valued function
\(f : \mathbb{R}^n \rightarrow \mathbb{R}\)
and a direction vector
\(\mathbf{v} \in \mathbb{R}^n,\)
the directional derivative of \(f\) at point \(\mathbf{x}\) in the direction \(\mathbf{v}\) is defined as:
\[D_{\mathbf{v}} f(\mathbf{x}) := \lim_{h \to 0} \frac{f(\mathbf{x} + h\,\mathbf{v}) - f(\mathbf{x})}{h}\]If you prefer unit directions, set
\(\mathbf{u} = \frac{\mathbf{v}}{\lVert\mathbf{v}\rVert}\)
and then
\(D_{\mathbf{u}} f\)
measures the rate of change per unit distance traveled.
Why the Gradient Pops Out in Directional Derivatives
Suppose you’re at a point \(\mathbf{x}\) in \(n\)-dimensional space, and you’re looking at a scalar function \(f(\mathbf{x})\). You want to ask:
“If I take a small step in some arbitrary direction \(\mathbf{v}\), how much will \(f\) change?”
This is not a partial derivative — those only describe change along one coordinate axis. But in real-world problems, especially in ML optimization, we often care about arbitrary directions. So we turn to the directional derivative.
Let’s build up to the formula and see how the gradient quietly reveals itself.
Step 1: Start from the Definition
We want the instantaneous rate of change of \(f\) at \(\mathbf{x}\) in the direction \(\mathbf{v}\):
\[D_{\mathbf{v}} f(\mathbf{x}) := \lim_{h \to 0} \frac{f(\mathbf{x} + h\mathbf{v}) - f(\mathbf{x})}{h}\]Step 2: Use Taylor Expansion
If \(f\) is differentiable, then for small \(h\), we can expand:
\[f(\mathbf{x} + h\mathbf{v}) \approx f(\mathbf{x}) + h\, \nabla f(\mathbf{x})^\top \mathbf{v} + o(h)\]where:
- \(h\mathbf{v}\) is your step,
- \(\nabla f(\mathbf{x})^\top \mathbf{v}\) is the dot product between gradient and direction,
- and \(o(h)\) vanishes faster than \(h\) as \(h \to 0\).
Now plug this into the definition:
\[\begin{aligned} D_{\mathbf{v}} f(\mathbf{x}) &= \lim_{h \to 0} \frac{f(\mathbf{x} + h\mathbf{v}) - f(\mathbf{x})}{h} \\\\ &= \lim_{h \to 0} \frac{f(\mathbf{x}) + h\,\nabla f(\mathbf{x})^\top \mathbf{v} + o(h) - f(\mathbf{x})}{h} \\\\ &= \lim_{h \to 0} \left( \nabla f(\mathbf{x})^\top \mathbf{v} + \frac{o(h)}{h} \right) \\\\ &= \nabla f(\mathbf{x})^\top \mathbf{v} \end{aligned}\]That’s the result:
the directional derivative is just a dot product between the gradient and the direction vector.
Step 3: Intuition — Why a Dot Product?
Recall:
\[\nabla f^\top \mathbf{v} = \sum_{i=1}^n \frac{\partial f}{\partial x_i} \cdot v_i\]So the gradient acts as a weighted sum of sensitivities: how much each coordinate direction contributes to change in \(f\), scaled by how much you’re moving in that direction.
- If \(\mathbf{v}\) points in the same direction as \(\nabla f\) → large positive dot product → steepest ascent.
- If \(\mathbf{v}\) is orthogonal to \(\nabla f\) → dot product is zero → no change.
- If \(\mathbf{v}\) points opposite to \(\nabla f\) → negative dot product → steepest descent.
This is the geometric meaning behind “gradient = steepest ascent direction.”
Step 4: The Gradient as a Local Linear Model
Think of \(f(\mathbf{x})\) as the altitude at position \(\mathbf{x}\) on a hilly surface.
Then:
- The gradient \(\nabla f(\mathbf{x})\) tells you which way is uphill fastest.
- Moving orthogonally to the gradient is like walking around the hill at constant altitude.
- The directional derivative tells you how much altitude changes if you take a small step in any direction \(\mathbf{v}\) — and it’s exactly the projection of \(\mathbf{v}\) onto the gradient.
2D Example
Take the simple quadratic surface:
\[f(x, y) = x^2 + y^2\]At point \((1, 2)\):
- The gradient is
\(\nabla f = \begin{bmatrix} 2x \\ 2y \end{bmatrix} = \begin{bmatrix} 2 \\ 4 \end{bmatrix}\) - Take a direction vector:
\(\mathbf{v} = \begin{bmatrix} 3 \\ -4 \end{bmatrix}\)
Then:
\[\begin{aligned} D_{\mathbf{v}} f(1, 2) &= \nabla f^\top \mathbf{v} \\\\ &= \begin{bmatrix} 2 & 4 \end{bmatrix} \cdot \begin{bmatrix} 3 \\ -4 \end{bmatrix} \\\\ &= 2 \cdot 3 + 4 \cdot (-4) = 6 - 16 = -10 \end{aligned}\]Interpretation:
A tiny step in the direction \((3, -4)\) causes the function to decrease at a rate of 10 units per unit step — because that direction leans against the gradient.
Key Takeaways
- The gradient vector encodes how a function changes in every possible direction.
- The directional derivative in direction \(\mathbf{v}\) is simply \(\nabla f^\top \mathbf{v}\) — a projection of movement onto the gradient.
- In machine learning, this is the theoretical justification behind gradient descent: to decrease a loss function fast, go in direction \(-\nabla f\).
Matrix Notation
In matrix calculus, we often prefer this notation:
If \(f(\mathbf{x})\) is scalar and \(\mathbf{x} \in \mathbb{R}^{n \times 1}\), then:
- The gradient is typically a column vector \(\frac{\partial f}{\partial \mathbf{x}} \in \mathbb{R}^{n \times 1}\)
- But some references (especially in engineering) write it as a row vector — be aware of this ambiguity.
In this blog series, we’ll default to column vectors for gradients.
Why This Matters in Machine Learning
Any loss function you optimize — mean squared error, cross-entropy, hinge loss — boils down to:
\[\min_{\mathbf{w}} \; f(\mathbf{w})\]And your optimizer (SGD, Adam, etc.) needs \(\nabla_{\mathbf{w}} f\).
Without understanding this gradient — how it’s formed, how it’s directional — you’re just running someone else’s black box. And when you want to build your own (say, with custom loss, or reparameterized weights), you’ll need this math.
Real-world models — from logistic regression to deep neural networks — don’t just optimize over vectors. They learn matrices of weights, like the parameter matrix \(W\) in a linear transformation:
\[Z = XW\]We need to compute how a scalar-valued loss changes with respect to every entry of a matrix.
So the natural next question becomes:
What does the gradient look like when the variable is a matrix?
That’s where we’re headed now.
Let’s step into scalar-by-matrix derivatives, and see how gradients flow through entire affine maps — the backbone of modern ML.
Scalar-by-Matrix Derivatives: \(\frac{\partial L}{\partial W}\)
Most machine learning models funnel a whole batch of data through an affine map:
\[\mathbf{Z} = XW + \mathbf{b}\]where:
- \(X \in \mathbb{R}^{m \times d}\) holds \(m\) examples with \(d\) features each,
- \(W \in \mathbb{R}^{d \times k}\) is the weight matrix we want to learn,
- \(\mathbf{b} \in \mathbb{R}^{1 \times k}\) (or broadcasted) is a bias, and
- \(\mathbf{Z}\) feeds a softmax, sigmoid, or some other non-linearity.
Our task is to figure out how a scalar loss
\[L = \mathcal{L}(\mathbf{Z},\, \mathbf{y})\]changes when we nudge each entry of \(W\).
Warm-Up: Mean-Squared Error with a Single Output
Let’s take plain linear regression with a single output:
\[\hat{y} = X\mathbf{w}, \qquad L = \frac{1}{2m}\;\lVert\hat{y} - y\rVert^2\]Here we write \(W\) as a column vector \(\mathbf{w}\) just to keep notation light.
Derivative with respect to \(\mathbf{w}\):
\[\frac{\partial L}{\partial \mathbf{w}} = \frac{1}{m}\,X^\top (X\mathbf{w} - y)\]That’s the familiar normal equation gradient.
Takeaway — When the loss is quadratic and predictions are linear in weights, the gradient boils down to “design-matrix-transpose times residuals.”
General Rule: Scalar Loss w.r.t. Matrix
Suppose:
\[L = f(Z(W)), \qquad Z = XW \in \mathbb{R}^{m \times k}\]Using the trace trick, we express the scalar loss as:
\[L = \mathrm{tr}(f(Z))\]Now use the identity:
\[\boxed{ \frac{\partial\, \mathrm{tr}(A^\top X W)}{\partial W} = X^\top A }\]Let:
- \(G = \frac{\partial L}{\partial Z} \in \mathbb{R}^{m \times k}\) be the “upstream” gradient
- \(Z = XW\) as before
Then apply the chain rule in matrix form:
\[\begin{aligned} \frac{\partial L}{\partial W} &= \frac{\partial\, \mathrm{tr}(G^\top Z)}{\partial W} \\ &= \frac{\partial\, \mathrm{tr}(G^\top XW)}{\partial W} \\ &= X^\top G \end{aligned}\]This formula — \(X^\top G\) — is the workhorse of backpropagation for fully connected layers.
Logistic Regression Example
For a binary classification task, the loss per sample is:
\[\ell_i = -\left[y_i \log \sigma(z_i) + (1 - y_i)\log (1 - \sigma(z_i))\right], \quad z_i = x_i^\top \mathbf{w}\]Then the full batch loss is:
\[L = \frac{1}{m} \sum_{i=1}^m \ell_i\]The derivative of the loss with respect to each logit \(z_i\) is:
\[\frac{\partial L}{\partial z_i} = \frac{1}{m}(\hat{y}_i - y_i)\]Stack this into a matrix form:
\[G = \frac{1}{m}(\hat{y} - y), \qquad G \in \mathbb{R}^{m \times 1}\]Then the gradient with respect to weights is:
\[\boxed{ \frac{\partial L}{\partial \mathbf{w}} = X^\top G = \frac{1}{m} X^\top (\hat{y} - y) }\]Again — the usual rule hidden inside every ML library.
Key Patterns to Remember
Situation | Upstream Gradient | \(\frac{\partial L}{\partial W}\) |
---|---|---|
\(Z = XW\), arbitrary scalar loss | \(G = \frac{\partial L}{\partial Z}\) | \(X^\top G\) |
\(L = \frac{1}{2} \lVert XW - y \rVert^2\) (MSE) | \(G = XW - y\) | \(X^\top (XW - y)\) |
Softmax + Cross-Entropy | \(G = \hat{Y} - Y\) | \(X^\top (\hat{Y} - Y)\) |
Common Pitfalls
- Shape mismatches — \(X^\top G\) must match the shape of \(W\) exactly
- Missing constants — don’t forget the factor \(\frac{1}{m}\) when averaging over batches
- Nonlinearities — gradients must pass through activation functions before reaching \(W\)
By now, we’ve moved beyond gradients of scalars — and entered the terrain where entire vectors flow through our models. The Jacobian is what keeps track of how those outputs shift, stretch, or twist when the inputs change. It’s the key to understanding how vector-valued functions respond to vector inputs, and it underpins the machinery of backpropagation, reparameterization tricks, and more.
But even this isn’t the full story.
In deep learning, we often work with parameters, activations, and losses that are matrices. Layers are matrix functions of matrices. The derivatives we need? They aren’t just gradients or Jacobians — they’re something more structured.
Let’s now turn to that next level of complexity:
what happens when both the function and the variable are matrices?
Vector-by-Vector Derivatives: \(\frac{\partial \mathbf{f}}{\partial \mathbf{x}}\)
Imagine a function that takes in a vector and spits out another vector:
\[\mathbf{f} : \mathbb{R}^n \rightarrow \mathbb{R}^m\]This setup is everywhere in machine learning and data science:
- Neural network layers (input → activations)
- Batch transformations
- Coordinate transformations in loss landscapes
So, how do we represent the derivative of a vector-valued function with respect to a vector?
The answer is the Jacobian matrix.
What Is the Jacobian?
If:
\[\mathbf{f}(\mathbf{x}) = \begin{bmatrix} f_1(\mathbf{x}) \\ f_2(\mathbf{x}) \\ \vdots \\ f_m(\mathbf{x}) \end{bmatrix} \quad \text{where } \mathbf{x} \in \mathbb{R}^n,\]then the Jacobian of \(\mathbf{f}\) with respect to \(\mathbf{x}\) is the matrix of all partial derivatives:
\[\boxed{ \frac{\partial \mathbf{f}}{\partial \mathbf{x}} = J_{\mathbf{f}}(\mathbf{x}) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} }\]- Each row corresponds to the gradient of a single component function \(f_i\).
- Each column corresponds to the sensitivity of all outputs to a single input dimension \(x_j\).
If you’re familiar with gradients, you can think of the Jacobian as a stacked collection of gradients, one per output dimension.
Example: Simple Vector Function
Let:
\[\mathbf{f}(\mathbf{x}) = \begin{bmatrix} x_1 + 2x_2 \\ x_1^2 \\ \sin(x_2) \end{bmatrix}, \quad \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}\]Then the Jacobian is:
\[\frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} \\ \frac{\partial f_3}{\partial x_1} & \frac{\partial f_3}{\partial x_2} \end{bmatrix} = \begin{bmatrix} 1 & 2 \\ 2x_1 & 0 \\ 0 & \cos(x_2) \end{bmatrix}\]Each row is a directional map: how each component of \(\mathbf{f}\) changes when you nudge \(\mathbf{x}\).
When Do Jacobians Appear in ML?
- Neural network layers: When an input vector flows through a layer, the weights form a Jacobian.
- Backpropagation: Chain rules over Jacobians allow gradients to be pushed through layers.
- Autoencoders, GANs, normalizing flows: Invertibility and determinant of Jacobians matter.
- Coordinate transforms in variational inference: Changing variables requires multiplying by Jacobian determinants.
The Chain Rule for Jacobians
Let’s say we have two functions:
- \[\mathbf{f} : \mathbb{R}^n \rightarrow \mathbb{R}^m\]
- \[\mathbf{g} : \mathbb{R}^m \rightarrow \mathbb{R}^p\]
and we define a composition:
\[\mathbf{h}(\mathbf{x}) = \mathbf{g}(\mathbf{f}(\mathbf{x}))\]Then the Jacobian of \(\mathbf{h}\) with respect to \(\mathbf{x}\) is given by the matrix chain rule:
\[\boxed{ \frac{\partial \mathbf{h}}{\partial \mathbf{x}} = \frac{\partial \mathbf{g}}{\partial \mathbf{f}} \cdot \frac{\partial \mathbf{f}}{\partial \mathbf{x}} }\]This is just matrix multiplication — as long as the shapes line up!
Important: the order matters. Matrix multiplication is not commutative.
You apply the outer function’s Jacobian first, then multiply by the inner function’s Jacobian.
Jacobian Shapes: Keep It Straight
Function | Input Dim | Output Dim | Jacobian Shape |
---|---|---|---|
\(\mathbf{f} : \mathbb{R}^n \rightarrow \mathbb{R}\) | \(n\) | \(1\) | \((1 \times n)\) (row vector) |
\(\mathbf{f} : \mathbb{R}^n \rightarrow \mathbb{R}^m\) | \(n\) | \(m\) | \((m \times n)\) (Jacobian matrix) |
Summary: Why the Jacobian Matters
- It generalizes the gradient when your function outputs multiple values.
- It’s essential in backprop, where gradients move backward through layers of composed vector functions.
- It’s the foundation for more complex structures: Hessians, vector-Jacobian products, and even determinants of transformations.
If you’re using PyTorch or JAX, you’ve likely seen torch.autograd.grad
or jacrev()
— they’re computing this exact object under the hood.
Matrix-by-Matrix Derivatives: \(\frac{\partial F}{\partial X}\)
In most real-world ML architectures — especially deep networks — layers don’t just map vectors to scalars or vectors to vectors. Instead:
- Inputs are often mini-batch matrices (e.g., \(X \in \mathbb{R}^{m \times d}\)),
- Outputs of layers are activation matrices (e.g., \(Z \in \mathbb{R}^{m \times k}\)),
- And the transformations themselves involve matrix-valued parameters.
So to compute how one matrix affects another, we need to differentiate matrix-valued functions with respect to matrices. That’s where matrix-by-matrix derivatives come in.
What Are Matrix-by-Matrix Derivatives?
Suppose:
\[F : \mathbb{R}^{m \times n} \rightarrow \mathbb{R}^{p \times q}\]and you want to compute:
\[\frac{\partial F}{\partial X}, \quad \text{where } X \in \mathbb{R}^{m \times n}, \; F(X) \in \mathbb{R}^{p \times q}\]Unlike scalar-valued functions, this derivative isn’t a matrix — it’s technically a 4D tensor of shape \(p \times q \times m \times n\). That’s hard to visualize or work with directly.
So instead, we often vectorize both inputs and outputs.
Vectorization: Flattening the Problem
Define:
- \(\operatorname{vec}(X)\): stacks the columns of \(X\) into a column vector
- \(\operatorname{vec}(F(X))\): same for the output
Then we work with the Jacobian matrix of vectorized output with respect to vectorized input:
\[\boxed{ \frac{\partial \operatorname{vec}(F)}{\partial \operatorname{vec}(X)} \in \mathbb{R}^{(pq) \times (mn)} }\]This is how libraries like PyTorch and TensorFlow think of matrix derivatives internally: everything gets flattened into vectors, and the derivative is just a big Jacobian.
Example: Matrix Multiplication as a Function
Let:
\[F(X) = AXB\]where:
- \(A \in \mathbb{R}^{p \times m}\),
- \(B \in \mathbb{R}^{n \times q}\),
- \(X \in \mathbb{R}^{m \times n}\),
- So that \(F(X) \in \mathbb{R}^{p \times q}\)
The vectorized version of the output is:
\[\operatorname{vec}(F(X)) = (B^\top \otimes A) \operatorname{vec}(X)\]Here, \(\otimes\) is the Kronecker product.
So the Jacobian (matrix-by-matrix derivative) is:
\[\boxed{ \frac{\partial \operatorname{vec}(F)}{\partial \operatorname{vec}(X)} = B^\top \otimes A }\]This rule is foundational. In fact, the entire backpropagation process in fully connected layers boils down to a series of matrix multiplications and this derivative.
Kronecker Product
The Kronecker product \(A \otimes B\) between two matrices creates a block matrix, where each entry of \(A\) is replaced by that entry multiplied by the entire matrix \(B\).
Formally, if:
\[A = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix}, \quad B = \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix}\]then:
\[A \otimes B = \begin{bmatrix} a_{11} B & a_{12} B \\ a_{21} B & a_{22} B \end{bmatrix} = \begin{bmatrix} a_{11}b_{11} & a_{11}b_{12} & a_{12}b_{11} & a_{12}b_{12} \\ a_{11}b_{21} & a_{11}b_{22} & a_{12}b_{21} & a_{12}b_{22} \\ a_{21}b_{11} & a_{21}b_{12} & a_{22}b_{11} & a_{22}b_{12} \\ a_{21}b_{21} & a_{21}b_{22} & a_{22}b_{21} & a_{22}b_{22} \end{bmatrix}\]So it inflates a small matrix into a much larger one — and this inflated object turns out to be the Jacobian of vectorized matrix multiplication.
Worked Example: Derivative of a Matrix Product
Let’s say:
\[F(X) = AX + C\]where:
- \[A \in \mathbb{R}^{2 \times 2}\]
- \(X \in \mathbb{R}^{2 \times 2}\) is our variable
- \(C \in \mathbb{R}^{2 \times 2}\) is a constant bias matrix
Suppose:
\[A = \begin{bmatrix} 1 & 2 \\ 0 & -1 \end{bmatrix}, \quad X = \begin{bmatrix} x_1 & x_2 \\ x_3 & x_4 \end{bmatrix}, \quad C = \begin{bmatrix} 5 & 0 \\ 0 & 0 \end{bmatrix}\]So:
\[F(X) = A X + C = \begin{bmatrix} 1 & 2 \\ 0 & -1 \end{bmatrix} \begin{bmatrix} x_1 & x_2 \\ x_3 & x_4 \end{bmatrix} + \begin{bmatrix} 5 & 0 \\ 0 & 0 \end{bmatrix}\]Compute the matrix multiplication:
\[A X = \begin{bmatrix} 1 \cdot x_1 + 2 \cdot x_3 & 1 \cdot x_2 + 2 \cdot x_4 \\ 0 \cdot x_1 + (-1) \cdot x_3 & 0 \cdot x_2 + (-1) \cdot x_4 \end{bmatrix} = \begin{bmatrix} x_1 + 2x_3 & x_2 + 2x_4 \\ - x_3 & -x_4 \end{bmatrix}\]Now add the bias:
\[F(X) = \begin{bmatrix} x_1 + 2x_3 + 5 & x_2 + 2x_4 \\ - x_3 & -x_4 \end{bmatrix}\]Now let’s say we want to compute the derivative of a scalar function of this output. For instance:
\[L = \operatorname{tr}(F(X)) = (x_1 + 2x_3 + 5) + (-x_4) = x_1 + 2x_3 - x_4 + 5\]Then the gradient of this scalar \(L\) with respect to each variable in \(X\) is:
- \[\frac{\partial L}{\partial x_1} = 1\]
- \[\frac{\partial L}{\partial x_2} = 0\]
- \[\frac{\partial L}{\partial x_3} = 2\]
- \[\frac{\partial L}{\partial x_4} = -1\]
So the gradient matrix is:
\[\frac{\partial L}{\partial X} = \begin{bmatrix} 1 & 0 \\ 2 & -1 \end{bmatrix}\]This matches the rule:
If \(L = \operatorname{tr}(A X)\), then \(\frac{\partial L}{\partial X} = A^\top\)
In our case, \(A^\top = \begin{bmatrix} 1 & 0 \\ 2 & -1 \end{bmatrix}\) — exactly the matrix above.
Application: Backpropagation in Deep Learning
Let’s say you have a fully connected layer:
\[Z = XW + b\]with:
- \(X \in \mathbb{R}^{m \times d}\) (input batch),
- \(W \in \mathbb{R}^{d \times k}\) (weights),
- \(Z \in \mathbb{R}^{m \times k}\) (pre-activation outputs),
- \(b \in \mathbb{R}^{1 \times k}\) (broadcasted bias)
Now suppose you receive the upstream gradient:
\[\frac{\partial L}{\partial Z} = G \in \mathbb{R}^{m \times k}\]You want to compute:
- \(\frac{\partial L}{\partial W}\) — how the loss changes with respect to the weights
- \(\frac{\partial L}{\partial X}\) — how the loss changes with respect to the inputs (useful for chaining to previous layers)
The rules are:
\[\boxed{ \frac{\partial L}{\partial W} = X^\top G, \quad \frac{\partial L}{\partial X} = G W^\top }\]These come directly from matrix-by-matrix derivatives of the form:
- \[\frac{\partial \operatorname{tr}(A^\top XW)}{\partial W} = X^\top A\]
- \[\frac{\partial \operatorname{tr}(A^\top XW)}{\partial X} = A W^\top\]
Think of these as matrix calculus chain rules — they let gradients flow through layers.
Recap: When Do You Actually Need This?
You’ll be using matrix-by-matrix derivatives (implicitly or explicitly) whenever you:
- Backprop through matrix-valued functions in deep networks
- Implement custom layers in PyTorch or JAX
- Derive gradients in matrix factorization models (e.g., \(XW^\top\))
- Work with vectorized representations (via
torch.nn.Linear
,einsum
, orreshape
)
Most of the time, you’ll rely on automatic differentiation. But understanding the underlying derivatives — and being able to write them down — is what makes debugging and custom design possible.
Chain Rule in High Dimensions
Goal: Efficiently compute derivatives when functions are composed, and inputs/outputs are vectors or matrices
What We Already Know
In scalar calculus, you learned:
\[\frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x)\]Now what happens if:
- Both \(f\) and \(g\) are vector-valued?
- Or they operate on matrices?
- And what if you have many layers of such functions — like in deep learning?
We need a more general, more structural rule — one that handles these layered compositions.
General Setup
Let:
- \[\mathbf{x} \in \mathbb{R}^n\]
- \[\mathbf{f} : \mathbb{R}^n \rightarrow \mathbb{R}^m\]
- \[\mathbf{g} : \mathbb{R}^m \rightarrow \mathbb{R}^p\]
and define:
\[\mathbf{h}(\mathbf{x}) = \mathbf{g}(\mathbf{f}(\mathbf{x}))\]Then the Jacobian of \(\mathbf{h}\) is given by the matrix chain rule:
\[\boxed{ \frac{\partial \mathbf{h}}{\partial \mathbf{x}} = \frac{\partial \mathbf{g}}{\partial \mathbf{f}} \cdot \frac{\partial \mathbf{f}}{\partial \mathbf{x}} }\]Same idea as scalar calculus — but now we’re multiplying Jacobians instead of scalars.
Shape Check
Let’s break down the dimensions to make sure it all lines up:
Derivative | Shape |
---|---|
\(\frac{\partial \mathbf{f}}{\partial \mathbf{x}}\) | \((m \times n)\) |
\(\frac{\partial \mathbf{g}}{\partial \mathbf{f}}\) | \((p \times m)\) |
Total: \(\frac{\partial \mathbf{h}}{\partial \mathbf{x}}\) | \((p \times n)\) |
So you’re safe to multiply — as long as you’re careful with matrix shapes and multiplication order.
Example: Deep Network as Function Composition
In a feedforward neural network, each layer is a function:
\[\begin{aligned} \mathbf{h}^{(1)} &= \sigma(W^{(1)} \mathbf{x}) \\ \mathbf{h}^{(2)} &= \sigma(W^{(2)} \mathbf{h}^{(1)}) \\ \vdots \\ \mathbf{y}_{\text{pred}} &= W^{(L)} \mathbf{h}^{(L-1)} \end{aligned}\]So the full network is:
\[\mathbf{y}_{\text{pred}} = f^{(L)} \circ f^{(L-1)} \circ \cdots \circ f^{(1)}(\mathbf{x})\]To compute gradients for training, we need:
\[\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}_{\text{pred}}} \cdot \frac{\partial \mathbf{y}_{\text{pred}}}{\partial \mathbf{h}^{(L-1)}} \cdot \cdots \cdot \frac{\partial \mathbf{h}^{(1)}}{\partial \mathbf{x}}\]This is just the chain rule, applied over and over, moving from the output layer back to the inputs.
Special Case: Scalar Output
If the final output is a scalar loss \(L\), and intermediate steps are vector/matrix-valued, then the chain rule becomes:
\[\boxed{ \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Z} \cdot \frac{\partial Z}{\partial X} }\]- Here \(Z\) could be an intermediate matrix
- \(X\) is a parameter or input
- The gradient “backflows” through the layers
This is exactly what backpropagation automates.
One More Example (Explicit):
Let:
- \(f(\mathbf{x}) = A \mathbf{x}\), where \(A \in \mathbb{R}^{m \times n}\)
- \(g(\mathbf{z}) = \sigma(\mathbf{z})\) elementwise activation
Then \(h(\mathbf{x}) = g(f(\mathbf{x})) = \sigma(A \mathbf{x})\)
You want \(\frac{\partial h}{\partial \mathbf{x}}\). Chain rule says:
\[\frac{\partial h}{\partial \mathbf{x}} = \frac{\partial g}{\partial \mathbf{z}} \cdot A\]Since \(\frac{\partial g}{\partial \mathbf{z}}\) is a diagonal matrix with entries \(\sigma'(z_i)\), this becomes:
\[\boxed{ \frac{\partial h}{\partial \mathbf{x}} = \operatorname{diag}(\sigma'(A\mathbf{x})) \cdot A }\]This kind of expression is everywhere in the derivative logic of neural nets.
Why This Chain Rule Matters
- Mathematical foundation of backpropagation
- It generalizes gradient descent to composite, high-dimensional, multi-layered models
- It allows modular, local gradient computation that’s reused across layers
Whether you’re implementing custom loss functions, writing your own neural network modules, or reading research papers — this chain rule is the connective tissue that makes modern machine learning trainable.
Vectorization Techniques
Kronecker products and the vec operator — flattening matrices, sharpening calculus
“What if we could flatten the chaos?”
That’s the idea behind vectorization.
In machine learning, we’re rarely dealing with just scalars. A single layer of a neural net transforms entire batches of feature matrices. Backpropagation involves matrices flowing through functions of other matrices — and the gradients? They must know how to flow back efficiently through all of that.
That’s where vectorization becomes your secret weapon.
We introduce two essential tools in matrix calculus:
- The vec operator: which flattens matrices to vectors
- The Kronecker product: which helps us do calculus on these flattened forms
The vec Operator: Turning Matrices Into Long Skinny Vectors
Let’s start simple. Suppose you have a matrix:
\[X = \begin{bmatrix} x_{11} & x_{12} \\ x_{21} & x_{22} \end{bmatrix}\]Then:
\[\operatorname{vec}(X) = \begin{bmatrix} x_{11} \\ x_{21} \\ x_{12} \\ x_{22} \end{bmatrix}\]The rule is: stack columns, top to bottom.
More generally, for \(X \in \mathbb{R}^{m \times n}\), the operator
\[\operatorname{vec}: \mathbb{R}^{m \times n} \rightarrow \mathbb{R}^{mn}\]produces a column vector with the columns of \(X\) stacked vertically.
As discussed in Matrix-by-Matrix Derivatives, the Kronecker product, denoted \(A \otimes B\), creates a block matrix from two smaller ones. Formally:
If \(A \in \mathbb{R}^{p \times q}\) and \(B \in \mathbb{R}^{r \times s}\), then
\[A \otimes B \in \mathbb{R}^{pr \times qs}\]and:
\[A \otimes B = \begin{bmatrix} a_{11} B & a_{12} B & \cdots & a_{1q} B \\ a_{21} B & a_{22} B & \cdots & a_{2q} B \\ \vdots & \vdots & \ddots & \vdots \\ a_{p1} B & a_{p2} B & \cdots & a_{pq} B \\ \end{bmatrix}\]Vectorization Identities
1. \(\operatorname{vec}(AX) = (I \otimes A)\operatorname{vec}(X)\)
Proof:
Let \(X \in \mathbb{R}^{m \times n}\). Then the \(j^{\text{th}}\) column of \(AX\) is:
\[AX_{[:, j]} = A \cdot \text{(column } j \text{ of } X)\]So:
\[\operatorname{vec}(AX) = \begin{bmatrix} A X_{[:, 1]} \\ A X_{[:, 2]} \\ \vdots \\ A X_{[:, n]} \end{bmatrix} = (I_n \otimes A)\, \operatorname{vec}(X)\]This identity follows by applying \(A\) to each column of \(X\) and stacking the results — which is precisely what the Kronecker product \(I_n \otimes A\) does when multiplying \(\operatorname{vec}(X)\).
2. \(\operatorname{vec}(X B) = (B^\top \otimes I)\operatorname{vec}(X)\)
Proof:
Let \(X \in \mathbb{R}^{m \times n},\ B \in \mathbb{R}^{n \times p}\).
Each row of the matrix product \(XB\) is a linear combination of the rows of \(B\) weighted by the entries of \(X\).
So the \(i^{\text{th}}\) row of \(XB\) is:
\[(XB)_{i,:} = \sum_{j=1}^{n} X_{i,j} B_{j,:}\]This row-wise operation corresponds to multiplying \(\operatorname{vec}(X)\) by \(B^\top \otimes I\).
Hence:
\[\operatorname{vec}(X B) = (B^\top \otimes I)\, \operatorname{vec}(X)\]3. \(\operatorname{vec}(A X B) = (B^\top \otimes A)\operatorname{vec}(X)\)
This elegant identity lies at the heart of matrix calculus and vectorization. It allows us to reduce a seemingly complex matrix expression into a clean, computable form — and it appears everywhere from backpropagation mechanics to second-order optimization.
Let’s walk through it step by step.
Suppose we have a function defined as:
\[F(X) = A X B\]where:
\[A \in \mathbb{R}^{p \times m}, \quad X \in \mathbb{R}^{m \times n}, \quad B \in \mathbb{R}^{n \times q}\]Then the output is:
\[F(X) \in \mathbb{R}^{p \times q}\]Our goal is to compute the vectorized form of this result — that is, express:
\[\operatorname{vec}(F(X)) = \operatorname{vec}(A X B)\]in terms of \(\operatorname{vec}(X)\).
The identity we aim to derive is:
\[\boxed{ \operatorname{vec}(A X B) = (B^\top \otimes A)\, \operatorname{vec}(X) }\]We’ll build this from two well-established vectorization rules:
- If \(A \in \mathbb{R}^{p \times m}\) and \(X \in \mathbb{R}^{m \times n}\), then:
- If \(X \in \mathbb{R}^{m \times n}\) and \(B \in \mathbb{R}^{n \times q}\), then:
These identities essentially say: you can translate matrix multiplication into Kronecker product multiplication after flattening.
Now rewrite the expression:
\[A X B = (A X) B\]and apply the vectorization rule for post-multiplication:
\[\begin{aligned} \operatorname{vec}(A X B) &= \operatorname{vec}((A X) B) \\ &= (B^\top \otimes I_p)\, \operatorname{vec}(A X) \end{aligned}\]Then apply the pre-multiplication identity to \(\operatorname{vec}(A X)\):
\[\operatorname{vec}(A X) = (I_n \otimes A)\, \operatorname{vec}(X)\]Putting it all together:
\[\begin{aligned} \operatorname{vec}(A X B) &= (B^\top \otimes I_p)\, (I_n \otimes A)\, \operatorname{vec}(X) \end{aligned}\]The final step is to simplify this Kronecker product composition. A standard identity in Kronecker algebra tells us:
\[(B^\top \otimes I_p)(I_n \otimes A) = B^\top \otimes A\]This follows from the mixed-product property:
\[(A \otimes B)(C \otimes D) = (AC) \otimes (BD)\]when the matrix dimensions are compatible.
So we conclude:
\[\operatorname{vec}(A X B) = (B^\top \otimes A)\, \operatorname{vec}(X)\]Exactly as claimed.
Example: Quadratic Form with Vectorization
Let:
\[L = \operatorname{tr}(A X B X^\top)\]We want to compute:
\[\frac{\partial L}{\partial X}\]This expression arises in:
- Mahalanobis distance: \(\mathbf{x}^\top \Sigma^{-1} \mathbf{x}\)
- Matrix-variate Gaussians
- Multivariate optimization problems
Step 1: Use vec and trace identity
Use the identity:
\[\operatorname{tr}(A X B X^\top) = \operatorname{vec}(X)^\top (B^\top \otimes A) \operatorname{vec}(X)\]Proof sketch:
We know:
\[\operatorname{tr}(A X B X^\top) = \operatorname{tr}(X^\top A X B) = \operatorname{vec}(X)^\top (B^\top \otimes A) \operatorname{vec}(X)\](See Petersen & Pedersen “The Matrix Cookbook” for this identity.)
Step 2: Take derivative of quadratic form
Let:
\[L = \mathbf{x}^\top Q \mathbf{x}\]Then:
\[\frac{dL}{d\mathbf{x}} = (Q + Q^\top)\, \mathbf{x}\]When \(Q\) is symmetric, this becomes:
\[\frac{dL}{d\mathbf{x}} = 2 Q \mathbf{x}\]In our case:
- \[\mathbf{x} = \operatorname{vec}(X)\]
- \(Q = B^\top \otimes A\) (symmetric if \(B = B^\top\) and \(A = A^\top\))
So:
\[\frac{\partial L}{\partial \operatorname{vec}(X)} = 2 (B^\top \otimes A) \operatorname{vec}(X)\]If needed, reshape back to matrix:
\[\frac{\partial L}{\partial X} = \text{reshape to } m \times n\]Summary
- The
vec
operator helps translate matrix calculus into vector calculus - The Kronecker product allows those translations to become clean linear maps
- Useful identities like \(\operatorname{vec}(A X B) = (B^\top \otimes A)\, \operatorname{vec}(X)\)
give us elegant tools to differentiate messy matrix functions
Trace and determinant derivatives
Why This Matters
If you’ve spent any time with loss functions in linear models, Gaussian likelihoods, matrix normal distributions, or multivariate optimization, you’ve seen trace and determinant terms pop up.
In this section, we’ll explore:
- The linearity and cyclic rules for traces
- How to differentiate trace expressions with respect to matrices
- What the Jacobi formula tells us about the derivative of a determinant
- And why these results are so useful in machine learning
Let’s begin with the friendliest one of them all: the trace.
The Trace Operator: Definition and Properties
For any square matrix \(A \in \mathbb{R}^{n \times n}\):
\[\operatorname{tr}(A) = \sum_{i=1}^{n} A_{ii}\]That’s it — just the sum of the diagonal entries.
But this innocent-looking function hides two powerful algebraic properties that make it a star in matrix calculus:
-
Linearity:
\(\operatorname{tr}(A + B) = \operatorname{tr}(A) + \operatorname{tr}(B), \quad \operatorname{tr}(c A) = c \operatorname{tr}(A)\) -
Cyclic Permutation:
\(\operatorname{tr}(A B) = \operatorname{tr}(B A)\)More generally:
\[\operatorname{tr}(A B C) = \operatorname{tr}(C A B) = \operatorname{tr}(B C A)\]Caution: You can cyclically permute but not reorder arbitrarily.
This second rule is especially important. It allows us to rearrange matrix products inside a trace to our advantage when differentiating.
Derivatives Involving the Trace
Let’s build some derivative identities now, starting with the trace of matrix expressions. These are derived using component-wise differentiation and the properties above.
1. Scalar by Matrix:
Let \(A \in \mathbb{R}^{n \times n}\) and \(X \in \mathbb{R}^{n \times n}\).
Then:
\[\frac{\partial}{\partial X} \operatorname{tr}(A^\top X) = A\]This one is straightforward. It’s the matrix analog of the scalar identity \(\frac{d}{dx} (a x) = a\).
More generally, for any expression \(\operatorname{tr}(A X)\) or \(\operatorname{tr}(X A)\):
\[\frac{\partial}{\partial X} \operatorname{tr}(A X) = A^\top\](because differentiation is with respect to each entry of \(X\), not the transpose.)
2. Quadratic Forms:
Let’s say \(X \in \mathbb{R}^{n \times p}\), and \(A \in \mathbb{R}^{p \times p}\) is symmetric.
Then:
\[\frac{\partial}{\partial X} \operatorname{tr}(X A X^\top) = X (A + A^\top)\]And if \(A\) is symmetric (common in ML):
\[\frac{\partial}{\partial X} \operatorname{tr}(X A X^\top) = 2 X A\]This comes up often in Mahalanobis distances and multivariate Gaussians.
3. Frobenius Norm:
The Frobenius norm of a matrix \(X\) is:
\[\|X\|_F^2 = \operatorname{tr}(X^\top X)\]Then:
\[\frac{\partial}{\partial X} \|X\|_F^2 = 2 X\]This result powers the regularization terms in Ridge regression and neural network weight decay.
Jacobi’s Formula: The Determinant Derivative
This one is a bit more subtle but incredibly important.
Let \(X \in \mathbb{R}^{n \times n}\) be an invertible matrix. Then:
\[\frac{d}{dX} \det(X) = \det(X) \cdot (X^{-1})^\top\]So in full derivative form:
\[\frac{\partial}{\partial X} \log \det(X) = (X^{-1})^\top\]This is known as Jacobi’s formula, and it plays a starring role in probabilistic modeling — especially when working with log-likelihoods of Gaussian distributions.
It gives us:
- A way to differentiate log-determinants in closed form
- The building block for matrix-variate distributions (like the matrix normal)
- Clean expressions for entropy and KL divergence in terms of covariance matrices
Real-World Machine Learning Examples
-
In multivariate Gaussian log-likelihoods, the term
\(\log \det(\Sigma)\)
appears in the normalization constant. Its gradient? Exactly what Jacobi gives us. -
In information theory, when computing KL divergence between Gaussians, we encounter:
\(\operatorname{tr}(\Sigma_2^{-1} \Sigma_1) - \log \frac{\det(\Sigma_1)}{\det(\Sigma_2)}\)
and the derivatives of trace and log-determinant terms show up in natural gradient methods. -
In regularization, the Frobenius norm penalty translates to
\(\lambda \operatorname{tr}(W^\top W)\)
and the derivative is just \(2 \lambda W\).
Perfect — let’s now step into a concept that often feels like black magic the first time you see it: how to differentiate through a matrix inverse.
We’ll walk through it carefully, building intuition and math from the ground up — and as always, all math is inside $$...$$
for clean Jekyll rendering.
Derivatives of Matrix Inverses
Why You Need This
If you’re working with second-order optimization (e.g. Newton’s method), statistical estimation (like linear or multivariate Gaussian models), or algorithms that involve solving matrix equations repeatedly, then derivatives of inverse matrices are everywhere.
And yet… the inverse operation is nonlinear, non-elementwise, and nontrivial to differentiate.
Still, with some matrix calculus magic, we can express this derivative in closed form. And once you know it, you’ll see it pop up all over machine learning theory.
The Goal
Let’s say \(X \in \mathbb{R}^{n \times n}\) is an invertible, differentiable matrix-valued function.
We want to compute:
\[\frac{\partial}{\partial X} (X^{-1})\]That is, how does the inverse of a matrix change as we perturb the original matrix?
The Key Identity
The answer is:
\[\boxed{ \frac{d}{dX} (X^{-1}) = -X^{-1} \left( \frac{dX}{dX} \right) X^{-1} }\]In practice, this becomes:
\[\frac{\partial X^{-1}}{\partial X} = -X^{-1} \otimes X^{-1}\]Let’s unpack that.
This means: if you flatten the output matrix \(X^{-1}\) into a vector (via the vec operator), and do the same for the input matrix \(X\), then the Jacobian is a big \(n^2 \times n^2\) matrix given by:
\[\operatorname{vec}(dX^{-1}) = - (X^{-1} \otimes X^{-1})\, \operatorname{vec}(dX)\]This is your tool for automatic differentiation, matrix backprop, and symbolic derivations in second-order methods.
Derivation: From First Principles
We begin with a classic trick:
If \(X\) is invertible, then by definition:
\[X X^{-1} = I\]Differentiate both sides:
\[d(X X^{-1}) = dI = 0\]Now apply the product rule:
\[dX \cdot X^{-1} + X \cdot d(X^{-1}) = 0\]Rearranging:
\[X \cdot d(X^{-1}) = -dX \cdot X^{-1}\]Multiply both sides on the left by \(X^{-1}\):
\[d(X^{-1}) = - X^{-1} dX X^{-1}\]This gives us a compact and powerful result:
\[\boxed{ d(X^{-1}) = - X^{-1} dX X^{-1} }\]This is the differential form. It can be translated into vectorized form via the Kronecker product if needed for Jacobians or autodiff systems.
Practical Use Case: Newton’s Method
In optimization, Newton’s method uses the update:
\[\mathbf{x}_{k+1} = \mathbf{x}_k - H^{-1} \nabla f\]where \(H\) is the Hessian matrix of the loss function.
To compute derivatives of the update rule itself (for meta-learning, for example), you may need to differentiate through \(H^{-1}\) — and this is exactly where our identity helps.
In Statistics: Covariance Matrix Inverses
In multivariate Gaussian distributions, the likelihood often includes the term:
\[(\mathbf{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})\]To perform MLE or Bayesian estimation, we need the gradient with respect to \(\Sigma\), the covariance matrix.
Differentiating through \(\Sigma^{-1}\) uses:
\[\frac{\partial \Sigma^{-1}}{\partial \Sigma} = -\Sigma^{-1} \otimes \Sigma^{-1}\]This result appears in:
- Natural gradient descent
- Fisher information matrix
- EM algorithm updates
- Precision matrix estimation in graphical models
Summary Box
Expression | Derivative | Notes |
---|---|---|
\(d(X^{-1})\) | \(-X^{-1} dX X^{-1}\) | Differential form |
\(\operatorname{vec}(dX^{-1})\) | \(- (X^{-1} \otimes X^{-1}) \operatorname{vec}(dX)\) | Vectorized form |
\(\frac{\partial}{\partial X} (X^{-1})\) | \(- X^{-1} \otimes X^{-1}\) | Jacobian matrix (optional) |
Closing Thought
Matrix inverses may feel tricky — and rightly so. But this compact identity:
\[d(X^{-1}) = - X^{-1} dX X^{-1}\]gives you the full derivative in one clean sweep. Whether you’re modeling covariance matrices or optimizing matrix-valued functions, this result will keep showing up. It’s one of the most useful “mental macros” in all of matrix calculus.
Derivatives of Eigenvalues and Eigenvectors
When matrices change, what happens to their spectrum?
The Setup
Let \(A \in \mathbb{R}^{n \times n}\) be a real symmetric matrix. Suppose:
\[A \mathbf{v} = \lambda \mathbf{v}\]with:
- \(\lambda\) being an eigenvalue
- \(\mathbf{v}\) the corresponding (unit-norm) eigenvector: \(\|\mathbf{v}\| = 1\)
Now we perturb \(A\) slightly — how do \(\lambda\) and \(\mathbf{v}\) change?
This is a sensitivity analysis problem: if \(A \rightarrow A + \varepsilon H\), what is:
- \[\frac{d\lambda}{d\varepsilon}\]
- \[\frac{d\mathbf{v}}{d\varepsilon}\]
1. Derivative of the Eigenvalue
Let’s start with the easier one.
Assume \(\mathbf{v}\) is normalized and \(A\) is symmetric.
Then the derivative of the eigenvalue \(\lambda\) with respect to a perturbation \(H\) is:
\[\boxed{ \frac{d\lambda}{d\varepsilon} = \mathbf{v}^\top H \mathbf{v} }\]This means: if you nudge the matrix \(A\) in the direction of some matrix \(H\), the eigenvalue \(\lambda\) changes at a rate given by this simple quadratic form.
Interpretation:
It’s just the Rayleigh quotient of \(H\) with respect to \(\mathbf{v}\) — a beautifully compact expression.
2. Derivative of the Eigenvector
This part is trickier, and assumes \(\lambda\) is simple (has multiplicity one).
Let \(A\) be symmetric, and suppose:
\[A \mathbf{v}_i = \lambda_i \mathbf{v}_i\]Let \(H\) be the perturbation direction for \(A\).
Then, the directional derivative of the eigenvector \(\mathbf{v}_i\) is:
\[\boxed{ \frac{d\mathbf{v}_i}{d\varepsilon} = \sum_{j \ne i} \frac{\mathbf{v}_j^\top H \mathbf{v}_i}{\lambda_i - \lambda_j}\, \mathbf{v}_j }\]So the rate of change of \(\mathbf{v}_i\) depends on:
- All the other eigenvectors \(\mathbf{v}_j\)
- The spectral gap \(\lambda_i - \lambda_j\)
- How the perturbation \(H\) interacts with them
Why This Matters for PCA
In Principal Component Analysis (PCA), we compute the eigenvectors of a covariance matrix:
\[C = \frac{1}{n} X^\top X\]The first principal component is the eigenvector \(\mathbf{v}_1\) corresponding to the largest eigenvalue.
If your data \(X\) changes (e.g., online updates or batch re-training), the covariance matrix changes too. These derivative formulas help answer:
- How stable are the principal directions to noise?
- How can we differentiate through PCA for optimization (e.g., in differentiable dimensionality reduction)?
These insights lead to:
- Differentiable subspace learning (e.g. in neural networks that embed PCA steps)
- Understanding convergence of online PCA algorithms
- Regularizing PCA when eigenvalues are close together (small spectral gap → unstable eigenvectors)
Real-World Analogy
Think of a guitar string.
- The tension of the string is like your matrix \(A\)
- The frequencies (eigenvalues) change smoothly as you tune the tension
- But the shape of vibration modes (eigenvectors) can flip wildly when two frequencies get close
This is why eigenvector derivatives are more fragile, and why PCA can become unstable when leading eigenvalues are nearly equal.
Closing Notes
- The eigenvalue derivative is simple and intuitive — it’s just the projection of your perturbation onto the eigendirection.
- The eigenvector derivative is messier — but critical when PCA must be differentiated, stabilized, or learned.
- These ideas generalize to singular value decomposition (SVD), which powers many other dimensionality reduction techniques.
Derivatives of Matrix Norms
Regularization’s best friend: the Frobenius norm and how it learns
What’s a Matrix Norm, and Why Do We Care?
In optimization, we often need to measure the “size” of a matrix. This comes up when:
- We want to penalize large weights in linear regression
- We apply weight decay in neural nets
- We impose smoothness or sparsity in optimization problems
The most common choice? The Frobenius norm, defined for a matrix \(X \in \mathbb{R}^{m \times n}\) as:
\[\|X\|_F = \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n} X_{ij}^2} = \sqrt{\operatorname{tr}(X^\top X)}\]This is just the Euclidean norm of the matrix when viewed as a long vector.
The Squared Frobenius Norm
We usually work with the squared Frobenius norm in loss functions, since it’s differentiable everywhere and easier to manipulate:
\[\|X\|_F^2 = \sum_{i,j} X_{ij}^2 = \operatorname{tr}(X^\top X)\]This term frequently appears in regularization penalties. For instance, in Ridge regression:
\[L(\mathbf{w}) = \frac{1}{2m} \|X\mathbf{w} - \mathbf{y}\|^2 + \frac{\lambda}{2} \|\mathbf{w}\|^2\]That second term is just the Frobenius norm of a vector — which is equivalent to the \(\ell_2\) norm.
The Derivative of the Frobenius Norm
Let’s derive the gradient step by step.
If \(X\) is a scalar:
Nothing new — this is just:
\[\frac{d}{dX} X^2 = 2X\]If \(X\) is a vector:
Same rule applies:
\[\frac{d}{dX} \|X\|^2 = 2X\]Now for a full matrix:
Let \(X \in \mathbb{R}^{m \times n}\). Then:
\[\|X\|_F^2 = \operatorname{tr}(X^\top X)\]Using the identity:
\[\frac{\partial}{\partial X} \operatorname{tr}(X^\top X) = 2X\]we conclude:
\[\boxed{ \frac{\partial}{\partial X} \|X\|_F^2 = 2X }\]This is remarkably clean: same rule applies, no matter the dimension.
Application: Ridge Regression
Ridge regression adds an \(\ell_2\) penalty on the weights:
\[L(\mathbf{w}) = \frac{1}{2m} \|X\mathbf{w} - \mathbf{y}\|^2 + \frac{\lambda}{2} \|\mathbf{w}\|^2\]Take the derivative of the second term:
\[\frac{d}{d\mathbf{w}} \left( \frac{\lambda}{2} \|\mathbf{w}\|^2 \right) = \lambda \mathbf{w}\]This leads to the closed-form solution:
\[\mathbf{w}^* = (X^\top X + \lambda I)^{-1} X^\top \mathbf{y}\]That \(\lambda I\) term comes directly from differentiating the norm penalty.
Application: Neural Network Weight Decay
In deep learning, weight decay (also called \(\ell_2\) regularization) penalizes large parameter magnitudes.
If your loss is:
\[L(\theta) = \text{CrossEntropy} + \frac{\lambda}{2} \sum_{l} \|W^{(l)}\|_F^2\]then each gradient step gets:
\[\frac{\partial L}{\partial W^{(l)}} = \text{Backprop gradient} + \lambda W^{(l)}\]Again — the derivative of the squared Frobenius norm is just the weight matrix itself, scaled by \(\lambda\).
Takeaways
- The Frobenius norm is just the Euclidean norm generalized to matrices.
- Its squared version has a clean derivative: \(2X\).
- This simplicity makes it a natural choice for regularization in many machine learning models.
- If you’re implementing Ridge regression or weight decay, you’re already using this identity — whether you knew it or not.
Derivatives of Structured Matrices
Why Structure Matters
Most real-world machine learning problems — especially those involving dimensionality reduction, optimization over manifolds, or latent variable models — don’t deal with arbitrary matrices.
They often involve:
- Symmetric matrices (e.g., covariance matrices, Hessians)
- Orthogonal matrices (e.g., PCA loadings, rotation matrices, SVD components)
These structures aren’t accidental. They encode geometric constraints and ensure the results make physical, statistical, or optimization sense.
The challenge? Their derivatives must also respect the structure.
1. Differentiating Symmetric Matrices
Let’s say \(S \in \mathbb{R}^{n \times n}\) is symmetric:
\(S = S^\top\)
Then if you define a scalar-valued function:
\[f(S) = \operatorname{tr}(S A)\]with \(A\) being arbitrary, the naive derivative \(\frac{\partial f}{\partial S} = A\) doesn’t always preserve symmetry.
To ensure the gradient remains symmetric, we must symmetrize the derivative:
\[\boxed{ \frac{\partial f}{\partial S} = \frac{1}{2}(A + A^\top) }\]Why? Because any variation \(dS\) must satisfy \(dS = dS^\top\) — the space of symmetric matrices is a subspace, and we’re projecting gradients onto that subspace.
Practical Example: Covariance Matrix Estimation
Suppose you estimate a sample covariance matrix:
\[\Sigma = \frac{1}{n} X^\top X\]Now you’re optimizing a likelihood involving \(\Sigma\) — but you must ensure it stays symmetric and positive semi-definite.
So, if the loss contains:
\[L = \log \det \Sigma + \operatorname{tr}(A \Sigma)\]you differentiate and then symmetrize the result:
\[\frac{\partial L}{\partial \Sigma} = \Sigma^{-1} + A \quad \Rightarrow \quad \text{Use } \frac{1}{2}(\Sigma^{-1} + A + \Sigma^{-1} + A)^\top\]2. Differentiating Orthogonal Matrices
Let’s now consider orthogonal matrices:
\[Q^\top Q = I\]Common in:
- PCA (principal axes)
- SVD (U and V matrices)
- Rotation layers in computer vision
The constraint introduces nontrivial geometry: the set of orthogonal matrices lies on a manifold, meaning you can’t perturb \(Q\) arbitrarily and still stay on the manifold.
So, if you’re optimizing a loss:
\[L(Q)\]subject to \(Q^\top Q = I\), then you must restrict derivatives to the tangent space of the orthogonal group.
Tangent Space to Orthogonal Matrices
If \(Q \in \mathbb{R}^{n \times n}\) is orthogonal, its tangent vectors \(dQ\) must satisfy:
\[Q^\top dQ + dQ^\top Q = 0\]This is the skew-symmetry condition. So: the directional derivative \(dQ\) must lie in the space of skew-symmetric matrices.
Therefore, any gradient \(\nabla_Q L\) must be projected onto this tangent space.
Projected Gradient on Orthogonal Group
To get a valid gradient that stays on the manifold:
\[\boxed{ \text{Proj}_{\mathcal{T}_Q}(\nabla_Q L) = \nabla_Q L - Q \nabla_Q L^\top Q }\]This is the Riemannian gradient — it tells you how to move along the manifold of orthogonal matrices.
You’ll encounter this in optimization over SO(n) or Stiefel manifolds — used in:
- Variational inference
- Geometric deep learning
- ICA and dimensionality reduction
Why It Matters
- Symmetric matrices dominate in second-order methods, covariance estimation, and graphical models
- Orthogonal matrices arise in projection methods, PCA, and rotation-based learning algorithms
- Derivatives that ignore structure may leak off the feasible region, making optimization unstable or incorrect
So: when you know your matrices have structure, you must let your gradients know too.
Generalized and Pseudo-Inverses
Why Pseudoinverses?
In many real-world problems, we encounter non-square matrices — where the inverse doesn’t exist. Or worse, ill-conditioned matrices, where the inverse is unstable or poorly defined.
That’s where the Moore–Penrose pseudoinverse comes in. It gives us a principled way to “invert” rectangular or singular matrices and find least-squares solutions.
Definition of the Moore–Penrose Pseudoinverse
Given any matrix \(A \in \mathbb{R}^{m \times n}\), its Moore–Penrose pseudoinverse is denoted:
\[A^+ \in \mathbb{R}^{n \times m}\]and satisfies four conditions:
- \[A A^+ A = A\]
- \[A^+ A A^+ = A^+\]
- \[(A A^+)^\top = A A^+\]
- \[(A^+ A)^\top = A^+ A\]
This defines a unique generalized inverse.
Computing the Pseudoinverse via SVD
If we compute the singular value decomposition:
\[A = U \Sigma V^\top\]where:
- \[U \in \mathbb{R}^{m \times m}\]
- \[V \in \mathbb{R}^{n \times n}\]
- \(\Sigma \in \mathbb{R}^{m \times n}\) is diagonal with singular values \(\sigma_i\)
Then the pseudoinverse is:
\[A^+ = V \Sigma^+ U^\top\]where:
- \(\Sigma^+\) is obtained by inverting each non-zero singular value \(\sigma_i\) and transposing the matrix.
If \(\Sigma = \operatorname{diag}(\sigma_1, \dots, \sigma_r, 0, \dots, 0)\), then:
\[\Sigma^+ = \operatorname{diag}\left(\frac{1}{\sigma_1}, \dots, \frac{1}{\sigma_r}, 0, \dots, 0\right)^\top\]Application: Least Squares Solution
Suppose you want to solve the system:
\[A \mathbf{x} = \mathbf{b}\]When \(A\) is tall and thin (\(m > n\)) and has full rank, this is overdetermined, and no exact solution may exist.
The best we can do is minimize:
\[\min_{\mathbf{x}} \|A \mathbf{x} - \mathbf{b}\|^2\]The solution is given by:
\[\boxed{ \mathbf{x}^* = A^+ \mathbf{b} }\]And if \(A\) has full column rank, we recover the familiar normal equation:
\[A^+ = (A^\top A)^{-1} A^\top\]This is the expression used in closed-form linear regression.
Application: Ridge Regression and Stability
When \(A^\top A\) is ill-conditioned, pseudoinverses help with regularization.
In Ridge regression, we replace:
\[(A^\top A)^{-1} A^\top\]with:
\[(A^\top A + \lambda I)^{-1} A^\top\]This is a regularized pseudoinverse, and it’s more numerically stable — small singular values are not inverted all the way.
Geometric Interpretation
Let’s suppose:
- \(A\) maps from \(\mathbb{R}^n\) to \(\mathbb{R}^m\)
- We want to find a solution \(\mathbf{x}\) such that \(A \mathbf{x} \approx \mathbf{b}\)
Then:
- \(A^+ \mathbf{b}\) is the point in the column space of A that’s closest to \(\mathbf{b}\)
- Among all such solutions, it finds the one with minimum Euclidean norm
This is what makes \(A^+\) the canonical least-norm solution.
Takeaways
- The pseudoinverse is central to least squares, linear models, and underdetermined systems
- It extends matrix inversion to non-square or rank-deficient matrices
- Using SVD to compute \(A^+\) is numerically stable, especially when \(A^\top A\) is near singular
- Pseudoinverses show up in Ridge regression, low-rank approximations, and even deep learning for layer inversion and model analysis
Matrix Decompositions and Their Derivatives
Breaking matrices apart — and tracking how they move when perturbed
Why Matrix Decompositions Matter in ML
Matrix decompositions are how we understand, compress, and solve linear systems — especially when direct inversion or full eigendecompositions are too expensive or unstable.
You’ll see:
- LU decomposition in solvers and backpropagation through linear systems
- QR decomposition in optimization problems with orthogonality constraints
- SVD in PCA, collaborative filtering, and matrix factorization
But beyond just computing them, we often need to differentiate through these decompositions — especially in deep learning or probabilistic models where intermediate matrix steps are part of the learnable pipeline.
1. LU Decomposition and Derivatives
The LU decomposition writes a square matrix as:
\[A = L U\]where:
- \(L\) is lower triangular with ones on the diagonal
- \(U\) is upper triangular
This is primarily used in solving linear systems efficiently, but not all matrices have an LU decomposition without row pivoting.
If:
\[A \in \mathbb{R}^{n \times n}, \quad A = LU\]and we perturb \(A \rightarrow A + \delta A\), then under certain regularity conditions:
\[\delta A = \delta L \cdot U + L \cdot \delta U\]One can derive expressions for \(\delta L\) and \(\delta U\) using triangular projection operators, but LU is not differentiable everywhere, especially where row swaps (pivoting) occur.
Hence: LU derivatives are useful for solvers, but less common in end-to-end differentiable models.
2. QR Decomposition and Derivatives
Every matrix \(A \in \mathbb{R}^{m \times n}\) (\(m \geq n\)) has a QR decomposition:
\[A = Q R\]where:
- \(Q \in \mathbb{R}^{m \times n}\) has orthonormal columns
- \(R \in \mathbb{R}^{n \times n}\) is upper triangular
This is powerful when you want to orthonormalize matrices — used in:
- Least squares solvers
- Constrained optimization on the Stiefel manifold
- Regression with orthogonality constraints
Derivative (Sketch)
If \(A = QR\) and we perturb \(A \rightarrow A + \delta A\), then:
\[\delta A = \delta Q \cdot R + Q \cdot \delta R\]But since \(Q^\top Q = I\), \(\delta Q\) must lie in the tangent space to the Stiefel manifold. Therefore:
\[Q^\top \delta Q + \delta Q^\top Q = 0\]So the derivative of \(Q\) must be skew-symmetric, as with orthogonal matrices.
Libraries like PyTorch and JAX now implement differentiable QR for use in deep learning layers.
3. SVD and Its Derivatives
This is the most celebrated decomposition:
\[A = U \Sigma V^\top\]- \(U \in \mathbb{R}^{m \times m}\), orthogonal
- \(\Sigma \in \mathbb{R}^{m \times n}\), diagonal (singular values)
- \(V \in \mathbb{R}^{n \times n}\), orthogonal
It underpins:
- PCA (projecting onto directions of maximal variance)
- Low-rank approximations
- Collaborative filtering, topic models, embedding compression
Derivatives of SVD
Suppose we perturb \(A \rightarrow A + \delta A\) and want to track how \(U, \Sigma, V\) change.
This is non-trivial because:
- The singular values may be non-differentiable when repeated
- The eigenvectors (columns of \(U\) and \(V\)) may rotate discontinuously
Still, under full-rank, non-degenerate singular values, we have:
Derivative of \(\Sigma\):
Let \(\sigma_i\) be the \(i^{th}\) singular value.
Then:
\[\delta \sigma_i = u_i^\top \, \delta A \, v_i\]That is: the change in the \(i^{th}\) singular value is the projection of \(\delta A\) onto the \(i^{th}\) left and right singular vectors.
Derivatives of \(U, V\):
More complex, but involves resolvent operators and eigenvalue gap assumptions.
If you’re building optimization algorithms or probabilistic models involving SVD, tools like Autograd/JAX and Matrix Cookbook can help.
Application: PCA via SVD
To perform PCA:
- Center the data: \(X \leftarrow X - \bar{X}\)
- Compute SVD: \(X = U \Sigma V^\top\)
- Take top \(k\) singular vectors: \(V_k\)
- Project: \(Z = X V_k\)
This compresses data to \(k\) dimensions.
Now, suppose your PCA directions are part of a learnable pipeline (e.g., in unsupervised learning). You need to differentiate through the SVD, which is why automatic SVD gradients are crucial in deep learning libraries.
Summary Table
Decomposition | Form | Used For | Derivative Highlights |
---|---|---|---|
LU | \(A = LU\) | Linear system solvers | Non-differentiable at pivot points |
QR | \(A = QR\) | Orthonormal basis, least squares | Project derivatives onto Stiefel manifold |
SVD | \(A = U \Sigma V^\top\) | PCA, low-rank approximation | Delicate near degenerate singular values |
Final Notes
- Matrix decompositions are not just algebraic tools — they appear in backpropagation, variational inference, and matrix estimation tasks
- Differentiating through decompositions allows end-to-end learning when layers include matrix ops like PCA, whitening, or factorization
- Automatic differentiation libraries now support SVD/QR gradients, but knowing the theory helps when things go wrong
Multivariate Distributions and Matrix Calculus
From Multivariate Gaussians to Matrix Normals
Let’s start with the familiar.
A multivariate normal distribution for a vector \(\mathbf{x} \in \mathbb{R}^d\) looks like:
\[\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \Sigma) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right)\]Now suppose instead of a single \(\mathbf{x}\), we observe an entire matrix \(X \in \mathbb{R}^{n \times p}\) — say, \(n\) samples and \(p\) variables — and we want to model correlation across both rows and columns.
This leads us to the matrix normal distribution.
The Matrix Normal Distribution
Let \(X \in \mathbb{R}^{n \times p}\). We say:
\[X \sim \mathcal{MN}(M, U, V)\]if:
\[\operatorname{vec}(X) \sim \mathcal{N}\bigl(\operatorname{vec}(M),\, V \otimes U \bigr)\]That is:
- \(M \in \mathbb{R}^{n \times p}\) is the mean matrix
- \(U \in \mathbb{R}^{n \times n}\) is the row covariance
- \(V \in \mathbb{R}^{p \times p}\) is the column covariance
The full density:
\[p(X) = \frac{1}{(2\pi)^{np/2} |U|^{p/2} |V|^{n/2}} \exp\left( -\frac{1}{2} \operatorname{tr}\left( U^{-1} (X - M) V^{-1} (X - M)^\top \right) \right)\]Why Matrix Calculus?
Let’s say we’re doing maximum likelihood estimation. We have a dataset \(\{X^{(i)}\}\), and we want to learn the MLE of \(M, U, V\). Or perhaps we’re doing Bayesian inference, and need the gradient of the log-likelihood for variational optimization.
In either case, we must compute derivatives of the form:
- \[\frac{\partial}{\partial M} \log p(X)\]
- \[\frac{\partial}{\partial U} \log p(X)\]
- \[\frac{\partial}{\partial V} \log p(X)\]
This is where trace derivatives, log-determinant derivatives, and matrix inverse derivatives come together — all the things we’ve built so far.
Gradient w.r.t. the Mean Matrix
Let’s compute the gradient of the log-likelihood w.r.t. the mean matrix \(M\).
From the density above:
\[\log p(X) = \text{const} - \frac{1}{2} \operatorname{tr}\left( U^{-1} (X - M) V^{-1} (X - M)^\top \right)\]Use matrix calculus:
Let \(E = X - M\). Then:
[ \begin{aligned} \frac{\partial}{\partial M} \log p(X) &= \frac{1}{2} \frac{\partial}{\partial M} \operatorname{tr}\left( U^{-1} E V^{-1} E^\top \right)
&= \frac{1}{2} \cdot (-2 U^{-1} E V^{-1})
&= -U^{-1} (X - M) V^{-1} \end{aligned} ]
So the gradient is:
\[\boxed{ \frac{\partial \log p(X)}{\partial M} = - U^{-1} (X - M) V^{-1} }\]Gradient w.r.t. Covariance Matrices
Here’s where things get a bit trickier — but powerful.
Let’s differentiate:
\[\log p(X) = \text{const} - \frac{p}{2} \log |U| - \frac{n}{2} \log |V| - \frac{1}{2} \operatorname{tr}\left( U^{-1} E V^{-1} E^\top \right)\]For the row covariance \(U\):
Using:
- \[\frac{\partial \log |U|}{\partial U} = U^{-1}\]
- \[\frac{\partial}{\partial U} \operatorname{tr}(U^{-1} A) = -U^{-1} A U^{-1}\]
We get:
\[\frac{\partial \log p(X)}{\partial U} = - \frac{p}{2} U^{-1} + \frac{1}{2} U^{-1} E V^{-1} E^\top U^{-1}\]Similarly for \(V\):
\[\frac{\partial \log p(X)}{\partial V} = - \frac{n}{2} V^{-1} + \frac{1}{2} V^{-1} E^\top U^{-1} E V^{-1}\]These terms are structurally symmetric — you just flip \(E\).
Application: Probabilistic PCA
Probabilistic PCA (Tipping & Bishop, 1999) models data as:
\[X = Z W^\top + \epsilon\]where \(Z\) is a latent matrix and \(\epsilon \sim \mathcal{MN}(0, I, \sigma^2 I)\).
Then:
\[p(X \mid W, \sigma^2) = \mathcal{MN}(0, I, WW^\top + \sigma^2 I)\]Optimizing this log-likelihood (via EM or gradient methods) requires the matrix derivatives above — especially w.r.t. the covariance.
The Matrix t-Distribution
A heavier-tailed cousin of the matrix normal, often used in robust models.
If:
\[X \sim \mathcal{MT}(M, U, V, \nu)\]Then:
\[p(X) \propto |U|^{-p/2} |V|^{-n/2} \left[1 + \frac{1}{\nu} \operatorname{tr}(U^{-1}(X - M)V^{-1}(X - M)^\top)\right]^{-(\nu + np)/2}\]Similar derivatives can be computed using matrix calculus — just with heavier algebra.
Summary Table
Quantity | Derivative | Interpretation |
---|---|---|
Mean matrix \(M\) | \(-U^{-1}(X - M)V^{-1}\) | Gradient of log-likelihood |
Row covariance \(U\) | \(-\frac{p}{2} U^{-1} + \frac{1}{2} U^{-1} E V^{-1} E^\top U^{-1}\) | MLE or MAP gradient |
Column covariance \(V\) | \(-\frac{n}{2} V^{-1} + \frac{1}{2} V^{-1} E^\top U^{-1} E V^{-1}\) | Symmetric to \(U\) |
Final Thoughts
The matrix normal distribution is more than a notational trick. It’s a powerful modeling tool that captures structure across rows and columns — essential in multivariate regression, matrix-variate factor models, and Gaussian processes.
And once you understand how to differentiate through it, you unlock a full spectrum of applications in statistical learning, Bayesian inference, and deep probabilistic models.
We’ve spent the past several sections laying down the mathematical bricks — gradients, Jacobians, Kronecker products, trace tricks, pseudoinverses. It’s a beautiful edifice of abstract computation. But machine learning isn’t built on equations alone.
Every time you train a neural network, minimize a loss, or regularize a model — you’re using these derivatives.
Now it’s time to flip the switch. Let’s move from how matrix calculus works to where it works.
This section will walk through applications where matrix calculus powers core algorithms.
Application 1: Backpropagation with Matrix Calculus
Backpropagation is often introduced as a recursive algorithm involving layers, local gradients, and lots of chain rule applications. That’s true, but at its heart, backprop is really just vectorized calculus over a composite matrix function.
And once we understand the right rules — like how to differentiate a scalar loss with respect to a matrix — the entire algorithm becomes cleaner, easier to generalize, and surprisingly elegant.
Let’s break it down from scratch.
The Setting: A 2-Layer Neural Network
Consider a simple neural network with:
- One hidden layer using ReLU activation
- A final output layer with linear activation (for regression)
The forward pass looks like this:
\[\begin{aligned} Z^{(1)} &= X W^{(1)} + \mathbf{b}^{(1)} \\ A^{(1)} &= \phi(Z^{(1)}) \quad \text{(ReLU)} \\ Z^{(2)} &= A^{(1)} W^{(2)} + \mathbf{b}^{(2)} \\ \hat{Y} &= Z^{(2)} \\ L &= \frac{1}{2m} \| \hat{Y} - Y \|^2 \end{aligned}\]Where:
- \(X \in \mathbb{R}^{m \times d}\): Input batch
- \(W^{(1)} \in \mathbb{R}^{d \times h}\): Weights from input to hidden layer
- \(W^{(2)} \in \mathbb{R}^{h \times o}\): Weights from hidden to output
- \(\phi(\cdot)\): ReLU activation
- \(L\): Scalar loss (MSE)
Step-by-Step Derivatives Using Matrix Calculus
We now compute the gradients of the loss \(L\) with respect to the weights and biases.
Step 1: Derivative w.r.t. \(Z^{(2)}\)
Since:
\[L = \frac{1}{2m} \| Z^{(2)} - Y \|^2\]We get:
\[\frac{\partial L}{\partial Z^{(2)}} = \frac{1}{m} (Z^{(2)} - Y) =: \delta^{(2)}\]Step 2: Gradient w.r.t. \(W^{(2)}\)
From:
\[Z^{(2)} = A^{(1)} W^{(2)} + \mathbf{b}^{(2)}\]Using:
\[\frac{\partial L}{\partial W^{(2)}} = (A^{(1)})^\top \delta^{(2)}\]Similarly, for the bias:
\[\frac{\partial L}{\partial \mathbf{b}^{(2)}} = \text{row-wise sum of } \delta^{(2)}\]Step 3: Backpropagate to Hidden Layer
We compute the upstream gradient:
\[\delta^{(1)} = \delta^{(2)} (W^{(2)})^\top \odot \phi'(Z^{(1)})\]Where:
- \(\odot\): Hadamard (elementwise) product
- \(\phi'(Z^{(1)})\): Derivative of ReLU, i.e., indicator for positive entries in \(Z^{(1)}\)
Step 4: Gradient w.r.t. \(W^{(1)}\)
\[\frac{\partial L}{\partial W^{(1)}} = X^\top \delta^{(1)}\]And for the bias:
\[\frac{\partial L}{\partial \mathbf{b}^{(1)}} = \text{row-wise sum of } \delta^{(1)}\]Recap: What Matrix Calculus Let Us Do
Without breaking into individual neuron derivatives, we used these matrix calculus identities:
- \(\frac{\partial L}{\partial W} = X^\top G\), when \(Z = XW\)
- Chain rule for \(\delta = \text{upstream} \times \text{local derivative}\)
- Elementwise application of nonlinear derivatives (like ReLU)
This gave us fully vectorized gradient expressions — scalable, efficient, and readable.
Summary Table
Quantity | Matrix Derivative | Dimensions |
---|---|---|
$$ \frac{\partial L}{\partial W^{(2)}} $$ | $$ (A^{(1)})^\top \delta^{(2)} $$ | $$ h \times o $$ |
$$ \frac{\partial L}{\partial W^{(1)}} $$ | $$ X^\top \delta^{(1)} $$ | $$ d \times h $$ |
$$ \delta^{(1)} $$ | $$ \delta^{(2)} (W^{(2)})^\top \odot \phi'(Z^{(1)}) $$ | $$ m \times h $$ |
Application 2: Closed-form Solutions in Linear Models
Problem Setup: Linear Regression
Suppose we’re given a dataset of \(m\) samples:
- Inputs: \(X \in \mathbb{R}^{m \times d}\) — each row is a feature vector
- Targets: \(y \in \mathbb{R}^{m \times 1}\) — a column vector of real-valued labels
- Weights to learn: \(\mathbf{w} \in \mathbb{R}^{d \times 1}\)
Our prediction is:
\[\hat{y} = X \mathbf{w}\]And the mean squared error loss is:
\[L(\mathbf{w}) = \frac{1}{2m} \left\| X \mathbf{w} - y \right\|^2\]This is a scalar function of a vector variable \(\mathbf{w}\), which makes it ideal for matrix calculus.
Goal
We want to minimize \(L(\mathbf{w})\) by finding the best weight vector \(\mathbf{w}^\star\) that solves:
\[\min_{\mathbf{w}} \; L(\mathbf{w})\]And we want to do this analytically, using calculus.
Step 1: Expand the Loss
Let’s first express the loss more compactly using inner products:
\[L(\mathbf{w}) = \frac{1}{2m} (X \mathbf{w} - y)^\top (X \mathbf{w} - y)\]We’ll now differentiate this with respect to \(\mathbf{w}\).
Step 2: Differentiate Using Matrix Calculus
We use the standard identity:
\[\frac{d}{d\mathbf{w}} \left( \mathbf{w}^\top A \mathbf{w} \right) = (A + A^\top) \mathbf{w}\](If \(A\) is symmetric, this reduces to \(2A \mathbf{w}\))
We also use:
\[\frac{d}{d\mathbf{w}} \left( \mathbf{a}^\top \mathbf{w} \right) = \mathbf{a}\]Let’s differentiate:
\[\begin{aligned} L(\mathbf{w}) &= \frac{1}{2m} \left[ \mathbf{w}^\top X^\top X \mathbf{w} - 2 y^\top X \mathbf{w} + y^\top y \right] \\ \frac{\partial L}{\partial \mathbf{w}} &= \frac{1}{2m} \left( 2 X^\top X \mathbf{w} - 2 X^\top y \right) \\ &= \frac{1}{m} \left( X^\top X \mathbf{w} - X^\top y \right) \end{aligned}\]Step 3: Set Gradient to Zero
To find the minimum, we set the gradient to zero:
\[X^\top X \mathbf{w} = X^\top y\]Assuming \(X^\top X\) is invertible, we get the normal equation:
\[\boxed{ \mathbf{w}^\star = (X^\top X)^{-1} X^\top y }\]That’s our closed-form solution.
When Is This Useful?
- For small to medium datasets, this is extremely fast — no need for iterative solvers.
- It appears in \(\text{OLS regression}\), \(\text{Ridge regression}\) (with a slight tweak), \(\text{PCA}\), and many classical models.
Important Notes
-
Computational cost: Inverting \(X^\top X\) has time complexity \(\mathcal{O}(d^3)\). So this is rarely used in high-dimensional ML directly — but still foundational for theory and diagnostics.
-
Numerical stability: Instead of directly computing the inverse, use:
\[\mathbf{w}^\star = \text{solve}(X^\top X, X^\top y)\]using something like QR or Cholesky decomposition.
-
Connection to projections: This solution is projecting \(y\) onto the column space of \(X\). That’s why the residual is orthogonal to the fitted predictions:
\[X^\top (\hat{y} - y) = 0\]
Summary Table
Expression | Interpretation |
---|---|
$$ L(\mathbf{w}) = \frac{1}{2m} \| X \mathbf{w} - y \|^2 $$ | Quadratic loss function |
$$ \nabla_{\mathbf{w}} L = \frac{1}{m} (X^\top X \mathbf{w} - X^\top y) $$ | Gradient via matrix calculus |
$$ \mathbf{w}^\star = (X^\top X)^{-1} X^\top y $$ | Closed-form solution (normal equation) |
Application 3: Convolutional Neural Networks (CNNs)
Gradient computation through convolutional layers using matrix calculus
Convolutional Neural Networks revolutionized computer vision by mimicking how we process spatial patterns — edges, corners, textures — through local filters. But for these filters to learn, they must be differentiable. And not just scalar-by-scalar differentiable, but differentiable as matrix functions that act over batched inputs, multiple channels, and shared parameters.
This is where matrix calculus flexes its power. Rather than deriving gradients through messy loops or hand-coding derivatives for every layer, we treat the entire convolution as a composed matrix transformation — and derive gradients cleanly, efficiently, and fully vectorized.
1D Convolution
Let’s begin with a minimalist setup to build intuition: a 1D convolution with one input row and one filter.
Step 1: The Forward Pass in Matrix Clothing
Suppose you have an input vector:
\[X = \begin{bmatrix} x_1 & x_2 & x_3 & x_4 & x_5 \end{bmatrix} \in \mathbb{R}^{1 \times 5}\]and a filter:
\[w = \begin{bmatrix} w_1 & w_2 & w_3 \end{bmatrix} \in \mathbb{R}^{1 \times 3}\]With stride 1 and no padding, the filter moves across 3 positions. At each step, it “sees” a window of 3 input elements — called a receptive field.
We extract these into overlapping patches:
\[X_{\text{col}} = \begin{bmatrix} x_1 & x_2 & x_3 \\ x_2 & x_3 & x_4 \\ x_3 & x_4 & x_5 \end{bmatrix} \in \mathbb{R}^{3 \times 3}\]This reshaping — often implemented as im2col
— converts the convolution into a standard matrix product:
Each row of \(Z\) is the dot product between a filter and a sliding window from the input — i.e., a classic 1D convolution output.
Step 2: The Backward Pass — Let the Gradients Flow
Now suppose we compute some scalar loss \(L = \mathcal{L}(Z)\) (like MSE or cross-entropy).
Let \(G = \frac{\partial L}{\partial Z}\) be the gradient flowing back from the loss to the convolution output.
Matrix calculus gives us:
-
Gradient w.r.t. the filter:
\[\frac{\partial L}{\partial w} = G^\top X_{\text{col}} \in \mathbb{R}^{1 \times 3}\] -
Gradient w.r.t. the input patches:
\[\frac{\partial L}{\partial X_{\text{col}}} = G \cdot w \in \mathbb{R}^{3 \times 3}\]
This gradient can be mapped back to the original 1D input space using a col2im
operation.
Sidenote: In PyTorch or TensorFlow, this step is automatically handled by the backward pass of the convolution op, but under the hood, it’s doing exactly this — summing overlapping contributions from each patch.
2D Convolution
Now let’s scale up to the 2D case: where both inputs and filters are matrices.
Step 1: Matrix View of 2D Convolution
Imagine a $3 \times 3$ image being filtered by a $2 \times 2$ kernel:
\[X = \begin{bmatrix} x_{11} & x_{12} & x_{13} \\ x_{21} & x_{22} & x_{23} \\ x_{31} & x_{32} & x_{33} \end{bmatrix} \in \mathbb{R}^{3 \times 3} \quad W = \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \end{bmatrix} \in \mathbb{R}^{2 \times 2}\]Stride 1, no padding.
This gives 4 positions for the filter to slide:
- \[x_{11}, x_{12}, x_{21}, x_{22}\]
- \[x_{12}, x_{13}, x_{22}, x_{23}\]
- \[x_{21}, x_{22}, x_{31}, x_{32}\]
- \[x_{22}, x_{23}, x_{32}, x_{33}\]
We collect these into:
\[X_{\text{col}} = \begin{bmatrix} x_{11} & x_{12} & x_{21} & x_{22} \\ x_{12} & x_{13} & x_{22} & x_{23} \\ x_{21} & x_{22} & x_{31} & x_{32} \\ x_{22} & x_{23} & x_{32} & x_{33} \end{bmatrix} \in \mathbb{R}^{4 \times 4}\]Flatten the filter:
\[w = \begin{bmatrix} w_{11} & w_{12} & w_{21} & w_{22} \end{bmatrix} \in \mathbb{R}^{1 \times 4}\]Now compute:
\[Z = X_{\text{col}} \cdot w^\top \in \mathbb{R}^{4 \times 1} \Rightarrow \text{reshape to } 2 \times 2\]Step 2: Gradients via Matrix Calculus
Assume loss:
\[L = \frac{1}{2} \| Z - T \|^2\]Then:
-
Output gradient:
\[G = \frac{\partial L}{\partial Z} = Z - T \in \mathbb{R}^{4 \times 1}\] -
Filter gradient:
\[\frac{\partial L}{\partial w} = G^\top X_{\text{col}} \in \mathbb{R}^{1 \times 4}\] -
Input patch gradient:
\[\frac{\partial L}{\partial X_{\text{col}}} = G \cdot w \in \mathbb{R}^{4 \times 4}\]
To reconstruct \(\frac{\partial L}{\partial X}\) (the gradient in image form), we undo the patchification using col2im
.
Stride and Padding: What Changes?
Let:
- \(n\): Input size
- \(k\): Kernel size
- \(s\): Stride
- \(p\): Padding
Then:
\[\text{output size} = \left\lfloor \frac{n + 2p - k}{s} \right\rfloor + 1\]Implications:
- Larger strides reduce overlap → fewer gradients per input location
- More padding preserves input size → more overlap → more spatial coverage
- Gradient shapes depend entirely on this arithmetic and are accounted for automatically during
im2col
/col2im
Weight Sharing and Summation of Gradients
Since the same kernel applies across all spatial locations, the total gradient with respect to the filter is the sum of contributions from each patch:
\[\frac{\partial L}{\partial w} = \sum_{i=1}^P g_i \cdot x_i\]Or vectorized:
\[\frac{\partial L}{\partial w} = G^\top X_{\text{col}}\]This summation is what gives convolutional filters their spatial generalization: a filter that sharpens an edge in one corner can also learn to do so in another.
Wrapping Up
By now, you’ve seen how matrix calculus fits into machine learning — from basic derivatives to the way gradients move through layers during backpropagation. We’ve looked at Jacobians, vectorization tricks, and how concepts like the trace and determinant show up in real models.
The goal wasn’t to cover every identity — it was to build a way of thinking. One where shapes, dimensions, and transformations become familiar, and where equations start to make sense beyond just symbol pushing.
You’re now in a position to read through a model description and understand where the gradients come from. You can trace the flow of derivatives through a network and spot the logic behind the updates.
And from here, it only gets more interesting. Try applying these ideas to your own models. Look at how your framework handles gradient flow. Work through a few derivations by hand.
This foundation will serve you well — in deep learning, in optimization, and anywhere gradients drive learning.