Linear Algebra - Bonus Topics
- Advanced Matrix Factorizations in Machine Learning
- Orthogonalization Techniques in Machine Learning and Deep Learning
- Kernel Methods and Hilbert Spaces in Machine Learning
- Sparse Matrices and Sparsity in Machine Learning
- Numerical Stability and Conditioning in Machine Learning
- Rotation, Reflection, and Markov Matrices in Machine Learning
- Advanced Projections: Random Projections and the Johnson–Lindenstrauss Lemma
Advanced Matrix Factorizations in Machine Learning
Matrix factorizations are a foundational tool in linear algebra and play a critical role in modern machine learning. They simplify matrix computations, enable numerical stability, and reveal latent structures in data. This post explores several advanced matrix factorizations used in machine learning and deep learning, with complete mathematical derivations and practical insights.
This section covers:
- LU Decomposition
- Cholesky Decomposition
- QR Decomposition
- Non-negative Matrix Factorization (NMF)
1. LU Decomposition
Definition
LU decomposition factors a square matrix \(A \in \mathbb{R}^{n \times n}\) into the product of two matrices:
\[A = LU\]Where:
- \(L\) is a lower triangular matrix with ones on its diagonal (\(L_{ii} = 1\)),
- \(U\) is an upper triangular matrix.
If row pivoting is required, the decomposition is written as:
\[PA = LU\]Where \(P\) is a permutation matrix.
Algorithm
Given:
\[A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nn} \end{bmatrix},\]we use Gaussian elimination to eliminate entries below the diagonal. The multipliers used are stored in \(L\), and the remaining upper triangular form becomes \(U\).
For each row \(j > i\), compute the multiplier:
\[l_{ji} = \frac{a_{ji}}{a_{ii}}\]Then update row \(j\):
\[a_{j\cdot} = a_{j\cdot} - l_{ji} \cdot a_{i\cdot}\]After completing all iterations, the matrix \(A\) is factorized into \(L\) and \(U\).
Applications in Machine Learning
- Efficiently solving systems of equations \(Ax = b\) by solving \(Ly = b\) then \(Ux = y\),
- Matrix inversion,
- Numerical optimization routines in regression and convex problems.
2. Cholesky Decomposition
Definition
Cholesky decomposition applies to symmetric, positive-definite matrices. For a matrix \(A \in \mathbb{R}^{n \times n}\) satisfying:
- \(A = A^T\) (symmetry),
- \(x^T A x > 0\) for all non-zero \(x \in \mathbb{R}^n\) (positive-definiteness),
there exists a unique lower triangular matrix \(L\) such that:
\[A = LL^T\]Construction
We compute each entry of \(L\) as follows:
For diagonal entries:
\[L_{ii} = \sqrt{A_{ii} - \sum_{k=1}^{i-1} L_{ik}^2}\]For off-diagonal entries where \(i > j\):
\[L_{ij} = \frac{1}{L_{jj}} \left( A_{ij} - \sum_{k=1}^{j-1} L_{ik} L_{jk} \right)\]All entries above the diagonal are zero.
Applications in Machine Learning
- Efficient sampling from multivariate Gaussian distributions,
- Gaussian Processes (for inverting the kernel matrix),
- Kalman filters and Bayesian updates.
3. QR Decomposition
Definition
For any matrix \(A \in \mathbb{R}^{m \times n}\) where \(m \geq n\), the QR decomposition expresses \(A\) as:
\[A = QR\]Where:
- \(Q \in \mathbb{R}^{m \times m}\) is orthogonal (\(Q^T Q = I\)),
- \(R \in \mathbb{R}^{m \times n}\) is upper triangular.
For economy-size decomposition (when only \(n\) orthogonal vectors are needed), we use:
\[A = Q_{\text{red}} R, \quad Q_{\text{red}} \in \mathbb{R}^{m \times n}\]Gram-Schmidt Orthogonalization
Let \(a_1, a_2, \ldots, a_n\) be the columns of \(A\). We generate orthonormal vectors \(q_1, q_2, \ldots, q_n\) using:
- Initialization:
- For \(k = 2, \ldots, n\):
Then define:
\[Q = [q_1, q_2, \ldots, q_n], \quad R = Q^T A\]Least Squares via QR
Given an overdetermined system \(Ax = b\), we solve:
\[A = QR \Rightarrow QRx = b\]Then:
\[Rx = Q^T b\]Since \(R\) is upper triangular, this is solved efficiently via back substitution.
Applications in Machine Learning
- Numerically stable solution for linear regression,
- Eigenvalue computations using the QR algorithm,
- Orthonormal basis construction.
4. Non-negative Matrix Factorization (NMF)
Definition
Given a non-negative matrix \(A \in \mathbb{R}^{m \times n}\) (i.e., \(A_{ij} \geq 0\)), NMF seeks matrices \(W \in \mathbb{R}^{m \times k}\) and \(H \in \mathbb{R}^{k \times n}\) such that:
\[A \approx WH\]subject to:
\[W \geq 0, \quad H \geq 0\]Optimization Problem
The factorization is found by solving:
\[\min_{W, H \geq 0} \|A - WH\|_F^2\]where \(\|\cdot\|_F\) denotes the Frobenius norm.
Multiplicative Update Rules
A common approach (Lee & Seung, 2001) involves the following update rules:
- Update \(H\):
- Update \(W\):
These updates are applied iteratively until convergence.
Applications in Machine Learning
- Topic modeling from document-term matrices,
- Collaborative filtering in recommendation systems,
- Image compression and decomposition,
- Clustering with parts-based representation.
Orthogonalization Techniques in Machine Learning and Deep Learning
Orthogonalization plays a central role in linear algebra and is extensively used in various machine learning and deep learning tasks. Whether it’s constructing orthonormal bases, decorrelating features, or stabilizing neural network training, orthogonal structures are powerful due to their numerical and geometric properties.
Now we shall cover:
- The Gram-Schmidt orthogonalization process for constructing orthonormal bases,
- Orthogonal initialization in neural networks for improved stability and convergence.
1. Orthogonalization and Orthonormal Bases
Motivation
Given a set of linearly independent vectors \(\{v_1, v_2, \ldots, v_n\}\) in \(\mathbb{R}^n\), it is often desirable to construct an orthonormal basis \(\{q_1, q_2, \ldots, q_n\}\) that spans the same subspace, where:
- Vectors are orthogonal: \(\langle q_i, q_j \rangle = 0\) for \(i \ne j\),
- Vectors are normalized: \(\|q_i\| = 1\).
Orthonormal bases are easier to work with:
- Projections are straightforward,
- Matrix representations (e.g., \(Q^T Q = I\)) are numerically stable,
- Useful for dimensionality reduction and decorrelation (e.g., PCA).
2. The Gram-Schmidt Process
Definition
The Gram-Schmidt process transforms a set of linearly independent vectors \(\{v_1, v_2, \ldots, v_n\}\) into an orthonormal set \(\{q_1, q_2, \ldots, q_n\}\) spanning the same subspace.
This is achieved by iteratively subtracting projections onto previously computed vectors and normalizing.
Step-by-step Algorithm
Let the input be vectors \(v_1, v_2, \ldots, v_n \in \mathbb{R}^d\).
- Initialize the first orthonormal vector:
-
Iterate for \(k = 2, \ldots, n\):
- Project \(v_k\) onto each previous \(q_j\):
- Subtract the projections:
- Normalize to get the next orthonormal vector:
After the process, the vectors \(\{q_1, \ldots, q_n\}\) form an orthonormal basis of the span of \(\{v_1, \ldots, v_n\}\).
Example
Let:
\[v_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}, \quad v_2 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}\]Compute:
\[q_1 = \frac{v_1}{\|v_1\|} = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ 1 \end{bmatrix}\]Project \(v_2\) onto \(q_1\):
\[\langle v_2, q_1 \rangle = \frac{1}{\sqrt{2}}(1 \cdot 1 + 0 \cdot 1) = \frac{1}{\sqrt{2}}\]Then:
\[u_2 = v_2 - \langle v_2, q_1 \rangle q_1 = \begin{bmatrix} 1 \\ 0 \end{bmatrix} - \frac{1}{\sqrt{2}} \cdot \frac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ 1 \end{bmatrix}\] \[= \begin{bmatrix} 1 \\ 0 \end{bmatrix} - \frac{1}{2} \begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} \frac{1}{2} \\ -\frac{1}{2} \end{bmatrix}\]Normalize:
\[q_2 = \frac{u_2}{\|u_2\|} = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ -1 \end{bmatrix}\]So the orthonormal basis is:
\[q_1 = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ 1 \end{bmatrix}, \quad q_2 = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ -1 \end{bmatrix}\]Applications
- QR Decomposition: Gram-Schmidt is used to compute orthogonal matrix \(Q\) in QR.
- PCA and SVD: Constructs orthonormal eigenvectors or singular vectors.
- Feature Embeddings: Ensures orthogonality between embedding dimensions for decorrelation.
3. Orthogonal Initialization in Neural Networks
Motivation
Deep networks are sensitive to weight initialization. Poorly chosen initializations can lead to:
- Vanishing or exploding gradients,
- Poor convergence,
- Suboptimal generalization.
Orthogonal initialization addresses these issues by ensuring that:
- The weight matrix \(W\) preserves the norm of the signal during forward and backward passes,
- Gradients are not distorted across layers.
Mathematical Principle
Let \(W \in \mathbb{R}^{n \times n}\) be an orthogonal matrix, so that:
\[W^T W = WW^T = I\]If input \(x \in \mathbb{R}^n\) is passed through such a weight matrix:
\[y = Wx \Rightarrow \|y\|_2 = \|x\|_2\]Hence, the norm is preserved, avoiding scaling issues layer by layer.
In backpropagation, the gradient flow also remains stable:
\[\frac{\partial L}{\partial x} = W^T \frac{\partial L}{\partial y}\]If \(W\) is orthogonal, the gradient is rotated but not scaled, avoiding vanishing/exploding gradients.
How to Initialize Orthogonal Matrices
- Generate a random matrix \(A\),
- Perform QR decomposition: \(A = QR\),
- Use \(Q\) as the initialization matrix (optionally scale by a factor \(\sigma\)):
Implementing in Deep Learning Libraries
In PyTorch:
import torch.nn as nn
nn.init.orthogonal_(tensor, gain=1.0)
In TensorFlow:
initializer = tf.keras.initializers.Orthogonal(gain=1.0)
Applications
- Recurrent Neural Networks (RNNs): Orthogonal or unitary weight matrices are crucial for long-term memory retention.
- Deep Fully Connected Networks: Improves training dynamics for very deep MLPs.
- Transformer Layers: Can help in initializing dense layers to preserve signal variance.
Orthogonalization is not only a theoretical concept but also a practical tool that:
- Enhances numerical stability,
- Helps decorrelate features,
- Enables better gradient flow in deep neural networks.
Whether through Gram-Schmidt orthogonalization for structured bases or orthogonal initialization for training deep networks, mastering these tools improves both understanding and implementation of modern ML models.
Kernel Methods and Hilbert Spaces in Machine Learning
Kernel methods are powerful tools that enable learning algorithms to operate in high-dimensional feature spaces without explicitly computing those spaces. They are foundational to algorithms like Support Vector Machines (SVMs), Gaussian Processes (GPs), and Kernel PCA. This is achieved through the kernel trick and formalized via the theory of Hilbert spaces and Mercer’s theorem.
In this section, we will cover:
- Inner products in high-dimensional (possibly infinite-dimensional) spaces,
- The concept of Reproducing Kernel Hilbert Spaces (RKHS),
- The kernel trick and its use in ML algorithms,
- Mercer’s theorem and the mathematical foundation of kernels.
1. Inner Products in High-Dimensional Spaces
Motivation
In many ML algorithms, we compute inner products between feature vectors:
\[\langle x, x' \rangle = x^\top x'\]However, linear models using this inner product are limited to linear decision boundaries.
To handle non-linear patterns, we can map the inputs into a higher-dimensional space:
\[\phi: \mathcal{X} \rightarrow \mathcal{H}, \quad x \mapsto \phi(x)\]where \(\mathcal{H}\) is a Hilbert space (a complete inner product space). Then, inner products become:
\[\langle \phi(x), \phi(x') \rangle\]If \(\phi\) maps into a very high-dimensional or infinite-dimensional space, computation becomes infeasible. However, we can avoid this cost using kernels.
2. Kernel Functions and the Kernel Trick
Definition
A kernel function \(k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}\) is defined as:
\[k(x, x') = \langle \phi(x), \phi(x') \rangle_{\mathcal{H}}\]This allows computing the inner product without explicitly computing \(\phi(x)\) or \(\phi(x')\).
This is known as the kernel trick.
Examples of Kernel Functions
- Linear kernel:
- Polynomial kernel (degree \(d\)):
- Radial Basis Function (RBF) / Gaussian kernel:
- Sigmoid kernel:
Why It Works
Many algorithms (like SVMs) rely on computing dot products between data points:
- Training: \(x_i^\top x_j\)
- Prediction: \(x^\top x_i\)
By replacing these with \(k(x_i, x_j)\) and \(k(x, x_i)\), we effectively operate in the high-dimensional space without ever computing it explicitly.
This is computationally and memory-efficient and allows fitting complex, non-linear decision boundaries.
3. Reproducing Kernel Hilbert Space (RKHS)
Definition
A Reproducing Kernel Hilbert Space is a Hilbert space \(\mathcal{H}\) of functions from \(\mathcal{X} \rightarrow \mathbb{R}\) such that:
- For all \(x \in \mathcal{X}\), the evaluation functional \(f \mapsto f(x)\) is continuous.
- There exists a kernel function \(k(x, \cdot) \in \mathcal{H}\) such that for all \(f \in \mathcal{H}\):
This is called the reproducing property, and it guarantees that the kernel uniquely defines the Hilbert space.
Intuition
RKHS is the space induced by the kernel. Every function in the RKHS can be written in terms of kernels evaluated at training points. This leads to kernel representer theorems and simplifies optimization.
4. Mercer’s Theorem
Statement
Mercer’s Theorem provides a condition under which a function is a valid kernel (i.e., corresponds to an inner product in some Hilbert space).
Let \(k(x, x')\) be a continuous, symmetric, and positive semi-definite kernel on a compact domain \(\mathcal{X} \subset \mathbb{R}^n\). Then:
- There exists a sequence of orthonormal eigenfunctions \(\{\phi_i\}\) and non-negative eigenvalues \(\{\lambda_i\}\) such that:
This shows that a kernel corresponds to an inner product in a (possibly infinite-dimensional) feature space.
Practical Use
Mercer’s Theorem justifies that functions like the RBF or polynomial kernel are valid kernels and thus induce a real Hilbert space with meaningful inner products.
5. Applications in Machine Learning
5.1 Support Vector Machines (SVMs)
The dual form of the SVM optimization problem involves only dot products. With a kernel function:
\[\text{maximize}_\alpha \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j k(x_i, x_j)\]Predictions are made using:
\[f(x) = \sum_i \alpha_i y_i k(x, x_i) + b\]This enables non-linear classification by implicitly mapping into a high-dimensional space.
5.2 Gaussian Processes (GPs)
A Gaussian Process is a distribution over functions:
\[f(x) \sim \mathcal{GP}(0, k(x, x'))\]The kernel function \(k\) defines the covariance between function values. This allows modeling smoothness, periodicity, and other function properties.
Prediction involves computing:
\[\mathbb{E}[f(x_*)] = k(x_*, X) [K(X, X) + \sigma^2 I]^{-1} y\]Where:
- \(K(X, X)\) is the Gram matrix over the training data using kernel \(k\),
- \(x_*\) is the test input.
5.3 Kernel PCA
Standard PCA uses the covariance matrix:
\[C = \frac{1}{n} \sum_{i=1}^n x_i x_i^T\]Kernel PCA generalizes this using the kernel trick by computing the kernel Gram matrix:
\[K_{ij} = k(x_i, x_j)\]Then perform eigen-decomposition on the centered kernel matrix:
\[K_c = K - \mathbf{1}K - K\mathbf{1} + \mathbf{1}K\mathbf{1}\]Where \(\mathbf{1}\) is the centering matrix.
The result is non-linear dimensionality reduction using kernel-defined similarities.
Summary
Concept | Mathematical Foundation | Role in Machine Learning |
---|---|---|
Feature Mapping | \(\phi: x \mapsto \mathcal{H}\) | Transforms data into high-dimensional space |
Kernel Function | \(k(x, x') = \langle \phi(x), \phi(x') \rangle\) | Computes inner products without explicit mapping |
RKHS | Function space induced by a valid kernel | Guarantees expressiveness and optimization theory |
Mercer’s Theorem | \(k(x, x') = \sum \lambda_i \phi_i(x) \phi_i(x')\) | Validates kernel as inner product in Hilbert space |
Kernel Trick | Replace \(\langle x, x' \rangle\) with \(k(x, x')\) | Enables non-linear learning with linear algorithms |
Kernel methods let us apply powerful linear algorithms in high-dimensional non-linear spaces—without computing those spaces directly. The kernel trick, rooted in Hilbert space theory and Mercer’s theorem, forms the backbone of some of the most elegant and effective machine learning algorithms.
Sparse Matrices and Sparsity in Machine Learning
Modern machine learning applications often involve high-dimensional data, where most entries are zero. In such settings, sparsity becomes a powerful property to exploit—both for computational efficiency and for improving generalization.
This section explores:
- Sparse matrix representation and storage,
- The role of sparsity in large-scale ML problems,
- Compressed sensing and sparse coding,
- Applications in recommender systems, NLP, computer vision, and feature selection.
1. Sparse Representations
Definition
A matrix \(A \in \mathbb{R}^{m \times n}\) is called sparse if most of its entries are zero:
\[\text{Sparsity}(A) = \frac{\text{Number of zero entries}}{mn} \gg 0\]Equivalently, it has only a few non-zero entries compared to its total size.
Storage and Efficiency
Instead of storing all \(mn\) entries, we store only the non-zero elements and their indices. Common sparse matrix formats include:
-
Compressed Sparse Row (CSR): Stores non-zero values, column indices, and row pointer.
-
Compressed Sparse Column (CSC): Similar to CSR, but column-wise.
-
Coordinate (COO): Stores each non-zero entry as a tuple \((i, j, A_{ij})\).
These formats enable:
- O(1) access to non-zero elements,
- Efficient matrix-vector multiplication in \(O(\text{nnz})\) time,
- Reduced memory usage.
Use Cases
-
Recommender Systems: User-item interaction matrices are sparse; most users rate very few items.
-
NLP: Bag-of-Words, TF-IDF, one-hot encoding—all result in sparse representations.
-
Graph ML: Adjacency matrices of large graphs (social networks, web graphs) are sparse.
2. Compressed Sensing
Motivation
Compressed sensing addresses the question:
Can we recover high-dimensional signals from few linear measurements if the signal is sparse?
The answer is yes—under specific conditions.
Problem Setup
Let:
- \(x \in \mathbb{R}^n\) be the original signal, sparse in some basis.
- \(y \in \mathbb{R}^m\) be the observed measurements, where \(m \ll n\).
- \(A \in \mathbb{R}^{m \times n}\) be a measurement matrix.
We want to solve:
\[y = A x \quad \text{with} \quad x \text{ sparse}\]Since \(m < n\), this is underdetermined. But if \(x\) is sparse (i.e., \(\|x\|_0 \ll n\)), recovery is possible.
Recovery via Optimization
The sparse recovery problem is posed as:
\[\min_x \|x\|_0 \quad \text{subject to} \quad y = Ax\]But this is NP-hard. Instead, we solve:
\[\min_x \|x\|_1 \quad \text{subject to} \quad y = Ax\]This is known as Basis Pursuit, and under the Restricted Isometry Property (RIP), it provably recovers the sparse signal.
Lasso (Relaxed version)
If measurements are noisy:
\[y = Ax + \varepsilon\]Then we solve:
\[\min_x \|y - Ax\|_2^2 + \lambda \|x\|_1\]This is the Lasso (Least Absolute Shrinkage and Selection Operator) formulation.
Applications
- Medical imaging: MRI and CT scan reconstruction from fewer samples.
- Signal processing: Denoising, compression.
- NLP and CV: Learning compact word/image representations.
3. Sparse Coding
Definition
Given an input signal \(x \in \mathbb{R}^d\), sparse coding assumes:
\[x \approx D h\]Where:
- \(D \in \mathbb{R}^{d \times k}\) is a dictionary of basis vectors (atoms),
- \(h \in \mathbb{R}^k\) is a sparse code (i.e., few non-zero entries).
The goal is to learn \(D\) and \(h\) such that \(x\) is well-represented with a sparse \(h\).
Optimization Objective
Given training data \(X = [x^{(1)}, \ldots, x^{(n)}]\), we solve:
\[\min_{D, H} \sum_{i=1}^n \left( \|x^{(i)} - D h^{(i)}\|_2^2 + \lambda \|h^{(i)}\|_1 \right)\]Subject to normalization constraints on \(D\) (e.g., \(\|d_j\|_2 \leq 1\)).
This problem is bi-convex and typically solved via alternating minimization:
- Fix \(D\), update sparse codes \(H\).
- Fix \(H\), update dictionary \(D\).
Interpretation
- Sparse coding learns overcomplete dictionaries that can represent inputs efficiently.
- The sparse codes \(h\) capture salient features with fewer active components.
Applications
- Image processing:
- Denoising, super-resolution, texture synthesis.
- NLP:
- Sparse embedding representations.
- Feature learning:
- Unsupervised pretraining for deep networks.
- Compression:
- Reduces storage and computation for large models.
Summary
Concept | Mathematical Description | Applications |
---|---|---|
Sparse Matrix | Mostly zero entries, stored in CSR/COO format | Recommender systems, NLP, graph ML |
Compressed Sensing | \(y = Ax, |x|_0 \ll n\), recover \(x\) from \(y\) | Imaging, signal processing, low-data regimes |
Lasso | \(\min_x |y - Ax|_2^2 + \lambda |x|_1\) | Feature selection, regularization |
Sparse Coding | \(x \approx D h\) with sparse \(h\) | Feature learning, representation compression |
Sparsity is a crucial structural assumption in many ML settings. Whether it’s handling massive sparse data structures, recovering signals from limited observations, or learning compressed representations, sparsity enables scalable and interpretable learning.
The underlying mathematical principles—\(\ell_0\) and \(\ell_1\) norms, underdetermined systems, and structured optimization—form the basis of compressed sensing and sparse coding.
Numerical Stability and Conditioning in Machine Learning
In large-scale machine learning models and numerical computations, numerical stability plays a vital role in ensuring accuracy, convergence, and generalization. When data is high-dimensional, features are correlated, or optimization problems are ill-posed, small numerical errors can lead to large deviations in results.
This section explores:
- Matrix conditioning and condition numbers,
- Their effect on optimization and linear models,
- Regularization strategies like Tikhonov regularization and Ridge regression,
- Real-world applications after each theoretical concept.
1. Matrix Conditioning and Condition Numbers
Definition
Given a matrix \(A \in \mathbb{R}^{n \times n}\), its condition number in the 2-norm is:
\[\kappa(A) = \|A\|_2 \cdot \|A^{-1}\|_2\]If \(A\) is symmetric positive definite:
\[\kappa(A) = \frac{\lambda_{\max}}{\lambda_{\min}}\]Where \(\lambda_{\max}\) and \(\lambda_{\min}\) are the largest and smallest eigenvalues of \(A\).
This value measures the sensitivity of the solution \(x\) to small perturbations in the system \(Ax = b\).
Application: Linear Systems and Inversion Stability
In linear regression or least squares, solving:
\[\hat{x} = (X^T X)^{-1} X^T y\]can be unstable if \(X^T X\) is ill-conditioned (e.g., due to multicollinearity).
Real-World Scenarios:
- High-dimensional regression models,
- Polynomial regression (where powers of features become highly correlated),
- PCA: covariance matrix conditioning affects eigenvalue computation.
2. Instability in Optimization
Gradient Descent Sensitivity
Consider optimizing a quadratic loss:
\[f(x) = \frac{1}{2} x^T A x\]Gradient descent update:
\[x_{k+1} = x_k - \eta A x_k\]If \(A\) is ill-conditioned (i.e., eigenvalues vary widely), the optimization will:
- Converge slowly,
- Zigzag across the cost surface,
- Require small learning rates to remain stable.
Application: Neural Network Training
In deep networks, layers may learn at drastically different rates due to ill-conditioning. Common symptoms include:
- Exploding or vanishing gradients,
- Slow convergence even on simple tasks,
- Difficulty tuning learning rates.
Example: In early training of deep MLPs or RNNs, poor weight scaling leads to ill-conditioned Jacobians and Hessians.
3. Tikhonov Regularization
Theory
To solve an ill-posed least squares problem:
\[\min_x \|Ax - b\|_2^2\]we add a regularization term:
\[\min_x \|Ax - b\|_2^2 + \lambda \|x\|_2^2\]This is Tikhonov regularization. The solution becomes:
\[\hat{x}_\lambda = (A^T A + \lambda I)^{-1} A^T b\]This improves conditioning by ensuring the matrix being inverted is better-behaved.
Application: Ill-posed Inverse Problems
Tikhonov regularization is used in:
- Image deblurring and denoising,
- Medical imaging (MRI, CT),
- Physics-based simulations with uncertain measurements.
In ML, it improves:
- Matrix inversion stability in linear models,
- Numerical robustness in batch/mini-batch computations.
4. Ridge Regression
Theory
Ridge regression is a specific case of Tikhonov regularization applied to linear regression. Given \(X \in \mathbb{R}^{n \times d}\) and \(y \in \mathbb{R}^n\):
\[\min_w \|Xw - y\|_2^2 + \lambda \|w\|_2^2\]Solution:
\[w = (X^T X + \lambda I)^{-1} X^T y\]Benefits:
- Stabilizes matrix inversion,
- Reduces overfitting in high-dimensional settings.
Application: High-Dimensional Linear Models
Ridge regression is essential when:
- The number of features exceeds the number of samples,
- Features are highly correlated (multicollinearity),
- Predictors are noisy or redundant.
Example:
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
model.fit(X, y)
Common in:
- Genomics (p » n),
- Text regression models with bag-of-words features,
- Financial models with redundant indicators.
5. Conditioning in Deep Learning
Common Problems
In deep neural networks:
- Weight matrices may become poorly conditioned,
- Gradients may vanish or explode during backpropagation,
- Activations may saturate, leading to optimization stalls.
Solutions and Their Mathematical Roles
-
Orthogonal Initialization: Weight matrices initialized to be orthogonal preserve input norm and maintain conditioning.
\[W^T W = I \Rightarrow \|Wx\|_2 = \|x\|_2\]Code:
torch.nn.init.orthogonal_(tensor)
-
Weight Decay (L2 regularization): Equivalent to Ridge on weights. Controls weight growth, stabilizes learning.
\[\min_w \mathcal{L}(w) + \lambda \|w\|_2^2\] -
Gradient Clipping: Prevents gradient explosion by clipping:
\[\nabla \mathcal{L} \leftarrow \frac{\nabla \mathcal{L}}{\max(1, \|\nabla \mathcal{L}\| / \tau)}\]Code:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
Application: Training Stability in Deep Networks
These techniques are used in:
- RNNs: To avoid exploding gradients through time.
- Transformers: LayerNorm and initialization stabilize long-depth training.
- CNNs: Weight decay improves generalization and convergence.
Summary
Concept | Mathematical Description | Role in ML / DL |
---|---|---|
Condition Number | \(\kappa(A) = |A| \cdot |A^{-1}|\) | Measures sensitivity to noise |
Ill-conditioning | \(\kappa(A) \gg 1\) | Leads to instability in training and optimization |
Tikhonov Regularization | \(\min_x |Ax - b|^2 + \lambda |x|^2\) | Improves matrix invertibility |
Ridge Regression | \(w = (X^T X + \lambda I)^{-1} X^T y\) | Stabilizes regression with correlated features |
Orthogonal Init. | \(W^T W = I\) | Preserves norm in forward/backward pass |
Weight Decay | \(\min_w \mathcal{L} + \lambda |w|^2\) | Regularizes weights and enhances generalization |
Gradient Clipping | \(\nabla \mathcal{L} \rightarrow \frac{\nabla \mathcal{L}}{\max(\cdot)}\) | Prevents exploding gradients during training |
Understanding numerical stability and matrix conditioning helps you:
- Design models that train efficiently,
- Use optimization methods that converge reliably,
- Avoid silent failures due to ill-conditioning.
By incorporating regularization, better initialization, and gradient control, you can ensure your machine learning models are not only performant—but also numerically robust.
Rotation, Reflection, and Markov Matrices in Machine Learning
Geometric and probabilistic transformations form the backbone of many machine learning systems—especially in computer vision, robotics, and probabilistic reasoning. This section explores two foundational categories of matrices:
- Rotation and Reflection matrices (used in spatial transformations),
- Markov (Stochastic) matrices (used in probabilistic models and temporal systems).
Rotation and Reflection Matrices
Geometric transformations such as rotation and reflection are represented by orthogonal matrices in linear algebra. These are crucial for tasks in computer vision, robotics, 3D graphics, and data augmentation.
Rotation Matrices
A rotation matrix in \(\mathbb{R}^n\) rotates a vector about the origin while preserving its norm. A matrix \(R\) is a rotation matrix if:
\[R^T R = RR^T = I \quad \text{and} \quad \det(R) = 1\]2D Rotation
In two dimensions, rotation by an angle \(\theta\) counterclockwise is given by:
\[R(\theta) = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}\]For any vector \(x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}\), the rotated vector is:
\[x' = R(\theta) x\]3D Rotation and Euler Angles
In 3D, rotations can be performed around each axis. The elementary rotations are:
-
Around \(x\)-axis: \(R_x(\theta) = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \cos\theta & -\sin\theta \\ 0 & \sin\theta & \cos\theta \end{bmatrix}\)
-
Around \(y\)-axis: \(R_y(\theta) = \begin{bmatrix} \cos\theta & 0 & \sin\theta \\ 0 & 1 & 0 \\ -\sin\theta & 0 & \cos\theta \end{bmatrix}\)
-
Around \(z\)-axis: \(R_z(\theta) = \begin{bmatrix} \cos\theta & -\sin\theta & 0 \\ \sin\theta & \cos\theta & 0 \\ 0 & 0 & 1 \end{bmatrix}\)
Any 3D rotation can be represented by a combination of these, often using Euler angles.
Application: Computer Vision and Robotics
- Image Augmentation: Rotating images during training increases robustness to orientation.
- Pose Estimation: Estimating camera or robot orientation using rotation matrices.
- 3D Reconstruction: Applying transformations to point clouds and mesh data.
- Robotics Control: Planning and executing movement using rotation matrices in kinematics.
Example:
import torchvision.transforms as T
T.RandomRotation(degrees=15)
Reflection Matrices
A reflection matrix flips a vector across a subspace (hyperplane). It is also orthogonal, but unlike a rotation matrix:
\[\det(R) = -1\]Reflection in 2D
Reflection across the \(x\)-axis:
\[R = \begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}\]Reflection across an arbitrary line through the origin with unit normal vector \(n\):
\[R = I - 2nn^T\]Where \(n \in \mathbb{R}^d\) and \(\|n\| = 1\).
Application: Data Augmentation in CV
Reflections are used to simulate different perspectives of the same object.
- Horizontal Flip:
T.RandomHorizontalFlip(p=0.5)
- Vertical Flip:
T.RandomVerticalFlip(p=0.5)
- Symmetry-based Learning: Useful in object detection and scene understanding.
Markov Matrices and Stochastic Processes
Markov models describe systems that evolve probabilistically over time, where the future depends only on the current state. Their underlying structure is captured by stochastic (Markov) matrices.
Stochastic Matrices
A Markov matrix (or row-stochastic matrix) is a square matrix \(P \in \mathbb{R}^{n \times n}\) where:
- \(P_{ij} \geq 0\) for all \(i, j\),
- \(\sum_{j=1}^n P_{ij} = 1\) for all \(i\) (each row sums to 1).
This means \(P_{ij}\) is the probability of transitioning from state \(i\) to state \(j\).
Let \(x_t\) be a probability distribution vector at time \(t\). Then:
\[x_{t+1} = x_t P\]This recurrence describes the evolution of a Markov chain over time.
Stationary Distribution
A distribution \(\pi\) is stationary if:
\[\pi = \pi P\]This represents the long-term distribution of states. Under mild conditions (irreducibility, aperiodicity), every Markov chain has a unique stationary distribution.
Application: PageRank and Random Walks
PageRank Algorithm:
- Models the web as a Markov chain.
- Pages are states, links are transitions.
- Uses a stochastic matrix with damping:
Stationary distribution \(\pi\) is computed such that:
\[\pi = \pi P'\]Other Applications:
- Language Modeling: Character or word-level Markov chains.
- Hidden Markov Models (HMMs): NLP, speech, time-series.
- Graph algorithms: Random walk-based node ranking and clustering.
Summary
Concept | Mathematical Description | Applications |
---|---|---|
Rotation Matrix | \(R^T R = I, \ \det(R) = 1\) | CV, robotics, 3D vision, pose estimation |
Reflection Matrix | \(R = I - 2nn^T, \ \det(R) = -1\) | Data augmentation, symmetry modeling |
Euler Angles | Composition of axis-wise rotations | Robotics, camera modeling |
Stochastic Matrix | \(P_{ij} \geq 0, \ \sum_j P_{ij} = 1\) | Markov chains, PageRank, probabilistic models |
Stationary Distribution | \(\pi = \pi P\) | Long-term behavior modeling |
Rotation and reflection matrices form the mathematical backbone of spatial transformations in ML applications like vision and robotics. Meanwhile, Markov matrices provide a probabilistic framework to model temporal evolution in tasks such as language modeling, search ranking, and sequence prediction.
Advanced Projections: Random Projections and the Johnson–Lindenstrauss Lemma
In modern machine learning and data science, high-dimensional datasets are common—particularly in fields like natural language processing, image processing, and information retrieval. However, high dimensionality brings computational and storage challenges, as well as the curse of dimensionality.
One elegant solution to this is random projection, a method for dimensionality reduction that is fast, scalable, and surprisingly effective. Its theoretical foundation is the Johnson–Lindenstrauss Lemma, which guarantees that random projections approximately preserve distances between points.
This section explains:
- The theory and math behind random projections,
- The Johnson–Lindenstrauss Lemma and its guarantees,
- Applications in NLP, IR, privacy, and beyond.
1. The Need for Dimensionality Reduction
Let \(X \in \mathbb{R}^{n \times d}\) be a dataset with \(n\) samples in a high-dimensional space \(\mathbb{R}^d\).
Problems with large \(d\):
- Computational inefficiency: Matrix operations are expensive.
- Memory consumption: Storing all features is costly.
- Overfitting: Too many features relative to data points.
- Distance concentration: In high dimensions, pairwise distances become less informative.
Goal: Reduce the dimensionality from \(d\) to \(k \ll d\) such that geometric structure (e.g., pairwise distances) is preserved.
2. Random Projections
Instead of learning an optimal projection (like PCA), random projections use a random linear map:
Let \(R \in \mathbb{R}^{k \times d}\) be a random matrix. Then:
\[z_i = \frac{1}{\sqrt{k}} R x_i \in \mathbb{R}^k\]for each data point \(x_i \in \mathbb{R}^d\).
The random matrix \(R\) typically has entries sampled from:
- Standard Gaussian: \(R_{ij} \sim \mathcal{N}(0, 1)\),
- Sparse sign matrices: \(R_{ij} \in \{-1, 0, +1\}\) with controlled sparsity.
3. Johnson–Lindenstrauss Lemma
The Johnson–Lindenstrauss Lemma states that a small set of points in high-dimensional space can be mapped into a lower-dimensional space such that pairwise distances are approximately preserved.
Theorem (JL Lemma)
For any \(0 < \epsilon < 1\) and integer \(n\), let \(X = \{x_1, \ldots, x_n\} \subset \mathbb{R}^d\) be a set of \(n\) points. Then for:
\[k = O\left(\frac{\log n}{\epsilon^2}\right),\]there exists a linear map \(f: \mathbb{R}^d \rightarrow \mathbb{R}^k\) such that for all \(i, j\):
\[(1 - \epsilon)\|x_i - x_j\|_2^2\] \[\leq \|f(x_i) - f(x_j)\|_2^2\] \[\leq (1 + \epsilon)\|x_i - x_j\|_2^2\]This means that random projections preserve pairwise distances up to small distortion with high probability.
Intuition
- The JL Lemma shows that no information-theoretic bottleneck exists when compressing data from \(\mathbb{R}^d\) to \(\mathbb{R}^k\), as long as \(k = O(\log n)\).
- This is data-independent: No need to look at the data when designing the projection.
4. Construction of Projection Matrix
Let \(R \in \mathbb{R}^{k \times d}\) be the random matrix. Some common constructions:
4.1 Gaussian Random Projection
Each entry is drawn i.i.d. from:
\[R_{ij} \sim \mathcal{N}(0, 1)\]Then the projection is:
\[f(x) = \frac{1}{\sqrt{k}} R x\]This satisfies the JL lemma with high probability.
4.2 Sparse Random Projection
To improve speed and memory:
\[R_{ij} = \sqrt{s} \cdot \begin{cases} +1 & \text{with probability } \frac{1}{2s}, \\ 0 & \text{with probability } 1 - \frac{1}{s}, \\ -1 & \text{with probability } \frac{1}{2s} \end{cases}\]For example, \(s = 3\) gives ~67% sparsity.
5. Applications
5.1 Natural Language Processing (NLP)
- TF-IDF vectors for documents can have tens of thousands of dimensions.
- Random projections reduce dimensionality for:
- Document classification,
- Similarity search,
- Topic modeling pre-processing.
Example:
from sklearn.random_projection import GaussianRandomProjection
transformer = GaussianRandomProjection(n_components=100)
X_new = transformer.fit_transform(X_tfidf)
5.2 Information Retrieval and ANN Search
- Used to index high-dimensional vectors for approximate nearest neighbors (ANN).
- Efficiently reduce the dimension of image embeddings, word embeddings, etc.
- Compatible with LSH (Locality-Sensitive Hashing).
5.3 Differential Privacy and Data Privacy
- Random projections are used to obscure sensitive dimensions while preserving utility.
- Also appear in private matrix factorization and federated learning pipelines.
5.4 Kernel Approximation
- Random Fourier Features approximate Gaussian/RBF kernels by projecting into a low-dimensional space.
- Scales kernel methods to large datasets: \(k(x, y) \approx \phi(x)^T \phi(y)\)
Where \(\phi(x)\) is obtained via random projections.
6. Comparison with PCA
Aspect | PCA | Random Projection |
---|---|---|
Data-dependent | Yes | No |
Computational cost | High (SVD-based) | Low (matrix multiplication) |
Distance preservation | Optimal for top-variance directions | Approximate, but probabilistically guaranteed |
Scalability | Less scalable for large \(d\) | Highly scalable |
Interpretability | High (axes = principal components) | Low |
Summary
Concept | Mathematical Idea | Application Domains |
---|---|---|
Johnson–Lindenstrauss Lemma | \(k = O\left(\frac{\log n}{\epsilon^2}\right)\) | Distance-preserving low-dim embedding |
Gaussian Projection | \(R_{ij} \sim \mathcal{N}(0, 1)\) | NLP, embeddings, privacy |
Sparse Projection | \(R_{ij} \in \{-1, 0, +1\}\) with sparsity | Faster computation |
Random Fourier Features | Approx. kernel projection via random bases | Kernel methods on large datasets |
Distance Preservation | \(|x_i - x_j| \approx |f(x_i) - f(x_j)|\) | ANN, clustering, manifold learning |
Random projections offer a principled, efficient, and theoretically sound method to compress high-dimensional data while preserving its geometric structure. Thanks to the Johnson–Lindenstrauss Lemma, we can apply these projections without worrying about distortion—making them perfect for large-scale ML systems.
Whether you’re dealing with:
- Large vocabulary document matrices in NLP,
- Embedding vectors in image retrieval,
- Kernel methods in high dimensions,
random projections are a tool you should definitely have in your toolbox.