/ or Ctrl+K to search to navigate Enter to select Esc to close

Vector Operations, Norms, and Projections


In 1854, William Rowan Hamilton made a discovery while walking along the Royal Canal in Dublin. As the story goes, he was so struck by the insight that he carved the core idea into the stone of Broom Bridge:

\[i^2 = j^2 = k^2 = ijk = -1\]

This marked the formal introduction of quaternions—an early system for handling three-dimensional rotations using algebra. It also helped pave the way for the modern concept of vectors: objects that capture both direction and magnitude, and that allow precise descriptions of motion, force, orientation, and more.

Over time, vector spaces became foundational in physics, computer graphics, and eventually machine learning. Today, whenever a neural network adjusts its weights, a recommendation engine scores similarity, or an algorithm projects data into fewer dimensions, vector operations are doing the work underneath.

This post begins with those fundamentals. Before diving into transformations and matrix mechanics, we’ll look closely at how vectors behave—how they combine, stretch, shrink, align, and relate to each other. These operations form the base for much of data science and machine learning, from gradient descent to PCA to attention mechanisms.

With that in mind, let’s begin.


Vector Addition and Scalar Multiplication

Imagine you’re training a neural network. Every time you update its weights using gradient descent, you’re really doing something quite basic: taking one vector (the weights), adding another (the gradient scaled by learning rate), and replacing the old with the new. This isn’t just an implementation detail—it’s a foundational operation that sits at the heart of every learning loop.

Let’s break it down.

A vector \(\mathbf{v}\) in \(\mathbb{R}^n\) is an ordered list of \(n\) real numbers:

\[\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}\]

Vectors can represent anything: coordinates in space, model weights, gradient directions, or even raw data. They’re the language of modern ML.

To combine two vectors \(\mathbf{u}\) and \(\mathbf{v}\), you just add them component-wise:

\[\mathbf{u} + \mathbf{v} = \begin{bmatrix} u_1 + v_1 \\ u_2 + v_2 \\ \vdots \\ u_n + v_n \end{bmatrix}\]

Visualization: Vector v (red, dashed) starts at the tip of u (blue). The resulting vector u + v (green) connects the origin to the end of the chain. This is vector addition in action.

Now, multiplying a vector by a scalar \(\alpha\) stretches or shrinks it:

\[\alpha \mathbf{u} = \begin{bmatrix} \alpha u_1 \\ \alpha u_2 \\ \vdots \\ \alpha u_n \end{bmatrix}\]

The direction stays the same, but the length changes. If \(\alpha < 0\), the vector flips around the origin.

This is the backbone of gradient descent:

\[\mathbf{w}_{\text{new}} = \mathbf{w}_{\text{old}} - \alpha \nabla \mathbf{w}\]

You’re scaling the gradient, flipping it (via subtraction), and combining it with the old weights—exactly the operations we’ve just visualized.

Whether you’re training a deep model or tuning a simple linear regression, vector addition and scalar multiplication are the building blocks. The math may be simple—but the impact is profound.


Linear Combinations, Span, Basis, and Dimensionality

Let’s move one level higher in abstraction. Suppose you have a bunch of vectors—what can you build from them? A lot, it turns out. In fact, much of machine learning is built on the idea that complex things can be constructed by combining simple things in clever ways.

This idea is formalized through linear combinations. Given a set of vectors \(\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_k\) in \(\mathbb{R}^n\), we can build a new vector like so:

\[\mathbf{v} = \alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \dots + \alpha_k \mathbf{v}_k\]

Here, each original vector is scaled by a coefficient, and then all are added together. This is more than a formula—this is how feature engineering works, how word embeddings combine meanings, and how models generalize.

The set of all such vectors you can make using these combinations is called the span. For example, the span of two non-parallel vectors in 2D is the entire plane. The span tells you how expressive your set of vectors is.

But not all sets of vectors are created equal. Some are redundant. That’s where the concept of a basis comes in.

A basis is a minimal set of linearly independent vectors whose span still covers the whole space. Every vector in the space can be written uniquely as a linear combination of the basis vectors.

The number of vectors in that basis? That’s the dimension of the space.

Dimensionality matters deeply in ML. High-dimensional data is everywhere—images, text, genomics—but often, we don’t need all those dimensions. That’s where dimensionality reduction enters.

A classic technique is PCA (Principal Component Analysis). PCA finds new basis vectors—called principal components—which are orthogonal and capture the most variance in the data. You can then drop the less useful components and keep only the top ones, shrinking your data’s dimensionality without losing much information.


Numerical Example

Suppose we are given two vectors:

  • \[\mathbf{v}_1 = [2, 1]\]
  • \[\mathbf{v}_2 = [1, 3]\]

These two vectors are not scalar multiples of each other, so they are linearly independent and span a plane in \(\mathbb{R}^2\). In other words, any point in 2D space can be written as a linear combination of \(\mathbf{v}_1\) and \(\mathbf{v}_2\).

Let’s choose coefficients:

  • \[\alpha_1 = 2\]
  • \[\alpha_2 = -1\]

Then the linear combination becomes:

\[\mathbf{v} = 2 \cdot [2, 1] + (-1) \cdot [1, 3] = [4, 2] + [-1, -3] = [3, -1]\]

So, the vector \([3, -1]\) lies in the span of \(\mathbf{v}_1\) and \(\mathbf{v}_2\).

Let’s verify it computationally:

import numpy as np

v1 = np.array([2, 1])
v2 = np.array([1, 3])
coeffs = np.array([2, -1])

new_vector = coeffs[0] * v1 + coeffs[1] * v2
print("New vector (linear combination):", new_vector)

Output:

New vector (linear combination): [ 3 -1]

Now let’s understand how this connects to span, basis, and dimension:

  • Linear Combination: By scaling and adding the vectors, we reached a new point in space.
  • Span: All such combinations form the filled-in area (the entire 2D plane in this case) spanned by \(\mathbf{v}_1\) and \(\mathbf{v}_2\).
  • Basis: Since \(\mathbf{v}_1\) and \(\mathbf{v}_2\) are linearly independent and span \(\mathbb{R}^2\), they form a basis.
  • Dimensionality: The number of vectors in the basis is 2 ⇒ so the space is 2D.

This example shows that you can express any vector in 2D using a basis of two independent vectors. And this same idea is extended in machine learning when we:

  • Learn latent embeddings with fewer dimensions
  • Use PCA to reduce input feature space
  • Combine attention vectors in transformers
  • Compress images or signals in autoencoders

The same idea scales up: Autoencoders learn compact representations (i.e., a new basis). Sparse coding finds minimal combinations to represent signals. In NLP, transformer models represent words as weighted combinations of basis-like vectors. In genomics and medical imaging, dimensionality reduction helps extract essential patterns from noisy, high-dimensional data.

The key takeaway? Whenever you reduce features, compress signals, or build latent spaces—you’re working with linear combinations and bases, even if you don’t always realize it.


Orthogonality and Projections

Now that we’ve learned how to construct vectors using others, let’s explore how to simplify or extract structure from vectors we already have. This is where projections—and particularly orthogonal projections—come into play. They help us reduce complexity while preserving what matters most.

Imagine you have a vector \(\mathbf{u}\) that represents some data, and you want to understand how much of it aligns with another vector \(\mathbf{v}\)—perhaps a direction that captures maximum variance, or a component of interest in a dataset.

To isolate that portion of \(\mathbf{u}\), we project it onto \(\mathbf{v}\). This projection is essentially \(\mathbf{u}\)’s shadow in the direction of \(\mathbf{v}\), and it’s defined by:

\[\text{proj}_{\mathbf{v}}(\mathbf{u}) = \left( \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{v}\|^2} \right) \mathbf{v}\]

This formula gives us a new vector pointing in the direction of \(\mathbf{v}\), scaled to reflect how much \(\mathbf{u}\) “leans” toward it.

Why is this useful?

Because orthogonal projections let us decompose vectors. For example, in Principal Component Analysis (PCA), we use projections to compress data: projecting high-dimensional data onto a smaller set of orthogonal axes (principal components) that capture the greatest variance.

Let’s make that concrete with a code snippet:

import numpy as np

data_point = np.array([3, 4])
principal_component = np.array([1, 0])

# Projection of data_point onto principal_component
dot = np.dot(data_point, principal_component)
norm_sq = np.dot(principal_component, principal_component)
projection = (dot / norm_sq) * principal_component
print("Projection:", projection)

This returns [3, 0]—a shadow of [3, 4] on the x-axis. You’ve just performed dimensionality reduction: mapping a 2D point to a 1D axis while preserving the meaningful part.


Numerical Example

Let’s unpack the projection formula with another set of numbers to build a strong geometric intuition.

Suppose:

  • \[\mathbf{u} = [2, 3]\]
  • \(\mathbf{v} = [4, 0]\) (aligned with the x-axis)

Let’s compute the projection of \(\mathbf{u}\) onto \(\mathbf{v}\):

  • Dot product:
    \(\mathbf{u} \cdot \mathbf{v} = 2 \times 4 + 3 \times 0 = 8\)

  • Magnitude squared of \(\mathbf{v}\):
    \(\|\mathbf{v}\|^2 = 4^2 + 0^2 = 16\)

So the projection is:

\[\text{proj}_{\mathbf{v}}(\mathbf{u}) = \left( \frac{8}{16} \right) \times [4, 0] = 0.5 \times [4, 0] = [2, 0]\]

This tells us that although \(\mathbf{u}\) points somewhere into the plane, its component along \(\mathbf{v}\) (the x-direction) is just 2 units. The rest—\([0, 3]\)—is orthogonal to \(\mathbf{v}\).

To isolate the orthogonal component:

\[\mathbf{u}_\perp = \mathbf{u} - \text{proj}_{\mathbf{v}}(\mathbf{u}) = [2, 3] - [2, 0] = [0, 3]\]

This kind of decomposition is used frequently in machine learning to separate signal from noise, or to isolate the informative component of a data point.


You’ll see projections all over the place: in feature extraction for images, orthogonal initialization in deep networks, and latent representations in generative models. Wherever there’s compression or abstraction, projections are probably doing the heavy lifting behind the scenes.


Vector Norms and Model Complexity

You’ve trained a model and it performs well—almost too well. It nails the training data but struggles with new inputs. That’s a red flag: your model might be overfitting. A common way to fix this is regularization, which penalizes large weights. But what does “large” mean for a vector of weights? Enter vector norms.

Vector norms provide a way to measure the size or length of a vector, giving us a handle to control model complexity.

Let’s walk through the three most common norms:

L1 Norm: Manhattan Distance

This is the sum of the absolute values of all components:

\[\|\mathbf{v}\|_1 = \sum_{i=1}^{n} |v_i|\]

It’s called “Manhattan” because it mimics city block distances—walking only along gridlines.

L2 Norm: Euclidean Distance

This is the familiar straight-line distance from the origin:

\[\|\mathbf{v}\|_2 = \sqrt{\sum_{i=1}^{n} v_i^2}\]

L∞ Norm: Maximum Norm

This measures the largest single component in the vector:

\[\|\mathbf{v}\|_\infty = \max_i |v_i|\]

Each norm is useful in different settings:

  • L1 promotes sparsity (many zeroes), useful in Lasso Regression
  • L2 encourages small but non-zero values, used in Ridge Regression
  • L∞ is useful for constraining peak values, such as in adversarial robustness

Let’s plug in some numbers to make this concrete:

Numerical Example

Let’s say:

\[\mathbf{v} = [3, -4, 1]\]

Then:

  • L1 norm:
\[\|\mathbf{v}\|_1 = |3| + |-4| + |1| = 3 + 4 + 1 = 8\]
  • L2 norm:
\[\|\mathbf{v}\|_2 = \sqrt{3^2 + (-4)^2 + 1^2} = \sqrt{9 + 16 + 1} = \sqrt{26} \approx 5.10\]
  • L∞ norm:
\[\|\mathbf{v}\|_\infty = \max(|3|, |4|, |1|) = 4\]

And here’s the code that calculates these:

import numpy as np

v = np.array([3, -4, 1])

l1 = np.sum(np.abs(v))
l2 = np.sqrt(np.sum(v ** 2))
linf = np.max(np.abs(v))

print("L1 norm:", l1)
print("L2 norm:", l2)
print("L∞ norm:", linf)

To see how these norms behave for different vectors, here’s an interactive plot. Use the dropdown to switch between vectors and observe how the L1, L2, and L∞ norms respond based on vector composition.


Each type of norm defines a different concept of “distance.” Here’s a geometric view: the L1 norm forms a diamond, L2 a circle, and L∞ a square. These unit norm balls visually explain why different norms behave differently when regularizing or constraining model weights.


Why Norms Matter in ML

Norms are used to regularize models—penalizing large weight magnitudes to reduce overfitting:

  • L1 regularization adds \(\lambda \|\mathbf{w}\|_1\) to the loss function
  • L2 adds \(\lambda \|\mathbf{w}\|_2^2\)

They’re also used to:

  • Clip gradients during training (to avoid exploding gradients)
  • Control noise in adversarial training (L\infty perturbation bounds)
  • Measure distance in nearest neighbor and anomaly detection models
  • Guide similarity in metric learning (e.g. contrastive loss)

Norms offer a mathematical grip on what it means for a model or input to be “large,” and in doing so, help keep our models efficient, robust, and generalizable.


Inner and Outer Products

Let’s say you’re building a recommendation engine or clustering users based on their behavior. One of the first things you’ll need to do is measure how similar two data points are. But how do you quantify “similarity” in a vector space?

That’s where the inner product comes in.

Given two vectors \(\mathbf{u}\) and \(\mathbf{v}\), the inner product (or dot product) is defined as:

\[\mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^{n} u_i v_i\]

This value tells us how aligned two vectors are. A large dot product means they’re pointing in the same general direction—like two users with similar preferences. A zero dot product means the vectors are orthogonal, i.e., completely unrelated.

This concept underpins cosine similarity, a widely used metric in NLP for comparing word embeddings or document vectors.

Here’s a visual interpretation of the dot product. The projection of u onto v represents how much of u lies in the direction of v. The more aligned they are, the longer the projection—hence a larger dot product.


On the other hand, the outer product builds a full matrix that captures all pairwise interactions between components of two vectors. If \(\mathbf{u} \in \mathbb{R}^m\) and \(\mathbf{v} \in \mathbb{R}^n\), then:

\[\mathbf{u} \otimes \mathbf{v} = \begin{bmatrix} u_1v_1 & u_1v_2 & \cdots & u_1v_n \\ u_2v_1 & u_2v_2 & \cdots & u_2v_n \\ \vdots & \vdots & \ddots & \vdots \\ u_mv_1 & u_mv_2 & \cdots & u_mv_n \end{bmatrix}\]

This matrix can be used to model interactions, correlations, or structure in higher dimensions.

The outer product expands two vectors into a matrix where each entry is the product of one element from u and one from v. This interaction matrix is fundamental to building things like attention maps or covariance matrices.

A Numerical Example

Let’s take:
\(\mathbf{u} = [1, 2] \quad \text{and} \quad \mathbf{v} = [3, 4]\)

  • Inner product:
\[\mathbf{u} \cdot \mathbf{v} = 1 \times 3 + 2 \times 4 = 3 + 8 = 11\]
  • Outer product:
\[\mathbf{u} \otimes \mathbf{v} = \begin{bmatrix} 1 \times 3 & 1 \times 4 \\ 2 \times 3 & 2 \times 4 \end{bmatrix} = \begin{bmatrix} 3 & 4 \\ 6 & 8 \end{bmatrix}\]

And here’s how you’d compute both in Python:

import numpy as np

u = np.array([1, 2])
v = np.array([3, 4])

inner = np.dot(u, v)
outer = np.outer(u, v)

print("Inner Product:", inner)
print("Outer Product:\n", outer)

Why These Matter in ML

  • The inner product appears in similarity search, attention mechanisms, and projection-based reasoning.
  • The outer product is essential in building covariance matrices, attention weight matrices, and tensor decomposition for multi-modal learning.

In deep learning, attention mechanisms use scaled dot-product attention, which relies on inner products to determine how much focus to place on different inputs.

In kernel methods (like SVMs), the inner product is generalized into a kernel function, which lets us work in high-dimensional (or even infinite-dimensional) spaces without computing those spaces explicitly.

Outer products, meanwhile, show up in bilinear models, tensor factorization, and multi-head attention, where interaction between elements of different feature sets needs to be captured.

Together, inner and outer products give us the building blocks for understanding similarity, structure, and interaction in data—and they show up everywhere from NLP to recommender systems to generative models.


We’ve covered a lot—maybe more than it first seemed. Starting with simple vector addition and scaling, we built up to ideas like span and basis, saw how projections carve structure out of noise, and explored how norms and products give us tools to measure and compare. On paper, these are just operations. But together, they shape how machine learning models move through data, learn patterns, and ultimately make sense of the world.

And what’s striking is that none of this feels outdated. These ideas—linear combinations, dot products, orthogonality—are as relevant in the depths of a transformer model as they were in the early days of signal processing or classical statistics.

If you understand this much, you’re not just doing the math behind machine learning. You’re speaking its native language.