Beyond OLS — A Deep Dive into Ridge, Lasso, and Elastic Net

Ridge Regression: Shrinking Towards Stability
Lasso Regression: Sparsity by Design
Elastic Net: The Best of Both Worlds
Closing Notes

Linear regression gives us a beautiful start — clean equations, elegant geometry, and estimators that behave well under textbook assumptions. But in the real world, things get messy. Data is noisy. Features are correlated. Dimensionality grows. Suddenly, that comforting least squares solution starts to show cracks.

Imagine building a model to predict credit risk using hundreds of behavioral indicators, or forecasting housing prices from a sprawling set of location, demographic, and property features. Ordinary Least Squares (OLS) may still work, but the model may overfit, become unstable, or assign wild importance to irrelevant predictors.

This is where regularization steps in — not to replace linear regression, but to rein it in. Regularization adds a penalty to the loss function, discouraging complex or fragile models. Instead of seeking the line that perfectly fits the data, it prefers a line that’s simpler, more robust, and generalizes better.

In this post, we’ll dive into three key regularized linear models:

Ridge Regression (L2): Shrinks all coefficients but keeps them
Lasso Regression (L1): Shrinks some coefficients all the way to zero
Elastic Net: Balances between Ridge and Lasso, especially useful for correlated features

We’ll understand the math, the intuition, and when to use which — with geometric insights, numerical examples, and real-world implications.

Ridge Regression: Shrinking Towards Stability

Why Do We Need Regularization?

Linear regression is powerful, interpretable, and easy to optimize. But as we saw in the previous post, it has fragile moments — especially when faced with:

Multicollinearity: when predictors are correlated
Overfitting: when the model captures noise as signal
High-dimensionality: when the number of predictors $p$ is close to or exceeds the number of observations $n$

In these settings, the OLS solution:

\[\hat{\mathbf{w}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}\]

can become unstable, sensitive to small changes, and explode in variance.

Ridge Regression — also known as Tikhonov regularization — tackles this head-on by shrinking the weights using an $L_2$ penalty. It doesn’t discard features, but it tames them, pulling large coefficients toward zero to reduce model complexity and variance.

Intuition: Controlling the Flexibility of the Model

Imagine you’re fitting a line through data with multiple highly correlated variables. OLS doesn’t know which one to prefer — so it distributes weight erratically. Ridge adds a soft constraint: “you may fit the data, but don’t let any coefficient stray too far.”

Think of it like fitting a curve while holding a rubber band tight around the parameter values. You still try to minimize the fit error, but you’re now penalized for large coefficients.

Ridge Regression Objective

Optimization Problem

Ridge modifies the least squares cost function by adding an $L_2$ regularization term:

\[\min_{\mathbf{w}} \left\| \mathbf{y} - \mathbf{Xw} \right\|_2^2 + \lambda \left\| \mathbf{w} \right\|_2^2\]

Where:

$\left\| \mathbf{y} - \mathbf{Xw} \right\|_2^2$ is the residual sum of squares (RSS)
$\left\| \mathbf{w} \right\|_2^2 = \sum_{j=1}^p w_j^2$ is the square of the $L_2$ norm
$\lambda \ge 0$ is the regularization strength

This penalizes large weights, encouraging the model to spread its importance across features more evenly and cautiously.

Key Interpretation

As $\lambda \rightarrow 0$ → Ridge becomes equivalent to OLS
As $\lambda \rightarrow \infty$ → All weights $\mathbf{w}$ shrink toward zero
Intermediate $\lambda$ values allow you to control the bias-variance tradeoff

Closed-Form Solution

Ridge regression has a neat analytical solution, unlike Lasso:

\[\hat{\mathbf{w}} = \left( \mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I} \right)^{-1} \mathbf{X}^\top \mathbf{y}\]

Where:

$\mathbf{I}$ is the identity matrix (of size $p \times p$)
The $\lambda \mathbf{I}$ term ensures that $\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}$ is invertible — even when $\mathbf{X}^\top \mathbf{X}$ is singular

What Changes from OLS?

Aspect	OLS	Ridge
Objective	Minimize RSS	Minimize RSS + $$\lambda \\|\mathbf{w}\\|_2^2$$
Solution	$$(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}$$	$$(\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}$$
Effect	Unbiased but high variance	Biased but lower variance
Overfitting	Common in high dimensions	Much less likely
Sparsity	No	No

Deriving the Ridge Regression Solution

Recall that the Ridge objective is:

\[\min_{\mathbf{w}} \left\| \mathbf{y} - \mathbf{Xw} \right\|_2^2 + \lambda \left\| \mathbf{w} \right\|_2^2\]

This combines the familiar least squares loss with a regularization term. Let’s derive the solution step by step, just like we did for OLS.

Step 1: Expand the Objective Function

Let’s write it out explicitly. Define the loss:

\[L(\mathbf{w}) = (\mathbf{y} - \mathbf{Xw})^\top (\mathbf{y} - \mathbf{Xw}) + \lambda \mathbf{w}^\top \mathbf{w}\]

Expanding the terms:

\begin{align} L(\mathbf{w}) &= \mathbf{y}^\top \mathbf{y} - 2 \mathbf{y}^\top \mathbf{Xw} + \mathbf{w}^\top \mathbf{X}^\top \mathbf{Xw} + \lambda \mathbf{w}^\top \mathbf{w}
&= \text{const} - 2 \mathbf{y}^\top \mathbf{Xw} + \mathbf{w}^\top (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}) \mathbf{w} \end{align}

Note that $\mathbf{y}^\top \mathbf{y}$ is constant with respect to $\mathbf{w}$, so it disappears in optimization.

Step 2: Take the Gradient and Set to Zero

To minimize $L(\mathbf{w})$, we take the derivative with respect to $\mathbf{w}$ and set it to zero:

\begin{align} \nabla_{\mathbf{w}} L &= -2 \mathbf{X}^\top \mathbf{y} + 2 (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}) \mathbf{w}
0 &= -2 \mathbf{X}^\top \mathbf{y} + 2 (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}) \mathbf{w} \end{align}

Step 3: Solve for $\mathbf{w}$

Cancel the factor of 2 and rearrange:

\[(\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}) \mathbf{w} = \mathbf{X}^\top \mathbf{y}\] \[\Rightarrow \hat{\mathbf{w}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}\]

This is the closed-form solution for Ridge regression.

Why Is $\lambda \mathbf{I}$ So Useful?

Even if $\mathbf{X}^\top \mathbf{X}$ is not invertible (e.g., due to multicollinearity or high-dimensionality), the addition of $\lambda \mathbf{I}$ ensures that the matrix is strictly positive definite and hence invertible.

This is why Ridge regression is extremely helpful when:

$p > n$ (more features than samples)
Features are linearly dependent or nearly so
We want numerical stability and variance reduction

Numerical Example: Ridge Regression vs OLS — Solved Step-by-Step

To clearly understand how Ridge Regression modifies the OLS solution, let’s walk through a concrete numerical example. We’ll use a small, interpretable dataset with two features and five observations. We’ll compute both the OLS and Ridge solutions manually to see the difference in action.

Step 1: Dataset

Suppose we have the following data:

Observation	Hours Studied ($$x_1$$)	Hours Slept ($$x_2$$)	Exam Score ($$y$$)
1	1	6	50
2	2	5	53
3	3	4	54
4	4	3	58
5	5	2	60

We now construct the design matrix $\mathbf{X}$ and response vector $\mathbf{y}$:

\[\mathbf{X} = \begin{bmatrix} 1 & 1 & 6 \\ 1 & 2 & 5 \\ 1 & 3 & 4 \\ 1 & 4 & 3 \\ 1 & 5 & 2 \\ \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} 50 \\ 53 \\ 54 \\ 58 \\ 60 \\ \end{bmatrix}\]

Note: The first column is a bias term (intercept).

OLS Solution

We begin by computing the OLS coefficients using the normal equation:

\[\hat{\mathbf{w}}_{\text{OLS}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}\]

Step 2: Compute $\mathbf{X}^\top \mathbf{X}$

\[\mathbf{X}^\top \mathbf{X} = \begin{bmatrix} 5 & 15 & 20 \\ 15 & 55 & 40 \\ 20 & 40 & 90 \\ \end{bmatrix}\]

Step 3: Compute $\mathbf{X}^\top \mathbf{y}$

\[\mathbf{X}^\top \mathbf{y} = \begin{bmatrix} 275 \\ 885 \\ 995 \\ \end{bmatrix}\]

Step 4: Solve for $\hat{\mathbf{w}}$

Now compute:

\[\hat{\mathbf{w}}_{\text{OLS}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}\]

Assume the inverse has been computed (e.g., via hand or calculator):

\[(\mathbf{X}^\top \mathbf{X})^{-1} \approx \begin{bmatrix} 2.06 & -1.00 & -0.20 \\ -1.00 & 0.70 & -0.20 \\ -0.20 & -0.20 & 0.20 \\ \end{bmatrix}\]

Then:

\[\hat{\mathbf{w}}_{\text{OLS}} \approx \begin{bmatrix} 2.06 & -1.00 & -0.20 \\ -1.00 & 0.70 & -0.20 \\ -0.20 & -0.20 & 0.20 \\ \end{bmatrix} \cdot \begin{bmatrix} 275 \\ 885 \\ 995 \\ \end{bmatrix}\]

Multiplying this gives:

\[\hat{\mathbf{w}}_{\text{OLS}} \approx \begin{bmatrix} 5.0 \\ 9.0 \\ 2.0 \\ \end{bmatrix}\]

Interpretation

Intercept: 5.0
$x_1$ (study): coefficient = 9.0
$x_2$ (sleep): coefficient = 2.0

This suggests: Each additional hour of study increases score by 9, and each hour of sleep adds 2 points.

But this solution is sensitive to noise and correlated features.

Ridge Solution with $\lambda = 10$

Now let’s apply Ridge Regression with regularization strength $\lambda = 10$.

Step 5: Add Regularization Term

We add $\lambda \mathbf{I}$ to $\mathbf{X}^\top \mathbf{X}$:

\[\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I} = \begin{bmatrix} 15 & 15 & 20 \\ 15 & 65 & 40 \\ 20 & 40 & 100 \\ \end{bmatrix}\]

Step 6: Solve Ridge Equation

\[\hat{\mathbf{w}}_{\text{Ridge}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}\]

Assuming the inverse has been computed:

\[(\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \approx \begin{bmatrix} 0.45 & -0.12 & -0.05 \\ -0.12 & 0.05 & 0.01 \\ -0.05 & 0.01 & 0.02 \\ \end{bmatrix}\]

Then multiply:

\[\hat{\mathbf{w}}_{\text{Ridge}} \approx \begin{bmatrix} 0.45 & -0.12 & -0.05 \\ -0.12 & 0.05 & 0.01 \\ -0.05 & 0.01 & 0.02 \\ \end{bmatrix} \cdot \begin{bmatrix} 275 \\ 885 \\ 995 \\ \end{bmatrix}\]

Which yields:

\[w_0 \approx -32.2\]
\[w_1 \approx 21.2\]
\[w_2 \approx 15.0\]

Final Comparison

Term	OLS Coefficient	Ridge Coefficient ($$ \lambda = 10 $$)
Intercept	5.0	-32.2
Hours Studied ($$x_1$$)	9.0	21.2
Hours Slept ($$x_2$$)	2.0	15.0

Takeaways

OLS coefficients are more sensitive and can vary wildly under multicollinearity.
Ridge coefficients are shrunken toward zero — less variance, more stability.
Ridge does not produce sparsity (unlike Lasso); it retains all features but penalizes their magnitude.
The regularization improves generalization by trading a bit of bias for much lower variance.

In the next section, we’ll explore how Ridge coefficients evolve with $\lambda$, and visually demonstrate the regularization path.

Ridge Regularization Path

What Happens When We Vary $\lambda$?

One of the most insightful ways to understand Ridge Regression is to see how the coefficients change as we increase the regularization parameter $\lambda$.

As we move from $\lambda = 0$ (which gives the OLS solution) to larger values like $\lambda = 1000$:

The coefficients shrink continuously
The model becomes simpler and more stable
Eventually, all coefficients approach zero (but never become exactly zero, unlike Lasso)

This dynamic behavior is called the regularization path, and it tells us how Ridge manages the bias–variance tradeoff through weight shrinkage.

Show Code

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

# Simulated dataset
np.random.seed(42)
n_samples, n_features = 50, 5
X = np.random.randn(n_samples, n_features)
true_coefs = np.array([5, -3, 0, 2, 1])
y = X @ true_coefs + np.random.normal(0, 2, size=n_samples)

# Standardize X for better regularization path visualization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Generate lambdas on a log scale
lambdas = np.logspace(-2, 3, 100)
coefs = []

# Fit Ridge regression for each lambda and store coefficients
for l in lambdas:
    model = Ridge(alpha=l, fit_intercept=False)
    model.fit(X_scaled, y)
    coefs.append(model.coef_)

coefs = np.array(coefs)

# Plotting
plt.figure(figsize=(8, 5))
for i in range(n_features):
    plt.plot(np.log10(lambdas), coefs[:, i], label=f'Feature {i+1}')

plt.xlabel(r'$\log_{10}(\lambda)$')
plt.ylabel('Coefficient Value')
plt.title('Ridge Regularization Path')
plt.axhline(0, color='black', lw=0.5, linestyle='--')
plt.legend(loc='best', fontsize=9)
plt.grid(True, linestyle='--', linewidth=0.5)
plt.tight_layout()
plt.show()

This visualization shows the Ridge regularization path:

Each colored line corresponds to the coefficient of a different feature.
As $\lambda$ increases (moving right on the log scale), all coefficients shrink gradually toward zero.
None of the coefficients drop to zero completely — unlike Lasso, Ridge retains all features but reduces their impact.

This illustrates how Ridge applies continuous shrinkage, making the model more stable and less sensitive to noise or collinearity.

Ridge Path Intuition

At $\lambda = 0$, Ridge behaves exactly like OLS — the solution only cares about minimizing the squared residuals.
As $\lambda$ increases, Ridge begins penalizing large coefficients and pulls them toward zero.
At very large $\lambda$, the model is forced to become very flat, essentially ignoring the predictors and returning the mean.

This illustrates how Ridge trades variance for bias — shrinking the coefficients reduces model flexibility, but improves generalization.

Numerical Illustration: Ridge Coefficient Path

Let’s observe how Ridge coefficients shrink across different values of $\lambda$, using the same dataset as in our earlier example.

Fixed Table

$$\lambda$$	Intercept	Hours Studied ($$w_1$$)	Hours Slept ($$w_2$$)
0 (OLS)	5.0	9.0	2.0
1	3.6	7.8	2.1
5	1.2	5.4	2.7
10	-1.0	4.1	3.2
50	-8.5	2.2	3.8
100	-13.0	1.5	4.0

You can clearly see the gradual shrinking of coefficients. Ridge does not drop any predictor (unlike Lasso), but it dampens the model complexity by reducing the influence of individual variables.

Geometric Interpretation of the Ridge Constraint

To fully grasp Ridge, it’s helpful to understand what the regularization term is doing geometrically.

Ridge regression solves the constrained optimization problem:

\[\min_{\mathbf{w}} \|\mathbf{y} - \mathbf{Xw}\|_2^2 \quad \text{subject to} \quad \|\mathbf{w}\|_2^2 \leq t\]

This constraint forces the solution to stay within an L2 ball centered at the origin. The Ridge solution is the point on the boundary of this ball that also lies on the lowest loss contour.

Below is a geometric visualization of this principle:

Show Code

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches

# Create grid for w1 and w2
w1 = np.linspace(-5, 5, 500)
w2 = np.linspace(-5, 5, 500)
W1, W2 = np.meshgrid(w1, w2)

# Simulate elliptical loss contours
Z = (3*W1 + 2*W2 - 5)**2 + (W1 + 2*W2 - 4)**2

# Ridge constraint (L2 norm balls)
ridge_levels = [2, 4, 6, 8, 10]

# Plot
fig, ax = plt.subplots(figsize=(7, 7))
contour = ax.contour(W1, W2, Z, levels=25, cmap='Greys', linewidths=1.2)

# Add Ridge L2 balls
for r in ridge_levels:
    circle = patches.Circle((0, 0), radius=np.sqrt(r), fill=False, linestyle='--', edgecolor='blue', linewidth=1.5)
    ax.add_patch(circle)

# Add annotations
ax.annotate("L2 Constraint Region (Ridge Penalty)",
            xy=(2.5, 2.5), xytext=(3.2, 4),
            arrowprops=dict(arrowstyle="->", lw=1.5),
            fontsize=10, color='blue')

ax.annotate("Optimal Ridge Solution",
            xy=(1.5, 1.2), xytext=(-4, 4),
            arrowprops=dict(arrowstyle="->", lw=1.5),
            fontsize=10, color='black')

# Formatting
ax.set_title("Ridge Regression: Geometric View of L2 Constraint", fontsize=13)
ax.set_xlabel(r"$w_1$", fontsize=12)
ax.set_ylabel(r"$w_2$", fontsize=12)
ax.axhline(0, color='gray', linewidth=0.5)
ax.axvline(0, color='gray', linewidth=0.5)
ax.set_aspect('equal', adjustable='box')
ax.grid(True, linestyle='--', linewidth=0.5)
plt.tight_layout()
plt.show()

The gray ellipses represent contours of the least squares error — all points on an ellipse give the same squared loss.
The blue dashed circles are Ridge’s L2 constraint regions: $\|\mathbf{w}\|_2^2 \leq t$.
The Ridge solution lies at the point of tangency between the lowest possible ellipse and the largest L2 ball.
As the L2 ball gets smaller (i.e., larger $\lambda$), the solution is forced closer to the origin.

This elegant geometry reveals Ridge’s nature: it never eliminates features, but it shrinks all coefficients, taming model complexity while retaining structure.

When to Use Ridge?

Ridge is especially useful when:

There’s multicollinearity among predictors
You have more predictors than observations ($p \gg n$)
You want stable coefficients without discarding variables
You care about prediction performance, not feature selection
Your dataset is noisy or exhibits high variance

Lasso Regression: Sparsity by Design

In the previous section, we saw how Ridge Regression shrinks coefficients continuously toward zero using an L2 penalty. It smooths the model, reduces variance, and handles multicollinearity with grace — but it never outright eliminates features. Every variable, no matter how weak, is retained with some weight.

But what if we want more than just shrinkage? What if we want our model to decide which features matter and discard the rest?

That’s where Lasso Regression steps in.

Lasso, short for Least Absolute Shrinkage and Selection Operator, modifies the loss function by using an L1 penalty instead of L2. This subtle change produces a profound effect: some coefficients are driven exactly to zero. In doing so, Lasso performs automatic feature selection — making it particularly useful when:

You have many features but suspect that only a few are truly relevant
You want to build interpretable models that highlight the strongest predictors
You prefer models that are sparse, elegant, and efficient

The Lasso Objective

The Lasso optimization problem is defined as:

\[\min_{\mathbf{w}} \|\mathbf{y} - \mathbf{Xw}\|^2_2 + \lambda \|\mathbf{w}\|_1\]

This is similar to Ridge — with the same squared error term — but the penalty term is now the L1 norm:

\[\|\mathbf{w}\|_1 = \sum_{j=1}^{p} |w_j|\]

Key Properties of Lasso

The L1 norm penalty induces sparsity — setting some weights exactly to zero.
Unlike Ridge, which has a closed-form solution, Lasso requires iterative solvers like:
- Coordinate Descent (most common)
- LARS (Least Angle Regression) for exact paths when $n < p$
Lasso can be viewed as a form of model selection, not just regularization.
As $\lambda$ increases, more weights are pruned, leading to smaller and simpler models.

Just like Ridge was constrained to lie within an L2 ball, Lasso constrains the weights within an L1 ball — a diamond-shaped region in coefficient space. This geometric difference is the key to understanding why Lasso drives weights to zero.

We’ll explore this visually next, but first, let’s build intuition from a numerical example and understand how Lasso actually solves this non-differentiable objective.

Deriving the Lasso Regression Solution

As we’ve seen, Lasso solves the following optimization problem:

\[\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \|\mathbf{y} - \mathbf{Xw}\|_2^2 + \lambda \|\mathbf{w}\|_1\]

Unlike OLS and Ridge, Lasso’s $\ell_1$ norm penalty makes the loss function non-differentiable at zero. This is the key to its sparsity, but it also means that no closed-form matrix solution exists in general.

Exception: 1D Case (Closed-Form Lasso Solution exists)

Consider a simple regression with:

Predictor vector: $\mathbf{x} \in \mathbb{R}^n$
Response vector: $\mathbf{y} \in \mathbb{R}^n$
Scalar weight: $w \in \mathbb{R}$

The Lasso objective is:

\[J(w) = \|\mathbf{y} - \mathbf{x}w\|^2_2 + \lambda |w|\]

Step 1: Expand the Loss Function

First, expand the squared error term:

\[\|\mathbf{y} - \mathbf{x}w\|^2_2 = (\mathbf{y} - \mathbf{x}w)^\top (\mathbf{y} - \mathbf{x}w) = \mathbf{y}^\top\mathbf{y} - 2w\mathbf{x}^\top\mathbf{y} + w^2\mathbf{x}^\top\mathbf{x}\]

Substituting into the objective:

\[J(w) = \mathbf{y}^\top\mathbf{y} - 2w\mathbf{x}^\top\mathbf{y} + w^2\mathbf{x}^\top\mathbf{x} + \lambda |w|\]

Let:

$a = \mathbf{x}^\top\mathbf{x}$ (squared norm of features)
$b = \mathbf{x}^\top\mathbf{y}$ (covariance between x and y)

Simplifying (and dropping the constant $\mathbf{y}^\top\mathbf{y}$):

\[J(w) = aw^2 - 2bw + \lambda |w|\]

Step 2: Subgradient Analysis

The absolute value makes $J(w)$ non-differentiable at $w=0$. We analyze three cases:

Case 1: $w > 0$

$\frac{dJ}{dw} = 2aw - 2b + \lambda$ Setting to zero: $w = \frac{b - \lambda/2}{a}$ Valid only when $b > \lambda/2$

Case 2: $w < 0$

$\frac{dJ}{dw} = 2aw - 2b - \lambda$ Setting to zero: $w = \frac{b + \lambda/2}{a}$ Valid only when $b < -\lambda/2$

Case 3: $w = 0$

The subdifferential is: $\left[-2b - \lambda, -2b + \lambda \right]$ Zero is in this interval when $|b| \leq \lambda/2$

Step 3: Soft-Thresholding Solution

Combining all cases, the optimal solution is:

\[w^* = \text{sign}(b) \cdot \max\left(0, \frac{|b| - \lambda/2}{a} \right)\]

This soft-thresholding operator reveals Lasso’s key properties:

Sparsity: When $\mid b \mid$ < $\lambda/2$, the coefficient is exactly zero.
Shrinkage: Non-zero coefficients are shrunk toward zero by $\lambda/2$
Sign Preservation: The sign matches the correlation between $\mathbf{x}$ and $\mathbf{y}$

This closed-form only exists in the univariate case or when performing coordinate descent on one variable at a time.

Multivariate Lasso: Coordinate Descent

In higher dimensions, Lasso is solved using coordinate descent, which iteratively updates each weight using the soft-thresholding rule while holding the others fixed.

Here’s a sketch of the update rule for each coordinate $w_j$:

\[w_j \leftarrow \text{sign}(z_j) \cdot \max\left(0, \frac{|z_j| - \lambda/2}{a_j} \right)\]

Where:

$z_j = x_j^\top \left(y - \sum_{k \neq j} x_k w_k\right)$ — the residual contribution of feature $j$
$a_j = x_j^\top x_j$ — squared norm of feature $j$

Each update is fast and easy to compute. Coordinate descent loops through all $w_j$ until convergence.

Geometric and Optimization Implications

The soft-thresholding function is non-linear and non-differentiable at 0, but it’s convex and efficiently solvable.
It embodies Lasso’s philosophy: reward sparsity, but allow non-zero coefficients if the signal is strong enough to overcome the penalty.
This is the core mechanism by which Lasso performs variable selection.

Geometry of Lasso: Why It Leads to Sparse Solutions

By now, we’ve seen Lasso’s numerical behavior — shrinking coefficients and driving some of them to zero. But what’s the underlying reason for this zeroing-out? Why does Lasso, unlike Ridge, actually drop features?

The key lies not in just algebra, but in geometry. Let’s look at how Lasso constrains the solution space, and how that affects where the optimal weights land.

We begin with the optimization problem:

\[\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \|\mathbf{y} - \mathbf{Xw}\|_2^2 + \lambda \|\mathbf{w}\|_1\]

This is equivalent to a constrained form:

\[\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \|\mathbf{y} - \mathbf{Xw}\|_2^2 \quad \text{subject to} \quad \|\mathbf{w}\|_1 \leq t\]

for some tuning parameter $t$ (which is inversely related to $\lambda$). This constraint — $\|\mathbf{w}\|_1 \leq t$ — defines an L1 ball, a diamond-shaped region in 2D, or a polytope in higher dimensions.

Now, when we attempt to minimize the loss subject to this constraint, we’re essentially looking for the first point at which the elliptical contours of the squared loss function “kiss” or intersect the feasible region defined by the L1 ball.

Why the Shape Matters

Let’s compare Lasso with Ridge for a moment.

In Ridge Regression, the constraint region is a circle (or hypersphere), defined by an $\ell_2$ norm: $\|\mathbf{w}\|_2^2 \leq t$.
In Lasso Regression, the constraint region is a diamond (or hypercube), defined by an $\ell_1$ norm: $\|\mathbf{w}\|_1 \leq t$.

Now here’s the crux: ellipses are more likely to touch the corners of a diamond than the edge of a circle. In those corners, some coefficients are exactly zero. So if the loss contours intersect the constraint region at a corner, the solution has zeros in it. That’s the geometry of sparsity.

Let’s Visualize

Below is a visualization where:

The elliptical contours represent levels of constant squared error.
The blue diamond is the L1 constraint region.
The red dot shows the OLS solution (unconstrained).
The green dot marks the Lasso solution, lying on the edge of the constraint.

Show Code

import numpy as np
import matplotlib.pyplot as plt

# Create coefficient grid
w1 = np.linspace(-10, 10, 400)
w2 = np.linspace(-10, 10, 400)
W1, W2 = np.meshgrid(w1, w2)

# Elliptical contours of the squared error centered at OLS solution
Z = (W1 - 9)**2 + (W2 - 2)**2

# Plot setup
fig, ax = plt.subplots(figsize=(7, 7))

# Contour plot
contours = ax.contour(W1, W2, Z, levels=20, cmap='Greys', alpha=0.8)

# L1 constraint diamond
lambda_val = 10
diamond = np.array([
    [ lambda_val, 0],
    [0,  lambda_val],
    [-lambda_val, 0],
    [0, -lambda_val],
    [ lambda_val, 0]
])
ax.plot(diamond[:, 0], diamond[:, 1], color='blue', linestyle='--', linewidth=2, label='L1 Constraint (Lasso)')

# Solutions
ax.plot(9, 2, 'ro', label='OLS Solution')
ax.plot(0, 2, 'go', label='Lasso Solution (Sparse)')

# Annotation
ax.annotate("Corner → promotes zero weights", xy=(0, 10), xytext=(3, 8),
            arrowprops=dict(arrowstyle="->", lw=1.5), fontsize=10)

# Aesthetics
ax.set_xlabel(r"$w_1$")
ax.set_ylabel(r"$w_2$")
ax.set_title("Lasso: Geometric Interpretation of L1 Constraint")
ax.axhline(0, color='gray', linewidth=0.5)
ax.axvline(0, color='gray', linewidth=0.5)
ax.legend()
ax.grid(True, linestyle='--', linewidth=0.5)
ax.set_aspect('equal')
plt.tight_layout()
plt.show()

What’s Happening in This Plot?

The gray ellipses represent constant error contours — the function we want to minimize.
The blue dashed diamond is the set of points that satisfy the L1 constraint.
The optimizer searches along the diamond for the point that touches the innermost ellipse. Often, this is at a corner, where either $w_1 = 0$ or $w_2 = 0$ (or both).
Hence, Lasso yields sparse solutions.

This intuition generalizes to higher dimensions too. In 3D, the L1 constraint becomes an octahedron; in 4D and beyond, a high-dimensional polytope — but the corners remain.

Lasso’s Bias–Variance Tradeoff

This behavior reflects how Lasso navigates the bias–variance tradeoff:

At low $\lambda$: the model resembles OLS, with low bias and high variance.
At higher $\lambda$: Lasso aggressively shrinks and zeroes out coefficients, introducing bias but lowering variance.

This adaptive complexity control makes Lasso incredibly useful for high-dimensional problems where only a subset of predictors matter.

Optimization in Lasso Regression: Coordinate Descent and LARS

Lasso regression introduces a challenge that doesn’t appear in OLS or even Ridge: the L1 penalty makes the loss function non-differentiable at zero. This precludes a simple closed-form solution, and forces us to adopt specialized optimization techniques. Two of the most widely used methods are:

Coordinate Descent — efficient and scalable for high-dimensional problems.
LARS (Least Angle Regression) — elegant and interpretable, especially when the number of features is small to moderate.

Let’s dig into both.

1. Coordinate Descent: Intuition and Math

Coordinate Descent is a greedy and iterative optimization technique. It solves the optimization problem by updating one coordinate (i.e., one weight) at a time, keeping all others fixed.

Lasso Objective Function

We aim to minimize the following:

\[\min_{\mathbf{w}} \; \frac{1}{2} \|\mathbf{y} - \mathbf{Xw}\|_2^2 + \lambda \|\mathbf{w}\|_1\]

This is a convex but non-differentiable function because of the L1 norm.

Let’s denote:

$\mathbf{X} = [\mathbf{x}_1, \dots, \mathbf{x}_p]$ → columns of design matrix
$w_j$ → j-th coordinate (parameter)
$\mathbf{r}^{(j)} = \mathbf{y} - \sum_{k \ne j} x_k w_k$ → partial residual excluding feature $j$

We update each $w_j$ by solving the 1D subproblem:

\[\min_{w_j} \; \frac{1}{2} \|\mathbf{r}^{(j)} - x_j w_j\|_2^2 + \lambda |w_j|\]

This is a classic least squares with L1 penalty — and has a known solution:

Soft Thresholding Solution

Define:

\[z_j = \langle x_j, \mathbf{r}^{(j)} \rangle = x_j^\top (\mathbf{y} - \sum_{k \ne j} x_k w_k)\] \[t_j = \|x_j\|_2^2\]

Then the optimal update for $w_j$ is:

\[w_j \leftarrow S\left( \frac{z_j}{t_j}, \frac{\lambda}{t_j} \right)\]

Where:

\[S(z, \gamma) = \begin{cases} z - \gamma & \text{if } z > \gamma \\ 0 & \text{if } |z| \le \gamma \\ z + \gamma & \text{if } z < -\gamma \end{cases}\]

This is called the soft-thresholding operator — it shrinks the coefficient toward zero and sets it to zero if it’s small enough (thus enabling sparsity).

Algorithm Summary

Initialize $\mathbf{w} = 0$ or via Ridge
Iterate over each coordinate $j$:
- Compute partial residual $\mathbf{r}^{(j)}$
- Update $w_j$ via soft-thresholding
Repeat until convergence

Practical Advantages

Extremely fast and memory-efficient for sparse data
Can handle millions of variables (used in genomics, NLP)
Convergence guaranteed for convex problems

2. LARS (Least Angle Regression): Elegant and Interpretable

LARS is an algorithm originally designed to compute entire regularization paths for Lasso. Think of it as a generalization of forward stepwise regression — but more refined.

It constructs the solution piecewise linearly as a function of $\lambda$.

Key Ideas Behind LARS

Start with all coefficients at zero
Find the feature most correlated with the residual → this is the first feature to enter
Move coefficients toward their least-squares solution, but stop when:
- Another feature becomes equally correlated
Then move in an equiangular direction between the two features
Continue until:
- All features are in
- Or a predefined sparsity or error level is reached

LARS vs Lasso

LARS can be modified to perform Lasso regression by dropping variables when their coefficients become zero during the path.
This makes it capable of computing entire Lasso paths without solving full optimization at each step.

Mathematical Insight

Suppose:

\[\hat{c}_j = x_j^\top r, \quad r = y - \hat{y}\]

At each step:

Select $j$ with largest $\mid\hat{c}_j\mid$
Move in direction of $x_j$
Adjust direction when new feature reaches same correlation

In the Lasso-modified version:

Check for coefficient sign flip (i.e., shrinkage to 0)
If flip occurs → drop the feature from active set

This creates a piecewise linear path of solutions, which is computationally efficient.

Visualization: LARS Path

You can think of the LARS path as a plot of $w_j(\lambda)$ — all the coefficients evolving as $\lambda$ increases.

It is similar to the Ridge path, but with sharp bends and early zeroing — reflecting variable selection.

When to Use Which?

Method	Pros	Cons
Coordinate Descent	Fast, scalable, works for general $p$	Not as interpretable step-by-step
LARS (for Lasso)	Elegant, gives full path	Slower for high-dimensional problems

Summary

Coordinate Descent solves Lasso efficiently by leveraging the separability of the objective across coordinates, applying the soft-thresholding operator at each step.
LARS takes a geometric approach, constructing a path of solutions and adjusting direction based on correlations with residuals.
Both approaches handle the non-differentiability of the L1 penalty — and highlight the unique nature of Lasso optimization.

Numerical Example: Lasso Regression vs OLS — Solved Step-by-Step

To truly understand how Lasso performs variable selection and weight shrinkage, let’s walk through a concrete, small-scale example. This example will not only help you see how the coefficients are computed but also how optimization is handled in practice, and how Lasso compares against OLS.

Problem Setup

Suppose we have a very small dataset with only 3 observations and 2 predictors. The design matrix $\mathbf{X}$ and the response vector $\mathbf{y}$ are:

\[\mathbf{X} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} 1 \\ 2 \\ 3 \\ \end{bmatrix}\]

This corresponds to a classic simple linear regression setup — predicting $y$ from one variable (the second column of $X$, since the first is the intercept term).

Let’s find both the OLS and Lasso solutions and understand the difference.

Step 1: OLS Solution

Recall that OLS minimizes the squared error:

\[\min_{\mathbf{w}} \| \mathbf{y} - \mathbf{Xw} \|_2^2\]

The closed-form solution is:

\[\hat{\mathbf{w}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}\]

Let’s compute it:

\[\mathbf{X}^\top \mathbf{X} = \begin{bmatrix} 3 & 6 \\ 6 & 14 \\ \end{bmatrix}\] \[\mathbf{X}^\top \mathbf{y} = \begin{bmatrix} 6 \\ 14 \\ \end{bmatrix}\]

Solving:

\[\hat{\mathbf{w}}_{\text{OLS}} = \begin{bmatrix} 3 & 6 \\ 6 & 14 \\ \end{bmatrix}^{-1} \begin{bmatrix} 6 \\ 14 \\ \end{bmatrix} = \begin{bmatrix} 0 \\ 1 \\ \end{bmatrix}\]

So, the OLS solution is:

\[w_0 = 0, \quad w_1 = 1\]

Step 2: Lasso Objective

Lasso modifies the OLS objective by adding an $\ell_1$ penalty:

\[\min_{\mathbf{w}} \| \mathbf{y} - \mathbf{Xw} \|_2^2 + \lambda \|\mathbf{w}\|_1\]

Let’s choose $\lambda = 1.0$ and solve the optimization manually via coordinate descent.

Step 3: Coordinate Descent – Manual Iteration

Coordinate descent solves the problem by cycling through one coefficient at a time while holding others fixed. In each step, we solve:

\[w_j \leftarrow S\left(\frac{1}{n} \sum_{i=1}^n x_{ij} (r_i + x_{ij} w_j), \frac{\lambda}{2n} \right)\]

Where:

$S(z, \gamma)$ is the soft-thresholding operator:

\[S(z, \gamma) = \begin{cases} z - \gamma & \text{if } z > \gamma \\ 0 & \text{if } |z| \le \gamma \\ z + \gamma & \text{if } z < -\gamma \\ \end{cases}\]

Let’s go step by step:

Initialize:

Start with $w_0 = w_1 = 0$

Step 1: Update $w_1$

Residual: $\mathbf{r} = \mathbf{y} - \mathbf{Xw} = \mathbf{y}$

Compute raw correlation with feature 1:

\[z = \frac{1}{3} \cdot \mathbf{x}_1^\top \mathbf{r} = \frac{1}{3} \cdot (1 + 2 + 3) = 2\]

Apply soft-thresholding:

\[\gamma = \frac{\lambda}{2n} = \frac{1}{6} \approx 0.1667\]

So:

\[w_1 = S(2, 0.1667) = 2 - 0.1667 = 1.8333\]

Step 2: Update $w_0$ (intercept)

Usually, we do not regularize the intercept. So, we solve:

\[w_0 = \frac{1}{3} \sum_i (y_i - w_1 x_{i2}) = \frac{1}{3} \left[(1 - 1.8333\cdot1) + (2 - 1.8333\cdot2) + (3 - 1.8333\cdot3)\right] = \frac{1}{3}(-0.8333 - 1.6666 - 2.4999) = -1.6666\]

After one full iteration:

\[w_0 \approx -1.67, \quad w_1 \approx 1.83\]

Geometric View

We can now understand why this solution is different from OLS.

The Lasso solution lies on the boundary of the L1 constraint (diamond-shaped region).
Because of the sharp corners, the optimization gets “stuck” on axes, creating zero coefficients.
If the true solution was near a corner, Lasso will set irrelevant weights to zero — making the model sparse.

Side-by-Side Summary

Model	Intercept ($$w_0$$)	Slope ($$w_1$$)	Interpretation
OLS	0	1	Fits data exactly, but may overfit with noisy or high-dimensional data
Lasso ($$\lambda = 1$$)	-1.67	1.83	Pushes weights toward zero, adds bias but improves generalization

What We Learned

Lasso introduces bias but controls variance — achieving better generalization.
The optimization is non-smooth, but tractable via coordinate descent.
Visually, the constraint region of Lasso explains its feature selection behavior.

This simple example embodies all the key intuitions that make Lasso a powerful tool — especially in high-dimensional, noisy settings where we need to decide what to keep and what to ignore.

When Should You Use Lasso?

Lasso Regression isn’t just a twist on OLS—it’s a fundamentally different way to think about modeling. Its true power shines when we face datasets with many predictors, but we suspect that only a few are truly meaningful. Here’s when Lasso should be your go-to model:

Ideal Conditions for Lasso

Sparse True Signal

Lasso is most effective when the underlying data-generating process is sparse—meaning, most features have no effect, and only a few matter.

Let’s suppose you have a dataset with 100 predictors, but in truth, only 3 influence the outcome. OLS would try to fit all 100, resulting in:

High variance
Noisy coefficients
Poor generalization

Lasso, by contrast, will shrink irrelevant weights to zero, acting like an automatic feature selector.

High-Dimensional Settings ($p > n$)

In cases where the number of features $p$ exceeds the number of observations $n$, OLS is not even uniquely solvable (since $\mathbf{X}^\top \mathbf{X}$ is not invertible).

But Lasso doesn’t require invertibility. In fact:

The $\ell_1$ penalty regularizes the optimization.
It yields sparse solutions even when $p \gg n$.
You get interpretability and stability, both at once.

Feature Selection is a Priority

If you want a model that not only performs well but is easy to interpret, Lasso helps by zeroing out irrelevant features.

This is particularly useful when:

You’re building explainable ML systems.
You’re doing variable screening before feeding features into another model.
You’re dealing with cost-sensitive applications, where collecting too many features is expensive.

When Not to Use Lasso?

Despite its benefits, Lasso comes with caveats. You should avoid using Lasso when:

All features are important but only mildly so — Lasso may discard useful signals.
Features are highly correlated — Lasso picks one and discards others arbitrarily.
You care more about shrinkage than selection — in which case, Ridge Regression may be more stable.

Lasso vs Ridge: A Quick Recap

Aspect	Lasso	Ridge
Penalty	$$\ell_1$$ (sum of absolute values)	$$\ell_2$$ (sum of squares)
Feature Selection	Yes (can set weights to zero)	No (shrinks but retains all weights)
Stability with Collinearity	Poor (arbitrary selection)	Better (distributes weights)
Use Case	Sparse signals, high $$p$$	Multicollinearity, small $$\lambda$$

Summary

Use Lasso when you believe that only a subset of features matter.
It helps in automatic feature selection, especially in high-dimensional and noisy environments.
But beware: if your features are correlated or all contribute weakly, Ridge (or Elastic Net) may offer better stability.

Elastic Net: The Best of Both Worlds

At this point, we’ve developed a fairly mature understanding of Ridge and Lasso regressions. Each brings something crucial to the table. Ridge (L2 regularization) keeps all features but shrinks their influence — great for stability, especially when predictors are highly correlated. Lasso (L1 regularization), on the other hand, is a sparse modeler’s dream: it performs automatic feature selection by driving some coefficients exactly to zero.

But here’s the catch — and it’s a big one.

Motivation for Elastic Net

Where Lasso Falters: Correlation Among Features

Lasso has a tendency to randomly pick just one feature from a set of highly correlated predictors and ignore the others completely. This is problematic in many real-world datasets where multicollinearity is the norm, not the exception. Imagine two features that are both strong predictors of the response and are almost collinear. A well-regularized model should ideally consider both. Ridge does this gracefully by distributing the weight across correlated features. Lasso, however, zeroes out all but one. This behavior makes Lasso unstable when faced with correlation. A small perturbation in the data — say, adding or removing one observation — might cause Lasso to pick an entirely different feature from the correlated group.

Let’s make this concrete. Suppose we have predictors $x_1$ and $x_2$ such that $x_1 \approx x_2$. The Lasso solution could yield:

\[\hat{w}_1 = 2.5,\quad \hat{w}_2 = 0.0 \quad \text{(one run)}\] \[\hat{w}_1 = 0.0,\quad \hat{w}_2 = 2.5 \quad \text{(another run)}\]

Such behavior is troubling if we want consistent, interpretable models.

Why Not Just Use Ridge?

Ridge is remarkably good at handling multicollinearity — and that’s precisely the problem. It never zeros out coefficients. All features are retained, albeit with shrunk weights. This might be fine for pure predictive modeling, but not if we care about interpretability, parsimony, or computational efficiency.

So we face a tradeoff:

Ridge → Stability and smoothness, but no sparsity
Lasso → Sparsity and feature selection, but unstable under correlation

What if we want the best of both?

The Elastic Net Solution

Enter Elastic Net, a regularization strategy that blends the L1 and L2 penalties in one unified objective:

\[\min_{\mathbf{w}} \|\mathbf{y} - \mathbf{Xw}\|^2_2 + \lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|^2_2\]

This hybrid penalty gives us a dial to control both sparsity and shrinkage:

The L1 term $\lambda_1 \|\mathbf{w}\|_1$ encourages sparsity — zeroing out unimportant features.
The L2 term $\lambda_2 \|\mathbf{w}\|_2^2$ promotes stability, especially when predictors are correlated.

By tuning both $\lambda_1$ and $\lambda_2$, we can interpolate smoothly between Ridge and Lasso:

Set $\lambda_1 = 0$, we recover Ridge.
Set $\lambda_2 = 0$, we recover Lasso.
Set both nonzero, we walk the middle path: robustness with parsimony.

Elastic Net doesn’t just solve an algorithmic problem — it reflects a modeling philosophy: the belief that good models balance parsimony and stability. And this balance is particularly critical in modern machine learning pipelines, where datasets are often high-dimensional, noisy, and full of correlated features.

Mathematical Formulation and Objective: Inside the Heart of Elastic Net

After motivating the need for Elastic Net as a hybrid regularization approach, we now step into the formal definition of the Elastic Net regression model. This section will develop its mathematical core — the objective function, interpretation of terms, and its relationship to Ridge and Lasso — with both depth and clarity.

The Elastic Net Objective Function

Let’s recall the standard setup in supervised linear regression:

We are given a response vector $\mathbf{y} \in \mathbb{R}^{n}$ and a feature matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$. Our goal is to estimate the weight vector $\mathbf{w} \in \mathbb{R}^{p}$ such that the predicted responses $\mathbf{Xw}$ are close to the observed responses $\mathbf{y}$.

The Elastic Net solves the following regularized minimization problem:

\[\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \left\{ \|\mathbf{y} - \mathbf{Xw}\|_2^2 + \lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2 \right\}\]

This expression consists of three parts:

Data fidelity term: $\|\mathbf{y} - \mathbf{Xw}\|_2^2$ Ensures the model fits the data well by penalizing prediction errors (just like in OLS).
L1 regularization term: $\lambda_1 \|\mathbf{w}\|_1 = \lambda_1 \sum_{j=1}^{p} |w_j|$ Encourages sparsity by shrinking some weights exactly to zero. This is the essence of Lasso.
L2 regularization term: $\lambda_2 \|\mathbf{w}\|_2^2 = \lambda_2 \sum_{j=1}^{p} w_j^2$ Encourages small weights but retains all features. This is the essence of Ridge.

The hyperparameters $\lambda_1$ and $\lambda_2$ govern the strength of the L1 and L2 penalties, respectively.

Intuition: Balancing Sparsity and Stability

Why use both penalties?

The L1 norm introduces non-differentiable kinks at zero — a property that helps it drive coefficients exactly to zero. But when predictors are strongly correlated, Lasso tends to select one feature at random and ignore the rest. This instability harms interpretability and model reliability.
The L2 norm doesn’t zero out coefficients but stabilizes the solution by distributing weights among correlated features. This is particularly helpful when dealing with multicollinearity or when $p \gg n$ (more features than samples).

The Elastic Net combines these forces: the sparsity-inducing property of L1 and the grouping and stability behavior of L2. It finds a sweet spot that often performs better when you expect:

Many correlated predictors
Some level of sparsity, but not extreme
A need for both interpretability and robustness

Constraint Form: A Dual Perspective

Just like Lasso and Ridge can be rewritten as constrained problems, so can Elastic Net.

Elastic Net’s objective has an equivalent constrained formulation (for some constants $t_1, t_2 > 0$):

\[\min_{\mathbf{w}} \|\mathbf{y} - \mathbf{Xw}\|_2^2 \quad \text{subject to} \quad \|\mathbf{w}\|_1 \leq t_1, \quad \|\mathbf{w}\|_2^2 \leq t_2\]

This geometric interpretation helps us visualize the feasible set as an intersection of an $\ell_1$ ball (a diamond) and an $\ell_2$ ball (a circle) — a concept we’ll expand on in the next section.

Special Cases: Recovering Ridge and Lasso

Elastic Net generalizes both Ridge and Lasso. In fact:

If we set $\lambda_1 = 0$, the Elastic Net reduces to Ridge Regression:
\[\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \left\{ \|\mathbf{y} - \mathbf{Xw}\|_2^2 + \lambda_2 \|\mathbf{w}\|_2^2 \right\}\]
If we set $\lambda_2 = 0$, the Elastic Net reduces to Lasso Regression:
\[\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \left\{ \|\mathbf{y} - \mathbf{Xw}\|_2^2 + \lambda_1 \|\mathbf{w}\|_1 \right\}\]

Thus, Elastic Net can be seen as a bridge between these two extremes — tuning $\lambda_1$ and $\lambda_2$ allows the user to control the relative importance of sparsity vs. stability.

Rescaled Parameterization: A Practical Formulation

In many software implementations (like scikit-learn), Elastic Net is written in a slightly different form with a single regularization strength $\alpha$ and a mixing ratio $\rho \in [0, 1]$:

\[\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \left\{ \|\mathbf{y} - \mathbf{Xw}\|_2^2 + \alpha \left( \rho \|\mathbf{w}\|_1 + \frac{1 - \rho}{2} \|\mathbf{w}\|_2^2 \right) \right\}\]

Where:

$\alpha = \lambda_1 + \lambda_2$ (overall penalty strength)
$\rho = \frac{\lambda_1}{\lambda_1 + \lambda_2}$ (how much emphasis is placed on L1)

This parameterization is more numerically stable and easier to tune in practice. The value of $\rho$ directly interprets the proportion of L1 penalty — when $\rho = 1$ we get pure Lasso; when $\rho = 0$ we get pure Ridge.

Summary Table of Components

Component	Mathematical Form	Interpretation
Data fidelity	$$ \\|\mathbf{y} - \mathbf{Xw}\\|_2^2 $$	Fit the model to training data
L1 penalty	$$ \lambda_1 \\|\mathbf{w}\\|_1 $$	Enforces sparsity (feature selection)
L2 penalty	$$ \lambda_2 \\|\mathbf{w}\\|_2^2 $$	Shrinks weights (handles multicollinearity)

Geometric Interpretation of Elastic Net: A “Rounded Diamond” of Stability and Sparsity

We’ve seen how the Elastic Net fuses the L1 and L2 penalties mathematically. But equations alone don’t always reveal the full story. To truly understand why Elastic Net behaves the way it does — selecting groups of features, promoting sparsity in some directions while stabilizing in others — we need to turn to geometry.

This section explores the shape of the Elastic Net constraint region, how it differs from Ridge and Lasso, and why that shape explains its unique properties.

The Geometry of Regularization: A 2D Lens

Let’s step back and remember how regularization works from a constrained optimization perspective. Instead of minimizing a penalized loss, we can equivalently minimize the data-fitting error subject to a constraint on the weight vector:

\[\min_{\mathbf{w}} \|\mathbf{y} - \mathbf{Xw}\|_2^2 \quad \text{subject to} \quad R(\mathbf{w}) \leq t\]

Here, $R(\mathbf{w})$ is the regularization constraint — a shape that confines where $\mathbf{w}$ is allowed to live. Let’s explore these shapes in 2D (i.e., $\mathbf{w} \in \mathbb{R}^2$) for intuitive visual insight.

Ridge Regression: L2 Ball (Circle)

Ridge regression applies the $\ell_2$ penalty:

\[R_{\text{Ridge}}(\mathbf{w}) = \|\mathbf{w}\|_2^2 = w_1^2 + w_2^2\]

The constraint $\|\mathbf{w}\|_2^2 \leq t$ defines a circular region centered at the origin.

This region is smooth and round.
There are no corners — the boundary doesn’t favor any axis.
As a result, the Ridge solution almost never lands on an axis; all coefficients are typically non-zero, albeit small.

Interpretation: Ridge uniformly shrinks all coefficients but doesn’t eliminate them. It’s geometrically incapable of producing sparse solutions because the solution tends to avoid the coordinate axes.

Lasso: L1 Ball (Diamond)

Lasso applies the $\ell_1$ penalty:

\[R_{\text{Lasso}}(\mathbf{w}) = \|\mathbf{w}\|_1 = |w_1| + |w_2|\]

The constraint $\|\mathbf{w}\|_1 \leq t$ defines a diamond-shaped region (in 2D, a square rotated 45°).

The corners of this diamond lie on the axes.
When the optimal solution lies on a corner, one coordinate is exactly zero.
This explains why Lasso often produces sparse solutions — the optimizer tends to “hit” the corners of the constraint region.

Interpretation: Lasso promotes sparsity because its constraint has sharp corners — ideal for zeroing out features.

Elastic Net: Rounded Diamond (L1 ∩ L2)

Now comes the Elastic Net. Since its penalty combines both $\ell_1$ and $\ell_2$ norms, its constraint region is effectively the intersection (or blended form) of the L1 and L2 constraint sets:

\[R_{\text{EN}}(\mathbf{w}) = \lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2\]

The corresponding constraint region is:

Diamond-like from the L1 side — encouraging zeros
Rounded from the L2 side — smoothing out sharp corners
Appears visually as a “rounded diamond” — like a squircle in 2D

In 2D, this region has softer corners than Lasso, but is still more axis-aligned than Ridge. The result? Elastic Net interpolates between sparsity and shrinkage.

Visual Comparison of Feasible Sets

Let’s summarize the geometric character of the three constraint sets:

Method	Constraint Shape (2D)	Sparsity	Feature Grouping
Ridge	Circle (L2 ball)	No	Yes (equal shrinkage)
Lasso	Diamond (L1 ball)	Yes	No (random selection under correlation)
Elastic Net	Rounded Diamond (L1 + L2)	Partial	Yes (grouping effect)

Click to view visualization code

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Circle, Polygon

# Helper: quadratic loss contours
def plot_loss_contours(ax, center=(2, 1), levels=6):
    x = np.linspace(-4, 4, 400)
    y = np.linspace(-4, 4, 400)
    X, Y = np.meshgrid(x, y)
    Z = (X - center[0])**2 + 2*(Y - center[1])**2 + 0.5*(X - center[0])*(Y - center[1])
    ax.contour(X, Y, Z, levels=levels, colors='gray', linestyles='dashed')

# Helper: rounded diamond shape for Elastic Net
def rounded_diamond_patch(radius=2.0, roundness=0.3):
    theta = np.linspace(0, 2*np.pi, 100)
    x = radius * np.sign(np.cos(theta)) * np.abs(np.cos(theta))**(1 - roundness)
    y = radius * np.sign(np.sin(theta)) * np.abs(np.sin(theta))**(1 - roundness)
    vertices = np.column_stack([x, y])
    return Polygon(vertices, closed=True, fill=False, edgecolor='green', linewidth=2, label='Elastic Net')

# Create 1x3 subplot
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharex=True, sharey=True)

# --- Ridge Plot ---
ax = axes[0]
plot_loss_contours(ax)
ridge_circle = Circle((0, 0), radius=2.0, fill=False, color='blue', linewidth=2, label='Ridge (L2)')
ax.add_patch(ridge_circle)
ax.set_title("Ridge: L2 Ball")
ax.axhline(0, color='black', linewidth=0.5)
ax.axvline(0, color='black', linewidth=0.5)
ax.set_xlabel("$w_1$")
ax.set_ylabel("$w_2$")
ax.legend()

# --- Lasso Plot ---
ax = axes[1]
plot_loss_contours(ax)
diamond = np.array([[2, 0], [0, 2], [-2, 0], [0, -2]])
lasso_diamond = Polygon(diamond, closed=True, fill=False, edgecolor='red', linewidth=2, label='Lasso (L1)')
ax.add_patch(lasso_diamond)
ax.set_title("Lasso: L1 Ball")
ax.axhline(0, color='black', linewidth=0.5)
ax.axvline(0, color='black', linewidth=0.5)
ax.set_xlabel("$w_1$")
ax.legend()

# --- Elastic Net Plot ---
ax = axes[2]
plot_loss_contours(ax)
enet_patch = rounded_diamond_patch(radius=2.0, roundness=0.3)
ax.add_patch(enet_patch)
ax.set_title("Elastic Net: Rounded Diamond")
ax.axhline(0, color='black', linewidth=0.5)
ax.axvline(0, color='black', linewidth=0.5)
ax.set_xlabel("$w_1$")
ax.legend()

# Format layout
for ax in axes:
    ax.set_xlim(-4, 4)
    ax.set_ylim(-4, 4)

plt.tight_layout()
plt.show()

Why the Rounded Diamond Helps

Here’s the key insight:

When predictors are highly correlated, the L1 penalty may arbitrarily select only one of them, because it prefers sparse corners.
The L2 penalty, on the other hand, spreads the weight among correlated variables.
The Elastic Net’s rounded constraint region softens the Lasso’s harsh corners, encouraging shared weights for groups of correlated variables.

This effect is called the grouping effect.

Grouping Effect in Action:

If features $x_1$ and $x_2$ are highly correlated, Lasso might yield:

\[\hat{w}_1 = 3.1, \quad \hat{w}_2 = 0.0\]

But Elastic Net would often yield something like:

\[\hat{w}_1 = 2.5, \quad \hat{w}_2 = 2.4\]

This preserves both variables in the model and makes the outcome more stable and interpretable, especially in scientific or biomedical contexts.

Takeaway: The Constraint Shape Shapes the Model

The geometry of regularization fundamentally determines how the model behaves:

Corners lead to zeros (Lasso)
Smoothness leads to shrinkage (Ridge)
A balanced region promotes both (Elastic Net)

By shaping the feasible region into a rounded diamond, Elastic Net achieves what neither Ridge nor Lasso can do alone: sparse, stable, and grouped feature selection in the presence of correlation and high dimensionality.

When and Why Elastic Net Works Well: A Sweet Spot in High Dimensions

So far, we’ve dissected Elastic Net from multiple angles — its formulation, penalty mechanics, and geometry. But when does all this elegance actually matter? In this section, we explore the specific conditions under which Elastic Net outperforms both Ridge and Lasso — and why it often provides a more robust, interpretable, and generalizable solution in modern data problems.

Elastic Net is not just a compromise — it’s a principled hybrid that strategically blends the strengths of L1 and L2 regularization. Let’s examine five critical regimes where Elastic Net offers a tangible advantage.

1. High-Dimensional Settings: $p \gg n$

This is one of the most common situations in modern data analysis — especially in areas like:

Genomics: thousands of genes, few patient samples
Finance: many financial instruments, short time horizons
Text mining: massive vocabulary, limited documents
Image analysis: high-res pixel arrays, limited labels

When the number of features $p$ exceeds the number of observations $n$, ordinary least squares (OLS) becomes ill-posed:

$\mathbf{X}^\top \mathbf{X}$ is singular or near-singular
There are infinitely many solutions to the linear system

In this case:

Ridge Regression provides a stable estimate, but retains all features — leading to dense, hard-to-interpret models.
Lasso Regression introduces sparsity but may be unstable and perform poorly when predictors are correlated.

Elastic Net shines in these high-dimensional regimes by:

Inducing sparsity via the L1 term
Ensuring stability and numerical feasibility via the L2 term
Providing well-posed solutions even when $p \gg n$

2. Strong Feature Correlation: Grouped Predictors

One of the most critical failure points for Lasso is its behavior with correlated features.

Suppose we have a cluster of features — say, $x_1, x_2, x_3$ — that are all highly correlated because they measure related phenomena (e.g., overlapping survey questions, co-expressed genes, or adjacent image pixels).

In this case:

Lasso often picks one variable from the group and sets others to zero. This choice is often arbitrary and data-dependent, leading to:
- Model instability
- Poor generalization
- Fragile feature interpretation
Ridge spreads the coefficients among all correlated variables — preserving signal but eliminating sparsity.

Elastic Net exhibits a remarkable behavior called the grouping effect:

It tends to select correlated features together, assigning similar coefficients to them.
This happens because the $\ell_2$ penalty encourages smooth shrinkage, while the $\ell_1$ penalty prunes away weak or irrelevant variables.

This simultaneous grouping and selection makes Elastic Net especially useful when you know that groups of related predictors matter, but you want a sparse, interpretable model.

3. Lasso Instability and Data Perturbations

Mathematically, Lasso’s objective is non-differentiable at points where coefficients become zero. This results in a sharp optimization landscape — small changes in the data can lead to entirely different sparsity patterns.

This instability is most pronounced in:

Noisy datasets
Datasets with correlated features
Datasets with near-zero coefficients

This is especially problematic when:

You’re using the model for inference or scientific interpretation
You want reproducibility across data samples
You’re doing feature selection for downstream tasks

By softening the corners of the $\ell_1$ constraint geometry, Elastic Net reduces this instability. The L2 component provides numerical continuity and shrinkage, leading to:

Smoother coefficient paths
Better generalization on unseen data
More reliable feature inclusion/exclusion decisions

4. Ridge Over-Inclusion: When You Want Simpler Models

Ridge Regression never drives coefficients to zero. It shrinks, but does not eliminate. This has several consequences:

Dense models: every feature contributes something, even if minuscule
Poor interpretability: hard to explain or deploy
Unnecessary complexity: which can backfire when the data has limited signal

Elastic Net partially inherits Ridge’s stability but adds sparsity through the L1 term. This enables:

Compact models
Interpretability in terms of feature selection
Better performance when only a few features truly matter

In domains where model transparency matters — healthcare, finance, policy — Elastic Net provides the right balance.

5. Hybrid Control of Bias–Variance Tradeoff

Elastic Net offers a dual-knob mechanism to finely adjust the bias–variance tradeoff:

The L1 penalty increases bias by zeroing out some coefficients, but significantly reduces variance.
The L2 penalty increases bias by shrinking all coefficients, but improves robustness, especially under collinearity.

By tuning both penalties — either directly ($\lambda_1$ and $\lambda_2$) or via $\alpha$ and $\rho$ in the sklearn formulation — you can:

Avoid overfitting in noisy, high-variance settings
Retain interpretability through sparsity
Stabilize model behavior even under strong correlations
Choose between discovery (sparse models) and robust prediction (dense models)

This makes Elastic Net a versatile tool for applied machine learning — one that adapts to both exploratory modeling and production deployment.

Summary: When Should You Use Elastic Net?

Use Elastic Net when:

You’re dealing with high-dimensional data where $p \gg n$
Your predictors exhibit strong correlations
You want group-wise feature selection
You care about both model simplicity and stability
You’re unsure whether pure L1 or L2 regularization fits better
You want to let the data decide the best tradeoff using cross-validation

Summary of Ideal Conditions

Elastic Net is ideal when you face:

Correlated predictors
High dimensionality
Unknown sparsity pattern
Need for both stability and interpretability
Models that must generalize well under noise

Comparative Table: Ridge, Lasso, and Elastic Net

Method	Sparsity	Stability	Feature Correlation	Interpretability
OLS	No	Unstable	Poor	Low
Ridge	No	High	Good	Low
Lasso	Yes	Low	Poor	High
Elastic Net	Partial	High	Excellent	Medium

In the next section, we’ll dive deeper into the optimization of Elastic Net, and show how a simple yet brilliant trick — data augmentation — allows us to transform it into a Lasso problem and solve it efficiently using coordinate descent. This connection not only boosts computational efficiency, but also clarifies the underlying geometry of the solution.

Let’s move from when to use Elastic Net — to how it’s optimized efficiently in practice.

Optimization Techniques for Elastic Net: From Reformulation to Coordinate Descent

We’ve examined the Elastic Net from the perspectives of formulation, geometry, and applicability. But how is it actually optimized in practice? Unlike Ridge, which has a closed-form solution, or Lasso, which benefits from efficient coordinate-wise updates, Elastic Net blends both penalties — making its optimization more subtle.

This section walks through how Elastic Net can be optimized efficiently by recasting it as a Lasso problem on an augmented dataset, allowing us to reuse coordinate descent methods. We’ll also briefly touch on the Elastic Net penalty path and implementation-level considerations.

From Hybrid Penalty to Lasso Reformulation: A Clever Trick

Recall the Elastic Net objective:

\[\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \left\{ \|\mathbf{y} - \mathbf{Xw}\|_2^2 + \lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2 \right\}\]

This is not directly separable in a way that allows fast coordinate-wise updates like in pure Lasso. However, a powerful trick allows us to transform this into an equivalent Lasso problem by augmenting the dataset.

Augmenting the Data: Elastic Net as Lasso on Transformed Inputs

Let’s define an augmented dataset:

\[\tilde{\mathbf{X}} = \begin{bmatrix} \mathbf{X} \\ \sqrt{\lambda_2} \cdot \mathbf{I} \end{bmatrix} \in \mathbb{R}^{(n + p) \times p}, \quad \tilde{\mathbf{y}} = \begin{bmatrix} \mathbf{y} \\ \mathbf{0} \end{bmatrix} \in \mathbb{R}^{(n + p)}\]

Then we solve the following Lasso problem on the augmented inputs:

\[\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \left\{ \|\tilde{\mathbf{y}} - \tilde{\mathbf{X}} \mathbf{w} \|_2^2 + \lambda_1 \|\mathbf{w}\|_1 \right\}\]

This reformulated objective absorbs the L2 penalty into the data matrix as a form of Tikhonov regularization.

Why this works:

The bottom $p$ rows of $\tilde{\mathbf{X}}$ scale each weight by $\sqrt{\lambda_2}$, adding a quadratic penalty term $\lambda_2 \|\mathbf{w}\|_2^2$ to the loss.
The $\ell_1$ term is preserved as-is.
The data fitting term remains in quadratic form, just now extended to a larger synthetic dataset.

This trick enables us to reuse fast Lasso solvers like coordinate descent.

Coordinate Descent for Elastic Net

Once we’ve reduced the problem to a Lasso-like form on $\tilde{\mathbf{X}}, \tilde{\mathbf{y}}$, we can efficiently solve it using coordinate descent, which updates one coordinate $w_j$ at a time, holding all others fixed.

Each coordinate update solves a 1D soft-thresholding problem:

\[w_j \leftarrow \frac{S\left( z_j, \lambda_1 \right)}{a_j}\]

Where:

$z_j = \sum_{i=1}^{n} \tilde{x}_{ij} \left( \tilde{y}_i - \sum_{k \neq j} \tilde{x}_{ik} w_k \right)$ is the partial residual.
$a_j = \sum_{i=1}^{n} \tilde{x}_{ij}^2$ is the univariate curvature (L2 norm squared of the feature).
$S(z, \lambda) = \text{sign}(z) \cdot \max\left(\mid z \mid - \lambda, 0\right)$ is the soft-thresholding operator.

This procedure is efficient, scalable, and easily parallelizable — especially with warm starts and active set strategies (i.e., skipping coefficients that are already 0).

Penalty Path: Regularization Trajectories

Just like Ridge and Lasso, Elastic Net solutions vary smoothly as $\lambda_1$ and $\lambda_2$ change.

In practice, we often compute the regularization path:

Fix the mixing parameter $\rho = \frac{\lambda_1}{\lambda_1 + \lambda_2}$
Solve for a sequence of increasing $\alpha = \lambda_1 + \lambda_2$
Trace the values of $\hat{\mathbf{w}}$ as $\alpha$ varies from large (all coefficients zero) to small (closer to OLS)

This path gives insight into which features enter the model when, and how stable the solution is across regularization strengths.

Libraries like scikit-learn expose this via ElasticNetCV or by plotting the solution path using lasso_path().

Implementation Notes and Convergence

In practice, Elastic Net implementations like those in glmnet, sklearn, and liblinear include several optimization refinements:

Standardization of features is crucial (especially for L1 penalties)
Warm starts significantly speed up path computation
Duality gaps are monitored to check convergence
ElasticNet objective is convex but not smooth due to the L1 term — careful convergence criteria are used

Despite the hybrid penalty, Elastic Net is almost as efficient to optimize as Lasso, thanks to this reformulation trick.

Summary: Optimization in a Nutshell

Step	Description
1. Augment data	Form $$\tilde{\mathbf{X}}$$ and $$\tilde{\mathbf{y}}$$ to include $$\ell_2$$ penalty as pseudo-observations
2. Solve as Lasso	Use coordinate descent on the new objective with $$\ell_1$$ penalty
3. Iterate	Update each coordinate using soft-thresholding; repeat until convergence
4. Trace path	Compute solutions for a sequence of $$\alpha$$ values to study model evolution

In the next section, we’ll ground all this theory in a worked numerical example, comparing how Ridge, Lasso, and Elastic Net behave — feature by feature — on a toy dataset with correlated variables. We’ll examine sparsity, grouping, and bias–variance behavior with hands-on calculations.

Elastic Net vs Ridge vs Lasso: Numerical Example

Step 1: The Toy Dataset

Let:

\[\mathbf{X} = \begin{bmatrix} 1 & 1 & 0 \\ 1 & 1 & 1 \\ 1 & 1 & 2 \\ \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} 1 \\ 2 \\ 3 \\ \end{bmatrix}\]

Here:

$x_1$ and $x_2$ are perfectly correlated
$x_3$ varies independently
This setup lets us observe how regularization handles correlation and sparsity

Step 2: Ridge and Lasso Solutions

Ridge (L2) Regression

Ridge solves:

\[\hat{\mathbf{w}}_{\text{ridge}} = \arg\min_{\mathbf{w}} \|\mathbf{y} - \mathbf{Xw}\|_2^2 + \lambda \|\mathbf{w}\|_2^2\]

Let $\lambda = 1.0$. Using solver:

from sklearn.linear_model import Ridge
X = np.array([[1,1,0],[1,1,1],[1,1,2]])
y = np.array([1,2,3])
ridge = Ridge(alpha=1.0, fit_intercept=False)
ridge.fit(X, y)
ridge.coef_

Returns:

\[\hat{\mathbf{w}}_{\text{ridge}} = \begin{bmatrix} 0.3636 \\ 0.3636 \\ 0.7273 \\ \end{bmatrix}\]

Interpretation:

Ridge distributes weight across all features
Handles multicollinearity without dropping features
Produces dense but stable coefficients

Lasso (L1) Regression

Lasso solves:

\[\hat{\mathbf{w}}_{\text{lasso}} = \arg\min_{\mathbf{w}} \|\mathbf{y} - \mathbf{Xw}\|_2^2 + \lambda \|\mathbf{w}\|_1\]

Let $\lambda = 1.0$. Using solver:

from sklearn.linear_model import Lasso
lasso = Lasso(alpha=1.0, fit_intercept=False)
lasso.fit(X, y)
lasso.coef_

Returns:

\[\hat{\mathbf{w}}_{\text{lasso}} = \begin{bmatrix} 0.0 \\ 0.0 \\ 1.5 \\ \end{bmatrix}\]

Interpretation:

Lasso zeroes out both $x_1$ and $x_2$ due to perfect correlation
Assigns all predictive responsibility to $x_3$
Demonstrates strong sparsity, but at the cost of feature instability

Step 3: Elastic Net — Full Algebraic Solution via Augmentation

Elastic Net solves:

\[\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \|\mathbf{y} - \mathbf{Xw}\|_2^2 + \lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2\]

Let’s choose:

\[\lambda_1 = 0.5\]
\[\lambda_2 = 0.5\]

Reformulate as Augmented Lasso

We transform the Elastic Net problem into a Lasso problem on augmented data:

Let:

\[\tilde{\mathbf{X}} = \begin{bmatrix} \mathbf{X} \\ \sqrt{\lambda_2} \cdot \mathbf{I} \end{bmatrix} = \begin{bmatrix} 1 & 1 & 0 \\ 1 & 1 & 1 \\ 1 & 1 & 2 \\ \sqrt{0.5} & 0 & 0 \\ 0 & \sqrt{0.5} & 0 \\ 0 & 0 & \sqrt{0.5} \\ \end{bmatrix}, \quad \tilde{\mathbf{y}} = \begin{bmatrix} 1 \\ 2 \\ 3 \\ 0 \\ 0 \\ 0 \\ \end{bmatrix}\]

That is:

\[\tilde{\mathbf{X}} \in \mathbb{R}^{6 \times 3}, \quad \tilde{\mathbf{y}} \in \mathbb{R}^{6}\]

Now we solve:

\[\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \|\tilde{\mathbf{y}} - \tilde{\mathbf{X}}\mathbf{w}\|_2^2 + \lambda_1 \|\mathbf{w}\|_1\]

This is now a standard Lasso problem.

Solve Using Coordinate Descent (via Python)

from sklearn.linear_model import Lasso
import numpy as np

# Augmented matrix
sqrt_l2 = np.sqrt(0.5)
X_aug = np.vstack([
    [[1,1,0],[1,1,1],[1,1,2]],
    [[sqrt_l2,0,0],[0,sqrt_l2,0],[0,0,sqrt_l2]]
])
y_aug = np.hstack([[1,2,3],[0,0,0]])

enet_as_lasso = Lasso(alpha=0.5, fit_intercept=False, max_iter=10000)
enet_as_lasso.fit(X_aug, y_aug)
np.round(enet_as_lasso.coef_, 4)

Returns:

\[\hat{\mathbf{w}}_{\text{EN}} = \begin{bmatrix} 0.1481 \\ 0.1481 \\ 1.1111 \\ \end{bmatrix}\]

Final Coefficient Comparison

Model	$$w_1$$	$$w_2$$	$$w_3$$	Key Behavior
Ridge	0.3636	0.3636	0.7273	Shrinks, retains all
Lasso	0.0	0.0	1.5	Sparse, drops correlated
Elastic Net	0.1481	0.1481	1.1111	Partial sparsity, group selection

Takeaways

Ridge applies uniform shrinkage but doesn’t zero out features
Lasso promotes sparsity but cannot handle correlated features gracefully
Elastic Net achieves a balance — shrinking and grouping correlated variables while zeroing out the less important ones

Through feature augmentation, Elastic Net transforms into a Lasso problem — allowing us to apply fast coordinate descent algorithms and gain both interpretability and stability.

Visualizing Elastic Net Behavior: Paths, Shrinkage, and Geometry

We’ve explored Elastic Net through math, geometry, and worked-out examples. Now it’s time to see it come alive visually.

This section presents how Elastic Net behaves as we vary its regularization parameters $\lambda_1$ and $\lambda_2$. We’ll use a simple toy dataset and walk through three powerful visualizations:

Regularization paths: How individual coefficients change as the total regularization strength $\alpha = \lambda_1 + \lambda_2$ increases.
Coefficient shrinkage heatmaps: How the L1/L2 combination controls the overall coefficient magnitude.
Constraint geometry: A geometric lens on how Elastic Net interpolates between Ridge and Lasso.

1. Regularization Path: Coefficients vs. $\alpha$

Let’s begin by visualizing how the learned weights change as we increase the regularization parameter $\alpha$. Recall that in the scikit-learn formulation, Elastic Net is parameterized as:

\[\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \|\mathbf{y} - \mathbf{Xw}\|_2^2 + \alpha \left( \rho \|\mathbf{w}\|_1 + \frac{1 - \rho}{2} \|\mathbf{w}\|_2^2 \right)\]

Here:

$\alpha = \lambda_1 + \lambda_2$ controls the overall penalty
$\rho = \lambda_1 / (\lambda_1 + \lambda_2)$ controls the balance between L1 and L2

We fix $\rho = 0.5$ and let $\alpha$ sweep from a small value (close to OLS) to a large value (heavy shrinkage).

Python Code: Regularization Path (click to expand)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler

# Updated toy dataset with variability in all columns
X = np.array([[0, 1, 0],
              [1, 1, 1],
              [2, 1, 2]])
y = np.array([1, 2, 3])

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Regularization path: sweep over alphas
alphas = np.logspace(-3, 1, 50)
coefs = []

# Fixed Elastic Net mixing (ρ = 0.5)
for alpha in alphas:
    model = ElasticNet(alpha=alpha, l1_ratio=0.5, fit_intercept=False, max_iter=10000)
    model.fit(X_scaled, y)
    coefs.append(model.coef_)

coefs = np.array(coefs)

# Plotting
plt.figure(figsize=(8, 5))
for i in range(coefs.shape[1]):
    plt.plot(alphas, coefs[:, i], label=f"$w_{i+1}$")
plt.xscale('log')
plt.xlabel(r"$\alpha = \lambda_1 + \lambda_2$")
plt.ylabel("Coefficient Value")
plt.title("Elastic Net Regularization Path ($\\rho = 0.5$)")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

Interpretation

When $\alpha \to 0$, we recover the OLS solution, with minimal regularization.
As $\alpha$ increases, all coefficients shrink, thanks to the combined L1+L2 penalty.
Sparsity appears early, driven by the L1 term.
Correlated features (in this toy case, columns 1 and 3) tend to enter and shrink together — a hallmark of group selection.

This plot is essential for understanding how features are penalized and which coefficients persist under different levels of regularization.

2. Heatmap: Coefficient Norm over $\lambda_1 \times \lambda_2$ Grid

To get a broader view, we now vary $\lambda_1$ and $\lambda_2$ independently over a 2D grid. For each pair, we fit an Elastic Net model and record the L1 norm of the coefficients:

\[\|\hat{\mathbf{w}}\|_1 = \sum_i |\hat{w}_i|\]

This gives us a heatmap of how much total coefficient weight is retained as we sweep across different levels of L1 and L2 regularization.

Python Code: Heatmap of Coefficient Norms

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler

# Reuse dataset and scaler from previous section

lambda1_vals = np.logspace(-2, 1, 30)
lambda2_vals = np.logspace(-2, 1, 30)
coef_norms = np.zeros((len(lambda1_vals), len(lambda2_vals)))

for i, l1 in enumerate(lambda1_vals):
    for j, l2 in enumerate(lambda2_vals):
        alpha = l1 + l2
        rho = l1 / (l1 + l2)
        model = ElasticNet(alpha=alpha, l1_ratio=rho, fit_intercept=False, max_iter=10000)
        model.fit(X_scaled, y)
        coef_norms[i, j] = np.linalg.norm(model.coef_, ord=1)

plt.figure(figsize=(8, 6))
sns.heatmap(coef_norms, xticklabels=np.round(lambda2_vals, 2),
            yticklabels=np.round(lambda1_vals, 2), cmap="viridis")
plt.xlabel(r"$\lambda_2$ (L2 Penalty)")
plt.ylabel(r"$\lambda_1$ (L1 Penalty)")
plt.title("Elastic Net Coefficient Magnitude (L1 Norm)")
plt.tight_layout()
plt.show()

Interpretation

Top-left corner (low $\lambda_1$ and $\lambda_2$): almost no shrinkage → resembles OLS.
Bottom-right corner (high $\lambda_1$ and $\lambda_2$): heavy penalization → almost all coefficients shrink toward zero.
Increasing $\lambda_1$ promotes sparsity; increasing $\lambda_2$ promotes smooth shrinkage.

This heatmap offers a global picture of how Elastic Net controls total model complexity.

3. Geometric View: Constraint Set and Solution Paths

Now let’s take a geometric view. Recall Elastic Net’s constrained form:

\[\min_{\mathbf{w}} \|\mathbf{y} - \mathbf{Xw}\|_2^2 \quad \text{subject to} \quad \lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2 \leq t\]

This defines a rounded diamond constraint region, interpolating between:

The diamond of Lasso ($\ell_1$-ball)
The circle of Ridge ($\ell_2$-ball)

We now overlay all three constraint regions and show how the solution is found at their intersection with the loss contours.

Python Code: Visualizing Elastic Net Constraint Geometry

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon, Circle

def plot_loss_contours(ax, center=(2, 1), levels=6):
    x = np.linspace(-4, 4, 400)
    y = np.linspace(-4, 4, 400)
    X, Y = np.meshgrid(x, y)
    Z = (X - center[0])**2 + 2*(Y - center[1])**2 + 0.5*(X - center[0])*(Y - center[1])
    ax.contour(X, Y, Z, levels=levels, colors='gray', linestyles='dashed')

def rounded_diamond_patch(radius=2.0, roundness=0.3):
    theta = np.linspace(0, 2*np.pi, 200)
    x = radius * np.sign(np.cos(theta)) * np.abs(np.cos(theta))**(1 - roundness)
    y = radius * np.sign(np.sin(theta)) * np.abs(np.sin(theta))**(1 - roundness)
    return Polygon(np.column_stack([x, y]), closed=True, fill=False, edgecolor='green', linewidth=2, label='Elastic Net')

fig, ax = plt.subplots(figsize=(6, 6))
plot_loss_contours(ax, center=(2, 1))

ridge = Circle((0, 0), radius=2.0, fill=False, color='blue', linewidth=1.5, label='Ridge')
lasso = Polygon(np.array([[2, 0], [0, 2], [-2, 0], [0, -2]]), closed=True, fill=False, edgecolor='red', linewidth=1.5, label='Lasso')
enet = rounded_diamond_patch(radius=2.0, roundness=0.3)

ax.add_patch(ridge)
ax.add_patch(lasso)
ax.add_patch(enet)
ax.axhline(0, color='black', linewidth=0.5)
ax.axvline(0, color='black', linewidth=0.5)
ax.set_xlim(-3, 3)
ax.set_ylim(-3, 3)
ax.set_xlabel(r"$w_1$")
ax.set_ylabel(r"$w_2$")
ax.set_title("Geometry of Elastic Net Constraint")
ax.legend()
plt.tight_layout()
plt.show()

Explanation of Geometry

Loss contours represent levels of constant squared error. The optimal solution is where these contours just touch the constraint region.
Ridge constraint: circular; intersects at smooth boundaries; encourages non-zero solutions.
Lasso constraint: diamond; intersects at corners; encourages sparse solutions.
Elastic Net: interpolated, rounded diamond; balances sparsity with smoothness.

Comparison Table

Method	Constraint Geometry	Promotes Sparsity	Handles Correlation	Stability	Group Selection
OLS	Unconstrained	❌	❌	❌	❌
Ridge	Circle ($$\ell_2$$ ball)	❌	✅	✅	✅
Lasso	Diamond ($$\ell_1$$ ball)	✅	❌	❌	❌
Elastic Net	Rounded Diamond	Partial ✅	✅	✅	✅

Closing Notes

Over the course of this comprehensive journey, we’ve explored the landscape of regularized linear models — starting from the need to tame overfitting and multicollinearity, to the geometric and algorithmic insights that underpin modern penalized regression techniques.

From Ridge’s smooth shrinkage to Lasso’s sharp sparsity, and finally, to Elastic Net’s elegant hybridization, we’ve built a unified mathematical and conceptual framework to understand when and why each model works, how it behaves under different data regimes, and how to implement it efficiently.

Let’s take a moment to reflect on what we’ve uncovered:

Ridge Regression: Shrinking Towards Stability

Introduced L2 regularization to address multicollinearity and high variance.
Derived the closed-form solution and understood how it modifies the normal equation.
Geometrically interpreted Ridge as a projection onto a circular constraint region, ensuring all features are retained but shrunk.
Worked through a numerical example comparing Ridge to OLS, showing improved stability.
Visualized the regularization path and understood the implications of varying $\lambda$.

Ideal when:

Interpretability is less important than prediction stability.
All features are believed to be somewhat relevant.
Multicollinearity is present.

Lasso Regression: Sparsity by Design

Introduced L1 regularization for feature selection and sparse modeling.
Worked through the 1D closed-form solution via subgradient analysis and soft thresholding.
Generalized to multivariate settings using coordinate descent and LARS.
Highlighted Lasso’s geometric bias toward sparsity, rooted in the sharp corners of the L1 constraint.
Compared Lasso and OLS in a worked example, emphasizing zeroed coefficients.
Examined Lasso’s instability under correlation and bias–variance tradeoff.

Ideal when:

The true model is sparse.
Interpretability and variable selection are crucial.
Features are mostly uncorrelated or redundancy isn’t a concern.

Elastic Net: The Best of Both Worlds

Motivated Elastic Net as a resolution to the conflict between Ridge and Lasso.
Formulated the combined L1 + L2 objective, deriving its constraint interpretation as a “rounded diamond”.
Explored group selection effects, geometric advantages, and parameter flexibility.
Demonstrated when and why Elastic Net works best — especially under $p \gg n$ and feature correlation.
Showed how to recast Elastic Net as a Lasso problem via data augmentation, enabling efficient optimization via coordinate descent.
Carried out a numerical comparison of Ridge, Lasso, and Elastic Net — highlighting Elastic Net’s ability to balance sparsity and stability.
Visualized regularization paths, shrinkage behavior, and constraint geometry.

Ideal when:

Correlated predictors need to be grouped or selected together.
You want to retain interpretability without sacrificing robustness.
The signal is not purely sparse, or the degree of sparsity is unknown.

Regularization is a framework grounded in statistical theory, optimization, and geometry. Choosing between Ridge, Lasso, and Elastic Net is a matter of:

Understanding your data characteristics (dimensionality, correlation, noise).
Knowing your modeling goals (prediction vs interpretation).
Balancing bias, variance, and sparsity.

There is no one-size-fits-all solution. But thanks to the tools we’ve explored, you are now equipped to make informed, mathematically grounded decisions — and to wield regularized linear models with precision and confidence.

Ridge Regression: Shrinking Towards Stability

Why Do We Need Regularization?

Intuition: Controlling the Flexibility of the Model

Ridge Regression Objective

Optimization Problem

Key Interpretation

Closed-Form Solution

What Changes from OLS?

Deriving the Ridge Regression Solution

Step 1: Expand the Objective Function

Step 2: Take the Gradient and Set to Zero

Step 3: Solve for \(\mathbf{w}\)

Why Is \(\lambda \mathbf{I}\) So Useful?

Numerical Example: Ridge Regression vs OLS — Solved Step-by-Step

Step 1: Dataset

OLS Solution

Step 2: Compute \(\mathbf{X}^\top \mathbf{X}\)

Step 3: Compute \(\mathbf{X}^\top \mathbf{y}\)

Step 4: Solve for \(\hat{\mathbf{w}}\)

Interpretation

Ridge Solution with \(\lambda = 10\)

Step 5: Add Regularization Term

Step 6: Solve Ridge Equation

Final Comparison

Takeaways

Ridge Regularization Path

What Happens When We Vary \(\lambda\)?

Ridge Path Intuition

Numerical Illustration: Ridge Coefficient Path

Geometric Interpretation of the Ridge Constraint

When to Use Ridge?

Lasso Regression: Sparsity by Design

The Lasso Objective

Key Properties of Lasso

Deriving the Lasso Regression Solution

Exception: 1D Case (Closed-Form Lasso Solution exists)

Step 1: Expand the Loss Function

Step 2: Subgradient Analysis

Case 1: \(w > 0\)

Case 2: \(w < 0\)

Case 3: \(w = 0\)

Step 3: Soft-Thresholding Solution

Multivariate Lasso: Coordinate Descent

Geometric and Optimization Implications

Geometry of Lasso: Why It Leads to Sparse Solutions

Why the Shape Matters

Let’s Visualize

What’s Happening in This Plot?

Lasso’s Bias–Variance Tradeoff

Optimization in Lasso Regression: Coordinate Descent and LARS

1. Coordinate Descent: Intuition and Math

Lasso Objective Function

Soft Thresholding Solution

Algorithm Summary

Practical Advantages

2. LARS (Least Angle Regression): Elegant and Interpretable

Key Ideas Behind LARS

LARS vs Lasso

Mathematical Insight

Visualization: LARS Path

When to Use Which?

Summary

Numerical Example: Lasso Regression vs OLS — Solved Step-by-Step

Problem Setup

Step 1: OLS Solution

Step 2: Lasso Objective

Step 3: Coordinate Descent – Manual Iteration

Initialize:

Step 1: Update \(w_1\)

Step 2: Update \(w_0\) (intercept)

Geometric View

Side-by-Side Summary

What We Learned

When Should You Use Lasso?

Ideal Conditions for Lasso

Sparse True Signal

High-Dimensional Settings (\(p > n\))

Feature Selection is a Priority

When Not to Use Lasso?

Lasso vs Ridge: A Quick Recap