Regularization & Generalization in Optimization

Overfitting vs. Underfitting
L1 and L2 Regularization (Sparsity vs. Smoothness)
Ridge Regression and Lasso
Elastic Net
Dropout: Theoretical Justification
Batch Normalization: Stability and Regularization
Applications
Conclusion

Training a model that fits the training data is not the real challenge in machine learning.
The real challenge is building a model that generalizes — one that performs well on data it has never seen.

As models become larger and more flexible, the risk of overfitting grows sharper.
With enough parameters, almost any function can fit almost any dataset, but fitting the noise of the data is rarely the goal.

This brings regularization into focus.

Regularization techniques are not additions to optimization; they are part of its core design.
They shape the optimization landscape, constrain the solutions we accept, and encourage models to find simpler, more stable representations.

In this blog, we will explore how regularization is woven into optimization, from classical methods like L1 and L2 penalties, to techniques like Dropout and Batch Normalization that reshape how models learn at scale.

Each method reflects a slightly different idea of what it means for a solution to be “good” — whether it is sparsity, smoothness, stability, or resilience to perturbations.

Overfitting vs. Underfitting

Before discussing regularization, it is important to be precise about the problem it tries to control.

When a model is trained on a dataset, its goal is to capture underlying patterns that generalize to unseen data.
But depending on how complex the model is and how it learns, two types of failure can occur: overfitting and underfitting.

Underfitting

A model underfits when it is too simple to capture the true structure of the data.

It may have too few parameters,
It may rely on poor features,
It may be optimized inadequately.

Underfitting is often easy to detect:
both training and validation performance are poor.
The model is not even able to explain the examples it was given.

Linear models trained on complex nonlinear data are classic cases of underfitting.
Even if training continues indefinitely, the model cannot bridge the gap between its assumptions and the data’s structure.

Overfitting

Overfitting happens when a model is too flexible relative to the amount of available data.

It fits not only the true patterns, but also the random fluctuations and noise in the training set.
Training loss becomes very low, but validation loss increases.

An overfit model will perform well on the training data because it memorizes it.
However, it generalizes poorly because the memorized noise does not carry over to unseen data.

Deep neural networks trained without constraints on small datasets are particularly vulnerable to overfitting.
They have enough capacity to model almost anything — including noise.

The Trade-off

There is always a tension between bias and variance:

Underfitting corresponds to high bias: rigid assumptions miss important patterns.
Overfitting corresponds to high variance: sensitivity to minor variations in the training data.

Good models find a balance, capturing enough complexity to explain the important structures,
while maintaining enough discipline to ignore irrelevant noise.

Visual Intuition

Imagine fitting a curve to a set of points:

A straight line (underfitting) misses the shape entirely.
A jagged spline through every point (overfitting) captures noise that isn’t part of the true signal.
A smooth, slightly flexible curve (good generalization) captures the essence without chasing every fluctuation.

Optimization alone cannot guarantee this balance.
The role of regularization is to tilt the optimization process toward simpler, more generalizable solutions.

The plot below shows a synthetic regression problem where the true underlying function is a smooth sine wave (green dashed curve).
We add random noise to a few sampled points (black markers), and then fit three different models:

A straight line (red) that barely bends to fit the data, representing underfitting.
A quadratic curve (blue) that captures the main structure without chasing every noise point, representing a good fit.
A high-degree polynomial (orange) that twists sharply to pass through almost every training point, representing overfitting.

Underfitting leads to high error on both training and unseen data because the model is too simple.
Overfitting leads to low training error but poor generalization because the model memorizes noise instead of finding patterns.
The good fit strikes the balance: flexible enough to follow the true structure, disciplined enough to ignore irrelevant fluctuations.

Regularization techniques, which we will explore next, are key tools for encouraging models toward this middle ground.

L1 and L2 Regularization (Sparsity vs. Smoothness)

Overfitting is not simply a consequence of too much model capacity.
It is also the result of allowing the optimization process to wander freely, fitting noise and minor fluctuations without restraint.

Regularization provides a way to guide the optimizer toward simpler, more stable solutions by explicitly penalizing complexity during training.

Two of the most fundamental regularization techniques are L1 regularization and L2 regularization.
Both introduce penalties into the loss function, but they encourage different types of model behavior.

L2 Regularization: Smoothness and Shrinkage

L2 regularization, also known as ridge regularization, adds a penalty proportional to the squared magnitude of the model parameters:

\[\text{Penalty}_{L2} = \lambda \sum_{j} \theta_j^2\]

where:

\(\theta_j\) are the model parameters,
\(\lambda > 0\) controls the strength of the penalty.

The updated optimization objective becomes:

\[\text{Loss}_{\text{total}} = \text{Original Loss} + \lambda \sum_{j} \theta_j^2\]

L2 regularization shrinks weights toward zero, but rarely forces them exactly to zero.

This leads to:

Smooth parameter landscapes, where many features contribute a little,
Smaller model coefficients, reducing sensitivity to noise,
Improved generalization by discouraging extremely large weights.

In models like linear regression or neural networks, L2 regularization often stabilizes training and improves out-of-sample performance.

L1 Regularization: Sparsity and Feature Selection

L1 regularization, also known as lasso regularization, adds a penalty proportional to the absolute magnitude of the model parameters:

\[\text{Penalty}_{L1} = \lambda \sum_{j} |\theta_j|\]

The updated objective becomes:

\[\text{Loss}_{\text{total}} = \text{Original Loss} + \lambda \sum_{j} |\theta_j|\]

Unlike L2, L1 regularization can drive some weights exactly to zero.

This results in:

Sparse models, where many parameters are exactly zero,
Automatic feature selection, by ignoring irrelevant features,
Simpler, more interpretable models.

In high-dimensional settings, especially when the number of features is much larger than the number of examples, L1 regularization becomes very powerful.

Intuitive Difference

L2 spreads weight across many features, favoring small but nonzero coefficients (smoothness).
L1 aggressively zeros out irrelevant features, favoring a compact subset of active features (sparsity).

Both forms of regularization introduce bias — they deliberately constrain the model —
but they do so in ways that often improve variance control and generalization.

Choosing between L1 and L2 depends on the problem:

When simplicity and interpretability are critical, L1 is often preferred.
When small, stable contributions from many features are acceptable, L2 is usually more effective.

Regularization directs models toward solutions that are simpler, more stable, and more robust to unseen data.
It helps encode a preference for models that balance fitting the training data with maintaining flexibility for generalization.

L1 and L2 provide two fundamental ways of shaping this balance, each favoring a different style of model simplicity.

Ridge Regression and Lasso

L1 and L2 regularization are often discussed abstractly, but their power becomes most concrete when seen in practical models.
Ridge regression and Lasso are two foundational methods where regularization directly shapes how models learn from data.

Both modify the classical least squares problem, but they do so in different ways, favoring different types of solutions.

Ridge Regression: Stability Through Shrinkage

Ridge regression adds an L2 penalty to the standard linear regression loss.
Instead of minimizing the sum of squared errors alone, ridge regression minimizes:

\[\text{Loss}_{\text{ridge}} = \sum_{i=1}^n (y_i - \mathbf{x}_i^\top \mathbf{w})^2 + \lambda \|\mathbf{w}\|_2^2\]

where:

\(\mathbf{x}_i\) are input features,
\(y_i\) are targets,
\(\mathbf{w}\) are model weights,
\(\lambda > 0\) controls the strength of the penalty.

The effect of the \(\lambda \|\mathbf{w}\|_2^2\) term is to shrink the weights toward zero,
discouraging large coefficients that can amplify noise or unstable patterns in the data.

Ridge regression does not set coefficients exactly to zero.
Instead, it distributes weight more evenly across features, improving stability and reducing variance without producing sparse models.

In situations where all features are thought to contribute roughly equally, or where multicollinearity is a concern, ridge regression often outperforms simple least squares.

Lasso: Sparsity Through Feature Selection

Lasso (Least Absolute Shrinkage and Selection Operator) regression adds an L1 penalty:

\[\text{Loss}_{\text{lasso}} = \sum_{i=1}^n (y_i - \mathbf{x}_i^\top \mathbf{w})^2 + \lambda \|\mathbf{w}\|_1\]

Here, the \(\lambda \|\mathbf{w}\|_1\) penalty encourages many coefficients to become exactly zero.

This leads to:

Sparse models with a clear selection of active features,
Automatic dimensionality reduction, useful in high-dimensional settings,
Easier interpretability, since only a subset of features is retained.

Lasso is particularly powerful when:

The number of features is large compared to the number of observations,
Only a small subset of features is truly relevant,
Interpretability and feature selection are priorities.

However, when features are highly correlated, Lasso may behave unpredictably, selecting one feature over others arbitrarily.

Comparing Ridge and Lasso

Aspect	Ridge Regression	Lasso
Penalty	\(\ell_2\) norm (\(\sum \theta_j^2\))	\(\ell_1\) norm (\(\sum \|\theta_j\|\))
Coefficient Behavior	Shrinks coefficients smoothly	Forces many coefficients exactly to zero
Best for	Stabilization with many small contributions	Sparse models, feature selection
Weakness	Does not perform variable selection	Can be unstable with correlated features

Ridge regression and Lasso are two lenses on the same core goal: improving generalization by constraining model complexity.
One softens the model through smooth shrinkage; the other sharpens it through aggressive sparsity.
Both change how optimization unfolds, pushing the solution toward forms that balance fit with restraint.

Choosing between them is not just a technical decision — it is a choice about what kind of simplicity you believe best serves the problem you are trying to solve.

Elastic Net

When we design regularized models, we often face a subtle but critical trade-off:
we want models that are simple enough to generalize,
but not so rigid that they ignore important relationships in the data.

Ridge regression and Lasso represent two classic strategies:

Ridge shrinks all weights smoothly but keeps them nonzero.
Lasso aggressively zeroes out many weights, selecting a sparse subset of features.

However, each method has limitations:

Lasso struggles when features are highly correlated.
Ridge cannot perform feature selection; it only shrinks.

Elastic Net was proposed to bridge this gap —
to create a regularization path that stabilizes coefficients while still allowing sparsity.

The Elastic Net Objective

Elastic Net modifies the standard least squares loss by adding both L1 and L2 penalties:

\[\text{Loss}_{\text{elastic net}} = \sum_{i=1}^n (y_i - \mathbf{x}_i^\top \mathbf{w})^2 + \lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2\]

where:

Regularization term for sparsity:
\[\|\mathbf{w}\|_1 = \sum_j |\theta_j|\]
Regularization term for smoothness:
\[\|\mathbf{w}\|_2^2 = \sum_j \theta_j^2\]

Alternatively, it is often parameterized with a mixing parameter \(\alpha\):

\[\text{Loss}_{\text{elastic net}} = \sum_{i=1}^n (y_i - \mathbf{x}_i^\top \mathbf{w})^2 + \lambda \left( \alpha \|\mathbf{w}\|_1 + (1-\alpha) \|\mathbf{w}\|_2^2 \right)\]

where:

\(\alpha \in [0,1]\) balances between Lasso (\(\alpha=1\)) and Ridge (\(\alpha=0\)),
\(\lambda\) controls the overall regularization strength.

This formulation makes Elastic Net a continuous bridge between ridge and lasso behaviors.

Why Is Elastic Net Needed?

To see why Elastic Net is important,
we need to understand a practical problem with pure Lasso.

Problem: Lasso with Correlated Features

Suppose two features, \(x_1\) and \(x_2\), are almost perfectly correlated.
Both are predictive of the target, but because of their correlation:

Lasso will arbitrarily select one and ignore the other.
Tiny perturbations in the data can flip which feature is selected.

This behavior is unstable:

It makes models sensitive to noise.
It reduces reliability across resampling or retraining.
It risks missing useful redundancy in the feature space.

Ridge regression, by contrast, would:

Distribute the weight between \(x_1\) and \(x_2\),
Keep both features active, but shrink their coefficients together.

Neither extreme is always ideal.

How Elastic Net Solves It

Elastic Net combines:

The grouping effect of Ridge: it keeps correlated features together,
The sparsity drive of Lasso: it zeros out irrelevant features.

Thus:

When groups of correlated features are predictive, Elastic Net selects them together rather than arbitrarily dropping them.
It retains interpretability without sacrificing stability.

This dual behavior makes Elastic Net particularly powerful in high-dimensional settings
(e.g., genomics, text data, financial modeling)
where correlations are unavoidable and pure sparsity may be too brittle.

Intuitive Geometry of Regularization Constraints

Different regularization methods impose different shapes on the set of allowed parameters,
and these shapes fundamentally influence how optimization behaves.

Ridge regression constrains weights inside a sphere (L2 ball).
It penalizes the overall size of coefficients smoothly, encouraging them to shrink toward zero without forcing exact sparsity.
Lasso regression constrains weights inside a diamond (L1 ball).
Its sharp corners promote sparsity by encouraging many coefficients to become exactly zero, effectively performing feature selection.
Elastic Net constrains weights inside a softened diamond.
It blends the sparsity-inducing behavior of Lasso with the smooth shrinkage of Ridge, allowing correlated features to survive together while still encouraging simpler models.

The figure below visualizes these constraint regions clearly:

The blue circle represents Ridge regularization, where weights are encouraged to stay small and distributed.
The red diamond represents Lasso regularization, where the sharp edges push many coefficients exactly to zero.
The green soft diamond represents Elastic Net, balancing between the two — encouraging sparsity without destabilizing correlated features.

Choosing the right constraint geometry shapes the kind of solutions the optimization process prefers —
whether they emphasize simplicity, robustness, sparsity, or stability.

Practical Impact

In real-world machine learning workflows:

Elastic Net often outperforms Lasso when there are groups of related features.
It often outperforms Ridge when interpretability and feature selection are important.

It is especially useful when:

The number of features exceeds the number of observations,
Feature redundancy is common,
Sparse models are desired without losing robustness.

Today, Elastic Net is a standard tool in:

High-dimensional regression (e.g., in scikit-learn’s ElasticNetCV),
Automatic model selection pipelines,
Structured data modeling where feature interactions matter.

Dropout: Theoretical Justification

As neural networks grew deeper and larger, a new challenge emerged: even with L1 or L2 regularization, complex models could still overfit. They could memorize training data in fragile ways — relying heavily on specific paths through the network, adapting too finely to patterns that might not generalize.

Dropout was introduced as a way to break this dependency. It injects randomness into the training process, forcing networks to learn more robust, distributed representations.

What is Dropout?

During training, dropout temporarily removes (or “drops out”) a random subset of neurons at each forward pass.

For a given neuron output \(h_i\):

With probability \(p\), set \(h_i = 0\) (drop it).
With probability \(1-p\), keep \(h_i\) unchanged.

Mathematically, during training:

\[\tilde{h}_i = h_i \cdot z_i\]

where:

\(z_i \sim \text{Bernoulli}(1-p)\) is a random binary mask,
\(\tilde{h}_i\) is the modified output.

At test time:

No dropout is applied.
Outputs are scaled by \(1-p\) to match the expected magnitude.

Why Dropout Works: Intuition

Dropout forces redundancy and discourages co-adaptation:

A neuron cannot rely on the presence of specific other neurons,
because any subset could be dropped at any update.
Each neuron must learn features that are useful on their own,
not just as part of a fragile ensemble.

This encourages the network to:

Distribute information more broadly,
Build multiple, redundant pathways for solving the task,
Avoid overfitting to small details in the training data.

In essence, dropout trains an implicit ensemble of subnetworks,
where each mini-network shares weights but sees a different random view of the full model.

Formal View: Dropout as Model Averaging

At a formal level, dropout can be interpreted as performing approximate model averaging over a huge ensemble of sub-networks.

Each dropout mask defines a sub-network:
one specific subset of the full neural network.

In principle:

A network with \(n\) neurons has \(2^n\) possible sub-networks (exponentially many).
Averaging predictions from all these sub-networks would produce a powerful, robust predictor.

Training all these separately would be infeasible.
Dropout approximates this averaging efficiently:

It randomly samples a different sub-network at each training step,
Updates the shared parameters based on the sampled sub-network’s behavior.

At test time, scaling by \(1-p\) approximates averaging the ensemble outputs.

Thus, dropout regularizes not just by penalizing large weights, but by penalizing reliance on specific configurations.

Geometric Interpretation of Dropout’s Effect on Optimization

In deep networks, optimization landscapes are complex surfaces in high-dimensional parameter spaces.
The shape of the loss surface around a minimum tells us much about how the model will behave on new data.

Without dropout, optimization often converges to sharp, narrow minima:
small pits where loss is very low, but small perturbations in parameters can cause large increases in loss.

With dropout, optimization tends to favor flatter, wider minima:
broad valleys where loss remains low across a large range of parameter variations.

The 3D plot below visualizes this effect:

The red steep surface represents a sharp minimum —
a deep, narrow well that fits training data closely but is unstable to small shifts.
The blue shallow surface represents a flat minimum —
a broad basin where the model remains robust even if parameters drift slightly.

Flatter minima are strongly correlated with better generalization.
They indicate solutions that are less sensitive to noise, perturbations, or changes in the data distribution —
properties critical for building models that perform reliably outside of the training set.

Dropout, by randomly perturbing the network during training, acts as a geometric regularizer —
steering optimization away from fragile sharp basins, and toward resilient flat valleys.

Practical Impact

Dropout has become a standard regularization tool, especially in:

Fully connected layers of deep networks,
Early convolutional networks (e.g., AlexNet, VGG).

However:

In very large modern architectures (e.g., BERT, GPT),
dropout is often used carefully (e.g., smaller dropout rates, or after attention layers only),
Excessive dropout can slow down convergence.

The typical choice of dropout rate is around \(p = 0.5\) for fully connected layers,
and lower values (e.g., \(p = 0.1 - 0.3\)) for convolutional or attention layers.

Batch Normalization: Stability and Regularization

As neural networks grew deeper, another difficulty emerged:
even with careful initialization, the distribution of activations changed as data propagated through the network.

This phenomenon, known as internal covariate shift, made optimization unstable:

Small parameter changes caused large changes in activation distributions,
Learning rates had to be kept very small to prevent divergence,
Deeper networks became extremely hard to train.

Batch Normalization (BatchNorm) was introduced to address these issues.
It normalizes the activations within each mini-batch to have zero mean and unit variance,
allowing networks to train deeper, faster, and more stably.

How Batch Normalization Works

For each neuron activation \(h\) across a mini-batch:

1. Compute batch statistics:

\[\mu_{\text{batch}} = \frac{1}{m} \sum_{i=1}^{m} h_i\] \[\sigma_{\text{batch}}^2 = \frac{1}{m} \sum_{i=1}^{m} (h_i - \mu_{\text{batch}})^2\]

2. Normalize the activations:

\[\hat{h}_i = \frac{h_i - \mu_{\text{batch}}}{\sqrt{\sigma_{\text{batch}}^2 + \epsilon}}\]

3. Scale and shift:

\[h_i' = \gamma \hat{h}_i + \beta\]

where:

\(\gamma\) and \(\beta\) are learnable parameters allowing the network to undo normalization if needed,
\(\epsilon\) is a small constant for numerical stability.

Intuition: Why It Helps

Stabilizes gradients: Normalized activations prevent gradients from exploding or vanishing across layers.
Allows higher learning rates: Optimization becomes less sensitive to parameter scale.
Acts as a regularizer: The stochasticity from using mini-batch statistics introduces noise, improving generalization (similar to a mild dropout effect).
Reduces internal covariate shift: Each layer sees inputs with more consistent statistics, accelerating convergence.

How Batch Normalization Reshapes Optimization Landscapes

One of the most important effects of Batch Normalization is not just stabilizing activation statistics,
but reshaping the loss surface to make optimization more predictable and efficient.

The 3D plot below illustrates this idea:

Without BatchNorm (red surface), the loss landscape is chaotic.
Small changes in parameters can cause sudden steep cliffs and sharp drops, making optimization highly sensitive and unstable.
With BatchNorm (blue surface), the loss landscape becomes much smoother.
Valleys are broader, transitions are more gradual, and optimization paths are less likely to get trapped in sharp irregularities.

By smoothing the effective loss surface, BatchNorm allows:

Larger, safer learning rates,
Faster convergence,
Stronger generalization,
even as networks become deeper and more complex.

It is not simply about normalizing activations;
BatchNorm fundamentally reshapes how gradients behave and how optimization traverses parameter space.

Applications

Regularization techniques shape the behavior of models during training, constrain what solutions are considered acceptable, and influence how well models generalize beyond the training data.

In this section, we move from core concepts into practical ground:
how L1/L2 penalties, dropout, batch normalization, and related ideas are applied to real modeling problems.

From small dataset regimes, where overfitting is immediate,
to high-dimensional spaces requiring aggressive feature selection,
to deep architectures needing stability across layers —
regularization strategies become central to designing models that are not only accurate but resilient and interpretable.

Each method we explore reflects a different way of answering the same question:
how do we help a model learn just enough to succeed, without letting it learn too much and fail?

Application: Regularizing Neural Networks on Small Datasets

Training deep neural networks typically demands large datasets.
When data is abundant, overparameterized models can generalize well, even without aggressive regularization.
But when datasets are small, the risk of overfitting becomes much sharper.

Without careful design, deep networks can memorize the limited training data perfectly,
failing to capture any real underlying structure and generalizing poorly to unseen examples.

In these scenarios, regularization techniques are not optional.
They are essential tools for shaping the optimization process — for constraining models to search for solutions that are simpler, more stable, and more likely to reflect real patterns rather than noise.

Challenges with Small Datasets

Challenge	Effect
Few samples per parameter	High risk of memorization
Noise dominates	Easy to overfit to spurious patterns
Lack of diversity	Reduces natural regularization from data itself
Difficult to validate	Small validation sets increase variance in model evaluation

Regularization Strategies for Small Datasets

When working with limited data, the choice and tuning of regularization methods becomes critical:

1. L2 Regularization (Weight Decay)

Adding an L2 penalty discourages large weights,
nudging the model toward simpler, smoother functions.

On small datasets:

Larger \(\lambda\) values (stronger regularization) are often necessary,
Helps prevent sharp decision boundaries that fit noise instead of signal.

Typically: L2 penalty is tuned carefully by validation set performance.

2. Dropout

Even small networks benefit from dropout on small datasets:

Introduces noise during training,
Forces the model to be redundant and robust,
Acts as implicit model averaging over subnetworks.

Lower dropout rates (e.g., \(p=0.1\)–\(p=0.3\)) often work better for small data compared to heavy dropout.

3. Data Augmentation

While not strictly a regularization term in the loss function,
data augmentation plays a similar role:

In vision: rotations, crops, flips artificially expand dataset size.
In NLP: backtranslation, synonym replacement enriches training data.

Good augmentation acts like “soft regularization” —
exposing the model to more variation without requiring new labeled examples.

Especially critical when labeled examples are expensive or rare.

4. Early Stopping

On small datasets, validation loss tends to increase after a few epochs of overfitting.
Monitoring validation loss and stopping training early is a practical, effective regularization strategy.

Early stopping prevents the model from drifting into noise-fitting regimes.

5. Smaller Model Architectures

Another often overlooked regularization method:
shrinking the model itself.

Fewer layers,
Smaller hidden dimensions,
Tighter parameter budgets.

With less capacity, the model is naturally constrained to simpler hypotheses —
reducing the risk of memorizing noise.

Practical Example

Suppose you are training a convolutional neural network (CNN) on a small medical imaging dataset (e.g., 500 images).

A good regularization strategy might include:

Adding L2 weight decay (\(\lambda = 10^{-3}\)),
Using dropout (rate 0.2) after fully connected layers,
Applying aggressive data augmentation (rotations, shifts),
Monitoring validation loss and stopping early,
Reducing network depth (e.g., fewer convolutional blocks).

Rather than relying on any single technique,
a stacked regularization approach tends to work best —
each method controlling overfitting in a complementary way.

In small data settings, optimization must be disciplined.
The model must be prevented from chasing every fluctuation in the training set,
and gently guided to focus only on the strongest, most stable patterns.

Regularization is not an afterthought.
It is a central design principle —
shaping both the model and the training process so that simplicity, robustness, and generalization emerge naturally from the constraints imposed.

Application: Feature Selection via Lasso

When datasets are high-dimensional, especially when the number of features exceeds the number of observations, feature selection becomes a critical part of building interpretable and stable models. Irrelevant or redundant features not only increase the risk of overfitting but also make models harder to explain and validate.

Lasso regression offers a natural, integrated way to perform feature selection during model training itself. By adding an L1 penalty to the loss function, Lasso encourages many coefficients to shrink exactly to zero, effectively removing the corresponding features from the model.

How Lasso Enables Feature Selection

Recall that Lasso optimizes:

\[\text{Loss}_{\text{lasso}} = \sum_{i=1}^n (y_i - \mathbf{x}_i^\top \mathbf{w})^2 + \lambda \sum_{j=1}^p |\theta_j|\]

The L1 penalty \(\sum_j |\theta_j|\) introduces a sharp “corner” in the optimization constraint space. When the optimizer moves along this constraint surface, it naturally “sticks” to the axes, producing many coefficients that are exactly zero.

This behavior differs from Ridge regression, where the \(\ell_2\) penalty only shrinks coefficients but rarely drives them exactly to zero.

In practice:

Features with coefficients reduced to zero are effectively eliminated,
The model focuses only on the most informative subset of features,
Interpretability improves, as the selected features have clear, nonzero contributions.

Advantages of Lasso-Based Feature Selection

Integrated selection and modeling: No need for separate feature selection steps; Lasso builds sparse models directly during training.
Control over sparsity: By adjusting \(\lambda\), you control how many features are selected. Higher \(\lambda\) values lead to sparser models.
Improved generalization: Reducing model complexity by eliminating irrelevant features typically improves out-of-sample performance, especially when sample sizes are small relative to feature dimensions.

Practical Considerations

While powerful, Lasso feature selection also comes with nuances:

Correlated features: When features are highly correlated, Lasso tends to select one arbitrarily and ignore the others. This instability can make interpretations less reliable.
Scaling matters: Features must be standardized (zero mean, unit variance) before applying Lasso, otherwise coefficients are not comparable across features.
Model refitting: Sometimes, after Lasso selects a subset of features, it is beneficial to refit an unregularized model using only the selected features to avoid shrinkage bias.

Elastic Net regularization, which blends L1 and L2 penalties, can sometimes be preferred when feature correlations are strong, preserving group structure better than Lasso alone.

Example Scenario

Imagine building a predictive model for disease risk based on thousands of genetic markers.
Many markers are irrelevant, and only a sparse subset influences the target.

Applying Lasso regression:

With a moderate \(\lambda\) value, the model automatically selects a few dozen markers with nonzero coefficients,
Ignoring the rest without requiring manual feature engineering,
Yielding a sparse, interpretable model where selected markers have meaningful predictive contributions.

This type of automatic sparsity is especially valuable in domains like genomics, text classification, or finance, where raw feature counts can be enormous relative to sample sizes.

Application: BatchNorm in CNNs and Transformers

Batch Normalization was first introduced in the context of convolutional neural networks (CNNs),
but its usefulness quickly extended across architectures, including transformers for natural language processing.

In both cases, the goal remains the same:
to stabilize and accelerate optimization by controlling the distribution of activations across layers.

However, the way BatchNorm is applied differs slightly between CNNs and Transformers,
reflecting the different structures and statistical properties of these models.

BatchNorm in CNNs

In convolutional networks, BatchNorm is applied channel-wise.
For a mini-batch of feature maps with shape \((N, C, H, W)\) —
where:

\(N\) = batch size,
\(C\) = number of channels (feature maps),
\(H, W\) = spatial dimensions (height, width) —

BatchNorm normalizes each channel separately across the batch and spatial dimensions.

Formally, for each channel \(c\):

\[\mu_c = \frac{1}{N \times H \times W} \sum_{n=1}^{N} \sum_{h=1}^{H} \sum_{w=1}^{W} x_{nchw}\] \[\sigma_c^2 = \frac{1}{N \times H \times W} \sum_{n=1}^{N} \sum_{h=1}^{H} \sum_{w=1}^{W} (x_{nchw} - \mu_c)^2\]

Each activation is then normalized:

\[\hat{x}_{nchw} = \frac{x_{nchw} - \mu_c}{\sqrt{\sigma_c^2 + \epsilon}}\]

followed by learned scaling and shifting:

\[y_{nchw} = \gamma_c \hat{x}_{nchw} + \beta_c\]

where \(\gamma_c\) and \(\beta_c\) are learned per-channel parameters.

In CNNs, BatchNorm provides multiple benefits:

It allows networks to train deeper architectures like VGG and ResNet.
It reduces sensitivity to initialization and learning rate choice.
It provides a mild regularization effect by introducing noise through mini-batch statistics.

BatchNorm in Transformers

In Transformer architectures, the structure is different:

Inputs are sequences, not images.
Representations are usually shaped \((N, T, D)\) — batch size \(N\), sequence length \(T\), hidden dimension \(D\).

BatchNorm is rarely used directly inside Transformers.
Instead, Transformers apply a related concept: Layer Normalization.

While BatchNorm normalizes across batch and spatial dimensions,
LayerNorm normalizes across the feature dimension for each token individually.

For a given token representation \(x \in \mathbb{R}^D\):

\[\mu = \frac{1}{D} \sum_{i=1}^{D} x_i\] \[\sigma^2 = \frac{1}{D} \sum_{i=1}^{D} (x_i - \mu)^2\]

Normalized output:

\[\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}\]

and finally:

\[y_i = \gamma_i \hat{x}_i + \beta_i\]

where \(\gamma_i\) and \(\beta_i\) are learnable per-feature parameters.

Why Transformers Prefer LayerNorm

BatchNorm relies on meaningful mini-batch statistics,
which can vary greatly in sequence models where:

Sequence lengths differ,
Batch sizes are small (for memory reasons),
Padding and masking complicate spatial assumptions.

LayerNorm avoids these issues:

It operates per token,
It makes no assumption about batch size,
It fits naturally into autoregressive decoding settings.

Thus, in Transformers like BERT, GPT, and T5,
LayerNorm replaces BatchNorm,
providing stability and faster convergence without relying on large batches.

Conclusion

Regularization defines how models learn and generalize. It introduces structure into optimization, controlling complexity and encouraging models to discover patterns that are stable, simple, and resilient.

Across architectures and domains, techniques like L1/L2 penalties, dropout, and normalization enhance both learning dynamics and final model behavior.

The advancement of deep learning depends on expanding model capacity together with shaping the learning process through principled constraints. Regularization remains essential for building models that extend their success from training data into real-world challenges.