Generalized Linear Models: A Unified Framework for Modern Regression

Why Go Beyond Linear Regression?
The GLM Framework: A Unified View of Regression
The Exponential Family of Distributions
Choosing the Right Link Function
Examples of Common GLMs
Maximum Likelihood Estimation in GLMs
Model Evaluation and Diagnostics for GLMs
When to Use Which GLM?
- A Few Guiding Principles
Closing Thoughts

Why Go Beyond Linear Regression?

Linear regression is one of the most foundational models in statistics and machine learning. It assumes a linear relationship between input features and the output variable, with errors that are normally distributed, independent, and homoscedastic. For many real-world problems, this model serves as a reliable starting point.

But what happens when the assumptions break?

Suppose you’re trying to model:

Whether a customer will click on an ad (yes or no)
How many times a patient visits a hospital in a year
The number of purchases a user makes in a session
Whether a transaction is fraudulent

These are not problems where the response variable is continuous and normally distributed. Instead:

A binary outcome like click or no click violates the assumption of continuous and Gaussian residuals.
A count outcome like number of visits is inherently discrete and non-negative, and its variance typically increases with the mean.
Outcomes such as claim amounts or duration until an event may be right-skewed or heavy-tailed.

If we apply standard linear regression to such problems, several issues arise:

Predictions can fall outside the feasible range (e.g., predicting probabilities less than 0 or more than 1)
The model may assign constant variance to heteroscedastic data
Estimations and inference become unreliable and misleading

In short, linear regression is not flexible enough to handle response variables with non-Gaussian characteristics. Forcing it onto such data can lead to poor predictive performance and invalid interpretations.

This motivates a broader question:

Can we build a general modeling framework that retains the interpretability of linear models, but adapts to different types of outcome variables?

The answer is yes—and it leads us directly to the framework of Generalized Linear Models (GLMs).

GLMs extend linear regression to a much wider class of problems, including classification, count modeling, and more, by:

Allowing the response variable to follow any distribution from the exponential family
Introducing a link function that connects the expected value of the response to a linear predictor

This framework not only preserves the elegant structure of linear models, but also adapts to the stochastic nature of real-world data. In the next section, we’ll build this structure from the ground up.

The GLM Framework: A Unified View of Regression

At its heart, the Generalized Linear Model (GLM) is a principled extension of linear regression. It retains the essential interpretability and structure of linear models while generalizing them to handle a much broader class of response variables.

A GLM is defined by three core components:

1. Random Component: Modeling the Response Variable

In classical linear regression, we assume that the response variable $y$ follows a normal distribution with constant variance:

\[y \sim \mathcal{N}(\mu, \sigma^2)\]

In GLMs, this assumption is replaced by a much more flexible condition: The response variable is drawn from the exponential family of distributions.

The exponential family has a canonical form:

\[p(y; \theta, \phi) = \exp\left( \frac{y \cdot \theta - b(\theta)}{\phi} + c(y, \phi) \right)\]

Where:

$\theta$ is the canonical (natural) parameter
$\phi$ is the dispersion parameter (fixed in many GLMs, e.g., Poisson)
$b(\theta)$ is a cumulant function ensuring proper normalization
$c(y, \phi)$ adjusts for the specific form of the response variable

Many commonly used distributions are special cases of this form:

Distribution	Support of $$y$$	Mean $$\mu$$	Variance	Canonical Parameter $$\theta$$	Canonical Link
Gaussian	$$\mathbb{R}$$	$$\mu$$	$$\sigma^2$$	$$\theta = \mu$$	Identity
Bernoulli	$$\{0,1\}$$	$$\mu = p$$	$$p(1-p)$$	$$\theta = \log\left(\frac{p}{1 - p}\right)$$	Logit
Poisson	$$\mathbb{N}_0$$	$$\mu$$	$$\mu$$	$$\theta = \log(\mu)$$	Log
Gamma	$$\mathbb{R}_+$$	$$\mu$$	$$\mu^2 / k$$	$$\theta = -1/\mu$$	Inverse

This generalization allows us to work with a variety of data types—binary, count, skewed continuous—within a single coherent modeling framework.

2. Systematic Component: The Linear Predictor

Just like in linear regression, GLMs assume that the input features affect the outcome via a linear combination. We define the linear predictor as:

\[\eta = \mathbf{Xw}\]

Where:

$\mathbf{X} \in \mathbb{R}^{n \times p}$ is the design matrix of inputs
$\mathbf{w} \in \mathbb{R}^p$ is the coefficient vector
$\eta \in \mathbb{R}^n$ is the linear predictor, one per data point

This linear predictor serves as the bridge between input features and the modeled outcome.

3. Link Function: Connecting Mean to Linear Predictor

In most GLMs, the mean of the response variable, $\mu = \mathbb{E}[y]$, is not equal to the linear predictor $\eta$. Instead, they are connected via a link function $g(\cdot)$:

\[g(\mu) = \eta = \mathbf{Xw}\]

Equivalently, we write:

\[\mu = g^{-1}(\eta)\]

The link function transforms the expected value of the response into the linear scale of predictors. The choice of the link depends on both the distribution and interpretability. The canonical link for each distribution results in convenient mathematical properties, particularly for maximum likelihood estimation.

Visualizing the GLM Generalization

Let’s take a moment to visualize what’s happening conceptually.

In linear regression, we directly model the mean as a linear combination: $\mu = \eta = \mathbf{Xw}$ Here, the link function is the identity: $g(\mu) = \mu$.
In logistic regression, we model the probability of a binary outcome: $\mu = \mathbb{E}[y] = \mathbb{P}(y = 1)$ We relate this to the linear predictor via the logit link: $g(\mu) = \log\left(\frac{\mu}{1 - \mu}\right) = \eta$.
In Poisson regression, we model count outcomes: $\mu = \mathbb{E}[y] \in \mathbb{R}_+$ The link is the log function: $g(\mu) = \log(\mu) = \eta$.

Thus, linear regression is simply a special case of GLMs where:

The distribution is Gaussian
The link is identity

GLMs generalize this by allowing flexible response distributions and appropriate transformations through the link.

The Exponential Family of Distributions

The cornerstone of Generalized Linear Models is the assumption that the response variable comes from a distribution within the exponential family. This family is large and expressive—it includes many common distributions used in regression, classification, and count modeling.

But what exactly is the exponential family?

General Form

A probability distribution belongs to the exponential family if it can be written in the following canonical form:

\[p(y; \theta, \phi) = \exp\left( \frac{y \cdot \theta - b(\theta)}{\phi} + c(y, \phi) \right)\]

Where:

$y$ is the observed response variable
$\theta$ is the natural (canonical) parameter
$\phi$ is the dispersion parameter (fixed in some distributions)
$b(\theta)$ is the log-partition function, which ensures the distribution integrates to one
$c(y, \phi)$ is a function of the data and dispersion

This decomposition might look abstract at first, so let’s walk through it.

Intuition Behind Each Term

Natural parameter $\theta$: This is the parameter that links directly to the sufficient statistic (here, $y$). The form $y \cdot \theta$ ensures that the distribution is linear in the sufficient statistic.
Log-partition function $b(\theta)$: This term normalizes the distribution. It also plays a crucial role in determining the mean and variance.
Dispersion parameter $\phi$: It controls the spread or scale of the distribution. In many GLMs (like Poisson or Bernoulli), $\phi = 1$ by convention.
Function $c(y, \phi)$: Ensures the entire function is properly normalized in terms of the observed data.

Why Use the Exponential Family?

The key advantage is tractability. Distributions in this family allow for:

Convex log-likelihoods
Analytical derivatives (score functions, Hessians)
Clean expressions for mean and variance

And most importantly, they yield GLMs that are easy to optimize using maximum likelihood methods.

Computing Mean and Variance

For any distribution in the exponential family:

The mean is given by:
\[\mu = \mathbb{E}[y] = b'(\theta)\]
The variance is:
\[\text{Var}(y) = \phi \cdot b''(\theta)\]

So, once we know $b(\theta)$, we can derive the full behavior of the distribution in terms of mean and variance.

Examples of Exponential Family Members

Let’s see how some familiar distributions fit into this general form.

Distribution	Support	$$\theta$$ (Canonical Parameter)	$$b(\theta)$$	$$\phi$$	Mean $$\mu$$	Variance	Canonical Link
Gaussian	$$\mathbb{R}$$	$$\theta = \mu$$	$$b(\theta) = \frac{\theta^2}{2}$$	$$\sigma^2$$	$$\mu$$	$$\sigma^2$$	Identity
Bernoulli	$$\{0, 1\}$$	$$\theta = \log\left(\frac{p}{1-p}\right)$$	$$b(\theta) = \log(1 + e^\theta)$$	1	$$\mu = p$$	$$p(1 - p)$$	Logit
Poisson	$$\mathbb{N}_0$$	$$\theta = \log(\lambda)$$	$$b(\theta) = e^\theta$$	1	$$\mu = \lambda$$	$$\lambda$$	Log
Gamma	$$\mathbb{R}_+$$	$$\theta = -\frac{1}{\mu}$$	$$b(\theta) = -\log(-\theta)$$	$$\phi = \frac{1}{k}$$	$$\mu$$	$$\mu^2 / k$$	Inverse

Beyond the Big Four: Mentioning Tweedie Models

The Tweedie family is a more general class that spans distributions between Poisson and Gamma. It includes:

Normal: variance $\propto \mu^0$
Poisson: variance $\propto \mu^1$
Gamma: variance $\propto \mu^2$

While Tweedie models are beyond the scope of basic GLMs, they offer powerful extensions in fields like insurance and energy forecasting, where distributions are often zero-inflated and right-skewed.

Choosing the Right Link Function

The link function is what makes Generalized Linear Models truly flexible and powerful. It acts as a bridge between the expected value of the response variable and the linear predictor. While the linear predictor $\eta = \mathbf{Xw}$ remains the same across all GLMs, the transformation applied through the link function determines how the model adapts to different types of data.

Mathematically, the link function is defined as:

\[g(\mu) = \eta = \mathbf{Xw}\]

Where:

$\mu = \mathbb{E}[y]$ is the mean of the response variable
$g(\cdot)$ is a monotonic, differentiable function that maps the range of $\mu$ into the entire real line

Choosing the right link function ensures:

The output of the model lies within the valid domain of the response
Interpretability aligns with the nature of the problem
Estimation becomes tractable

Let’s explore some common link functions and where they are used.

Identity Link

This is the simplest possible link function:

\[g(\mu) = \mu\]

So that:

\[\mu = \eta = \mathbf{Xw}\]

Used in:

Linear Regression

This link assumes that the mean of the response can take any real value, making it suitable for continuous outputs that are approximately normally distributed.

It is also the canonical link for the Gaussian distribution.

Logit Link

The logit function is defined as:

\[g(\mu) = \log\left( \frac{\mu}{1 - \mu} \right)\]

Which implies:

\[\mu = \frac{1}{1 + e^{-\eta}} = \sigma(\eta)\]

Used in:

Logistic Regression (for binary classification)

This link maps probabilities (which lie in the interval $[0, 1]$) to the real line, allowing us to model binary outcomes while ensuring predictions remain valid probabilities.

The logit is the canonical link for the Bernoulli distribution.

Probit Link

The probit link uses the inverse CDF of the standard normal distribution:

\[g(\mu) = \Phi^{-1}(\mu)\]

So that:

\[\mu = \Phi(\eta)\]

Where $\Phi(\cdot)$ is the CDF of the standard normal distribution.

Used in:

Probit Regression (alternative to logistic regression)

While both logit and probit provide sigmoid-shaped mappings, the probit link assumes latent Gaussian noise and is often favored in Bayesian settings due to conjugacy with Gaussian priors.

Unlike the logit, the probit link is not canonical for the Bernoulli distribution.

Log Link

The log function is defined as:

\[g(\mu) = \log(\mu)\]

Which implies:

\[\mu = \exp(\eta)\]

Used in:

Poisson Regression (for count data)
Gamma Regression (for skewed positive data)

The log link ensures that the predicted mean $\mu$ remains strictly positive, which is critical when modeling counts or non-negative continuous quantities.

This is the canonical link for the Poisson distribution.

Canonical vs Non-Canonical Links

Each exponential family distribution has a canonical link function that simplifies the mathematics of estimation, particularly in deriving the score function and Hessian matrix. However, one is not restricted to the canonical link. Non-canonical links can be useful when:

Interpretation is more natural in another scale (e.g., using identity link for binomial proportions)
Robustness or regularization properties differ
Prior distributions align better (in Bayesian modeling)

The choice of link affects:

Interpretability: Odds ratios (logit), probabilities (identity), hazard rates (log)
Computational behavior: Canonical links often lead to faster convergence in estimation
Model fit: Some datasets might prefer a non-canonical link for better predictive performance

To summarize, the link function governs how the model expresses the relationship between the inputs and the expected output. The selection should reflect both the distributional assumptions and the practical meaning of model coefficients.

Examples of Common GLMs

The power of the Generalized Linear Model framework lies in its versatility. By appropriately choosing a distribution from the exponential family and selecting a suitable link function, we can derive a wide range of models tailored to different types of response variables.

In this section, we will revisit several well-known models—some you’ve likely encountered before—as special cases of GLMs. This unified perspective not only clarifies their assumptions and limitations but also illuminates how they are all structurally similar under the GLM umbrella.

We begin with the most familiar of all: Linear Regression.

1. Linear Regression (Revisiting OLS as a GLM)

To appreciate the power and generality of Generalized Linear Models, it helps to start with something familiar: Ordinary Least Squares (OLS) linear regression.

Though often treated as a separate classical method, linear regression fits naturally into the GLM framework—it’s simply the case where:

The response variable is assumed to follow a Gaussian distribution
The link function is the identity, i.e., $g(\mu) = \mu$

Let’s walk through this explicitly.

The Classical Setup

In standard linear regression, we model the response as:

\[y = \mathbf{x}^\top \mathbf{w} + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2)\]

For a dataset with $n$ observations and $p$ features, the full model becomes:

\[\mathbf{y} = \mathbf{Xw} + \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})\]

We assume the noise terms are:

Independent
Identically distributed
Normally distributed with mean 0 and constant variance $\sigma^2$

This leads to the familiar least squares objective:

\[\min_{\mathbf{w}} \|\mathbf{y} - \mathbf{Xw}\|_2^2\]

The closed-form solution is given by the normal equation:

\[\hat{\mathbf{w}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}\]

GLM Perspective: Expressing OLS as a Generalized Linear Model

Let’s now reinterpret this through the GLM lens.

Distribution (Random Component): We assume each $y_i$ is drawn independently from a normal distribution:
\[y_i \sim \mathcal{N}(\mu_i, \sigma^2)\]
Linear Predictor (Systematic Component): The mean is modeled as:
\[\mu_i = \eta_i = \mathbf{x}_i^\top \mathbf{w}\]
Link Function: We use the identity link, so that:
\[g(\mu_i) = \mu_i = \eta_i\]

This satisfies all the conditions of a GLM:

Response from exponential family (Normal distribution is a member)
Linear predictor
Link function

Connection to Maximum Likelihood Estimation

To cement the GLM connection, let’s derive the OLS solution from maximum likelihood.

We assume each response $y_i$ is distributed as:

\[p(y_i \mid \mu_i, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(y_i - \mu_i)^2}{2\sigma^2} \right)\]

Since $\mu_i = \mathbf{x}_i^\top \mathbf{w}$, the log-likelihood becomes:

\[\log \mathcal{L}(\mathbf{w}) = \sum_{i=1}^n \log p(y_i \mid \mathbf{x}_i, \mathbf{w}) = -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \mathbf{x}_i^\top \mathbf{w})^2\]

Maximizing the likelihood is equivalent to minimizing the negative log-likelihood, which simplifies to:

\[\min_{\mathbf{w}} \sum_{i=1}^n (y_i - \mathbf{x}_i^\top \mathbf{w})^2\]

This is precisely the least squares objective. So, from a GLM standpoint, linear regression is not just algebraically neat—it is also the maximum likelihood estimator for a Gaussian response with an identity link.

2. Logistic Regression (GLM Perspective)

When the response variable is binary—such as clicked or not, fraud or no fraud, yes or no—a linear model with Gaussian assumptions is inappropriate. Probabilities must lie between 0 and 1, and linear regression can’t guarantee that.

Enter Logistic Regression, a model that naturally arises in the GLM framework when:

The response variable follows a Bernoulli distribution
The link function is the logit, mapping probabilities to the real line

Random Component: Bernoulli Distribution

We assume the binary response variable $y_i \in \{0, 1\}$ is drawn from a Bernoulli distribution with probability $p_i = \mathbb{P}(y_i = 1)$:

\[y_i \sim \text{Bernoulli}(p_i)\]

The Bernoulli distribution can be expressed in exponential family form:

\[p(y_i; \theta_i) = \exp\left( y_i \cdot \theta_i - \log(1 + e^{\theta_i}) \right)\] \[\text{with } \theta_i = \log\left( \frac{p_i}{1 - p_i} \right)\]

Here, $\theta_i$ is the canonical parameter, and this form confirms that Bernoulli is indeed a member of the exponential family.

Link Function: Logit

To ensure $p_i \in (0, 1)$, we use the logit link:

\[g(p_i) = \log\left( \frac{p_i}{1 - p_i} \right) = \eta_i = \mathbf{x}_i^\top \mathbf{w}\]

Taking the inverse gives the sigmoid function:

\[p_i = \frac{1}{1 + e^{-\eta_i}} = \sigma(\eta_i)\]

This transformation maps real-valued linear predictors to valid probabilities. The logit is the canonical link for the Bernoulli distribution.

Interpretation: Odds and Log-Odds

One of the appealing features of logistic regression is the interpretability of its coefficients in terms of odds ratios.

The odds of a positive outcome are:
\[\text{Odds}(y_i = 1) = \frac{p_i}{1 - p_i}\]
The log-odds (logit) are linear in the features:
\[\log\left( \frac{p_i}{1 - p_i} \right) = \mathbf{x}_i^\top \mathbf{w}\]

Each coefficient $w_j$ represents the change in log-odds associated with a one-unit increase in feature $x_j$, holding all other features constant.

MLE Objective: Cross-Entropy Loss

The likelihood of a Bernoulli response under the model is:

\[\mathcal{L}(\mathbf{w}) = \prod_{i=1}^n p_i^{y_i} (1 - p_i)^{1 - y_i}\]

Taking logs gives the log-likelihood:

\[\log \mathcal{L}(\mathbf{w}) = \sum_{i=1}^n \left[ y_i \log p_i + (1 - y_i) \log(1 - p_i) \right]\]

This is the negative cross-entropy loss, which we minimize to estimate $\mathbf{w}$. Unlike OLS, there’s no closed-form solution; we must use iterative optimization methods like gradient descent or IRLS.

While logistic regression is most commonly used for classification, here we’ve framed it purely as a regression model for binary data using the GLM formalism.

We’ll revisit logistic regression in much greater detail—including classification boundaries, ROC curves, decision thresholds, and regularization—in the dedicated classification blog series.

3. Probit Regression

Like logistic regression, Probit Regression is designed for modeling binary response variables. It arises from the same GLM framework, with the key difference being the link function: instead of the logit, probit uses the inverse cumulative distribution function (CDF) of the standard normal distribution.

This subtle change makes Probit regression particularly useful in Bayesian statistics, latent variable modeling, and cases where a Gaussian error structure is assumed for the underlying decision process.

Random Component: Bernoulli Response

Just like in logistic regression, the outcome $y_i \in \{0, 1\}$ is modeled as:

\[y_i \sim \text{Bernoulli}(p_i)\]

But here, $p_i$ is linked to the input features via a different transformation.

Link Function: Probit (Inverse Gaussian CDF)

Instead of using the logit function, we apply the inverse of the standard normal CDF, denoted by $\Phi^{-1}(\cdot)$:

\[g(p_i) = \Phi^{-1}(p_i) = \eta_i = \mathbf{x}_i^\top \mathbf{w}\]

Taking the inverse of the link gives the probit model:

\[p_i = \Phi(\eta_i) = \int_{-\infty}^{\eta_i} \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{z^2}{2}\right) dz\]

This maps the linear predictor $\eta_i$ to the probability space $[0, 1]$ through the standard normal CDF.

Geometric and Statistical Interpretation

You can think of probit regression as arising from a latent Gaussian process. Imagine there’s an unobserved continuous variable $z_i$ defined as:

\[z_i = \mathbf{x}_i^\top \mathbf{w} + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, 1)\]

Then:

$y_i = 1$ if $z_i > 0$
$y_i = 0$ otherwise

This leads to:

\[\mathbb{P}(y_i = 1) = \mathbb{P}(z_i > 0) = \mathbb{P}(\mathbf{x}_i^\top \mathbf{w} + \varepsilon_i > 0) = \Phi(\mathbf{x}_i^\top \mathbf{w})\]

This latent variable formulation makes probit regression especially appealing in Bayesian modeling, where Gaussian priors on $\mathbf{w}$ yield conjugacy with the likelihood.

Comparison to Logistic Regression

Both probit and logit links produce sigmoid-shaped curves, but with slightly different tails:

Logistic CDF: Has heavier tails, allowing slightly more extreme probabilities
Gaussian CDF (Probit): Is more concentrated near the center

Despite this, the predictions from both models are often very similar in practice. The choice between them typically hinges on:

Theoretical considerations (e.g., latent Gaussian assumptions favor probit)
Bayesian tractability (probit is easier to integrate into Gibbs sampling)
Software or domain conventions

Practical Considerations

Interpretability: Coefficients in probit models are interpreted less intuitively than in logit models, since they reflect changes in the latent normal scale.
Usage: Probit regression is less common in applied machine learning but more frequently used in economics, psychometrics, and Bayesian statistics.

In summary, probit regression provides a statistically elegant and Bayesian-friendly alternative to logistic regression, particularly when Gaussian noise is a natural modeling assumption.

4. Multinomial / Softmax Regression

When the response variable can take on more than two discrete categories—such as predicting which product a customer will buy or which language a user is typing in—we need a model that generalizes logistic regression to the multi-class setting.

This leads us to Multinomial Regression, also known as Softmax Regression in machine learning literature. It is the natural GLM extension of logistic regression when the target is categorical with more than two classes.

Random Component: Multinomial Distribution

Assume the response variable $y_i$ takes values in the set $\{1, 2, \dots, K\}$ for $K > 2$. The conditional probability is modeled using a categorical distribution, which is a special case of the multinomial (with one trial per observation):

\[\mathbb{P}(y_i = k) = p_{ik}, \quad \text{for } k = 1, \dots, K\]

Each $p_{ik}$ is the probability that observation $i$ belongs to class $k$, and we require:

\[\sum_{k=1}^K p_{ik} = 1\]

Link Function: Generalized Logit (Softmax)

We define a separate linear predictor for each class $k$:

\[\eta_{ik} = \mathbf{x}_i^\top \mathbf{w}_k\]

Then apply the softmax transformation:

\[p_{ik} = \frac{e^{\eta_{ik}}}{\sum_{j=1}^K e^{\eta_{ij}}}\]

This maps each vector of real-valued logits $[\eta_{i1}, \dots, \eta_{iK}]$ into a valid probability distribution over $K$ classes.

Note: To ensure identifiability, one of the $\mathbf{w}_k$ (often the last) is typically set to zero.

GLM Interpretation

The multinomial distribution is a member of the exponential family
The softmax function serves as the generalized canonical link for multi-class categorical data
The model can be fit using maximum likelihood, leading to the categorical cross-entropy loss

While multinomial regression is a GLM in every sense, its primary usage is in classification tasks rather than regression. Topics such as:

One-vs-rest vs full softmax modeling
Decision boundaries in high-dimensional space
Regularization (e.g., multinomial logistic with $\ell_1$ or $\ell_2$ penalty)
Evaluation metrics (accuracy, top-k, F1)

will be discussed in detail in the upcoming classification blog series.

5. Poisson Regression

In many real-world problems, the response variable represents counts—how many times an event occurs. Examples include:

Number of clicks on an ad
Number of doctor visits in a month
Number of claims filed by an insurance policyholder
Number of purchases during a session

These are non-negative integers ($0, 1, 2, \dots$) with typically skewed distributions and heteroscedastic variance. Linear regression fails to capture these properties.

Poisson Regression offers a GLM-based solution by modeling the response using the Poisson distribution and linking its mean to a linear predictor through a log link.

Random Component: Poisson Distribution

The response variable $y_i \in \mathbb{N}_0 = \{0, 1, 2, \dots\}$ is assumed to follow a Poisson distribution with rate parameter $\lambda_i > 0$:

\[\mathbb{P}(y_i = k) = \frac{e^{-\lambda_i} \lambda_i^k}{k!}, \quad k \in \mathbb{N}_0\]

This distribution belongs to the exponential family with:

Canonical parameter: $\theta_i = \log(\lambda_i)$
Log-partition function: $b(\theta_i) = e^{\theta_i}$
Dispersion: $\phi = 1$ (fixed in standard Poisson)

Mean and Variance

A key property of the Poisson distribution:

\[\mathbb{E}[y_i] = \lambda_i, \quad \text{Var}(y_i) = \lambda_i\]

The variance equals the mean, a restriction that will later lead us to Negative Binomial regression for overdispersed data.

Link Function: Log Link

To ensure $\lambda_i > 0$, we use the log link function:

\[g(\lambda_i) = \log(\lambda_i) = \eta_i = \mathbf{x}_i^\top \mathbf{w}\]

This gives us the modeling equation:

\[\lambda_i = \exp(\mathbf{x}_i^\top \mathbf{w})\]

The log link is the canonical link for the Poisson distribution and guarantees positivity of predicted rates.

Likelihood and Estimation

The log-likelihood for the Poisson model is:

\[\log \mathcal{L}(\mathbf{w}) = \sum_{i=1}^n \left[ y_i \log \lambda_i - \lambda_i - \log(y_i!) \right] = \sum_{i=1}^n \left[ y_i \cdot \mathbf{x}_i^\top \mathbf{w} - \exp(\mathbf{x}_i^\top \mathbf{w}) - \log(y_i!) \right]\]

This is a convex function, and the MLE $\hat{\mathbf{w}}$ can be estimated using iteratively reweighted least squares (IRLS) or gradient-based optimization.

Use Cases

Poisson regression is ideal for scenarios where:

The target is a non-negative count
The variance is roughly equal to the mean
Events occur independently and randomly in time or space

Common applications include:

Insurance: Number of claims per policyholder
Marketing: Ad impressions or clicks
Healthcare: Number of visits or prescriptions
Web Analytics: Pageviews, session counts

6. Negative Binomial Regression

Poisson regression is elegant, but it makes a strong assumption: that the variance of the response equals the mean:

\[\text{Var}(y_i) = \mathbb{E}[y_i] = \lambda_i\]

In practice, however, count data often exhibit overdispersion—the observed variance exceeds the mean:

\[\text{Var}(y_i) > \mathbb{E}[y_i]\]

This can arise from unobserved heterogeneity, clustered events, or long-tailed distributions. When overdispersion is present, the Poisson model underestimates variability, leading to biased standard errors and poor model fit.

Negative Binomial Regression extends Poisson regression by explicitly modeling overdispersion while remaining within the GLM framework.

From Poisson to Negative Binomial

One way to derive the Negative Binomial distribution is as a Poisson-Gamma mixture:

Assume the Poisson rate parameter $\lambda_i$ is not fixed, but follows a Gamma distribution:
\[\lambda_i \sim \text{Gamma}(r, p), \quad y_i \mid \lambda_i \sim \text{Poisson}(\lambda_i)\]
Marginalizing over the latent Gamma noise gives the Negative Binomial distribution:
\[\mathbb{P}(y_i = k) = \frac{\Gamma(k + r)}{k! \, \Gamma(r)} \left( \frac{p}{1 + p} \right)^r \left( \frac{1}{1 + p} \right)^k\]

This compound formulation introduces extra variance through the Gamma prior on $\lambda_i$, allowing the model to fit overdispersed data.

Mean and Variance

For Negative Binomial-distributed responses, the mean and variance are:

\[\mathbb{E}[y_i] = \mu_i, \quad \text{Var}(y_i) = \mu_i + \alpha \mu_i^2\]

Where:

$\mu_i = \exp(\mathbf{x}_i^\top \mathbf{w})$ is the predicted count
$\alpha > 0$ is an overdispersion parameter

This quadratic variance structure allows more flexible modeling than Poisson regression.

Link Function: Log Link

As with Poisson regression, we use the log link to ensure non-negative predictions:

\[\log(\mu_i) = \eta_i = \mathbf{x}_i^\top \mathbf{w}\] \[\quad \text{or equivalently} \quad \mu_i = \exp(\mathbf{x}_i^\top \mathbf{w})\]

The log link remains interpretable and is often used in count models.

Estimation and Usage

The log-likelihood for the Negative Binomial model includes a shape parameter (usually estimated via maximum likelihood or profile likelihood). Most modern libraries (e.g., statsmodels, R’s MASS::glm.nb) handle this seamlessly.

Negative Binomial regression is widely used in fields such as:

Healthcare: modeling hospital visits or lengths of stay
Ecology: modeling species counts
Insurance and Finance: modeling claim frequency with variability
Web analytics: modeling user behavior over variable-length sessions

Summary

By introducing a Gamma-distributed noise term into the Poisson rate, Negative Binomial regression provides a principled way to model count data with extra variance. It retains the GLM structure, uses the same log link, and improves fit when Poisson’s mean–variance constraint is too rigid.

With this, we complete the tour of the most common GLMs. In the next section, we’ll briefly discuss how these models are fit and evaluated in practice, and when to use each one depending on your data and task.

Maximum Likelihood Estimation in GLMs

All models within the Generalized Linear Model framework share a common estimation philosophy: maximum likelihood. Whether you’re modeling a continuous outcome via linear regression or counts via Poisson regression, the goal is to find the parameter vector $\mathbf{w}$ that maximizes the likelihood of observing the data under the assumed model.

In this section, we’ll explore how parameter estimation works across GLMs, focusing on the score function, the role of the link function, and the iterative algorithms used to solve the optimization problem.

Objective: Maximizing the Likelihood

Suppose we observe $n$ independent data points $(\mathbf{x}_i, y_i)$. Under a GLM, the likelihood of the full dataset is:

\[\mathcal{L}(\mathbf{w}) = \prod_{i=1}^n p(y_i; \theta_i(\mathbf{x}_i^\top \mathbf{w}), \phi)\]

Taking the logarithm, we obtain the log-likelihood function:

\[\ell(\mathbf{w}) = \sum_{i=1}^n \left[ \frac{y_i \cdot \theta_i - b(\theta_i)}{\phi} + c(y_i, \phi) \right]\]

Where $\theta_i$ is the canonical parameter, expressed as a function of the linear predictor:

\[\eta_i = \mathbf{x}_i^\top \mathbf{w}, \quad \theta_i = g^{-1}(\eta_i) \text{ (possibly with transformation)}\]

The goal is to solve:

\[\hat{\mathbf{w}} = \arg\max_{\mathbf{w}} \, \ell(\mathbf{w})\]

In general, this problem does not admit a closed-form solution (except for linear regression). We must use numerical optimization.

Deriving the Score Function

To solve for $\mathbf{w}$, we compute the score function, the gradient of the log-likelihood:

\[\frac{\partial \ell(\mathbf{w})}{\partial \mathbf{w}} = \sum_{i=1}^n \left( \frac{\partial \ell}{\partial \mu_i} \cdot \frac{\partial \mu_i}{\partial \eta_i} \cdot \frac{\partial \eta_i}{\partial \mathbf{w}} \right)\]

This can be organized neatly using matrix notation:

\[\nabla_{\mathbf{w}} \ell(\mathbf{w}) = \mathbf{X}^\top \mathbf{W}^{-1} (\mathbf{y} - \boldsymbol{\mu})\]

Where:

$\mathbf{X}$ is the design matrix
$\boldsymbol{\mu}$ is the vector of predicted means: $\mu_i = \mathbb{E}[y_i]$
$\mathbf{W}$ is a diagonal matrix of weights involving variances and derivatives of the link

This gradient points us in the direction of steepest ascent in log-likelihood, and can be used in iterative algorithms like gradient ascent or Newton-Raphson.

IRLS: Iteratively Reweighted Least Squares

The most commonly used method for solving GLM estimation is IRLS—Iteratively Reweighted Least Squares. It is a second-order method that approximates the log-likelihood surface locally as a quadratic, then solves a weighted least squares problem at each step.

IRLS solves:

\[\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} + (\mathbf{X}^\top \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{W}^{(t)} \mathbf{z}^{(t)}\]

Where:

$\mathbf{W}^{(t)}$ is a diagonal matrix of weights based on the variance and derivative of the link
$\mathbf{z}^{(t)}$ is the working response, a linear approximation to the link-transformed outcome:
\[z_i^{(t)} = \eta_i^{(t)} + \frac{y_i - \mu_i^{(t)}}{g'(\mu_i^{(t)})}\]

Each iteration is equivalent to solving a weighted least squares problem, where weights depend on the model’s curvature at the current parameter estimate.

Convergence Properties

IRLS is efficient and has strong convergence properties under the following conditions:

The log-likelihood is concave (which holds for canonical links)
The Fisher Information Matrix $\mathbf{X}^\top \mathbf{W} \mathbf{X}$ is full-rank
The response distribution is properly specified

However:

Poorly scaled data can cause numerical instability
Overdispersion (e.g., Poisson when variance > mean) can distort curvature
Non-canonical links may slow convergence or make the Hessian harder to compute

In modern practice, IRLS is often replaced or supplemented by:

Quasi-Newton methods (e.g., BFGS)
Stochastic gradient descent (SGD) for large datasets
Bayesian MCMC for posterior estimation in fully probabilistic settings

Model Evaluation and Diagnostics for GLMs

Fitting a GLM is only the beginning. Once you’ve estimated the model parameters, the next crucial step is to evaluate how well the model fits the data, determine whether the assumptions hold, and compare across candidate models.

Unlike linear regression—where we have well-defined tools like $R^2$ and residual plots—GLMs require a slightly more general toolkit. In this section, we explore the most common diagnostics and model selection tools for GLMs.

Deviance: The GLM Analog of Residual Sum of Squares

In OLS, we evaluate model fit using the Residual Sum of Squares (RSS):

\[\text{RSS} = \sum_{i=1}^n (y_i - \hat{y}_i)^2\]

In GLMs, the equivalent measure is the deviance, which quantifies the distance between the fitted model and a saturated model (a hypothetical model that fits each data point exactly).

Formally, the deviance is defined as:

\[D = 2 \left[ \ell_{\text{sat}} - \ell(\hat{\mathbf{w}}) \right]\]

Where:

$\ell_{\text{sat}}$ is the log-likelihood of the saturated model
$\ell(\hat{\mathbf{w}})$ is the log-likelihood under the fitted model

Smaller deviance implies better fit. For comparing two nested GLMs, the difference in deviance follows an approximate $\chi^2$ distribution and can be used for hypothesis testing.

AIC: Penalized Likelihood for Model Selection

The Akaike Information Criterion (AIC) balances goodness-of-fit with model complexity. It is defined as:

\[\text{AIC} = -2 \ell(\hat{\mathbf{w}}) + 2k\]

Where:

$\ell(\hat{\mathbf{w}})$ is the maximized log-likelihood
$k$ is the number of estimated parameters (including the intercept)

AIC is used to compare different models on the same dataset: lower AIC indicates a better model, accounting for overfitting.

For smaller sample sizes, a corrected version—AICc—is often used:

\[\text{AICc} = \text{AIC} + \frac{2k(k+1)}{n - k - 1}\]

Residuals in GLMs

GLMs generalize the concept of residuals. Different types of residuals provide diagnostic insight into various aspects of model fit.

1. Pearson Residuals

These standardize the difference between observed and expected responses:

\[r_i^{(P)} = \frac{y_i - \hat{\mu}_i}{\sqrt{\text{Var}(y_i)}}\]

Useful for detecting outliers and non-constant variance patterns.

2. Deviance Residuals

Derived from the contribution of each data point to the model deviance:

\[r_i^{(D)} = \text{sign}(y_i - \hat{\mu}_i) \cdot \sqrt{d_i}\]

Where $d_i$ is the contribution of observation $i$ to the total deviance.

Deviance residuals are symmetric and often preferred for visual diagnostics like residual vs. fitted plots.

3. Anscombe Residuals

These are designed to normalize skewness in residuals and are especially useful when the response distribution is highly asymmetric (e.g., Gamma or Poisson).

Checking for Overdispersion

Overdispersion occurs when the observed variance exceeds the model’s assumed variance. It’s most common in:

Poisson regression: assumes $\text{Var}(y) = \mu$
Binomial models: assumes $\text{Var}(y) = n \cdot p(1 - p)$

To test for overdispersion:

Compute the Pearson dispersion statistic:
\[\hat{\phi} = \frac{1}{n - p} \sum_{i=1}^n \left( \frac{y_i - \hat{\mu}_i}{\sqrt{\text{Var}(y_i)}} \right)^2\]
\quad \text{If } \hat{\phi} > 1, \text{ there is evidence of overdispersion.}
Refit the model using quasi-likelihood or Negative Binomial variants.

Goodness-of-Fit Tests

Some tests assess whether the model adequately describes the data:

Hosmer-Lemeshow Test (for binary outcomes): Compares observed and expected frequencies across quantile bins.
Deviance Test: Compares model deviance to degrees of freedom.
Likelihood Ratio Test (LRT): Compares two nested models.

These are often supplemented by visual diagnostics such as:

Residual plots
Q-Q plots of deviance residuals
Leverage and Cook’s distance for influence analysis

When to Use Which GLM?

Choosing the right Generalized Linear Model hinges on understanding the nature of your response variable and the distributional assumptions appropriate for it. Each GLM is specialized to handle a different kind of outcome behavior, and selecting the correct model is essential for valid inference and predictive performance.

The table below summarizes which GLM to use depending on the structure of your target variable:

Response Type	Distribution	Link Function	Recommended GLM
Binary (0/1)	Bernoulli	Logit / Probit	Logistic Regression / Probit Regression
Multi-class (Categorical)	Multinomial	Generalized Logit (Softmax)	Multinomial (Softmax) Regression
Count (Variance $$\\approx$$ Mean)	Poisson	Log	Poisson Regression
Count (Variance $$>$$ Mean)	Negative Binomial	Log	Negative Binomial Regression
Continuous (Unbounded)	Gaussian (Normal)	Identity	Linear Regression (OLS)

A Few Guiding Principles

Use Logistic Regression when your outcome is binary and you’re interested in interpretability via odds ratios.
Use Probit Regression if you have latent Gaussian motivations or are building a Bayesian model.
Use Multinomial Regression for multi-class classification tasks, especially when class probabilities matter.
Use Poisson Regression for non-negative count data where the variance is close to the mean.
Use Negative Binomial Regression when you detect overdispersion in count data.
Use Linear Regression for continuous numeric outcomes with approximately normal residuals.

Always check residuals, variance structure, and dispersion before settling on your model. If in doubt, fit multiple candidate models and compare using AIC, cross-validation, and diagnostic plots.

Closing Thoughts

Generalized Linear Models (GLMs) represent one of the most elegant and powerful frameworks in statistical modeling. By extending linear regression to accommodate a variety of response types—binary, count, categorical, and more—GLMs unify a wide spectrum of models under a common mathematical structure.

At their core, GLMs provide:

A random component drawn from the exponential family
A systematic component via linear predictors
A link function connecting them

This modular design brings both interpretability and extensibility. It allows us to handle real-world data complexities—like skewness, discreteness, and heteroscedasticity—while preserving the conceptual clarity of linear models.

Importantly, GLMs serve as a bridge between classical statistics and modern machine learning:

They form the foundation of regularized regression, generalized additive models (GAMs), and Bayesian regression
Their probabilistic nature connects smoothly to likelihood-based inference, uncertainty quantification, and Bayesian priors
Their optimization structure aligns with gradient-based methods used in deep learning and large-scale modeling

Understanding GLMs equips data scientists with a principled approach to choosing the right model for the task at hand—not just because it “works,” but because it fits the data-generating process in a meaningful way.

As we transition to more complex tasks like classification, structured outputs, or nonlinear modeling, the ideas introduced here—link functions, exponential families, and likelihood-based estimation—will remain foundational.

In future posts, we’ll explore how these ideas extend into the classification domain (Logistic, Softmax, and beyond), and how GLMs inspire the design of flexible nonlinear models in both Bayesian and ML contexts.