Generalized Linear Models (GLMs)

Abhishek
8 min readMar 25, 2023

Using GLMs for regression analysis

Generalized Linear Models (GLMs) are a powerful family of statistical models that can handle a wide range of data types and response variables. GLMs are an extension of the linear regression model and are used when the response variable is not normally distributed or has a non-linear relationship with the predictor variables. In this article, we will discuss the basics of GLMs and provide sample code in R to help you get started with GLMs.

The Basics of GLMs

GLMs are a flexible and robust modeling framework that allows us to model a wide range of response variables. The key idea behind GLMs is to model the response variable as a function of a linear predictor that is transformed by a link function. The link function is used to transform the linear predictor to ensure that the response variable is non-negative and has the correct distributional properties.

The general form of a GLM can be expressed as:

Y ~ f(µ) = g⁻¹(η)

where Y is the response variable, f(µ) is the distribution of Y, µ is the mean of Y, g is the link function, and η is the linear predictor. In other words, GLMs assume that the response variable Y has a probability distribution that is a member of the exponential family of distributions. The exponential family of distributions includes many common probability distributions, such as the normal, binomial, Poisson, and gamma distributions.

The linear predictor η is a linear combination of the predictor variables X, and is expressed as:

η = β₀ + β₁X₁ + β₂X₂ + … + βₚXₚ

where β₀ is the intercept, β₁, β₂, …, βₚ are the coefficients for the predictor variables, and X₁, X₂, …, Xₚ are the predictor variables.

The link function g is used to transform the linear predictor η to ensure that the response variable Y has the correct distributional properties.

The choice of link function depends on the distribution of the response variable. For example, the logit link function is used for binary response variables, the identity link function is used for continuous response variables, and the log link function is used for count response variables.

We will discuss some common link functions and their use cases.

  1. Identity Link Function:
    The identity link function is the simplest link function, which is used for linear regression models. The identity link function is appropriate when the response variable is continuous and normally distributed. The formula for the identity link function is:
    η = μ
    where η is the linear predictor and μ is the mean of the response variable. The identity link function does not transform the response variable, and the estimated coefficients in the GLM can be directly interpreted as the change in the response variable for a one-unit increase in the corresponding predictor variable.
  2. Logit Link Function:
    The logit link function is commonly used in GLMs for binary response variables (e.g., presence or absence of a disease, success or failure of an event). The logit link function transforms the response variable into the log odds of the event occurring. The formula for the logit link function is:
    η = log[μ / (1 — μ)]
    where η is the linear predictor and μ is the probability of the event occurring. The estimated coefficients in the GLM can be interpreted as the change in the log odds of the event occurring for a one-unit increase in the corresponding predictor variable.
  3. Probit Link Function:
    The probit link function is another commonly used link function for binary response variables. The probit link function transforms the response variable into the standard normal distribution. The formula for the probit link function is:
    η = Φ^-1(μ)
    where Φ^-1 is the inverse of the cumulative distribution function of the standard normal distribution, and μ is the probability of the event occurring. The estimated coefficients in the GLM can be interpreted as the change in the z-score of the standard normal distribution for a one-unit increase in the corresponding predictor variable.
  4. Log Link Function:
    The log link function is used for GLMs with response variables that have a positive skew, such as count data or proportion data. The log link function transforms the response variable into the natural logarithm of the mean. The formula for the log link function is:
    η = log(μ)
    where η is the linear predictor and μ is the mean of the response variable. The estimated coefficients in the GLM can be interpreted as the percent change in the mean of the response variable for a one-unit increase in the corresponding predictor variable.
  5. Inverse Link Function:
    The inverse link function is used for GLMs with response variables that are continuous and have a positive skew. The inverse link function transforms the response variable into the reciprocal of the mean. The formula for the inverse link function is:
    η = 1 / μ
    where η is the linear predictor and μ is the mean of the response variable. The estimated coefficients in the GLM can be interpreted as the percent change in the reciprocal of the mean of the response variable for a one-unit increase in the corresponding predictor variable.

In summary, the choice of link function in GLMs depends on the distribution of the response variable and the research question of interest. Each link function transforms the response variable in a different way, and the estimated coefficients in the GLM can be interpreted differently depending on the link function used. It is important to select the appropriate link function for the research question and to interpret the estimated coefficients in the context of the link function used.

Assumptions of GLM

Similar to Linear Regression Model, there are some basic assumptions for Generalized Linear Models as well. Most of the assumptions are similar to Linear Regression models, while some of the assumptions of Linear Regression are modified.

  • Data should be independent and random (Each Random variable has the same probability distribution).
  • The response variable y does not need to be normally distributed, but the distribution is from an exponential family (e.g. binomial, Poisson, multinomial, normal)
  • The original response variable need not have a linear relationship with the independent variables, but the transformed response variable (through the link function) is linearly dependent on the independent variables

Sample Code in R

Now let’s look at some sample code in R to fit GLMs. For this example, we will use the built-in “mtcars” dataset, which contains data on various characteristics of 32 automobiles, such as miles per gallon (mpg), horsepower (hp), and number of cylinders (cyl). Our goal is to fit a GLM to predict the binary variable “am”, which indicates whether the car has an automatic transmission (0) or a manual transmission (1) based on the other variables.

First, we will load the dataset and split it into training and testing sets:

library(dplyr)
library(caret)

# Load the mtcars dataset
data(mtcars)

# Split the data into training and testing sets
set.seed(123)
train.index <- createDataPartition(mtcars$am, p = 0.7, list = FALSE)
train.data <- mtcars[train.index, ]
test.data <- mtcars[-train.index, ]

Next, we will fit a GLM using the “glm” function in R. We will use the logit link function, since our response variable is binary. We will include all of the other variables in the model as predictors:

# Fit a GLM to predict the am variable
model.glm <- glm(am ~ ., data = train.data, family = binomial(link = "logit"))
summary(model.glm)
# Print the coefficients
coef(model.glm)

The output of the “summary” function provides information about the coefficients

The “coef” function returns a vector of the estimated coefficients.

Interpreting the Coefficients in GLMs

The interpretation of the coefficients in GLMs is closely related to the link function used in the model. The link function specifies the relationship between the linear predictor η (the sum of the products of the predictor variables and their corresponding coefficients) and the expected value of the response variable μ. In our previous example, we used the logit link function, which is defined as:

logit(μ) = η = β₀ + β₁x₁ + … + βₚxₚ

where logit(μ) is the natural logarithm of the odds ratio of the response variable being a success (in our case, having a manual transmission), given the predictor variables. The coefficients β₁, β₂, …, βₚ represent the change in the log odds of the response variable for a one-unit increase in the corresponding predictor variable, while holding all other predictor variables constant.

To interpret the coefficients in the GLM, we can exponentiate them to get the odds ratio. For example, if the coefficient for “hp” is 0.035, the odds ratio for “hp” is:

odds ratio for hp = exp(0.035) = 1.036

This means that for a one-unit increase in “hp”, the odds of having a manual transmission (am=1) increase by a factor of 1.036, while holding all other variables constant.

The sign of the coefficient also gives us information about the direction of the relationship between the predictor variable and the response variable. If the coefficient is positive, an increase in the predictor variable is associated with an increase in the log odds (and therefore the odds) of the response variable. If the coefficient is negative, an increase in the predictor variable is associated with a decrease in the log odds (and therefore the odds) of the response variable.

Calculating Coefficients in GLMs

The coefficients in GLMs are estimated using maximum likelihood estimation (MLE), which is a method for finding the values of the coefficients that maximize the likelihood of the observed data, given the model. The likelihood function is a measure of how likely the observed data are, given the values of the coefficients.

The likelihood function for a GLM is:

L(β) = ∏ᵢ f(yᵢ; θᵢ, β)ᵈⁱ

where β is the vector of coefficients, yᵢ is the observed response variable for the i-th observation, θᵢ is the vector of predictor variables for the i-th observation, f(yᵢ; θᵢ, β) is the probability density function for the response variable, and dⁱ is the weight for the i-th observation.

The goal of MLE is to find the values of the coefficients that maximize the likelihood function. This is typically done using numerical optimization algorithms, such as the Newton-Raphson method or the Fisher scoring algorithm. These algorithms iteratively update the values of the coefficients until convergence is achieved.

To calculate the odds ratios for each coefficient, we can exponentiate the coefficients using the “exp” function:

# Calculate the odds ratios
exp(coef(model.glm))

This returns a vector of the odds ratios for each coefficient. For example, the odds ratio for “hp” is 1.036, which means that for a one-unit increase in “hp”, the odds of having a manual transmission increase by a factor of 1.036, while holding all other variables constant.

We can also calculate confidence intervals for the coefficients using the “confint” function:

# Calculate the confidence intervals
confint(model.glm)

This returns a matrix of the lower and upper bounds for the 95% confidence intervals for each coefficient. The intervals can give us an idea of how certain we are about the estimated coefficients. If the intervals are narrow, it suggests that we are relatively confident about the estimated coefficients. On the other hand, if the intervals are wide, it suggests that we are less confident about the estimated coefficients.

Summary

GLMs are a powerful tool for modeling the relationship between a response variable and one or more predictor variables. The coefficients in GLMs represent the change in the log odds of the response variable for a one-unit increase in the corresponding predictor variable, while holding all other predictor variables constant. These coefficients can be interpreted using odds ratios, which can give us insights into the direction and strength of the relationship between the predictor variables and the response variable. The coefficients are estimated using maximum likelihood estimation, and we can calculate confidence intervals to assess the uncertainty associated with the estimated coefficients.

--

--