# Generalized Linear Models (GLZ)

This page was last updated on 09/27/02

Generalized Linear Models (GLZ) are an extension of the linear modeling process that allows models to be fit to data that follow probability distributions other than the Normal distribution, such as the Poisson, Binomial, Multinomial, and etc. Generalized Linear Models also relax the requirement of equality or constancy of variances that is required for hypothesis tests in traditional linear models.

The General Linear Univariate Model (GLUM)

Most parametric statistical analyses can be viewed as a process of fitting a linear model to the observed data and testing hypotheses about the fitted model’s parameters. Even the lowly t – test is a form of the General Linear Univariate Model (GLUM). The Analysis of Variance (ANOVA), Regression, Multiple Regression, and the Analysis of Covariance (ANCOVA) are more complicated forms of the GLUM.

The least squares criterion is used to obtain estimates of the parameters of these GLUM models.  Additional assumptions must be met in order to test hypotheses about the model’s parameters. Besides the assumption of independence of the observations, which is required for all statistical analyses, hypothesis tests derived from GLUM’s require normality of the response variable and constancy or homogeneity of variances.

The General Linear Multivariate Model (GLMM)

When attempting to explain variation in more than one response variable simultaneously the modeling exercise is to fit the General Linear Multivariate Model (GLMM) to the data. Commonly used multivariate statistical procedures such as Multivariate Analysis of Variance (MANOVA), Multivariate Analysis of Covariance (MANCOVA), Discriminant Function Analysis (DFA), Canonical Correlation Analysis (CCA), and Principal Components Analysis (PCA) are all forms of the GLMM. To perform hypothesis tests in the context of the GLMM, one must assume that the response variables are multivariate normal and that the variance-covariance matrices are homogeneous.

When the distribution of the response variable(s) is not normal or multivariate normal, or if the variances or the variance-covariance matrices are not homogeneous, then application of hypothesis tests to GLUM’s or GLMM’s can lead to Type I and Type II error rates that differ from the nominal rates. Traditionally, transformations of the scale of the response variables have been applied to insure that the assumptions required for hypotheses tests are met. For example, count data are often Poisson distributed and tend to be right skewed. Furthermore, the variance of a Poisson random variable is equal to the mean of the response. Hence, for count data a transformation must both normalize the data and eliminate the inherent variance heterogeneity. Commonly, count data are transformed to a logarithmic scale or even a square-root scale, however such transformations are not always successful in achieving the desired end. In fact, there is no a priori reason to believe that a scale exists that will insure that data meet the normality and variance homogeneity assumptions.

General - izing the Linear Model

The Generalized Linear Model is an extension of the General Linear Model to include response variables that follow any probability distribution in the exponential family of distributions. The exponential family includes such useful distributions as the Normal, Binomial, Poisson, Multinomial, Gamma, Negative Binomial, and others. Hypothesis tests applied to the Generalized Linear Model do not require normality of the response variable, nor do they require homogeneity of variances. Hence, Generalized Linear Models can be used when response variables follow distributions other than the Normal distribution, and when variances are not constant. For example, count data would be appropriately analyzed as a Poisson random variable within the context of the Generalized Linear Model.

Parameter estimates are obtained using the principle of maximum likelihood; therefore hypothesis tests are based on comparisons of likelihoods or the deviances of nested models.

What puts the -ized in Generalized Linear Models

The common linear regression model (a form of the general linear model) specifies that the mean response µ is identical to a linear function ? of the predictor variables xj:

(1)

and uses least squares as the criterion by which to estimate the unknown parameters ß   = (ß0,        ß1,...,  ßp)'. When observations are independent and normally distributed with constant variance s2, least squares estimation of ß   and s2 is equivalent to maximum likelihood estimation.

Generalized linear models encompass the general linear model and enlarge the class of linear least-squares models in two ways: the distribution of Y for fixed x is merely assumed to be from the exponential family of distributions, which includes important distributions such as the binomial, Poisson, exponential, and gamma distributions, in addition to the normal distribution. Also, the relationship between E(Y) = µ and ? is specified by a non-linear link function ? = g(µ), which is only required to be monotonic and differentiable.

The link function serves to link the random or stochastic component of the model, the probability distribution of the response variable, to the systematic component of the model (the linear predictor):

,                       (2)

Where g(µ) is a non-linear link function that links the random component, E(Y), to the systematic component .  For traditional linear models in which the random component consists of the assumption that the response variable follows the Normal distribution, the canonical link function is the identity link. The identity link specifies that the expected mean of the response variable is identical to the linear predictor, rather than to a non-linear function of the linear predictor. The canonical link functions for a variety of probability distribution are given below.

 Probability Distribution Canonical Link Function Normal Identity Binomial Logit Poisson Log Gamma Reciprocal

Although other link functions are possible, the canonical links are most often used.

## Estimation and Testing

The parameters in a generalized linear model can be estimated by the maximum likelihood method. For a given probability distribution specified by f(yi ; ß, F) and observations y = (y1, y2, . . ., yn)', the log-likelihood function for ß  and F, expressed as a function of mean values µ = (µ1,…, µn) of the responses {Y1, Y2, . . . , Yn}, has the form

.

The maximum likelihood estimates of the parameters ß can be obtained by iterative re-weighted least squares (IRLS). Detailed information about the iterative algorithm and asymptotic properties of the parameter estimates can be found in McCullagh and Nelder (1989).

Analogous to the residual sum of squares in linear regression, the goodness-of-fit of a generalized linear model can be measured by the scaled deviance

,

where is the maximum likelihood achievable for an exact fit in which the fitted values are equal to the observed values, and  is the log-likelihood function calculated at the estimated parameters ß. The deviance function is very useful for comparing two models when one model has parameters that are a subset of the second model. The deviance is additive for such nested models if maximum likelihood estimates are used (McCullagh and Nelder 1989). Consider two nested models with the second having some covariates omitted and denote the maximum likelihood estimates in the two models by  and  , respectively. Then the deviance difference  is identical to the likelihood-ratio statistic and has an approximatedistribution with degrees of freedom equal to the difference between the numbers of parameters in the two models. For probability distributions in the exponential family theapproximation is usually quite accurate for differences of deviance even though it may be inaccurate for the deviances themselves (McCullagh and Nelder 1989).

Over-dispersion

If the sampling variance of a response variable Yi is significantly greater than that predicted by an expected probability distribution, Yi is said to be over-dispersed.

The covariance matrix of  is estimated by COV = F(X'WX)-1, where X is the covariate matrix and W is a weight matrix used in the iterative algorithm. If over-dispersion occurs, ignoring it (i.e., setting F  = 1) will result in underestimating the standard errors of the parameter estimates, which may lead to incorrect conclusions. McCullagh and Nelder (1989) suggest modeling mean and dispersion jointly as a way to take possible over-dispersion into account. The detailed fitting procedure can be found in McCullagh and Nelder (1989).

## Software

GLZ’s can be fit and evaluated using SPLUS, SAS, SPSS, and a number of other statistical packages. Of the major packages, SPLUS and SAS provide greater flexibility in fitting and evaluating GLZ’s

References

Agresti, A. 1996. An Introduction to Categorical Data Analysis. John Wiley & Sons: New York. (A very readable introduction the many forms of the generalized linear model)

McCullagh, P. and J.A. Nelder. 1989. Generalized Linear Models. Chapman and Hall: London. (mathematical statistics of generalized linear model)

Ecological Applications of Generalized Linear Models

Vincent, P.J. and J.M. Haworth. 1983. Poisson regression models of species abundance. Journal of Biogeography 10: 153-160.

Connor, E.F., E. Hosfield, D. Meeter, and X. Nui. 1997. Tests for aggregation and size-based sample-unit selection when sample units vary in size. Ecology 78: 1238 -1249.

Links to Other Websites

 Site Description The Generalized Linear Models Page Introduction, bibliography, software, and other information on GLZ’s Statsoft online textbook Fairly comprehensive introduction to GLZ’s GLMLAB Using Matlab to fit GLZ’s Introduction to GLM Brief introduction to GLZ’s