This page was last updated on 09/27/02
Generalized Linear Models (GLZ) are an extension of the linear modeling process that allows models to be fit to data that follow probability distributions other than the Normal distribution, such as the Poisson, Binomial, Multinomial, and etc. Generalized Linear Models also relax the requirement of equality or constancy of variances that is required for hypothesis tests in traditional linear models.
The General Linear
Univariate Model (GLUM)
Most parametric statistical analyses can be viewed as a process of fitting a linear model to the observed data and testing hypotheses about the fitted models parameters. Even the lowly t test is a form of the General Linear Univariate Model (GLUM). The Analysis of Variance (ANOVA), Regression, Multiple Regression, and the Analysis of Covariance (ANCOVA) are more complicated forms of the GLUM.
The least squares criterion is used to obtain estimates of the parameters of these GLUM models. Additional assumptions must be met in order to test hypotheses about the models parameters. Besides the assumption of independence of the observations, which is required for all statistical analyses, hypothesis tests derived from GLUMs require normality of the response variable and constancy or homogeneity of variances.
The General Linear Multivariate Model (GLMM)
When attempting to explain variation in more than one response variable simultaneously the modeling exercise is to fit the General Linear Multivariate Model (GLMM) to the data. Commonly used multivariate statistical procedures such as Multivariate Analysis of Variance (MANOVA), Multivariate Analysis of Covariance (MANCOVA), Discriminant Function Analysis (DFA), Canonical Correlation Analysis (CCA), and Principal Components Analysis (PCA) are all forms of the GLMM. To perform hypothesis tests in the context of the GLMM, one must assume that the response variables are multivariate normal and that the variance-covariance matrices are homogeneous.
When the distribution of the response variable(s) is not normal or multivariate normal, or if the variances or the variance-covariance matrices are not homogeneous, then application of hypothesis tests to GLUMs or GLMMs can lead to Type I and Type II error rates that differ from the nominal rates. Traditionally, transformations of the scale of the response variables have been applied to insure that the assumptions required for hypotheses tests are met. For example, count data are often Poisson distributed and tend to be right skewed. Furthermore, the variance of a Poisson random variable is equal to the mean of the response. Hence, for count data a transformation must both normalize the data and eliminate the inherent variance heterogeneity. Commonly, count data are transformed to a logarithmic scale or even a square-root scale, however such transformations are not always successful in achieving the desired end. In fact, there is no a priori reason to believe that a scale exists that will insure that data meet the normality and variance homogeneity assumptions.
General - izing the Linear Model
The Generalized Linear Model is an extension of the General Linear Model to include response variables that follow any probability distribution in the exponential family of distributions. The exponential family includes such useful distributions as the Normal, Binomial, Poisson, Multinomial, Gamma, Negative Binomial, and others. Hypothesis tests applied to the Generalized Linear Model do not require normality of the response variable, nor do they require homogeneity of variances. Hence, Generalized Linear Models can be used when response variables follow distributions other than the Normal distribution, and when variances are not constant. For example, count data would be appropriately analyzed as a Poisson random variable within the context of the Generalized Linear Model.
Parameter estimates are obtained using the principle of maximum likelihood; therefore hypothesis tests are based on comparisons of likelihoods or the deviances of nested models.
What puts the -ized in Generalized Linear Models
The common linear
regression model (a form of the general linear model) specifies that the mean response µ
is identical to a linear function ? of the predictor variables xj:
(1)
and uses least
squares as the criterion by which to estimate the unknown parameters ß = (ß0, ß1,..., ßp)'. When
observations are independent and normally distributed with constant variance s2, least squares
estimation of ß and s2 is equivalent to
maximum likelihood estimation.
Generalized linear
models encompass the general linear model and enlarge the class of linear least-squares
models in two ways: the distribution of Y for
fixed x is merely assumed to be from the
exponential family of distributions, which includes important distributions such as the
binomial, Poisson, exponential, and gamma distributions, in addition to the normal
distribution. Also, the relationship between E(Y) = µ and ? is specified by a
non-linear link function ? = g(µ),
which is only required to be monotonic and differentiable.
The link function serves to link the random or stochastic component of the model, the probability distribution of the response variable, to the systematic component of the model (the linear predictor):
,
(2)
Where g(µ) is a non-linear link function that links the
random component, E(Y), to the systematic component
. For traditional linear models in which the random
component consists of the assumption that the response variable follows the Normal
distribution, the canonical link function is the identity link. The identity link
specifies that the expected mean of the response variable is identical to the linear
predictor, rather than to a non-linear function of the linear predictor. The canonical
link functions for a variety of probability distribution are given below.
Probability Distribution |
Canonical Link Function |
|
|
Normal |
Identity |
Binomial |
Logit |
Poisson |
Log |
Gamma |
Reciprocal |
Although other link functions are possible, the canonical links are most often used.
The parameters in a
generalized linear model can be estimated by the maximum likelihood method. For a given
probability distribution specified by f(yi ; ß,
F) and observations y = (y1, y2, . . ., yn)', the
log-likelihood function for ß and
F, expressed as a function of mean values µ = (µ1,
, µn)
of the responses {Y1, Y2, . . . , Yn}, has the form
.
The maximum
likelihood estimates of the parameters ß can be obtained by iterative
re-weighted least squares (IRLS). Detailed information about the iterative algorithm and
asymptotic properties of the parameter estimates can be found in McCullagh and Nelder
(1989).
Analogous to the
residual sum of squares in linear regression, the goodness-of-fit of a generalized linear
model can be measured by the scaled deviance
,
where
is the maximum likelihood achievable for an exact fit in
which the fitted values are equal to the observed values, and
is the log-likelihood function calculated at the
estimated parameters ß. The deviance function is very useful for comparing
two models when one model has parameters that are a subset of the second model. The
deviance is additive for such nested models if maximum likelihood estimates are used
(McCullagh and Nelder 1989). Consider two nested models with the second having some
covariates omitted and denote the maximum likelihood estimates in the two models by
and
, respectively.
Then the deviance difference
is identical to the likelihood-ratio statistic and
has an approximate
distribution
with degrees of freedom equal to the difference between the numbers of parameters in the
two models. For probability distributions in the exponential family the
approximation is usually quite accurate for differences of
deviance even though it may be inaccurate for the deviances themselves (McCullagh and
Nelder 1989).
Over-dispersion
If the sampling
variance of a response variable Yi is significantly
greater than that predicted by an expected probability distribution, Yi is said to be
over-dispersed.
The covariance
matrix of
is estimated by COV
= F(X'WX)-1,
where X
is the covariate matrix and W is a weight matrix used in the iterative
algorithm. If over-dispersion occurs, ignoring it (i.e., setting F = 1) will result in underestimating the standard
errors of the parameter estimates, which may lead to incorrect conclusions. McCullagh and
Nelder (1989) suggest modeling mean and dispersion jointly as a way to take possible
over-dispersion into account. The detailed fitting procedure can be found in McCullagh and
Nelder (1989).
GLZs can be fit and evaluated using SPLUS, SAS, SPSS, and a number of other statistical packages. Of the major packages, SPLUS and SAS provide greater flexibility in fitting and evaluating GLZs
References
Agresti, A. 1996. An Introduction to Categorical Data Analysis. John Wiley & Sons: New York. (A very readable introduction the many forms of the generalized linear model)
McCullagh, P. and J.A. Nelder. 1989. Generalized Linear Models. Chapman and Hall: London. (mathematical statistics of generalized linear model)
Ecological
Applications of Generalized Linear Models
Vincent, P.J. and J.M. Haworth. 1983. Poisson regression models of species abundance. Journal of Biogeography 10: 153-160.
Connor, E.F., E. Hosfield, D. Meeter, and X. Nui. 1997. Tests for aggregation and size-based sample-unit selection when sample units vary in size. Ecology 78: 1238 -1249.
Links to Other
Websites
Site |
Description |
|
|
Introduction, bibliography, software, and other information on GLZs |
|
Fairly comprehensive introduction to GLZs |
|
Using Matlab to fit GLZs |
|
Brief introduction to GLZs |