Discriminant Function
Analysis (DA)
Julia Barfield, John Poulsen, and Aaron French
Key
words: assumptions, further reading,
computations, validation of
functions, interpretation, classification, links
Introduction
Discriminant
function analysis is used to discriminate between two or more naturally occurring
groups based on a suite of continuous or discriminating variables. It consists
of two closely related procedures that allow researchers to discover
underlying, dominant gradients of variation among groups of sample entities
(such as species, individuals, sites, or any naturally occurring group) from a
set of multivariate observations. The goal is to elucidate how variation among
groups is maximized and variation within groups is minimized along a gradient
(McGarigal 2000). Discriminant function analysis can be used both as a means of
explaining differences among groups, but also to predict group membership for
sampling entities of unknown membership.
Discriminant
analysis is used by researchers in a wide variety of settings and fields
including biological and medical sciences, education, psychology, finance,
engineering and political science. It also goes by many names, sometime causing
confusion. But terms such as canonical analysis of discriminance, multiple
discriminant analysis, canonical variates analysis, and Fisher’s linear
discriminant function analysis all refer to the same statistical technique,
which is designed to predict membership
in naturally occurring groups from a set of predictor variables.
Discriminant Function Analysis in Biology
In biology,
discriminant function analysis is used in many kinds of research.
For example, a researcher may want to investigate which variables
discriminate between fruits eaten by (1) primates, (2) birds, or (3) squirrels.
For that purpose, the researcher could collect data on numerous fruit
characteristics of those species eaten by each of the animal groups. Most
fruits will naturally fall into one of the three categories. Discriminant
analysis could then be used to determine which variables are the best predictors
of whether a fruit will be eaten by birds, primates, or squirrels.
Discriminant
analysis is used very often in studies
on acoustic and visual communication within species. It is also used for
taxonomy classification, morphometric analyses to identify or sex species, and
species distribution. If a researcher recovers bird bands, for instance, DFA
can be used to classify species and sex to their subpopulations, which can help
to answer questions about dispersal and seasonal distribution of animals.
Discriminant analysis can
also be used to test for niche separation by sympatric species or for the
presence or absence of a species. One approach would be to measure ecological
variables that could influence the occurrence of the species in question at sampling
locations. Data obtained by random sampling of ecological variables at a study
site could be used to identify which environmental factors are related to the
presence/absence of a particular species at a series of sites. The goal would
be to discover which variables best explain differences in distribution
patterns of species. If a suite of discriminating ecological variables has been
identified and consistent, non-random patterns appear in the relation of
ecological gradients to species distribution, then it should be possible to
define gradients that best characterize used and unused sites (McGarigal 2000).
Discriminant Analysis vs. MANOVA
To
understand discriminant analysis, it is important to be familiar with another multivariate
technique, multivariate analysis of variance (MANOVA). A MANOVA is simply a
multivariate version of an analysis of variance (ANOVA) with several dependent
variables. The main objective in using MANOVA is to determine if response
variables are altered by the observer’s manipulation of the independent
variables. The researcher wants to know if group memberships are associated
with reliable mean differences on a combination of dependent variables. The
null hypothesis would be that mean vectors of dependent variables (DVs) are
equal.
Discriminant
function analysis is MANOVA turned around. In MANOVA, the independent variables
are the groups and the dependent variables are the predictors. In DA, the
independent variables are the predictors and the dependent variables are the
groups. (In order to avoid semantic confusion, it’s easier to refer to
independent variables as the predictors -- or discriminating variables -- and
to dependent variables as the grouping
variables.)
The
emphases of MANOVA and DA are different. While MANOVA seeks to find a linear
combination of variables that will maximize the test statistic, DA is used to
establish the linear combination of dependent variables that maximally
discriminates among groups. As previously stated, DA is used to predict
membership in naturally occurring groups and to determine if a combination of
variables can reliably predict group membership. Several variables are included
in a study to see which ones best contribute to the discrimination between
groups.
Logistic Regression vs. Discriminant
Analysis
Logistic
regression answers the same questions as discriminant analysis. It is often
preferred to discriminant analysis as it is more flexible in its assumptions
and types of data that can be analyzed. The key difference is that the
predictor or independent variables must be continuous in DA. However, Multiple
Logistic Regression can handle both categorical and continuous
predictor/independent variables, and the predictors do not have to be normally
distributed, linearly related, or of equal variance within each group
(Tabachnick and Fidell 1996).
Discriminant Function Analysis Procedure
Discriminant
function analysis is broken into a 2-step process:
(1)
Discrimination: Significance testing of a set of discriminant functions. This
is sometimes referred to as interpretation.
(2)
Classification: Classify entities into groups using a classification
criterion that maximizes correct classification and emphasizes differences
among groups of sampling entities.
Linear Discriminant Functions
A
statistical method for measuring the differences among groups is needed. The
first step is to derive linear combinations of the discriminating variables,
also known as canonical discriminant functions. The discriminant functions are generated from a sample of individuals
(or cases), for which group membership is known. The functions can then be
applied to new cases with measurements on the same set of variables, where
group membership is not known.
These
canonical linear functions are like regression equations where each variable is
weighted by a coefficient. In discriminant analysis, each variable is weighted
according to its ability to discriminate among groups. Each sampling entity has
a single composite score derived by multiplying the sample values for each
variable by a corresponding weight and adding these products together. By
averaging the canonical scores for all the entities within a particular group,
we arrive at the group mean canonical score. The group mean is called a centroid
because it is the composite of several variables.
Deriving Discriminant
Functions
It’s important to have some familiarity with matrix algebra to
understand how discriminant functions and coefficients are obtained.
Statistical computer programs have removed the need to perform these
calculations by hand, but use of them will be greatly enhanced by understanding
the concepts behind the procedure. Multivariate statistics textbooks generally
provide an appendix or chapter on matrix algebra that covers the basics, but a
good algebra textbook will also do the trick.
To start the process, the data for a discriminant analysis will be
inserted into a matrix that takes the form of an N x P data set with G groups
where N is the number of sampling entities (rows) and P is the number of
variables (columns), X1, X2 … Xp, for each
sample member. There are m random samples from different groups with sizes n1,
n2 … nm .
(Note that notation varies across reference sources. It is
important to be mindful of this when consulting several sources at once.
Assuming that all texts use the same notation will inevitably lead to confusion
and frustration.)
To derive the canonical discriminant functions, take a linear
combination of the X variables where each variable is weighted according to its
ability to discriminate among a priori
defined groups. Linear discriminant functions take the following mathematical
form:
D = a1X1
+ a2X2 + apXp
There are Q (equal G – 1 or P, whichever is smaller) possible
canonical discriminant functions corresponding to the dimensions of the data
set:
D1
= a11X1 + a12X2
+ a1pXp
D2
= a21X1 + a22X2 + a2pXp
Di
= ai1X1 + ai2X2 + aipXp
The procedure to
determine the weights or coefficients follows a procedure that uses the same fundamental
equations as MANOVA to get sums-of-squares-and-cross-products matrices. This is like determining sums of squares in ANOVA. In
the multivariate case, group means and standard deviations are not sufficient
to show interrelations among the variables, so it is necessary to use a matrix
of total sums-of-squares-and-cross-products (SSCP). Variance in the predicting
variables comes from two sources, within-groups differences and among-groups
differences. Using MANOVA computation, a matrix of
total variances and covariances is obtained, as well as a matrix of pooled
within-group variances and covariances.
First,
create cross-products matrices for between-group differences and within-group
differences:
SStotal
= SSbg + SSwg.
The determinants
are calculated for each cross-products matrix and used to calculate a test
statistic – either Wilks’ Lambda or Pillai’s Trace.
Wilks’
Lambda follows the equation:

Next an
F ratio is calculated as in MANOVA:

For
cases where n is equal in all groups:
![]()
For unequal n between
groups, this is modified only by changing the dferror to equal the
number of data points in all groups minus the number of groups (N – k). If the
experimental F exceeds a critical F, then the experimental groups
can be distinguished based on the combination of predictor variables.
Wilks' lambda is
used to generate an F test of mean
differences in DA, such that the smaller the lambda for an independent
variable, the more that variable contributes to the discriminant function.
Lambda varies from 0 to 1, with 0 meaning group means differ (thus the more the
variable differentiates the groups), and 1 meaning all group means are the
same. The F test shows which of the
discriminant functions contributes significantly to distinguishing among
groups.
One first performs the
multivariate test and, if statistically significant, proceeds to see which of
the variables have significantly different means across the groups. When an
overall relationship is found, the next step is to examine the discriminant
functions that compose the overall relationship and determine in how many dimensions groups differ reliably. The number of discriminant functions used in the analysis,
as previously mentioned, can be no greater than the number of groups minus one
(the degrees of freedom) or the number of predictor variables in the analysis,
whichever is smaller.
Eigenanalysis
Computationally,
discriminant analysis is essentially an eigenanalysis (or canonical correlation
analysis) problem to determine the successive functions and coefficients. The
discriminant functions each have an eigenvalue associated with them, which
represents the extent of group differentiation along the dimension specified by
the canonical function. The eigenvalues are determined by solving the following
equation:
óA – λW ó = 0
where λ is a constant,
called the eigenvalue, W is the within-groups SSCP matrix, and A is the
among-groups SSCP matrix.
An eigenanalysis also
produces an eigenvector that is associated with each eigenvalue. Eigenvectors are
determined by solving this equation:
óA – λ iW óvi = 0
λ i is
the eigenvalue corresponding to the ith
canonical function, and vi is the eigenvector associated with the ith
eigenvalue, W is the within-groups SSCP matrix, and A is the among-groups SSCP
matrix. Each solution, which yields its own λ and the set of v’s,
corresponds to one discriminant function. However, v’s cannot be
interpreted as coefficients, since the solution does not have a logical
constraint on the origin or the metric units used for the discriminant space.
The coefficients v’s can be transformed into u’s of the
discriminant function as follows:
![]()
This adjustment makes the raw coefficients useful as
weighted coefficients that can be used to compare the relative importance of
the independent variables, much as beta weights are used in regression. The
number of coefficients in an eigenvector equals the number of variables in the
linear equations that define the discriminant or canonical functions and are referred
to as canonical coefficients or weights.
The discriminant function with the largest λ
value is the most powerful discriminator, while the function with the smallest
λ value is the weakest. An eigenvalue close to 0 indicates the associated
discriminant function has minimal discriminating power. λ = 0 implies no
difference between the groups.
Interpreting
Discriminant Functions
The
discriminant function score for the ith
function is:
Di = di1Z1+ di2Z2+...+
dipZp
Where z
= the score on each predictor, and di= discriminant function
coefficient. The discriminant function score for a case can be produced with
raw scores and unstandardized discriminant function scores. The discriminant
function coefficients are, by definition, chosen to maximize differences
between groups. The mean over all the discriminant function coefficients is
zero, with a SD equal to one.
The discriminant score, also called the DA score, is the value
resulting from applying a discriminant function formula to the data for a given
case. The Z value is the
standardized discriminant function coefficient.
Discriminant
function analysis will find some optimal combination of variables so that the
first eigenvalue or discriminant
function is the largest, and thus provides the most
overall discrimination between groups. The second discriminant function
is orthogonal (independent) to the first function, and provides the second
greatest discrimination after the first function, and so on for Q eigenvalues
and functions.
Since the functions must be orthogonal, their contributions
to the discrimination between groups will not overlap. The first function picks
up the most variation; the second function picks up the greatest part of the
unexplained variation, and so on for each subsequent function.
Spatial Interpretation
Each discriminant function equation defines an axis in space, in
which a multivariate data set can be depicted as a multidimensional cloud of
sample points. The discriminant or canonical function projects through the
cloud at orientations that maximally separate group distributions along each
axis (McGarigal 2000).
We can visualize
how the functions discriminate among groups by plotting the individual scores
for the discriminant functions. To summarize the
position of each group, compute its centroid. The origin is the “grand
centroid” and the first axis is drawn through this centroid. The next canonical
axis will be drawn perpendicular to the first axis in the direction that best
separates the groups, so that there is a maximum within-group to among-group
ratio. The positions of group centroids on this canonical axis are maximally
separated. Subsequent canonical axes are constrained by orthogonality (independence)
and maximization of remaining group differences. Comparison of centroids shows
how far apart the groups are along the dimension defined by the canonical
function.
Factor Structure
Another
way to determine which variables define a particular discriminant function is
to look at the factor structure. The factor structure coefficients are the
correlations between the variables in the model and the discriminant functions.
The discriminant function coefficients denote the unique contribution of each variable
to the discriminant function, while the structure coefficients denote the
simple correlations between the variables and the functions.
Once the discriminant functions are determined and groups are
differentiated, the utility of these functions can be examined in terms of
their ability to correctly classify each data point to their a priori
groups.
There
are many methods for performing classifications. All procedures involve
defining some notion of distance between an entity and its group centroid. The
entity is classified into the closest group. These procedures use either the
discriminating variables themselves or the canonical functions.
Fisher’s linear equation
This procedure is based on
R. A. Fisher’s original linear equation. In a 1936 paper, Fisher suggested that
classification should be based on a linear combination of discriminating
variables such that group differences are maximized and variation within groups
is minimized by using the pooled within-groups covariance matrix. An adaptation
of Fisher’s proposal has been developed to derive a linear combination for each
group, which is called a classification function.
For cases with
an equal sample size for each group the classification function coefficient (Cj)
is expressed by the following equation:
Cj = cj0+ cj1x1+
cj2x2+...+ cjpxp
where Cj is the score for the jth
group, j = 1 … k, cjo is the
constant for the jth group, and x = raw scores of
each predictor. If W = within-group variance-covariance
matrix, and M = column matrix of means for group j, then the constant cjo=
(-1/2)CjMj.
A
different classification equation is used for unequal samples in each group:
![]()
nj = size in group j, N = total sample size.
Mahalanobis distances
To use the Mahalanobis
distance approach to classification, calculate the distance from an entity to
the centroid of the group, then classify the entity into the group to which it
is closest. If the data is obtained from a population that is multivariate
normally distributed, the values of the Mahalanobis distance has the same
properties as the chi-square statistic, so the distance can be converted into a
probability of belonging to a particular group. A particular entity can also
then be assigned to a group based on a maximum likelihood estimate.
Once discriminant functions
have been derived, and significance testing has been carried out and assessed,
it is important to perform a validation of results. This is an often
overlooked, but important step in DA, since results are only reliable if group
means and dispersions are estimated accurately and with precision. If classification
of unknown entities is the ultimate goal of the analysis, it is particularly
important to assess the reliability and robustness of discriminant findings.
Split-Sample Validation
One common approach to
validating the analysis is through a split-sampling procedure, where the data
is split into two subsets. This is a type of resampling procedure. A DA is run
on one subset so that classification rules are established from the first
subset or the first half of the data. The rules are then applied to the second
subset to classify the samples in the second half of the data. The percentage
of correct classification rate is then used to determine the reliability of the
classification rules. If the criterion performed poorly, it may be necessary to
take a larger sample to obtain more accurate estimates of group means and
dispersions (McGarigal 2000).
The split-sample
validation (also called cross-validation) procedure can be carried out several
times, or the data can be randomly divided into analysis and holdout samples
multiple times. This is similar to performing a jackknife resampling
procedure.
Resampling Validation
Several
resampling approaches can be used to validate the robustness of discriminate
functions. In addition to the split-sample procedure described above, a variety
of bootstrap, jackknife and randomization tests can be used for reliability
testing. The jackknife is considered a good approach when sample sizes are
small and it is not possible to perform a split-sample procedure.
Territorial Plots
Computer programs have an
option to run a graphic called a territorial plot, which shows all the entities
in relation to the centroids. It also demarcates the territories associated
with each group, so that a case that falls into a particular group’s territory
should be classified as belonging to that group. This plot allows for visual
inspection of distances and where each case falls, and may help in classifying
entities into the most appropriate group. It may also generate questions about
the accuracy of the analysis if group members appear to have been classified
erroneously.
Discriminant Function Analysis Assumptions
Discriminant analysis is
generally robust to small deviations, which is important since field data
rarely meet assumptions precisely. The larger the sample size, the more robust
the analysis is to violation of assumptions. When sample sizes are small, more
attention to violations is needed.
Discriminant
function analysis is computationally very similar to MANOVA, and all
assumptions for MANOVA apply.
Equality of
Variance-Covariance Matrices: The variance-covariance matrices of
variables are assumed to be homogeneous across groups. DA assumes groups have
equal dispersions (within-group variance-covariance structure is the same for
all groups), and the correlation between any two variables must be the same in
the respective populations from which the different groups have been sampled.
If a violation is suspected,
the analysis should be run with exclusion of one or two groups that are of less
interest. If the deviations are minor and overall results hold up, it should
not cause a problem with the analysis. But if this assumption is grossly
violated, certain desirable properties of the canonical functions are lost and
some degree of distortion will occur in the canonical representations of the
data. Also, the statistical relationships between distances in observation
space and their canonical representations become complex and non-intuitive.
Before accepting final conclusions for an important study, it is a good idea to review the within-groups variances and correlation matrices. Homoscedasticity is evaluated through scatterplots and corrected by transformation of variables.
Normal
Distribution: It is
assumed that the data (for the variables) represent a sample from a
multivariate normal distribution. You can examine whether or not variables are
normally distributed with histograms of frequency distributions. However, note
that violations of the normality assumption are not "fatal" and the
resultant significance test are still reliable as long as non-normality is
caused by skewness and not outliers, which are a more serious problem
(Tabachnick and Fidell 1996).
Violations of normality mean
the computed probabilities are not exact and will not be optimal in the sense
of minimizing the number of misclassifications, even though they may still be
quite useful if interpreted with caution.
Sample size: Unequal sample sizes are
acceptable. The sample size of the smallest group needs to exceed the
number of predictor variables. As a “rule of thumb”, the smallest sample
size should be at least 20 for a few (4 or 5) predictors. The maximum
number of independent variables is n - 2, where n is the sample size.
While this low sample size may work, it is not encouraged, and generally it is
best to have 4 or 5 times as many observations and independent variables. While this low sample size may work, it is not
encouraged, and generally it is best to have 4 or 5 times as many observations
and independent variables.
Independent Random
Samples: DA assumes that random
samples of observation vectors have been drawn independently from respective
P-dimensional multivariate normal populations
Outliers: DA is highly sensitive to the inclusion
of outliers. Run a test for univariate and multivariate outliers for each
group, and transform or eliminate them. If one group in the study
contains extreme outliers that impact the mean, they will also increase
variability. Overall significance tests are based on pooled variances, that is,
the average variance across all groups. Thus, the significance tests of the
relatively larger means (with the large variances) would be based on the
relatively smaller pooled variances, resulting erroneously in statistical
significance.
Non-multicollinearity: DA requires that no discriminating variable be a
linear combination of other variables being analyzed. This stems from
mathematical requirements that the matrix be nonsingular. A variable defined by
a linear combination of other variables is redundant.
Multicollinearity is defined
as multiple near-linear dependencies (high correlations) in the data set. If
one of the independent variables is very highly correlated with another, or one
is a function (e.g., the sum) of other IVs, then the tolerance value for that
variable will approach 0 and the matrix will not have a unique discriminant
solution. There must also be low multicollinearity of the independents. To
the extent that independents are correlated, the standardized discriminant
function coefficients will not reliably assess the relative importance of the
predictor variables. Non-multicollinearity is not a specified assumption, but
it can affect interpretation of data.
Linearity: Variables change linearly along underlying
gradients and linear relationships exist among the variables such that they can
be combined in a linear fashion to create the canonical (discriminant)
functions. This is not a specified assumption of the math model, but it
determines the effectiveness of discriminant analysis.
To diagnose violations of
linearity, look at scatterplots of pairs of canonical functions for arched or
curvilinear configurations of sample points that often indicate nonlinearities.
Summary
To summarize, when interpreting multiple discriminant functions, which arise
from analyses with more than two groups and more than one continuous variable,
the different functions are first tested for statistical significance. If
the functions are statistically significant, then the groups can be
distinguished based on predictor variables. Standardized b coefficients
for each variable are determined for each significant function. The larger the
standardized b coefficient, the larger is the respective variable's unique
contribution to the discrimination specified by the respective discriminant
function.
Cooley,
W. W. and P. R. Lohnes (1971). Multivariate Data Analysis. John Wiley &
Sons, Inc.
Dunteman,
George H. (1984). Introduction to multivariate analysis.
Huberty,
Carl J. (1994). Applied Discriminant Analysis. John Wiley & Sons:
Klecka,
William R. (1980). Discriminant Analysis. Quantitative Applications in the
Social Sciences Series, No. 19.
Lachenbruch,
P. A. (1975). Discriminant Analysis. NY: Hafner. For detailed notes on
computations.
Manly,
McGarigal,
Kevin, Cushman, Sam and Susan Stafford (2000). Multivariate Statistics for
Wildlife and Ecology Research. Springer Verlag:
Morrison,
D.F. (1967). Multivariate Statistical Methods. McGraw-Hill:
Overall,
J.E. and C.J. Klett (1972). Applied Multivariate Analysis. McGraw-Hill:
Press,
S. J. and
Tabachnick,
B.G. and L.S. Fidell (1996). Using Multivariate Statistics.
How-To Guides: http://www.statsguides.bham.ac.uk/HowToGuides/WebPages/HTG/DFA/HTG_DFA_data_p1.htm
This website is a supplement to online help for computer packages.
http://www.unesco.org/webworld/idams/advguide/Chapt9_2.htm
This web page offers a good description of computations.
www.statsoft.com/textbook/stathome.html
Statsoft provides descriptions and explanations of many different statistical
techniques.
www2.chass.ncsu.edu/garson/pa765/discrim.htm
This website offers a good, readable treatment of DA. It
also offers very understandable explanations of how to read result print-outs
from SPSS and SAS. Other analyses like logistic regression and
log-linear models can be found here.
U.S. Environmental Protection Agency’s statistical primer:
http://www.epa.gov/bioindicators/primer/dfa.html