Discriminant Function Analysis (DA)

Julia Barfield, John Poulsen, and Aaron French

Introduction

Discriminant function analysis is used to discriminate between two or more naturally occurring groups based on a suite of continuous or discriminating variables. It consists of two closely related procedures that allow researchers to discover underlying, dominant gradients of variation among groups of sample entities (such as species, individuals, sites, or any naturally occurring group) from a set of multivariate observations. The goal is to elucidate how variation among groups is maximized and variation within groups is minimized along a gradient (McGarigal 2000). Discriminant function analysis can be used both as a means of explaining differences among groups, but also to predict group membership for sampling entities of unknown membership.

Discriminant analysis is used by researchers in a wide variety of settings and fields including biological and medical sciences, education, psychology, finance, engineering and political science. It also goes by many names, sometime causing confusion. But terms such as canonical analysis of discriminance, multiple discriminant analysis, canonical variates analysis, and Fisher’s linear discriminant function analysis all refer to the same statistical technique, which is designed to predict membership in naturally occurring groups from a set of predictor variables.

Discriminant Function Analysis in Biology

In biology, discriminant function analysis is used in many kinds of research.

For example, a researcher may want to investigate which variables discriminate between fruits eaten by (1) primates, (2) birds, or (3) squirrels. For that purpose, the researcher could collect data on numerous fruit characteristics of those species eaten by each of the animal groups. Most fruits will naturally fall into one of the three categories. Discriminant analysis could then be used to determine which variables are the best predictors of whether a fruit will be eaten by birds, primates, or squirrels.

Discriminant analysis is used very often in studies on acoustic and visual communication within species. It is also used for taxonomy classification, morphometric analyses to identify or sex species, and species distribution. If a researcher recovers bird bands, for instance, DFA can be used to classify species and sex to their subpopulations, which can help to answer questions about dispersal and seasonal distribution of animals.

Discriminant analysis can also be used to test for niche separation by sympatric species or for the presence or absence of a species. One approach would be to measure ecological variables that could influence the occurrence of the species in question at sampling locations. Data obtained by random sampling of ecological variables at a study site could be used to identify which environmental factors are related to the presence/absence of a particular species at a series of sites. The goal would be to discover which variables best explain differences in distribution patterns of species. If a suite of discriminating ecological variables has been identified and consistent, non-random patterns appear in the relation of ecological gradients to species distribution, then it should be possible to define gradients that best characterize used and unused sites (McGarigal 2000).

Discriminant Analysis vs. MANOVA

To understand discriminant analysis, it is important to be familiar with another multivariate technique, multivariate analysis of variance (MANOVA). A MANOVA is simply a multivariate version of an analysis of variance (ANOVA) with several dependent variables. The main objective in using MANOVA is to determine if response variables are altered by the observer’s manipulation of the independent variables. The researcher wants to know if group memberships are associated with reliable mean differences on a combination of dependent variables. The null hypothesis would be that mean vectors of dependent variables (DVs) are equal.

Discriminant function analysis is MANOVA turned around. In MANOVA, the independent variables are the groups and the dependent variables are the predictors.  In DA, the independent variables are the predictors and the dependent variables are the groups. (In order to avoid semantic confusion, it’s easier to refer to independent variables as the predictors -- or discriminating variables -- and to dependent variables as the grouping variables.)

The emphases of MANOVA and DA are different. While MANOVA seeks to find a linear combination of variables that will maximize the test statistic, DA is used to establish the linear combination of dependent variables that maximally discriminates among groups. As previously stated, DA is used to predict membership in naturally occurring groups and to determine if a combination of variables can reliably predict group membership. Several variables are included in a study to see which ones best contribute to the discrimination between groups.

Logistic Regression vs. Discriminant Analysis

Logistic regression answers the same questions as discriminant analysis. It is often preferred to discriminant analysis as it is more flexible in its assumptions and types of data that can be analyzed.  The key difference is that the predictor or independent variables must be continuous in DA. However, Multiple Logistic Regression can handle both categorical and continuous predictor/independent variables, and the predictors do not have to be normally distributed, linearly related, or of equal variance within each group (Tabachnick and Fidell 1996).

Discriminant Function Analysis Procedure

Discriminant function analysis is broken into a 2-step process:

(1) Discrimination: Significance testing of a set of discriminant functions. This is sometimes referred to as interpretation.

(2) Classification: Classify entities into groups using a classification criterion that maximizes correct classification and emphasizes differences among groups of sampling entities.

Linear Discriminant Functions

A statistical method for measuring the differences among groups is needed. The first step is to derive linear combinations of the discriminating variables, also known as canonical discriminant functions. The discriminant functions are generated from a sample of individuals (or cases), for which group membership is known. The functions can then be applied to new cases with measurements on the same set of variables, where group membership is not known.

These canonical linear functions are like regression equations where each variable is weighted by a coefficient. In discriminant analysis, each variable is weighted according to its ability to discriminate among groups. Each sampling entity has a single composite score derived by multiplying the sample values for each variable by a corresponding weight and adding these products together. By averaging the canonical scores for all the entities within a particular group, we arrive at the group mean canonical score. The group mean is called a centroid because it is the composite of several variables.

It’s important to have some familiarity with matrix algebra to understand how discriminant functions and coefficients are obtained. Statistical computer programs have removed the need to perform these calculations by hand, but use of them will be greatly enhanced by understanding the concepts behind the procedure. Multivariate statistics textbooks generally provide an appendix or chapter on matrix algebra that covers the basics, but a good algebra textbook will also do the trick.

To start the process, the data for a discriminant analysis will be inserted into a matrix that takes the form of an N x P data set with G groups where N is the number of sampling entities (rows) and P is the number of variables (columns), X1, X2 … Xp, for each sample member. There are m random samples from different groups with sizes n1, n2 … nm .

(Note that notation varies across reference sources. It is important to be mindful of this when consulting several sources at once. Assuming that all texts use the same notation will inevitably lead to confusion and frustration.)

To derive the canonical discriminant functions, take a linear combination of the X variables where each variable is weighted according to its ability to discriminate among a priori defined groups. Linear discriminant functions take the following mathematical form:

D = a1X1 + a2X2 + apXp

There are Q (equal G – 1 or P, whichever is smaller) possible canonical discriminant functions corresponding to the dimensions of the data set:

D1 = a11X1  + a12X2 + a1pXp

D2 = a21X1 + a22X2 + a2pXp

Di = ai1X1 + ai2X2 + aipXp

The procedure to determine the weights or coefficients follows a procedure that uses the same fundamental equations as MANOVA to get sums-of-squares-and-cross-products matrices. This is like determining sums of squares in ANOVA. In the multivariate case, group means and standard deviations are not sufficient to show interrelations among the variables, so it is necessary to use a matrix of total sums-of-squares-and-cross-products (SSCP). Variance in the predicting variables comes from two sources, within-groups differences and among-groups differences. Using MANOVA computation, a matrix of total variances and covariances is obtained, as well as a matrix of pooled within-group variances and covariances.

First, create cross-products matrices for between-group differences and within-group differences:

SStotal = SSbg + SSwg.

The determinants are calculated for each cross-products matrix and used to calculate a test statistic – either Wilks’ Lambda or Pillai’s Trace.

Wilks’ Lambda follows the equation:

Next an F ratio is calculated as in MANOVA:

For cases where n is equal in all groups:

For unequal n between groups, this is modified only by changing the dferror to equal the number of data points in all groups minus the number of groups (N – k). If the experimental F exceeds a critical F, then the experimental groups can be distinguished based on the combination of predictor variables.

Wilks' lambda is used to generate an F test of mean differences in DA, such that the smaller the lambda for an independent variable, the more that variable contributes to the discriminant function. Lambda varies from 0 to 1, with 0 meaning group means differ (thus the more the variable differentiates the groups), and 1 meaning all group means are the same. The F test shows which of the discriminant functions contributes significantly to distinguishing among groups.

One first performs the multivariate test and, if statistically significant, proceeds to see which of the variables have significantly different means across the groups. When an overall relationship is found, the next step is to examine the discriminant functions that compose the overall relationship and determine in how many dimensions groups differ reliably. The number of discriminant functions used in the analysis, as previously mentioned, can be no greater than the number of groups minus one (the degrees of freedom) or the number of predictor variables in the analysis, whichever is smaller.

Eigenanalysis

Computationally, discriminant analysis is essentially an eigenanalysis (or canonical correlation analysis) problem to determine the successive functions and coefficients. The discriminant functions each have an eigenvalue associated with them, which represents the extent of group differentiation along the dimension specified by the canonical function. The eigenvalues are determined by solving the following equation:

óA – λW ó = 0

where λ is a constant, called the eigenvalue, W is the within-groups SSCP matrix, and A is the among-groups SSCP matrix.

An eigenanalysis also produces an eigenvector that is associated with each eigenvalue. Eigenvectors are determined by solving this equation:

óA – λ iW óvi = 0

λ i is the eigenvalue corresponding to the ith canonical function, and vi  is the eigenvector associated with the ith eigenvalue, W is the within-groups SSCP matrix, and A is the among-groups SSCP matrix. Each solution, which yields its own λ and the set of v’s, corresponds to one discriminant function. However, v’s cannot be interpreted as coefficients, since the solution does not have a logical constraint on the origin or the metric units used for the discriminant space. The coefficients v’s can be transformed into u’s of the discriminant function as follows:

This adjustment makes the raw coefficients useful as weighted coefficients that can be used to compare the relative importance of the independent variables, much as beta weights are used in regression. The number of coefficients in an eigenvector equals the number of variables in the linear equations that define the discriminant or canonical functions and are referred to as canonical coefficients or weights.

The discriminant function with the largest λ value is the most powerful discriminator, while the function with the smallest λ value is the weakest. An eigenvalue close to 0 indicates the associated discriminant function has minimal discriminating power. λ = 0 implies no difference between the groups.

The discriminant function score for the ith function is:

Di = di1Z1+ di2Z2+...+ dipZp

Where z = the score on each predictor, and di= discriminant function coefficient. The discriminant function score for a case can be produced with raw scores and unstandardized discriminant function scores. The discriminant function coefficients are, by definition, chosen to maximize differences between groups. The mean over all the discriminant function coefficients is zero, with a SD equal to one.

The discriminant score, also called the DA score, is the value resulting from applying a discriminant function formula to the data for a given case. The Z value is the standardized discriminant function coefficient.

Discriminant function analysis will find some optimal combination of variables so that the first eigenvalue or discriminant function is the largest, and thus provides the most overall discrimination between groups. The second discriminant function is orthogonal (independent) to the first function, and provides the second greatest discrimination after the first function, and so on for Q eigenvalues and functions.

Since the functions must be orthogonal, their contributions to the discrimination between groups will not overlap. The first function picks up the most variation; the second function picks up the greatest part of the unexplained variation, and so on for each subsequent function.

Spatial Interpretation

Each discriminant function equation defines an axis in space, in which a multivariate data set can be depicted as a multidimensional cloud of sample points. The discriminant or canonical function projects through the cloud at orientations that maximally separate group distributions along each axis (McGarigal 2000).

We can visualize how the functions discriminate among groups by plotting the individual scores for the discriminant functions. To summarize the position of each group, compute its centroid. The origin is the “grand centroid” and the first axis is drawn through this centroid. The next canonical axis will be drawn perpendicular to the first axis in the direction that best separates the groups, so that there is a maximum within-group to among-group ratio. The positions of group centroids on this canonical axis are maximally separated. Subsequent canonical axes are constrained by orthogonality (independence) and maximization of remaining group differences. Comparison of centroids shows how far apart the groups are along the dimension defined by the canonical function.

Factor Structure

Another way to determine which variables define a particular discriminant function is to look at the factor structure. The factor structure coefficients are the correlations between the variables in the model and the discriminant functions. The discriminant function coefficients denote the unique contribution of each variable to the discriminant function, while the structure coefficients denote the simple correlations between the variables and the functions.

Classification

Once the discriminant functions are determined and groups are differentiated, the utility of these functions can be examined in terms of their ability to correctly classify each data point to their a priori groups.

There are many methods for performing classifications. All procedures involve defining some notion of distance between an entity and its group centroid. The entity is classified into the closest group. These procedures use either the discriminating variables themselves or the canonical functions.

Fisher’s linear equation

This procedure is based on R. A. Fisher’s original linear equation. In a 1936 paper, Fisher suggested that classification should be based on a linear combination of discriminating variables such that group differences are maximized and variation within groups is minimized by using the pooled within-groups covariance matrix. An adaptation of Fisher’s proposal has been developed to derive a linear combination for each group, which is called a classification function.

For cases with an equal sample size for each group the classification function coefficient (Cj) is expressed by the following equation:

Cj = cj0+ cj1x1+ cj2x2+...+ cjpxp

where Cj is the score for the jth group, j = 1 … k, cjo is the constant for the jth group, and x = raw scores of each predictor. If W = within-group variance-covariance matrix, and M = column matrix of means for group j, then the constant   cjo= (-1/2)CjMj.

A different classification equation is used for unequal samples in each group:

nj = size in group j, N = total sample size.

Mahalanobis distances

To use the Mahalanobis distance approach to classification, calculate the distance from an entity to the centroid of the group, then classify the entity into the group to which it is closest. If the data is obtained from a population that is multivariate normally distributed, the values of the Mahalanobis distance has the same properties as the chi-square statistic, so the distance can be converted into a probability of belonging to a particular group. A particular entity can also then be assigned to a group based on a maximum likelihood estimate.

Validation of Functions

Once discriminant functions have been derived, and significance testing has been carried out and assessed, it is important to perform a validation of results. This is an often overlooked, but important step in DA, since results are only reliable if group means and dispersions are estimated accurately and with precision. If classification of unknown entities is the ultimate goal of the analysis, it is particularly important to assess the reliability and robustness of discriminant findings.

Split-Sample Validation

One common approach to validating the analysis is through a split-sampling procedure, where the data is split into two subsets. This is a type of resampling procedure. A DA is run on one subset so that classification rules are established from the first subset or the first half of the data. The rules are then applied to the second subset to classify the samples in the second half of the data. The percentage of correct classification rate is then used to determine the reliability of the classification rules. If the criterion performed poorly, it may be necessary to take a larger sample to obtain more accurate estimates of group means and dispersions (McGarigal 2000).

The split-sample validation (also called cross-validation) procedure can be carried out several times, or the data can be randomly divided into analysis and holdout samples multiple times. This is similar to performing a jackknife resampling procedure.

Resampling Validation

Several resampling approaches can be used to validate the robustness of discriminate functions. In addition to the split-sample procedure described above, a variety of bootstrap, jackknife and randomization tests can be used for reliability testing. The jackknife is considered a good approach when sample sizes are small and it is not possible to perform a split-sample procedure.

Territorial Plots

Computer programs have an option to run a graphic called a territorial plot, which shows all the entities in relation to the centroids. It also demarcates the territories associated with each group, so that a case that falls into a particular group’s territory should be classified as belonging to that group. This plot allows for visual inspection of distances and where each case falls, and may help in classifying entities into the most appropriate group. It may also generate questions about the accuracy of the analysis if group members appear to have been classified erroneously.

Discriminant analysis is generally robust to small deviations, which is important since field data rarely meet assumptions precisely. The larger the sample size, the more robust the analysis is to violation of assumptions. When sample sizes are small, more attention to violations is needed.

Discriminant function analysis is computationally very similar to MANOVA, and all assumptions for MANOVA apply.

Equality of Variance-Covariance Matrices: The variance-covariance matrices of variables are assumed to be homogeneous across groups. DA assumes groups have equal dispersions (within-group variance-covariance structure is the same for all groups), and the correlation between any two variables must be the same in the respective populations from which the different groups have been sampled.

If a violation is suspected, the analysis should be run with exclusion of one or two groups that are of less interest. If the deviations are minor and overall results hold up, it should not cause a problem with the analysis. But if this assumption is grossly violated, certain desirable properties of the canonical functions are lost and some degree of distortion will occur in the canonical representations of the data. Also, the statistical relationships between distances in observation space and their canonical representations become complex and non-intuitive.

Before accepting final conclusions for an important study, it is a good idea to review the within-groups variances and correlation matrices.  Homoscedasticity is evaluated through scatterplots and corrected by transformation of variables.

Normal Distribution: It is assumed that the data (for the variables) represent a sample from a multivariate normal distribution. You can examine whether or not variables are normally distributed with histograms of frequency distributions. However, note that violations of the normality assumption are not "fatal" and the resultant significance test are still reliable as long as non-normality is caused by skewness and not outliers, which are a more serious problem (Tabachnick and Fidell 1996).

Violations of normality mean the computed probabilities are not exact and will not be optimal in the sense of minimizing the number of misclassifications, even though they may still be quite useful if interpreted with caution.

Sample size:  Unequal sample sizes are acceptable.  The sample size of the smallest group needs to exceed the number of predictor variables.  As a “rule of thumb”, the smallest sample size should be at least 20 for a few (4 or 5) predictors.  The maximum number of independent variables is n - 2, where n is the sample size.  While this low sample size may work, it is not encouraged, and generally it is best to have 4 or 5 times as many observations and independent variables. While this low sample size may work, it is not encouraged, and generally it is best to have 4 or 5 times as many observations and independent variables.

Independent Random Samples: DA assumes that random samples of observation vectors have been drawn independently from respective P-dimensional multivariate normal populations

Outliers: DA is highly sensitive to the inclusion of outliers.  Run a test for univariate and multivariate outliers for each group, and transform or eliminate them.  If one group in the study contains extreme outliers that impact the mean, they will also increase variability. Overall significance tests are based on pooled variances, that is, the average variance across all groups. Thus, the significance tests of the relatively larger means (with the large variances) would be based on the relatively smaller pooled variances, resulting erroneously in statistical significance.

Non-multicollinearity: DA requires that no discriminating variable be a linear combination of other variables being analyzed. This stems from mathematical requirements that the matrix be nonsingular. A variable defined by a linear combination of other variables is redundant.

Multicollinearity is defined as multiple near-linear dependencies (high correlations) in the data set. If one of the independent variables is very highly correlated with another, or one is a function (e.g., the sum) of other IVs, then the tolerance value for that variable will approach 0 and the matrix will not have a unique discriminant solution. There must also be low multicollinearity of the independents. To the extent that independents are correlated, the standardized discriminant function coefficients will not reliably assess the relative importance of the predictor variables. Non-multicollinearity is not a specified assumption, but it can affect interpretation of data.

Linearity: Variables change linearly along underlying gradients and linear relationships exist among the variables such that they can be combined in a linear fashion to create the canonical (discriminant) functions. This is not a specified assumption of the math model, but it determines the effectiveness of discriminant analysis.

To diagnose violations of linearity, look at scatterplots of pairs of canonical functions for arched or curvilinear configurations of sample points that often indicate nonlinearities.

Summary
To summarize, when interpreting multiple discriminant functions, which arise from analyses with more than two groups and more than one continuous variable, the different functions are first tested for statistical significance. If the functions are statistically significant, then the groups can be distinguished based on predictor variables. Standardized b coefficients for each variable are determined for each significant function. The larger the standardized b coefficient, the larger is the respective variable's unique contribution to the discrimination specified by the respective discriminant function.

Cooley, W. W. and P. R. Lohnes (1971). Multivariate Data Analysis. John Wiley & Sons, Inc.

Dunteman, George H. (1984). Introduction to multivariate analysis. Thousand Oaks, CA: Sage Publications.  Chapter 5 covers classification procedures and discriminant analysis.

Huberty, Carl J. (1994). Applied Discriminant Analysis. John Wiley & Sons: New York.

Klecka, William R. (1980). Discriminant Analysis. Quantitative Applications in the Social Sciences Series, No. 19.  Thousand Oaks, CA: Sage Publications. This is a very accessible small booklet on discriminant analysis, which provides social science examples to explain the procedures.

Lachenbruch, P. A. (1975). Discriminant Analysis. NY: Hafner. For detailed notes on computations.

Manly, Bryan F.J. (2005). Multivariate Statistical Methods: A Primer, Third Edition. Chapman & Hall/CRC: Boca Raton, Florida. This is a good introduction to multivariate statistics for non-mathematicians. It is not intended as a comprehensive textbook, but provides a good place to start. Chapter eight covers discriminant function analysis.

McGarigal, Kevin, Cushman, Sam and Susan Stafford (2000). Multivariate Statistics for Wildlife and Ecology Research. Springer Verlag: New York. This is a good introduction to multivariate statistics for biologists/ecologists. It is very accessible and written with wildlife ecology graduate students in mind.

Morrison, D.F. (1967). Multivariate Statistical Methods. McGraw-Hill: New York. A general textbook explanation.

Overall, J.E. and C.J. Klett (1972). Applied Multivariate Analysis. McGraw-Hill: New York.

Press, S. J. and S. Wilson (1978). Choosing between logistic regression and discriminant analysis. Journal of the American Statistical Association, Vol. 73: 699-705.

Tabachnick, B.G. and L.S. Fidell (1996). Using Multivariate Statistics. Harper Collins College Publishers: New York. Tabachnick and Fidell compare and contrast statistical packages, and can be used with a modicum of pain to understand SPSS result print-outs.

This website is a supplement to online help for computer packages.

This web page offers a good description of computations.

Statsoft provides descriptions and explanations of many different statistical techniques.

This website offers a good, readable treatment of DA.  It also offers very understandable explanations of how to read result print-outs from SPSS and SAS.   Other analyses like logistic regression and log-linear models can be found here.

U.S. Environmental Protection Agency’s statistical primer: