Concepts of Correspondence Analysis

 

Summer Lindzey

 

Correspondence analysis (CA) is best learned by first considering the problems that ordination techniques in general are meant to resolve.  I address the problems conceptually, rather than mathematically, because a number of in-depth, mathematical treatments are already available (see references), and frankly, are hard to penetrate without a lot of effort and time.  The purpose of this summary is to introduce the problems that motivate the math behind correspondence analysis, with some brief digression to the details; once you have the basics, you can decide if the technique is appropriate and warrants dedicating time to mastering the intricacies.

 

What is Ordination, and what does it do?

Correspondence analysis is one of many ordination techniques aimed at reducing multivariate data into a manageable number of variables.  The essential question in ordination is: “Is there some other variable than the one(s) I used, possibly theoretical, that will better describe my subject?”  The purpose of ordination is two-fold: to extract theoretical variables that capture the majority of the variation in the data and to depict the structure the data assumes when described by these reduced number of variables.  Depicting the “structure” of the data is just a fancy way of saying “graphing”; in fact, the major appeal of ordination techniques is the two-dimensional graphs they can produce from a potentially inchoate mass of data.  Note that the goal here is not to test the significance of any of the variables, but merely to describe them more concisely.  As such, ordination techniques are only descriptive, and can not be used to test hypotheses about data structure.

 

How does ordination “reduce” data?  By describing subjects’ responses according to their dominant sources of variation, ordination can produce a single “score” for a subject from multiple (and often redundant) measurements. Ordination can therefore conveniently represent multi-dimensional subjects as single points on an axis (the axis representing the scoring system). Using the same scoring system, other subjects can be evaluated and plotted on the same axis, providing immediate, direct comparisons: subjects occurring close together on the axis are more similar in their responses, while those at distance are less similar.  Furthermore, another, different scoring system can be generated so as to provide more than one axis along which to plot and compare subjects.  The hope is that when taken together, the two or three scoring systems still capture the majority of the variation in the data.

 

Correspondence Analysis

History and Goals

Community ecologists frequently use CA to identify and distinguish communities on the basis of their species composition. In these studies, each species can be considered a variable, and communities our subjects, which we attempt to describe efficiently and comprehensively with abundance data for the individual species. Early attempts at defining plant communities focused on using species with known ecological characteristics (i.e.,  response to moisture or elevation) as indicators of certain environmental conditions and therefore of habitat types and communities.  A community was therefore designated on the basis of the relative abundance of these indicator species.   However, many researchers became interested in whether groups of organisms could be described as communities without reference to pre-existing knowledge of their ecological requirements.

 CA accomplishes this by taking into account the relative distribution of as many species as possible, using observed associations between species, or the lack thereof, to compare or distinguish groups of organisms in different locations.  Correspondence analysis constructs a score for each location sampled by weighting the abundance of species in that location.  The trick is devising weights that adequately represent both the overall distribution of a species, and its co-occurrence with and abundance relative to others.  Remember, the goal is to reduce the data collected on many species to a single index that accounts for as much variation in species as possible.

 

Many algorithms have been used and described as “correspondence analysis”, but I will focus on two of the broadest, most commonly referenced ones here: “Reciprocal Averaging”, and then the modern “Correspondence Analysis” described by Legendre and Legendre (1998). 

 

Reciprocal Averaging.

 

Reciprocal averaging (RA) was put forth by M.O. Hill in 1973 as a way of addressing particular weaknesses of PCA (principal components analysis) in describing multivariate community data.  At the time, Hill did not realize his solution was in fact another method of correspondence analysis, being developed concomitantly by other researchers in France.  He later published a treatment that resolved the differences and proved mathematical relationship between the two.

 

A simple example will help illustrate the problems and goals of correspondence analysis and of the reason for deriving weighted averages.  Consider the following artificial data taken from Pielou (1984).  Species are arranged in rows, and abundance data on each is collected in five locations (quadrats):

 

 

Quad I

Quad II

Quad III

Quad IV

Quad V

Spp 1

15

2

0

2

1

Spp 2

9

6

15

0

0

Spp 3

1

7

5

8

29

 

Given this data, we might wonder if the quadrats represent samples of different communities, and more specifically, whether differences in species abundances across the quadrats can be more succinctly described.  To compare quadrats we could choose, for example, a single number that represents total species abundance and see if that distinguishes the five quadrats.  This produces the following sums, in order of quadrat number:  25, 15, 25, 10, 30.  Note that these sums can then be used as scores to arrange the quadrats, in the following order:

           

            Quad IV          Quad II            Quad I and III              Quad V

               (10)              (15)                 (25 and 25)                (30)

 

With this simple scoring system, we have essentially created a single axis along which locations are placed relative to each other; in other words we have created an ordination.  Now consider what this order implies, and if it makes sense:

·        Quad I and III are the most similar, essentially identical according to their scores;

·        Quad IV and V are the most different .

 

Closer inspection of the relative species abundances in quadrats I and III reveals that they are hardly identical: while species one dominates Quad I, it is completely absent in Quad III.  Similarly, Quads IV and V  may differ in overall abundance, but comprise the same species, just in different magnitudes of abundance.  A better ordination would take into account not only total species abundance, but the particular abundance of species dominating in a given location. In other words, we need to weight the abundances.

 But how do we decide on weights?   Do we give more weight to the species that is most frequent, or the species that achieves the highest abundance at any given point?  This essentially assigns uniform ecological importance to a given species across all locations, when that clearly may not be the case; the behavior or “importance” of a species may depend on the presence and abundance of others. 

Reciprocal averaging approaches this problem on a trial and error basis by assigning arbitrary weights to species and re-calculating them until they converge on a single number.  Thus, the data themselves derive the weights, rather than being imposed by the researcher.  The re-calculation is guided by a back-and forth averaging between rows (species averages) and columns (quadrats): quadrat scores become weighted averages of species, and species become weighted averages of quadrats.   The back and forth averaging between row and column scores gives the name “reciprocal” to the technique.  In general the procedure is as follows (Pielou gives a worked example using the above data):

1)           Assign arbitrary trial values for species weights (ranging 0 to 100)

2)           Calculate average abundance for each quadrat using the above weights for each species; these are trial quadrat scores.

3)           Return to species data, and re-calculate species scores. Using the scores for each quadrat to weight abundance in that quadrat, calculate a weighted average for each species across the quadrats.  This is your revised set of species scores.

4)           Return to quadrat data, and using the new species scores as weights, calculate new quadrat scores (essentially repeat step 2).

5)           Moving back and forth between species and quadrat scores, re-calculate new scores until they converge (stop changing).

 

 Applying this procedure to our sample data yields the following one-dimensional ordination of quadrats (scores are re-scaled to fit between 0 and 100):

 

            Quad V           Quad IV          Quad II            Quad III           Quad I

             (0)                  (18.6)              (52.1)              (73.0)              (100)

 

Due to the fact that species scores are weighted according to quadrat scores, and vice versa, the scoring system derived is a better account of similarities than our first attempt. Quadrats IV and V are accurately represented close together, each being more similar to the other than other quadrats.  In addition, Quadrats I and III no longer have the same score, but are more adequately distinguished by the species weights derived by our second attempt.   Thus, by accounting for species distributions within and across all quadrats, reciprocal averaging has more adequately described the variation throughout the sampled locations – a successful ordination.

 

Note, however, that while we have separated and ordered the quadrats, the specific ecological or environmental factor underlying this order is not apparent in our analysis; in fact, it has to be inferred from what we know.  Hence, these ordination techniques are considered indirect methods of detecting underlying environmental gradients.  Also, note that we derived only one axis or scoring system, and another is traditionally developed. That second axis is constructed with the constraint that it accounts for variation independent of the first.

 

 

Correspondence Analysis using Linear Algebra

Detecting and depicting structure in species distributions can also be accomplished by using a system of linear equations to solve for the “best” weights, or those that will account for the most variation. Conceptually, this is the same approach of PCA (principal components analysis), but the difference lies in how the variation in species is quantified before weights are assigned. Rather than outlining the mathematical procedure involved, I will briefly describe the conceptual goals of the technique, which are sometimes the hardest to figure out.

 

In both PCA and CA the weights are derived by eigenanalysis, a technique in matrix algebra that can detect systematic variation among measurements and devise linear equations to represent that systematic variation.  In PCA, the matrix of species abundances is transformed into a matrix of covariances or correlations, each abundance value being replaced by a measure of its correlation (or covariance) with other abundances in other quadrats.  This correlation matrix is then subjected to eigenanalysis to determine weighted combinations of species that can combine and account for variation across all the samples. In ordination terminology, the specific source of these weights, which essentially become coefficients in a linear equation,  is a set of eigenvectors.

In correspondence analysis, the abundance data is transformed to a semi-chi-square statistic ( I say semi, because it’s never evaluated according to the chi-square distribution to test for its significance). You will recall that the goal of CA in community ecology is to use species associations to depict distance or similarity in sites, and the chi-square statistic is used to depict the degree those associations depart from independence.  Each abundance value is replaced by its chi-square statistic, or in some cases a factored form of this statistic, and the variance in these chi-square “distances” is evaluated by eigenanalysis.  Thus, the system of weights used to score sites or quadrats is derived from a metric of species associations, and the more these associations depart from independence, the further separated final scores will be.

 

CA often produces more reliable ordinations than PCA in community ecology, presumably because it better models the non-linear responses of species to environmental and ecological gradients.  The linear combinations calculated in PCA can either manufacture differences where they do not exist, or likewise obscure subtle differences in species associations that separate communities.  Presumably, the chi-square metric in CA preserves ecological distance by modeling differences in associations rather than abundances of single species.  

 

But CA has a major fault that mitigates against its use, commonly referred to as the “arch” or “horseshoe” effect.  When constructing a second, yet independent, axis for plotting scores, CA often introduces a spurious “arch” in the order of scores along the first axis that may not correspond to underlying ecological forces.  Detrended correspondence analysis was devised in an effort to eliminate the arch, but the procedure is somewhat arbitrary and in some cases may eliminate actual underlying structure to the data. Thus some researchers suggest avoiding DCA, while others suggest CA is not valid without it.  As a response, more researchers are turning to canonical correspondence analysis, which constrains the ordination to known environmental measurements and thereby eliminates some of the guesswork in inferring underlying causality.

 

 

 

Suggested references:

 

Pielou, E.C.   1984.  Interpretation of Ecological Data: A Primer on Classification and Ordination. Wiley,  New York.  The most common reference I found.  A good introduction to general matrix algebra operations, and a good starting point, but the description of PCA and CA are somewhat outdated.  Legendre and Legendre’s treatment is more satisfying.

 

Ludwig, J and J. Reynolds. 1988.  Statistical Ecology: A Primer on Methods and Computing.  Wiley, New York.  Easy to follow outline of procedures involved in PCA and CA; working through the chapter on PCA helps to understand CA.  They don’t explain why the procedures or certain statistics are used, however.

 

Legendre, P. and  L. Legendre. 1999.  Numerical Ecology.  Elsevier, Amsterdam. The most thorough and modern treatment of ordination I’ve found; surprisingly readable.  I strongly suggest working through the chapter on matrix algebra first.

 

Searle, S.  1982.  Matrix Algebra Useful for Statistics.  Wiley,  New York.  If you have time, the best, most accessible introduction to matrix algebra for these purposes.