Concepts of Correspondence Analysis
Summer Lindzey
Correspondence analysis (CA)
is best learned by first considering the problems that ordination techniques in
general are meant to resolve. I address
the problems conceptually, rather than mathematically, because a number of
in-depth, mathematical treatments are already available (see references), and
frankly, are hard to penetrate without a lot of effort and time. The purpose of this summary is to introduce
the problems that motivate the math behind correspondence analysis, with some
brief digression to the details; once you have the basics, you can decide if
the technique is appropriate and warrants dedicating time to mastering the
intricacies.
What is Ordination, and what does it do?
Correspondence analysis is
one of many ordination techniques aimed at reducing multivariate data into a
manageable number of variables. The
essential question in ordination is: “Is there some other variable than the
one(s) I used, possibly theoretical, that will better
describe my subject?” The purpose of
ordination is two-fold: to extract theoretical variables that capture the
majority of the variation in the data and to depict the structure the data
assumes when described by these reduced number of variables. Depicting the “structure” of the data is just
a fancy way of saying “graphing”; in fact, the major appeal of ordination
techniques is the two-dimensional graphs they can produce from a potentially
inchoate mass of data. Note that the
goal here is not to test the significance of any of the variables, but merely
to describe them more concisely. As
such, ordination techniques are only descriptive, and can not be used to test
hypotheses about data structure.
How does ordination “reduce”
data? By describing subjects’ responses
according to their dominant sources
of variation, ordination can produce a single “score” for a subject from
multiple (and often redundant) measurements. Ordination can therefore
conveniently represent multi-dimensional subjects as single points on an axis
(the axis representing the scoring system). Using the same scoring system,
other subjects can be evaluated and plotted on the same axis, providing
immediate, direct comparisons: subjects occurring close together on the axis
are more similar in their responses, while those at distance are less
similar. Furthermore, another, different
scoring system can be generated so as to provide more than one axis along which
to plot and compare subjects. The hope
is that when taken together, the two or three scoring systems still capture the
majority of the variation in the data.
Correspondence Analysis
History and Goals
Community
ecologists frequently use CA to identify and distinguish communities on the
basis of their species composition. In these studies, each species can be
considered a variable, and communities our subjects, which we attempt to
describe efficiently and comprehensively with abundance data for the individual
species. Early attempts at defining plant communities focused on using species
with known ecological characteristics (i.e.,
response to moisture or elevation) as indicators of certain environmental
conditions and therefore of habitat types and communities. A community was therefore designated on the
basis of the relative abundance of these indicator species. However, many researchers became interested
in whether groups of organisms could be described as communities without
reference to pre-existing knowledge of their ecological requirements.
CA accomplishes this by taking into account
the relative distribution of as many species as possible, using observed
associations between species, or the lack thereof, to compare or distinguish
groups of organisms in different locations.
Correspondence analysis constructs a score for each location sampled by
weighting the abundance of species in that location. The trick is devising weights that adequately
represent both the overall distribution of a species, and its co-occurrence
with and abundance relative to others.
Remember, the goal is to reduce the data collected on many species to a
single index that accounts for as much variation in species as possible.
Many
algorithms have been used and described as “correspondence analysis”, but I
will focus on two of the broadest, most commonly referenced ones here:
“Reciprocal Averaging”, and then the modern “Correspondence Analysis” described
by Legendre and Legendre
(1998).
Reciprocal Averaging.
Reciprocal averaging (RA)
was put forth by M.O. Hill in 1973 as a way of addressing particular weaknesses
of PCA (principal components analysis) in describing multivariate community
data. At the time, Hill did not realize
his solution was in fact another method of correspondence analysis, being
developed concomitantly by other researchers in
A simple example will help
illustrate the problems and goals of correspondence analysis and of the reason
for deriving weighted averages. Consider
the following artificial data taken from Pielou
(1984). Species are arranged in rows, and
abundance data on each is collected in five locations (quadrats):
|
|
Quad I |
Quad II |
Quad III |
Quad IV |
Quad V |
|
Spp 1 |
15 |
2 |
0 |
2 |
1 |
|
Spp 2 |
9 |
6 |
15 |
0 |
0 |
|
Spp 3 |
1 |
7 |
5 |
8 |
29 |
Given this data, we might wonder
if the quadrats represent samples of different communities, and more
specifically, whether differences in species abundances across the quadrats can
be more succinctly described. To compare
quadrats we could choose, for example, a single number that represents total
species abundance and see if that distinguishes the five quadrats. This produces the following sums, in order of
quadrat number: 25, 15, 25, 10, 30. Note that these
sums can then be used as scores to arrange the quadrats, in the following
order:
Quad IV Quad
II Quad I and III Quad V
(10) (15) (25 and 25) (30)
With this simple scoring
system, we have essentially created a single axis along which locations are
placed relative to each other; in other words we have created an ordination. Now consider what this order implies, and if
it makes sense:
·
Quad I and III
are the most similar, essentially identical according to their scores;
·
Quad IV and V
are the most different .
Closer inspection of the
relative species abundances in quadrats I and III reveals that they are hardly
identical: while species one dominates Quad I, it is completely absent in Quad
III. Similarly, Quads IV and V may differ in
overall abundance, but comprise the same species, just in different magnitudes
of abundance. A better ordination would
take into account not only total species abundance, but the particular
abundance of species dominating in a given location. In other words, we need to
weight the abundances.
But how do we decide on weights? Do we give more weight to the species that
is most frequent, or the species that achieves the highest abundance at any
given point? This essentially assigns
uniform ecological importance to a given species across all locations, when
that clearly may not be the case; the behavior or “importance” of a species may
depend on the presence and abundance of others.
Reciprocal
averaging approaches this problem on a trial and error basis by assigning arbitrary
weights to species and re-calculating them until they converge on a single
number. Thus, the data themselves derive
the weights, rather than being imposed by the researcher. The re-calculation is guided by a back-and
forth averaging between rows (species averages) and columns (quadrats): quadrat
scores become weighted averages of species, and species become weighted
averages of quadrats. The back and
forth averaging between row and column scores gives the name “reciprocal” to
the technique. In general the procedure
is as follows (Pielou gives a worked example using
the above data):
1)
Assign arbitrary
trial values for species weights (ranging 0 to 100)
2)
Calculate
average abundance for each quadrat using the above weights for each species;
these are trial quadrat scores.
3)
Return to
species data, and re-calculate species scores. Using the scores for each
quadrat to weight abundance in that quadrat, calculate a weighted average for
each species across the quadrats. This
is your revised set of species scores.
4)
Return to
quadrat data, and using the new species scores as weights, calculate new
quadrat scores (essentially repeat step 2).
5)
Moving back and
forth between species and quadrat scores, re-calculate new scores until they
converge (stop changing).
Applying this procedure to our sample data
yields the following one-dimensional ordination of quadrats (scores are
re-scaled to fit between 0 and 100):
Quad V Quad
IV Quad II Quad III Quad
I
(0) (18.6) (52.1) (73.0) (100)
Due to the fact that species
scores are weighted according to quadrat scores, and vice versa, the scoring
system derived is a better account of similarities than our first attempt.
Quadrats IV and V are accurately represented close together, each being more
similar to the other than other quadrats.
In addition, Quadrats I and III no longer have the same score, but are
more adequately distinguished by the species weights derived by our second
attempt. Thus, by accounting for
species distributions within and across all quadrats, reciprocal averaging has
more adequately described the variation throughout the sampled locations – a
successful ordination.
Note,
however, that while we have separated and ordered the quadrats, the specific
ecological or environmental factor underlying this order is not apparent in our
analysis; in fact, it has to be inferred from what we know. Hence, these ordination techniques are
considered indirect methods of detecting underlying environmental gradients. Also, note that we derived only one axis or scoring
system, and another is traditionally developed. That second axis is constructed
with the constraint that it accounts for variation independent of the first.
Correspondence Analysis using Linear Algebra
Detecting
and depicting structure in species distributions can also be accomplished by
using a system of linear equations to solve for the “best” weights, or those
that will account for the most variation. Conceptually, this is the same
approach of PCA (principal components analysis), but the difference lies in how
the variation in species is quantified before weights are assigned. Rather than
outlining the mathematical procedure involved, I will briefly describe the
conceptual goals of the technique, which are sometimes the hardest to figure
out.
In
both PCA and CA the weights are derived by eigenanalysis,
a technique in matrix algebra that can detect systematic variation among
measurements and devise linear equations to represent that systematic
variation. In PCA, the matrix of species
abundances is transformed into a matrix of covariances or correlations, each
abundance value being replaced by a measure of its correlation (or covariance)
with other abundances in other quadrats.
This correlation matrix is then subjected to eigenanalysis
to determine weighted combinations of species that can combine and account for
variation across all the samples. In ordination terminology, the specific
source of these weights, which essentially become coefficients in a linear
equation, is a
set of eigenvectors.
In
correspondence analysis, the abundance data is transformed to a semi-chi-square
statistic ( I say semi, because it’s never evaluated
according to the chi-square distribution to test for its significance). You
will recall that the goal of CA in community ecology is to use species
associations to depict distance or similarity in sites, and the chi-square
statistic is used to depict the degree those associations depart from
independence. Each abundance value is
replaced by its chi-square statistic, or in some cases a factored form of this statistic, and the variance in these chi-square “distances”
is evaluated by eigenanalysis. Thus, the system of weights used to score
sites or quadrats is derived from a metric of species associations, and the
more these associations depart from independence, the further separated final
scores will be.
CA
often produces more reliable ordinations than PCA in community ecology,
presumably because it better models the non-linear responses of species to
environmental and ecological gradients.
The linear combinations calculated in PCA can either manufacture
differences where they do not exist, or likewise obscure subtle differences in
species associations that separate communities.
Presumably, the chi-square metric in CA preserves ecological distance by
modeling differences in associations rather than abundances of single
species.
But
CA has a major fault that mitigates against its use, commonly referred to as
the “arch” or “horseshoe” effect. When
constructing a second, yet independent, axis for plotting scores, CA often
introduces a spurious “arch” in the order of scores along the first axis that
may not correspond to underlying ecological forces. Detrended
correspondence analysis was devised in an effort to eliminate the arch, but the
procedure is somewhat arbitrary and in some cases may eliminate actual
underlying structure to the data. Thus some researchers suggest avoiding DCA,
while others suggest CA is not valid without it. As a response, more researchers are turning
to canonical correspondence analysis, which constrains the ordination to known
environmental measurements and thereby eliminates some of the guesswork in
inferring underlying causality.
Suggested references:
Pielou, E.C.
1984. Interpretation of Ecological
Data: A Primer on Classification and Ordination. Wiley,
Ludwig, J
and J. Reynolds. 1988. Statistical Ecology: A Primer on Methods and
Computing. Wiley,
Legendre, P. and L. Legendre.
1999. Numerical Ecology. Elsevier,
Searle, S. 1982. Matrix Algebra Useful for Statistics. Wiley,