Spatial Autocorrelation
Andrea Zuur
Introduction
The
goal of this presentation is to provide ecology students with an understandable
primer on spatial autocorrelation within the context of ecology. The assumptions of spatial and classical
statistics are compared, spatial autocorrelation is defined, and the most
common spatial autocorrelation functions are reviewed. Spatial statistics is a huge topic. As such, topics such as experimental design,
scale, data type, and timeseries analysis are not covered. References and resources including software
are presented at the end for those interested in more information.
Spatial Analysis
The
distribution of species, a driving force in ecology and conservation biology,
often occurs in patterns such as gradients and or clusters. These patterns are the geographic result of
the interaction of geologic, climatic, topographic, and biological
variables. Spatial statistics provides
tools with which these patterns can be analyzed. The origin of spatial statistics is
attributed to South African mining engineer D. W. Krige, who developed
techniques to predict the location of ores within geologic formations. Kriging, a method of interpolation developed
by Matheron (1963) further developed the techniques and named them after
Krige. The techniques developed by
Krige, Matheron and others have their home in geostatistics and are now applied
in fields ranging from epidemiology to ecology to analyze data distributed in
space. The space in which the data are
measured can be geographic, such as distances between ores in a geological
formation, trees in a forest, points in the human brain, or more abstract,
where distance is genetic and based on allele frequencies rather than
geographic distance.
Variables are measured at finite locations within a coordinate
system. Measuring variables at specific
coordinates allows the calculation of distances between measurements, and hence
the analysis of spatial patterns within the data.
Assumptions of Classical and
Spatial Statistics
Randomness
One
of the goals of statistics is to characterize a population based on a sampling
of that population. Both classical and
spatial statistics are based on the assumption that samples are randomly chosen
from the population. If samples are not randomly
chosen, the sample population may be biased.
Statistics calculated from a biased sample population will not
accurately characterize the population of interest.
Classical
statistical tests rely on the assumption that subjects are independent of one
another. For example, if sampling to
determine the distribution of a plant, a researcher randomly places quadrats
throughout the study area; all individuals within each quadrat are counted to
determine how abundance. The subject is
the quadrat and the response variable is species abundance. Each quadrat must be chosen such that it is
independent of all other quadrats. Each
independent unit, in this case each quadrat, counts as
a degree of freedom. If the quadrats are
not independent of one another, the degrees of freedom will be
overestimated. Losing degrees of freedom
reduces the effective sample size.
Violating the assumption of independence and the subsequent loss of
degrees of freedom increases the probability of rejecting a true null
hypothesis and committing a Type I error.
The
assumption of independence implies that variables are homogeneous. In such a world, abiotic and biotic
parameters would be exhibit homogenous distributions; precipitation, soil,
wind, plants and animals would be constant, unvarying and evenly distributed
across a landscape. In ecology, this is
almost never the case. Abiotic and biotic
parameters vary over space and time and their interaction results in complex
spatial patterns. The abundance of a plant at one location may
be much greater than at other locations due to differences in parameters such
as soil type and slope, which vary over space, and precipitation, wind
direction and wind speed, which vary over space and time. For example, abundance measured in quadrats
located at the bottom of a hill will be more similar than abundance measured in
quadrats located at the summit of a hill.
The quadrats at the bottom of the hill will have more soil moisture, more accumulated soil organic matter, and most
likely be more sheltered from wind than those quadrats at the top of the
hill. The variability in the independent
variables and their subsequent effects on the response variable belie the
assumption of independence and the implication of homogeneity.
In
spatial statistics, subjects are assumed to be dependent. In addition it is commonly assumed for
certain tests that the data are stationary and isotropic. The assumption of stationarity requires that
data are normally distributed with the same mean and variance. If data are isotropic, the characteristics of
patterns within the data are constant in all directions, whereas anistropic
data will exhibit a pattern that varies in different directions. A more relaxed assumption of isotropy is the
intrinsic hypothesis, which assumes that distance intervals between pairs of
points have a mean of zero and a finite variance. Anisotropy refers to data in which the
spatial pattern is not constant in all directions.
Spatial autocorrelation:
definition and causes
Definition
Autocorrelation
literally means that a variable is correlated with itself. The simplest definition of autocorrelation
states that pairs of subjects that are close to each other are more likely to
have values that are more similar, and pairs of subjects far apart from each
other are more likely to have values that are less similar. The spatial structure of the data refers to
any patterns that may exist. Gradients
or clusters are examples of spatial structures that are positively correlated,
whereas negative correlation may be exhibited in a checkerboard pattern where
subjects appear to repulse each other.
When data are spatially autocorrelated, it is possible to predict the
value at one location based on the value sampled from a nearby location when
data using interpolation methods. The
absence of autocorrelation implies data are independent.
Causes
There
are several causes of autocorrelation which can be simplified into two
categories, ‘spurious’ and ‘real’. It
should be noted that the terms ‘real’ and ‘spurious’ are defined differently by
different authors. Here we will define
spurious autocorrelation is an artifact of experimental design that often
occurs when samples have not been randomly chosen but can occur as a result of
some other aspect of the experimental design.
Real autocorrelation can be defined as caused by the interaction of a
response variable with itself (univariate) or with independent variables
(multivariate) due to some inherent characteristic of the variable(s) such as
habitat preference, mode of reproduction, etc.
In the case of univariate spatial autocorrelation, assuming the soil
type is constant and wind is not a factor, a plant may be found in clusters
simply because seeds fall to the ground and germinate near the parent plant,
making the response variable, plant abundance, spatially autocorrelated. In the case of multivariate spatial
autocorrelation, the independent variables soil type, wind speed and wind
direction interact; resulting in clusters of plants on the preferred soil type
along a gradient following wind direction.
Both cases result in similar values of abundance in quadrats close
together and far apart. Those quadrats
located in clusters will have similarly high values whereas quadrats that are
not located in clusters will have similarly low values.
The relationship between
variance, covariance and correlation
The
functions most often used to describe spatial autocorrelation are related to
variance, covariance and of course, correlation. The variance is a measure of dispersion of
a population, whereas covariance is a measure of the association between two
variables. We can see how variance is
related to covariance, the expected value of the product of the variance of two
variables, by reviewing the equations for sample variance and covariance:
Sample
variance: 
Sample covariance: 
_{} 
_{} 
where: 
where: 
_{}=
sample variance of variable _{} 
_{} = sample covariance between variables _{}and _{} 
_{} = values of variable _{}from i to n 
_{} = values of variable _{} from i to n 
_{} = sample mean of variable _{} 
_{} = sample mean of variable _{} 
_{} = sample size 
_{}= values of variable _{} from i to n 

_{}= sample mean of variable _{} 

_{} = sample size 
A
correlation coefficient also provides a measure of how strongly two variables
are associated. It can be loosely
defined as a normalized form of the covariance, in which the covariance is the
numerator. Dividing through by the
product of the variance of each variable standardizes the equation, resulting
in a value ranging from (–1) to (+1), where values between (0) and (+1)
indicate a positive association between variables, values between (0) and (1)
indicate a negative association, and (0) indicates there is no correlation
between variables. An example of a
correlation coefficient commonly used in classical statistics is Pearson’s
correlation coefficient:
_{} 
where: 
_{} = sample
correlation coefficient between variables _{}and_{} 
_{} = values of variable _{}from i to n 
_{} = sample mean of variable _{} 
_{}= values of variable _{} from i to n 
_{} = sample mean of variable _{} 
_{} = product of the variance of variables _{}and _{} 
Structure functions
Because the goal of measuring spatial
autocorrelation is to determine whether samples close together have similar
values and visa versa, functions include terms that account for the distance
between samples. Rather than taking the
difference between a sample value and its mean _{} as is done in
classical variance or correlation functions, spatial functions take the
difference in values between all pairs of samples _{} located a
given distance apart. Variance and
correlation values can be calculated for the extent of the sample area.
It may be more informative to calculate
and plot values for pairs of samples occurring at different distance
intervals. The distance between samples,
referred to as lag, is chosen by the researcher and should be based on a
priori knowledge about the system being studied. This provides information about the spatial
structure of the data. Each point on a
plot represents the variance or correlation for all pairs of points within a
distance interval, called a lag. Plots
of semivariance are called variograms; plots of correlation coefficients are
called correlograms.
In order to create a variogram or correlogram, a matrix of distances
between each pairs of samples is created.
From this matrix, matrices are generated for each distance interval. All pairs of points located within a given
distance interval are represented by the value of (1); all pairs of points not
within that distance interval are given a value of (0) in the matrix. Using these matrices, correlation
coefficients are then calculated for all points within each distance
interval. It is also possible to produce
directional variograms and correlograms
in which semivariance or correlation within a distance
interval are plotted within a specific direction.
SemiVariance and the Variogram
The semivariogram, commonly referred
to as the variogram, is a plot of semivariance as a function of distance. The variogram originated in the field of
geostatistics as a component of kriging.
Semivariance measures the dissimilarity of subjects within a single
variable, compared to covariance which measures the similarity of one or more
variables. Unlike a correlation
coefficient, semivariance is not normalized and values are not constrained as
are most correlation coefficients. The
following version of the variogram is the numerator of Geary’s C, a spatial
autocorrelation coefficient:
_{}
where:
_{}(d) = semivariance as a
function of distance
W = sum of the values of w_{hi}
within the weight matrix
n = sample size
w_{hi} = weighted
elements as a function of distance, represents a matrix of weighted values,
1 = y_{h} and y_{i} are within a given distance class, for y_{h}_{}y_{i}_{}
0
= all other cases
_{} = pair of sample points, for_{}
A classic variogram (Figure 1.a)
calculated from data that is dependent, where semivariance increases as the
distance between points, or lag, increases, has several distinguishing
features. The range is the lag distance
at which the data become independent.
The sill represents variance values corresponding to the range. The nugget is the distance on the yaxis
between zero and the yintercept. It
represents unaccounted variability due to error or smallscale variability at
h< the smallest sampling distance.
Semivariance calculated for data that is independent is constant as lag
increases. Unless a distinct gradient is
present, most variograms will show much more
variation than Figure 1.a.
Figure 1: Empirical variograms generated in SPlus
using the default of 20 lags. a) The
semivariance of mean temperature from locations at different elevations,
showing that temperature is dependent on elevation. As the distance between pairs of points
increases, elevation values become less similar and autocorrelation
decreases. b) Using the same
coordinates, temperature data was replaced with randomly generated numbers
showing the data are truly independent.
Moran’s I and Geary’s c: Univariate
Spatial Correlation Coefficients and Correlograms
These coefficients measure spatial
autocorrelation within a single quantitative variable. Moran’s I takes the form of a classic
correlation coefficient in that the mean of a variable is subtracted from each
sample value in the numerator. This
results in coefficients ranging from (–1) to (+1), where values between (0) and
(+1) indicate a positive association between variables, values between (0) and
(1) indicate a negative association, and (0) indicates there is no correlation
between variables. As with it’s numerator semivariance, Geary’s c is always positive
because the numerator is squared. Values usually range from (0) to (+2), where positive
autocorrelation values are less than (1) and negative values are greater than
(1).
Moran’s I: 
Geary’s c: 
_{} for_{}_{} 
_{} for_{}_{} 
where: 
where: 
_{} = Moran’s I correlation coefficient as a
function of distance 
_{} = Geary’s correlation coefficient as a
function of distance 
_{} = a matrix of weighted values, where
elements are a function of distance 1 = y_{h} and y_{i} are within a given distance class, for y_{h}_{}y_{i}
0 = all other cases 
_{} = a matrix of weighted values, where
elements are a function of distance 1 = y_{h} and y_{i} are within a given distance class, for y_{h}_{}y_{i}
0 = all other cases 
_{} = values of variables at locations h and I 
_{} = values of variables at locations h and I 
_{} = sum of the values of the matrix_{}

_{} = sum of the values of the matrix_{}

_{} = sample
size 
_{} = sample
size 
Figure 2: Empirical correlograms generated in SPlus
using the default of 20 lags using the same data as in Figure 1. a) Autocorrelation of mean temperature from
locations at different elevations.
Autocorrelation between pairs of points is positive for distance
intervals less than 200,000 meters. Pairs of points separated by distances of greater than 200,000
meters show no correlation or, as distance increases, weak negative
correlation. b) Using the same
coordinates, temperature data was replaced with randomly generated
numbers. Coefficients are approximately
equal to (0), showing the data are independent.
Mantel Test: Univariate and
Multivariate correlations
Applying the simple Mantel test, a
univariate correlation coefficient or a correlogram can be computed. It employs two difference matrices, one of
which is usually based on the distances between pairs of samples similar to the
distance intervals used in Moran’s I and Geary’s c while the second matrix is
based on differences between pairs of sample values from points within a
distance interval. The partial Mantel
test compares two or more variables while controlling for a third variable that
is usually distance.
References and Resources
Articles
Cressie, Noel
A. C. 1993.
Statistics for Spatial Data. John Wiley and Sons.
Dale,
Mark R.T. and
MarieJosee Fortin. 2002. Spatial autocorrelation and statistical tests
in ecology. Ecoscience 9(2):162167.
Dutilleul, Pierre. 1993.
Spatial heterogeneity and the design of ecological field
experiments. Ecology 74(6):16461658.
Fortin, MarieJosee and Jessica Gurevitch.
2001. Mantel Tests: Spatial
Structure in Field Experiments. In: Design and analysis of ecological
experiments. Eds:
Samuel Scheiner and Jessica Gurevitch.
2nd ed.
Fortin, MarieJosee, Pierre Drapeau and Pierre Legendre. 1989.
Spatial autocorrelation and sampling design in plant ecology. Vegetatio
83:209222.
Kitanidis, P. K. 1997.
Introduction to Geostatistics: Applications in
hydrogeology.
Legendre,
Pierre and MarieJosee Fortin. 1989.
Spatial Pattern and ecological analysis.
Vegetatio 80:107138.
Legendre, Pierre, Mark
R. T. Dale, MarieJosee Fortin, Jessica Gurevitch, Michael Hohn, and
Donald Myers. 2002. The consequences of
spatial structure for the design and analysis of ecological field surveys. Ecography
25:601615.
Legendre,
Perry,
J. N., A. M. Liebhold, M. S. Rosenberg, J. Dungan, M. Miriti, A Jakomulska and S. CitronPousty. 2002. Illustrations and guidelines for selecting
statistical methods for quantifying spatial pattern in ecological data. Ecography 25:578600.
Ripley, Brian D. 1981.
Spatial statistics. John Wiley and Sons.
Rossi,
Richard E., David J. Mulla, Andre G. Journel, Eldon H. Franz.
1992. Geostatistical
tools for modeling and interpreting ecological spatial dependence. Ecological Monographs 62(2):277314.
Smouse, Peter E. , Jeffrey C. Long and Robert R. Sokal. 1986.
Multiple regression and correlation extensions fo the Mantel test of matrix correspondence. Systematic Zoology 35(4):627632.
Ver Hoef, Jay M. And Noel Cressie. 2001.
Spatial Statistics: Analysis of Field Experiments. In:
Design and analysis of ecological experiments. Eds: Samuel Scheiner and Jessica Gurevitch. 2nd ed.
Note: For those interested in spatial
statistics within ecology, you may want to check out MarieJosee
Fortin and Mark Dale’s soon to be released book (June of 2005), “Spatial
Analysis: a guide for ecologists”:
http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=0521804345
Online
Anselin, Luc,
professor in geography, course in Spatial Analysis at the
http://sal.agecon.uiuc.edu/courses/sa03/
Fortin,
M. J., professor in Landscape Ecology at the
http://www.zoo.utoronto.ca/fortin
Legendre,
http://www.bio.umontreal.ca/legendre/indexEnglish.html
Software (some not all….)
FREEWARE
R
package (Legendre and Casgrain
at
Windows version under development, currently available as MacOs version 4.0.
Lots of information available: manuals, links, functions, etc. According to the website, this program can:
“Compute similarity, distance and
correlation (Rmode) matrices
Compute
geographic distances from latitudelongitude data
Create
a principal coordinates analysis (PCA)
Create
a correspondance analysis (CA)
Compute
a Mantel test
Create
a hierarchical or KMeans cluster analysis
Study
spatial autocorrelation processes
Verify
and normalize your data matrices
And
much more! “
http://www.bio.umontreal.ca/Casgrain/en/labo/R/v4/index.html
R
project (Brian Ripley at
Similar
to SPLUS, contains various modules for use in spatial statistics. With regard to spatial spatial
autocorrelation, Brian D. Ripley in R. R News, 1(2):1415, June 2001 states:
"For
spatial autorcorrelation, there is still nothing
available yet…..the commercial module
http://agec144.agecon.uiuc.edu/csiss/Rgeo/
GeoDa (Luc Anselin at the Department of
Geography at the University of
Illinois, UrbanaChampaign):
In
addition to data management, visuals, mapping, etc., performs the following
spatial statistics: Moran’s I, Moran scatterplot: univariate, bivariate, EB
corrected LISA local Moran: univariate, bivariate, EB corrected
https://www.geoda.uiuc.edu/default.php
Norton,
M., a graduate student in the Legendre lab who has
developed his own software for autocorrelation and Mantel
tests and has helpful links:
http://biol10.biol.umontreal.ca/mnorton/stats.html
zt: a software
tool for simple and partial Mantel tests
http://www.psb.rug.ac.be/~erbon/mantel/
NOT FREEWARE
Statistics, Math, GIS
Software:
SPlus:
Uses a module for spatial statistics called, not
surprisingly, Spatial Statistics.
It will compute variograms and correlograms and many some common tests (Ripley’s K, etc.),
but no Moran’s I, Geary's C or Mantel tests.
These can be found on the web, developed as freeware downloads. Here are a few resources for spatial
autocorrelation functions not found in SPlus:
http://www.esapubs.org/archive/archive_index.htm
The
above site is also a great resource for data.
http://www.cnr.colostate.edu/~robin
Matlab:
Similar
to SPlus in that spatial functions are limited but you can write or find functions
online. Apparently there are lots of
people writing code for MatLab that is freely
available on their website. Here is one
link:
http://spatialstatistics.com/software_index.htm
ESRI
GIS ArcView and ArcGIS
software:
Writing
code for spatial functions using Avenue, a scriptwriting extension:
http://gis.esri.com/library/userconf/europroc97/11technology/T2/t2.htm
Linking
ArcView and SPlus:
http://gis.esri.com/library/userconf/proc00/professional/papers/PAP801/p801.htm
ESRI
is introducing a spatial statistics toolbox in the new version of ArcGIS 9 that includes Moran’s I
and other spatial autocorrelation functions:
http://www.esri.com/news/arcuser/0405/ss_intro.html
Discipline
Specific Software: lots and lots,
usually within the hydrological and engineering fields. Here are two examples:
Gamma
Design Software: (“variograms on the fly!”), no
Mantel tests, with good online help
Statios
software: variograms,
kriging, etc., no Moran’s or Mantel’s
http://www.statios.com/WinGslib/