Spatial Autocorrelation

Andrea Zuur

Introduction

The goal of this presentation is to provide ecology students with an understandable primer on spatial autocorrelation within the context of ecology.  The assumptions of spatial and classical statistics are compared, spatial autocorrelation is defined, and the most common spatial autocorrelation functions are reviewed.  Spatial statistics is a huge topic.  As such, topics such as experimental design, scale, data type, and time-series analysis are not covered.  References and resources including software are presented at the end for those interested in more information.

Spatial Analysis

The distribution of species, a driving force in ecology and conservation biology, often occurs in patterns such as gradients and or clusters.  These patterns are the geographic result of the interaction of geologic, climatic, topographic, and biological variables.  Spatial statistics provides tools with which these patterns can be analyzed.  The origin of spatial statistics is attributed to South African mining engineer D. W. Krige, who developed techniques to predict the location of ores within geologic formations.  Kriging, a method of interpolation developed by Matheron (1963) further developed the techniques and named them after Krige.  The techniques developed by Krige, Matheron and others have their home in geostatistics and are now applied in fields ranging from epidemiology to ecology to analyze data distributed in space.  The space in which the data are measured can be geographic, such as distances between ores in a geological formation, trees in a forest, points in the human brain, or more abstract, where distance is genetic and based on allele frequencies rather than geographic distance.  Variables are measured at finite locations within a coordinate system.  Measuring variables at specific coordinates allows the calculation of distances between measurements, and hence the analysis of spatial patterns within the data.

Assumptions of Classical and Spatial Statistics

Randomness

One of the goals of statistics is to characterize a population based on a sampling of that population.  Both classical and spatial statistics are based on the assumption that samples are randomly chosen from the population.  If samples are not randomly chosen, the sample population may be biased.  Statistics calculated from a biased sample population will not accurately characterize the population of interest.

Independence VS Dependence

Classical statistical tests rely on the assumption that subjects are independent of one another.  For example, if sampling to determine the distribution of a plant, a researcher randomly places quadrats throughout the study area; all individuals within each quadrat are counted to determine how abundance.  The subject is the quadrat and the response variable is species abundance.  Each quadrat must be chosen such that it is independent of all other quadrats.  Each independent unit, in this case each quadrat, counts as a degree of freedom.  If the quadrats are not independent of one another, the degrees of freedom will be overestimated.  Losing degrees of freedom reduces the effective sample size.  Violating the assumption of independence and the subsequent loss of degrees of freedom increases the probability of rejecting a true null hypothesis and committing a Type I error.

Independence and Homogeneity

The assumption of independence implies that variables are homogeneous.  In such a world, abiotic and biotic parameters would be exhibit homogenous distributions; precipitation, soil, wind, plants and animals would be constant, unvarying and evenly distributed across a landscape.  In ecology, this is almost never the case.  Abiotic and biotic parameters vary over space and time and their interaction results in complex spatial patterns.  The abundance of a plant at one location may be much greater than at other locations due to differences in parameters such as soil type and slope, which vary over space, and precipitation, wind direction and wind speed, which vary over space and time.  For example, abundance measured in quadrats located at the bottom of a hill will be more similar than abundance measured in quadrats located at the summit of a hill.  The quadrats at the bottom of the hill will have more soil moisture, more accumulated soil organic matter, and most likely be more sheltered from wind than those quadrats at the top of the hill.  The variability in the independent variables and their subsequent effects on the response variable belie the assumption of independence and the implication of homogeneity.

In spatial statistics, subjects are assumed to be dependent.  In addition it is commonly assumed for certain tests that the data are stationary and isotropic.  The assumption of stationarity requires that data are normally distributed with the same mean and variance.  If data are isotropic, the characteristics of patterns within the data are constant in all directions, whereas anistropic data will exhibit a pattern that varies in different directions.  A more relaxed assumption of isotropy is the intrinsic hypothesis, which assumes that distance intervals between pairs of points have a mean of zero and a finite variance.  Anisotropy refers to data in which the spatial pattern is not constant in all directions.

Spatial autocorrelation: definition and causes

Definition

Autocorrelation literally means that a variable is correlated with itself.  The simplest definition of autocorrelation states that pairs of subjects that are close to each other are more likely to have values that are more similar, and pairs of subjects far apart from each other are more likely to have values that are less similar.  The spatial structure of the data refers to any patterns that may exist.  Gradients or clusters are examples of spatial structures that are positively correlated, whereas negative correlation may be exhibited in a checkerboard pattern where subjects appear to repulse each other.  When data are spatially autocorrelated, it is possible to predict the value at one location based on the value sampled from a nearby location when data using interpolation methods.  The absence of autocorrelation implies data are independent.

Causes

There are several causes of autocorrelation which can be simplified into two categories, ‘spurious’ and ‘real’.  It should be noted that the terms ‘real’ and ‘spurious’ are defined differently by different authors.  Here we will define spurious autocorrelation is an artifact of experimental design that often occurs when samples have not been randomly chosen but can occur as a result of some other aspect of the experimental design.  Real autocorrelation can be defined as caused by the interaction of a response variable with itself (univariate) or with independent variables (multivariate) due to some inherent characteristic of the variable(s) such as habitat preference, mode of reproduction, etc.  In the case of univariate spatial autocorrelation, assuming the soil type is constant and wind is not a factor, a plant may be found in clusters simply because seeds fall to the ground and germinate near the parent plant, making the response variable, plant abundance, spatially autocorrelated.  In the case of multivariate spatial autocorrelation, the independent variables soil type, wind speed and wind direction interact; resulting in clusters of plants on the preferred soil type along a gradient following wind direction.  Both cases result in similar values of abundance in quadrats close together and far apart.  Those quadrats located in clusters will have similarly high values whereas quadrats that are not located in clusters will have similarly low values.

The relationship between variance, covariance and correlation

The functions most often used to describe spatial autocorrelation are related to variance, covariance and of course, correlation.    The variance is a measure of dispersion of a population, whereas covariance is a measure of the association between two variables.  We can see how variance is related to covariance, the expected value of the product of the variance of two variables, by reviewing the equations for sample variance and covariance:

 Sample variance: Sample covariance: where: where: = sample variance of variable = sample covariance between variables and = values of variable from i to n = values of variable  from i to n = sample mean of variable = sample mean of variable = sample size = values of variable  from i to n = sample mean of variable = sample size

A correlation coefficient also provides a measure of how strongly two variables are associated.  It can be loosely defined as a normalized form of the covariance, in which the covariance is the numerator.  Dividing through by the product of the variance of each variable standardizes the equation, resulting in a value ranging from (–1) to (+1), where values between (0) and (+1) indicate a positive association between variables, values between (0) and (-1) indicate a negative association, and (0) indicates there is no correlation between variables.  An example of a correlation coefficient commonly used in classical statistics is Pearson’s correlation coefficient:

 where: = sample correlation coefficient between variables and = values of variable from i to n = sample mean of variable = values of variable  from i to n = sample mean of variable = product of the variance of variables and

Structure functions

Because the goal of measuring spatial autocorrelation is to determine whether samples close together have similar values and visa versa, functions include terms that account for the distance between samples.  Rather than taking the difference between a sample value and its mean  as is done in classical variance or correlation functions, spatial functions take the difference in values between all pairs of samples  located a given distance apart.  Variance and correlation values can be calculated for the extent of the sample area.

It may be more informative to calculate and plot values for pairs of samples occurring at different distance intervals.  The distance between samples, referred to as lag, is chosen by the researcher and should be based on a priori knowledge about the system being studied.  This provides information about the spatial structure of the data.  Each point on a plot represents the variance or correlation for all pairs of points within a distance interval, called a lag.  Plots of semi-variance are called variograms; plots of correlation coefficients are called correlograms.  In order to create a variogram or correlogram, a matrix of distances between each pairs of samples is created.  From this matrix, matrices are generated for each distance interval.  All pairs of points located within a given distance interval are represented by the value of (1); all pairs of points not within that distance interval are given a value of (0) in the matrix.  Using these matrices, correlation coefficients are then calculated for all points within each distance interval.  It is also possible to produce directional variograms and correlograms in which semi-variance or correlation within a distance interval are plotted within a specific direction.

Semi-Variance and the Variogram

The semi-variogram, commonly referred to as the variogram, is a plot of semivariance as a function of distance.  The variogram originated in the field of geostatistics as a component of kriging.  Semivariance measures the dissimilarity of subjects within a single variable, compared to covariance which measures the similarity of one or more variables.  Unlike a correlation coefficient, semi-variance is not normalized and values are not constrained as are most correlation coefficients.  The following version of the variogram is the numerator of Geary’s C, a spatial autocorrelation coefficient:

where:

(d)  = semi-variance as a function of distance

W  = sum of the values of whi within the weight matrix

n  = sample size

whi = weighted elements as a function of distance, represents a matrix of weighted values,

1 = yh and yi are within a given distance class, for yhyi

0 = all other cases

= pair of sample points, for

A classic variogram (Figure 1.a) calculated from data that is dependent, where semi-variance increases as the distance between points, or lag, increases, has several distinguishing features.  The range is the lag distance at which the data become independent.  The sill represents variance values corresponding to the range.  The nugget is the distance on the y-axis between zero and the y-intercept.  It represents unaccounted variability due to error or small-scale variability at h< the smallest sampling distance.  Semi-variance calculated for data that is independent is constant as lag increases.  Unless a distinct gradient is present, most variograms will show much more variation than Figure 1.a.

Figure 1:  Empirical variograms generated in S-Plus using the default of 20 lags.  a) The semi-variance of mean temperature from locations at different elevations, showing that temperature is dependent on elevation.  As the distance between pairs of points increases, elevation values become less similar and autocorrelation decreases.  b) Using the same coordinates, temperature data was replaced with randomly generated numbers showing the data are truly independent.

Moran’s I and Geary’s c: Univariate Spatial Correlation Coefficients and Correlograms

These coefficients measure spatial autocorrelation within a single quantitative variable.  Moran’s I takes the form of a classic correlation coefficient in that the mean of a variable is subtracted from each sample value in the numerator.  This results in coefficients ranging from (–1) to (+1), where values between (0) and (+1) indicate a positive association between variables, values between (0) and (-1) indicate a negative association, and (0) indicates there is no correlation between variables.  As with it’s numerator semi-variance, Geary’s c is always positive because the numerator is squared.  Values usually range from (0) to (+2), where positive autocorrelation values are less than (1) and negative values are greater than (1).

 Moran’s I: Geary’s c: for for where: where: = Moran’s I correlation coefficient as a function of distance = Geary’s correlation coefficient as a function of distance = a matrix of weighted values, where elements are a function of distance       1 = yh and yi are within a given distance class, for yhyi       0 = all other cases = a matrix of weighted values, where elements are a function of distance       1 = yh and yi are within a given distance class, for yhyi       0 = all other cases = values of variables at locations h and I = values of variables at locations h and I = sum of the values of the matrix = sum of the values of the matrix = sample size = sample size

Figure 2:  Empirical correlograms generated in S-Plus using the default of 20 lags using the same data as in Figure 1.  a) Autocorrelation of mean temperature from locations at different elevations.  Autocorrelation between pairs of points is positive for distance intervals less than 200,000 meters.  Pairs of points separated by distances of greater than 200,000 meters show no correlation or, as distance increases, weak negative correlation.  b) Using the same coordinates, temperature data was replaced with randomly generated numbers.  Coefficients are approximately equal to (0), showing the data are independent.

Mantel Test: Univariate and Multivariate correlations

Applying the simple Mantel test, a univariate correlation coefficient or a correlogram can be computed.   It employs two difference matrices, one of which is usually based on the distances between pairs of samples similar to the distance intervals used in Moran’s I and Geary’s c while the second matrix is based on differences between pairs of sample values from points within a distance interval.  The partial Mantel test compares two or more variables while controlling for a third variable that is usually distance.

References and Resources

Articles

Cressie, Noel A. C.  1993.  Statistics for Spatial Data.  John Wiley and Sons.

Dale, Mark R.T.  and Marie-Josee Fortin. 2002.  Spatial autocorrelation and statistical tests in ecology.  Ecoscience  9(2):162-167.

Dutilleul, Pierre.  1993.  Spatial heterogeneity and the design of ecological field experiments.  Ecology 74(6):1646-1658.

Fortin, Marie-Josee and Jessica Gurevitch.  2001.  Mantel Tests: Spatial Structure in Field Experiments.  In:  Design and analysis of ecological experiments.  Eds: Samuel Scheiner and Jessica Gurevitch. 2nd ed. Oxford University Press.

Fortin, Marie-Josee, Pierre Drapeau and Pierre Legendre.  1989.  Spatial autocorrelation and sampling design in plant ecology.  Vegetatio 83:209-222.

Kitanidis, P. K.  1997.  Introduction to Geostatistics: Applications in hydrogeology.  Cambridge University Press.

Legendre, Pierre and Marie-Josee Fortin.  1989.  Spatial Pattern and ecological analysis.  Vegetatio 80:107-138.

Legendre, Pierre, Mark R. T. Dale, Marie-Josee Fortin, Jessica Gurevitch, Michael Hohn, and Donald Myers.  2002. The consequences of spatial structure for the design and analysis of ecological field surveys.  Ecography 25:601-615.

Legendre, Pierre.  1993.  Spatial autocorrelation: Trouble or new paradigm?.  Ecology 74(6):1659-1673.

Perry, J. N., A. M. Liebhold, M. S. Rosenberg, J. Dungan, M. Miriti, A Jakomulska and S. Citron-Pousty.  2002.  Illustrations and guidelines for selecting statistical methods for quantifying spatial pattern in ecological data. Ecography 25:578-600.

Ripley, Brian D.  1981.  Spatial statistics.  John Wiley and Sons.

Rossi, Richard E., David J. Mulla, Andre G. Journel, Eldon H. Franz.  1992.  Geostatistical tools for modeling and interpreting ecological spatial dependence.  Ecological Monographs 62(2):277-314.

Smouse, Peter E. , Jeffrey C. Long and Robert R. Sokal.  1986.  Multiple regression and correlation extensions fo the Mantel test of matrix correspondence.  Systematic Zoology 35(4):627-632.

Ver Hoef, Jay M. And Noel Cressie.  2001.  Spatial Statistics: Analysis of Field Experiments.  In:  Design and analysis of ecological experiments.  Eds: Samuel Scheiner and Jessica Gurevitch. 2nd ed. Oxford University Press.

Note:  For those interested in spatial statistics within ecology, you may want to check out Marie-Josee Fortin and Mark Dale’s soon to be released book (June of 2005), “Spatial Analysis: a guide for ecologists”:

Online

Anselin, Luc, professor in geography, course in Spatial Analysis at the Univ. of Illinois at Urbana-Champaign; helpful notes, slides and handouts:

Fortin, M. J., professor in Landscape Ecology at the University of Toronto.  Extensive and readable publications; upcoming book (“Spatial Analysis. A Guide for Ecologists”).

Legendre, Pierre, ISI Highly Cited Researcher in Ecology/Environment, professor in Biological Sciences at the University of Montreal, author of “Numerical Ecology”.  Primary developer of “R”.  Extensive publications in several fields.

Software (some not all….)

FREEWARE

R package (Legendre and Casgrain at Univ. of Montreal):

Windows version under development, currently available as MacOs version 4.0.  Lots of information available: manuals, links, functions, etc.  According to the website, this program can:

“Compute similarity, distance and correlation (R-mode) matrices

Compute geographic distances from latitude-longitude data

Create a principal coordinates analysis (PCA)

Create a correspondance analysis (CA)

Compute a Mantel test

Create a hierarchical or K-Means cluster analysis

Study spatial autocorrelation processes

Verify and normalize your data matrices

And much more! “

R project (Brian Ripley at Oxford)

Similar to SPLUS, contains various modules for use in spatial statistics.  With regard to spatial spatial autocorrelation, Brian D. Ripley in R. R News, 1(2):14-15, June 2001 states:

"For spatial autorcorrelation, there is still nothing available yet…..the commercial module S+SpatialStats for S-PLUS…has dampened enthusiasm for user-contributed spatial statistics code over the last decade."

GeoDa (Luc Anselin at the Department of Geography at the University of Illinois, Urbana-Champaign):

In addition to data management, visuals, mapping, etc., performs the following spatial statistics: Moran’s I, Moran scatterplot: univariate, bivariate, EB corrected LISA local Moran: univariate, bivariate, EB corrected

Norton, M., a graduate student in the Legendre lab who has developed his own software for autocorrelation and Mantel

zt: a software tool for simple and partial Mantel tests

NOT FREEWARE

Statistics, Math, GIS Software:

S-Plus:

Uses a module for spatial statistics called, not surprisingly, Spatial Statistics.  It will compute variograms and correlograms and many some common tests (Ripley’s K, etc.), but no Moran’s I, Geary's C or Mantel tests.  These can be found on the web, developed as freeware downloads.  Here are a few resources for spatial autocorrelation functions not found in S-Plus:

The above site is also a great resource for data.

Matlab:

Similar to S-Plus in that spatial functions are limited but you can write or find functions online.  Apparently there are lots of people writing code for MatLab that is freely available on their website.  Here is one link:

ESRI GIS ArcView and ArcGIS software:

Writing code for spatial functions using Avenue, a script-writing extension:

ESRI is introducing a spatial statistics toolbox in the new version of ArcGIS 9 that includes Moran’s I and other spatial autocorrelation functions:

Discipline Specific Software:  lots and lots, usually within the hydrological and engineering fields.  Here are two examples:

Gamma Design Software: (“variograms on the fly!”), no Mantel tests, with good online help

Statios software:  variograms, kriging, etc., no Moran’s or Mantel’s