Analysis of Molecular Variance (AMOVA)
Peter Werner
Population Differentiation
When a population is divided into isolated subpopulations, there is less heterozygosity than there would be if the population was undivided. Founder effects acting on different demes generally lead to subpopulations with allele frequencies that are different from the larger population. Also, these demes are smaller in size than the larger population; since allele frequency in each generation represents a sample of the previous generation's allele frequency, there will be greater sampling error in these small groups than there would be in a larger undifferentiated population. Hence, genetic drift will push these smaller demes toward different allele frequencies and allele fixation more quickly than would take place in a larger undifferentiated population.
Wright's F
The decline in heterozygosity due to subdivision within a population has usually been quantified using an index known as Wright's F statistic, also known as the fixation index. The F statistic is a measure of the difference between the mean heterozygosity among the subdivisions in a population, and the potential frequency of heterozygotes if all members of the population mixed freely and nonassortatively (Hartl and Clark 1997). The fixation index ranges from 0 (indicating no differentiation between the overall population and its subpopulations) to a theoretical maximum of 1, though in practice the observed fixation index is much less than 1 even in highly differentiated populations.
Fixation indexes can be determined for differentiated hierarchical levels of a population structure, to indicate, for example, the degree of differentiation within a population among groups of demes (F_{SG}), within groups among demes (F_{GT}), and within a population among demes (F_{ST}) (Hartl and Clark 1997).
To determine the fixation index, the mean heterozygosity at each level must be determined. For a locus with two alternate alleles, allele frequency is symbolized as p and the alternative form of the allele is equal to 1 – p. For a population subdivided to three hierarchical levels, the mean heterozygosity for each level is then determined as follows:
Level of population hierarchy 
Heterozygosity 
Demes 
_{} 
Groups of demes 
_{} 
Total population 
_{} 
(Table based on Hartl and Clark 1997.)
For each level of the population hierarchy, the mean allele frequency p is determined, then the allele frequency is multiplied by 2(1 – p); this product is the frequency of heterozygotes for that allele if panmixia occurs at that hierarchical level.
Once the heterozygosity at each hierarchical level is determined, F statistics can be calculated (Hartl and Clark 1997):
Level of population hierarchy 
Fstatistic 
Among demes within group 
_{} 
Among groups within population 
_{} 
Among demes within population 
_{} 
AMOVA
Wright's F is based upon comparison of gene frequencies among demes, however, molecular data reveals not only the frequency of molecular markers, but can also tell us something about the amount of mutational differences between different genes. A technique that could be used to estimate population differentiation by analyzing differences between molecular sequences rather than assumed Mendelian gene frequencies would therefore be very useful.
Analysis of Molecular Variance (AMOVA) is a method of estimating population differentiation directly from molecular data and testing hypotheses about such differentiation. A variety of molecular data – molecular marker data (for example, RFLP or AFLP), direct sequence data, or phylogenetic trees based on such molecular data – may be analyzed using this method (Excoffier, et al. 1992).
AMOVA treats any kind of raw molecular data as a Boolean vector p_{i}, that is, a 1 ´ n matrix of 1s and 0s, 1 indicating the presence of a marker and 0 its absence. A marker could be a nucleotide base, a base sequence, a restriction fragment, or a mutational event (Excoffier, et al. 1992).
Euclidean distances between pairs of vectors are then calculated by subtracting the Boolean vector of one haplotype from another, according to the formula (p_{j} – p_{k}). If p_{j} and p_{k} are visualized as points in ndimensional space indicated by the intersections of the values in each vector, with n being equal to the length of the vector, then the Euclidean distance is simply a scalar that is equal to the shortest distance between those two points. The squared Euclidean distances are then calculated using the equation _{}. W is a weighting matrix; by default, it is an identity matrix and does not change the value of the final product; however, W can be a matrix with a number of values depending upon how one weights molecular change at different locations on a sequence or phylogenetic tree (Excoffier, et al. 1992).
Squared Euclidean distances are calculated for all pairwise arrangements of Boolean vectors, which are then arranged into a matrix, and partitioned into submatrices corresponding to subdivisions within the population (Excoffier, et al. 1992):
_{}
They are arranged in such a way that the submatrices on the diagonal of the larger matrix are pairs of individuals in the same population while those on the offdiagonal represent pairs of individuals from different populations. The sums of the diagonals in the matrix and submatrices yield sums of squares for the various hierarchical levels of the population.
These sums of squares can then be analyzed in a nested analysis of variance framework. A nested ANOVA differs from a simple ANOVA in that data is arranged hierarchically and mean squares are computed for groupings at all levels of the hierarchy. This allows for hypothesis tests of betweengroup and withingroup differences at several hierarchical levels. The nested ANOVA framework for AMOVA is as follows (Excoffier, et al. 1992; Excoffier 2001):
Level of variation 
df 
Sum of squares 
Mean squares 
Variance 
Among haplotypes within demes 
n– d 
SS(WD) = 
_{} 
_{} 
Among demes within groups 
d – G 
SS(AD) = 
_{} 
_{} 
Among groups within population 
G – 1 
SS(AG) = 
_{} 
_{} 
Total 
n – 1 
SS(T) = _{} 

_{} 
_{}, _{}, _{}
The variance components can be used to calculate a series of statistics called phistatistics (F), which summarize the degree of differentiation between population divisions and are analogous to Fstatistics. F  statistics are derived as follows (Excoffier, et al. 1992; Excoffier 2001):
Level of population hierarchy 
Fstatistic 
Among demes within group 
_{} 
Among groups within population 
_{} 
Among demes within population 
_{} 
A F statistic can be treated as a hypothesis about differentiation at that level of a population; for example, F_{ST} can be treated as a hypothesis about differentiation between the population and its component demes. These hypotheses can be tested using the null distribution of the variance components; if the variance of the subpopulations does not significantly differ from the null distribution of the variance of the population, the hypothesis that those subpopulations are differentiated from the larger population would be rejected.
Because the molecular data consist of Euclidean distances derived from vectors of 1s and 0s, the data are unlikely to follow a normal distribution. A null distribution is therefore computed by resampling of the data (Excoffier, et al. 1992). In each permutation, each individual is assigned to a randomly chosen population while holding the sample sizes constant. These permutations are repeated many times, eventually building a null distribution. Hypothesis testing is carried out relative to these resampling distributions.
What assumptions are made about the data? The individuals from which haplotypes are sampled should be chosen independently and at random, or course. Since the null distributions are obtained by resampling, the Euclidean distances between haplotypes need not be assumed to be normally distributed or have homogeneity of variance.
Because of genetic drift, any one haplotype should not be assumed to be completely representative of variation among the whole genome. It is therefore important that the data are derived from an adequate number of markers or base pairs.
Certain assumptions are made about the nature of the population (Excoffier, et al. 1992), for example, that mating is entirely random and nonassortative and no inbreeding occurs. If nonrandom mating or inbreeding is occurring, it will result in lower heterozygosity, and if the rates of nonrandom mating or inbreeding differ between populations, fixation estimates will be confounded.
The effects of selection are not fully accounted for by this model. There is almost certain to be differing selective pressures among different subpopulations, and selection can have very different effects on different alleles and allele combinations. All variance among different allele frequencies due to genetic drift can be assumed to be the product of a degree of sampling error that is common to all alleles. However, selection acting on different alleles is nonrandom, hence any given betweenpopulation difference in the frequency of a given allele is potentially nonrepresentative of allele frequency variation as a whole.
Again, use of a large number of markers makes it more likely that one is getting a representative crosssection of alleles. However, because of the nonrandom nature of the effects of selection on different allele frequencies, increasing the percentage of the genome that is sampled will not necessarily yield an unbiased estimate of allele differentiation across the whole genome, at least, not as readily as would be the case when compensating for the effects of differential genetic drift. Using neutral, nonselected genetic markers can be a useful means of avoiding the confounding effects of selection, if neutral markers can be identified.
AMOVA appears to be highly robust to the methods of estimating distance between haplotypes. Excoffier et al. (1992) examined the behavior of several different distance metrics. They constructed four different data sets using the same RFLP data, extracted from mtDNA sampled from 10 human populations in 5 geographical regions. The data sets were structured as follows:
D_{1}:
The distance matrix was constructed from individual restriction site differences,
with all restriction sites weighted equally.
D_{2}: The distance matrix was constructed from individual haplotypes that represented discrete restriction fragment patterns. Each haplotype consisted of one or more restriction fragments. All haplotype differences were weighted equally, regardless of whether a given haplotype difference represented one or several restriction fragment differences.
D_{3}: The distance matrix was derived from an unrooted phylogenetic network constructed from a parsimonious arrangement of restriction site differences. When connections of equal length were possible, haplotypes that were regionally closer or those that did not represent a change from one rare haplotype to another were favored. Each step on this network is scored as 1 for a single mutational event.
D_{4}: The distance matrix was again derived from an unrooted phylogenetic network constructed from a parsimonious arrangement of restriction site differences. In this case, a weighting matrix W was applied to the data, the different weightings based on variation on nucleotide diversity at different restriction sites.
The results were as follows:
(Table from Excoffier, et al. 1992)
The F statistics and the partitioning of the variance components are nearly identical for data sets D_{1}, D_{3}, and D_{4} are close to identical, indicating that AMOVA is robust to most data arrangements. Data set D_{2} showed somewhat lower (but still significant) values for F_{ST} and F_{CT}, indicating that the grouping of restriction fragment data into discrete haplotypes represents some loss of information. It is possible that there is an analogous difference between direct sequence data and restriction fragment data, with restriction fragment data representing a loss of information when compared with direct sequence data.
References cited:
Excoffier, L., Smouse, P.E., and Quattro, J.M. 1992. Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics 131: 479491.
Excoffier L. 2001. Analysis of population subdivision, p. 271307. In: Balding, D.J., Bishop, M., and Cannings, C., eds. Handbook of statistical genetics. Chichester (UK): John Wiley & Sons.
Hartl, D.L. and Clark, A.G. 1997. Principles of population genetics. 3rd edition. Sunderland (MA): Sinauer Associates.
Further reading:
Bishop, G.R. 1996. Analysis of molecular variance [Internet]. Edinburgh: Biomathematics & Statistics Scotland; SMART. Available from: http://www.bioss.sari.ac.uk/smart/unix/mamova/slides/frames.htm
Schneider, S., Roessli, D., and Excoffier, L. 2000. Arlequin ver 2.000: a software package for population genetics data analysis [Users manual]. Geneva: University of Geneva, Genetics and Biometry Laboratory. Available from: http://lgb.unige.ch/arlequin/software/2.000/manual/Arlequin.pdf
Examples of studies using AMOVA:
Dyer, R.J. and Sork, V.L. 2001. Pollen pool heterogeneity in shortleaf pine, Pinus echinata Mill. Molecular Ecology 10:859866. Available from: http://www.umsl.edu/~biology/Dyer/Downloads/cpSSR%20Pine.pdf
Garbelotto, M., Otrosinr, W.J., Cobb, F.W., and Bruns, T.D. 1997. Population biology of the forest pathogen Heterobasidion annosum: implications for forest management. In: Proceedings of the 46th Annual Meeting of the California Forest Pest Council; 1997 Nov 1213; Sacramento, CA. Available from: http://www.srs.fs.fed.us/pubs/rpc/199909/rpc_99sep_11.pdf
Gustine, D.L, Voigt, P.W., Brummer, E.C., and Papadopoulos, Y.A. 2002. Genetic variation of RAPD markers for North American white clover collections and cultivars. Crop Science 42:343347. Available from: http://www.public.iastate.edu/~brummer/papers/Gustine2002CropSci42343.pdf
McMillan, W.O. and Bermingham, E. 1996. The phylogeographic pattern of mitochondrial DNA variation in the Dall's porpoise Phocoenoides dalli. Molecular Ecology 5:4761.
Peakall, R., Smouse, P.E, and Huff, D.R. 1995. Evolutionary implications of allozyme and RAPD variation in diploid populations of dioecious buffalograss Buchloe dactyloides. Molecular Ecology 4:135147.
Wu, J., Krutovskii, K.V., Strauss, S.H. 1998. Abundant mitochondrial genome diversity, population differentiation and convergent evolution in pines. Genetics 150:16051614. Available from: http://www.fsl.orst.edu/tgerc/pubs/Wu_1998_Genetics.pdf
Software:
Arlequin: a software package for population genetics data analysis. Available from: http://lgb.unige.ch/arlequin/