Analysis of Molecular Variance (AMOVA)

 

Peter Werner

 

 

Population Differentiation

 

When a population is divided into isolated subpopulations, there is less heterozygosity than there would be if the population was undivided. Founder effects acting on different demes generally lead to subpopulations with allele frequencies that are different from the larger population. Also, these demes are smaller in size than the larger population; since allele frequency in each generation represents a sample of the previous generation's allele frequency, there will be greater sampling error in these small groups than there would be in a larger undifferentiated population. Hence, genetic drift will push these smaller demes toward different allele frequencies and allele fixation more quickly than would take place in a larger undifferentiated population.

 

Wright's F

 

The decline in heterozygosity due to subdivision within a population has usually been quantified using an index known as Wright's F statistic, also known as the fixation index. The F statistic is a measure of the difference between the mean heterozygosity among the subdivisions in a population, and the potential frequency of heterozygotes if all members of the population mixed freely and non-assortatively (Hartl and Clark 1997). The fixation index ranges from 0 (indicating no differentiation between the overall population and its subpopulations) to a theoretical maximum of 1, though in practice the observed fixation index is much less than 1 even in highly differentiated populations.

 

Fixation indexes can be determined for differentiated hierarchical levels of a population structure, to indicate, for example, the degree of differentiation within a population among groups of demes (FSG), within groups among demes (FGT), and within a population among demes (FST) (Hartl and Clark 1997).

 

To determine the fixation index, the mean heterozygosity at each level must be determined. For a locus with two alternate alleles, allele frequency is symbolized as p and the alternative form of the allele is equal to 1 – p. For a population subdivided to three hierarchical levels, the mean heterozygosity for each level is then determined as follows:


 

Level of population hierarchy

         Heterozygosity

Demes

        

Groups of demes

        

Total population

        

 

(Table based on Hartl and Clark 1997.)

 

For each level of the population hierarchy, the mean allele frequency p is determined, then the allele frequency is multiplied by 2(1 – p); this product is the frequency of heterozygotes for that allele if panmixia occurs at that hierarchical level.

 

Once the heterozygosity at each hierarchical level is determined, F statistics can be calculated (Hartl and Clark 1997):

 

Level of population hierarchy

          F-statistic

Among demes within group

         

Among groups within population

         

Among demes within population

         

 

AMOVA

 

Wright's F is based upon comparison of gene frequencies among demes, however, molecular data reveals not only the frequency of molecular markers, but can also tell us something about the amount of mutational differences between different genes. A technique that could be used to estimate population differentiation by analyzing differences between molecular sequences rather than assumed Mendelian gene frequencies would therefore be very useful.

 

Analysis of Molecular Variance (AMOVA) is a method of estimating population differentiation directly from molecular data and testing hypotheses about such differentiation. A variety of molecular data – molecular marker data (for example, RFLP or AFLP), direct sequence data, or phylogenetic trees based on such molecular data – may be analyzed using this method (Excoffier, et al. 1992).

 

AMOVA treats any kind of raw molecular data as a Boolean vector pi, that is, a 1 n matrix of 1s and 0s, 1 indicating the presence of a marker and 0 its absence. A marker could be a nucleotide base, a base sequence, a restriction fragment, or a mutational event (Excoffier, et al. 1992).

 

Euclidean distances between pairs of vectors are then calculated by subtracting the Boolean vector of one haplotype from another, according to the formula (pjpk). If pj and pk are visualized as points in n-dimensional space indicated by the intersections of the values in each vector, with n being equal to the length of the vector, then the Euclidean distance is simply a scalar that is equal to the shortest distance between those two points. The squared Euclidean distances are then calculated using the equation . W is a weighting matrix; by default, it is an identity matrix and does not change the value of the final product; however, W can be a matrix with a number of values depending upon how one weights molecular change at different locations on a sequence or phylogenetic tree (Excoffier, et al. 1992).

 

Squared Euclidean distances are calculated for all pairwise arrangements of Boolean vectors, which are then arranged into a matrix, and partitioned into submatrices corresponding to subdivisions within the population (Excoffier, et al. 1992):

 

 

They are arranged in such a way that the submatrices on the diagonal of the larger matrix are pairs of individuals in the same population while those on the off-diagonal represent pairs of individuals from different populations. The sums of the diagonals in the matrix and submatrices yield sums of squares for the various hierarchical levels of the population.

 

These sums of squares can then be analyzed in a nested analysis of variance framework. A nested ANOVA differs from a simple ANOVA in that data is arranged hierarchically and mean squares are computed for groupings at all levels of the hierarchy. This allows for hypothesis tests of between-group and within-group differences at several hierarchical levels. The nested ANOVA framework for AMOVA is as follows (Excoffier, et al. 1992; Excoffier 2001):


 

Level of variation

df           

Sum of squares

Mean squares

Variance

Among haplotypes within demes

nd

SS(WD) =

Among demes within groups

d – G

SS(AD) =
 – SS(WD)

Among groups within population

G – 1

SS(AG) =  
SS(T)

Total

n – 1

SS(T) =

 

 

 

The variance components can be used to calculate a series of statistics called phi-statistics (F), which summarize the degree of differentiation between population divisions and are analogous to F-statistics. F - statistics are derived as follows (Excoffier, et al. 1992; Excoffier 2001):

 

Level of population hierarchy

          F-statistic

Among demes within group

         

Among groups within population

         

Among demes within population

         

 


A F -statistic can be treated as a hypothesis about differentiation at that level of a population; for example, FST can be treated as a hypothesis about differentiation between the population and its component demes. These hypotheses can be tested using the null distribution of the variance components; if the variance of the subpopulations does not significantly differ from the null distribution of the variance of the population, the hypothesis that those subpopulations are differentiated from the larger population would be rejected.

 

Because the molecular data consist of Euclidean distances derived from vectors of 1s and 0s, the data are unlikely to follow a normal distribution. A null distribution is therefore computed by resampling of the data (Excoffier, et al. 1992). In each permutation, each individual is assigned to a randomly chosen population while holding the sample sizes constant. These permutations are repeated many times, eventually building a null distribution. Hypothesis testing is carried out relative to these resampling distributions.

 

What assumptions are made about the data? The individuals from which haplotypes are sampled should be chosen independently and at random, or course. Since the null distributions are obtained by resampling, the Euclidean distances between haplotypes need not be assumed to be normally distributed or have homogeneity of variance.

 

Because of genetic drift, any one haplotype should not be assumed to be completely representative of variation among the whole genome. It is therefore important that the data are derived from an adequate number of markers or base pairs.

 

Certain assumptions are made about the nature of the population (Excoffier, et al. 1992), for example, that mating is entirely random and non-assortative and no inbreeding occurs. If non-random mating or inbreeding is occurring, it will result in lower heterozygosity, and if the rates of non-random mating or inbreeding differ between populations, fixation estimates will be confounded.

 

The effects of selection are not fully accounted for by this model. There is almost certain to be differing selective pressures among different subpopulations, and selection can have very different effects on different alleles and allele combinations. All variance among different allele frequencies due to genetic drift can be assumed to be the product of a degree of sampling error that is common to all alleles. However, selection acting on different alleles is non-random, hence any given between-population difference in the frequency of a given allele is potentially non-representative of allele frequency variation as a whole.

 

Again, use of a large number of markers makes it more likely that one is getting a representative cross-section of alleles. However, because of the non-random nature of the effects of selection on different allele frequencies, increasing the percentage of the genome that is sampled will not necessarily yield an unbiased estimate of allele differentiation across the whole genome, at least, not as readily as would be the case when compensating for the effects of differential genetic drift. Using neutral, non-selected genetic markers can be a useful means of avoiding the confounding effects of selection, if neutral markers can be identified.

 

AMOVA appears to be highly robust to the methods of estimating distance between haplotypes. Excoffier et al. (1992) examined the behavior of several different distance metrics. They constructed four different data sets using the same RFLP data, extracted from mtDNA sampled from 10 human populations in 5 geographical regions. The data sets were structured as follows:

 

D1:   The distance matrix was constructed from individual restriction site differences, with all restriction sites weighted equally.

D2:   The distance matrix was constructed from individual haplotypes that represented discrete restriction fragment patterns. Each haplotype consisted of one or more restriction fragments. All haplotype differences were weighted equally, regardless of whether a given haplotype difference represented one or several restriction fragment differences.

 

D3:   The distance matrix was derived from an un-rooted phylogenetic network constructed from a parsimonious arrangement of restriction site differences. When connections of equal length were possible, haplotypes that were regionally closer or those that did not represent a change from one rare haplotype to another were favored. Each step on this network is scored as 1 for a single mutational event.

 

D4:   The distance matrix was again derived from an un-rooted phylogenetic network constructed from a parsimonious arrangement of restriction site differences. In this case, a weighting matrix W was applied to the data, the different weightings based on variation on nucleotide diversity at different restriction sites.

 

The results were as follows:

(Table from Excoffier, et al. 1992)

 

The F -statistics and the partitioning of the variance components are nearly identical for data sets D1, D3, and D4 are close to identical, indicating that AMOVA is robust to most data arrangements. Data set D2 showed somewhat lower (but still significant) values for FST and FCT, indicating that the grouping of restriction fragment data into discrete haplotypes represents some loss of information. It is possible that there is an analogous difference between direct sequence data and restriction fragment data, with restriction fragment data representing a loss of information when compared with direct sequence data.

 

References cited:

 

Excoffier, L., Smouse, P.E., and Quattro, J.M. 1992. Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics 131: 479-491.

Excoffier L. 2001. Analysis of population subdivision, p. 271-307. In: Balding, D.J., Bishop, M., and Cannings, C., eds. Handbook of statistical genetics. Chichester (UK): John Wiley & Sons.

 

Hartl, D.L. and Clark, A.G. 1997. Principles of population genetics. 3rd edition. Sunderland (MA): Sinauer Associates.

 

Further reading:

 

Bishop, G.R. 1996. Analysis of molecular variance [Internet]. Edinburgh: Biomathematics & Statistics Scotland; SMART. Available from: http://www.bioss.sari.ac.uk/smart/unix/mamova/slides/frames.htm

 

Schneider, S., Roessli, D., and Excoffier, L. 2000. Arlequin ver 2.000: a software package for population genetics data analysis [Users manual]. Geneva: University of Geneva, Genetics and Biometry Laboratory. Available from: http://lgb.unige.ch/arlequin/software/2.000/manual/Arlequin.pdf

 

Examples of studies using AMOVA:

 

Dyer, R.J. and Sork, V.L. 2001. Pollen pool heterogeneity in shortleaf pine, Pinus echinata Mill. Molecular Ecology 10:859-866. Available from: http://www.umsl.edu/~biology/Dyer/Downloads/cpSSR%20Pine.pdf

 

Garbelotto, M., Otrosinr, W.J., Cobb, F.W., and Bruns, T.D. 1997. Population biology of the forest pathogen Heterobasidion annosum: implications for forest management. In: Proceedings of the 46th Annual Meeting of the California Forest Pest Council; 1997 Nov 12-13; Sacramento, CA. Available from: http://www.srs.fs.fed.us/pubs/rpc/1999-09/rpc_99sep_11.pdf

 

 

Gustine, D.L, Voigt, P.W., Brummer, E.C., and Papadopoulos, Y.A. 2002. Genetic variation of RAPD markers for North American white clover collections and cultivars. Crop Science 42:343-347. Available from: http://www.public.iastate.edu/~brummer/papers/Gustine2002CropSci42-343.pdf

 

McMillan, W.O. and Bermingham, E. 1996. The phylogeographic pattern of mitochondrial DNA variation in the Dall's porpoise Phocoenoides dalli. Molecular Ecology 5:47-61.

 

Peakall, R., Smouse, P.E, and Huff, D.R. 1995. Evolutionary implications of allozyme and RAPD variation in diploid populations of dioecious buffalograss Buchloe dactyloides. Molecular Ecology 4:135-147.

 

Wu, J., Krutovskii, K.V., Strauss, S.H. 1998. Abundant mitochondrial genome diversity, population differentiation and convergent evolution in pines. Genetics 150:1605-1614. Available from: http://www.fsl.orst.edu/tgerc/pubs/Wu_1998_Genetics.pdf

 

Software:

 

Arlequin: a software package for population genetics data analysis. Available from: http://lgb.unige.ch/arlequin/