Introduction to Meta-Analysis

 

Lucas Bohnett, Jacqueline Levy, Victoria Martinez, and Ed Connor

 

 

Approaches to Research Synthesis:

 

Vote Counting

Combining Significance Levels

Combining Estimates of Effect Size

Obtaining Effect Size Estimates from Significance levels

Commonly Used Estimates of Effect Size

Fixed-and Random-effects models

Homogeneity Tests

Fixed-Effect Models

Random-Effects Models

Publication Bias

Calculation

Literature Cited

Software

 

Definition

Meta-Analysis - “...the statistical analysis of a large collection of analysis results from individual studies for the purpose of integrating the findings. It connotes a rigorous alternative to the causal, narrative discussions of research studies which typify our attempts to make sense of a large volume of research literature.”  Glass (1976)

 

 

Introduction

 

 Due to the ever-growing number of studies in experimental ecology, methods for summarizing results across a series of studies and reaching general conclusions are available.  The process of statistically synthesizing the findings of independent experimentation is known as meta-analysis, which is a quantitative re-evaluation of the outcomes of two or more studies.  Meta-analysis is utilized by combining the results of multiple studies to reach an overall conclusion about the magnitude of a treatment effect, and can be performed whenever two or any other number of studies examine the same conceptual hypothesis (i.e., same null hypothesis). For example, determining overall inferences drawn from combined results by estimating the average effect of a treatment or covariate among a group of studies; is the effect large or small, is the effect positive or negative; does the overall combined effect differ from zero?

 

Three major approaches are: vote counting, combining significance levels (p-values), and by combining estimates of effect size (r-family, d-family, odds ratios).  Each approach to research synthesis is dependent upon the data available for synthesis.

 

 

 

General Procedure for Meta-Analysis:

 

1. Form a purpose & hypotheses

2. Identify relevant studies

3. Establish study inclusion / exclusion criteria

4. Extract and code study data

5. Data analysis & interpretation

6. Results and conclusions

 

Vote Counting

 

An approach to meta-analysis commonly found in review articles is the technique called vote counting.  Vote counting is a method for synthesizing results across studies by counting the number of instances found in the literature that are consistent or inconsistent with an hypothesis. For example, Schoener (1983), Connell (1983), and Denno et al. (1995) examined the ecological literature for experimental studies to determine if interspecific competition is common or rare in nature. For each instance in each study that found interspecific competition, they tallied one vote. They then reported what proportion of studies show interspecific competition - calculated as the number of positive “votes” divided by the total number of instances examined. Schoener (1983) defined a positive vote to be detection of a negative effect of one species on another, while Connell (1983) defined a positive vote to be detection of a statistically significant negative effect of one species on another (α = 0.05).  The interpretation that the overall pattern in the data is either consistent with the hypothesis that interspecific competition is common in nature is either based on a subjective assessment that the observed proportion of instances of interspecific competition is sufficiently high, or could possibly be subjected to a binomial test with the binomial parameter equal to 0.5.

 

The problems with vote counting are that one can define a positive vote in different ways, and more importantly, vote counting treats each study – each vote - as being equal. A vote derived from a study with a sample size of 2 is equivalent to vote from a highly replicated study. Furthermore, a vote from a study in which the magnitude of the observed effect is very small is equal to a study in which the magnitude of the observed effect is very large. Because vote counting does not take into account the sample sizes of the studies, vote counting is biased towards studies with small sample sizes, since studies with large sample sizes and small sample sizes are given the same weight (Cooper and Hedges 1994).

 

On the other hand, if no other information than the existence and direction of an effect is reported in a series of studies, vote counting is the only means of synthesizing results.

 

Combining Significance levels

 

A long-standing approach to meta-analysis involves combining the significance levels derived from multiple tests of the same underlying null hypothesis. When the studies under evaluation provide data that do not meet the assumptions necessary to apply the parametric models described below, or only report the p-values, tests based on combining significance values can be used for synthesizing results.  Because of the non-parametric nature of the tests of combined significance, they can be applied broadly and are fairly easy to compute. P-values from studies in which an F, t, , or other test statistic was applied can be readily combined to obtain an overall test of significance. While there are a number of tests of combined significance that can be used in synthesizing studies, Fisher’s method for combining probabilities is most widely used (Becker 1994, Sokal and Rohlf 1995):

 

,

 

where pi is the significance level obtained from the ith study, and  is distributed as a  variate with 2k degrees of freedom.  In a series of experiments when each individual experiment yields a non-significant hypothesis test, if the treatment consistently increases of decreases the response variable the combined test of significance is likely to be statistically significant.  For an example of Fisher’s Method applied to ecological studies see Simberloff and Connor (1981) or McQuate and Connor (1990).

 

The problem with combining probabilities is that the same probability calculated from different studies could arise if one study had a large sample size and a small effect size, and another study had a large effect size and a small sample size (Becker 1994). Hence, the overall test is for statistical significance and provides no information on the average magnitude of the treatment or covariate effect.

 

Combining Estimates of Effect Size

 

The most recent development in meta-analysis are procedures that permit one to combine estimates of “effect sizes” to obtain an overall estimate of the average effect size and its standard error, and to test hypotheses about the effects of covariates on the average effect size observed in a series of studies. Combining effect sizes is superior to combining probabilities because the same probability calculated from different studies could arise if one study had a large sample size and a small effect size, and another study had a large effect size and a small sample size (Becker 1994). However, effect sizes may be combined in an unambiguous way by weighting each effect size in proportion to its respective variance, which is in part a function of sample size (Shadish and Haddock 1994).

 

The effect size is critical in meta-analysis.  The effect size is chosen by the investigator and reflects the differences between experimental and control groups or is utilized to find the degree of relationship between the independent and dependent variables (Gurevitch and Hedges 1999).  The outcome of each study is summarized as an index of the effect size and these indices are summarized across studies (Gurevitch and Hedges 1999).

 

Effect sizes are measures of the effect of some experimental treatment or a covariate on a response variable that is observed in each study. A variety of effect size measures are available to be used with response variable that are either continuous or discrete (Rosenthal 1994). Two families of effect size measures are available for continuous variables the d - family and the r - family. Effect size measures in the d – family are appropriate when effect sizes arise from the assessment of the effect of a discrete covariate such as in an ANOVA or t-test.   Effect size measures in the r – family are appropriate when effect sizes arise from the assessment of the effect of a continuous covariate such as in regression or correlation.

 

The effect-size measure referred to as the odds ratio is utilized for analysis of categorical data, and is applied if the outcome is a dichotomous variable (success versus failure).  The odds ratio is a way of comparing whether the probability of an event is the same for two groups.  So, the odds of outcome 1 versus outcome 2 are the probability of outcome 1 divided by the probability of outcome 2.  Therefore an odds ratio equal to one implies that the event is equally likely in both groups.

 

The sample size used in estimating an effect size may differ among studies; thus estimates of effect sizes may vary in precision.  Therefore, when combining effect sizes, each effect size must be weighted in proportion to its precision.  The precision is a function of study sample size; thus the larger the sample sizes the greater the weight. In some instances it is possible to obtain estimates of effect sizes even when only a p-value or test statistic has been reported (See Box below).

 

 

 

 

Obtaining effect-size when significance levels are given

 

When an effect-size is not reported, it can be obtained from the significance level, where there is a given p-value.  Knowing the significance level is useful when an effect size estimate or a test of significance is not accessible.  Even so, this information can be used to obtain a lower limit effect size estimate using r = f = Z/ (Cooper and Hedges 1994). A table of the standard normal deviates is needed in order to find Z, t, F, or , depending on what kind of p-level you have.  Once the p values or t values are obtained, then r, Cohen’s d, or Hedges g can be calculated in order to get the effect size indices.

 

The d- family is the most common method for obtaining effect size from significance levels when using categorical covariates.   The r- family is also used and in some cases the d and r families are combined to obtain effect size estimates.  The d-family effect size estimate and the r-family effect size estimate can be inter-converted.

 

To obtain d use Cohen’s d equation:  

To obtain r use:

 

(See Cooper and Hedges (1994), Chapter 16 for computational formulas and calculations for the d and r family).

 

Commonly Utilized Effect Size Statistics

d-family

(both Hedges’ g and Cohen’s d are common)

 

Hedges’ g: effect size based on mean differences

 

Ye is the mean of the experimental group

Yc is the mean of the control group

sp is the pooled average population standard deviation

•the effect size g is a biased estimator of the population effect size

•Using g produces estimates that are too large, especially with small samples, therefore

 

Hedges unbiased estimate is used:

 

 

The variance of g, given large samples:

 

 

•CI for g is therefore given as

 

z* is the critical value from the normal distribution

 

r-family

 

•Correlations are widely used as a measure of the linear relationship between two continuous variables

 

 

The correlation coefficient r is a slightly biased estimator of the population correlation coefficient.  An approximation of the population correlation coefficient may be obtained from the formula:

 

 

The sampling distribution of a correlation coefficient is somewhat skewed, especially if the population correlation is large. It is therefore conventional to convert r to z scores using Fisher's r-to-z transformation

 

 

If you wish to work with unbiased estimates of the population correlation coefficient, you should first calculate the correction G(r) for each study and then transform the G(r) values into z-scores for analysis.

 

 has a nearly normal distribution with variance:

 

 

Using these statistics we can construct a confidence interval for the population value:

 

where z* is the critical value from the normal distribution

 

Calculating r from g, and g from r

In order to evaluate common effect sizes for all included studies, it is often required to transform reported effect sizes.  The following is just a sample of a multitude of possible calculations:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

odds-ratios

 

An odds ratio is calculated by dividing the odds in the treated or exposed group by the odds in the control group.

A single stratum odds ratio is estimated as follows:

 

                                    Exposed                  Non-Exposed

Caged                                       a                                

Tied-up                                     c                                 d

 

Sample estimate of the odds ratio = (ad)/(bc)  

"The odds that caged subjects were exposed to the disease were (odds ratio) times the odds for subjects tied-up"   

 

Fixed-and Random-effects models

 

Combining estimates of effect sizes in meta-analysis can be consummated by using one of two models: Fixed-effect models or Random-effect models. For a fixed-effect model, one assumes that the studies under examination share a common true effect size, and that the differences of the actual effect size are from sampling error alone (Scheiner and Gurevitch 1993).  Unlike fixed-effect models, in random-effects models one assumes that there is a distribution of effect sizes and that differences in effect sizes between studies are due not only to sampling error, but also to other factors such as measurement error and inherent differences between studies. The computations involved in fitting either model depend upon obtaining an effect size estimate for each study examined.

 

Homogeneity tests provide an analytic process for deciding between a fixed ("Homogeneity test” yields a high P-value, P>0.05) or random effects model ("Homogeneity test” yields a low P-value, P<0.05).    Philosophical aspects of analytic variables should also be considered in the decision between fixed or random effects model implementation.

 

Homogeneity Tests

 

Homogeneity tests were developed to determine the likelihood that variance among effect sizes is due only to sampling error, and are used to assess whether the observed variability in study results is greater than that expected to occur by chance.

If the homogeneity statistic is significant for a group of studies, a procedure analogous to analysis of variance can be used. (Studies are repeatedly divided into subgroups according to moderator variables until within-group variation is non-significant).

 

Cochran’s Q (SAS code and output below) tests the homogeneity of the one-dimensional margins, when there are multiple strata and two response categories.  When there are two variables (m=2), Cochran's Q simplifies to McNemar's statistic. Breslow-Day tests the homogeneity of odds-ratios (Breslow-Day requires a large sample size within each stratum, this limits its usefulness)

 

 

 

Fixed-Effects Models

 

The assumptions of fixed effects meta-analysis are that studies under examination share a common true effect size, the control and experimental groups are normally distributed, and the differences of effect size are assumed to be due to sampling error alone.  The variances of the sampling error are known as conditional variances, and will be applied in the actual synthesis of data. “The unbiased estimate of the population effect would then be the simple average of observed study effects; and its standard error would allow computations of confidence intervals around that average” (Cooper and Hedges 1994).

 

In the both the Fixed - and Random - effects model there are two null hypotheses that can be examined;

 

,

 

the overall grand average effect size does not differ from zero; and

 

 

there is no difference between in average effect sizes among the p levels of the covariate.

 

The null hypothesis of no covariate effect can be examined for both categorical and continuous covariates. For fixed-effects models, the model fitting with categorical covariates involves a weighted ANOVA and with continuous covariates a weighted regression. Statistical packages, such as SPSS, can be used to perform weighted ANOVA and weighted regression to fit fixed-effects model meta-analyses. The model sum-of-squares is distributed as a  variate with number of covariate levels – 1 dfs in the weighted ANOVA, and number of covariates in the weighted regression.

 

Random-Effects Models

 

In a fixed effects meta-analysis, the sample estimates of effect sizes, Ti, from the k studies are viewed as estimates of a common population parameter qI that is the underlying population effect size and is a fixed value so that q1 = q2 =...= qk = q. Ti values from any particular study differs from q because of sampling error or conditional variability. Because Ti is based on a random sample of subjects from a population it will differ somewhat from q for the population.

 

In a random effects model, qi is not a fixed value, rather it is a random variable that follows its own distribution. Hence, the total variability of an observed effect size vi* is a combination of both the sampling error or conditional variation, vi, about each population's qI, and random variation, , of each qi around the mean population effect size:

 

Variance of estimated effects

= random effects variance

+ estimation (or conditional) variance

vi*

  =              

          +          vi .

 

 

 is referred to as the random effects variance, the between studies variance, or the variance component, vi as the within-study variance, estimation variance, or the conditional variance of the Ti (i.e., conditional on q being fixed at the value qi ), and vi* as the unconditional variance. If the between studies variance equals zero, then the equations for random effects models reduce to those of fixed effects models.

 

When would a random effects model be appropriate? If  is significantly different from 0, then it might be appropriate to use a random rather than a fixed effects model.  However, since the power of this test might be low, the use of a random effects model may be warranted even when such a test is insignificant.  If the studies in a synthesis are viewed as a random sample from some larger population of potential studies that have been or could be done, and the researcher wishes to draw inferences about the larger population of potential studies, then a priori a random effects model is appropriate.

 

Things to consider before you begin

 

When planning a meta-analysis it is important to consider sources of variation in the studies that are being included in the meta-analysis. Osenberg et al. (1999) suggest that variation among studies in effect magnitudes may arise from four sources: experimental, parametric, functional, and structural. Experimental variation arises when the procedures under which studies were conducted lead to differences in effect sizes. Parametric variation occurs when systems are governed by the same basic processes, yet differ in effect magnitudes generated by those processes. Functional variation is when systems are so distinct that the functions that describe the interactions between variables assume different shapes. Structural variation occurs when systems differ in their causal processes. In any event, one must be aware of sources of variation in effect sizes, and account for such variation by appropriate selection of an effect size measure or perhaps by conducting a mixed-model analysis.

 

Publication Bias

 

Like any study a meta-analysis is only as good as the data used in it. There can be problems with the available data such as: incomplete reporting of data, lack of independence, publication bias, and research bias. Studies that fail to report sample size and variance cannot be included in meta-analyses that combine estimates of effect size. If more than one parameter is used in a study then the parameters are not independent. To correct for this lack of independence separate analyses need to be conducted or only one parameter must be examined. Studies performed in the same lab are also an example of a lack of independence that could lead to between study biases. Publication bias may exist when significant studies are published more than non-significant. Begg (1994) outlines approaches to determine if the published literature represents a biased sample of the studies actually conducted. Begg (1994) also describes the file-drawer problem, and a method of estimating how many non-significant, unpublished studies would have to exist to change the conclusion of a meta-analysis. It is also possible that researchers choose to study organism or systems in which it is more likely to detect an effect, this could be a problem for a meta-analysis which is trying to make generalization about the natural world (Gurevitch and Hedges 1999).

 

 

Calculations

 

Fixed-Effects Models

 

The steps required to compute a fixed effects model in meta-analysis are similar to those in calculating an ANOVA; the means, sum of the scores, and the variance are calculated for each group.  The steps involved include:

 

1.      The calculation of the grand-mean

2.      Calculation of means for different categories of explanatory variables

3.      Calculation of the confidence intervals around the means

4.      Statistical tests are completed to determine the consistency of the effects within and among categories of the studies.

 

Effect size is calculated for each experiment as the difference between the means of two groups of individuals, divided by their pooled standard deviation to standardize the effect among studies. 

 

(We use an effect size measure from the d - family to illustrate the calculations for the Fixed-effects model and one from the r - family to illustrate the Random Effects Model).

 

Notation:

 

k  = total number of independent studies among all groups

mi = number of studies in each group

p = number of groups (a level of the covariate)

 = observed effect size

 = conditional variance

 = weight =,

q  = population effect size, under the fixed–effects model, we assume  is the common effect size.

 

Group Weighted Mean

 

This is the general formula for the group-weighted mean.  The singular dot indicates that the effect size measure has been averaged across all studies within a particular level of the covariate.  The group weighted mean effect size estimate for the ith group is

 


i = 1, …, p, where the weight  is the reciprocal of the variance of , =1/.

 

Grand Weighted Mean

 


The Grand Weighted Mean, , is obtained by summing the group weighted means among all groups. Two dots indicate the overall grand mean.

 

 

 

Group Mean Conditional Variance

 

The conditional variance is given by the reciprocal of the sum of the weights in each group.


 

 

Grand Mean Conditional Variance

 

The Grand Mean Conditional Variance (v..) or sampling variance is obtained by summing the Group Mean Conditional Variance among groups.


 

Now that you have obtained the grand weighted mean  and the sampling variance v.., one can test the null hypothesis that the overall grand mean effect size does not differ from zero.

 


Reject  if the absolute value of

 

 

Exceeds  of the standard normal distribution at a = 0.05

 

Confidence Intervals

 


Confidence intervals for the Grand mean or Group mean effect sizes can be obtained using the following formula and by inserting the appropriate weighed mean effect sizes and conditional variances

 

 

If the confidence interval does not include zero, reject Ho.

 

Test of Heterogeneity of Effect Sizes Between and Within Groups

 

To test the null hypothesis of no difference between groups (levels of the covariate) in the average effect size, an omnibus test for between group differences is conducted using the following formula:

 


wi.  is the reciprocal of the variance (),  of

 

 can be considered to be the weighted sum of squares of group mean effect sizes about the grand mean effect size.

 

The null hypothesis is tested by comparing the observed value of  with the upper-tail critical values of the distribution with p-1 degrees of freedom (Cooper and Hedges 1994).  If  exceeds Ca, Ho is rejected at a - level.

 

To test for heterogeneity within groups, an omnibus test for within-group variation is conducted using the following formula:


 

The  are the reciprocals of , which is the sampling variance of .

 

The null hypothesis is tested by comparing the calculated value of Qw with the upper-tail critical values of the chi-squared distribution with k-p degrees of freedom, where  is the total number of studies (Cooper and Hedges 1994). If Qw exceeds 100(1-a), Ho is rejected. A significant Qw test would suggest that a Fixed-Effects Model might be inappropriate.

 

 

Estimating the random effects variance

 

Several procedures are available to estimate the random effects variance, . Shaddish and Haddock (1994) present two approaches that are appropriate when no attempt is being made to determine if study characteristics (covariates) account for variation in effect sizes. Raudenbush (1994) outlines a more general procedure that can be used when covariates are used to model the effects of study characteristics. The Raudenbush (1994) approach will be presented below in the section on fitting random effects models with covariates.

 

Shaddish and Haddock (1994) Method 1 for computation of