Introduction
to Meta-Analysis
Lucas Bohnett,
Jacqueline Levy, Victoria Martinez, and Ed Connor
Approaches to
Research Synthesis:
Combining Estimates of Effect Size
Obtaining Effect Size Estimates from
Significance levels
Commonly Used Estimates of Effect Size
Fixed-and Random-effects models
Meta-Analysis
- “...the statistical analysis of a large collection of analysis results from
individual studies for the purpose of integrating the findings. It connotes a rigorous
alternative to the causal, narrative discussions of research studies which
typify our attempts to make sense of a large volume of research
literature.” Glass (1976)
Due to the ever-growing number of studies in experimental ecology, methods for summarizing results across a series of studies and reaching general conclusions are available. The process of statistically synthesizing the findings of independent experimentation is known as meta-analysis, which is a quantitative re-evaluation of the outcomes of two or more studies. Meta-analysis is utilized by combining the results of multiple studies to reach an overall conclusion about the magnitude of a treatment effect, and can be performed whenever two or any other number of studies examine the same conceptual hypothesis (i.e., same null hypothesis). For example, determining overall inferences drawn from combined results by estimating the average effect of a treatment or covariate among a group of studies; is the effect large or small, is the effect positive or negative; does the overall combined effect differ from zero?
Three major approaches are: vote counting, combining
significance levels (p-values), and by combining estimates of effect size
(r-family, d-family, odds ratios). Each approach to research synthesis is
dependent upon the data available for synthesis.
1.
Form a purpose & hypotheses
2. Identify relevant studies
3. Establish study inclusion / exclusion
criteria
4. Extract and code study data
5. Data analysis & interpretation
6. Results and conclusions
An approach to
meta-analysis commonly found in review articles is the technique called vote
counting. Vote counting is a method for
synthesizing results across studies by counting the number of instances found
in the literature that are consistent or inconsistent with an
hypothesis. For example, Schoener (1983), Connell
(1983), and Denno et al. (1995) examined the
ecological literature for experimental studies to determine if interspecific
competition is common or rare in nature. For each instance in each study that
found interspecific competition, they tallied one vote. They then reported what
proportion of studies show interspecific competition - calculated as the number
of positive “votes” divided by the total number of instances examined. Schoener (1983) defined a positive vote to be detection of
a negative effect of one species on another, while Connell (1983) defined a
positive vote to be detection of a statistically significant negative effect of
one species on another (α = 0.05).
The interpretation that the overall pattern in the data is either
consistent with the hypothesis that interspecific competition is common in
nature is either based on a subjective assessment that the observed proportion
of instances of interspecific competition is sufficiently high, or could
possibly be subjected to a binomial test with the binomial parameter equal to
0.5.
The problems
with vote counting are that one can define a positive vote in different ways,
and more importantly, vote counting treats each study – each vote - as being
equal. A vote derived from a study with a sample size of 2 is equivalent to
vote from a highly replicated study. Furthermore, a vote from a study in which
the magnitude of the observed effect is very small is equal to a study in which
the magnitude of the observed effect is very large. Because vote counting does
not take into account the sample sizes of the studies, vote counting is biased
towards studies with small sample sizes, since studies with large sample sizes
and small sample sizes are given the same weight (Cooper and Hedges 1994).
On the other
hand, if no other information than the existence and direction of an effect is
reported in a series of studies, vote counting is the only means of
synthesizing results.
A long-standing
approach to meta-analysis involves combining the significance levels derived from
multiple tests of the same underlying null hypothesis. When the studies under
evaluation provide data that do not meet the assumptions necessary to apply the
parametric models described below, or only report the p-values, tests based on combining significance values can be used
for synthesizing results. Because of the
non-parametric nature of the tests of combined significance, they can be
applied broadly and are fairly easy to compute. P-values from studies in which an F, t,
, or other test statistic was applied can be readily combined
to obtain an overall test of significance.
While there are a number of tests of combined significance that can be used in
synthesizing studies, Fisher’s method for combining probabilities is most
widely used (Becker 1994, Sokal and Rohlf 1995):
,
where pi is the significance level obtained from the ith study, and
is distributed as a
variate with 2k
degrees of freedom. In a series
of experiments when each individual experiment yields a non-significant
hypothesis test, if the treatment consistently increases of decreases the
response variable the combined test of significance is likely to be
statistically significant. For an example of Fisher’s Method applied
to ecological studies see Simberloff and Connor
(1981) or McQuate and Connor (1990).
The problem with combining probabilities is that the same probability calculated from different studies could arise if one study had a large sample size and a small effect size, and another study had a large effect size and a small sample size (Becker 1994). Hence, the overall test is for statistical significance and provides no information on the average magnitude of the treatment or covariate effect.
The most recent
development in meta-analysis are procedures that permit one to combine
estimates of “effect sizes” to obtain an overall estimate of the average effect
size and its standard error, and to test hypotheses about the effects of
covariates on the average effect size observed in a series of studies.
Combining effect sizes is superior to combining probabilities because the same
probability calculated from different studies could arise if one study had a
large sample size and a small effect size, and another study had a large effect
size and a small sample size (Becker 1994). However, effect sizes may be
combined in an unambiguous way by weighting each effect size in proportion to
its respective variance, which is in part a function of sample size (Shadish and Haddock 1994).
The effect size
is critical in meta-analysis. The effect
size is chosen by the investigator and reflects the differences between
experimental and control groups or is utilized to find the degree of
relationship between the independent and dependent variables (Gurevitch and Hedges 1999).
The outcome of each study is summarized as an index of the effect size and these indices are summarized across
studies (Gurevitch and Hedges 1999).
Effect sizes are
measures of the effect of some experimental treatment or a covariate on a
response variable that is observed in each study. A variety of effect size
measures are available to be used with response variable that are either
continuous or discrete (Rosenthal 1994). Two families of effect size measures
are available for continuous variables the d
- family and the r - family. Effect
size measures in the d – family are
appropriate when effect sizes arise from the assessment of the effect of a discrete
covariate such as in an ANOVA or t-test. Effect size measures in the r – family are appropriate when effect
sizes arise from the assessment of the effect of a continuous covariate such as
in regression or correlation.
The effect-size measure referred to as the odds ratio is utilized for analysis of categorical data, and is applied if the outcome is a dichotomous variable (success versus failure). The odds ratio is a way of comparing whether the probability of an event is the same for two groups. So, the odds of outcome 1 versus outcome 2 are the probability of outcome 1 divided by the probability of outcome 2. Therefore an odds ratio equal to one implies that the event is equally likely in both groups.
The sample size
used in estimating an effect size may differ among studies; thus estimates of
effect sizes may vary in precision.
Therefore, when combining effect sizes, each effect size must be
weighted in proportion to its precision.
The precision is a function of study sample size; thus the larger the
sample sizes the greater the weight. In some instances it is possible to obtain
estimates of effect sizes even when only a p-value
or test statistic has been reported (See Box below).
When an effect-size is not reported, it can be obtained from
the significance level, where there is a given p-value. Knowing the
significance level is useful when an effect size estimate or a test of
significance is not accessible. Even so,
this information can be used to obtain a lower limit effect size estimate using
r = f = Z/
(Cooper and
Hedges 1994). A table of the standard normal deviates is needed in order to
find Z, t, F,
or
, depending on
what kind of p-level you have. Once the p
values or t values are obtained, then
r, Cohen’s d, or Hedges g can be
calculated in order to get the effect size indices.
The d- family is the most common method for obtaining effect size from significance levels when using categorical covariates. The r- family is also used and in some cases the d and r families are combined to obtain effect size estimates. The d-family effect size estimate and the r-family effect size estimate can be inter-converted.
To obtain d use
Cohen’s d equation:
To obtain r use:

(See
Cooper and Hedges (1994), Chapter 16 for computational formulas and
calculations for the d and r family).
d-family
(both Hedges’ g and Cohen’s d are common)
Hedges’ g: effect size based on mean differences

•Ye is the mean of the experimental group
•
Yc is the mean of the control group
•sp
is the pooled average
population standard deviation
•the
effect size g is a biased estimator of the population effect size
•Using
g produces estimates that are too large, especially with small samples,
therefore
Hedges unbiased estimate is used:

The variance of g,
given large samples:

•CI for g is therefore given as
![]()
•z* is the critical value from the normal distribution
r-family
•Correlations are widely used as a measure of the linear relationship between two continuous variables

The correlation coefficient r is a slightly biased estimator of the population correlation coefficient. An approximation of the population correlation coefficient may be obtained from the formula:
![]()
The sampling distribution of a correlation coefficient is somewhat skewed, especially if the population correlation is large. It is therefore conventional to convert r to z scores using Fisher's r-to-z transformation
![]()
If you wish to work with unbiased estimates of the population correlation coefficient, you should first calculate the correction G(r) for each study and then transform the G(r) values into z-scores for analysis.
has
a nearly normal distribution with variance:
![]()
Using these statistics we can construct a confidence interval for the population value:
![]()
where z* is the critical value from the normal distribution
In order to evaluate common effect sizes for
all included studies, it is often required to transform reported effect
sizes. The following is just a sample of
a multitude of possible calculations:

![]()
odds-ratios
An odds ratio is calculated by dividing the odds in the treated or exposed group by the odds in the control group.
A single stratum odds ratio is estimated as follows:
Exposed Non-Exposed
Caged a b
Tied-up
c d
Sample
estimate of the odds ratio = (ad)/(bc)
"The odds that caged subjects were exposed to the disease were (odds ratio) times the odds for subjects tied-up"
Fixed-and Random-effects models
Combining estimates of effect sizes in meta-analysis can be consummated by using one of two models: Fixed-effect models or Random-effect models. For a fixed-effect model, one assumes that the studies under examination share a common true effect size, and that the differences of the actual effect size are from sampling error alone (Scheiner and Gurevitch 1993). Unlike fixed-effect models, in random-effects models one assumes that there is a distribution of effect sizes and that differences in effect sizes between studies are due not only to sampling error, but also to other factors such as measurement error and inherent differences between studies. The computations involved in fitting either model depend upon obtaining an effect size estimate for each study examined.
Homogeneity tests provide an analytic process for deciding between a fixed ("Homogeneity test” yields a high P-value, P>0.05) or random effects model ("Homogeneity test” yields a low P-value, P<0.05). Philosophical aspects of analytic variables should also be considered in the decision between fixed or random effects model implementation.
Homogeneity Tests
Homogeneity tests were developed to determine the likelihood that variance among effect sizes is due only to sampling error, and are used to assess whether the observed variability in study results is greater than that expected to occur by chance.
If the homogeneity statistic is significant for a group of studies, a procedure analogous to analysis of variance can be used. (Studies are repeatedly divided into subgroups according to moderator variables until within-group variation is non-significant).
Cochran’s Q (SAS
code and output below) tests the homogeneity of the one-dimensional margins,
when there are multiple strata and two response categories. When there are two variables (m=2), Cochran's
Q simplifies to McNemar's statistic. Breslow-Day tests the homogeneity of odds-ratios (Breslow-Day requires a large sample size within
each stratum, this limits its usefulness)
The assumptions of fixed effects meta-analysis are that studies under examination share a common true effect size, the control and experimental groups are normally distributed, and the differences of effect size are assumed to be due to sampling error alone. The variances of the sampling error are known as conditional variances, and will be applied in the actual synthesis of data. “The unbiased estimate of the population effect would then be the simple average of observed study effects; and its standard error would allow computations of confidence intervals around that average” (Cooper and Hedges 1994).
In the both the
Fixed - and Random - effects model there are two null hypotheses that can be
examined;
,
the overall grand average effect size
does not differ from zero; and
![]()
there is no difference between in average
effect sizes among the p levels of
the covariate.
The null hypothesis of no covariate effect can be examined
for both categorical and continuous covariates. For fixed-effects models, the
model fitting with categorical covariates involves a weighted ANOVA and with
continuous covariates a weighted regression. Statistical packages, such as
SPSS, can be used to perform weighted ANOVA and weighted regression to fit
fixed-effects model meta-analyses. The model sum-of-squares is distributed as a
variate with number of
covariate levels – 1 dfs in the weighted ANOVA, and
number of covariates in the weighted regression.
In a fixed
effects meta-analysis, the sample estimates of effect sizes, Ti, from the k studies are viewed as estimates of a common population parameter qI
that is the underlying population effect
size and is a fixed value so that q1 = q2 =...=
qk
=
q.
Ti values from any particular study differs from q because of sampling error or conditional
variability. Because Ti
is based on a random sample of subjects from a population it will differ
somewhat from q for
the population.
In a random
effects model, qi is not a fixed value, rather it is a
random variable that follows its own distribution. Hence, the total variability
of an observed effect size vi* is a combination of both the sampling
error or conditional variation, vi,
about each population's qI,
and random variation,
, of each qi
around the mean population effect size:
|
Variance of estimated effects |
= random effects variance |
+ estimation (or conditional) variance |
|
vi* |
= |
+ vi . |
is referred to as the random effects variance,
the between studies variance, or the variance component, vi as the within-study variance, estimation variance, or
the conditional variance of the Ti (i.e.,
conditional on
q being fixed at the value qi ), and vi* as
the unconditional variance. If the between studies variance equals zero, then
the equations for random
effects models reduce to those of fixed effects models.
When would a
random effects model be appropriate?
If
is significantly different from 0, then it might
be appropriate to use a random rather than a fixed effects model. However, since the power of this test might
be low, the use of a random effects model may be warranted even when such a
test is insignificant. If the studies in
a synthesis are viewed as a random sample from some larger population of
potential studies that have been or could be done, and the researcher wishes to
draw inferences about the larger population of potential studies, then a priori a random effects model is
appropriate.
When planning a
meta-analysis it is important to consider sources of variation in the studies
that are being included in the meta-analysis. Osenberg
et al. (1999) suggest that variation
among studies in effect magnitudes may arise from four sources: experimental,
parametric, functional, and structural. Experimental
variation arises when the procedures under which studies were conducted lead to
differences in effect sizes. Parametric variation occurs when systems are
governed by the same basic processes, yet differ in effect magnitudes generated
by those processes. Functional variation is when systems are so distinct that
the functions that describe the interactions between variables assume different
shapes. Structural variation occurs when systems differ in their causal
processes. In any event, one must be aware of sources of variation in effect
sizes, and account for such variation by appropriate selection of an effect
size measure or perhaps by conducting a mixed-model analysis.
Like any study a
meta-analysis is only as good as the data used in it. There can be problems
with the available data such as: incomplete reporting of data, lack of
independence, publication bias, and research bias. Studies that fail to report
sample size and variance cannot be included in meta-analyses that combine
estimates of effect size. If more than one parameter is used in a study then
the parameters are not independent. To correct for this lack of independence
separate analyses need to be conducted or only one parameter must be examined.
Studies performed in the same lab are also an example of a lack of independence
that could lead to between study biases. Publication bias may exist when
significant studies are published more than non-significant. Begg (1994) outlines approaches to determine if the
published literature represents a biased sample of the studies actually
conducted. Begg (1994) also describes the file-drawer
problem, and a method of estimating how many non-significant, unpublished
studies would have to exist to change the conclusion of a meta-analysis. It is
also possible that researchers choose to study organism or systems in which it
is more likely to detect an effect, this could be a
problem for a meta-analysis which is trying to make generalization about the
natural world (Gurevitch and Hedges 1999).
The steps
required to compute a fixed effects model in meta-analysis are similar to those
in calculating an ANOVA; the means, sum of the scores, and the variance are
calculated for each group. The steps
involved include:
1. The calculation of the grand-mean
2. Calculation of means for different
categories of explanatory variables
3. Calculation of the confidence intervals
around the means
4. Statistical tests are completed to
determine the consistency of the effects within and among categories of the
studies.
Effect size is
calculated for each experiment as the difference between the means of two
groups of individuals, divided by their pooled standard deviation to
standardize the effect among studies.
(We use an
effect size measure from the d -
family to illustrate the calculations for the Fixed-effects model and one from
the r - family to illustrate the
Random Effects Model).
Notation:
k = total number of independent studies
among all groups
mi
= number of studies in
each group
p = number of groups (a level of the covariate)
= observed effect size
= conditional variance
= weight =
,
q = population effect size, under the
fixed–effects model, we assume
is the common effect
size.
This is the
general formula for the group-weighted mean.
The singular dot indicates that the effect size measure has been
averaged across all studies within a particular level of the covariate. The group weighted mean
effect size estimate for the ith group
is

i = 1, …, p, where the weight
is the reciprocal of the variance of
,
=1/
.

The Grand Weighted Mean,
, is obtained by
summing the group weighted means among all groups. Two dots indicate the overall grand mean.
The conditional
variance is given by the reciprocal of the sum of the weights in each group.

The Grand Mean
Conditional Variance (v..) or sampling variance is obtained
by summing the Group Mean Conditional Variance among groups.

Now that you
have obtained the grand weighted mean
and the sampling variance v.., one can test the null hypothesis
that the overall grand mean effect size does not differ from zero.

Reject
if the absolute value of
Exceeds
of the standard normal distribution at a = 0.05
![]()
Confidence intervals for the Grand
mean or Group mean effect sizes can be obtained using the following formula and
by inserting the appropriate weighed mean effect sizes and conditional
variances
If the
confidence interval does not include zero, reject Ho.
To test the null hypothesis of no
difference between groups (levels of the covariate) in the average effect size,
an omnibus test for between group differences is
conducted using the following formula:

wi. is
the reciprocal of the
variance (
), of
![]()
can be considered to be the weighted sum of squares of group
mean effect sizes about the grand mean effect size.
The null
hypothesis is tested by comparing the observed value of
with the upper-tail critical values of
the
distribution with p-1
degrees of freedom (Cooper and Hedges 1994).
If
exceeds Ca, Ho
is rejected at a - level.
To test for heterogeneity within groups, an omnibus
test for within-group variation is conducted using the following formula:

The
are the reciprocals of
, which is the sampling variance of
.
The null
hypothesis is tested by comparing the calculated value of Qw with the upper-tail critical
values of the chi-squared distribution with k-p
degrees of freedom, where
is the total number of studies (Cooper and Hedges 1994). If Qw exceeds
100(1-a), Ho is
rejected. A significant Qw
test would suggest that a Fixed-Effects Model might be inappropriate.
Several
procedures are available to estimate the random effects variance,
. Shaddish and Haddock (1994) present two approaches that are
appropriate when no attempt is being made to determine if study characteristics
(covariates) account for variation in effect sizes. Raudenbush
(1994) outlines a more general procedure that can be used when covariates are
used to model the effects of study characteristics. The Raudenbush
(1994) approach will be presented below in the section on fitting random effects
models with covariates.
Shaddish and Haddock (1994) Method 1 for computation of