The
Bootstrap, Jackknife, Randomization, and other non-traditional approaches to estimation
and hypothesis testing
Rationale
Much
of modern statistics is anchored in the use of statistics and hypothesis tests that only
have desirable and well-known properties when computed from populations that are normally
distributed. While it is claimed that many
such statistics and hypothesis tests are generally robust with respect to non-normality,
other approaches that require an empirical investigation of the underlying population
distribution or of the distribution of the statistic are possible and in some instances
preferable. In instances when the
distribution of a statistic, conceivably a very complicated statistic, is unknown, no
recourse to a normal theory approach is available and alternative approaches are required.
I. Hypothesis Testing
A. Normal Theory Approach
For
illustration consider Student's t - test for
differences in means when variances are unknown, but are considered to be equal. The hypothesis of interest is that
H0:
m1
= m2.
While
several possible alternative hypotheses could be specified, for our purposes
HA:
m1
< m2.
Given
two samples drawn from populations 1 and 2, assuming that these are normally distributed
populations with equal variances, and that the samples were drawn independently and at
random from each population, then a statistic whose distribution is known can be
elaborated to test H0:
(1)
where
are the respective sample means, variances and
sample sizes.
When
the conditions stated above are strictly met and H0 is true, (1) is distributed
as Student's t with (n1 + n2 - 2) degrees of freedom. As ![]()
The reasons for making the assumptions specified above is to allow the investigator
to make some statement about the likelihood of the computed t - value, and to make a decision as to whether to
profess belief in either H0 or Ha.
The percentiles of the t distribution
with the computed degrees of freedom can be interpreted as the conditional probability of
observing the computed t value or one larger (or
smaller) given that H0 is true:
![]()
Therefore
we would probably wish to profess belief in H0 for suitably large values of a and
disbelief for suitably small values. Embedded
in this probability we must also include the distributional assumptions mentioned above
![]()
If
a specific alternative hypothesis had been stated, for example
HA: m1
= m2
- 5,
then
under the assumption of normality and equal variances, the t - statistic could be recomputed given the new
estimate of m2
under the alternative hypothesis
The conditional probability of obtaining the
observed difference in t values as computed
under the null and alternative hypotheses (t0
- ta), given specified a
, the observed variances, and the particular alternative hypothesis could be computed:
![]()
This
probability is also conditioned on the assumption that both populations are normally
distributed ![]()
Recall
that of these two conditional probabilities a
is the Type I error rate, the probability of rejecting the tested null hypothesis when
true, and that b
is the Type II error rate, the probability of failing to reject the tested null hypothesis
when false.
I
present this review to emphasize that the estimation of each of these probabilities, which
are interpreted as error rates in the process of making a decision about nature, in the
course of interpreting a specific statistical test, is totally contingent on assuming
specific forms for the distribution of the underlying populations. To know these error rates exactly requires that
all the conditions of these tests be met.
B. A distribution-free approach
One
way to avoid these distributional assumptions has been the approach now called non -
parametric, rank - order, rank - like, and distribution - free statistics. A series of tests many of which apply in
situations analogous to normal theory statistics have been elaborated (see ref 1, 2, 3 for
expanded treatments of these procedures).
The
key to the function of these statistics is that they are based on the ranks of the actual
observations in a joint ranking and not on the observations themselves. For example the Wilcoxon distribution - free rank
sum test can be applied in place of the 2 sample or separate groups Student's t test. The
observations from both samples combined are ranked from least to greatest and the sum of
the ranks assigned to the observations from either sample is computed. If both samples are comprised of observations that
are of similar magnitude, then the ranks assigned to sample 1 should be similar to the
ranks assigned to sample 2. For fixed sample
sizes a fixed number of ranks are possible, for n1
= 5 and n2 = 7, 12 ranks will be
assigned. Under the null hypothesis that the
location of each population on the number line is identical, the sum of the ranks assigned
to either sample should equal the sum obtained from randomly assigning ranks to the
observations in each sample. In this example
there are
ways of
assigning ranks to sample 1 and
or 1 way to assign ranks to sample 2 after
assigning the ranks to sample 1. The total number of possible arrangements of the ranks is
then
and it is
possible for each arrangement to compute the sum of the ranks. From this we can enumerate the distribution of
the sum of the ranks, W. W in
this example can range from 15 to 50. If we
divide the
![]()
then
we have the probability of observing a particular value of w equal to
![]()
To
obtain the conditional probability that w > W given H0, we simply tabulate the
cumulative probabilities.
The
only additional assumption embedded in this approach is that the observations are
independent, but this is also an assumption of the normal theory approach. Notice we make no assumption about the forms of
the underlying populations about which we wish to make inferences, and that the exact
distribution of the test statistic, W, is
known because it is enumerated. These
distribution - free statistics are usually criticized for being less "efficient"
than the analogous test based on assuming the populations to be normally distributed. It is true that when the underlying populations
are normally distributed then the asymptotic relative efficiencies (ratio of sample sizes
of one test to another necessary to have equal power relative to a broad class of
alternative hypotheses for fixed a)
of distribution - free tests are generally lower than their normal theory analogs, but
usually not markedly so. In instances where
the underlying populations are non-normal then the distribution - free tests can be
infinitely more efficient that their normal theory counterparts. In general, this means that distribution-free
tests will have higher Type II error rates (b)
than normal theory tests when the normal theory assumptions are met. Type I error rates
will not be affected. However, if the
underlying populations are not normally distributed then normal theory tests can lead to
under estimation of both Type I and Type II error rate.
C. Randomization
So
far we have used two approaches to estimating error rates in hypothesis testing that
either require the assumption of a particular form of the distribution of the underlying
population, or that require the investigator to be able to enumerate the distribution of the test statistic
when the null hypothesis is true and under specific alternative hypotheses. What can be done when we neither wish to assume
normality nor can we enumerate the distribution of the test statistic?
Recall
the analogy I used when describing how to generate the expected sum of ranks assigned to a
particular sample under the null hypothesis of identical population locations on a number
line. The analogy was to a process of
randomly assigning ranks to observations independent of one's knowledge of which sample an
observation is a member. A randomization
test makes use of such a procedure, but does so by operating on the observations rather
than the joint ranking of the observations. For
this reason, the distribution of an analogous statistic (the sum of the observations in
one sample) cannot be easily tabulated, although it is theoretically possible to enumerate
such a distribution. From one instance to the
next the observations may be of substantially different magnitude so a single tabulation
of the probabilities of observing a specific sum of observations could not be made, a
different tabulation would be required for each application of the test. A further problem arises if the sample sizes are
large. In the example mentioned previously
there are only
possible
arrangements of values so the exact distribution of the sum of observations in one sample
could conceivably have been enumerated. Had
our sample sizes been 10 and 15 then over 3.2 million arrangements would have been
possible. If you have had any experience in combinatorial enumeration then you would know
that this approach has rapidly become computationally impractical. With high-speed computers it is certainly possible
to tally 3.2 million sums, but developing an efficient algorithm to be sure that each and
every arrangement has been included, and included only once is prohibitive.
What
then? Sample. When the universe of possible arrangements is too
large to enumerate why not sample arrangements from this universe independently and at
random? The distribution of the test
statistic over this series of samples can then be tabulated, its' mean and variance
computed, and the error rate associated with an hypothesis test estimated.
Table
1 contains samples of n1 = 10, and n2 = 15, obtained from sampling from 2
populations with m1
= 200,
and m2
= 190,
, respectively. A normal theory t - test applied to these data yields a t = 3.3216, df = 23, 0.0005 < p < 0.005. The same data examined by the distribution - free
Wilcoxon's rank sum test yields W* = 2.7735, 0.0026 < p < 0.0030. Applying this randomization approach with 1000
iterations of sampling without replacement first 10 and then 15 observations and computing
the t statistic for each of these samples yields
the distribution depicted in Figure 1. The
normal theory t distribution is depicted as a
smooth curve. According to the randomization
procedure the probability of observing a t value
greater than or equal to that actually observed (3.3216) is 0.005 < p < 0.006. Remember that this sampling procedure, unlike an
enumeration, allows each possible arrangement of values to be sampled more than once. The probability that on any iteration a particular
arrangement will be chosen is in this instance
or approximately 3.12 x 10-7. After 1000 such randomizations it is quite
possible that some arrangements have been sampled more than once, but there is no reason
to believe that particular arrangements yielding either low or high t - values should be systematically included or
excluded from the 1000 randomizations.
This
approach is obviously an empirical approach to learning something about the distribution
of a test statistic under specified conditions. This Monte Carlo sample procedure would
have to be performed anew for each new set of observations.
One aspect that may be an advantage of this approach over normal theory approaches
is that any ad hoc test statistic can be elaborated since a direct empirical investigation
of its distributional properties accompanies each test.
For example we could just use the difference in the sample means
as one test statistic. Figure 2 shows the distribution of
over the same 1000 randomizations. The actual
difference between sample means is 8.9251 and under the randomization approach the
probability of observing a difference this large or larger is 0.004 < p < 0.005. The
computed probability of observing the t or
actually observed compares favorably with the
normal theory estimates.
Figure
3 and 4 illustrate the same procedure applied to two populations whose underlying
distributions are exponential. Table 2
presents the sample data generated from two populations with
and
, respectively. A normal theory test on these data yields t = -2.0874, df = 23, 0.0l <p < 0.025, while
the randomization approach yielded a probability of obtaining the observed value of t or one greater of 0.017 < p < 0.018, and
a probability of obtaining the observed or a greater difference in means of p > 0.05. Here we see the normal theory test breaking down
and convergence in the results obtained from the distribution - free and randomization
tests.
Is
all of this kosher? We can see the parallel
development of the distribution-free and the randomization tests, yet is the randomization
test actually yielding a meaningful result? The
answer is a resounding well-maybe-er-I-don't-know. The
randomization procedure essentially asks the question, given observed samples n1 and n2, if we assume that these samples
came from the same underlying population whose distribution F is given by the n1 + n2 sampled values, with probability mass
1/(n1 +n2) for each observation, what is the
chance of partitioning the observations into groups of the size observed that have means
that differ by an amount as large as that observed? Is
this the best empirical estimate of the distribution of a test statistic?
D. The Bootstrap
The
Bootstrap is another empirical approach to understanding the distributional properties of
a test statistic, but is also useful as a means of estimating statistics and their
standard errors. The bootstrap is very
similar to the randomization procedure outlined above.
The observed distribution of sample values is used as an estimate of the underlying
probability distribution of the population F. Then,
the distribution of a statistic for fixed sample sizes is obtained by repeatedly sampling
from the distribution F, with each value receiving probability mass 1/(n1 + n2), but sampling values with
replacement, so that instead of individual partitions of the data having the potential to
occur more than once, the individual values themselves may appear repeatedly in a single
sample. Under this resampling algorithm the
number of possible sample arrangements is much greater than for the randomization
approach. For example with a total sample
size m = 12, with component samples of size 7
and 5, 127 x 125 = 8.9161004 x 1012 arrangements are
possible. For m = 25, and n1
= 10, n2 = 15, 2510 x 2515
= 8.8817842 x 1034 arrangements are possible, factors of 1010 and 1028
more arrangements, respectively, compared to the randomization approach. Any test statistic averaged across a series of say
1000 samples under this algorithm will have a larger standard error since sub-samples of F
can deviate from F more than under the randomization algorithm. Figures 5 and 6 illustrate the distribution of t values and
for 1000 bootstrap samples of the empirical
probability distribution presented in Table 1. For
the normal populations the bootstrap estimates the probability of the observed t or one greater to be 0.002 < p < 0.003
which, surprisingly is somewhat less than the randomization approach. This comparison is reversed when examining the
differences between means. The bootstrap
estimates the probability of the observed mean difference or one greater as 0.022 < p
< 0.023, which is an order of magnitude greater than that estimated by the
randomization approach, or for that matter for the t
-statistic from the same group of bootstrap samples.
Figures
7 and 8 provide similar data for the samples derived from exponentially distributed
populations presented in Table 2. The
bootstrap is more conservative than either the normal theory approach or the randomization
approach when examining the t value obtained for
the exponential populations. This is the
result I would generally expect in a comparison of the bootstrap and randomization.
Which
approach is best? While the randomization
approach can be seen to be analogous to the enumeration of distributions that
characterizes distribution - free statistics, it is unrealistic in that the distribution
of a test statistic across a series of randomized samples is restricted to sub-samples
that contain exactly the same observations as the true samples, once each. In some instances this may be the appropriate
procedure, but in general randomization may give unrealistically small standard errors for
test statistics, so that the true Type I error rates will be greater than nominally stated
and Type II error rates also will be greater than nominally stated. However, in all the examples presented above the
empirical randomization and bootstrap approaches compare favorably with the normal theory
approach.
II. Estimation
A. The Jackknife
In
the course of applying each of the empirical techniques in the construction of hypothesis
tests we could also have estimated test statistics and a suite of characteristics of the
test statistics and the empirical distributions. In
the test or means we obviously could estimate the means, the variances (or standard
errors), and the medians (their standard errors), etc.
We could also estimate the bias associated with each of these estimators. Define an estimator
of the parameter
,
![]()
then
the bias of the estimator is c. The sample mean,
, is an unbiased estimator of
m
because
, even though different samples may give different estimates
of m
they are all unbiased estimates. In general,
however, most estimators are biased, and the bias can be depicted as a Taylor series
expansion of the estimator. So the bias of
is
![]()
If
we define
![]()
to
be a new estimator of
, then the bias of
is
![]()
which
is less than the bias of
since it
eliminates the term of order 1/n. In practice the estimator
is computed as
![]()
where
i = 1, ... n. This
is the first order jackknife estimator. It is
useful in that it is a less biased estimator although being somewhat more variable than
the un-jackknifed estimator, but this increased variability is at maximum
.
Since
the standard error of an estimator decreases as
by a factor of
, the estimator
has dispersion greater by a factor of 1/n than
, but
usually only n-3/2 greater than
. Therefore the reduction in bias achieved by using
is not offset by a similar increase in the
magnitude of the estimator's variance. If we depict the bootstrap estimator as
then
the jackknife estimator of the standard error of
is

where
FB is the empirical bootstrap
probability distribution of the random variables and
is a linear approximation of the estimator
on the empirical bootstrap probability
distribution. This implies that the bootstrap
estimator
has a standard
error that is [n /n - 1]1/2 times less than the
jackknife estimator
. The jackknife
estimator has bias
![]()
Where
is a quadratic approximation of the estimator
on the distribution FB and E indicates the expectation with respect to
bootstrap sampling. This implies that the bootstrap estimate of bias is n/(n -
1) times less than the jackknife estimate of bias (see ref. 4).
In
general then the bootstrap will provide estimators with less bias and variance than the
jackknife. Table 3 shows a data set generated
by sampling from two normally distributed populations with m1
= 200,
, and m2
= 200 and
. To test the
hypothesis that the variances of these populations are equal, that is
![]()
versus
the alternative that
,
we
could use the normal theory approach, which is again conditioned, on the assumptions
mentioned earlier and elaborate the test statistic based on the sample estimates of ![]()
,
which
is F distributed with n1 - 1 numerator degrees of freedom
and n2 - 1 denominator degrees of
freedom. Alternatively we could use a
jackknife or a bootstrap estimate of the same or a similar test statistic. The F statistic computed under normal theory
assumptions is F 9,14 = 8.286, p
<0.001, while the bootstrap estimate of the probability of obtaining the observed F or one greater is 0.017 < p < 0.018. The
jackknife test is performed on the natural logs of the jackknifed variances rather than
the variances themselves. A full description
of the computations is given in reference (3). The
test statistic for the jackknife test on variances is
,
where
and
are the averages of the natural logs of the
variances across the n1 and n2 jackknifed estimates and V1 and V2 are the variances of the jackknifed
estimates of the variances. For large samples
(n1+n2 > 10), Q is N(0,1), but for small equal size samples it
follows student's t distribution with n1+n2
- 2 degrees of a freedom. For this example Q = -1.8883, 0.0294 < p < 0.03. Figure 9 presents the distribution of the
bootstrap estimates of F, and Table 4 presents
the jackknifed pseudo-values their standard errors and bias. Table 5 and Figure 10 provide
a similar test for two exponential populations. Under
the assumption of normality F9,14 =
2.783, 0.025 < p < 0.05. The jackknife
test, however, yields Q = 3.095, 0.0009 < p
< 0.001, and the bootstrap yields p > 0.05. In
these instances the jackknife is the most powerful test.
III. Prospectus
So
far I have presented in a non-rigorous fashion a number of computationally expensive,
empirical approaches to estimation and hypothesis testing. The theory underlying some of
these approaches is well developed and I refer you to the reference list for that
material. However, much of what I have
presented has no rigorous theoretical underpinnings, but can be shown to be quite useful
particularly in situations where the assumption of normality is suspect. The prognosis among statisticians is that theory
will catch up to our computational prowess, so that many of these procedures will be
justified and should be adopted. In the
interim, however, should you choose to employ one of the more radical of these procedures
be prepared for considerable disagreement over its validity and usefulness. The prospects
for further development of these kinds of procedures, and work to establish their
limitations, advantages, and care and maintenance is considerable. At present, however, the burden of investigating
the properties of one of these procedures, in its application to a particular situation
and test statistic rests with the investigator.
References
Bradley,
J.U. 1968.
Distribution-free statistical tests. Prentice-Hall, Inc: Englewood Cliff, N.J.
Conover,
W.J. 1980.
Practical Nonparametric statistics. John Wiley and Sons: New York.
Hollander,
M. and D.A. Wolfe. 1973. Nonparametric Statistical Methods. John Wiley and Sons: New York.
Efron,
B. and G. Gong. 1983. A leisurely look at the
bootstrap, the jackknife, and cross-validation. The
American Statistician 37: 36-48.
Other
Readable Literature
Miller,
R.G. 1974.
The jackknife - a review. Biometrika
61: 1-15.
Peters,
S.C. and D.A. Freedman. 1984. Some notes on the Bootstrap in regression problems. Journal of Business and Economic Statistics 2:
406-409.
Efron,
B. 1979.
Bootstrap Methods: another look at the jackknife.
Annals of Statistics 7: 1-26.
Other
Not So Readable Literature
Arvesen,
J.N. 1969.
Jackknifing U-Statistics. Annals of
Mathematical Statistics 40: 2076-2100.
Miller,
R.G. 1964.
A trustworthy jackknife. Annals of
Mathematical Statistics 35: 1594-1605.
Miller,
R.G. 1968.
Jackknifing variances. Annals of
Mathematical Statistics 39: 567-582.
Quenouille,
M.H. 1956.
Notes on bias in estimation. Biometrika
43: 353-360.
Some
Applications
Zahl,
S. 1977.
Jackknifing an index of diversity. Ecology
58: 907-913.
Heltshe,
J.F. and N.E. Forrester. 1985. Statistical
evaluation of the jackknife estimate of diversity when using quadrat samples. Ecology 66: 107-111.
Routledge,
R.D. 1980. Bias in estimating the diversity of large uncensused communities. Ecology 61: 276-281.