The Bootstrap, Jackknife, Randomization, and other non-traditional approaches to estimation and hypothesis testing




Much of modern statistics is anchored in the use of statistics and hypothesis tests that only have desirable and well-known properties when computed from populations that are normally distributed.  While it is claimed that many such statistics and hypothesis tests are generally robust with respect to non-normality, other approaches that require an empirical investigation of the underlying population distribution or of the distribution of the statistic are possible and in some instances preferable.  In instances when the distribution of a statistic, conceivably a very complicated statistic, is unknown, no recourse to a normal theory approach is available and alternative approaches are required.


I.   Hypothesis Testing


            A.  Normal Theory Approach


For illustration consider Student's t - test for differences in means when variances are unknown, but are considered to be equal.  The hypothesis of interest is that


          H0: m1 = m2.


While several possible alternative hypotheses could be specified, for our purposes


          HA: m1 < m2.


Given two samples drawn from populations 1 and 2, assuming that these are normally distributed populations with equal variances, and that the samples were drawn independently and at random from each population, then a statistic whose distribution is known can be elaborated to test H0:





where  are the respective sample means, variances and sample sizes.


When the conditions stated above are strictly met and H0 is true, (1) is distributed as Student's t with (n1 + n2 - 2) degrees of freedom.  As


            The reasons for making the assumptions specified above is to allow the investigator to make some statement about the likelihood of the computed t - value, and to make a decision as to whether to profess belief in either H0 or Ha.  The percentiles of the t distribution with the computed degrees of freedom can be interpreted as the conditional probability of observing the computed t value or one larger (or smaller) given that H0 is true:




Therefore we would probably wish to profess belief in H0 for suitably large values of a and disbelief for suitably small values.  Embedded in this probability we must also include the distributional assumptions mentioned above




If a specific alternative hypothesis had been stated, for example


                          HA: m1 = m2 - 5,


then under the assumption of normality and equal variances, the t - statistic could be recomputed given the new estimate of m2 under the alternative hypothesis  The conditional probability of obtaining the observed difference in t values as computed under the null and alternative hypotheses (t0 - ta), given specified a , the observed variances, and the particular alternative hypothesis could be computed:




This probability is also conditioned on the assumption that both populations are normally distributed


Recall that of these two conditional probabilities a is the Type I error rate, the probability of rejecting the tested null hypothesis when true, and that b is the Type II error rate, the probability of failing to reject the tested null hypothesis when false.


I present this review to emphasize that the estimation of each of these probabilities, which are interpreted as error rates in the process of making a decision about nature, in the course of interpreting a specific statistical test, is totally contingent on assuming specific forms for the distribution of the underlying populations.  To know these error rates exactly requires that all the conditions of these tests be met.


            B.  A distribution-free approach


One way to avoid these distributional assumptions has been the approach now called non - parametric, rank - order, rank - like, and distribution - free statistics.  A series of tests many of which apply in situations analogous to normal theory statistics have been elaborated (see ref 1, 2, 3 for expanded treatments of these procedures).


The key to the function of these statistics is that they are based on the ranks of the actual observations in a joint ranking and not on the observations themselves.  For example the Wilcoxon distribution - free rank sum test can be applied in place of the 2 sample or separate groups Student's t test.  The observations from both samples combined are ranked from least to greatest and the sum of the ranks assigned to the observations from either sample is computed.  If both samples are comprised of observations that are of similar magnitude, then the ranks assigned to sample 1 should be similar to the ranks assigned to sample 2.  For fixed sample sizes a fixed number of ranks are possible, for n1 = 5 and n2 = 7, 12 ranks will be assigned.  Under the null hypothesis that the location of each population on the number line is identical, the sum of the ranks assigned to either sample should equal the sum obtained from randomly assigning ranks to the observations in each sample.  In this example there are  ways of assigning ranks to sample 1 and  or 1 way to assign ranks to sample 2 after assigning the ranks to sample 1. The total number of possible arrangements of the ranks is then  and it is possible for each arrangement to compute the sum of the ranks.  From this we can enumerate the distribution of the sum of the ranks, W.  W in this example can range from 15 to 50.  If we divide the




then we have the probability of observing a particular value of w equal to




To obtain the conditional probability that w > W given H0, we simply tabulate the cumulative probabilities.


The only additional assumption embedded in this approach is that the observations are independent, but this is also an assumption of the normal theory approach.  Notice we make no assumption about the forms of the underlying populations about which we wish to make inferences, and that the exact distribution of the test statistic, W, is known because it is enumerated.  These distribution - free statistics are usually criticized for being less "efficient" than the analogous test based on assuming the populations to be normally distributed.  It is true that when the underlying populations are normally distributed then the asymptotic relative efficiencies (ratio of sample sizes of one test to another necessary to have equal power relative to a broad class of alternative hypotheses for fixed a) of distribution - free tests are generally lower than their normal theory analogs, but usually not markedly so.  In instances where the underlying populations are non-normal then the distribution - free tests can be infinitely more efficient that their normal theory counterparts.  In general, this means that distribution-free tests will have higher Type II error rates (b) than normal theory tests when the normal theory assumptions are met. Type I error rates will not be affected.  However, if the underlying populations are not normally distributed then normal theory tests can lead to under estimation of both Type I and Type II error rate.


     C.  Randomization


So far we have used two approaches to estimating error rates in hypothesis testing that either require the assumption of a particular form of the distribution of the underlying population, or that require the investigator to be able to enumerate the distribution of the test statistic when the null hypothesis is true and under specific alternative hypotheses.  What can be done when we neither wish to assume normality nor can we enumerate the distribution of the test statistic?


Recall the analogy I used when describing how to generate the expected sum of ranks assigned to a particular sample under the null hypothesis of identical population locations on a number line.  The analogy was to a process of randomly assigning ranks to observations independent of one's knowledge of which sample an observation is a member.  A randomization test makes use of such a procedure, but does so by operating on the observations rather than the joint ranking of the observations.  For this reason, the distribution of an analogous statistic (the sum of the observations in one sample) cannot be easily tabulated, although it is theoretically possible to enumerate such a distribution.  From one instance to the next the observations may be of substantially different magnitude so a single tabulation of the probabilities of observing a specific sum of observations could not be made, a different tabulation would be required for each application of the test.  A further problem arises if the sample sizes are large.  In the example mentioned previously there are only  possible arrangements of values so the exact distribution of the sum of observations in one sample could conceivably have been enumerated.  Had our sample sizes been 10 and 15 then over 3.2 million arrangements would have been possible. If you have had any experience in combinatorial enumeration then you would know that this approach has rapidly become computationally impractical.  With high-speed computers it is certainly possible to tally 3.2 million sums, but developing an efficient algorithm to be sure that each and every arrangement has been included, and included only once is prohibitive.


What then?  Sample.  When the universe of possible arrangements is too large to enumerate why not sample arrangements from this universe independently and at random?  The distribution of the test statistic over this series of samples can then be tabulated, its' mean and variance computed, and the error rate associated with an hypothesis test estimated.


Table 1 contains samples of n1 = 10, and n2 = 15, obtained from sampling from 2 populations with m1 = 200,  and m2 = 190, , respectively. A normal theory t - test applied to these data yields a t = 3.3216, df = 23, 0.0005 < p < 0.005.  The same data examined by the distribution - free Wilcoxon's rank sum test yields W* = 2.7735, 0.0026 < p < 0.0030.  Applying this randomization approach with 1000 iterations of sampling without replacement first 10 and then 15 observations and computing the t statistic for each of these samples yields the distribution depicted in Figure 1.  The normal theory t distribution is depicted as a smooth curve.  According to the randomization procedure the probability of observing a t value greater than or equal to that actually observed (3.3216) is 0.005 < p < 0.006.  Remember that this sampling procedure, unlike an enumeration, allows each possible arrangement of values to be sampled more than once.  The probability that on any iteration a particular arrangement will be chosen is in this instance  or approximately 3.12 x 10-7.  After 1000 such randomizations it is quite possible that some arrangements have been sampled more than once, but there is no reason to believe that particular arrangements yielding either low or high t - values should be systematically included or excluded from the 1000 randomizations.


This approach is obviously an empirical approach to learning something about the distribution of a test statistic under specified conditions. This Monte Carlo sample procedure would have to be performed anew for each new set of observations.  One aspect that may be an advantage of this approach over normal theory approaches is that any ad hoc test statistic can be elaborated since a direct empirical investigation of its distributional properties accompanies each test.  For example we could just use the difference in the sample means  as one test statistic.  Figure 2 shows the distribution of  over the same 1000 randomizations. The actual difference between sample means is 8.9251 and under the randomization approach the probability of observing a difference this large or larger is 0.004 < p < 0.005. The computed probability of observing the t or  actually observed compares favorably with the normal theory estimates.   


Figure 3 and 4 illustrate the same procedure applied to two populations whose underlying distributions are exponential.  Table 2 presents the sample data generated from two populations with  and , respectively.  A normal theory test on these data yields t = -2.0874, df = 23, 0.0l <p < 0.025, while the randomization approach yielded a probability of obtaining the observed value of t or one greater of 0.017 < p < 0.018, and a probability of obtaining the observed or a greater difference in means of p > 0.05.  Here we see the normal theory test breaking down and convergence in the results obtained from the distribution - free and randomization tests.


Is all of this kosher?  We can see the parallel development of the distribution-free and the randomization tests, yet is the randomization test actually yielding a meaningful result?  The answer is a resounding well-maybe-er-I-don't-know.  The randomization procedure essentially asks the question, given observed samples n1 and n2, if we assume that these samples came from the same underlying population whose distribution F is given by the n1 + n2 sampled values, with probability mass 1/(n1 +n2) for each observation, what is the chance of partitioning the observations into groups of the size observed that have means that differ by an amount as large as that observed?  Is this the best empirical estimate of the distribution of a test statistic?




            D.   The Bootstrap


The Bootstrap is another empirical approach to understanding the distributional properties of a test statistic, but is also useful as a means of estimating statistics and their standard errors.  The bootstrap is very similar to the randomization procedure outlined above.  The observed distribution of sample values is used as an estimate of the underlying probability distribution of the population F.  Then, the distribution of a statistic for fixed sample sizes is obtained by repeatedly sampling from the distribution F, with each value receiving probability mass 1/(n1 + n2), but sampling values with replacement, so that instead of individual partitions of the data having the potential to occur more than once, the individual values themselves may appear repeatedly in a single sample.  Under this resampling algorithm the number of possible sample arrangements is much greater than for the randomization approach.  For example with a total sample size m = 12, with component samples of size 7 and 5, 127 x 125 = 8.9161004 x 1012 arrangements are possible.  For m = 25, and n1 = 10, n2 = 15, 2510 x 2515 = 8.8817842 x 1034 arrangements are possible, factors of 1010 and 1028 more arrangements, respectively, compared to the randomization approach.  Any test statistic averaged across a series of say 1000 samples under this algorithm will have a larger standard error since sub-samples of F can deviate from F more than under the randomization algorithm.  Figures 5 and 6 illustrate the distribution of t values and  for 1000 bootstrap samples of the empirical probability distribution presented in Table 1.  For the normal populations the bootstrap estimates the probability of the observed t or one greater to be 0.002 < p < 0.003 which, surprisingly is somewhat less than the randomization approach.  This comparison is reversed when examining the differences between means.  The bootstrap estimates the probability of the observed mean difference or one greater as 0.022 < p < 0.023, which is an order of magnitude greater than that estimated by the randomization approach, or for that matter for the t -statistic from the same group of bootstrap samples.


Figures 7 and 8 provide similar data for the samples derived from exponentially distributed populations presented in Table 2.  The bootstrap is more conservative than either the normal theory approach or the randomization approach when examining the t value obtained for the exponential populations.  This is the result I would generally expect in a comparison of the bootstrap and randomization.


Which approach is best?  While the randomization approach can be seen to be analogous to the enumeration of distributions that characterizes distribution - free statistics, it is unrealistic in that the distribution of a test statistic across a series of randomized samples is restricted to sub-samples that contain exactly the same observations as the true samples, once each.  In some instances this may be the appropriate procedure, but in general randomization may give unrealistically small standard errors for test statistics, so that the true Type I error rates will be greater than nominally stated and Type II error rates also will be greater than nominally stated.  However, in all the examples presented above the empirical randomization and bootstrap approaches compare favorably with the normal theory approach.


II.  Estimation


            A.  The Jackknife


In the course of applying each of the empirical techniques in the construction of hypothesis tests we could also have estimated test statistics and a suite of characteristics of the test statistics and the empirical distributions.  In the test or means we obviously could estimate the means, the variances (or standard errors), and the medians (their standard errors), etc.  We could also estimate the bias associated with each of these estimators.  Define an estimator  of the parameter ,




then the bias of the estimator is c.  The sample mean,, is an unbiased estimator of m because , even though different samples may give different estimates  of m they are all unbiased estimates.  In general, however, most estimators are biased, and the bias can be depicted as a Taylor series expansion of the estimator. So the bias of  is




If we define




to be a new estimator of , then the bias of  is




which is less than the bias of  since it eliminates the term of order 1/n.  In practice the estimator  is computed as




where i = 1, ... n.  This is the first order jackknife estimator.  It is useful in that it is a less biased estimator although being somewhat more variable than the un-jackknifed estimator, but this increased variability is at maximum




Since the standard error of an estimator decreases as  by a factor of , the estimator  has dispersion greater by a factor of 1/n than , but usually only n-3/2 greater than .  Therefore the reduction in bias achieved by using  is not offset by a similar increase in the magnitude of the estimator's variance. If we depict the bootstrap estimator as then the jackknife estimator of the standard error of  is




where FB is the empirical bootstrap probability distribution of the random variables and  is a linear approximation of the estimator  on the empirical bootstrap probability distribution.  This implies that the bootstrap estimator  has a standard error that is [n /n - 1]1/2 times less than the jackknife estimator .  The jackknife estimator has bias




Where  is a quadratic approximation of the estimator  on the distribution FB and E indicates the expectation with respect to bootstrap sampling. This implies that the bootstrap estimate of bias is n/(n - 1) times less than the jackknife estimate of bias (see ref. 4).


In general then the bootstrap will provide estimators with less bias and variance than the jackknife.  Table 3 shows a data set generated by sampling from two normally distributed populations with m1 = 200, , and m2 = 200 and .  To test the hypothesis that the variances of these populations are equal, that is




versus the alternative that




we could use the normal theory approach, which is again conditioned, on the assumptions mentioned earlier and elaborate the test statistic based on the sample estimates of




which is F distributed with n1 - 1 numerator degrees of freedom and n2 - 1 denominator degrees of freedom.  Alternatively we could use a jackknife or a bootstrap estimate of the same or a similar test statistic. The F statistic computed under normal theory assumptions is F 9,14 = 8.286, p <0.001, while the bootstrap estimate of the probability of obtaining the observed F or one greater is 0.017 < p < 0.018. The jackknife test is performed on the natural logs of the jackknifed variances rather than the variances themselves.  A full description of the computations is given in reference (3).  The test statistic for the jackknife test on variances is




where  and  are the averages of the natural logs of the variances across the n1 and n2 jackknifed estimates and V1 and V2 are the variances of the jackknifed estimates of the variances.  For large samples (n1+n2 > 10), Q is N(0,1), but for small equal size samples it follows student's t distribution with n1+n2 - 2 degrees of a freedom.  For this example Q = -1.8883, 0.0294 < p < 0.03.  Figure 9 presents the distribution of the bootstrap estimates of F, and Table 4 presents the jackknifed pseudo-values their standard errors and bias. Table 5 and Figure 10 provide a similar test for two exponential populations.  Under the assumption of normality F9,14 = 2.783, 0.025 < p < 0.05.  The jackknife test, however, yields Q = 3.095, 0.0009 < p < 0.001, and the bootstrap yields p > 0.05.  In these instances the jackknife is the most powerful test.


III.  Prospectus


So far I have presented in a non-rigorous fashion a number of computationally expensive, empirical approaches to estimation and hypothesis testing. The theory underlying some of these approaches is well developed and I refer you to the reference list for that material.  However, much of what I have presented has no rigorous theoretical underpinnings, but can be shown to be quite useful particularly in situations where the assumption of normality is suspect.  The prognosis among statisticians is that theory will catch up to our computational prowess, so that many of these procedures will be justified and should be adopted.  In the interim, however, should you choose to employ one of the more radical of these procedures be prepared for considerable disagreement over its validity and usefulness. The prospects for further development of these kinds of procedures, and work to establish their limitations, advantages, and care and maintenance is considerable.  At present, however, the burden of investigating the properties of one of these procedures, in its application to a particular situation and test statistic rests with the investigator.




Bradley, J.U.  1968.  Distribution-free statistical tests. Prentice-Hall, Inc:  Englewood Cliff, N.J.

Conover, W.J.  1980.  Practical Nonparametric statistics. John Wiley and Sons: New York.

Hollander, M. and D.A. Wolfe. 1973. Nonparametric Statistical Methods. John Wiley and Sons:  New York.

Efron, B. and G. Gong.  1983. A leisurely look at the bootstrap, the jackknife, and cross-validation.  The American Statistician 37: 36-48.


Other Readable Literature


Miller, R.G.  1974.  The jackknife - a review.  Biometrika 61: 1-15.

Peters, S.C. and D.A. Freedman. 1984. Some notes on the Bootstrap in regression problems.  Journal of Business and Economic Statistics 2: 406-409.

Efron, B.  1979.  Bootstrap Methods: another look at the jackknife.  Annals of Statistics 7: 1-26.


Other Not So Readable Literature


Arvesen, J.N.  1969.  Jackknifing U-Statistics.  Annals of Mathematical Statistics 40: 2076-2100.

Miller, R.G.  1964.  A trustworthy jackknife.  Annals of Mathematical Statistics 35: 1594-1605.

Miller, R.G.  1968.  Jackknifing variances.  Annals of Mathematical Statistics 39: 567-582.

Quenouille, M.H.  1956.  Notes on bias in estimation.  Biometrika 43: 353-360.


Some Applications


Zahl, S.  1977.  Jackknifing an index of diversity.  Ecology 58: 907-913.

Heltshe, J.F. and N.E. Forrester. 1985.  Statistical evaluation of the jackknife estimate of diversity when using quadrat samples.  Ecology 66: 107-111.

Routledge, R.D. 1980. Bias in estimating the diversity of large uncensused communities.  Ecology 61: 276-281.