**Resampling
Statistics**

**Rationale**

Much of modern statistics is anchored in
the use of statistics and hypothesis tests that only have desirable and
well-known properties when computed from populations that are normally
distributed. While it is claimed that
many such statistics and hypothesis tests are generally robust with respect to
non-normality, other approaches that require an empirical investigation of the
underlying population distribution or of the distribution of the statistic are
possible and in some instances preferable.
In instances when the distribution of a statistic, conceivably a very
complicated statistic, is unknown, no recourse to a normal theory approach is
available and alternative approaches are required.

**General Overview**

Resampling statistics refers to the use of the observed data or of a data
generating mechanism (such as a die) to produce new hypothetical samples
(resamples) that mimic the underlying population, the results of which can then
be analyzed. With numerous cross-disciplinary applications especially in the
sub-disciplines of the life science, resampling methods are widely used since
they are options when parametric approaches are difficult to employ or
otherwise do not apply.

Resampled data is derived using a manual
mechanism to simulate many pseudo-trials. These approaches were difficult to
utilize prior to 1980s since these methods require many repetitions. With the
incorporation of computers, the trials can be simulated in a few minutes and is
why these methods have become widely used.
The methods that will be discussed are used to make many statistical
inferences about the underlying population. The most practical use of
resampling methods is to derive confidence intervals and test hypotheses. This
is accomplished by drawing simulated samples from the data themselves
(resamples) or from a reference distribution based on the data; afterwards, you
are able to observe how the statistic of interest in these resamples behaves.
Resampling approaches can be used to substitute for traditional statistical
(formulaic) approaches or when a traditional approach is difficult to apply.
These methods are widely used because their ease of use. They generally require
minimal mathematical formulas, needing a small amount of mathematical
(algebraic) knowledge. These methods are easy to understand and stray away from
choosing an incorrect formula in your diagnostics.

**Ecology and Evolution Publicized Applications**

This is a list of applications of
resampling methods concatenated by Phillip Crowley, 1992.

·
Analysis
of Null models, competition, and community structure

·
Detecting
Density Dependence

·
Characterizing
Spatial Patterns and Processes

·
Estimating
Population Size and Vital Rates

·
Environmental
Modeling

·
Evolutionary
Processes and Rates

·
Phylogeny
Analysis

In order of use, Crowley found
this relationship when searching for publications:

Monte Carlo > Bootstrap >
Permutation & Jackknife

Overall, resampling methods were
increasing in significant use over the prior decade.

**Types of Resampling Methods**

I.
**Monte Carlo Simulation –** This is a method that derives data from a mechanism (such as
a proportion) that models the process you wish to understand (the population).
This produces new samples of simulated data, which can be examined as possible
results. After doing many repetitions, Monte Carlo tests produce exact p-values
that can be interpreted as an error rate; letting the number of repeats
sharpens the critical region.

**II.
****Randomization (Permutation) Test
– **this is a type of
statistical significance test, in which a reference distribution is obtained by
calculating all possible values of the test statistic under rearrangements of
the labels on the observed data points. Like other the Bootstrap and the Monte
Carlo approach, permutation methods for significance testing also produce exact
p-values. These tests are the oldest, simplest, and most common form of
resampling tests and are suitable whenever the null hypothesis makes all
permutations of the observed data equally likely. In this method, data is
reassigned randomly without replacement. They are usually based off the Student
t and Fisher’s F test. Most non-parametric tests are based on permutations of
rank orderings of the data. This method has become practical because of
computers; without them, it may be impossible to derive all the possible
permutations. This method should be employed when you are dealing with an
unknown distribution.

**III.
****Bootstrapping –**
This approach is based on the fact that all we know about the underlying
population is what we derived in our samples. Becoming the most widely used
resampling method, it estimates the
sampling distribution of an estimator by sampling with replacement from the
original estimate, most often with the purpose of deriving robust estimates of
standard errors and confidence intervals of a population parameter. Like all
Monte Carlo based methods, this approach can be used to define confidence
Intervals and in hypothesis testing. This method is beneficial to side step
problems with non-normality or if the distribution parameters are unknown. This
method can be used to calculate an appropriate sample size for experimental
design.

**IV.
****Jackknife **– This method is used in statistical inference to estimate
the bias and standard error in a statistic, when a random sample of
observations is used to calculate it. This method provides a systematic method
of resampling with a mild amount of calculations. It offers “improved” estimate
of the sample parameter to create less sampling bias. The basic idea behind the
jackknife estimator lies in systematically re-computing the statistic estimate
leaving out one observation at a time from the sample set. From this new
“improved" sample statistic can be used to estimate the bias can be variance
of the statistic.

**The Flow of Information for resampling methods **

In these methods, it
is necessary to specify the universe to sample from (random numbers, an
observed data set, true or false, etc.), specify the sampling procedure (number
of samples, sizes of samples, sampling with or without replacement), and
specify the statistic you wish to keep track of. The flow of information is as
follows:

1.
Input
data

2.
Resample
from the inputted data

3.
Calculate
the statistic desired

4.
Record
statistic

5.
Return
to sample for (X) number of resamples; once reached to completed (X) times,
continue to step 6

6.
Calculate
p-value by counting number of resamples that occur in desired extreme domains
divided by the total number of resamples

7.
Present/Print
results

**Resampling Computer Programs **

To effectively use
these methods, you should have a good program and a fast computer to handle the
repetitions. Phillip L. Good has suggested the following programs, with the
first four being recommended:

I.
**R – **a programming language that is easy to manipulate. This
program is free and scripts are precompiled throughout the Internet. However,
be aware, you are on your own.

**II.
****C ^{++ } – **like
R, this is a programming language that has great potential for those entering
statistics with a great programming background. Also like R, this is
do-it-yourself so you are once again on your own.

**III.
****Resampling Stats **– easy to use, this programming language seems very similar
to BASIC programming language. It has
all the resampling method functions already incorporated and is also available
as a Microsoft Excel add-in. It is cheap
and easy to follow but can eventually become limited for intense practice of
these methods.

**IV.
****S Plus **– R based, this program has many built-in functions and
pull-down menus, which make it easy to use. The program’s designers offer much
support; this package comes at an expensive price.

**V.
****SAS **– commonly used in statistical analysis, this package is C
based. Pricey and time consuming to debug.

**Monte Carlo Simulation **

Similar to what was outlined
above, the general procedure of Monte Carlo Simulations is as follows:

A.
Make
a simulated sample population utilizing a non-biased randomizing mechanism
(cards, dice, or a computer program) which is based on the population whose
behavior you wish to investigate.

B.
Create
a pseudo-sample to simulate a real-life sample of interest.

C.
Repeat
step B, (X) number of times.

D.
Calculate
the probability of interest from the tabulation of outcomes of the resampling
trials.

__Monte Carlo Example
(derived from a Simon 1997 example)__

On any given day, it
is likely to be sunny 70% of the time. On a sunny day, the Redskins win 65% of
their games. What is the likelihood of winning a game on a sunny day? This is a
very simplistic example easily described by calculating the joint probability
of the two, but serves as a good example of the BASIC programming involved in
resampling. The outline:

1.Put seven blue balls and three yellow balls into an urn
labeled A (the Nice Day Urn). Put 65 green balls and 35 red balls into an urn
labeled B (the Win/Lose Conditional to Nice day Urn).

2.Draw one ball from urn A. If it is blue, continue; otherwise
record ‘no’ and stop.

3.If you have drawn a blue ball from run A, now draw a ball
from urn B, and if it is green, record ‘yes’; otherwise write ‘no’.

4.Repeat steps 2-3 1000 (or more) times.

5.Count the number of trials.

6.Compute the proportion of ‘yeses’ in the 1000 samples.

The Resampling
Statistics programming looks like this:

URN 7#1 3#0 weather *Create
10 days samples, 7 nice day*

URN 35#0 65#1 winlose *Create
100 record sample, 55 wins*

REPEAT 1000 *Repeat
following 1000*

SAMPLE 1 weather
a *Sample
1 of the days*

IF a = 1 *If a
good day, continue; otherwise to skip if statement*

SAMPLE 1 winlose b *Sample
1 winlose records*

IF b = 1 *If
win, continue; otherwise skip*

SCORE b z *Tally Good Day Wins*

END *End
‘if’*

END *End
‘if’*

END *Go
back to repeat; after 1000, End ‘repeat’*

COUNT z = 1 k *Count
Good Day wins*

DIVIDE k 1000 kk *Then
divide it by 1000*

PRINT kk *Show
on screen the result*

Using Resampling
Stats® plugin for Microsoft Excel, this is what the
print out might look like:

.454 = wins on a
sunny day. The actual joint probability approach gives a probability of .455.

**Resampling**** Approaches in
Estimation and Hypothesis Testing**

**I. Hypothesis
Testing**

** A. Normal Theory Approach**

For illustration
consider Student's *t* - test for
differences in means when variances are unknown, but are considered to be
equal. The hypothesis of interest is
that

H_{0}: *m*_{1} = *m*_{2.}

While several
possible alternative hypotheses could be specified, for our purposes

H_{A}: *m*_{1} < *m*_{2}.

Given two samples
drawn from populations 1 and 2, assuming that these are normally distributed populations
with equal variances, and that the samples were drawn independently and at
random from each population, then a statistic whose distribution is known can
be elaborated to test H_{0}:

_{} (1)

where _{} are the respective
sample means, variances and sample sizes.

When the conditions
stated above are strictly met and H_{0} is true, (1) is distributed as
Student's *t* with (*n*_{1} + *n*_{2} - 2) degrees of freedom. As _{}

The reasons for making the
assumptions specified above is to allow the investigator to make some statement
about the likelihood of the computed *t*
- value, and to make a decision as to whether to profess belief in either H_{0}
or H_{a}. The percentiles of the
*t* distribution with the computed
degrees of freedom can be interpreted as the conditional probability of
observing the computed *t* value or one
larger (or smaller) given that H_{0} is true:

_{}

Therefore, we would
probably wish to profess belief in H_{0} for suitably large values of *a* and
disbelief for suitably small values.
Embedded in this probability we must also include the distributional
assumptions mentioned above

_{}

If a specific
alternative hypothesis had been stated, for example

H_{A}: *m*_{1} = *m*_{2} - 5,

then under the
assumption of normality and equal variances, the *t* - statistic could be recomputed given the new estimate of *m*_{2} under the alternative hypothesis _{} The conditional probability
of obtaining the observed difference in *t*
values as computed under the null and alternative hypotheses (*t*_{0} - *t*_{a}), given specified *a*
, the observed variances, and the particular alternative hypothesis could be
computed:

_{}

This probability is
also conditioned on the assumption that both populations are normally
distributed _{}

Recall that of these
two conditional probabilities *a* is the Type I error rate, the probability of rejecting the
tested null hypothesis when true, and that *b* is the Type II error
rate, the probability of failing to reject the tested null hypothesis when
false.

I present this review
to emphasize that the estimation of each of these probabilities, which are
interpreted as error rates in the process of making a decision about nature, in
the course of interpreting a specific statistical test, is totally contingent
on assuming specific forms for the distribution of the underlying
populations. To know these error rates
exactly requires that all the conditions of these tests be met.

** B. A distribution-free approach**

One way to avoid
these distributional assumptions has been the approach now called non -
parametric, rank - order, rank - like, and distribution - free statistics. A series of tests many of which apply in
situations analogous to normal theory statistics have been elaborated (see ref
1, 2, 3 for expanded treatments of these procedures).

The key to the
function of these statistics is that they are based on the ranks of the actual
observations in a joint ranking and not on the observations themselves. For example the Wilcoxon
distribution - free rank sum test can be applied in place of the 2 sample or
separate groups Student's *t*
test. The observations from both samples
combined are ranked from least to greatest and the sum of the ranks assigned to
the observations from either sample is computed. If both samples are comprised of observations
that are of similar magnitude, then the ranks assigned to sample 1 should be
similar to the ranks assigned to sample 2.
For fixed sample sizes a fixed number of ranks are possible, for *n*_{1} = 5 and *n*_{2} = 7, 12 ranks will be
assigned. Under the null hypothesis that
the location of each population on the number line is identical, the sum of the
ranks assigned to either sample should equal the sum obtained from randomly
assigning ranks to the observations in each sample. In this example there are _{} ways of assigning
ranks to sample 1 and _{} or 1 way to assign
ranks to sample 2 after assigning the ranks to sample 1. The total number of
possible arrangements of the ranks is then _{} and it is possible for
each arrangement to compute the sum of the ranks. From this we can enumerate the distribution
of the sum of the ranks, *W*. *W*
in this example can range from 15 to 50.
If we divide the

_{}

then we have the probability of observing a particular value of *w* equal to

_{}

To obtain the
conditional probability that *w* > *W* given H_{0}, we simply
tabulate the cumulative probabilities.

The only additional
assumption embedded in this approach is that the observations are independent,
but this is also an assumption of the normal theory approach. Notice we make no assumption about the forms
of the underlying populations about which we wish to make inferences, and that
the exact distribution of the test statistic, *W*, is known because it is enumerated. These distribution - free statistics are
usually criticized for being less "efficient" than the analogous test
based on assuming the populations to be normally distributed. It is true that when the underlying
populations are normally distributed then the asymptotic relative efficiencies
(ratio of sample sizes of one test to another necessary to have equal power
relative to a broad class of alternative hypotheses for fixed *a*)
of distribution - free tests are generally lower than their normal theory
analogs, but usually not markedly so. In
instances where the underlying populations are non-normal then the distribution
- free tests can be infinitely more efficient that their normal theory
counterparts. In general, this means
that distribution-free tests will have higher Type II error rates (*b*)
than normal theory tests when the normal theory assumptions are met. Type I
error rates will not be affected.
However, if the underlying populations are not normally distributed then
normal theory tests can lead to under estimation of both Type I and Type II
error rate.

** C.
Randomization**

So far we have used
two approaches to estimating error rates in hypothesis testing that either
require the assumption of a particular form of the distribution of the
underlying population, or that require the investigator to be able to **enumerate** the distribution of the test
statistic when the null hypothesis is true and under specific alternative
hypotheses. What can be done when we
neither wish to assume normality nor can we enumerate the distribution of the
test statistic?

Recall the analogy I
used when describing how to generate the expected sum of ranks assigned to a
particular sample under the null hypothesis of identical population locations
on a number line. The analogy was to a process
of randomly assigning ranks to observations independent of one's knowledge of
which sample an observation is a member.
A randomization test makes use of such a procedure, but does so by
operating on the observations rather than the joint ranking of the
observations. For this reason, the
distribution of an analogous statistic (the sum of the observations in one
sample) cannot be easily tabulated, although it is theoretically possible to
enumerate such a distribution. From one
instance to the next the observations may be of substantially different
magnitude so a single tabulation of the probabilities of observing a specific
sum of observations could not be made, a different tabulation would be required
for each application of the test. A
further problem arises if the sample sizes are large. In the example mentioned previously there are
only _{} possible arrangements
of values so the exact distribution of the sum of observations in one sample
could conceivably have been enumerated.
Had our sample sizes been 10 and 15 then over 3.2 million arrangements
would have been possible. If you have had any experience in combinatorial
enumeration then you would know that this approach has rapidly become
computationally impractical. With high-speed
computers it is certainly possible to tally 3.2 million sums, but developing an
efficient algorithm to be sure that each and every arrangement has been
included, and included only once is prohibitive.

What then? **Sample.** When the universe of possible arrangements is
too large to enumerate why not sample arrangements from this universe
independently and at random? The
distribution of the test statistic over this series of samples can then be
tabulated, its' mean and variance computed, and the error rate associated with an hypothesis test estimated.

Table 1 contains
samples of *n*_{1} = 10, and *n*_{2} = 15, obtained from
sampling from 2 populations with *m*_{1} = 200, _{} and *m*_{2} = 190, _{}, respectively. A normal theory *t* - test applied to these data yields a *t* = 3.3216, df =
23, 0.0005 < p < 0.005. The same
data examined by the distribution - free Wilcoxon's
rank sum test yields W* = 2.7735, 0.0026 < p < 0.0030. Applying this randomization approach with
1000 iterations of sampling without replacement first 10 and then 15
observations and computing the *t *statistic
for each of these samples yields the distribution depicted in Figure 1. The normal theory *t* distribution is depicted as a smooth curve. According to the randomization procedure the
probability of observing a *t* value
greater than or equal to that actually observed (3.3216) is 0.005 < p <
0.006. Remember that this sampling
procedure, unlike an enumeration, allows each possible arrangement of values to
be sampled more than once. The
probability that on any iteration a particular arrangement will be chosen is in
this instance _{} or approximately 3.12
x 10^{-7}. After 1000 such
randomizations it is quite possible that some arrangements have been sampled
more than once, but there is no reason to believe that particular arrangements
yielding either low or high *t* -
values should be systematically included or excluded from the 1000
randomizations.

This approach is
obviously an empirical approach to learning something about the distribution of
a test statistic under specified conditions. This Monte Carlo sample procedure
would have to be performed anew for each new set of observations. One aspect that may be an advantage of this
approach over normal theory approaches is that any ad hoc test statistic can be
elaborated since a direct empirical investigation of its distributional
properties accompanies each test. For
example we could just use the difference in the sample means _{} as one test
statistic. Figure 2 shows the
distribution of _{} over the same 1000
randomizations. The actual difference between sample means is 8.9251 and under
the randomization approach the probability of observing a difference this large
or larger is 0.004 < p < 0.005. The computed probability of observing the
*t* or _{} actually observed
compares favorably with the normal theory estimates.

Figure 3 and 4
illustrate the same procedure applied to two populations whose underlying
distributions are exponential. Table 2
presents the sample data generated from two populations with _{} and _{}, respectively. A
normal theory test on these data yields *t*
= -2.0874, df = 23, 0.0l <p < 0.025, while the
randomization approach yielded a probability of obtaining the observed value of
*t* or one greater of 0.017 < p <
0.018, and a probability of obtaining the observed or a greater difference in
means of p > 0.05. Here we see the
normal theory test breaking down and convergence in the results obtained from
the distribution - free and randomization tests.

Is all of this
kosher? We can see the parallel
development of the distribution-free and the randomization tests, yet is the
randomization test actually yielding a meaningful result? The answer is a resounding well-maybe-er-I-don't-know. The
randomization procedure essentially asks the question, given observed samples *n*_{1} and *n*_{2}, if we assume that these samples came from the same
underlying population whose distribution F is given by the *n*_{1} + *n*_{2}
sampled values, with probability mass 1/(*n*_{1}
+*n*_{2}) for each observation,
what is the chance of partitioning the observations into groups of the size
observed that have means that differ by an amount as large as that
observed? Is this the best empirical
estimate of the distribution of a test statistic?

** D. The Bootstrap**

The Bootstrap is
another empirical approach to understanding the distributional properties of a test
statistic, but is also useful as a means of estimating statistics and their
standard errors. The bootstrap is very
similar to the randomization procedure outlined above. The observed distribution of sample values is
used as an estimate of the underlying probability distribution of the
population F. Then, the distribution of
a statistic for fixed sample sizes is obtained by repeatedly sampling from the
distribution F, with each value receiving probability mass 1/(*n*_{1} + *n*_{2}), but sampling values with replacement, so that
instead of individual partitions of the data having the potential to occur more
than once, the individual values themselves may appear repeatedly in a single
sample. Under this resampling algorithm
the number of possible sample arrangements is much greater than for the
randomization approach. For example with
a total sample size *m* = 12, with
component samples of size 7 and 5, 12^{7} x 12^{5} = 8.9161004
x 10^{12} arrangements are possible.
For *m* = 25, and *n*_{1} = 10, *n*_{2} = 15, 25^{10 }x 25^{15} = 8.8817842 x
10^{34} arrangements are possible, factors of 10^{10} and 10^{28}
more arrangements, respectively, compared to the randomization approach. Any test statistic averaged across a series
of say 1000 samples under this algorithm will have a larger standard error since
sub-samples of F can deviate from F more than under the randomization
algorithm. Figures 5 and 6 illustrate
the distribution of *t* values and _{} for 1000 bootstrap
samples of the empirical probability distribution presented in Table 1. For the normal populations the bootstrap
estimates the probability of the observed *t
*or one greater to be 0.002 < p < 0.003 which, surprisingly is
somewhat less than the randomization approach.
This comparison is reversed when examining the differences between
means. The bootstrap estimates the
probability of the observed mean difference or one greater as 0.022 < p <
0.023, which is an order of magnitude greater than that estimated by the
randomization approach, or for that matter for the *t* -statistic from the same group of bootstrap samples.

Figures 7 and 8
provide similar data for the samples derived from exponentially distributed
populations presented in Table 2. The
bootstrap is more conservative than either the normal theory approach or the randomization
approach when examining the *t* value
obtained for the exponential populations.
This is the result I would generally expect in a comparison of the
bootstrap and randomization.

Which approach is
best? While the randomization approach
can be seen to be analogous to the enumeration of distributions that
characterizes distribution - free statistics, it is unrealistic in that the
distribution of a test statistic across a series of randomized samples is
restricted to sub-samples that contain exactly the same observations as the
true samples, once each. In some
instances this may be the appropriate procedure, but in general randomization
may give unrealistically small standard errors for test statistics, so that the
true Type I error rates will be greater than nominally stated and Type II error
rates also will be greater than nominally stated. However, in all the examples presented above
the empirical randomization and bootstrap approaches compare favorably with the
normal theory approach.

__More examples of
Randomization and Bootstrap methods (Simon, 1997):__

Simon produced a book
*“Resampling: the New Statistics”,* an
example based book on Monte Carlo, Permutation (Randomization) tests, and
Bootstrap available for free on the Resampling Stats website. I found the
following examples demonstrate the effectiveness of these methods.

**Bootstrap example in creating a confidence interval:**

Of 135 men of age
34-44 with high cholesterol, 10 developed myocardial infarction. How much
confidence should we have that if we were to take a much larger sample than was
actually obtained, the sample mean (10/135 = .07) would be in some close
vicinity of the actual mean? The general set up may be like this:

1.
Construct
a sample containing 135 representatives balls: 10 red representing infarction
and 125 green representing no infarction

2.
Mix,
choose a ball, record its color, replace it, and repeat 135 times (to simulate
a sample of 135 men).

3.
Record
the number of red balls among the 135 balls drawn.

4.
Repeat
steps 2-4 perhaps 1000 times, and observe how much the total number of reds
varies from sample to sample.

How this looks in the
basic programming:

URN 10#1 125#0 men *Create
a 135 based sample, 10 of which have myocardial infarction*

REPEAT 1000

SAMPLE 135 men a *Resample
135 times with replacement*

COUNT a =1 b *Count how many myocardial
infarctions occurred*

DIVIDE b 135 c *Divide
by 135 to get a sample mean*

SCORE c z *Keep
track of every resampled sample mean*

END

HISTOGRAM z *Plot
the sample means in a histogram*

PERCENTILE z (2.5
97.5) k *Calculate
the 2.5 ^{th} and 97.5^{th} percentiles*

PRINT k *Print
histogram and results*

Using Resampling
Stats® plug-in for Microsoft Excel, this is what the print out might look like:

**Randomization in hypothesis testing:**

The following is the
price of whisky in 16 monopoly states (where the state owns the liquor store)
and 26 private-owned states are as follows:

16 monopoly states: $4.65, $4.55,
$4.11, $4.15, $4.20, $4.55, $3.80,

$4.00, $4.19, $4.75, $4.74, $4.50,
$4.10, $4.00, $5.05, $4.20.

Mean = $4.35

26 private-ownership states: $4.82,
$5.29, $4.89, $4.95, $4.55, $4.90,

$5.25, $5.30, $4.29, $4.85, $4.54,
$4.75, $4.85, $4.85, $4.50, $4.75,

$4.79, $4.85, $4.79, $4.95, $4.95,
$4.75, $5.20, $5.10, $4.80, $4.29.

Mean = $4.84

There is a difference
in prices of whisky between monopoly and private ownership states with a mean
difference of .49. How likely is it that one would choose 16 and 26 states
randomly that would have a difference of $.49? To answer this question, the
basic programming would look like this:

NUMBERS (4.82 5.29 4.89 4.95 4.55
4.90 5.25 5.30 4.29 4.85 4.54 4.75 *Create
number array for private which is observed private-ownership sample*

4.85 4.85 4.50 4.75 4.79 4.85 4.79
4.95 4.95 4.75 5.20 5.10 4.80 4.29)

priv

NUMBERS (4.65 4.55 4.11 4.15 4.20
4.55 3.80 4.00 4.19 4.75 4.74 4.50 *Create
number array for monopoly which is observed state owned sample*

4.10 4.00 5.05 4.20) mono

CONCAT priv govt all *Concatenate
the two samples*

REPEAT 1000

SHUFFLE
26 all priv$ *Allocate
26 of concatenated samples at random w/o replacement to the private resample*

SHUFFLE
16 all mono$ *Allocate
16 of concatenated samples at random w/o replacement to the monopoly resample*

MEAN
priv$ p *Calculate
the mean of the private resample*

MEAN
mono$ g *Calculate
the mean of the monopoly resample*

SUBTRACT p g diff *Subtract
the private from monopoly*

SCORE
diff z *Score
the difference per resample*

END *Repeat
1000x then stop*

COUNT z >= .49 k *Count
how many times a resample of equal or greater than .49 occurred; store in k*

COUNT z <= -.49 j *Count
how many times a resample of equal or less than -.49 occurred; store in j*

ADD k+j m *Add k
and j together in n*

DIVIDE m/1000 n* Calculate
a p-value by dividing m by 1000*

PRINT m * Print result*

Using Resampling
Stats® plug-in for Microsoft Excel, this is what the print out might look like:

Probability of have a
difference greater or equal to .49 is zero. A difference of .49 never occurred
in 1000 resamples of the data.

**Bootstrap approach
in Hypothesis Testing, a difference in treatment**

If you believe there is a difference between
treatments, act as though they come from different populations; the difference
is not due to chance. This is completed by resampling with replacement from
individual treatments. The example is as follows: mice that were treated with a
certain antibiotic lived 94, 38, 23, 197, 99, 16 and 14 days after their
surgery with a mean survival period of 86.8 days. The untreated control lived
52, 10, 40, 104, 51, 27, 146, 30, and 46 days with a mean survival 56.2 days.
The difference between the two means is 30.6 days H_{0}: There is a
difference in mean survival between treated and untreated.

The basic programming to test this hypothesis might
look like this:

NUMBERS (94 38 23 197 99 16 141) treatmt *Treatment Sample*

NUMBERS (52 10 40 104 51 27 146 30 46) control *Control Sample*

REPEAT 1000

SAMPLE
7 treatmt treatmt$ *Sample
7 treatments w/R.*

SAMPLE
9 control control$ *Sample
9 controls w/R.*

MEAN
treatmt$ tmean *Mean
of Treatments*

MEAN
control$ cmean *Mean
of Control*

SUBTRACT
tmean cmean diff *Subtract
*

SCORE
diff scrboard *Score
the Difference*

END *Repeat
1000x*

HISTOGRAM scrboard *Draw
a histogram*

PERCENTILE scrboard (2.5 97.5) interval *Get the 95% Confidence Interval*

PRINT interval *Give
the 95% Confidence Interval*

Using Resampling
Stats® plug-in for Microsoft Excel, this is what the print out might look like:

The difference of 30.6 days does not suggest that there
is a significant difference of post-operational mice that were treated versus
untreated.

**II. Estimation**

** A. The Jackknife**

In the course of
applying each of the empirical techniques in the construction of hypothesis
tests we could also have estimated test statistics and a suite of
characteristics of the test statistics and the empirical distributions. In the test or means we obviously could
estimate the means, the variances (or standard errors), and the medians (their
standard errors), etc. We could also
estimate the bias associated with each of these estimators. Define an estimator _{} of the parameter _{},

_{}

then the bias of the estimator is *c*. The sample mean,_{}, is an unbiased estimator of *m*
because _{}, even though different samples may give different estimates _{} of *m*
they are all unbiased estimates. In
general, however, most estimators are biased, and the bias can be depicted as a
Taylor series expansion of the estimator. So the bias of _{} is

_{}

If we define

_{}

to be a new estimator of _{}, then the bias of _{} is

_{}

which is less than the bias of _{} since it eliminates
the term of order 1/*n*. In practice the estimator _{} is computed as

_{}

where *i*
= 1, ... *n*. This is the first order jackknife
estimator. It is useful in that it is a
less biased estimator although being somewhat more variable than the
un-jackknifed estimator, but this increased variability is at maximum

_{}.

Since the standard
error of an estimator decreases as _{} by a factor of _{}, the estimator _{} has dispersion greater
by a factor of 1/n than _{}, but usually only *n*^{-3/2}
greater than _{}. Therefore the
reduction in bias achieved by using _{} is not offset by a
similar increase in the magnitude of the estimator's variance. If we depict the
bootstrap estimator as_{} then the jackknife estimator of the standard error of _{} is

_{}

where *F ^{B}* is
the empirical bootstrap probability distribution of the random variables and

_{}

Where _{} is a quadratic
approximation of the estimator _{} on the distribution *F ^{B }*and

In general then the
bootstrap will provide estimators with less bias and variance than the
jackknife. Table 3 shows a data set generated
by sampling from two normally distributed populations with *m*_{1} = 200, _{}, and *m*_{2} = 200 and _{}. To test the
hypothesis that the variances of these populations are equal, that is

_{}

versus the alternative that

_{},

we could use the normal theory approach, which is again
conditioned, on the assumptions mentioned earlier and elaborate the test
statistic based on the sample estimates of _{}

_{},

which is *F* distributed
with *n*_{1} - 1 numerator
degrees of freedom and *n*_{2}
- 1 denominator degrees of freedom.
Alternatively we could use a jackknife or a bootstrap estimate of the
same or a similar test statistic. The *F*
statistic computed under normal theory assumptions is *F* _{9,14 }= 8.286, p <0.001, while the bootstrap
estimate of the probability of obtaining the observed *F* or one greater is 0.017 < p < 0.018. The jackknife test is
performed on the natural logs of the jackknifed variances rather than the
variances themselves. A full description
of the computations is given in reference (3).
The test statistic for the jackknife test on variances is

_{},

where _{}and _{} are the averages of
the natural logs of the variances across the *n*_{1} and *n*_{2}
jackknifed estimates and *V*_{1}
and *V*_{2} are the variances
of the jackknifed estimates of the variances.
For large samples (*n*_{1}+*n*_{2} > 10), *Q* is N(0,1),
but for small equal size samples it follows student's *t* distribution with *n*_{1}+*n*_{2} - 2 degrees of a
freedom. For this example *Q *= -1.8883, 0.0294 < p <
0.03. Figure 9 presents the distribution
of the bootstrap estimates of *F*, and
Table 4 presents the jackknifed pseudo-values their standard errors and bias.
Table 5 and Figure 10 provide a similar test for two exponential
populations. Under the assumption of
normality *F*_{9,14}
= 2.783, 0.025 < p < 0.05. The
jackknife test, however, yields *Q* =
3.095, 0.0009 < p < 0.001, and the bootstrap yields p > 0.05. In these instances the jackknife is the most
powerful test.

**III. Recommendations for Use**

In 1992, Philip
Crowley gave recommendations for these methods:

1.
Use
a Large Number of Repetitions – use large amount, which will smoothen out the
distribution of the data. Use 1000 to start to get a rough idea and use 20,000
as final.

2.
In
the absence of random sampling, two or more samples with equivalent
distributions should be tested by randomization.

3.
Don’t
use the jackknife approach in confidence intervals and hypothesis testing –
randomization and bootstrap approaches are superior.

4.
With
small sample sizes, be a skeptic of parametric analysis.

**IV. Prospectus**

Thus far we have
presented in a non-rigorous fashion a number of computationally expensive,
empirical approaches to estimation and hypothesis testing. The theory
underlying some of these approaches is well developed and I refer you to the
reference list for that material.
However, much of what I have presented has no rigorous theoretical
underpinnings, but can be shown to be quite useful particularly in situations
where the assumption of normality is suspect.
The prognosis among statisticians is that theory will catch up to our
computational prowess, so that many of these procedures will be justified and
should be adopted. In the interim,
however, should you choose to employ one of the more radical of these
procedures be prepared for considerable disagreement over its validity and
usefulness. The prospects for further development of these kinds of procedures,
and work to establish their limitations, advantages, and care and maintenance
is considerable. At present, however,
the burden of investigating the properties of one of these procedures, in its
application to a particular situation and test statistic rests with the
investigator.

**References**

Bradley, J.U.
1968. Distribution-free
statistical tests. Prentice-Hall, Inc:
Englewood Cliff, N.J.

Conover, W.J.
1980. Practical Nonparametric
statistics. John Wiley and Sons: New York.

Crowley, P. 1992. Resampling Methods for
Computation-Intensive Data Analysis in Ecology and Evolution*. *Annu. Rev.
Ecol. Syst. 23:405-447.

Efron, B. and G. Gong.

Good, P.L. 2005. Resampling Methods: 3rd Edition.
Birkhauser.

Hollander, M. and D.A. Wolfe. 1973. Nonparametric Statistical Methods. John Wiley and
Sons: New York.

Simon, J.L. 1997.
Resampling: the New Statistics. Resampling Stats. FREE ONLINE!

**Other Readable Literature**

Miller, R.G.
1974. The jackknife - a
review. Biometrika
61: 1-15.

Peters, S.C. and D.A. Freedman. 1984. Some notes on the
Bootstrap in regression problems.
Journal of Business and Economic Statistics 2: 406-409.

Efron, B. 1979. Bootstrap Methods: another look at the
jackknife. Annals of Statistics 7: 1-26.

**Other Not So Readable Literature**

Arvesen, J.N. 1969. Jackknifing U-Statistics. Annals of Mathematical Statistics 40:
2076-2100.

Miller, R.G.
1964. A trustworthy jackknife. Annals of Mathematical Statistics 35:
1594-1605.

Miller, R.G.
1968. Jackknifing variances. Annals of Mathematical Statistics 39:
567-582.

Quenouille, M.H. 1956. Notes on bias in estimation. Biometrika 43:
353-360.

**Some Applications**

Zahl, S. 1977. Jackknifing an index of diversity. Ecology 58: 907-913.

Heltshe, J.F. and N.E. Forrester. 1985. Statistical evaluation of the jackknife
estimate of diversity when using quadrat
samples. Ecology 66: 107-111.

Routledge, R.D. 1980. Bias in estimating the
diversity of large uncensused communities. Ecology 61: 276-281.

Originally by Dr. Edward Connor, modified by Eugenel Espiritu, June 1,
2008.