An Overview of Experimental Design
I. Hypothesis Testing
While much research in Biology
consists of data collection for descriptive purposes, there is a burgeoning trend toward collecting information with the
hope of answering particular questions or to recast information collected for
descriptive purposes in light of particular hypotheses. This is due to a number
of causes, among them; the growing
body of descriptive information on all aspects of natural history and the
inappropriateness of existing data sets to answer specific questions. It may
also represent an increase in the awareness of scientists of the logical
structure of Scientific Method. Whatever the cause, the effect is that biologists are asking questions and designing
their research efforts to answer questions.
Hence, our
concern with asking answerable questions and for developing procedures upon
which to base probabilistic inferences regarding the answers to these
questions. However, before we
discuss particular procedures and their application, some discussion of the form of the questions we ask and the
possible outcomes of our attempts to answer a particular question is warranted.
A. Null, tested, and alternative hypotheses
When presented as a statement, rather than a question, the question of interest in a
particular research program is called the "tested hypothesis." It may be either of the general form;
"Factor A is responsible for phenomenon B," or "Factor A is not responsible for phenomenon B." This
latter form, which is the negation of any relationship between Factor A and
phenomenon B is also called a "null
hypothesis." Each of these hypotheses can serve as the tested hypothesis
or as the corresponding alternative hypothesis. The alternative hypothesis is
that set of hypotheses implied by the rejection of the tested hypothesis. Once
again, a null hypothesis is simply an hypothesis of "no effect" or "no
relationship" between some factor of interest and some observable
phenomenon.
B. Intimacy, advocacy, and
impartiality in hypothesis testing
Given a specified tested and alternative hypothesis,
with what goal in mind should evidence be gathered to "test" the hypothesis. Should our "test" be an attempt to
find evidence consistent with our tested hypothesis or to find evidence inconsistent
with our tested hypothesis? Do we wish to prove or disprove our tested
hypothesis? As a linguistic convenience
and as a widespread practice, many
scientists strive to prove particular hypotheses, or at least say they do. However, for both logical and practical reasons it may be desirable to
strive to falsify the tested hypothesis.
First, the practical reasons. If the researcher attempts to prove the tested
hypothesis by gathering evidence in its support, they run the risk of ignoring
evidence that might controvert the tested hypothesis. Furthermore, if the tested hypothesis is not a null
hypothesis, the process of attempting to prove the tested hypothesis amounts to
advocating the tested hypothesis. This may lead the researcher to develop
"pet theories" and seek to support them rather than exposing these
theories to critical scrutiny. Such intimacy between researcher and hypothesis
can impair the researcher's ability to cast aside favorite, yet untenable
hypotheses.
Logically speaking,
Hume and Popper illustrate the asymmetry that exists between proof and
disproof. The weight of evidence found consistent with a particular hypothesis
is no match for the instance inconsistent with that hypothesis. That single
instance is sufficient to falsify the hypothesis where no amount of evidence
consistent with a hypothesis is sufficient to prove it. So to insure an
impartial, detached evaluation of
competing hypotheses and to efficiently and rigorously assess the relative
merits of competing hypotheses, one
should attempt to gather evidence capable of falsifying the tested hypothesis,
not evidence designed to prove the tested hypothesis.
C. The statistical hypothesis,
error, and power
A statistical
hypothesis is a statement about a statistical population that, on the basis of information obtained
from the observed data, one seeks to
refute. A statistical test is a set
of rules whereby a decision about the hypothesis is reached. Associated with
the decision rules is some
indication of the accuracy of the decisions reached by following the
rules. The measure of accuracy is a
statement about the probability of making the correct decision when certain
conditions are true in the population in which the hypothesis applies. The accuracy
of the decisions based upon information supplied by an experiment depends to a
great extent upon the design of the experiment. The decision rules are set up
by the experimenter and depend upon what the experimenter considers to be the
critical bounds for arriving at the wrong decision. However,
the statistical hypothesis does not become false when it exceeds the
critical bounds, nor is it true when it does not exceed the bounds. The
decision rules are merely guides in summarizing the results of the statistical
test - following the guides enable the experimenter to attach probability
statements to their decisions. The probability statements associated with the decision rules of the statistical test are
predictions as to what may be expected to
be the case if the experiment were
repeated many times.
The logic of a statistical hypothesis test is
as follows:
1. One assumes the tested hypothesis to be
true
2.
One examines the consequences of
this assumption in
terms of a sampling distribution that depends upon the truth of this hypothesis.
3.
If, as determined from the sampling distribution, the observed data have relatively high probability of occurring, the decision is
made that the data do
not contradict the hypothesis.
4.
If the probability of an
observed data set is low when the tested hypothesis is true, the decision is
made that the data contradict the tested hypothesis.
5.
Again, the tested hypothesis is
often stated in such a way that when the data contradict it, the experimenter has
demonstrated the presence of some experimental effect. The experimenter has been able to nullify the tested hypothesis, in
favor of the alternative hypothesis that some effect is
detectable. Voila, a null tested hypothesis.
The level of
significance,
, defines the probability level that is too low to warrant- support of the tested hypothesis. It
is one of the decision rules. If the probability of
occurrence of the observed data (when the tested hypothesis is
true) is smaller than the level of significance, then the data contradict the hypothesis being tested, and the decision is
made to reject the tested hypothesis. This rejection is equivalent to supporting one of the possible alternative hypotheses that are not contradicted by the data. If the tested hypothesis is
symbolized by H0, and the set of alternative hypotheses that remain tenable when H0 is rejected is
Ha: then the decision rules in
a statistical test can be specified with respect to rejection or
non-rejection of H0:
1. The rejection of H0 may be regarded as
the acceptance of Ha.
2. The non-rejection of H0 may be regarded as
a rejection of Ha.
If the decision rule
rejects H0 when it is in
fact true, the rule has led to an erroneous decision. The probability of
making such an error is at
most equal to
,
the level of significance. This kind of error is
known as a Type I error; rejecting the tested hypothesis when true.
If the decision rules do not reject H0, when it is in
fact false, it also leads to an
erroneous decision. This kind of error is known as a Type II
error; failing to
reject the tested hypothesis when it
is false. The potential magnitude of a Type II error depends in part upon the level of
significance of the test, and in part upon which of
the alternative hypotheses the data actually supports.
Associated with each possible alternative hypothesis is a different probability
of a Type II error.
Type I errors can only occur if
the decision is made to
reject H0 and Type II errors may occur when the decision is made to not reject H0.
The probability of
making a Type I error is under the direct control of the experimenter, since the experimenter sets the level of significance,
. However, Type II error is controlled
indirectly, primarily through the design of
the experiment. If possible the tested hypothesis is stated in
such a way that the more costly error is the Type I error, since its magnitude can be directly controlled by the experimenter. This is why the tested hypothesis is often stated as
a null hypothesis, since rejection of the tested null hypothesis when it is true amounts to finding an experimental effect when none exists. Such an error could have a great impact on
a research program since it will most likely lead the
experimenter to consider the question answered. Better to
be conservative and fail to find an
experimental result. The experimenter would then be
forced to repeat the experiment, possibly with modifications, or to
perform other experiments to test the same hypothesis. Err on the side of innocence.
Nevertheless, it is best, to try to
minimize both sources of error. However, Type I and Type II errors are not independent. The smaller the probability of a Type I error,
, the larger numerically the potential Type II error can be.
The relationship between Type I and Type II errors can be best represented
graphically. The
rejection region for H0 is defined relative to
the sampling distribution of the statistic of interest when H0 is true (red line). The blue line represents the sampling distribution of the same statistic when a particular alternative hypothesis is true, Ha.
,
the probability of a Type II error is that area under the blue curve that lies within the region of
non-rejection of the sampling distribution of the statistic when H0
is true. If
is smaller, the area of the blue curve that
falls within the region of non-rejection of the sampling distribution when H0
is true is larger; hence
is larger when
is smaller.

The power of a test is equal to
. The power is the area under the dashed curve
that falls in the region of rejection of the sampling distribution when H0
is true. Since
is the probability of failing to reject the
tested hypothesis when false,
, the power is the probability of
rejecting the tested hypothesis when false.
1 = P (rejecting when false) + P (failing to reject when false)
Power ![]()
Power = ![]()
Power is the probability that the decision rule rejects
H0 when a specified Ha is true. The
closer the Ha to H0 (the greater the overlap in the corresponding sampling distributions) the lower
the power of the test with respect to that particular alternative. A
well-designed experiment will both be conservative, have low a, and have high power,
,
with respect to all alternative hypotheses which are in a practical sense
different from H0. For an H0 of m1 = m2, and an Ha of m1
= m2 + 0.001, this Ha may not be, for all practical purposes, a
different hypothesis than H0. Hence power with respect to this
alternative is of no practical consequence.
The most common means to minimize the probability of a
Type II error and to increase the power of a test relative to all possible
reasonable alternative hypotheses, for fixed
,
is to increase the sample size or
replication in the experiment. This is because the dispersion of the sampling
distribution of a statistic decreases by a
factor of
. Hence, the overlap in the sampling distributions of the
tested and alternative hypotheses decreases as N increases.
While an inordinate emphasis has been placed on the
level of significance of a statistical test, the power of a test is usually
ignored. This is partially due to the reluctance of experimenters to present
results in which a tested null hypothesis could not be rejected, and to a lack
of information as to what constitutes a reasonable alternative hypothesis.
While it is easy for the experimenter
to control Type I error, and the
experimenter may wish to perform a
conservative test (minimizing the probability of a Type I error), these tests may suffer from a complete lack of power to
discriminate between the tested hypothesis and reasonable alternative
hypotheses. If this is true then the best solution may be to allow higher Type
I error in order to increase the
power of the test versus fixed alternatives.
In such instances
values of 0.1, 0.2, or even 0.3, may be
reasonable.
However, given a properly designed experiment in which
the power of the test has been investigated for specified alternatives for
specified
values, any desired power can be obtained
simply by providing an adequate sample size, even though this may be very expensive. Given an initial preliminary survey the
necessary sample size to discriminate a particular alternative hypothesis can
be estimated. Green (1979, p.43) outlines such a procedure for a
2 x 2 factorial experiment in which a test of the interaction between time and
treatment was the statistical hypothesis of interest.
D. The statistical hypothesis test and its relationship to ruling
hypotheses or theories
In practice our hypothesis test usually involve a
specific set of observations regarding the effects of a particular factor (say, soil moisture) on a particular response
variable (say, crop yield). If our
tested hypothesis is in the form of a null hypothesis, H0: Soil moisture content has no effect on crop yield,
a logical alternative hypothesis might be that; Ha: Soil moisture
content effects crop yield. Two outcomes
are possible from our test, we can either reject or fail to reject the tested hypothesis. If
we reject our null hypothesis, and our observations were derived from
controlled experiments so that only soil moisture was allowed to vary among
replicate fields, then we might
safely conclude that soil moisture does affect crop yield, or at least that our measures of soil moisture and crop yield
would suggest so. However, if we fail
to reject our null hypothesis we cannot conclude that soil moisture has no
effect on crop yield, rather only that from the data at hand one is not
compelled to posit that it does. Possibly a better-controlled or designed
experiment would have succeeded in falsifying the tested null hypothesis. How do these experimental outcomes relate to
the general hypothesis that soil moisture effects plant growth and yield? In
the instance where we rejected the null hypothesis we have demonstrated that
this appears to be true for one crop, given our experimental design. In the
instance of failing to reject the null hypothesis we are back to the drawing
board to design a more critical experiment.
In either event, a single experiment is insufficient to lead us to
believe that the truth content of the alternative or tested hypothesis is high.
Repeated attempts to reject or failure to reject such an
hypothesis with more and more critical experiments are necessary to establish
its verisimilitude. It is seldom that a
single experiment will have a major impact on how scientists in a discipline
view a ruling theory.
II. A
Typology of Evidence
A. Non-experimental research
1. Data-dredging
Data dredging, as described by Selvin and Stuart (1966),
occurs in the process of examining data sets that are often collected for other
purposes. If an
hypothesis and an hypothesis test are stated prior to the examination of the
data, and the results of the test, regardless of the outcome are reported, then
data dredging can be useful and result in a considerable savings in
effort. Why collect more data to test an hypothesis if an adequate set already exists?
However, if a specific quantitative hypothesis is not
stated in advance, but rather emerges during the data analysis, perhaps along
with a novel "test" variable, then the strength of the test is
compromised, since a mechanistic explanation for the result must also be
developed a posteriori. In the initial stages of study on a new topic
this may be a useful process to help develop new hypotheses and to formalize
critical hypothesis tests. But, when the
topic has been the subject of much study, a specific a priori hypothesis should be available for testing.
Care must also be taken when engaging in three other
types of data dredging; "snooping ", "fishing", and
"hunting.
Snooping is testing a large set of hypotheses. The problem
arises since some tests are expected to be significant by chance alone, and
because the hypotheses may not be independent.
These problems also occur with experimenter generated data sets.
Fishing is choosing test variables based on
an examination of the data rather than because of their importance to an a priori
hypothesis. Also, by relegating
variables to two classes, those chosen and those discarded, the interpretation of the results are clouded. In the absence of a specific a priori hypothesis, why not report the
tests for all variables?
Hunting is the process of searching through many data sets to
find some relationships worth testing.
We never know how many data sets were found not to display the desired
relationship since negative results are seldom reported.
2. Uncontrolled Observations
By uncontrolled observations I mean observations on a
test variable under experimental conditions which cannot be compared to a set
of observations obtained on the same test variable in the absence of the
experimental conditions. Uncontrolled
observations are often, but not always, experimenter generated. They arise
either because of poor experimental design, or because of the nature of the
problem under study. An example of this
latter problem can be seen in studies of trends in global atmospheric
chemistry. These studies postulate recent anthropogenic changes in atmospheric
chemistry based on observations on present conditions and knowledge of the
increase in anthropogenic inputs. The only available control observations are
theoretical or budgetary predictions removing the anthropogenic inputs, or a
reconstructed fragmentary historical and prehistoric record. While these might be the only sorts of
"control" observations possible in these circumstances, the absence
of good estimates of experimental and control parameters and their variance
makes rigorous hypothesis testing difficult. Similar problems plague explanations
of the impact of introduced plants and animals on native populations. Usually
no information is available on population trends in the native biota prior to
the introduction of the exotic species.
The only cure for uncontrolled observations is to
generate controlled observations. This can be accomplished by repeating an
experiment with proper controls, if possible, generating theoretical expectations of the control observations by
either deterministic models or by
B. Experimental Evidence
Experimental evidence is data
collected by the experimenter for the
express purpose of answering a particular question or to test a particular
hypothesis. I do not mean to suggest
that all experimental evidence is created equal as a basis for causal inferences.
In fact, I can discern several kinds of experiments that in the order I will
present them represent an increasing degree of intervention on the part of the
experimenter into the workings of nature. And, I believe an increasing ability
to intimately connect cause and effect.
1. Controlled observations
Controlled observations are collected by design to test
a particular hypothesis. The design
includes samples under the experimental conditions of interest and under
putative control conditions (lacking the experimental treatment). However, the observations are derived from a
sampling program that involves nature only passively. The only activity of the experimenter is to
make the observations, analyze the data, and interpret the result. For example, I conducted a sampling program
to determine if the amount of folivory on White Oak trees is related to the
timing of leafing and leaf development in the spring. Much of the folivory that
occurs on woody plants, in general, occurs during spring when leaves are young
and supple. If the amount of damage received by Oak trees is determined by the
age of the foliage relative to the emergence time of leaf feeding insects, then
trees that either foliate sufficiently prior to insect emergence and feeding,
or after insect emergence may receive less damage. To test this hypothesis, I
sampled groups of trees similar in size, but differing in the timing of
foliation. Three classes of foliation were established, early, mid, and late,
which by making inter-comparisons serve as corresponding controls as well (mid
and late serve as controls for early, etc.). In this "experiment", I
have intervened only to record my observations on folivory and foliation. I have neither altered the leafing time of
the trees to observe the subsequent damage received, nor have I manipulated the herbivores
to create or destroy a pattern of synchronous emergence and foliation. The key
to distinguish controlled observations from a more elaborate experiment is the
passive role of both nature and the experimenter.
In as much as the experimental observations have good
control observations, such sampling
programs can be a reasonable basis for causal inferences. However, the danger exists that the control
observations are not true controls, since
the experimenter may not be able to insure that subjects used for experimental
observations differ only in the experimental treatment from the subjects used
in the control observations. In the above example it is conceivable that some
aspect of leaf chemistry, either nutritional quality, the concentration of volatile compounds (which may be used in
host location and stimulate feeding) or the concentration of chemicals that
deter feeding may co-vary with time of foliation. These chemical
characteristics may be the proximal causal agents responsible for any observed
relationship between foliation and folivory. Foliation time may either affect
these characteristics or simply co-vary with them. In either instance, further
experiments would be necessary to determine the actual causal mechanism.
Therefore, although
more convincing than inferences based on uncontrolled observations it is still
difficult to base firm causal inferences purely on controlled observations.
2. Mensurative Experiments
The next step in experimental interventions I call
mensurative experiments (sensu Hurlburt 1984). They involve the experimenter and a part of
nature a bit more actively in the hypothesis test, but only to passively measure another part of nature. A common
example of a mensurative experimental technique in ecology is the use of litter
bags to examine the rate of litter decomposition in aquatic environments or on
the forest floor. The experimenter packages a bit of nature in these litter
bags (which have mesh that allow colonization by bacteria and invertebrates who
along with chemical weathering are responsible for the litter decomposition)
and exposes the bags under different experimental conditions or times to
determine if significant variation in decomposition rates is detectable between
the experimental conditions of interest. In this case, the experimental
intervention into the workings of nature is solely to create a replicable
sampling device with which to passively measure a natural process.
As with controlled
observations, mensurative experiments suffer because we cannot uniquely
associate a causal mechanism with the variety of experimental situations into
which we have placed our mensurative device. Were we to find different rates of
litter decomposition in temperate and tropical forests, to what would we
attribute these differences? The list of reasonable causes is lengthy. Once
again further experiments are necessary to establish the specific causal
mechanisms involved.
3. Manipulative Experiments
In a manipulative experiment, the experimenter may exercise total control over a portion of
nature to create all the desired experimental and control conditions. To repeat
the folivory experiment mentioned above as a manipulative experiment would
involve direct modification of the foliation time and exposure of all trees to
identical herbivore populations, possibly
via massive rearing and release of leaf-eating insects. The litter
decomposition experiment would involve a series of experiments in which several
factors such as temperature, humidity,
bacterial populations, fungal
populations, and invertebrate
populations are controlled or allowed to vary singly or in combination to test
for simple effects and interactions of factors in determining decomposition
rates. Obviously, it is often easiest
to perform manipulative experiments in laboratories or experimental enclosures
where there is some hope of success in actually controlling the multitude of
environmental and biological variables.
If it is in any way possible to perform a manipulative
experiment, it is preferred over the
previously mentioned kinds of experimental exercises. However, it is extremely difficult to perform
these kinds of experiments under field conditions. Manipulative experiments do,
however, have a greater ability to associate cause and effect since if properly
designed and executed they remove the possibility of co-varying potential
causal factors.
Manipulative experiments, even where logistically
feasible, are not without their problems. The most important of which is the
danger of introducing some experimental artifact via some aspect of the
manipulation. This problem also besets mensurative experiments, but to a lesser
degree since the experimental intervention in nature is less drastic.
III.
Allocation of sampling effort
A. What is a sampling program to do?
A statistical population is the collection of all
elements about which one seeks information, or about which one desires to make
some inference. It is incumbent upon the experimenter to state a priori the population of elements
about which they wish to make some statement. It is crucial that this
population is defined in advance of
designing a sampling program or experiment. The reason for this will
be apparent, shortly.
Usually only a small portion, or a sample, of this population can actually be observed. It is from data on the elements of the population that are members of the sample that conclusions or
inferences are drawn about the characteristics of the entire statistical population. Quantities computed from sample data are
commonly termed statistics while those characterizing populations are
known as parameters. Sample statistics then serve two roles:
1)
to describe the data obtained in the sample
2)
to estimate or test hypotheses about characteristics of the population
Had we enumerated the values of a particular
characteristic for all elements in a population, and then tabulated the
frequencies with which the elements of the population take on different values,
the resulting tabulation would be the population distribution of the character
of interest. The population distribution can be described by this sort of
enumeration or by a series of parameters. The number of parameters necessary to
describe a particular distribution depends on its form, but is in general a
more parsimonious approach than enumeration. For example, the Poisson
distribution can be specified by one parameter and the normal distribution by
two parameters.
If we are interested in estimating a population mean,
,
the sample mean,
,
generally provides a good estimate. Similarly the sample standard deviation, s, provides a good estimate of the
population standard deviation,
. The precision of
these estimates depends on four factors:
1) the size
of the sample
2) the manner
of sampling
3) the
characteristics of the underlying population
4) the
principle used in estimating the parameter
If a sample is drawn such that:
1)
all elements of the population have an equal chance of being drawn at all
times, and
2)
all possible samples of size n
have an equal (or fixed and determinable) chance of being drawn,
then, the sample is a random sample of
size n from the underlying
population. Of course to meet the conditions the population must be defined in
advance or a sampling program that insures that all elements have an equal
chance of being drawn cannot be designed.
Random samples insure that all elements in a population
are at equal risk of being sampled and that the probability of sampling any
individual element from the population is independent of which other elements
may or may not be sampled.
Suppose 10,000 samples each of n elements are drawn from a population. Sample means,
,
and variances, s2, could
be computed from each sample. The
tabulation of the frequencies with which our sample statistics take on
different values is the sampling distribution of the statistic. In this
instance, we have determined the distribution empirically. The form of these
distributions depends in part upon
the sampling method. As with population distributions sampling distributions can be described more
economically with parameters than by enumeration. Frequently the parameters of
the sampling distribution of a statistic are related to the parameters of the
underlying population. The mean or average value of the sampling distribution
of a statistic and its standard deviation is the standard error of the
statistic. The form of the sampling distribution as well as the magnitude of its
parameters depend on:
1) the form of the population
distribution,
2) the manner of sampling, and
3) the size of the sample.
If, for example,
the underlying population is normally distributed and the samples are random
samples, then if one draws a large
number of samples, the sampling distribution of the sample mean,
,
will be approximately normal with mean,
,
and standard error
. This same consequence is derivable mathematically based
on the properties of random samples. This is the importance of random samples -
they permit the estimation of
sampling distributions from purely mathematical considerations without necessitating the laborious kinds of
enumerations I have mentioned.
The key aspect of random sampling which allows this is that random samples
ensure all elements of the population are
at equal risk of being sampled and the
probability that any single element is sampled is independent of
which other elements are sampled.
Statistics obtained from samples drawn by other sampling plans which are not
random have sampling distributions which are either unknown or which can only
be approximated with unknown precision. Good approximations to sampling
distributions required if one is to evaluate the precision of the inferences
made from sample data.
Of course, the Central Limit Theorem generalizes this
result for populations that are non-normal. The sampling distribution of a
statistic derived from a non-normal
population is also approximately normal and the approximation improves as the
sample size increases. For the sample
mean,
, its expected value is still,
,
and its standard error, is
.
In the context of hypothesis testing, the role of sampling
is to enable the experimenter to
discover something about the sampling distribution of
the statistic of
interest, under the experimental conditions of interest, based on the underlying population of interest. This is because the statistical hypothesis test is
based upon the sample estimates of the parameters of
the sampling distribution of the statistic, not upon the sample estimates of
the underlying population parameters. The close relationship between the parameters of
the sampling distribution of
a statistic and the population parameters
estimated by the sample statistic tends to obscure this fact. The
statistical hypothesis test involves a comparison of sampling distributions.
Upon this comparison inferences about population distributions and their
parameters can be made.
B. Bias, precision and random sampling
So we sample to
estimate population parameters and to learn through knowledge of the sampling
distribution of our statistics just how good our estimates are. Two criteria are commonly used to judge the
accuracy of an estimate: bias and variance.
A statistic is an unbiased estimate of a parameter if
the expected value of the sampling distribution of the statistic is equal to
the parameter of which it is an estimate. Bias therefore, is a property of the sampling distribution not of a single
statistic. This implies that in the long
run the mean of a statistic computed from a large number of samples of equal
size will be equal to the parameter, if
it is unbiased. In addition to insuring
that the elements included in a sample are independent, random sampling also helps to prevent biasing estimates of
population parameters. If all elements
in a population were not at an equal risk of being sampled it is easy to see
that values systematically above or below the true population value may be
represented disproportionately in the sample.
The precision of an estimator is measured by the
standard error of its sampling distribution. The smaller the
standard error the greater the precision. The standard error is only a good measure of
precision if the sampling distribution is asymptotically normal. If this is
true, then the best-unbiased estimator is the one with the smallest standard
error. This is called a minimum variance unbiased estimator. Increasing sample
size will also increase the precision of an estimator.
C. The preliminary survey
Once the statistical population of interest has been
defined, the attributes to be examined are selected, and the experimental
conditions decided upon, the
experimenter is left with the task of deciding where to invest sampling effort.
First and foremost, this is dictated
by the questions of interest. Assuming limited resources, there is no reason to
expend extra effort to test ancillary hypotheses that are not of pressing interest. It is easy to
compromise all the hypothesis tests you wish to perform by
attempting to design an all encompassing sampling plan which allows you to test several hypotheses, but none with any power. You just cannot answer all the important questions in
biology in one MS. or Ph.D. thesis. Believe me, I tried. State the questions you wish to answer and rank them in
importance. If the cost of sample collection or processing in
time or money, or the inherent variation in the attributes of the populations you wish to study are high then pare down the number of
questions so that at least some can be answered with adequate confidence and power.
Second try out your sampling gear to assess its accuracy and to estimate the cost per sample. In
nature you can rest assured that appearances will be
deceiving and that field work always costs more in time and money than anticipated. If you are using some sort of sampling gear that you cannot normally
observe during its operation, try to observe its behavior at least once. If there is any subjective component to sample selection or any other aspect of the collection, sorting, or enumeration of
samples have more than one observer repeat the same procedure to see if any systematic bias is being introduced.
Third, carry out a preliminary survey so that you can estimate the amount of variation to be expected under each set of experimental conditions. If
you know where the variation lies in your subject populations you can increase your replication to improve the precision of your estimates and thereby (by decreasing the standard error of the sampling distributions) increase the power of your tests against fixed alternatives for fixed values of
.
D. Optimal allocation of sampling effort
As I
mentioned before, the precision of our estimates of population parameters depends upon the form of the population distribution, the manner of sampling, and the size of
the sample. The experimenter has control only over these last two aspects. So the
allocation of sampling effort must involve variations in the manner of
sampling and the size of the sample.
Sample size
Increasing sample size will increase the precision of our estimates by decreasing the standard error of
our sample statistic. Increasing sample size should not decrease our estimate of
the standard deviation of
the underlying population. For fixed a,
this
increases the power of an hypothesis test against all alternatives. For example, the sample size required to
be 95% confident that our estimate of the sample mean lies
within an allowable error, L, of the true population mean is:
![]()
where n
is sample size and s is estimated by the sample standard deviation. Increasing precision is
synonymous with decreasing the allowable error, L,
and for fixed confidence we must increase n to achieve increased precision.
Sampling manner
The results concerning sampling distributions that I mentioned earlier hold for other types of
sampling than just simple random sampling. It
is sufficient for the sampling method to sample all elements independently and with known probabilities.
These probabilities need not be equal for all elements of
the population (as in
simple random sampling), as long as
we take account of these probabilities when constructing our estimates. Sampling plans that follow these criteria are known as
probability sampling. Simple random sampling being the most common of
these. Two other commonly used methods of
probability sampling are stratified sampling and 2-stage sampling.
Stratified sampling involves dividing a population into a number of parts, called strata, drawing simple random samples from each strata, and computing the parameter of
interest as a weighted mean of the parameter estimates from each strata. For the sample mean we have
,
where n is the total number of elements in the hth stratum,
is the sample mean for the hth stratum and
is the size of the population. Stratified sampling is useful because differences between
strata means do not contribute to the standard error of the mean,
. That is, the sampling error arises solely because of variation among elements within strata. If we
can stratify an otherwise heterogeneous population into strata which are fairly homogeneous, we can increase the precision of our estimate over that achievable by simple random sampling. The
size of the sample we choose in any stratum is determined by
the experiments. This freedom of choice allows the experimenter to allocate sampling effort efficiently. This control over the allocation of sampling effort is often the principal reason for the gain in
precision derived from stratification.
If equal fractions of the elements in each stratum are sampled the weighting factors are equal for all strata and we need not modify our sample statistics to
account for the unequal probabilities of sampling elements in different strata. This is known as stratified sampling with proportional
allocation of sampling effort. The
optimum allocation of sampling effort in a stratified design is not necessarily
a proportional allocation program where nh/Nh
is equal for all strata. Rather the optimal solution is to take nh
elements proportional to
, where sh is the within stratum standard
deviation, and Ch is the
cost per sample in the hth stratum.
This method gives the smallest standard error of the estimated sample statistic
for a given total cost of sampling. In other words, take a larger sample in a
stratum that is unusually variable (sh large), and a smaller sample where
sampling is unusually expensive (Ch
large). If the within strata standard deviations are all approximately equal
and the cost of sampling in each strata is also equal then this method reduces
to the method of proportional allocation. Of course, in order to allocate effort optimally rough estimates of standard
deviations and costs must be made.
Two stage sampling
In a two stage sampling program the sample is derived by
first collecting a sample of primary sampling units, and then by sub-sampling within each of these units. The oak tree
experiment I described earlier is an example of a two-stage sampling program.
The primary units are the trees selected from the forests and the sub units are
the leaves or leaf clusters sampled within a tree. Sometimes two-stage sampling
is the only practical sampling method. On a live oak tree 5 meters in height I once counted 4,000 leaves on just one branch. Obviously enumerating all the leaves on the tree
would be very tedious. In general, it
is easy to sample the primary sampling units but difficult to sample the
sub-unit. The observation on each sub-unit is considered to be the sum of two
independent terms. One term, associated
with the primary unit, has the same
value for all second-stage units in the primary unit, and varies from one primary unit to the next with variance
.
The second term, which serves to
measure differences between second stage units varies independently from one
sub-unit to the next with variance
. If a sample consists of n1, primary
units from each of which n2
sub-units are drawn, then the sample
as a whole contains n1 independent values of the first term
and n1n2 values of the second term. The variance of the sample
mean,
, per sub-unit is:
.
These two components of variation can be estimated from
an analysis of variance.
,
,
.
Therefore, we
can juggle the number of primary and secondary units to minimize
.
But what choice of values is best? Naturally the answer
to this question depends on the relative costs of primary and secondary
sampling units. If the costs associated solely with sampling
primary units is C1
and the cost associated with sampling secondary units is C2 then the total cost (CT) of a 2-stage program is
CT = C1n1 + C2n1n2
If advance estimates of these
individual costs and of the variation due to each sampling stage are known then
one can allocate sampling effort to minimize the standard error of the
statistic of interest for fixed cost, or to achieve a desired precision of our
estimate by minimizing the product

where V
is the variance of the sample mean in this case and CT is total cost.
Since n1
drops out of this expression we can solve for the value of n2 that minimizes this expression:
![]()

Then for known total cost CT,
,
and for known total variance V,
.
Therefore, the value of n2
required for an optimal allocation of sampling effort can be obtained, and a similar value for n1 can also be obtained contingent on being able to specify the total
cost or variance desired.
IV.
Experimental Design
I have tried to
illustrate that the goal of a sampling program is both to
produce unbiased estimates of population
parameters and to learn something about the sampling distributions appropriate for the underlying population distributions. Also, the reason for choosing a particular sampling program is
to improve the power or
sensitivity of the statistical hypothesis test motivating the sampling.
In testing a statistical hypothesis one uses sampling distributions which are largely chosen for mathematical convenience (i.e., whose forms can be
specified if certain preconditions are met by the sampling program). One
proposes a model, imposes specific conditions upon the model, and derives the model's consequences in terms of
sampling distributions which are valid given the properties assumed for these sampling
distributions. To the extent that the model and conditions imposed upon it approximate the actual experiment, the model can be used as a guide in drawing inferences from the data.
To use models that allow the properties of the sampling distributions to
be specified in advance, the experiment must be designed to meet the preconditions associated with the particular model. If an experiment does not meet the specifications of existing models, the experimenter may be able to develop a model tailored to the specific experiment. However, the resulting data must still be analyzed. If
sampling distributions with known and manageable characteristics appropriate for an
experiment can be derived, the specific model can lead to
inferences with known precision. Without knowledge of
the properties of
the appropriate sampling distributions, inferences drawn from an experiment have unknown precision.
The analysis of
experimental data is
dependent upon the experimental design and the sampling distributions appropriate for the underlying population distributions. The design, in part, determines what the sampling program will be. For standard designs the sampling distribution necessary to
test the hypotheses of
interest have known and manageable properties (i.e., asymptotic normality), which lead to the widespread use of these designs. Alternative designs are often available for an experiment having specified objectives. Depending upon the specific situation, one design may be more efficient - that is have power in the associated tests and narrower confidence intervals -for a given amount of experimental effort. The goal in planning experiments is
to find the design that is most efficient per unit effort relative to the primary objectives of the experiment.
Increasing sample size, improving measurement techniques,
and introducing various kinds of
controls all may decrease
experimental error and therefore improve power. Which method results in the greater increase in
power for a given unit of effort will depend upon conditions unique to
each experimental situation.
An examination of
purely statistical aspects of experimental designs will help the experimenter find the model best suited for their experiment. The model chosen should allow the experimenter to reach decisions regarding all the objectives of
the experiment. Whether or
not a particular model actually corresponds to a specific experimental situation requires an in-depth knowledge of the subject matter addressed by
the experiment. A careful assessment of the adequacy of alternative models may lead the experimenter to more fully understand the sources of
variation inherent in
the experiment. This may ultimately lead to
a better design and therefore a more clear-cut interpretation of
the experimental result.
Five criteria for evaluating experimental designs can be stated.
1.
The model chosen and its underlying assumptions should be appropriate for the experimental material.
2. The design should provide as much information as possible with regard to the major objectives of the experiment for a given amount of experimental effort
3.
The design should provide some information with regard to all the experimental objectives.
4.
The design must be feasible within the working conditions that exist for the experimenter.
5.
The analyses based upon the design should provide unambiguous information on the primary objectives of the experiment.
In the following discussion several broad categories of
experimental designs will be presented. The benefits and costs of choosing one particular category of design over another will be examined.
A. Factorial Designs
Factorial experimental designs involve the comparison of the effects of
two or more factors acting simultaneously on a common response or
criterion variable. A factor can be considered a set of related treatments or related classifications. Each member of the set of related treatments belonging to
Factor A is considered a level of Factor A. The principal advantage of using a factorial design versus a series of single factor experiments is that it allows one to examine the effects of the interaction of each factor combination on the criterion variable. The presence of an interaction effect attributable to the combination
of factors above and beyond the effects of the factors singly can be
determined. However, the additional effort necessary to test an
hypothesis of interaction can be considerable. For example, if five replications are made at each level to test for the effects of 2 - four level factors singly then such a design requires a total of 40 replications. To test for the effects of 2 - four level factors and their interaction requires 80 replications. If prior information indicates that no interaction exists, a factorial design will not be as
economical as several single-factor designs.
Figures 2-5 illustrate the data layout and analysis of variance for a single factor and a two-factor factorial experiment. In each instance an equal number of
independent and randomly sampled elements are sampled at each factor level or combination of factor levels. Fully factorial designs, those with no confounding between factors and independent observations at all factor levels, are the most common and widely used designs. Other kinds of factorial experiments are sometimes useful.
Figure 2. Single Factor Factorial Experiment - data layout
|
Treatment 1 |
Treatment 2 |
… |
Treatment k |
|
X11 X21 X31 … Xn1 |
X12 X22 X32 … Xn2 |
… … … … … |
X1k X2k X3k … Xnk |
Figure 3. Single Factor Experiment - ANOVA Table
|
Source of Variation |
SS |
df |
MS |
F |
|
Treatments |
SStreat |
k-1 |
SStreat/(k-1) |
MStreat/MSerror |
|
Error |
SSerror |
kn-k |
SSerror/kn-k |
|
|
Total |
SStotal |
kn-1 |
SStotal/kn-1 |
|
Figure 4. Two Factor Factoral Experiment - data layout
Factor
A
|
Factor B |
|
Level 1 |
Level 2 |
… |
Level p |
|
Level 1 |
X111 X112 X113 … x11n |
X121 X122 X123 … X12n |
… |
X1p1 X1p2 X1p3 … X1pn |
|
|
Level 2 |
X211 X212 X213 … X21n |
… |
… |
… |
|
|
Level 3 |
… |
… |
… |
… |
|
|
… |
… |
… |
… |
… |
|
|
Level r |
Xr11 Xr12 Xr13 … Xr1n |
… |
… |
Xrp1 Xrp2 Xrp3 … Xrpn |
Figure 5. Two Factor Factorial Experiment -
ANOVA Table
|
|
F-Ratios |
|||||
|
Source of Variation |
SS |
df |
MS |
Model I |
Model II |
Model III (A fixed, B random) |
|
Factor A |
SSA |
p-1 |
SSA/(p-1) |
MSA/MSe |
MSA/MSAB |
MSA/MSAB |
|
Factor B |
SSB |
r-1 |
SSB/(r-1) |
MSB/MSe |
MSB/MSAB |
MSB/MSe |
|
AB Interaction |
SSAB |
(p-1)*(r-1) |
SSAB/(p-1)*(r-1) |
MSAB/MSe |
MSAB/MSe |
MSAB/MSe |
|
Within cell (error) |
SSe |
pr*(n-1) |
SSe/pr*(n-1) |
|
|
|
|
Total |
SStotal |
|
|
|
|
|
Occasionally in executing a single-factor experiment a limited number or amount of primary sampling units are available to receive the experimental treatments. Those
that are available may not be
considered strict "replicates" because uncontrolled variation exists between primary units prior to the experiment. In order to incorporate enough replicates for each experimental
treatment it is often necessary or maybe even desirable to use more that one primary unit. The best design in this situation is a randomized complete block design. Each primary unit is considered a block and each treatment is randomly assigned to sub-blocks within each block. A 2-factor analysis of variance is performed with blocks as one factor, in order to
remove variation due to
blocks from the experimental error. If hypothesis tests are only performed on the treatment effect then the blocking factor can be considered a fixed factor. If analyzed in this manner the treatment block interaction is implicitly considered to be zero.
B. Nested Designs
Three kinds of
nested designs are used in agricultural and psychological research and have many applications in biology. These designs are hierarchical, split-plot, and repeated measures. The primary purpose of these designs is to eliminate uncontrolled variation due to a priori differences in primary sampling units from the estimate of experimental error. In this sense we can see that these designs are a way to
remove confounding variation by adding
classificatory controls or
strata. Another reason for the use of these kinds of
nested designs in
biological research is
that we often wish to make inferences concerning hierarchically
arranged environments, habitats, and species.
1.Hierarchical Factors
Consider the example depicted in Figure
6, ignoring for the moment the high and low marsh categories. We wish to test the hypothesis that some characteristic, say above ground biomass, does not differ between estuaries. We have estimates of biomass per marsh in each estuary. Our Factor B, marshes, is not completely crossed with Factor A, estuaries, since no marsh is found in
both estuaries. The marsh factor is nested within each level of the factor estuaries. Since all levels of the factor marsh do not occur in combination with all levels of
the factor estuaries, we cannot examine the effects of
a marsh by estuary interaction. The degrees of freedom and SS for estuaries can be computed as in a normal 2-Way ANOVA. The
SS for marshes is computed as the sum of the SS marshes within level 1 of factor A and the SS marshes within level 2 of Factor A.
The degrees of freedom for each of these components is (q - 1) where q is the number of marshes in each estuary. If marshes are considered a random
factor the F ratio to
test the hypothesis that
is F =MSA/MSB . If marsh is a fixed factor the F is MSA/MSWC.
Designs with more levels of
nesting are possible.
If
we include data on biomass at locations high and low in each marsh the resulting design is a partially hierarchical design. The high-low factor is
not nested within marshes or estuaries, but rather completely crossed with them. If we consider the estuary and high-low factors fixed and marshes random then these hypotheses may be
tested: an estuary effect, a high-low effect, and a high-low/estuary interaction. An outline of the degrees of
freedom, mean squares, and F ratios are given in Figure 7. Note
that the within cell variation has been partitioned into two orthogonal components which are used as
error terms to evaluate different hypotheses.
Figure 6. Nested Analysis of Variance - Data Layout
|
|
Estuary 1 |
Estuary 2 |
Estuary 3 |
|||||||
|
|
|
Marsh 1 |
Marsh 2 |
Marsh 3 |
Marsh 4 |
Marsh 5 |
Marsh 6 |
Marsh 7 |
Marsh 8 |
Marsh 9 |
|
High |
n |
n |
n |
n |
n |
n |
n |
n |
n |
|
|
Low |
n |
n |
n |
n |
n |
n |
n |
n |
n |
|
Figure 7. Nested Analysis of Variance - ANOVA Table Data Layout
|
Source of Variation |
SS |
df |
MS |
F |
|
Estuaries |
SSEstuaries |
(p-1) |
SSEstuaries (p-1) |
MSEstuaries MSMarshes w Estuaries |
|
Marshes
within Estuaries |
SSMarshes w Estuaries |
p*(q-1) |
SSMarshes w Estuaries p*(q-1)
|
|
|
High-Low |
SSHigh-Low |
r-1 |
SSHigh-Low (r-1) |
MSHigh-Low MSMarshes w (Estuary
by High-Low) |
|
Estuary
by High-low interaction |
SSEstuary by High-Low |
(p-1)*(r-1) |
SSEstuary by High-Low (p-1)*(r-1) |
MSEstuary by High-Low MSMarshes w (Estuary
by High-Low) |
|
Marshes
within (Estuaries by High - Low) |
SSMarshes w (Estuary
by High-Low) |
p*(q-1)*(r-1) |
SSMarshes w (Estuary
by High-Low) p*(q-1)*(r-1) |
|
2. Split Plot Designs
Split-plot designs are equivalent to repeated measures designs and are used widely in agriculture. They are useful when one of the treatments is more difficult to apply than the other, or at
least that one-treatment is easier to apply at
a larger scale. Figure 8 depicts the layout of a split-plot design. This design is similar in
form to the randomized complete block design except that in
this instance a treatment level is applied to
each whole block or plot. Within each whole plot, each level of
a second treatment is
randomly assigned to
sub-plots. The effects of factor A are confounded with differences
between whole plots while the effects of factor B are part of
the within plot variation. The estimates of
the effect of Factor B are free from variation due to whole plots. The interaction between Factor A and B is also free from whole-plot effects. The analysis of
this design is outlined in
Figure 9.
3. Repeated Measures Designs
In repeated measures experiments observations are made on the same subject at all levels of at
least one factor. For example, the paired t - test can be considered the simplest instance of
a repeated measure design. Each subject is
observed before and after the application of
some treatment. The advantage of such a design is that the subject acts as
a self-control. Variation between subjects that
occurs for reasons unrelated to the experiment can then be removed from one estimate of experimental error. This
may lead to a more sensitive test of the hypothesis of interest. For example in a t
- test with un-correlated observations (no repeated measures) the estimate of experimental error is
,
while for a t
- test on paired observations it
is
![]()
Where rabsasb is
the covariance of
a and b. If
the covariance is positive then the estimated experimental error from a design involving correlated observations will be smaller than that from un-correlated observations by a factor of
2rabsasb. However, the degrees of freedom for the estimate with correlated observations are only (n - 1), while they are (na
-1) + (nb
-1) for un-correlated observations.
Figure 8.
Factor A
|
Factor B |
A1 |
A2 |
A3 |
A4 |
|
Plot 1-n |
Plot 1-n |
Plot 1-n |
Plot 1-n |
|
|
B1 |
B2 |
B3 |
B2 |
|
|
B2 |
B1 |
B1 |
B3 |
|
|
B3 |
B3 |
B2 |
B1 |
Figure 9.
|
Source of Variation |
SS |
df |
MS |
F |
|
A |
SSA |
(p-1) |
SSA/(p-1) |
MSA/MSplots w A |
|
Plots w A |
SS
plots w A |
p*(n-1) |
SSplots w A/p*(n-1) |
|
|
B |
SSB |
(q-1) |
SSB/(q-1) |
MSB/MS
B x plots w A |
|
AB |
SSAB |
(p-1)*(q-1) |
SSAB/(p-1)*(q-1) |
MSAB/
MS B x plots w A |
|
B by Plots w A |
SSB
x Plots w A |
p*(q-1)*(n-1) |
SS
B x plots w A/p*(q-1)*(n-1) |
|
For a repeated measures experiment to be more efficient
than a design with un-correlated observations the reduction in experimental error associated with controlling for extraneous between subject variation must offset the reduction in degrees of freedom. In
field biological research where our subjects might be
trees, lakes, marshes, streams, or
grids, even given an effort to chose them to be as
similar in physical characteristics as possible, a repeated measures design may help to remove uninteresting between subject variation. One, two, three, and more complicated multi-factor repeated measures designs are possible with from one to all factors with repeated measures. The main effects and interaction sum of squares are computed as in
a factorial experiment. The error variation is partitioned into a series of orthogonal terms that are used to evaluate the main effects and interactions. Figure 10 shows a comparison of the partition of variation in
a repeated measures and a non-repeated measures
single-factor experiment.

C. Square Designs
Square designs like Latin and Greco-Latin squares are useful in controlling for individual differences between experimental
units.
They are also useful in instances where the main effects of three
factors are of interest, but the number of
subjects available is low or the cost of making observations is high. The loss of information in
square designs involves the interaction terms. Square designs can involve repeated measures and in that context are used for controlling sequence effects associated with the order of
applying treatments to
subjects.
For Further Reference
Chamberlin, T.C.
1965. The method of multiple working hypotheses. Science 148: 754-759.
Cochran, W.G. and G.M. Cox. 1957. Experimental Designs. John Wiley and Sons.
Hurlburt, S.H. 1984. Pseudoreplication in the design of
ecological field experiments. Ecological Monographs 54: 187-211.
Keppel, G. 1982. Design and Analysis: A Researcher’s
Handbook. Prentice-Hall:
McCall, R,B, and M.L, Applebaum. 1973. Bias
in the analysis of
repeated-measures designs: Some alternative approaches. Child
Development 44:401-415.
Nowell,
A.R.M.et al. 1982. High energy benthic boundary layer experiment: Hubble. EOS,
August 1982. P. 594-595.
Snedicor, G.W. and and W.G. Cochran. 1967. Statistical Methods.
Selvin, H.C. and A. Stuart. 1966. Data dredging procedures in survey
analysis. American Statistician 20: 20-23.
Winer, B.J. 1971. Statistical Principles in
Experimental Design. McGraw-Hall: