Power Analysis

(Stephane Pauquet)


CONTENTS

1/ Introduction
2/ Statistical significance
3/ Factors affecting Power
4/ Power of test
5/ The effect size index
6/ Failures of the assumptions
7/ Significance testing
8/ The use of tables for significance testing
9/ Implications of ignoring Power
10/ Illustrative example
11/ References


1/ Introduction

The power of a statistical test is the probability that it will yield significant results (Cohen, 1969). Scientists strongly wish for statistical significance, a concept which surprisingly is not well understood. Because scientists have a crude concept of statistical significance and an even more primitive concept of statistical power, tests are often conducted under conditions where the hypothesis under scrutiny (the null hypothesis) has very low chances of being rejected, even if it is actually wrong.

Power analysis, which can be performed a priori or a posteriori to the collection of data, is used to assess the likelihood inherent in the design of a test to reject null hypothesis. A test found to have low power a priori should lead to a new, if not completely different experimental design, or to changes in the constraints such as the significance criterion. A test found a posteriori to have low power should either convince the scientist to perform the experiment again with a larger sample size, or to at least to reflect on what inferences, if any, can be drawn from this experiment.

However, those types of Power analyses should be performed respecting the same basic principles as for any other statistical test, notably that of defining the precise objectives of the investigation prior to any experiment. For example, testing for Power a posteriori is irrelevant when using the effect size (see below) observed in the data rather than postulating the effect size that one wanted to detect.

 


2/ Statistical significance

Since samples only approximate population characteristics, a significance criterion a for the observed values is set. a serves as a "standard of proof" that the phenomenon under study exists, or equivalently, "standard of disproof" for the null hypothesis stating that the phenomenon does not exist. From there, we define:

a type I error (=a) as the probability of rejecting the null hypothesis (Ho) when in fact it is true. a is traditionally set very low.

a type II error (= ß) as the probability of failing to reject Ho when in fact the null hypothesis is false. ß is consequently usually high and often unknown.

This leads us to the definition of Power (= 1- ß) as being the probability of correctly rejecting Ho, the null hypothesis.


3/ Factors affecting Power

Power increases with:

- sample size (n): sample reliability always depends upon its size (the smaller the sample, the larger the error). Thus, it is intuitively obvious that increases in sample size will increase statistical power.

- effect size (ES, or d when standardized). The degree to which a specified alternative hypothesis deviates from the null hypothesis.

- higher a level: the lower its level, the lower the power;

- observational variability: the lower the variance and standard deviation, the greater the power;

...and the degree to which the data meet the assumptions of the statistical method applied.

The directionality of the significance criterion also bears on the power:

Two-tailed tests have less power than one-tailed tests, provided that the sample result is in the direction predicted:

Power[one-tailed test - a level] = Power[two-tailed test- 2(a level)]

For example, if a one-tailed test is conducted at an a = 0.05 level and yields results in the right direction, then it will have equal power as if the test had been conducted as a two–tailed test for an a of 0.10.

 


4/ Power of test

Power is a function a of , n and ES:

- If these are determined, Power is obtained by simple computation, or by using Power tables;

- Power Tables can be used to determine the power of a test, as well as to yield the value of any of these 4 parameters, knowing the 3 others.

Assumptions:

The Power tables are designed to yield values for the t-test (difference between the means of two independent samples of equal size drawn from a normal population having equal variances). Thus the primary assumptions are:
- n1 = n2
-
s1 = s2

But we will see later that the analysis is robust to certain violations of these assumptions.

Application:

By rearranging the appropriate specific form of the following equation: a

Power = some function of (ES, n, sigma and ),

…it is possible to solve it for any one of its terms:

- "detectable" effect size            - variance

- sample size                              - probability of type I error


5/ The effect size index

The effect size ES is the degree of departure of the alternative hypothesis from the null hypothesis. It is indexed to obtain a "pure number", d, free of the original measurement unit (standardization procedure), by dividing the difference of the observed results by their standard deviation:

- for a one-tailed test:

- for a two-tailed test:

(m1, m2 expressed in measurement units)

This has lead to a categorization between "small", "medium" and "large" d values, for instance (Cohen, 1977):

- 0.2 : ES is small relative to uncontrollable extraneous variables ("noise"). This is assumed when the phenomena under study are not under good experimental or measurement control, or both;

- 0.5 : conceived as an effect size "large enough to be visible to the naked eye" (experimentally perceptible), and;

- 0.8 : large effect size (as an example, it is represented by the mean IQ difference estimated between holders of the Ph.D. degree and typical college freshmen...).

Note that only previous practice and knowledge in a particular field can serve for the setting of these otherwise arbitrary values.


6/ Failures of the assumptions

case 1: n1n2 ; s1 = s2

The power tables will yield useful approximate values, if the harmonic mean, n’ is computed instead of the arithmetic mean :

case 2: n1=n2 ;

The tables are still valid using '=instead of the standard deviation.

(unless there is a big difference between s 1 and s 2, s ’ will not differ greatly from their arithmetic mean);

case 3: one sample of n observations

The t-test can also be used testing H0:m=c instead of an alternative hypothesis, c

being a specified value relevant to some theory under consideration:

This requires the computation of d3 = (conceptually, no change) and still

another calculation () in order to consult the tables, because c is a hypothetical parameter, thus without sampling error, whereas in a two - sample test, each mean contributes to sampling error.

The relevant t-test will be based on (n-1) degrees of freedom (vs. [2(n-1)] df in a two - sample test). This approach is problematic only in the case of very small samples.

case 4: situation where the data set consists of one sample of n differences between paired observations (matched into n pairs X,Y)

; ;


7/ Significance testing

...Is an appraisal of the research results: the second column of the power tables yields thevalue of the significance criterion for the parameter under study (effect size d or correlation coefficient r).

A "significance at the a level" is attributed when the observed effect size (d) or correlation coefficient (r) equals or exceeds this criterion.

Advantage: the significance decision can be made without computation, in a "quick check" of the significance of results.


8/ The use of tables for significance testing

Provision has been made in the Power tables to facilitate significance testing:

Cases 1 and 2(see above): Calculate the standard mean difference for the sample: ds

(with being the sample means and S the pooled within sample estimate of the population standard deviation):

(samples need to be independent, but can be unequal)

Note: ds* is related to the t-statistic by:

(*in case 1, ds simplifies to: )

The value of ds necessary for significance is called dc, i.e. the criterion value of ds. In entering the tables, the value of n to be used is the harmonic mean of na and nb, as defined above.

Case 3: Same as above. Provided that the sample sizes are approximately equal, the validity of the t-test is hardly affected (thus, valid under "nonextreme" conditions).

Case 4 (one sample of n differences between paired observations: X-Y = Z ): Some transformations are required:

Compute: (d's indicating a one - sample test) and use d'c instead of dc: , or 0.707dc.


9/ Implications of ignoring Power

Low-power impact assessment experiments can generate costly type II errors, resulting in the implementation of inappropriate actions (e.g. depletion of aquatic resources, Peterman, 1990);

Low power often results in inefficient and/or unadapted experiments, and could thus lead to miss opportunities to increase understanding of the processes under study, and;

The cost of type II errors can exceed that of type I errors (whereas traditional significance standards usually implicitly assume the opposite).

Taking action as the result of a statistical analysis that fails or not to reject the null hypothesis can lead to the two following sequences of events (Peterman, 1990):

1) Hypothetical sequence of events following the rejection of Ho

wpe1.jpg (24525 octets)

 

2) Hypothetical sequence of events following the failure to reject Ho

wpe2.jpg (24470 octets)

The total cost of the decision path that reflects acting if the Ho were true will likely be larger than the path that assumes the Ho to be false (compare Figs. (1) and (2)).


10/ References

Cohen, J. 1977. Statistical Power Analysis for the Behavioral Sciences. Academic Press: New-York.

Peterman, R. M. 1990. Statistical Power Analysis can Improve Fisheries Research and Management. Canadian Journal of Aquatic Sciences 47:2-15.

Philips, P. C. 1998. Designing Experiments to maximize the Power of detecting correlations. Evolution 52 (1):251-255.

 

This page was last updated on 09/26/02