BIOL 458 BIOMETRY
Lab 10 - Multiple Regression
Many problems in biology science involve the analysis of multivariate data sets. For data sets in which there is a single continuous dependent variable, but several continuous independent variables, multiple regression is used. Multiple regression is a method of fitting linear models of the form:

where
is
the estimated value of Y, the criterion variable ; X1,
X2,. . ., Xk are the k predictor
variables; and b0 and b1, b2,
. . bk are the regression coefficients. The values of the
regression coefficients are determined by minimizing the sum of squares of the
residuals, i.e, minimizing

Hypothesis tests about the regression coefficients or about the contribution of particular terms or groups of terms to the fit of the model are then performed to determine the utility of the model. In many studies, regression programs are used to generate a series of models; the "best" of these models is then chosen on the basis of a variety of criteria, as discussed in lecture. These models can be generated using a forward, back-wards, or stepwise regression routine. In forward regression new independent variables are added to the model if they meet a set significance criteria for inclusion (often p< 0.05, for the partial - F test for the inclusion of the term in the model). In backwards regression all independent variables are initially entered into the model and sequentially taken out if the do not meet a set significance criterion (often p>0.1, for the partial F - test for removal of a term). Stepwise regression uses both these techniques. A variable is entered if it meets the p - value to enter. After each variable is added to the equation all other variables in the equation are tested against the p - value to remove a term and if necessary thrown out of the model. The SPSS, SAS, MINITAB, SYSTAT, BMDP and other statistical packages include these routines. The computer output generated by these routines consists of a series of models for estimating the value of Y and the goodness-of-fit statistics for each model. Each model estimates the value of Y as a linear combination of values of the predictor variables included in that model.
Further Instructions for Lab 10
In Lab 10 you use the same regression module to fit multiple regression models in SPSS that you used in Lab 9 to fit bivariate regression models. The big difference now is in entering the multiple independent variables, selecting the algorithm for building the model, and evaluating the fit of each model.
In the Linear Regression sub-window, you will see a box with a pull down arrow called Method that by default is occupied by the word Enter. Enter is one of several model building algorithms available in the Method box. Enter in SPSS is equivalent of forcing all the variables in the independents variable box to be entered into the model simultaneously. The opposite of Enter if Remove where all variables are removed simultaneously. Other model building algorithms use various criteria to make decisions about which variables are entered (or removed) from the model, and when to stop adding or removing variables from the model. SPSS has algorithms named; Stepwise, Backward, and Forward. In the Stepwise algorithm, the variable with the smallest probability of its F statistic (if it meets a criteria, such as p<0.05) is entered into the model first. Then this process is repeated for the variables not yet included in the model. The next variable that meets this criterion is added to the model. This process continues to add variables to the model until there are no variables left that have F statistics that meet some user specified criteria (p<0.05 for example). As this process progresses, the F statistics for variables already in the model can change. If the significance level of these F statistics exceeds the criterion, then these variables are removed from the model. Hence, in a Stepwise algorithm, variables can be both added and removed from a model in the model building process. The Forward algorithm is identical to the Stepwise algorithm, except that variables can only be added to the model, not removed. The Backward algorithm puts all variables into the model, but then attempts to sequentially remove variables. The variable with the smallest partial correlation with the dependent variable is removed first if it meets the criterion for removal. If this variable is removed, then the variable with the next smallest partial correlation with the dependent variable is considered for removal, and removed if it meets the criterion. Note that in the Backward algorithm, variables are removed because their partial correlations exceed the significance criterion (p>0.05), the opposite of the criterion for a Stepwise or Forward algorithm.
Unfortunately, none of these algorithms are guaranteed to choose the best model. I prefer the Forward algorithm, but sometimes build models with different algorithms to see if they all choose the same best model.
Occasionally
you might wish to enter variables in a specific sequence into a model, or to
use different algorithms for model building for different groups of independent
variables. To do so you need to look at the text and buttons surrounding the Independent(s)
box in the Linear Regression sub-window. Note a light gray line
enclosing this region, and blue text that says Block 1 of 1. SPSS
allows you to group variables into blocks and specify different variable
selection methods for each block. For example, to build the Analysis of
Covariance models that I described in class, you would place the variable name
for the covariate into the Independent(s) box and select Enter as
the Method (since you dont want SPSS to do any thinking, just
put the variable in the model). Then you would click on the Next button.
Note that the blue text now says Block 2 of
Assessing
model fit involves all the same procedures used in bivariate regression since
the same assumptions apply. The dependent variable should be normally
distributed, scatter plots should indicate linear relationships between the
dependent and independent variables, and residual plots should show
homoscedasticity (equality of variances in the residuals throughout the
regression line). In addition to these issues, one also needs to check for
outliers or overly influential data points, and for high inter-correlations
between pairs of independent variables (called multi-colinearity). If two
independent variables are highly correlated (r>0.9), then inclusion of both
variables in the model causes problems in parameter estimation. You can pre-screen
your independent variables by getting a correlation matrix prior to performing
the regression and only allowing one variable of a pair of high correlated
variables to serve as a candidate variable for model building at a time. You
could also examine the Tolerance values provided by SPSS in the
output table named Excluded Variables. These values also provide
you information about whether you have a problem with multi-colinearity. Come
to class to find out how to interpret the tolerance values.
Lab 10 Assignment
The exercise to be performed in this lab is to use the SPSS stepwise and forward regression routine to generate a series of models, and to select the "best" model from each series, as discussed in lecture. Two data sets will be provided; you are to perform the analysis on either of these two. You must discuss in detail the reasons for choosing the models that you have selected, including showing plots of residuals, information about the distribution of the response variable, examining outliers, and other metrics to demonstrate goodness-of-fit.
DESCRIPTION OF DATA
The data is stored in a text file MULTR2 or the equivalent (SPSS file). The variables are as follows (they are in the same order in the data sets):
VARIABLE (UNITS)
______________________________________
Mean elevation (feet)
Mean temperature (degrees F)
Mean annual precipitation (inches)
Vegetative density (percent cover)
Drainage area (miles2)
Latitude (degrees)
Longitude (degrees)
Elevation at temperature station (feet)
1-hour, 25-year precipitation. intensity (inches/hour)
Annual water yield (inches) (Dependent Variable)
The data consists of values
of these variables measured on all gauged watersheds in the western region of
the