Sean R. Avent
Introduction - Goals - Data Types - Autocorrelation - Correlograms - Box-Jenkins Models -
Frequency Analysis - References
A time series is defined as a collection of observations made sequentially in time. This means that there must be equal intervals of time in between observations.
This page is designed for those who have a basic knowledge of elementary statistics and need a short introduction to time-series analysis. Many references are included for those who need to probe further into the subject which is suggested if these methods are to be applied. This guide will hopefully help people to decide if these are the correct applications to use on their data and to give a quick summary of the basics involved. For analyzing the data there are a number of statistical packages available.
Goals of Time Series Analysis
Time series analysis can be used to accomplish different goals:
1) Descriptive analysis determines what trends and patterns a time series has by plotting or using more complex techniques. The most basic approach is to graph the time series and look at:
Overall trends (increase, decrease, etc.)
Cyclic patterns (seasonal effects, etc.)
Outliers points of data that may be erroneous
Turning points different trends within a data series
2) Spectral analysis is carried out to describe how variation in a time series may be accounted for by cyclic components. This may also be referred to as "Frequency Domain". With this an estimate of the spectrum over a range of frequencies can be obtained and periodic components in a noisy environment can be separated out.
Example: What is seen in the ocean as random waves may actually be a number of different frequencies and amplitudes that are quite stable and predictable. Spectral analysis is used on the wave height vs. time to determine which frequencies are most responsible for the patterns that are there, but cant be readily seen without analysis.
3) Forecasting can do just that - if a time series has behaved a certain way in the past, the future behavior can be predicted within certain confidence limits by building models.
Example: Tidal charts are predictions based upon tidal heights in the past. The known components of the tides (e.g., positions of the moon and sun and their weighted values) are built into models that can be employed to predict future values of the tidal heights.
4) Intervention analysis can explain if there is a certain event that occurs that changes a time series. This technique is used a lot of the time in planned experimental analysis. In other words, 'Is there a change in a time series before and after a certain event?'
Example: 1. If a plant's growth rate before changing the amount of light it gets is different from that afterwards, an intervention has occurred - the change in light is the intervention. 2. When a community of goats changes its behavior after a bear shows up in the area, then there may be an intervention.
5) Explanative Analysis (Cross Correlation)
Using one or more variable time series, a mechanism that results in a dependent time series can be estimated. A common question to be answered with this analysis would be "What relationship is there between two time series data sets?" This topic is not discussed within this page although it is discussed in Chatfield (1996) and Box et al. (1994).
Example: Atmospheric pressure and seawater temperature affect sea level. All of
these data are in time series and can relate how and to what degree pressure and
temperature affect the sea level.
Types of Time Series Data
Continuous vs. Discrete
Continuous - observations made continuously in time
1. Seawater level as measured by an automated sensor.
2. Carbon dioxide output from an engine.
Discrete - observations made only at certain times.
1. Animal species composition measured every month.
2. Bacteria culture size measured every six hours.
Stationary vs. Non-stationary
Stationary - Data that fluctuate around a constant value
Non-stationary - A series having parameters of the cycle (i.e., length,
amplitude or phase) change over time
Deterministic vs. Stochastic
Deterministic time series - This data can be predicted exactly.
Stochastic time series - Data are only partly determined by past values and future values have to be described with a probability distribution. This is the case for most, if not all, natural time series. So many factors are involved in a natural system that we can not possibly correctly apply all of them.
Transformations of the Data
We can transform data to:
1. Stabilize the variance - use the logarithmic transformation
2. Make the seasonal effect additive - this makes the effect constant from year to year - use the logarithmic transformation.
3. Make data normally distributed - this reduces the skewness in the data so that we may apply appropriate statistics - use the Box-Cox (logarithmic and square root) transformation
There are many more transformations not discussed here that are available to use for
the many different things we may want to do with the time series data. These are
discussed in the various texts listed througout this page.
A series of data may have observations that are not independent of one another.
A population density on day 8 depends on what that population density was at on day 7. And likewise, that in turn is dependent on day 6 and so forth.
The order of these data has to be taken into account so that we can assess the autocorrelation involved..
To find out if autocorrelation exists:
Autocorrelation Coefficients measure correlations between observations a certain distance apart.
Based on the ordinary correlation coefficient r (see Zar for a full explanation), we can see if successive observations are correlated. An autocorrelation coefficient at lag k can be found by:
This is the covariance (xt xt+k)divided by the variance (xt).
An rk value of (± 2/Ö N) denotes a significant difference from zero and signifies an autocorrelation.
Also note that as k gets large, rk becomes smaller.
Another test for the presence or absence of autocorrelation, a Durbin-Watson d-statistic can be employed:
Fig. 1 shows the five regions of values in which autocorrelation is accepted or not.
Figure 1. The five regions of the Durban-Watson d-statistic.
A Note on Non-Stationary Data
As stated above, non-stationary data has the parameters of the cycle involved changing over time. This is a trend that must be removed before the calculation of rk and the resulting correlograms seen below. Without this trend removal, the trend will tend to dominate the other features of the data.
The autocorrelation coefficient rk can then be plotted against the lag (k) to develop a correlogram. This will give us a visual look at a range of correlation coefficients at relevant time lags so that significant values may be seen.
The correlogram in Fig.2 shows a short-term correlation being significant at low k and small correlation at longer lags. Remember that an rk value of (± 2/Ö N) denotes a significant difference (a = 0.05) from zero and signifies an autocorrelation. Some procedures may call for a higher a value since this constitues expectation that one out of every twenty obsservations in a truly random data series will be significant.
Figure 2. A time series showing short-term autocorrelation together with its correlogram.
Fig. 3 shows an alternating (negative correlation) time series.
The coefficient rk alternates as does the raw data (r1 is negative and r2 is positive ..) This series of rk is negative.
Figure 3. An alternating time series with its correlogram.
A greater discussion on the correlograms and associated periodograms can be found in
Chatfield (1996), Naidu (1996), and Warner (1998).
Box-Jenkins Models (Forecasting)
Box and Jenkins developed the AutoRegressive Integrative Moving Average (ARIMA) model which combined the AutoRegresive (AR) and Moving Average (MA) models developed earlier with a differencing factor that removes in trend in the data.
This time series data can be expressed as: Y1, Y2, Y3, , Yt-1, Yt
With random shocks (a) at each corresponding time: a1, a2, a3, ,at-1, at
In order to model a time series, we must state some assumptions about these 'shocks'. They have:
1. a mean of zero
2. a constant variance
3. no covariance between shocks
4. a normal distribution (although there are procedures for dealing with this)
An ARIMA (p,d,q) model is composed of three elements:
d: Integration or Differencing
q: Moving Average
A simple ARIMA (0,0,0) model without any of the three processes above is written as:
Yt = at
The autoregression process [ARIMA (p,0,0)] refers to how important previous values are to the current one over time. A data value at t1 may affect the data value of the series at t2 and t3. But the data value at t1 will decrease on an exponential basis as time passes so that the effect will decrease to near zero. It should be pointed out that f is constrained between -1 and 1 and as it becomes larger, the effects at all subsequent lags increase.
Yt = f1 Yt-1 + at
The integration process [ARIMA (0,d,0)] is differenced to remove the trend and drift of the data (i.e. makes non-stationary data stationary). The first observation is subtracted from the second and the second from the third and . So the final form without AR or MA processes is the ARIMA (0,1,0) model:
Yt = Yt-1 + at
The order of the process rarely exceeds one (d < 2 in most situations).
The moving average process [ARIMA (0,0,q)] is used for serial correlated data. The process is composed of the current random shock and portions of the q previous shocks. An ARIMA (0,0,1) model is described as:
Yt = at - q1at-1
As with the integration process, the MA process rarely exceeds the first order.
Time Series Intervention Analysis
The basic question is "Has an event had an impact on a time series?"
The null hypothesis is that the level of the series before the intervention (bpre) is the same as the level of the series after the intervention (bpost). or
Ho: bpre - bpost = 0
After building the ARIMA model, an intervention term (It) can be added and the ARIMA equation is now a noise component (Nt):
Yt = f(It) + Nt
The intervention component can be of four different types that are described by their onset and duration characteristics (Fig. 4):
Figure 4. Types of intervention components. From Mc Dowall et al. (1980).
Frequency analysis is used to decompose a time series into an
array of sine and cosine functions which can be plotted by their wavelengths. This
spectrum of wavelengths can be analyzed to determine which are most relevant (see Fig.
5). In Fig.5 you cant tell what the major components are of the raw data, but when a
spectral analysis is completed, yu can pick out the relevant wavelengths.
In any one of these analyses, the data is considered to be stationary. If it is not, then a filter should be applied to the data before instituting the appropriate analysis. All angles are presented as radians.
Figure 5. Frequency analysis data sets. The top four plots are the raw data as where the bottom four are the periodograms for the top four, but are not in order.
A Harmonic Analysis (a type of regression analysis) is used to fit a model when the period or cycle length is known apriori. This can estimate the amplitude, cycle phase, and mean.
Xt =m + A cos(wt) + B sin(wt) + et
w = 2p/t (We know what the period (t) is).
t = observation time or number
A and B = coefficients
e = residuals that are uncorrelated
Given t, we can use OLS regression methods to estimate the amplitude and the phase of the cycle.
Amplitude: R = (A2 + B2) 1/2
Phase: f = arctan (-B, A)
Using SPSS, we use multiple regression using sinw and cosw as variables to give us estimates of A and B. Once we have this info, we can calculate the amplitude and phase and the model is fit.
A Periodogram or Spectral Analysis is used if there is no reason to suspect a certain period or cycle length. These methods fit a suite of cycles of differing lengths or periods to the data.
To find which sinusoidals describe the data and to what degrees, a generalization of the harmonic analysis is applied to the residuals of the data. The overall SS variance is partitioned into N/2 periodic components each with df=2. Then a harmonic analysis is done on each component and summed in an ANOVA source table. From this, we get estimates of A and B (SSs) for each component and as they are additive to the SStotal, we can get estimates of variances for each component.
The null hypothesis is that the variances are all the same and this is indicative of white noise. This is plotted with intensity or SS on the Y-axis, while the X-axis is composed of the frequencies. A large peak represents a frequency that varies the data significantly.
Xt = m + S [A cos (wt) + B sin (wt) ]
|Dependent variable:||Xt = time series|
|Independent variables:||A = cosine parameter is regression coefficient|
|B = sine parameter is regression coefficient|
A and B determine the degree to which each function is correlated with the data.
Since the sine and cosine functions are orthogonal (mutually independent), Periodogram Values (Pk) are created and ploted against the frequency. These values are interpreted as variances of the frequencies.
Pk = A2 + B2 * N/2Pk = periodogram value at uk
N = overall length of series
Since the true data are not sampled continuously, the significant period peak may leak into other adjacent frequencies. To alleviate this problem, deployment of the following are suggested:
2. Tapering or windowing
These methods can be found in Warner et al. (1998), Chatfield (1996), and Gardner (1988).
In order to get a power spectrum, we must smooth the data from the periodogram so that the each periodogram intensity is replaced by an average that includes weighted neighboring data. This gives a better and more reliable picture of the distribution of power (or variance accounted for). Smoothing procedures can differ by window width and weighting function.
Fourier frequencies are chosen with the longest cycle equal to the length of the series and the shortest cycle having a period of two cycles. All frequencies in between are equally spaced and dont overlap. A Fast Fourier Transform uses the Euler relation deriving complex numbers (Chatfield, 1996) and is too math-intensive to practically do by hand. SPSS has a fast Fourier transfrom built in for these analyses.
Spectrum analysis significance tests use upper and lower bounds of a confidence interval that are derived using a c2 distribution. The degrees of freedom will depend on what kind of smoothing was used. This confidence interval can be superimposed on the Power Spectrum so that significant values may be seen.
For a more complete description see any one of the spectral analysis books listed below, but especially Chatfield (1996) and Warner (1998).
There are many references out there for time series analysis. Most refer
to applications involving econometrics or social sciences, but most techniques can be
applied to the biological sciences. Most of the web pages involve vaery advanced theories
Books held by the SFSU Library
|Box, G.E.P., G.M. Jenkins, and G.C. Reinsel. 1994. Time series analysis Forecasting and control. 3rd ed. Prentice Hall, Englewood Cliffs, NJ, USA||A great introductory section, although the rest of the book is very involved and mathematically in-depth.|
|Chatfield, C. 1996. The analysis of time series an introduction. 5th ed. Chapman and Hall, London, UK.||A very good and readable book that goes over most aspects of time series data. Highly recommended.|
|Gardner, W.H. 1988. Statistical spectral analysis - A nonprobabilistic theory. Prentice-Hall Inc. Englewood Cliffs, NJ, USA.||An in-depth book with advanced features and methods.|
|Harvey, H.C., 1981. Time series models. Halstead Press, New York, NY, USA.||A moderately involved book with some understandable sections on model building.|
|McDowall, D., R. McCleary, E.E. Meidinger, and A.H. Richard Jr. 1980. Interupted Time Series Analysis. Sage Publications,Inc., Thousand Oaks, CA, USA.||A good book in the Sage series on intervention analysis that covers the basics quite well. Very readable.|
|Naidu, P.S. 1996. Modern spectrum analysis of time series. CRC Press Inc., Boca Raton, FL, USA||A complete account of spectrum analysis, but very involved and assumes great comfort with basic statistics.|
|Ostrom, C.W., 1978. Time series analysis : regression techniques. Sage Publications, Beverly Hills, CA. USA.||A good and short book in the Sage series that goes over the basics with decent ease.|
|Warner, R. M. 1998. Spectral analysis of time-series data. Guilford Press, New York, NY, USA.||A very good book on spectral analysis that is especially good with experimental design and data collection/entry.|
|SPSS for Beginners - $5.95||This can be downloaded in a pdf (Acrobat Reader) file for a small fee. Chapter 17, Time Series Analysis can be downloaded separately for free from the SPSS site.|
|An online textbook from Statsoft that cover most aspects of Time Series Analysis.||Very complete and readable.|
|Autobox tutorial||A rather bulky tutorial on ARIMA Models|
|Carnegie Mellon Univerity - Datasets||Very wide range of datasets to play with.|
|Rob J Hyndman's Forecasting Pages||A set of pages with everything forecasting.|
|Time Series Analysis and Chaosdynamics - Rotating Fluids||A very in depth page on advanced time series analysis.|
|Forum: sci.stat.edu||If you get stuck, you can post a question to this forum .|
|Forum: sci.stat.consult||Or this forum .|
|Forum: comp.soft-sys.stat.spss||Or for any SPSS question, use this forum.|
|SPSS||Lots of Software - a great statistics package|
|AFS - Autobox||Looks useful, but I havent played with it. Starts at $400 and goes up from there. Forecasting and intervention analysis.|
|UCLA Statistics Bookmark Database||Need Software - look here!|
Page last updated 14 Dec 1999.