Time Series Analysis

Anne Senter

One definition of a time series is that of a collection of quantitative observations that are evenly spaced in time and measured successively.  Examples of time series include the continuous monitoring of a person’s heart rate, hourly readings of air temperature, daily closing price of a company stock, monthly rainfall data, and yearly sales figures. Time series analysis is generally used when there are 50 or more data points in a series.  If the time series exhibits seasonality, there should be 4 to 5 cycles of observations in order to fit a seasonal model to the data.

Goals of time series analysis:

1. Descriptive: Identify patterns in correlated data—trends and seasonal variation

2. Explanation: understanding and modeling the data

3. Forecasting: prediction of short-term trends from previous patterns

4. Intervention analysis: how does a single event change the time series?

5. Quality control: deviations of a specified size indicate a problem

Time series are analyzed in order to understand the underlying structure and function that produce the observations.  Understanding the mechanisms of a time series allows a mathematical model to be developed that explains the data in such a way that prediction, monitoring, or control can occur.  Examples include prediction/forecasting, which is widely used in economics and business.  Monitoring of ambient conditions, or of an input or an output, is common in science and industry.  Quality control is used in computer science, communications, and industry.

It is assumed that a time series data set has at least one systematic pattern.  The most common patterns are trends and seasonality.  Trends are generally linear or quadratic.  To find trends, moving averages or regression analysis is often used.  Seasonality is a trend that repeats itself systematically over time.  A second assumption is that the data exhibits enough of a random process so that it is hard to identify the systematic patterns within the data.  Time series analysis techniques often employ some type of filter to the data in order to dampen the error.  Other potential patterns have to do with lingering effects of earlier observations or earlier random errors.

There are numerous software programs that will analyze time series, such as SPSS, JMP, and SAS/ETS.  For those who want to learn or are comfortable with coding, Matlab, S-PLUS, and R are other software packages that can perform time series analyses. Excel can be used if linear regression analysis is all that is required (that is, if all you want to find out is the magnitude of the most obvious trend).  A word of caution about using multiple regression techniques with time series data: because of the autocorrelation nature of time series, time series violate the assumption of independence of errors.  Type I error rates will increase substantially when autocorrelation is present.  Also, inherent patterns in the data may dampen or enhance the effect of an intervention; in time series analysis, patterns are accounted for within the analysis.

Observations made over time can be either discrete or continuous.  Both types of observations can be equally spaced, unequally spaced, or have missing data.  Discrete measurements can be recorded at any time interval, but are most often taken at evenly spaced intervals.  Continuous measurements can be spaced randomly in time, such as measuring earthquakes as they occur because an instrument is constantly recording, or can entail constant measurement of a natural phenomenon such as air temperature, or a process such as velocity of an airplane.

Time series are very complex because each observation is somewhat dependent upon the previous observation, and often is influenced by more than one previous observation.  Random error is also influential from one observation to another.  These influences are called autocorrelation—dependent relationships between successive observations of the same variable.  The challenge of time series analysis is to extract the autocorrelation elements of the data, either to understand the trend itself or to model the underlying mechanisms.

Time series reflect the stochastic nature of most measurements over time.  Thus, data may be skewed, with mean and variation not constant, non-normally distributed, and not randomly sampled or independent.  Another non-normal aspect of time series observations is that they are often not evenly spaced in time due to instrument failure, or simply due to variation in the number of days in a month.

There are two main approaches used to analyze time series (1) in the time domain or (2) in the frequency domain.  Many techniques are available to analyze data within each domain.  Analysis in the time domain is most often used for stochastic observations.  One common technique is the Box-Jenkins ARIMA method, which can be used for univariate (a single data set) or multivariate (comparing two or more data sets) analyses.  The ARIMA technique uses moving averages, detrending, and regression methods to detect and remove autocorrelation in the data.  Below, I will demonstrate a Box-Jenkins ARIMA time domain analysis of a single data set.

Analysis in the frequency domain is often used for periodic and cyclical observations. Common techniques are spectral analysis, harmonic analysis, and periodogram analysis.  A specialized technique is Fast Fourier Transform (FFT).  Mathematically, frequency domain techniques use fewer computations than time domain techniques, thus for complex data, analysis in the frequency domain is most common.  However, frequency analysis is more difficult to understand, so time domain analysis is generally used outside of the sciences.

Time series analysis using ARIMA methods

Using the ARIMA (auto-regressive, integrated, moving average) method is an iterative, exploratory, process intended to best-fit your time series observations by using three steps—identification, estimation, and diagnostic checking—in the process of building an adequate model for a time series.  The auto-regressive component (AR) in ARIMA is designated as p, the integrated component (I) as d, and moving average (MA) as q.  The AR component represents the lingering effects of previous observations.  The I component represents trends, including seasonality.  And the MA component represents lingering effects of previous random shocks (or error).  To fit an ARIMA model to a time series, the order of each model component must be selected. Usually a small integer value (usually 0, 1, or 2) is found for each component.  The goal is to find the most parsimonious model with the smallest number of estimated parameters needed to adequately model the patterns in the observed data.

In order to demonstrate time series analysis, I introduce a data set of monthly precipitation totals from Portola, CA in the Sierra Nevada in Table 1.  When a time series has strong seasonality, as my data set does, a slightly different type of ARIMA (p,d,q) process is used, which is often called SARIMA (p,d,q)*(P,D,Q), where S stands for seasonal.  In this model, not only are there possible AR, I, and MA terms for the data, there is a second set of AR, I, and MA terms that take into account the seasonality of the data.

Time series data are correlated, which means that measurements are related to one another and change together to some degree.  Thus, each observation is partially predictable from previous observations, or from previous random shocks, or from both.  An assumption made after analysis is that the correlations inherent in the data set have been adequately modeled.  Thus after a model has been built, any leftover variations are considered to be independent and normally distributed with mean zero and constant variance over time.  These leftover variations are used to interpret the data.

Regardless of which technique is used, the first step in any time series analysis is to plot the observed values against time.  A number of qualitative aspects are noticeable as you visually inspect the graph.  In Figure 1, we see that there is a 12-month pattern of seasonality, no evidence of a linear trend, and, variation from the mean appears to be approximately equal across time.

 Monthly precipitation data from NOAA weather station in Portola, Ca., from January 1999 through April 2004

 Figure 1.  Precipitation occurs cyclically.  December falls on number 12, 24, 36, 48, 60, and 72.  Mean = 1.66 inches/month, standard deviation = 2.09, n = 76.

Is there a trend to this data set?   The simplest linear equation would be y = b, where b is the random shock, or error, of the data set.  The linear equation for my data set is y = -0.0018x + 1.6688.  With a slope of -0.0018, there is no significant linear trend.  This data set needs no further work to eliminate a linear or quadratic trend.

If removal of the trend—detrending—is needed, I would proceed to differencing.  Ordinary least squares analysis is another method used to detect and remove trends.  Differencing has advantages of ease of use and simplicity, but also has disadvantages including over-correcting for trends, which skews the correlations in a negative direction.  There are other problems with differencing that are covered in textbooks.

Differencing means calculating the difference among pairs of observations at some time interval.  A difference of one time interval apart is calculated by subtracting value #1 from value #2, then #2 from #3, and on, and plotting that data to determine if mean of 0 and a constant variance are present.  If differencing of one does not detrend the data, calculate a difference of 2 by subtracting difference #2 from difference #3, and on.  Use a log transformation on the differences if necessary to stabilize the mean and variance.

Seasonal autocorrelation is different from a linear or quadratic data trend in that it is predictably spaced in time.  Our precipitation data can be expected to have a 12-month seasonal pattern, whereas daily observations might have a 7-day pattern, and hourly observations often have a 24-hour pattern.

 Equation 1

In order to detect seasonality, plot the autocorrelation function (ACF) by calculating and graphing the residuals (observed minus mean for each data point).  The graph of the residuals against a specified time interval is called a lagged autocorrelation function or a correlogram.  The null hypothesis for the ACF is that the time series observations are not correlated to one another, i.e.; that any pattern in the data is from random shocks only.  The residuals can be calculated using equation 1.

In time series analysis a lag is defined as: an event occurring at time t + k (k > 0) is said to lag behind an event occurring at time t, the extent of the lag being k.  In 1970, Box and Jenkins wrote, “..to obtain a useful estimate of the autocorrelation function, we would need at least 50 observations and the estimated autocorrelations would be calculated for k = 0, 1, …, k, where k was not larger than N/4”.  For my data set of 78 observations, I specified 19 autocorrelation lags (78/4 = 19.5).

A rule of thumb for an ACF is if there are plotted residuals that are greater than 2 standard errors away from the zero mean, they indicate statistically significant autocorrelation.  In Figure 2, there are 2 residual values, at lag 6 and lag 12, that lay more than 2 standard errors—that is, the approximate 95% confidence limits—from the zero mean.  I interpret this as a 6-month seasonal pattern that cycles between summer when there is little to no precipitation, and winter when precipitation is at its peak.  So, even though the linear equation reveals no trend, graphing the ACF reveals seasonality.

I used the JMP software program from SAS to analyze my data set.  Though I will not cover how to perform a time series analysis in the spectral domain, I did use the spectral density graph to verify that the biggest seasonal pattern occurs at 12-month intervals, not at 6-month intervals.  In Figure 3, notice the large spike at period 12.

 Lagged autocorrelation function of Portola, Ca precipitation data.

 Figure 2.  Visual inspection shows significant deviations from zero correlation at lag 1, 6, and 12, and very close at lag 7 and 13.  Interpretation suggests that there are two seasonal (rainy season and dry season) patterns spaced about 6 months apart.  Number of autocorrelation lags equals 19.

 Spectral Density as a function of period

 Figure 3.  A strong signal appears at about period 12, corresponding to a yearly cycle.

The partial autocorrelation function (PACF) is also used to detect trends and seasonality.  Figure 4 is the PACF of the precipitation data.  In general, the PACF is the amount of correlation between a variable and its lag that is not explained by correlations at all lower-order lags.  The equation to obtain partial autocorrelations is very complex, and is best explained in time series textbooks.

 Lagged partial autocorrelation function of Portola, Ca precipitation data.

 Figure 4.  Significant deviation from zero is evident at lags 1, 6, and 12, suggesting the same 6-month seasonal pattern.

Now that our observations against time, as well as the ACF, and PACF have been graphed, we can begin to match our patterns to idealized ARIMA models.  The easy way to analyze a time series data set is to simply input numerous variations of ARIMA.  There are also systematic steps that you can take that will help suggest the best values for the AR, I, and MA terms.

Here I present a few general rules to apply when working to identify the best-fit ARIMA model.  These rules come from the Duke University website http://www.duke.edu/~rnau/411home.htm, that, along with other textbooks and websites listed below, was instrumental in helping me understand time series analysis, and specifically in helping me understand the nuances of seasonally affected time series.

After adjusting the data by a seasonal difference of 1 using JMP, a visual inspection of shows that the ACF decays more slowly than the PACF, Figure 5.  I used Duke’s Rule #3: The optimal order of differencing is often the order of differencing at which the standard deviation is lowest, to help me determine that my data needed no differencing for trend but did need to be differenced for seasonality (both options available in JMP).  A seasonal difference of 1 yields a standard deviation of 1.89, the lowest value of the iterations that I tried.

 ACF and PACF after seasonal differencing of 1.

 Figure 5.  All ACF and PACF lags fall below significant levels, indicating that autocorrelation has been eliminated.

Using the iterative approach of checking model values via JMP, I found that the lowest values of Aikaike’s ‘A’ Information Criterion (AIC), Schwarz’s Bayesian Criterion, and the -2LogLikelihood for my data set are obtained with an ARIMA (0,0,0)(1,1,1).  According to Duke’s Rule 8, it is possible for an AR term and an MA term to cancel each other out.  They suggested that I try a model with one fewer AR term and one fewer MA term, particularly if it takes more than 10 iterations for the model to converge.  My model took 6 iterations to converge.

Duke’s Rule 12 states that if a series has a strong and consistent seasonal pattern, never use more than one order of seasonal differencing or more than 2 orders of total differencing (seasonal + nonseasonal).  Rule 13 states that if the autocorrelation at the seasonal period is positive, consider adding an SAR term, and if negative try adding an SMA term to the model.  Do not mix SAR and SMA terms in the same model.

Duke’s rules for seasonality suggest that I not accept a mixed model as the best-fit model for my data.  I eliminated the AR and MA terms, but that model yielded a higher value of AIC, Schwarz’s Bayesian Criterion, and a much higher value of the -2LogLikelihood.  I also successively eliminated the AR or the MA term while leaving the other term in, but still got higher values for all test parameters.  Based on the parameter values, I believe that the ARIMA (0,0,0)(1,1,1) is the best model for my data.

Parameter estimates of the most likely SARIMA models

 Model DF Variance AIC Seasonal ARIMA(0, 0, 0)(1, 1, 0)12 62 3.5908132 83.784319 Seasonal ARIMA(0, 0, 0)(0, 1, 1)12 62 3.5125921 82.374756 Seasonal ARIMA(0, 0, 0)(0, 1, 0)12 63 3.6544726 83.93302 Seasonal ARIMA(0, 0, 0)(1, 1, 1)12 61 2.8333581 69.581017 SBC RSquare -2LogLH 88.102085 -0.11 80.1373 86.692522 -0.09 79.272251 86.091903 -0.14 348.10154 76.057666 -0.04 75.26258

 Table 2.  Model #4, SARIMA (0,0,0)(1,1,1) has the lowest variance, AIC, SBC, RSquare, and -2LogLH.  About 20 models were tested; these four had the lowest scores.

I have demonstrated best-fitting an ARIMA model to a time series using description and explanation phases of time series analysis.  If I were to continue with this exercise, I could use this model to predict precipitation for the next year or two.  Most software programs are capable of extrapolating values based on previous patterns in the data set.  This topic is covered in textbooks.

There are numerous books, websites, and software programs available for working with time series.  I found that most of the books that were solely dedicated to time series were quite dense with formulas, thus difficult to understand.  Some websites were somewhat easier to understand but only a couple offered a step-by-step process to guide you through an analysis.  I used just one software program, JMP, and used the help guide extensively.  The help guide was useful in understanding the generated graphs, but offered definitions without elaboration as to how to interpret the defined data.  If you are going to analyze a time series, I suggest using multiple resources, and especially if you are new to time series analysis (like I am), find a knowledgeable person who can help you with interpretation of your results.

Books:

If the CD-ROM is available, this text will walk you through many analyses.

Brockwell, P.J. and Davis, R.A. 2002, 2nd ed.  Introduction to time series and forecasting. Springer, New York.

These guys wrote the book on ARIMA processes.

Box, G.E.P., Jenkins, G.M., and Reinsel, G.C. 1994, 3rd ed. Time  series analysis: Forecasting and control. Prentice Hall, Englewood Cliffs,  NJ.

This book is pretty understandable, though still lots of formulas.

Chatfield, C. 2004, 6th ed. The analysis of time series – an  introduction. Chapman and Hall, London, UK.

An excellent discussion of problems and solutions to ARIMA techniques.

Glass, G.V., Willson, V.L., and Gottman, J.M. 1975.  Design and analysis of time-series experiments. Colorado Associated University Press, Boulder, Colorado.

Klein, J.L. 1997. Statistical visions in time: a history of time series analysis, 1662-1938.  Cambridge University Press, New York.

The time series chapter is understandable and easily followed.

Tabachnick, B.G., and Fidell, L.S. 2001, 4th ed. Using multivariate statistics. Allyn and Bacon, Needham Heights, MA.

Websites:

This is the best website that I found in my web searches.  It is a step-by-step guide to understanding many aspects of time series, including a series of ‘rules’ to use when analyzing your data.

http://www.duke.edu/~rnau/411home.htm

An introduction to time series analysis from an engineering point of view, with two worked examples.  Very helpful.

http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm

Extensive website with LOTS of useful information once you get through the business talk.  Has applets for determining stationarity, seasonality, mean, variance, etc.

Useful for definitions, would be great if they had examples of actual analyses.

http://www.statsoftinc.com/textbook/stathome.html

Step-by-step explanation of time series analysis, including examples of how to use Excel to adjust for seasonality and analyzing the data by using linear regression, all in the Crunching section.

http://www.bized.ac.uk/timeweb/index.htm

Type in time series in product search to see available books that are short but sweet.

http://www.sagepub.com/Home.aspx

Website for my precipitation data.

http://www.wrh.noaa.gov/cnrfc/monthly_precip.php

Website for the software package that I used in this presentation.

http://www.jmp.com/

Extensive and easy to use statistical software package.

http://www.spss.com/

Free software for analyzing time series data sets, but you need to code.

http://www.r-project.org/

Free statistics and forecasting software (didn’t try out, so can’t say how good)

http://www.wessa.net/