SURVIVAL/FAILURE ANALYSIS

Rafael Hidalgo Gonzalez

HISTORY

Peter L. Berstein in his book ‘Against the Gods the remarkable story of risk’ narrates how the small book published in London and titled Natural and Political Obsrvations made upon the Bills of Mortality made history. The book contained a compilation of birth and deaths in London from 1604 to 1661. The author was John Gaunt, he was 42 years old when he wrote this book, he was a businessman but interested in many other things. William Petty, a professor and Graunt’s friend help him with some of the population analysis in the book

“Graunt was hardly aware that he was the innovator of sampling theory. In fact, he worked with the complete set of bills of mortality rather than with a sample. But he reasoned systematically about raw data in ways that no one had ever tried before. The manner in which he analyzed the data laid the foundation for the science of statistics. The word “statistics” is derived from the analysis of quantitative facts about the state. Graunt and Petty may be considered the co-fathers of this important field of study.”(Berstein, 1998)

Graunt put together the first recorded longitudinal study of event occurrence, some form of a life table:

Age

Died

Survived

0

 

100

6

36

64

16

24

40

26

15

25

36

9

16

46

6

10

56

4

6

66

3

2

76

2

1

86

1

0

 

Graunt’s accomplishment was to analyze mortality statistics in London and concluded correctly that more female than male babies were born and that women lived longer than men. And he created the first life table assessing out of every 100 babies born in London, how many survived until ages 6, 16, 26, etc. Unfortunately the table did not give a realistic representation of the true survival rate because the figures for ages after 6 were all guesses (Adapted from Judith D. Singer & John B. Willett ).

Thirty years after Graunt published his book came another important figure: Edmund Halley, a great scientist, an astronomer (Halley’s Comet). He analyzed data compiled from years 1687 to 1691 (from a small town in Germany) He applied probability and risk management. He also put together a life table that could be used to reckon the price of insuring lives at different ages. He uses mathematics to evaluate annuities (Berstein).

WHAT IS SURVIVAL ANALYSIS?

Survival analysis is used when we wish to study the occurrence of some event in a population of subjects and the time until the event is of interest. This time is called survival time or failure time. Survival analysis is often used in many fields of study. Examples include: time until failure of a light bulb and time until occurrence of an anomaly in an electronic circuit in industrial reliability, time until relapse of cancer and time until pregnancy in medical statistics, duration of strikes in economics, length of tracks on a photographic plate in particle physics.

Data that measure lifetime or the length of time until the occurrence of an event are called lifetime, failure time, or survival data. For example, variables of interest might be the lifetime of diesel engines, the length of time a person stayed on a job, or the survival time for heart transplant patients. Such data have special considerations that must be incorporated into any analysis.

Survival data consist of a response variable that measures the duration of time until a specified event occurs (event time, failure time, or survival time) and possibly a set of independent variables thought to be associated with the failure time variable. These independent variables (concomitant variables, covariates, or prognostic factors) can be either discrete, such as sex or race, or continuous, such as age or temperature. The system that gives rise to the event of interest can be biological, as for most medical data, or physical, as for engineering data. The purpose of survival analysis is to model the underlying distribution of the failure time variable and to assess the dependence of the failure time variable on the independent variables.

“Survival data arise when the aim is to study the time elapsed from some particular starting point to the occurrence of an event (Marubini and Grazia).

A particular characteristic of survival data is the possibility for censoring of observations, that is, the actual time until the event is not observed. Such censoring can arise from withdrawal of a subject from the experiment or termination of the experiment. Because the response is usually a duration, some of the possible events may not yet have occurred when the period for data collection has terminated. For example, clinical trials are conducted over a finite period of time with staggered entry of patients. That is, patients enter a clinical trial over time and thus the length of follow-up varies by individuals; consequently, the time to the event may not be ascertained on all patients in the study. Additionally, some of the responses may be lost to follow-up (for example, a participant may move or refuse to continue to participate) before termination of data collection. In either case, only a lower bound on the failure time of the censored observations is known. These observations are said to be right censored. Thus, an additional variable is incorporated into the analysis indicating which responses are observed event times and which are censored times. More generally, the failure time may only be known to be smaller than a given value (left censored) or known to be within a given interval (interval censored). There are numerous possible censoring schemes that arise in survival analyses. Data with censored observations cannot be analyzed by ignoring the censored observations because, among other considerations, the longer-lived individuals are generally more likely to be censored. The method of analysis must take the censoring into account and correctly use the censored observations as well as the uncensored observations.

Another characteristic of survival data is that the response cannot be negative. This suggests that a transformation of the survival time such as a log transformation may be necessary or that specialized methods may be more appropriate than those that assume a normal distribution for the error term. It is especially important to check any underlying assumptions as a part of the analysis because some of the models used are very sensitive to these assumptions

Statistical Problems

Given a complete sample of size n, there are three general ques­tions to be answered: Estimate the distribution of the population:

Two methods have been learned: Non-parametric approach and Parametric approach:

Compare failure time distribution of 2 or more groups Once again two different approaches:

Parametric approach:

Assume normal distribution Assume the sample comes from a particular family of distri­butions, e.g., normal distribution For normal distribution we need two parameters to specify the distribution: µ and σ

Two-sample problem: we perform the t-statistic k-sample problem: we perform the ANOVA

Non-parametric approach:

Two-sample problem: permutation test

K-sample problem: rank test

If explanatory variables (covariates) present, assess joint de­pendence of failure time distribution on these covariates Regression Problem:

y = f(x; θ)

Type I censoring: Consider an experiment which will be stopped at a pre-­specified time X. All observations survive after X cannot be observed. So only observations smaller than X can be observed again, the observations greater than X are called right censored. The number of observations being observed is a random vari­able. It is also called time-censoring Type II censoring: One of the question in Type-I censoring is the determina­tion of the censoring time X.  If X is large, the expense of the experiment is large If X is small, it might turn out only small portion of sample can be observed To avoid this situation, one might terminate the experi­ment after the first ‘z’ failures have been observed This is called Type II censoring or item censoring. Random censoring: Sometimes the censoring time might not be a fixed number but a random variable For example, in a medical study, some patients might drop out from the study for some reasons Once the patient dropped out, there is censored observation Therefore the censoring time is a random variable It is very difficult if the lifetime variable and censoring vari­able are dependent Usually we assume that the distribution of the censoring vari­able does not depend on the parameters which involved in the distribution of lifetime

CONCEPTS AND DEFINITIONS

(Adapted from statsoft textbooks)

Life Table Analysis

The most straightforward way to describe the survival in a sample is to compute the Life Table. The life table technique is one of the oldest methods for analyzing survival (failure time) data (e.g., see Berkson & Gage, 1950; Cutler & Ederer, 1958; Gehan, 1969). This table can be thought of as an "enhanced" frequency distribution table. The distribution of survival times is divided into a certain number of intervals. For each interval we can then compute the number and proportion of cases or objects that entered the respective interval "alive," the number and proportion of cases that failed in the respective interval (i.e., number of terminal events, or number of cases that "died"), and the number of cases that were lost or censored in the respective interval.

Based on those numbers and proportions, several additional statistics can be computed:

Number of Cases at Risk. This is the number of cases that entered the respective interval alive, minus half of the number of cases lost or censored in the respective interval.

Proportion Failing. This proportion is computed as the ratio of the number of cases failing in the respective interval, divided by the number of cases at risk in the interval.

Proportion Surviving. This proportion is computed as 1 minus the proportion failing.

Cumulative Proportion Surviving (Survival Function). This is the cumulative proportion of cases surviving up to the respective interval. Since the probabilities of survival are assumed to be independent across the intervals, this probability is computed by multiplying out the probabilities of survival across all previous intervals. The resulting function is also called the survivorship or survival function.

Probability Density. This is the estimated probability of failure in the respective interval, computed per unit of time Hazard Rate. The hazard rate (the term was first used by Barlow, 1963) is defined as the probability per time unit that a case that has survived to the beginning of the respective interval will fail in that interval. Specifically, it is computed as the number of failures per time units in the respective interval, divided by the average number of surviving cases at the mid-point of the interval.

Median Survival Time. This is the survival time at which the cumulative survival function is equal to 0.5. Other percentiles (25th and 75th percentile) of the cumulative survival function can be computed accordingly. Note that the 50th percentile (median) for the cumulative survival function is usually not the same as the point in time up to which 50% of the sample survived. (This would only be the case if there were no censored observations prior to this time).

Required Sample Sizes. In order to arrive at reliable estimates of the three major functions (survival, probability density, and hazard) and their standard errors at each time interval the minimum recommended sample size is 30

In summary, the life table gives us a good indication of the distribution of failures over time. However, for predictive purposes it is often desirable to understand the shape of the underlying survival function in the population. The major distributions that have been proposed for modeling survival or failure times are the exponential (and linear exponential) distribution, the Weibull distribution of extreme events, and the Gompertz distribution.

Estimation. The parameter estimation procedure (for estimating the parameters of the theoretical survival functions) is essentially a least squares linear regression algorithm (see Gehan & Siddiqui, 1973). A linear regression algorithm can be used because all four theoretical distributions can be "made linear" by appropriate transformations. Such transformations sometimes produce different variances for the residuals at different times, leading to biased estimates.

Goodness-of-Fit. Given the parameters for the different distribution functions and the respective model, we can compute the likelihood of the data. One can also compute the likelihood of the data under the null model, that is, a model that allows for different hazard rates in each interval. Without going into details, these two likelihoods can be compared via an incremental Chi-square test statistic. If this Chi-square is statistically significant, then we conclude that the respective theoretical distribution fits the data significantly worse than the null model; that is, we reject the respective distribution as a model for our data.

Plots. You can produce plots of the survival function, hazard, and probability density for the observed data and the respective theoretical distributions. These plots provide a quick visual check of the goodness-of-fit of the theoretical distribution

One can compare the survival or failure times in two or more samples. In principle, because survival times are not normally distributed, nonparametric tests that are based on the rank ordering of survival times should be applied. A wide range of nonparametric tests can be used in order to compare survival times; however, the tests cannot "handle" censored observations.

Available Tests. The following five different (mostly nonparametric) tests for censored data are available: Gehan's generalized Wilcoxon test, the Cox-Mantel test, the Cox's F test , the log-rank test, and Peto and Peto's generalized Wilcoxon test. A nonparametric test for the comparison of multiple groups is also available. Most of these tests are accompanied by appropriate z- values (values of the standard normal distribution); these z-values can be used to test for the statistical significance of any differences between groups. However, note that most of these tests will only yield reliable results with fairly large samples sizes; the small sample "behavior" is less well understood.

Choosing a Two Sample Test. There are no widely accepted guidelines concerning which test to use in a particular situation. Cox's F test tends to be more powerful than Gehan's generalized Wilcoxon test when:

  1. Sample sizes are small (i.e., n per group less than 50);
  2. If samples are from an exponential or Weibull;
  3. If there are no censored observations (see Gehan & Thomas, 1969).

Lee, Desu, and Gehan (1975) compared Gehan's test to several alternatives and showed that the Cox-Mantel test and the log-rank test are more powerful (regardless of censoring) when the samples are drawn from a population that follows an exponential or Weibull distribution; under those conditions there is little difference between the Cox-Mantel test and the log-rank test. Lee (1980) discusses the power of different tests in greater detail.

Multiple Sample Test. There is a multiple-sample test that is an extension (or generalization) of Gehan's generalized Wilcoxon test, Peto and Peto's generalized Wilcoxon test, and the log-rank test. First, a score is assigned to each survival time using Mantel's procedure (Mantel, 1967); next a Chi- square value is computed based on the sums (for each group) of this score. If only two groups are specified, then this test is equivalent to Gehan's generalized Wilcoxon test, and the computations will default to that test in this case.

Unequal Proportions of Censored Data. When comparing two or more groups it is very important to examine the number of censored observations in each group. Particularly in medical research, censoring can be the result of, for example, the application of different treatments: patients who get better faster or get worse as the result of a treatment may be more likely to drop out of the study, resulting in different numbers of censored observations in each group. Such systematic censoring may greatly bias the results of comparisons.


Cox's Proportional Hazard Model

The proportional hazard model is the most general of the regression models because it is not based on any assumptions concerning the nature or shape of the underlying survival distribution. The model assumes that the underlying hazard rate (rather than survival time) is a function of the independent variables (covariates); no assumptions are made about the nature or shape of the hazard function. Thus, in a sense, Cox's regression model may be considered to be a nonparametric method.

Assumptions. While no assumptions are made about the shape of the underlying hazard function, the model equations shown above do imply two assumptions. First, they specify a multiplicative relationship between the underlying hazard function and the log-linear function of the covariates. This assumption is also called the proportionality assumption. In practical terms, it is assumed that, given two observations with different values for the independent variables, the ratio of the hazard functions for those two observations does not depend on time. The second assumption of course, is that there is a log-linear relationship between the independent variables and the underlying hazard function.

Cox's Proportional Hazard Model with Time-Dependent Covariates

An assumption of the proportional hazard model is that the hazard function for an individual (i.e., observation in the analysis) depends on the values of the covariates and the value of the baseline hazard. Given two individuals with particular values for the covariates, the ratio of the estimated hazards over time will be constant -- hence the name of the method: the proportional hazard model. The validity of this assumption may often be questionable. For example, age is often included in studies of physical health. Suppose you studied survival after surgery. It is likely, that age is a more important predictor of risk immediately after surgery, than some time after the surgery (after initial recovery). In accelerated life testing one sometimes uses a stress covariate (e.g., amount of voltage) that is slowly increased over time until failure occurs (e.g., until the electrical insulation fails; see Lawless, 1982, page 393). In this case, the impact of the covariate is clearly dependent on time. The user can specify arithmetic expressions to define covariates as functions of several variables and survival time

An Example

I will use an ecological paper as an example to go through the different survival analysis concepts and possibilities. The paper is called: "Synergistic interaction of soilborne plant pathogens and root-attacking insects in classical biological control of an exotic rangeland weed" [Biol. Control 28 (2003) 144–153], Biological Control, Volume 28, Issue 3, November 2003, Page 387.
A. J. Caesar

The author of this paper set up several pots in a greenhouse with a weed that causes great damage to rangeland in the USA and other countries (leafy spurge) to see the possible role of pathogen–insect interactions on the mortality of this weed, Euphorbia esula/virgata This paper is about biological control and the conclusion was that “The main finding in this study is that the mortality of leafy spurge is due chiefly to plant pathogens contributing to a synergism with root-damaging insects. An earlier phase of the study had shown that insect–plant pathogen combinations caused accelerated damage/disease of leafy spurge plants compared to any single agent.” And “The Kaplan–Meier and Cox proportional hazards survival analysis algorithms (JMP 4, SAS Institute, Cary, NC) were used to access the effect of varying numbers of Aphthona and inoculum level of R. solani at the beginning of the experiment on survival of leafy spurge plants. Survival analysis is applied usually to individual survival times grouped according to treatment. The duration of exposure to the baseline factors until death of individual leafy spurge plants was recorded to the nearest 48 h, and confirmed by lack of regrowth until the experiment was terminated. For the year 2000 studies, this period was 26–120 days for individual plants and 30–91 days in 2001. Plants still alive at the end of the study were coded as censored.” The author also recommended survival analysis as “the application of survival-analysis data could potentially result in considerable savings in time and cost and reduced ecological risks. In addition to death of the target weed as an outcome, other more subtle outcomes could alternatively be measured. For example, the phenological effects of an agent or agents on time to flowering or seed set could be another measure. A further advantage is that survival analysis would encourage consideration of other mortality factors along with insects.”

A link to the paper is given and the main parts of the paper are also reproduced here. I do not have the original data of this paper but I will try to follow the paper development in order to illustrate the use of survival analysis together with the help of a book called: ‘Biostatistics the Bare Essentials’ by Norman and Streiner.’ It has a wonderful chapter on survival analysis.

Let‘s start with the simple actuarial approach. The first thing is to make the X axis the length of time in the study, so everyone starts from point 0. Now the risk of death is not constant as time passes so we will have to reduce our interval to a year or a day and listing for each year (or day) of the study the number of subjects still alive (at risk) and the number who “died,” and the number lost to follow-up. In this way the number ‘at risk’ will diminish as time goes by. We continue this until either the study ends or we run out of subjects. It is important to see that we treat people or (things) that dropped out of the study (lost) and people who were “censored” in the same way. 

Now the next step is to calculate the probability of dying every interval (year or day):

Probability (death) = Number who died/ Number at Risk of Death

So qi= Probability of death in year or day i

Pi= 1-q The probability of survival in year i, this is the converse or complement of the death number.

Di= Number of persons who died in Year i

Ri= Number of subjects at risk beginning Year i

The equation becomes a simple one: qi= Di/Ri and this is the hazard concept. Hazard is the probability of the occurrence of the outcome for people who began in that interval.

 

            From the definitions of survival and density of failure, we can derive the hazard function which gives the probability density of an event occurring around time t, given that it has not occurred before t. It can be shown that the hazard has this form ([Kalbfleisch and Prentice, 1980]):

 

Image

 

Or = 2(qi)/ (divided by) hi (1+pi) hi is the width of the interval (in this case, 1). The fatality rate is at the middle of the interval (see (r/x); qi is the risk at the end of the interval.

The problem here is that we are looking at a whole interval (a year or day). In this case we ‘fixed’ the interval and we don’t know the time that people were lost or when they are at risk within that interval. Should be drop them at the beginning of the year or should we keep them the whole year?

Statistic here reaches a compromise. If we don’t know what happen during that interval then we counted these people as ‘half person’. In a large sample this is a good assumption. In a random death event during a year half of the people will die during the first 6 months and half will die during the second half year.

The equation become: Probability (death) = Number who died in Year i / (divided by) Number at Risk at the Start of the year i-(minus) [Lost or Censored / (divided by) 2]

This again the hazard:   qi = Di / [Ri –Li/2]

With this equation we can figure out the probability of death and from this the probability of survival: 1 – (the probability of death). Then we can multiply this probability of survival each year (This is a form of ‘independent’ but conditional probability so the multiplication rule applied). This multiplication will give us the cumulative probability. This is the probability to make it or surviving each year for those subjects that are still around. These figures will give us the survival curve or the survival function.

This was the actuarial life table, now next is the Kaplan-Meier table or approach. Both are similar but the Kaplan-Meier has a few differences:

First there is not fixed interval the exact time is used. Second, the calculation of the survival function is not fixed, is done as the outcome happens. Third, with the actuarial table you will see changes at the end of each interval. With the K/M changes happened whenever there is one. In the actuarial table the X axis are equal, not so with the K/M. Finally the lost or censored people are taken into account the first time they occurred but not for the next event.

When we have less than 50 subjects then it is better to use the Kaplan-Meier. If it is more than 50 it is better the actuarial.

We can also estimate the Standard Error. But we have to limit this calculation to a specific time so we will end up with many Standard Errors. The formula is:

SE (Pi) = Pi √ (1 – Pi)/Ri

Now if we want to compare two or more groups. Used one as a control for example, the first thing is to ‘draw’ the survival curves in the same graph to see what is going on. Then if we assumed that the survival rates are normally distributed we have to pick up a specific point and apply a z- Test. So the equation is:

Z= Pi1 – Pi2 / √ [SE(Pi1)] exp(2) + {SE (Pi2)}exp(2)

Where Pi1 and Pi2 are the values of P (cumulative probability of surviving) for two groups for example at some arbitrarily chosen interval i (or time t, if we used the K/M approach), and SE are the standard errors at those times.

This is very useful and easy to calculate, we can also determine the Relative Risk (RR). The RR is the ratio of the probability of having some outcome occurs among subjects in (for example) Group 1 as opposed to it occurring among those in Group 2. The formula is:

RRi = 1 – Pi1 / 1 – Pi2    This the relative risk at interval i. The z-Test can also be applied to test the significance of the RR.

THE MANTEL-COX LOG-RANK TEST

The previous approach has two problems. You should pick your comparison time before you look at the data, ideally before you even start the trial. Otherwise you could be tempted to trick things. The second point is that we have ignored most of the data and focused on only one point. The Mantel-Cox log-rank Test (or logrank) uses most of the data. This test is a modification of the Haenszel chi-squared test. Although it is a nonparametric test, it is more powerful than the parametric z-test because it makes use of more of the data.

The log-rank test compares the observed number of events with the number expected, under the assumption that the null hypothesis of no group differences is true.

We need first to calculate the expected frequency for a Group X at interval I so:

Eix = Di X (times) [Rix / (Ri1 + Ri2)] Di is the total number of deaths.

We have to do this for each interval and finally we calculate how much the observed event rate differs from the expected rate. And for this we use the Mantel-Cox chi-squared:

χ˛= [(Oi – Ei)˛/ Ei ] + [(O2 –E2)˛] / E2    with 1 degree of freedom (df) If we had more than two groups, we would extend the equation by using k – 1 df( where k is the number of groups)

So the RR for this test is, ( the overall Relative Risk);

RR= (O1 / E1) / O2 / E2)

Of course we could also adjust for Covariates and for that we use the Cox proportional hazards model.

SURVIVAL ANALYSIS and  MATH FORMULAS

 

Let T(x) denote an absolutely continuous random variable (rv) describing the failure time of a system defined by a vector of features x. If FT(t|x) is the cumulative distribution function for T(x), then we can define the survival function:

 

Image

(1)

which is the probability that the failure occurs after time t.

The requirements for a function S(t|x) to be a survival function are, uniformly w.r.t. x:

S(0|x)=1 (there cannot be a failure before time 0),

S(+infinity|x)=0 (asymptotically all events realize),

S(t1|x)≥S(t2|x) when t1t2.

From the definitions of survival and density of failure, we can derive the hazard function which gives the probability density of an event occurring around time t, given that it has not occurred before t. It can be shown that the hazard has this form ([Kalbfleisch and Prentice, 1980]):

 

Image

(2)

In many studies (e.g. in medical statistics or in quality control), we do not observe realizations of the rv T(x). Rather, this variable is associated with some other rv V~r such that the observation is a realization of the rv Z=q(T,V). In this case we say that the observation of the time of the event is censored. If V is independent of T, we say that censoring is uninformative.

Depending on the form of the function q, we have different forms of censoring. Two of the most common ones in applications are right censoring, when q≡min(·,·) and left censoring, when q≡max(·,·).

Survival data can then be represented by triples of the form (t, x, y) where x is a vector of features defining the process, t is an observed time, and y is an indicator variable:

 

Image

(3)

We can define the sampling density l(y|t,x) of a survival process (assuming, e.g. right censoring) by noting that, if we observe an event (y=1), it directly contributes to the evaluation of the sample density; if we do not observe the event (y=0), the best contribution is to evaluate S(t|x). We have then:

 

l(y|t,x)=S(t|x)1−yfT(t|x)y.

(4)

The joint sample density for a set of independent observations D={(tk,xk,yk)} can then be written as:

 

Image

(5)

Since the censoring is supposed uninformative, it does not influence inference on the failure density, but gives contribution to the sample density.

Note that censored data models can be seen as a particular class of missing data models in which the densities of interest are not sampled directly ([Robert and Casella, 1999]).

3. Standard models for survival analysis

In the next sections we will describe some of the most commonly used models, both for homogeneous (time-only) and heterogeneous modeling.

3.1. The Kaplan–Meier non-parametric estimator

The Kaplan–Meier (KM) estimator is a non-parametric maximum likelihood estimator of the survival function. It is piecewise constant, and it can be thought of as an empirical survival function for censored data. In its basic form, it is only homogeneous.

Let k be the number of events in the sample, t1, t2,…,tk the event times (supposed unique and ordered),ei the number of events at time ti and ri the number of times (event or censored) greater than or equal to ti. The estimator is given by the formula:

 

Image

(6)

It can be shown that the KM. estimator maximizes the generalized likelihood over the space of all distributions, so its evaluation on large data sets gives a good qualitative description of the true survival function. However, it should be noted that it is noisy when the data are few, in particular when the events are rare, since it is piecewise constant. Therefore KM estimates from datasets sampled from the same distribution but with different number of samples can differ quite a lot.

3.2. Proportional hazards models

The most used survival specification which takes into account system features is to allow the hazard function to have the form:

 

small lambda, Greek(t|x)=small lambda, Greek(t)exp(wTx)

(7)

where small lambda, Greek(t) is some homogeneous hazard function (called the baseline hazard) and w are the feature dependent parameters of the model. This is called a proportional hazards (PH) model. The two most used approaches for this kind of model are:

• Choose a parameterized functional form for the baseline hazard, then use Maximum Likelihood (ML) techniques to find values for the parameters of the model;

• do not fix the baseline hazard, but make a ML estimation of the feature dependent part of the model, and a non-parametric estimation of the baseline hazard. This is called Cox's model ([Kalbfleisch and Prentice, 1980]).

 

 

§         Modulated Poisson and modulated renewal processes models

 

An OverviewHazard h(t):

a function describing temporal change in the instantaneous death rate experienced by individuals in a sample. Commonly referred to as the ‘force of mortality’ or the ‘mortality density.’ More precisely, hazard is the probability density function that generates the probability of dying in a time interval. Units: number of deaths individual-at-risk−1 time−1. Hazards and hazard analysis can be applied to events other than deaths.
Image

Hazard, baseline Image:

the mean of instantaneous hazards in a sample of N individuals, if the initial hazard distribution of that sample is maintained (i.e. replenished as deaths occur). Provides an unbiased estimate of mean instantaneous death rate in a heterogeneous population. Units: number of deaths individual-at-risk−1 t−1 (Eqn I).

 

Image

 

Hazard, mean Image:

the mean instantaneous hazard in a sample of N individuals. In heterogeneous samples, reflects both changes in individual hazards through time, and the change in population composition resulting from the disproportionate loss of high hazard individuals. Units: number of deaths individual-at-risk−1 t−1 (Eqn II). For a broad range of hazard distributions (represented by a range of small alpha, Greekvalues in Eqn III), Imageunderestimates baseline hazard, Imagemore when Imageis greater, hazard variance, small sigma, Greek2, is larger, and time duration, t, is longer [34].

 

Image

 

 

Image

 

Mortality rate, baseline Image:

hazard averaged across individuals and the sample period t. Provides an unbiased estimate of sample death rate over a time period. Units: number of deaths individual-at-risk−1 t−1 (Eqn IV).

 

Image

 

Mortality rate, restricted Image:

hazard averaged across individuals and the sample period t, assuming individual hazards are identical. Systematically underestimates the mean death rate when hazards are heterogeneous (Box 1 and Fig. 1). Units: number of deaths individual-at-risk−1 t−1. Imageis related to hazards by (Eqn V).

 

Image

 

Commonly calculated as (Eqn VI):

 

Image

 

 

 

Selected Papers on Discrete-Time Survival Analysis

http://gseacademic.harvard.edu/~willetjo/dsta.htm

 

 

 

BOOKS

 

Applied Survival Analysis: Regression Modeling of Time to Event Data, Textbook and Solutions Manual
David W. Hosmer, Jr.,
Stanley Lemeshow
ISBN: 0-471-43732-8
Hardcover
648 pages
September 2002

Statistical Methods for Survival Data Analysis, Third Edition
Elisa T. Lee, John Wenyu Wang
ISBN: 0-471-36997-7
Hardcover
534 pages
April 2003

Survival Analysis: A Practical Approach
Mahesh K. B. Parmar, David Machin
ISBN: 0-471-93640-5
Hardcover
268 pages
September 1995

Survival Data Mining: Modeling Customer Event Histories
Will Potts, SAS Institute, Inc.
ISBN: 0-471-67621-7
Paperback
224 pages
October 2005

 

Survival Models and Data Analysis
Regina C. Elandt-Johnson, Norman L. Johnson
ISBN: 0-471-34992-5
Paperback
457 pages
February 1999

 

Survival analysis: a practical approach. Mahesh K. B. Parmar and David Machin (1996), John Wiley and Sons.

 

Survival analysis using the SAS system: a practical guide. Paul D. Allison (1995). SAS Institute Press.

 

Applied survival analysis: regression modeling of time to event data. David Hosmer and Stanley Lemeshow (1999), John Wiley & Sons.

Survival analysis: a self-learning text. David G. Kleinbaum (1996), Springer, New York.

 

The statistical analysis of failure time data. J. D. Kalbfleisch and R. L. Prentice (1980), John Wiley & Sons.

 

Statistical models and methods for lifetime data. J. F. Lawless (1982), John Wiley & Sons, New York.

 

Applied Longitudinal Data Analysis: Modeling Change and Event Occurrence by Judith D. Singer & John B. Willett  New York: Oxford University Press, March, 2003

 

WEB SITES

 

http://www.la.utexas.edu/course-materials/sociology/soc386L/

 

http://www.statsoftinc.com/textbook/stsurvan.html

Survival Analysis Overall good summary

 

OTHER RESOURCES USED FOR THIS SUMMARY:

 

Analyzing Survival Data from Clinical Trials and Observational Studies, Ettore Marubini and Maria Grazia Valsecchi. (1995), John Wiley

 

Statistics for Biology and Health, Analysis of Multivariate Survival Data, Philip Hougaard, (2000), Springer.

 

Biostatistics a Bayesian Introduction, George G. Woodwotth, (2004), john Wiley.

 

Survival Analysis Rupert G. Miller jr. (1981), John Wiley

 

Multivariate Statistics fourth ed. Barbara G. Tabachnick and Linda S. Fidell, (2001), Allyn and Bacon

 

IMPORTANT PAPERS:

 

Dealing with death data: individual hazards, mortality and bias, Michael S. Zens and David R. Peart, Dept. of BIO Sc. Darthmouth College, deals with Ecology, Evolution and Survival Analysis.