Introduction to Parametric Duration Models

 

The purpose of this session is to show you how to use STATA's procedures for estimating parametric duration models. Note that we do not cover non-parametric or semi-parametric duration models which are an important part of this literature.

 

/* This file demonstrates the STATA procedures for evaluating duration data. First, let's read in the data. These

are data on the duration of labor strikes taken from Greene, p. 800. The variable T is the duration of the strike.

The variable PROD is the residual from a linear regression of the log of industrial production on time, time

squared, and a set of monthly dummy variables. */

 

 

use "C:\users\wood\documents\my teaching\maximum likelihood\data\strikes.dta", clear

 

/* In contrast to Limdep, STATA does not expect the dependent variable to be logged. You will want a censoring

variable to indicate whether or not a process is still ongoing. Below we create the censoring variable, STATUS,

for all durations above 80. Note that the censoring variable can be omitted from the specifications below if

there is no censoring. */

 

gen status=t<81

 

/*We need to stset the data set to alert STATA that we have duration data and want to estimate survival models*/

 

stset t, failure(status)

 

/* Let's first summarize the data */

 

summarize

 

/* Now let's just estimate some duration models with no covariates. This would be useful for exploring the

duration dependence of a process. First the Exponential model. Note that we are estimating this model in

the Accelerated Failure Time form using the option "time" */

 

streg , dist(exp) time

 

/*Now lets get the median duration of strikes predicted by this distribution.*/

 

predict medsurvexp, med

display medsurvexp

 

/* Let's now get the survival, hazard, and cumulative hazard functions for the exponential function.*/

 

stcurve, surv

more

stcurve , haz

more

stcurve, cumhaz

 

/* Let's now calculate AIC for later use in model testing and comparison. */

 

scalar aicexp = -2*e(ll)+2*(e(df_m)+1)

display aicexp

 

/*Note the distinctive hazard and integrated hazard functions for the Exponential model. This reflects the

memory-less property of this distribution. Also, note that this distribution results in hazard functions that

are strictly monotonic. */

 

/*Now let's estimate the same relationship with the Weibull model. Weibull is probably the most popular.

The reason is it's flexibility and the fact that it encompasses the exponential. You have a built-in test for

exponential, in the significance and size of p. Also, the Weibull's hazard function is more flexible in that

it posits exponential change in the hazard function, which is an advantage if you think that duration

dependence is not linear as in the exponential function. Note that both the exponential and weibull

disributions posit only monotonic change in the hazard function. */

 

streg , dist(weibull) time

 

/*Now lets get the median duration of strikes predicted by this distribution.*/

 

predict medsurvweib, med

display medsurvweib

 

/* Let's now get the survival, hazard, and cumulative hazard functions for the weibull model.*/

 

stcurve, surv

more

stcurve , haz

more

stcurve, cumhaz

 

/* Now let's calculate AIC for later use in model testing and comparison. */

 

scalar aicweib = -2*e(ll)+2*(e(df_m)+1)

display aicweib

 

/*STATA will also estimate the Gamma model.  Note that the GAMMA function allows non-monotonicity

in the hazard function.*/

 

streg , dist(gamma) time

 

/*Now lets get the median duration of strikes predicted by this distribution.*/

 

predict medsurvgam, med

display medsurvgam

 

/* Let's now get the survival, hazard, and cumulative hazard functions for the gamma model.*/

 

stcurve, surv

more

stcurve , haz

more

stcurve, cumhaz

 

/* Now let's calculate AIC for later use in model testing and comparison. */

 

scalar aicgamma = -2*e(ll)+2*(e(df_m)+1)

display aicgamma

 

/*Generally, note the distinctively different shape of the hazard functions for the Gamma model versus

the Weibull or exponential. This demonstrates that it does make a difference which model is selected.

If you think the baseline hazard should be non-monotonic, then you need to specify one of the functions

that allow non-monotonicity. STATA will also estimate a lognormal and loglogistic model, both of which

also allow a non-monotonic hazard function. */

 

streg , dist(lognormal) time

 

/*Now lets get the median duration of strikes predicted by this distribution.*/

 

predict medsurvlnorm, med

display medsurvlnorm

 

 

stcurve, surv

more

stcurve , haz

more

stcurve, cumhaz

more

 

/* Calculate AIC for the lognormal model. */

 

scalar aiclognorm = -2*e(ll)+2*(e(df_m)+1)

display aiclognorm

 

/* Calculate AIC for the loglogistic model. */

 

streg , dist(loglogistic) time

 

/*Now lets get the median duration of strikes predicted by this distribution.*/

 

predict medsurvloglog, med

display medsurvloglog

 

stcurve, surv

more

stcurve , haz

more

stcurve, cumhaz

more

 

scalar aicloglog = -2*e(ll)+2*(e(df_m)+1)

display aicloglog

 

/*We can select the appropriate model using AIC. We want the model with minimum AIC */

 

display aicexp

display aicweib

display aicgamma

display aiclognorm

display aicloglog

 

/*Based on this, we find that the gamma model provides the best fit for modeling duration dependence.

However, we might want to make this determination after including covariates. Note also that this differs from

the finding from LIMDEP using the F distribution encompassing test that the Weibull is the correct model. */

 

/*Another diagnostic for model specification utilizes the cumulative hazard function. A well specified model

will have a cumulative hazard function that begins at zero and increases in linear fashion. For example,

compare the cumulative hazard functions for the gamma versus the weibull models below.*/

 

streg , dist(gamma) time

stcurve, cumhaz

more

streg , dist(weibull) time

stcurve, cumhaz

 

/*This diagnostic suggests that the Weibull may be the best model. */

 

/*STATA allows you to include covariates in the duration model. First, let's estimate the Cox model, which makes

no assumptions about the form of duration dependence, but does assume proportional hazards. We obtain estimates

in both the coefficient form and the hazard ratio form.*/

 

stcox prod, nohr

stcox prod

 

/* Note that the hazard ratio form reveals a very high number. Why? The hazard ratio calculation computes the change in

hazard associated with a one unit change in the variable. However, the variable ranges from a minimum of -0.104 to a

maximum of 0.074. A one unit change is not consistent with the data. We could, however, transform the variable prod

to standard deviation units, which would provide a more interesting quantity.*/

 

egen prodstd = std(prod)

stcox prodstd

 

/* A one standard deviation change in lost production produces a 52 percent increase in the probability of the strike

ending in the very next time period.*/

 

/*We can test the proportional hazards assumption by interacting the right side variables with time and then testing

the significance of the interaction coefficients. STATA simplifies this as shown below. The tvc option specifies that

you want an interaction, and the texp option specifies the form of the interaction which in this case is just time. */

 

stcox prod, tvc(prod) texp(_t) nohr

test [tvc]prod

 

/*We can also test the proportional hazards assumption using the Schoenfeld residuals. STATA has a built in command

to perform this test, as well as plot the scaled residuals as a function of time. */

 

stcox prod, schoenfeld(sch*) scaledsch(sca*)

estat phtest

stphtest,detail

stphtest,plot(prod)

 

/*There is no evidence that we have violated the proportional hazard assumption from the preceding tests. However, we still

might be interested in the baseline hazard function for theoretical reasons, so a parametric model might be appropriate.

Now let's estimate some of the parametric survival models with a covariate. As observed in class, for these models,

the coefficients on covariates must be interpreted with some caution. The interpetation

is relatively straightforward with an exponential, or some values of the P parameter for other distribution.

Otherwise, exercise caution. */

 

/*If you are using one of the accelerated failure time distributions (e.g.,, exponential or weibull), then you can

get the estimates in either the coefficient, hazard ratio, or time ratio metric. Again, we have to think about the

scaling of the prod variable in interpreting the hazard or time ratios. */

 

/* First, the two models in coefficient form.*/

 

streg prodstd, dist(weibull) nohr

streg prodstd, dist(weibull) time

 

/* Now the Weibull in hazard ratio form.*/

streg prodstd, dist(weibull)

 

/* A one standard deviation increase in lost production produces an increase of 68.5 percent of the hazard of a

strike ending in the next period.*/

 

/* Now in accelerated failure time and time ratio form. */

 

streg prodstd, dist(weibull) tr

 

/* A one standard deviation increase in lost production produces strikes that are 47 percent shorter*/

 

/*If you are using the gamma, log-logistic, or gamma models, you can have the estimates reported in either the

coefficient metric or the time ratio metric, but not in a hazard ratio metric. */

 

streg prod, dist(gamma) time

streg prod, dist(gamma) tr

 

/*The preceding models are all log-linear. However, an alternative model, the Gompertz, is not log-linear.

Note that with the Gompertz there is no intercept. Below we report estimates in both hazard ratio and

coefficients form.*/

 

streg prod, dist(gompertz)

streg prod, dist(gompertz) nohr

 

/*STATA will also estimate models with heterogeneity. For example, here is a model that assumes

heterogeneity in the coefficients based on whether the data is censored. Be patient. This one takes a

while to converge.*/

 

set more off

streg prod, dist(weibull) strata(status)

 

/*Here is a model that assumes an unknown source of heterogeneity in the durations, modeled as a

gamma distribution. These are the so-called frailty models. */

 

streg prod, dist(weibull) frailty(gamma)

 

/*This lesson has provided a summary of parametric models of survival. As you can see, there is considerable

expertise required to use and interpret these models. A future course in survival analysis might be required to learn

everything about this class of models.*/