Censoring and Truncation

The purpose of this session is to show you how to use R's procedures for doing censored and truncated regression. We also estimate Heckman's two-stage procedure for samples with selection bias which is a form of incidential truncation.

# This file demonstrates some of R's procedures for doing censored and truncated regression.

# In particular, we estimate a lower limit censored regression (i.e., Tobit), Cragg's model

# that assumes a heterogenous censoring process, Heckman's incidential truncation model for

# dealing with sample selection bias, and truncated regression.

# We will use some of the Mroz data on female labor force participation and income for these

# examples. The first 428 observations of the Mroz data contained women who worked in 1975.

# The remaining 345 bservations contained women who did not work. We will use only the first

# 50 observations from each of these subsets of the data.

attach(LFP)

summary(LFP)

# Now let's estimate a Tobit model and also save the log likelihood for later testing. The dependent

# variable (WHRS)is the wife's hours worked in 1975. The independent variables are a constant,

# number of children less than 6 years old (KL6), number of children between 6 and 18 (KL618), CIT is a dummy for whether live in a large city or not,

# wife's age (WA), and wife's education (WE).

# Currently Zelig does not support the tobit model. However, it is easily obtained using the survival

# regression package. We will discuss why in the next lecture period. Note that the element

# WHRS > 0 gives the non-censored observations. The "left" option implies left censoring.

library(survival)

tobit.out <- survreg(Surv(WHRS, WHRS>0, type='left') ~ KL6 + K618 + WA + WE,

data=LFP, dist='gaussian')

summary(tobit.out)

# You can obtain predicted values from the Tobit regression as follows.

WHRShat <- predict(tobit.out,type="response")

WHRShat

# We will also save the log likelihood for the model for later use.

lltobit <- tobit.out\$loglik[2:2]

lltobit

# However, this prediction would assume that we are interested in predictions on y*. However, y* is latent and

partially determined by a probit. Thus, following Greene,# p. 764 the expected value must take into account the

# probability of censoring. First, get the standard deviation of the latent residuals.

SD <- tobit.out\$scale

SD

# Now calculate the inverse Mills ratio called lambda

lambda <- dnorm(WHRShat/SD)/pnorm(WHRShat/SD)

lambda

# Now calculate the expected value given censoring and X

EYgivenX <- pnorm(WHRShat/SD)*(WHRShat + SD*lambda)

EYgivenX

# McDonald and Moffit suggest a useful decomposition of the marginal effects associated with the

# censored regression model. They show that a change in the conditional mean due to right side

# variables derives from two sources:

# 1) It affects the conditional mean in the uncensored part of the distribution

# 2) It affects the conditional mean by also affecting the probability that an observation will

# lie in the uncensored part of the distribution.

xbar <- as.matrix(colMeans(cbind(1,LFP[3:6])))

beta <-coefficients(tobit.out)

BXoverS <- t(beta) %*% xbar/SD

Mu <- dnorm(BXoverS)/pnorm(BXoverS)

P <- pnorm(BXoverS)

P

P1 <- P*(1-BXoverS*Mu-Mu^2)

P1

P2 <- dnorm(BXoverS)*BXoverS+dnorm(BXoverS)*Mu

P2

# We can also calculate marginal effects on the expected value of the unconditional distribution,

# the expected value of the uncensored distribution, as well as probabilities of being in the

# uncensored part of the distribution as follows.

N <- nrow(LFP)

censored <- sum(WHRS<=0)

Fz <- (N-censored)/N

z <- qnorm(Fz)

fz <- 1/sqrt(2*pi)*exp(-.5*z^2)

cond <- (1-(z*fz/Fz))-fz^2/Fz^2

uncond <- Fz

prob <- fz/SD

rawcoeff <- beta[2:5]

mfxuncond <- rawcoeff*uncond

mfxuncensored <- rawcoeff*cond

puncensored <- rawcoeff*prob

table <- cbind(rawcoeff,mfxuncond,mfxuncensored,puncensored)

table

# The censored regression model assumes that the uncensored part of the distribution is normal

# If it is not, then estimates are inconsistent.

# We can test this assumption. Drop the censored observations and test for normality

LFP2 <- subset(LFP, subset=WHRS>0)

summary(LFP2)

library(e1071)

skewness(LFP2\$WHRS)

kurtosis(LFP2\$WHRS)

library(car)

qqPlot(LFP2\$WHRS, dist= "norm", labels=FALSE)

hist(LFP2\$WHRS, probability=T, ylab='Density')

lines(density(LFP2\$WHRS), lwd=2)

points(LFP2\$WHRS, rep(0, length(residuals(ols.model))), pch="|")

box()

# N is small here, so we cannot be fully confident of these tests. However, there is some

#evidence of non-normality.

# Cragg has suggested that assuming a censoring limit that depends on the same distribution as

# the uncensored observations is often incorrect. He suggests a two equation system in which the

# first equation estimates the probability of being above the censoring limit and the second is a

# truncated regression on the uncensored observations. Below we estimate Cragg's model using

# Probit and a Truncated regression procedure. Also, we do a likelihood ratio test of whether

# Cragg's model is significantly different than the Tobit model.

library(Zelig)

probit.out <- zelig(LFP ~ KL6 + K618 + WA + WE,  model = "probit", data = LFP)

summary(probit.out)

probitsummary.out <- summary(probit.out)

llprobit <- -.5 * probitsummary.out\$deviance

llprobit

# We will write our own procedure for the truncated normal regression on the uncensored observations

# First, create a new dataset using the uncensored observations.

LFPsubset <- subset(LFP, subset=LFP>0)

x <- cbind(1,LFPsubset[3:6])

y <- as.vector(LFPsubset\$WHRS)

K <- ncol(x)

K1 <- K+1

n <- nrow(x)

# Assign the lower truncation limit

L <- 0

# Define the function to be optimized

llik.ltruncnormal <- function(par,X,Y) {

Y <- as.vector(y)

X <- as.matrix(x)

xbeta <- X%*%par[1:K]

Sig <- par[K1:K1]

sum( log((1/Sig)*dnorm((Y-xbeta)/Sig))-log(1-pnorm((L-xbeta)/Sig)))

}

llik.ltruncnormal

# Now let's use the above function to estimate the model.

# Get start values from a regular regression.

regress.out <- lm(WHRS ~ KL6 + K618 + WA + WE, data = LFPsubset)

startvalues <- c(coefficients(regress.out),120)

# Estimate the lower truncated regression.

ltrunc.out <- optim(startvalues,llik.ltruncnormal, method = "BFGS", control = list(trace=1,maxit=100,fnscale = -1),

hessian = TRUE)

ltrunc.out

# Calculate standard errors, z statistics, and pvalues and report in a table.

v <- -solve(ltrunc.out\$hessian)

se <- sqrt( diag(v))

b <- ltrunc.out\$par

zstat <-b/se

pzstat <- 2* (1 - pnorm(abs(zstat)))

table <- cbind(b,se,zstat,pzstat)

table

# Get the log likelihood from the truncated regression.

lltruncate <- ltrunc.out\$value

lltruncate

# Calculate Cragg's test statistic

lrtest <- 2*((llprobit+lltruncate)-lltobit)

lrtest

# The restricted model is Tobit. The unrestricted model is the two models estimated

# separately. The test statistic is 14.34, which is chi-squared with 5 degrees of freedom

# for the number of additional parameters being estimated. The critical value is 11.07, so

# we reject the null that the restricted model is true. The two equation approach is therefore

# more appropriate than Tobit.

# Now let's turn to estimating a model with sample selection bias. In these cases the

# truncation is incidental, due to sample selection on another variable that is correlated

# with the truncation in the dependent variable. As discussed in class, the standard model is

# Heckman's two stage procedure. Once again, there is no "canned" procedure in R to do

# the Heckman model. The easy thing to do in this case would be to use STATA or LIMDEP.

# However, for illustrative purposes we develop our own procedure in R.

# First, estimate the first stage probit.

heckprob.out <- zelig(LFP ~ CIT + KL6, model="probit", data=LFP)

summary(heckprob.out)

summaryheckprob.out <- summary(heckprob.out)

# Now retrieve the index function from the probit

gamma <- as.matrix(summaryheckprob.out\$coefficients)

w <- cbind(1,CIT,KL6)

z <- w %*% gamma

# Now calculate lambda and delta

LFP\$lambda <- dnorm(z)/pnorm(z)

LFP\$delta <- lambda *(lambda+z)

detach(LFP)

attach(LFP)

# Now get the subset of the data that suffers from selection bias

LFPheckman <- subset(LFP, subset=LFP>0)

# Now run the second stage regression that includes lambda (the Mills ratio) from the probit.

heckman.out <- lm( WHRS ~ KL6 + K618 + WA + WE + lambda, data=LFPheckman)

summary(heckman.out)

# The coefficients from this OLS regression are now unbiased due to the inclusion of

# the source of the bias in the equation. However, the standard errors and inferential

# statistics will be incorrect due to not taking into account the sampling variability

# contained in lambda. Also, the residuals are heteroskedastic by definition.

# We need to compute a corrected covariance matrix of estimates for both problems.

# First, get the standard error of the regression corrected for selection.

blambda <- heckman.out\$coefficients[6:6]

deltabar <- mean(LFPheckman\$delta)

sigma <- sqrt(((t(heckman.out\$residuals) %*% heckman.out\$residuals)/nrow(LFPheckman))+ deltabar*blambda^2)

# Next obtain rho in the correction rho*sigma*lambda*wgamma, Greene p. 784

rho2 <- blambda^2/sigma^2

rho <- sqrt(rho2)

# Next obtain an n x n Identity matrix, the Delta matrix, and I-rho^2 Delta.

# These are from Greene, p 785

n <- nrow(LFPheckman)

I <- matrix(0,nrow=n,ncol=n)

I[row(I)==col(I)] <- 1

Delta <- as.vector(LFPheckman\$delta[,1]) * I

IminusRho2Delta <- (I - as.numeric(rho2)*Delta)

# Now calculate the correction factor Q

xstar <- cbind(1,LFPheckman[,2],LFPheckman[,3],LFPheckman[,4],LFPheckman[,5],LFPheckman[,6])

w <- cbind(1,LFPheckman\$CIT,LFPheckman\$KL6)

Q <- (t(xstar) %*% Delta %*% w) %*% vcov(heckprob.out) %*% (t(w) %*% Delta %*% xstar)

# Now, obtain the corrected asymptotic covariance matrix of coefficients.

vcovheckman <- as.numeric(sigma^2) *(solve(t(xstar) %*% xstar) %*% (t(xstar) %*% IminusRho2Delta %*% xstar + Q) %*% solve(t(xstar) %*% xstar))

# Now produce corrected standard errors, z statistics and pvalues along with the coefficients

se <- sqrt( diag(vcovheckman))

b <- heckman.out\$coefficients[1:6]

zstat <-b/se

pzstat <- 2* (1 - pnorm(abs(zstat)))

table <- cbind(b,se,zstat,pzstat)

table

# Again, this section was illustrative only. You will probably want to use STATA or LIMDEP

# for most Heckman estimations. However, the preceding demonstrates how to

# program your own estimator. If you have an econometrics text with the matrices,

# then R can be used to implement the estimator. As a final note, the preceding

# would have been incorrect if taken directly from the textbook. Greene's formulation

# for delta i at 1 on p. 784 is incorrect. So it pays to have the original

# reference and to check the math against multiple references.