Censoring and Truncation


The purpose of this session is to show you how to use R's procedures for doing censored and truncated regression. We also estimate Heckman's two-stage procedure for samples with selection bias which is a form of incidential truncation.


# This file demonstrates some of R's procedures for doing censored and truncated regression.

# In particular, we estimate a lower limit censored regression (i.e., Tobit), Cragg's model

# that assumes a heterogenous censoring process, Heckman's incidential truncation model for

# dealing with sample selection bias, and truncated regression.


# We will use some of the Mroz data on female labor force participation and income for these

# examples. The first 428 observations of the Mroz data contained women who worked in 1975.

# The remaining 345 bservations contained women who did not work. We will use only the first

# 50 observations from each of these subsets of the data.


LFP <- read.table("Documents/My Teaching/Maximum Likelihood/Data/tobit.txt",header=TRUE)




# Now let's estimate a Tobit model and also save the log likelihood for later testing. The dependent

# variable (WHRS)is the wife's hours worked in 1975. The independent variables are a constant,

# number of children less than 6 years old (KL6), number of children between 6 and 18 (KL618), CIT is a dummy for whether live in a large city or not,

# wife's age (WA), and wife's education (WE).


# Currently Zelig does not support the tobit model. However, it is easily obtained using the survival

# regression package. We will discuss why in the next lecture period. Note that the element

# WHRS > 0 gives the non-censored observations. The "left" option implies left censoring.




tobit.out <- survreg(Surv(WHRS, WHRS>0, type='left') ~ KL6 + K618 + WA + WE,

                data=LFP, dist='gaussian')



# You can obtain predicted values from the Tobit regression as follows.

WHRShat <- predict(tobit.out,type="response")



# We will also save the log likelihood for the model for later use.

lltobit <- tobit.out$loglik[2:2]




# However, this prediction would assume that we are interested in predictions on y*. However, y* is latent and

partially determined by a probit. Thus, following Greene,# p. 764 the expected value must take into account the

# probability of censoring. First, get the standard deviation of the latent residuals.


SD <- tobit.out$scale



# Now calculate the inverse Mills ratio called lambda

lambda <- dnorm(WHRShat/SD)/pnorm(WHRShat/SD)



# Now calculate the expected value given censoring and X

EYgivenX <- pnorm(WHRShat/SD)*(WHRShat + SD*lambda)



# McDonald and Moffit suggest a useful decomposition of the marginal effects associated with the

# censored regression model. They show that a change in the conditional mean due to right side

# variables derives from two sources:


# 1) It affects the conditional mean in the uncensored part of the distribution

# 2) It affects the conditional mean by also affecting the probability that an observation will

# lie in the uncensored part of the distribution.


xbar <- as.matrix(colMeans(cbind(1,LFP[3:6])))

beta <-coefficients(tobit.out)


BXoverS <- t(beta) %*% xbar/SD

Mu <- dnorm(BXoverS)/pnorm(BXoverS)

P <- pnorm(BXoverS)


P1 <- P*(1-BXoverS*Mu-Mu^2)


P2 <- dnorm(BXoverS)*BXoverS+dnorm(BXoverS)*Mu



# We can also calculate marginal effects on the expected value of the unconditional distribution,

# the expected value of the uncensored distribution, as well as probabilities of being in the

# uncensored part of the distribution as follows.


N <- nrow(LFP)

censored <- sum(WHRS<=0)

Fz <- (N-censored)/N

z <- qnorm(Fz)

fz <- 1/sqrt(2*pi)*exp(-.5*z^2)

cond <- (1-(z*fz/Fz))-fz^2/Fz^2

uncond <- Fz

prob <- fz/SD

rawcoeff <- beta[2:5]

mfxuncond <- rawcoeff*uncond

mfxuncensored <- rawcoeff*cond

puncensored <- rawcoeff*prob

table <- cbind(rawcoeff,mfxuncond,mfxuncensored,puncensored)



# The censored regression model assumes that the uncensored part of the distribution is normal

# If it is not, then estimates are inconsistent.

# We can test this assumption. Drop the censored observations and test for normality


LFP2 <- subset(LFP, subset=WHRS>0)






qqPlot(LFP2$WHRS, dist= "norm", labels=FALSE)


hist(LFP2$WHRS, probability=T, ylab='Density')

lines(density(LFP2$WHRS), lwd=2)

points(LFP2$WHRS, rep(0, length(residuals(ols.model))), pch="|")


lines(density(LFP2$WHRS, adjust=.5), lwd=1)


# N is small here, so we cannot be fully confident of these tests. However, there is some

#evidence of non-normality.


# Cragg has suggested that assuming a censoring limit that depends on the same distribution as

# the uncensored observations is often incorrect. He suggests a two equation system in which the

# first equation estimates the probability of being above the censoring limit and the second is a

# truncated regression on the uncensored observations. Below we estimate Cragg's model using

# Probit and a Truncated regression procedure. Also, we do a likelihood ratio test of whether

# Cragg's model is significantly different than the Tobit model.



probit.out <- zelig(LFP ~ KL6 + K618 + WA + WE,  model = "probit", data = LFP)



probitsummary.out <- summary(probit.out)


llprobit <- -.5 * probitsummary.out$deviance



# We will write our own procedure for the truncated normal regression on the uncensored observations

# First, create a new dataset using the uncensored observations.

LFPsubset <- subset(LFP, subset=LFP>0)


x <- cbind(1,LFPsubset[3:6])

y <- as.vector(LFPsubset$WHRS)

K <- ncol(x)

K1 <- K+1

n <- nrow(x)


# Assign the lower truncation limit


L <- 0


# Define the function to be optimized


llik.ltruncnormal <- function(par,X,Y) {

Y <- as.vector(y)

X <- as.matrix(x)

xbeta <- X%*%par[1:K]

Sig <- par[K1:K1]

  sum( log((1/Sig)*dnorm((Y-xbeta)/Sig))-log(1-pnorm((L-xbeta)/Sig)))





# Now let's use the above function to estimate the model.

# Get start values from a regular regression.

regress.out <- lm(WHRS ~ KL6 + K618 + WA + WE, data = LFPsubset)

startvalues <- c(coefficients(regress.out),120)


# Estimate the lower truncated regression.

ltrunc.out <- optim(startvalues,llik.ltruncnormal, method = "BFGS", control = list(trace=1,maxit=100,fnscale = -1),

      hessian = TRUE)



# Calculate standard errors, z statistics, and pvalues and report in a table.

v <- -solve(ltrunc.out$hessian)

se <- sqrt( diag(v))

b <- ltrunc.out$par

zstat <-b/se

pzstat <- 2* (1 - pnorm(abs(zstat)))

table <- cbind(b,se,zstat,pzstat)



# Get the log likelihood from the truncated regression.


lltruncate <- ltrunc.out$value



# Calculate Cragg's test statistic


lrtest <- 2*((llprobit+lltruncate)-lltobit)



# The restricted model is Tobit. The unrestricted model is the two models estimated

# separately. The test statistic is 14.34, which is chi-squared with 5 degrees of freedom

# for the number of additional parameters being estimated. The critical value is 11.07, so

# we reject the null that the restricted model is true. The two equation approach is therefore

# more appropriate than Tobit.




# Now let's turn to estimating a model with sample selection bias. In these cases the

# truncation is incidental, due to sample selection on another variable that is correlated

# with the truncation in the dependent variable. As discussed in class, the standard model is

# Heckman's two stage procedure. Once again, there is no "canned" procedure in R to do

# the Heckman model. The easy thing to do in this case would be to use STATA or LIMDEP.

# However, for illustrative purposes we develop our own procedure in R.


# First, estimate the first stage probit.

heckprob.out <- zelig(LFP ~ CIT + KL6, model="probit", data=LFP)



summaryheckprob.out <- summary(heckprob.out)


# Now retrieve the index function from the probit

gamma <- as.matrix(summaryheckprob.out$coefficients)

w <- cbind(1,CIT,KL6)

z <- w %*% gamma


# Now calculate lambda and delta


LFP$lambda <- dnorm(z)/pnorm(z)

LFP$delta <- lambda *(lambda+z)




# Now get the subset of the data that suffers from selection bias


LFPheckman <- subset(LFP, subset=LFP>0)


# Now run the second stage regression that includes lambda (the Mills ratio) from the probit.


heckman.out <- lm( WHRS ~ KL6 + K618 + WA + WE + lambda, data=LFPheckman)



# The coefficients from this OLS regression are now unbiased due to the inclusion of

# the source of the bias in the equation. However, the standard errors and inferential

# statistics will be incorrect due to not taking into account the sampling variability

# contained in lambda. Also, the residuals are heteroskedastic by definition.


# We need to compute a corrected covariance matrix of estimates for both problems.


# First, get the standard error of the regression corrected for selection.

blambda <- heckman.out$coefficients[6:6]

deltabar <- mean(LFPheckman$delta)

sigma <- sqrt(((t(heckman.out$residuals) %*% heckman.out$residuals)/nrow(LFPheckman))+ deltabar*blambda^2)


# Next obtain rho in the correction rho*sigma*lambda*wgamma, Greene p. 784

rho2 <- blambda^2/sigma^2

rho <- sqrt(rho2)


# Next obtain an n x n Identity matrix, the Delta matrix, and I-rho^2 Delta.

# These are from Greene, p 785

n <- nrow(LFPheckman)

I <- matrix(0,nrow=n,ncol=n)

I[row(I)==col(I)] <- 1

Delta <- as.vector(LFPheckman$delta[,1]) * I

IminusRho2Delta <- (I - as.numeric(rho2)*Delta)


# Now calculate the correction factor Q

xstar <- cbind(1,LFPheckman[,2],LFPheckman[,3],LFPheckman[,4],LFPheckman[,5],LFPheckman[,6])

w <- cbind(1,LFPheckman$CIT,LFPheckman$KL6)

Q <- (t(xstar) %*% Delta %*% w) %*% vcov(heckprob.out) %*% (t(w) %*% Delta %*% xstar)


# Now, obtain the corrected asymptotic covariance matrix of coefficients.

vcovheckman <- as.numeric(sigma^2) *(solve(t(xstar) %*% xstar) %*% (t(xstar) %*% IminusRho2Delta %*% xstar + Q) %*% solve(t(xstar) %*% xstar))


# Now produce corrected standard errors, z statistics and pvalues along with the coefficients

se <- sqrt( diag(vcovheckman))

b <- heckman.out$coefficients[1:6]

zstat <-b/se

pzstat <- 2* (1 - pnorm(abs(zstat)))

table <- cbind(b,se,zstat,pzstat)



# Again, this section was illustrative only. You will probably want to use STATA or LIMDEP

# for most Heckman estimations. However, the preceding demonstrates how to

# program your own estimator. If you have an econometrics text with the matrices,

# then R can be used to implement the estimator. As a final note, the preceding

# would have been incorrect if taken directly from the textbook. Greene's formulation

# for delta i at 1 on p. 784 is incorrect. So it pays to have the original

# reference and to check the math against multiple references.