Poisson and Negative Binomial Regression

The purpose of this session is to show you how to use R's procedures for count models including Poisson adn Negative Binomial Regression. We also show how to do various tests for overdispersion and for discriminating between models.

# We use McCullagh and Nelder's data on ship accidents. The variables are:

# Type = Ship type

# TA, TB, TC, TD, TE = Ship Type indicators

# T6064, T6569, T7074, T7579 = Year constructed indicators

# O6074, O7579 = Years operated indicators

# Months = Measure of service amount

# Acc = Accidents.

library(car)

library(stats)

library(MASS)

# Now let's read in the data. Note that there are some missing data defined by ".". R expects

# missing data to be coded NA. So we have changed the missing value code on input here.

rm(list = ls())

ShipAccidents <- read.table("Documents/My Teaching/Maximum Likelihood/Data/ShipAccidents.txt",

header=TRUE, na.strings=".")

attach(ShipAccidents)

# Let's go ahead and remove the missing values from the dataset.

detach(ShipAccidents)

ShipAccidents <- na.omit(ShipAccidents)

attach(ShipAccidents)

# First let's do some descriptive statistics on the data just to confirm what is

# obvious from looking at the data (there are a lot of zeros).

summary(ShipAccidents)

library(e1071)

skewness(Acc)

kurtosis(Acc)

qqPlot(Acc, dist= "norm", labels=FALSE)

# These are discrete data, so let's do a histogram of their distribution.

# And, just for fun, let's also superimpose non-parametric regression lines.

hist(Acc, probability=T, ylab='Density')

lines(density(Acc), lwd=2)

points(Acc, rep(0, length(Acc)), pch="|")

box()

lines(density(Acc, adjust=.5), lwd=1)

# Note that the discrete data are non-normal and cannot be approximated by a normal distribution.

# They are both skewed and kurtotic. This implies the need for some model other than a normal

# distribution. Which model? is the next question. We'll attempt to answer this question later.

# Let's begin by doing the estimation using the Poisson model. Again, we use the Zelig interface

# to R. Replicating McCullaugh and Nelder, we include months of service logged in the analysis

ShipAccidents$Logmonths <- log(Months)

detach(ShipAccidents)

attach(ShipAccidents)

library(Zelig)

poisson.out <- zelig(Acc ~ TB + TC + TD + TE + T6569+ T7074 + T7579 + O7579 + Logmonths,

model = "poisson", data = ShipAccidents)

summary(poisson.out)

poissonsummary.out <- summary(poisson.out)

# McCullagh and Nelder assert that the coefficient on Logmonths is known to be 1. Let's do a

# Wald test for whether the coefficient on Logmonths is consistent with their assertion.

b <- poissonsummary.out$coefficients[,1]

v <- poissonsummary.out$cov.unscaled

r <- rbind(c(0,0,0,0,0,0,0,0,0,1))

q <- 1

j <- nrow(r)

Wald <- t(r%*%b-q) %*% solve (r %*% v %*% t(r)) %*% (r%*%b-q)

Wald

Waldpvalue <- 1-pchisq(Wald, df=j)

Waldpvalue

# Based on the Wald statistic we are unable to reject the null that the coefficients on Logmonths

# is one.

# Note, McCullaugh and Nelder assert that the coefficient on Logmonths is known

# to be 1. In a sense, they cheat by not taking into account the sampling variability

# of this coefficient. However, to replicate their results we need to fix the coefficient

# on Logmonths at 1. This can be done using the offset command. Note that you can also

# restrict to other values than 1 by making a scalar transformation of the variable to

# be restricted. Note that R does not report a coefficient for the fixed parameter.

poissonmn.out <- zelig(Acc ~ TB + TC + TD + TE + T6569+ T7074 + T7579 + O7579 + offset(Logmonths),

model = "poisson", data = ShipAccidents)

summary(poissonmn.out)

poissonmnsummary.out <- summary(poissonmn.out)

# Let's do some interpretation of the results. First, the hard way.

# Get the vector of coefficients and data matrix

x <- as.matrix(cbind(1,ShipAccidents[3:6],ShipAccidents[8:10],ShipAccidents[12:12],

ShipAccidents[15:15]))

y <- as.vector(Acc)

beta <- c(poissonmnsummary.out$coefficients[,1],1)

beta

# Predicted accidents for each observation.

lambda <- exp(x %*% beta)

lambda

# Predicted accidents with all independent variables at their means.

xbar <- as.matrix(colMeans(cbind(1,ShipAccidents[3:6],ShipAccidents[8:10],ShipAccidents[12:12],

ShipAccidents[15:15])))

Expectation <- exp(t(xbar) %*% beta)

Expectation

# We could use the preceding to do a first differences analysis. However, let's use Zelig

# to do the work for us. First, set all variables at their means.

means.out <- setx(poisson.out)

means.out

# Simulate expected value as above. Compare the results to the preceding.

fitted.out <- sim(poisson.out, x = means.out)

summary(fitted.out)

# Simulated first difference when shifting Logmonths one standard deviation above the mean

Logmonths.low <- setx(poisson.out, Logmonths = mean(Logmonths))

Logmonths.high <- setx(poisson.out, Logmonths = mean(Logmonths) + sd(Logmonths))

fd.Logmonth <- sim(poisson.out, x = Logmonths.low, x1 = Logmonths.high)

summary(fd.Logmonth)

# Now let's test the hypothesis that the period of manufacture has no effect on accidents.

# Let's use a Wald statistic.

r <- rbind(c(0,0,0,0,0,1,0,0,0),c(0,0,0,0,0,0,1,0,0),c(0,0,0,0,0,0,0,1,0))

q <- c(0,0,0)

b <- poissonmnsummary.out$coefficients[,1]

v <- poissonmnsummary.out$cov.unscaled

j <- nrow(r)

Wald <- t(r%*%b-q) %*% solve (r %*% v %*% t(r)) %*% (r%*%b-q)

Wald

Waldpvalue <- 1-pchisq(Wald, df=j)

Waldpvalue

# We can reject the null hypothesis that the period of manufacture has no effect.

# We can also do this test using a likelihood ratio approach. Re-estimate the model leaving out

# the period effects.

poissonR.out <- zelig(Acc ~ TB + TC + TD + TE + O7579 + Logmonths,

model = "poisson", data = ShipAccidents)

summary(poissonR.out)

names(poissonR.out)

poissonRsummary.out <- summary(poissonR.out)

# Now construct the likelihood ratio test

j <- poissonRsummary.out$df.residual - poissonmnsummary.out$df.residual

lu <- -.5*poissonmnsummary.out$deviance

lr <- -.5*poissonRsummary.out$deviance

lrtest <- -2*(lr - lu)

lrtest

lrpvalue <- 1-pchisq(lrtest, df=j)

lrpvalue

# We generated lambda above as the model predictions. Let's also generate the set of errors.

err <- Acc-lambda

err

# McCullagh and Nelder also assert that the Poisson model exhibits overdispersion. Let's test

# for overdispersion using a regression based approach as suggested by Cameron and Trivedi

z <- ((Acc-lambda)^2-Acc)/(sqrt(2)*lambda)

# First the regression on a constant

test1 <- lm(z ~ 1)

summary(test1)

# Now the regression without a constant

test2 <- lm(z ~ lambda -1)

summary(test2)

# Since neither t-statistic is significant, there is no evidence of overdispersion in the data

# using a regression based approach.

# We can also test for overdispersion using the more general approach suggested on pp. 743-44

# of Greene.

ei2 <- as.vector(err*err)

vi <- as.vector(ei2 - lambda)

vi2 <- as.vector(vi*vi)

eivi <- as.vector(err*vi)

zi <- cbind(TB,TC,TD,TE,T6569,T7074,T7579,O7579,Logmonths)

mm <- t(zi) %*% diag(vi2) %*% zi

dd <- t(x) %*% diag(ei2) %*% x

md <- t(zi) %*% diag(eivi) %*% x

q <- mm - md %*% solve(dd) %*% t(md)

r <- t(zi) %*% vi

cm <- t(r) %*% solve(q) %*% r

# CM is a chi square statistic with k degrees of freedom, where k is the number of regressors.

# Here we reject the null that the variance is unrelated to the the regressors in a way that is not

# completely accounted for by the expected value of y. In other words, we have evidence

# of heterogeneity due to the regressors. We also reject the null that the mean equals the variance.

# Now let's test for overdispersion using a LaGrange Multiplier test as given by

# Green on p. 744.This is a test for poisson versus negative binomial.

mean(Acc)

ybar <- mean(Acc)

nybar <- nrow(x) * ybar

lmtest <- (t(err) %*% err-nybar)^2/(2*t(lambda) %*% lambda)

lmtest

# The LaGrange Multiplier statistic is Chi Squared with 1 degree of freedom. The test is

# non-significant, so no evidence of overdispersion for this test.

# We can also test for overdispersion in the context of the Negative Binomial model. If

# the reported heterogeneity parameter is significant then this is evidence for overdispersion.

startvalues <- poissonmn.out$coefficients

negbinmn.out <- zelig(Acc ~ TB + TC + TD + TE + T6569+ T7074 + T7579 + O7579 + offset(Logmonths),

model = "negbinom", start = startvalues, maxit=100, data = ShipAccidents)

summary(negbinmn.out)

# The pscl library does zero inflated models in R. Below is code to use that library on the ship accidents data.

library(pscl)

zinp <- zeroinfl(Acc ~ TB + TC + TD + TE + T6569+ T7074 + T7579 + O7579 + Logmonths | Logmonths, dist=c("poisson"), link = c("logit"))

summary(zinp)