Poisson and Negative Binomial Regression


# The purpose of this session is to show you how to use R's procedures for count models

# including Poisson and Negative Binomial Regression. We also show how to do various tests

# for overdispersion and for discriminating between models.






# Now let's read in the data. Note that there are some missing data defined by ".". R expects

# missing data to be coded NA. So we have changed the missing value code on input here.


rm(list = ls())

ShipAccidents <- read.table("C:/users/Wood/Documents/My Teaching/Maximum Likelihood/Data/ShipAccidents.txt",

    header=TRUE, na.strings=".")



# Let's go ahead and remove the missing values from the dataset.



ShipAccidents <- na.omit(ShipAccidents)



# First let's do some descriptive statistics on the data just to confirm what is

# obvious from looking at the data (there are a lot of zeros).







qqPlot(Acc, dist= "norm", labels=FALSE)


# These are discrete data, so let's do a histogram of their distribution.

# And, just for fun, let's also superimpose non-parametric regression lines.


hist(Acc, probability=T, ylab='Density')

lines(density(Acc), lwd=2)

points(Acc, rep(0, length(Acc)), pch="|")


lines(density(Acc, adjust=.5), lwd=1)


# Note that the discrete data are non-normal and cannot be approximated by a normal distribution.

# They are both skewed and kurtotic. This implies the need for some model other than a normal

# distribution. Which model? is the next question. We'll attempt to answer this question later.


# Let's begin by doing the estimation using the Poisson model. Again, we use the Zelig interface

# to R. Replicating McCullaugh and Nelder, we include months of service logged in the analysis


ShipAccidents$Logmonths <- log(Months)





poisson.out <- zelig(Acc ~ TB + TC + TD + TE + T6569+ T7074 + T7579 + O7579 + Logmonths,

    model = "poisson", data = ShipAccidents)



poissonsummary.out <- summary(poisson.out)


# McCullagh and Nelder assert that the coefficient on Logmonths is known to be 1. Let's do a

# Wald test for whether the coefficient on Logmonths is consistent with their assertion.


b <- poisson.out$coefficients

v <- poissonsummary.out$cov.unscaled

r <- rbind(c(0,0,0,0,0,0,0,0,0,1))

q <- 1

j <- nrow(r)

Wald <- t(r%*%b-q) %*% solve (r %*% v %*% t(r)) %*% (r%*%b-q)


Waldpvalue <- 1-pchisq(Wald, df=j)



# Based on the Wald statistic we are unable to reject the null that the coefficients on Logmonths

# is one.


# Note, McCullaugh and Nelder assert that the coefficient on Logmonths is known

# to be 1. In a sense, they cheat by not taking into account the sampling variability

# of this coefficient. However, to replicate their results we need to fix the coefficient

# on Logmonths at 1. This can be done using the offset command. Note that you can also

# restrict to other values than 1 by making a scalar transformation of the variable to

# be restricted. Note that R does not report a coefficient for the fixed parameter.


poissonmn.out <- zelig(Acc ~ TB + TC + TD + TE + T6569+ T7074 + T7579 + O7579 + offset(Logmonths),

    model = "poisson", data = ShipAccidents)



poissonmnsummary.out <- summary(poissonmn.out)


# Let's do some interpretation of the results. First, the hard way.

# Get the vector of coefficients and data matrix


x <- as.matrix(cbind(1,ShipAccidents[3:6],ShipAccidents[8:10],ShipAccidents[12:12],


y <- as.vector(Acc)

beta <- c(poissonmn.out$coefficients,1)



# Predicted accidents for each observation.


lambda <- exp(x %*% beta)



# Predicted accidents with all independent variables at their means.


xbar <- as.matrix(mean(cbind(1,ShipAccidents[3:6],ShipAccidents[8:10],ShipAccidents[12:12],


Expectation <- exp(t(xbar) %*% beta)



# We could use the preceding to do a first differences analysis. However, let's use Zelig

# to do the work for us. First, set all variables at their means.


means.out <- setx(poisson.out)



# Simulate expected value as above. Compare the results to the preceding.


fitted.out <- sim(poisson.out, x = means.out)



# Simulated first difference when shifting Logmonths one standard deviation above the mean


Logmonths.low <- setx(poisson.out, Logmonths = mean(Logmonths))

Logmonths.high <- setx(poisson.out, Logmonths = mean(Logmonths) + sd(Logmonths))

fd.Logmonth <- sim(poisson.out, x = Logmonths.low, x1 = Logmonths.high)



# Now let's test the hypothesis that the period of manufacture has no effect on accidents.

# Let's use a Wald statistic.


r <- rbind(c(0,0,0,0,0,1,0,0,0),c(0,0,0,0,0,0,1,0,0),c(0,0,0,0,0,0,0,1,0))

q <- c(0,0,0)

b <- poissonmn.out$coefficients

v <- poissonmnsummary.out$cov.unscaled

j <- nrow(r)

Wald <- t(r%*%b-q) %*% solve (r %*% v %*% t(r)) %*% (r%*%b-q)


Waldpvalue <- 1-pchisq(Wald, df=j)



# We can reject the null hypothesis that the period of manufacture has no effect.


# We can also do this test using a likelihood ratio approach. Re-estimate the model leaving out

# the period effects.


poissonR.out <- zelig(Acc ~ TB + TC + TD + TE + O7579 + Logmonths,

    model = "poisson", data = ShipAccidents)




# Now construct the likelihood ratio test


j <- poissonR.out$df.residual - poissonmn.out$df.residual


lu <- -.5*poissonmn.out$deviance

lr <- -.5*poissonR.out$deviance

lrtest <- -2*(lr - lu)


lrpvalue <- 1-pchisq(lrtest, df=j)



# We generated lambda above as the model predictions. Let's also generate the set of errors.


err <- Acc-lambda



# McCullagh and Nelder also assert that the Poisson model exhibits overdispersion. Let's test

# for overdispersion using a regression based approach as suggested by Cameron and Trivedi


z <- ((Acc-lambda)^2-Acc)/(sqrt(2)*lambda)


# First the regression on a constant


test1 <- lm(z ~ 1)



# Now the regression without a constant


test2 <- lm(z ~ lambda -1)



# Since neither t-statistic is significant, there is no evidence of overdispersion in the data

# using a regression based approach.


# We can also test for overdispersion using the more general approach suggested on pp. 743-44

# of Greene.


ei2 <- as.vector(err*err)

vi <- as.vector(ei2 - lambda)

vi2 <- as.vector(vi*vi)

eivi <- as.vector(err*vi)

zi <- cbind(TB,TC,TD,TE,T6569,T7074,T7579,O7579,Logmonths)

mm <- t(zi) %*% diag(vi2) %*% zi

dd <- t(x) %*% diag(ei2) %*% x

md <- t(zi) %*% diag(eivi) %*% x

q <- mm - md %*% solve(dd) %*% t(md)

r <- t(zi) %*% vi

cm <- t(r) %*% solve(q) %*% r



# CM is a chi square statistic with k degrees of freedom, where k is the number of regressors.

# Here we reject the null that the variance is unrelated to the the regressors in a way that is not

# completely accounted for by the expected value of y. In other words, we have evidence

# of heterogeneity due to the regressors. We also reject the null that the mean equals the variance.


# Now let's test for overdispersion using a LaGrange Multiplier test as given by

# Green on p. 744.This is a test for poisson versus negative binomial.




ybar <- mean(Acc)

nybar <- nrow(x) * ybar

lmtest <- (t(err) %*% err-nybar)^2/(2*t(lambda) %*% lambda)



# The LaGrange Multiplier statistic is Chi Squared with 1 degree of freedom.  The test is

# non-significant, so no evidence of overdispersion for this test.


# We can also test for overdispersion in the context of the Negative Binomial model. If

# the reported heterogeneity parameter is significant then this is evidence for overdispersion.


startvalues <- poissonmn.out$coefficients

negbinmn.out <- zelig(Acc ~ TB + TC + TD + TE + T6569+ T7074 + T7579 + O7579 + offset(Logmonths),

    model = "negbin", start = startvalues, maxit=100, data = ShipAccidents)



# In this case the estimation does not converge and no overdispersion parameter is reported.

# This does not bode well for McCullaugh and Nelder's analysis.


# All of the preceding interpretational tools using Zelig can also be applied to the

# negative binomial model in straight forward fashion. We leave that as an exercise.


# There are library packages in R that will also estimate other models of heterogeneity

# in count processes such as the zero inflated poisson or negative binomial.