Poisson and Negative
Binomial Regression
The
purpose of this session is to show you how to use R's procedures for count
models including Poisson adn Negative Binomial
Regression. We also show how to do various tests for overdispersion
and for discriminating between models.
#
We use McCullagh and Nelder's
data on ship accidents. The variables
are:
#
Type = Ship type
#
TA, TB, TC, TD, TE = Ship Type indicators
#
T6064, T6569, T7074, T7579 = Year constructed indicators
#
O6074, O7579 = Years operated indicators
#
Months = Measure of service amount
#
Acc = Accidents.
library(car)
library(stats)
library(MASS)
#
Now let's read in the data. Note that there are some missing data defined by
".". R expects
#
missing data to be coded NA. So we have changed the
missing value code on input here.
rm(list = ls())
ShipAccidents <- read.table("Documents/My
Teaching/Maximum Likelihood/Data/ShipAccidents.txt",
header=TRUE, na.strings=".")
attach(ShipAccidents)
#
Let's go ahead and remove the missing values from the dataset.
detach(ShipAccidents)
ShipAccidents <- na.omit(ShipAccidents)
attach(ShipAccidents)
#
First let's do some descriptive statistics on the data just to confirm what is
#
obvious from looking at the data (there are a lot of
zeros).
summary(ShipAccidents)
library(e1071)
skewness(Acc)
kurtosis(Acc)
qqPlot(Acc,
dist= "norm", labels=FALSE)
#
These are discrete data, so let's do a histogram of their distribution.
#
And, just for fun, let's also superimpose non-parametric regression lines.
hist(Acc,
probability=T, ylab='Density')
lines(density(Acc), lwd=2)
points(Acc, rep(0, length(Acc)), pch="|")
box()
lines(density(Acc,
adjust=.5), lwd=1)
# Note that the discrete data are non-normal and
cannot be approximated by a normal distribution.
#
They are both skewed and kurtotic. This implies the
need for some model other than a normal
#
distribution. Which model? is
the next question. We'll attempt to answer this question later.
#
Let's begin by doing the estimation using the Poisson model. Again, we use the Zelig interface
#
to R. Replicating McCullaugh
and Nelder, we include months of service logged in
the analysis
ShipAccidents$Logmonths <- log(Months)
detach(ShipAccidents)
attach(ShipAccidents)
library(Zelig)
poisson.out <- zelig(Acc ~ TB + TC + TD + TE +
T6569+ T7074 + T7579 + O7579 + Logmonths,
model = "poisson", data = ShipAccidents)
summary(poisson.out)
poissonsummary.out <- summary(poisson.out)
#
McCullagh and Nelder assert
that the coefficient on Logmonths is known to be 1.
Let's do a
#
Wald test for whether the coefficient on Logmonths is
consistent with their assertion.
b <- poissonsummary.out$coefficients[,1]
v <- poissonsummary.out$cov.unscaled
r <- rbind(c(0,0,0,0,0,0,0,0,0,1))
q <- 1
j <- nrow(r)
Wald
<- t(r%*%b-q) %*% solve (r %*% v %*% t(r)) %*% (r%*%b-q)
Wald
Waldpvalue <- 1-pchisq(Wald,
df=j)
Waldpvalue
#
Based on the Wald statistic we are unable to reject the null that the
coefficients on Logmonths
#
is one.
#
Note, McCullaugh and Nelder
assert that the coefficient on Logmonths is known
#
to be 1. In a sense, they cheat by not taking into
account the sampling variability
#
of this coefficient. However, to replicate their
results we need to fix the coefficient
#
on Logmonths at 1. This can
be done using the offset command. Note that you can also
#
restrict to other values than 1 by making a scalar
transformation of the variable to
#
be restricted. Note that R does not report a
coefficient for the fixed parameter.
poissonmn.out <- zelig(Acc ~ TB + TC + TD + TE +
T6569+ T7074 + T7579 + O7579 + offset(Logmonths),
model = "poisson", data = ShipAccidents)
summary(poissonmn.out)
poissonmnsummary.out <- summary(poissonmn.out)
#
Let's do some interpretation of the results. First, the hard
way.
#
Get the vector of coefficients and data matrix
x <- as.matrix(cbind(1,ShipAccidents[3:6],ShipAccidents[8:10],ShipAccidents[12:12],
ShipAccidents[15:15]))
y <- as.vector(Acc)
beta <- c(poissonmnsummary.out$coefficients[,1],1)
beta
# Predicted accidents for each observation.
lambda <- exp(x %*% beta)
lambda
# Predicted accidents with all independent
variables at their means.
xbar <- as.matrix(colMeans(cbind(1,ShipAccidents[3:6],ShipAccidents[8:10],ShipAccidents[12:12],
ShipAccidents[15:15])))
Expectation
<- exp(t(xbar) %*% beta)
Expectation
#
We could use the preceding to do a first differences
analysis. However, let's use Zelig
#
to do the work for us. First, set all variables at
their means.
means.out <- setx(poisson.out)
means.out
#
Simulate expected value as above. Compare the results to the preceding.
fitted.out <- sim(poisson.out, x = means.out)
summary(fitted.out)
#
Simulated first difference when shifting Logmonths
one standard deviation above the mean
Logmonths.low <- setx(poisson.out, Logmonths = mean(Logmonths))
Logmonths.high <- setx(poisson.out, Logmonths = mean(Logmonths) + sd(Logmonths))
fd.Logmonth <- sim(poisson.out, x = Logmonths.low, x1 = Logmonths.high)
summary(fd.Logmonth)
#
Now let's test the hypothesis that the period of manufacture has no effect on
accidents.
#
Let's use a Wald statistic.
r <- rbind(c(0,0,0,0,0,1,0,0,0),c(0,0,0,0,0,0,1,0,0),c(0,0,0,0,0,0,0,1,0))
q <- c(0,0,0)
b <- poissonmnsummary.out$coefficients[,1]
v <- poissonmnsummary.out$cov.unscaled
j <- nrow(r)
Wald
<- t(r%*%b-q) %*% solve (r %*% v %*% t(r)) %*% (r%*%b-q)
Wald
Waldpvalue <- 1-pchisq(Wald,
df=j)
Waldpvalue
#
We can reject the null hypothesis that the period of manufacture has no effect.
#
We can also do this test using a likelihood ratio approach. Re-estimate the
model leaving out
#
the period effects.
poissonR.out <- zelig(Acc ~ TB + TC + TD + TE +
O7579 + Logmonths,
model = "poisson", data = ShipAccidents)
summary(poissonR.out)
names(poissonR.out)
poissonRsummary.out <- summary(poissonR.out)
#
Now construct the likelihood ratio test
j <- poissonRsummary.out$df.residual
- poissonmnsummary.out$df.residual
j
lu <- -.5*poissonmnsummary.out$deviance
lr <- -.5*poissonRsummary.out$deviance
lrtest <- -2*(lr - lu)
lrtest
lrpvalue <- 1-pchisq(lrtest, df=j)
lrpvalue
#
We generated lambda above as the model predictions. Let's also generate the set
of errors.
err <- Acc-lambda
err
#
McCullagh and Nelder also
assert that the Poisson model exhibits overdispersion.
Let's test
#
for overdispersion using a
regression based approach as suggested by Cameron and Trivedi
z <- ((Acc-lambda)^2-Acc)/(sqrt(2)*lambda)
#
First the regression on a constant
test1 <- lm(z ~ 1)
summary(test1)
#
Now the regression without a constant
test2 <- lm(z ~ lambda -1)
summary(test2)
#
Since neither t-statistic is significant, there is no evidence of overdispersion in the data
#
using a regression based approach.
#
We can also test for overdispersion using the more
general approach suggested on pp. 743-44
#
of Greene.
ei2 <- as.vector(err*err)
vi <- as.vector(ei2 -
lambda)
vi2 <- as.vector(vi*vi)
eivi <- as.vector(err*vi)
zi <- cbind(TB,TC,TD,TE,T6569,T7074,T7579,O7579,Logmonths)
mm <- t(zi) %*% diag(vi2) %*% zi
dd <- t(x) %*% diag(ei2) %*% x
md <- t(zi) %*% diag(eivi) %*% x
q <- mm - md %*% solve(dd)
%*% t(md)
r <- t(zi) %*% vi
cm
<- t(r) %*% solve(q) %*% r
cm
#
CM is a chi square statistic with k degrees of freedom, where k is the number
of regressors.
#
Here we reject the null that the variance is unrelated to the the regressors in a way that is
not
#
completely accounted for by the expected value of y.
In other words, we have evidence
#
of heterogeneity due to the regressors.
We also reject the null that the mean equals the variance.
#
Now let's test for overdispersion using a LaGrange
Multiplier test as given by
#
Green on p. 744.This is a test for poisson versus
negative binomial.
mean(Acc)
ybar <- mean(Acc)
nybar <- nrow(x) * ybar
lmtest <- (t(err) %*% err-nybar)^2/(2*t(lambda) %*% lambda)
lmtest
#
The LaGrange Multiplier statistic is Chi Squared with 1 degree of freedom. The test is
#
non-significant, so no evidence of overdispersion
for this test.
#
We can also test for overdispersion in the context of
the Negative Binomial model. If
#
the reported heterogeneity parameter is significant
then this is evidence for overdispersion.
startvalues <- poissonmn.out$coefficients
negbinmn.out <- zelig(Acc ~ TB + TC + TD + TE +
T6569+ T7074 + T7579 + O7579 + offset(Logmonths),
model = "negbinom", start = startvalues,
maxit=100, data = ShipAccidents)
summary(negbinmn.out)
#
The pscl library does zero inflated models in R.
Below is code to use that library on the ship accidents data.
library(pscl)
zinp <- zeroinfl(Acc ~ TB + TC + TD + TE
+ T6569+ T7074 + T7579 + O7579 + Logmonths | Logmonths, dist=c("poisson"), link = c("logit"))
summary(zinp)