Poisson and Negative Binomial Regression

The purpose of this session is to show you how to use STATA's procedures for count models including Poisson, Negative Binomial zero inflated Poisson, and zero inflated Negative Binomial Regression. We also show how to do various tests for overdispersion and for discriminating between models.

/*This program estimates Poisson and Negative Binomial Regression models using the McCullagh and Nelder data on ship accidents. The variables are:

Type = Ship type

TA, TB, TC, TD, TE = Ship Type indicators

T6064, T6569, T7074, T7579 = Year constructed indicators

O6064, O7579 = Years operated indicators

Months = Measure of service amount

Acc = Accidents. */

/*Now let's read in the data */

use "C:\Documents and Settings\B. Dan Wood\My Documents\My Teaching\Maximum Likelihood\Data\shipaccidents.dta", clear

/* First, let's do some descriptive statistics on the data just to confirm what is obvious from looking at the data. */

summarize

summarize acc, detail

drop if acc==.

/* Note that the data are non-normal. They are both skewed and kurtotic. This implies the need for some model other than a normal distribution. Which model? is the next question. We'll attempt to answer this question later. */

/* Let's begin by doing the estimation using the Poisson model. */

gen logmth=log(months)

poisson acc tb tc td te t6569 t7074 t7579 o7579 logmth

fitstat /*Obtain various fit statistics on the poisson regression*/

listcoef, help /*List the coefficients and standardized coefficients*/

prvalue, x(o7579=0) rest(mean) /*Compute probabilities when o7579=0 and rest at mean */

prchange, help /*Compute various first differences */

mfx compute, at(mean) /*Compute marginal effects at the means */

poisgof /*Goodness of fit for poisson model */

/* McCullagh and Nelder assert that the coefficient on Log(Months) is known to be 1. As with all of the models we have estimated it is possible to restrict coefficients. Do this for the Poisson model as follows. We also save the predicted values and residuals for later testing.*/

constraint define 1 logmth=1

poisson acc tb tc td te t6569 t7074 t7579 o7579 logmth, constraints (1)

mfx compute, at(mean)

/* Generate lambda and errors for later use */

predict lambda, ir

gen err=acc-lambda

/* Now let's test the hypothesis that the period of manufacture has no effect on accidents. First with a Wald statistic. We'll also save the likelihood from this for later use. */

poisson acc tb tc td te t6569 t7074 t7579 o7579 logmth,constraints(1)

lrtest, saving(0)

/*A Wald test that the period of manufacture has no effect on accidents.*/

test t6569 t7074 t7579

/* Now test the same hypothesis with a Likelihood Ratio test. */

poisson acc tb tc td te o7579 logmth,constraints(1)

lrtest

/* McCullagh and Nelder also assert that the Poisson model exhibits overdispersion. Let's test for overdispersion using a regression based approach as suggested by Cameron and Trivedi*/

gen z=((acc-lambda)^2-acc)/(sqrt(2)*lambda)

regress z

regress z lambda, noconstant

/* Since neither t-statistic is significant, there is no evidence of overdispersion in the data using a regression based approach.

We can also test for overdispersion using the more general approach suggested on pp. 743-44 of Greene.*/

gen i=1

gen ei2 = err*err

mkmat ei2

gen vi = ei2 - lambda

mkmat vi

gen vi2 = vi*vi

mkmat vi2

gen eivi = err*vi

mkmat eivi

mkmat tb tc td te t6569 t7074 t7579 o7579 logmth,matrix(zi)

mkmat i tb tc td te t6569 t7074 t7579 o7579 logmth,matrix(x)

matrix mm = zi'*diag(vi2)*zi

matrix dd = x'*diag(ei2)*x

matrix md=zi'*diag(eivi)*x

matrix q = mm - md*inv(dd)*md'

matrix r = zi'*vi

matrix cm = r'*inv(q)*r

matrix list cm

/* CM is a chi square statistic with k degrees of freedom, where k is the number of regressors. Here we reject the null that the variance is unrelated to the the regressors in a way that is not completely accounted for by the expected value of y. In other words, we have evidence of heterogeneity due to the regressors. We also reject the null that the mean equals the variance.

Now let's test for overdispersion using a LaGrange Multiplier test as given by Green on p. 744.This is a test for poisson versus negative binomial.*/

mkmat i

egen ybar=mean(acc)

mkmat ybar

matrix nybar=i'*ybar

mkmat err, matrix(err)

mkmat lambda,matrix(lambda)

matrix lmtest=((err'*err-nybar)*(err'*err-nybar))*inv(2*lambda'*lambda)

matrix list lmtest

/* The LaGrange Multiplier statistic is Chi Squared with 1 degree of freedom. The test is non-significant, so no evidence of overdispersion. */

/* We can also test for overdispersion in the context of the Negative Binomial model. If the heterogeneity parameter is significant then this is evidence for overdispersion.*/

nbreg acc tb tc td te t6569 t7074 t7579 o7579 logmth,constraints(1)

/*The parameter alpha is not significant, so no evidence for overdispersion. We would not want to use the negative binomial for these data.*/

/* A way to deal with heterogeneity and the overdispersion problem is to estimate a so-called zero inflated model. However, this should only be used when there is reason to suspect multiple regimes in the data. We can estimate a zero inflated Poisson model, obtain the Vuong test, and obtain measures of fit, and effects. The default binary distribution is the logit. Specify probit option to get probit. */

zip acc tb tc td te t6569 t7074 t7579 o7579 logmth, offset(logmth) inflate(o7579) vuong

fitstat /*Obtain various fit statistics on the zip regression*/

listcoef, help /*List the coefficients and standardized coefficients*/

prvalue, x(o7579=0) rest(mean) /*Compute probabilities when o7579=0 and rest at mean */

prchange, help /*Compute various first differences */

mfx compute, at(mean) /*Compute marginal effects at the means */

/* We can also estimate a zero inflated Negative Binomial model, obtain the Vuong test, and obtain measures of fit, and effects. The default binary distribution is the logit. Specify probit to get probit. The zip option for the zinb procedure gives a test for zip vs. zinb. */

zinb acc tb tc td te t6569 t7074 t7579 o7579 logmth, offset(logmth) inflate(o7579) vuong zip

fitstat /*Obtain various fit statistics on the zip regression*/

listcoef, help /*List the coefficients and standardized coefficients*/

prvalue, x(o7579=0) rest(mean) /*Compute probabilities when o7579=0 and rest at mean */

prchange, help /*Compute various first differences */

mfx compute, at(mean) /*Compute marginal effects at the means */

/* Note that there is a problem with the zero inflated Negative Binomial model probabilities, confirming what we found above concerning the inappropriateness of the negative binomial. */