MAS8381: Statistics for Big Data

Introduction to Bayesian Poisson regression

Introduction

The main idea behind Bayesian modelling (and model-based statistical inference, more generally), is that the models you fit to data should be as similar as possible as the models that you would use to simulate data like the data you want to model. So, in this lab we will use a simple example to illustrate this point.

First, let's consider the problem of understanding how the number of days of snow in a particular winter in the UK varies with latitude. We could imagine recording the count of the number of days of snow for a number of different locations with known latitude. We are only interested in how the snow varies, not the locations that we choose to measure at, so we will consider the locations to be fixed and known and just think about the count of the number of snow days. We could obviously model this as a Gaussian random variable, but we know that the Poisson distribution is a much better probability model for a count, so we would prefer to use that. The idea is then that there will be a Poisson distributed random variable at each location where we make a measurement.

So, now we model Y[i] ~ Po(mu[i]) where Y[i] is the count we observe at site i and mu[i] is the mean at site i.

So then we need to think about how the parameter(s) of the probability distribution vary with location. The Poisson distribution has just one parameter - its mean, so that is what we need to model.

We might be tempted to model it as a linear function of latitude, but that is problematic, since the linear function is not guaranteed to be positive, and a Poisson random quantity can't have a negative mean. Instead, it is better to model the log of the mean with a linear predictor. That way the mean will be the exponential of a linear predictor, and so guaranteed to be positive.

With this in mind we put eta[i] = log(mu[i]) (and mu[i] = exp(eta[i])) and then model eta[i] with a linear predictor as eta[i] = beta[0] + beta[1]x[i], where x[i] is the latitude at site i, and the beta are the regression coefficients of inferential interest.

Simulating data

We now have enough to simulate some data for a selection of locations (don't worry about how I chose the regression coefficients - I just chose them to be vaguely realistic)

n=200
x=runif(n,50,60)
eta=-3+0.1*x
mu=exp(eta)
y=rpois(n,mu)
plot(x,y)

Frequentist inference

So now, we want to know if we could figure out what regression parameters would lead to data of this form. This model is not linear - it is a nonlinear model, so simple linear methods can not be used to fit this. In fact, this model falls into a class of models known as generalised linear models (GLMs), of which more later. It turns out that we can fit this model using a maximum likelihood approach using the glm function as follows:

glm(y~x,family=poisson())

We see that it does a reasonable but imperfect job of infering the parameters used to simulate the data.

Bayesian inference

Even without knowing about generalised linear models, we can fit this model (with fairly vague priors for the regression coefficients) using JAGS as follows.

require(rjags)
data=list(x=x,y=y,n=n)
init=list(beta0=0,beta1=0)
modelstring="
  model {
    for (i in 1:n) {
      mu[i]<-exp(beta0+beta1*x[i])
      y[i]~dpois(mu[i])
    }
    beta0~dnorm(0,0.001)
    beta1~dnorm(0,0.01)
  }
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
# burn-in
update(model,n.iter=1000)
# monitoring run
output=coda.samples(model=model,variable.names=c("beta0","beta1"),n.iter=100000,thin=100)
print(summary(output))
plot(output)
# use mcmcSummary
require(smfsb)
mcmcSummary(as.matrix(output),rows=2)

This runs, and works, and clearly identifies the regression parameters quite well, though there is considerable remaining posterior uncertainty about both. The mixing of the MCMC chain isn't great however - really it should be run for 10 times longer, at least.

1. Try re-running the model using just 20 observations. How does this affect posterior uncertainty?

2. Now try it with 1000 observations - does this nail down the parameters sufficiently well?

3. How does the running time of the MCMC algorithm vary with sample size? What about the mixing of the chain?

4. Monitor the mu[i] as well as the beta, and overlay the posterior means of the mu[i] on a plot of the data.

Darren J Wilkinson