Intrinsic Conditional Autoregressive (CAR) model – Lip Cancer in PyMC








The autoregression continues this week with an intrinsic conditional autoregressive, or CAR model, in PyMC. There are many, many ways to introduce spatial effects into a modelling framework; due to the availability of cheap MCMC sampling, particularly in WinBUGS, CAR models have been widely used over the past decade to structure the spatial dependence or clustering of gridded data.

The canonical example is in estimating the log-relative risk of lip cancer among 56 districts in Scotland between 1975 and 1980; the original data come from (1, Kemp et al. 1985). Disease mapping is perhaps the oldest form of epidemiology, including the famous cholera map of Dr. John Snow that helped him figure out the bacteria was waterborne, rather than being from ‘bad air.’ In the Scottish case, there is some interest in how lip cancer rates relate to the proportion of the population engaged in agriculture, forestry, or fishing (AFF), as exposure to sunlight is a risk factor.

The data consist of observed (O) cases of lip cancer, and expected (E) rates, based on the age of the local population:

import numpy as np

county = np.array(["skye_lochalsh", "banff_buchan", "caithness,berwickshire", "ross_cromarty", "orkney", "moray", "shetland", "lochaber", "gordon", "western_isles", "sutherland", "nairn", "wigtown", "NE.fife", "kincardine", "badenoch", "ettrick", "inverness", "roxburgh", "angus", "aberdeen", "argyll_bute", "clydesdale", "kirkcaldy", "dunfermline", "nithsdale", "east_lothian", "perth_kinross", "west_lothian", "cumnock_doon", " stewartry", " midlothian", "stirling", "kyle_carrick", "inverclyde", "cunninghame", "monklands", "dumbarton", "clydebank", "renfrew", "falkirk", "clackmannan", "motherwell", "edinburgh", "kilmarnock", "east_kilbride", "hamilton", "glasgow", "dundee", "cumbernauld", "bearsden", "eastwood", "strathkelvin", "tweeddale", "annandale"])

O = np.array([9, 39, 11, 9, 15, 8, 26, 7, 6, 20, 13, 5, 3, 8, 17, 9, 2, 7, 9, 7, 16, 31, 11, 7, 19, 15, 7, 10, 16, 11, 5, 3, 7, 8, 11, 9, 11, 8, 6, 4, 10, 8, 2, 6, 19, 3, 2, 3, 28, 6, 1, 1, 1, 1, 0, 0])

N = len(O)

E = np.array([1.4, 8.7, 3.0, 2.5, 4.3, 2.4, 8.1, 2.3, 2.0, 6.6, 4.4, 1.8, 1.1, 3.3, 7.8, 4.6, 1.1, 4.2, 5.5, 4.4, 10.5, 22.7, 8.8, 5.6, 15.5, 12.5, 6.0, 9.0, 14.4, 10.2, 4.8, 2.9, 7.0, 8.5, 12.3, 10.1, 12.7, 9.4, 7.2, 5.3, 18.8, 15.8, 4.3, 14.6, 50.7, 8.2, 5.6, 9.3, 88.7, 19.6, 3.4, 3.6, 5.7, 7.0, 4.2, 1.8])

logE = np.log(E)

Which for the observed cases of lip cancer look like:
scotland

And for the expected cases, based on demographics:
scotland2

So there was a big disparity between observed and expected rates of lip cancer in the late 1970’s. The exposure to sunlight factor may be responsible, or it may be spatially related to some unmeasured factor (in general space or time stand in for things we haven’t measured). The remaining data are for the CAR calculations outlined below.

aff = np.array([16, 16, 10, 24, 10, 24, 10, 7, 7, 16, 7, 16, 10, 24, 7, 16, 10, 7, 7, 10, 7, 16, 10, 7, 1, 1, 7, 7, 10, 10, 7, 24, 10, 7, 7, 0, 10, 1, 16, 0, 1, 16, 16, 0, 1, 7, 1, 1, 0, 1, 1, 0, 1, 1, 16, 10])/10.

num = np.array([4, 2, 2, 3, 5, 2, 5, 1,  6, 4, 4, 3, 4, 3, 3, 6, 6, 6 ,5, 3, 3, 2, 6, 8, 3, 4, 4, 4,11,  6, 7, 4, 4, 9, 5, 4, 5, 6, 5, 5, 7, 6, 4, 5, 4, 6, 6, 4, 9, 3, 4, 4, 4, 5, 5, 6])

adj = np.array([[5, 9,11,19],
[7,10],
[6,12],
[18,20,28],
[1,11,12,13,19],
[3, 8],
[2,10,13,16,17],
[6],
[1,11,17,19,23,29],
[2, 7,16,22],
[1, 5, 9,12],
[3, 5,11],
[5, 7,17,19],
[31,32,35],
[25,29,50],
[7,10,17,21,22,29],
[7, 9,13,16,19,29],
[4,20,28,33,55,56],
[1, 5, 9,13,17],
[4,18,55],
[16,29,50],
[10,16],
[9,29,34,36,37,39],
[27,30,31,44,47,48,55,56],
[15,26,29],
[25,29,42,43],
[24,31,32,55],
[4,18,33,45],
[9,15,16,17,21,23,25,26,34,43,50],
[24,38,42,44,45,56],
[14,24,27,32,35,46,47],
[14,27,31,35],
[18,28,45,56],
[23,29,39,40,42,43,51,52,54],
[14,31,32,37,46],
[23,37,39,41],
[23,35,36,41,46],
[30,42,44,49,51,54],
[23,34,36,40,41],
[34,39,41,49,52],
[36,37,39,40,46,49,53],
[26,30,34,38,43,51],
[26,29,34,42],
[24,30,38,48,49],
[28,30,33,56],
[31,35,37,41,47,53],
[24,31,46,48,49,53],
[24,44,47,49],
[38,40,41,44,47,48,52,53,54],
[15,21,29],
[34,38,42,54],
[34,40,49,54],
[41,46,47,49],
[34,38,49,51,52],
[18,20,24,27,56],
[18,24,30,33,45,55]])

# Change to Python indexing (i.e. -1)
for i in range(len(adj)):
    for j in range(len(adj[i])):
        adj[i][j] = adj[i][j]-1


weights = np.array([[1,1,1,1],
[1,1],
[1,1],
[1,1,1],
[1,1,1,1,1],
[1, 1],
[1,1,1,1,1],
[1],
[1,1,1,1,1,1],
[1, 1,1,1],
[1, 1, 1,1],
[1, 1,1],
[1, 1,1,1],
[1,1,1],
[1,1,1],
[1,1,1,1,1,1],
[1, 1,1,1,1,1],
[1,1,1,1,1,1],
[1, 1, 1,1,1],
[1,1,1],
[1,1,1],
[1,1],
[1,1,1,1,1,1],
[1,1,1,1,1,1,1,1],
[1,1,1],
[1,1,1,1],
[1,1,1,1],
[1,1,1,1],
[1,1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1],
[1,1,1,1,1,1,1],
[1,1,1,1],
[1,1,1,1],
[1,1,1,1,1,1,1,1,1],
[1,1,1,1,1],
[1,1,1,1],
[1,1,1,1,1],
[1,1,1,1,1,1],
[1,1,1,1,1],
[1,1,1,1,1],
[1,1,1,1,1,1,1],
[1,1,1,1,1,1],
[1,1,1,1],
[1,1,1,1,1],
[1,1,1,1],
[1,1,1,1,1,1],
[1,1,1,1,1,1],
[1,1,1,1],
[1,1,1,1,1,1,1,1,1],
[1,1,1],
[1,1,1,1],
[1,1,1,1],
[1,1,1,1],
[1,1,1,1,1],
[1,1,1,1,1],
[1,1,1,1,1,1]])

Wplus = array([sum(w) for w in weights])

This model has been analyzed in many places, but I’ll use the model specified on page 165 (2, Banerjee et al.), the WinBUGS code for which is available on Brad Carlin’s website. The data are counts so a Poisson model is a good starting point, with the linear model being used to explain deviations from the expected values in each district:

$$ O_{i}|\psi_{i}\sim P(E_{i}e^{\psi_{i}}) $$
$$ \psi_{i}=\beta_{0}+\beta_{1}AFF_{i}+\theta_{i}+\phi_{i} $$

Here the $\beta_{0}$ and $\beta_{1}$ are the intercept and slope for the relationship to proportion of the district engaged in outdoor occupations. The $\theta_{i}$’s are hierarchical, district level effects (despite having only one observation per district) that add extra Poisson variation:

$$ \theta_{i}\sim N(0,\tau_{h}) $$

given a common precision:

$$ \tau_{h}\sim \Gamma(3.2761,1.81) $$

The $\Gamma(3.2761,1.81)$ priors are given in Table 5.4 of (2, Banerjee et al.) as alternative priors to a more conventional $\Gamma(0.001,0.001)$ uninformative prior. The $\phi_{i}$’s are the spatial or clustering effects given by the CAR part of the model:

$$ \phi_{i}\sim CAR(\tau_{c}) $$

with precision

$$ \tau_{c}\sim \Gamma(1.0,1.0) $$

In WinBUGS then, this model is specified as (from Carlin’s website):

model
{
   for (i in 1 : regions) {
      O[i] ~ dpois(mu[i])
      log(mu[i]) <- log(E[i]) + beta0 + beta1*aff[i]/10 + phi[i] + theta[i]
      theta[i] ~ dnorm(0.0,tau.h)
   }
   phi[1:regions] ~ car.normal(adj[], weights[], num[], tau.c)

   beta0 ~ dnorm(0.0, 1.0E-5)  # vague prior on grand intercept
   beta1 ~ dnorm(0.0, 1.0E-5)  # vague prior on covariate effect

   tau.h ~ dgamma(3.2761, 1.81)    
   tau.c ~ dgamma(1.0, 1.0)  

   sd.h <- sd(theta[]) # marginal SD of heterogeneity effects
   sd.c <- sd(phi[])   # marginal SD of clustering (spatial) effects

   alpha <- sd.c / (sd.h + sd.c)
}

The WinBUGS code has three extra bits – sd.h, sd.c, and alpha – posterior nodes that keep tabs on the marginal standard deviations of global heterogeneity and spatial variation, and on their relative contributions to overall variation.

Looking a little more closely at this model brings up two important points. The first is that $\theta_{i}$ and $\phi_{i}$ represent two random effects, global (hierarchical) and spatial, from a single datapoint in each district, making them unidentified. So how can this be overcome?? By specifying hyperpriors $\tau_{h}$ and $\tau_{c}$; however these must be chosen sensibly so as to not favour one component over the other. To do so is tricky, because while $\tau_{h}$ is a marginal precision, $\tau_{c}$ is conditional according to the CAR structure we chose. This raises the second point – what the hell does $\sim CAR(\tau_{c})$ mean? What does this ‘distribution’ look like and what is its structure? In the WinBUGS code for this model, all we get is:

phi[1:regions] ~ car.normal(adj[], weights[], num[], tau.c)

which tells us about as much as $\sim CAR(\tau_{c})$. Because (sadly) I understand code far better than algebra, to find out I delved into Appendix 1 of the GeoBUGS manual, to find out what the car.normal() model meant. It contained a few key sentences:

…the joint multivariate Gaussian model can be expressed in the form of a set of conditional distribuitons
$$S_{i}|S_{-i}\sim Normal(\mu_{i}+\Sigma_{j=1}^{N}\gamma C_{ij}(s_{j}-\mu_{i}), \phi M_{ii})$$

($S_{-i}$ means all elements but i)

Meaning that, rather than calculating the entire massive multivariate normal for the entire dataset (with associated big matrix inversions), the spatial component can be conditionally normal given a mean ($\mu_{i}$) weighted by the neighbouring data cells ($\Sigma_{j=1}^{N}\gamma C_{ij}(s_{j}-\mu_{i})$) and the conditional variance for cell i. Then the intrinsic CAR model (i.e. car.normal()) is given as

$$ S_{i}|S_{-i}\sim Normal(S.bar_{i}, v/n_{i}) $$

which means that, in the intrinsic case

Si has a normal distribution with conditional mean given by the average of the neighbouring Sj ‘s and conditional variance inversely proportional to the number of neighbours ni.

This makes a lot of sense – the value we observe at i is informed by the average of neighbouring values. To be sure, I delved into the very bowels of WinBUGS code – written in the obscure Component Pascal language – to find that the car.normal() distribution is exactly what the manual said:

PROCEDURE (node: Node) Sample (OUT res: SET);
    VAR
        i, len: INTEGER;
        mu, tau, value, wPlus: REAL;
BEGIN
    len := LEN(node.neighs);
    i := 0;
    mu := 0.0;
    wPlus := 0.0;
    WHILE i < len DO
        mu := mu + node.neighs[i].value * node.weights[i];
        wPlus := wPlus + node.weights[i];
        INC(i)
    END;
    mu := mu / wPlus;
    tau := node.tau.Value() * wPlus;
    value := MathRandnum.Normal(mu, tau);
    node.SetValue(value);
    res := {}
END Sample;

This means the CAR sample is a Normally-distributed random variable, with the mean value for observation i, $\mu$, being the weighted (node.weights[i]) average of the values bordering that observation (node.neighs[i].value), given a precision (this is WinBUGS/PyMC remember) that is increased according to the weighted contribution of the number of neighbours present (wPlus). In this model the weights are all given as 1, indicating simply whether the other observations are neighbours or not.

Here then is the source of conditioning on $\tau_{c}$, the weighting scheme used and the number of observations neighbouring each observation. As a result, the prior used for $\tau_{c}$ above comes from (3, Bernardinelli et al.), who argue that the prior marginal standard deviation of $\phi_{i}$ is approximately equal to the prior conditional standard deviation divided by 0.7. As noted by (2, Banerjee et al.), any scale parameter for the prior $\Gamma$ that satisfies:

$$ sd(\theta_{i}) = \frac{1}{\sqrt{\tau_{h}}} \approx \frac{1}{0.7\sqrt{\bar{m}\tau_{h}}} \approx sd(\phi_{i}) $$

where $\bar{m}$ is the average number of neighbours should be a fair prior for each source of variation. Understanding this, all the elements are in place for a [PyMC] CAR model.

First the priors:

# Vague prior on intercept
beta0 = Normal('beta0', mu=0.0, tau=1.0e-5, value=0)
# Vague prior on covariate effect
beta1 = Normal('beta1', mu=0.0, tau=1.0e-5, value=0)

# Random effects (hierarchial) prior
tau_h = Gamma('tau_h', alpha=3.2761, beta=1.81, value=1)
# Spatial clustering prior
tau_c = Gamma('tau_c', alpha=1.0, beta=1.0, value=1)

and the random effects:

# Regional random effects
theta = Normal('theta', mu=0.0, tau=tau_h, value=np.zeros(N))

Next comes the CAR part, which is efficiently handled by a Python list comprehension within a decorator instantiation of stochastic node:

@stochastic
def mu_phi(tau=tau_c, value=np.zeros(N)):
    # Calculate mu based on average of neighbours 
    mu = np.array([ sum(weights[i]*value[adj[i]])/Wplus[i] for i in xrange(N)])
    # Scale precision to the number of neighbours
    taux = tau*Wplus
    return normal_like(value,mu,taux)

# Zero-centre phi
phi = Lambda('phi', lambda mu=mu_phi: mu-mean(mu))

This estimates each mu by summing the values of the adjacent observations (indexed by the ragged array adj) multiplied by the weights (the other ragged array, weights), and dividing by the total weight (Wplus, which because the weights are all 1 is simply the number of adjacent observations); the $\tau$ for each observation is the overall $tau_{c}$ is multipled by the total weight (number of observations). These values are then given a normal likelihood that is subsequently zero-centred at $\phi$. Finally the mean (linear) model and the Poisson likelihood:

# Mean model
mu = Lambda('mu', lambda b0=beta0,b1=beta1,theta=theta,phi=phi: exp(logE+b0+b1*aff+theta+phi) )

# Likelihood
Yi = Poisson('Yi', mu=mu, value=O, observed=T)

As for the WinBUGS model above, we will also keep tabs on the proportions of total variance for the random and spatial components:

# Marginal SD of heterogeniety effects
sd_h = Lambda('sd_h', lambda theta=theta: np.std(theta))
# Marginal SD of clustering (spatial) effects
sd_c = Lambda('sd_c', lambda phi=phi: np.std(phi))
# Proportion sptial variance
alpha = Lambda('alpha', lambda sd_h=sd_h,sd_c=sd_c: sd_c/(sd_h+sd_c))

All that remains is to run the two models in their various frameworks and compare to the results presented in Table 5.4 of (2, Banerjee et al.). Looks pretty good:

comparison

That is except the spatial precision tau_c is somewhat more precise in PyMC than WinBUGS. However the proportion of spatial variance is almost identical

$$ \alpha_{winbugs} = 0.57 [0.45, 0.67] $$
$$ \alpha_{pymc} = 0.57 [0.49, 0.65] $$
$$ \alpha_{banerjee} = 0.57 [0.46, 0.68] $$

Next up – time!


Bibliography

[1] Kemp I, Boyle P, Smans M, Muir C. Atlas of Cancer in Scotland, 1975–1980: Incidence and Epidemiologic Perspective. IARC Scientific Publication 72. Lyon, France: International Agency for Research on Cancer; 1985.

[2] Banerjee S, Carlin BP, Gelfand AE. Hierarchical Modeling and Analysis for Spatial Data. Boca Raton, FL: Chapman and Hall; 2004.

[3] Bernardinelli L, Clayton D, Montomoli C, Ghislandi M, Songini M. Bayesian estimates of disease maps: how important are priors. Statistics in Medicine 1995;14:2411-31.


Hierarchical Priors for Multinomial Data

As part of an ongoing project to understand autoregressive structures in hierarchical models I came across some WinBUGS code code that puts a non-Wishart prior on a Multivariate Normal precision matrix for a Multinomial-logistic model of pig weight histograms. Why might one do such a thing? Well, often when putting together histograms of things the distribution doesn’t look as smooth as one would expect. In the case of these pigs, the distribution of weights is slightly uneven:

Pig data

The objective in this model (from pg.177 of Congdon’s 2001 Bayesian Statistical Modelling book) is to smooth the observed histogram so that it looks more like the density of the population from which the sample is drawn, q(y).

Because histograms are broken into bins into which the observations falling are stacked, Congdon’s approach was to use a Multinomial distribution to estimate expected frequencies in weight bins ranging from 19 to 39. The key twist is that the smoothing occurs by assuming that adjacent points on the histogram have similar frequencies and cannily using an AR1 process to structure this belief.

Prior specification using matrices is a tricky business, not one that I’m an expert in by a long shot. But it is important for some water quality modelling I’ve been working on that will have a spatial AR process in the mix.

So on to the PyMC implementation…

Data

First, the observed frequency of pig weight gains from Table 5.8 of Congdon:

import numpy as np

# Observed frequencies
weights = np.array([1,1,0,7,5,10,30,30,41,48,66,72,56,46,45,22,24,12,5,0,1])

# Total number of observations
n = sum(weights)

# Total number of bins
s = len(weights)

Priors

Next in our Bayesian model comes prior specification. For this model the covariance matrix $V=P^{-1}$ (that is, the inverse of the precision matrix, $P$) assumes a smoothness structure:

\[ V_{ij}=\sigma^{2}\rho^{|i-j|} \]

which means the covariance between bins i and j is equal to some common variance, $\sigma^{2}$, times a correlation coefficient $\rho$, and that this relationship diminishes exponentially by how far the bins are away from each other (i.e. $|i-j|$). This, Congdon says, is akin to the smoothing we’d see in an AR1 time series and expresses our prior belief as to the histograms’s smoothness. If $\tau = \sigma^{-2}(1-\rho^{2})^{-1}$, then the resulting precision matrix $P$ has the form:

$$ \begin{align} P_{ij} = P_{ji}=0, & for & i=1,…,J-2; j=2+i, J \end{align}$$

$$ \begin{align} P_{jj}=\tau(1+\rho^{2}) & & j=2,…,J-1 \end{align}$$

$$ P_{11,JJ} = \tau $$

$$ \begin{align} P_{j,j+1} = P_{j+1,j} = -\rho\tau & & j=2,…,J-1 \end{align} $$

To build this precision matrix in PyMC requires first defining the priors for $\rho$ and $\tau$:

# Prior on elements of AR Precision Matrix 
rho = Uniform('rho', lower=0, upper=1)
tau = Uniform('tau', lower=0.5, upper=10)

Variable $\rho$ is the correlation and, because we expect adjacent bins to be similar, it is bounded positively between zero and one; $\tau$ is a precision bounded between 0.5 and 10 due to prior information that Congdon cites in his text; we could easily use $\tau\sim U(0,100)$ priors instead, the results are the same. Next the precision matrix itself:

# Define precision matrix
@deterministic
def P(tau=tau, rho=rho):
    M = (np.zeros(s**2)).reshape(s,s)
    M[range(1,s-1),range(1,s-1)] = np.ones(s-2)*(tau*(1+rho**2))
    M[(0,s-1),(0,s-1)] = tau
    M[range(0,s-1),range(1,s)] = np.ones(s-1)*(-tau*rho)
    M[range(1,s),range(0,s-1)] = np.ones(s-1)*(-tau*rho)
    return M

This uses the decorator instantiation of PyMC (@deterministic) that allows the matrix to be specified as a regular Python function (def). A bit about the code here to explain; in the same order as the algebra above and for each line of PyMC code in function P:

  1. Sets up an empty matrix M that is the same dimension as the number of bins;
  2. Sets the diagonal elements to $\tau(1+\rho^{2})$ except for the upper left and lower right elements;
  3. Sets the upper left and lower right elements to $\tau$;
  4. Sets the off-diagonal elements to $-\rho\tau$;
  5. Return the matrix M.

Model

Now that we have priors for P and have P itself is set up, we can estimate the multiple logit model parameters using a Multivariate Normal distribution:

# MVN for logit parameters
theta = MvNormal('theta', mu=np.ones(s)*-np.log(s), tau=P)

The mu=np.ones(s)*-np.log(s) part of this model correspond to a neutral (equal) prior for the Multinomial probabilities $\pi_{j}$ that will be used to estimate the frequencies in the smoothed histogram. From this, we can apply a multiple logit tranformation to get the bin probabilities:

# Bin probabilities
pi = Lambda('pi', lambda g=theta: np.exp(g)/sum(np.exp(g)))

Likelihood

These probabilities are then used with the data to complete the likelihood:

# Likelihood
Yi = Multinomial('Yi', n=n, p=pi, value=weights, observed=True)

Finally, we can calculate the expected bin frequencies for the smoothed histogram from the posterior probabilities for each bin and the total number of pigs weighed:

# Smoothed frequencies
Sm = Lambda('Sm', lambda pi=pi: n*pi)

Put this all together and run for $10e^{5}$ iterations, we get a rather satisfactory result, quite near the estimates from Congdon’s original Table 5.8 and the WinBUGS example:

Pig smoothed

Linked occupancy-abundance ZINB model


A recent paper in the Early View of Ecology (Smith et al. 2012) peaked my interest last week, as it was an effort to explicitly link the occupancy (zero or not) and abundance (counts, biomass etc) parts of a zero-inflated negative binomial model of marine reserve effectiveness. These kinds of models are/should be seeing increasing use among marine ecologists as they commonly provide a much improved fit to the kinds of data we see with fisheries or biodiversity assessments.

Traditional zero-inflated (ZI) models do one of two things: either they model the zero (presence/absence) part of the model separate from the counts (or biomass etc.) or they are modelled together. The former method has been used a fair bit in fisheries circles, appearing as the ‘Delta’ method, which goes back to a paper by Lo et al. 1992, who were interested in estimating how spotters found schools of anchovy and what environmental conditions led to high observed abundances. The latter, integrated method has been developed more recently, relying on more complex mixture distributions to model the zeros and counts together.

Typically these mixtures have involved combining two generalized linear models, one being a Bernoulli to quantify presence-absence, and the other being a distribution of choice for the response; common versions include Poisson (ZIP; for counts) and negative binomial (ZINB; also counts) versions but log-Normals (ZILN; for biomass) are also around. In each case, the zero and count models have covariates in them that may or may not be the same, depending on the question at hand. In a 2009 paper on shark depredation in the Northeastern Atlantic (pdf), John Carlson and I looked at what would predict probability of sharks encountering the gear and what would predict how many were caught:

finding (unsurprisingly) that longer soak times led to more encounters, but also that the use of more lightsticks led to fewer depredation events. While these models had different zero and count model covariates, it may often be the case that these predictors are shared between the zero and count model components. As such, there may be high levels of correlation between parameters in these components that can be difficult to handle.

To make things concrete, let’s delve into the data from Smith et al. – counts of a porgie-like snapper Pagrus auratus from the Te Whanganui-A-Hi marine reserve, in Northern New Zealand. The species is heavily fished in the area and is therefore monitored through the use of baited remote underwater visual surveys (BRUVS) in both spring and autumn. Looking at the data through time it is clear that it is both unbalanced in terms of sampling, and contains about 55% zeros.

Several covariates are available for the data and the counts of fish most naturally fit a negative binomial distribution in almost every application. So for a ‘traditional’ mixture model, the zero model might be

$$ logit(p) = \gamma_{0}+\gamma_{1}x_{1}+…+\gamma_{n}x_{n} $$

with covariates $X$ and

$$ Y\sim Bernoulli(p) $$

while for the count component

$$ \eta = \beta_{0}+\beta_{1}x_{1}+…+\beta_{n}x_{n} \\ Y \sim NB(exp(\eta),\delta) $$

might have the same covariates in a negative binomial model. For the Smith et al. example the covariates include marine reserve status, season, and year as a categorical factor. If we plunk those into the model above however the zero and count parameters can be highly correlated – not ideal behaviour for a statistical model. To wrangle these correlations we can model the joint distribution of the $\beta$’s and $\gamma$’s as multivariate normal. This step helps mop up the correlation and provide more stable results in general.

Although this approach can work in many contexts and is, as pointed out by Smith et al., appropriate where ‘the ecological drivers giving rise to excess zeros are fundamentally different from those which are important in predicting patterns of relative abundance.’ When conducting an underwater visual survey (UVC), many of those excess zeros are due to low abundances. The fact that detection probabilities are less than one means that when abundances get low, zeros will be recorded even though there are fish present. So in many cases, what’s driving the excess zeros is low abundance due to incomplete detectability.

To get around this particular problem, Smith et al. propose to explicitly link the zero and count mean models through a single linear predictor for both $\lambda$ and $p$. The move that they make is clever; instead of modelling the $X$ covariates separately, they link the zero model to the covariates in the count model. While the count model remains the same, the zeros model then becomes:

$$ logit(p) = \gamma_{0}+\gamma_{1}\eta $$

With a change in the probability of excess zeros being a function of the changes in $\eta$, that are modelled with the $X$ covariates.

While the change is a simple one, it does require a Bayesian implementation due to the highly non-linear nature of the model. Fortunately Smith et al. provide their OpenBUGS code and the data, so testing the method out for oneself is an easy task. Here I’ll walk through a PyMC version of the model, along with the associated Python code that I’ve put up on GitHub (code).

The first step in a PyMC-based model is to define the priors (unlike BUGS, order matters in Python), which in Bayes notation would be

$$\delta \sim \Gamma(0.001,0.001) \\
\gamma_{0} \sim N(0.0,0.01) \\
\gamma_{1} \sim N(0.0,0.01) \\
$$

etc. In PyMC these priors are written similarly:


# Overdispersion for NB
alpha = Gamma('alpha', alpha=0.0001, beta=0.0001, value=6.)

# Zero link intercept to count model
gamma0 = Normal('gamma0', mu=0.0, tau=0.01, value=-0.5)

# Zero link slope to count model
gamma1 = Normal('gamma1', mu=0.0, tau=0.01, value=-0.4)

# Count intercept
beta0 = Normal('beta0', mu=0.0, tau=0.01, value=0.1)

# Reserve effect
beta_reserve = Normal('beta_reserve', mu=0.0, tau=0.01, value=0.)

# Constrain reserve effects
beta_rez = Lambda('beta_rez', lambda b0=beta0,b1=beta_reserve: array([b0-b1,b0+b1]))

# Season effect
beta_season = Normal('beta_season', mu=0.0, tau=0.01, value=0.)

# Area effect
@stochastic
def tau_area(value=1.91, alpha=0, beta=100):
return half_cauchy_like(value,alpha,beta)
beta_area = Normal('beta_area', mu=0.0, tau=tau_area, value=np.zeros(6))

# Reserve given area
beta_rez_area = Lambda('beta_rez_area', lambda br=beta_rez[Ir],ba=beta_area: br+ba)

# Year effect
@stochastic
def tau_year(value=1.91, alpha=0, beta=100):
return half_cauchy_like(value,alpha,beta)
beta_year = Normal('beta_year', mu=0.0, tau=tau_year, value=np.zeros(9))

In this model the both the area and year effects share a common precision (tau_area and tau_year) among the various areas and years. The half_cauchy_like** hyper priors for the year and area effects are worth taking note of; they reflect an important paper from Andrew Gelman (2006) that illustrates several problems with the standard $\Gamma(0.001,0.001)$ precisions (inverse-variances) used in many, many Bayesian models. The punchline of that paper is that a ‘half-t’ (e.g. half-Cauchy) prior that is weakly informative is helpful when the number of groups in hierarchical model is small, as it keeps the variance from flying off the rails. From these priors the linear model components can be defined for the linked model. In PyMC this could be


# Count model
eta = Lambda('eta', lambda b0=beta_rez_area[Ia],b1=beta_season,b2=beta_year[Iy]: b0+b1*season+b2, trace=False)

# Link function for negative binomial
mu = Lambda('mu', lambda e=eta: exp(e), trace=False)

# Zero model
pzero = Lambda('pzero', lambda g0=gamma0,g1=gamma1,eta=eta: invlogit(g0+g1*eta), trace=False)

As stated above, this links the linear zero model to the linear count model and avoids the doubling up on parameters that ignores the occupancy-abundance relationship.

The last part of the PyMC model requires specifying the zero-inflated negative binomial likelihood; Smith et al. do this explicitly in OpenBUGS and the PyMC implementation is somewhat similar.

The first step is to chose a form for the negative binomial – there are several (NB versions) to chose from. I’m going to go with the PyMC implementation, simply because it is what I use every day. This version defines the negative binomial distribution as a Poisson random variable whose rate
parameter, $\lambda$ is gamma distributed, $\lambda \sim \Gamma(\mu,\alpha)$:

$$ f(x \mid \mu, \alpha) = \frac{\Gamma(x+\alpha)}{x! \Gamma(\alpha)} (\alpha/(\mu+\alpha))^\alpha (\mu/(\mu+\alpha))^x $$

The other component of the ZINB likelihood involves the zeros component, which is simply the log-probability of an observation being an extra zero. PyMC conveniently has the @stochastic class of Python objects, which handles probability distributions; because the data are fixed (i.e. observed) this $@stochastic$ becomes an $@observed$ stochastic object that we can create as per any other Python function:


# ZINB likelihood
@observed(dtype=int, plot=False)
def zinb(value=sna, mu=mu, alpha=alpha, psi=pzero):

# Initialise likeihood
like = 0.0

# Add zero component
like += sum((np.log(psi + (1.-psi)*(alpha/(mu+alpha))**alpha))*Iz)

# Add count component
like += sum((np.log(1.-psi) + np.log(gammaf(alpha+value))-np.log((ft(value)*gammaf(alpha))) + alpha*np.log(alpha/(mu+alpha)) + value*np.log(mu/(mu+alpha)))*Ic)

return like

It is worth understanding what each of these components are. First the $like=0$ bit initializes a value to keep track of the likelihood value, given the current parameters in the MCMC scheme. The second line is the likelihood for zeros, both from the excess zeros component and for the negative binomial. The zeros component log-likelihood is simply

$$ log(\psi) $$

the log probability of observing a zero, while the count log-likelihood is the log-probability of a non-zero

$$ log(1-\psi) $$

plus the log-likelihood for a negative binomial zero, given the current parameters $\mu$ and $\alpha$. By substituting zeros in for the $x$’s in the negative binomial, the first and last terms cancel out, leaving

$$ f(0 \mid \mu, \alpha) = (\alpha/(\mu+\alpha))^\alpha$$

as in the second component. To properly sum the log-likelihood for the zeros only, I’ve used a dummy variable for zeros, $Iz$, that makes the log-likelihood for any non-zeros =0. The third line does the same thing, but for the counts; in this case a second dummy Ic flags out the zeros and the full binomial log-likelihood is included. This is important because speed in PyMC is achieved through vectorization, rather than looping as in BUGS. The sum of lines 2 and 3 is the full ZINB log-likelihood and is simply returned to the Metropolis-Hastings algorithm.

After 100,000 iterations the results (medians and 95% uncertainty intervals) are fairly comparable:


Parameter Smith et al. PyMC
gamma_0 0.34 [-1.03, 1.76] 0.27 [-0.93, 1.72]
gamma_1 -1.55 [-2.72, -0.76] -1.49 [-2.56, -0.64]
beta_0 0.39 [0.24, 0.57] 0.47 [0.25, 0.83]
beta_res 0.83 [0.41, 1.40] 0.84 [0.31, 1.42]
beta_sea 0.39 [0.24, 0.57] 0.40 [0.23, 0.55]

And that’s it; I’m looking forward to a similar application in our work at AIMS on the Great Barrier Reef.

**I have used the ‘decorator’ instantiation, @stochastic, here as the wrapper for HalfCauchy had been omitted accidentally; this has been fixed, so this prior could be: tau_year = HalfCauncy('tau_year', alpha=0, beta=100, value=1.91).

Rejection sampling 101

Working with my friends Ed and Sean on some zero-truncated Poisson logNormal (ZTPLN) community ecology work, I’ve come to realize I’ve done far too little coding of random value generation from odd statistical distributions in the form of rejection sampling. As this forms a core part of MCMC, I thought it high time for me to code some rejection sampling algorithms in R. So this morning I delved into Gammerman and Lopes (2006) for a little nerd gospel.

The basic idea is to use a distribution that is easy to sample to generate value from one that is difficult to do so. The key requirement is that the easy distribution envelops the difficult distribution from which we want to generate a given sample. Formally *Enveloping* means that, for any given value $x$, easy distribution $f(x)$ is always $\geq g(x)$. Informally, this looks like

where the red (in this case easy) $f(x)$ distribution sits over top the black $g(x)$distribution we’re interested in sampling from. For this example I’m following Gammerman and Lopes’ use of an enveloping Cauchy distribution $f(x)$ to generate samples from a Normal distribution $g(x)$; of course, given a uniform random sampler, Normals are relatively easy to sample from, however the process is the same as for a more complex distribution.

The first step in setting setting up the rejection sampler is to write the expression (the probability density function or pdf) for the target density of interest $g(x)$, in this case for a Normal. The pdf for a Normal is given by


# Normal pdf
normd = function(x,mu,sigma){
1/(sqrt(2*pi)*sigma)*exp(-(x-mu)**2/(2*sigma**2))
}

which describes the familiar shape of a standard Normal ($\sim N(0,1)$) distribution.

The next step is to choose a class of enveloping distribution ($f(x)$) and parameters that will a) make $f(x)\gt g(x)$ for all possible values of $x$ and b) is hopefully close enough in shape to $g(x)$ so as to be an efficient algorithm. This process no doubt becomes easier with experience and there are adaptive rejection sampling algorithms to make things efficient, however we can also construct a suitable enveloping distribution through trial and error.

Starting with arbitrary values, let’s see how a Cauchy distribution with location$=0$ and scale$=2$ compares to our $g(x)~N(0,1)$:


# Plot densities of f(x) and g(x)
plot(seq(-5, 5, 0.01), dnorm(seq(-5, 5, 0.01)),type="l",xlab="",ylab="Density",ylim=c(0,0.5))
lines(seq(-5, 5, 0.01), dcauchy(seq(-5, 5, 0.01), location=0, scale = 2), col=2)
text(0,0.45,labels="N(0,1)")
text(2.3,0.4,labels=paste("Cauchy(0,",2,")",sep=""),col=2)

This is certainly non-enveloping so maybe if try multiplying $Cauchy(0,2)$ by some constant it will cover $g(x)$ for all values of $x$. These *enveloping constants* (here called $A$) are a key part of rejection sampling, and are the core part of the tuning process in adaptive rejection sampling algorithms. So if we try multiplying $Cauchy(0,2)$ by $A=3$ it seems to do the trick:


# Plot g(x) and modified f(x)
plot(seq(-5, 5, 0.01), dnorm(seq(-5, 5, 0.01)),type="l",xlab="",ylab="Density",ylim=c(0,0.5))
lines(seq(-5, 5, 0.01), dcauchy(seq(-5, 5, 0.01), location=0, scale = 2)*3, col=2)
text(0,0.45,labels="N(0,1)")
text(2.3,0.4,labels=paste("Cauchy(0,",2,")*3",sep=""),col=2)

Now that we have a suitable (though perhaps not optimal) enveloping distribution, it is time for the rejection sampling step. This incredibly nifty trick works by first drawing a random value, say $u$, that is between zero and one from uniform ($U(0,1)$) distribution. Next a random value for $x$ is drawn from the enveloping distribution $f(x)$. Finally there is an acceptance/rejection step that asks: if the probability $u$ is less than the density of $g(x)$ at $x$ divided by the density of $f(x)*A$, then we’ll accept $x$ as having been drawn from $g(x)$; if not, we’ll sample another value for $x$ from $f(x)$ and try again.

In R code this becomes:


# Rejection sampler
rejsamp = function(A){
while(1){
# Draw single value from between zero and one
u = runif(1)
# Draw candidate value for x from the enveloping distribution
x = rcauchy(1, location=0, scale = 2)
# Accept or reject candidate value; if rejected try again
if(u < normd(x,0,1)/A/dcauchy(x, location=0, scale = Cscale))
return(x)
}
}

Note the use of R functions runif to draw random probabilities; rcauchy to draw random values from $f(x)$; and dcauchy to provide the density of $f(x)$ at $x$. The d$name$ syntax is used for canned pdfs already programmed into R.

So, to run this rejection sampler we can use the replicate function, which will apply a given function as many times as specified and return the results in an array. So to generate 10k samples with $A=3$:


sampx = replicate(10000, rejsamp(3))

and having a look:


hist(sampx, freq=FALSE, ylim=c(0,0.5), breaks=100, main="N(0,1) by rejection sampling",xlab="")
lines(seq(-5, 5, 0.01), dnorm(seq(-5, 5, 0.01)))
lines(seq(-5, 5, 0.01), dcauchy(seq(-5, 5, 0.01), location=0, scale = 2)*3, col=2)
text(2.3,0.4,labels="Cauchy(0,2)*3",col=2)

Et voila! We have sampled from $g(x)$!