A sample text widget

Etiam pulvinar consectetur dolor sed malesuada. Ut convallis euismod dolor nec pretium. Nunc ut tristique massa.

Nam sodales mi vitae dolor ullamcorper et vulputate enim accumsan. Morbi orci magna, tincidunt vitae molestie nec, molestie at mi. Nulla nulla lorem, suscipit in posuere in, interdum non magna.

Linked occupancy-abundance ZINB model

A recent paper in the Early View of Ecology (Smith et al. 2012) peaked my interest last week, as it was an effort to explicitly link the occupancy (zero or not) and abundance (counts, biomass etc) parts of a zero-inflated negative binomial model of marine reserve effectiveness. These kinds of models are/should be seeing increasing use among marine ecologists as they commonly provide a much improved fit to the kinds of data we see with fisheries or biodiversity assessments.

Traditional zero-inflated (ZI) models do one of two things: either they model the zero (presence/absence) part of the model separate from the counts (or biomass etc.) or they are modelled together. The former method has been used a fair bit in fisheries circles, appearing as the ‘Delta’ method, which goes back to a paper by Lo et al. 1992, who were interested in estimating how spotters found schools of anchovy and what environmental conditions led to high observed abundances. The latter, integrated method has been developed more recently, relying on more complex mixture distributions to model the zeros and counts together.

Typically these mixtures have involved combining two generalized linear models, one being a Bernoulli to quantify presence-absence, and the other being a distribution of choice for the response; common versions include Poisson (ZIP; for counts) and negative binomial (ZINB; also counts) versions but log-Normals (ZILN; for biomass) are also around. In each case, the zero and count models have covariates in them that may or may not be the same, depending on the question at hand. In a 2009 paper on shark depredation in the Northeastern Atlantic (pdf), John Carlson and I looked at what would predict probability of sharks encountering the gear and what would predict how many were caught:

finding (unsurprisingly) that longer soak times led to more encounters, but also that the use of more lightsticks led to fewer depredation events. While these models had different zero and count model covariates, it may often be the case that these predictors are shared between the zero and count model components. As such, there may be high levels of correlation between parameters in these components that can be difficult to handle.

To make things concrete, let’s delve into the data from Smith et al. – counts of a porgie-like snapper Pagrus auratus from the Te Whanganui-A-Hi marine reserve, in Northern New Zealand. The species is heavily fished in the area and is therefore monitored through the use of baited remote underwater visual surveys (BRUVS) in both spring and autumn. Looking at the data through time it is clear that it is both unbalanced in terms of sampling, and contains about 55% zeros.

Several covariates are available for the data and the counts of fish most naturally fit a negative binomial distribution in almost every application. So for a ‘traditional’ mixture model, the zero model might be

$$ logit(p) = \gamma_{0}+\gamma_{1}x_{1}+…+\gamma_{n}x_{n} $$

with covariates $X$ and

$$ Y\sim Bernoulli(p) $$

while for the count component

$$ \eta = \beta_{0}+\beta_{1}x_{1}+…+\beta_{n}x_{n} \\ Y \sim NB(exp(\eta),\delta) $$

might have the same covariates in a negative binomial model. For the Smith et al. example the covariates include marine reserve status, season, and year as a categorical factor. If we plunk those into the model above however the zero and count parameters can be highly correlated – not ideal behaviour for a statistical model. To wrangle these correlations we can model the joint distribution of the $\beta$’s and $\gamma$’s as multivariate normal. This step helps mop up the correlation and provide more stable results in general.

Although this approach can work in many contexts and is, as pointed out by Smith et al., appropriate where ‘the ecological drivers giving rise to excess zeros are fundamentally different from those which are important in predicting patterns of relative abundance.’ When conducting an underwater visual survey (UVC), many of those excess zeros are due to low abundances. The fact that detection probabilities are less than one means that when abundances get low, zeros will be recorded even though there are fish present. So in many cases, what’s driving the excess zeros is low abundance due to incomplete detectability.

To get around this particular problem, Smith et al. propose to explicitly link the zero and count mean models through a single linear predictor for both $\lambda$ and $p$. The move that they make is clever; instead of modelling the $X$ covariates separately, they link the zero model to the covariates in the count model. While the count model remains the same, the zeros model then becomes:

$$ logit(p) = \gamma_{0}+\gamma_{1}\eta $$

With a change in the probability of excess zeros being a function of the changes in $\eta$, that are modelled with the $X$ covariates.

While the change is a simple one, it does require a Bayesian implementation due to the highly non-linear nature of the model. Fortunately Smith et al. provide their OpenBUGS code and the data, so testing the method out for oneself is an easy task. Here I’ll walk through a PyMC version of the model, along with the associated Python code that I’ve put up on GitHub (code).

The first step in a PyMC-based model is to define the priors (unlike BUGS, order matters in Python), which in Bayes notation would be

$$\delta \sim \Gamma(0.001,0.001) \\
\gamma_{0} \sim N(0.0,0.01) \\
\gamma_{1} \sim N(0.0,0.01) \\

etc. In PyMC these priors are written similarly:

# Overdispersion for NB
alpha = Gamma('alpha', alpha=0.0001, beta=0.0001, value=6.)

# Zero link intercept to count model
gamma0 = Normal('gamma0', mu=0.0, tau=0.01, value=-0.5)

# Zero link slope to count model
gamma1 = Normal('gamma1', mu=0.0, tau=0.01, value=-0.4)

# Count intercept
beta0 = Normal('beta0', mu=0.0, tau=0.01, value=0.1)

# Reserve effect
beta_reserve = Normal('beta_reserve', mu=0.0, tau=0.01, value=0.)

# Constrain reserve effects
beta_rez = Lambda('beta_rez', lambda b0=beta0,b1=beta_reserve: array([b0-b1,b0+b1]))

# Season effect
beta_season = Normal('beta_season', mu=0.0, tau=0.01, value=0.)

# Area effect
def tau_area(value=1.91, alpha=0, beta=100):
return half_cauchy_like(value,alpha,beta)
beta_area = Normal('beta_area', mu=0.0, tau=tau_area, value=np.zeros(6))

# Reserve given area
beta_rez_area = Lambda('beta_rez_area', lambda br=beta_rez[Ir],ba=beta_area: br+ba)

# Year effect
def tau_year(value=1.91, alpha=0, beta=100):
return half_cauchy_like(value,alpha,beta)
beta_year = Normal('beta_year', mu=0.0, tau=tau_year, value=np.zeros(9))

In this model the both the area and year effects share a common precision (tau_area and tau_year) among the various areas and years. The half_cauchy_like** hyper priors for the year and area effects are worth taking note of; they reflect an important paper from Andrew Gelman (2006) that illustrates several problems with the standard $\Gamma(0.001,0.001)$ precisions (inverse-variances) used in many, many Bayesian models. The punchline of that paper is that a ‘half-t’ (e.g. half-Cauchy) prior that is weakly informative is helpful when the number of groups in hierarchical model is small, as it keeps the variance from flying off the rails. From these priors the linear model components can be defined for the linked model. In PyMC this could be

# Count model
eta = Lambda('eta', lambda b0=beta_rez_area[Ia],b1=beta_season,b2=beta_year[Iy]: b0+b1*season+b2, trace=False)

# Link function for negative binomial
mu = Lambda('mu', lambda e=eta: exp(e), trace=False)

# Zero model
pzero = Lambda('pzero', lambda g0=gamma0,g1=gamma1,eta=eta: invlogit(g0+g1*eta), trace=False)

As stated above, this links the linear zero model to the linear count model and avoids the doubling up on parameters that ignores the occupancy-abundance relationship.

The last part of the PyMC model requires specifying the zero-inflated negative binomial likelihood; Smith et al. do this explicitly in OpenBUGS and the PyMC implementation is somewhat similar.

The first step is to chose a form for the negative binomial – there are several (NB versions) to chose from. I’m going to go with the PyMC implementation, simply because it is what I use every day. This version defines the negative binomial distribution as a Poisson random variable whose rate
parameter, $\lambda$ is gamma distributed, $\lambda \sim \Gamma(\mu,\alpha)$:

$$ f(x \mid \mu, \alpha) = \frac{\Gamma(x+\alpha)}{x! \Gamma(\alpha)} (\alpha/(\mu+\alpha))^\alpha (\mu/(\mu+\alpha))^x $$

The other component of the ZINB likelihood involves the zeros component, which is simply the log-probability of an observation being an extra zero. PyMC conveniently has the @stochastic class of Python objects, which handles probability distributions; because the data are fixed (i.e. observed) this $@stochastic$ becomes an $@observed$ stochastic object that we can create as per any other Python function:

# ZINB likelihood
@observed(dtype=int, plot=False)
def zinb(value=sna, mu=mu, alpha=alpha, psi=pzero):

# Initialise likeihood
like = 0.0

# Add zero component
like += sum((np.log(psi + (1.-psi)*(alpha/(mu+alpha))**alpha))*Iz)

# Add count component
like += sum((np.log(1.-psi) + np.log(gammaf(alpha+value))-np.log((ft(value)*gammaf(alpha))) + alpha*np.log(alpha/(mu+alpha)) + value*np.log(mu/(mu+alpha)))*Ic)

return like

It is worth understanding what each of these components are. First the $like=0$ bit initializes a value to keep track of the likelihood value, given the current parameters in the MCMC scheme. The second line is the likelihood for zeros, both from the excess zeros component and for the negative binomial. The zeros component log-likelihood is simply

$$ log(\psi) $$

the log probability of observing a zero, while the count log-likelihood is the log-probability of a non-zero

$$ log(1-\psi) $$

plus the log-likelihood for a negative binomial zero, given the current parameters $\mu$ and $\alpha$. By substituting zeros in for the $x$’s in the negative binomial, the first and last terms cancel out, leaving

$$ f(0 \mid \mu, \alpha) = (\alpha/(\mu+\alpha))^\alpha$$

as in the second component. To properly sum the log-likelihood for the zeros only, I’ve used a dummy variable for zeros, $Iz$, that makes the log-likelihood for any non-zeros =0. The third line does the same thing, but for the counts; in this case a second dummy Ic flags out the zeros and the full binomial log-likelihood is included. This is important because speed in PyMC is achieved through vectorization, rather than looping as in BUGS. The sum of lines 2 and 3 is the full ZINB log-likelihood and is simply returned to the Metropolis-Hastings algorithm.

After 100,000 iterations the results (medians and 95% uncertainty intervals) are fairly comparable:

Parameter Smith et al. PyMC
gamma_0 0.34 [-1.03, 1.76] 0.27 [-0.93, 1.72]
gamma_1 -1.55 [-2.72, -0.76] -1.49 [-2.56, -0.64]
beta_0 0.39 [0.24, 0.57] 0.47 [0.25, 0.83]
beta_res 0.83 [0.41, 1.40] 0.84 [0.31, 1.42]
beta_sea 0.39 [0.24, 0.57] 0.40 [0.23, 0.55]

And that’s it; I’m looking forward to a similar application in our work at AIMS on the Great Barrier Reef.

**I have used the ‘decorator’ instantiation, @stochastic, here as the wrapper for HalfCauchy had been omitted accidentally; this has been fixed, so this prior could be: tau_year = HalfCauncy('tau_year', alpha=0, beta=100, value=1.91).

Leave a Reply