Categories

A sample text widget

Etiam pulvinar consectetur dolor sed malesuada. Ut convallis euismod dolor nec pretium. Nunc ut tristique massa.

Nam sodales mi vitae dolor ullamcorper et vulputate enim accumsan. Morbi orci magna, tincidunt vitae molestie nec, molestie at mi. Nulla nulla lorem, suscipit in posuere in, interdum non magna.

Hierarchical Priors for Multinomial Data

As part of an ongoing project to understand autoregressive structures in hierarchical models I came across some WinBUGS code code that puts a non-Wishart prior on a Multivariate Normal precision matrix for a Multinomial-logistic model of pig weight histograms. Why might one do such a thing? Well, often when putting together histograms of things the distribution doesn’t look as smooth as one would expect. In the case of these pigs, the distribution of weights is slightly uneven:

Pig data

The objective in this model (from pg.177 of Congdon’s 2001 Bayesian Statistical Modelling book) is to smooth the observed histogram so that it looks more like the density of the population from which the sample is drawn, q(y).

Because histograms are broken into bins into which the observations falling are stacked, Congdon’s approach was to use a Multinomial distribution to estimate expected frequencies in weight bins ranging from 19 to 39. The key twist is that the smoothing occurs by assuming that adjacent points on the histogram have similar frequencies and cannily using an AR1 process to structure this belief.

Prior specification using matrices is a tricky business, not one that I’m an expert in by a long shot. But it is important for some water quality modelling I’ve been working on that will have a spatial AR process in the mix.

So on to the PyMC implementation…

Data

First, the observed frequency of pig weight gains from Table 5.8 of Congdon:

import numpy as np

# Observed frequencies
weights = np.array([1,1,0,7,5,10,30,30,41,48,66,72,56,46,45,22,24,12,5,0,1])

# Total number of observations
n = sum(weights)

# Total number of bins
s = len(weights)

Priors

Next in our Bayesian model comes prior specification. For this model the covariance matrix $V=P^{-1}$ (that is, the inverse of the precision matrix, $P$) assumes a smoothness structure:

\[ V_{ij}=\sigma^{2}\rho^{|i-j|} \]

which means the covariance between bins i and j is equal to some common variance, $\sigma^{2}$, times a correlation coefficient $\rho$, and that this relationship diminishes exponentially by how far the bins are away from each other (i.e. $|i-j|$). This, Congdon says, is akin to the smoothing we’d see in an AR1 time series and expresses our prior belief as to the histograms’s smoothness. If $\tau = \sigma^{-2}(1-\rho^{2})^{-1}$, then the resulting precision matrix $P$ has the form:

$$ \begin{align} P_{ij} = P_{ji}=0, & for & i=1,…,J-2; j=2+i, J \end{align}$$

$$ \begin{align} P_{jj}=\tau(1+\rho^{2}) & & j=2,…,J-1 \end{align}$$

$$ P_{11,JJ} = \tau $$

$$ \begin{align} P_{j,j+1} = P_{j+1,j} = -\rho\tau & & j=2,…,J-1 \end{align} $$

To build this precision matrix in PyMC requires first defining the priors for $\rho$ and $\tau$:

# Prior on elements of AR Precision Matrix 
rho = Uniform('rho', lower=0, upper=1)
tau = Uniform('tau', lower=0.5, upper=10)

Variable $\rho$ is the correlation and, because we expect adjacent bins to be similar, it is bounded positively between zero and one; $\tau$ is a precision bounded between 0.5 and 10 due to prior information that Congdon cites in his text; we could easily use $\tau\sim U(0,100)$ priors instead, the results are the same. Next the precision matrix itself:

# Define precision matrix
@deterministic
def P(tau=tau, rho=rho):
    M = (np.zeros(s**2)).reshape(s,s)
    M[range(1,s-1),range(1,s-1)] = np.ones(s-2)*(tau*(1+rho**2))
    M[(0,s-1),(0,s-1)] = tau
    M[range(0,s-1),range(1,s)] = np.ones(s-1)*(-tau*rho)
    M[range(1,s),range(0,s-1)] = np.ones(s-1)*(-tau*rho)
    return M

This uses the decorator instantiation of PyMC (@deterministic) that allows the matrix to be specified as a regular Python function (def). A bit about the code here to explain; in the same order as the algebra above and for each line of PyMC code in function P:

  1. Sets up an empty matrix M that is the same dimension as the number of bins;
  2. Sets the diagonal elements to $\tau(1+\rho^{2})$ except for the upper left and lower right elements;
  3. Sets the upper left and lower right elements to $\tau$;
  4. Sets the off-diagonal elements to $-\rho\tau$;
  5. Return the matrix M.

Model

Now that we have priors for P and have P itself is set up, we can estimate the multiple logit model parameters using a Multivariate Normal distribution:

# MVN for logit parameters
theta = MvNormal('theta', mu=np.ones(s)*-np.log(s), tau=P)

The mu=np.ones(s)*-np.log(s) part of this model correspond to a neutral (equal) prior for the Multinomial probabilities $\pi_{j}$ that will be used to estimate the frequencies in the smoothed histogram. From this, we can apply a multiple logit tranformation to get the bin probabilities:

# Bin probabilities
pi = Lambda('pi', lambda g=theta: np.exp(g)/sum(np.exp(g)))

Likelihood

These probabilities are then used with the data to complete the likelihood:

# Likelihood
Yi = Multinomial('Yi', n=n, p=pi, value=weights, observed=True)

Finally, we can calculate the expected bin frequencies for the smoothed histogram from the posterior probabilities for each bin and the total number of pigs weighed:

# Smoothed frequencies
Sm = Lambda('Sm', lambda pi=pi: n*pi)

Put this all together and run for $10e^{5}$ iterations, we get a rather satisfactory result, quite near the estimates from Congdon’s original Table 5.8 and the WinBUGS example:

Pig smoothed

Linked occupancy-abundance ZINB model


A recent paper in the Early View of Ecology (Smith et al. 2012) peaked my interest last week, as it was an effort to explicitly link the occupancy (zero or not) and abundance (counts, biomass etc) parts of a zero-inflated negative binomial model of marine reserve effectiveness. These kinds of models are/should be seeing increasing use among marine ecologists as they commonly provide a much improved fit to the kinds of data we see with fisheries or biodiversity assessments.

Traditional zero-inflated (ZI) models do one of two things: either they model the zero (presence/absence) part of the model separate from the counts (or biomass etc.) or they are modelled together. The former method has been used a fair bit in fisheries circles, appearing as the ‘Delta’ method, which goes back to a paper by Lo et al. 1992, who were interested in estimating how spotters found schools of anchovy and what environmental conditions led to high observed abundances. The latter, integrated method has been developed more recently, relying on more complex mixture distributions to model the zeros and counts together.

Typically these mixtures have involved combining two generalized linear models, one being a Bernoulli to quantify presence-absence, and the other being a distribution of choice for the response; common versions include Poisson (ZIP; for counts) and negative binomial (ZINB; also counts) versions but log-Normals (ZILN; for biomass) are also around. In each case, the zero and count models have covariates in them that may or may not be the same, depending on the question at hand. In a 2009 paper on shark depredation in the Northeastern Atlantic (pdf), John Carlson and I looked at what would predict probability of sharks encountering the gear and what would predict how many were caught:

finding (unsurprisingly) that longer soak times led to more encounters, but also that the use of more lightsticks led to fewer depredation events. While these models had different zero and count model covariates, it may often be the case that these predictors are shared between the zero and count model components. As such, there may be high levels of correlation between parameters in these components that can be difficult to handle.

To make things concrete, let’s delve into the data from Smith et al. – counts of a porgie-like snapper Pagrus auratus from the Te Whanganui-A-Hi marine reserve, in Northern New Zealand. The species is heavily fished in the area and is therefore monitored through the use of baited remote underwater visual surveys (BRUVS) in both spring and autumn. Looking at the data through time it is clear that it is both unbalanced in terms of sampling, and contains about 55% zeros.

Several covariates are available for the data and the counts of fish most naturally fit a negative binomial distribution in almost every application. So for a ‘traditional’ mixture model, the zero model might be

$$ logit(p) = \gamma_{0}+\gamma_{1}x_{1}+…+\gamma_{n}x_{n} $$

with covariates $X$ and

$$ Y\sim Bernoulli(p) $$

while for the count component

$$ \eta = \beta_{0}+\beta_{1}x_{1}+…+\beta_{n}x_{n} \\ Y \sim NB(exp(\eta),\delta) $$

might have the same covariates in a negative binomial model. For the Smith et al. example the covariates include marine reserve status, season, and year as a categorical factor. If we plunk those into the model above however the zero and count parameters can be highly correlated – not ideal behaviour for a statistical model. To wrangle these correlations we can model the joint distribution of the $\beta$’s and $\gamma$’s as multivariate normal. This step helps mop up the correlation and provide more stable results in general.

Although this approach can work in many contexts and is, as pointed out by Smith et al., appropriate where ‘the ecological drivers giving rise to excess zeros are fundamentally different from those which are important in predicting patterns of relative abundance.’ When conducting an underwater visual survey (UVC), many of those excess zeros are due to low abundances. The fact that detection probabilities are less than one means that when abundances get low, zeros will be recorded even though there are fish present. So in many cases, what’s driving the excess zeros is low abundance due to incomplete detectability.

To get around this particular problem, Smith et al. propose to explicitly link the zero and count mean models through a single linear predictor for both $\lambda$ and $p$. The move that they make is clever; instead of modelling the $X$ covariates separately, they link the zero model to the covariates in the count model. While the count model remains the same, the zeros model then becomes:

$$ logit(p) = \gamma_{0}+\gamma_{1}\eta $$

With a change in the probability of excess zeros being a function of the changes in $\eta$, that are modelled with the $X$ covariates.

While the change is a simple one, it does require a Bayesian implementation due to the highly non-linear nature of the model. Fortunately Smith et al. provide their OpenBUGS code and the data, so testing the method out for oneself is an easy task. Here I’ll walk through a PyMC version of the model, along with the associated Python code that I’ve put up on GitHub (code).

The first step in a PyMC-based model is to define the priors (unlike BUGS, order matters in Python), which in Bayes notation would be

$$\delta \sim \Gamma(0.001,0.001) \\
\gamma_{0} \sim N(0.0,0.01) \\
\gamma_{1} \sim N(0.0,0.01) \\
$$

etc. In PyMC these priors are written similarly:


# Overdispersion for NB
alpha = Gamma('alpha', alpha=0.0001, beta=0.0001, value=6.)

# Zero link intercept to count model
gamma0 = Normal('gamma0', mu=0.0, tau=0.01, value=-0.5)

# Zero link slope to count model
gamma1 = Normal('gamma1', mu=0.0, tau=0.01, value=-0.4)

# Count intercept
beta0 = Normal('beta0', mu=0.0, tau=0.01, value=0.1)

# Reserve effect
beta_reserve = Normal('beta_reserve', mu=0.0, tau=0.01, value=0.)

# Constrain reserve effects
beta_rez = Lambda('beta_rez', lambda b0=beta0,b1=beta_reserve: array([b0-b1,b0+b1]))

# Season effect
beta_season = Normal('beta_season', mu=0.0, tau=0.01, value=0.)

# Area effect
@stochastic
def tau_area(value=1.91, alpha=0, beta=100):
return half_cauchy_like(value,alpha,beta)
beta_area = Normal('beta_area', mu=0.0, tau=tau_area, value=np.zeros(6))

# Reserve given area
beta_rez_area = Lambda('beta_rez_area', lambda br=beta_rez[Ir],ba=beta_area: br+ba)

# Year effect
@stochastic
def tau_year(value=1.91, alpha=0, beta=100):
return half_cauchy_like(value,alpha,beta)
beta_year = Normal('beta_year', mu=0.0, tau=tau_year, value=np.zeros(9))

In this model the both the area and year effects share a common precision (tau_area and tau_year) among the various areas and years. The half_cauchy_like** hyper priors for the year and area effects are worth taking note of; they reflect an important paper from Andrew Gelman (2006) that illustrates several problems with the standard $\Gamma(0.001,0.001)$ precisions (inverse-variances) used in many, many Bayesian models. The punchline of that paper is that a ‘half-t’ (e.g. half-Cauchy) prior that is weakly informative is helpful when the number of groups in hierarchical model is small, as it keeps the variance from flying off the rails. From these priors the linear model components can be defined for the linked model. In PyMC this could be


# Count model
eta = Lambda('eta', lambda b0=beta_rez_area[Ia],b1=beta_season,b2=beta_year[Iy]: b0+b1*season+b2, trace=False)

# Link function for negative binomial
mu = Lambda('mu', lambda e=eta: exp(e), trace=False)

# Zero model
pzero = Lambda('pzero', lambda g0=gamma0,g1=gamma1,eta=eta: invlogit(g0+g1*eta), trace=False)

As stated above, this links the linear zero model to the linear count model and avoids the doubling up on parameters that ignores the occupancy-abundance relationship.

The last part of the PyMC model requires specifying the zero-inflated negative binomial likelihood; Smith et al. do this explicitly in OpenBUGS and the PyMC implementation is somewhat similar.

The first step is to chose a form for the negative binomial – there are several (NB versions) to chose from. I’m going to go with the PyMC implementation, simply because it is what I use every day. This version defines the negative binomial distribution as a Poisson random variable whose rate
parameter, $\lambda$ is gamma distributed, $\lambda \sim \Gamma(\mu,\alpha)$:

$$ f(x \mid \mu, \alpha) = \frac{\Gamma(x+\alpha)}{x! \Gamma(\alpha)} (\alpha/(\mu+\alpha))^\alpha (\mu/(\mu+\alpha))^x $$

The other component of the ZINB likelihood involves the zeros component, which is simply the log-probability of an observation being an extra zero. PyMC conveniently has the @stochastic class of Python objects, which handles probability distributions; because the data are fixed (i.e. observed) this $@stochastic$ becomes an $@observed$ stochastic object that we can create as per any other Python function:


# ZINB likelihood
@observed(dtype=int, plot=False)
def zinb(value=sna, mu=mu, alpha=alpha, psi=pzero):

# Initialise likeihood
like = 0.0

# Add zero component
like += sum((np.log(psi + (1.-psi)*(alpha/(mu+alpha))**alpha))*Iz)

# Add count component
like += sum((np.log(1.-psi) + np.log(gammaf(alpha+value))-np.log((ft(value)*gammaf(alpha))) + alpha*np.log(alpha/(mu+alpha)) + value*np.log(mu/(mu+alpha)))*Ic)

return like

It is worth understanding what each of these components are. First the $like=0$ bit initializes a value to keep track of the likelihood value, given the current parameters in the MCMC scheme. The second line is the likelihood for zeros, both from the excess zeros component and for the negative binomial. The zeros component log-likelihood is simply

$$ log(\psi) $$

the log probability of observing a zero, while the count log-likelihood is the log-probability of a non-zero

$$ log(1-\psi) $$

plus the log-likelihood for a negative binomial zero, given the current parameters $\mu$ and $\alpha$. By substituting zeros in for the $x$’s in the negative binomial, the first and last terms cancel out, leaving

$$ f(0 \mid \mu, \alpha) = (\alpha/(\mu+\alpha))^\alpha$$

as in the second component. To properly sum the log-likelihood for the zeros only, I’ve used a dummy variable for zeros, $Iz$, that makes the log-likelihood for any non-zeros =0. The third line does the same thing, but for the counts; in this case a second dummy Ic flags out the zeros and the full binomial log-likelihood is included. This is important because speed in PyMC is achieved through vectorization, rather than looping as in BUGS. The sum of lines 2 and 3 is the full ZINB log-likelihood and is simply returned to the Metropolis-Hastings algorithm.

After 100,000 iterations the results (medians and 95% uncertainty intervals) are fairly comparable:


Parameter Smith et al. PyMC
gamma_0 0.34 [-1.03, 1.76] 0.27 [-0.93, 1.72]
gamma_1 -1.55 [-2.72, -0.76] -1.49 [-2.56, -0.64]
beta_0 0.39 [0.24, 0.57] 0.47 [0.25, 0.83]
beta_res 0.83 [0.41, 1.40] 0.84 [0.31, 1.42]
beta_sea 0.39 [0.24, 0.57] 0.40 [0.23, 0.55]

And that’s it; I’m looking forward to a similar application in our work at AIMS on the Great Barrier Reef.

**I have used the ‘decorator’ instantiation, @stochastic, here as the wrapper for HalfCauchy had been omitted accidentally; this has been fixed, so this prior could be: tau_year = HalfCauncy('tau_year', alpha=0, beta=100, value=1.91).

Rejection sampling 101

Working with my friends Ed and Sean on some zero-truncated Poisson logNormal (ZTPLN) community ecology work, I’ve come to realize I’ve done far too little coding of random value generation from odd statistical distributions in the form of rejection sampling. As this forms a core part of MCMC, I thought it high time for me to code some rejection sampling algorithms in R. So this morning I delved into Gammerman and Lopes (2006) for a little nerd gospel.

The basic idea is to use a distribution that is easy to sample to generate value from one that is difficult to do so. The key requirement is that the easy distribution envelops the difficult distribution from which we want to generate a given sample. Formally *Enveloping* means that, for any given value $x$, easy distribution $f(x)$ is always $\geq g(x)$. Informally, this looks like

where the red (in this case easy) $f(x)$ distribution sits over top the black $g(x)$distribution we’re interested in sampling from. For this example I’m following Gammerman and Lopes’ use of an enveloping Cauchy distribution $f(x)$ to generate samples from a Normal distribution $g(x)$; of course, given a uniform random sampler, Normals are relatively easy to sample from, however the process is the same as for a more complex distribution.

The first step in setting setting up the rejection sampler is to write the expression (the probability density function or pdf) for the target density of interest $g(x)$, in this case for a Normal. The pdf for a Normal is given by


# Normal pdf
normd = function(x,mu,sigma){
1/(sqrt(2*pi)*sigma)*exp(-(x-mu)**2/(2*sigma**2))
}

which describes the familiar shape of a standard Normal ($\sim N(0,1)$) distribution.

The next step is to choose a class of enveloping distribution ($f(x)$) and parameters that will a) make $f(x)\gt g(x)$ for all possible values of $x$ and b) is hopefully close enough in shape to $g(x)$ so as to be an efficient algorithm. This process no doubt becomes easier with experience and there are adaptive rejection sampling algorithms to make things efficient, however we can also construct a suitable enveloping distribution through trial and error.

Starting with arbitrary values, let’s see how a Cauchy distribution with location$=0$ and scale$=2$ compares to our $g(x)~N(0,1)$:


# Plot densities of f(x) and g(x)
plot(seq(-5, 5, 0.01), dnorm(seq(-5, 5, 0.01)),type="l",xlab="",ylab="Density",ylim=c(0,0.5))
lines(seq(-5, 5, 0.01), dcauchy(seq(-5, 5, 0.01), location=0, scale = 2), col=2)
text(0,0.45,labels="N(0,1)")
text(2.3,0.4,labels=paste("Cauchy(0,",2,")",sep=""),col=2)

This is certainly non-enveloping so maybe if try multiplying $Cauchy(0,2)$ by some constant it will cover $g(x)$ for all values of $x$. These *enveloping constants* (here called $A$) are a key part of rejection sampling, and are the core part of the tuning process in adaptive rejection sampling algorithms. So if we try multiplying $Cauchy(0,2)$ by $A=3$ it seems to do the trick:


# Plot g(x) and modified f(x)
plot(seq(-5, 5, 0.01), dnorm(seq(-5, 5, 0.01)),type="l",xlab="",ylab="Density",ylim=c(0,0.5))
lines(seq(-5, 5, 0.01), dcauchy(seq(-5, 5, 0.01), location=0, scale = 2)*3, col=2)
text(0,0.45,labels="N(0,1)")
text(2.3,0.4,labels=paste("Cauchy(0,",2,")*3",sep=""),col=2)

Now that we have a suitable (though perhaps not optimal) enveloping distribution, it is time for the rejection sampling step. This incredibly nifty trick works by first drawing a random value, say $u$, that is between zero and one from uniform ($U(0,1)$) distribution. Next a random value for $x$ is drawn from the enveloping distribution $f(x)$. Finally there is an acceptance/rejection step that asks: if the probability $u$ is less than the density of $g(x)$ at $x$ divided by the density of $f(x)*A$, then we’ll accept $x$ as having been drawn from $g(x)$; if not, we’ll sample another value for $x$ from $f(x)$ and try again.

In R code this becomes:


# Rejection sampler
rejsamp = function(A){
while(1){
# Draw single value from between zero and one
u = runif(1)
# Draw candidate value for x from the enveloping distribution
x = rcauchy(1, location=0, scale = 2)
# Accept or reject candidate value; if rejected try again
if(u < normd(x,0,1)/A/dcauchy(x, location=0, scale = Cscale)) return(x) } }

Note the use of R functions runif to draw random probabilities; rcauchy to draw random values from $f(x)$; and dcauchy to provide the density of $f(x)$ at $x$. The d$name$ syntax is used for canned pdfs already programmed into R.

So, to run this rejection sampler we can use the replicate function, which will apply a given function as many times as specified and return the results in an array. So to generate 10k samples with $A=3$:


sampx = replicate(10000, rejsamp(3))

and having a look:


hist(sampx, freq=FALSE, ylim=c(0,0.5), breaks=100, main="N(0,1) by rejection sampling",xlab="")
lines(seq(-5, 5, 0.01), dnorm(seq(-5, 5, 0.01)))
lines(seq(-5, 5, 0.01), dcauchy(seq(-5, 5, 0.01), location=0, scale = 2)*3, col=2)
text(2.3,0.4,labels="Cauchy(0,2)*3",col=2)

Et voila! We have sampled from $g(x)$!