PyMC2 code respositing

As an attempt to improve the state of my blog, I’m going to start linking to PyMC2 and PyMC3 code examples on the code repository. Enjoy.

Post-Doctoral Fellow – Advances to State-Space Models for Fisheries Science

Applications are invited for a CANSSI/Fields post-doctoral position in connection with a CANSSI Collaborative Research Team project to develop robust and statistically rigorous state-space models (SSM) methodologies for fisheries science with Joanna Mills Flemming and Chris Field (both at Dalhousie University). The research will be motivated by ongoing discussions with fisheries collaborators aimed at addressing pressing fisheries issues in Canada. Goals include:

- Building robust statistical methodologies for SSMs used in fisheries science and management that allow for both errors and misspecification in the observation process and in the state/dynamic process.

- Developing relevant model diagnostics that can be used to evaluate model goodness-of-fit for both the observation and state process. In practice, stock assessment scientists and fisheries managers need to know if their advice is sensitive to reasonable alternative assumptions.

The successful applicant should be familiar with SSMs, possess excellent computational skills and be well organized. Interested applicants should send their CV and Letter of interest as well as arrange to have two letters of reference sent to Joanna Mills Flemming (Joanna.Flemming – [at] – Dal.Ca) before April 1, 2014. The position is for two years, contingent upon funding, with start date to be determined.

Correction to Hussey et al. 2014

In a frustrating development, my co-author Nigel Hussey reported problems from other researchers in replicating our trophic position calculations from our recent Ecology Letters paper. In it, trophic position is, regrettably (and completely my fault) missing a +TPbase (trophic position of the baseline organism) at the end. The correct equation is:

TP_eq

My apologies to Nigel and my co-authors for not catching this in the proofs.

Ecology Letters Volume 17, Issue 2, pages 239–250, February 2014 http://onlinelibrary.wiley.com/doi/10.1111/ele.12226/full

Correction to JNAFS 30: 77-82.








For those following the latest in Von Bertalanffy growth curve fitting for sharks, my colleague Steve Campana discovered this week that our 2002 paper looking at age and growth in blue sharks contained an error in the reported growth curve parameters. The Von B curve has a couple of forms (for a recent review on best practices see Pardo et al. (2013) MEE 4:353-360) but in our 2002 paper we used the most standard:

$$ L_{\inf}*(1-e^{-k*(age-t_{0})} $$

In the paper we reported the three parameters as $L_{\inf}=302$, $k=0.58$, and $t_{0}=-0.24$. These are way off, and I’m at a loss as to how I estimated them so badly ten years ago. In any case, we have issued a correction, given revised Bayesian estimates $L_{\inf}=336 [319, 353]$, $k=0.15 [0.13, 0.17]$, and $t_{0}=-2.2 [-2.6, -1.8]$ (posterior medians and 95% uncertainty intervals). The data, new fits, and old estimates show just how far off the old ones were. NB: PLOT YOUR MODEL AND DATA PROPERLY!!

Pig smoothed

Intrinsic Conditional Autoregressive (CAR) model – Lip Cancer in PyMC








The autoregression continues this week with an intrinsic conditional autoregressive, or CAR model, in PyMC. There are many, many ways to introduce spatial effects into a modelling framework; due to the availability of cheap MCMC sampling, particularly in WinBUGS, CAR models have been widely used over the past decade to structure the spatial dependence or clustering of gridded data.

The canonical example is in estimating the log-relative risk of lip cancer among 56 districts in Scotland between 1975 and 1980; the original data come from (1, Kemp et al. 1985). Disease mapping is perhaps the oldest form of epidemiology, including the famous cholera map of Dr. John Snow that helped him figure out the bacteria was waterborne, rather than being from ‘bad air.’ In the Scottish case, there is some interest in how lip cancer rates relate to the proportion of the population engaged in agriculture, forestry, or fishing (AFF), as exposure to sunlight is a risk factor.

The data consist of observed (O) cases of lip cancer, and expected (E) rates, based on the age of the local population:

import numpy as np

county = np.array(["skye_lochalsh", "banff_buchan", "caithness,berwickshire", "ross_cromarty", "orkney", "moray", "shetland", "lochaber", "gordon", "western_isles", "sutherland", "nairn", "wigtown", "NE.fife", "kincardine", "badenoch", "ettrick", "inverness", "roxburgh", "angus", "aberdeen", "argyll_bute", "clydesdale", "kirkcaldy", "dunfermline", "nithsdale", "east_lothian", "perth_kinross", "west_lothian", "cumnock_doon", " stewartry", " midlothian", "stirling", "kyle_carrick", "inverclyde", "cunninghame", "monklands", "dumbarton", "clydebank", "renfrew", "falkirk", "clackmannan", "motherwell", "edinburgh", "kilmarnock", "east_kilbride", "hamilton", "glasgow", "dundee", "cumbernauld", "bearsden", "eastwood", "strathkelvin", "tweeddale", "annandale"])

O = np.array([9, 39, 11, 9, 15, 8, 26, 7, 6, 20, 13, 5, 3, 8, 17, 9, 2, 7, 9, 7, 16, 31, 11, 7, 19, 15, 7, 10, 16, 11, 5, 3, 7, 8, 11, 9, 11, 8, 6, 4, 10, 8, 2, 6, 19, 3, 2, 3, 28, 6, 1, 1, 1, 1, 0, 0])

N = len(O)

E = np.array([1.4, 8.7, 3.0, 2.5, 4.3, 2.4, 8.1, 2.3, 2.0, 6.6, 4.4, 1.8, 1.1, 3.3, 7.8, 4.6, 1.1, 4.2, 5.5, 4.4, 10.5, 22.7, 8.8, 5.6, 15.5, 12.5, 6.0, 9.0, 14.4, 10.2, 4.8, 2.9, 7.0, 8.5, 12.3, 10.1, 12.7, 9.4, 7.2, 5.3, 18.8, 15.8, 4.3, 14.6, 50.7, 8.2, 5.6, 9.3, 88.7, 19.6, 3.4, 3.6, 5.7, 7.0, 4.2, 1.8])

logE = np.log(E)

Which for the observed cases of lip cancer look like:
scotland

And for the expected cases, based on demographics:
scotland2

So there was a big disparity between observed and expected rates of lip cancer in the late 1970’s. The exposure to sunlight factor may be responsible, or it may be spatially related to some unmeasured factor (in general space or time stand in for things we haven’t measured). The remaining data are for the CAR calculations outlined below.

aff = np.array([16, 16, 10, 24, 10, 24, 10, 7, 7, 16, 7, 16, 10, 24, 7, 16, 10, 7, 7, 10, 7, 16, 10, 7, 1, 1, 7, 7, 10, 10, 7, 24, 10, 7, 7, 0, 10, 1, 16, 0, 1, 16, 16, 0, 1, 7, 1, 1, 0, 1, 1, 0, 1, 1, 16, 10])/10.

num = np.array([4, 2, 2, 3, 5, 2, 5, 1,  6, 4, 4, 3, 4, 3, 3, 6, 6, 6 ,5, 3, 3, 2, 6, 8, 3, 4, 4, 4,11,  6, 7, 4, 4, 9, 5, 4, 5, 6, 5, 5, 7, 6, 4, 5, 4, 6, 6, 4, 9, 3, 4, 4, 4, 5, 5, 6])

adj = np.array([[5, 9,11,19],
[7,10],
[6,12],
[18,20,28],
[1,11,12,13,19],
[3, 8],
[2,10,13,16,17],
[6],
[1,11,17,19,23,29],
[2, 7,16,22],
[1, 5, 9,12],
[3, 5,11],
[5, 7,17,19],
[31,32,35],
[25,29,50],
[7,10,17,21,22,29],
[7, 9,13,16,19,29],
[4,20,28,33,55,56],
[1, 5, 9,13,17],
[4,18,55],
[16,29,50],
[10,16],
[9,29,34,36,37,39],
[27,30,31,44,47,48,55,56],
[15,26,29],
[25,29,42,43],
[24,31,32,55],
[4,18,33,45],
[9,15,16,17,21,23,25,26,34,43,50],
[24,38,42,44,45,56],
[14,24,27,32,35,46,47],
[14,27,31,35],
[18,28,45,56],
[23,29,39,40,42,43,51,52,54],
[14,31,32,37,46],
[23,37,39,41],
[23,35,36,41,46],
[30,42,44,49,51,54],
[23,34,36,40,41],
[34,39,41,49,52],
[36,37,39,40,46,49,53],
[26,30,34,38,43,51],
[26,29,34,42],
[24,30,38,48,49],
[28,30,33,56],
[31,35,37,41,47,53],
[24,31,46,48,49,53],
[24,44,47,49],
[38,40,41,44,47,48,52,53,54],
[15,21,29],
[34,38,42,54],
[34,40,49,54],
[41,46,47,49],
[34,38,49,51,52],
[18,20,24,27,56],
[18,24,30,33,45,55]])

# Change to Python indexing (i.e. -1)
for i in range(len(adj)):
    for j in range(len(adj[i])):
        adj[i][j] = adj[i][j]-1


weights = np.array([[1,1,1,1],
[1,1],
[1,1],
[1,1,1],
[1,1,1,1,1],
[1, 1],
[1,1,1,1,1],
[1],
[1,1,1,1,1,1],
[1, 1,1,1],
[1, 1, 1,1],
[1, 1,1],
[1, 1,1,1],
[1,1,1],
[1,1,1],
[1,1,1,1,1,1],
[1, 1,1,1,1,1],
[1,1,1,1,1,1],
[1, 1, 1,1,1],
[1,1,1],
[1,1,1],
[1,1],
[1,1,1,1,1,1],
[1,1,1,1,1,1,1,1],
[1,1,1],
[1,1,1,1],
[1,1,1,1],
[1,1,1,1],
[1,1,1,1,1,1,1,1,1,1,1],
[1,1,1,1,1,1],
[1,1,1,1,1,1,1],
[1,1,1,1],
[1,1,1,1],
[1,1,1,1,1,1,1,1,1],
[1,1,1,1,1],
[1,1,1,1],
[1,1,1,1,1],
[1,1,1,1,1,1],
[1,1,1,1,1],
[1,1,1,1,1],
[1,1,1,1,1,1,1],
[1,1,1,1,1,1],
[1,1,1,1],
[1,1,1,1,1],
[1,1,1,1],
[1,1,1,1,1,1],
[1,1,1,1,1,1],
[1,1,1,1],
[1,1,1,1,1,1,1,1,1],
[1,1,1],
[1,1,1,1],
[1,1,1,1],
[1,1,1,1],
[1,1,1,1,1],
[1,1,1,1,1],
[1,1,1,1,1,1]])

Wplus = array([sum(w) for w in weights])

This model has been analyzed in many places, but I’ll use the model specified on page 165 (2, Banerjee et al.), the WinBUGS code for which is available on Brad Carlin’s website. The data are counts so a Poisson model is a good starting point, with the linear model being used to explain deviations from the expected values in each district:

$$ O_{i}|\psi_{i}\sim P(E_{i}e^{\psi_{i}}) $$
$$ \psi_{i}=\beta_{0}+\beta_{1}AFF_{i}+\theta_{i}+\phi_{i} $$

Here the $\beta_{0}$ and $\beta_{1}$ are the intercept and slope for the relationship to proportion of the district engaged in outdoor occupations. The $\theta_{i}$’s are hierarchical, district level effects (despite having only one observation per district) that add extra Poisson variation:

$$ \theta_{i}\sim N(0,\tau_{h}) $$

given a common precision:

$$ \tau_{h}\sim \Gamma(3.2761,1.81) $$

The $\Gamma(3.2761,1.81)$ priors are given in Table 5.4 of (2, Banerjee et al.) as alternative priors to a more conventional $\Gamma(0.001,0.001)$ uninformative prior. The $\phi_{i}$’s are the spatial or clustering effects given by the CAR part of the model:

$$ \phi_{i}\sim CAR(\tau_{c}) $$

with precision

$$ \tau_{c}\sim \Gamma(1.0,1.0) $$

In WinBUGS then, this model is specified as (from Carlin’s website):

model
{
   for (i in 1 : regions) {
      O[i] ~ dpois(mu[i])
      log(mu[i]) <- log(E[i]) + beta0 + beta1*aff[i]/10 + phi[i] + theta[i]
      theta[i] ~ dnorm(0.0,tau.h)
   }
   phi[1:regions] ~ car.normal(adj[], weights[], num[], tau.c)

   beta0 ~ dnorm(0.0, 1.0E-5)  # vague prior on grand intercept
   beta1 ~ dnorm(0.0, 1.0E-5)  # vague prior on covariate effect

   tau.h ~ dgamma(3.2761, 1.81)    
   tau.c ~ dgamma(1.0, 1.0)  

   sd.h <- sd(theta[]) # marginal SD of heterogeneity effects
   sd.c <- sd(phi[])   # marginal SD of clustering (spatial) effects

   alpha <- sd.c / (sd.h + sd.c)
}

The WinBUGS code has three extra bits – sd.h, sd.c, and alpha – posterior nodes that keep tabs on the marginal standard deviations of global heterogeneity and spatial variation, and on their relative contributions to overall variation.

Looking a little more closely at this model brings up two important points. The first is that $\theta_{i}$ and $\phi_{i}$ represent two random effects, global (hierarchical) and spatial, from a single datapoint in each district, making them unidentified. So how can this be overcome?? By specifying hyperpriors $\tau_{h}$ and $\tau_{c}$; however these must be chosen sensibly so as to not favour one component over the other. To do so is tricky, because while $\tau_{h}$ is a marginal precision, $\tau_{c}$ is conditional according to the CAR structure we chose. This raises the second point – what the hell does $\sim CAR(\tau_{c})$ mean? What does this ‘distribution’ look like and what is its structure? In the WinBUGS code for this model, all we get is:

phi[1:regions] ~ car.normal(adj[], weights[], num[], tau.c)

which tells us about as much as $\sim CAR(\tau_{c})$. Because (sadly) I understand code far better than algebra, to find out I delved into Appendix 1 of the GeoBUGS manual, to find out what the car.normal() model meant. It contained a few key sentences:

…the joint multivariate Gaussian model can be expressed in the form of a set of conditional distribuitons
$$S_{i}|S_{-i}\sim Normal(\mu_{i}+\Sigma_{j=1}^{N}\gamma C_{ij}(s_{j}-\mu_{i}), \phi M_{ii})$$

($S_{-i}$ means all elements but i)

Meaning that, rather than calculating the entire massive multivariate normal for the entire dataset (with associated big matrix inversions), the spatial component can be conditionally normal given a mean ($\mu_{i}$) weighted by the neighbouring data cells ($\Sigma_{j=1}^{N}\gamma C_{ij}(s_{j}-\mu_{i})$) and the conditional variance for cell i. Then the intrinsic CAR model (i.e. car.normal()) is given as

$$ S_{i}|S_{-i}\sim Normal(S.bar_{i}, v/n_{i}) $$

which means that, in the intrinsic case

Si has a normal distribution with conditional mean given by the average of the neighbouring Sj ‘s and conditional variance inversely proportional to the number of neighbours ni.

This makes a lot of sense – the value we observe at i is informed by the average of neighbouring values. To be sure, I delved into the very bowels of WinBUGS code – written in the obscure Component Pascal language – to find that the car.normal() distribution is exactly what the manual said:

PROCEDURE (node: Node) Sample (OUT res: SET);
    VAR
        i, len: INTEGER;
        mu, tau, value, wPlus: REAL;
BEGIN
    len := LEN(node.neighs);
    i := 0;
    mu := 0.0;
    wPlus := 0.0;
    WHILE i < len DO
        mu := mu + node.neighs[i].value * node.weights[i];
        wPlus := wPlus + node.weights[i];
        INC(i)
    END;
    mu := mu / wPlus;
    tau := node.tau.Value() * wPlus;
    value := MathRandnum.Normal(mu, tau);
    node.SetValue(value);
    res := {}
END Sample;

This means the CAR sample is a Normally-distributed random variable, with the mean value for observation i, $\mu$, being the weighted (node.weights[i]) average of the values bordering that observation (node.neighs[i].value), given a precision (this is WinBUGS/PyMC remember) that is increased according to the weighted contribution of the number of neighbours present (wPlus). In this model the weights are all given as 1, indicating simply whether the other observations are neighbours or not.

Here then is the source of conditioning on $\tau_{c}$, the weighting scheme used and the number of observations neighbouring each observation. As a result, the prior used for $\tau_{c}$ above comes from (3, Bernardinelli et al.), who argue that the prior marginal standard deviation of $\phi_{i}$ is approximately equal to the prior conditional standard deviation divided by 0.7. As noted by (2, Banerjee et al.), any scale parameter for the prior $\Gamma$ that satisfies:

$$ sd(\theta_{i}) = \frac{1}{\sqrt{\tau_{h}}} \approx \frac{1}{0.7\sqrt{\bar{m}\tau_{h}}} \approx sd(\phi_{i}) $$

where $\bar{m}$ is the average number of neighbours should be a fair prior for each source of variation. Understanding this, all the elements are in place for a [PyMC] CAR model.

First the priors:

# Vague prior on intercept
beta0 = Normal('beta0', mu=0.0, tau=1.0e-5, value=0)
# Vague prior on covariate effect
beta1 = Normal('beta1', mu=0.0, tau=1.0e-5, value=0)

# Random effects (hierarchial) prior
tau_h = Gamma('tau_h', alpha=3.2761, beta=1.81, value=1)
# Spatial clustering prior
tau_c = Gamma('tau_c', alpha=1.0, beta=1.0, value=1)

and the random effects:

# Regional random effects
theta = Normal('theta', mu=0.0, tau=tau_h, value=np.zeros(N))

Next comes the CAR part, which is efficiently handled by a Python list comprehension within a decorator instantiation of stochastic node:

@stochastic
def mu_phi(tau=tau_c, value=np.zeros(N)):
    # Calculate mu based on average of neighbours 
    mu = np.array([ sum(weights[i]*value[adj[i]])/Wplus[i] for i in xrange(N)])
    # Scale precision to the number of neighbours
    taux = tau*Wplus
    return normal_like(value,mu,taux)

# Zero-centre phi
phi = Lambda('phi', lambda mu=mu_phi: mu-mean(mu))

This estimates each mu by summing the values of the adjacent observations (indexed by the ragged array adj) multiplied by the weights (the other ragged array, weights), and dividing by the total weight (Wplus, which because the weights are all 1 is simply the number of adjacent observations); the $\tau$ for each observation is the overall $tau_{c}$ is multipled by the total weight (number of observations). These values are then given a normal likelihood that is subsequently zero-centred at $\phi$. Finally the mean (linear) model and the Poisson likelihood:

# Mean model
mu = Lambda('mu', lambda b0=beta0,b1=beta1,theta=theta,phi=phi: exp(logE+b0+b1*aff+theta+phi) )

# Likelihood
Yi = Poisson('Yi', mu=mu, value=O, observed=T)

As for the WinBUGS model above, we will also keep tabs on the proportions of total variance for the random and spatial components:

# Marginal SD of heterogeniety effects
sd_h = Lambda('sd_h', lambda theta=theta: np.std(theta))
# Marginal SD of clustering (spatial) effects
sd_c = Lambda('sd_c', lambda phi=phi: np.std(phi))
# Proportion sptial variance
alpha = Lambda('alpha', lambda sd_h=sd_h,sd_c=sd_c: sd_c/(sd_h+sd_c))

All that remains is to run the two models in their various frameworks and compare to the results presented in Table 5.4 of (2, Banerjee et al.). Looks pretty good:

comparison

That is except the spatial precision tau_c is somewhat more precise in PyMC than WinBUGS. However the proportion of spatial variance is almost identical

$$ \alpha_{winbugs} = 0.57 [0.45, 0.67] $$
$$ \alpha_{pymc} = 0.57 [0.49, 0.65] $$
$$ \alpha_{banerjee} = 0.57 [0.46, 0.68] $$

Next up – time!


Bibliography

[1] Kemp I, Boyle P, Smans M, Muir C. Atlas of Cancer in Scotland, 1975–1980: Incidence and Epidemiologic Perspective. IARC Scientific Publication 72. Lyon, France: International Agency for Research on Cancer; 1985.

[2] Banerjee S, Carlin BP, Gelfand AE. Hierarchical Modeling and Analysis for Spatial Data. Boca Raton, FL: Chapman and Hall; 2004.

[3] Bernardinelli L, Clayton D, Montomoli C, Ghislandi M, Songini M. Bayesian estimates of disease maps: how important are priors. Statistics in Medicine 1995;14:2411-31.