Bayesian data analysis

This workshop demonstrates principles of data analysis (estimation and model selection) in a Bayesian framework. For probability and likelihood calculations consult the “Probability” tab at the R tips site.

Estimation using maximum posterior probability

Estimation in a Bayesian framework is related to likelihood methods. With likelihood, we treat the data as given and vary the parameter to find that value for which the probability of obtaining the data is highest. Bayesian methods go one step further, treating the parameter (or hypothesis) as a random variable and seeking the value having highest posterior probability, given the data. We need to specify a prior probability distribution for the parameter values.

Bayesian model selection

Selecting among candidate models requires a criterion for evaluating and comparing models. We’ve already investigated AIC in a previous workshop. Here we demonstrate the superficially similar Bayesian information criterion (BIC). The approach has a tendency to pick a simpler model than does AIC, but this should not be the basis for preferring or shunning it.

Elephant population size estimation

Mark-recapture

This example continues one from the likelihood workshop. Eggert et al. (2003. Molecular Ecology 12: 1389-1402) used mark-recapture methods to estimate the total number of forest elephants inhabiting Kakum National Park in Ghana by sampling dung and extracting elephant DNA to uniquely identify individuals. Over the first seven days of collecting the researchers identified 27 elephant individuals. Refer to these 27 elephants as marked. Over the next eight days they sampled 74 individuals, of which 15 had been previously marked. Refer to these 15 elephants as recaptured.

Probability model

Provided the assumptions are met (no births, deaths, immigrants, or emigrants while the study was being carried out; the dung of every elephant had an equal chance of being sampled, and selection of each individual was independent of the others), then the number of recaptured (previously marked) individuals X in the second sample should have a hypergeometric distribution with parameters k (the size of the second sample of individuals), m (total number of marked individuals in the population when the second sample was taken), and n (total number of unmarked individuals in the population at the time of the second sample).

Likelihood

Using the hypergeometric distribution, calculate the log-likelihood of each of a range of possible values for the number of elephants in the Park. Note that the total number of elephants is N = n + m, and that m is known (m = 27). Note also that only integer values for n (and N) are allowed, and that n cannot be smaller than k - X, the observed number of unmarked individuals in the second sample.
Plot the log-likelihood function. Find the value of n that maximizes the likelihood. Add m to this number to obtain the maximum likelihood estimate for population size N.*
Calculate the likelihood-based 95% confidence interval for the total number of elephants.**

* 133
** 104 < N < 193

Answers

# 1. Estimate N
m <- 27   # total marked individuals in the population
k <- 74   # size of second sample
X <- 15   # number of recaptures

# Note that N must be at least 86 = 74 - 15 + 27

N <- 86:200
loglike <- dhyper(X, m, n = N - 27, k, log = TRUE)

# 2. MLE
plot(loglike ~ N, type = "l", ylab = "Log-likelihood")

N[loglike == max(loglike)]

## [1] 133

# 3. 95% confidence interval
z <- N[loglike < max(loglike) - 1.92]
c( max(z[z < 133]), min(z[z > 133]) )

## [1] 104 193

Bayesian estimate

To develop a Bayesian estimate of population size, we will need to come up with prior probabilities for each of the possible values for n. In reality this might be based upon previous information (visual counts; tracks; knowledge of the maximum number of elephants that can be sustained in an area the size of the Park, etc.). In the absence of prior information, the convention is to use a “flat” or non-informative prior. To apply this to the elephant problem we will come up with a realistic minimum and maximum possible population size. Then we give each value within this interval a prior probability equal to 1 divided by the number of integer values within the interval.

Decide on a minimum and maximum possible value for n, the number of unmarked elephants in the Park (or N = n + m, the total number). The minimum n can be as small as 0 but the likelihood will be 0 for values smaller than k - X, so it is practical to set the smallest n to something greater or equal to k - X. For this first exercise don’t set the maximum too high. We’ll explore what happens later when we set the maximum to be a very large number.
Create a vector of n values (or N = n + m values) containing all the integers between and including the minimum and maximum values that you decided in (4).
Calculate the likelihoods of each of these values for n. Plot the likelihoods against n (or N). Plot the likelihood function. (We need the likelihoods rather than the log-likelihoods for the posterior probability calculations.)
Create a vector containing the prior probabilities for each of the possible values for n that you included in your vector in (5). If you are using a flat prior then the vector will be the same length as your n vector, and each element will be the same. Plot the prior probabilities against n. If all is OK at this stage then the plot should show a flat line. Also, confirm that the prior probabilities sum to 1.
Now apply Bayes Theorem. Using your two vectors from (6) and (7), calculate the posterior probability for each possible value of n (or of N = n + m) between your minimum and maximum values. After your calculation, confirm that all the posterior probabilities sum to 1. Plot the posterior probability against n (or N = n + m). Compare with the shape of the likelihood curve.
What is the most probable value of N = n + m, given the data? Compare this with your previous maximum likelihood estimate.*
Calculate the 95% credible interval for n (or N = n + m, the total population size).** The procedure for finding the lower and upper limits of the credible interval is a bit like that for likelihood. The idea is illustrated in the figure below. Starting from the highest point of the posterior probability, slide a horizontal line downward until you reach a point at which the corresponding values for the parameter (indicated below by the dashed vertical lines) bracket an area of 0.95 under the curve.

Try to think of a method to find the values for n (or N = n + m) that correspond to an area under the curve equal to 0.95. Trial and error might work. At the bottom of this section I describe a crude method that worked for me, but try to come up with a method yourself first.

Compare your 95% credible interval for population size with the approximate likelihood-based 95% confidence interval. Which interval is narrower? Also compare the interpretations of the two intervals. How are they different? Are they compatible? Which makes the most sense to you? Why?
Repeat the procedures (4)-(10) but using a much larger value for the maximum possible population size. How is your credible interval affected by the increase in the maximum value of the posterior probability distribution?
What general implication do you draw from the influence of the prior probability distribution on the interval estimate for population size? Do you consider this implication to be a weakness or a strength of the Bayesian approach?

* I got 133
** I got lower: 106, upper: 187, but you may get a different answer if you used a different prior

Here is one way to find the approximate lower and upper limits to the credible interval.

# First, order the posterior probabilities from highest to lowest
post.ordered <- posterior[order(posterior, decreasing = TRUE)]

# Remember to order the corresponding N values the same way
N.ordered <- N[order(posterior, decreasing=TRUE)]

# Obtain the cumulative sum of the posterior probabilities from lowest to highest
post.cumsum <- cumsum(post.ordered)

# Finally, find N corresponding to a cumulative posterior probability of 0.95.
range(N.ordered[post.cumsum <= 0.95])

The area under the curve corresponding to this interval will actually sum to slightly less than 0.95. To be conservative, subtract 1 from the lower limit, or add 1 to upper limit, whichever first brings us over 0.95 cumulative posterior probability.

Answers

# 5. Vector with possible values for N. This will determine the prior I'm using.
N <- 86:200

# 6. Likelihoods
like <- dhyper(X, m, n = N - 27, k, log = FALSE)
plot(like ~ N, type="l", ylab = "Likelihood")

# 7. Prior probabilities
Pprior <- rep(1/length(N), length(N))
sum(Pprior)

## [1] 1

# 8. Posterior probabilities
Ppost <- like * Pprior / sum(like*Pprior)
plot(N, Ppost, type="l", ylab = "Posterior probability")

# 9. Most probable value of N = n + m
N[Ppost == max(Ppost)]

## [1] 133

# 10. Credible interval

# Cumulative posterior probability, from top down
Pordered <- Ppost[order(Ppost, decreasing = TRUE)]
Pcumsum <- cumsum(Pordered)

# Corresponding N
Nordered <- N[order(Ppost, decreasing=TRUE)]

# N corresponding to cumulative Ppost of 0.95.
range(Nordered[Pcumsum <= 0.95])

## [1] 106 186

# Adjust 
Nordered[Pcumsum == min(Pcumsum[Pcumsum > 0.95])]

## [1] 187

# 12. Repeat with larger value maximum possible population size
N <- 86:2000
like <- dhyper(X, m, n = N - 27, k, log = FALSE)
Pprior <- rep(1/length(N), length(N))
Ppost <- like * Pprior / sum(like*Pprior)
N[Ppost == max(Ppost)]

## [1] 133

Pordered <- Ppost[order(Ppost, decreasing = TRUE)]
Pcumsum <- cumsum(Pordered)
Nordered <- N[order(Ppost, decreasing=TRUE)]
range(Nordered[Pcumsum <= 0.95])

## [1] 103 198

Nordered[Pcumsum == min(Pcumsum[Pcumsum > 0.95])]

## [1] 199

Biodiversity and ecosystem function

In this second example we will compare the fit of linear and nonlinear models to a data set using a Bayesian model selection criterion. Consult the “Fit model” tab of the R tips pages for information on how to calculate the Bayesian Information Criterion (BIC).

We haven’t discussed nonlinear model fitting before. Information on how to fit a nonlinear model using nls in R is also provided on the “Fit model” tab at the R tips web site.

We will investigate the relationship between an ecosystem function variable, CO₂ flux, and the number of eukaryote species (protists and small metazoans) in an experimental aquatic microcosms. (An average of 24 bacterial species was present but not included in the species counts.)

The data are from J. McGrady-Steed, P. M. Harris & P. J. Morin (1997, Nature 390: 162-165). Download it from here.

The variables are:

realized number of species, ranging from 2 to 19; may be lower than the number of species initially introduced by the experiments because of local extinctions.
cumulative CO₂ flux, a measure of total ecosystem respiration.

There are 82 data points representing replicate mesocosms. For this exercise we will assume that each replicate is independent, even though there is overlap in the species composition of different assemblages.

Models

These data can be used to test two alternative models to desceribe the role of biodiversity in ecosystem function:

H₁: Each new species added makes the same contribution to ecosystem respiration regardless of how many species are already present. In this case, a linear relationship is expected between community respiration and the number of species.

H₂: Multi-species communities include species that are functionally redundant, at least for univariate measures of ecosystem function. Under this view, as the number of species increases in a community, each new species makes a smaller and smaller contribution to ecosystem respiration until an asymptote is reached. In this case, the relationship might be described by a Michaelis-Menten curve (see Fit Model page).

Bayesian model selection

Download and read the data from the file.
Plot CO₂ flux against number of species.
Fit a simple linear regression to the data. Add the regression line to the plot. Judging by eye, is it a good fit?
Fit a Michaelis-Menten model to the data. You’ll need to use the version of the formula having a non-zero y-intercept. (Note: When I tried to fit the model to the data the estimation process did not converge, perhaps because 2 rather than 0 or 1 is the smallest value for the explanatory variable. I had better luck when I used the number of species minus 2 rather than number of species as the explanatory variable in the model formula). Add the fitted line to the plot. Judging by eye, is it a good fit?
Calculate BIC for both the linear and nonlinear models that you fit in (3) and (4)*. Which hypothesis has the lowest BIC? Does this accord with your visual judgements of model fit?
Calculate the BIC differences for the two models, and then the BIC weights**. These weights can be interpreted as Bayesian posterior probabilities of the models if both the linear and Michaelis-Menten models have equal prior probabilities, and if we assume that one of these two models is the “true” model. Of course, we can never know whether either of these models is “true”, but we can nevertheless use the weights as a measure of evidence in support of both model, if we are considering only these two.
Compare the models using AIC instead of BIC. Do you get the same “best” model using this criterion instead?
Which hypothesis about the role of biodiversity in ecosystem function receives strongest support from these data?
Assuming that it were possible, would conventional null hypothesis significance testing be a poorer, equivalent, or superior approach to the one used above to decide between the two models? Why?
Will ecosystem respiration really reach an asymptote or might it continue to increase, albeit at a slower and slower rate, as the number of species increases? The power function can be used to model the latter situation. Which function, the Michaelis-Menten or the power function, has strongest support? Use the number of species minus 2 rather than number of species as the explanatory variable in the model formula.

* linear = 1300.878, non-linear = 1294.914
** weight: 0.04823876 and 0.95176124 respectively

Answers

library(ggplot2)

# 1. Download and read data
x <- read.csv(url("https://www.zoology.ubc.ca/~bio501/R/data/biodiv.csv"), 
      stringsAsFactors = FALSE)
head(x)

##   species co2flux
## 1       2    -928
## 2       2   -1034
## 3       2   -1066
## 4       2   -1151
## 5       2   -1215
## 6       3    -854

# 2. Plot 
ggplot(x, aes(x = species, y = co2flux)) + 
    geom_point(size = 3, col = "firebrick", alpha = 0.5) + 
    labs(x = "Number of species", y = "CO2 flux") + 
    theme_classic()

# 3. Linear regression
z1 <- lm(co2flux ~ species, data=x)
summary(z1)

## 
## Call:
## lm(formula = co2flux ~ species, data = x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1062.0  -423.5   -96.6   386.4  1940.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -438.24     144.59  -3.031 0.003283 ** 
## species        63.46      16.40   3.870 0.000221 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 629.5 on 80 degrees of freedom
## Multiple R-squared:  0.1577, Adjusted R-squared:  0.1472 
## F-statistic: 14.98 on 1 and 80 DF,  p-value: 0.0002206

ggplot(x, aes(x = species, y = co2flux)) + 
    geom_point(size = 3, col = "firebrick", alpha = 0.5) + 
    geom_smooth(method = "lm", size = 1, se = FALSE) + 
    labs(x = "Number of species", y = expression(paste(CO[2], " flux"))) + 
    theme_classic()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# 4. Michaelis-Menten model 
z2 <- nls(co2flux ~ a + b*(species-2)/(c+(species-2)), data = x, 
  start = list(a = 0, b = 100, c = 1))
summary(z2)

## 
## Formula: co2flux ~ a + b * (species - 2)/(c + (species - 2))
## 
## Parameters:
##     Estimate Std. Error t value Pr(>|t|)    
## a -1054.3443   263.3068  -4.004  0.00014 ***
## b  1490.2317   284.8087   5.232 1.34e-06 ***
## c     1.0317     0.6672   1.546  0.12604    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 594.7 on 79 degrees of freedom
## 
## Number of iterations to convergence: 6 
## Achieved convergence tolerance: 3.818e-06

x$predictMM <- predict(z2)
ggplot(x, aes(x = species, y = co2flux)) + 
    geom_point(size = 3, col = "firebrick", alpha = 0.5) + 
    geom_smooth(method = "lm", size = 1, se = FALSE) + 
    geom_line(aes(y = predictMM), size = 1) +
    labs(x = "Number of species", y = expression(paste(CO[2], " flux"))) + 
    theme_classic()

# 5. BIC
BIC(z1)

## [1] 1300.878

BIC(z2)

## [1] 1294.914

# 6. BIC weights
delta <- c(BIC(z1), BIC(z2)) - min(c(BIC(z1), BIC(z2)))
L <- exp(-0.5 * delta)
w <- L/sum(L)
w

## [1] 0.04823876 0.95176124

# 7. AIC
AIC(z1)

## [1] 1293.658

AIC(z2)

## [1] 1285.287

# 10. Michaelis-Menten vs power function
z3 <- nls(co2flux ~ a + b*(species-2)^c, data = x, start=list(a = 1, b = 500, c = 1))
summary(z3)

## 
## Formula: co2flux ~ a + b * (species - 2)^c
## 
## Parameters:
##     Estimate Std. Error t value Pr(>|t|)    
## a -1.082e+03  2.644e+02  -4.091 0.000103 ***
## b  8.545e+02  2.862e+02   2.985 0.003771 ** 
## c  2.157e-01  9.113e-02   2.367 0.020366 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 592.4 on 79 degrees of freedom
## 
## Number of iterations to convergence: 6 
## Achieved convergence tolerance: 1.112e-06

BIC(z1)

## [1] 1300.878

BIC(z2)

## [1] 1294.914

BIC(z3)

## [1] 1294.275

x$predictPWR <- predict(z3)
ggplot(x, aes(x = species, y = co2flux)) + 
    geom_point(size = 3, col = "firebrick", alpha = 0.5) + 
    geom_smooth(method = "lm", size = 1, se = FALSE) + 
    geom_line(aes(y = predictMM), size = 1) +
    geom_line(aes(y = predictPWR), size = 1, col = "red") +
    labs(x = "Number of species", y = expression(paste(CO[2], " flux"))) + 
    theme_classic()