Bootstrap resampling methods

This workshop will illustrate some of the principles of data analysis using resampling. See the “Resample” tab on the R tips help pages for advice.

Methods summary

Here is a quick summary of the two methods, the bootstrap and the permutation test, applied in today’s workshop.

Boostrap estimation

The bootstrap is is a computer-intensive method used to calculate standard errors for an estimate and to obtain approximate confidence intervals for a population parameter. It works well in most situations except when sample size is small. The basic procedure is:

Take a random sample of the data, sampling with replacement
Compute the statistic on the bootstrap replicate obtained in (1)
Repeat steps (1) and (2) many times (e.g., B = 10000), saving the result each time
Plot the B bootstrap replicate estimates to visualize the shape of the sampling distribution
Calculate the standard deviation of the B boostrap replicate estimates to obtain the standard error of the estimate

It is also possible to use the bootstrap replicates to obtain a confidence interval for the parameter using a percentile method and the BC_a method.

Permutation test

The permutation test is a computer-intensive method for hypothesis testing. It is used to generate a null distribution for a test statistic measuring association between two variables (or differences between groups). When used to test differences between two or more groups in a numerical variable the method assumes that the distributions have equal shape. However, the method is fairly robust to departures from this assumption. The steps are:

Create a randomized data set in which the values for one of two variables are randomly reassigned to subjects (without replacement)
Calculate the test statistic measuring association between the variables (or difference between groups) on the randomized sample obtained in (1)
Repeat steps (1) and (2) many times (B = 10000), saving the result each time
Plot the B values of the test statistic generated in (3) to visualize the shape of the null distribution
Using the distribution of the test statistic from (3), calculate the tail probability for the observed value of the test statistic (calculated on the original data). Multiply this number by 2 if test is two-sided.

Caribbean bird immigration

Birds of the Caribbean islands of the Lesser Antilles are descended from rare immigrants from larger islands and the nearby mainland. The data here are the approximate dates of immigration, in millions of years, of each of 37 bird species now present on the Lesser Antilles (Ricklefs and Bermingham 2001, Science 294: 1522-1524). The dates were calculated from the difference in mitochondrial DNA sequences between each of the species and its closest living relative on larger islands or the mainland. The data can be downloaded here.

What shape is the frequency distribution of estimated immigration dates? Use a graph to display it.
What are the mean and median dates of immigration, in millions of years? Why are the two values so different?
Obtain a single bootstrap replicate of the immigration dates and plot the results. How different is the frequency distribution from that of the data?
Write a short loop to generate 10000 bootstrap replicate estimates for the sample median immigration date. It is a good idea to begin by using only 10 replicates until your loop is tested and found to be working smoothly. Store the resulting medians in a vector.
Plot the frequency distribution of your results from (4). What does this frequency distribution estimate?
Using your results in (4) to calculate a standard error for the sample median*.
Most of the familiar estimators of population parameters, such as sample mean and variance, are unbiased, which means that the mean of its sampling distribution equals the parameter value being estimated. For example, the mean of the sampling distribution of the sample mean is $, the very parameter being estimated. The sample mean is an unbiased estimator. However, some estimators are biased, and the bootstrap is often used to estimate bias. Is the median of immigration dates biased? Calculate the mean of the bootstrap replicate estimates of the median immigration date to estimate the bias**.
Use the percentile method (check the quantile() function) to generate an approximate 95% confidence interval for the population median***.
Apply the boot package to generate a more accurate boostrap confidence interval for the median using the BC_a method.****

* will be about 2.5
** will be about 4.4 (This exceeds the median of the original sample, 3.5. So, our bootstrap tells us that our statistic, the sample median, tends to overestimate the true median by about 0.9.)
*** should be about (1.8, 11.1)
**** should be about (1.8, 10.6)

Answers

library(boot)

# 1. Histogram of right-skewed data
x <- read.csv(url("https://www.zoology.ubc.ca/~bio501/R/data/antilles.csv"), stringsAsFactors = FALSE)
head(x)

##   immigration.date
## 1              0.0
## 2              0.0
## 3              0.0
## 4              0.2
## 5              0.3
## 6              0.5

hist(x$immigration.date, breaks=seq(0,30,by=2), col="firebrick", right = FALSE,
    xlab = "Immigration date (my)", las = 1, main = "")

# 2. Mean and median dates of immigration differ because distribution strongly skewed
mean(x$immigration.date)

## [1] 8.664865

median(x$immigration.date)

## [1] 3.5

# 3. Single bootstrap replicate of the immigration dates
xboot <- sample(x$immigration.date, replace=TRUE)
hist(xboot, breaks=seq(0,30,by=2), col="firebrick", right = FALSE,
    xlab = "Immigration date (my)", las = 1, main = "")

# 4. Loop to generate 10000 bootstrap replicate estimates 
z <- vector()
for(i in 1:10000){
    xboot <- sample(x$immigration.date, replace=TRUE)
    z[i] <- median(xboot)
    }

# 5. Bootstrap sampling distribution of median
hist(z, col="firebrick", right = FALSE,
    xlab = "Median immigration date (my)", las = 1, main = "")

# 6. Bootstrap standard error of sample median
sd(z)

## [1] 2.482597

# 7. Mean of bootstrap medians
mean(z)

## [1] 4.37471

# 8. Approximate 95% CI
quantile(z, probs = c(0.025,0.975))

##  2.5% 97.5% 
##   1.8  11.1

# 9. More accurate boostrap confidence interval
boot.median <- function(x,i){
    boot.median <- median(x[i])
    }
z <- boot(x$immigration.date, boot.median, R = 10000)
sd(z$t)

## [1] 2.524183

boot.ci(z, type = "bca")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = z, type = "bca")
## 
## Intervals : 
## Level       BCa          
## 95%   ( 1.8, 10.6 )  
## Calculations and Intervals on Original Scale

Trillium fragmentation

Logging in western North America impacts populations of western trillium (Trillium ovatum), a long-lived perennial inhabiting conifer forests. Jules and Rathcke (1999, Conservation Biology 13:784-793) measured attributes of eight local populations of western trillium confined to forest patches of varying size created by logging in southwestern Oregon. A subset of their data are here. The variables included are

population
forest fragment size (ha)
distance between local population and forest edge (m)
years since isolated
number of plants in local population
proportion of plants consumed by deer in of the years of study
recruitment rate, the density of new plants produced in each population per year

Plot recruitment against fragment size.
From visual inspection of your plot in (1), choose an appropriate transformation of one or both variables to meet (roughly) the assumptions of linear regression*.
Using a linear model on your transformed data, estimate the slope of the linear regression of recruitment on fragment size. Inspect the diagnostic plots. These will reveal that although transformation improved matters, there might still be some issues regarding the assumptions of equal variance of residuals. A small sample size makes it difficult to pursue the issue much further. Let’s take a conservative approach for the purposes of this exercise and use a nonparametric approach, the permutation test, to test the null hypothesis of zero slope. We’ll use the transformed data because the transformation rendered the relationship more linear (it is the slope of a line that we wish to test) and it eliminated the problem of one or two data points having excessive influence.
Using the permutation test, create a null distribution for the slope of the linear regression.
Plot the null distribution. Compare the distribution to the value of the slope you estimated in (3). Is your slope near the middle or toward one of the tails of the null distribution?
Calculate the tail probability using the results from your analysis of the data in (3) and the null distribution.
Use your results in (6) to calculate an approximate $P$-value for the test of the null hypothesis of zero slope**.
Note: As we’ve discussed in class, the analysis of data shouldn’t end with a P-value because it tells us nothing about magnitudes of effects. A drawback of the permutation test is that it is entirely focused on the P-value and provides no estimates of parameters. The bootstrap is more informative, but sample size here is probably too small for a reliable outcome.

* I log-transformed both variables
** will be about 0.26

Answers

# 1. Plot recruitment against fragment size
x <- read.csv(url("https://www.zoology.ubc.ca/~bio501/R/data/trillium.csv"), stringsAsFactors = FALSE)
x

##   population fragsize.ha distedge years   N herbivory recruitment
## 1          1        2.97       67    16  53      0.50      0.0053
## 2          2        3.72       65    14  48      0.13      0.0021
## 3          3        6.48       61    29  26      0.33      0.0069
## 4          4       11.90       30    13  64      0.06      0.0006
## 5          5       62.09       84    30 116      0.38      0.0124
## 6          6      116.74       97    26  56      0.13      0.0045
## 7          7      215.99       16    16  58      0.12      0.0028
## 8          8     1000.00      332    24 129      0.08      0.0182

plot(recruitment ~ fragsize.ha, data = x, pch = 16, col = "firebrick", las = 1,
    xlab = "Fragment size (ha)", ylab = "Recruitment", cex = 1.5)

# 2. Transform
x$ln.recruitment <- log(x$recruitment)
x$ln.fragsize <- log(x$fragsize.ha)
plot(ln.recruitment ~ ln.fragsize, data = x, pch = 16, col = "firebrick", las = 1,
    xlab = "Log fragment size", ylab = "Log recruitment", cex = 1.5)

# 3. Linear model
z <- lm(ln.recruitment ~ ln.fragsize, data = x)
summary(z)

## 
## Call:
## lm(formula = ln.recruitment ~ ln.fragsize, data = x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7533 -0.4112  0.2076  0.7624  0.8996 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.2284     0.7409  -8.407 0.000154 ***
## ln.fragsize   0.2274     0.1846   1.232 0.264114    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.035 on 6 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:  0.06881 
## F-statistic: 1.517 on 1 and 6 DF,  p-value: 0.2641

plot(z)

# 4. Permutations of regression slope
slope <- vector()
for(i in 1:10000){
    x1 <- sample(x$ln.fragsize, replace = FALSE)
    slope[i] <- coef(lm(ln.recruitment ~ x1, data = x))[2]
    }

# 5. Null distribution
hist(slope, col="firebrick", right = FALSE, xlab = "Slope", 
      las = 1, main = "")

# 6. Tail probability
tailP <- length(slope[slope >= 0.2274])/10000
tailP

## [1] 0.1239

# 7. P-value
Pval <- 2 * tailP
Pval

## [1] 0.2478

Vampire attack

The vampire bat, Desmodus rotundus, commonly feeds on the blood of domestic cattle. It prefers cows to bulls, which suggested that the bats might be responding to a hormonal cue. To explore this, the frequency of vampire bat attacks on cows in estrous was compared with attacks on anestrous female cows. The following data are from one night of observation (D. C. Turner 1975, The vampire bat: A field study in behavior and ecology. Johns Hopkins Press, Baltimore).

	Female cows in estrous	Anestrous female cows
Bitten by vampire bat	15	6
Not bitten	7	322

To begin the analysis, create vectors for the two variables (estrous and bitten) each with two states (yes/no) corresponding to the above measurements for the 350 cows. 15 of the cows should have “yes” for both variables, 7 should have “yes” for estrous and “no” for bitten, and so on. Combine the two vectors into a data frame.
Create a 2x2 contingency table of your new variables to confirm that you obtain the same table as that above.
Use a conventional $\chi^2$ contingency test to test the null hypothesis of no difference in the frequency of cows bitten by vampire bats between estrous and anestrous cows. Use the chisq.test() command in R (type ?chisq.test for help). Store the results in an object (e.g., z). If you execute the command correctly you should receive a warning. What is the source of the problem? To determine this, examine the expected frequencies (they are stored in the results object, e.g.,that you named z). (Do you remember the assumptions of the $\chi^2$ contingency test from your first stats course?*)
Use the permutation test instead to test the null hypothesis. (HINT: check if this may have already be implemented in the function.)
Calculate the odds ratio using the variables you created in (1). The odds ratio is a commonly used measure of association in 2x2 tables. It is the ratio of the odds of success in a treatment group compared to the odds of success in a control group. Here, the ratio will be odds of estrous cows being bitten / odds of anestrous cows being bitten. The quickest way to calculate odds ratio from a table of frequencies is OR = (a/c) / (b/d), where

treatment

control

success

a

b

failure

c

d

A problem arises if one of the four table frequencies is 0, which can create an odds ratio of 0 or infinity. Although not optimal, a standard fix is to add 0.5 to each cell of the table before calculating OR.

Use the bootstrap to obtain a standard error for the estimate of odds ratio. By chance your resampling might occasionally produce a zero for the number of anestrous cows bitten, which will result in a division by zero and a value of “Inf” for the odds ratio or relative risk. One solution is to add 0.5 to each cell of the 2x2 table before calculating the odds ratio. Alternatively, just remove the “Inf” values from the vector of results using
```
z <- z[is.finite(z)]
```

Use the percentile method to obtain a bootstrap 95% confidence interval for the odds ratio.

* The $\chi^2$ approximation to the null distribution may not be accurate when too many of the expected frequencies are less than 5.

Answers

library(boot)

# 1. Create data frame

bitten <- c( rep(c("bitten","not"),c(15,7)), rep(c("bitten","not"),c(6,322)) )
estrous <- rep(c("estrous","no"), c(22,328))
x <- cbind.data.frame(bitten, estrous, stringsAsFactors = FALSE)
head(x)

##   bitten estrous
## 1 bitten estrous
## 2 bitten estrous
## 3 bitten estrous
## 4 bitten estrous
## 5 bitten estrous
## 6 bitten estrous

# 2. 2x2 contingency table
VampireTable <- table(x$bitten, x$estrous)

# 3. chi^2 test
z <- chisq.test(VampireTable)

## Warning in chisq.test(VampireTable): Chi-squared approximation may be incorrect

z$expected

##         
##          estrous     no
##   bitten    1.32  19.68
##   not      20.68 308.32

# 4. Permutation test
chisq.test(VampireTable, simulate.p.value = TRUE, B = 5000)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 5000
##  replicates)
## 
## data:  VampireTable
## X-squared = 160.94, df = NA, p-value = 2e-04

# 5. Odds ratio
OR <- (VampireTable[1,1]/VampireTable[2,1])/(VampireTable[1,2]/VampireTable[2,2])
OR

## [1] 115

# 6. Bootstrap standard error
OR <- function(x, i){
    tab <- table(x[,1][i],x[,2][i])
      OR <- ((tab[1,1]+.5)/(tab[2,1]+.5))/((tab[1,2]+.5)/(tab[2,2]+.5))
      }
z <- boot(x, OR, R = 5000)
z

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = x, statistic = OR, R = 5000)
## 
## 
## Bootstrap Statistics :
##     original   bias    std. error
## t1* 102.5385 45.73285    164.8772

# 7. Use the percentile method to obtain a 95% confidence interval for the odds ratio.
boot.ci(z)

## Warning in boot.ci(z): bootstrap variances needed for studentized intervals

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 5000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = z)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   (-266.3,  380.0 )   (-259.7,  167.9 )  
## 
## Level     Percentile            BCa          
## 95%   ( 37.2, 464.8 )   ( 30.8, 362.8 )  
## Calculations and Intervals on Original Scale

	treatment	control
success	a	b
failure	c	d