Resample

Resampling methods

This page provides tips on how to carry out bootstrapping and permutation tests in R by resampling data.


Bootstrap resampling

The bootstrap is mainly used in estimation. The method uses resampling with replacement to generates an approximate sampling distribution of an estimate. Standard errors and confidence intervals can be calculated using this bootstrap sampling distribution.

Resample data

Let’s assume that the data are a sample of measurements for a single variable stored in a vector x. The data may be numeric or categorical.

  1. A single bootstrap replicate is obtained as follows. The replace option is used to indicate that sampling is carried out with replacement.
    xboot <- sample(x, replace = TRUE)
    
  2. Calculate the statistic of interest (for example, the mean) on the resampled data in xboot and store the result in a vector created for this purpose.
    z <- vector()        # initialize z (do only once)
    z[1] <- mean(xboot)  # first bootstrap replicate estimate
    
  3. Repeat steps (1) and (2) many times. The result will be a large collection of bootstrap replicate estimates for subsequent analysis.

In other cases, two or more variables are measured on individuals (e.g., stem height, leaf area, petal diameter, etc). Assume that each row of a data frame mydata is a different individual, and each column a different variable.

  1. To resample individuals (i.e., rows),
    iboot <- sample(1:nrow(mydata), replace = TRUE))
    bootdata <- mydata[iboot,]
    

    The data frame bootdata will contain a single bootstrap replicate including all the variables.

  2. Calculate the statistic of interest on the resampled data and store the result in vector created for this purpose. For example, to calculate the correlation between two variables x and y in bootdata,
    z <- vector()        # initialize z (do only once)
    z[1] <- cor(bootdata$x, bootdata$y) 
    
  3. Repeat steps (1) and (2) many times. The result will be a large collection of bootstrap replicate estimates for subsequent analysis.

Bootstrap standard error

Assume that the vector z contains a large number of bootstrap replicate estimates of a statistic of interest. The bootstrap standard error of the statistic is obtained as the standard deviation of the bootstrap replicate estimates. (The most common mistake at this stage is to calculate the standard error of the bootstrap replicates rather than the standard deviation.)

sd(z)

Confidence interval approximation using the percentile method

The percentile method is often used to provide an approximate 95% confidence interval for the population parameter. The percentile method is not as accurate as the BCa method, but it is very quick and easy to calculate. Assume that the vector z contains a large number of bootstrap replicate estimates of the statistic of interest. By the percentile method, a 95% boostrap confidence interval for the parameter is obtained as

quantile(z, probs = c(0.025,0.975))

A large number of bootstrap replicate estimates is required for an accurate confidence interval.

Bootstrap estimate of bias

The bootstrap is often used to determine the bias of an estimate. Assume that the vector z contains a large number of bootstrap replicate estimates of a statistic of interest. Also, assume that the same statistic computed on the original data is stored in an object named “estimate”. The bootstrap estimate of bias is calculated as

mean(z) - estimate

Using the boot package

The boot library, included with the standard R installation, includes useful commands for bootstrapping. Notable, it will calculate confidence intervals using the BCa method, which is more accurate than those produced by the percentile method. BCa stands for “bias corrected and accelerated”. The method corrects for estimated bias and skewness. Consult Efron and Tibshirani (1998) for details.

To begin, load the boot library

library(boot)

Single variable

To use the boot package you will need to write a function to calculate the statistic of interest. The format is illustrated below for the sample mean, but any univariate function would be handled similarly. We’ll call our function “boot.mean”. When you have finished writing the script for a function you will need to cut and paste it into the command window so that R can access it (you’ll need to do this just once in an R session). Here, x refers to the vector of data. i serves as a counter, as in your own for loop, but it must be included as an argument in your function as shown.

boot.mean <- function(x,i){boot.mean <- mean(x[i])}

The command boot will automatically carry out all the resampling and computations required. For this example, x is the vector of original data and boot.mean is the name of the function we created above to calculate the statistic of interest. R specifies the number of bootstrap replicate estimates desired.

z <- boot(x, boot.mean, R = 2000)

The resulting object (which here named z) is a boot object containing all the results. Use the following additional commands to pull out the results.

print(z)                  # Bootstrap calculation of bias and SE
sd(z$t)                   # Another way to get the standard error

hist(z$t)                 # Histogram of boostrap replicate estimates
qqnorm(z$t)               # Normal quantiles of replicate estimates

boot.ci(z, type = "bca")  # 95% confidence interval using BCa
boot.ci(z, type = "perc") # Same using percentile method

Two or more variables

Here’s how you would use the boot command If the statistic of interest must be calculated from two (or more) variables (for example, a correlation, regression slope, or odds ratio). Assume that there are two variables of interest, x and y. If not done already, put the two variables into a data frame (here called mydata),

mydata <- cbind.data.frame(x, y, stringsAsFactors = FALSE)

Then create a function to calculate the statistic of interest on the variables in mydata. For example, to create a function that calculates the correlation coefficient between the two variables x and y, use

boot.cor <- function(mydata,i){
  x <- mydata$x[i]
  y <- mydata$y[i]
  boot.cor <- cor(x,y)
  }

Here, i refers to a vector of indices, which must be included as an argument in the function and employed as shown.

Finally, pass your data frame and function to the boot command,

z <- boot(mydata, boot.cor, R = 2000)

See the previous section for a list of commands to pull results from the boot object (here named z).


Permutation test

A permutation test uses resampling and the computer to generate a null distribution for a test statistic. The test statistic is a measure of association between two variables or difference between groups, such as a slope, a correlation, or an odds ratio. Each permutation step involves randomly resampling without replacement the values of one of the two variables in the data and recalculating the test statistic in each permutation. The two variables may be categorical (character or factor), numeric, or one of each.

Both variables categorical

R has a built-in permutation procedure for a contingency test of association when both of two variables are categorical (call them A1 and A2). To apply it, execute the usual command for the χ2 contingency test, but set the simulate.p.value option to TRUE. The number of replicates in the permutation is set by the option B (default is 2000). Each permutation rearranges the values in the contingency table while keeping all the row and column totals fixed to their observed values.

chisq.test(A1, A2, simulate.p.value = TRUE, B = 5000)

Data are numeric

If one or both of the variables is numeric, then you will need to create a short loop to carry out the resampling necessary for the permutation test. Choose one of the two variables to resample (call it x). It doesn’t matter which of the two variables you choose. Keep the other variable (call it y) unchanged (there is no benefit to resampling both variables).

  1. Resample x without replacement to create a new vector (call it x1).
    x1 <- sample(x, replace = FALSE)
    
  2. Calculate the test statistic to measure association between y and the randomized variable x1. Store the result in a vector created for this purpose. For example, to calculate the correlation between the two variables,
    z <- vector()        # initialize z (do only once)
    z[1] <- cor(x1, y)   # first permutation result
    
  3. Repeat steps (1) and (2) many times. The result will be a large collection of replicates representing the null distribution of your test statistic.