This page explains how to get started doing repetitive calculations, such as applying a function to many columns or rows of a data frame, or applying the same function to multiple groups.


Simple for loops

The for loop is probably the easiest way to repeat something over and over.

The following simple example repeats the same command 5 times. The element i is a counter that starts at 1 and increases by 1 each time the commands between the brackets “{ }” are executed.

for(i in 1:5){
  print("Back off, man, I'm a scientist")
  }

This next example uses i to index a different element of a vector on each iteration of the loop. The loop below would print the elements of a vector x, one element on each iteration.

for(i in 1:length(x)){
  print(x[i])           # use "print" to print to screen from inside loops
  }

Typically, you want to automate a repetitive task and save the results. For example, you might want to generate 100 random samples of size n = 10 and calculate (and save) the mean each time. The following loop will accomplish this. To save the results in a new vector named myMeans, create it before starting the loop. Inside the loop save the result from each iteration i into the i’th element of the vector.

myMeans <- vector("numeric", length = 100)
for(i in 1:100){
  x <- runif(10) # generate 10 random numbers from a uniform distribution
  myMeans[i] <- mean(x, na.rm = TRUE)
  }
print(myMeans) # see the results!

As a final example, you might have a collection of variables (columns of a data frame mydata) and want to calculate the sample mean for each variable. The loop below uses i to index each variable (column) in turn. To save the results in a new vector named result, create it before starting the loop. Inside the loop save the result from each iteration i into the i’th element of the vector.

result <- vector("numeric", length = ncol(mydata)) # initialize vector to store results
for(i in 1:ncol(mydata)){
  result[i] <- mean(mydata[ ,i], na.rm = TRUE)     # mean of ith variable, and store in result
  }
result                                             # see the results!

Repeat something on several columns

Use apply

Use the apply command to repeat a function on multiple columns of a data frame. Calculations are generally faster than with a for loop.

MARGIN = 2 in the following example indicates columns. FUN indicates the function to use on each column. Arguments to FUN go last (in this example, na.rm = TRUE is an argument to the mean function). The output, here stored in result, is a vector containing the variable means, one for every column in mydata.

result <- apply(mydata, MARGIN = 2, FUN = mean, na.rm = TRUE)

Home made functions can be used in the same way. The following calculates the standard error of each column (variable) of a data frame mydata.

se <- function(x){          # x is a dummy variable for the function
  s <- sd(x, na.rm = TRUE)  # calculate the standard deviation
  n <- length(x[!is.na(x)]) # calculate the sample size
  se <- s/sqrt(n)           # standard error
  se                        # what the function will return
  }
result <- apply(mydata, 2, FUN = se)

Repeat something on multiple rows

Use apply

The apply command is also used to repeat a function on multiple rows of a data frame.

The command is the similar to that used above on columns, except that a MARGIN = 1 is used to indicate rows. The output, here stored in result, is a vector containing the means, one for each of the rows in mydata.

result <- apply(mydata, MARGIN = 1, FUN = mean, na.rm = TRUE)

Analyze a variable by group

Use tapply

Use tapply to analyze a vector (variable) separately by groups. For example, to calculate the median of x separately for each group identified by the variable A,

result <- tapply(mydata$x, INDEX = mydata$A, FUN = median, na.rm=TRUE)

Function options go last: in this example, na.rm = TRUE is an option of the function median.


Analyze multiple variables by group

Use aggregate

Use aggregate instead of tapply to analyze multiple variables at once by group. The method lets you identify groups using more than one categorical variable, if needed (e.g., month and year).

For example, to calculate the median of the 2nd through 5th variables in mydata, separately for each group identified by the two categorical variables A and B, use

result <- aggregate(mydata[ ,2:5], by = list(mydata$A, mydata$B), 
                      FUN = median, na.rm = TRUE)

The group variable or variables must be enclosed in list(), even if you are using only one grouping variable.

 

© 2009-2024 Dolph Schluter