This page explains how to get started doing repetitive calculations, such as applying a function to many columns or rows of a data frame, or applying the same function to multiple groups.
for loopsThe for loop is probably the easiest way to repeat
something over and over.
The following simple example repeats the same command 5 times. The
element i is a counter that starts at 1 and increases by 1
each time the commands between the brackets “{ }” are executed.
for(i in 1:5){
print("Back off, man, I'm a scientist")
}
This next example uses i to index a different element of
a vector on each iteration of the loop. The loop below would print the
elements of a vector x, one element on each iteration.
for(i in 1:length(x)){
print(x[i]) # use "print" to print to screen from inside loops
}
Typically, you want to automate a repetitive task and save the
results. For example, you might want to generate 100 random samples of
size n = 10 and calculate (and save) the mean each time.
The following loop will accomplish this. To save the results in a new
vector named myMeans, create it before starting the loop.
Inside the loop save the result from each iteration i into
the i’th element of the vector.
myMeans <- vector("numeric", length = 100)
for(i in 1:100){
x <- runif(10) # generate 10 random numbers from a uniform distribution
myMeans[i] <- mean(x, na.rm = TRUE)
}
print(myMeans) # see the results!
As a final example, you might have a collection of variables (columns
of a data frame mydata) and want to calculate the sample
mean for each variable. The loop below uses i to index each
variable (column) in turn. To save the results in a new vector named
result, create it before starting the loop. Inside the loop
save the result from each iteration i into the
i’th element of the vector.
result <- vector("numeric", length = ncol(mydata)) # initialize vector to store results
for(i in 1:ncol(mydata)){
result[i] <- mean(mydata[ ,i], na.rm = TRUE) # mean of ith variable, and store in result
}
result # see the results!
applyUse the apply command to repeat a function on multiple
columns of a data frame. Calculations are generally faster than with a
for loop.
MARGIN = 2 in the following example indicates columns.
FUN indicates the function to use on each column. Arguments
to FUN go last (in this example, na.rm = TRUE
is an argument to the mean function). The output, here
stored in result, is a vector containing the variable
means, one for every column in mydata.
result <- apply(mydata, MARGIN = 2, FUN = mean, na.rm = TRUE)
Home made functions can be used in the same way. The following
calculates the standard error of each column (variable) of a data frame
mydata.
se <- function(x){ # x is a dummy variable for the function
s <- sd(x, na.rm = TRUE) # calculate the standard deviation
n <- length(x[!is.na(x)]) # calculate the sample size
se <- s/sqrt(n) # standard error
se # what the function will return
}
result <- apply(mydata, 2, FUN = se)
applyThe apply command is also used to repeat a function on
multiple rows of a data frame.
The command is the similar to that used above on columns, except that
a MARGIN = 1 is used to indicate rows. The output, here stored in
result, is a vector containing the means, one for each of
the rows in mydata.
result <- apply(mydata, MARGIN = 1, FUN = mean, na.rm = TRUE)
tapplyUse tapply to analyze a vector (variable) separately by
groups. For example, to calculate the median of x
separately for each group identified by the variable A,
result <- tapply(mydata$x, INDEX = mydata$A, FUN = median, na.rm=TRUE)
Function options go last: in this example, na.rm = TRUE
is an option of the function median.
aggregateUse aggregate instead of tapply to analyze
multiple variables at once by group. The method lets you identify groups
using more than one categorical variable, if needed (e.g., month and
year).
For example, to calculate the median of the 2nd through 5th variables
in mydata, separately for each group identified by the two
categorical variables A and B, use
result <- aggregate(mydata[ ,2:5], by = list(mydata$A, mydata$B),
FUN = median, na.rm = TRUE)
The group variable or variables must be enclosed in
list(), even if you are using only one grouping
variable.
© 2009-2026 Dolph Schluter