Repeat commands using for loops and apply

This page explains how to get started doing repetitive calculations, such as applying a function to many columns of a data frame, or applying the same function to multiple groups.

Simple for loops

The “for” loop is probably the easiest to use.

The following simple example repeats the same command 5 times. The variable `i` is a counter that starts at 1 and increases by 1 each time the commands between the brackets “{ }” are executed.

```for(i in 1:5){
print("because it's 2015")
}
```

This next example uses the variable `i` to index a different element of a vector on each iteration of the loop. The loop below would print the elements of a vector `x`, one element on each iteration.

```for(i in 1:length(x)){
print(x[i])           # use "print" to print to screen from inside loops
}
```

Typically, you want to automate a repetitive task and save the results. For example, you might want to generate 100 random samples of size `n` = 10 and calculate (and save) the mean each time. The following loop will accomplish this. To save the results in a new vector named `myMeans`, create it before starting the loop. Inside the loop save the result from each iteration `i` into the `i`‘th element of the vector.

```myMeans <- vector("numeric", length = 100)
for(i in 1:100){
x <- runif(10) # generate 10 random numbers from a uniform distribution
myMeans[i] <- mean(x, na.rm = TRUE)
}
print(myMeans) # see the results!
```

As a final example, you might have a collection of variables (columns of a data frame `mydata`) and want to calculate the sample mean for each variable. The loop below uses `i` to index each variable (column) in turn. To save the results in a new vector named `result`, create it before starting the loop. Inside the loop save the result from each iteration `i` into the `i`'th element of the vector.

```result <- vector("numeric", length = ncol(mydata)) # initialize vector to store results
for(i in 1:ncol(mydata)){
result[i] <- mean(mydata[ ,i], na.rm = TRUE)     # mean of ith variable, and store in result
}
result                                             # see the results!
```

Repeat same operation on all columns (variables) of a data frame

Use the `apply` command to repeat a function on multiple columns of a data frame. Calculations are generally faster than with a `for` loop.

MARGIN = 2 in the following example indicates columns. FUN indicates the function to use on each column. Arguments to FUN go last (in this example, `na.rm = TRUE` is an argument to the `mean` function). The output, here stored in `result`, is a vector containing the variable means, one for every column in `mydata`.

```result <- apply(mydata, MARGIN = 2, FUN = mean, na.rm = TRUE)
```

Home made functions can be used in the same way. The following calculates the standard error of each column (variable) of a data frame `mydata`.

```se <- function(x){          # x is a dummy variable for the function
s <- sd(x, na.rm = TRUE)  # calculate the standard deviation
n <- length(x[!is.na(x)]) # calculate the sample size
se <- s/sqrt(n)           # standard error
se                        # what the function will return
}
result <- apply(mydata, 2, FUN = se)
```

Repeat same operation on all rows of a data frame

The `apply` command is also used to repeat a function on multiple rows of a data frame.

The command is the similar to that used above on columns, except that a MARGIN = 1 is used to indicate rows. The output, here stored in `result`, is a vector containing the means, one for each of the rows in `mydata`.

```result <- apply(mydata, MARGIN = 1, FUN = mean, na.rm = TRUE)
```

Analyze a variable by group

Use `tapply` to analyze a vector (variable) separately by groups. For example, to calculate the median of `x` separately for each group identified by the variable `A`,

```result <- tapply(mydata\$x, INDEX = mydata\$A, FUN = median, na.rm=TRUE)
```

Function options go last: in this example, `na.rm = TRUE` is an option of the function `median`.

Repeat an operation on multiple columns (variables) and by group

Use `aggregate` instead of `tapply` to analyze multiple variables at once by group. The method lets you identify groups using more than one categorical variable, if needed (e.g., month and year).

For example, to calculate the median of the 2nd through 5th variables in `mydata`, separately for each group identified by the two categorical variables `A` and `B`, use

```result <- aggregate(mydata[ ,2:5], by = list(mydata\$A, mydata\$B),
FUN = median, na.rm = TRUE)
```

The group variable or variables must be enclosed in `list()`, even if you are using only one grouping variable.