Introduction to R

The goal of this first workshop is to get you started working in R, and to introduce the most commonly-used data types, operations, and functions. Help is on the R tips page. The most useful pages for this exercise are Calculate, Data, and Graphs & Tables for producing frequency tables.

Using the console

The command line in the R console is where you interact with R.

Oversized calculator

At its most basic, the command line is a bloated calculator. The basic operations are

+  -  *  /

for add, subtract, multiply, divide. Familiar calculator functions also work on the command line. For example, to take the natural log of 2, enter the following.

log(2)

## [1] 0.6931472

Try the calculator out to get a feel for this basic application and the style of the output. Try log10() for log base 10, and sqrt() and abs() for square root and absolute value.
R allows you to store or assign numbers and character strings to named variables called vectors, which are a type of “object” in R. For example, to assign the number “3” to a variable x, use
```
x <- 3
```
The assign symbol is a “less than” symbol followed by a dash, <-, with no space between. Try assigning a single number to a named variable.
R can also assign character strings (enter using double quotes) to named variables. Try entering
```
z <- "Wake up Neo"  # single or double quotes needed
print(z)
```
At any time, enter ls() to see the names of all the objects in the working R environment.
Assign a single number to the variable x and another number to the variable y. Then watch what happens when you type an operation, such as multiplication:
```
x * y
```
Finally, you can also store the result in a third variable.
```
z <- x * y
```
To print the contents of z, just enter the name on the command line, or enter print(z).

The calculator will also give a TRUE or FALSE response to a logical operation. Try one or more variations of the following examples on the command line to see the outcome.

2 + 2 == 4     # Note double "==" for logical "is equal to"
3 <= 2         # less than or equal to
"A" < "a"      # greater than
"Hi" != "hi"   # not equal to (i.e., R is case sensitive)

Vectors

Vectors in R are used to represent variables. R can assign sets of numbers or character strings to named variables using the c() command, for concatenate. (R treats a single number or character string as a vector too, having just one element.)

x <- c(1, 2, 333, 65, 45, -88)

Assign a set of 10 numbers to a variable x. Make sure it includes some positive and some negative numbers. To see the contents afterward, enter x on the command line, or enter print(x). Is it really a vector? Enter is.vector(x) to confirm.
Use integers in square brackets to indicate subsets of vector x.
```
x[5]        # fifth element
```
Try this out. See also what happens when you enter vectors of indices,
```
x[1:3]        # 1:3 is a shortcut for c(1,2,3)
x[c(2, 4, 9)] # need c() if more than one index
```
Print the 3rd and 6th elements of x with a single command.
Some functions of vectors yield integer results and so can be used as indices too. For example, enter the function
```
length(x)
```
Since the result is an integer, it is ok to use as follows
```
x[length(x)]
```
The beauty of this construction is that it will always give the last element of a vector x no matter how many elements x contains.
Logical operations can also be used to generate indices. First, enter the following command and compare with the contents of x,
```
x < 0
```
Now enter
```
x[x < 0]
```
Try this yourself: print all elements of x that are non-negative. The which() command will identify the elements corresponding to TRUE. For example, try the following and compare with your vector x.
```
which(x < 0)
```
Indicators can be used to change individual elements of the vector x. For example, to change the fifth element of x to 0,
```
x[5] <- 0
```
Try this yourself. Change the last value of your x vector to a different number. Change the 2nd, 6th, and 10th values of x all to 3 new numbers with a single command.
Missing values in R are indicated by NA. Try changing the 2nd value of x to a missing value. Print x to see the result.
R can be used as a calculator for arrays of numbers too. To see this, create a second numerical vector y of the same length as x. Now try out a few ordinary mathematical operations on the whole vectors of numbers,
```
z <- x * y
print(z)
```
```
z <- y - 2 * x
print(z)
```
Examine the results to see how R behaves. It executes the operation on the first elements of x and y, then on the corresponding second elements, and so on. Each result is stored in the corresponding element of z. Logical operations are the same,
```
z <- x >= y              # greater than or equal to
print(z)
```
```
z <- x[abs(x) < abs(y)]  # absolute values
print(z)
```
What does R do if the two vectors are not the same length? The answer is that the elements in the shorter vector are “recycled”, starting from the beginning. This is basically what R does when you multiply a vector by a single number. The single number is recycled, and so is applied to each of the elements of x in turn.
```
z <- 2 * x
print(z)
```
Make a data frame called mydata from the two vectors, x and y. Consult the R tips Calculate and Data frame pages for help on how to do this. Print mydata on the screen to view the result. If all looks good, delete the vectors x and y from the R environment. They are now stored only in the data frame. Type names(mydata) to see the names of the stored variables.
Vector functions applied to data frames may give unexpected results – data frames are not vectors. For example, length(mydata) won’t give you the same answer as length(x) or length(y). But you can still access each of the original vectors using mydata$x and mydata$y. Try printing one of them. All the usual vector functions and operations can be used on the variables in the data frame. We’ll do more with data frames below.

Vector of data: flying snakes

Photo by J. Socha, https://www.sciencenews.org/article/how-flying-snakes-stay-aloft

Analyze data in a vector

Paradise tree snakes (Chrysopelea paradisi) leap into the air from trees, and by generating lift they glide downward and away rather than plummet. An airborn snake flattens its body everywhere except for the heart region. It forms a horizontal “S” shape and undulates from side to side. By orienting the head and anterior part of the body, a snake can change direction, reach a preferred landing site, and even chase aerial prey. To better understand lift and stability of glides, Socha (2002, Nature 418: 603-604) videotaped eight snakes leaping from a 10-m tower (video here). One measurement taken was the rate of side-to-side undulation. Undulation rates of the eight snakes, measured in Hertz (cycles per second), were as follows:

0.9 1.4 1.2 1.2 1.3 2.0 1.4 1.6

We’ll store these data in a vector (variable) and try out some useful vector functions in R (review the common vector functions section on the Calculate page of the R tips web pages).

Put the glide undulation data above into a named vector. Afterward, check the number of observations stored in the vector.
Apply the hist() command to the vector and observe the result (a histogram). Examine the histogram and you will see that it counts two observations between 1.0 and 1.2. Are there any measurements in the data between these two numbers? What is going on? The default in R is to use right-closed, left-open intervals. To change to left-closed right-open, modify an option in the hist() command as follows,
```
hist(myvector, right = FALSE)
```
We’ll be doing more on graphs next week.
Hertz units measure undulations in cycles per second. The standard international unit of angular velocity, however, is radians per second. 1 Hertz is 2π radians per second. Transform the snake data so that it is in units of radians per second (note: pi is a programmed constant in R).
Using the transformed data hereafter, calculate the sample mean undulation rate WITHOUT using the function mean() (i.e., use other functions instead)*.
Ok, try the function mean() and compare your answer.
Calculate the sample standard deviation in undulation rate WITHOUT using the functions sd() or var(). Then calculate using sd() to compare your answer**.
Sort the observations using the sort() command.
Calculate the median undulation rate. When there is an even number of observations (as in the present case), the population median is most simply estimated as the average of the two middle measurements in the sample.
Calculate the standard error of the mean undulation rate. Remember, the standard error of the mean is calculated as the standard deviation of the data divided by the square root of sample size.

* 8.63938.
** 2.035985

Answers

All lines below beginning with double hashes are R output

# Input the data:
x <- c(0.9, 1.4, 1.2, 1.2, 1.3, 2.0, 1.4, 1.6)

# Histogram
hist(x, right = FALSE, col = "firebrick", main = "")

# Convert hertz to radians/sec
z <- x * 2*pi
z

## [1]  5.654867  8.796459  7.539822  7.539822  8.168141 12.566371  8.796459 10.053096

# Mean without using mean() function
sum(z)/length(z)

## [1] 8.63938

# Using mean()
mean(z)

## [1] 8.63938

# Standard deviation without sd() or var()
sqrt( sum( (z - mean(z))^2 )/( length(z) - 1) )

## [1] 2.035985

# Compare with output of sd()
sd(z)

## [1] 2.035985

# Median
median(z)

## [1] 8.4823

# SE
sd(z)/sqrt(length(z))

## [1] 0.7198293

Missing data

Missing data in R are indicated by NA. Many functions for vectors, such as sum() and mean(), will return a value of NA if the data vector you used contained at least one missing value. Overcoming this usually involves modifying a function option to instruct R to ignore the offending points before doing the calculation. See the Calculate page of the R tips pages for help on how to do this.

Use the c() function to add a single new measurement to the snake vector created in the previous section (i.e., increase its length by one) but have the new observation be missing, as though the undulation rate measurement on a 9th snake was lost.
Check the length of this revised vector, according to R. The length should be 9, even though one of the elements of the vector is NA.
What is the sample mean of the measurements in the new vector, according to R? Use a method that does not involve you directly removing the offending point from the vector.
Recalculate the standard error of the mean, again leaving in the missing value in the vector. Did you get the same answer as in the previous section? If not, what do you think went wrong? Take great care when there are missing values.
Calculate the standard error properly, using the vector of 9 elements (i.e., the vector containing one NA). Use a method that will work on any vector containing missing values.

Answers

All lines below beginning with double hashes are R output

# Add a missing observation
z <- c(z, NA)

# mean() with NA in vector, use na.rm=TRUE argument
mean(z, na.rm = TRUE)

## [1] 8.63938

# Oops, the following is not correct! length() includes the missing case
n <- length(z)
sd(z, na.rm = TRUE)/sqrt(n)

## [1] 0.6786616

# Fix by excluding any missing value
  # use: n <- length(z[!is.na(z)])
  # or: n <- length(na.omit(z))
n <- length(na.omit(z))
sd(z, na.rm = TRUE)/sqrt(n)

## [1] 0.7198293

Anolis lizards in a data frame

Here we will read data on several variables from a comma-delimited (.csv) text file into a data frame, which is the usual way to bring data into R. The data are all the known species of Anolis lizards on Caribbean islands, the named clades to which they belong, and the islands on which they occur. A subset of the species is also classified into “ecomorphs” clusters according to their morphology and perching habitat. Each ecomorph is a phylogenetically heterogeneous group of species having high ecological and morphological similarity. The list was compiled by Jonathan Losos from various sources and are provided in the Afterword of his wonderful book (Losos 2009. Lizards in an evolutionary tree. University of California Press).

Download the file anolis.csv (click file name to initiate download) and save in a convenient place.
Open a new script file to write and submit your commands (or cut and paste to the command window) for the remainder of this section.
Read the data from the file into a data frame (e.g., call it mydata) using the read.csv() command. See the R tips Data tab for further help on this step. For this first attempt, include no additional arguments or options for the read.csv() command other than the file name so we can explore R’s behavior.
Use the str() command to obtain a compact summary of the contents of the data frame. Every variable shown should be a character which is the default for character data. Some functions will convert character data to factors, which is a special kind of character variable whose categories also have a numerical interpretation (useful when a specific order of categories is desired in a table or plot). Another useful command is class(), which will tell you what data type your object is. Try it out on a vector in your data frame, and again on the whole data frame.
Use the head() command to inspect the variable names and the first few rows of the data frame. Every variable in this data set contains character strings. (Why, yes, there’s a tail() command too!)
Let’s focus on the variable Ecomorph, since it has a manageable number of categories. Use the table() function on the Ecomorph vector to see the frequency (number of rows, namely number of species) belonging to each named group.
Notice anything odd about the table? One ecomorph category is blank and has 47 species (rows). Another ecomorph, the Trunk-Crown, seems to be present twice with different numbers of species. Same for Trunk-ground. Do you notice the cause of the problem?
The answer is that one species belongs to the “Trunk-Crown” (trailing space) category rather than to the “Trunk-Crown” (no spaces) category, caused by a typo in the data file. Use the which() command to identify the row with the typo.
Using assignment (<-), fix the single typo. Use the table() function afterward to check the effect of your change. Note that other Ecomorph categories still have the same problam.

Re-read the data from the file into R. This time, use options of the read.csv() function in base R to strip leading and trailing spaces from character string entries and treat empty fields as missing rather than as words with no letters.
How will you ever remember such a list of options in future, when it comes time to reading your own data into R? The answer is: you don’t have to. I couldn’t possibly remember it myself. If you keep a script file when you analyze the data you can always go back and consult it, and copy it the next time you need it. Type ?read.csv in the command line at any time to get a complete list of all the read options and their effects. Try it now. Those options can be handy when you need them.
Use table() once more to tally up the numbers of species in each Ecomorph category. Is there an improvement from the previous attempts? Which is the commonest Ecomorph and which is the rarest?
What happened to the missing values? Use table() but include the useNA = “ifany” option in your command to include a count of NA values in your output table (see the R tips Display page on frequency tables to see examples of the use of this argument). In this data set, NA refers to lizard species that do not belong to a standard ecomorph category, so it is worthwhile to include them in the table. Perhaps they should be given their own named group (“none”), which is less ambiguous than NA.
How many Anolis species inhabit Jamaica exclusively?*
What is the total number of Anolis species on Cuba?** This is not the same as the number occurring exclusively on Cuba — a few species live there and also on other islands. Figure out a way in R to count the number of species that occur on Cuba. Bonus points for the briefest command! [Hint: check the vector functions for character data on the R tips Calculate page.]
What is the tally of species belonging to each ecomorph on the four largest Caribbean islands: Jamaica, Hispaniola, Puerto Rico and Cuba?***
What is the most frequent ecomorph for species that do not occur on the four largest islands?***

* 6
** 63
*** Trunk-crown

Answers

All lines below beginning with double hashes are R output

# Load libraries you might need

library(dplyr)
library(readr)

# Read the data (this is a shortcut)
mydata <- read.csv(url("https://www.zoology.ubc.ca/~bio501/R/data/anolis.csv"))

# See variable types
str(mydata)

## 'data.frame':    154 obs. of  4 variables:
##  $ Species     : chr  "Anolis brunneus" "Anolis fairchildi" "Anolis smaragdinus" "Anolis ernestwilliamsi" ...
##  $ Island      : chr  "Bahamas" "Bahamas" "Bahamas" "Carrot Rock" ...
##  $ Ecomorph    : chr  "Trunk-Crown" "Trunk-Crown" "Trunk-Crown" "Trunk-Ground" ...
##  $ Series.Clade: chr  "carolinensis" "carolinensis" "carolinensis" "cristatellus" ...

# dplyr command instead
glimpse(mydata)

## Rows: 154
## Columns: 4
## $ Species      <chr> "Anolis brunneus", "Anolis fairchildi", "Anolis smaragdinus", "Anolis ernestwilliamsi", "Anolis…
## $ Island       <chr> "Bahamas", "Bahamas", "Bahamas", "Carrot Rock", "Cuba", "Cuba", "Cuba", "Cuba", "Cuba", "Cuba",…
## $ Ecomorph     <chr> "Trunk-Crown", "Trunk-Crown", "Trunk-Crown", "Trunk-Ground", "Crown-Giant", "Crown-Giant", "Cro…
## $ Series.Clade <chr> "carolinensis", "carolinensis", "carolinensis", "cristatellus", "equestris", "equestris", "eque…

class(mydata$Island)

## [1] "character"

class(mydata)

## [1] "data.frame"

head(mydata)

##                  Species      Island     Ecomorph Series.Clade
## 1        Anolis brunneus     Bahamas  Trunk-Crown carolinensis
## 2      Anolis fairchildi     Bahamas  Trunk-Crown carolinensis
## 3     Anolis smaragdinus     Bahamas  Trunk-Crown carolinensis
## 4 Anolis ernestwilliamsi Carrot Rock Trunk-Ground cristatellus
## 5        Anolis baracoae        Cuba  Crown-Giant    equestris
## 6       Anolis equestris        Cuba  Crown-Giant    equestris

tail(mydata)

##               Species                    Island Ecomorph Series.Clade
## 149    Anolis griseus  Southern Lesser Antilles                roquet
## 150     Anolis luciae  Southern Lesser Antilles                roquet
## 151  Anolis richardii  Southern Lesser Antilles                roquet
## 152 Anolis trinitatis  Southern Lesser Antilles                roquet
## 153     Anolis roquet Southern Lesser Antilles                 roquet
## 154    Anolis acutus                  St. Croix          cristatellus

# Frequency table of ecomorphs
table(mydata$Ecomorph)

## 
##                 Crown-Giant  Crown-Giant     Grass-Bush         Trunk   Trunk-Crown  Trunk-Crown   Trunk-Ground 
##            47            11             1            25             7            20             1            30 
## Trunk-Ground           Twig 
##             1            11

# Typo present
which(mydata$Ecomorph == "Trunk-Crown ")

## [1] 118

# Repair
mydata$Ecomorph[mydata$Ecomorph == "Trunk-Crown "] <- "Trunk-Crown"

# Make frequency table again
table(mydata$Ecomorph)

## 
##                 Crown-Giant  Crown-Giant     Grass-Bush         Trunk   Trunk-Crown  Trunk-Ground Trunk-Ground  
##            47            11             1            25             7            21            30             1 
##          Twig 
##            11

# Re-read data with options
mydata <- read.csv(url("https://www.zoology.ubc.ca/~bio501/R/data/anolis.csv"), 
            strip.white = TRUE, na.strings = c("NA",""), stringsAsFactors = FALSE)

# or use read_csv() function from readr package
mydata <- read_csv(url("https://www.zoology.ubc.ca/~bio501/R/data/anolis.csv"))

# Variable types
str(mydata) # or use dplyr command "glimpse(mydata)"

## spc_tbl_ [154 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Species     : chr [1:154] "Anolis brunneus" "Anolis fairchildi" "Anolis smaragdinus" "Anolis ernestwilliamsi" ...
##  $ Island      : chr [1:154] "Bahamas" "Bahamas" "Bahamas" "Carrot Rock" ...
##  $ Ecomorph    : chr [1:154] "Trunk-Crown" "Trunk-Crown" "Trunk-Crown" "Trunk-Ground" ...
##  $ Series.Clade: chr [1:154] "carolinensis" "carolinensis" "carolinensis" "cristatellus" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Species = col_character(),
##   ..   Island = col_character(),
##   ..   Ecomorph = col_character(),
##   ..   Series.Clade = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

# Number of species in each Ecomorph category
table(mydata$Ecomorph)

## 
##  Crown-Giant   Grass-Bush        Trunk  Trunk-Crown Trunk-Ground         Twig 
##           12           25            7           21           31           11

# Same, but include a tally of missing values too
#   (missing here means the species doesn't fall into an ecomorph series)
table(mydata$Ecomorph, useNA = "ifany")

## 
##  Crown-Giant   Grass-Bush        Trunk  Trunk-Crown Trunk-Ground         Twig         <NA> 
##           12           25            7           21           31           11           47

# How many Anolis species inhabit Jamaica exclusively?
  # use: length(mydata$Species[mydata$Island == "Jamaica"])
  # or: nrow(mydata[mydata$Island == "Jamaica",])
  # or: nrow(filter(mydata, Island == "Jamaica"))
length(mydata$Species[mydata$Island == "Jamaica"])

## [1] 6

# Total number of Anolis species on Cuba
# Use "grep" to find the species whose islands of occurrence include Cuba
  # use: length(mydata$Species[grep("Cuba", mydata$Island)])
  # or: length(grep("Cuba", mydata$Island))
  # or: nrow(mydata[grep("Cuba", mydata$Island), ])
  # or: nrow(filter(mydata, grepl("Cuba", Island)))
length(mydata$Species[grep("Cuba", mydata$Island)])

## [1] 63

# What is the tally of species belonging to each ecomorph 
#   on the four largest Caribbean islands:
#   Jamaica, Hispaniola, Puerto Rico and Cuba?
# **Wrong way**: Tally ecomorphs on just the four big islands. This fails to
#   include big-island species that are also found on smaller islands.
z <- mydata[mydata$Island %in% c("Cuba", "Jamaica", "Hispaniola", "Puerto Rico"), ]
table(z$Island, z$Ecomorph)

##              
##               Crown-Giant Grass-Bush Trunk Trunk-Crown Trunk-Ground Twig
##   Cuba                  6         15     1           7           13    4
##   Hispaniola            3          7     5           4            9    4
##   Jamaica               1          0     0           2            1    1
##   Puerto Rico           1          3     0           2            3    1

# One solution is to create a new variable indicating species occurrence on big islands
mydata$bigIsland <- rep(NA, nrow(mydata)) # create the new variable
mydata$bigIsland[grep("Cuba", mydata$Island)] <- "Cuba"
mydata$bigIsland[grep("Jamaica", mydata$Island)] <- "Jamaica"
mydata$bigIsland[grep("Hispaniola", mydata$Island)] <- "Hispaniola"
mydata$bigIsland[grep("Puerto Rico", mydata$Island)] <- "Puerto Rico"

# Finally:
table(mydata$bigIsland, mydata$Ecomorph)

##              
##               Crown-Giant Grass-Bush Trunk Trunk-Crown Trunk-Ground Twig
##   Cuba                  6         15     1           7           14    5
##   Hispaniola            3          7     6           4            9    4
##   Jamaica               1          0     0           2            1    1
##   Puerto Rico           2          3     0           2            3    1

# What is the most frequent ecomorph of species that do not occur 
#   on the four largest islands?
# One way is to make a new data frame that excludes large-island species
mydata2 <- mydata[!(mydata$bigIsland %in% 
                c("Cuba", "Jamaica", "Hispaniola", "Puerto Rico")), ]
table(mydata2$Ecomorph) # Trunk-Crown is the most frequent

## 
##  Trunk-Crown Trunk-Ground 
##            6            4