Introduction to R


The goal of this first workshop is to get you started working in R, and to introduce the most commonly-used data types, operations, and functions. Help is on the R tips page.

Try out the command line

The command line in the R console is where you interact with R. The command prompt is a red “>” symbol.

Oversized calculator

At its most basic, the command line is a bloated calculator. The basic operations are

+, -, *, /

for add, subtract, multiply, divide. Familiar calculator functions also work on the command line. For example, to take the natural log of 2, enter

  log(2)

  1. Try the calculator out to get a feel for this basic application and the style of the output. Try log and a few other functions (see "Transform numerical data" in the vector functions section of the R tips start page).
  2. R allows you to store or assign numbers and characters to named variables called vectors, which are a type of "object" in R. For example, to assign the number "3" to a variable "x", use

      x <- 3

    The assign symbol is a < followed by a dash -, with no space between. Try assigning a single number to a named variable. 
  3. R can also assign character data (enter using double quotes) to named variables. Try entering

      z <- "Wake up Neo"  # quotes are necessary

  4. At any time, enter "ls()" to see the names of all the objects in the working R environment. (You can save the environment for later use by entering "save.image()" or by saving when you exit R.)
  5. Assign a single number to the variable "x" and another number to the variable "y". Then watch what happens when you type an operation, such as

      x * y

    Finally, you can also store the result in a third variable.

      z <- x * y

    To print the contents of z, just enter the name on the command line, or enter print(z).
  6. The calculator will also give a TRUE or FALSE response to a logical operation. Try one or more variations of the following examples on the command line to see the outcome.

      2 + 2 == 4    # Note double "==" for logical "is equal to"
      3 <= 2        # less than or equal to
      "A" > "a"     # greater than
      "Hi" != "hi"  # not equal to (i.e., R is case sensitive)

Vectors

Vectors in R are used to represent variables. R can assign sets of numbers or characters to named variables using the " c" command (for concatenate). (R treats single numbers or characters as vectors too, having just one element.)

  x <-  c(1,2,333,65,45,-88)

  1. Assign a set of 10 numbers to a variable x. Make sure it Includes some positive and some negative numbers. To see the contents afterward, enter x on the command line, or enter print(x). Is it really a vector? Enter is.vector(x) to confirm.
  2. Use integers in square brackets to indicate subsets of vector x.

      x[5]         # fifth element

    Try this out. See also what happens when you enter vectors of indices,

      x[1:3]       # 1:3 is a shortcut for c(1,2,3)

      x[c(2,4,9)]

    Print the 3rd and 6th elements of x with a single command.
  3. Some functions of vectors yield integer results and so can be used as indices too. For example, enter the function

      length(x)

    Since the result is an integer, it is ok to use as follows,

      x[length(x)]

    The beauty of this construction is that it will always give the last element of a vector x no matter how many elements x contains.
  4. Logical operations can also be used to generate indicators. First, enter the following command and compare with the contents of x,

      x > 0

    Now enter

      x[x > 0]

    Try this yourself: print all elements of x that are non-negative.
    The "which" command will identify the elements corresponding to TRUE. For example, try the following and compare with your vector x.

      which(x > 0)

  5. Indicators can be used to change individual elements of the vector x. For example, to change the fifth element of x to 0,

     x[5] <- 0

    Try this yourself. Change the last value of your x vector to a different number.
    Change the 2nd, 6th, and 10th values of x all to 3 new numbers with a single command.
  6. Missing values in R are indicated by NA (without quotes). Try changing the 2nd value of x to a missing value. Print x to see the result.
  7. R can be used as a calculator for arrays of numbers too. To see this, create a second numerical vector y of the same length as x. Now try out a few ordinary mathematical operations on the whole vectors of numbers,

      z <- x * y

      z <- y - 2 * x

    Examine the results to see how R behaves. It executes the operation on the first elements of x and y, then on the corresponding second elements, and so on. Each result is stored in the corresponding element of z. Logical operations are the same,

      z <- x >= y               # great than or equal to

      z <- x[ abs(x) < abs(y)]  # absolute values


    What does R do if the two vectors are not the same length? The answer is that the elements in the shorter vector are "recycled", starting from the beginning. This is basically what R does when you multiply a vector by a single number. The single number is recycled, and so is applied to each of the elements of x in turn.

     z <- 2 * x

  8. Make a data frame called mydata from the two vectors, x and y. Print mydata on the screen to view the result. If all looks good, delete the vectors x and y from the R environment. They are now stored only in the data frame.  Type names(mydata) to see the names of the stored variables. 
  9. Vector functions applied to data frames may give unexpected results -- data frames are not vectors. For example, length(mydata) won't give you the same answer as length(x) or length(y). But you can still access each of the original vectors using mydata$x and mydata$y. Try printing one of them. All the usual vector functions and operations can be used on the variables in the data frame. We'll do more with data frames below.


Analyze vector of data: flying snakes

Paradise tree snakes (Chrysopelea paradisi) leap into the air from trees, and by generating lift they glide downward and away rather than plummet. An airborn snake flattens its body everywhere except for the heart region. It forms a horizontal “S” shape and undulates from side to side. By orienting the head and anterior part of the body, a snake can change direction, reach a preferred landing site, and even chase aerial prey. To better understand lift and stability of glides, Socha (2002, Nature 418: 603-604) videotaped eight snakes leaping from a 10-m tower. One measurement taken was the rate of side-to-side undulation. Undulation rates of the eight snakes, measured in Hertz (cycles per second), were as follows:
   0.9 1.4 1.2 1.2 1.3 2.0 1.4 1.6

We'll store these data in a vector (variable) and try out some useful vector functions in R (review the common vector functions section on the start page of the R tips web pages).
  1. Read the glide undulation data above into a named vector. Afterward, check the number of observations stored in the vector.
  2. Apply the "hist" command to the vector and observe the result (a histogram). Examine the histogram and you will see that it counts two observations between 1.0 and 1.2. Are there any measurements in the data between these two numbers? What is going on? The default in R is to use right-closed, left-open intervals. To change to left-closed right-open, modify an option in the hist command as follows,

      hist(myvector, right=FALSE)

    We'll be doing more on graphs next week.
  3. Hertz units measure undulations in cycles per second. The standard international unit of angular velocity, however, is radians per second. 1 radian per second is 1/(2π) Hertz. Transform the snake data so that it is in units of radians per second (note: "pi" is a programmed constant in R).
  4. Calculate the sample mean undulation rate WITHOUT using the function "mean" (i.e., use other functions instead).
  5. Ok, try the function "mean" and compare your answer.
  6. Calculate the sample standard deviation in undulation rate WITHOUT using the functions "sd" or "var". Then calculate using "sd" to compare your answer. 
  7. Sort the observations using the "sort" command.
  8. Calculate the median undulation rate. When there are an even number of observations (as in the present case), the population median is most simply estimated as the average of the two middle measurements in the sample. 
  9. Calculate the standard error of the mean undulation rate. 


Missing data

Missing data in R are indicated by NA (without quotes). Many functions for vectors, such as sum and mean, will return a value of NA if the data vector you used contained at least one missing value. Overcoming this usually involves modifying a function option to instruct R to remove the offending points before doing the calculation. See the start page of the R tips pages for help on how to do this.
  1. Use the "c" function to add a single new measurement to the snake vector created in the previous section (I.e., increase its length by one) but have the new observation be missing, as though the undulation rate measurement on a 9th snake was lost.
  2. Check the length of this revised vector, according to R.
  3. What is the sample mean of the measurements in the new vector, according to R? Use a method that does not involve you directly removing the offending point yourself.
  4. Recalculate the standard error of the mean. Did you get the same answer as in the previous section?


Anolis lizards in a data frame

Here we will read data on several variables from a comma-delimited (.csv) text file into a data frame, which is the usual way to bring data into R. The data are all the known species of Anolis lizards on Caribbean islands, the named clades to which they belong, and the islands on which they occur. A subset of the species is also classified into "ecomorphs", named according to perching habitat. Each ecomorph is a phylogenetically heterogeneous group of species having high ecological and morphological similarity. The list was compiled by Jonathan Losos from varous sources and are provided in the Afterword of his wonderful book (Losos 2009. Lizards in an evolutionary tree. University of California Press).
  1. Download the file anolis.csv (click file name to initiate download) and save in a convenient place. 
  2. Open an ordinary text file using Tinn-R or the R console editor. Use it to write and submit your commands (or cut and paste to the command window) for the remainder of this section.
  3. Read the data from the file into a data frame (e.g., call it "mydata") using the read.csv command. See the R tips data tab for further help on this step. For this first attempt, include none of the recommended options in the read.csv command, so we can explore R's behavior. By default, all columns with character data will be converted to factors. A factor is like a character variable except that its unique values represent levels that have names but also have a numerical interpretation. 
  4. Use the "str" command to obtain a compact summary of the contents of the data frame. Every variable should be listed as a factor (the default). Another way to check the type of a specific variable in the data frame is to use the "is.factor" command, e.g.,

      is.factor(mydata$Island)     # returns TRUE or FALSE

  5. Use the "head" command to inspect the variable names and the first few lines of the data frame. Every variable in this data set contains characters (words).
  6. Use the "fix" command to see the whole data frame. Using the mouse, expand the width of the columns to view the data better. The data are sorted by island name. Notice that each species is listed only once, and that some species are found on more than one island. (Unfortunately, not all its islands are given for A. sagrei -- ignore that issue for now)
  7. Let's focus on the variable "Ecomorph", since it has a manageable number of categories. Since "Ecomorph" is a factor, it will have "levels" representing the different groups. Use the "levels" command to list them. Notice anything unexpected? One of the categories is an empty character string. Some of the groups appear to be listed twice. But look more closely -- are they really duplicates? 
  8. Use the "table" function on the "Ecomorph" vector to see the frequency (number of species) belonging to each named group. See, for example, that one species belongs to the "Trunk-Crown " (trailing space) group rather than to the"Trunk-Crown" (no spaces). Use the "which" command to identify the row with the typo.
  9. Using assignment ( <- ), fix the single typo. Use the "table" function afterward to check the effect of your change. 
  10. Weirdly, the now-eliminated category of "Trunk-Crown " (trailing space) is still present in the "table". This is because, even though no species belong to this category, the category remains a factor level! Confirm this using the "levels" function. This confusing behavior is one reason why I recommend you avoid reading character data in as factors. The presence of factor levels with no members can wreak havoc when fitting models to data. 
  11. Re-read the data from the file into R. This time, use all three of the recommended options on the R tips data tab. These options will instruct R to 1) keep character data as-is; 2)  strip leading and trailing spaces from character entries, minimizing typos; and 3) treat empty fields as missing rather than as words with no letters.
  12. (You might wonder how you would ever be able to remember such a list of options in future, when it comes time to reading your own data into R. The answer is: you don't have to. I couldn't possibly remember it myself. If you keep a script file when you analyze the data you can always go back and consult it, and copy it the next time you need it. Also, type ?read.csv in the command line at any time to get a complete list of all the read options and their effects.)
  13. Use "table" once more to tally up the numbers of species in each Ecomorph category. Is there an improvement from the previous attempts? Which is the commonest Ecomorph and which is the rarest?
  14. How many Anolis species inhabit Jamaica exclusively?
  15. What is the total number of Anolis species on Cuba? This is not the same as the number occurring exclusively on Cuba -- a few species live there and also on other islands. Figure out an elegant way in R to count the number of species that occur on Cuba. Bonus points for the briefest command! [Hint: check the vector functions for character data on the Rtips start page.]
  16. What is the tally of species belonging to each ecomorph on the four largest Caribbean islands: Jamaica, Hispaniola, Puerto Rico and Cuba?
  17. What is the most frequent ecomorph for species that do not occur on the four largest islands?