Data

Data

This page introduces the basics of working with data sets having multiple variables, often of several types. The focus here is on data frames, which are the most convenient data objects in R. Matrices and lists are also useful data objects, and these are introduced briefly at the end.


Manage data

Start with a spreadsheet program

It is best to enter your data to an ordinary text file, such as a .csv (comma-separated text) file, created with the help of a spreadsheet program. A text file is never obsolete, whereas a proprietary format may not be readable 10 years from now. Avoid special characters in your data files.

Long vs wide layouts

Keep data that you want analyzed together in a single worksheet. A “long” layout is recommended, rather than a wide layout. Here is an example of a wide layout of data on the numbers of individuals of 3 species recorded in plots and sites.

Plot    Site      species1   species2   species3
 1        A           0          12         4
 2        A          88           2         0
 3        B          12           4         1   
...

The equivalent long layout will be easier to analyze:

Plot   Site  Species Number
 1      A      1      0
 1      A      2     12
 1      A      3      4
 2      A      1     88
 2      A      2      2
 2      A      3      0
 3      B      1     12
 3      B      2      4
 3      B      3      1
...

Data file tips

The following suggestions for your data file will save you frustration when it comes time to read into R.

  • Use brief, informative variable names in plain text. Keep more detailed explanations of variables in a separate text file. Avoid spaces in variable names — use a dot (e.g., “size.mm”).
  • Use the same symbol consistently to indicate missing values. A blank entry is ok.
  • Avoid putting non-numeric characters in columns of numeric data, R will assume that the entire column is non-numeric. For example, avoid using a question mark “12.67?” to indicate a number you are not sure about. Put the question mark and other comments into a separate column just for comments.
  • Use the international format (YYYY-MM-DD) or use separate columns for year, month and day.
  • Keep commas out of your data set entirely, because they are column delimiters in your .csv file.
  • R is case-sensitive: “Hi” and “hi” are distinct entries.

Read data from text file

Read your .csv file using the read.csv command. The command without any options is as follows. Navigate to your data file in the popup window.

mydata <-read.csv(file.choose())

However, including a few options can save you some frustration.

mydata <-read.csv(file.choose(), stringsAsFactors = FALSE,
                  strip.white = TRUE, na.strings = c("NA","") )

stringsAsFactors = FALSE tells R to keep character variables as they are rather than convert to factors, which I find a little harder to work with (I explain what factors are further below).
strip.white = TRUE removes spaces at the start and end of character elements. Spaces are often introduced accidentally during data entry. R treats “word” and ” word” differently, which is not usually desired.
na.strings = c("NA","") tells R that in addition to the usual NA, empty strings in columns of character data are also to be treated as missing. By default, R treats a blank cell in a column of character data as a character string of zero length rather than as missing.

In place of file.choose() you can specify the full directory and file name,

mydata <- read.csv("c:\\directoryname\\filename.csv") # PC
mydata <- read.csv("/directoryname/filename.csv")     # Mac

Read data from an online file

In place of file.choose() you can specify the full URL and file name,

mydata <- read.csv(url("http://www.zoology.ubc.ca/~bio501/data/mydata.csv"), 
                        stringsAsFactors = FALSE)

R automatically calls variable types

As it reads your data, R will classify your variables into types.

  • Columns with only numbers are made into numeric or integer variables.
  • Columns with non-numeric characters are made into factors unless you specify that they should remain characters using the stringsAsFactors = FALSE option in the read command.

To explain, a factor is a categorical variable whose categories represent levels. These levels are named, like characters, but the levels additionally have a numerical interpretation. If a variable A has 3 categories “a”, “b”, and “c”, R will order the levels alphabetically, by default, and give them the corresponding numerical interpretations 1, 2, and 3. This will also determine the order that the categories appear in graphs and tables. You can always change the order of the levels. For example, if you want “c” to be first (e.g., because it refers to the control group), set the order as follows:

A <- factor(A, levels = c("c","a","b"))

It is easy to convert character data to factors later, when you need them.

To check on how R has classified all your variables, enter

str(mydata)            # "str" stands for structure

To check on R's classification of just one variable, x,

class(mydata$x)        # integer, character, factor, numeric, etc
is.factor(mydata$x)    # result: TRUE or FALSE
is.character(mydata$x) # result: TRUE or FALSE
is.integer(mydata$x)   # result: TRUE or FALSE

Convert between variable types

You can always convert variables between types. The following should work well:

mydata$x <- as.factor(mydata$x)     # character to factor 
mydata$x <- as.character(mydata$x)  # factor to character

Warning: To convert factors to numeric or integer, first convert to character. Converting factors directly to numeric or integer data can lead to unwanted outcomes.

Always check the results of a conversion to make sure R did what you wanted.


Write/save a data frame to a text file

To write a data frame to a comma delimited text file, use the following commands. Include the option row.names = FALSE if you don't want the row names or numbers of the data frame included in the first column of the csv file.

write.csv(mydata, file="c:\\directoryname\\filename.csv") # PC
write.csv(mydata, file="/directoryname/filename.csv")     # Mac

Work with data frames

View the data frame

The following commands are useful for viewing aspects of a data frame.

head(mydata)     # prints the first few rows
tail(mydata)     # prints the last few rows
names(mydata)    # see the variable names
str(mydata)      # check the variable types
rownames(mydata) # view row names (numbers, if you haven't assigned names)

Useful data frame functions and operations

str(mydata)                     # summary of variables included
is.data.frame(mydata)           # TRUE or FALSE
ncol(mydata)                    # number of columns in data frame
nrow(mydata)                    # number of rows
names(mydata)                   # variable names
names(mydata)[1] <- c("quad")   # change 1st variable name to quad
rownames(mydata)                # optional row names

Some vector functions can be applied to data frames too, but with different outcomes:

length(mydata)                  # number of variables in data frame
var(mydata)                     # covariances between all variables

Access variables in data frame

The columns of the data frame are the vectors (representing variables). Access them by name using the "$" symbol.

mydata$site        # the site vector
mydata$quadrat     # the quadrat vector

Or, access variables using square brackets that include a comma. Integers before the comma refer to rows, integers after the comma indicate columns: mydata[rows, columns].

mydata[2,3]        # 2nd row, 3rd column contents of data frame
mydata$species[2]  # 2nd element of species vector
mydata[,3][2]      # 2nd element of 3rd column vector

Transform elements or variables in a data frame

Change the value of "size" for individual #15 in a data frame to 20.3.

mydata$size[mydata$individual == 15] <- c(20.3)

Change the value located in the 3rd row, second column of "mydata".

mydata[3,2] <- c(20.3)

Log-transform a variable size.mm and save the result as a new variable logsize in the data frame. log yields the natural log (use log10 for log base 10).

mydata$logsize <- log(mydata$size.mm)

Apply a function to a variable in a data frame

For example, obtain the mean of the variable size.mm in mydata,

mean(mydata$size.mm, na.rm = TRUE)   # na.rm option removes missing values

See the "Loop" menu for information on how to apply a function to multiple variables at once, or to apply a function to variables by group.

Delete a variable from a data frame

For example, delete the variable x from mydata

mydata$x <- NULL     # NULL must be in upper case letters

Combine vectors into a data frame

Make a data frame by combining vectors of the same length using the data.frame command. The vectors need not be of the same type.

quadrat <- c(1:7)
site <- c(1,1,2,3,3,4,5)
species <- c("a","b","b","a","c","b","a")
mydata <- data.frame(quadrat = quadrat,site = site, species = species, 
                    stringsAsFactors = FALSE)

The stringsAsFactors = FALSE is optional but recommended to preserve character data instead of converting to a factor.


Sort and order the rows of a data frame

It is often convenient to re-order the rows of a data frame mydata to correspond to the sorted order of one of its variables, for example the variable x. To do this use the order function as follows,

mydata.x <- mydata[order(mydata$x), ]

Extract a subset of a data frame

One way is to use row and column indicators inside square brackets:

newdata <- mydata[ ,c(2,3)]   # data frame containing columns 2 and 3 only
newdata <- mydata[ ,-1]       # data frame leaving out first column
newdata <- mydata[1:3,1:2]    # extract first 3 rows and first 2 columns

Logical statements and variable names can also be used.

newdata <- mydata[mydata$sex == "f" & mydata$size.mm > 25, c("site","id","weight")]

Or, use the subset command. The select argument is optional, and is used to select which variables to include (the default is to include all the variables). Notice the double equal sign for the logical statement and the single equal sign for the argument.

newdata <- subset(mydata, sex == "f" & size.mm > 25, select = c(site,id,weight))

Match data between two data frames

Often, measurements stored in two data frames relate to one another. One data frame might contain measurements of all captured individuals of a bird species (e.g., weight, age, sex), including the study site in which the individual was captured. A second data frame might contain physical measurements made on those study sites (e.g., elevation, rainfall). If the site names or numbers in both data frames correspond, then it is possible to bring measurements from one frame to the other using the "match" command.

For example, to bring the site variable "elevation" from the sites data frame to the birds data frame

birds$elevation <- sites$elevation[match(birds$siteno, sites$siteno)]

Always check the results to make sure R did what you wanted.


Attach a data frame

It can be cumbersome to type the name of the data frame over and over to carry out operations and function on the variables within, such as

plot(mydata$length.mm, mydata$mass.g)

It would be great if R somehow knew that the variables to plot were in the data frame mydata. Two functions can help: attach() and with().

Attach

Do not use attach(). When you attach a data frame, you are only attaching a unchangeable copy. Subsequent changes to variables in the real data frame located in your workspace, and addition of new variables, do not update the attached copy. To update the attached copy after making changes to the data frame you will need to detach and then attach again. Waste of time.

With

Use with to force R to look first in your data frame for any variables referred to in the command,

with(mydata, plot(length.mm, mass.g))

Other data structures: matrix

Some functions will give a matrix as output. Matrices have lots of uses, especially for algebra, but for data are not as convenient as data frames (for example, all columns of a matrix must be of the same data type -- all characters or all numbers but not both). Briefly here's how to manipulate matrices and covert them to data frames.

Reshape a vector to a matrix

Use matrix to reshape a vector into a matrix. For example, if

x <- c(1,2,3,4,5,6)
xmat <- matrix(x,nrow=2)

yields the matrix

      [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

and

xmat <- matrix(x,nrow=2, byrow=TRUE)

yields the matrix

      [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

Bind vectors to make a matrix

Use cbind to bind vectors in columns of equal length to form a matrix. rbind will bind them by rows instead. For example,

x <- c(1,2,3)
y <- c(4,5,6)
xmat <- cbind(x,y)

yields the matrix

     x y
[1,] 1 4
[2,] 2 5
[3,] 3 6

Access subsets of a matrix

Use integers in square brackets to access subsets of a matrix. Within the bracket, integers before the comma refer to rows, whereas integers after the comma indicate columns: [rows, columns].

xmat[2,3]       # value in the 2nd row, 3rd column of matrix
xmat[, 2]       # 2nd column of matrix (result is a vector)
xmat[2, ]       # 2nd row of matrix (result is a vector)
xmat[ ,c(2,3)]  # matrix subset containing columns 2 and 3 only
xmat[-1, ]      # matrix subset leaving out first row
xmat[1:3,1:2]   # submatrix containing first 3 rows and first 2 columns only

Useful matrix functions

dim(xmat)     # dimensions (rows & columns) of a matrix
ncol(xmat)    # number of columns in matrix
nrow(xmat)    # number of rows
t(xmat)       # transpose a matrix

Convert a matrix to a data.frame

mydata <- as.data.frame(xmat, stringsAsFactors = FALSE)

The stringsAsFactors=FALSE is optional but recommended to preserve character data. Otherwise character variables are converted to factors.


Other data structures: list

Some R functions will output results in a list. A list is a collection of R objects bundled together in a single object. The component objects can be anything at all: vectors, matrices, data frames, and even other lists. The different objects needn't have the same length or number of rows and columns.

Create list

Use the list command to create a list of multiple objects. For example, here two vectors are bundled into a list

x <- c(1,2,3,4,5,6,7)
y <- c("a","b","c","d","e")
mylist <- list(x,y)                   # simple version
mylist <- list(name1 = x, name2 = y)  # names each list object

Entering mylist in the R command window shows the contents of the list, which is

[[1]]
[1] 1 2 3 4 5 6 7

[[2]]
[1] "a" "b" "c" "d" "e"

if the components were left unnamed, or

$name1
[1] 1 2 3 4 5 6 7

$name2
[1] "a" "b" "c" "d" "e"

if you named the list components.

Add an object to an existing list

Use the "$" symbol to name a new object in the list.

z <- c("A","C","G","T")
mylist$name3 <- z

Access list components

Use the "$" to grab a named object in a list. Or, use an integer between double square brackets,

mylist$name2        # the 2nd list object
mylist[[2]]         # the 2nd list component, here a vector
mylist[[1]][4]      # the 4th element of the 1st list component, here "4"

Useful list functions

names(mylist)              # NULL if components are unnamed
unlist(mylist)             # collapse list to a single vector

Convert a list of vectors to a data frame

This is advised only if all list objects are vectors of equal length.

x <- c(1,2,3,4,5,6,7)
y <- c("a","b","c","d","e","f","g")
mylist <- list(x = x, y = y)
mydata <- do.call("cbind.data.frame", list(mylist, stringsAsFactors=FALSE))

Notice how the option stringsAsFactors=FALSE for the command cbind.data.frame is contained inside the list() argument of do.call.