This page introduces the basics of working with data sets often with multiple variables of different types.


Read me

Data objects

Data frames are the most convenient data objects in R.

Tibbles are a type of data frame used by tidyverse packages and can be converted to data frames as follows.

# convert tibble type to data frame
mydata <- as.data.frame(mydata)

# do the reverse
mydata <- as_tibble(mydata)

You will also run into matrices, arrays, and lists, which are described briefly at the end. A section is added on manipulating image data.


Useful R packages

You won’t need anything but base R to read and write data to files. The following packages are for more specialized uses, shown below. You might need to install them first.

library(readr)         # includes tidyverse read_csv() command to read .csv files 
library(readxl)        # read data from excel files
library(reshape2)      # reshape data from wide to long format
library(googlesheets4) # read data from online google spreadsheet
library(data.table)    # for really fast read and write
library(vroom)         # for really fast read
library(imager)        # for reading, manipulating, writing images

Make data file

Enter data with a spreadsheet

Enter your data using a spreadsheet program and save it in text format to a comma-delimited file (e.g., “myfile.csv”). A text file is never obsolete but data saved in a proprietary format may not be readable 30 years from now.

Read your data into R using the read.csv() command (more details are below).

mydata <- read.csv("myfile.csv")


What to enter in columns

These tips will save you frustration when it comes time to read into R.
  • Use brief, informative variable names in plain text. Keep more detailed explanations of variables in a separate text file.
  • Avoid spaces in variable names – use a dot or underscore instead (e.g., size.mm or size_mm).
  • Leave missing data cells blank.
  • Avoid non-numeric characters in columns of numeric data. R will assume that the entire column is non-numeric. For example, avoid using a question mark “12.67?” to indicate a number you are not sure about. Put the question mark and other comments into a separate column just for comments.
  • Use the international date format (YYYY-MM-DD).
  • Keep commas out, because they are column delimiters in your .csv file.
  • R is case-sensitive: “Hi” and “hi” are distinct entries.


Long vs wide layouts

A “long” layout is recommended, instead of a “wide” layout, when using linear models to analyze data. Use different columns for variables and different rows for sample units.

Here’s an example of a “wide” layout with data on the running speed of individual marked lizards in separate years. The same response variable, running speed, is incuded in two columns, one for each year. (Data are from Huey, R. B. and A. E. Dunham. 1987. Evolution 42: 1116-1120.)

lizard1 <- read.csv("https://www.zoology.ubc.ca/~schluter/R/csv/LizardSpeed.csv")
head(lizard1)
##   lizardID speed1984 speed1985
## 1        1      1.43      1.37
## 2        2      1.56      1.30
## 3        3      1.64      1.36
## 4        4      2.13      1.54
## 5        5      1.96      1.82
## 6        6      1.89      1.79

The wide layout can be converted to a long layout using the melt() command in the reshape2 package.

library(reshape2) # load the package
lizard2 <- melt(lizard1, id.vars = c("lizardID"), variable.name = "Year", 
                    value.name = "Speed", factorsAsStrings = FALSE)
head(lizard2)
##   lizardID      Year Speed
## 1        1 speed1984  1.43
## 2        2 speed1984  1.56
## 3        3 speed1984  1.64
## 4        4 speed1984  2.13
## 5        5 speed1984  1.96
## 6        6 speed1984  1.89



Read/write data

Read from csv file

The following command in base R reads a data file named “myfile.csv” into a data frame mydata.

mydata <- read.csv("myfile.csv")
mydata <- read.csv(file.choose()) # navigate to file location

R treats “word” and ” word” (with a leading space) differently, which is not usually desired. To remove leading and trailing spaces,

# remove leading and trailing spaces, treat blank cells as missing too
mydata <- read.csv("myfile.csv", strip.white = TRUE, 
            na.strings = c("NA",""))

Alternatively, you can use the read_csv() command from the readr package, which automatically takes care of leading and trailing spaces and knows what to do with blank cells.

library(readr)
mydata <- read_csv("myfile.csv")


Read online Google sheet

If you have allowed anyone with the link to read a Google sheet of your own making, then you can read it into R. To demonstrate, I have created a small Google sheet to test. You’ll need googlesheets4 package.

library(googlesheets4)

# Suspend authorization
gs4_deauth()

# My test sheet
testSheet <- gs4_get("https://docs.google.com/spreadsheets/d/1yRSa4WMnVUc8Q46Td__gOcFar1aigdfxId_YaZZ27yA/edit?usp=sharing")

# Read the data
mydata <- as.data.frame(read_sheet(testSheet, sheet = 1))


Read Excel spreadsheet

Yes, this is possible. You’ll need the readxl package.

library(readxl)

# Read the data
mydata <- as.data.frame(read_excel("myExcelFile.xlsx", sheet = 1))


Write data to csv file

To write the data frame mydata to a comma-delimited text file,

write.csv(mydata, file = "myfile.csv")


Really fast read and write

For big data sets that seem to take forever to read and write, try these commands from the data.table package.

Use fwrite() to write to a .csv file really quickly.

library(data.table)

fwrite(snps, file = "mydata.csv", sep = ",", col.names = TRUE, row.names = TRUE)

Use fread() to read a big .csv file. If your file has row names, these will be placed into the first variable column and must be restored with additional commands.

mydata <- data.frame(fread(file = "myfile.csv"))

# If you want to recover row names in first column:
rownames(mydata) <- snps[,1]
mydata <- mydata[,-1]

Another method to read really large .csv files is to use vroom().

library(vroom)

mydata <- vroom("myfile.csv")



Variable types

As it reads your data, read.csv() will automatically classify your variables into the following types unless you specify otherwise using additional arguments.
  • Columns with only numbers are made into numeric or integer variables.
  • Columns with even one non-numeric character will classify variable as character.

To check on how R has classified all your variables, enter

str(mydata)     # base R
glimpse(mydata) # dplyr command

To check on R’s classification of just one variable, x,

class(mydata$x)        # integer, character, factor, numeric, etc
is.character(mydata$x) # result: TRUE or FALSE
is.integer(mydata$x)   # result: TRUE or FALSE
is.factor(mydata$x)    # result: TRUE or FALSE


Factors are special

Include the read.csv() argument stringsAsFactors = TRUE if you want character data columns turned into factor variables instead of character text.

A factor is a categorical variable whose categories represent levels. These levels have names, but they additionally have a numeric interpretation. If a variable A has 3 categories “a”, “b”, and “c”, R will order the levels alphabetically, by default, and give them the corresponding numerical interpretations 1, 2, and 3. This will determine the order that the categories appear in graphs and tables.

You can always change the order of the levels for a factor variable. For example, if you want “c” to be first (e.g., because it refers to the control group), set the order as follows:

A <- factor(A, levels = c("c", "a", "b"))


Change variable type

You can convert variables to a different type.

mydata$A <- as.factor(mydata$A)                # character to factor 
mydata$A <- as.character(mydata$A)             # factor to character
mydata$A <- as.numeric(as.character(mydata$A)) # factor to numeric

Always check the results of a conversion to make sure R did what you wanted. Check especially how missing values were converted.



Manipulate data

View the data

The following commands are useful for viewing aspects of a data frame.

mydata             # print whole data frame
print(mydata, n=5) # print the first 5 rows
head(mydata)       # print the first few rows
tail(mydata)       # print the last few rows

names(mydata)      # see the variable names
rownames(mydata)   # view row names (numbers, if you haven't assigned names)


Useful data frame functions

These functions are applied to the whole data frame.

str(mydata)                     # summary of variables in frame
is.data.frame(mydata)           # TRUE or FALSE
ncol(mydata)                    # number of columns in data
nrow(mydata)                    # number of rows

names(mydata)                   # variable names
names(mydata)[1] <- c("quad")   # rename 1st variable to "quad"
rownames(mydata)                # row names


Access variables

The columns of the data frame are vectors representing variables. They can be accessed several ways.

mydata$site          # the variable named "site" in mydata
mydata[ , 2]         # the second variable (column) of the data frame
mydata[5, 2]         # the 5th element (row) of the second variable

select(mydata, site) # same as mydata$site but using dplyr package


Transform a variable

For example, here is how to log transform a variable named size.mm and save the result as a new variable named logsize in the data frame mydata (log yields the natural log, whereas the function log10 yields log base 10.)

mydata$logsize <- log(mydata$size.mm)            # natural log

mydata <- mutate(mydata, logsize = log(size.mm)) # equivalent command in dplyr package


Delete variable

For example, to delete the variable site from mydata, use

mydata$site <- NULL             # NULL must be upper case

mydata <- select(mydata, -site) # dplyr method


Extract subset

There are several ways.

One is to use indicators inside square brackets using the following format: mydata[rows, columns].

newdata <- mydata[ , c(2,3)]   # all rows, columns 2 and 3 only;
newdata <- mydata[ , -1]       # all rows, leave out first column
newdata <- mydata[1:3, 1:2]    # first three rows, first two columns

Logical statements and variable names within the square brackets also work.

The following commands extract three variables of females whose size is less than 25 mm. Note the double “==” sign to represent “equals” in the logical statement.

newdata <- mydata[mydata$sex == "f" & mydata$size.mm < 25, 
                  c("site","id","weight")]
newdata <- subset(mydata, sex == "f" & size.mm < 25, 
                  select = c(site,id,weight))

You can also use dplyr’s filter and select commands. Use select to extract variables (columns), and use filter to select rows, as in the following examples.

# extract rows
temp <- filter(mydata, sex == "f")

# extract columns
newdata <- select(temp, site, id, weight) 


Sort and order rows

To re-order the rows of a data frame mydata to correspond to the sorted order of one of its variables, say x, use

mydata.x <- mydata[order(mydata$x), ]  # base R

mydata.x <- arrange(mydata, x)         # dplyr method



Combine two data frames

Measurements stored in two data frames might relate to one another. For example, one data frame might contain measurements of individuals of a bird species (e.g., weight, age, sex) caught at multiple sites. A second data frame might contain physical measurements made at those sites (e.g., elevation, rainfall). If the site names in both data frames correspond, then it is possible to bring one or all the variables from the second data frame to the first.

For example, to bring the site variable “elevation” from the sites data frame to the birds data frame,

birds$elevation <- sites$elevation[match(birds$siteno, sites$siteno)]

To bring all the variables from the sites data set to the bird data set, corresponding to the same sites in both data frames, use the dplyr command

birds2 <- left_join(birds, sites, by="siteno")

Always check the results to make sure R did what you wanted.



Matrix objects

Some functions will give a matrix as output, which is not as convenient as a data frame. For example, all columns of a matrix must be of the same data type. Briefly, here’s how to manipulate matrices and convert them to data frames.


Convert matrix to data.frame

mydata <- as.data.frame(xmat)


Useful matrix functions

dim(xmat)     # dimensions (rows & columns) of a matrix
ncol(xmat)    # number of columns in matrix
nrow(xmat)    # number of rows
t(xmat)       # transpose a matrix


Subset a matrix

Use integers in square brackets to access subsets of a matrix. Within square brackets, integers before the comma refer to rows, whereas integers after the comma indicate columns: [rows, columns].

xmat[2,3]       # value in the 2nd row, 3rd column of matrix
xmat[, 2]       # 2nd column of matrix (result is a vector)
xmat[2, ]       # 2nd row of matrix (result is a vector)
xmat[ ,c(2,3)]  # matrix subset containing columns 2 and 3 only
xmat[-1, ]      # matrix subset leaving out first row
xmat[1:3,1:2]   # submatrix containing first 3 rows and first 2 columns only


Reshape a vector to a matrix

Use matrix to reshape a vector into a matrix. For example, if

x <- c(1,2,3,4,5,6)
xmat <- matrix(x, nrow = 2)

yields the matrix

      [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

and

xmat <- matrix(x, nrow = 2, byrow = TRUE)

yields the matrix

      [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6


Bind vectors to make matrix

Use cbind to bind vectors in columns of equal length, and use rbind to bind them by rows instead. For example,

x <- c(1, 2, 3)
y <- c(4, 5, 6)
xmat <- cbind(x, y)

yields the matrix

     x y
[1,] 1 4
[2,] 2 5
[3,] 3 6



Lists

Some R functions will output results in a list. A list is a collection of R objects bundled together in a single object. The component objects can be anything at all: vectors, matrices, data frames, and even other lists. The different objects needn’t have the same length or number of rows and columns.


Make a list

Use the list command to create a list of multiple objects. For example, here two vectors are bundled into a list

x <- c(1,2,3,4,5,6,7)
y <- c("a","b","c","d","e")
mylist <- list(x,y)                   # simple version
mylist <- list(name1 = x, name2 = y)  # give a name to each list object

Entering mylist in the R command window shows the contents of the list, which is

[[1]]
[1] 1 2 3 4 5 6 7

[[2]]
[1] "a" "b" "c" "d" "e"

if the components were left unnamed, or

$name1
[1] 1 2 3 4 5 6 7

$name2
[1] "a" "b" "c" "d" "e"

if you named the list components.


Add to a list

Use the “$” symbol to name a new object in the list.

z <- c("A","C","G","T")
mylist$name3 <- z


Access list elements

Use the “$” to grab a named object in a list. Or, use an integer between double square brackets,

mylist$name2        # the 2nd list object
mylist[[2]]         # the 2nd list component, here a vector
mylist[[1]][4]      # the 4th element of the 1st list component, here "4"


Useful list functions

names(mylist)              # NULL if components are unnamed
unlist(mylist)             # collapse list to a single vector


Convert list to data frame

This is advised only if all list objects are vectors of equal length.

# construct a list
x <- c(1,2,3,4,5,6,7)
y <- c("a","b","c","d","e","f","g")
mylist <- list(x = x, y = y)

# convert it to a data frame
mydata <- do.call("cbind.data.frame", mylist)



Image data

Images are stored as arrays of numbers in R. A matrix is a type of array, but arrays can have more than two dimensions. The imager package is useful for reading, manipulating, and writing images.

library(imager)


Read an image

img <- load.image("myimage.jpg")

I’ll use the following example image to demonstrate the commands, stored on the web. The imported image is stored as a cimg object, which is a type of array.

pip <- load.image("https://www.zoology.ubc.ca/~schluter/R/images/pip.jpg")

The dimensions of the image array can be checked with the dim() command. The first two dimensions of the image are the width (1536 pixels) and height (2040 pixels) of the 2D image. The third dimension is the depth of the image, which is 1 for a 2D image (a cimg object can also store 3D images). The 4th dimension of the array indicates the color channels (3 for RGB color).

dim(pip)
## [1] 1536 2040    1    3

For a 2D image, such as our example, each color channel is a matrix of 1536 x 2040 numbers. Each element of the matrix corresponds to a pixel and contains a number representing the intensity of the color. The matrix is curently too big to display in this document but you can see the contents of the 5 x 5 pixels in the top left corner of the matrix of the “G” channel as follows.

pip[1:5, 1:5, 1, 2]
##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 0.2705882 0.2784314 0.2862745 0.2901961 0.3058824
## [2,] 0.2823529 0.2941176 0.2980392 0.3019608 0.3176471
## [3,] 0.2941176 0.3019608 0.3058824 0.3137255 0.3254902
## [4,] 0.2901961 0.2941176 0.3019608 0.3098039 0.3254902
## [5,] 0.2745098 0.2784314 0.2862745 0.3019608 0.3254902

Each element of the matrix in a given color channel ranges from 0 (no color) to 1 (full color).


Plot an image

The plot() command will display the image.

plot(pip, axes = FALSE)


Crop image

Cropping an image involves removing pixels from the edges. The crop.borders() accomplished this.

For example, to crop the example image so that it is square rather than rectangular, I need to remove 2040 - 1536 = 504 pixels from the height, or 504/2 = 252 pixels from the top and bottom.

pip_square <- crop.borders(pip, nx = 0, ny = 252)
dim(pip_square)
## [1] 1536 1536    1    3
plot(pip_square, axes = FALSE)


Resize image

Resizing an image involves changing the number of pixels in the image. Here I downsample the cropped example image to 100 x 100 pixels.

pip_100 <- resize(pip_square, size_x = 100, size_y = 100)
dim(pip_100)
## [1] 100 100   1   3
plot(pip_100, axes = FALSE)


Convert to grayscale

The grayscale() command converts an RGB image to black and white. The resulting image now has only one channel indicating grayscale, with 0 indicating black, 1 indicating white, and numbers in between indicating shades of gray.

pip_bw <- grayscale(pip_100)
dim(pip_bw)
## [1] 100 100   1   1
plot(pip_bw, axes = FALSE)


Transpose an image

For some applications it is necessary to transpose the image. To accomplish this, we transpose the matrix of each color channel. In the case of a grayscale image the only color channel (4th dimension of the array) is 1.

pip_transposed <- pip_bw # initialize
pip_transposed[,,1,1] <- t(pip_transposed[,,1,1])
plot(pip_transposed, axes = FALSE)


Save an image

The save.image() command saves the image to a file. The file type is determined by the file extension.

save.image(pip_bw, "pip_bw.jpg")
 

© 2009-2024 Dolph Schluter