# Graphs and tables

This page provides tips and recommendations for making graphs and tables in R. Updated and revised frequently (click the reload button on your browser to make sure you are seeing the most recent version).

## Frequency tables

These commands generate tables of frequencies or summary statistics. In the examples below, `x `

is a single numeric variable. `A`

is a categorical variable (factor or character variable) identifying different groups; `B`

is a second such variable.

### Frequency table

Frequency table for a categorical variable A,

table(A) table(A, useNA = "ifany") # includes NA as a category, if present

### Contingency table

The following command generates a frequency table for two categorical variables, A and B. The command can be extended to three or more variables.

table(A, B)

To include the row and column sums in the contingency table, use the following commands.

mytable <- table(A, B) addmargins(mytable, margin = c(1,2), FUN = sum, quiet = TRUE)

### Flat frequency table

The following command generates a "flat" frequency table for two categorical variables, A and B. In a flat table, A and B are separate columns of a table, and a third column tallies the frequencies of each combination. In the following, A and B are variables in a data frame named `mydata`

myftable <- ftable(mydata, row.vars = c("A","B")) # flat table x1 <- data.frame(myftable) # save the results in a new data frame "x1"

## Tables of descriptive statistics

The command "tapply" creates tables of results. The function argument can be any statistical function that can be applied to a single numeric variable (e.g., mean, standard deviation, median, etc).

When creating tables for display purposes, such as in a manuscript, you can round the results to a fixed number of decimal places. For example, to round a table of means "z" to two decimal places, use

round(z,2)

### Table display for one variable

For example, here is how to generate a one-way table of group means. `A`

is the factor or character variable identifying the groups.

tapply(x, INDEX = A, FUN = mean, na.rm = TRUE)

The `na.rm`

option removes missing values before computing (otherwise the mean returns "NA" if there are missing values). `na.rm`

is not a `tapply`

option -- rather it is an option for the function `mean`

. In general you can pass optional arguments to FUN by including them immediately afterward.

The following shortcut works if the arguments are listed strictly in the order shown.

tapply(x, A, mean, na.rm= TRUE)

The command for a one-way table of group standard deviations is similar.

tapply(x, A, sd, na.rm = TRUE)

### Table display for two variables

The following example produces a two-way table of group medians. `x`

is a numeric variable, whereas `A`

and `B`

are categorical variables (factors or character variables).

tapply(x, INDEX = list(A,B), FUN = median)

## Drawing graphs in R

It is important to examine data visually before analyzing the numbers. Your eye is a great pattern detector and can easily pick out outliers, errors, and associations in data.

There are many different types of graphs. Here's we've grouped them according to whether their purpose is to display frequencies of a single variable, or association between variables.

### Generic command options

Many (but not all -- try them) of the basic plotting commands will accept the same options to control axis limits, labeling, print a title, change the plotting symbol, change the size of the plotting symbols and text, and change the line types. Here are some of the most frequently modified options. Remember, these are not commands, but options (also called "arguments"), so use them inside the parentheses of a plotting command. If you are not sure whether a given option applies in your case, try it -- the worst that could happen is you get an error message, or R ignores it.

main = "Eureka" # add a title above the graph pch = 16 # change plot symbol to a filled circle color = "red" # change the item color xlim = c(-10,10) # change limits of the x-axis (horizontal axis) ylim = c(0,100) # change limits of the y-axis (vertical axis) lty = 2 # change line type to dashed las = 2 # rotate axis labels to be perpendicular to axis cex = 1.5 # magnify the plotting symbols 1.5-fold xlab = "Body size" # label for the x-axis ylab = "Frequency" # label for the y-axis

To find out most of the basic plotting options (there are many), see the help windows for the functions below

?par # graphical parameters ?plot.default # basic plot decorations

### Save graph as a pdf file

After drawing your plot, you can use the menu (File -> Save As) to save to a pdf file. Or, draw the graph on a pdf device to begin with:

pdf( file = "mygraph.pdf") # opens the pdf device for plotting ... # Issue your R commands here to generate plot dev.off() # closes the device when you are done

## Graphs to display frequencies of a single variable

These methods display the frequency distribution of a single variable.

### Histogram - show frequency distribution for a numeric variable

The basic command is

hist(x, col = "navy", right = FALSE)

The `color`

argument sets the color of the bars. The `right = FALSE`

option causes all the histogram intervals, or bins, except the last one to be left-closed. In other words, the value 1 would appear in the interval 1-2 rather than in the interval 0-1, which seems to be the convention (unless 1 is the upper limit of the right-most bin, in which case R puts it in the bin 0-1; use the `include.lowest`

option to control that!).

To influence the width and number of histogram bins, use the `breaks`

option. To determine the approximate number of bins, use

hist(x, breaks = 20, right = FALSE)

For finer control of bin number and location, specify the breakpoints. For example, the following command creates a series of bins 1 unit wide between the limits 0 and 6 (make sure all the data fall between these limits or R will complain),

hist(x, breaks = seq(from=0, to=6, by=1), right = FALSE)

Notice that a value of exactly 6, the upper limit of breaks, will appear in the interval 5-6. This behavior is restricted to the rightmost bin. To prevent this from happening, increase the upper limit of the breaks by 1:

hist(x, breaks = seq(from=0, to=7, by=1), right = FALSE)

To display probability density instead of raw frequencies

hist(x, prob = TRUE, right = FALSE)

Superimpose a normal density curve on a histogram. To accomplish this we generate 101 evenly spaced points along the x axis, between the smallest and largest data value, using `seq`

. Then we use `dnorm`

to generate the normal density at each x point, using the mean and standard deviation of the data shown in the histogram.

hist(x, prob = TRUE, right = FALSE) m <- mean(x, na.rm = TRUE) s <- sd(x, na.rm = TRUE) xpts <- seq(from = min(x, na.rm=TRUE), to = max(x, na.rm = TRUE), length.out = 101) lines(dnorm(xpts, mean=m, sd=s) ~ xpts, col="red", lwd=2)

### Bar graph - show frequency distribution for a categorical (grouping) variable

In the following examples, `A`

is a categorical variable, with each element corresponding to the category for a different individual or replicate. (A can be a factor or a character variable.) The basic command is below. The `table`

command tallies up the frequency of individuals in each category and then `barplot`

plots it.

barplot(table(A)) # makes a bar graph

R will arrange the categories in alphabetical order. If you want to fix a specific order, specify this in a factor command. For example, if the variable A has three groups "a", "b" and "c", fix a preferred order as follows

A <- factor(A, levels = c("c","a","b"))

To plot the bars in order of decreasing frequency, (a good idea for bar graphs)

barplot(sort(table(A), decreasing = TRUE))

If you already have the frequencies for each category, then modify as follows. Let `A`

be a variable that lists each named category exactly once. Let `x`

be the variable containing the corresponding frequency of observations in each category.

barplot(y, names.arg = A)

Some other options to include in your barplot command to control appearance (enter "?barplot" for more options)

color = "green" # fill color space = 0.1 # adjust space between bars cex.names = 0.8 # shrink labels names.arg = c("Hi","Ho","He") # set new names under bars

### Normal quantile plot - compare data to the normal distribution

`x`

is a numeric variable.

qqnorm(x) qqline(x) # adds line through first and third quartiles

## Graphs to display association between two variables

The appropriate graph depends on which variables are numeric or categorical.

### Mosaic plot - association between two categorical variables

A and B can be factors or character variables. `color = TRUE`

yields shades of grey, or a vector of colors can be used instead. A quick way to get a bunch of different colors is to use `color = rainbow(n)`

, where `n`

is the number of categories for the variable B.

mosaicplot(table(A,B), color = TRUE, las = 2, cex.axis = 0.8) mosaicplot(table(A,B), color = c("red","blue","yellow"), las = 2, cex.axis = 0.8) mosaicplot(table(A,B), color = rainbow(3), las = 2, cex.axis = 0.8)

The "las" option makes the labels perpendicular to the axes so that they don't overlap. The "cex.axis" is optional and is used to adjust the size of the labels (provide a number that is a multiple of the default size).

### Grouped bar graph - association between two categorical variables

A and B can be factors or character variables.

barplot(table(A,B), beside = TRUE)

The "space" option controls spacing between bars. Two values are needed: the first controls spacing between bars within each group of A, and the second controls the space between the bars from different groups of A. The first number should be smaller than the second number.

barplot(table(A,B), beside = TRUE, space = c(.1,.3))

Other options:

barplot(table(A,B), beside = TRUE, cex.names = 0.8) # adj label size barplot(table(A,B), beside = TRUE, legend.text = TRUE) # add legend

### Box plots - association between a categorical and a numeric variable

Generates a box plot for the numeric variable `y`

separately for every group identified in the categorical variable `A`

(`A`

can be a factor or character variable).

boxplot(y ~ A) # formula method: response ~ explanatory

Options to include in the boxplot command to control appearance (enter ?boxplot to see more)

cex.axis = 0.8 # reduce size of axis labels (to fit them onto figure) boxwex = .5 # adjust width of boxes varwidth = TRUE # widths proportional to sqrt(n)

### Strip chart - association between a categorical and a numeric variable

A strip chart can be used instead of a box plot when the number of data points is not large.

stripchart(y ~ A, vertical = TRUE) # formula method

You can use `jitter`

to reduce overlap of points. Changing the value of `jitter`

adjusts how much noise is added. Change label size with `cex.axis`

, and change symbol with `pch`

.

stripchart(y ~ A, vertical = TRUE, method = "jitter", jitter = 0.2) stripchart(y ~ A, vertical = TRUE, cex.axis = 0.8, pch = 1) stripchart(y ~ A, vertical = TRUE, pch = "-")

Option `add = TRUE`

adds points to a preexisting stripchart. For example, to use different symbols according to unique values of category variable `B`

, try the following. The first of the three commands below set up the plot but withhold the points. The following two commands add the data points for two groups.

stripchart(y ~ A, vertical = TRUE, method = "jitter", pch = "") stripchart(y[B=="b1"] ~ A[B=="b1"], method = "jitter", pch = 1, add = TRUE) stripchart(y[B=="b2"] ~ A[B=="b2"], method = "jitter", pch = 16, add = TRUE)

You can add points or lines to a strip chart by taking advantage of the fact that category variable plotted along the x-axis also has a numerical interpretation. For example, to plot a single point at the mean of `y`

for all four categories of a variable `A`

, use

stripchart(y ~ A, vertical = TRUE, method = "jitter", pch = 1) points( c(1,2,3,4), tapply(y, A, mean), pch=16)

You can offset the positions of the points by tweaking the values of x,

points( c(1,2,3,4) + 0.2, tapply(y, A, mean), pch = 16)

Adding lines to a plot is similar. This is a simple way to add error bars to a strip chart. For example, either of the following commands adds a vertical line from `y`

=5 to `y`

=10 at the position of the first group in a strip chart,

lines( c(1,1), c(5,10)) # vertical line lines( c(5,10) ~ c(1,1)) # same using formula method arrows(1, 5, 1, 10, angle=90, code=3) # same with line ends

### Scatter plot - association between two numeric variables

Here's how to to produce a scatter plot for two numeric variables, `x`

and `y`

. The variable listed first in the command is plotted along the horizontal (`x`

) axis.

plot(y ~ x) # a scatter plot if x and y are numeric; formula method plot(x, y) # alternate method to produce the same scatter plot plot(y ~ x, pch = 16, col = 2) # change symbol to a colored dot

Add a smooth curve through the data to estimate the shape of the relationship between y and x. The `lowess`

command uses locally weighted regression to accomplish this. "Local" means that y is predicted for each x using only data in the vicinity of that x, rather than all the data. The size of the vicinity is controlled by the option "f", which is a proportion between 0 (very narrow vicinity) and 1 (uses all the data). Try different values of "f" to best capture the relationship. The default is `f = 2/3`

.

plot(y ~ x) x1 <- x[order(x)] y1 <- y[order(x)] lines(lowess(x1, y1, f = 0.5))

The `order`

command sorts the values of x and y by x. This ensures that the segments of the `lowess`

curve are drawn in order, sequentially from left to right.

Add a line to the scatter plot

plot(y ~ x) abline(a = myintercept, b = myslope) # specify values for a and b (intercept and slope) abline(lm(y ~ x)) # add the least squares regression line

Plot multiple groups with different symbols. You can use `pch`

to vary the symbol, or `col`

to vary the color, or vary both at the same time. `x`

and `y`

are numeric variables, whereas `A`

is a categorical variable identifying groups. If `A`

is already a factor (check using `is.factor`

command) you can use just `A`

instead of `factor(A)`

in the commands below.

plot(y ~ x, pch = as.numeric(factor(A)), col = as.numeric(factor(A)))

You'll want to add a legend too so that you can identify the groups. Issue the following command and then click on the plot (inside the plot region) with your cursor to place the legend. If `A`

is already a factor (check using `is.factor`

command) you can use just `A`

instead of `factor(A)`

in these commands.

plot(y ~ x, pch = as.numeric(factor(A))) legend( locator(1), legend = as.character(levels(factor(A))), pch = 1:length(levels(factor(A))) )

Replace `pch =`

with `col =`

in the plot command and the legend command if you want to identify groups by color instead of symbol, or include both color and symbol in both commands.

Here are two ways to identify individual points on a scatter plot. The first just redraws the scatter plot and then adds the row number next to it (you can use another variable instead). This can get noisy if there are a lot of data points.

plot(y ~ x) text(x, y, labels = seq_along(x), pos = 1, cex = 0.5)

The second method uses the cursor to click those few points on the plot you want identified. This version prints the row number when you click a point. You can change that by setting the `labels`

option to a character variable that identifies them too.

plot(y ~ x) identify(x, y, labels = seq_along(x))

### Overlay scatter plots for multiple groups

To draw a scatter plot between two numeric variables x and y separately for each category of a third variable `A`

,

plot(y ~ x, pch = as.numeric(factor(A)))

To add a legend that identifies the groups, issue the following command and then click the cursor inside the plot region to place the legend.

legend(locator(1), legend = as.character(levels(factor(A))), pch = 1:length(levels(factor(A)))

Change `pch =`

with `col =`

in the plot and legend commands if you want to use different colors instead of symbols -- or include both!

### Line plot - display a sequence of measurements

Here's how to plot a sequence of x,y points and connect the dots with lines. This is especially useful when the x-variable represents a series of points in time or across a spatial gradient. The plot command below is the same as in the case of a scatter plot. The `lines`

command adds lines to the same plot and connects the dots. The `order(x)`

bit is to connect the dots from left to right, in case the x-values are unsorted.

plot(y ~ x, pch=16) lines(y[order(x)] ~ x[order(x)])

To draw a line plot without any dots,

plot(y[order(x)] ~ x[order(x)], type = "l")

You can change line type, thickness, and color using the `lty`

, `lwd`

, and `col`

options. Enter `?par`

to learn more about these line options.

lines(y[order(x)] ~ x[order(x)], lty = 3, lwd = 2, col = "red")

## Graphs to display multiple variables

See also **Overlay scatter plots for multiple groups** in the previous section and **Multipanel plots** below for other solutions.

The command `interaction.plot`

is quick but it does not show the data, and so is not useful for presentation. A better solution is to add lines to a `stripchart`

instead.

### Interaction plots

Interaction plots are useful for displaying how the mean of a response variable `y`

changes between the levels of two categorical variables, `A`

and `B`

. The graph is especially useful for determine whether an interaction is present between two factors `A`

and `B`

in a factorial experiment, or between a factor `A`

and a blocking variable `B`

. If the lines are parallel then there is no interaction.

interaction.plot(A, B, y)

The levels of the variable listed first (here, `A`

) will be displayed along the x-axis of the plot. The y-axis will then display the mean of y separately for each category of the second variable, `B`

. Variations on this command include

interaction.plot(B, A, y) # Put B along x-axis instead interaction.plot(A, B, y, fun = median) # median of y interaction.plot(A, B, y, col = 1:length(unique(B))) # color the lines interaction.plot(A, B, y, las=2) # more room for A's labels

### Pairwise scatter plots for multiple variables

The following command creates a single graph with scatter plots between all pairs of numeric variables in a data frame, "mydata". The option `gap`

adjusts the spacing between separate plots,

pairs(mydata, gap = 0.5)

Use the formula method to plot only the three numeric variables x1, x2, and x3 in the data frame mydata.

pairs(~ x1 + x2 + x3, data = mydata)

## Multipanel plots

R has the ability to produce a series of plots in the same graphics window. For example, you might want to produce the same plot, such as a histogram, separately for distinct groups defined by a categorical variable, all in the same window.

### Plot more than one graph in the same window

The easiest method is to use `par`

to divide up the plot window into an array of rows and columns, and then make each plot in turn.

To plot more than one graph in the same graph window, precede your plotting command with one of the following commands. Reissue the command if you want to restart at the top of the window.

par(mfrow = c(2,2)) # sets up a window for 4 plots, 2 by 2. par(mfrow = c(3,1)) # sets up a single column of 3 plots. par(mfrow = c(1,1)) # returns to a 1 plot per window layout

The down side of this approach is that the panels won't use the window space efficiently, and a pleasing graph would require adjusting plot margins and axes using other arguments with `par`

.

### Using `lattice`

package

The `lattice`

package makes it easy to draw a panel of plots separately for different groups defined by a categorical variable, all on the same scale. The down side of this package is that the commands to add points and lines to the individual panes can be tricky.

The `lattice`

package is included with the basic installation but you need to load the library,

library(lattice)

The graph types available in the lattice package include the standard ones found also in R's basic graphics package, such as box plots, histograms, and so on. The table below lists the most types and the relevant command.

For example, to draw a **histogram** of a numeric variable `x`

separately for four groups identified by the variable `B`

(assume the variables are in the data frame "mydata"), use

histogram(~ x | B, data = mydata, layout = c(1,4), right = FALSE)

The `layout`

option is special to lattice and draws the 4 panels in a grid with 1 column with 4 rows, so that the histograms are stacked and most easily compared visually. The `right=FALSE`

option has the same meaning here as for the base R `hist`

command.

To draw a bar graph showing the frequency distribution of a categorical variable `A`

separately for each group identified by the variable `B`

,

barchart( ~ table(A) | B, data = mydata)

This produces horizontal bar graphs, which leaves room for the category labels. To draw the bars vertically instead, while tilting the group labels on the x-axis by 45 degrees so that they fit,

barchart(table(A) ~ names(table(A)) | B, data = mydata, scales = list(x=list(rot=45)))

As a third example, draw a **scatter plot** to show the relationship between the numeric variables `x`

and `x`

separately for each group in the variable `B`

. The `pch`

option in this example replaces the default plot symbol with a filled dot, and the `aspect`

option sets the relative lengths of the vertical and horizontal axes.

xyplot(y ~ x | B, data = mydata, pch = 16, aspect = 0.7)

It is possible to add plot elements to individual panels, but the commands and options take some getting used to. For example, to fit a separate regression line to each scatter plot, one to each group, use the `panel`

argument in `xyplot`

to construct a function that applies built-in panel functions to each group, as follows.

xyplot(y ~ x | B, data = mydata, pch = 16, aspect = 0.7, panel=function(x, y){ # Use x and y here, not real variable names panel.xyplot(x, y) # draws the scatter plot panel.lmline(x, y) # fits the regression line } )

This doesn't even begin to describe what's possible using the lattice package. Crawley has a few more examples of trellis graphics in *The R Book*. Sarkar (2008) gives a complete description. See the links to these books on the "Books" tab of the Biology 501 course page.

Table showing a few of the commonly used plotting commands in the lattice package. `x`

and `y`

are numeric variables, whereas `A`

is a categorical variable (character or factor). `B`

is a factor or character variable that will define the groups or subsets of the data frame to be plotted in separate panels. A separate plot in the graphics window will be made for each of the groups defined by the variable `B`

.

Command | Graph type | Syntax (~ refers to the tilde not to the minus sign) |
---|---|---|

`barchart` | `bar graph` | `barchart(~table(A) | B, data=mydata)` |

`bwplot` | `box plot` | `bwplot(x ~ A | B, data=mydata)` |

`histogram` | `histogram` | `histogram(~x | B, data=mydata, right=FALSE)` |

`stripplot` | `strip chart` | `stripplot(x ~ A | B, data=mydata, jitter=TRUE)` |

`xyplot` | `scatter plot` | `xyplot(y ~ x | B, data=mydata)` |

### Using `ggplot2`

package

The `ggplot2`

package is a popular one for drawing graphs. It has built-in functions to draw multi-panel plots by group, all on the same page. It is a language all its own, distinct from that used in base R, but its formula-like structures makes sense once learned. The default graph style is more embellished than in base R, and if this makes you unhappy you would need to do some digging to figure out how to modify. The second edition of the ggplot2 book by Wickam is helpful, find the link on the "Books" tab of the Biology 501 course page.

Load the package and use,

library(ggplot2) # load the package theme_set(theme_bw()) # to employ the simplest ggplot2 style

The plot commands are formed by chaining together elements using the "+" symbol, much like you would include different terms in a linear model statement when fitting to data.

For example, the following command draws a **histogram** of a numeric variable `x`

separately for groups identified by the variable `B`

(assume the variables are in the data frame "mydata"). The `geom_histogram`

function builds the histogram, with options to left-close the bins and color the bars. Use `breaks = seq(...)`

to control the location and size of bins, and to put bin labels below the edges, rather than under the center, of bars. The `facet_wrap`

function instructs R to plot by `B`

groups in two columns of panels.

ggplot(mydata, aes(x)) + xlab("X variable name") + ylab("Frequency") + geom_histogram(closed = "left", breaks = seq(from=10, to=100, by=10), fill = "red") + facet_wrap(~B, ncol = 2)

As a second example, the following draws a **scatter plot** of `x`

against `x`

, separately for groups identified by the variable `B`

, with all variables present in the data frame "mydata". The `geom_point`

term draws the scatter plot in each panel, and `geom_smooth`

with the argument `method = lm`

fits a regression line to each panel of points.

ggplot(mydata, aes(x, y)) + xlab("X variable name") + ylab("Y variable name") + geom_point(col = "red", size = I(2)) + geom_smooth(method = lm, size = I(1), se = FALSE, col = "black") + facet_wrap(~B, ncol = 2)