# Graphs and tables

This page provides tips and recommendations for making graphs and tables in R.

In the examples below, `x`

and `y`

are numeric variables in the data frame, `mydata`

. `A`

and `B`

are categorical variables (factors or character variables) identifying different groups.

We include optional examples that make use of the add-on packages `dplyr`

and `ggplot2`

. Hadley Wickham’s `ggplot2`

extends base R with its “grammar of graphics”. His book is the standard reference (Wickam, H. 2016. ggplot2: Elegant graphics for data analysis. 2nd edition) but plenty of introductory resources are available online (e.g., this one).

Some will find `ggplot2`

‘s default theme to be visually noisy. You can easily change to something else by running the `theme_set()`

command at the start of your run, as follows. Other simpler themes include `theme_minimal()`

and `theme_bw()`

.

library(dplyr) library(ggplot2) theme_set(theme_classic())

## Frequency tables

These commands generate tables of frequencies.

### Frequency table for one variable

This frequency table counts the number (frequency) of cases in each category of a categorical variable `A`

. Using the `useNA`

argument add the category `NA`

if one or more cases is missing.

table(mydata$A, useNA = "ifany") # or with(mydata, table(A), useNA = "ifany")

The `summarize`

command in `dplyr`

package can generate frequency tables with its `n()`

function.

summarize(group_by(mydata, A), Frequency = n())

### Frequency (contingency) table for two variables

The following commands generate a frequency table for two categorical variables, `A`

and `B`

. The command can be extended to three or more variables.

table(mydata$A, mydata$B, useNA = "ifany")

To include the row and column sums, use

tab <- table(mydata$A, mydata$B) addmargins(tab, margin = c(1,2), FUN = sum, quiet = TRUE)

### Flat frequency table

The following command generates a "flat" frequency table for two categorical variables, `A`

and `B`

. In a flat table, `A`

and `B`

are separate columns of a table, and a third column tallies the frequencies of each combination. The table will show a count of 0 for category combinations not present in the data.

data.frame(ftable(mydata[, c("A","B")], row.vars = c("A","B")))

The `summarize`

method of `dplyr`

will tally only the combinations of categories that have a frequency > 0. Hence, `ftable`

is preferred. Alternatively, a fix is available in the `tidyr`

package (you might need to install first):

library(tidyr) mytab <- summarize(group_by(mydata, A, B), freq = n()) complete(ungroup(mytab), A, B, fill = list(freq = 0))

## Table summaries of descriptive statistics

The `tapply`

command creates tables of descriptive statistics by group (e.g., mean, standard deviation, median, etc). So does the `summarize`

command of the `dplyr`

package, as shown here.

### Descriptive statistics for one variable

Here is how to generate a table of group means for a variable `y`

, where `A`

is the grouping variable.

tapply(mydata$y, INDEX = mydata$A, FUN = mean, na.rm = TRUE) # result is a vector summarize(group_by(mydata, A), ybar = mean(y, na.rm = TRUE)) # dplyr method; result is a data frame

The argument `na.rm = TRUE`

removes missing values (otherwise the mean returns `NA`

if missing values are present). With `tapply`

, pass optional arguments to FUN by including them immediately afterward.

An advantage of the `summarize`

method from `dplyr`

is that you can calculate more than one descriptive statistic at once. Here we calculate mean, standard deviation, and number of observations (including missing observations) for the variable `x`

by group.

summarize(group_by(mydata, A), xbar = mean(x, na.rm = TRUE), s = sd(x, na.rm = TRUE), n = n())

### More than one grouping variable

To calculate descriptive statistics (e.g., the median of `x`

) with more than one grouping variable, use one of the following commands.

summarize(group_by(mydata, A, B), ybar = median(y, na.rm = TRUE)) # dplyr command, yields a data frame aggregate(mydata$y, by = list(A = mydata$A, B = mydata$B), FUN = median) # yields a data frame tapply(mydata$y, INDEX = list(mydata$A, mydata$B), FUN = median) # yields a r x c matrix

### More than one response variable

The `dplyr`

command `summarize`

allows you to tabulate summaries for more than one variable. For example, if your data frame `mydata`

contains a categorical variable named `A`

that has multiple categories, you can obtained means and standard deviations by group for one (or more) numeric variables `y1`

, `y2`

, etc, as follows. Result can be saved as a new data frame (here, named `z`

).

z <- summarize(group_by(mydata, A), mean1 = mean(y1, na.rm = TRUE), mean2 = mean(y2, na.rm = TRUE), mean3 = mean(y3, na.rm = TRUE), sd1 = sd(y1, na.rm = TRUE), sd2 = sd(y2, na.rm = TRUE), sd3 = sd(y3, na.rm = TRUE)) print(z)

## Drawing graphs in R

Here's the many types of graphs are grouped according to whether their purpose is to display frequencies of a single variable (e.g., histogram, bar graphs), or association between two (more more) variables and differences between groups (e.g., mosaic plots, box plots, strip charts, scatter plots).

### Command options for base R plots

Many (but not all -- try them) of the basic plotting commands in R will accept the same options to control axis limits, labeling, print a title, change the plotting symbol, change the size of the plotting symbols and text, and change the line types. Here are some of the most frequently modified options. Use them inside the parentheses of a plotting command to have their effect. If you are not sure whether a given option works in your case, try it -- the worst that could happen is you get an error message or R ignores you.

main = "Eureka" # add a title above the graph pch = 16 # change plot symbol to a filled circle color = "red" # change the item color xlim = c(-10,10) # change limits of the x-axis (horizontal axis) ylim = c(0,100) # change limits of the y-axis (vertical axis) lty = 2 # change line type to dashed las = 2 # rotate axis labels to be perpendicular to axis cex = 1.5 # magnify the plotting symbols 1.5-fold xlab = "Body size" # label for the x-axis ylab = "Frequency" # label for the y-axis

For details and the full list of plotting options in base R, get help as follows,

?par # graphical parameters ?plot.default # basic plot decorations

### Drawing graphs with ggplot

Building a graph using `ggplot`

involves the combination of components including data, "aesthetics" that map variables to visuals, and "geoms" that create different kinds of plots. A basic scatter plot of `yvar`

against `xvar`

has the three components as follows.

ggplot(data = mydata, mapping = aes(x = xvar, y = yvar)) + geom_point()

where `geom_point()`

is the specific geom for plotting points. Other graph components can be added, as demonstrated with examples below.

### Save graph as a pdf file

After drawing your plot, you can use the menu (File -> Save As) to save to a pdf file. Or, draw the graph on a pdf device to begin with:

pdf(file = "mygraph.pdf") # opens the pdf device for plotting ... # Issue your R commands here to generate plot dev.off() # closes the device when you are done

## Graphs to show a frequency distribution of a single variable

### Bar graph - frequency distribution for a categorical (grouping) variable

In the following examples, `A`

is a categorical variable identifying groups.

# base R barplot(table(mydata$A), col = "firebrick", space = 0.2, cex.names = 1.2) # Using ggplot ggplot(mydata, aes(x = A)) + geom_bar(stat="count") # ggplot with options: ggplot(mydata, aes(x = A)) + geom_bar(stat="count", fill = "firebrick") + labs(x = "A group", y = "Frequency") + theme(text = element_text(size = 15), axis.text = element_text(size = 12), aspect.ratio = 0.8)

R will arrange the categories in alphabetical order by default. If you want to fix a specific order, specify this in a factor command. For example, if the variable A has three groups "a", "b" and "c", fix a preferred order as follows and then rerun the command.

A <- factor(mydata$A, levels = c("c","a","b")) barplot(table(mydata$A), col = "firebrick")

To plot the bars in order of decreasing frequency, (a good idea for bar graphs)

# Using base R barplot(sort(table(mydata$A), decreasing = TRUE), col = "firebrick", space = 0.2) # Using ggplot mydata$A_ordered <- factor(mydata$A, levels = names(sort(table(mydata$A), decreasing = TRUE)) ) ggplot(mydata, aes(x = A_ordered)) + geom_bar(stat="count", fill = "firebrick") + labs(x = "A group", y = "Frequency") + theme(text = element_text(size = 15), axis.text = element_text(size = 12), aspect.ratio = 0.8)

If the frequencies are already tabulated in a data frame named `mytab`

, then modify as follows. `A`

is a variable in `mytab`

that lists each named category exactly once and `Freq`

is a variable containing the corresponding frequency of cases in each category.

mytab <- arrange(mytab, desc(Freq)) barplot(mytab$Freq, names.arg = mytab$A, col = "firebrick") # or mytab$A_ordered <- factor(mytab$A, levels = mytab$A[order(mytab$Freq, decreasing = TRUE)] ) ggplot(mytab, aes(x = A_ordered, y = Freq)) + geom_bar(stat="identity", fill = "firebrick") + labs(x = "A group", y = "Frequency") + theme(text = element_text(size = 15), axis.text = element_text(size = 12), aspect.ratio = 0.8)

### Histogram - frequency distribution for a numeric variable

The basic command is

hist(mydata$x, col = "navy", right = FALSE)

The `col`

argument sets the color of the bars. The `right = FALSE`

option causes all the histogram intervals, or bins, except the last one to be left-closed. In other words, the value 1 would appear in the interval 1-2 rather than in the interval 0-1, which seems to be the convention (unless 1 is the upper limit of the right-most bin, in which case R puts it in the bin 0-1; use the `include.lowest`

option to control that!).

Use the `breaks`

option to influence the width and number of histogram bins. To set the approximate number of bins to 20, use

hist(mydata$x, breaks = 20, right = FALSE)

For finer control of bin number and location, specify the breakpoints. For example, the following command creates a series of bins 1 unit wide between the limits 0 and 6 (make sure all the data fall between these limits or R will complain),

hist(mydata$x, breaks = seq(from=0, to=6, by=1), right = FALSE)

Notice that a value of exactly 6, the upper limit of breaks, will appear in the interval 5-6. This behavior is restricted to the rightmost bin. To prevent this from happening, increase the upper limit of the breaks by 1:

hist(mydata$x, breaks = seq(from=0, to=7, by=1), right = FALSE)

In `ggplot`

, the barest histogram for a numeric variable `x`

in `mydata`

requires only

ggplot(mydata, aes(x)) + geom_histogram()

The example below improves the graph with a number of options. To see the impact of each option, leave out and rerun.

ggplot(mydata, aes(x)) + geom_histogram(fill = "firebrick", col = "black", binwidth = 0.2, boundary = 0) + labs(x = "The variable x", y = "Frequency") + theme(aspect.ratio = 0.80)

To display probability density instead of raw frequencies,

hist(mydata$x, prob = TRUE, right = FALSE) # in ggplot ggplot(mydata, aes(x)) + geom_histogram(aes(y = ..density..))

To superimpose a normal density curve on a histogram, try the following commands. First, 101 evenly spaced points along the `x`

axis, are made between the smallest and largest data value using `seq`

. Then `dnorm`

generates the normal density at each `x`

point, using the mean and standard deviation of the data.

hist(mydata$x, prob = TRUE, right = FALSE) m <- mean(mydata$x, na.rm = TRUE) s <- sd(mydata$x, na.rm = TRUE) xpts <- seq(from = min(mydata$x, na.rm=TRUE), to = max(mydata$x, na.rm = TRUE), length.out = 101) lines(dnorm(xpts, mean=m, sd=s) ~ xpts, col="red", lwd=2)

In `ggplot`

, add a `stat_function()`

to get the same,

ggplot(mydata, aes(x)) + geom_histogram(aes(y = ..density..)) + stat_function(fun = dnorm, args = list(mean = mean(mydata$x, na.rm = TRUE), sd = sd(mydata$x, na.rm = TRUE)))

### Normal quantile plot - compare data to the normal distribution

`x`

is a numeric variable.

qqnorm(mydata$x) qqline(mydata$x) # adds line through first and third quartiles

In `ggplot`

, no straightforward way exists to add a line for comparison.

ggplot(mydata, aes(sample = x)) + geom_qq()

## Graphs to display association between two variables

The appropriate graph depends on which variables are numeric or categorical.

### Mosaic plot - association between two categorical variables

A and B can be factors or character variables. `color = TRUE`

yields shades of grey, or a vector of colors can be used instead (shown below for 3 categories). A quick way to get a bunch of different colors is to use `color = rainbow(n)`

, where `n`

is the number of categories for the variable B. Other options in the examples below alter the orientation (`las`

) and size (`cex.axis`

) of the labels.

mosaicplot(table(mydata$A, mydata$B), color = c("red","blue","yellow"), las = 2, cex.axis = 0.8) mosaicplot(table(mydata$A, mydata$B), color = rainbow(3), las = 2, cex.axis = 0.8)

To draw a mosaic plot using `ggplot`

, first install the `ggmosaic`

package. The example below loads the package, assuming it is installed.

library(ggmosaic) ggplot(mydata, aes(x = product(A, B), fill=factor(A))) + geom_mosaic() + labs(x = "B group") + theme(, aspect.ratio = 1, axis.text.x = element_text(angle = -25, hjust= .1, size = 12)) + guides(fill = guide_legend(title = "A group", reverse = TRUE))

### Grouped bar graph - association between two categorical variables

`A`

and `B`

can be factors or character variables in a data frame `mydata`

. The first line of code below is the bare minimum, whereas the second adds a few useful options.

barplot(table(mydata$A, mydata$B), beside = TRUE) barplot(table(mydata$A, mydata$B), beside = TRUE, las = 1, col = rainbow(4), cex.names = 0.8, space = c(0,0.8), xlab = "B group", ylab = "Frequency", legend.text = TRUE)

In `ggplot`

, `geom_bar(stat="count")`

works only if all possible combinations of categories between `A`

and `B`

have a count of at least 1 in the data. Bars with counts of 0 will not be plotted at all, and remaining bars will widen to fill the space.

ggplot(mydata, aes(x = A, fill = B)) + geom_bar(stat="count", position = "dodge") + labs(x = "A group", y = "Frequency") + theme(aspect.ratio = 0.80)

To avoid problems arising from counts of 0, generate a flat table first and then run `geom_bar(stat="identity")`

(see section on flat tables higher up this page).

mytab <- data.frame(ftable(mydata[, c("A","B")], row.vars = c("A","B"))) ggplot(mytab, aes(x = A, y = Freq, fill = B)) + geom_bar(stat="identity", position = "dodge") + labs(x = "A group", y = "Frequency") + theme(aspect.ratio = 0.80)

### Histograms by group

A plot of multiple histograms is useful for comparing the frequency distribution of a numeric variable between groups. Stack the plots above one another if possible, for best results, and use the same minimum and maximum of `x`

-values on the axis. The code below draws a panel of 3 histograms of `x`

, on top of one another, one for each of three groups (`a1`

- `a3`

) of the categorical variable `A`

in `mydata`

.

A panel of plots is accomplished easily with `ggplot`

or the `lattice`

package (a brief introduction to the lattice package is provided at the bottom of this page). In `ggplot`

, add `facet_wrap()`

to place graphs for different groups on the same page.

ggplot(mydata, aes(x = x)) + geom_histogram(fill = "firebrick", col = "black", binwidth = 0.2, boundary = 0) + labs(x = "The variable x", y = "Frequency") + theme(aspect.ratio = 0.5) + facet_wrap( ~ A, ncol = 1, scales = "free_y")

To accomplish the task in base R, begin by setting up a graphics window with the desired number of rows and columns (here, 3 and 1, respectively) using the `mfrow`

option of `par()`

. Here the `mar`

option is used to adjust the size of the margin around each plot, and `cex`

is used to reduce the font size of labels to make room. The following code loops through the three unique categories of `A`

and draws a histogram for `x`

for each group.

par(mfrow=c(3,1), mar = c(4, 4, 2, 1), cex = 0.7) for( i in unique(mydata$A) ){ dat <- subset(mydata, mydata$A == i) hist(dat, breaks = 20, right = FALSE, xlim = range(mydata$x), col="firebrick", main = i, xlab = "x variable", ylab = "Frequency") }

### Box plots - association between a categorical and a numeric variable

The following code generates a box plot for the numeric variable `yvar`

separately for every group identified in the categorical variable `A`

. The following shows use of the formula method in `barplot`

, written as "response_variable ~ explanatory_variable". Set `varwidth = FALSE`

if you want all boxes to have the same width.

boxplot(yvar ~ A, data = mydata, varwidth = TRUE, ylab="y value", col = "firebrick", cex.axis = 0.8)

The next command shows how to make a basic box plot with `ggplot`

. Below it is a command with more options.

ggplot(mydata, aes(x = A, y = yvar)) + geom_boxplot() ggplot(mydata, aes(x = A, y = yvar)) + geom_boxplot(fill = "firebrick", notch = FALSE, varwidth = TRUE) + labs(x = "A group", y = "y value") + theme(text = element_text(size = 15), axis.text = element_text(size = 12), aspect.ratio = 0.80)

### Strip chart - association between a categorical and a numeric variable

A strip chart can be used instead of a box plot when the number of data points is not large. Random noise ("jitter") is used to reduce overlap of points. The first example is a basic plot, whereas the second adds common options.

stripchart(y ~ A, vertical = TRUE, data = mydata, method = "jitter") stripchart(y ~ A, vertical = TRUE, data = mydata, method = "jitter", jitter = 0.2, cex.axis = 0.8, pch = 1, col = "firebrick")

The first line of code below shows the basic strip chart in `ggplot`

. The following example includes common options.

ggplot(mydata, aes(A, y)) + geom_jitter() ggplot(mydata, aes(A, y)) + geom_jitter(color = "firebrick", size = 3, width = 0.15) + labs(x = "A group", y = "y value") + theme(aspect.ratio = 0.80, text = element_text(size = 12), axis.text = element_text(size = 10))

You can add points or lines to a base R `stripchart`

by using integer numbers to indicate position of categories along the x-axis. For example, to add means and standard errors of `y`

to a strip chart for a numeric variable `y`

and a categorical variable `A`

having four categories, use

stripchart(y ~ A, data = mydata, vertical = TRUE, method = "jitter", pch = 1) m <- tapply(mydata$y, mydata$A, mean, na.rm=TRUE) se <- tapply(mydata$y, mydata$A, function(y){ sd(y, na.rm=TRUE)/sqrt(length(na.omit(y))) }) points( m ~ c(1:4 + 0.2) + 0.2, pch=16, col = "red") segments(x0 = c(1:4 + 0.2), y0 = m - se, x1 = c(1:4 + 0.2), y1 = m + se, col = "red")

### Scatter plot - association between two numeric variables

Here's how to to produce a scatter plot for two numeric variables, `x`

and `y`

.

# formula method, base R plot(y ~ x, data = mydata, pch = 16, col = 2) # basic scatter plot in ggplot ggplot(mydata, aes(x, y)) + geom_point() # scatter plot with options in ggplot ggplot(mydata, aes(x, y)) + geom_point(size = 2, col = "firebrick") + labs(x = "x variable", y = "y variable") + theme(aspect.ratio = 0.80)

Add a smooth curve through the data to estimate the relationship between `y`

and `x`

. The `lowess`

command uses locally weighted regression to accomplish this. "Local" means that `y`

is predicted for each `x`

using only data in the vicinity of that `x`

, rather than all the data. The size of the vicinity is controlled by the option `f`

, which is a proportion between 0 (very narrow vicinity) and 1 (uses all the data). Try different values of `f`

to best capture the relationship. The default is `f = 2/3`

.

plot(y ~ x, data = mydata) x1 <- mydata$x[order(mydata$x)] y1 <- mydata$y[order(mydata$x)] lines(lowess(mydata$x1, mydata$y1, f = 0.5))

Using `ggplot`

, you also get SE's of predicted values.

ggplot(mydata, aes(x, y)) + geom_point(size = 2, col = "firebrick") + labs(x = "x variable", y = "y variable") + geom_smooth(method = loess, size = 1, col = "black") + theme(aspect.ratio = 0.80)

To add the least squares regression line instead,

plot(y ~ x, data = mydata) abline(lm(y ~ x, data = mydata)) # in ggplot ggplot(mydata, aes(x, y)) + geom_point(size = 2, col = "firebrick") + labs(x = "x variable", y = "y variable") + geom_smooth(method = lm, size = 1, col = "black") + theme(aspect.ratio = 0.80)

Here are two methods to identify individual points on a scatter plot. The first redraws the scatter plot and then adds the data row number next to it (you can add text from another variable instead). This can get noisy if there are a lot of data points.

plot(y ~ x, data = mydata) text(mydata$x, mydata$y, labels = seq_along(mydata$x), pos = 1, cex = 0.5)

The second method uses the cursor to click those few points on the plot you want identified. This version prints the row number when you click a point. You can change that by setting the `labels`

option to a character variable that identifies them too.

plot(y ~ x, data = mydata) identify(mydata$x, mydata$y, labels = seq_along(mydata$x))

### Scatter plots by group

One way to make a scatter plot for multiple groups is to superimpose them all on a single plot but vary the symbols according to group. In base R, use `pch`

to vary the symbol, `col`

to vary the color, and both to vary both at the same time. `x`

and `y`

are numeric variables, whereas `A`

is a categorical variable identifying groups. If `A`

is already a factor you can use just `A`

instead of `factor(A)`

in the commands below. The `legend`

command adds a legend identifying the groups---click on the plot (inside the plot region) with your cursor to place the legend.

plot(y ~ x, data = mydata, pch = as.numeric(factor(mydata$A)), col = as.numeric(factor(mydata$A))) legend( locator(1), legend = as.character(levels(factor(mydata$A))), pch = 1:length(levels(factor(mydata$A))), col=1:length(levels(factor(mydata$A))) )

To accomplish the same with `ggplot`

, specify which variable to vary by color and by shape within `aes()`

, which maps variables to visuals. The example below includes an optional `geom_smooth()`

line that will result in the least squares lines for each group also superimposed.

ggplot(mydata, aes(x, y, colour = A, shape = A)) + geom_point(size = 2) + geom_smooth(method = lm, size = 1, se = FALSE) + labs(x = "x variable", y = "y variable") + theme(aspect.ratio = 0.80)

Another way to make a scatter plot for multiple groups is to plot them separately in panes of a panel of plots. Use `facet_wrap`

() in `ggplot`

for this purpose, as follows. The `lattice`

package can also be used to make panels of plots, as described briefly at the bottom of the page.

ggplot(mydata, aes(x, y) + facet_wrap(~A, ncol = 2) + geom_point(col = "firebrick", size = 2) + geom_smooth(method = lm, size = 1, se = FALSE, col = "black") + labs(x = "x variable", y = "y variable") + theme(aspect.ratio = 0.80)

### Line plot - display a sequence of measurements

Here's how to plot a sequence of (`x, y`

) points and connect the dots with lines. This is especially useful when the `x`

-variable represents a series of points in time or across a spatial gradient. In base R:

plot(y ~ x, data = mydata, pch=16) lines(y[order(x)] ~ x[order(x)], data = mydata) # Eliminate the points, and some options plot(y[order(x)] ~ x[order(x)], data = mydata, type = "l", lty = 3, lwd = 2, col = "red")

The basic `ggplot`

command is below. There's no need to order by `x`

.

ggplot(data = mydata, aes(x, y)) + geom_line()

A line plot in `ggplot`

works if the explanatory variable is numeric (`x`

) or categorical (`A`

), as show in the two commands below (which also include further options).

ggplot(data = mydata, aes(x, y)) + geom_line(color = "red") + geom_point() + labs(x = "x variable", y = "y variable") + theme(text = element_text(size = 15), axis.text = element_text(size = 12), aspect.ratio = 0.75) # To use categorical variable A, include "group = 1" in aes() ggplot(data = mydata, aes(A, y, group=1)) + geom_line(color = "red") + geom_point() + labs(x = "A variable", y = "y variable") + theme(text = element_text(size = 15), axis.text = element_text(size = 12), aspect.ratio = 0.75)

## Graphs to display multiple variables

See also **Overlay scatter plots for multiple groups** in the previous section and **Multipanel plots** below for other solutions.

The command `interaction.plot`

is quick but it does not show the data, and so is not useful for presentation. A better solution is to add lines to a `stripchart`

instead.

### Interaction plots

Interaction plots are useful for displaying how the mean of a response variable `y`

changes between the levels of two categorical variables, `A`

and `B`

. The graph is especially useful for determine whether an interaction is present between two factors `A`

and `B`

in a factorial experiment, or between a factor `A`

and a blocking variable `B`

. If the lines are parallel then there is no interaction.

interaction.plot(A, B, y)

The levels of the variable listed first (here, `A`

) will be displayed along the x-axis of the plot. The y-axis will then display the mean of y separately for each category of the second variable, `B`

. Variations on this command include

interaction.plot(mydata$B, mydata$A, mydata$y) # Put B along x-axis instead interaction.plot(mydata$A, mydata$B, mydata$y, fun = median) # median of y interaction.plot(mydata$A, mydata$B, mydata$y, col = 1:length(unique(mydata$B))) # color the lines interaction.plot(mydata$A, mydata$B, mydata$y, las=2) # more room for A's labels

### Pairwise scatter plots for multiple variables

The following command creates a single graph with scatter plots between all pairs of numeric variables in a data frame, "mydata". The option `gap`

adjusts the spacing between separate plots,

pairs(mydata, gap = 0.5)

Use the formula method to plot only the three numeric variables x1, x2, and x3 in the data frame mydata.

pairs(~ x1 + x2 + x3, data = mydata)

## Using the **lattice** package

The `lattice`

package makes it easy to draw a panel of plots separately by group. The basic plot is simple, but commands to add points and lines to the individual panes can be tricky.

The `lattice`

package is included with the basic installation but you need to load the library. The graph types available in the lattice package include the standard ones found also in R's basic graphics package, such as box plots, histograms, and so on. The table below lists the most types and the relevant command.

For example, to draw a histogram of a numeric variable `x`

separately for four groups identified by the variable `B`

in the data frame `mydata`

, use

library(lattice) histogram(~ x | B, data = mydata, layout = c(1,4), right = FALSE)

The `layout`

option is special to lattice and draws the 4 panels in a grid with 1 column with 4 rows, so that the histograms are stacked and most easily compared visually. The `right=FALSE`

option has the same meaning here as for the base R `hist`

command.

To draw a bar graph showing the frequency distribution of a categorical variable `A`

separately for each group identified by the variable `B`

,

barchart( ~ table(A) | B, data = mydata)

This produces horizontal bar graphs, which leaves room for the category labels. To draw the bars vertically instead, while tilting the group labels on the x-axis by 45 degrees so that they fit,

barchart(table(A) ~ names(table(A)) | B, data = mydata, scales = list(x=list(rot=45)))

As a third example, draw a **scatter plot** to show the relationship between the numeric variables `x`

and `x`

separately for each group in the variable `B`

. The `pch`

option in this example replaces the default plot symbol with a filled dot, and the `aspect`

option sets the relative lengths of the vertical and horizontal axes.

xyplot(y ~ x | B, data = mydata, pch = 16, aspect = 0.7)

It is possible to add plot elements to individual panels, but the commands and options take some getting used to. For example, to fit a separate regression line to each scatter plot, one to each group, use the `panel`

argument in `xyplot`

to construct a function that applies built-in panel functions to each group, as follows.

xyplot(y ~ x | B, data = mydata, pch = 16, aspect = 0.7, panel=function(x, y){ # Use x and y here, not real variable names panel.xyplot(x, y) # draws the scatter plot panel.lmline(x, y) # fits the regression line } )

This doesn't even begin to describe what's possible using the lattice package. Crawley has a few more examples of trellis graphics in *The R Book*. Sarkar (2008) gives a complete description. See the links to these books on the "Books" tab of the Biology 501 course page.

Table showing a few of the commonly used plotting commands in the lattice package. `x`

and `y`

are numeric variables, whereas `A`

is a categorical variable (character or factor). `B`

is a factor or character variable that will define the groups or subsets of the data frame to be plotted in separate panels. A separate plot in the graphics window will be made for each of the groups defined by the variable `B`

.

Command | Graph type | Syntax (~ refers to the tilde not to the minus sign) |
---|---|---|

`barchart` | `bar graph` | `barchart(~table(A) | B, data=mydata)` |

`bwplot` | `box plot` | `bwplot(x ~ A | B, data=mydata)` |

`histogram` | `histogram` | `histogram(~x | B, data=mydata, right=FALSE)` |

`stripplot` | `strip chart` | `stripplot(x ~ A | B, data=mydata, jitter=TRUE)` |

`xyplot` | `scatter plot` | `xyplot(y ~ x | B, data=mydata)` |