# Graphs and tables

This page provides tips and recommendations for making graphs and tables in R.

In the examples below, `x`

and `y`

are numeric variables in the data frame, `mydata`

. `A`

and `B`

are categorical variables (factors or character variables) identifying different groups.

We include optional examples that make use of the add-on packages `dplyr`

and `ggplot2`

. Hadley Wickham’s `ggplot2`

extends base R with its “grammar of graphics”. His book is the standard reference (Wickam, H. 2016. ggplot2: Elegant graphics for data analysis. 2nd edition) but plenty of introductory resources are available online (e.g., this one).

Some will find `ggplot2`

‘s default theme to be visually noisy. You can easily change to something else by running the `theme_set()`

command at the start of your run, as follows. Other simpler themes include `theme_minimal()`

and `theme_bw()`

.

library(dplyr) library(ggplot2) theme_set(theme_classic())

## Frequency tables

These commands generate tables of frequencies.

### Frequency table for one variable

This frequency table counts the number (frequency) of cases in each category of a categorical variable `A`

. Using the `useNA`

argument add the category `NA`

if one or more cases is missing.

table(mydata$A, useNA = "ifany") # or with(mydata, table(A), useNA = "ifany")

The `summarize`

command in `dplyr`

package can generate frequency tables with its `n()`

function.

summarize(group_by(mydata, A), Frequency = n())

### Frequency (contingency) table for two variables

The following commands generate a frequency table for two categorical variables, `A`

and `B`

. The command can be extended to three or more variables.

table(mydata$A, mydata$B, useNA = "ifany")

To include the row and column sums, use

mytab <- table(mydata$A, mydata$B) addmargins(mytab, margin = c(1,2), FUN = sum, quiet = TRUE)

The same thing can be accomplished with `dplyr`

, except zero counts are given as NA.

spread( summarize(group_by(mydata, A, B), n = n()), B, n )

### Flat frequency table

The following commands generate a "flat" frequency table for two categorical variables, `A`

and `B`

. In a flat table, `A`

and `B`

are separate columns of a table, and a third column tallies the frequencies of each combination. The table will show a count of 0 for category combinations not present in the data.

mytab <- table(Aname = mydata$A, Bname = mydata$B) data.frame(mytab, stringsasFactors = FALSE) # or data.frame(ftable(mydata[, c("A","B")], row.vars = c("A","B")))

The `summarize`

method of `dplyr`

will tally only the combinations of categories that have a frequency greater than 0. Hence, `ftable`

is preferred. Alternatively, a fix is available in the `tidyr`

package (you might need to install first):

library(tidyr) mytab <- summarize(group_by(mydata, A, B), freq = n()) complete(ungroup(mytab), A, B, fill = list(freq = 0))

## Table summaries of descriptive statistics

The `tapply`

command creates tables of descriptive statistics by group (e.g., mean, standard deviation, median, etc). So does the `summarize`

command of the `dplyr`

package, as shown here.

### Descriptive statistics for one variable

Here is how to generate a table of group means for a variable `y`

, where `A`

is the categorical grouping variable.

tapply(mydata$y, INDEX = mydata$A, FUN = mean, na.rm = TRUE) # result is a vector summarize(group_by(mydata, A), ybar = mean(y, na.rm = TRUE)) # dplyr method; result is a data frame

The argument `na.rm = TRUE`

removes missing values (otherwise the mean returns `NA`

if missing values are present). With `tapply`

, pass optional arguments to FUN by including them immediately afterward.

An advantage of the `summarize`

method from `dplyr`

is that you can calculate more than one descriptive statistic at once. Here we calculate mean, standard deviation, and number of observations (including missing observations) for the variable `x`

by group.

summarize(group_by(mydata, A), xbar = mean(x, na.rm = TRUE), s = sd(x, na.rm = TRUE), n = n())

### More than one grouping variable

To calculate descriptive statistics (e.g., the median of `x`

) with more than one grouping variable, use one of the following commands.

summarize(group_by(mydata, A, B), ybar = median(y, na.rm = TRUE)) # dplyr command, yields a data frame aggregate(mydata$y, by = list(A = mydata$A, B = mydata$B), FUN = median) # yields a data frame tapply(mydata$y, INDEX = list(mydata$A, mydata$B), FUN = median) # yields a r x c matrix

### More than one response variable

The `dplyr`

command `summarize`

allows you to tabulate summaries for more than one variable. For example, if your data frame `mydata`

contains a categorical variable named `A`

that has multiple categories, you can obtained means and standard deviations by group for one (or more) numeric variables `y1`

, `y2`

, etc, as follows. Result can be saved as a new data frame (here, named `z`

).

z <- summarize(group_by(mydata, A), mean1 = mean(y1, na.rm = TRUE), mean2 = mean(y2, na.rm = TRUE), mean3 = mean(y3, na.rm = TRUE), sd1 = sd(y1, na.rm = TRUE), sd2 = sd(y2, na.rm = TRUE), sd3 = sd(y3, na.rm = TRUE)) print(z)

## Drawing graphs in R

Here's the many types of graphs are grouped according to whether their purpose is to display frequencies of a single variable (e.g., histogram, bar graphs), or association between two (more more) variables and differences between groups (e.g., mosaic plots, box plots, strip charts, scatter plots).

### Command options for base R plots

Many (but not all -- try them) of the basic plotting commands in base R will accept the same options to control axis limits, labeling, print a title, change the plotting symbol, change the size of the plotting symbols and text, and change the line types. Here are some of the most frequently modified options. Use them inside the parentheses of a plotting command to have their effect. If you are not sure whether a given option works in your case, try it. The worst that could happen is you get an error message, or R ignores you.

main = "Eureka" # add a title above the graph pch = 16 # set plot symbol to a filled circle color = "red" # set the item color xlim = c(-10,10) # set limits of the x-axis (horizontal axis) ylim = c(0,100) # set limits of the y-axis (vertical axis) lty = 2 # set line type to dashed las = 2 # rotate axis labels to be perpendicular to axis cex = 1.5 # magnify the plotting symbols 1.5-fold cex.lab = 1.5 # magnify the axis labels 1.5-fold cex.axis = 1.3 # magnify the axis annotation 1.3-fold xlab = "Body size" # label for the x-axis ylab = "Frequency" # label for the y-axis

For details and the full list of plotting options in base R, get help as follows,

?par # graphical parameters ?plot.default # basic plot decorations

### Drawing graphs with ggplot2

Building a graph using `ggplot2`

involves the combination of components or "layers" including data, "aesthetics" that map variables to visuals, and "geoms" that create different kinds of plots. A basic scatter plot of `yvar`

against `xvar`

has the three components as follows.

ggplot(data = mydata, mapping = aes(x = xvar, y = yvar)) + geom_point()

where `geom_point()`

is the specific geom for plotting points. Other graph components can be added, as demonstrated with examples below.

### Save graph as a pdf file

After drawing your plot, you can use the menu (File -> Save As) to save to a pdf file. Or, draw the graph on a pdf device to begin with:

pdf(file = "mygraph.pdf") # opens the pdf device for plotting ... # Issue your R commands here to generate plot dev.off() # closes the device when you are done

## Graph the frequency distribution of a single variable

### Bar graph - frequency distribution for a categorical variable

In the following examples, `A`

is a categorical variable identifying groups.

# base R barplot(table(mydata$A), col = "firebrick", space = 0.2, cex.names = 1.2) # Using ggplot ggplot(mydata, aes(x = A)) + geom_bar(stat="count") # ggplot with options: ggplot(mydata, aes(x = A)) + geom_bar(stat="count", fill = "firebrick") + labs(x = "A group", y = "Frequency") + theme(text = element_text(size = 15), axis.text = element_text(size = 12), aspect.ratio = 0.8)

R will arrange the categories in alphabetical order by default. If you want to fix a specific order, specify this in a factor command. For example, if the variable A has three groups "a", "b" and "c", fix a preferred order as follows and then rerun the command.

A <- factor(mydata$A, levels = c("c","a","b")) barplot(table(mydata$A), col = "firebrick")

To plot the bars in order of decreasing frequency, (a good idea for bar graphs)

# Using base R barplot(sort(table(mydata$A), decreasing = TRUE), col = "firebrick", space = 0.2) # Using ggplot mydata$A_ordered <- factor(mydata$A, levels = names(sort(table(mydata$A), decreasing = TRUE)) ) ggplot(mydata, aes(x = A_ordered)) + geom_bar(stat="count", fill = "firebrick") + labs(x = "A group", y = "Frequency") + theme(text = element_text(size = 15), axis.text = element_text(size = 12), aspect.ratio = 0.8)

If the frequencies are already tabulated in a data frame named `mytab`

, then modify as follows. `A`

is a variable in `mytab`

that lists each named category exactly once and `Freq`

is a variable containing the corresponding frequency of cases in each category.

mytab <- arrange(mytab, desc(Freq)) barplot(mytab$Freq, names.arg = mytab$A, col = "firebrick") # or mytab$A_ordered <- factor(mytab$A, levels = mytab$A[order(mytab$Freq, decreasing = TRUE)] ) ggplot(mytab, aes(x = A_ordered, y = Freq)) + geom_bar(stat="identity", fill = "firebrick") + labs(x = "A group", y = "Frequency") + theme(text = element_text(size = 15), axis.text = element_text(size = 12), aspect.ratio = 0.8)

### Histogram - frequency distribution for a numeric variable

The basic command is

hist(mydata$x, col = "navy", right = FALSE)

The `col`

argument sets the color of the bars. The `right = FALSE`

option causes all the histogram intervals, or bins, except the last one to be left-closed. In other words, the value 1 would appear in the interval 1-2 rather than in the interval 0-1, which seems to be the convention (unless 1 is the upper limit of the right-most bin, in which case R puts it in the bin 0-1; use the `include.lowest`

option to control that!).

Use the `breaks`

option to influence the width and number of histogram bins. To set the approximate number of bins to 20, use

hist(mydata$x, breaks = 20, right = FALSE)

For finer control of bin number and location, specify the breakpoints. For example, the following command creates a series of bins 1 unit wide between the limits 0 and 6 (make sure all the data fall between these limits or R will complain),

hist(mydata$x, breaks = seq(from=0, to=6, by=1), right = FALSE)

Notice that a value of exactly 6, the upper limit of breaks, will appear in the interval 5-6. This behavior is restricted to the rightmost bin. To prevent this from happening, increase the upper limit of the breaks by 1:

hist(mydata$x, breaks = seq(from=0, to=7, by=1), right = FALSE)

In `ggplot`

, the barest histogram for a numeric variable `x`

in `mydata`

requires only

ggplot(mydata, aes(x)) + geom_histogram()

The example below improves the graph with a number of options. To see the impact of each option, leave out and rerun.

ggplot(mydata, aes(x)) + geom_histogram(fill = "firebrick", col = "black", binwidth = 0.2, boundary = 0, closed = "left") + labs(x = "The variable x", y = "Frequency") + theme(aspect.ratio = 0.80)

To display probability density instead of raw frequencies,

hist(mydata$x, prob = TRUE, right = FALSE) # in ggplot ggplot(mydata, aes(x)) + geom_histogram(aes(y = ..density..), closed = "left")

To superimpose a normal density curve on a histogram, try the following commands. First, 101 evenly spaced points along the `x`

axis, are made between the smallest and largest data value using `seq`

. Then `dnorm`

generates the normal density at each `x`

point, using the mean and standard deviation of the data.

hist(mydata$x, prob = TRUE, right = FALSE) m <- mean(mydata$x, na.rm = TRUE) s <- sd(mydata$x, na.rm = TRUE) xpts <- seq(from = min(mydata$x, na.rm=TRUE), to = max(mydata$x, na.rm = TRUE), length.out = 101) lines(dnorm(xpts, mean=m, sd=s) ~ xpts, col="red", lwd=2)

In `ggplot`

, add a `stat_function()`

to get the same,

ggplot(mydata, aes(x)) + geom_histogram(aes(y = ..density..), closed = "left") + stat_function(fun = dnorm, args = list(mean = mean(mydata$x, na.rm = TRUE), sd = sd(mydata$x, na.rm = TRUE)))

### Normal quantile plot - compare data to the normal distribution

`x`

is the numeric variable whose distribution is being compared with the normal.

qqnorm(mydata$x) qqline(mydata$x) # adds line through first and third quartiles

The same can be accomplished in `ggplot`

as follows.

ggplot(mydata, aes(sample = x)) + geom_qq() + geom_qq_line()

## Graphs to display association between two variables

The appropriate graph depends on which variables are numeric or categorical.

### Mosaic plot - association between two categorical variables

A and B can be factors or character variables. `color = TRUE`

yields shades of grey, or a vector of colors can be used instead (shown below for 3 categories). A quick way to get a bunch of different colors is to use `color = rainbow(n)`

, where `n`

is the number of categories for the variable B. Other options in the examples below alter the orientation (`las`

) and size (`cex.axis`

) of the labels.

mosaicplot(table(mydata$A, mydata$B), color = c("red","blue","yellow"), las = 2, cex.axis = 0.8) mosaicplot(table(mydata$A, mydata$B), color = rainbow(3), las = 2, cex.axis = 0.8)

To draw a mosaic plot using `ggplot`

, first install the `ggmosaic`

package. The example below loads the package, assuming it is installed.

*** This command was broken the last time I attempted it (2018/09/12) using R version 3.5.1 and the most recent versions of `ggmosaic`

and `ggplot2`

***

library(ggmosaic) ggplot(mydata, aes(x = product(A, B), fill=factor(A))) + geom_mosaic() + labs(x = "B group") + theme(, aspect.ratio = 1, axis.text.x = element_text(angle = -25, hjust= .1, size = 12)) + guides(fill = guide_legend(title = "A group", reverse = TRUE))

### Grouped bar graph - association between two categorical variables

`A`

and `B`

can be factors or character variables in a data frame `mydata`

. The first line of code below is the bare minimum, whereas the second adds a few useful options.

barplot(table(mydata$A, mydata$B), beside = TRUE) barplot(table(mydata$A, mydata$B), beside = TRUE, las = 1, col = rainbow(4), cex.names = 0.8, space = c(0.2,0.8), xlab = "B group", ylab = "Frequency", legend.text = TRUE)

In `ggplot`

, use `geom_bar(stat="count")`

and the argument `position_dodge2(preserve="single")`

, instead of `position="dodge"`

, which doesn't handle category combinations of `A`

and `B`

having a count of 0. (`position_dodge2`

leaves a small space between bars for `B`

within groups of `A`

, whereas `position_dodge`

leaves no gap.)

ggplot(mydata, aes(x = A, fill = B)) + geom_bar(stat = "count", position = position_dodge2(preserve="single")) + labs(x = "A group", y = "Frequency") + theme(aspect.ratio = 0.80)

### Multiple histograms by group

A plot of multiple histograms is useful for comparing the frequency distribution of a numeric variable between groups. Stack the plots above one another if possible, for best results, and use the same minimum and maximum of `x`

-values on the axis. The code below draws a panel of 3 histograms of `x`

, on top of one another, one for each of three groups (`a1`

- `a3`

) of the categorical variable `A`

in `mydata`

.

A panel of plots is accomplished easily with `ggplot`

or the `lattice`

package (a brief introduction to the lattice package is provided at the bottom of this page). In `ggplot`

, add `facet_wrap()`

to place graphs for different groups on the same page.

ggplot(mydata, aes(x = x)) + geom_histogram(fill = "firebrick", col = "black", binwidth = 0.2, boundary = 0, closed = "left") + labs(x = "The variable x", y = "Frequency") + theme(aspect.ratio = 0.5) + facet_wrap( ~ A, ncol = 1, scales = "free_y")

To accomplish the task in base R, begin by setting up a graphics window with the desired number of rows and columns (here, 3 and 1, respectively) using the `mfrow`

option of `par()`

. Here the `mar`

option is used to adjust the size of the margin around each plot, and `cex`

is used to reduce the font size of labels to make room. The following code loops through the three unique categories of `A`

and draws a histogram for `x`

for each group.

par(mfrow=c(3,1), mar = c(4, 4, 2, 1), cex = 0.7) for( i in unique(mydata$A) ){ dat <- subset(mydata, mydata$A == i) hist(dat$x, breaks = 20, right = FALSE, xlim = range(mydata$x), col="firebrick", main = i, xlab = "x variable", ylab = "Frequency") }

### Strip chart - association between a categorical and a numeric variable

A strip chart (a.k.a dot plot) can be used instead of a box plot when the number of data points is not large. Random noise ("jitter") is used to reduce overlap of points. The first example is a basic plot, whereas the second adds common options.

# base R stripchart(y ~ A, vertical = TRUE, data = mydata, method = "jitter") stripchart(y ~ A, vertical = TRUE, data = mydata, method = "jitter", jitter = 0.2, cex.axis = 0.8, pch = 1, col = "firebrick")

The first line of code below shows the basic strip chart in `ggplot`

. The second example includes common options. The third example adds horizontal line segments at the group means (which is harder in `ggplot`

than it ought to be).

ggplot(mydata, aes(A, y)) + geom_jitter() ggplot(mydata, aes(A, y)) + geom_jitter(color = "firebrick", size = 3, width = 0.15) + labs(x = "A group", y = "y value") + theme(aspect.ratio = 0.80, text = element_text(size = 12), axis.text = element_text(size = 10)) # To add horizontal line segments at means, first define a function that calculates the means groupMeans <- function(x) { m <- mean(x, na.rm = TRUE) c(y = m, ymin = m, ymax = m) } ggplot(mydata, aes(A, y)) + geom_jitter(color = "firebrick", size = 3, width = 0.15) + stat_summary(fun.data = "groupMeans", geom = "errorbar", colour = "black", width = 0.5, size = 1) labs(x = "A group", y = "y value") + theme(aspect.ratio = 0.80, text = element_text(size = 12), axis.text = element_text(size = 10))

You can add points or lines to a base R `stripchart`

by using integer numbers to indicate position of categories along the x-axis. For example, to add means and standard errors of `y`

to a strip chart for a numeric variable `y`

and a categorical variable `A`

having four categories, use

stripchart(y ~ A, data = mydata, vertical = TRUE, method = "jitter", pch = 1) m <- tapply(mydata$y, mydata$A, mean, na.rm=TRUE) se <- tapply(mydata$y, mydata$A, function(y){ sd(y, na.rm=TRUE)/sqrt(length(na.omit(y))) }) points( m ~ c(1:4 + 0.2) + 0.2, pch=16, col = "red") segments(x0 = c(1:4 + 0.2), y0 = m - se, x1 = c(1:4 + 0.2), y1 = m + se, col = "red")

### Strip chart with lines for paired data

Paired data should be displayed accordingly. The following commands create a strip chart with lines connection the two measurements of the same unit in the two treatments. The data frame `mydata`

includes the response variable `y`

, the treatment variable `A`

, and an `id`

variable indicating identity of individuals measured under both treatments.

The following works in base R.

interaction.plot(response = mydata$y, x.factor = mydata$A, trace.factor = mydata$id, legend = FALSE, lty = 1, xlab = "Treatment", ylab = "Y variable", type = "b", pch = 16, las = 1, cex = 1.5, cex.lab = 1.5, cex.axis = 1.3)

Try the following in `ggplot`

.

ggplot(mydata, aes(y = y, x = A)) + geom_point(size = 5, col = "firebrick", alpha = 0.5) + geom_line(aes(group = id)) + labs(x = "Treatment", y = "Y variable") + theme(text = element_text(size = 18), axis.text = element_text(size = 16), aspect.ratio = 0.80)

### Box plot - association between a categorical and a numeric variable

The following code generates a box plot for the numeric variable `yvar`

separately for every group identified in the categorical variable `A`

. The following shows use of the formula method in `barplot`

, written as "response_variable ~ explanatory_variable". Set `varwidth = FALSE`

if you want all boxes to have the same width.

boxplot(yvar ~ A, data = mydata, varwidth = TRUE, ylab="y value", col = "firebrick", cex.axis = 0.8)

The next command shows how to make a basic box plot with `ggplot`

. Below it is a command with more options.

ggplot(mydata, aes(x = A, y = yvar)) + geom_boxplot() ggplot(mydata, aes(x = A, y = yvar)) + geom_boxplot(fill = "firebrick", notch = FALSE, varwidth = TRUE) + labs(x = "A group", y = "y value") + theme(text = element_text(size = 15), axis.text = element_text(size = 12), aspect.ratio = 0.80)

### Violin plot - association between a categorical and a numeric variable

A violin plot shows the frequency distribution for a numerical variable (and its mirror image) for several groups. The frequency distribution is a kernel density estimate, which "smooths" the distribution. The following code generates a violin plot for the numeric variable `yvar`

separately for every group identified in the categorical variable `A`

. This is easiest to accomplish in `ggplot`

. The `stat_summary`

layer adds a dot for the mean of each group.

ggplot(mydata, aes(x = A, y = yvar)) + geom_violin(fill = "firebrick") + stat_summary(fun.y = mean, geom = "point", color = "black") + labs(x = "A group", y = "y value") + theme(text = element_text(size = 15), axis.text = element_text(size = 12), aspect.ratio = 0.80)

The job is possible without `ggplot`

, but you'll need to install the `vioplot`

package first using `install.packages()`

, and then load it. Here, `yvar`

is the variable plotted separately for groups of the variable `A`

. In this example, A has 4 groups named `a1`

, `a2`

, `a3`

, and `a4`

. The amount of "smoothing" for the kernel density estimates is controlled by the `h`

option.

library(vioplot) vioplot(mydata$yvar[mydata$A=="a1"], mydata$yvar[mydata$A=="a2"], mydata$yvar[mydata$A=="a3"], mydata$yvar[mydata$A=="a4"], col="#FFB531", drawRect = FALSE, names = c("a1","a2","a3","a4"), h = 0.5) mtext("y value", side=2, line=2, las = 3) mtext("A group", side=1, line = 3)

### Scatter plot - association between two numeric variables

Here's how to to produce a scatter plot for two numeric variables, `x`

and `y`

.

# formula method, base R plot(y ~ x, data = mydata, pch = 16, col = 2) # basic scatter plot in ggplot ggplot(mydata, aes(x, y)) + geom_point() # scatter plot with options in ggplot ggplot(mydata, aes(x, y)) + geom_point(size = 2, col = "firebrick", alpha = 0.5) + labs(x = "x variable", y = "y variable") + theme(aspect.ratio = 0.80)

You can probably guess the intent of most of the `ggplot`

options except `alpha`

, which makes the dots partly transparent.

To add a smooth curve through the data, use locally weighted regression. "Local" here means that `y`

is predicted for each `x`

using only data in the vicinity of that `x`

, rather than all the data.

In base R you can control the size of the vicinity using the option `f`

, which is a proportion between 0 (very narrow vicinity) and 1 (uses all the data). Try different values of `f`

to best capture the relationship. The default is `f = 2/3`

.

plot(y ~ x, data = mydata) x1 <- mydata$x[order(mydata$x)] y1 <- mydata$y[order(mydata$x)] lines(lowess(mydata$x1, mydata$y1, f = 0.5))

Using `ggplot`

, you also get SE's of predicted values (set to FALSE if not desired).

ggplot(mydata, aes(x, y)) + geom_point(size = 2, col = "firebrick") + labs(x = "x variable", y = "y variable") + geom_smooth(method = "loess", size = 1, col = "black", se = TRUE) + theme(aspect.ratio = 0.80)

To add the least squares regression line to a plot, use the following.

plot(y ~ x, data = mydata) abline(lm(y ~ x, data = mydata)) # in ggplot ggplot(mydata, aes(x, y)) + geom_point(size = 2, col = "firebrick") + labs(x = "x variable", y = "y variable") + geom_smooth(method = "lm", size = 1, col = "black") + theme(aspect.ratio = 0.80)

In base R you can use the cursor to click data points to identify individuals. The following code prints the row number when you click a point (change that by setting the `labels`

option to a character variable that labels the point instead).

plot(y ~ x, data = mydata) identify(mydata$x, mydata$y, labels = seq_along(mydata$x))

### Line plot - display a sequence of measurements

Here's how to plot a sequence of (`x, y`

) points and connect the dots with lines. This is especially useful when the `x`

-variable represents a series of points in time or across a spatial gradient. In base R:

plot(y ~ x, data = mydata, pch=16) lines(y[order(x)] ~ x[order(x)], data = mydata) # Eliminate the points, and some options plot(y[order(x)] ~ x[order(x)], data = mydata, type = "l", lty = 3, lwd = 2, col = "red")

The basic `ggplot`

command is below. There's no need to order by `x`

.

ggplot(data = mydata, aes(x, y)) + geom_line()

A line plot in `ggplot`

works if the explanatory variable is numeric (`x`

) or categorical (`A`

), as show in the two commands below (which also include further options).

ggplot(data = mydata, aes(x, y)) + geom_line(color = "red") + geom_point() + labs(x = "x variable", y = "y variable") + theme(text = element_text(size = 15), axis.text = element_text(size = 12), aspect.ratio = 0.75) # To use categorical variable A, include "group = 1" in aes() ggplot(data = mydata, aes(A, y, group = 1)) + geom_line(color = "red") + geom_point() + labs(x = "A variable", y = "y variable") + theme(text = element_text(size = 15), axis.text = element_text(size = 12), aspect.ratio = 0.75)

## Graphs to display more variables

### Scatter plots by group

One way to make a scatter plot for multiple groups is to superimpose them all on a single plot but vary the symbols according to group. In base R, use `pch`

to vary the symbol, `col`

to vary the color, and both to vary both at the same time. `x`

and `y`

are numeric variables, whereas `A`

is a categorical variable identifying groups. If `A`

is already a factor you can use just `A`

instead of `factor(A)`

in the commands below. The `legend`

command adds a legend identifying the groups---click on the plot (inside the plot region) with your cursor to place the legend.

plot(y ~ x, data = mydata, pch = as.numeric(factor(mydata$A)), col = as.numeric(factor(mydata$A))) legend( locator(1), legend = as.character(levels(factor(mydata$A))), pch = 1:length(levels(factor(mydata$A))), col=1:length(levels(factor(mydata$A))) )

To accomplish the same with `ggplot`

, specify which variable to vary by color and by shape within `aes()`

, which maps variables to visuals. The example below includes an optional `geom_smooth()`

line that will result in the least squares lines for each group also superimposed.

ggplot(mydata, aes(x, y, colour = A, shape = A)) + geom_point(size = 2) + geom_smooth(method = "lm", size = 1, se = FALSE) + labs(x = "x variable", y = "y variable") + theme(aspect.ratio = 0.80)

Another way to make a scatter plot for multiple groups is to plot them separately in panes of a panel of plots. Use `facet_wrap`

() in `ggplot`

for this purpose, as follows. The `lattice`

package can also be used to make panels of plots, as described briefly at the bottom of the page.

ggplot(mydata, aes(x, y) + facet_wrap(~A, ncol = 2) + geom_point(col = "firebrick", size = 2) + geom_smooth(method = "lm", size = 1, se = FALSE, col = "black") + labs(x = "x variable", y = "y variable") + theme(aspect.ratio = 0.80)

### Box plots by group

A grouped box plot displays a numeric response variable `y`

for different levels of two categorical variables, `A`

and `B`

. This can be accomplished in base R by overlaying multiple box plots, but it is much easier to produce the plot in `ggplot`

.

ggplot(data = mydata, aes(y = y, x = A, fill = B)) + geom_boxplot(width = 0.6, position = position_dodge(width = 0.7)) + labs(x = "A variable", y = "y variable") + theme(aspect.ratio = 0.75, text = element_text(size = 16), axis.text = element_text(size = 14))

`width`

controls the width of each box, whereas `position_dodge(width = )`

controls the horizontal distance between the adjacent boxes depicting different levels of the `B`

variable within each group defined by the `A`

variable.

### Strip charts by group

A grouped strip chart displays a numeric response variable `y`

for different levels of two categorical variables, `A`

and `B`

. This can be accomplished in base R by overlaying multiple strip charts, but it is much easier to produce the plot in `ggplot`

.

ggplot(data = mydata, aes(y = y, x = A, fill = B, color = B)) + geom_jitter(size = 3, position = position_dodge(width = 0.7)) + labs(x = "A variable", y = "y variable") + theme(aspect.ratio = 0.75, text = element_text(size = 16), axis.text = element_text(size = 14))

`position_dodge(width = )`

controls the horizontal distance between the adjacent strips depicting different levels of the `B`

variable within each group defined by the `A`

variable.

### Interaction plots

The command `interaction.plot`

is quick but rudimentary, as it fails to show the data.

Interaction plots display how the mean of a numeric response variable `y`

changes between the levels of two categorical variables, `A`

and `B`

. The graph is especially useful for determine whether an interaction is present between two factors `A`

and `B`

in a factorial experiment, or between a factor `A`

and a blocking variable `B`

. If the lines are parallel then there is no interaction.

interaction.plot(A, B, y)

The levels of the variable listed first (here, `A`

) will be displayed along the x-axis of the plot. The y-axis will then display the mean of y separately for each category of the second variable, `B`

. Variations on this command include

interaction.plot(mydata$B, mydata$A, mydata$y) # Put B along x-axis instead interaction.plot(mydata$A, mydata$B, mydata$y, fun = median) # median of y interaction.plot(mydata$A, mydata$B, mydata$y, col = 1:length(unique(mydata$B))) # color the lines interaction.plot(mydata$A, mydata$B, mydata$y, las=2) # more room for A's labels

### Pairwise scatter plots for multiple variables

The following command creates a single graph with scatter plots between all pairs of numeric variables in a data frame, "mydata". The option `gap`

adjusts the spacing between separate plots,

pairs(mydata, gap = 0.5)

Use the formula method to plot only the three numeric variables x1, x2, and x3 in the data frame mydata.

pairs(~ x1 + x2 + x3, data = mydata)

## Using the **lattice** package

The `lattice`

package makes it easy to draw a panel of plots separately by group. The basic plot is simple, but commands to add points and lines to the individual panes can be tricky.

The `lattice`

package is included with the basic installation but you need to load the library. The graph types available in the lattice package include the standard ones found also in R's basic graphics package, such as box plots, histograms, and so on. The table below lists the most types and the relevant command.

For example, to draw a histogram of a numeric variable `x`

separately for four groups identified by the variable `B`

in the data frame `mydata`

, use

library(lattice) histogram(~ x | B, data = mydata, layout = c(1,4), right = FALSE)

The `layout`

option is special to lattice and draws the 4 panels in a grid with 1 column with 4 rows, so that the histograms are stacked and most easily compared visually. The `right=FALSE`

option has the same meaning here as for the base R `hist`

command.

To draw a bar graph showing the frequency distribution of a categorical variable `A`

separately for each group identified by the variable `B`

,

barchart( ~ table(A) | B, data = mydata)

This produces horizontal bar graphs, which leaves room for the category labels. To draw the bars vertically instead, while tilting the group labels on the x-axis by 45 degrees so that they fit,

barchart(table(A) ~ names(table(A)) | B, data = mydata, scales = list(x=list(rot=45)))

As a third example, draw a **scatter plot** to show the relationship between the numeric variables `x`

and `x`

separately for each group in the variable `B`

. The `pch`

option in this example replaces the default plot symbol with a filled dot, and the `aspect`

option sets the relative lengths of the vertical and horizontal axes.

xyplot(y ~ x | B, data = mydata, pch = 16, aspect = 0.7)

It is possible to add plot elements to individual panels, but the commands and options take some getting used to. For example, to fit a separate regression line to each scatter plot, one to each group, use the `panel`

argument in `xyplot`

to construct a function that applies built-in panel functions to each group, as follows.

xyplot(y ~ x | B, data = mydata, pch = 16, aspect = 0.7, panel=function(x, y){ # Use x and y here, not real variable names panel.xyplot(x, y) # draws the scatter plot panel.lmline(x, y) # fits the regression line } )

This doesn't even begin to describe what's possible using the lattice package. Crawley has a few more examples of trellis graphics in *The R Book*. Sarkar (2008) gives a complete description. See the links to these books on the "Books" tab of the Biology 501 course page.

Table showing a few of the commonly used plotting commands in the lattice package. `x`

and `y`

are numeric variables, whereas `A`

is a categorical variable (character or factor). `B`

is a factor or character variable that will define the groups or subsets of the data frame to be plotted in separate panels. A separate plot in the graphics window will be made for each of the groups defined by the variable `B`

.

Command | Graph type | Syntax (~ refers to the tilde not to the minus sign) |
---|---|---|

`barchart` | `bar graph` | `barchart(~table(A) | B, data=mydata)` |

`bwplot` | `box plot` | `bwplot(x ~ A | B, data=mydata)` |

`histogram` | `histogram` | `histogram(~x | B, data=mydata, right=FALSE)` |

`stripplot` | `strip chart` | `stripplot(x ~ A | B, data=mydata, jitter=TRUE)` |

`xyplot` | `scatter plot` | `xyplot(y ~ x | B, data=mydata)` |