Display

Graphs and tables

This page provides tips and recommendations for making graphs and tables in R.

In the examples below, x and y are numeric variables in the data frame, mydata. A and B are categorical variables (factors or character variables) identifying different groups.

We include optional examples that make use of the add-on packages dplyr and ggplot2. Hadley Wickham’s ggplot2 extends base R with its “grammar of graphics”. His book is the standard reference (Wickam, H. 2016. ggplot2: Elegant graphics for data analysis. 2nd edition) but plenty of introductory resources are available online (e.g., this one).

Some will find ggplot2‘s default theme to be visually noisy. You can easily change to something else by running the theme_set() command at the start of your run, as follows. Other simpler themes include theme_minimal() and theme_bw().

library(dplyr)
library(ggplot2)
theme_set(theme_classic())

Frequency tables

These commands generate tables of frequencies.

Frequency table for one variable

This frequency table counts the number (frequency) of cases in each category of a categorical variable A. Using the useNA argument add the category NA if one or more cases is missing.

table(mydata$A, useNA = "ifany")
# or
with(mydata, table(A), useNA = "ifany")

The summarize command in dplyr package can generate frequency tables with its n() function.

summarize(group_by(mydata, A), Frequency = n())

Frequency (contingency) table for two variables

The following commands generate a frequency table for two categorical variables, A and B. The command can be extended to three or more variables.

table(mydata$A, mydata$B, useNA = "ifany")

To include the row and column sums, use

mytab <- table(mydata$A, mydata$B)
addmargins(mytab, margin = c(1,2), FUN = sum, quiet = TRUE)

The same thing can be accomplished with dplyr, except zero counts are given as NA.

spread( summarize(group_by(mydata, A, B), n = n()), B, n )

Flat frequency table

The following commands generate a "flat" frequency table for two categorical variables, A and B. In a flat table, A and B are separate columns of a table, and a third column tallies the frequencies of each combination. The table will show a count of 0 for category combinations not present in the data.

mytab <- table(Aname = mydata$A, Bname = mydata$B)
data.frame(mytab, stringsasFactors = FALSE)
# or
data.frame(ftable(mydata[, c("A","B")], row.vars = c("A","B")))

The summarize method of dplyr will tally only the combinations of categories that have a frequency greater than 0. Hence, ftable is preferred. Alternatively, a fix is available in the tidyr package (you might need to install first):

library(tidyr)
mytab <- summarize(group_by(mydata, A, B), freq = n())
complete(ungroup(mytab), A, B, fill = list(freq = 0))

Table summaries of descriptive statistics

The tapply command creates tables of descriptive statistics by group (e.g., mean, standard deviation, median, etc). So does the summarize command of the dplyr package, as shown here.

Descriptive statistics for one variable

Here is how to generate a table of group means for a variable y, where A is the categorical grouping variable.

tapply(mydata$y, INDEX = mydata$A, FUN = mean, na.rm = TRUE) # result is a vector

summarize(group_by(mydata, A), ybar = mean(y, na.rm = TRUE)) # dplyr method; result is a data frame

The argument na.rm = TRUE removes missing values (otherwise the mean returns NA if missing values are present). With tapply, pass optional arguments to FUN by including them immediately afterward.

An advantage of the summarize method from dplyr is that you can calculate more than one descriptive statistic at once. Here we calculate mean, standard deviation, and number of observations (including missing observations) for the variable x by group.

summarize(group_by(mydata, A), xbar = mean(x, na.rm = TRUE),
              s = sd(x, na.rm = TRUE), n = n()) 

More than one grouping variable

To calculate descriptive statistics (e.g., the median of x) with more than one grouping variable, use one of the following commands.

summarize(group_by(mydata, A, B), ybar = median(y, na.rm = TRUE))        # dplyr command, yields a data frame
aggregate(mydata$y, by = list(A = mydata$A, B = mydata$B), FUN = median) # yields a data frame
tapply(mydata$y, INDEX = list(mydata$A, mydata$B), FUN = median)         # yields a r x c matrix

More than one response variable

The dplyr command summarize allows you to tabulate summaries for more than one variable. For example, if your data frame mydata contains a categorical variable named A that has multiple categories, you can obtained means and standard deviations by group for one (or more) numeric variables y1, y2, etc, as follows. Result can be saved as a new data frame (here, named z).

z <- summarize(group_by(mydata, A), 
               mean1 = mean(y1, na.rm = TRUE),
               mean2 = mean(y2, na.rm = TRUE),
               mean3 = mean(y3, na.rm = TRUE),
               sd1 = sd(y1, na.rm = TRUE),
               sd2 = sd(y2, na.rm = TRUE),
               sd3 = sd(y3, na.rm = TRUE))
print(z)

Drawing graphs in R

Here's the many types of graphs are grouped according to whether their purpose is to display frequencies of a single variable (e.g., histogram, bar graphs), or association between two (more more) variables and differences between groups (e.g., mosaic plots, box plots, strip charts, scatter plots).

Command options for base R plots

Many (but not all -- try them) of the basic plotting commands in base R will accept the same options to control axis limits, labeling, print a title, change the plotting symbol, change the size of the plotting symbols and text, and change the line types. Here are some of the most frequently modified options. Use them inside the parentheses of a plotting command to have their effect. If you are not sure whether a given option works in your case, try it. The worst that could happen is you get an error message, or R ignores you.

main = "Eureka"    # add a title above the graph
pch = 16           # set plot symbol to a filled circle
color = "red"      # set the item color
xlim = c(-10,10)   # set limits of the x-axis (horizontal axis)
ylim = c(0,100)    # set limits of the y-axis (vertical axis)
lty = 2            # set line type to dashed
las = 2            # rotate axis labels to be perpendicular to axis
cex = 1.5          # magnify the plotting symbols 1.5-fold
cex.lab = 1.5      # magnify the axis labels 1.5-fold
cex.axis = 1.3     # magnify the axis annotation 1.3-fold
xlab = "Body size" # label for the x-axis
ylab = "Frequency" # label for the y-axis

For details and the full list of plotting options in base R, get help as follows,

?par              # graphical parameters
?plot.default     # basic plot decorations

Drawing graphs with ggplot2

Building a graph using ggplot2 involves the combination of components or "layers" including data, "aesthetics" that map variables to visuals, and "geoms" that create different kinds of plots. A basic scatter plot of yvar against xvar has the three components as follows.

ggplot(data = mydata, mapping = aes(x = xvar, y = yvar)) + geom_point()

where geom_point() is the specific geom for plotting points. Other graph components can be added, as demonstrated with examples below.

Save graph as a pdf file

After drawing your plot, you can use the menu (File -> Save As) to save to a pdf file. Or, draw the graph on a pdf device to begin with:

pdf(file = "mygraph.pdf")     # opens the pdf device for plotting
...                           # Issue your R commands here to generate plot
dev.off()                     # closes the device when you are done

Graph the frequency distribution of a single variable

Bar graph - frequency distribution for a categorical variable

In the following examples, A is a categorical variable identifying groups.

# base R
barplot(table(mydata$A), col = "firebrick", space = 0.2, cex.names = 1.2)

# Using ggplot
ggplot(mydata, aes(x = A)) + geom_bar(stat="count")

# ggplot with options:
ggplot(mydata, aes(x = A)) + 
	geom_bar(stat="count", fill = "firebrick") +
	labs(x = "A group", y = "Frequency") +
	theme(text = element_text(size = 15), 
	  axis.text = element_text(size = 12), aspect.ratio = 0.8)

R will arrange the categories in alphabetical order by default. If you want to fix a specific order, specify this in a factor command. For example, if the variable A has three groups "a", "b" and "c", fix a preferred order as follows and then rerun the command.

A <- factor(mydata$A, levels = c("c","a","b"))
barplot(table(mydata$A), col = "firebrick")

To plot the bars in order of decreasing frequency, (a good idea for bar graphs)

# Using base R
barplot(sort(table(mydata$A), decreasing = TRUE), col = "firebrick", space = 0.2)

# Using ggplot
mydata$A_ordered <- factor(mydata$A, 
	levels = names(sort(table(mydata$A), decreasing = TRUE)) )
ggplot(mydata, aes(x = A_ordered)) + 
	geom_bar(stat="count", fill = "firebrick") +
	labs(x = "A group", y = "Frequency") +
	theme(text = element_text(size = 15), 
	  axis.text = element_text(size = 12), aspect.ratio = 0.8)

If the frequencies are already tabulated in a data frame named mytab, then modify as follows. A is a variable in mytab that lists each named category exactly once and Freq is a variable containing the corresponding frequency of cases in each category.

mytab <- arrange(mytab, desc(Freq))
barplot(mytab$Freq, names.arg = mytab$A, col = "firebrick")

# or
mytab$A_ordered <- factor(mytab$A, levels = mytab$A[order(mytab$Freq, decreasing = TRUE)] )
ggplot(mytab, aes(x = A_ordered, y = Freq)) + 
	geom_bar(stat="identity", fill = "firebrick") +
	labs(x = "A group", y = "Frequency") +
	theme(text = element_text(size = 15), axis.text = element_text(size = 12), aspect.ratio = 0.8)

Histogram - frequency distribution for a numeric variable

The basic command is

hist(mydata$x, col = "navy", right = FALSE)

The col argument sets the color of the bars. The right = FALSE option causes all the histogram intervals, or bins, except the last one to be left-closed. In other words, the value 1 would appear in the interval 1-2 rather than in the interval 0-1, which seems to be the convention (unless 1 is the upper limit of the right-most bin, in which case R puts it in the bin 0-1; use the include.lowest option to control that!).

Use the breaks option to influence the width and number of histogram bins. To set the approximate number of bins to 20, use

hist(mydata$x, breaks = 20, right = FALSE)

For finer control of bin number and location, specify the breakpoints. For example, the following command creates a series of bins 1 unit wide between the limits 0 and 6 (make sure all the data fall between these limits or R will complain),

hist(mydata$x, breaks = seq(from=0, to=6, by=1), right = FALSE)

Notice that a value of exactly 6, the upper limit of breaks, will appear in the interval 5-6. This behavior is restricted to the rightmost bin. To prevent this from happening, increase the upper limit of the breaks by 1:

hist(mydata$x, breaks = seq(from=0, to=7, by=1), right = FALSE)

In ggplot, the barest histogram for a numeric variable x in mydata requires only

ggplot(mydata, aes(x)) + geom_histogram() 

The example below improves the graph with a number of options. To see the impact of each option, leave out and rerun.

ggplot(mydata, aes(x)) + 
	geom_histogram(fill = "firebrick", col = "black", binwidth = 0.2, 
		boundary = 0, closed = "left") + 
	labs(x = "The variable x", y = "Frequency") + 
	theme(aspect.ratio = 0.80)

To display probability density instead of raw frequencies,

hist(mydata$x, prob = TRUE, right = FALSE)

# in ggplot
ggplot(mydata, aes(x)) + geom_histogram(aes(y = ..density..), closed = "left")

To superimpose a normal density curve on a histogram, try the following commands. First, 101 evenly spaced points along the x axis, are made between the smallest and largest data value using seq. Then dnorm generates the normal density at each x point, using the mean and standard deviation of the data.

hist(mydata$x, prob = TRUE, right = FALSE)
m <- mean(mydata$x, na.rm = TRUE)
s <- sd(mydata$x, na.rm = TRUE)
xpts <- seq(from = min(mydata$x, na.rm=TRUE), to = max(mydata$x, na.rm = TRUE), length.out = 101)
lines(dnorm(xpts, mean=m, sd=s) ~ xpts, col="red", lwd=2)

In ggplot, add a stat_function() to get the same,

ggplot(mydata, aes(x)) + 
	geom_histogram(aes(y = ..density..), closed = "left") + 
	stat_function(fun = dnorm, args = list(mean = mean(mydata$x, na.rm = TRUE), 
     	     	sd = sd(mydata$x, na.rm = TRUE))) 

Normal quantile plot - compare data to the normal distribution

x is the numeric variable whose distribution is being compared with the normal.

qqnorm(mydata$x)
qqline(mydata$x)  # adds line through first and third quartiles

The same can be accomplished in ggplot as follows.

ggplot(mydata, aes(sample = x)) + geom_qq() + geom_qq_line()

Graphs to display association between two variables

The appropriate graph depends on which variables are numeric or categorical.

Mosaic plot - association between two categorical variables

A and B can be factors or character variables. color = TRUE yields shades of grey, or a vector of colors can be used instead (shown below for 3 categories). A quick way to get a bunch of different colors is to use color = rainbow(n), where n is the number of categories for the variable B. Other options in the examples below alter the orientation (las) and size (cex.axis) of the labels.

mosaicplot(table(mydata$A, mydata$B), color = c("red","blue","yellow"), las = 2, cex.axis = 0.8)

mosaicplot(table(mydata$A, mydata$B), color = rainbow(3), las = 2, cex.axis = 0.8)

To draw a mosaic plot using ggplot, first install the ggmosaic package. The example below loads the package, assuming it is installed.

*** This command was broken the last time I attempted it (2018/09/12) using R version 3.5.1 and the most recent versions of ggmosaic and ggplot2 ***

library(ggmosaic)
ggplot(mydata, aes(x = product(A, B), fill=factor(A))) + 
	geom_mosaic() +
	labs(x = "B group") + 
	theme(, aspect.ratio = 1, axis.text.x = element_text(angle = -25, hjust= .1, size = 12)) + 
	guides(fill = guide_legend(title = "A group", reverse = TRUE))

Grouped bar graph - association between two categorical variables

A and B can be factors or character variables in a data frame mydata. The first line of code below is the bare minimum, whereas the second adds a few useful options.

barplot(table(mydata$A, mydata$B), beside = TRUE)

barplot(table(mydata$A, mydata$B), beside = TRUE, 
	las = 1, col = rainbow(4), cex.names = 0.8, space = c(0.2,0.8),
	xlab = "B group", ylab = "Frequency", legend.text = TRUE)

In ggplot, use geom_bar(stat="count") and the argument position_dodge2(preserve="single"), instead of position="dodge", which doesn't handle category combinations of A and B having a count of 0. (position_dodge2 leaves a small space between bars for B within groups of A, whereas position_dodge leaves no gap.)

ggplot(mydata, aes(x = A, fill = B)) + 
	geom_bar(stat = "count", position = position_dodge2(preserve="single")) +
	labs(x = "A group", y = "Frequency") +
	theme(aspect.ratio = 0.80)

Multiple histograms by group

A plot of multiple histograms is useful for comparing the frequency distribution of a numeric variable between groups. Stack the plots above one another if possible, for best results, and use the same minimum and maximum of x-values on the axis. The code below draws a panel of 3 histograms of x, on top of one another, one for each of three groups (a1 - a3) of the categorical variable A in mydata.

A panel of plots is accomplished easily with ggplot or the lattice package (a brief introduction to the lattice package is provided at the bottom of this page). In ggplot, add facet_wrap() to place graphs for different groups on the same page.

ggplot(mydata, aes(x = x)) + 
	geom_histogram(fill = "firebrick", col = "black", binwidth = 0.2, 
		boundary = 0, closed = "left") +
	labs(x = "The variable x", y = "Frequency") + 
	theme(aspect.ratio = 0.5) + 
	facet_wrap( ~ A, ncol = 1, scales = "free_y")

To accomplish the task in base R, begin by setting up a graphics window with the desired number of rows and columns (here, 3 and 1, respectively) using the mfrow option of par(). Here the mar option is used to adjust the size of the margin around each plot, and cex is used to reduce the font size of labels to make room. The following code loops through the three unique categories of A and draws a histogram for x for each group.

par(mfrow=c(3,1), mar = c(4, 4, 2, 1), cex = 0.7) 
for( i in unique(mydata$A) ){
	dat <- subset(mydata, mydata$A == i)
	hist(dat, breaks = 20, right = FALSE, xlim = range(mydata$x), 
		col="firebrick", main = i, xlab = "x variable", ylab = "Frequency")
	}

Strip chart - association between a categorical and a numeric variable

A strip chart (a.k.a dot plot) can be used instead of a box plot when the number of data points is not large. Random noise ("jitter") is used to reduce overlap of points. The first example is a basic plot, whereas the second adds common options.

# base R
stripchart(y ~ A, vertical = TRUE, data = mydata, method = "jitter")

stripchart(y ~ A, vertical = TRUE, data = mydata, method = "jitter", 
	jitter = 0.2, cex.axis = 0.8, pch = 1, col = "firebrick")

The first line of code below shows the basic strip chart in ggplot. The second example includes common options. The third example adds horizontal line segments at the group means (which is harder in ggplot than it ought to be).

ggplot(mydata, aes(A, y)) + geom_jitter()

ggplot(mydata, aes(A, y)) +
	geom_jitter(color = "firebrick", size = 3, width = 0.15) +
	labs(x = "A group", y = "y value") + 
	theme(aspect.ratio = 0.80, text = element_text(size = 12), axis.text = element_text(size = 10))

# To add horizontal line segments at means, first define a function that calculates the means
groupMeans <- function(x) {
	m <- mean(x, na.rm = TRUE)
	c(y = m, ymin = m, ymax = m)
	}
ggplot(mydata, aes(A, y)) +
	geom_jitter(color = "firebrick", size = 3, width = 0.15) +
	stat_summary(fun.data = "groupMeans", geom = "errorbar", colour = "black", width = 0.5, size = 1)
	labs(x = "A group", y = "y value") + 
	theme(aspect.ratio = 0.80, text = element_text(size = 12), axis.text = element_text(size = 10))

You can add points or lines to a base R stripchart by using integer numbers to indicate position of categories along the x-axis. For example, to add means and standard errors of y to a strip chart for a numeric variable y and a categorical variable A having four categories, use

stripchart(y ~ A, data = mydata, vertical = TRUE, method = "jitter", pch = 1)
m <- tapply(mydata$y, mydata$A, mean, na.rm=TRUE)
se <- tapply(mydata$y, mydata$A, function(y){ sd(y, na.rm=TRUE)/sqrt(length(na.omit(y))) })
points( m ~ c(1:4 + 0.2) + 0.2, pch=16, col = "red")
segments(x0 = c(1:4 + 0.2), y0 = m - se, x1 = c(1:4 + 0.2), y1 = m + se, col = "red")

Strip chart with lines for paired data

Paired data should be displayed accordingly. The following commands create a strip chart with lines connection the two measurements of the same unit in the two treatments. The data frame mydata includes the response variable y, the treatment variable A, and an id variable indicating identity of individuals measured under both treatments.

The following works in base R.

interaction.plot(response = mydata$y, x.factor = mydata$A, trace.factor = mydata$id,
	legend = FALSE, lty = 1, xlab = "Treatment", ylab = "Y variable", 
	type = "b", pch = 16, las = 1, cex = 1.5, cex.lab = 1.5, cex.axis = 1.3)

Try the following in ggplot.

ggplot(mydata, aes(y = y, x = A)) +  
	geom_point(size = 5, col = "firebrick", alpha = 0.5) + 
	geom_line(aes(group = id)) +
	labs(x = "Treatment", y = "Y variable") + 
	theme(text = element_text(size = 18), axis.text = element_text(size = 16), aspect.ratio = 0.80)

Box plot - association between a categorical and a numeric variable

The following code generates a box plot for the numeric variable yvar separately for every group identified in the categorical variable A. The following shows use of the formula method in barplot, written as "response_variable ~ explanatory_variable". Set varwidth = FALSE if you want all boxes to have the same width.

boxplot(yvar ~ A, data = mydata, varwidth = TRUE, 
	ylab="y value", col = "firebrick", cex.axis = 0.8)

The next command shows how to make a basic box plot with ggplot. Below it is a command with more options.

ggplot(mydata, aes(x = A, y = yvar)) + geom_boxplot()

ggplot(mydata, aes(x = A, y = yvar)) + 
	geom_boxplot(fill = "firebrick", notch = FALSE, varwidth = TRUE) + 
	labs(x = "A group", y = "y value") + 
	theme(text = element_text(size = 15), axis.text = element_text(size = 12), aspect.ratio = 0.80)

Violin plot - association between a categorical and a numeric variable

A violin plot shows the frequency distribution for a numerical variable (and its mirror image) for several groups. The frequency distribution is a kernel density estimate, which "smooths" the distribution. The following code generates a violin plot for the numeric variable yvar separately for every group identified in the categorical variable A. This is easiest to accomplish in ggplot. The stat_summary layer adds a dot for the mean of each group.

ggplot(mydata, aes(x = A, y = yvar)) + 
	geom_violin(fill = "firebrick") + 
	stat_summary(fun.y = mean,  geom = "point", color = "black") +
	labs(x = "A group", y = "y value") + 
	theme(text = element_text(size = 15), axis.text = element_text(size = 12), aspect.ratio = 0.80)

The job is possible without ggplot, but you'll need to install the vioplot package first using install.packages(), and then load it. Here, yvar is the variable plotted separately for groups of the variable A. In this example, A has 4 groups named a1, a2, a3, and a4. The amount of "smoothing" for the kernel density estimates is controlled by the h option.

library(vioplot)
vioplot(mydata$yvar[mydata$A=="a1"], mydata$yvar[mydata$A=="a2"], 
	mydata$yvar[mydata$A=="a3"], mydata$yvar[mydata$A=="a4"],
	col="#FFB531", drawRect = FALSE, names = c("a1","a2","a3","a4"),
	h = 0.5)
mtext("y value", side=2, line=2, las = 3)
mtext("A group", side=1, line = 3)

Scatter plot - association between two numeric variables

Here's how to to produce a scatter plot for two numeric variables, x and y.

# formula method, base R
plot(y ~ x, data = mydata, pch = 16, col = 2)

# basic scatter plot in ggplot
ggplot(mydata, aes(x, y)) + geom_point()

# scatter plot with options in ggplot
ggplot(mydata, aes(x, y)) + 
	geom_point(size = 2, col = "firebrick", alpha = 0.5) + 
	labs(x = "x variable", y = "y variable") + 
	theme(aspect.ratio = 0.80)

You can probably guess the intent of most of the ggplot options except alpha, which makes the dots partly transparent.

To add a smooth curve through the data, use locally weighted regression. "Local" here means that y is predicted for each x using only data in the vicinity of that x, rather than all the data.

In base R you can control the size of the vicinity using the option f, which is a proportion between 0 (very narrow vicinity) and 1 (uses all the data). Try different values of f to best capture the relationship. The default is f = 2/3.

plot(y ~ x, data = mydata)
x1 <- mydata$x[order(mydata$x)]
y1 <- mydata$y[order(mydata$x)]
lines(lowess(mydata$x1, mydata$y1, f = 0.5))

Using ggplot, you also get SE's of predicted values (set to FALSE if not desired).

ggplot(mydata, aes(x, y)) + 
	geom_point(size = 2, col = "firebrick") + 
	labs(x = "x variable", y = "y variable") + 
   	geom_smooth(method = "loess", size = 1, col = "black", se = TRUE) +
	theme(aspect.ratio = 0.80)

To add the least squares regression line to a plot, use the following.

plot(y ~ x, data = mydata)
abline(lm(y ~ x, data = mydata))

# in ggplot
ggplot(mydata, aes(x, y)) + 
	geom_point(size = 2, col = "firebrick") + 
	labs(x = "x variable", y = "y variable") + 
   	geom_smooth(method = "lm", size = 1, col = "black") +
	theme(aspect.ratio = 0.80)

In base R you can use the cursor to click data points to identify individuals. The following code prints the row number when you click a point (change that by setting the labels option to a character variable that labels the point instead).

plot(y ~ x, data = mydata)
identify(mydata$x, mydata$y, labels = seq_along(mydata$x))

Line plot - display a sequence of measurements

Here's how to plot a sequence of (x, y) points and connect the dots with lines. This is especially useful when the x-variable represents a series of points in time or across a spatial gradient. In base R:

plot(y ~ x, data = mydata, pch=16)
lines(y[order(x)] ~ x[order(x)], data = mydata)

# Eliminate the points, and some options
plot(y[order(x)] ~ x[order(x)], data = mydata, type = "l", 
	lty = 3, lwd = 2, col = "red")

The basic ggplot command is below. There's no need to order by x.

ggplot(data = mydata, aes(x, y)) +
	geom_line()

A line plot in ggplot works if the explanatory variable is numeric (x) or categorical (A), as show in the two commands below (which also include further options).

ggplot(data = mydata, aes(x, y)) +
	geom_line(color = "red") +
	geom_point() +
	labs(x = "x variable", y = "y variable") + 
	theme(text = element_text(size = 15), 
		axis.text = element_text(size = 12), aspect.ratio = 0.75)

# To use categorical variable A, include "group = 1" in aes()
ggplot(data = mydata, aes(A, y, group = 1)) +
	geom_line(color = "red") +
	geom_point() +
	labs(x = "A variable", y = "y variable") + 
	theme(text = element_text(size = 15), 
		axis.text = element_text(size = 12), aspect.ratio = 0.75)

Graphs to display more variables

Scatter plots by group

One way to make a scatter plot for multiple groups is to superimpose them all on a single plot but vary the symbols according to group. In base R, use pch to vary the symbol, col to vary the color, and both to vary both at the same time. x and y are numeric variables, whereas A is a categorical variable identifying groups. If A is already a factor you can use just A instead of factor(A) in the commands below. The legend command adds a legend identifying the groups---click on the plot (inside the plot region) with your cursor to place the legend.

plot(y ~ x, data = mydata, pch = as.numeric(factor(mydata$A)),
         col = as.numeric(factor(mydata$A)))
legend( locator(1), legend = as.character(levels(factor(mydata$A))), 
	pch = 1:length(levels(factor(mydata$A))), 
	col=1:length(levels(factor(mydata$A))) )

To accomplish the same with ggplot, specify which variable to vary by color and by shape within aes(), which maps variables to visuals. The example below includes an optional geom_smooth() line that will result in the least squares lines for each group also superimposed.

ggplot(mydata, aes(x, y, colour = A, shape = A)) + 
	geom_point(size = 2) + 
	geom_smooth(method = "lm", size = 1, se = FALSE) +
	labs(x = "x variable", y = "y variable") + 
	theme(aspect.ratio = 0.80)

Another way to make a scatter plot for multiple groups is to plot them separately in panes of a panel of plots. Use facet_wrap() in ggplot for this purpose, as follows. The lattice package can also be used to make panels of plots, as described briefly at the bottom of the page.

ggplot(mydata, aes(x, y) + 
	facet_wrap(~A, ncol = 2) + 
	geom_point(col = "firebrick", size = 2) +
   	geom_smooth(method = "lm", size = 1, se = FALSE, col = "black") +
	labs(x = "x variable", y = "y variable") + 
	theme(aspect.ratio = 0.80)

Box plots by group

A grouped box plot displays a numeric response variable y for different levels of two categorical variables, A and B. This can be accomplished in base R by overlaying multiple box plots, but it is much easier to produce the plot in ggplot.

ggplot(data = mydata, aes(y = y, x = A, fill = B)) + 
	geom_boxplot(width = 0.6, position = position_dodge(width = 0.7)) +
	labs(x = "A variable", y = "y variable") +
	theme(aspect.ratio = 0.75, text = element_text(size = 16), axis.text = element_text(size = 14))

width controls the width of each box, whereas position_dodge(width = ) controls the horizontal distance between the adjacent boxes depicting different levels of the B variable within each group defined by the A variable.

Strip charts by group

A grouped strip chart displays a numeric response variable y for different levels of two categorical variables, A and B. This can be accomplished in base R by overlaying multiple strip charts, but it is much easier to produce the plot in ggplot.

ggplot(data = mydata, aes(y = y, x = A, fill = B, color = B)) + 
	geom_jitter(size = 3, position = position_dodge(width = 0.7)) +
	labs(x = "A variable", y = "y variable") +
	theme(aspect.ratio = 0.75, text = element_text(size = 16), axis.text = element_text(size = 14))

position_dodge(width = ) controls the horizontal distance between the adjacent strips depicting different levels of the B variable within each group defined by the A variable.

Interaction plots

The command interaction.plot is quick but rudimentary, as it fails to show the data.

Interaction plots display how the mean of a numeric response variable y changes between the levels of two categorical variables, A and B. The graph is especially useful for determine whether an interaction is present between two factors A and B in a factorial experiment, or between a factor A and a blocking variable B. If the lines are parallel then there is no interaction.

interaction.plot(A, B, y)

The levels of the variable listed first (here, A) will be displayed along the x-axis of the plot. The y-axis will then display the mean of y separately for each category of the second variable, B. Variations on this command include

interaction.plot(mydata$B, mydata$A, mydata$y)                # Put B along x-axis instead
interaction.plot(mydata$A, mydata$B, mydata$y, fun = median)  # median of y
interaction.plot(mydata$A, mydata$B, mydata$y, 
        col = 1:length(unique(mydata$B)))                     # color the lines
interaction.plot(mydata$A, mydata$B, mydata$y, las=2)         # more room for A's labels

Pairwise scatter plots for multiple variables

The following command creates a single graph with scatter plots between all pairs of numeric variables in a data frame, "mydata". The option gap adjusts the spacing between separate plots,

pairs(mydata, gap = 0.5)

Use the formula method to plot only the three numeric variables x1, x2, and x3 in the data frame mydata.

pairs(~ x1 + x2 + x3, data = mydata)

Using the lattice package

The lattice package makes it easy to draw a panel of plots separately by group. The basic plot is simple, but commands to add points and lines to the individual panes can be tricky.

The lattice package is included with the basic installation but you need to load the library. The graph types available in the lattice package include the standard ones found also in R's basic graphics package, such as box plots, histograms, and so on. The table below lists the most types and the relevant command.

For example, to draw a histogram of a numeric variable x separately for four groups identified by the variable B in the data frame mydata, use

library(lattice)
histogram(~ x | B, data = mydata, layout = c(1,4), right = FALSE)

The layout option is special to lattice and draws the 4 panels in a grid with 1 column with 4 rows, so that the histograms are stacked and most easily compared visually. The right=FALSE option has the same meaning here as for the base R hist command.

To draw a bar graph showing the frequency distribution of a categorical variable A separately for each group identified by the variable B,

barchart( ~ table(A) | B, data = mydata)

This produces horizontal bar graphs, which leaves room for the category labels. To draw the bars vertically instead, while tilting the group labels on the x-axis by 45 degrees so that they fit,

barchart(table(A) ~ names(table(A)) | B, data = mydata, scales = list(x=list(rot=45)))

As a third example, draw a scatter plot to show the relationship between the numeric variables x and x separately for each group in the variable B. The pch option in this example replaces the default plot symbol with a filled dot, and the aspect option sets the relative lengths of the vertical and horizontal axes.

xyplot(y ~ x | B, data = mydata, pch = 16, aspect = 0.7)

It is possible to add plot elements to individual panels, but the commands and options take some getting used to. For example, to fit a separate regression line to each scatter plot, one to each group, use the panel argument in xyplot to construct a function that applies built-in panel functions to each group, as follows.

xyplot(y ~ x | B, data = mydata, pch = 16, aspect = 0.7,
	panel=function(x, y){       # Use x and y here, not real variable names
		panel.xyplot(x, y)  # draws the scatter plot
		panel.lmline(x, y)  # fits the regression line
		}
	)

This doesn't even begin to describe what's possible using the lattice package. Crawley has a few more examples of trellis graphics in The R Book. Sarkar (2008) gives a complete description. See the links to these books on the "Books" tab of the Biology 501 course page.

Table showing a few of the commonly used plotting commands in the lattice package. x and y are numeric variables, whereas A is a categorical variable (character or factor). B is a factor or character variable that will define the groups or subsets of the data frame to be plotted in separate panels. A separate plot in the graphics window will be made for each of the groups defined by the variable B.

CommandGraph typeSyntax (~ refers to the tilde not to the minus sign)
barchartbar graphbarchart(~table(A) | B, data=mydata)
bwplotbox plotbwplot(x ~ A | B, data=mydata)
histogramhistogramhistogram(~x | B, data=mydata, right=FALSE)
stripplotstrip chartstripplot(x ~ A | B, data=mydata, jitter=TRUE)
xyplotscatter plotxyplot(y ~ x | B, data=mydata)