Displaying data using graphs and tables

The purpose of this exercise is to tour the table and graphics capabilities of R, and to explore the best methods for displaying patterns in data. If you need help with some of the commands, check the "tables and graphs" section of the R tips pages. 

Data set 1: Body mass of late Quaternary mammals

These data were published as a data paper in Ecology and deposited in the Ecological Archives (F. A. Smith, S. K. Lyons, S. K. M. Ernest, K. E. Jones, D. M. Kaufman, T.. Dayan, P. A. Marquet, J. H. Brown, and J. P. Haskell. 2003. Body mass of late Quaternary mammals. Ecology 84: 3403.) See the metadata for a description. Most of the variables are categorical, with multiple named categories. Body mass is the sole numeric variable.

Read and examine the data

The original data were saved in mammals.csv file on our server here. Download the file to your computer and open in a spreadsheet program (e.g., Excel, Calc)  to have a look at it.

Read the contents of the file to a data frame. In the original data set, the continent North America was indicated with "NA", but I changed it to "NAM". Can you guess why? Clearly the author of the data table was not an R users!

Carry out the following inspections of the data
  1. View the first few lines of the data frame on the screen. You'll see that every row represents the data for a different mammal species.
     head(mammal)
  2. Use the fix command to see the entire data frame.
  3. For the two most interesting character variables, tabulate the number of cases in each group. What do the different groups of the variable "status" stand for? Visit the web site with the metadata for an explanation.
  4. You'll notice from the table that there's a typo in the data for the variable "continent". One case is shown as being on the continent "Af" rather than "AF". Fix this using the command line in R.
  5. Create a new variable in the mammal data frame: the log (base 10) of body mass. (See "Manage data frames" on the R Tips "manage data" page if you need help with this.)

Plots of single variables

  1. Produce a bar graph of the number of mammal species on each "continent". Which continent has the most mammal species? Which has the least?
  2. Redo the barplot using the cex.names option to shrink the labels, so that there is room for them all.
  3. Redo the barplot in color. Add a label to the y axis.
  4. Redo the barplot to increase the limit of the y-axis to 1500 species. (The result might not be immediately evident in the axis labeling because by default R applies internal rules to make graphs "pretty". Try increasing the limit to 1600 or 1700 and see what happens.)
  5. The plot categories are listed in alphabetical order by default, which is arbitrary and makes the visual display less efficient than other possibilities. Redo the barplot with the continents appearing in order of decreasing frequency.
  6. Generate a histogram of the body mass of mammal species. How informative is that?!
  7. Generate a histogram of the cube root of body mass (converts body mass to a scale that is proportional to body length). Any improvement?
  8. Generate a histogram of log body mass. Is this more informative?
  9. Redo the previous histogram but use the breaks option to force a bin width of 2 units (i.e., generate breaks between 0 and 10 by 2 units). How much detail is lost? (note: if you used "log" rather than "log10" to create your variable of log body mass you will need to use breaks between 0 and 20).
  10. Redo the previous histogram but use a bin width of 1; then try 0.5; and then 0.1. 
  11. Redo the histogram, but display probability density instead of frequency.
  12. How does the frequency distribution of log body mass depart from a normal distribution? Answer by visual examination of the histogram you just created. Now answer by examining a normal quantile plot instead.  Which display is more informative?
  13. Redo the normal quantile plot but use the option pch="." to change the plotting symbol to the "." character.

Plots of two variables

  1. Use a box plot to compare the distirbution of body sizes (log scale most revealing) of mammals having different extinction status. Are extinct mammals similar to, larger than, or smaller than, extant mammals? (You may need to use the cex.axis option to shrink the labels so that they all fit on the graph).
  2. Examine the previous box plot. How do the shapes of the body size distributions compare between extinct and extant mammals?
  3. Redo the previous box plot but make box width proportional to the square root of sample size. Add a title to the plot.
  4. Use the tapply command to calculate the median log body mass of each extinction-status group of mammals. Check that these are consistent with the box plot results.
  5. Use tapply again to calculate the mean of log body mass of each extinction-status mammal group. Why is the mean log size of extant mammals larger than the median, but the mean log size for extinct mammals smaller than the median?
  6. Repeat the tapply command for untransformed body mass (rather than log-transformed body mass). What is the magnitude of the difference in median body mass of the different groups? 
  7. Create a two-way frequency table (contingency table) describing the frequencies of mammal species in different extinction status groups on different continents. Which continent has seen the most extinctions? Which continent has the greatest number of extinctions relative to the number of extant species?
  8. Draw a mosaic plot illustrating the relative frequencies of mammal species in different extinction status groups on different continents. Try switching the order of the variables in the mosaicplot command to change the explanatory and response variable. Which continent has experienced the greatest number of extinctions relative to total numbers of species? (A mosaic plot is perhaps not ideal for these data because the frequencies are so small for some categories, such as "introduction". In this case R also has difficulties squeezing in the labels. Perhaps this is a case in which a table is superior to a graph.)


Data set 2: Fly sex and longevity

The data are from L. Partridge and M. Farquhar (1981), Sexual activity and the lifespan of male fruitflies, Nature 294: 580-581. The experiment placed male fruit flies with varying numbers of previously-mated or virgin females to investigate whether mating activity affects male lifespan.

More plots of two variables

The data are in the file fruitflies.csv file available here. Download the file to your computer and open in a spreadsheet program to have a look at it. Read the data file into a new data frame. Our goal is to find a plot type that most clearly and efficiently visualizes the differences among treatment groups.
  1. View the first few lines of the data frame on the screen, and familiarize yourself with the variable names.
  2. Use a box plot to examine the distribution of longevities in the treatment groups. Add a label to the y axis. Do the treatment groups differ in longevities? Describe the pattern of differences between treatments.
  3. Use a dot plot (stripchart) to examine the distribution of longevities in the treatment groups. Try the jitter method to reduce overlap between points. Adjust the treatment label sizes so that they all fit on the graph. Compare with the box plot results. Which is more revealing? 
  4. The variable "thorax" stands for thorax length, which was used as a measure of body size. The measurement was included in case body size also affected longevity. Produce a scatter plot of thorax length and longevity. Make "longevity" the response variable (i.e., plot it on the vertical axis). Is there a relationship?
  5. Use the lowess smoother to draw a smooth curve through the scatter plot of longevity on thorax length.
  6. Redraw the scatter plot but this time use different symbols or colors for the different treatment groups. Add a legend to identify the symbols. Describe the pattern of differences between treatments.

Trellis graphics

If time permits, load the lattice package and use it to create panels of graphs for multiple groups in a single window, all on the same scale. This continues our effort to find a plot type that most clearly and efficiently visualizes the differences among treatment groups.
  1. Use the histogram command to plot the frequency distribution of male longevity for each treatment group separately. How easy is it to compare the distributions among treatments?
  2. Repeat the previous command but use the layout option to stack the histograms one above the other. Comment on how this affects your ability to compare the distributions among treatments.
  3. Create a panel of scatter plots showing the relationship between male longevity and male size (as measured by thorax length) separately for each treatment group. Compare this with the previous exercise in which all points were placed on the same scatter plot with different symbols.
  4. Experiment with some of the other plotting commands in the lattice package to plot variables from the mammal and fly data sets.