Linear models

In this workshop we will use R to fit linear models to data. See the "fit linear model" page of Rtips for help. You might also need to refer to the graphics help page to visualize your results.

Linear models for fixed effects are implemented in the R command lm. This method is not suitable to analyze data that include random effects. Other, related methods (e.g., lme) are available for experiments and observational studies that have random effects.

Predicting lion age using a numeric explanatory variable

We'll start with linear regression because you are probably most familiar with this type of linear model. The data are from Whitman et al (2004 Nature 428: 175-178), who noticed that the amount of black pigmentation on the noses of male lions increases as they get older. They used data on the proportion of black on the noses of 32 male lions of known age in Tanzania. We will use fit a linear model to these data to predict a lion's age from the proportion of black in his nose. The data can be downloaded here.

Read and examine the data

  1. Read the data from the file. 
  2. View the first few lines of data to make sure it was read correctly.
  3. Create a scatter plot of the data. Choose the response and explanatory variables with care: we will be attempting to predict age from the proportion black in the nose.

Fit a linear model

  1. Fit a linear model to the lion data. Store the output in an lm object. Choose the response and explanatory variables with care: we will be attempting to predict age from the proportion black in the nose. 
  2. Add the best-fit line to the scatter plot. Does the relationship appear linear? Check for any serious problems such as outliers or changes in the variance of residuals.
    (The variance of the residuals tends to rise with increasing black in the nose, but the trend is not severe. The residuals might not be quite normal, but they are not badly skewed so we are probably OK.)
  3. Using the same fitted model object, obtain the estimates for the coefficients, slope and intercept, and standard errors.
  4. Obtain 95% confidence intervals for the slope and intercept.
  5. Summarize the fit of the model to the data with an ANOVA table.
  6. Apply the plot command to the lm object created in (1) to diagnose violations of assumptions (keep hitting <return> in the command window to see all the plots). Recall the assumptions of linear models. Do the data conform to the assumptions? Are there any potential concerns? What options would be available to you to overcome any potential problems with meeting the assumptions?

    Most of the plots will be self-explanatory, except perhaps the last one. "Leverage" calculates the influence that each data point has on the estimated parameters. For example if the slope changes a great deal when a point is removed, that point is said to have high leverage. "Cook's distance" measures the effect of each data point on the predicted values for all the other data points. A value greater than 1 is said to be worrisome. Points with high leverage don't necessarily have high Cook's distance, and vice versa.

    Optional: One of the data points (the oldest lion) has rather high leverage. To see the effect this has on the results, refit the data leaving this point out. Did this change the regression line substantially? (If the effect is large, you could turn to robust regression methods, which are less sensitive to outliers. Again, see if this changes the results. Robust regression methods are implemented in the rlm command in the MASS library.)

Prediction

  1. Display the data once again in a scatter plot. Add the regression line.
  2. Add confidence bands to the scatter plot. These are confidence limits for the prediction of mean of lion age at each value of the explanatory variable. You can think of these as putting bounds on the most plausible values for the "true" or population regression line. Note the spread between the upper and lower limits and how this changes across the range of values for age. 
  3. Add prediction intervals to the scatter plot. These are confidence limits for the prediction of new individual lion ages at each value of the explanatory variable. Whereas confidence bands address questions like "what is the mean age of lions whose proportion black in the nose is 0.5 ?", prediction intervals address questions like "what is the age of that new lion over there, which has a proportion black in the nose of 0.5 ?".
  4. Examine the confidence bands and prediction intervals. Is the prediction of mean lion age from black in the nose relatively precise? Is prediction of individual lion age relatively precise? Could this relationship be used in the field to age lions?

Effects of light treatment on circadian rhythms

Our second example fits a linear model with a categorical explanatory variable. The data are from an experiment by Wright and Czeisler (2002. Science 297: 571) that re-examined a previous claim that light behind the knees could reset the circadian rhythm of an individual the same way as light to the eyes. One of three light treatments was randomly assigned to 22 subjects (a three-hour episode of bright lights to the eyes, to the knees, or to neither). Effects were measured two days later as the magnitude of phase shift in each subject’s daily cycle of melatonin production. A negative measurement indicates a delay in melatonin production, which is the predicted effect of light treatment. The data can be downloaded here.

Read and examine the data

  1. Read the data from the file. 
  2. View the first few lines of data to make sure it was read correctly.
  3. Determine whether the categorical variable "treatment" is a factor. If not a factor, convert treatment to a factor using the "factor" command. This will be convenient when we fit the linear model.
  4. Use the levels command on the factor variable "treatment" to see how R has ordered the different treatment groups. The order should be alphabetical by default. Conveniently, the control group is listed first in the alphabetical sequence, which is what we want. However, to get some practice, change the order so that the "knee" treatment group is listed second and the "eyes" group is listed third.
  5. Create a plot of the data, showing the individual data points in each treatment group.

Fit a linear model

  1. Fit a linear model to the light treatment data.
  2. Create a graphic that illustrates the fit (predicted or fitted values) of the model to the data.
  3. Use graphs to check whether the assumptions of linear models are met in this case. Are there any potential concerns? What options would be available to overcome any potential problems with meeting the assumptions?
  4. Obtain the parameter estimates with standard errors for the coefficients of the fitted model. Interpret the parameter estimates. What do they represent? 
  5. Why are the P-values from the t-tests of model parameters shown generally invalid? Under what circumstance would they be valid?
  6. Obtain 95% confidence intervals for the parameters.
  7. Summarize the fit of the model to the data, and test the overall treatment effects with an ANOVA table.
  8. Refit a linear model to the same data, but this time suppress the intercept term in the model formula. Obtain the parameter estimates with standard errors for the fitted model. Interpret the parameter estimates. What do they represent? Why are the P-values not useful here?
  9. Obtain 95% confidence intervals for the parameters.
  10. Re-test the overall treatment effects with an ANOVA table. Did anything change?

Fly sex and longevity revisited

We analyzed these data previously in our graphics workshop. If time permits, we will analyze them further by fitting a linear model to the data.

The data are from L. Partridge and M. Farquhar (1981), Sexual activity and the lifespan of male fruit flies, Nature 294: 580-581. The experiment placed male fruit flies with varying numbers of previously-mated or virgin females to investigate whether mating activity affects male lifespan. To begin, download the file fruitflies.csv from here.

The linear model will have longevity as the response variable, and two explanatory variables: treatment (categorical) and thorax length (numerical; representing body size). This model will fit a separate linear regression of longevity on thorax for each treatment group. The hope is that it will be ok to assume regression lines for different groups have the same slope. This would allow us to compare different treatment groups after correcting for variation among males in thorax length. This in turn should reduce the amount of noise in the data and lead to more precise estimates of treatment effects on longevity.

Read and examine the data

  1. Read the data from the file. 
  2. View the first few lines of data to make sure it was read correctly.
  3. Determine whether the categorical variable "treatment" is a factor. If not a factor, convert treatment to a factor using the "factor" command. This will be convenient when we fit the linear model.
  4. Use the "levels" command on the factor variable "treatment" to see how R has ordered the different treatment groups (should be alphabetically).
  5. Change the order of the categories so that a sensible control group is first in the order of categories. Arrange the order of the remaining categories as you see fit.
  6. Create a scatter plot, with longevity as the response variable and body size (thorax length) as the explanatory variable. Use different symbols (and colors too, if you like) for different treatment groups. (Use the plot method rather than xyplot in the lattice package).

Fit a linear model

  1. Fit a linear model to the fly data, including both body size (thorax length) and treatment as explanatory variables. Place thorax length before treatment in the model formula. Leave out the interaction term -- we'll assume for now that there is no interaction between the explanatory variables thorax and treatment.
  2. Visualize the results by adding the fitted values to the scatter plot.
  3. Use graphical diagnostic tools for lm to check whether the assumptions of linear models are met in this case. Are there any potential concerns?
  4. Attempt to fix the problem identified in step (3) using a transformation of the response variable. You can also transform the body size variable if necessary to preserve a straight-line relationship. Refit the model and reapply the graphical doagnostic tools to check assumptions.
  5. Once you feel you have satisfied the assumption of the linear model method, obtain the parameter estimates and standard errors for the fitted model. Interpret the parameter estimates. What do they represent? 
  6. Obtain 95% confidence intervals for the treatment and body size parameters.
  7. Test overall treatment effects with an ANOVA table. Interpret each significance test -- what exactly is being tested?
  8. Refit the model to the data but this time enter the two explanatory variables in the model in the reverse order in which you entered them previously. Test the treatment effects with an ANOVA table. Why isn't the table identical to the one from your analysis in (7)?
  9. Our analysis so far has assumed that the regression slopes for different treatment groups are the same. Is this a valid assumption? We have the opportunity to investigate just how different the estimated slopes really are. To do this, fit a new linear model to the data, but this time include the interaction term between the explanatory variables. The best approach would be to save the results in a new lm object (e.g., z1 instead of z). This way you will be able to compare directly the fits of the two models, one with an interaction term and one without.
  10. The coefficients table will be more complicated to interpret in the model including an interaction term, so lets skip this step. Instead, go right to the ANOVA table to test the interaction term using the new model fit. Interpret the result. Does it mean that the interaction term really is zero?
  11. The two models that we fit, one with an interaction term and the other without, are "nested" models (not to be confused with nested experimental designs) in the sense that one of them is a subset of the other, and was obtained simply by dropping terms (in this case, leaving out the interaction). Think of the model with the interaction term as the "full" model and that without the interaction as the "reduced" model. When two models z and z1 are nested, it is possible to compare them using

      anova(z,z1)

    Try this yourself. This particular comparison tests whether addition of the interaction term results in a "significantly" improved fit. The previous anova command already gave you this result, but here you've done it directly, "manually".  R's ability to compare any nested models in this way provides great flexibility for data analysis.
  12. Another way to help assess whether the assumption of no interaction is a sensible one for these data is to determine whether the fit of the model is "better" when an interaction term is present or not, and by how much. A simple measure of model fit can be obtained using the adjusted R2 value. The ordinary R2 measures the fraction of the total variation in the response variable that is "explained" by the explanatory variables. This, however, cannot be compared between models that differ in the number of parameters because fitting more parameters always results in a larger R2, even if the added variables are just made-up random numbers. To compare the fit of models having different parameters, use the adjusted R2 value instead, which takes account of the number of parameters. being fitted. Use the summary command on each of the fitted models, one with and the other without an interaction term, and compare their adjusted R2 values. Are they much different? If not, then maybe it is OK to assume that any interaction term is likely small and can be left out.
  13. Another method to determine whether or not an interaction term should be included in the model is to compare the two models using model selection techniques, such as Akaike Information Criterion (AIC). The model with the smallest AIC value is considered "best". Compare the AIC scores of the two models. Which is "best" by this criterion?
  14. How do we compare treatment effects when the interaction term is too large to be ignored? One strategy is to try transforming the data to a scale on which the regression lines from different groups really are parallel. If this fails, another strategy is to fit the model with the interaction terms, and then assess  treatment effects by the magnitude of the differences between groups at a specified value for the x-variable (such as at the overall mean value of x).