# Fit models to data

This page provides tips and recommendations for fitting linear and nonlinear models to data. Updated and revised frequently (click the reload button on your browser to make sure you are seeing the most recent version).

The main model-fitting commands covered on this page are:

`lm`

— linear models for fixed effects`lmer`

and`lme`

— linear models for mixed effects`glm`

— generalized linear models for fixed effects`nls`

— nonlinear least squares`gam`

— generalized additive models

Also covered is the use of

`visreg`

— visualize model fits (fixed effects, but see here for random effects)`emmeans`

— obtain model predicted means and post hoc (Tukey) tests (formerly called`lsmeans`

)

**Note on lmer vs lme**:

`lmer`

(in the `lme4`

package) is emphasized here, but these pages also show how to use `lme`

(in the `nlme`

package). Luke (2017; Behav Res 49:1494–1502) shows that inference for linear mixed models using methods available in `lmer`

is more accurate than inference using the methods built into `lme`

. **Note on Sums of Squares**: The default `anova`

in R fits terms in the model **sequentially** (“type I sum of squares”). Other statistical packages such as SAS and JMP use **marginal** testing of terms in anova (“type III sum of squares”) instead. The question of *“which set of sums of squares is the Right Thing provokes low-level holy wars on R-help from time to time”* (see here).

Under sequential fitting, the sum of squares for each term or factor in the `anova`

table is the improvement in the error sum of squares when that term is added to the linear model, compared with a model including only the terms listed above it (but not those listed below it) in the table. This means that the order in which you list the variables in the `lm`

formula affects the `anova`

table of results, including the *P*-values. The formula `y~A+B+A:B`

will lead to different sums of squares than the formula `y~B+A+A:B`

when the design is unbalanced. Note that `anova`

also respects hierarchy: the intercept is fitted first, before any other terms, main effects are fitted next, and interactions are fitted last. An interaction is never tested without its corresponding main effects included in the model.

Under **marginal** testing of terms (“type III sum of squares”), order of appearance of terms in the formula doesn’t matter, and neither does hierarchy. The contribution of each model term is measured by the improvement in the error sum of squares when that term is entered last into the model. Main effects are tested with their interactions already in the model. The `Anova`

function in the `car`

package, combined with a change in the contrasts used to calculate sums of squares, can be used to fit models using type III sum of squares (and also Type II, an in-between solution). Instructions are given below.

A discussion of the pros and cons of type I and type III approaches can be found here, with further explanation provided here.

## Fit a linear model to data

Linear models for fixed effects are implemented in the `lm`

method in R. You can pass a data frame to the `lm`

command, using a formula to indicate the model you want to fit.

z <- lm(response ~ explanatory, data = mydata) z <- lm(response ~ explanatory, data = mydata, na.action = na.exclude)

The resulting object (which I've named z) is an `lm`

object containing all the results. You use additional commands to pull out these results, including residuals and predicted values. The argument `na.action = na.exclude`

is optional --- it tells R to keep track of cases having missing values, in which case the residuals and predicted values will have NA's inserted for those cases. Otherwise R just drops missing cases.

Here are some of the most useful functions to extract results from the `lm`

object:

summary(z) # parameter estimates and overall model fit plot(z) # plots of residuals, normal quantiles, leverage coef(z) # model coefficients (means, slopes, intercepts) confint(z) # confidence intervals for parameters resid(z) # residuals predict(z) # predicted values predict(z, newdata = mynewdata) # see below fitted(z) # predicted values anova(z1, z2) # compare fits of 2 models, "full" vs "reduced" anova(z) # ANOVA table (** terms tested sequentially **)

The command `predict(z, newdata = mynewdata)`

will used the model to predict values for new observations contained in the data frame `mynewdata`

. The explanatory variables (the X variables) in `mynewdata`

must have exactly the same names as in `mydata`

.

In the case of simple linear regression you can add a line to a scatter plot of the data using the `abline`

function.

plot(response ~ explanatory, data = mydata) abline(z)

### Basic types of linear models

In the examples below, I'm assuming that all the variables needed are in a data frame `mydata`

.

The simplest linear model fits a constant, the mean, to a single variable. This is useful if you want to obtain an estimate of the mean with a standard error and confidence interval, or wanted to test the null hypothesis that the mean is zero.

z <- lm(y ~ 1, data = mydata)

The most familiar linear model is the linear regression of y on x.

z <- lm(y ~ x, data = mydata)

To fit a linear regression model without an intercept (i.e., fit a regression through the origin) use one of the following modifications:

z <- lm(y ~ x - 1, data = mydata) z <- lm(y ~ 0 + x, data = mydata)

If A is a categorical variable (factor or character) rather than a numeric variable then the model conforms to a single factor ANOVA instead.

z <- lm(y ~ A, data = mydata)

More complicated models include more variables and their interactions

z1 <- lm(y ~ x1 + x2, data = mydata) # no interaction between x1 and x2 z2 <- lm(y ~ x1 + x2 + x1:x2, data = mydata) # interaction term present z2 <- lm(y ~ x1 * x2, data = mydata) # interaction term present

Analysis of covariance models include both numeric and categorical variables. The linear model in this case is a separate linear regression for each group of the categorical variable. Interaction terms between these two types of variables, if included in the model, fit different linear regression slopes; otherwise the same slope is forced upon every group.

z <- lm(y ~ x + A + x:A, data = mydata)

### Simple linear regression

Fit the model to data. Note: avoid calling the variables `mydata$y`

and `mydata$x`

in the `lm`

formula, so that R knows to find the variables in `mydata`

in later commands.

z <- lm(y ~ x, data = mydata)

Add the regression line to a scatter plot. Using `ggplot`

gives a regression line that doesn't extend beyond the data points.

plot(y ~ x, data = mydata) abline(z) # or ggplot(mydata, aes(y = y, x = x)) + geom_point(size = 2, col = "red") + geom_smooth(method = lm, se = FALSE) + theme(aspect.ratio = 0.80)

To obtain the regression coefficients (parameter estimates) with confidence intervals, as well as R^{2} values,

summary(z) # estimates of slope, intercept, SE's confint(z, level = 0.95) # 95% conf. intervals for slope and intercept

To test the null hypothesis of zero slope with the ANOVA table,

anova(z)

Add 95% confidence bands to a scatter plot. Use `visreg`

or `ggplot`

(you might need to install and load the `visreg`

package first).

visreg(z, points.par = list(pch = 16, cex = 1.2, col = "red")) # or ggplot(x, aes(y = age, x = black)) + geom_point(size = 2, col = "red") + geom_smooth(method = lm, se = TRUE) + theme(aspect.ratio = 0.80)

You can use the following also to include 95% prediction intervals on your `ggplot`

graph. Heed the warning from the `predict`

command.

x.p <- predict(z, interval = "prediction") x.p <- cbind.data.frame(x, x.p) ggplot(x.p, aes(y = age, x = black)) + geom_point(size = 2, col = "red") + geom_smooth(method = lm, se = TRUE) + geom_line(aes(y = lwr), color = "red", linetype = "dashed") + geom_line(aes(y = upr), color = "red", linetype = "dashed") + theme(aspect.ratio = 0.80)

Here are some useful diagnostic plots for checking assumptions. Most of the plots produced by `plot(z)`

will be self-explanatory, except perhaps the last one. "Leverage" calculates the influence that each data point has on the estimated parameters. "Cook's distance" measures the effect of each data point on the predicted values for all the other data points. A value greater than 1 is said to be worrisome.

plot(z) # residual plots, etc: keep hitting enter hist(resid(z)) # histogram of residuals

Finally, here are a couple of useful tricks to constrain parameter values for simple linear regression. `x`

and `y`

are numeric variables in the data frame `mydata`

.

z <- lm(y ~ x, data = mydata) # Ordinary linear regression, for comparison z <- lm(y ~ 1, data = mydata) # Force the regression line through the origin z <- lm(y ~ 1 + offset(1*x), data = mydata) # Force a slope of 1 through the data z <- lm(y ~ 1 + offset(b*x), data = mydata) # Force a slope of "b" through the data # (b must be a number)

Forcing a line through the origin (forcing an intercept of 0) results in only the slope being estimated. Forcing a slope of 1 or b through the data results in only the intercept being estimated. Unfortunately, `visreg`

doesn't work well for these models.

### Single factor ANOVA

Before fitting a linear model to the data, check that the categorical variable is a factor. If not, make it a factor, as this will help later when we fit the model.

is.factor(mydata$A) # or class(mydata$A) # result should be "factor" if A is a factor mydata$A <- factor(mydata$A) # convert categorical variable A to a factor

Check how R has sorted the factor groups (levels). You will see that R orders them alphabetically, by default, which isn't necessarily the most useful order.

levels(mydata$A)

For reasons that will become clear below, it is often useful to rearrange the order of groups so that any control group is first, followed by treatment groups*. Let's assume that there are four groups, "a" to "d", and that group "c" is the control. Change the order in R as follows:

mydata$A <- factor(mydata$A, levels=c("c","a","b","d")

*NB: Do not use the command "ordered" for this purpose, which will mess things up.

Fit the linear model to the data. `y`

is the numeric response variable and `A`

is the categorical explanatory variable (character or factor).

z <- lm(y ~ A, data = mydata)

Visualize the fitted model in a graph that includes the data points. The second command below is a fast but crude way to add the fitted values to the plot.

stripchart(y ~ A, vertical = TRUE, method = "jitter", pch = 16, col = "red", data = mydata) stripchart(fitted(z) ~ A, vertical = TRUE, add = TRUE, pch="------", method = "jitter", data = mydata)

The `visreg`

package provides a fast tool to visualize the model fit with a strip chart, including optional 95% confidence intervals for the predicted means.

library(visreg) # loads package, if not done yet z <- lm(y ~ A, data = mydata) # fit model visreg(z) # basic plot visreg(z, points.par = list(cex = 1.2, col = "red") # with points options visreg(z, whitespace = 0.4) # adjust space between groups

Or, add the predicted values to your strip chart with base R.

stripchart(y ~ A, vertical = TRUE, method = "jitter", pch = 16, col = "red", data = mydata) yhat <- tapply(fitted(z), mydata$A, mean) for(i in 1:length(yhat)){ lines(rep(yhat[i], 2) ~ c(i-.2, i+.2)) }

The `emmeans`

package (formerly called `lsmeans`

) estimates the group means fitted by the a single factor ANOVA model, including SE's and confidence intervals. It will also carry out post hoc (Tukey) tests between all pairs of means. You'll have to install and load the package first, if you haven't already done so.

library(emmeans) emmeans(z, "A", data = mydata) # use the real name of your factor in place of "A"

Note that these are not the SE's and confidence intervals you would calculate yourself on each group separately, because they use the residual mean square from the model fitted to all the data. This should generally result in smaller SE's and narrower confidence intervals. The calculation is valid if the assumption of equal variances within different groups is met.

To carry out Tukey tests between all pairs of means, use

grpMeans <- emmeans(z, "A", data = mydata) grpMeans pairs(grpMeans) cld(grpMeans)

The `pairs()`

command from the `emmeans`

package produces a table of Tukey pairwise comparisons between means. The `cld()`

command makes a table depicting comparisons between means using compact-letter display (e.g, "1", "2"). Two means sharing a symbol are not significantly different from each other.

Check assumptions of single factor ANOVA:

plot(z) # residual plots, etc hist(resid(z)) # histogram of residuals

Estimate parameters. By default in R, the intercept term estimates the mean of the first group (set this to be the control group or the most convenience reference group). Remaining parameters estimate the difference between the mean of each group and the first (control) group. Ignore the *P*-values because they are **invalid** for hypothesis tests of the differences except in the special case of a planned contrast.

summary(z) # coefficients table with estimates confint(z, level = 0.95) # conf. intervals for parameters

Test the null hypothesis of equal group means with the ANOVA table.

anova(z)

See how R is representing the categorical variable behind the scenes by using indicator (dummy) variables.

model.matrix(z)

### Fit multiple factors

When two or more categorical factors are fitted, the possibility of an interaction might also be evaluated. Remember that in R the order in which you enter the variables in the formula affects the `anova`

results. By default, `anova`

fits the variables **sequentially** ("type I sum of squares"), while at the same time respecting hierarchy. Some other statistics programs such as SAS and JMP use **marginal** fitting of terms instead ("type III sum of squares") instead. See notes on this at the top of the current page.

In the case of two factors, A and B, the linear model is as follows.

z <- lm(y ~ A + B, data = mydata) # additive model, no interaction fitted z <- lm(y ~ A * B, data = mydata) # main effects and interaction terms fitted summary(z) # coefficients table confint(z, level = 0.95) # confidence intervals for parameters anova(z) # A is tested before B; interaction tested last

We can use the `Anova`

command in the `car`

package to give us the anova table based on marginal fitting of terms ("type III sum of squares"). To accomplish this we first need to override R's default coding of categorical variables, as follows.

z <- lm(y ~ A * B, contrasts = c("contr.sum", "contr.poly"), data = mydata) library(car) Anova(z, type = 3)

Use the `visreg`

package to visualize the scatter of data around the model fitted values, and the contribution of each variable. When a *subset* of multiple variables is plotted, the method visualizes the predicted values for the plotted variable(s) *while conditioning on the value of the other variable* (adjusted to the most common category for other factor). In this case you are seeing a slice of the model but not the overall predicted means or fitted values. Note that this conditioning makes sense only when no interaction term is fitted. When *all* factors are plotted (e.g., both factors A and B), then the predicted values are indeed shown for every combination of groups (levels). You can choose to overlay the fits or plot in separate panels. Try different options to see which method is most effective.

library(visreg) visreg(z, xvar = "B", whitespace = 0.4, points.par = list(cex = 1.1, col = "red")) visreg(z, xvar = "A", by = "B", whitespace = 0.4, points.par = list(cex = 1.1, col = "red")) visreg(z, xvar = "A", by = "B", whitespace = 0.5, overlay = TRUE, band = FALSE, points.par = list(cex = 1.1))

Use the `emmeans`

package to obtain fitted group means. In the formulas below, use the real names of your factors in place of "A" and "B".

library(emmeans) z <- lm(y ~ A + B, data = mydata) # additive model, no interaction emmeans(z, "A", data = mydata) # group means for A variable, averaged over levels of B emmeans(z, c("A", "B"), data = mydata) # model fitted means* for all combinations of A and B groups z <- lm(y ~ A * B, data = mydata) # main effects and interaction terms fitted emmeans(z, c("A", "B"), data = mydata) # means for all combinations of A and B groups

These will not usually be the actual group means. Instead, they are the predicted means for group combinations under the constraint that no interaction between A and B is present.

### Fit a factor and a continuous covariate

Here, I use an example of a linear model having a numeric response variable (y) and two explanatory variables, one categorical (A) and the other numeric (x). (Fitting a model with both numerical and categorical variables is also known as ancova, or analysis of covariance.) Note that the order of terms in your formula affects the sums of squares and `anova`

tests. See the notes at the top of this page on sequential vs marginal fitting of terms. To fit terms sequentially, use the following methods.

z <- lm(y ~ x + A, data = mydata) # no interaction term included, or z <- lm(y ~ x * A, data = mydata) # interaction term present summary(z) # coefficients table confint(z, level = 0.95) # confidence intervals for parameters anova(z) # sequential fitting of terms

Check model assumptions:

plot(z) # residual plots, etc hist(resid(z)) # histogram of residuals

Use `Anova`

command in the `car`

package to carry out marginal fitting of terms instead ("type III sum of squares"). First, we need to override R's default contrasts.

z <- lm(y ~ x * A, contrasts = c("contr.sum", "contr.poly"), data = mydata) library(car) Anova(z, type = 3)

Use the `visreg`

package to help visualize the scatter of the data around the model fits and the contributions of each variable. When you plot a *subset* of the variables in the model, `visreg`

shows you their contributions when the data values are adjusted for the other factors, conditioning on the most common factor level (for categorical factors) or the median value (for numeric variables). In this case you aren't looking at the overall model fitted values. If you plot more than one variable, you can choose to overlay the visuals for the different levels of the factor, or you can plot in separate panels.

library(visreg) visreg(z, xvar = "A", whitespace = 0.4, points.par = list(cex = 1.1, col = "red")) visreg(z, xvar = "x", by = "A", points.par = list(cex = 1.1, col = "red")) visreg(z, xvar = "x", by = "A", whitespace = 0.4, overlay = TRUE, band = FALSE, points.par = list(cex = 1.1))

Or, add the separate regression lines for each group to the scatter plot yourself.

plot(y ~ x, pch = as.numeric(A), data = mydata) legend(locator(1), as.character(levels(A)), pch = 1:length(levels(A))) # click on plot to place legend groups <- levels(A) # stores group names for(i in 1:length(groups)){ xi <- x[A==groups[i]] # grabs x-values for group i yhati <- fitted(z)[A==groups[i]] # grabs yhat's for group i lines(yhati[order(xi)] ~ xi[order(xi)]) # connects the dots }

## Fit a linear mixed-effects model to data

Linear mixed-effects models are implemented with the `lmer`

function of the `lme4`

package in R, and with the `lme`

function of the `nlme`

package. These methods use restricted maximum likelihood (REML) to produce unbiased estimates of model parameters and to test hypotheses.

`lmer`

is recommended here because it uses the Satterthwaite or Kenward-Roger approximations to degrees of freedom. Luke (2017; Behav Res 49:1494–1502) has shown that these approximations lead to more accurate inference than is available in `lme`

, especially when sample size is small.

To begin, load the packages you will need (you might need to install them first). To use `lmer`

, load the `lmerTest`

package, which includes `lme4`

and additional useful methods.

library(visreg) library(lmerTest) # for lmer() in lme4 package library(nlme) # for lme() in nlme package library(emmeans)

The `lmer`

command expects a formula that includes fixed as well as random effects in parentheses. For example, a model for a single random effect `B`

and only an intercept for a fixed effect would looks like the following. `y`

is the response variable, and along with `B`

is contained in `mydata`

.

z <- lmer(y ~ 1 + (1|B), data = mydata)

In contrast, `lme`

expects two formulas, one for the fixed effects and the second for the random effects. The same analysis as the one above would look like the following:

z <- lme(y ~ 1, random = ~ 1|B, data = mydata) # avoid mydata$y and mydata$B notation

In both cases the first "`1`

" indicates that a fixed intercept is fitted. The second part, "`1|B`

" indicates that a separate random intercept is to be fitted for each group identified by the categorical variable `B`

.

The resulting object (here named `z`

) contains all the results. Here are some of the most useful commands to extract results from the `z`

object:

summary(z) # variances for random effects, fit metrics plot(z) # plot of residuals against predicted values VarCorr(z) # variance components for random effects confint(z) # lmer: conf. intervals for fixed effects and variances intervals(z) # lme: conf. intervals for fixed effects and variances resid(z) # residuals fitted(z) # best linear unbiased predictors (BLUPs) anova(z, type = 1) # lmer: test fixed effects sequentially (Type I SS) anova(z, type = 3) # lmer: as above but using Type III Sums of Squares anova(z) # lme: test fixed effects sequentially (Type I SS)

A **frequent error** when analyzing data with random effects is using the same labels to identify different sampling units in different groups. For example, the same number codes 1 through 5 might be used to label the 5 subplots of every plot. This will confuse R and cause frustration. Instead, use the codes a1 through a5 to label the subplots from plot "a", b1 through b5 to label the subplots taken from plot "b", and so on.

### Example: One random factor -- repeated measurements of individuals

The simplest example of a model with random effects is an observational study in which multiple measurements are taken on a random sample of individuals (or units, plots, etc) from a population. In this case "individual" is the random factor or group.

A frequent application is estimation of the **repeatability** of a trait or measurement. Lessels and Boag (1987, Auk 104: 116-121) define repeatability as "*the proportion of variance in a character that occurs among rather than within individuals*". It can be thought of as the correlation between repeated measurements made on the same individuals.

In the example below, a numeric variable `y`

is measured two (or more) times on randomly sampled individuals whose identities are recorded in the variable `B`

. Both variables are in the data frame `mydata`

.

z <- lmer(y ~ 1 + (1|B), data = mydata) # lme4/lmerTest package z <- lme(y ~ 1, random = ~ 1|B, data = mydata) # nlme package

Use the following commands to obtain estimates of the parameters (grand mean, for the fixed part; standard deviation and variances among groups for the random part, as well as the variance of the residuals).

summary(z) VarCorr(z) confint(z) # lmer: 95% confidence intervals fixed effects intervals(z) # lme: 95% confidence intervals fixed effects

Repeatability of a measurement is calculated as

*r* = var_{among} / (var_{among} + var_{within})

where var_{among} is the variance among the means of groups (individuals, in this case), and var_{within} is the variance among repeat measurements within groups. In the output of the `VarCorr`

command, the estimated variance among groups is shown as the variance associated with the random intercepts. The variance associated with the residual is the estimate of the within-group variance.

To see a plot of the residual values against fitted values,

plot(z)

You might notice a positive trend in the residuals, depending on the amount of variation between repeat measurements and the number of repeat measurements per individual in the data. To confirm what is happening, plot the original data and superimpose the `lme`

fitted (predicted) values:

stripchart(y ~ B, vertical = TRUE, pch = 1) stripchart(fitted(z) ~ B, vertical = TRUE, add = TRUE, pch = "---")

Notice that the predicted values are smaller than the mean for individuals having large values of `y`

, whereas the predicted values are larger than the mean for individuals having small values of `y`

. This is not a mistake. Fitting linear mixed effects models obtains "best linear unbiased predictors" for the random effects, which tend to lie closer to the grand mean than do the mean of the measurements of each individual. Basically, `lme`

is predicting that the "true" trait value for individuals having extreme measurements is likely to be closer, on average, to the grand mean than is the average of the repeat measurements. This effect diminishes the more measurements you have taken of each individual.

`visreg`

doesn't work in this simple case because the method requires at least one fixed factor in the model (not just an intercept).

### Example: Two nested random factors -- nested sampling designs

The above procedures can be extended to study designs in which sampled units are nested within other units, and there are no treatment fixed effects.

For example, a study might measure a variable `y`

in quadrats randomly sampled within transects. Transects are randomly placed within woodlots, and woodlots are randomly sampled from a population of woodlots in a region. Variable `A`

identifies the transect, and `B`

indicates the woodlot. The multiple values of `y`

for each combination of `A`

and `B`

are the quadrat measurements. This sampling design is modeled as

z <- lmer(y ~ 1 + (1|B/A), data = mydata) # lme4/lmerTest z <- lme(y ~ 1, random = ~ 1|B/A, data = mydata) # nlme

where "`B/A`

" expresses the nesting: "woodlot, and transect within woodlot". In this case, random intercepts are fitted to woodlots and to transects from each woodlot.

This same formulation could be used to model the results of a half-sib breeding design in quantitative genetics, where multiple offspring are measured per dam (`A`

), and multiple dams are mated to each sire (`B`

).

Take special care to label sampling units (e.g., transects) **uniquely**. For example, if you have 5 transects in each woodlot, don't give them the same labels 1 through 5 in each woodlot. Instead, label those from woodlot 1 as w1.1 through w1.5, and those from woodlot 2 as w2.1 through w2.5. Otherwise R won't compute the correct quantities.

In this model there are no treatment fixed effects to test, so the `anova`

command will not accomplish anything.

### Example: One fixed and one random factor -- randomized block, and subjects-by-treatment repeated measures designs

A randomized complete block (RCB) design is like a paired design but for more than two treatments. In the typical RCB design, every treatment is applied exactly once within every block. This yields a single value for the response variable from every treatment by block combination. In this case it is not possible to include an interaction term between treatment and block. In field experiments, blocks are typically experimental units (e.g., plots) grouped by location or in time. In lab experiments blocks may be separate environment chambers, or separate shelves within a single chamber. Used this way, blocks aren't true random samples from a population of blocks. Nevertheless blocks are usually modeled as random effects when the data are analyzed.

The data from a RCB experiment include the response variable `y`

, a categorical variable `A`

indicating the treatment (fixed effect), and a categorical variable `B`

indicating block (random effect). If you plan to use `visreg`

to visualize the scatter of the data around the model fits, convert `A`

and `B`

to factors before fitting.

"Subject" is the random effect in human or animal experiments in which each fixed treatment is applied once (in random order) to every individual subject or animal. These are also called "subjects by treatment" repeated measures designs, and they are often analyzed in the same way as RCB designs. Here, each treatment is assigned to every subject in **random order**. The analysis assumes that there is no carry-over between treatments: earlier measurements made on a subject under one treatment should not influence later measurements taken on the same subject in another treatment. The categorical variable `B`

indicates the identities of the subjects.

z <- lmer(y ~ A + (1|B), data = mydata) # lme4/lmerTest package z <- lme(y ~ A, random = ~ 1|B, data = mydata) # nlme package

The formulas are similar to that for a single random effect except that now the fixed part of the model includes the fixed effect `A`

as an explanatory variable. The random part of the model again fits an intercept to each of the random groups defined in `B`

(subjects or blocks).

You can use `interaction.plot`

to view how the mean `y`

changes between treatment levels of `A`

separately for each level of `B`

. Parallel lines would indicate a lack of an interaction between `A`

and `B`

. The drawback of this function is that it does not show the data (not a problem if you have just one data point for each combination of categories of `A`

and `B`

).

interaction.plot(A, B, y)

The `visreg`

function seems to work in this case to visualize model fits to data (although it won't include the confidence intervals). If `A`

is the fixed (treatment) effect and `B`

is the random effect, try one of the following.

visreg(z, xvar = "A") visreg(z, xvar = "A", by = "B", scales=list(rot = 90))

To obtain estimates of the fixed effects and variance components, with confidence intervals,

summary(z) VarCorr(z) confint(z) # lmer intervals(z) # lme

Use `emmeans`

to obtain model-based estimates of the treatment (`A`

) means.

emmeans(z, "A", data = mydata) # uses kenward-roger degrees of freedom method emmeans(z, "A", data = mydata, lmer.df = "satterthwaite") # uses satterthwaite degrees of freedom method

Use the `anova`

command to test the fixed effects (the grand mean and the treatment `A`

).

anova(z, type = 1) # lmer: test fixed effects sequentially (Type I SS) anova(z, type = 3) # lmer: as above but using Type III Sums of Squares anova(z) # lme: test fixed effects sequentially (Type I SS)

### Example: One fixed and one random factor -- factorial design

The approach to a factorial design with one fixed and one random factor is similar to the randomized complete block design except that each combination of levels of the fixed and random effect may be replicated.

Two-factor ANOVA is more complex when one of the factors is random because random sampling of groups adds extra sampling error to the design. This sampling error adds noise to the measurement of differences between group means for the fixed factors that interact with the random factor.

If `A`

is the fixed factor and `B`

is the random factor, use the following if you want to include the interaction between `A`

and `B`

. An interaction between a fixed and a random factor is a random effect.

z <- lmer(y ~ A + (1|B/A), data = mydata) # lme4/lmerTest package z <- lme(y ~ A, random = ~ 1|B/A, data = mydata) # nlme package

The corresponding formulas when there is no interaction term are

z <- lmer(y ~ A + (1|B), data = mydata) # lme4/lmerTest package z <- lme(y ~ A, random = ~ 1|B, data = mydata) # nlme package

To plot the residuals against the fitted values,

plot(z)

To test the fixed effects, use

anova(z, type = 1) # lmer: test fixed effects sequentially (Type I SS) anova(z, type = 3) # lmer: as above but using Type III Sums of Squares anova(z) # lme: test fixed effects sequentially (Type I SS)

Use `emmeans`

to obtain model-based estimates of the fixed factor (`A`

) means.

emmeans(z, "A", data = mydata)

### Example: One numeric variable and one random factor -- linear regression in multiple random groups

In the examples below, `y`

is the response variable, `x`

is the numeric explanatory variable, and `B`

identifies the random groups, with all variables in a data frame `mydata`

. `x`

is always modeled as a fixed effect.

The simplest model for linear regression in multiple random groups assumes equal slopes within each group. In this case, only a random intercept is modeled in each group.

z <- lmer(y ~ x + (1|B), data = mydata) # lme4/lmerTest package z <- lme(y ~ x, random = ~ 1|B, data = mydata) # nlme package

If slopes and intercepts both vary among random groups, then the model includes an interaction between `x`

and `B`

, which is a random effect.

z <- lmer(y ~ x + (x|B), data = mydata) # lme4/lmerTest package z <- lme(y ~ x, random = ~ x|B, data = mydata) # nlme package

To obtain estimates of the fixed effects and variance components, with confidence intervals,

summary(z) VarCorr(z) confint(z) # lmer intervals(z) # lme

Use `emmeans`

to obtain model-based estimates of the treatment (`A`

) means.

emmeans(z, "A", data = mydata)

Use the `anova`

command to test the fixed effects (the grand mean and the treatment `A`

).

anova(z, type = 1) # lmer: test fixed effects sequentially (Type I SS) anova(z, type = 3) # lmer: as above but using Type III Sums of Squares anova(z) # lme: test fixed effects sequentially (Type I SS)

### Example: Two random factors -- factorial design

Use `lmer`

in the `lmerTest`

package if fitting models to a factorial design with two crossed random factors, `A`

and `B`

.

z <- lmer(y ~ (1|A) + (1|B), data = mydata) # without interaction z <- lmer(y ~ (1|A) + (1|B) + (1|A:B), data = mydata) # with interaction

Accomplishing the same goal using `lme`

in the `nlme`

package is more challenging, and is explained here).

To obtain estimates of the fixed effects and variance components, with confidence intervals,

summary(z) VarCorr(z) confint(z) # lmer intervals(z) # lme

In this model there are no fixed effects to test, so the `anova`

command will not accomplish anything.

### Example: Two fixed and one random factor -- split plot design

A basic split plot design has two treatment variables, which are fixed factors, `A`

and `B`

. One of these is applied to whole plots (ponds, subjects, etc) as in a completely randomized design. All levels of the second factor `B`

are applied to subplots within every plots. For example, if there are two treatment levels of factor `B`

(e.g., a treatment and a control), then each plot is split in two; one of the treatments is applied to one side (randomly chosen) and the second is applied to the other side. In this case, each side of a given plot yields a single value for the response variable. Because levels of `B`

are repeated over multiple plots within each level of `A`

, it is possible to investigate the effects of each treatment variable as well as their interaction.

Let "plot" refer to the variable identifying the random plot (pond, subject, etc). The response variable is modeled as

z <- lmer(y ~ A * B + (1|plot), data = mydata) # lme4/lmerTest package z <- lme(y ~ A * B, random = ~ 1|plot, data = mydata) # nlme package

Notice that "`A * B`

" is shorthand for "`A + B + A:B`

", where the last term is the interaction between `A`

and `B`

.

Plot the residuals against the fitted values in a partial check of assumptions.

plot(z)

To visualize the model fit, `visreg`

seems to work in this case, since there is only one random factor.

visreg(z, xvar = "A", by = "B", scales=list(rot = 90))

Use `emmeans`

to generate model-based fitted means for treatments.

emmeans(z, c("A", "B"), data = mydata) # means of each treatment combination emmeans(z, c("A"), data = mydata) # means of "A" treatment levels averaged over levels of "B"

Use the `anova`

command to test the fixed effects (the grand mean and the treatment `A`

).

## Fit a generalized linear model

Generalized linear models for fixed effects are implemented using `glm`

in R. The `glm`

command is similar to the `lm`

command, used for fitting linear models, except that the error distribution and link function must be specified using the "family" argument.

For example, to model a binary response variable, specify a binomial error distribution and the logit link function:

z <- glm(response ~ explanatory, family = binomial(link="logit"), data = mydata)

For count data, specify the Poisson error distribution and the log link function.

z <- glm(response ~ explanatory, family = poisson(link="log", data = mydata)

If the variance of the error distribution in the data is greater than that expected under the binomial and poisson distributions ("overdispersion"), try the following instead,

family = quasibinomial(link = "logit") # or family = quasipoisson(link = "log"))

These error distributions model overdispersion for binary and count data, and additionally provide an estimate of the dispersion parameter (a value greater than one indicates overdispersion).

Type `help(family)`

to see other error distributions and link functions that can be modeled using `glm`

.

The resulting object (which I've named `z`

) is a **glm object** containing all the results. Here are some of the most useful commands used to extract results from the `glm`

object:

summary(z) # parameter estimates and overall model fit coef(z) # model coefficients resid(z) # deviance residuals predict(z) # predicted values on the transformed scale predict(z, se.fit = TRUE) # Includes SE's of predicted values fitted(z) # predicted values on the original scale anova(z, test = "Chisq") # Analysis of deviance - sequential anova(z1, z2, test = "Chisq") # compare fits of 2 models, "reduced" vs "full" anova(z, test = "F") # Use F test for gaussian, quasibinomial or quasipoisson link

"Deviance residuals" are not the same as ordinary residuals in linear models. They are a measure of the goodness of fit of the model to each data point. Each residual can be thought of as the contribution of the corresponding data point to the residual deviance (given in the analysis of deviance table).

There is no `plot(z)`

method for `glm`

model objects. If you enter `plot(z)`

, the resulting plots will look strange and won't be easy to interpret. R is using the plot function for `lm`

model objects, and they aren't valid for `glm`

objects. Instead, it is more useful to visualize the data by adding the fitted curves or means onto scatter plots or other graphs that show the data.

A method for calculating likelihood based confidence intervals for parameters is available in the MASS library. A function to estimate LD50's is also found here, for the case of a dose-response curve with a binary response variable.

library(MASS) confint(z, level = 0.95) # approximate 95% confidence intervals dose.p(z, p = 0.50) # LD50 for a dose-response curve

### Logistic and log-linear regression

These methods are analogous to linear regression. We assume that `x`

is a continuous numeric explanatory variable and that the response variable `y`

is either binary data (logistic regression) or count data (log-linear regression). The only difference between logistic and log-linear regression is the error distribution and link function, which are specified by the `family`

argument.

z <- glm(y ~ x, family = binomial(link="logit"), data = mydata) z <- glm(y ~ x, family = poisson(link="log"), data = mydata)

You can use `visreg`

to visualize the model on the transformed scale (the function uses `predict(z)`

to generate the result). Replace `x`

in the command below with the name of your x-variable (in quotes). The `glm`

method of `visreg`

fits a linear model on the transformed scale by default. This plot will show points, but they are not the transformed data, however. They are "working values" obtained by transforming the residuals of the fitted model on the original scale. `glm`

repeatedly recalculates the working values and the fitted model as it converges on the maximum likelihood estimate. `visreg`

shows you the results from the final iteration.

Use the `scale = "response"`

argument to see the fitted relationship on the original scale. You can then add the data points, but make sure you have enough room on your y-axis to plot all the points.

library(visreg) visreg(z, xvar = "x") # fit on the transformed scale visreg(z, xvar = "x", ylim = range(y), scale = "response") # fit on the original scale points(y ~ x) # add the data points

Use `fitted(z)`

to generate the predicted values corresponding to your data points on the original scale of the data. Since the response variable is discrete I have added a bit of jitter to the points to minimize overlap.

plot(jitter(y, amount = 0.02) ~ x, data = mydata) yhat <- fitted(z) lines(yhat[order(x)] ~ x[order(x)])

Approximate confidence bands can also be added to your scatter plot. `visreg`

can do this, but here I show how to calculate and plot them in base R. Confidence intervals are based on a normal approximation and assume a large sample size -- they are not very accurate for small and moderate sample sizes. Replace the 1.96 in the command below with 1 to plot standard error bands instead. First, calculate the upper and lower limits as follows.

zhat <- predict(z, se.fit = TRUE) # result on logit or log scale zupper <- zhat$fit + 1.96 * zhat$se.fit zlower <- zhat$fit - 1.96 * zhat$se.fit

Then convert to the original scale.

yupper <- exp(zupper)/(1 + exp(zupper)) # for logit link ylower <- exp(zlower)/(1 + exp(zlower)) # or yupper <- exp(zupper) # for log link ylower <- exp(zlower)

Finally, plot the bands.

lines(yupper[order(x)] ~ x[order(x)], lty = 2) lines(ylower[order(x)] ~ x[order(x)], lty = 2)

Use `summary`

to view the parameter estimates (slope and intercept) on the logit or log scale, along with standard errors. Note that *P*-values in the `summary()`

output are based on a normal approximation and are not accurate for small to moderate sample sizes -- use the log likelihood ratio test instead (see below).

summary(z)

To obtain 95% likelihood-based confidence intervals for the parameters, use a command from the MASS library

library(MASS) confint(z, level = 0.95)

Use the following command to obtain the analysis of deviance table. This provides log-likelihood ratio tests of the most important parameters. Model terms are tested __sequentially__, and the results will depend on the order in which variables are entered into the model formula.

anova(z, test = "Chisq")

### Categorical explanatory variable

Glm models can also be fitted to data when the explanatory variable is categorical rather than continuous. This is analogous to single-factor ANOVA in `lm`

, but here the response variable is binary or has three or more categories.

The glm models are as follows. `A`

is the categorical variable identifying treatment groups.

z <- glm(y ~ A, family = binomial(link="logit"), data = mydata) z <- glm(y ~ A, family = poisson(link="log"), data = mydata)

As in `lm`

, it may be useful to order the different groups (categories of `A`

) so that a control group is first, followed by treatment groups. This can help interpretation of the model parameters. For example, if there are 4 groups in `A`

and group "c" is the control group, set the order of the group

A <- factor(A, levels=c("c","a","b","d"))

As in `lm`

, you can also change the way that R models the categorical variable. For example, a useful approach is to fit a model in which the parameters are the group means (in `glm`

these means are fitted on the logit or log scale before being back-transformed to the original scale). To accomplish this, refit the linear model with a "-1" in the formula, which leaves out an intercept.

z <- glm(y ~ A - 1, family = binomial(link = "logit"), data = mydata) z <- glm(y ~ A - 1, family = poisson(link = "log"), data = mydata)

Use `emmeans`

to generate model-based fitted means for treatments.

library(emmeans) # load emmeans library emmeans(z, c("A"), data = mydata) # means of "A" treatment on transformed scale

You'll need to transform the means and the lower and upper limits of the confidence intervals to the original scale using the inverse of the link function.

### Modeling contingency tables

Binary response data ("success" = 1, "failure" = 0) are often compared among groups using contingency tables. For example, here are fictitious data on the number of successes and failures of independent units belonging to one of 4 groups identified as levels of the variable `A`

. The first row below has the variable names of each column.

A successes failures n a 5 10 15 b 15 20 35 c 25 30 55 d 35 40 75

There are two approaches to analyzing these data with `glm`

. Both yield identical results when the response variable is binary. Only the second approach can be used when the response variable has more than two categories.

The first approach is to fit a glm model to the proportion of successes, taking care to indicate the number of observations on which each proportion is based using the `weights`

option. Assume that the variables "successes" and `n`

are in your data frame `mydata`

. First calculate the proportion of successes.

mydata$prop <- mydata$successes/mydata$n z <- glm(prop ~ A, weights = n, family = binomial(link = "logit"), data = mydata)

The second approach is to model the interaction between the grouping variable `A`

and a single outcome variable (success/failure). To use this approach we need to reshape the data as a data frame of frequencies,

A freq outcome a 5 success b 15 success c 25 success d 35 success a 10 failure b 20 failure c 30 failure d 40 failure

We then fit the frequencies with a glm model whose linear predictor includes the grouping variable `A`

, the outcome variable `outcome`

, and their interaction

z <- glm(freq ~ A * outcome, family = poisson(link = "log"), data = mydata) summary(z) anova(z, test="Chisq")

The summary coefficients of interest are those corresponding to the interaction between `A`

and `outcome`

, rather than coefficients for the main effects for `A`

and `outcome`

. The interaction measures the extent to which probability of success or failure depends on `A`

, which is the point of the analysis. Similarly, the interaction is the term of principal interest in the analysis of deviance table.

### Checking assumptions of logistic and log-linear regression

Generalized linear models are freed from the assumption that residuals are normally distributed with equal variance, but the method nevertheless makes important assumptions that should be checked.

Unfortunately there is no `plot(z)`

method for `glm`

model objects. If you enter `plot(z)`

, the resulting plots will look strange because the data are discrete, and they won't be easy to interpret. (R is using the plot function for `lm`

model objects, which aren't valid for `glm`

objects.) Instead, it is more useful to graph the fitted curves or means onto plots of the data.

To check whether the fitted curve from the logistic or log-linear regression accurately captures the trend in the data, you might compare it to that of a `lowess`

fit. The `lowess`

method calculates a smooth curve that estimates the trend in the data (rather like a running mean). Is it similar to the `glm`

fitted curve?

plot(jitter(y, amount = 0.02) ~ x, data = mydata) lines(lowess(mydata$x, mydata$y))

To use an analogous smoothing method that also takes the error distribution and link function into account, try fitting a cubic spline instead of using `lowess`

. See the section "Fit a Cubic Spline" elsewhere on this page for help.

Logistic and log-linear regression assume that the "__dispersion parameter__" is 1. What this means is that the variance of the residuals are as expected from the binomial or Poisson distributions specified by the link function. In the case of count data, for example, a dispersion parameter of 1 means that the variance of the explanatory variable within each group should be approximately equal to the mean for the group. However, count data are notorious for having higher variances than expected (dispersion parameter > 1). One way to check is to tabulate the mean and variance for each group and see how they compare. Another way is to estimate the dispersion parameter directly by refitting the data to a quasi-likelihood model that includes such a parameter. For binary data (logistic regression) the quasi-likelihood model is fitted as follows.

z <- glm(y ~ x, family = quasibinomial(link="logit"), data = mydata) summary(z) anova(z, test = "F")

For count data the corresponding method is

z <- glm(y ~ x, family = quasipoisson(link="log"), data = mydata) summary(z) anova(z, test = "F")

If the estimated dispersion parameter is much greater than 1, then you should analyze your data using the quasibinomial or quasipoisson link function. This will lead to more reliable standard errors for parameters and more accurate *P*-values in tests. Note that you should use `anova(z, test = "F")`

, i.e., an F test, when testing hypotheses that use gaussian, quasibinomial or quasipoisson link functions. This is because the quasi-likelihood is not a real likelihood and the generalized log-likelihood ratio test is not accurate.

## Model selection

R includes a number of model selection tools to evaluate and compare the fits of alternative models to data. Most of the methods shown below work on `lm`

objects as well as `lme, glm, nl, gls`

and `gam`

objects.

### AIC and related quantities

These functions compute log-likelihoods and the Akaike information criterion for fitted models. To use them, begin by fitting the model of interest to the data,

z <- lm(y ~ x1 + x2 + x3, data = mydata)

Then apply the function of interest to the fitted model object, z.

logLik(z) # log-likelihood of the model fit AIC(z) # Akaike Information Criterion AIC(z, k = log(n)) # Bayesian Information Criterion (n is the sample size) BIC(z) # Bayesian Information Criterion

The function `AIC`

can be used to compare two or more models fitted to the same data. The model with the **smallest** AIC values is the "best". The same response variable must be fitted in each case, but the explanatory variables can differ.

z1 <- lm(y ~ x1 + x2 + x3, data = mydata) # first model z2 <- lm(y ~ x3 + x4, data = mydata) # second model c(AIC(z1), AIC(z2)) # vector of AIC values AIC(z1, z2) # same but returns a data frame

What matters is not AIC itself but the difference in AIC between fitted models. The model with the smallest AIC will have a difference of zero.

x <- c(AIC(z1), AIC(z2), AIC(z3)) # stores AIC values in a vector delta <- x - min(x) # AIC differences

Finally, we can use these quantities to calculate the Akaike weights for each of a set of fitted models. The model with the highest weight is the "best". The magnitude of the weight of a given model indicates the weight of evidence in its favor, given that one of the models in the set being compared is the best model. The weights are used in model averaging, a technique in multimodel inference. Weights should sum to 1.

x <- c(AIC(z1), AIC(z2), AIC(z3)) # stores AIC values in a vector delta <- x - min(x) # AIC differences L <- exp(-0.5 * delta) # relative likelihoods of models w <- L/sum(L) # Akaike weights

A 95% confidence set for the “best” model can be obtained by ranking the models and summing the weights until that sum is ≥ 0.95.

### BIC and related quantities

The computations are similar for BIC (Bayesian Information Criterion) as for AIC. Under the BIC criterion the model with the **smallest** BIC is the "best" in the set of models fitted.

To calculate BIC, use one of the following commands on the fitted model object "z".

AIC(z, k = log(n)) # n is the sample size BIC(z)

What matters is not BIC itself but the difference in BIC between models. The model with the smallest BIC will have a difference of zero. These differences can be used to compute the BIC weights. The model with the highest weight is the "best" model in the set.

x <- c(bic1, bic2) # stores two BIC values in a vector delta <- x - min(x) # BIC differences L <- exp(-0.5 * delta) # w <- L/sum(L) # BIC weights

BIC weights can be interpreted as posterior probabilities of the models, assuming that one of the models in the set is the "true" model (whatever that means), if all the models in the set have equal prior probabilities.

### All subsets regression

The `leaps`

command in the leaps library carries out an exhaustive search for the best subsets of the explanatory variables for predicting a response variable. The method is only suited to analyzing **numeric **variables, not categorical variables (a variable having just two categories can be recoded as a numeric variable with indicator 0's and 1's). To begin, load the leaps library,

library(leaps)

To prepare your data for analyzing, you will need to create a new data frame that contains only the explanatory variables you wish to include in the analysis. Don't include the response variable or other variables in that data frame. Any interaction terms to fit will need to be added manually.

x <- mydata[ ,c(2,5,6,7,11)] x$x1x2 <- x$x1 * x$x2 # interaction between x1 and x2

Then run leaps. Models are compared using Mallow's *C _{p}* rather than AIC, but the two quantities are related. We can compute AIC ourselves later.

z <- leaps(x, y, names=names(x)) # note that x comes before y

By default, the resulting object (here called z) will contain the results for the 10 best models for each value of *p*, the number of predictors, according to the criterion of smallest *C _{p}*. To increase the number of models saved for each value of

*p*to 20, for example, modify the

`nbest`

option.z<-leaps(x, y, names = names(x), nbest = 20)

Additional useful commands:

plot(z$size, z$Cp) # plotCvs number of predictors,_{p}plines(z$size, z$size) # adds the line forCi <- which(z$Cp==min(z$Cp)) # finds the model with smallest_{p}= pCvars <- which(z$which[i,]) # id variables of best model_{p}

At the end of this Rtips section on model selection below is a home-made (and rather slow) `leaps2aic`

command that will calculate additional quantities of interest from the leaps output. To use it, copy and paste the text into your R command window (you will need to do this only once in a session). These calculations assume that the the set of saved models from `leaps`

includes that model having the lowest AICc and AIC, which is likely. However, the `leaps`

output might not include all runners up (to be safe, increase the number of models saved using the `nbest`

option in `leaps`

). `leaps2aic`

computes differences in AIC (AIC relative to the best) as well as differences in AICc (AIC adjusted for small sample size) and BIC (Bayesian Information Criterion).

To use the `leaps2aic`

command, make certain that the arguments `x`

, `y`

and `z`

are exactly the same as the ones you passed to the leaps command immediately beforehand. `y`

is the response variable, `x`

is the data frame containing the explanatory variables, and `z`

is the saved output object from the leaps command.

z1 <- leaps2aic(x, y, z) z1[,c("model","AICc", "AIC")] # prints model and AIC differences

The leaps package also includes the command `regsubsets`

, which has the advantage of a formula method, but the output is not as functional.

z <- regsubsets(y ~ x1 + x2 + x3, data = mydata) summary(z) plot(z)

### stepAIC

`stepAIC`

is a command in the MASS library that will automatically carry out a restricted search for the "best" model, as measured by either AIC or BIC (Bayesian Information criterion) (but not AICc, unfortunately). `stepAIC`

accepts both categorical and numerical variables. It does not search all possible subsets of variables, but rather it searches in a specific sequence determined by the order of the variables in the formula of the model object. (This can be a problem if a design is severely unbalanced, in which case it may be necessary to try more than one sequence.) The search sequence obeys certain restrictions regarding inclusion of interactions and main effects, such as

- A quadratic term x
^{2}is not fitted unless x is also present in the model - The interaction a:b is not fitted unless both a and b are also present
- The threeway interaction a:b:c is not fitted unless all two-way interactions of a, b, c, are also present

and so on. `stepAIC`

focuses on finding the best model, and it does not yield a list of all the models that are nearly equivalent to the best.

To use `stepAIC`

you will need to load the MASS library

library(MASS)

To begin, fit a "full" model with all the variables of interest included. It can be quite a chore, and take multiple lines of script, to write down all the variables in the model. To simplify this procedure, put the single response variable and all the explanatory variables you want to investigate into a new data frame. Leave out all variables you are not interested in fitting. There's no need to include variables representing interactions. Then you can use one of the following shortcuts

z <- lm(y ~., data = mydata) # additive model, all variables z <- lm(y ~(.)^2, data = mydata) # includes 2-way interactions z <- lm(y ~(.)^3, data = mydata) # includes 3-way interactions

To use `stepAIC`

, include the fitted model output from the previous `lm`

command,

z1 <- stepAIC(z, upper=~., lower=~1, direction="both")

The output will appear on the screen. The output object (here called `z1`

) stores only `lm`

output from fitting the best model identified by `stepAIC`

. To obtain results from fitting the best model, use the usual commands for `lm`

objects, e.g.,

summary(z1) plot(z1) anova(z1)

Now, to interpret the output of `stepAIC`

. Start at the top and follow the sequence of steps that the procedure has carried out. First, it fits the full model you gave it and prints the AIC value. Then it lists the AIC values that result when it drops out each term, one at a time, leaving all other variables in the model. (It starts with the highest-order interactions, if you have included any.) It picks the best model of the bunch tested (the one with the lowest AIC) and then starts again. At each iteration it may also add variables one at a time to see if AIC is lowered further. The process continues until neither adding nor removing any single term results in a lower AIC.

To use the BIC criterion with `stepAIC`

, change the option "k" to the log of the sample size

z1 <- stepAIC(z, upper = ~., lower = ~1, direction = "both", k = log(nrow(mydata)))

Here is the `leaps2aic`

function.

-------------- leaps2aic <- function(x,y,z){ # Calculate AIC and AICc for every model stored in "z", the # output of "leaps". This version sorts the output by AICc. # Requires the same x and y objects inputted to "leaps" rows <- 1:length(z$Cp) for(i in 1:nrow(z$which)){ vars <- which(z$which[i,]) newdat <- cbind.data.frame(y, x[,vars]) znew <- lm(y~., data = newdat) L <- logLik(znew) aL <- attributes(L) z$model[i] <- paste(z$label[-1][vars],collapse=" + ") z$p[i] <- z$size[i] # terms in model, including intercept z$k[i] <- aL$df # parameters (including sigma^2) z$AIC[i] <- -2*L[1] + 2*aL$df z$AICc[i] <- -2*L[1] + 2*aL$df + (2*aL$df*(aL$df+1))/(aL$nobs-aL$df-1) z$BIC[i] <- -2*L[1] + log(aL$nobs)*aL$df } z$AIC <- z$AIC - min(z$AIC) z$AICc <- z$AICc - min(z$AICc) z$BIC <- z$BIC - min(z$BIC) result <- do.call("cbind.data.frame", list(z[c("model","p","k","Cp","AIC","AICc","BIC")], stringsAsFactors=FALSE)) result <- result[order(result$AICc),] rownames(result) <- 1:nrow(result) newdat <- cbind.data.frame(y,x) print(summary( lm( as.formula(paste("y ~ ",result$model[1])), data=cbind.data.frame(y, x) ))) return(result) }

## Fit nonlinear models to data

The function `nls`

in R allows fitting of nonlinear relationships between a response variable and one or more explanatory variables. Table 20.1 in Crawley's The R book lists a variety of nonlinear functions useful in biology. Examples of nonlinear functions include the power function,

y = ax# when y intercept is 0^{b}y = a + bx# when y intercept is not zero^{c}

the exponential function,

y = a(1 - e^{-bx}) # basic versiony = a - be^{-cx}# similar, but more flexible

and the Michaelis-Menten function,

y = ax/(b+x)# when y intercept is 0y = a + bx/(c+x)# when y intercept is not zero

Unlike `lm`

, `nls`

requires that the formula statement includes all parameters, including an intercept if you want one fitted. Whereas in `lm`

you would write the formula for linear regression as

y ~ x

to fit the same model in "nls" you would instead write the formula as

y ~ a + b*x

which includes "a" for the intercept and "b" for the slope. Including them in the formula tells the `nls`

function what to estimate.

To fit a power function to the data, the complete "nls" command would be

z <- nls(y ~ a * x^b, data = mydata, start = list(a=1, b=1))

Note that `nls`

requires that you provide reasonable starting values for the parameters, in this case "a" and "b", so that it knows where to begin searching. The program will then run through a series of iterations that bring it closer and closer to the least squares estimates of the parameter values. Your starting values don't have to be terribly close, but the program is more likely to succeed in finding the least squares estimates the closer you are.

The resulting object (which I've named `z`

) is an **nls object** containing all the results. You use additional commands to pull out these results. Here are some of the most useful commands to extract results from the `lm`

object:

summary(z) # parameter estimates and overall model fit coef(z) # model coefficients (means, slopes, intercepts) confint(z) # confidence intervals for parameters resid(z) # residuals fitted(z) # predicted values predict(z, newdata=x1) # predicted values for new observations anova(z1, z2) # compare fits of 2 models, "full" vs "reduced" logLik(z) # log-likelihood of the parameters AIC(z) # Akaike Information Criterion BIC(z # Bayesian Information Criterion

If nls has difficulty finding the least squares estimates, it may stop before it has succeeded. For example, it will stop and print an error if it runs out of iterations before it has converged on the solution (by default, the maximum number of iterations allowed is 50). You can help it along by increasing the number of iterations using the "control" option,

z <- nls(y ~ a * x^b, data = mydata, start = list(a=1,b=1), control = list(maxiter=200, warnOnly=TRUE)))

The control option `warnOnly=TRUE`

ensures that when the program stops prematurely, it prints a warning rather than an error message. The difference is that with a warning you get to evaluate the fit of the not-yet-converged model stored in the model object (here stored in "z"). Type `?nls.control`

to see other options.

## Fit a cubic spline

The function "gam" allows estimation of a function, *f,* relating a response variable `Y`

to a continuous explanatory variable `X`

,

`Y`

=`f`

(`X`

) + random error.

For example, data on individual fitness, such as survival or reproductive success (`Y`

), may be obtained to estimate the form of natural selection *f* on a trait `X`

. In this case *f* is a "fitness function".

A cubic spline can be used to estimate *f*, which may take any smooth form that the data warrant. The cubic spline is a type of generalized additive model (gam). As in the case of generalized linear models, the random errors may have a distribution that is normal, binomial, Poisson, or otherwise. The advantage of the spline method is that it is flexible and avoids the need to make prior assumptions about the shape of *f*. See Schluter (1988, Evolution 42: 849–861; reprint) for a more detailed explanation of its use in estimating a fitness function.

Key to fitting the data with a cubic spline is the choice of smoothing parameter that controls the "roughness" of the fitted curve. A small value for the smoothing parameter (close to 0) will lead to a very wiggly curve that passes close to each of the data points, whereas a large value for the smoothing parameter is equivalent to fitting a straight line through the data. In the procedure used here, the smoothing parameter is chosen as that value minimizing the generalized cross-validation (GCV) score. Minimizing this score is tantamount to maximizing the predictive ability of the fitted model.

To begin, load the built-in mgcv library in R,

library(mgcv)

The following command fits a cubic spline to the relationship between a response variable y and a single continuous variable `x`

, both of which are columns of the data frame, `mydata`

. The method will find and fit the curve with a smoothing parameter that minimizes the GCV score. If the response variable is binary, such as data on survival (0 or 1), then a logit link function should be used.

z <- gam(y ~ s(x), data = mydata, family = binomial(link = "logit"), method = "GCV.Cp")

Use instead `family=poisson(link = "log")`

for Poisson data and `family=gaussian(link = "identity")`

for normal errors.

The results are stored in the fitted "gam" object, here called "z". To extract results use the following commands.

summary(z) # regression coefficients fitted(z) # predicted (fitted) values on original scale predict(z) # predicted values on the logit (or log, etc) scale plot(z) # the spline on the logit (or log, etc) scale residuals(z, type = "response") # residuals

Other useful results stored in the gam object:

z$sp # "best" value for the smoothing parameter z$gcv.ubre # the corresponding (minimum) GCV score sum(z$hat) # effective number of parameters of fitted curve

To plot predicted values with Bayesian standard errors (plotted on the logit or log scale if residuals are binomial or Poisson).

plot(z, se = TRUE, seWithMean=TRUE) # plus or minus 2 standard error plot(z, se = 1, seWithMean=TRUE) # plus or minus 1 standard error

To plot predicted values with standard errors on the original scale, in the case of non-normal errors, supply the appropriate inverse of the link function. For example,

plot(z, se = 1, seWithMean = TRUE, rug = FALSE, shift = mean(predict(z)), trans = exp) # Poisson data plot(z, se = 1, seWithMean = TRUE, rug = FALSE, shift = mean(predict(z)), trans = function(x){exp(x)/(1+exp(x))}) # binomial data

To plot predicted values for nicely spaced x-values, not necessarily the original x-values, with 1 standard error for prediction above and below, use this modified procedure instead. The commands below assume that your variables are named `x`

and `y`

and are in a data frame named `mydata`

.

newx <- seq(min(mydata$x), max(mydata$x), length.out = 100) z1 <- predict(z, newdata=list(x = newx), se.fit = TRUE) invlogit <- function(x){exp(x)/(exp(x) + 1)} yhat <- invlogit(z1$fit) upper <- invlogit(z1$fit + z1$se.fit) lower <- invlogit(z1$fit - z1$se.fit) plot(newx, yhat, type="l", ylim = c(0,1)) lines(newx, upper, lty = 2) lines(newx, lower, lty = 2)

What is the relationship between the choice of smoothing parameter (lambda) and the GCV score? The following commands will plot the relationship for binary data.

lambda <- exp(seq(-4,0, by=.05)) # fit a range of lambdas >0 gcvscore <- sapply(lambda, function(lambda, mydata){ gam(y ~ s(x), data = mydata, family = binomial, sp = lambda, method="GCV.Cp")$gcv.ubre},mydata) plot(lambda, gcvscore, type = "l") # or plot(log(lambda), gcvscore, type = "l")