Phylogenetic

Phylogenetic tools

The ape package in R contains tools of phylogenetic comparative methods. For more details on the package see the online book by E. Paradis, Analysis of phylogenetics and evolution with R. A link to this book can be found on the “textbooks” tab of the course main web pages.

Load the ape library to begin. You will also need to load the nlme library later.

library(ape)

Phylogenetic trees

This section explains commands to read and write files containing phylogenetic trees, and to display trees in the plot window in R.

Read tree from file

Phylogenetic trees are usually represented in one of two text formats, newick and nexus. The formats are explained below. To read a tree contained in a text file, use the corresponding command.

mytree <- read.tree(file.choose())   # newick format
mytree <- read.nexus(file.choose())  # nexus format

The object mytree is a special type of list. The following commands access some of the information stored in mytree.

mytree               # prints basic tree information
mytree$tip.label     # species names
mytree$edge.length   # lengths of all tree branches
mytree$edge          # identity of branches (from node, to node)

View tree

Use the plot command to view a phylogeny stored in the object mytree. Include the cex option to help reduce overlap of species labels.

plot(mytree, cex=0.7)

Write trees to file

write.tree(file.choose())   # newick format
write.nexus(file.choose())  # nexus format

Tree formats

The newick format is used by most programs that estimate phylogenies. Trees in this format look like the following (you might need to scroll right within the box to see the entire line):

((((Homo:0.21,Pongo:0.21):0.28,Macaca:0.49):0.13,Ateles:0.62):0.38,Galago:1.00);

Each species at the tips of the phylogeny has a name followed by a number indicating the length of the branch connecting it to its immediate ancestor. Sister taxa — those sharing an immediate common ancestor — are wrapped by parentheses. In the above example, the innermost parentheses indicate that Homo and Pongo are sister taxa. The next set of parentheses outward indicate that Macaca is the sister to the Homo/Pongo group. And so on.

The nexus format is also widely used. The same tree shown above in newick format looks like the following in nexus format:

BEGIN TAXA;
DIMENSIONS NTAX = 5;
TAXLABELS
Homo
Pongo
Macaca
Ateles
Galago
;
END;
BEGIN TREES;
TRANSLATE
1       Homo,
2       Pongo,
3       Macaca,
4       Ateles,
5       Galago
;
TREE * UNTITLED = [&R] ((((1:0.21,2:0.21):0.28,3:0.49):0.13,4:0.62):0.38,5:1);
END;

Trait data

Trait data read from a text file should be stored in a data frame. Use the familiar commands in R to read the data, e.g.,

mydata <- read.csv(file.choose(), stringsAsFactors=FALSE)

Species names

Make sure the species names are included in the data frame. They should match to the letter (and case) the species names in the mytree object that stores the tree information (the species names are in mytree$tip.label). So be sure to check carefully.

Row names

Although not crucial, it can be convenient to change the row names of mydata to be the species names. For example, if the variable species in mydata contains the species names, set the rownames as follows

rownames(mydata) <- mydata$species

Row order

The species data in the data frame containing the trait measurements must be in the same row order as the names of the species in the tree object, mytree (species names are in mytree$tip.label). If this is not the case then you need to rearrange the rows of the data frame to match the order of species names in the tree object. The following command will do this for you; it assumes that the rownames of mydata are the species names:

mydata <- mydata[match(mytree$tip.label,rownames(mydata)),]

Phylogenetically independent contrasts (PIC)

This is a method that estimates and tests a regression or correlation between two variables while correcting for the nonindependence of data points resulting from phylogeny. It assumes that trait evolution mimics a Brownian motion process with unchanging rates through time and along all branches of the tree. This is a difficult assumption to verify.

Before beginning, make sure that the rows of the species data in mydata are in the same order as the names of the species in the tree object, mytree, namely mytree$tip.label. If not, see the previous section to rearrange.

PICs

The method transforms the two variables x and y separately into phylogenetically independent contrasts, x1 and y1. Then fit a linear model using these contrasts but leave out the intercept term so that the fitted line passes through the origin. The regression is though the origin because each contrast is calculated as a difference (e.g., species A minus species B) and the direction of the difference is arbitrary (it could just as well have been calculated as species B minus species A).

x1 <- pic(mydata$x, mytree)
y1 <- pic(mydata$y, mytree)

View x1 and y1 before entering the next command. Check especially for values listed as “NaN” or “Inf”. They may result if some of the tip branches in the tree have length 0 (the branch lengths are given in mytree$edge.length). See below for suggestions on what to do in this case.

If all is well, then fit the linear model through the origin. All the usual commands can be used to pull results from the fitted lm object, z.

z <- lm(y1 ~ x1 - 1)    # the "-1" forces line through origin

Correlation

Often the goal is to estimate a correlation coefficient rather than a regression coefficient. The method for calculating a correlation between independent contrasts differs from the usual calculation for a correlation because we need to account for the fact that the direction of the contrast is arbitrary. Both the following calculations should give the same answer. First calculate x1, y1, and z as in the PICs section above. Then calculate the correlation between independent contrasts as

r <- sum(x1*y1)/(sqrt(sum(x1^2))*sqrt(sum(y1^2)))  # or
r <- sqrt(summary(z)$r.squared)

Zero branch lengths

The PIC method in ape will crash if certain of the branches of the tree have length zero. If this happens to you, inspect the branch lengths.

mytree$edge.length        # lengths of all tree branches
range(mytree$edge.length) # shortest and longest branch

Try adding a very small number to each of the branches to prevent crashes caused by zero branch lengths. Here I add a constant “0.001” to each branch.

mytree$edge.length <-  mytree$edge.length + 0.001

Retry the PICs method to see if this fixes the problem.


General least squares method

General least squares (GLS) is a linear model technique that allows non-independent data points to be fitted when the expected similarity (“correlation”) between data points is known. In the present case, phylogeny provides the expected correlations, assuming that trait evolution conforms to a Brownian motion process. The basic method gives identical results to PICs. The advantage of analyzing the data using GLS instead of PICs is that it provides a more flexible framework for analyzing trait evolution.

Phylogenetic correlation matrix

GLS uses a matrix containing the expected correlations between the trait values of all pairs of species. The expected correlation between any pair is the proportion of evolutionary history, from root to tips, that they share through common descent. The phylogenetic correlation matrix is used to weight the data from each species in a linear model. You can view the matrix corresponding to a given phylogenetic tree using the following command:

vcv.phylo(mytree, cor=TRUE)

Here is the phylogenetic correlation matrix corresponding to the example tree above. The diagonal elements of the correlation matrix are “1”. The off-diagonal elements give the fraction of the total tree, from root to tip, that is shared between a given pair of species i and j. For example, Homo and Pongo share a great deal of their histories (ρ = 0.79), prior to branching apart. At the other extreme, Galago shares none of its history with any other species in this tree (ρ = 0).

Homo

Pongo

Macaca

Ateles

Galago

Homo

1

0.79

0.51

0.38

0

Pongo

0.79

1

0.51

0.38

0

Macaca

0.51

0.51

1

0.38

0

Ateles

0.38

0.38

0.38

1

0

Galago

0

0

0

0

1



If the off-diagonal elements of the correlation matrix are all zero, then this indicates a “star phylogeny” and the results of GLS will be the same as those from an ordinary linear model that ignores phylogeny.

Using GLS

The gls command is from the nlme library. The variables x and y are assumed to be in the data frame mytree. We are going back to the data here and are not using the PICs calculated earlier. Since we are modeling the data directly rather than the contrasts, we don’t remove the intercept term from the model formula.

library(nlme)
z <- gls(y ~ x, data=mydata, correlation=corBrownian(1,mytree))

The results are stored in the gls model object z. This object behaves similarly to a lm model object from ordinary linear model fitting, and a number of the same commands are used to pull out results.

plot(z)    # residual plot
summary(z) # coefficients, SE's, fit statistics
confint(z) # confidence intervals for coefficients

Pagel’s λ

The GLS approach makes it easy to ask whether it is even necessary to take phylogeny into account when fitting a linear model to species data. Pagel’s λ (“lambda”) is one way to evaluate the importance of phylogeny to species data. The method is founded on the assumption of a modified Brownian motion process.

Under the strict Brownian motion model, the expected correlation ρij between the trait values of any two species i and j is measured by the fraction of total phylogenetic history, from root to tip, that they share. An alternative possibility is that every correlation is weaker by the same amount, λ*ρij, where λ is a constant between 0 and 1. A λ value less than 1 implies that the effect of phylogeny is not as strong as that expected under a strict Brownian process, as though each species has experienced an extra bit of independent evolution since it split from its ancestor. If λ = 0 then there is no effect of phylogeny at all. λ is fitted by multiplying all the off-diagonal values of the phylogenetic correlation matrix by λ and then fitting a GLS model with this modified correlation matrix.

The method is simple to implement in ape. The three commands below fit models to the data in which λ is 1, 0, and 0.5.

z <- gls(y ~ x, data=mydata, correlation=corPagel(1,mytree, fixed = TRUE))
z <- gls(y ~ x, data=mydata, correlation=corPagel(0,mytree, fixed = TRUE))
z <- gls(y ~ x, data=mydata, correlation=corPagel(0.5,mytree, fixed = TRUE))

The following command finds and fits the maximum likelihood value for λ, given the data. The best-fit value of λ is a useful guide to the extent to which species differences are predicted by phylogeny.

z <- gls(y ~ x, data=mydata, correlation=corPagel(1,mytree, fixed = FALSE))

The Ornstein–Uhlenbeck process

Under Brownian motion, the expected or average difference between species is proportional to the amount of time since they split from a common ancestor. This assumes that species can evolve ever greater differences without constraint. However, real evolution seems to have bounds, even if they are soft. The Ornstein–Uhlenbeck (OU) process attempts to capture this idea of limits or constraints in a modified random walk that includes an “attractor” or “optimum.” (The process is sometimes said to represent a model of “stabilizing selection”, but this is confusing because the attractor isn’t an optimum in the population genetic sense.) Under an OU process, the farther a species trait evolves away from the attractor, the stronger the tendency for the next step in its random walk to be toward the attractor rather than away from it. Think of it as a Brownian motion process with a soft boundary.

Use gls to fit an OU model to the data as follows.

z <- gls(y ~ x, data=mydata, correlation=corMartins(1,mytree,fixed=TRUE))  # λ = 1
z <- gls(y ~ x, data=mydata, correlation=corMartins(1,mytree,fixed=FALSE)) # λ is estimated