Phylogenetic tools
The ape
package in R contains tools of phylogenetic comparative methods. For more details on the package see the online book by E. Paradis, Analysis of phylogenetics and evolution with R. A link to this book can be found on the “textbooks” tab of the course main web pages.
Load the ape
library to begin. You will also need to load the nlme
library later.
library(ape)
Phylogenetic trees
This section explains commands to read and write files containing phylogenetic trees, and to display trees in the plot window in R.
Read tree from file
Phylogenetic trees are usually represented in one of two text formats, newick and nexus. The formats are explained below. To read a tree contained in a text file, use the corresponding command.
mytree < read.tree(file.choose()) # newick format mytree < read.nexus(file.choose()) # nexus format
The object mytree
is a special type of list. The following commands access some of the information stored in mytree
.
mytree # prints basic tree information mytree$tip.label # species names mytree$edge.length # lengths of all tree branches mytree$edge # identity of branches (from node, to node)
View tree
Use the plot command to view a phylogeny stored in the object mytree
. Include the cex
option to help reduce overlap of species labels.
plot(mytree, cex=0.7)
Write trees to file
write.tree(file.choose()) # newick format write.nexus(file.choose()) # nexus format
Tree formats
The newick format is used by most programs that estimate phylogenies. Trees in this format look like the following (you might need to scroll right within the box to see the entire line):
((((Homo:0.21,Pongo:0.21):0.28,Macaca:0.49):0.13,Ateles:0.62):0.38,Galago:1.00);
Each species at the tips of the phylogeny has a name followed by a number indicating the length of the branch connecting it to its immediate ancestor. Sister taxa — those sharing an immediate common ancestor — are wrapped by parentheses. In the above example, the innermost parentheses indicate that Homo and Pongo are sister taxa. The next set of parentheses outward indicate that Macaca is the sister to the Homo/Pongo group. And so on.
The nexus format is also widely used. The same tree shown above in newick format looks like the following in nexus format:
BEGIN TAXA; DIMENSIONS NTAX = 5; TAXLABELS Homo Pongo Macaca Ateles Galago ; END; BEGIN TREES; TRANSLATE 1 Homo, 2 Pongo, 3 Macaca, 4 Ateles, 5 Galago ; TREE * UNTITLED = [&R] ((((1:0.21,2:0.21):0.28,3:0.49):0.13,4:0.62):0.38,5:1); END;
Trait data
Trait data read from a text file should be stored in a data frame. Use the familiar commands in R to read the data, e.g.,
mydata < read.csv(file.choose(), stringsAsFactors=FALSE)
Species names
Make sure the species names are included in the data frame. They should match to the letter (and case) the species names in the mytree
object that stores the tree information (the species names are in mytree$tip.label). So be sure to check carefully.
Row names
Although not crucial, it can be convenient to change the row names of mydata
to be the species names. For example, if the variable species
in mydata
contains the species names, set the rownames as follows
rownames(mydata) < mydata$species
Row order
The species data in the data frame containing the trait measurements must be in the same row order as the names of the species in the tree object, mytree
(species names are in mytree$tip.label
). If this is not the case then you need to rearrange the rows of the data frame to match the order of species names in the tree object. The following command will do this for you; it assumes that the rownames of mydata
are the species names:
mydata < mydata[match(mytree$tip.label,rownames(mydata)),]
Phylogenetically independent contrasts (PIC)
This is a method that estimates and tests a regression or correlation between two variables while correcting for the nonindependence of data points resulting from phylogeny. It assumes that trait evolution mimics a Brownian motion process with unchanging rates through time and along all branches of the tree. This is a difficult assumption to verify.
Before beginning, make sure that the rows of the species data in mydata
are in the same order as the names of the species in the tree object, mytree
, namely mytree$tip.label
. If not, see the previous section to rearrange.
PICs
The method transforms the two variables x and y separately into phylogenetically independent contrasts, x1 and y1. Then fit a linear model using these contrasts but leave out the intercept term so that the fitted line passes through the origin. The regression is though the origin because each contrast is calculated as a difference (e.g., species A minus species B) and the direction of the difference is arbitrary (it could just as well have been calculated as species B minus species A).
x1 < pic(mydata$x, mytree) y1 < pic(mydata$y, mytree)
View x1 and y1 before entering the next command. Check especially for values listed as “NaN” or “Inf”. They may result if some of the tip branches in the tree have length 0 (the branch lengths are given in mytree$edge.length
). See below for suggestions on what to do in this case.
If all is well, then fit the linear model through the origin. All the usual commands can be used to pull results from the fitted lm
object, z.
z < lm(y1 ~ x1  1) # the "1" forces line through origin
Correlation
Often the goal is to estimate a correlation coefficient rather than a regression coefficient. The method for calculating a correlation between independent contrasts differs from the usual calculation for a correlation because we need to account for the fact that the direction of the contrast is arbitrary. Both the following calculations should give the same answer. First calculate x1
, y1
, and z
as in the PICs section above. Then calculate the correlation between independent contrasts as
r < sum(x1*y1)/(sqrt(sum(x1^2))*sqrt(sum(y1^2))) # or r < sqrt(summary(z)$r.squared)
Zero branch lengths
The PIC method in ape
will crash if certain of the branches of the tree have length zero. If this happens to you, inspect the branch lengths.
mytree$edge.length # lengths of all tree branches range(mytree$edge.length) # shortest and longest branch
Try adding a very small number to each of the branches to prevent crashes caused by zero branch lengths. Here I add a constant “0.001” to each branch.
mytree$edge.length < mytree$edge.length + 0.001
Retry the PICs method to see if this fixes the problem.
General least squares method
General least squares (GLS) is a linear model technique that allows nonindependent data points to be fitted when the expected similarity (“correlation”) between data points is known. In the present case, phylogeny provides the expected correlations, assuming that trait evolution conforms to a Brownian motion process. The basic method gives identical results to PICs. The advantage of analyzing the data using GLS instead of PICs is that it provides a more flexible framework for analyzing trait evolution.
Phylogenetic correlation matrix
GLS uses a matrix containing the expected correlations between the trait values of all pairs of species. The expected correlation between any pair is the proportion of evolutionary history, from root to tips, that they share through common descent. The phylogenetic correlation matrix is used to weight the data from each species in a linear model. You can view the matrix corresponding to a given phylogenetic tree using the following command:
vcv.phylo(mytree, cor=TRUE)
Here is the phylogenetic correlation matrix corresponding to the example tree above. The diagonal elements of the correlation matrix are “1”. The offdiagonal elements give the fraction of the total tree, from root to tip, that is shared between a given pair of species i and j. For example, Homo and Pongo share a great deal of their histories (ρ = 0.79), prior to branching apart. At the other extreme, Galago shares none of its history with any other species in this tree (ρ = 0).

Homo 
Pongo 
Macaca 
Ateles 
Galago 
Homo 
1 
0.79 
0.51 
0.38 
0 
Pongo 
0.79 
1 
0.51 
0.38 
0 
Macaca 
0.51 
0.51 
1 
0.38 
0 
Ateles 
0.38 
0.38 
0.38 


Galago 
0 
0 
0 
0 

If the offdiagonal elements of the correlation matrix are all zero, then this indicates a “star phylogeny” and the results of GLS will be the same as those from an ordinary linear model that ignores phylogeny.
Using GLS
The gls
command is from the nlme library. The variables x
and y
are assumed to be in the data frame mytree
. We are going back to the data here and are not using the PICs calculated earlier. Since we are modeling the data directly rather than the contrasts, we don’t remove the intercept term from the model formula.
library(nlme) z < gls(y ~ x, data=mydata, correlation=corBrownian(1,mytree))
The results are stored in the gls
model object z
. This object behaves similarly to a lm
model object from ordinary linear model fitting, and a number of the same commands are used to pull out results.
plot(z) # residual plot summary(z) # coefficients, SE's, fit statistics confint(z) # confidence intervals for coefficients
Pagel’s λ
The GLS approach makes it easy to ask whether it is even necessary to take phylogeny into account when fitting a linear model to species data. Pagel’s λ (“lambda”) is one way to evaluate the importance of phylogeny to species data. The method is founded on the assumption of a modified Brownian motion process.
Under the strict Brownian motion model, the expected correlation ρ_{ij} between the trait values of any two species i and j is measured by the fraction of total phylogenetic history, from root to tip, that they share. An alternative possibility is that every correlation is weaker by the same amount, λ*ρ_{ij}, where λ is a constant between 0 and 1. A λ value less than 1 implies that the effect of phylogeny is not as strong as that expected under a strict Brownian process, as though each species has experienced an extra bit of independent evolution since it split from its ancestor. If λ = 0 then there is no effect of phylogeny at all. λ is fitted by multiplying all the offdiagonal values of the phylogenetic correlation matrix by λ and then fitting a GLS model with this modified correlation matrix.
The method is simple to implement in ape
. The three commands below fit models to the data in which λ is 1, 0, and 0.5.
z < gls(y ~ x, data=mydata, correlation=corPagel(1,mytree, fixed = TRUE)) z < gls(y ~ x, data=mydata, correlation=corPagel(0,mytree, fixed = TRUE)) z < gls(y ~ x, data=mydata, correlation=corPagel(0.5,mytree, fixed = TRUE))
The following command finds and fits the maximum likelihood value for λ, given the data. The bestfit value of λ is a useful guide to the extent to which species differences are predicted by phylogeny.
z < gls(y ~ x, data=mydata, correlation=corPagel(1,mytree, fixed = FALSE))
The Ornstein–Uhlenbeck process
Under Brownian motion, the expected or average difference between species is proportional to the amount of time since they split from a common ancestor. This assumes that species can evolve ever greater differences without constraint. However, real evolution seems to have bounds, even if they are soft. The Ornstein–Uhlenbeck (OU) process attempts to capture this idea of limits or constraints in a modified random walk that includes an “attractor” or “optimum.” (The process is sometimes said to represent a model of “stabilizing selection”, but this is confusing because the attractor isn’t an optimum in the population genetic sense.) Under an OU process, the farther a species trait evolves away from the attractor, the stronger the tendency for the next step in its random walk to be toward the attractor rather than away from it.
Use gls
to fit an OU model to the data as follows.
z < gls(y ~ x, data=mydata, correlation=corMartins(1,mytree,fixed=TRUE)) # λ = 1 z < gls(y ~ x, data=mydata, correlation=corMartins(1,mytree,fixed=FALSE)) # λ is estimated