Variation in the molecular substitution rate between lineages is a known factor when analyzing sequence data. This variation may be caused by the species intrinsic characteristics. The various ways speciesí characteristics have been shown to influence the rate of molecular evolution include generation time, metabolic rate, reproductive mode, mating preference, sex of carrier, habitat, or any other life history trait, or morphological feature.
traitRate is a program for detecting trait-dependent shifts in the rate of
molecular evolution. Currently, traitRate allows only characters with two
possible states (binary) to be analyzed. traitRate implements a likelihood
model that combines models of sequence and character (morphological) evolution.
The program outputs the relative rate of sequence evolution being at character
state 1 relative to character state 0, along with the log-likelihood of the
model and the log-likelihood of the null model that assumes that the state of
the binary character is not associated with the rate of evolution.
To run the program type traitRate followed by the path to the control file. The control file specifies the paths to the input/output files and the various options to be used. The obligatory inputs to traitRate are an ultrametric tree file in Newick format, a sequence alignment file in a number of common formats, and a character data file in a FASTA format. The user is responsible for the correspondence between the different files so that all extant taxa in the tree have both sequence and character data, and that all taxa with sequence and character data appear in the tree.
The first phase of the analysis is the estimation of the parameters of the underlying model. This is done with standard maximum likelihood techniques. The program outputs the maximum-likelihood parameter estimation and the log-likelihood of the null and alternative models. The likelihood ratio test, approximated with a chi-square distribution with one degree of freedom, can be used to compare between these two nested models. A log-likelihood difference of 1.92 between the two models is significant at the 0.05 level. Importantly, significant result using the likelihood ratio test between these two models is only the first condition and does not necessarily imply that the analyzed trait influences the rate of sequence evolution. A "parametric trait bootstrap" procedure (see below) should follow to verify whether the observed rate variation is caused by the analyzed trait compared to other, uncorrelated, traits that evolve in a similar manner. The last step is done using simulations, which may be computationally intensive. See below for directions regarding this procedure.
1. Input tree
The input species tree should be a rooted ultrametric tree in which the distances from the root to the tips are all equal. The program accepts either a single tree or multiple plausible trees as representing the phylogeny of the analyzed species.
To use a single tree, the option _treeFile in the control file should be used.
For multiple trees, the option _treesFile should be used. In this case, a multi-newick file should be used. See here for an example.
Note that computation time is much longer using multiple trees.
2. Number of stochastic mappings
The program relies on a sampling technique, called stochastic mapping, to approximate the likelihood of the combined sequence and character data. The approximation is more accurate when more mappings are used. However, the computation time increases linearly with the number of mappings. Although, practically, we find this parameter to have little influence on the results, we recommend trying several different numbers of mappings. The number of mappings is specified using the _stochasicMappingIterations option in the control file.
3. Running the parametric trait bootstrap procedure
Step one - simulate random character data. This procedure generates random character data by simulating character evolution along the input phylogeny using the parameters of the character model that were inferred for the original dataset.
In order to run this step, the parameters of the character model (_charModelParam1 and _charModelParam2) should be set to be equal to their inferred value. In addition set the following parameters in the control file:
_outDir <simulate directory>
The _endSimulationsIter parameter specifies the number of iteration to perform. As always, the more the better but 200 (or 100) should be enough. In addition, the same character data file and tree file as in the original run should be provided.
Step two - run traitRate for each simulated data. This step should,
preferentially run in parallel for the different simulated character datasets.
The input for each run should be the original tree file, the original sequence
file, and the simulated character file. For example, for the 10th simulated
data the following parameters should be specified in the control file:
_outDir <simulate directory>/sim_10/
_characterFile <simulate directory>/sim_10/simRandomChars.fasta
_seqFile <path to original sequence alignment file>
_treeFile <path to original species tree file>
Step three - assign significance level. Collect the difference in the log-likelihood score between the null and alternative models from all simulated runs. The log-likelihood difference observed for the original data should then be compared to the simulated distribution to produce a p-value. A perl script is provided here for this step.
|_mainType||The type of analysis to run.
Optimize_Model = Optimize the model parameters and infer relative rate
Run_Fix_Param = Run analysis with the specified parameters values
simulateFalsePositive = simulate random character data based on a given set of model parameters. See Running the parametric trait bootstrap procedure
|_treeFile or _treesFile||A path to the species phylogeny specifying one tree (_treeFile) or multiple trees (_treesFile) in Newick format||Either _treeFile or _treesFile is Obligatory|
|_characterFile||A path to a FASTA-format file with the character state assignments||Obligatory|
A path to a sequence alignment file.
Supported formats: Phylip, Clustal, Fasta, Mase, Molphy.
|_outDir||A path to the location of the output directory.||RESULTS|
|_outFile||Out file name||log.txt|
|_logFile||Log file name. Will be output in the _outDir directory||log.txt|
|_scaledTreeFile||Name of the scaled tree file. Will be output in the _outDir directory||scaled.tree|
|_sequenceType||nc (nucleotide sequence) prot (amino-acid sequence)||nc|
Nucleotide models: JC, K2, HKY, GTR|
Protein models: JTT, MT_REV, CP_REV, LG, WAG, AA_JC
See Sequence models
If 1, then the total length of the tree is fixed to a certain value. This value is optimized during the model optimization procedure.|
0 = total tree length will not be optimized.
A path to a reference tree file. The total branch lengths of this tree can be
used to fix the total branch lengths of the species tree.
When using this option, the _bScaleTree parameter can be set to 0.
|_stochasicMappingIterations||Number of stochastic mapping iterations||100|
An index specifying the first character simulation iteration|
See Running the parametric trait bootstrap procedure
An index specifying the last character simulation iteration|
See Running the parametric trait bootstrap procedure
The program combines models of sequence and character-state evolution such that rates of sequence change depend on the character state of a lineage at each point in time. The model parameters thus depend on the specified character and sequence models. Currently, only one character model is implemented that has two parameters and is applicable to binary data only. The model parameters are:
_relRate = the rate of sequence evolution being at character state 1 relative to state 0.
_charModelParam1 = the rate of change from state 1 to state 0.
_charModelParam2 = A scaling parameter that transforms the units of branch lengths from average number of nucleotide substitutions per site to average number of character transitions. The parameters of the sequence model are:
_gammaParam: specifies the single shape parameter of the gamma distribution that models among-site-rate-variation.
_gammaCategories: the number of discrete gamma categories.
_seqModelParam1-6: specify the rate matrix parameters. For example, when using the HKY model, _seqModelParam1 specifies the transition versus transversion rate bias.
_treeLength: the total tree length after the scaling procedure. This parameter allows the adjusted tree to be stretched or shrunk, as different choices for the relative rate parameter are explored.