traitRate: phenotype-genotype association detection

traitRate (VERSION 1.0)
traitRate is a program for the detection of trait-dependent shifts in the rate of sequence evolution.

Download the program
Current version is from August 2010.
For support and questions please email me: itaymay 'at' gmail.com
traitRate.exe (Windows)

Examples of input files:
params.txt (traitRate control file).
in_msa.fasta (input sequence alignment file).
in_chars.fasta (input characters assignment file).
model.tree (Newick tree file format).

You can try the program by typing
>traitRate.exe params.txt

Source code and copyrights
Source code (C++) is available for download here: [traitRate_source-1.0.tar.gz].
The sources include a large number of files that belong to a general phylogenetic C++ library and some C++ files specific to the traitRate program. In order to compile traitRate, first the phylogenetic lib should be compiled and then the traitRate program.
To compile the program follow these steps:
1. Go to directory libs/phylogeny/ by typing: "cd download_dir/libs/phylogeny/"
2. To compile the library type: "make doubleRep"
3. Go to the traitRate source directory by typing: "cd ../../programs/traitRate/"
4. Compile the program by typing: "make doubleRep".
The name of the executable will be traitRate.doubleRep

If there are problems with the compilations (occasionally, with various versions of g++) - please email me and I'll try to help. To modify the code, or use parts of it for other purposes, permission is requested. Please contact Itay Mayrose at itay 'at' gmail.com

In citing the traitRate program please refer to

Mayrose I and Otto SP. 2010. A Likelihood Method for Detecting Trait-Dependent Shifts in the Rate of Molecular Evolution. Mol Biol Evol.

Overview

Variation in the molecular substitution rate between lineages is a known factor when analyzing sequence data. This variation may be caused by the species intrinsic characteristics. The various ways species’ characteristics have been shown to influence the rate of molecular evolution include generation time, metabolic rate, reproductive mode, mating preference, sex of carrier, habitat, or any other life history trait, or morphological feature.

traitRate is a program for detecting trait-dependent shifts in the rate of molecular evolution. Currently, traitRate allows only characters with two possible states (binary) to be analyzed. traitRate implements a likelihood model that combines models of sequence and character (morphological) evolution. The program outputs the relative rate of sequence evolution being at character state 1 relative to character state 0, along with the log-likelihood of the model and the log-likelihood of the null model that assumes that the state of the binary character is not associated with the rate of evolution.

Methodology
To run the program type traitRate followed by the path to the control file. The control file specifies the paths to the input/output files and the various options to be used. The obligatory inputs to traitRate are an ultrametric tree file in Newick format, a sequence alignment file in a number of common formats, and a character data file in a FASTA format. The user is responsible for the correspondence between the different files so that all extant taxa in the tree have both sequence and character data, and that all taxa with sequence and character data appear in the tree.

Significance tests
The first phase of the analysis is the estimation of the parameters of the underlying model. This is done with standard maximum likelihood techniques. The program outputs the maximum-likelihood parameter estimation and the log-likelihood of the null and alternative models. The likelihood ratio test, approximated with a chi-square distribution with one degree of freedom, can be used to compare between these two nested models. A log-likelihood difference of 1.92 between the two models is significant at the 0.05 level. Importantly, significant result using the likelihood ratio test between these two models is only the first condition and does not necessarily imply that the analyzed trait influences the rate of sequence evolution. A "parametric trait bootstrap" procedure (see below) should follow to verify whether the observed rate variation is caused by the analyzed trait compared to other, uncorrelated, traits that evolve in a similar manner. The last step is done using simulations, which may be computationally intensive. See below for directions regarding this procedure.

IMPORTANT NOTES
1. Input tree
The input species tree should be a rooted ultrametric tree in which the distances from the root to the tips are all equal. The program accepts either a single tree or multiple plausible trees as representing the phylogeny of the analyzed species.
To use a single tree, the option _treeFile in the control file should be used.
For multiple trees, the option _treesFile should be used. In this case, a multi-newick file should be used. See here for an example.
Note that computation time is much longer using multiple trees.

2. Number of stochastic mappings
The program relies on a sampling technique, called stochastic mapping, to approximate the likelihood of the combined sequence and character data. The approximation is more accurate when more mappings are used. However, the computation time increases linearly with the number of mappings. Although, practically, we find this parameter to have little influence on the results, we recommend trying several different numbers of mappings. The number of mappings is specified using the _stochasicMappingIterations option in the control file.

3. Running the parametric trait bootstrap procedure
Step one - simulate random character data. This procedure generates random character data by simulating character evolution along the input phylogeny using the parameters of the character model that were inferred for the original dataset.
In order to run this step, the parameters of the character model (_charModelParam1 and _charModelParam2) should be set to be equal to their inferred value. In addition set the following parameters in the control file:
_mainType simulateFalsePositive
_outDir <simulate directory>
_startSimulationsIter 0
_endSimulationsIter 199
The _endSimulationsIter parameter specifies the number of iteration to perform. As always, the more the better but 200 (or 100) should be enough. In addition, the same character data file and tree file as in the original run should be provided.

Step two - run traitRate for each simulated data. This step should, preferentially run in parallel for the different simulated character datasets. The input for each run should be the original tree file, the original sequence file, and the simulated character file. For example, for the 10th simulated data the following parameters should be specified in the control file:
_mainType Optimize_Model
_outDir <simulate directory>/sim_10/
_characterFile <simulate directory>/sim_10/simRandomChars.fasta
_seqFile <path to original sequence alignment file>
_treeFile <path to original species tree file>

Step three - assign significance level. Collect the difference in the log-likelihood score between the null and alternative models from all simulated runs. The log-likelihood difference observed for the original data should then be compared to the simulated distribution to produce a p-value. A perl script is provided here for this step.

Outputs
traitRate.res:
This file includes various run statistics as well as the inferred model parameters, frequencies of the two character states at the root of the tree, and the log-likelihood values of the null and alternative models.
log.txt
All outputs that are printed during an execution of the program
scaled.tree
A Newick tree file with branch lengths scaled according to the two-clock model. The character state assignment is appended by a hyphen to each taxon name. This tree is useful to visualize the estimated effect of the trait on the rate of sequence evolution.

Control file options
An example of a control option file can be found here. A description of each parameter is given below. If a parameter is missing from the file or remarked (using the # sign), its default value will be used

Parameter Description Default
_mainType The type of analysis to run.
Possible options:
Optimize_Model = Optimize the model parameters and infer relative rate
Run_Fix_Param = Run analysis with the specified parameters values
simulateFalsePositive = simulate random character data based on a given set of model parameters. See Running the parametric trait bootstrap procedure Optimize_Model
_treeFile or _treesFile A path to the species phylogeny specifying one tree (_treeFile) or multiple trees (_treesFile) in Newick format Either _treeFile or _treesFile is Obligatory
_characterFile A path to a FASTA-format file with the character state assignments Obligatory
_seqFile A path to a sequence alignment file.
Supported formats: Phylip, Clustal, Fasta, Mase, Molphy. Obligatory
_outDir A path to the location of the output directory. RESULTS
_outFile Out file name log.txt
_logFile Log file name. Will be output in the _outDir directory log.txt
_scaledTreeFile Name of the scaled tree file. Will be output in the _outDir directory scaled.tree
_sequenceType nc (nucleotide sequence) prot (amino-acid sequence) nc
_seqModelType Nucleotide models: JC, K2, HKY, GTR
Protein models: JTT, MT_REV, CP_REV, LG, WAG, AA_JC
See Sequence models JC
_bScaleTree If 1, then the total length of the tree is fixed to a certain value. This value is optimized during the model optimization procedure.
0 = total tree length will not be optimized. 1
_referenceTree A path to a reference tree file. The total branch lengths of this tree can be used to fix the total branch lengths of the species tree.
When using this option, the _bScaleTree parameter can be set to 0.
""
_stochasicMappingIterations Number of stochastic mapping iterations 100
_startSimulationsIter An index specifying the first character simulation iteration
See Running the parametric trait bootstrap procedure 0
_endSimulationsIter An index specifying the last character simulation iteration
See Running the parametric trait bootstrap procedure 0

Model parameters
The program combines models of sequence and character-state evolution such that rates of sequence change depend on the character state of a lineage at each point in time. The model parameters thus depend on the specified character and sequence models. Currently, only one character model is implemented that has two parameters and is applicable to binary data only. The model parameters are:
_relRate = the rate of sequence evolution being at character state 1 relative to state 0.
_charModelParam1 = the rate of change from state 1 to state 0.
_charModelParam2 = A scaling parameter that transforms the units of branch lengths from average number of nucleotide substitutions per site to average number of character transitions. The parameters of the sequence model are:
_gammaParam: specifies the single shape parameter of the gamma distribution that models among-site-rate-variation.
_gammaCategories: the number of discrete gamma categories.
_seqModelParam1-6: specify the rate matrix parameters. For example, when using the HKY model, _seqModelParam1 specifies the transition versus transversion rate bias.
_treeLength: the total tree length after the scaling procedure. This parameter allows the adjusted tree to be stretched or shrunk, as different choices for the relative rate parameter are explored.

Parameter	Description	Default
_mainType	The type of analysis to run. Possible options: Optimize_Model = Optimize the model parameters and infer relative rate Run_Fix_Param = Run analysis with the specified parameters values simulateFalsePositive = simulate random character data based on a given set of model parameters. See Running the parametric trait bootstrap procedure	Optimize_Model
_treeFile or _treesFile	A path to the species phylogeny specifying one tree (_treeFile) or multiple trees (_treesFile) in Newick format	Either _treeFile or _treesFile is Obligatory
_characterFile	A path to a FASTA-format file with the character state assignments	Obligatory
_seqFile	A path to a sequence alignment file. Supported formats: Phylip, Clustal, Fasta, Mase, Molphy.	Obligatory
_outDir	A path to the location of the output directory.	RESULTS
_outFile	Out file name	log.txt
_logFile	Log file name. Will be output in the _outDir directory	log.txt
_scaledTreeFile	Name of the scaled tree file. Will be output in the _outDir directory	scaled.tree
_sequenceType	nc (nucleotide sequence) prot (amino-acid sequence)	nc
_seqModelType	Nucleotide models: JC, K2, HKY, GTR Protein models: JTT, MT_REV, CP_REV, LG, WAG, AA_JC See Sequence models	JC
_bScaleTree	If 1, then the total length of the tree is fixed to a certain value. This value is optimized during the model optimization procedure. 0 = total tree length will not be optimized.	1
_referenceTree	A path to a reference tree file. The total branch lengths of this tree can be used to fix the total branch lengths of the species tree. When using this option, the _bScaleTree parameter can be set to 0.	""
_stochasicMappingIterations	Number of stochastic mapping iterations	100
_startSimulationsIter	An index specifying the first character simulation iteration See Running the parametric trait bootstrap procedure	0
_endSimulationsIter	An index specifying the last character simulation iteration See Running the parametric trait bootstrap procedure	0