Models for chromosome-number evolution

chromEvol (VERSION 1.3)
chromEvol is a program for analyzing changes in chromosome-number along a phylogeny.

Download the program:
Current version is from January 2012, which includes some minor bug fixes.
For support and questions please email me: itaymay 'at' gmail.com
chromEvol.exe (Windows)

Examples of input files:
params.txt (chromEvol control file).
Arist.counts (chromosome counts file format).
Arist.tree (Newick tree file format).

You can try the program by typing
>chromEvol.exe params.txt

Source code and copyrights:
Source code (C++) is available for download here: [chromEvol_source-1.3.tar.gz].
The makefile within can be used to compile the executable (using the make or gmake commands). Alternatively, type:
g++ -o chromEvol.exe -O3 *.cpp -DDOUBLEREP

If there are problems with the compilations (occasionally, with various versions of g++) - please email me and I'll try to help. To modify the code, or use parts of it for other purposes, permission is requested. Please contact Itay Mayrose at itay 'at' gmail.com

In citing the chromEvol program please refer to:

Mayrose I, Barker MS, Otto SP. 2010. Probabilistic models of chromosome number evolution and the inference of polyploidy. Systematic Biology. 59(2):132-144

Overview:

Chromosome number is a remarkably dynamic feature of eukaryotic evolution. Chromosome numbers can change by a duplication of the whole genome (a process termed polyploidy), or by gaining or losing single chromosomes. Of the various mechanisms of chromosome number change, polyploidy has received significant attention because of the impact such an event may have on the organism. Polyploids often differ markedly from their progenitors in morphological, physiological, or life history characteristics, and these differences may contribute to the establishment and success of a polyploid species in novel ecological settings.

ChromEvol implements a series of likelihood models for the evolution of chromosome numbers. By comparing the fit of the different models to biological data, it may be possible to gain insight regarding the pathways by which the evolution of chromosome number proceeds. For each model, the program infers the set of ancestral chromosome numbers and estimates the location along the tree for which polyploidy events (and other chromosome number changes) occurred.

Methodology:
To run the program type chromEvol followed by the path to the control file. The control file specifies the paths to the input/output files and the various options to be used. The obligatory inputs to chromEvol are a tree file in Newick format and a chromosome counts data in the correct format. The counts should represent the haploid number. The user is responsible for a correspondence between the two files so that all extant taxa in the tree have chromosome count data, and that all taxa with count data appear in the tree. The first phase of the analysis is the estimation of the parameters of the underlying model. This is done with standard maximum likelihood techniques. Then, the most likely ancestral chromosome numbers are inferred, as well as the probability of any given chromosome number to exist at any internal node. Finally, the program estimates the expected number of polyploidizations and transitions of a single chromosome that have occurred. The last step is done using simulations, which may be computationally intensive. For this stage, the interplay between accuracy and running time can be controlled using the control file.

IMPORTANT NOTES:
1. The input phylogenetic tree is assumed to be rooted. The root name is printed in the chromEvol.res output file. In order to verify that this is the correct root node, the user should view the file allNodes.tree. In case the default rooting is not correct, it is possible to change the root using the _rootAt option in the control file.
2. For efficient multi-dimensional optimization, the program sets an upper bound of 100.0 for the rate parameters. However, in case one of the optimized model parameters are close to this upper bound it is indicative that the model parameters may not have reached their global optimum. The solution is to multiply all branch lengths by a certain factor (e.g., 10) and run the program again. Multiplying all branch lengths by a scalar can be done using the _branchMul option in the control file.
3. The branch lengths of the tree represent the expected number of chromosome-number transitions along the branch. When the branch lengths are exceptionally large (or small) the program will infer unrealistic ancestral states and will overestimate the number of transitions. By default the program scales the input tree (multiply all branches by a constant) so that the total tree length will represent a realistic number of chromosome-number transitions across the whole tree. Use the _branchMul parameter in the control file (see below), to scale the tree by a specified scalar or to keep the branch lengths identical to the input tree (_branchMul = 1.0)
4. The program also accepts multiple chromosome counts for a certain species. For example, when 40% of the individuals in the population have a 24 haploid chromosome count and 60% have a 26 haploid chromosome count these should be written in the data file as follows:
>TAXA_B
24=0.6_26=0.4
If two chromosome numbers are valid for specific taxa, for example, if both 24 and 25 are valid for TAXA_A then each of these can be given a 0.5 probability:
>TAXA_A
24=0.5_25=0.5

Control file options:
An example of a control option file can be found here. This file specifies a model with 3 free parameters (_lossConstR, _gainConstR, _duplConstR) with the demi-polyploidy rate equal to the polyploidy rate. A description of each parameter is given below.

Parameter Description Default
_dataFile A path to a file with the chromosome count data Obligatory
_treeFile A path to a tree file in Newick format Obligatory
_outDir A path to the location of the output directory. The directory should be an existing one (that is - the program will not create a new directory). RESULTS
_mainType Possible options:
Run_Fix_Param = Run analysis with the specified parameters values
Optimize_Model = Optimize the specified model parameters and then run analysis
All_Models = Run analysis for each of the eight models, see Models comparison Optimize_Model
_maxChrNum The maximal chromosome number allowed
Negative values (-X): Set the maximal chromosome number allowed to be X units larger than the maximal chromosome number observed in the data file -10 (10 units larger than the maximal chromosome number observed in the data file)
_minChrNum The minimal chromosome number allowed
Negative values (-X): Set the minimal chromosome number allowed to be X units smaller than the minimal chromosome number observed in the data file 1
_simulationsNum The number of simulations for computing the expectation of the number of changes of certain transition type along each branch. Note: This step is computationally expensive. Lower values results in faster computations with decreased accuracy 10000
_rootAt The internal node assumed to represent the root of the tree. N1
_branchMul If different than 1.0 then all branch lengths of the tree are multiplied by this scalar. Should be used if one of the model parameters are close to their boundary value (100), or in order to scale the tree when the branch lengths are exceptionally large or small. 999 (=Scale tree so that total tree length is equal to the number of different character types)

Model parameters
Currently the program supports 6 types of transitions between different chromosome numbers. The user may include all parameters in the model or choose to ignore some of them. By specifying different sets of parameters the user may compare different hypotheses regarding the evolution of chromosome number along a given phylogeny. The model parameters should are specified in the control file. In order to include a parameter its name should be followed by a space and a positive number. In case the Optimize_Model option is specified, this number represents the initial parameter value for optimization, or is fixed to that value in case Run_Fix_Param is specified. In order to exclude a parameter, the parameter name should be followed by the number -999.

Parameter Description
_gainConstR An increase by a single chromosome
_gainLinearR Rates for single chromosome increases are dependent on the current chromosome number
_lossConstR A decrease by a single chromosome
_lossLinearR Rates for single chromosome decreases are dependent on the current chromosome number
_duplConstR A duplication of the whole genome (polyploidy)
_demiPloidyR A demi-polyploidization. This parameter allows for transitions from a genome with n haploid chromosomes to 1.5n (e.g., 4x to 6x).
If -2 is specified then the rate of demi-polyploidy is equal to that of polyploidy. Thus, the number of model parameters does not increase

Models comparison
A set of 8 models, each with a different set of parameters can be optimized. The maximal log-likelihood values and AIC scores of each model are printed to the file modelsSummary.txt. In order to run this option the following line should be included in the control file:
_mainType All_Models
The following models are run:

Model Model parameters
CONST_RATE _gainConstR, _lossConstR, _duplConstR
CONST_RATE_DEMI _gainConstR, _lossConstR, _duplConstR = _demiPloidyR
CONST_RATE_DEMI_EST _gainConstR, _lossConstR, _duplConstR, _demiPloidyR
CONST_RATE_NO_DUPL _gainConstR, _lossConstR
LINEAR_RATE _gainConstR, _gainLinearR, _lossConstR, _lossLinearR, _duplConstR
LINEAR_RATE_DEMI _gainConstR, _gainLinearR, _lossConstR, _lossLinearR, _duplConstR = _demiPloidyR
LINEAR_RATE_DEMI_EST _gainConstR, _gainLinearR, _lossConstR, _lossLinearR, _duplConstR, _demiPloidyR
LINEAR_RATE_NO_DUPL _gainConstR, _gainLinearR, _lossConstR, _lossLinearR

Chromosome counts file format View Example

Chromosome counts data should be supplied in a format similar to a FASTA file, with a few extensions. For each species in the input tree two lines should be specified. The first line lists the species name, which is preceded by the symbol '>'. The species name must be identical to the name as appear in the input tree file. The second line specified the chromosome count for that species.
Extensions:
1. If the count for a certain species is unknown, the symbol 'X' can be used. The program then treats this species as missing data (similar to a gap in molecular sequence data).
2. The program also accepts multiple chromosome counts for a certain species (NOTE: THIS OPTION IS NOT FULLY TESTED).
There are two possible scenarios for this option.
(A) When two counts are sampled from a population. For example, 40% individuals having a 24 haploid chromosome count and 60% having a 26 haploid chromosome count these should be written in the data file as follows:
>TAXA_B
24=0.6_26=0.4
(B) Two chromosome numbers are valid for specific taxa. For example if both 24 and 25 are valid for TAXA_A then these two counts should be separated by '_' and given 0.5 probability:
>TAXA_A
24=0.5_25=0.5

Outputs:

chromEvol.res:
This file includes various run statistics as well as the inferred model parameters, frequencies of the chromosome numbers at the root of the tree, and the log-likelihood value of the optimized parameter set.
log.txt
All outputs that are printed during an execution of the program
allNodes.tree:
A tree file in Newick format that specify the names for nodes (internals and externals0 of the input tree. Internal node's names are given as the bootstrap values and can be viewed in tree viewing programs such as njplot or FigTree.
mlAncestors.tree
A Newick tree file with the maximum-likelihood ancestor reconstruction. The reconstruction of ancestral states is printed as the bootstrap values and can be viewed using a tree viewing program.
posteriorAncestors.tree
A Newick tree file with the posterior probabilities of the two most probable chromosome numbers at each internal node. These probabilities are printed as the bootstrap values and can be viewed using a tree viewing program.
Exp.tree
Similar to posteriorAncestors.tree above - lists in a quite ugly way the expected number of gains//loses//polyploidy//demi-polyploidy events that are inferred to occur along each branch. These expectations are printed instead of the bootstrap values for the terminal node below the branch (further from the root) and can be viewed using a tree viewing program.
ancestorsProbs.txt
A table with the probability of each chromosome number to occur at each internal node.
expectations.txt
Lists branches with an expectation above 0.5 to have experienced a gain, a loss, a polyploidization, or a demi-polyploidization event. A table at the end of the file lists the expected number of gain/loss/polyploidy/demi-polyploidy events for all branches of the phylogeny. The name of the branch is given as the name of the node bellow it (further from the root).
log.txt
All outputs that are printed during an execution of the program
modelsSummary.txt (only when the _mainType option All_Models is specified)
Lists the log-likelihood and AIC scores of all models. This should be used to choose the model that best fit a particular dataset.

Parameter	Description	Default
_dataFile	A path to a file with the chromosome count data	Obligatory
_treeFile	A path to a tree file in Newick format	Obligatory
_outDir	A path to the location of the output directory. The directory should be an existing one (that is - the program will not create a new directory).	RESULTS
_mainType	Possible options: Run_Fix_Param = Run analysis with the specified parameters values Optimize_Model = Optimize the specified model parameters and then run analysis All_Models = Run analysis for each of the eight models, see Models comparison	Optimize_Model
_maxChrNum	The maximal chromosome number allowed Negative values (-X): Set the maximal chromosome number allowed to be X units larger than the maximal chromosome number observed in the data file	-10 (10 units larger than the maximal chromosome number observed in the data file)
_minChrNum	The minimal chromosome number allowed Negative values (-X): Set the minimal chromosome number allowed to be X units smaller than the minimal chromosome number observed in the data file	1
_simulationsNum	The number of simulations for computing the expectation of the number of changes of certain transition type along each branch. Note: This step is computationally expensive. Lower values results in faster computations with decreased accuracy	10000
_rootAt	The internal node assumed to represent the root of the tree.	N1
_branchMul	If different than 1.0 then all branch lengths of the tree are multiplied by this scalar. Should be used if one of the model parameters are close to their boundary value (100), or in order to scale the tree when the branch lengths are exceptionally large or small.	999 (=Scale tree so that total tree length is equal to the number of different character types)

Parameter	Description
_gainConstR	An increase by a single chromosome
_gainLinearR	Rates for single chromosome increases are dependent on the current chromosome number
_lossConstR	A decrease by a single chromosome
_lossLinearR	Rates for single chromosome decreases are dependent on the current chromosome number
_duplConstR	A duplication of the whole genome (polyploidy)
_demiPloidyR	A demi-polyploidization. This parameter allows for transitions from a genome with n haploid chromosomes to 1.5n (e.g., 4x to 6x). If -2 is specified then the rate of demi-polyploidy is equal to that of polyploidy. Thus, the number of model parameters does not increase

Model	Model parameters
CONST_RATE	_gainConstR, _lossConstR, _duplConstR
CONST_RATE_DEMI	_gainConstR, _lossConstR, _duplConstR = _demiPloidyR
CONST_RATE_DEMI_EST	_gainConstR, _lossConstR, _duplConstR, _demiPloidyR
CONST_RATE_NO_DUPL	_gainConstR, _lossConstR
LINEAR_RATE	_gainConstR, _gainLinearR, _lossConstR, _lossLinearR, _duplConstR
LINEAR_RATE_DEMI	_gainConstR, _gainLinearR, _lossConstR, _lossLinearR, _duplConstR = _demiPloidyR
LINEAR_RATE_DEMI_EST	_gainConstR, _gainLinearR, _lossConstR, _lossLinearR, _duplConstR, _demiPloidyR
LINEAR_RATE_NO_DUPL	_gainConstR, _gainLinearR, _lossConstR, _lossLinearR