bModelTest: Bayesian site model selection for nucleotide data

Remco Bouckaert

doi:10.1101/020792

Abstract

bModelTest allows for a Bayesian approach to inferring a site model for phylogenetic analysis. It is based on trans dimensional MCMC proposals that allow switching between substitution models, whether gamma rate heterogeneity is used and whether a proportion of the sites is invariant. The model can be used with the set of reversible models on nucleotides, but we also introduce other sets of substitution models, and show how to use these sets of models. With the method, the site model can be inferred during the MCMC analysis and does not need to be pre-determined, as is now often the case in practice, by likelihood based methods.

1 Introduction

One of the choices that need to be made when performing a Bayesian phylogenetic analysis is which site model to use. A common approach is to use a likelihood based method like ModelTest [16], JModelTest [15], or JModelTest2 [4] to determine the substitution model and whether to use gamma rate hetero-geneity [21] and/or invariant sites. These three components form what we call the site model. The site model recommended by such likelihood based method is then used in the Bayesian analysis. A more elegant method is to co-estimate site model and phylogeny in a single analysis, thus preventing possible bias caused by the phylogeny assumed by the above methods. Obviously, in a Bayesian setting it is more natural to use Bayesian methods of model selection.

One way to select substitution models of DNA evolution is to use reversible jump between all possible reversible models [8], or just a nested set of model [2]. An alternative is to use stochastic Bayesian variable selection [5], though this does not address whether to use gamma rate heterogeneity or invariant sites. Wu et al. [20] use reversible jump for substitution models and furthermore select for each site whether to use gamma rate heterogeneity or not. Since the method divides sites among a set of substitution models, it does not address invariant sites.

We introduce a method that estimates all three components of the site model simultaneously, which combines selection of substitution model as with selection for rate heterogeneity model and invariant sites, all of these are using reversible jump. The method is implemented in the bModelTest package of BEAST 2 [3] with GUI support for BEAUti making it trivial to use. It is open source and available under LGPL licence. Source code, installation instructions and documentation can be found at https://github.com/BEAST2-Dev/bModelTest.

2 Method

Symmetric nucleotide models can be represented by a 4x4 rate matrix

with six rates r_ac, r_ag, r_at, r_cg, r_ct and r_gt. A particular grouping of rates can conveniently be represented by a six figure number where each of the six numbers corresponds to one of the six rates in the order listed before respectively. Rates that are grouped have the same number. For examples, model 123456 corresponds to a model where all rates are independent. This is also known as the GTR model. Model 1211211 corresponds to the HKY model. By convention, the lowest number representing a model is used, so even though 646646 and 212212 represent the same model, we only use 121121.

There are 203 reversible models in total [8]. However, it is well known that transitions (A↔C, and G↔T mutations) are much more likely than tranversions (the other mutations) [13, 17]. Hence grouping transition rates with transversion rates is not appropriate and these rates should be treated differently. If we restrict the set of substitution model such that we only allow grouping within transitions and within transversions (with the exception of model 111111, where all rates are grouped), only 31 models are left (see Figure 1 and details in Appendix). If one is interested in using named models, we can use Jukes Cantor [9, 6] (111111), HKY [10] (121121), TN93 [18] (121131), K81 [11] (123321), TIM [14] (123341), TVM [14] (123421),and GTR [19] (123456). However, to facilitate stepping between TIM and GTR during the MCMC (see proposals below) we like to use nested models, and models 123345 and 123451 nicely fit between TIM and GTR, leaving us with a set of 7 models (Figure 1).

The state space consist of the following parameters:

the model number M,
a variable size rate parameter (depending on model number) R,
a flag to indicate whether 1 or 4 gamma categories should be used,
a shape parameter α, used for gamma rate heterogeneity when there are 4 gamma categories,
a flag to indicate a category for invariant sites should be used,
the proportion of invariant sites p_inv,

Figure 1:

Model sets supported by bModelTest, all reversible models (left), transition/tranversion split models (middle), and named models (right). Arrows indicate which models can be reached by splitting a model. Note all models with the same number of groupings are at the same height.

Rates rates r_ac, r_ag, r_at, r_cg, r_ct and r_gt are determined from M and R. Further, we restrict R such that the sum of the six rates ∑r.. equals 6 in order to ensure identifiability.

2.1 Prior

By default, bModelTest uses the flat Dirichlet prior on rates from [8] taking the linear transformation in account. From emperical studies [13, 17], we know that transition rates tend to be higher than transition rates. It makes sense to encode this information in our prior and bModelTest allows for rates to get a different prior on transition rates (default log normal with mean 1 and standard deviation of 1.25) and transition rates (default exponential with mean 1).

An obvious choice for the prior on models is to use a uniform prior on all models. As Figure 1 shows, there are many more models with 3 parameters than with 1. An alternative allowed in bModelTest is to use a uniform prior on the number of groups. That way, Jukes Cantor and GTR get a prior probability of 1/6, since these are the only models with 1 and 6 groups respectively. Depending on the model set, a much lower probability is assigned to each of the other models such that the sum over models with k groups is 1/6.

For p_inv a uniform prior with interval (0, 1) is used and for α an exponential with mean 1. These priors only affect the posterior when the respective flags are set.

2.2 MCMC proposals

The probability of acceptance of a (possibly trans-dimensional) proposal [7] is

min{1, posterior ratio × proposal ratio × Jacobian}

where the posterior ratio is the posterior of the proposed state S′ divided by that of the current state S, the proposal ratio the probability of moving from S to S′ divided by the probability of moving back from S′ to S, and the Jacobian is the determinant of the matrix of partial derivatives of the parameters in the propsoed state with respect to that of the current state.

2.2.1 Model merge/split proposal

For splitting (or merging) substitution models, suppose we start with a model M. To determine the proposed model M′, we randomly select one of the child (or parent) nodes in the graph (as shown in Figure 1). This differs from Huelsen-beck’s approach [8] where first randomly a group is selected, then randomly a subgrouping created, since the graph based method is easier to generalise to other model sets (e.g. the one used in [12]). If there are no candidates to split (that is, model M = 123456 is GTR) the proposal returns the current state (this proposal is important to guarantee uniform sampling of models). Likewise, when attempting to merge model M = 111111, the currents state is proposed (M′ = 111111). Let r be the rate of the group to be split. We have to generate two rates r_i and r_j for the split into groups of size n_i and n_j. To ensure rates sum to 6, we select u uniformly from the interval (−n_ir, n_jr) and set r_i = r + u/n_i and r_j = r − u/n_j.

For a merge proposal, the rate of the merged group r from two split groups i and j with sizes n_i and n_j, as well as rates r_i and r_j is calculated as

When we select merge and split moves with equal probability, the proposal ratio for splitting becomes where |M_split| (and ) is the number of possible candidates to split (and merge) into from model M (and M′ respectively). The proposal ratio for merging is

The Jacobian for splitting is and for merging it is .

2.2.2 Rate exchange proposal

The rate exchange proposal randomly selects two groups, and exchanges a random amount such that the condition that all six rates sum to 6 is met. A random number is selected from the interval [0, δ] where δ is a parameter of the proposal (0.15 appears a good number). Let n_i, r_i, n_j and r_j as before, then the new rates are and . The proposal fails when .

The proposal ratio as well as the Jacobian are 1.

2.2.3 Birth/death proposal

Birth and death proposals set or unset the category count flag and sample a new value for α from the prior when the flag is set. The proposal ratio is d(α′) for birth and 1/d(α) for death where d(.) is the density used to sample from. Likewise for setting the proportion invariant flag and sampling p_inv from the prior. The Jacobian is 1 for all these proposals.

2.2.4 Scale proposal

For the α, we use the standard scale operator in BEAST 2 [3], adapted so it only samples if the category count flag is set for α. Likewise, for pInv this scale operator is used, but only if the proportion invariant flag is set.

3 Results

Since implementation of the split/merge and rate exchange proposals is not straightforward, nor getting the math right for proposal ratio and Jacobian, unit tests were written to guarantee their correctness and lack of bias in proposals.

To validate the method we performed a simulation study by drawing site models from the prior, then used these models to generate sequence data of 10K sites length on a tree (in Newick (A:0.2,(B:0.15,C:0.15):0.05)) with three taxa under a strict clock. The data was analysed using a Yule tree prior, a strict clock and bModelTest as site model with uniform prior over models and exponential with mean one for tranversions and log-normal with mean one and variance 1.25 for transition rates. A hundred alignments were generated with gamma rate heterogeneity and a hundred without rate heterogeneity using a BEASTShell [1] script. Invariant sites can be generated in the process and are left in the alignment.

Comparing the model used to generate the alignments with inferred models is best done by comparing the individual rates of these models. Figure 2 shows the rate estimates for the six rates agains the rates used to generate the data. Clearly, there is a high correlation between the estimated rates and the ones used to generate (R² > 0.99 for all rates). Results were similar with and without rate heterogeneity. Note values for rates AG and CT (middle panels) tend to be higher than the transversion rates due to the prior they are drawn from.

Figure 2:

Rate original (horizontal) against estimates (vertical) in simulated data. In reading order, rate AC, AG, AT, CG, CT and GT.

Figure 3 shows histograms with the proportion of time gamma shapes were used during the MCMC run. When data was generated without gamma shapes, gamma rate heterogeneity was not used most of the time (left of Figure 3), while gamma rates were used for most of the analyses most of the time when gamma rate heterogeneity was present (middle of Figure 3). ¹ When rate heterogeneity was present, shape estimates were fairly close to the ones used to generate the data (right of Figure 3). However, there were quite a few outliers, especially when the shape parameter was low, the estimate tended to be biased upward. This can happen due to the fact that when the gamma shape is small, a large proportion of sites gets a very low rate and the invariant category can model those instances. The mean number of invariant sites was 6083 when no rate heterogeneity was used, while it was 6907 when rate heterogeneity was used, a difference of about 8% of the sites.

Figure 3:

Proportion of time gamma rate heterogeneity is used without (left) and with (middle) rate heterogeneity used to generate the data. Right, gamma shape original (horizontal) against estimates (vertical) when rate heterogeneity is used to generate the data.

Figure 4 shows similar plots as Figure 3 but for the proportion invariant. The actual proportion invariant was determined empirically by counting the number of invariant sites in the generated alignments.² It appears that if there is less than 60% invariant sites, adding a category to model them does not give a better fit. When invariant rates were used they had a high correlation (R² = 0.77) with that of the empirical proportion invariant.

To compare results with jModelTest 2 [4], we used gopher data (available from BEAST 2 as file examples/nexus/gopher.nex) and GTR+G is given the highest likelihood score. So, it may be tempting to just use that model. However, with bModelTest we get a more nuanced picture (Figure 5) showing GTR only makes a small fraction of the posterior. Further, 78% of the time gamma rate heterogeneity is in use, while 60% of the posterior has an invariant site category.

Figure 4:

Proportion of time proportion invariant is used without (left) and with (middle) rate heterogeneity used to generate the data. Right, empirical proportion invariant in alignment (horizontal) against estimates (vertical) when rate heterogeneity is used to generate the data.

Figure 5:

Model distribution for gopher data using the transition/tranversion split models (left). Numbers on x-axis correspond to models in Appendix. Rate AC vs AG (middle) and AC vs AT (right) show no correlation for the former and high correlation for the latter.

The calculation of the tree likelihood consumes the bulk (>> 90%) of computational time. Note that for a category with invariant sites, the rate is zero, hence only sites that are invariant (allowing for missing data) contribute to the tree likelihood. The contribution is 1 for those sites for any tree and for any parameter setting, so by counting the number of invariant sites, the tree likelihood can be calculated in constant time. Switching between with and without gamma rate heterogeneity means switching between one and k rate categories, which requires k time as much calculation. Having two tree likelihood objects, one for each of these two scenarios, and a switch object that selects the one required allows use of the BEAST 2 updating mechanism [5] so that only the tree likelihood that needs updating is performing calculations.

4 Conclusion

bModelTest is a BEAST package which can be used in any analysis where trees are estimated based on DNA sequences, such as multi-species coalescent analysis (*BEAST), various forms of phylogeographical analyses, sampled ancestor analysis, demographic reconstruction using coalescent, birth death skyline analysis, etc. The GUI support provided through BEAUti makes it easy to set up an analysis with the bModelTest site model: just select bModelTest instead of the default gamma site model from the combo box in the site model panel.

bModelTest allows estimation of the site model using a full Bayesian approach, without the need to rely on non-Bayesian tools for determining aspects of the site model.

Acknowledgements

This work was funded by a Rutherford fellowship (http://www.royalsociety.org.nz/programmes/funds/rutherford-discovery/) from the Royal Society of New Zealand awarded to Prof. Alexei Drummond.

Appendix

View this table:

Footnotes

remco{at}cs.auckland.ac.nz
↵¹ Estimated shape parameters only take values of the shape parameter in account in the posterior sample where gamma rate heterogeneity was present.
↵² Estimated proportion invariant only take values of the proportion invariant parameter in account in the posterior sample where the invariant category was present.

References

1.↵
Remco R. Bouckaert. Beastshell – scripting for bayesian hierarchical clustering. Submitted, 2015.
2.↵
Remco R. Bouckaert, Mónica Alvarado-Mora, and Jõao Rebello Pinho. Evolutionary rates and hbv: issues of rate estimation with bayesian molecular methods. Antiviral therapy, 2013.
3.↵
Remco R. Bouckaert, Joseph Heled, Denise Kühnert, Tim Vaughan, Chieh-Hsi Wu, Dong Xie, Marc A Suchard, Andrew Rambaut, and Alexei J Drummond. Beast 2: a software platform for bayesian evolutionary analysis. PLoS Comput Biol, 10(4): e1003537, Apr 2014.
4.↵
Diego Darriba, Guillermo L Taboada, Ramón Doallo, and David Posada. jmodeltest 2: more models, new heuristics and parallel computing. Nature methods, 9(8): 772–772, 2012.
OpenUrl
5.↵
Alexei J. Drummond and Remco R. Bouckaert. Bayesian evolutionary analysis with BEAST. Cambridge University Press, 2015.
6.↵
J. Felsenstein. Evolutionary trees from dna sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17: 368–376, 1981.
7.↵
P.J. Green. Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika, 82: 711–732, 1995.
OpenUrl CrossRef Web of Science
8.↵
John P. Huelsenbeck, Bret Larget, and Michael E. Alfaro. Bayesian phylogenetic model selection using reversible jump markov chain monte carlo. Molecular Biology and Evolution, 21(6): 1123–1133, June 2004.
9.↵
1. H.N. Munro
T. Jukes and C. Cantor. Evolution of protein molecules. In H.N. Munro, editor, Mammaliam Protein Metabolism, pages 21–132. Academic Press, New York, 1969.
10.↵
M. Kimura. A simple model for estimating evolutionary rates of base sub-stitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16: 111–120, 1980.
OpenUrl CrossRef PubMed Web of Science
11.↵
Motoo Kimura. Estimation of evolutionary distances between homologous nucleotide sequences. Proceedings of the National Academy of Sciences, 78(1): 454–458, 1981.
OpenUrl Abstract/FREE Full Text
12.↵
Mark Pagel and Andrew Meade. Bayesian analysis of correlated evolution of discrete characters by reversible-jump markov chain monte carlo. The American Naturalist, 167(6): 808–825, 2006.
OpenUrl CrossRef PubMed Web of Science
13.↵
Lúisa Pereira, Fernando Freitas, Verónica Fernandes, Joana B Pereira, Marta D Costa, Stephanie Costa, Valdemar Máximo, Vincent Macaulay, Ricardo Rocha, and David C Samuels. The diversity present in 5140 human mitochondrial genomes. Am J Hum Genet, 84(5): 628–40, May 2009.
OpenUrl CrossRef PubMed Web of Science
14.↵
David Posada. Using modeltest and paup* to select a model of nucleotide substitution. Current protocols in bioinformatics, pages 6–5, 2003.
15.↵
David Posada. jmodeltest: phylogenetic model averaging. Molecular biology and evolution, 25(7): 1253–1256, 2008.
OpenUrl CrossRef PubMed Web of Science
16.↵
David Posada and Keith A. Crandall. Modeltest: testing the model of dna substitution. Bioinformatics, 14(9): 817–818, 1998.
OpenUrl CrossRef PubMed Web of Science
17.↵
N.A. Rosenberg. The shapes of neutral gene genealogies in two species: probabilities of monophyly, paraphyly and polyphyly in a coalescent model. Evolution, 2003.
18.↵
K. Tamura and M. Nei. Estimation of the number of nucleotide substitutions in the control region of mitochondrial dna in humans and chimpanzees. Molecular Biology and Evolution, 10: 512–526, 1993.
OpenUrl CrossRef PubMed Web of Science
19.↵
Simon Tavaré. Some probabilistic and statistical problems in the analysis of dna sequences. Lectures on mathematics in the life sciences, 17: 57–86, 1986.
OpenUrl
20.↵
Chieh-Hsi Wu, Marc A Suchard, and Alexei J Drummond. Bayesian selection of nucleotide substitution models and their site assignments. Mol Biol Evol, 30(3): 669–88, Mar 2013.
OpenUrl CrossRef PubMed Web of Science
21.↵
Ziheng Yang. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. Journal of Molecular Evolution, 39(3): 306–314, 1994.
OpenUrl CrossRef PubMed Web of Science