## Abstract

Phylogenetic inference typically assumes that the data has evolved under Stationary, Reversible and Homogeneous (SRH) conditions. Many empirical and simulation studies have shown that assuming SRH conditions can lead to significant errors in phylogenetic inference when the data violates these assumptions. Yet, many simulation studies focused on extreme non-SRH conditions that represent worst-case scenarios and not the average empirical dataset. In this study, we simulate datasets under various degrees of non-SRH conditions using empirically derived parameters to mimic real data and examine the effects of incorrectly assuming SRH conditions on inferring phylogenies. Our results show that maximum likelihood inference is generally quite robust to a wide range of SRH model violations but is inaccurate under extreme convergent evolution.

- Phylogenetic inference
- model violations
- systematic bias
- simulations
- evolution under non-SRH conditions
- test of symmetry

## Main Text

Markov processes are commonly used in model-based phylogenetic analyses such as maximum likelihood (ML) and Bayesian inference (Felsenstein 2004; Yang 2006). A Markov model is represented by an instantaneous rate matrix Q of size 4-by-4 for DNA or 20-by-20 for protein sequences, that describes the substitution rates between nucleotides or amino- acids (henceforth denoted as states), respectively. The Markovian propriety is convenient because the probabilities of the next states only depend on the current states, independently of how the current states had evolved (Felsenstein 1981; Felsenstein 1983; Yang 1994; Swofford, et al. 1996; Yang 2006). For mathematical simplicity and computational tractability, most studies assume that the Markov model is stationary, reversible, and homogeneous (SRH) (Kimura 1980; Felsenstein 1981; Hasegawa, et al. 1985; Tavaré 1986; Tamura and Nei 1993; Yang 1994). Homogeneity means that a single Q matrix operates along all edges of the tree, i.e., all substitution rates stay constant through time. Stationarity means that the state frequencies also remain constant along all edges of the tree. Reversibility means that the rate of change from state A to another state B is the same as the backward substitution rate from B to A.

The assumptions of homogeneity, stationarity, and reversibility come at the cost of complying with biological reality (Roberts and Yang 1995; Foster and Hickey 1999; Foster 2004; Ababneh, et al. 2006). For example, the reversibility assumption implies that the likelihood of a tree topology will be the same regardless of the placement of the root (Felsenstein 1981). Moreover, a reversible substitution model has up to 8 free rate parameters for nucleotides and 208 for amino acids, while a non-reversible substitution model has up to 11 free rate parameters for nucleotides and 379 for amino acids, provided that the model is still stationary and homogeneous (Yang 1994). These degrees of freedom can increase dramatically if the model is non-stationary or/and non-homogeneous (Barry and Hartigan 1987; Boussau and Gouy 2006): at the limit there can be an independent model of evolution on every branch of a tree, meaning that the total number of parameters is the product of the number of parameters in the substitution model and the number of branches in the tree.

Using stationary, reversible, and homogeneous substitution models to infer a phylogeny from data that has evolved under more complex conditions compromises the consistency of the ML estimation (Felsenstein 2004). Ideally, we would like to use data that comply with the assumptions of the models we apply, or alternatively, use models that are not violated by the data in hand. However, the use of non-SRH models is computationally demanding and is often not practical in large datasets. On the other hand, removing data that do not comply with the SRH assumption will come at a cost of losing phylogenetic information. Both simulation (Huelsenbeck and Hillis 1993; Hillis, et al. 1994; Galtier and Gouy 1998; Ho and Jermiin 2004; Jermiin, et al. 2004; Boussau and Gouy 2006) and empirical (Phillips, et al. 2004; Collins, et al. 2005; Nguyen, et al. 2012; Betancur, et al. 2013; Naser-Khdour, et al. 2019) studies have shown that applying SRH models to data that have evolved under more complex conditions can lead to significant errors in phylogenetic inference. However, most of these simulation studies have used parameters that do not reflect most empirical datasets, and sometimes represent extreme conditions such as the independent convergence of distantly-related taxa to a GC content that differs substantially from the rest of the taxa in the tree. While these simulations are based on biological observations such as the evolution of extreme GC content differences among closely related bacteria (Mooers and Holmes 2000), they do not represent the degree of violation of SRH conditions typical of most datasets. Indeed, apart from extreme cases it remains relatively poorly understood to what extent different types and degrees of violations of the SRH conditions affect phylogenetic inference.

In this study, we examine the influence of violating the SRH assumptions on phylogenetic inference with SRH models using parameters that are derived from thousands of empirical datasets. We simulate nucleotide alignments under various non-stationary (and thus non-reversible) or/and non-homogeneous conditions and examine the effects of incorrectly assuming SRH conditions on inferring phylogenies from these data. Moreover, we examine the ability of different methods to detect non-SRH evolution across multiple sequence alignments. Several tests for detecting non-SRH evolution in nucleotide and amino acid alignments have been introduced (Lanave, et al. 1984; Lanave, et al. 1986; von Haeseler, et al. 1993; Lockhart, et al. 1994; Kumar and Gadagkar 2001; Phillips and Penny 2003; Weiss and von Haeseler 2003; Foster 2004; Ababneh, et al. 2006; Ho, et al. 2006; Jermiin, et al. 2019; Naser-Khdour, et al. 2019). However, these tests are rarely used in phylogenetic analysis (Jermiin, et al. 2004; Jermiin, et al. 2009), likely because many of them are difficult to apply in practice. In this study we focussed on three tests for detecting non-SRH evolution that are implemented in the widely-used IQ-TREE software (Minh, et al. 2020): the MaxSymTests (Naser-Khdour, et al. 2019), the compositional chi-square test (Preparata and Saccone 1987) as implemented in IQ-TREE (Nguyen, et al. 2015), and the test of non- stationarity proposed by Weiss and von Haeseler (Weiss and von Haeseler 2003). The MaxSymTests ask whether there is evidence in a single alignment that evolutionary symmetry imposed by SRH evolution is violated, and is a relatively new extension of similar tests designed for pairs of sequences (Jermiin, et al. 2019). The Weiss and von Haeseler (WH) test checks the homogeneity of the substitution model across the tree based on the pairwise sequence comparisons and performs a parametric bootstrap to assess the statistical significance (Weiss and von Haeseler 2003). The compositional chi-square test checks if the state composition of each sequence in the alignment is similar to the average state composition of the whole alignment, and is commonly-used to detect and sometimes remove sequences that clearly violate the SRH conditions (e.g. Aouad, et al. 2018; Liu, et al. 2018; Martijn, et al. 2018; Puttick, et al. 2018; Song, et al. 2018; Fan, et al. 2020). The Chi-square test gives researchers a way of understanding whether each sequence in an alignment has state frequencies that are plausible given the overall state frequencies of the alignment. We know of no existing test which combines individual chi-square tests to assess whether the state frequencies across all sequences of an alignment is plausible under an SRH model. It is possible to do this with model adequacy tests, but this requires one to first fit a full model and a tree (Foster 2004; Brown and ElDabaje 2009; Duchene, et al. 2017), while our current work focusses on tests that can be performed quickly and efficiently on very large datasets prior to tree inference. We therefore use two different approaches in this study to leverage the information in from individual chi-square tests.

The two approaches we take to using information from chi-square tests reflect different ways of balancing false-positive and false-negative outcomes, and so may be thought of as appropriate for different situations. Our first approach to using the chi-square tests is to take the most conservative possible approach and score an alignment as violating SRH assumptions if at least one sequence fails the test. Using the Chi-square frequencies in this way is very conservative, and liable to have a high false positive rate that increases with the number of sequences in an alignment. However, in some practical cases when many loci are available but only a small number can be used for analyses, e.g. selecting ∼50 loci for a Bayesian analysis out of many thousands available from whole genomes, a conservative approach such as this with a high false positive rate may be warranted. Our second approach is less conservative. In this approach we record the proportion of sequences in an alignment that fail the Chi-square test, and ask whether this proportion is correlated with the degree of non-stationarity in the simulations. This approach may be more useful in practical cases where researchers wish to rank a set of loci with respect to the severity of model violations.

## Materials and Methods

### Simulations

In order to investigate the ability of SRH models to correctly infer topologies and branch lengths from non-SRH data, we devised a new approach that allows us to simulate alignments gradually ranging from true SRH conditions (with identical base frequencies and identical reversible substitution processes on every branch of the topology) to the most extreme violation with completely unrelated base frequencies and non-reversible substitution processes on every branch of the topology. For an alignment of *m* taxa and *n* sites, we will denote the set of all branches in the rooted tree τ as Φ = {1, …, *l*}.

We simulate data under two different simulation schemes as follows:

An inheritance scheme designed to reflect the evolutionary process, in which each node in the tree inherits its substitution processes from its parent with a constant strength of inheritance modified by the branch length connecting the two nodes. The scheme reflects the continuity of evolutionary processes that are changing through time along a phylogenetic tree.

A two-matrix scheme designed to reflect previous approaches to simulating non-SRH evolution, where two independent subtrees (that are not sisters nor descendants of each other) have an identical substitution process and that is distinct from the substitution process that operates on the rest of the tree. This scheme resembles convergent evolution.

Applying these two schemes allows us to ask how evolutionarily-inspired non-SRH simulations are affected by SRH assumptions (scheme 1) and then to directly compare these to the more extreme forms of non-SRH evolution that are more often simulated (scheme 2). We will describe both simulation approaches in more detail below. But we start by describing how we choose model parameters for our simulations.

### Estimating Empirical Parameter Distributions and Tree Topologies for Simulations

Both of our simulation approaches require us to choose base frequency vectors and rate matrices with which to simulate alignments. Generating these at random could limit the applicability of our results because it is unlikely that randomly-generated base frequency vectors or rate matrices would reflect reality. To address this, we instead estimated base frequency vectors and rate matrices from a large collection of empirical alignments, and then used these parameters for our simulations.

In order to estimate the distributions of the empirical base frequencies (Π) and the substitution rates (*X*) we used 32,666 partitions from 49 nucleotide datasets (Appendix Table A.1). Consisting of different types of loci (codon positions, rRNA, tRNA, introns, intergenic spacers and UCEs) and genomes (nuclear, mitochondria, virus, plastid). Since different partitions of the genome evolve differently, for each partition, we ran IQ-TREE with a GTR model and free rate heterogeneity across sites (Yang 1995) with 4 categories + invariant sites. This gave us the distributions of 32,666 estimates of each parameter in the GTR matrix (A↔C, A↔G, A↔T, C↔G, C↔T, G↔T) and the distribution of each base frequency (*π*_{A}, *π*_{C}, *π*_{G}, *π*_{T}).

We use a similar approach to estimate the distribution of branch lengths. Estimating branch lengths from each partition separately could be misleading because there tends to be a high stochastic error in branch lengths estimated from short single-partition alignments (Kumar, et al. 2012). Therefore, in order to estimate the empirical distribution of the branch lengths, we instead estimated a single set of branch lengths from each of our 49 nucleotide datasets and complemented these with an additional 18 amino-acid datasets. For each dataset, we ran IQ-TREE with the best-fit fully-partitioned model (Chernomor, et al. 2016), which allows each partition to have its own evolutionary model and edge-linked rates determined by ModelFinder (Kalyaanamoorthy, et al. 2017). We then rooted the tree with the outgroup taxa (if provided) and extracted the empirical branch lengths of the ingroup (*T*) for each of the 33,178 partitions from 67 nucleotide and amino acid datasets.

Finally, for each parameter in *X* (5 parameters - G↔T equals to 1) and Π (4 parameters), and for each distribution in *T* (67 distributions - each dataset is an independent distribution) we find the best-fit distribution from 36 common probability distributions using the Kolmogorov-Smirnov test using SciPy (Virtanen, et al. 2020). We then sampled parameters for our simulations from these best-fit distributions. Since the parameters of Π are not independent, to sample a base-frequency vector we randomly sampled a parameter from each of the four base-frequency’s best-fit distribution and then normalized these parameters to sum to 1.

The tree topology τ is derived from birth-death simulations with speciation rate λ, extinction rate μ and the fraction of sampled taxa *f*using TreeSim package with a fixed number of extant species (Stadler 2011). In principle, it is possible to estimate the speciation and extinction rates from empirical data (Nee, et al. 1994; Rannala and Yang 1996; Magallon and Sanderson 2001). However, not knowing the fraction of sampled taxa a priori will tend to bias such estimates (Stadler 2013; Hua and Lanfear 2018). Because of the challenges of reliably estimating empirical speciation and extinction rates, we instead randomly sampled the speciation rate, the extinction rate and the fraction of sampled taxa from uniform distributions, to attempt to cover all the realistic regions of the parameter space.

Note that under these conditions λ is always greater than μ.

We simulated datasets with 20, 40, 60, 80 and 100 taxa. For each number of taxa (*m*), we simulated 3960 topologies with random speciation rate (λ), random extinction rate (μ) and random fraction of sampled taxa (*f*). For each of these topologies, we then randomly choose a distribution from set *T* and sampled the branch lengths from this distribution (2*m*–2 branch lengths in total).

Other Python libraries that we used for the simulations are NumPy (Walt, et al. 2011), pandas (McKinney 2010) and ETE3 (Huerta-Cepas, et al. 2016). The python scripts for all simulations can be found on Github (https://github.com/suhanaser/empiricalGTRdist).

### Inheritance Evolution: Inheritance Scheme Simulations

An evolutionary scenario would, ideally, have each lineage inheriting the parameters of its molecular evolutionary process from its parent lineage. At one extreme – where inheritance is perfect and the original evolutionary process is SRH, such a process would define a molecular evolutionary process that is SRH across the entire topology by simply defining a single SRH model at the root node. At the other extreme, where the association between parent and offspring lineages is no better than random and the original process is not SRH, there is no association between parent and offspring lineages and the process is maximally non-SRH. To mimic this situation, we designed a simulation approach that allows us to vary the homogeneity and stationarity assumptions both independently and together.

Our inheritance scheme allows us to vary the degree to which a single alignment has evolved under SRH conditions by simply adjusting the strength of inheritance of the substitution process and the base frequencies either jointly via a parameter we call *ρ*, or independently via parameters *v* and *ω* respectively. When the inheritance parameters are set to 1 and the model at the root of the tree is reversible, the model will conform to SRH conditions. We can simulate increasing violation of SRH conditions simply by decreasing the inheritance parameters towards zero. When the relevant inheritance parameter is less than one, each branch inherits some proportion of its substitution model from the parent branch, while the remaining proportion of the model is selected at random from the empirical parameter distributions. In practice, the parameter in a descendant branch is calculated as the weighted sum of the parameter in the parent branch (where the weight is the inheritance parameter) and a randomly-generated parameter from the appropriate empirical distribution (where the weight is one minus the inheritance parameter).

We simulated data under five different categories of conditions using this scheme, in order to examine independently and together the effects of relaxing the stationarity and homogeneity assumptions.

*SRH conditions (Fig. 1a)*.*—*In the simplest case for a model that conforms to the SRH assumptions, where model parameters are generated from the empirical distributions. This describes a model in which all branches inherit this reversible model from their parent branch without variation, such that all branches on the tree have the same reversible substitution model, conforming to the SRH assumptions.*Relaxing the stationarity assumption (Fig. 1b)*.*—*In order to hold the homogeneity assumption but relax the stationarity assumption, we introduce a parameter called*v*(*0 ≤ v ≤*1) that allows to vary the state frequency at the root while still keeping the same rate matrix for all branches of the tree. Mathematically, this can be described as: Where*Q*_{i}is the substitution rate matrix operating on branch*i*, and*d*_{root}is the branch length of the root branch. When*ν*= 1,*π*_{root}is equal to*π*_{0}and this scheme boils down to the first SRH condition. When*ν*=*0, π*_{root}is equal to*π*, meaning that the root frequency is generated separately from*π*_{0}.*π*_{root}will vary between these two extremes when*ν*is between 0 and 1, with lower*ν*reflecting a larger deviation from stationary conditions.*Relaxing the homogeneity assumption (Fig. 1c)*.*—*In order to hold the stationarity assumption but relax the homogeneity assumption we need to simulate data in which*v*is set to 1 (such that all branches have the same base frequencies as the root node), but we introduce a parameter*ω*that varies between zero and one (such that the inheritance of the parameters of the*Q*matrix ranges from completely random to near-perfect). We can describe this mathematically as follows: Where*Q*_{i}is the process operating on branch i,*S*_{j}are the substitution rates on the parent branch of branch*i*, and*d*_{i}is the branch length of the branch*i*.*Relaxing the stationarity and homogeneity assumptions simultaneously but independently (Fig. 1d)*.*—*We can simulate non-stationary and non-homogeneous data by setting both*v*and*ω*to values less than one. When we relax both assumptions, we will allow*Q*_{i}and*π*_{root}to vary simultaneously but independently:*Relaxing the stationarity and homogeneity assumptions jointly (Fig. 1e)*.*—*While the 4^{th}set of simulation conditions, above, allows us to vary homogeneity and stationarity jointly but independently, it suffers from the limitation that we have a maximum of two base frequency vectors in the tree (*π*_{root}and*π*_{0}). To relax this assumption further, we will allow*Q*_{i}to vary while*π*_{root}stays fixed. In those settings, both homogeneity and stationarity will increase with*ρ*.

### Convergent Evolution: The Two-Matrix Scheme Simulations

Previous studies for simulating non-SRH evolution on phylogenies have used an approach in which two distantly related branches undergo severe but correlated changes in the molecular evolutionary process. To compare this approach to the more evolutionarily- motivated approach described above, we randomly chose two nodes that are not sisters and not descendants of each other and assigned a different rate matrix (denoted by *S*_{1}*π*_{1}) from the rest of the tree to all their descendant branches (Fig. 2Figure 2).

### Simulation Parameters

The simulation parameters that we use in this study are the strength of inheritance of the substitution process (*ω*), strength of inheritance of the base frequencies (*v*), strength of inheritance of the substitution process and base frequencies (*ρ*), number of sites (*n*), and number of taxa (*m*) where the parameter space is:

The inheritance weight parameters (ω, *v, ρ*) were chosen to represent an even spread of corrected inheritance weights (i.e., the inheritance weights raised to the power of *d*, where *d* is the branch length) between zero and one. The number of taxa and number of sites are chosen to reflect typical sizes of empirical datasets. For simulation under the inheritance scheme, we simulated 10 alignments of each combination of *n, m, v*, and *ω* or *n, m*, and *ρ* for a total of 19,800 simulations. For simulation under the two-matrix scheme, we simulated 1000 alignments of each combination of *n* and *m* for a total of 15,000 simulations.

### Tree Inference

Our first goal is to understand how the incorrect use of SRH models on data that have evolved under non-SRH processes can affect phylogenetic inference. To do this, we compare the tree topologies and branch lengths estimated with SRH models in IQ-TREE to the topologies and branch lengths used to simulate each dataset. For each simulated alignment, we ran IQ-TREE with ModelFinder (Kalyaanamoorthy, et al. 2017) and 1000 ultrafast bootstrap replicates (Hoang, et al. 2018). In order to assess the ability of SRH models to infer the correct tree topology we then compared the simulated tree topology to the estimated tree topology using three different metrics – normalized Robinson-Foulds distance (Robinson and Foulds 1981), Quartet distance (Estabrook, et al. 1985), and the Path-difference distance (Steel and Penny 1993). The normalized Robinson-Foulds distance between two trees is the fraction of internal branches that appear in one tree but not the other. It ranges from 0 to 1, where 0 means that the two trees are topologically identical and 1 means that the two trees have no branches in common. In order to assess the accuracy of branch length estimates, we tested whether the estimated branch lengths and the original branch lengths are drawn from the same distribution using the two-sample Kolmogorov-Smirnov test.

### Detecting non-SRH Processes

We, therefore, tested the ability of three tests implemented in IQ-TREE to detect violation of the SRH assumptions: the MaxSymTests (Naser-Khdour, et al. 2019), and the compositional Chi-square test, and the WvH test (Weiss and von Haeseler 2003). These three tests only need the composition of the alignment and therefore can be used with any analysis in IQ-TREE by adding the appropriate options to the command line, except for the Chi- square test that runs automatically for each alignment (Table 1).

Since the Chi-square test tells us whether each sequence in the alignment fails the compositional homogeneity assumption, we use two different approaches that leverage the results of the Chi-square test (see also the Introduction):

A very conservative approach that we denote as . In this approach, we consider the alignment to fail the Chi-square test if one or more of the sequences in the alignment fails the test.

A less conservative ranking approach that we denote as . We record for each alignment what proportion of sequences that fail Chi-square test.

In the first case, we ask whether the proportion of replicate simulated alignments with one or more sequences failing the Chi-square test increases with the degree of violation of SRH conditions in the simulations. In the second case, we ask whether the proportion of sequences that fail the Chi-square test increases with the degree of violation of the SRH conditions in the simulations.

## Results

### Empirical Distributions

We derived the empirical distributions of the substitution model parameters, the nucleotide frequencies, and the proportion of invariant sites from 32,666 nucleotide alignments (Appendix Table A.2). The empirical distribution of branch lengths we derived from 67 nucleotide and amino acid alignments consist of 33,178 partitions (Appendix Table A.1).

Using Kolmogorov-Smirnov test, we found the best-fit probability distribution for each one of these empirical distributions (Table 2, Appendix Table A.2, Appendix Figs. A.1- 3).

### Phylogenetic Inference is Unaffected by Violation of SRH Conditions in an inheritance Framework

Surprisingly, our results for the inheritance simulation scheme show that there is no detectable relationship between the severity with which SRH conditions were violated during the simulations and the accuracy of the tree topology or the tree length inferred from the simulated data. Specifically, we saw no relationship between the inheritance weight and the normalized RF (Robinson-Foulds), QD (Quartet Distance), or NPD (Normalized Path Difference) metrics in any of our inheritance simulations (Fig. 3Figure 3, Appendix Figs. A.4-7). These metrics measure the difference between the inferred tree and the tree from which the alignment was simulated. If stronger violation of the SRH conditions affects phylogenetic inference we should expect to see that the distances are higher when the inheritance weight is lower, because a lower inheritance weight implies stronger model violation through less homogeneity (for the rate matrix) and less stationarity (for the base frequencies). In addition, our results show that the proportion of simulated datasets for which the simulated tree is recovered from the simulated alignment is constant at around 0.25 in the inheritance scheme simulations regardless of the inheritance weight (Fig. 3, Appendix Fig. A.4). Finally, we see no correlation between the inheritance weights and the proportion of datasets that fail a Kolmogorov-Smirnov test comparing the true and estimated branch lengths, suggesting that violation of SRH assumptions in our evolutionary framework has no detectable effect on the estimation of branch lengths (Fig. 4, Appendix Fig. A.19).

### Tree Topologies, but not Branch Lengths, are Affected by Severe and Convergent Violation of SRH Conditions

Our results show that convergent violation of SRH assumptions by allowing two distantly related branches to have identical substitution models has increasingly severe effects on phylogenetic inference as the severity of the changes in the substitution models increases. Under the two-matrix scheme, we expect to see higher distances between the true tree and the estimated tree when there are larger Euclidian distances between the original matrix and the matrix under which the divergent clades evolve. In two out of the three metrics (Robison- Foulds and Path-Difference) we found a weak but significant correlation between the distance between the matrices and the distance between the topologies (Fig. 3, Appendix Fig. A.4, Appendix Figs. A.8-10). However, in the third metric (Quartet Distance) we found no correlation. Notably, the distance between the true tree and the estimated tree increases only when the Euclidean distance between the two matrices is very high. Nevertheless, the proportion of simulated datasets for which the simulated tree is recovered from the simulated alignment declines exponentially as the difference between the matrices in the two-matrix scheme increases (Fig. 3).

The proportion of analyses in which the simulated tree is recovered positively declines from around 0.20 when there is no model violation to zero when the Euclidean distance between the matrices is around 2000, confirming that even the lowest levels of SRH violation have detectable negative effects on phylogenetic inference under the two-matrix scheme.

Finally, we see no correlation between the Euclidean distance between the two matrices and the proportion of datasets that fail a Kolmogorov-Smirnov test comparing the distributions of the true and estimated branch lengths, suggesting that violation of SRH assumptions in the convergent framework has limited effects on the estimation of branch lengths (Fig. 4, Appendix Fig. A.22).

### Tests for Detecting non-SRH Processes are Successful but have High False-Negative Rates

As expected, the ability of all three MaxSym tests to reject the null hypothesis of stationarity and homogeneity improves as the inheritance weight in the evolutionary simulations decreases (i.e. as the violation of SRH conditions increases), the distance between the two matrices in the convergent simulations increases, and the number of sites in the alignment increases (Fig. 5, Appendix Figs. A.11-13). Moreover, the three MaxSym tests have a reasonable false positive rate of approximately 4.5% (Appendix Table A.4). However, they also have very high false-negative rates of 50-90%, depending on the test and the particular simulation conditions (Fig. 5, Appendix Table A.3, Appendix Figs. A.14-16). In the two-matrix scheme simulations, the false negative rates of MaxSym, MaxSym_{mar}, and MaxSym_{int} tests are 67%, 66%, and 87%, respectively. Thus, across all simulation conditions, a significant result from a MaxSymTest can be reliably interpreted as indicating that an alignment violates the SRH conditions, but the test will fail to identify many such alignments.

Similarly to the MaxSym tests results, the and tests show an increase in the proportion of alignments and/or sequences that fail the test in each dataset as the inheritance weight decreases, and the number of sites increases (Fig. 5a, Appendix Fig. A.17). The false-positive rates of the test is 6% (Appendix Table A.6). The false- negative rate of the test in the inheritance-scheme simulations is 57% (Appendix Table A.5). Moreover, similar to the MaxSym tests, in the two-matrix scheme simulation, the percentage of datasets that pass the decreases logarithmically the higher the distance between the two matrices (Fig. 5b, Appendix Fig. A.18). The false negative rate of the test under extreme convergent evolution is the smallest of all the tests considered here under these conditions, and it is around 44% (Appendix Table A.5).

In the inheritance-scheme simulations, similarly to the MaxSym tests, the and , the WvH test shows an increase in the proportion of alignments that fail the test as the inheritance weight (ω and ρ) decreases. However, *ν* has no effect on the proportion of alignments that fail the WvH test. The false-positive rates of the WvH test is 3.5% (Appendix Table A.8), which is lower than any of the MaxSym tests or the test. In addition, the false negative rate of the WvH test (Appendix Table A.7) in the inheritance-scheme simulations is lower than all the other tests (∼30%) but it is still high under the two-matrix scheme simulations (∼67%). Yet, due to numerical instability, the WvH test could be only applied to half of the datasets in the two simulation schemes.

### MaxSymTest_{int} is a good predictor of correct tree inference

A key question for empiricists is whether tests of model adequacy are likely to improve phylogenetic inference. To explore this in our simulation framework, we asked whether datasets that are rejected by the tests we evaluated tended to be associated with more phylogenetic tree error than those that were not rejected. To do this, we used three different metrics of tree distance (the normalized Robison-Foulds (RF), Path-Difference, and Quartet distance) and asked whether datasets that fail the test (i.e. have detectable non-SRH processes) tended to result in trees that were further from the true tree (i.e. had higher nRF distances) when analysed using SRH models. All three showed very similar results, so we show the normalized Robinson-Foulds results here (Fig. 6) and the other metrics in the supplementary information (Appendix Fig. A.23a, Appendix Fig. A.24a).

For the inheritance scheme simulations we found as expected that datasets which failed the MaxSym tests were associated with trees much further from the true tree than those that passed the tests, although there was substantial variation within each category (Fig. 6a, Appendix Fig. A.23a, Appendix Fig. A.24a). Surprisingly, this pattern was reversed for the test, and there was a very small difference in tree distances with the WvH test (Fig. 6a). Welch’s t-test results suggest all of the differences are statistically significant (p<<0.05, Fig. 6a).

For the two-matrix simulations the only test for which datasets which failed were associated with trees further from the true tree was the MaxSym_{int} test (Fig. 6b, Appendix Fig. A.23b, Appendix Fig. A.24b). For all other tests, datasets which failed the test were associated with trees that were markedly *closer* to the true tree than datasets which passed the tests (Fig. 6b, Appendix Fig. A.23b, Appendix Fig. A.24b). Again, Welch’s t-test results suggest all of the differences are statistically significant (p<<0.05, Fig. 6b).

## Discussion

Using two different simulation schemes, we explored the impact of violating the assumption of evolution under stationary, reversible, and homogeneous (SRH) conditions on ML phylogenetic tree inference. Our study extends the simulations in many previous studies by simulating data under an evolutionary scenario in which molecular evolutionary models evolve along a phylogeny. Our results show that the inference of phylogenetic tree topologies and branch lengths are surprisingly robust to violations of SRH assumptions under an evolutionary scheme. But similarly to previous studies, we show that in extreme cases of convergent molecular evolution the incorrect assumption of SRH conditions can severely mislead phylogenetic inference.

The first simulation scheme we introduced in this paper, which we called *the inheritance scheme*, allows tree branches to inherit their substitution process from their ancestor. The second simulation scheme, which we called *the two-matrix scheme*, is similar to previous studies and allows two distantly related monophyletic sub-trees to evolve with a different evolutionary process from the rest of the tree (Galtier and Gouy 1995; Jermiin, et al. 2004; Jayaswal, et al. 2011; Duchene, et al. 2017).

Surprisingly, our results show no correlation between errors in the topology or branch length inference and any of the inheritance scheme parameters, even in extreme cases where the evolutionary process is completely heterogeneous and non-stationary. These results indicate that ML tree inference with SRH models is surprisingly robust to even quite extreme violations of the SRH conditions.

Under the two-matrix simulation scheme, we found a small but significant increase in topological inference error and the extent of the violation of the SRH assumptions. Specifically, the more extreme the evolutionary convergence, the larger the errors in topological inference that assumes SRH conditions. Despite this, we found no correlation between branch length inference and the distance between the two matrices. These results emphasize the limitations of ML inference to operate under certain model violations, especially when these violations are highly imbalanced along the tree, as in the case of the two-matrix scheme simulations. These results indicate that the inference of the substitution model is more influenced by the imbalance of the model violation distribution along the tree than by the model violation itself. This conclusion agrees well with all previous simulation studies of similar simulation conditions (e.g. Jermiin, et al. 2004; Duchene, et al. 2017; Jermiin, et al. 2019).

In this study, we also tested the power of the MaxSym tests, WvH test and two variations of the Chi-squared test to detect model violation due to non-SRH evolution. Our results show that those tests were able to detect some model violation in both simulation schemes. As expected, the power of all tests to detect model violation due to non-SRH evolution improves dramatically as alignment length increases, reflecting simply the larger amount of information available in longer alignments. However, the power of most of the tests we looked at was somewhat limited – even in the best-case scenario when violation of the SRH conditions was severe, most tests were able to detect this violation in less than 50% of the simulated datasets (Fig. 5). The two exceptions were the WvH test, which was able to detect the vast majority of datasets simulated with model violation under the inheritance scheme simulations (Fig. 6a) and the conservative Chi-Square test, which was able to detect the majority of datasets simulated with model violation under the convergent evolution scheme. However, the WvH test could not be applied to half of the datasets in our simulations due to numerical instability, suggesting that it may be less useful for detecting violations of SRH conditions in practice than the other tests.

The utility of any test of model adequacy in practice is likely to be tied to the amount of phylogenetic error that a test helps empiricists avoid. All models used in phylogenetic analyses are gross oversimplifications of highly complex molecular evolutionary processes, and so merely detecting violations of models is necessary but not sufficient for a model adequacy test to be useful. Because of this, we asked for each test whether the datasets that fail the test were associated with more or less topological inference error than the datasets that passed the test. Surprisingly, the only test that performed consistently well in this regard was the MaxSymTest_{int}. Under the inheritance scheme simulations, all three MaxSym tests are good predictors of phylogenetic accuracy; trees that pass any of those tests are closer to the true tree than trees that fail. The WvH and tests on the other hand are bad predictors of phylogenetic accuracy; trees that *fail* the test are usually closer to the true tree, while there is only a small difference between trees that fail and trees that pass the WvH test. Surprisingly, under the convergent simulation scheme, the MaxSymTest_{int} is the *only* test for which datasets that pass the test are closer to the true tree than datasets that fail the test (Fig. 6). For all other tests, the datasets that pass the test were substantially *further* from the true tree than those that fail the test.

It is challenging to disentangle why some tests of the SRH assumptions tend to detect datasets that are associated with more topological error, while others show the opposite tendency (Fig. 6), although we suspect this is often driven by the interplay of the power of the tests, phylogenetic signal, and stochastic error in tree estimation. Across all simulation conditions, the only test which consistently showed the desirable behaviour from an empirical standpoint (i.e. where datasets that fail the test are associated with more topological error) was the MaxSymTest_{int}. All other tests showed evidence of having the opposite tendency (Fig. 6) in at least some simulation conditions. In the case of the WvH test, for which alignments that fail the test were consistently associated with *less* topological inference error when analysed under SRH models, we suspect that the underlying reason may be the reliance of the test on a parametric bootstrap. The WvH test depends fundamentally on a tree estimated with an SRH model to estimate the null distribution of the test statistic. If this tree is wrong, as we show occurs under model violation, then the null distribution may be incorrect and the test misled. For the other tests we suspect that the tendency is driven largely by the fact that datasets with few informative sites will tend to both pass the tests *and* be associated with high topological error, with both caused by the limited information in the data, although further work is needed to understand these relationships in more detail. Nevertheless, the observation that across all simulation conditions, datasets that fail the MaxSymTest_{int} are associated with higher topological error do suggest that violations of the *homogeneity* assumption might be the most important when it comes to phylogenetic inference with SRH models, since the MaxSymTest_{int} tests primarily for violations of homogeneity.

These results combined with the results from the inheritance scheme simulations, emphasize the need to use different methods and tests for model violation in phylogenetic analyses since each test can capture a different aspect of model violation. A new phylogenetic protocol (Jermiin, et al. 2020) stresses the need to validate the assumptions of the models in advance. If the data in hand violates the model’s assumptions then different models or methods should be considered. A surprising result from this work is that the MaxSymTest_{int} is a good predictor for phylogenetic accuracy. Yet, one should bear in mind that this test has the highest false-negative rate among all of the tests examined in this study.

It is noteworthy that our results from the different simulation schemes agree with the results from empirical data (Naser-Khdour, et al. 2019). They emphasize the impact of model violation due to non-SRH evolution on phylogenetic inference and suggest that reducing model violation in phylogenetic analysis by using the new protocol of phylogenetic inference (Jermiin, et al. 2020) or using more complex substitution models e.g.(Galtier and Gouy 1998; Tamura and Kumar 2002; Blanquart and Lartillot 2008; Dutheil, et al. 2012; Zou, et al. 2012; Groussin, et al. 2013; Jayaswal, et al. 2014) has the potential to improve phylogenetic accuracy.

For the purpose of this study, in order to simulate data that mimic as closely as possible empirical alignments, we extracted the empirical distributions of base frequencies, substitution rates, proportion of invariable sites, and branch lengths from tens of thousands of empirical datasets. In addition to their use in this paper, these empirical distributions, along with their best-fit distributions may be useful for a wide variety of simulation studies, or for specifying prior distributions for Bayesian phylogenetic methods.

## Funding

This work was supported by an Australian Research Council (Grant No. DP200103151 to R.L., B.Q.M.) and by a Chan-Zuckerberg Initiative grant to B.Q.M and R.L.