The Influence of Model Violation on Phylogenetic Inference: A Simulation Study

Phylogenetic inference typically assumes that the data has evolved under Stationary, Reversible and Homogeneous (SRH) conditions. Many empirical and simulation studies have shown that assuming SRH conditions can lead to significant errors in phylogenetic inference when the data violates these assumptions. Yet, many simulation studies focused on extreme non-SRH conditions that represent worst-case scenarios and not the average empirical dataset. In this study, we simulate datasets under various degrees of non-SRH conditions using empirically derived parameters to mimic real data and examine the effects of incorrectly assuming SRH conditions on inferring phylogenies. Our results show that maximum likelihood inference is generally quite robust to a wide range of SRH model violations but is inaccurate under extreme convergent evolution.

1987; Boussau and Gouy 2006): at the limit there can be an independent model of evolution 47 on every branch of a tree, meaning that the total number of parameters is the product of the 48 number of parameters in the substitution model and the number of branches in the tree. 49 Using stationary, reversible, and homogeneous substitution models to infer a 50 phylogeny from data that has evolved under more complex conditions compromises the 51 consistency of the ML estimation (Felsenstein 2004). Ideally, we would like to use data that 52 comply with the assumptions of the models we apply, or alternatively, use models that are not 53 violated by the data in hand. However, the use of non-SRH models is computationally 54 demanding and is often not practical in large datasets. On the other hand, removing data that 55 do not comply with the SRH assumption will come at a cost of losing phylogenetic 56 information. Both simulation (Huelsenbeck and Hillis 1993;Hillis, et al. 1994 focusses on tests that can be performed quickly and efficiently on very large datasets prior to 104 tree inference. We therefore use two different approaches in this study to leverage the 105 information in from individual chi-square tests. 106 The two approaches we take to using information from chi-square tests reflect 107 different ways of balancing false-positive and false-negative outcomes, and so may be 108 thought of as appropriate for different situations. Our first approach to using the chi-square 109 tests is to take the most conservative possible approach and score an alignment as violating 110 SRH assumptions if at least one sequence fails the test. Using the Chi-square frequencies in 111 this way is very conservative, and liable to have a high false positive rate that increases with 112 the number of sequences in an alignment. However, in some practical cases when many loci 113 are available but only a small number can be used for analyses, e.g. selecting ~50 loci for a 114 Bayesian analysis out of many thousands available from whole genomes, a conservative 115 approach such as this with a high false positive rate may be warranted. Our second approach 116 is less conservative. In this approach we record the proportion of sequences in an alignment 117 that fail the Chi-square test, and ask whether this proportion is correlated with the degree of 118 non-stationarity in the simulations. This approach may be more useful in practical cases 119 where researchers wish to rank a set of loci with respect to the severity of model violations.

122
Simulations 123 In order to investigate the ability of SRH models to correctly infer topologies and branch 124 lengths from non-SRH data, we devised a new approach that allows us to simulate alignments 125 gradually ranging from true SRH conditions (with identical base frequencies and identical 126 reversible substitution processes on every branch of the topology) to the most extreme 127 violation with completely unrelated base frequencies and non-reversible substitution 128 processes on every branch of the topology. For an alignment of m taxa and n sites, we will 129 denote the set of all branches in the rooted tree τ as Φ = {1, … , }. 130 We simulate data under two different simulation schemes as follows: 131 1. An inheritance scheme designed to reflect the evolutionary process, in which each node 132 in the tree inherits its substitution processes from its parent with a constant strength of 133 inheritance modified by the branch length connecting the two nodes. The scheme reflects 134 the continuity of evolutionary processes that are changing through time along a 135 phylogenetic tree. 136 2. A two-matrix scheme designed to reflect previous approaches to simulating non-SRH 137 evolution, where two independent subtrees (that are not sisters nor descendants of each 138 other) have an identical substitution process and that is distinct from the substitution 139 process that operates on the rest of the tree. This scheme resembles convergent evolution. 140 Applying these two schemes allows us to ask how evolutionarily-inspired non-SRH 141 simulations are affected by SRH assumptions (scheme 1) and then to directly compare these 142 to the more extreme forms of non-SRH evolution that are more often simulated (scheme 2). 143 We will describe both simulation approaches in more detail below. But  Both of our simulation approaches require us to choose base frequency vectors and 148 rate matrices with which to simulate alignments. Generating these at random could limit the 149 applicability of our results because it is unlikely that randomly-generated base frequency 150 vectors or rate matrices would reflect reality. To address this, we instead estimated base 151 frequency vectors and rate matrices from a large collection of empirical alignments, and then 152 used these parameters for our simulations. 153 In order to estimate the distributions of the empirical base frequencies (Π) and the 154 substitution rates ( ) we used 32,666 partitions from 49 nucleotide datasets (Appendix Table  155 A.1). Consisting of different types of loci (codon positions, rRNA, tRNA, introns, intergenic 156 spacers and UCEs) and genomes (nuclear, mitochondria, virus, plastid). Since different 157 partitions of the genome evolve differently, for each partition, we ran IQ-TREE with a GTR 158 model and free rate heterogeneity across sites (Yang 1995) with 4 categories + invariant sites. 159 This gave us the distributions of 32,666 estimates of each parameter in the GTR matrix 160 (A↔C, A↔G, A↔T, C↔G, C↔T, G↔T) and the distribution of each base frequency ( , 161 , , ). 162 We use a similar approach to estimate the distribution of branch lengths. Estimating 163 branch lengths from each partition separately could be misleading because there tends to be a 164 high stochastic error in branch lengths estimated from short single-partition alignments 165 (Kumar, et al. 2012). Therefore, in order to estimate the empirical distribution of the branch 166 lengths, we instead estimated a single set of branch lengths from each of our 49 nucleotide 167 datasets and complemented these with an additional 18 amino-acid datasets. For each dataset, we ran IQ-TREE with the best-fit fully-partitioned model (Chernomor, et al. 2016), which 169 allows each partition to have its own evolutionary model and edge-linked rates determined by 170 ModelFinder (Kalyaanamoorthy, et al. 2017). We then rooted the tree with the outgroup taxa 171 (if provided) and extracted the empirical branch lengths of the ingroup (T) for each of the 172 33,178 partitions from 67 nucleotide and amino acid datasets. 173 Finally, for each parameter in (5 parameters -G↔T equals to 1) and Π (4 174 parameters), and for each distribution in T (67 distributions -each dataset is an independent 175 distribution) we find the best-fit distribution from 36 common probability distributions using 176 the Kolmogorov-Smirnov test using SciPy (Virtanen, et al. 2020). We then sampled 177 parameters for our simulations from these best-fit distributions. Since the parameters of Π are 178 not independent, to sample a base-frequency vector we randomly sampled a parameter from 179 each of the four base-frequency's best-fit distribution and then normalized these parameters 180 to sum to 1. 181 The tree topology τ is derived from birth-death simulations with speciation rate λ, 182 extinction rate μ and the fraction of sampled taxa using TreeSim package with a fixed 183 number of extant species (Stadler 2011). In principle, it is possible to estimate the speciation 184 and extinction rates from empirical data (Nee, et al. 1994; Rannala and Yang 1996;Magallon 185 and Sanderson 2001). However, not knowing the fraction of sampled taxa a priori will tend to 186 bias such estimates (Stadler 2013;Hua and Lanfear 2018). Because of the challenges of 187 reliably estimating empirical speciation and extinction rates, we instead randomly sampled 188 the speciation rate, the extinction rate and the fraction of sampled taxa from uniform 189 distributions, to attempt to cover all the realistic regions of the parameter space. inheritance is perfect and the original evolutionary process is SRH, such a process would 205 define a molecular evolutionary process that is SRH across the entire topology by simply 206 defining a single SRH model at the root node. At the other extreme, where the association 207 between parent and offspring lineages is no better than random and the original process is not 208 SRH, there is no association between parent and offspring lineages and the process is 209 maximally non-SRH. To mimic this situation, we designed a simulation approach that allows 210 us to vary the homogeneity and stationarity assumptions both independently and together. 211 Our inheritance scheme allows us to vary the degree to which a single alignment has 212 evolved under SRH conditions by simply adjusting the strength of inheritance of the 213 substitution process and the base frequencies either jointly via a parameter we call , or 214 independently via parameters and respectively. When the inheritance parameters are set 215 to 1 and the model at the root of the tree is reversible, the model will conform to SRH 216 conditions. We can simulate increasing violation of SRH conditions simply by decreasing the inheritance parameters towards zero. When the relevant inheritance parameter is less than 218 one, each branch inherits some proportion of its substitution model from the parent branch, 219 while the remaining proportion of the model is selected at random from the empirical 220 parameter distributions. In practice, the parameter in a descendant branch is calculated as the 221 weighted sum of the parameter in the parent branch (where the weight is the inheritance 222 parameter) and a randomly-generated parameter from the appropriate empirical distribution 223 (where the weight is one minus the inheritance parameter). 224 We simulated data under five different categories of conditions using this scheme, in 225 order to examine independently and together the effects of relaxing the stationarity and 226 homogeneity assumptions. 227

2.
Relaxing the stationarity assumption (Fig. 1b).-In order to hold the homogeneity 233 assumption but relax the stationarity assumption, we introduce a parameter called 234 (0 ≤ ≤ 1) that allows to vary the state frequency at the root while still keeping the 235 same rate matrix for all branches of the tree. Mathematically, this can be described as: 236 Where is the substitution rate matrix operating on branch i, and is the branch 238 length of the root branch. 239 When ν = 1, is equal to 0 and this scheme boils down to the first SRH 240 condition. When ν = 0, is equal to , meaning that the root frequency is 241 generated separately from 0 .
will vary between these two extremes when ν is 242 between 0 and 1, with lower ν reflecting a larger deviation from stationary conditions. 243

3.
Relaxing the homogeneity assumption (Fig. 1c).-In order to hold the stationarity 244 assumption but relax the homogeneity assumption we need to simulate data in which 245 is set to 1 (such that all branches have the same base frequencies as the root node), but 246 we introduce a parameter that varies between zero and one (such that the inheritance 247 of the parameters of the matrix ranges from completely random to near-perfect). We 248 can describe this mathematically as follows: 249 Where is the process operating on branch i, are the substitution rates on the 251 parent branch of branch i, and is the branch length of the branch i. 252

4.
Relaxing the stationarity and homogeneity assumptions simultaneously but 253 independently (Fig. 1d).-We can simulate non-stationary and non-homogeneous data 254 by setting both and to values less than one. When we relax both assumptions, we 255 will allow and to vary simultaneously but independently: 256 Relaxing the stationarity and homogeneity assumptions jointly (Fig. 1e).-While the 4 th 258 set of simulation conditions, above, allows us to vary homogeneity and stationarity 259 jointly but independently, it suffers from the limitation that we have a maximum of two 260 base frequency vectors in the tree ( and 0 ). To relax this assumption further, we 261 will allow to vary while stays fixed. In those settings, both homogeneity and 262 stationarity will increase with . 263 Previous studies for simulating non-SRH evolution on phylogenies have used an 275 approach in which two distantly related branches undergo severe but correlated changes in 276 the molecular evolutionary process. To compare this approach to the more evolutionarily-277 motivated approach described above, we randomly chose two nodes that are not sisters and 278 not descendants of each other and assigned a different rate matrix (denoted by 1 1 ) from the 279 rest of the tree to all their descendant branches (Fig. 2FIGURE 2).

Tree Inference 301
Our first goal is to understand how the incorrect use of SRH models on data that have 302 evolved under non-SRH processes can affect phylogenetic inference. To do this, we compare 303 the tree topologies and branch lengths estimated with SRH models in IQ-TREE to the 304 topologies and branch lengths used to simulate each dataset. For each simulated alignment, 305 we ran IQ-TREE with ModelFinder (Kalyaanamoorthy, et al. 2017) and 1000 ultrafast 306 bootstrap replicates (Hoang, et al. 2018). In order to assess the ability of SRH models to infer 307 the correct tree topology we then compared the simulated tree topology to the estimated tree 308 topology using three different metricsnormalized Robinson-Foulds distance (Robinson and 309 Foulds 1981), Quartet distance (Estabrook, et al. 1985), and the Path-difference distance 310 (Steel and Penny 1993). The normalized Robinson-Foulds distance between two trees is the 311 fraction of internal branches that appear in one tree but not the other. It ranges from 0 to 1, 312 where 0 means that the two trees are topologically identical and 1 means that the two trees 313 have no branches in common. In order to assess the accuracy of branch length estimates, we 314 tested whether the estimated branch lengths and the original branch lengths are drawn from 315 the same distribution using the two-sample Kolmogorov-Smirnov test. 316

Detecting non-SRH Processes 317
We, therefore, tested the ability of three tests implemented in IQ-TREE to detect 318 We derived the empirical distributions of the substitution model parameters, the 340 nucleotide frequencies, and the proportion of invariant sites from 32,666 nucleotide 341 alignments (Appendix Table A.2). The empirical distribution of branch lengths we derived 342 from 67 nucleotide and amino acid alignments consist of 33,178 partitions (Appendix Table  343 A.1). 344 Using Kolmogorov-Smirnov test, we found the best-fit probability distribution for 345 each one of these empirical distributions (  Difference) metrics in any of our inheritance simulations (Fig. 3FIGURE 3, Appendix Figs.  357 A.4-7). These metrics measure the difference between the inferred tree and the tree from 358 which the alignment was simulated. If stronger violation of the SRH conditions affects 359 phylogenetic inference we should expect to see that the distances are higher when the 360 inheritance weight is lower, because a lower inheritance weight implies stronger model 361 violation through less homogeneity (for the rate matrix) and less stationarity (for the base frequencies). In addition, our results show that the proportion of simulated datasets for which 363 the simulated tree is recovered from the simulated alignment is constant at around 0.25 in the 364 inheritance scheme simulations regardless of the inheritance weight (Fig. 3, Appendix Fig.  365 A.4). Finally, we see no correlation between the inheritance weights and the proportion of 366 datasets that fail a Kolmogorov-Smirnov test comparing the true and estimated branch 367 lengths, suggesting that violation of SRH assumptions in our evolutionary framework has no 368 detectable effect on the estimation of branch lengths (Fig. 4, Appendix Fig. A.19). 369

SRH Conditions 371
Our results show that convergent violation of SRH assumptions by allowing two 372 distantly related branches to have identical substitution models has increasingly severe effects 373 on phylogenetic inference as the severity of the changes in the substitution models increases. 374 Under the two-matrix scheme, we expect to see higher distances between the true tree and the 375 estimated tree when there are larger Euclidian distances between the original matrix and the 376 matrix under which the divergent clades evolve. In two out of the three metrics (Robison-377 Foulds and Path-Difference) we found a weak but significant correlation between the distance 378 between the matrices and the distance between the topologies (Fig. 3, Appendix Fig. A.4, 379 Appendix Figs. A.8-10). However, in the third metric (Quartet Distance) we found no 380 correlation. Notably, the distance between the true tree and the estimated tree increases only 381 when the Euclidean distance between the two matrices is very high. Nevertheless, the 382 proportion of simulated datasets for which the simulated tree is recovered from the simulated 383 alignment declines exponentially as the difference between the matrices in the two-matrix 384 scheme increases (Fig. 3).

FIGURE 3.
Normalized Robinson-Foulds distance between the estimated tree topology 387 and the original tree topology as a function of the inheritance weight (ν, ω ,ρ) in the first 388 simulation scheme, and the distance between the two matrices (delta(Q)) in the second 389 simulation scheme. The small plots show the proportion of datasets in which the distance 390 between the estimated topology and the original topology equals zero as a function of the 391 inheritance weight and the distance between the two matrices. If violation of SRH model 392 assumptions increases topological error, we expect the nRF distance to increase towards the 393 right of each plot. The figure shows that for the first simulation scheme, which mimics a 394 stochastic evolutionary process, there is no detectable association between violation of SRH 395 conditions and topological error. For the second simulation scheme, which mimics an 396 extreme convergent situation, topological error increases with increasing violation of SRH 397 conditions. 398 The proportion of analyses in which the simulated tree is recovered positively 400 declines from around 0.20 when there is no model violation to zero when the Euclidean 401 distance between the matrices is around 2000, confirming that even the lowest levels of SRH 402 violation have detectable negative effects on phylogenetic inference under the two-matrix 403 scheme. 404 Finally, we see no correlation between the Euclidean distance between the two 405 matrices and the proportion of datasets that fail a Kolmogorov-Smirnov test comparing the 406 distributions of the true and estimated branch lengths, suggesting that violation of SRH 407 assumptions in the convergent framework has limited effects on the estimation of branch 408 lengths (Fig. 4, Appendix Fig. A.22 As expected, the ability of all three MaxSym tests to reject the null hypothesis of 417 stationarity and homogeneity improves as the inheritance weight in the evolutionary simulations decreases (i.e. as the violation of SRH conditions increases), the distance 419 between the two matrices in the convergent simulations increases, and the number of sites in 420 the alignment increases (Fig. 5, Appendix Figs. A.11-13). Moreover, the three MaxSym tests 421 have a reasonable false positive rate of approximately 4.5% (Appendix Table A.4). However, 422 they also have very high false-negative rates of 50-90%, depending on the test and the 423 particular simulation conditions (Fig. 5, Appendix Table A.3, Appendix Figs. A.14-16). In 424 the two-matrix scheme simulations, the false negative rates of MaxSym, MaxSymmar, and 425 MaxSymint tests are 67%, 66%, and 87%, respectively. Thus, across all simulation conditions, 426 a significant result from a MaxSymTest can be reliably interpreted as indicating that an 427 alignment violates the SRH conditions, but the test will fail to identify many such alignments. 428 Similarly to the MaxSym tests results, the ℎ 2 and ℎ 2 tests show an 429 increase in the proportion of alignments and/or sequences that fail the test in each dataset as 430 the inheritance weight decreases, and the number of sites increases (Fig. 5a, Appendix Fig.  431 A.17). The false-positive rates of the ℎ 2 test is 6% (Appendix Table A.6). The false-432 negative rate of the ℎ 2 test in the inheritance-scheme simulations is 57% (Appendix 433 Table A.5). Moreover, similar to the MaxSym tests, in the two-matrix scheme simulation, the 434 percentage of datasets that pass the ℎ 2 decreases logarithmically the higher the distance 435 between the two matrices (Fig. 5b, Appendix Fig. A.18). The false negative rate of the 436 ℎ 2 test under extreme convergent evolution is the smallest of all the tests considered 437 here under these conditions, and it is around 44% (Appendix Table A.5). 438 In the inheritance-scheme simulations, similarly to the MaxSym tests, the ℎ 2 439 and ℎ 2 , the WvH test shows an increase in the proportion of alignments that fail the test 440 as the inheritance weight (ω and ρ) decreases. However, ν has no effect on the proportion of 441 alignments that fail the WvH test. The false-positive rates of the WvH test is 3.5% (Appendix WvH test, and the Chi-square test as a function of (a) the inheritance weights (ν,ρ,ω) (b) 453 maximum Euclidian distance between the two matrices. We define datasets that pass the 454 Chi2Cons test as datasets where all the sequences pass the Chi-square test. On the other hand, 455 Chi2Rank shows the proportion of sequences that pass the Chi-square test in each dataset. 456 457

MaxSymTestint is a good predictor of correct tree inference 458
A key question for empiricists is whether tests of model adequacy are likely to 459 improve phylogenetic inference. To explore this in our simulation framework, we asked 460 whether datasets that are rejected by the tests we evaluated tended to be associated with more 461 phylogenetic tree error than those that were not rejected. To do this, we used three different 462 metrics of tree distance (the normalized Robison-Foulds (RF), Path-Difference, and Quartet 463 distance) and asked whether datasets that fail the test (i.e. have detectable non-SRH 464 processes) tended to result in trees that were further from the true tree (i.e. had higher nRF 465 distances) when analysed using SRH models. All three showed very similar results, so we show the normalized Robinson-Foulds results here (Fig. 6) and the other metrics in the 467 supplementary information (Appendix Fig. A.23a, Appendix Fig. A.24a). 468 For the inheritance scheme simulations we found as expected that datasets which 469 failed the MaxSym tests were associated with trees much further from the true tree than those 470 that passed the tests, although there was substantial variation within each category (Fig. 6a,  471 Appendix Fig. A.23a, Appendix Fig. A.24a). Surprisingly, this pattern was reversed for the 472 ℎ 2 test, and there was a very small difference in tree distances with the WvH test (Fig.  473   6a). Welch's t-test results suggest all of the differences are statistically significant (p<<0.05, 474 Fig. 6a). 475 For the two-matrix simulations the only test for which datasets which failed were 476 associated with trees further from the true tree was the MaxSymint test (Fig. 6b, Appendix 477 Fig. A.23b, Appendix Fig. A.24b). For all other tests, datasets which failed the test were 478 associated with trees that were markedly closer to the true tree than datasets which passed the 479 tests (Fig. 6b, Appendix Fig. A.23b, Appendix Fig. A.24b). Again, Welch's t-test results 480 suggest all of the differences are statistically significant (p<<0.05, Fig. 6b). The first simulation scheme we introduced in this paper, which we called the 500 inheritance scheme, allows tree branches to inherit their substitution process from their 501 ancestor. The second simulation scheme, which we called the two-matrix scheme, is similar 502 to previous studies and allows two distantly related monophyletic sub-trees to evolve with a different evolutionary process from the rest of the tree (Galtier and Gouy 1995;Jermiin, et al. 504 2004;Jayaswal, et al. 2011;Duchene, et al. 2017). 505 Surprisingly, our results show no correlation between errors in the topology or branch 506 length inference and any of the inheritance scheme parameters, even in extreme cases where 507 the evolutionary process is completely heterogeneous and non-stationary. These results 508 indicate that ML tree inference with SRH models is surprisingly robust to even quite extreme 509 violations of the SRH conditions. 510 Under the two-matrix simulation scheme, we found a small but significant increase in 511 topological inference error and the extent of the violation of the SRH assumptions. 512 Specifically, the more extreme the evolutionary convergence, the larger the errors in 513 topological inference that assumes SRH conditions. Despite this, we found no correlation 514 between branch length inference and the distance between the two matrices. In this study, we also tested the power of the MaxSym tests, WvH test and two 523 variations of the Chi-squared test to detect model violation due to non-SRH evolution. Our 524 results show that those tests were able to detect some model violation in both simulation 525 schemes. As expected, the power of all tests to detect model violation due to non-SRH 526 evolution improves dramatically as alignment length increases, reflecting simply the larger 527 amount of information available in longer alignments. However, the power of most of the 528 tests we looked at was somewhat limitedeven in the best-case scenario when violation of 529 the SRH conditions was severe, most tests were able to detect this violation in less than 50% 530 of the simulated datasets (Fig. 5). The two exceptions were the WvH test, which was able to 531 detect the vast majority of datasets simulated with model violation under the inheritance 532 scheme simulations (Fig. 6a) and the conservative Chi-Square test, which was able to detect 533 the majority of datasets simulated with model violation under the convergent evolution 534 scheme. However, the WvH test could not be applied to half of the datasets in our simulations 535 due to numerical instability, suggesting that it may be less useful for detecting violations of 536 SRH conditions in practice than the other tests. 537 The utility of any test of model adequacy in practice is likely to be tied to the amount 538 of phylogenetic error that a test helps empiricists avoid. All models used in phylogenetic 539 analyses are gross oversimplifications of highly complex molecular evolutionary processes, 540 and so merely detecting violations of models is necessary but not sufficient for a model 541 adequacy test to be useful. Because of this, we asked for each test whether the datasets that 542 fail the test were associated with more or less topological inference error than the datasets 543 that passed the test. Surprisingly, the only test that performed consistently well in this regard 544 was the MaxSymTestint. Under the inheritance scheme simulations, all three MaxSym tests 545 are good predictors of phylogenetic accuracy; trees that pass any of those tests are closer to 546 the true tree than trees that fail. The WvH and ℎ 2 tests on the other hand are bad 547 predictors of phylogenetic accuracy; trees that fail the ℎ 2 test are usually closer to the 548 true tree, while there is only a small difference between trees that fail and trees that pass the 549 WvH test. Surprisingly, under the convergent simulation scheme, the MaxSymTest int is the 550 only test for which datasets that pass the test are closer to the true tree than datasets that fail the test (Fig. 6). For all other tests, the datasets that pass the test were substantially further 552 from the true tree than those that fail the test. 553 It is challenging to disentangle why some tests of the SRH assumptions tend to detect 554 datasets that are associated with more topological error, while others show the opposite 555 tendency (Fig. 6), although we suspect this is often driven by the interplay of the power of the 556 tests, phylogenetic signal, and stochastic error in tree estimation. Across all simulation 557 conditions, the only test which consistently showed the desirable behaviour from an empirical 558 standpoint (i.e. where datasets that fail the test are associated with more topological error) 559 was the MaxSymTestint. All other tests showed evidence of having the opposite tendency 560 ( Fig. 6) in at least some simulation conditions. In the case of the WvH test, for which 561 alignments that fail the test were consistently associated with less topological inference error 562 when analysed under SRH models, we suspect that the underlying reason may be the reliance 563 of the test on a parametric bootstrap. The WvH test depends fundamentally on a tree 564 estimated with an SRH model to estimate the null distribution of the test statistic. If this tree 565 is wrong, as we show occurs under model violation, then the null distribution may be 566 incorrect and the test misled. For the other tests we suspect that the tendency is driven largely 567 by the fact that datasets with few informative sites will tend to both pass the tests and be 568 associated with high topological error, with both caused by the limited information in the 569 data, although further work is needed to understand these relationships in more detail. For the purpose of this study, in order to simulate data that mimic as closely as 592 possible empirical alignments, we extracted the empirical distributions of base frequencies, 593 substitution rates, proportion of invariable sites, and branch lengths from tens of thousands of 594 empirical datasets. In addition to their use in this paper, these empirical distributions, along 595 with their best-fit distributions may be useful for a wide variety of simulation studies, or for 596 specifying prior distributions for Bayesian phylogenetic methods. 597