An estimate of the deepest branches of the tree of life from ancient vertically-evolving genes

Core gene phylogenies provide a window into early evolution, but different gene sets and analytical methods have yielded substantially different views of the tree of life. Trees inferred from a small set of universal core genes have typically supported a long branch separating the archaeal and bacterial domains. By contrast, recent analyses of a broader set of non-ribosomal genes have suggested that Archaea may be less divergent from Bacteria, and that estimates of inter-domain distance are inflated due to accelerated evolution of ribosomal proteins along the inter-domain branch. Resolving this debate is key to determining the diversity of the archaeal and bacterial domains, the shape of the tree of life, and our understanding of the early course of cellular evolution. Here, we investigate the evolutionary history of the marker genes key to the debate. We show that estimates of a reduced Archaea-Bacteria (AB) branch length result from inter-domain gene transfers and hidden paralogy in the expanded marker gene set. By contrast, analysis of a broad range of manually curated marker gene datasets from an evenly sampled set of 700 Archaea and Bacteria reveal that current methods likely underestimate the AB branch length due to substitutional saturation and poor model fit; that the best-performing phylogenetic markers tend to support longer inter-domain branch lengths; and that the AB branch lengths of ribosomal and non-ribosomal marker genes are statistically indistinguishable. Furthermore, our phylogeny inferred from the 27 highest-ranked marker genes recovers a clade of DPANN at the base of the Archaea, and places CPR within Bacteria as the sister group to the Chloroflexota.

Much remains unknown about the earliest period of cellular evolution and the deepest 41 divergences in the tree of life. Phylogenies encompassing both Archaea and Bacteria have been 42 inferred from a "universal core" set of 16-56 genes encoding proteins involved in translation and 43 other aspects of the genetic information processing machinery (Ciccarelli et al., 2006;Fournier 44 and Gogarten of genes encoding translational and in particular ribosomal proteins along the AB branch as 73 compared to other genes. Interestingly, the same observation was made previously using a 74 smaller set of 38 non-ribosomal marker proteins (Petitjean et al., 2014), although the difference 75 in AB branch length between ribosomal and non-ribosomal markers in that analysis was reported 76 to be substantially lower (roughly two-fold, compared to roughly ten-fold for the 381 protein set 77 (Petitjean et  divergences (Gouy et al., 2015). Failure to model site-specific amino acid preferences has 90 previously been shown to lead to under-estimation of the AB branch length due to a failure to 91 detect convergent changes (Tourasse and Gouy, 1999;Williams et al., 2020), although the 92 published analysis of the 381 marker set did not find evidence of a substantial impact of these 93 features on the tree as a whole (Zhu et al., 2019). Those analyses also identified phylogenetic 94 incongruence among the 381 markers, but did not determine the underlying cause (Zhu et al.,95 2019). 96 97 This recent work (Zhu et al., 2019) raises two important issues regarding the inference of the 98 universal tree: first, that estimates of the genetic distance between Archaea and Bacteria from 99 classic "core genes" may not be representative of ancient genomes as a whole, and second, that 100 there may be many more suitable genes to investigate early evolutionary history than generally 101 recognized, providing an opportunity to improve the precision and accuracy of deep phylogenies. 102 Here, we investigate these issues in order to determine how different methodologies and marker 103 sets affect estimates of the evolutionary distance between Archaea and Bacteria. First, we 104 examine the evolutionary history of the 381 gene marker set (hereafter, the expanded marker 105 gene set) and identify several features of these genes, including instances of inter-domain gene 106 transfers and mixed paralogy, that may contribute to the inference of a shorter AB branch length 107 in concatenation analyses. Then, we re-evaluate the marker gene sets used in a range of previous 108 analyses to determine how these and other factors, including substitutional saturation and model 109 fit, contribute to inter-domain branch length estimations and the shape of the universal tree. 110 Finally, we identify a subset of marker genes least affected by these issues, and use these to 111 estimate an updated tree of the primary domains of life and the length of the stem branch that 112 separates phylogenies and the species tree, but did not investigate the underlying causes. Our analyses 180 establish the basis for these disagreements in terms of gene transfers and the mixing of 181 orthologues and paralogues within and between domains. The estimation of genetic distance 182 based on concatenation relies on the assumption that all of the genes in the supermatrix evolve 183 on the same underlying tree; genes with different gene tree topologies violate this assumption 184 and should not be concatenated because the topological differences among sites are not 185 modelled, and so the impact on inferred branch lengths is difficult to predict. In practice, it is often 186 difficult to be certain that all of the markers in a concatenate share the same gene tree topology, 187 and the analysis proceeds on the hypothesis that a small proportion of discordant genes are not 188 expected to seriously impact the inferred tree. However, the concatenated tree inferred from the 189 expanded marker set differs from previous trees in that the genetic distance between Bacteria 190 and Archaea is greatly reduced, such that the AB branch length appears comparable to distances 191 among bacterial phyla (Zhu et al., 2019). Because an accurate estimate of the AB branch length 192 has a major bearing on unanswered questions regarding the root of the universal tree (Gouy et 193 al., 2015), we next evaluated the impact of the conflicting gene histories within the expanded 194 marker set on inferred AB branch length.

196
The ( Figure 1A). To estimate AB branch lengths for genes in which the domains were not 203 monophyletic in the ML tree, we first performed a constrained ML search to find the best gene 204

inferred branch length between Archaea and Bacteria is shortened by inter-domain
tree that was consistent with domain monophyly for each family under the LG+G4+F model in IQ-205 TREE 2 . While it may seem strained to estimate the length of a branch that 206 does not appear in the ML tree, we reasoned that this approach would provide insight into the 207 contribution of these genes to the AB branch length in the concatenation, in which they conflict 208  with the overall topology. AB branch lengths were significantly (p = 3.653✕10 -6 , Wilcoxon rank 209 sum test) shorter for markers that rejected domain monophyly (Bonferroni-corrected p < 210 0.0001656; Figure 1A): mean AB branch length was 0.00668 substitutions/site for markers that 211 significantly rejected domain monophyly, and 0.287 substitutions/site for markers that did not 212 reject domain monophyly). This behaviour might result from marker gene transfers reducing the 213 number of fixed differences between the domains, so that the AB branch length in a tree in which 214 Archaea and Bacteria are constrained to be reciprocally monophyletic will tend towards 0 as the 215 number of transfers increases. 216 217 To test the hypothesis that phylogenetic incongruence among markers might reduce the inferred 218 Archaea-Bacteria distance, we evaluated the relationship between AB distance and two 219 complementary metrics of marker gene verticality: ∆LL, the difference in log likelihood between 220 the constrained ML tree and the ML gene tree (a proxy for the extent to which a marker gene 221 rejects the reciprocal monophyly of Bacteria and Archaea) and the "split score" (Dombrowski et  222 al., 2020), which measures the extent to which marker genes recover established relationships 223 for defined taxonomic levels of interest (for example, at the level of domain, phylum or order), 224 averaging over bootstrap distributions of gene trees to account for phylogenetic uncertainty (see 225 Methods). We evaluated split scores at both the between-domain and within-domain ( genes that recover the reciprocal monophyly of Archaea and Bacteria also evolve more vertically 234 within each domain, and that these vertically-evolving marker genes support a longer AB branch 235 and a greater AB distance. Consistent with this inference, AB branch lengths estimated using 236 concatenation decreased as increasing numbers of low-verticality markers (that is, markers with 237 higher ∆LL) were added to the concatenate ( Figure 1H). These results suggest that inter-domain 238 gene transfers reduce the AB branch length when included in a concatenation. 239 240 An alternative explanation for the positive relationship between marker gene verticality and AB 241 branch length could be that vertically-evolving genes experience higher rates of sequence 242 evolution. For a set of genes that originate at the same point on the species tree, the mean root-243 to-tip distance (measured in substitutions per site, for gene trees rooted using the MAD method 244 ( that vertically-evolving genes evolve relatively slowly (note that large values of ∆LL and split score 248 denote low verticality). Thus, the longer AB branches of vertically-evolving genes do not appear 249 to result from a faster evolutionary rate for these genes. Taken together, these results indicate 250 that the inclusion of genes that do not support the reciprocal monophyly of Archaea and Bacteria, 251 or their constituent taxonomic ranks, in the universal concatenate explain the reduced estimated 252 AB branch length. 253 254

Finding ancient vertically-evolving genes 255 256
To estimate the AB branch length and the phylogeny of prokaryotes using a dataset that resolves 257 some of the issues identified above, we performed a meta-analysis of several previous studies to 258 identify a consensus set of vertically-evolving marker genes. We identified unique markers from 259 these analyses by reference to the COG ontology ( and 16 ribosomal and non-ribosomal genes, respectively, comprising the 54 marker genes set. 285 We found no evidence that there was a longer AB branch associated with ribosomal markers than 286 for other vertically-evolving "core" genes ( Figure 2B; Gamma rate category; 868 sites) and slowest sites (sites with highest probability of being in the 299 slowest Gamma rate category, 1604 sites) and compared relative branch lengths inferred from 300 the entire concatenate, using IQ-TREE 2 to infer site-specific rates (Figure 3).  Slow-evolving sites support a relatively long inter-domain branch and less diversity within the domains (that is, shorter between-taxa branch lengths within domains). This suggests that substitution saturation (overwriting of earlier changes) may reduce the relative length of the AB branch at fast-evolving sites and genes.

Figure 2. The relationship between marker gene verticality, AB branch length, and functional category. (A) Vertically-evolving phylogenetic markers have longer AB
Notably, the proportion of inferred substitutions that occur along the AB branch differs between 302 the slow-evolving and fast-evolving sites. As would be expected, the total tree length measured 303 in substitutions per site is shorter from the slow-evolving sites, but the relative AB branch length 304 is longer (1.2 substitutions/site, or ~2% of all inferred substitutions, compared to 2.6 305 substitutions/site, or ~0.04% of all inferred substitutions for the fastest-evolving sites; see Figure  306 3- Figure Supplement 1 for absolute tree size comparisons). Since we would not expect the 307 distribution of substitutions over the tree to differ between slow-evolving and fast-evolving sites, 308 this result suggests that some ancient changes along the AB branch at fast-evolving sites have 309 been overwritten by more recent events in evolution ---that is, that substitutional saturation leads 310 to an underestimate of the AB branch length (this is the case for both the expanded marker set, 311 and the 27 marker set (Figure 3-Figure Supplement 2). 312 313 Another factor that has been shown to lead to underestimation of genetic distance on deep 314 branches is a failure to adequately model the site-specific features of sequence evolution (Lartillot 315 and LG+G4+F model (1.45 substitutions/site). Thus, substitution model fit has a major effect on the 332 estimated length of the AB branch, with better-fitting models supporting a longer branch length 333 ( Table 1). The same trends are evident when better-fitting site-heterogeneous models are used 334 to analyse the expanded marker set: considering only the top 5% of genes by ∆LL score, the AB 335 branch length is 1.2 under LG+G4+F, but increases to 2.4 under the best-fitting LG+C60+G4+F 336 model (Figure 3-Figure Supplement 3 The Hadesarchaea (92% bootstrap support) and a clade comprising Theionarchaea, 383 Methanofastidiosa, and Thermococcales (92% bootstrap support) branch basal to the clade 384 comprising TACK and Asgard Archaea in our analysis, rather than with other Euryarchaeota. 385

Figure 4: A phylogeny of Archaea and Bacteria inferred from a concatenation of 27 marker genes. Consistent with some recent studies (Dombrowski et al., 2020; Guy and Ettema, 2011; Raymann et al., 2015; Williams et al., 2017), we recovered the DPANN, TACK and Asgard Archaea as monophyletic groups. Although the deep branches within
Bacteria are poorly resolved, we recovered a sister group relationship between CPR and Chloroflexota, consistent with recent reports (Taib et al., 2020, Coleman et al., 2021). The tree was inferred using the best-fitting LG+C60+G4+F model in IQ-TREE 2 . Branch lengths are proportional to the expected number of substitutions per site. Support values are ultrafast (UFBoot2) bootstraps (Hoang et al., 2018). Numbers in parenthesis refer to the number of taxa within each collapsed clade. Please note that collapsed taxa in the Archaea and Bacteria roughly correspond to order-and phylum-level lineages, respectively. Our results demonstrate that slow and fast-evolving sites from the same set of marker genes 420 support different tree shapes and branch lengths; it therefore seems possible that between-421 dataset differences are due, at least in part, to evolutionary rate variation within and between 422 marker genes. inferring deep divergence times, because age estimates of the last universal common ancestor

Figure 5. Molecular clock estimates of LUCA and LACA age are uncertain due to a lack of deep calibrations and maximum ages for microbial clades. (A) Posterior node age estimates from Bayesian molecular clock analyses of 1) Conserved sites as estimated
previously (Zhu et al., 2019); 2) Random sites (Zhu et al., 2019) 3) Ribosomal genes (Zhu et al., 2019) 4) The top 5% of marker gene families according to their ∆LL score (including only 1 ribosomal protein) and 5) The same top 5% of marker genes trimmed using BMGE (Criscuolo and Gribaldo, 2010) to remove poorly-aligned sites. In each case, a strict molecular clock was applied, with the age of the Cyanobacteria-Melainabacteria split constrained between 2.5 and 2.6 Ga. In 6) and 7) an expanded set of fossil calibrations were implemented with a relaxed (lognormal) molecular clock. In 6) a soft maximum age of 4.520 Ga was applied, representing the age of the moon-forming impact (Kleine et al., 2005 File 5a), resulted in age estimates for LUCA that exceed the age of the Earth, >~5.5Ga ( Figure  447 5), approaching the age inferred from the ribosomal genes (7.46-8.03 Ga). These results ( Figure  448 5) suggest that the apparent agreement between the fossil record and divergence times estimated 449 from the expanded gene set may be due, at least in part, to the shortening of the AB branch due 450 to phylogenetic incongruence among marker genes. 451 452 In the original analyses, the age of LUCA was estimated using a strict clock with a single 453 calibration constraining the split between Cyanobacteria and Melainabacteria derived from 454 estimates of the Great Oxidation Event and a secondary estimate of the age of cyanobacteria 455 derived from an independent analysis (Shih et al., 2017). The combination of a strict clock and 456 only two calibrations is not sufficient to capture the variation in evolutionary rate over deep 457 timescales (Drummond et al., 2006). To investigate whether additional calibrations might help to 458 improve age estimates for deep nodes in the universal tree, we performed analyses on our new 459 27 marker gene dataset using two different relaxed clock models (with branchwise independent 460 and autocorrelated rates) and 7 additional calibrations (Supplementary File 5b). Unfortunately, all 461 of these were minimum age calibrations with the exception of the root (for which the moon-forming 462 impact 4.52Ga (Kleine et al., 2005) provides a reasonable maximum), due to the difficulty of 463 establishing uncontroversial maximum ages for microbial clades. Maximum age constraints are 464 essential to inform faster rates of evolution because, in combination with more abundant minimum 465 age constraints, they imply that a given number of substitutions must have accumulated in at most 466 a certain interval of time. In the absence of other maximum age constraints, the only lower bound 467 on the rate of molecular evolution is provided by the maximum age constraint on the root (LUCA). 468 469 These new analyses indicated that even with additional minimum age calibrations, the age of 470 LUCA inferred from the 27-gene dataset was unrealistically old, falling close to the maximum age 471 constraint in all analyses even when the maximum was set to the age of the known universe 472 (13.7Ga (Planck Collaboration et al., 2018); Figure 5). Inspection of the inferred rates of molecular 473 evolution across the tree ( Figure 5B) provides some insight into these results: the mean rate is 474 low (mean = 0.21, 95% credibility interval = 0.19-0.22 subs. site -1 Ga -1 ), so that long branches 475 (such as the AB stem), in the absence of other information, are interpreted as evidence of a long 476 period of geological time. These low rates likely result both from the limited number of calibrations 477 and, in particular, the lack of maximum age constraints. 478 479 An interesting outlier among inferred rates is the LUCA to LACA branch, which has a rate tenfold 480 greater than the average (mean = 2.51, 95% HPD = 1.6-3.5 subs. site -1 Ga -1 ). The reason is that 481 calibrations within Bacteria imply that LBCA cannot be younger than 3.227 Ga (Manzimnyama 482 Banded Ironstone Formation provides evidence of cyanobacterial oxygenation (Satkoski et al.,  483 2015), Supplementary File 5b)); as a result, with a 4.52Ga maximum the LUCA to LBCA branch 484 cannot be longer than 1.28Ga. By contrast, the early branches of the archaeal tree are poorly 485 constrained by fossil evidence. Analysis without the 3.227Ga constraint resulted in overlapping 486 age estimates for LBCA (4.47-3.53Ga) and LACA (4.37-3.44Ga). Finally, analysis of the archaeal 487 and bacterial subtrees independently (that is, without the AB branch, rooted on LACA and LBCA, 488 respectively) resulted in LBCA and LACA ages that abut the maximum root age (LBCA: 4.52-489 4.38Ga; LACA: 4.52-4.14Ga). This analysis demonstrates that, under these analysis conditions, 490 the inferred age of the root (whether corresponding to LUCA, LACA or LBCA) is strongly 491 influenced by the prior assumptions about the maximum age of the root. 492 493 In sum, the agreement between fossils and age estimates from the expanded gene set appears 494 to result from the impact of phylogenetic incongruence on branch length estimates. Under more 495 flexible modelling assumptions the limitations of current clock methods for estimating the age of 496 LUCA become manifest: the sequence data only contain limited information about the age of the 497 root, with posterior estimates driven by the prior assumptions about the maximum age of the root. 498 This analysis implies several possible ways to improve age estimates of deep branches in future 499 analyses. More calibrations, particularly maximum age constraints and calibrations within 500 Archaea, are essential to refine the current estimates. Given the difficulties in establishing 501 maximum ages for archaeal and bacterial clades, constraints from other sources such as donor- no evidence that ribosomal genes overestimate stem length; since they appear to be transferred 514 less frequently than other genes, our analysis affirms that ribosomal proteins are useful markers 515 for deep phylogeny. In general, high-verticality markers, regardless of functional category, 516 supported a longer AB branch length. Furthermore, our phylogeny was consistent with recent 517 work on early prokaryotic evolution, resolving the major clades within Archaea and nesting the 518 CPR within Terrabacteria. Notably, our analyses suggested that both the true Archaea-Bacteria 519 branch length ( Figure 6A), and the phylogenetic diversity of Archaea, may be underestimated by 520 even the best current models, a finding that is consistent with a root for the tree of life between 521 the two prokaryotic domains. 522 523 Phylogenies inferred from "core" genes involved in translation and other conserved cellular 524 processes have provided one of the few available windows into the earliest period of archaeal 525 and bacterial evolution. However, core genes comprise only a small proportion of prokaryotic 526 genomes, and have sometimes been viewed as outliers (Zhu et al., 2019) in the sense that they 527 are unusually vertical among prokaryotic gene families. This means that they are among the few

Figure 6. The impact of marker gene choice, phylogenetic congruence, alignment trimming, and substitution model fit on estimates of the Archaea-Bacteria branch length. (A) Analysis using a site-homogeneous model (LG+G4+F) on the complete 381 gene expanded set (i) results in an AB branch substantially shorter than previous estimates. Removing the genes most seriously affected by inter-domain gene transfer (ii)
, trimming poorly-aligned sites (iii) using BMGE (Criscuolo and Gribaldo, 2010) in the original alignments (see below), and using the best-fitting site-heterogeneous model (iv) (LG+C60+G4+F) substantially increase the estimated AB length, such that it is comparable with published estimates from the "core" set: 3.3 (Williams et al., 2020)  (https://github.com/biocore/wol/tree/master/data/), along with the genome metadata and the 554 individual newick files. We checked each published tree for domain monophyly, and also 555 performed approximately unbiased (AU) (Shimodaira, 2002) tests to assess support for domain 556 monophyly on the underlying sequence alignments using IQ Individual database searches were conducted as follows: arCOGs were assigned using PSI-584 BLAST v2.7.1+ (settings: -evalue 1e-4 -show_gis -outfmt 6 -max_target_seqs 1000 -dbsize 585 100000000 -comp_based_stats F -seg no) (Altschul et al., 1997 protein accession were used to extract the sequences from the 700 genomes, which were used 638 for subsequent phylogenetic analyses. 639 640 Marker gene inspection and analysis 641 642 We aligned these 95 marker gene sequence sets using MAFFT-L-INS-i 7.475 (Katoh and Toh,  643 2008) and removed poorly-aligned positions with BMGE 1.12 (Criscuolo and Gribaldo, 2010). We 644 inferred initial maximum likelihood trees (LG+G4+F) for all 95 markers and mapped the KO and 645 Pfam domains and descriptions, inferred from annotation of the 700 genomes, to the 646 corresponding tips (see above). Manual inspection took into consideration monophyly of Archaea 647 and Bacteria and the presence of paralogs, and other signs of contamination (HGT, LBA). 648 Accordingly, single gene trees that failed to meet reciprocal domain monophyly were excluded, 649 and any instances of HGT, paralogous sequences, and LBA artefacts were manually removed 650 from the remaining trees resulting in 54 markers across the three published datasets that were 651 subject to subsequent phylogenetic analysis (LG+C20+G4+F) and further refinement (see below). 652 653 Ranking markers based on split score 654 655 We applied an automated marker gene ranking procedure devised previously (the split score, 656 ) to rank each of the 54 markers that satisfied reciprocal monophyly 657 based on the extent to which they recovered established phylum-, class-or, order-level 658 relationships within the archaeal and bacterial domains (Supplementary File 4). 659 The script quantifies the number of splits, or occurrences where a taxon fails to cluster within its 660 expected taxonomic lineage, across all gene phylogenies. Briefly, we assessed monophyletic 661 clustering using phylum-, class-, and order-level clades within Archaea (Cluster1) in combination 662 with Cluster0 (phylum) or Cluster3 (i.e. on class-level if defined and otherwise on phylum-level; 663 Supplementary File 4) for Bacteria. We then ranked the marker genes using the following split 664 score criteria: the number of splits per taxon and the splits normalized to the species count. The 665 percentage of split phylogenetic groups was used to determine the highest ranking (top 50%) 666 markers. 667 668 Concatenation 669 670 Based on the split score ranking of the 54 marker genes (above), the top 50% (27 markers, 671 Supplementary File 4) marker genes were manually inspected using criteria as defined above, 672 and contaminating sequences were manually removed from the individual sequence files. 673 Following inspection, marker protein sequences were aligned using MAFFT-L-INS-i 7.475 (Katoh 674 and Standley, 2013) and trimmed using BMGE (version 1.12, under default settings) (Criscuolo 675 and Gribaldo, 2010). We concatenated the 27 markers into a supermatrix, which was used to 676 infer a maximum-likelihood tree ( Figure 5, under LG+C60+G4+F), evolutionary rates (see below), 677 and rate-category supermatrices as well as to perform model performance tests (see below). 678 679 Constraint analysis 680 681 We performed a maximum likelihood free topology search using IQ-TREE 2. of the markers from the expanded, bacterial, core and non-ribosomal sets. We also performed a 684 constrained analysis with the same model, in order to find the maximum likelihood tree in which 685 Archaea and Bacteria were reciprocally monophyletic. We then compared both trees using the 686 approximately unbiased (AU) Shimodaira (2002) test in IQ-TREE 2.0.6 (Minh et al., 2020) with 687 10,000 RELL (Shimodaira, 2002) bootstrap replicates. To evaluate the relationship between 688 marker gene verticality and AB branch length, we calculated the difference in log-likelihood 689 between the constrained and unconstrained trees in order to rank the genes from the expanded 690 marker set. We then concatenated the top 20 markers (with the lowest difference in log-likelihood 691 between the constrained and unconstrained trees) and iteratively added 5 markers with the next 692 smallest difference in log-likelihood to the concatenate, this was repeated until we had 693 concatenates up to 100 markers (with the lowest difference in log-likelihood) we inferred trees 694 under LG+C10+G4+F in IQ-TREE 2.0.6, with 1000 ultrafast bootstrap replicates and calculated 695 AB length. 696 697 Site and gene evolutionary rates 698 699 We inferred rates using the --rate option in IQ-TREE 2.0.6 (Minh et al., 2020) for both the 381 700 marker concatenation from Zhu (Zhu et al., 2019) and the top 5% of marker genes based on the 701 results of difference in log-likelihood between the constrained tree and free-tree search in the 702 constraint analysis (above). We also used this method to explore the differences in rates for the 703 27 marker set. We built concatenates for sites in the slowest and fastest rate categories, and 704 inferred branch lengths from each of these concatenates using the tree inferred from the 705 corresponding dataset as a fixed topology. 706 707 Substitution model fit 708 709 Model fit tests were undertaken using the top 5% concatenate described above, with the 710 alignment being trimmed with BMGE 1.12 (Criscuolo and Gribaldo, 2010) with default settings 711 (BLOSUM62, entropy 0.5) for all of the analyses except the 'untrimmed' LG+G4+F run, other  712  models  on  the  trimmed  alignment  were  LG+G4+F,  LG+R4+F  and  713 LG+C10 LG+C60+G4+F,LG+C50+G4+F,LG+C40+G4+F,LG+C30+G4+F,LG+C20+G4+F,LG+C10+G4+ 718 F,LG+G4+F,LG+R4+F) to the default, to find the best fitting model for the analysis. This revealed 719 that, according to AIC,BIC and cAIC, LG+C60+G4+F was the best fitting model. For comparison, 720 we also performed analyses using the following models: 721 LG+G4+F,LG+C20+G4+F,LG+C40+G4+F (Table 1) analyses were performed in MCMCTree (Yang, 2007) under a strict clock model. We used the 732 normal approximation approach, with branch lengths estimated in codeml under the LG+G4 733 model. In each case, a fixed tree topology was used alongside a single calibration on the 734 Cyanobacteria-Melainabacteria split. The calibration was modelled as a uniform prior distribution 735 between 2.5 and 2.6 Ga, with a 2.5% probability that either bound could be exceeded. For each 736 alignment, four independent MCMC chains were run for 2,000,000 generations to achieve 737 convergence. 738