Reconstruction of Escherichia coli ancient diversification by layered phylogenomics and polymorphism fingerprinting

The rapidly increasing availability of whole genomes provides the opportunity to reach an updated comprehensive view of bacterial evolution. The staggered diversification of evolutionary processes, based on the combined strategy of layered phylogenomics and polymorphism fingerprinting, give a new perspective in phylogenetic reconstructions. Layered phylogenomics is based on the assignation of genes according to five different evolutionary layers: minimal genome, genus-core genome, species-core genome, phylogroup-core genome and phylogroup-flexible genome. Polymorphism fingerprinting is based on the detection of conserved positions in each phylogenetic group but differing from those of their hypothetical ancestors. This approach was applied to Escherichia coli because there are unresolved evolutionary questions, although has been highly studied. Phylogenetic analysis based on 6,220 full genomes, identified three E. coli root lineages, defined as D, EB1A and FGB2. A new phylogroup, called G was detected near to phylogroup B2. The closest phylogroup to ancestral E. coli was phylogroup D, whereas E and F were the closest ones in their respective lineages; moreover, A and B2 were the most distant phylogroups in EB1A and FGB2 respectively. We suspect that EB1A and FGB2 lineages represent different adaptive strategies. In the deepest branch of EB1A lineage, the number of accumulated mutations was lower than in recent branches, whereas in FGB2 lineage the opposite occurred. The FGB2 lineage was enriched in genes related to host colonization-pathogenicity and toxin-antitoxin systems (such as hipA), whereas B1A sub-lineage acquired functions related to uptake and metabolism of carbohydrates (such as bgl, mng or xlyE). This new combined strategy shows a detailed staggered evolutionary reconstruction, which help us to understand the deepest events and the selection forces have driven E. coli diversification. This approach could add resolution in the reconstruction of the evolutionary trajectories of other microorganisms. Author summary Phylogeny based on whole genome provides the opportunity to study the history of eco-adaptive diversification of any bacterial taxon. Different strategies have been proposed for knowing the evolutionary trajectories in some species, such as Escherichia coli; however, these analyses were based on a limited number of sequences, and sometimes the evolutionary reconstructions reached clashed positions, especially in the ancestral inferences. For adding resolution in evolutionary reconstructions, we propose a combination of approaches, such as layered phylogenomics based on the use of different set of genes corresponding to the successive evolutionary steps, and polymorphism fingerprinting which detects hallmarks of the ancient mutations. We propose to use E. coli because it is paradigmatic example of the evolutionary inconsistences despite being a microorganism with enough evolutionary analysis. Three ancestral lineages were established with this strategy and the staggered reconstruction about the origin and diversification of E. coli phylogroups was inferred. Moreover, in the context of this study, a new E. coli phylogroup was defined. The main lineages represent different adaptive strategies, one lineage gained genes involved in pathogenicity, and another one acquired genes allowing the obtainment of energy from different sources.

Layered phylogenomics is based on the assignation of genes according to five different 23 evolutionary layers: minimal genome, genus-core genome, species-core genome, 24 phylogroup-core genome and phylogroup-flexible genome. Polymorphism fingerprinting 25 is based on the detection of conserved positions in each phylogenetic group but differing 26 from those of their hypothetical ancestors. This approach was applied to Escherichia coli 27 because there are unresolved evolutionary questions, although has been highly studied. 28 Phylogenetic analysis based on 6,220 full genomes, identified three E. coli root lineages, 29 defined as D, EB1A and FGB2. A new phylogroup, called G was detected near to 30 phylogroup B2. The closest phylogroup to ancestral E. coli was phylogroup D, whereas 31 E and F were the closest ones in their respective lineages; moreover, A and B2 were the 32 most distant phylogroups in EB1A and FGB2 respectively. We suspect that EB1A and 33 FGB2 lineages represent different adaptive strategies. In the deepest branch of EB1A 34 lineage, the number of accumulated mutations was lower than in recent branches, whereas 35 in FGB2 lineage the opposite occurred. The FGB2 lineage was enriched in genes related 36 to host colonization-pathogenicity and toxin-antitoxin systems (such as hipA), whereas 37 B1A sub-lineage acquired functions related to uptake and metabolism of carbohydrates Since the first description by Theodore Escherich described of Escherichia coli in 1885, 63 several generations of researchers have been fascinated by this organism. E. coli has been 64 extensively used as a model to understand bacterial adaptability [1,2]. The population 65 diversity of E. coli was initially recognized in four main phylogroups (A, B1, B2 and D) 66 [3]. In the following years, the increasing number of available sequences allowed the 67 identification of three new phylogroups (C, E, and F) and five cryptic clades, revealing 68 that the population structure of E. coli was more complex than initially suggested [4]. 69 When the first whole E. coli genome was sequenced in 1997, a new possibility in the 70 comparative genomic field was perceived for this microorganism [5]. The growing 71 availability of a large amount of whole E. coli genomes provided an unprecedented level 72 of discrimination and the opportunity to perform solid evolutionary reconstructions [6]. 73 Traditionally, the bacterial genomes have been distinguished in a core genes pool 74 encoding the basic cellular functions, and a flexible genes pool conferring strain-, 75 pathotype-or ecotypes-specific characteristics which allow adaptation to special 76 conditions [7]. For instance, from the first available studies based on a limited number of 77 sequences, ranging from 20-61 genomes [8,9], until the most recent ones using fewer 78 than 250 genomes [10,11] To contribute to answer these unresolved evolutionary questions by adding resolution in 88 the evolutionary reconstructions, a new strategy was proposed to elucidate the successive 89 steps in the E. coli diversification. Our strategy was a combination of two approaches.

90
One of them, coined "layered phylogenomics" (LP) is based on stratified phylogenetic 91 analysis of genes representing successive evolutionary steps. The layers are divided in 92 minimal genome, genus-core genome species-core genome, phylogroup-core genome and 93 phylogroup-flexible genome (Fig 1) another six genomes were also excluded because their poor sequencing. Once the 108 remaining 6,220 genomes were confirmed, E. coli core genome was established in 1,027 109 genes. A phylogenetic tree was constructed with these genes and was used as the reference 6 110 phylogeny. This tree confirmed most of the previously known E. coli phylogroups, but 111 we were unable to unequivocally separate phylogroup C from B1. On the contrary, a new 112 phylogroup was found, which we proposed to designate as phylogroup G, following the 113 pre-established denomination (Fig 2A). The estimation of evolutionary divergences over 114 sequences pairs between phylogroups reinforced the identification of phylogroup G (Fig   115   2B). This phylogroup is a monophyletic clade with low diversity, located next to 116 phylogroup B2. Two E. coli genomes (KTE146 and EPEC-503225) were located in an 117 intermediate position between the node of Escherichia cryptic clade I and the origin of 118 the E. coli diversification. Nowadays, these sequences could be used as better candidates 119 than cryptic clade I in the ancestral reconstruction of E. coli diversification as the 120 evolutionary distance between cryptic clade I and E. coli origin is too large to be 121 considered as the most recent ancestor.

122
The E. coli core genome phylogeny also suggested three root lineages. They were In first layer corresponding to known as minimal genome, only 51 among the previously 142 described genes [17,18] were found in all E. coli genomes. In Escherichia genus-core 143 genome (second layer), 189 genes were found; however, from this number we subtracted 144 the genes from minimal genome, in order to reconstruct the corresponding phylogenetic 145 tree based on 138 genes of genus-core genome. In the third layer, the E. coli species-core 146 genome was reconstructed using 838 genes, after excluding the 189 from Escherichia   To reinforce this evolutionary scenario, we explore the gain and loss of ancient genes 178 reconstructing the hypothetical ancient E. coli core genome based on the phylogroup-core 179 genomes, the fourth evolutionary layer in our model [19]. The gene content of When the threshold of ancient genes was 95%, no differences among phylogroups were 186 detectable; however, the step-wise increase of this threshold towards 99% progressively 187 revealed differences among them (Fig 4). Consistently with the previous analysis, B1A sublineage. In the FGB2 lineage, only the bgl operon was acquired by GB2 260 sublineage (Fig 6). 261 On the other hand, the B1A sublineage, from EB1A lineage, lost genes encoding 262 key proteins involved in the uptake of metals as iron, manganese and molybdene, 263 including proteins from the siderophore ABC transport system, metal-ABC transport 264 (ECSMS35_RS09855 to ECSMS35_RS09880). Moreover, genes involved in the vitamin 265 B12 and hemin metabolism were also lost (hmuV, ECSMS35_RS191855 to 266 ECSMS35_RS19215). These genes, which might influence tissue colonization and 267 pathogenicity, were essentially preserved in phylogroup E, suggesting that B1A  The symE gene, encoding a toxin belonging to type I toxin-antitoxin system, has probably 281 evolved by gene duplication [27]. Phylogroup B2 lost genes involved in toxin-antitoxin 282 systems (tisAB/istR, hicAB or pemI/pemK). matter of concern, as some genes could be lost [29]. However, the analysis of 32,000   (Fig 3) [35].

349
The results obtained using LP-PF strategy were reinforced with the analysis of the gained-350 lost genes in the different phylogroups with respect to hypothetical ancestral core genome 351 (Fig 4). This evolutionary analysis strongly suggests that early steps in the diversification Within the FGB2 root lineage, phylogroup B2, that has been suggested to be the 409 most host-adapted including humans [43], seems to have lost some environmental-

440
To guarantee the correct classification of all downloaded genomes, those genes present 441 in 100% of 6,290 available Escherichia genomes were defined as Escherichia genus core 442 genome (S1 Table). These genes were chosen and aligned using SeaView4.4 [45].  Table). Once the 448 operative E. coli database was established, the next steps were oriented to define the 449 species core genome, that is, the ensemble of genes present in 100% of E. coli genomes  monophyletic groups with more than 10 sequences were considered as new phylogroups.

496
The orphan sequences (lower than n=10 sequences) were excluded in successive analyses. 497 We considered as a necessary requirement to define a new phylogroup that the estimated 498 evolutionary distance between the hypothetical new group and known phylogroups must 499 be higher than the distance among previously established phylogroups. Evolutionary 500 distance between two phylogroups was obtained considering the relative length of the 501 branches. The mean intragroup evolutionary distance was estimated as the mean distance 22 502 of each branch to the origin of the phylogroup, the subtree of each phylogroup was 503 obtained from the tree and the distances were extracted with the TreeStat program 504 included in the BEAST software (tree.bio.ed.ac.uk/software/beast/).

505
In order to infer the staggered diversification processes in E. coli, the previously described 506 combined strategy was implemented. The phylogenetic trees in the different layers in the 507 LP approach (minimal genome, genus-core genome and species-core genome and 508 phylogroup-core genome) were performed using ML with RAxML using GTR + I + Γ as (or gained) in two phylogroups sharing a common ancestor, only a single event (loss or 551 gain) was considered. If they did not share a common ancestor, then we considered that 552 two independent events had occurred. Therefore, we could then calculate how many 553 genes and how many times the studied genes in each branch and in the E. coli tree were 554 lost respectively.

555
Once the gain/lost genes were identified, they were classified based on their presumptive (defense mechanism) and W (extracellular structure). Replication, including D (cell