RT Journal Article SR Electronic T1 What can we learn from over 100,000 Escherichia coli genomes? JF bioRxiv FD Cold Spring Harbor Laboratory SP 708131 DO 10.1101/708131 A1 Kaleb Abram A1 Zulema Udaondo A1 Carissa Bleker A1 Visanu Wanchai A1 Trudy M. Wassenaar A1 Michael S. Robeson II A1 Dave W. Ussery YR 2020 UL http://biorxiv.org/content/early/2020/01/15/708131.abstract AB The explosion of microbial genome sequences in public databases allows for large-scale population genomic studies of bacterial species, such as Escherichia coli. In this study, we examine and classify more than one hundred thousand E. coli and Shigella genomes. After removing outliers, a semi-automated Mash-based analysis of 10,667 assembled genomes reveals 14 distinct phylogroups. A representative genome or medoid identified for each phylogroup serves as a proxy to classify more than 95,000 unassembled genomes. This analysis shows that most sequenced E. coli genomes belong to 4 phylogroups (A, C, B1 and E2(O157)). Authenticity of the 14 phylogroups described is supported by pangenomic and phylogenetic analyses, which show differences in gene preservation between phylogroups. A phylogenetic tree constructed with 2,613 single copy core genes along with a matrix of phylogenetic profiles is used to confirm that the 14 phylogroups change at different rates of gene gain/loss/duplication. The methodology used in this work is able to identify previously uncharacterized phylogroups in E. coli species. Some of these new phylogroups harbor clonal strains that have undergone a process of genomic adaptation to the acquisition of new genomic elements related to virulence or antibiotic resistance. This is, to our knowledge, the largest E. coli genome dataset analyzed to date and provides valuable insights into the population structure of the species.