Reclassification of Shigella species as later heterotypic synonyms of Escherichia coli in the Genome Taxonomy Database

Members of the genus Shigella have high genomic similarity to Escherichia coli and are often considered to be atypical members of this species. In an attempt to retain Shigella species as recognizable entities, they were reclassified as Escherichia species in the Genome Taxonomy Database (GTDB) using an operational average nucleotide identity (ANI)-based approach nucleated around type strains. This resulted in nearly 80% of E. coli genomes being reclassified to new species including the common laboratory strain E. coli K-12 (to ‘E. flexneri’) because it is more closely related to the type strain of Shigella flexneri than it is to the type strain of E. coli. Here we resolve this conundrum by treating Shigella species as later heterotypic synonyms of E. coli, present evidence supporting this reclassification, and show that assigning E. coli/Shigella strains to a single species is congruent with the GTDB-adopted genomic species definition.


Introduction
The genus Escherichia currently comprises the six species E. coli (Castellani and Chalmers, 1919), E. hermannii (Brenner et al., 1982), E. fergusonii (Farmer et al., 1985), E. albertii (Huys et al., 2003), E. marmotae (Liu et al., 2015), and the recently described E. ruysiae (van der Putten et al., 2021). However, a recent phylogenetic study places E. hermannii within the effectively published but currently unvalidated genus 'Atlantibacter' (Hata et al., 2016), and both the NCBI (Schoch et al., 2020) and GTDB (Parks et al., 2018) taxonomies follow this reclassification. There are also a number of strains that have been recognized as distinct Escherichia lineages that have no assigned species name and are referred to as 'cryptic clades' (Walk et al., 2009). However, the most enduring anomaly with this genus is paraphyly between E. coli and the four currently recognized Shigella species Pettengill et al., 2016;Hu et al., 2019), which has been attributed to independent acquisition of the pINV virulence plasmid and subsequent niche adaptation (Pupo et al., 2000;Lan and Reeves 2002;Yang et al., 2005).
Numerous studies have shown that E. coli and Shigella species have high genomic similarity (Brenner et al., 1973;Richter and Rosselló-Móra 2009;Jain et al., 2018;Ciufo et al., 2018). Consequently, it is widely recognized that E. coli and Shigella species could be considered a single species (Lan and Reeves, 2002;Richter and Rosselló-Móra, 2009;van den Beld and Reubsaet, 2012); however, this reclassification has not been adopted due to the medical significance of Shigella as pathogens causing shigellosis, a form of bacillary dysentery (Ewing 1949;Sahl et al., 2015). At the same time, enteroinvasive E. coli (EIEC) also causes dysentery making the clinical distinction between E. coli and Shigella ambiguous as few biochemical properties distinguish Shigella from EIEC (Johnson 2000;Yang et al., 2005; van den Beld 2012; Hendriks et al., 2020). Studies have also shown that E. coli and Shigella species are paraphyletic and that S. boydii, S. dysenteriae, and S. flexneri, primarily defined based on serotype, are not monophyletic entities (Sahl et al., 2015;Pettengill et al., 2016;Pupo et al., 2000;Hu et al., 2019) further justifying the reclassification of Shigella species as E. coli.
The Genome Taxonomy Database (GTDB) has adopted a genomic species definition based on average nucleotide identity (ANI) and alignment fraction (AF) nucleated around genomes from type strains for delineating species (Parks et al., 2020) and an estimation of time of divergence for circumscribing higher taxonomic ranks (Parks et al., 2018). Application of these criteria resulted in Shigella species being reassigned to the genus Escherichia as 'E. flexneri' and 'E. dysenteriae' with S. sonnei and S. boydii being classified as synonyms of 'E. flexneri' due to their high genomic relatedness (>97%). Notably, nearly 80% of the genomes in GTDB R06-RS202 formerly classified as E. coli were reclassified as either 'E. flexneri' or 'E. dysenteriae' as a result of the adopted criteria for delineating species. This includes the common laboratory strain E. coli str. K-12 which was assigned to 'E. flexneri' due to its higher genomic similarity to the type strain of 'E. flexneri' than to the type strain of E. coli (U5/41 = ATCC 11775;Meier-Kolthoff et al., 2014).
A practical consequence of these GTDB reassignments is that traditional properties of species such as 'E. dysenteriae' and 'E. flexneri' being comprised of human pathogenic strains are no longer applicable. As this is likely to result in confusion, we propose that Shigella spp. be treated as later heterotypic synonyms of E. coli in GTDB. Here we put forth the evidence supporting this reclassification in anticipation of introducing this change in the next release of GTDB.

Results and discussion
Reclassification of Escherichia coli and Shigella in GTDB GTDB assigns strains to species based on the ANI and AF to genomes assembled from the type strain of the species (henceforth referred to as type strain genomes) or to a selected representative genome of the species when a type strain genome is not available (Parks et al., 2020). This provides quantitative criteria for establishing species membership that is consistent with the majority of species assignments defined using a polyphasic approach. In GTDB release R06-RS202, 68.2% of genomes with an NCBI species assignment have the same species assignment in the GTDB. Genomes with incongruent GTDB and NCBI species assignments result from the transfer of species to other genera (10.4%), defining overclassified species as synonyms (1.4%), dividing species considered to be underclassified into multiple species (5.9%), and otherwise reclassifying genomes consider misclassified according to the GTDB (14.1%; Supp. Table 1). Notably, >50% of incongruent species assignments arise from just nine NCBI species with E. coli and Shigella spp. being the most conspicuous accounting for 32.6% of all incongruent assignments between GTDB and NCBI (Table 1; Supp. Table 2). This is in contrast to other Escherichia species where all or nearly all genomes assigned to these species within GTDB have the same assignment in the NCBI Taxonomy (Table 1).
Strains have traditionally been assigned to E. coli and Shigella spp. based on biochemistry and serotyping (Pettengill et al., 2016;Chattaway et al., 2017) which is in contrast to the genomic species definition (Doolittle and Papke 2006;Konstantinidis et al., 2006;Richter and Rosselló-Móra, 2009) adopted by GTDB. Specifically, GTDB delineates species using a fixed AF threshold of 0.65 and an ANI threshold that is allowed to vary between 95% and 97% in order to preserve the majority of existing species classifications (Parks et al., 2020). Despite using a flexible ANI threshold, 'E. sonnei' and 'E. boydii' are considered later heterotypic synonyms of 'E. flexneri' in GTDB as the ANI between these type strain genomes is 97.9% and 97.8%, respectively ( Fig. 1). Furthermore, 79.2% of the 20,973 genomes classified as E. coli at NCBI are reassigned within GTDB with the majority being reclassified as 'E. flexneri' (12,400 genomes) or 'E. dysenteriae' (2,044 genomes; Table 1). Similarly, the majority (68.3%) of genomes classified as S. dysenteriae at NCBI are classified as 'E. flexneri' in GTDB (Table 1). GTDB species assignments reflect the closest type strain genome as determined using ANI and this large number of reassignments illustrate the degree to which the traditional classification of E. coli and Shigella spp. conflict with the adopted genomic species definition (Fig. 1). A few clear misclassifications within the NCBI taxonomy are also evident such as the reassignment of a S. sonnei and three S. boydii genomes to species in the genus Serratia in GTDB (Supp. Table 3). , and the representative genomes from the GTDB species E. coli_C and E. coli_D. The distance between genomes indicates their ANI though the diagram is conceptual as it is not possible to accurately represent the pairwise distance between all genomes. The larger circles depict the ANI criteria used by GTDB for assigning genomes to each species (Parks et al., 2020). This criterion is typically 95% in the GTDB, but the high similarity of these species results in this criterion being increased. The single red square represents a genome traditionally classified as E. coli which is reclassified as S. flexneri in GTDB since it resides within the ANI threshold of both species, but is more similar to the S. flexneri type strain genome. Reclassification of Shigella spp. as later heterotypic synonyms of E. coli results in a species with a 95% ANI criterion commensurate with the majority of GTDB species. E. coli_C and E. coli_D are also reclassified as E. coli as they have an ANI >95% to the type strain of E. coli.
Genomes assembled from type strains of a species are highly similar Here we verify that type strain genomes from Escherichia and Shigella species are highly similar to each other giving confidence that the assemblies are of high quality and unlikely to be misclassified ( Table 2). All species except E. albertii, E. ruysiae, and S. flexneri had two or more type strain genomes available, and intraspecific type strain genomes had an ANI >99.8% with an AF generally >0.97. Lower AFs were observed in a few cases, but this can be attributed to either assembly quality (e.g., E. marmotae assembly GCF_000807695.1 consists of 218 contigs compared to the single chromosome and two plasmids of the E. marmotae assembly GCF_002900365.1) or naturally occurring differences in the phage or genomic islands within individual genomes.

Phylogeny and genomic similarity of Escherichia and Shigella type strains
A maximum-likelihood tree was used to establish the phylogenetic relationships between type strain genomes for Escherichia and Shigella species along with the representative genomes of the seven GTDB R06-RS202 placeholder species within Escherichia ( Fig. 2A; Supp. Table 4). As expected, E. coli and Shigella spp. form a monophyletic lineage which also includes the GTDB species E. coli_D. Genomic species are generally defined as strains with an ANI greater than 94% to 96% and an AF greater than 0.  Table 5). Furthermore, all genomes classified as E. coli, 'E. flexneri', 'E. dysenteriae', E. coli_C, or E. coli_D in GTDB have an ANI ≥95% and AF ≥0.65 to the E. coli type strain (ATCC 11775 T ) genome when using FastANI (Figs. 2C and 2D), and thus satisfy the GTDB criteria for classifying these strains as a single species (Parks et al., 2020). This result is robust to the method used to establish genomic similarity as indicated by BLAST-based ANI (Rodriguez-R and Konstantinidis, 2016) providing highly correlated results with all E. coli, 'E. flexneri', 'E. dysenteriae', E. coli_C, and E. coli_D genomes still reported as having ≥95% ANI to E. coli ATCC 11775 T (Supp. Fig. 1).

Figure 2.
Phylogenetic and genomic similarity of Escherichia and Shigella species. (A) Phylogenetic tree inferred from the concatenated multiple sequence alignment of 2,329 core genes using IQ-Tree under the GTR+F+R5 model. E. albertii NBRC 107761 T was used to root the tree (see Methods). SH-aLRT and UFBoot support values were 100% for all nodes. (B) UPGMA tree indicating the genomic distance, i.e. 100-ANI, between Escherichia and Shigella type strain genomes as determined with FastANI. (C) ANI between the E. coli ATCC 11775 T genome and all genomes classified as Escherichia in GTDB R06-RS202. (D) AF between the E. coli ATCC 11775 T genome and all genomes classified as Escherichia in GTDB R06-RS202. Box-and-whisker plots show the lower and upper quartiles as a box, the median value as a line within the box, and the minimum and maximum values as whiskers.

Escherichia coli and Shigella spp. form a single, distinct ANI species cluster
It is informative to consider the clustering which occurs when considering all pairwise ANI values between Escherichia and Shigella genomes and not just the ANI to the type strain/representative genomes of species. Here we explore this using Mash which provides a computationally efficient approximation to ANI allowing it to be applied to large numbers of genomes. Specifically, we use Mash to dereplicate all 23,538 high-quality Escherichia genomes in GTDB R06-R202 to 1,148 genomes which have an ANI <99% and can be seen as operationally defined strains (see Methods). Visualization of these 1,148 operational strains using a 95% ANI clustering criterion indicates that E. coli, 'E. flexneri', 'E. dysenteriae', E. coli_C, and E. coli_D form a single cluster in support of reclassifying these 5 GTDB species as a single species (Fig. 3). E. sp005843885 and E. sp004211955 also form a cluster which is unsurprising as the ANI between the GTDB representative genomes of these species is 94.95%. Figure 3. Clustering of 1,148 GTDB R06-RS202 Escherichia genomes dereplicated at 99% ANI. Each node represents a genome and two genomes are connected by an edge if their Mash distance is ≤0.05, which is approximately equivalent to an ANI ≥95%. Nodes are colored by GTDB species assignment. The graph layout was determined using the force directed method implemented in Cytoscape (Shannon et al., 2003).

Conclusions
In previous releases of GTDB, we attempted to retain at least some Shigella species as distinct entities within the genus Escherichia, i.e. 'E. flexneri' and 'E. dysenteriae', using our type strain nucleated ANIbased species delineation approach (Parks et al., 2020), albeit with reduced ANI radii (Fig. 1A). This was a compromise to accommodate the well-known issue of Shigella species being genomically closely related to E. coli (Brenner et al., 1973;Richter and Rosselló-Móra 2009;Jain et al., 2018;Ciufo et al., 2018), while still retaining some of the clinically important classification information associated with Shigella. An unforeseen consequence of this compromise was the reclassification of almost 60% of E. coli genomes including E. coli K-12 to 'E. flexneri' ( Table 1), which is a source of potential confusion. Indeed, K-12 is often mistakenly thought to be the type strain of E. coli, but in fact is genomically and phenotypically quite distinct from the actual type strain, E. coli U5/41 T (DSM 30083 T ), which has been sequenced only comparatively recently (Meier-Kolthoff et al., 2014). To remove this confusion, we intend to treat Shigella species as later heterotypic synonyms of E. coli (Fig. 1B) in the next GTDB release, consistent with previous genomic circumscriptions (Lan and Reeves, 2002;Richter and Rosselló-Móra, 2009;van den Beld and Reubsaet, 2012), which also returns strain K-12 to the species E. coli. The other major consequence of this change is that the GTDB taxonomy as it currently stands is not useful for clinical classification of Shigella isolates. In our opinion, reclassifying Shigella spp. as later heterotypic synonyms of E. coli will ultimately best serve the community as this adheres to the species definition adopted by the GTDB, makes E. coli commensurate with the majority of bacterial species, and most accurately reflects the phylogenetic relationship and genomic similarity of E. coli and Shigella strains.

Escherichia/Shigella genome dataset
The 23,686 genomes classified as Escherichia in GTDB R06-RS202 or Escherichia/Shigella according to the NCBI Taxonomy (Schoch et al. 2020;downloaded September 23, 2020) were considered in this study. Two notable exceptions are inclusion of the E. ruysiae type strain genome GCF_902498915.1 which was released after GTDB R06-RS202 and filtering of the genome GCF_009711095.1 which is classified as 'E. alba' at NCBI but recognized as belonging to the genus Intestinirhabdus (Xu et al., 2020). The E. ruysiae type strain genome (GCF_902498915.1) was compared to GTDB species representatives in Escherichia and found to be highly similar to the representative of Escherichia sp000208585 (GCF_000208585.1) with an ANI of 99.8% and AF of 0.97. Consequently, the 32 genomes in the GTDB R06-RS202 species cluster Escherichia sp000208585 are referred to as E. ruysiae in this study. A highquality set of 23,538 genomes defined as having a completeness ≥90% and contamination ≤5% as estimated using CheckM v1.1.3 (Parks et al., 2015) was used for analyses that might be negatively impacted by the presence of lower-quality genome assemblies.
Inferring core gene tree Prokka v1.14.6 (Seemann, 2014) using the Escherichia specific database was used to annotate genomes and the core gene set determined using Roary v3.12.0 (Page et al., 2015). The core gene set for the 16 Escherichia/Shigella type strain or GTDB representative genomes was defined as any gene present in ≥14 of the genomes which resulted in a set of 2,329 core genes. These genes were aligned with MAFFT v7.394 (Katoh et al., 2013) which produced a 2,263,396 base multiple sequence alignment. The tree was inferred with IQ-Tree v1.6.12 (Nguyen et al., 2015) using ModelFinder (Kalyaanamoorthy et al., 2017) to establish GTR+F+R5 as the best-fit model and tree stability assessed with the SH-aLRT test and ultrafast bootstraps set to 1,000 replicates (Hoang et al., 2018). The tree was rooted on E. albertii based on a preliminary tree using Salmonella enterica LT T (GCF_000006945.2) as an outgroup which indicate this Escherchia species to be the most basal member of the genus. This rooting is also in agreement with E. albertii being the most basal species in the ANI-based UPGMA tree.

Calculating ANI and AF
The ANI and AF between genomes was calculated with FastANI v1.3 (Jain et al., 2018) with default parameters unless otherwise indicated. Since the ANI and AF values produced by FastANI are not symmetric, the maximum of the two reciprocal calculations was used in agreement with the approach adopted by GTDB (Parks et al., 2020). ANI values were also calculated with the ani.rb script in the Enveomics Collection (Rodriguez-R LM & Konstantinidis, 2016) using default parameters and BLASTn 2.9.0+ (Camacho et al., 2009) to identify orthologous fragments.
Mash v2.0 (Ondov et al., 2016) with a sketch size of 5,000 and k-mer size of 16 was used to estimate the ANI between all 554,013,906 pairwise combinations of the 23,538 high-quality GTDB R06-RS202 Escherichia genomes as the use of less computationally efficient methods was not practical. Genomes were dereplicated in a greedy manner consisting of three steps: i) sort genomes in descending order of estimated assembly quality, ii) select the highest-quality genome to form a new cluster, and iii) assign any unclustered genome with a Mash distance <0.01 (≈99% ANI) to this new cluster. These steps were repeated until all genomes were assigned to a cluster. Estimated assembly quality was defined as completeness -5*contamination as estimated using CheckM v1.1.3 (Parks et al., 2015).
Pairwise ANI values were visualized as a UPGMA tree calculated with DendroPy v4.5.1 (Sukumaran and Holder, 2010) and as a graph generated with force directed layout method implemented in Cytoscape (Shannon et al., 2003).