Population structure analysis and laboratory monitoring of Shigella with a standardised core-genome multilocus sequence typing scheme

The laboratory surveillance of bacillary dysentery is based on a standardised Shigella typing scheme that classifies Shigella strains into four serogroups and more than 50 serotypes on the basis of biochemical tests and lipopolysaccharide O-antigen serotyping. Real-time genomic surveillance of Shigella infections has been implemented in several countries, but without the use of a standardised typing scheme. We studied over 4,000 reference strains and clinical isolates of Shigella, covering all serotypes, with both the current serotyping scheme and the standardised EnteroBase core-genome multilocus sequence typing scheme (cgMLST). The Shigella genomes were grouped into eight phylogenetically distinct clusters, within the E. coli species. The cgMLST hierarchical clustering (HC) analysis at different levels of resolution (HC2000 to HC400) recognised the natural groupings for Shigella. By contrast, the serotyping scheme was affected by horizontal gene transfer, leading to a conflation of genetically unrelated Shigella strains and a separation of genetically related strains. The use of this cgMLST scheme will enhance the laboratory surveillance of Shigella infections.


Characterisation of the O-antigen gene clusters 137
The Shigella O-antigen biosynthetic gene (rfb) cluster was analysed after extraction of the 138 region between the housekeeping genes galF (encoding UTP-glucose-1-phosphate 139 uridylyltransferase) and gnd (encoding 6-phosphogluconate dehydrogenase), which are known 140 to flank the rfb cluster 27 . Newly identified rfb clusters were annotated based on a previously 141 annotated closely matched E. coli cluster in the NCBI BLASTn nucleotide collection (nr/nt) 142 database (100% coverage and at least 99% identity) or with ORFfinder 143 (https://www.ncbi.nlm.nih.gov/orffinder/) when no matching cluster was found in the NCBI 144 BLAST database (https://blast.ncbi.nlm.nih.gov/Blast.cgi). The GenBank accession codes of 145 all the Shigella rfb clusters are listed in Supplementary Table 2. We also used three tools for in 146 silico serotyping: SeroPred, the serotype prediction tool implemented in EnteroBase, 147 ShigaTyper 14 , and ShigEiFinder 16 . Short-read and SPAdes assemblies were used for 148 ShigaTyper and ShigEiFinder, respectively. 149 based on the "cgMLST V1 + HierCC" scheme. We visualised the cgMLST data with 162 We also performed cgSNV analysis, to assess the phylogenetic relationships of 398 Shigella 164 (317 from the reference dataset and 81 from the reference+ dataset) and 95 E. coli (68 ECOR 165 and 27 EIEC) strains. An Escherichia fergusonii genome (RHB19-C05, GenBank accession 166 no. GCF_013892435.1) was used as an outgroup for the cgSNV analysis. The paired-end reads 167 and simulated paired-end reads were mapped onto the reference genome of E. coli K12-168 MG1655 (GenBank accession no. NC_000913.3) with Snippy version 4.6 169 (https://github.com/tseemann/snippy) using the default parameters but with a minimum read 170 coverage of 4 and a 75% read concordance at a locus for a variant to be reported. An alignment 171 of 92,688 SNVs was used for phylogenetic inference. A maximum-likelihood (ML) 172 phylogenetic tree was generated with RAxML-NG 29 version 1.0.1 using the general time-173 reversible (GTR) model of nucleotide substitution with a gamma model of the between-site 174 heterogeneity rate (GTR + G) and 100 bootstrap iterations. The best-scoring ML tree of the 20 rfb sequences were trimmed to ensure the same start and end points as for the E. coli rfb 180 sequences from DebRoy and coworkers 27 . A sequence alignment was generated with MEGA 181 X 31 version 10.2.1, using ClustalW with default settings. A ML phylogeny was created with

Gene analyses 187
The presence of the ipaH gene, a multicopy gene unique to Shigella and EIEC 32 , the presence 188 and structure of the mannitol (mtl) 33  According to cgMLST, all these genomes belonged to the same hierarchical cluster, HC2350_1 210 (Supplementary Data 1). As expected, all the Shigella and EIEC genomes contained the pathogenicity gene ipaH, whereas the ECOR genomes did not ( Supplementary Fig. 1). A 212 NINJA neighbour-joining (NJ) tree of core genomic allelic distances was generated with the 213 dataset for the 493 Shigella and E. coli genomes (Fig. 1A). The differential contribution of the 214 reference and reference+ datasets to Shigella population diversity is shown in Supplementary 215 MLST analysis of 46 diverse Shigella strains. The HC2000_1465 cluster, named S1, can be 228 divided into five HC1100 clusters (Fig. 2). Only the HC1100_36524 cluster (subcluster S1d) 229 contained strains from a single serotype, S. dysenteriae 7. The HC1100_45518 cluster (S1e) 230 contained only S. flexneri 6 strains, but most strains from this serotype were in another HC1100, 231 HC1100_1465 (S1b), along with S. dysenteriae 3 (Supplementary Results section "Aerogenic 232 strains of S. boydii 14 and S. dysenteriae 3") and various serotypes of S. boydii. The 233 HC1100_1466 cluster (S1c) contained S. dysenteriae 5 and various serotypes of S. boydii. 234 Finally, the HC1100_4194 cluster (S1a) included only S. dysenteriae strains, but from diverse level of resolution, four Shigella serotypes were grouped within specific HC400 clusters, 237 whereas the other serotypes were split between two to six HC400 clusters (Supplementary  238   Table 5). 239

240
The second cluster, HC2000_4118, comprised various serotypes of S. dysenteriae (2, prov. 241 E670/74, prov. 96-265, and prov. BEDP 02-5104) and S. boydii (5, 7, 9, 11, 15-17) (Fig. 3). for S. boydii 9, and HC1100_11421 (S2c) for S. boydii 11. This last serotype was also found in 250 the S1 cluster (S1b subcluster). At higher resolution, it was possible to assign some serotypes 251 to a particular HC400 cluster. This was the case for S. boydii 16 (HC400_11449) and S. boydii 252 17 (HC400_11452). However, at this level of resolution, other serotypes were split between 253 two to four clusters (Supplementary Table 5 flexneri 6, which grouped in S1 (Fig. 4). This cluster seems to correspond to the Cluster 3 258 reported by Pupo and coworkers 8 , except that S. boydii 12 rather than S. boydii prov. E1621-54 259 was reported in Cluster 3 in this previous study (Supplementary Results section "Genomic could be divided into seven distinct HC1100 clusters (Fig. 4A). One of these S3 subclusters, 262 HC1100_11429, contained exclusively S. boydii prov. E1621-54. The other six HC1100 263 clusters contained two or more S. flexneri serotypes per cluster. Connor and coworkers 2 264 previously subdivided >350 genomes of S. flexneri 1-5, X, Y into seven phylogenetic groups 265 (PGs), based on a Bayesian analysis of population structure. As 140 S. flexneri genomes from 266 our study were included in the analysis by Connor and coworkers 2 , we compared the clustering 267 by cgMLST HC1100 to that obtained by PG. HC1100_204, HC1100_543, HC1100_1468, 268 HC1100_11594, HC1100_1530 corresponded to PG2, PG4, PG5, PG6 and PG7, respectively 269 ( Fig. 4). HC1100_192 encompassed PG1 and PG3, and the use of a higher HC resolution made 270 it possible to link HC400_192 to PG3. However, PG1 did not correspond to a single HC400 271 cluster. Instead, it corresponded to two such clusters: HC400_237 and HC400_327. Results section "Updating the Shigella typing scheme"). We found that all these provisional 296 serotypes belonged to the three main Shigella clusters, S1 to S3 (Figs. 2-4), and that many of 297 those reported under different names were actually identical. We propose adding S. dysenteriae

Performance of available in silico serotype prediction tools 304
In silico serotyping tools have been developed by various groups, and can be used to maintain 305 links with the current Shigella serotyping system. We assessed the performances of the three 306 tools currently available: the EnteroBase "SeroPred" tool 18 , ShigaTyper 14 , and ShigEiFinder 16 307 with our 317 genomes from well-characterised reference strains. ShigEiFinder (Supplementary 308 We present here a broad overview of the population of Shigella. The hierarchical clustering of 319 cgMLST data and a cgSNV analysis showed that Shigella strains belong to eight 320 phylogenetically distinct clusters, within the species E. coli. Our results are consistent with 321 previous studies suggesting multiple origins of the Shigella phenotype 8,39 . However, the higher 322 resolution of cgMLST, and comprehensive sampling from thousands of phenotypically 323 characterised isolates and reference strains covering all serotypes, including provisional 324 serotypes and atypical strains, made it possible to complete, and in some cases amend, the 325 subclusters. This is particularly interesting for S3, which contains the S. flexneri 1-5, X, and Y 355 serotypes generated via horizontal gene transfer rather than by vertical descent. We therefore 356 recommend integrating the seven phylogenetic groups (PG1-PG7) described for S. flexneri into 357 routine genomic surveillance for S. flexneri. These PGs can be easily inferred from cgMLST 358 HC1000/HC400; it is even possible to obtain up to eight groups (after subdividing PG1 into 359 PG1a and PG1b). The cgMLST HC analysis also provides, in a single step, a wide range of 360 clustering levels, from HC0 (no allelic difference allowed) to HC2350 (maximum of 2,350 361 allelic differences), with a standard nomenclature. For the most frequent Shigella serotypes, 362 such as S. sonnei and S. flexneri 2a, higher resolution levels, such as HC5 and HC10, can also 363 help to identify a single-source outbreak or an epidemic strain, before confirmation by cgSNV 364 analysis. The use of cgMLST HC data also makes it possible to query EnteroBase, which 365 contains over 160,000 E. coli/Shigella genomes, to identify strains with similar HC types. This 366 can facilitate the investigation of unusual types of Shigella or outbreaks with an international 367 dimension. HC10 was recently used to investigate the origins of an outbreak of S. sonnei 368 infections in Belgium, and made it possible to link this outbreak to South America 40 . 369 370 However, the use of cgMLST HC data in surveillance should be paired with in silico serotyping, 371 to achieve backward compatibility with the current serotyping scheme. This is a very important 372 point for the maintenance of international surveillance with laboratories that cannot currently 373 afford genomic surveillance and to prevent disjunction with the seven decades of serotyping 374 data accumulated worldwide. For this purpose, we found that ShigEiFinder 16 had the best 375 performance of the three available tools. However, it requires optimisation for certain serotypes. 376 The complete set of rfb sequences provided by our study would be helpful for improving this 377 tool. 378

379
In conclusion, by studying >4,000 serotyped reference strains and routine isolates covering the 380 overall diversity of Shigella, we were able to demonstrate that cgMLST is a robust and portable 381 genomic method revealing natural groupings for this pathovar of E. coli. The cgMLST method 382 has strong added value in the framework of the laboratory monitoring of Shigella, as it prevents 383 genetically unrelated strains being conflated, and genetically related strains being separated. 384 However, we strongly recommend combining cgMLST with in silico serotyping to maintain 385 backward compatibility with the current Shigella serotyping scheme. the Shigella genomes are also labelled on the right with cgMLST HC2000 and HC1100 data. 572