Informing shigellosis prevention and control through pathogen genomics

Shigella spp. are the leading bacterial cause of severe childhood diarrhoea in low- and middle-income countries (LMIC), are increasingly antimicrobial resistant and have no licensed vaccine. We performed genomic analyses of 1246 systematically collected shigellae from seven LMIC to inform control and identify factors that could limit the effectiveness of current approaches. We found that S. sonnei contributes ≥20-fold more disease than other Shigella species relative to its genomic diversity and highlight existing diversity and adaptative capacity among S. flexneri that may generate vaccine escape variants in <6 months. Furthermore, we show convergent evolution of resistance against the current recommended antimicrobial among shigellae. This demonstrates the urgent need to integrate existing genomic diversity into vaccine and treatment plans for Shigella, and other pathogens.


37
Shigellosis is a diarrhoeal disease responsible for approximately 212,000 annual deaths and 38 accounting for 13.2% of all diarrhoeal deaths globally (1). The Global Enteric Multicenter Study (GEMS) 39 was a large case-control study conducted between 2007 and 2011, investigating the aetiology and burden 40 of moderate-to-severe diarrhoea (MSD) in children less than five years old in low-and middle-income 41 countries (LMICs) (2). GEMS revealed shigellosis as the leading bacterial cause of diarrhoeal illness in 42 children, who represent a major target group for vaccination (3). The aetiological agents are Shigella, a 43 Gram-negative genus comprised of S. flexneri, S. sonnei, S. boydii and S. dysenteriae, with the former two 44 serotypes causing the majority (90%) of attributable shigellosis in children in LMICs (3). Currently, the 45 disease is primarily managed through supportive care and antimicrobial therapy. However, there has been 46 an increase in antimicrobial resistance (AMR) among Shigella (4). Particularly concerning is the rise of 47 resistance against the fluoroquinolone antimicrobial ciprofloxacin, the current World Health Organisation 48 (WHO) recommended treatment, such that fluoroquinolone-resistant (FQR) Shigella is one of a dozen 49 pathogens for which WHO notes new antimicrobial therapies are urgently needed (5). The high disease 50 burden and increasing AMR of Shigella call for improvements in treatment and management options for 51 shigellosis, and significant momentum has built to rise to this challenge. 52 However, there is still no licenced vaccine available for Shigella and one of the main challenges in 53 its development is the considerable genomic and phenotypic diversity of the organisms (6). The distinct 54 lipopolysaccharide O-antigen structures of Shigella determine its serotype and is responsible for conferring 55 of shigellosis (i.e., isolates derived from case patients rather than from controls, as defined in GEMS). The 119 dominant genomic subtype was PG3, which comprised the majority (47%, 378/806) of total isolates, as 120 well as case (50%, 341/687) isolates, with some regional variation (Fig. 2). This resulted in an increased 121 odds of cases (OR = 2.3, 95% CI = 1.5-3.6, p = 0.0001) for PG3 compared with other genomic subtypes 122 (PGs and Sf6) (methods, table S3). The association of cases with the dominant serotype, S. flexneri serotype 123 2a (accounting for 29% (234/806) of total isolates and 31% (210/687) of case isolates) also resulted in an 124 increased odds of cases (OR = 1.9, 95% CI = 1.7-3.2, p = 0.0099) (table S3). But the higher prevalence and 125 larger effect size of PG3 relative to serotype 2a on case status offers compelling evidence that targeting 126 vaccination by phylogroup might offer broader coverage per licenced vaccine relative to, or in combination 127 with, a serotype-specific approach. 128 129

Diversity of S. flexneri relevant to serotype-targeted vaccines 130 131
The development of serotype-targeted vaccines is complicated by the diversity and distribution of 132 serotypes, which are heterogenous over time and place (8,19,26,27). Furthermore, genetic determinants 133 of O-antigen modification are often encoded on mobile genetic elements (28,29) that can move horizontally 134 among bacterial populations, causing the recognised, but poorly quantified phenomenon of serotype 135 switching (20,27,28), which may result in the rapid escape of infection induced immunity against 136 homologous serotypes. For our analyses of serotype switching, we focused on S. flexneri owing to high 137 disease burden and serotypic diversity. Phenotypic serotyping data were overlaid onto the phylogeny and 138 revealed that while generally there was a strong association of genotype (i.e. PG/Sf6) with serotype 139 (Fisher's exact test; p<2. 20E-16), multiple serotypes were observed for each genotype (Fig. 3). The greatest 140 serotype diversity was observed in PG3, comprised of seven distinct serotypes and two subserotypes. 141 Correlation of serotypic diversity (number of serotypes) and genomic diversity (maximum pairwise SNP 142 distance within genotype) revealed no evidence for an association, but a significant positive correlation of 143 serotypic diversity with the number of isolates in each genotype was found ( fig. S6), indicating that serotype 144 diversity scales with prevalence. 145 To qualitatively and quantitatively determine serotype switching across S. flexneri, we examined 146 the number of switches occurring within each genotype. A switching event was inferred when a serotype 147 emerged (either as a singleton or monophyletic clade) that was distinct from the majority (>65%) serotype 148 within a genotype ( Fig. 3 and fig. S7). PG6 was excluded from the analysis, as only three isolates from 149 GEMS belonged to this genotype and a dominant serotype could not be inferred. Quantitatively, this 150 revealed serotype switching was infrequent, with only 26 independent switches (3.3% of isolates) identified 151 across the five S. flexneri genotypes. Although the frequency of switching varied across the genotypes, 152 statistical support for an association of serotype switching with genotype fell short of significance (Fisher's 153 exact test; p = 0.09). Qualitatively, the majority (22/26) of switching resulted in a change of serotype, with 154 few (4/26) resulting in a change of subserotype. Examination of O-antigen modification genes revealed that 155 serotype switching was facilitated by changes in the composition of phage-encoded gtr and oac genes in 156 the genomes, as well as point mutations in these genes (table S4). Our data also revealed that few (4/26) 157 switching events resulted in more than two descendant isolates ( fig. S7). This indicates that while natural 158 immunity drives the fixation of relatively few serotype-switched variants in the short term, the potential 159 pool of variants that could be driven to fixation by vaccine-induced selective pressure following a serotype-160 targeted vaccination program is much larger. 161 In order to estimate the likely timeframe over which serotype switching events might be expected 162 to occur, we estimated the divergence time of the phylogenetic branch giving rise to each switching event. 163 To streamline the analysis, we focused on two subclades of PG3, the most prevalent phylogroup, in which 164 seven independent serotype switching events were detected ( fig. S8) have demonstrated protection in animal models (Table 1). First, we assessed the distribution of the 180 candidates among GEMS Shigella isolates which revealed that the proportional presence of antigens varied 181 across species and with genetic context. Specifically, genes encoded on the virulence plasmid (ipaB, ipaC, 182 ipaD, icsP) were present in >85% of genomes for each species with the exception of S. sonnei ( fig. S9). 183 The low proportion (≤5%) of virulence plasmid encoded genes detected among S. sonnei was caused by a 184 similarly low detection of the virulence plasmid among S. sonnei (6%), which likely arose due to loss during 185 sub-culture (32). In contrast, the chromosomally encoded ompA was present in >98% of all isolates, while 186 the sigA gene (carried on the chromosomally integrated SHI-1 pathogenicity island (17)) was present in 187 99% of S. sonnei genomes, but only 63% of S. flexneri genomes. Notably, among S. flexneri genomes, the 188 sigA gene was exclusively found in PG3 and Sf6, and present in >96% of isolates in each genotype) ( fig.  189 S2), indicating an appropriate distribution for targeting the two genotypes. Second, we assessed the antigens 190 for amino acid variation and modelled the likely impact of detected variants, as antigen variation may also 191 lead to vaccine escape, as demonstrated for the P1 variant of 34). We determined the 192 distribution of pairwise amino acid (aa) sequence identities per antigen against S. flexneri vaccine strains 193 for each species (methods). Overall, sequence identities were >90% but varied with antigen ( fig. S9). For 194 example, OmpA was present in the highest proportion of genomes, but showed ~5% sequence divergence, 195 while SigA was present in fewer genomes, but exhibited little divergence (<0.5%) among species. The least 196 conserved sequence was IpaD, ranging from 3 to 7% divergence within species. 197 Not all antigenic variation will affect antibody binding, so we performed in silico analyses of the 198 detected variants to assess whether they may compromise the antigens as vaccine targets. Again, we focused 199 our analyses on S. flexneri owing to its high disease burden and the likely complication of serotype-based 200 vaccination strategies for this species. We detected 121 variants across the six antigens, the majority (79%) 201 of which correlated with genotype (i.e. belonging to either PGs 1-5 or Sf6, fig. S11). We then determined 202 if amino acid variants were located in immunogenic regions (i.e. epitope/peptide fragment) (fig. S10) and 203 assessed their potential destabilization of protein structure through in silico protein modelling. For IpaB, 204 IpaC and IpaD, the epitopes have been empirically determined (35, 36). The sequence and location of 205 peptide fragments of SigA, IcsP and OmpA used in vaccine development are available (37, 38). Variants 206 located within the immunogenic regions were identified for all antigens relative to PG3 reference sequences 207 (methods, Fig. 4). Only 4 of 121 variants were predicted to be highly destabilising to protein structure, and 208 these occurred in: OmpA (residue 89) at a periplasmic turn, SigA (residues 1233 and 1271) in adjacent 209 extracellular turns in the translocator domain (fig. S12), and in IpaD (residue 247) within a beta-turn-beta 210 motif flanking the intramolecular coiled-coil (Fig. 4). While it remains possible that these mutations could 211 affect antigenicity through the disruption of folding or global stability, it is less likely than if they occurred 212 in immunogenic regions. These results thus indicate that it is less likely that existing natural variation will 213 compromise antigen-based vaccine candidates for Shigella compared with serotype-based vaccines. 214 However, our approach is limited and the knowledge base incomplete. For example, there was no suitable 215 template available for IpaC, and some epitopes were predicted to be in membrane regions which should be 216 inaccessible to antibodies, indicating the need for more accurate publicly available protein structures to be 217 developed for many of the vaccine antigen candidates. Overall, a high frequency of AMR genes conferring resistance against aminoglycoside, 236 tetracycline, trimethoprim, and sulphonamide antimicrobials was observed, while resistance against other 237 antimicrobial classes varied with region and species (Fig. 5B). The extended spectrum beta-lactamase gene 238 blaCTX-M-15 was detected in a small (9/1246) percentage of isolates, and genes conferring resistance to 239 macrolides and lincosamides were also infrequent (fig. S13), indicating that the recommended second-line 240 treatments likely remain effective antimicrobials (43). 241 However, higher rates of resistance were found against the first-line treatment. FQR in Shigella can 242 be conferred through the acquisition of FQR-genes or, more typically, by point mutations in the 243 chromosomal Quinolone Resistance Determining Region (QRDR) within the DNA gyrase (gryA) and the 244 topoisomerase IV (parC) genes. Single and double QRDR mutations are known to confer reduced 245 susceptibility to ciprofloxacin and are evolutionary intermediates on the path to resistance, conferred by 246 triple mutations in this region (41, 44). Overall, FQR-genes were uncommon in S. flexneri (4%, 33/806), S. 247 sonnei (1%, 3/305) and S. dysenteriae (7%, 4/60), but were present in 32% (24/75) of S. boydii. QRDR 248 mutations were identified in all species (fig. S13), but were more common among S. sonnei (65%, 199/305) 249 and S. flexneri (54%, 435/806) than compared with S. boydii (15%, 11/75) and S. dysenteriae (30%, 18/60). 250 Among these, triple QRDR mutations were identified in 13% (106/806) of S. flexneri and 14% (44/305) of 251 S. sonnei. Analysis of the QRDR mutants across the phylogenies indicate marked convergent evolution 252 toward resistance across the genus. Specifically, all triple QRDR mutant S. sonnei belonged to one 253 monophyletic subtype (previously described as globally emerging from Southeast Asia (45)), while three 254 distinct triple QRDR mutational profiles were found across three polyphyletic S. flexneri genotypes (Fig.  255 5C). Thus, the polyphyletic distribution of single, double, and triple QRDR mutants indicates continued 256 convergent evolution of lineages with reduced susceptibility or resistant to FQR. 257 We then stratified the dataset by geographic region which revealed that FQR were largely 258 associated with isolates from Asia where fluoroquinolones are more frequently used compared to African 259 sites ( Fig. 5B) (46), which is consistent with trends observed in atypical enteropathogenic Escherichia coli 260 isolated from GEMS (46). Our analyses thus suggest that for the period of GEMS trial (2007 -2011), 17% 261 (150/881) of Shigella isolates from Asia were resistant and 58% (508/881) had reduced susceptibility to the 262 WHO recommended antimicrobial. The high level of reduced susceptibility together with marked 263 convergent evolution toward resistance suggests that management of shigellosis with fluroquinolones at 264 these sites may soon be ineffective and regional antimicrobial treatment guidelines may require updating. 265 These results indicate the value of AMR and microbiological surveillance in LMICs and the control and 266 management of shigellosis will be improved by initiatives such as the Africa Pathogen Genomics Initiative 267 (47) and the WHO Global Antimicrobial Resistance Surveillance System (48). 268

270
Pathogen genomics is a powerful tool that has a wide range of applications to help combat 271 infectious diseases. Here, we have applied this tool to an unparalleled systematically collected Shigella 272 dataset to characterise the relevant population diversity of this pathogen across LMICs in a pre-vaccine era. 273 Our results revealed that current antimicrobial treatment guidelines for shigellosis should be updated, and 274 that improved surveillance will be essential to guide antimicrobial stewardship. This study has also 275 highlighted the urgent need to continue the development of Shigella vaccines for children in endemic areas.  Sequencing reads for isolates sequenced using the LITE pipeline and re-sequenced at CGR were combined 318 to increase overall sample depth of coverage. Sequence variants were identified against reference using 319 SAMtools v1.9-47 mpileup and bcftools v1.9-80 (53). Low quality SNPs were filtered if mapping quality 320 <60, Phred-scaled quality score <30 and read depth <4. 321 322

Phylogenetic reconstruction and inference of genomic diversity 323 324
Filtered SNP variants were used to generate a reference-based pseudogenome for each sample, 325 where regions with depth of coverage >4x were masked in the pseudogenome. Additionally, regions 326 containing phage (identified using PHASTER (55)) and insertion sequences were identified from the 327 reference genomes, and co-ordinates were used to mask these sites on the pseudogenomes using BEDTools 328 v2.28.0 maskfasta (56). For each species, chromosome sequences from the masked pseudogenomes were 329 extracted and concatenated. Gubbins v2.3.4 (57) was used to remove regions of recombination and invariant 330 sites from the concatenated pseudogenomes. This generated a chromosomal SNP alignment length of 331 78,251 bp for S. flexneri (n=806), 5,081 bp for S. sonnei (n=305), 98,842 bp for S. boydii (n=75) and 45,031 332 bp for S. dysenteriae (n=60). Maximum-likelihood phylogenetic reconstruction was performed 333 independently for each species and inferred with IQ-TREE v2.0-rc2 (58)  To measure the extent of shigella genomic diversity among GEMS population, pairwise SNP 343 distance was determined from the alignment of core genome SNPs identified outside regions of 344 recombination using snp-dists v0.7.0 (https://github.com/tseemann/snp-dists). For each species, the 345 genomic diversity, measured by SNPs per kbp, was determined by dividing the core genome SNP alignment The pangenome of each species was defined using Roary v3.12.0 (68) without splitting paralogues. 388 The pangenome accumulation curves were generated separately for each species using the specaccum 389 function from Vegan v2.5-7 (https://github.com/vegandevs/vegan/), with 100 permutations and random Medicine (Baltimore, Maryland), serotyping was performed as previously describe (19). In silico serotyping 397 of S. flexneri genomes was performed using ShigaTyper v1.0.6 (69) which detects the presence of serotype-398 determining genetic elements from sequencing reads to predict serotype. ShigaTyper predictions were 84% 399 concordant to the serotype data provided. SRST2 v2 (70)  To determine the presence of antigen vaccine candidates among GEMS Shigella isolates, genes of 405 the antigen vaccine candidates was screened against draft genome assemblies using screen_assembly (17)  406 with a threshold of ≥80% identity and ≥70% coverage to the reference sequence. Reference sequences for 407 ipaB, ipaC, ipaD and icsP were derived from S. flexneri 5a strain M90T (accession GCA_004799585) and In order to assess the effect of point mutations on protein stability and vaccine escape, six antigen 417 candidates from S. flexneri PG3 were modelled: OmpA, SigA, IcsP, IpaB, IpaC and IpaD (Table 1). PG3 418 was selected as it is the most prevalent phylogroup and is therefore the target of current vaccine 419 development. To model the antigen targets, we first searched for a suitable template using HHPred (73,74). 420 Five of the six proteins (OmpA, SigA, IcsP, IpaB and IpaD) had suitable homologues available. To improve 421 the performance of the comparative modelling, the signal peptides for OmpA, SigA and IcsP were removed 422 and OmpA, SigA and IpaB were modelled in two parts to make use of optimal templates. RosettaCM (75) The table below the plots  492 demonstrates for each species the genomic diversity (as measure by total number of SNPs per kbp 493 [methods]), the contribution to GEMS shigellosis burden and the shigellosis burden relative to genomic 494 diversity. For S. sonnei, the genomic diversity and shigellosis burden relative to genomic diversity that was 495 calculated excluding the two outliers are shown in bracket. 496

Fig. 2. The diversity of S. flexneri genomic subtypes across seven GEMS study sites. An unrooted ML 498
phylogenetic tree of S. flexneri genomes identified six distinct genomic subtypes, each highlighted in a 499 different colour according to the inlaid key displayed above the tree. The bar plot below the tree 500 demonstrates the relative frequencies of the subtypes at each study site. 501 Frequencies of AMR genotypic profiles among Shigella spp. Each point in the scatterplot represents a 516 unique AMR genotype profile: the proportion of isolates with a particular profile is displayed along the y-517 axis. Profiles identified in only a single isolate are not displayed. MDR genotypic profile conferring 518 resistance or reduced suppressibility to three or more drug classes are highlighted in pink, and normal AMR 519 genotype profile conferring resistance or reduced suppressibility in fewer than three drug classes are in 520 grey. Numbers displayed above the plot represents the number of AMR genotype profiles plotted for each 521 species. (B) Detection of known AMR genetic determinants associated with drug class grouped by country. 522 Each cell in the heatmap represents the percentage of isolates from a region containing genetic determinants 523 associated with resistance to a drug class. Genetic determinant conferring reduced susceptibility to 524 quinolone is indicated with an asterisk. (C) The genetic convergent evolution of ciprofloxacin resistance in 525 S. flexneri and S. sonnei. The presence of multiple monophyletic clades of QRDR mutations (single, double, 526 or triple according to the inlaid key) conferring reduced susceptibility or resistance to ciprofloxacin is 527 shown in the outer ring. B and C for S. boydii and S. dysenteriae are shown elsewhere (fig. S15). 528  the number of unique protein coding genes in the pangenome as a new genome is randomly added, with the 534 number of genomes plotted along the x-axis. Random permutation of the data were subsampled 100 times, 535 in which genomes are subsampled without replacement at each iteration. The y-axis shows the minimum 536 and maximum range of unique gene count after each iteration in (A) and the mean value in (B). 537

Fig. S2. 539
Phylogeny of S. flexneri population from GEMS. ML phylogenetic tree constructed using core genome 540 SNPs from alignments of 817 S. flexneri genomes from GEMS and publicly available genomes. Tree was 541 rooted using E. coli genome. The outer concentric rings illustrate different genotypic and epidemiological 542 data according to the numbered inlaid keys displayed next to the tree. Scale bars represents the number of 543 SNPs. 544

Fig. S3. 546
Phylogeny of S. sonnei population from GEMS. Midpoint rooted ML phylogenetic tree constructed using 547 core genome SNPs from alignments of 308 S. sonnei genomes from GEMS and publicly available genomes. diversity of the subtype, as measured by pairwise core SNP distance and plotted along the x-axis. Linear 565 regression analysis was performed to assess the association between serotype diversity and the different 566 properties of subtypes. The regression coefficient of determination (R 2 ) and p-value are displayed on the 567 top left of each plot. 568

Fig. S7. 570
Serotype switching events across S. flexneri genomic subtypes. ML phylogenetic tree of each subtype was 571 generated based on core genome SNPs. Serotypes determined through biochemical serotyping are displayed 572 on the right-hand side of each tree, and coloured according to the inlaid key. The 26 inferred serotype 573 switching events occurring along the phylogenetic branches are labelled accordingly. Numbers inside each 574 backets represents switch IDs, with further details provided in table S3. Where the dominant serotype 575 cannot be determined, a question mark is displayed, indicating switch from unknown ancestral type. 576 Serotype switching events resulting in more than two descendant isolates are highlighted in red. 577

Fig. S8. 579
Estimation of time frame for serotype switching among S. flexneri PG3 isolates. ML phylogenetic tree of 580 S. flexneri PG3 (n=384) generated using core genome SNPs is displayed on the right. Isolate serotype is 581 displayed on the outer ring, coloured according to the inlaid key displayed next to the tree. Two subclades 582 with branches highlighted in red were selected for BEAST analysis. Maximum clade credibility trees based 583 on two subclades within PG3 are displayed on the left. Independent switching events occurring along the 584 various phylogenetic branches are highlighted in black, labelled and annotated. BEAST estimated time 585 frame of divergence along the branches of the seven isolates that have undergone serotype switching are 586 shown in table S5. 587

Fig. S9. 589
The distribution of vaccine antigen candidate and protein sequence identity among Shigella spp. (A) 590 Lefthand y-axis refers to the grouped bar plot displaying presence of vaccine candidate genes identified 591 among Shigella isolates from GEMS. Bars are grouped by genes and coloured according to species. 592 Righthand y-axis (blue) refers to the boxplot displaying the interquartile range, median (red) and 593 minimum/maximum pairwise percentage identity of the amino acid sequences of antigen vaccine 594 candidates among GEMS, compared against the reference sequences. Presence of genes were identified 595 using BLASTn search against draft genome assemblies and amino acid sequence percentage identity were 596 inferred using BLASTp. (B) Mapping coverage of Shigella spp. virulence plasmid. Low percentage of 597 virulence plasmid were detected among S. sonnei isolates, likely contributed by the fact that S. sonnei 598 virulence plasmid is comparatively unstable and often lost during subculturing. 599 Temporal phylogenetic signal for S. flexneri. Correlation between isolate sampling time in months (x-axis) 649 and phylogenetic root-to-tip divergence (y-axis), as estimated by TempEst based on ML phylogeny of each 650 subclade. The two datasets correspond to S. flexneri 2a isolates belonging to node A (left) and S. flexneri 651 2b isolates belonging to node B (right) from PG3 in fig. S8. The linear regression line is coloured in red, 652 with the coefficient of determination (R 2 ) and p-value displayed for each plot. 653