Comparative genomic analysis of skin and soft tissue Streptococcus pyogenes isolates from low- and high-income settings

Streptococcus pyogenes is a leading cause of human morbidity and mortality, especially in resource limited settings. The World Health Organisation has recently made a vaccine for S. pyogenes a global health priority to reduce the burden of the post-infection rheumatic heart disease. For a vaccine to be active against all relevant strains in each region, molecular characterisation of circulating S. pyogenes isolates is needed. We performed extensive comparative whole genome analyses of S. pyogenes isolates from skin and soft tissue infections in The Gambia, West Africa, where there is a high burden of such infections. To act as a comparator to this low-income country (LIC) collection of isolates, we performed genome sequencing of isolates from skin infections in Sheffield, UK, as representative high-income country (HIC) isolates. LIC isolates from The Gambia were genetically more diverse (46 emm-types in 107 isolates) compared to HIC isolates from Sheffield (23 emm-types in 142 isolates), with only 7 overlapping emm-types and with diverse genetic backgrounds. Characterisation of other molecular markers indicated some shared features, including a high prevalence of the skin infection-associated emm-pattern D and the variable fibronectin-collagen-T antigen (FCT) types FCT-3 and FCT-4. A previously unidentified FCT (FCT-10) was identified in the LIC isolates, belonging to two different emm-types. A high proportion (79/107; 73.8%) of LIC isolates carried genes for tetracycline resistance, compared to 53/142 (37.3%) HIC isolates. There was also evidence of different circulating prophages, as very few prophage-associated DNases and lower numbers of superantigens were detected in LIC isolates. Our study provides much needed insight into the genetics of circulating isolates in a LIC (The Gambia), and how they differ from those circulating in HICs (Sheffield, UK). Common molecular features may act as bacterial drivers for specific infection types, regardless of the diverse genetic background.


Emm pattern and FCT regions 173
To determine the emm pattern in the genome of each isolate, in silico PCR 174 (https://github.com/simonrharris/in_silico_pcr) was used to extract the sequence of the whole 175 mga regulon (the beginning of mga to the end of scpA) from de novo assemblies and then 176 annotated with Prokka. To improve assemblies where the mga regulon was not within 177 contiguous sequence, de novo assemblies were ordered against a completed reference genome 178 of the same emm-type (where available) using ABACAS (27) and the in silico PCR repeated. 179 An emm pattern of I, II, III, IV, V or VI was assigned using BLAST to identify genes followed 180 by visual determination of gene location within the regulon. For 22 LIC and 3 HIC isolates, 181 the emm pattern could not be determined as contiguous sequence for the mga regulon could 182 not be obtained (detailed in Supplementary Table 1). 183 Alleles of the emm-like genes enn and mrp were assigned by comparison to those identified by 184 Frost et al. (28), ensuring 100% nucleotide identity across the entire gene sequence. Where we 185 could not obtain contiguous sequence for the mga regulon, enn and mrp alleles were 186 determined by BLAST of each allele sequence against the entire de novo assembly. New alleles 187 for enn and mrp were kindly assigned by Prof Pierre Smeesters and Dr Anne Botteaux. In some 188 cases, breaks in the de novo assemblies occurred within the enn gene and therefore alleles could 189 not be confirmed (detailed in Supplementary Table 1). 190 To determine the arrangement of the genes in the FCT region and the FCT type, in silico PCR 191 was used to extract the FCT region and annotated with Prokka. Assemblies in which amplicons 192 were not obtained due to contig break in the FCT regions, were again ordered against a close 193 reference of the same emm-type (where available). The ORFs within each extracted FCT region 9 were blasted against the entire NCBI database and, in combination with order of the genes, the 195 FCT types were assigned based on previously assigned FCT type where possible (29). For 196 some isolates, it was not possible to obtain contiguous sequence for the FCT region and so the 197 FCT type was estimated based on manual inspection of the de novo assembly and identification 198 of FCT associated genes through BLAST. 199 Results 200

Genetic diversity of S. pyogenes LIC and HIC skin and soft tissue isolates 201
We performed whole genome sequencing on 115 of 127 S. pyogenes skin infection isolates 202 collected in The Gambia (15). After quality control and filtering of reads and de novo 203 assemblies, we obtained high quality genome sequence data for a total of 107 Gambian (LIC) 204 Table 1). Within the genomes of these 205 107 isolates, we determined 46 different emm-types, with no obvious dominant emm-type; the 206 most common being emm80 (6/107, ~6%), closely followed by emm85, emm229 and 207 emm/stG1750 (5/107 isolates, ~5% each). Although emm/stG1750 has been previously 208 identified in group G streptococci, in this case these isolates were S. pyogenes with the group 209 A carbohydrate. The multi-locus sequence types (STs) for all 107 isolates were determined and 210 revealed 57 different types, of which 25 were assigned for the first time. Although multiple 211 STs could be found within single emm-types, no STs were shared by multiple emm-types. 212

S. pyogenes isolates for further analyses (Supplementary
An emm-pattern could be assigned to the majority of isolates using the previously determined 213 classifications. The exceptions were two emm147 isolates, one emm162 isolate, one emm247 214 isolate and five emm/stG1750 isolates, for which an emm pattern had not been previously 215 described. For the 98 isolates with known emm-patterns, 48% (n=47) were D, 40% were E 216 (n=39) and 12% (n=12) were A-C ( Figure 1 and Figure 2). In addition to emm-pattern, an emm-217 cluster type could also be assigned to these 98 isolates. The emm-cluster type is based on the 218 sequence of the full M protein and is broadly associated with emm-pattern (30). The majority 219 of isolates (56/98, ~57%,) were assigned to one of the six E emm-cluster types: E1 (n=4), E2 220 (n=2), E3 (n=14), E4 (n=16), E5 (n=2), E6 (n=18), representing 25 emm-types. All E1-E4 and 221 all but four E6 emm-types were positive for the serum opacity factor (sof) gene, commonly 222 associated with E emm-clusters (11), however E5 emm-types were sof negative. The remaining 223 isolates were A-C4 (n=6), D1 (n-1), D2 (n=1), D4 (n=17) or singletons (n=17). 224 Phylogenetic analysis of the core genome of all 107 LIC isolates showed clustering by emm-225 type ( Figure 1). The exceptions to this were emm25, emm65, emm85, emm89 and emm209, 226 whereby two distinct lineages were identified within these genotypes. Pairwise distance 227 analysis identified a median of 22 SNPs when comparing isolates with the same emm type 228 (range; 0-11,142 SNPs), and a median of 9,816 SNPs when comparing isolates with different 229 emm types (range 1,423 to 12,428) (Supplementary Figure 1A). 230 After read quality filtering and assembly assessment, we obtained draft genomes from 142 S. 231 pyogenes skin infection isolates collected in Sheffield, UK. Within these 142 HIC isolates 232 there were 23 different emm-types but ~59% of the isolates were represented by just 5 emm-233 types: emm108 (30/142, 21%), emm89 (19/142, 13%), emm12 (15/142, 11%), emm1 (10/142, 234 7%) and emm4 (9/142, 6%). An emm-pattern could be assigned to all 142 isolates and 36% 235 (n=51) were D, 35% (n=50) were E and 29% (n=41) were A-C ( Figure 2 and Figure 3). An 236 emm-cluster type was also assigned to all 142 isolates and the majority of isolates were D4 237 (n=50, 35%). No other D cluster-types were found. The most common E cluster type was E4 238 (n=26), followed by E6 (n=14), E1 (n=9) and E3 (n=2) (Figure 2). The A-C clusters were 239 represented by emm1 (A-C3, n=10), emm12 (A-C4, n=15) and emm3 (A-C5, n=5), which were 240 absent emm-types in the LIC population ( Figure 2). Only emm5 (n=4) and emm6 (n=7) were 241 singleton emm-cluster types. 242 Consistent with the fewer emm-type within the HIC isolate collection, we identified only 28 243 different STs, the most common being ST14, ST101, ST36 and ST28, reflective of their 244 association with the dominant genotypes emm108, emm89, emm12 and emm1, respectively. 245 As with the LIC isolates, STs were unique to a single emm-type. 246 The phylogenetic analysis of the HIC isolates based on core-genome SNPs also grouped 247 isolates into lineages based on emm-types, and all emm-types formed single lineages ( Figure  248 3). Pairwise genetic distance between isolates identified a median of 17 SNPs between isolates 249 of the same emm type (range 0 to 2206), compared to a median of 11100 SNPs distance 250 between isolates belonging to different emm types (range 3057 to 12339) (Supplementary 251 Figure 1B). 252 Surprisingly, only seven out of the 62 total emm-types identified were common to both LIC 253 and HIC isolates: emm4, 28, 75, 77, 80, 81 and 89. However, except for emm80 (emm80.0), 254 the other six overlapping emm-types were of different emm sub-types between the two sites 255 ( Figure 2, Supplementary table 1). All were emm-cluster E emm-types, except emm80 which 256 belongs to emm-cluster D4. Pairwise comparison of isolates from the two different sites within 257 each of these emm-types revealed a level of genetic distance similar to that observed when 258 isolates of different emm-types were compared, indicating that, although they may share an 259 emm-type, they do not share a core genome. 260 It is also possible that closely related isolates may exist within both collections but carry 261 different emm genes. Core-gene phylogeny of all isolates from both sites combined showed 262 clear segregation of isolates from different sites, except in one instance where an emm192 HIC 263 isolate clustered with two emm56 LIC isolates (Supplementary Figure 2). 264 The core genome of isolates from both sites combined was 1191 genes from a total of 7921 265 genes. However, while 1416 genes were present in at least one HIC isolate and absent from all 266 LIC isolates, 3418 genes were present in at least one LIC isolate but absent from all HIC 267 isolates. This indicates a greater accessory genome in LIC isolates. The core genome of LIC 268 isolates alone was 1288, similar to HIC isolates at 1242, but there was a total of 6408 genes in 269 LIC isolates compared to 4411 genes in HIC isolates. isolates within ten different emm types, because it was not contiguous in the de novo 280 assemblies, possibly due to sequence quality or repetitive regions. For the HIC isolates, this 281 was the case for only single isolates within emm types emm1, 12 and 108, and other isolates 282 within these emm-types had confirmed  Six different Mga-regulon compositions were identified across isolates from both sites ( Figure  284 4) but the vast majority of emm-types from both sites were Mga-regulon type I, consisting of 285 mga, mrp, emm, enn and scpA. This type was found in 31/36 emm-types in LIC isolates and 286 16/23 emm-types in HIC isolates, accounting for 88% (75/85) and 71% (98/139) of the LIC 287 isolates and HIC isolates, respectively. Mga-regulon type II, with the emm1 streptococcal 288 inhibitor of complement (sic) or emm12 SIC related gene (drs), was only found in HIC isolates. 289 13 Alleles for mrp and enn were extracted and compared for associations with emm and 290 geographical location of the isolate. Ninety-seven mrp genes and 92 enn genes were extracted 291 from the 107 LIC isolate genomes, resulting in 44 unique mrp sub-alleles and 48 unique enn 292 sub-alleles. From the 142 HIC isolate genomes, we extracted 101 mrp genes and 99 enn genes, 293 resulting in 22 unique mrp sub-alleles and 21 unique enn sub-alleles. For the majority, unique 294 alleles were associated with emm-type and geographical location, although phylogenetic 295 analysis did show overall there was limited geographical restriction between closely related 296 alleles (Supplementary Figure 3). There were two main clades for both Mrp and Enn, each with 297 one clade associated with E cluster emm-patterns while the other associated with a mix of emm-298 patterns. We did identify some instances of the same mrp allele associated with different emm 299 types, although, with one exception, this was restricted to the LIC isolates. The mrp202 allele 300 was shared by emm119 and emm162 isolates and mrp60 was shared by emm85 and emm89 301 isolates. Sub-alleles (same amino acid sequence but different nucleotide sequence) mrp193.14 302 and mrp193.15 were found in emm116 and emm86, respectively. Different sub-alleles of 303 mrp195 were found in the LIC emm18, emm95 and emm/stg1750 isolates but also in HIC 304 emm53 isolates. A similar pattern was also found with enn, with different sub-alleles of enn199 305 found in the LIC emm65 and emm182 isolates, and sub-alleles of enn26 found in the LIC 306 emm168 but also HIC emm89 isolates. 307 We also looked for the presence of the fbaA gene, downstream of scpA (outside of the Mga-308 regulon), which encodes a surface protein associated with the infection potential of pattern D 309 skin isolates (11,31). This gene was found in all D pattern and E pattern isolates but was absent 310 in 75% of A-C pattern HIC and LIC isolates (Supplementary Table 1). 311

Hyaluronic capsule biosynthesis genes 352
Although the hyaluronic capsule is considered an important virulence factor, recently it was 353 shown that genotypes emm4, emm22 and emm89 lack the hasABC operon required to 354 synthesise the capsule. Additionally, in HICs there is a high proportion of isolates within 355 different genotypes whereby hasA or hasB has either been deleted or carries a mutation that 356 would render the encoded protein non-functional, predicted to result in the lack or reduction of 357 capsule (33). The hasABC operon was detected in all the LIC isolates, including the emm4 and 358 emm89 isolates, supporting the findings that they have a different core genome compared to 359 HIC emm4 and emm89, which all lacked the hasABC operon. No variations were detected in 360 the hasA and hasB genes that would lead to truncated proteins in the LIC isolates, except for 361 one emm74 isolate with a hasA variant that would encode for a truncated HasA. In the HIC 362 isolates and consistent with previous findings (33), all emm28, emm77 and emm87 isolates 363 were predicted to produce truncated HasA, and all emm81 and emm94 predicted to produce 364 truncated HasB. Three other isolates were predicted to produce truncated HasA and a further 365 two to produce truncated HasB, but these were sporadic examples within emm-types 366 Table 1). 367

FCT-types in the LIC and HIC isolates 368
The Fibrinogen collagen binding T-antigen (FCT) region, which is classified into 9 different 369 types (FCT1-9), encodes for pilin structural and biosynthesis proteins and adhesins that could 370 be potential determinants of genetics basis for tissue tropism (34). Therefore, we investigated 371 the diversity of the FCT regions in isolates across the two geographical settings. Eight different 372 patterns were identified across the two sites, corresponding to FCT1-6 and FCT9, as well as a 373 previously unidentified pattern found among the LIC isolates, which we termed FCT10; it was 374 similar to FCT5, but with an additional fibronectin binding protein ( Figure 6). FCT3 was found 375 in the most emm-types in both LIC and HIC isolate collections, 9/23 (39%) and 20/46 (43%), 376 respectively, although this represented only 23% of the HIC isolates compared to 41% of LIC 377 isolates. FCT4 was also found in a high proportion of emm types, accounting for 7/23 (30%) 378 and 11/46 (24%) emm-types, representing 28% and 30% of HIC and LIC isolates, respectively. 379 Due to the prevalence of emm108 and emm1 in HIC, 33% of isolates were either FCT1 or 380 FCT2, whereas only 6% of the LIC isolates were FCT1 and no LIC isolates were FCT2. There 381 was only one example of isolates of the same emm-type with two different FCT-types, and that 382 was within the two LIC emm118 isolates. While one emm118 (ST1205) isolate was estimated 383 to be FCT4, the other (ST354) was estimated to be FCT10, alongside the two LIC emm63 384 isolates. The FCT regions in both emm118 isolates however were estimated as they were not 385 found within a single contiguous sequence. 386 We also compared the amino acid sequences of the FCT regulatory genes rofA, nra and msmR 387 and identified a number of different of variations. For the majority, variations were common 388 to all isolates within an emm-type and there were no obvious variations that may affect 389 function. We found that 9/10 HIC emm1 isolates carried three variations within RofA that 390 characterised them as being part of the M1UK lineage associated with high speA expression 391 (35). No other isolates were found to carry any of these three RofA variations. 392

Vaccine antigen diversity 401
Based on the number of isolates with emm-types present in the vaccine, the potential coverage 402 of the 30-valent M protein vaccine in the LIC isolates was 24%, with only 11 vaccine-included 403 emm-types (Supplementary Figure 6). On the other hand, the potential coverage of the HIC 404 isolates was 61%, although only 14 were vaccine-included emm-types. This suggests limited 405 potential for this vaccine for low-income settings such as The Gambia, although there may be 406 potential for cross-protection as has been seen for some emm-types (4, 36). 407 Among other potential vaccine candidates, the genes spy0651, spy0762, spy0942, pulA, oppA, 408 shr, speB, adi, ropA(tf), spyCEP, slo, spyAD, fbp54 and scpA were recently highlighted as 409 conserved potential targets (37). All LIC and HIC isolates carried all 14 genes and BLASTp 410 indicated that all genes were highly conserved in all isolates with less than 1% sequence 411 divergence (>99% identity) from the corresponding genes in reference genome MGAS5005 412 (emm1). indicate a much higher level of diversity than that seen in HICs (6-9) and this is reflected in 422 the limited African genomic data (37). In this study, we aimed to contribute genomic data and 423 provide molecular characterisation of S. pyogenes in The Gambia by whole genome sequencing 424 isolates collected during a population-based study of skin infections in children aged 5 years 425 and under. To act as a comparison isolate collection, we also genome sequenced isolates from 426

Sheffield, UK to represent HIC isolates. 427
Consistent with other findings from LICs (9), we identified a high number of different emm-428 types in the LIC isolate collection from The Gambia compared to the HIC isolate collection 429 from the UK, and no dominant type. In the HIC isolates, five emm-types (emm108, emm89, 430 emm12, emm1 and emm4) accounted for ~60% of the isolates. There was also limited overlap 431 across the two sites with only 7 shared emm-types; emm4, 28,75,77,80,81 and 89. However,432 it was clear that these emm-types represented a different genetic background between the two 433 locations, supporting previous findings that emm might not be a good marker for characterising 434 a diverse global population (37). 435 Although we did not specifically select for impetigo isolates or patient age range amongst the 436 HIC isolate collection, all were associated with some form of non-invasive skin infection. Little The emm-pattern D, previously determined to be associated with skin infections, was the most 454 common in the LIC isolates (48%) and the HIC isolates (36%), although emm-pattern E was 455 20 almost equally as common in HIC isolates (35%). A review of population-based studies (11)  456 found that among impetigo isolates, 49.8% were D, 42% were E and 8.2% were A-C patterns, 457 compared to 1.7% D, 51.7% E and 46.6% A-C patterns among pharyngeal isolates. This 458 distribution is consistent with our findings in the LIC isolates (48% D, 40% E, 12% A-C) but 459 we found a higher level of A-C isolates (29%) in HIC isolates. This could be due to the more 460 diverse collection of HIC isolates, given that we did not focus specifically on impetigo. 461 Interestingly, the dominant HIC emm-types were either pharyngeal specialist pattern A-C 462 (emm1 and emm12) or generalist pattern E (emm4 and emm89), with only emm108 representing 463 skin specialist pattern D. 464 In the LIC isolates, all six E emm-clusters were represented, with the most common being E6 465 (18%) closely followed by E4 (16%) and E3 (14%). E6 was recently found to be the leading 466 cluster in Gambian non-invasive isolates (skin and pharyngeal) but with E3 leading among 467 invasive isolates (9). D4 was also common in LIC isolates (17%) but, more so in HIC isolates 468 where 35% of the isolates were D4. This was almost equal to all E clusters combined, but again 469 explained by the high number of emm108 isolates. A higher number of singleton emm-cluster 470 types were also found in the LIC isolates (n=17) representing 9 emm-types, compared to HIC 471 isolates (n=11) representing just two emm-types. There was an association with E emm-cluster 472 isolates also carrying the sof gene, as all E1-E4 emm-types were sof positive. 65,182 and 205) were sof negative and all E5 emm-types were negative. 474 HIC emm12 isolates carried a sof gene that would only produce a truncated form of SOF, as 475 previously identified (11). 476 Consistent with the high number of D/E pattern isolates, we also found the majority of isolates 477 had the Mga-regulon pattern I, and therefore carried the emm-like genes mrp and enn. Within 478 the HIC emm4 isolates we found that 4/9 carried the emm-enn fusion gene, and this was also 479 associated with degraded prophages in these isolates (40,41). Given the high number of 480 isolates carrying Mrp and Enn it is possible that they contribute to pathogenesis at the same, or 481 even greater, level of the M protein (28). The M-like proteins have not been well characterised 482 and their role and expression may vary depending on the allele or other genetic factors. The 483 existence of two major clades within the Mrp and Enn phylogeny is of interest and may indicate 484 varying domains and functions. Despite being adjacent to the emm gene, we did not observe 485 sharing of enn and mrp alleles with emm-type over the two geographical sites. We did, 486 however, see the same allele or very closely related alleles of mrp and enn shared with different 487 emm-types across different geographical locations. 488 HIC emm4 and emm89 isolates were acapsular, as expected, but this was not the case for LIC 489 emm4 and emm89, again reflecting very different genetic backgrounds. All LIC isolates carried 490 the hasABC genes required to synthesise the capsule, only one isolate had a mutation that would 491 lead to a truncated HasA and a probably acapsular phenotype. 492 The FCT region encodes for genes thought to be involved in adhesion to the host, particularly 493 the pili, which are likely to mediate primary host:pathogen interactions (42). Factors essential 494 for pili construction are encoded within the FCT and include a major pilus subunit, one or two 495 minor subunits, at least one specific sortase and a chaperone (42). The pili of the M1 isolate, 496 SF370, has been shown to be essential for adherence to human tonsil and human skin (43), 497 indicating its role in primary interactions and establishing infection. Other factors included 498 within the FCT region are fibrinogen and fibronectin binding proteins, which may also 499 contribute to host cell interactions, as well as transcriptional regulators. We identified the 500 previously described FCT types FCT1-6 and FCT9 among our isolates but, also a new FCT 501 type (FCT10) that was based on FCT5 with an additional fibronectin binding protein. FCT2 502 and FCT6 was restricted to HIC isolates and the new FCT10 was only found in LIC isolates. 503 FCT3 and FCT4 were the most common types across both sites, found in 70% (16/23) and 74% 504 (34/46) of emm-types, representing 54% (76/142) and 69% (74/107) HIC and LIC isolates, 505 respectively. FCT3 and FCT4 have been shown to share the greatest similarity and can undergo 506 recombination (42). Both these FCTs have a cpa gene, which encodes for a collagen binding 507 subunit found at the pilus tip, one or two fibronectin-binding proteins (sfbI/sfbII) and the 508 regulator msmR upstream of the fibronectin-binding protein. The pilus and fibronectin-binding 509 proteins may contribute to tissue-specific host cell adhesion, in addition to others located 510 outside the FCT region. This includes fbaA, which we identified to present in all isolates except 511 for the majority of A-C pattern types, and has been found to contribute to skin infection (31). 512 The regulator msmR has been shown to have a positive effect on the fibronectin binding protein 513 expression and may also control other surface proteins, impacting on host cell adhesion (44). 514

It is not clear if specific FCT types confer tissue tropism and previous work has shown that 515
there is a high level of variability in host cell interactions and biofilm formation between 516 isolates sharing the same FCT (45). This indicates that there are other bacterial factors involved 517 in the expression of FCT related genes. The role of the regulators nra or rofA do vary between 518 isolates of differing genetic backgrounds, with evidence of environmental effects such as pH 519 and temperature (42). We explored the sequences of rofA, nra and msmR and found a number 520 of different variations, however, many seemed to be related to emm-type and it is difficult to 521 determine if any variation would impact on function. This was also the case for the two- between isolates. The chromosomal speG and smeZ genes were the most common in both 536 populations, with more than 90% of the isolates carrying these genes. The prophage-associated 537 speC and ssa were more common in HIC isolates compared to LIC isolates, and three HIC 538 isolates actually carried two copies of speC, along with the DNase spd1, on two separate 539 prophages. Typically, speA is prophage associated but the divergent speA.4 allele is associated 540 with a prophage-like element that has been previously only found in in emm6, emm32, emm67 541 and emm77 (32). We found this only in the HIC emm6, but, although speA was almost equally 542 as common in the LIC population, all, except one, of the 22 speA-positive LIC isolates carried 543 speA.4 associated with the prophage-like element. Only a LIC emm89 isolate carried speA on 544 what appeared to be a complete prophage and was only one base pair different from the speA.1 545 allele. Interestingly, we also identified a gene in one LIC isolate (emm65) that appeared to be 546 a fusion of 5' speK and 3' speM, and since speK and speM are phage encoded, it could be a 547 result of recombination of phages carrying the two genes. BLASTp of this potential fusion 548 protein identified a similar (two-three amino acid different) variant in six published genomes; 549 NS88.3 (emm98, locus accession PWO34032), emm89.14 (QCK42181), emm100 550 (QCK70992), NS426 (VGQ95836), NS76 (VGR28970) and NS6221 (VHG25078). 551 Only two of the prophage-associated DNases (spd1 and spd3) were found in the LIC isolates, 552 while five DNases (sda1,sda2,sdn,spd1,spd3 and spd4) were identified in the HIC isolates. 553 Almost all (136/142, 96%) of the HIC population carried at least one prophage-associated 554 DNase, whereas only two LIC isolates carried spd3 and only 24% of isolates carried spd1, 555 which associated with the superantigen speC. DNases, such as sda1, have been shown to be 556 necessary and sufficient to degrade neutrophil extracellular traps (46), therefore the lack of 557 these in LIC isolates from The Gambia could be suggestive of limited/reduced ability of 558 immune evasion, and warrants further investigation into their invasive capacity. There is the 559 potential that other prophage-associated DNases exist but are yet to be identified. It also 560 suggests differences in circulating phages between the two sites, although the accessory 561 genome appeared to be much greater in LIC isolates compared to HIC isolates. This could be 562 related to the high prevalence of tetracycline resistance genes within the LIC population that 563 may be carried on mobile genetic elements. Further investigation is needed to determine 564 prophage content, as well as other mobile genetic elements; this is, however, notoriously 565 difficult with short read sequence data and may require supporting long read data. 566 The most advanced multi-valent S. pyogenes experimental vaccine is based on 30 emm-types 567 identified from isolates causing infection predominantly in high income countries (4, 5). Based 568 on the emm-types distributions, we determine the direct coverage of the vaccine to be only 24% 569 in the LIC population, compared to 61% in the HIC population, although we did not explore 570 cross-reactivity between emm-types. The high proportion of emm108 in HIC isolates was 571 unexpected as this was not a previously recognised dominant emm-type and highlights the 572 potential for sudden and dramatic increases in new emm-types that could escape a serotype-573 specific vaccine. If such a vaccine was introduced, monitoring of new variants in the non-574 invasive as well as the invasive bacterial populations would be needed, and on a global scale. 575 Alternatively, a vaccine targeting antigens with limited variability between isolates may be 576 preferable, if these can still provide similar levels of protection. We have confirmed that several 577 previously identified potential targets (37) are also highly conserved in our LIC and HIC 578 bacterial populations. However, both our LIC and HIC isolates represent only single 579 geographical locations: Sukuta, The Gambia and Sheffield, UK. Further in-depth genomic 580 analysis of international S. pyogenes populations, encompassing more LICs and different 581 infection types, is needed to confirm diversity and distribution of potential vaccine diversity. 582 Our study confirms work by others (37), that emm-typing alone is insufficient to 583 comprehensively characterise global isolates. Furthermore, genetic features that have been 584 characterised in particular HIC emm-types, such as the absence of the hasABC locus in emm4, 585 may not be present in LIC isolates of the same genotype. In the absence of WGS, other 586 molecular markers, such as MLST, enn, mrp and FCT type could be used in addition to emm-587 typing to characterise the diverse genetic background of isolates from different geographical 588 settings. More work is required to understand why there is such a high genetic diversity in LIC 589 settings compared to HIC and with limited overlap. This may be linked to infection types but 590 there is insufficient data both on pharyngeal infections in LICs, like The Gambia, as well as 591 skin infections in HICs. By increasing the characterisation of isolates from different infections 592 over wider geographical settings we could gain real insight into the molecular mechanisms 593 underpinning tissue tropism. 594  (22) with 100 bootstraps. Isolates clustered by emm-type except those indicated, whereby two lineages were represented by a single emm genotype: star; emm25, filled square; emm65, open square; emm85, open circle; emm89, filled circle; emm209). Also shown is the presence (black)/absence (white) of the superantigen genes (speA, speC, speG, speH-M, speQ, speR, ssa and smeZ) and DNase genes spd1 and spd3; four other DNase genes (sda1, sda2, sdn, and  bootstraps. All isolates clustered by emm-type. Presence (black)/absence (white) of superantigens (speA, speC, speG, speH-M, speQ, speR, ssa and smeZ) and DNases (sda1, sda2, sdn, spd1, spd3 and spd4) is indicated. Antimicrobial resistance genes (AMR) tetM, ermA and ermB were also identified in some isolates (white; absent, black; present). The positivity for serum opacity factor (sof) is also shown, but in all emm12 this gene would produce a truncated variation of SOF (grey). Scale bar represents substitutions per site. emm-38 types are coloured for easy visualisation and type numbers are also given.   FCT regions were extracted from de novo assemblies and the FCT type assigned based on the predicted function and order of genes within the extracted region. The emm-types of isolates with each FCT type are shown for HIC and LIC isolates. A new FCT region was identified (FCT10) as similar to FCT5 but with an additional fibronectin binding protein after the sortase genes. For all emm-types there was at least one isolate with a designated FCT type in a single contiguous region. The only exception to this was emm118 (*) where the FCT was estimated to be FCT4 and the new FCT10 for each of the two isolates as the FCT region was split over two contigs. In FCT1 transposases were found in HIC emm6 and emm108, and in FCT9, transposases were found in HIC emm75 and LIC emm4. fbp; fibronectin binding protein, cwp; cell wall protein, ap; ancillary protein and trp; transposase.