Genetic structure, function and evolution of capsule biosynthesis loci in Vibrio parahaemolyticus

Capsule-forming extracellular polysaccharides are crucial to bacterial host colonization, invasion, immune evasion and ultimately pathogenicity. Due to warming ocean waters and human encroachment of coastal ecosystems, Vibrio parahaemolyticus has emerged as a globally important food-borne enteropathogen implicated in acute gastroenteritis, wound infections, and septic shock. Conventionally, the antigenic properties of lipopolysaccharide (LPS, O antigen) and capsular polysaccharide (CPS, K antigen) have provided a basis for serotyping V. parahaemolyticus, while disclosure of genetic elements encoding 13 O-serogroups have allowed molecular serotyping methods to be developed. However, the genetic structure of CPS loci for 71 K-serogroups has remained unidentified, limiting progress in understanding its roles in V. parahaemolyticus pathophysiology. In this study, we identified and characterized the genetic structure and their evolutionary relationship of CPS loci of 40 K-serogroups through whole genome sequencing of 443 V. parahaemolyticus strains. We found a distinct pattern of CPS gene cluster across different K-serogroups, and expanded its new right-border by identifying glpX as a key gene conserved across all serotypes. A total of 217 genes involved in CPS biosynthesis were annotated. Functional contents and genetic structure of the 40 K-serogroups were analyzed. Based on inferences from species trees and gene trees, we proposed an evolution model of the CPS gene clusters of 40 K-serogroups. Horizontal gene transfer by recombination from other Vibrio species, gene duplication and nonsense mutations are likely to play instrumental roles in the evolution of CPS in V. parahaemolyticus. It is the first time, to the best of our knowledge, that a large-scale of CPS gene clusters of different K-serogroups in V. parahaemolyticus have been identified and characterized in evolutionary contexts. This work should help advance understanding on the variation of CPS in V. parahaemolyticus, and provide a framework for developing diagnostically relevant serotyping methods. Author summary Due to warming ocean waters and human encroachment of coastal ecosystems, Vibrio parahaemolyticus has emerged as a globally important food-borne enteropathogen. However, the genetic structure of CPS loci for 71 K-serogroups V. parahaemolyticus have remained unidentified, limiting progress in understanding its roles in V. parahaemolyticus pathophysiology. In this study, we identified and characterized the genetic structure of CPS loci of 40 K-serogroups through whole genome sequencing of 443 V. parahaemolyticus strains. We expanded and identified its new right-border by identifying glpX as a key gene conserved across all serotypes. We proposed an evolution model of the CPS gene clusters of 40 K-serogroups. We also found horizontal gene transfer by recombination from other Vibrio species, gene duplication and nonsense mutations are likely to play instrumental roles in the evolution of CPS in V. parahaemolyticus. It is the first time, to the best of our knowledge, that a large-scale of CPS loci of different K-serogroups in V. parahaemolyticus have been identified and characterized in evolutionary contexts. This work should help advance understanding on the variation of CPS in V. parahaemolyticus, and provide a framework for developing diagnostically relevant serotyping methods.


182
Based on this, we deduce that CPS gene cluster length is directly proportional to their ORF 183 number in different K-serogroups.

184
Interestingly, the lengths of CPS gene cluster from K-serogroups combined with O2 forming 185 serotypes (namely O2:K28 and O2:K3) are the apparently longer than that of K-serogroups 186 combined with other O-serogroups, which is in agreement with the ORF number (Fig 2). This 187 suggests that K28 and K3 have a closer evolutionary relationship. The following evolution 188 analysis indeed supports this view.

190
Representation of 40 CPS gene cluster's structure, function and their differences 191 In order to clarify coding gene structure and gene function of 40 K-serogroups gene clusters,

192
we selected one representative strain from each K-serogroup to analyze and cluster coding genes 193 within CPS gene cluster (Table 1). Accordingly, 40 K-serogroups' coding genes were clustered 194 into 219 genes which were subsequently classified into 4 classes, including 48 pathway genes, 12 195 processing and transportation genes, 6 glycoltransferase genes, and 153 others (4 known functions 196 but cannot be classified, and 149 unknown function) ( Table 2).
197 K28 (four core genes) located in the left flank region and seven core genes located in the right 213 flank region (Fig 4, S3 Fig).

214
For the remaining 207 genes, 78% of them are distributed in fewer than three K serotypes (S2 215  Table, S4 Fig) and are mostly located in the middle regions of the CPS gene cluster (Fig 4).

216
Average within-genes identity is 34.8%, which is less than those in core-genes. Moreover, 16 217 genes have multi-copies in several K serotypes ( Fig 3B, S4 Fig, Table 3). continuous genes coding from right to left, which is distinct from other K-serogroups (Fig 4). The

234
following insertion sequence analysis indicates that these ORFs coding from right to left arose 235 from 2 insertion events.

237
Functional characteristics of CPS gene cluster's coding region. Pathway genes:

238
Pathway gene class, containing 48 genes, are the largest in 3 gene function classes (Table 2), and 239 are mainly located in the middle and right flank region of CPS gene clusters (Fig 4). The average 240 percentage of pathway gene class in each K-serogroup is up to 39.48%: more specifically, K17 241 has the largest (52%) while K36 has the least pathway genes percentage (21.9%) ( Fig 3A). There 242 are total 5 genes (glpX, hldD, hpcD, tpiA, ugd) whose distribution frequency in 40 K-serogroups is 243 up to 100% in pathway gene class, which is more than other 2 gene classes, what's more,

244
sequences of these core genes in 40 K-serogroups are highly conserved than any other genes (the 245 average of their within identity is 83.9%) ( Table 2, Fig 3B, Fig 3C). Some of pathway genes are 246 reported in other enteropathogenic bacteria responsible for biosynthesis precursor of CPS. gtaB 247 and glmM, encoding UTP-glucose-1-puridylyltransferase and phosphoglucosamine mutase 248 respectively, exist in 72.5% and 42.5% of 40 K-serogroups (S2 Table ), and were reported existing  255  Table). In summary, the core pathway genes of 40 K-serogroups might be essential for synthesis 256 of the common precursor of CPS, while the pan pathway genes maybe catalyze the common 257 precursor to different product, which will change the chemical properties and immunogenicity of 258 CPS.

259
Processing and transportation genes: Processing and transportation genes are responsible 260 for forming the CPS repeat units and translocating mature CPS to cell surface subsequently. [32].

261
Totally 12 genes were identified to be processing and transportation genes class, and mainly with higher sequence identity, except K28 and K3 with low sequence identity 31.5% (S2 Table, 269 Fig 3B, Fig 3C). According to the annotation result of UniProt database, kpsD is involved in the 270 translocation of the polysialic acid capsule acorss the outer membrane to the cell surface, and wzc 271 probably invovled in the export of colanic acid from the cytosome to outer membrane, thus we 18 272 speculated that kpsD and wzc can non-specifically process and transport very distinct CPS 273 structure. According to a recent study by Olaya Rendueles et al. [33], CPS can be biosynthesized 274 by using one of the following five mechanisms recognized by processing and transportation genes:

276
K28 and K3 belongs to an ABC-dependent mechanism by our annotation strategy, while other 38

277
K-serogroups cannot be identified by this approach.

278
Glycoltransferase genes: 6 genes can be classified into glycoltransferase genes class, and are 279 mainly located in the middle region of CPS gene clusters (in orange in Fig 4). The average 280 proportion of glycoltransferase genes in 40 K-serogroups is the least which is equal to 8.43% in 281 three gene classes: K19 has the most (15.2%) while K63 has the least percentage 2.4% (Fig 3A).

282
These genes all distributed with less frequencies among 40 K-serogroups, ranging from 5% to 283 77.5%, and at sequence levels with lower within pairwise gene identity (the average of them 284 within identity is 21.5%) ( Table 2, Fig 3B, Fig 3C). The diversity and uniqueness among

304
Additionally, we proposed that during the evolutionary process, K28 and K3 were generated 305 by two insertion events from their ancestor, which have similar genetic structure with other K 306 serogroups (Fig 4, Fig 5). Both donor sequences of insertion event 1 and 2 are terminated by the 307 same gene, wecA and Rv2957, respectively. The genes in these two insertion sequences uniquely 308 are distributed within group 1, supporting the notion that they were recombined from other 309 species.

310
To determine the potential donor sources of insertion sequence, we extracted these sequences  Table). We reasoned that the most recent donor sources of the whole 319 insertion sequences in event 1 and 2 is unidentifiable by the current approach because of low 320 availability of public sequences relevant to the analysis, though we can find that the homologous 321 cysD from the other Vibrio species.   Group 4. We proposed that an insertion event 3 has occurred in the ancestor (N10) of group4 345 and group5 (labeled with bold orange lines in Fig 4, Fig 5). The sequence of insertion event 3 346 contains three genes gnu, epsL and pglF with conserved order. This insertion sequence 347 specifically exists in group 4 (7/17) and group 5 (7/13) but not in others groups.  Fig 4, Fig 5). This is 357 consistent with the insertion sequence in the right junction region of putative CPS gene clusters of Group 5. 13 K-serogroups forming an independent clade were classified as group 5 (Fig 5). We 379 inferred that the co-ancestor N17 of these K underwent an insertion event 4 (labeled with blue 380 bold lines in Fig 4; Fig 5), which makes the descendants' K gene clusters share a six-gene 381 sequence specifically within this group (Fig 5). The insertion sequence contains 6 conserved 382 genes, with a fixed order (Fig 4). The entire insertion sequence exists in group 5 with high 383 frequency (10/13 in this group, mean pair-wise identity 70%), but with low frequencies in group 3

384
(1/5), group 4 (2/17). We speculate that the insertion sequence might have originated from an 385 insertion events between the ancestral receptor V. parahaemolyticus K-serogroup and the donor 386 species, and remained in group 5 during evolution, whereas a gradual loss has taken place in some  multi-copy as a mechanism for gene divergence (Table 3).

401
We also found that the early evolution of CPS  Fig 5). In analysis on a gene-by-gene basis, 408 the two glycoltransferase genes, Rv2957 and bshA, display dual-copies at high frequencies of 409 15/40 (15 in 40 K-serogroups) and 9/40, respectively, and even triple copies at substantial 410 frequencies 5/40 and 2/40, consistent with above analysis. Other multi-copies genes occur less 411 frequently than 5 K-serogroups. In addition, other triple copies pathway gene gtaB and epsM exist 412 at frequencies of 2/40 and 1/40, respectively (S2 Table, Fig 4, Fig 3B). Collectively, genes of  We found that these mult-copies genes may have been generated by three different 421 mechanisms in V. parahaemolyticus. Firstly, some genes acquire multi-copies by gene mutations.

422
More concretely, these genes undergoes nonsense mutation leading to a stop codon, while the 423 downstream region undergoes mutations leading to a start codon. All multi-copies of TTHA0252 26 424 (old right-border gene) genes in five K serotypes belonging to groups 3, 4 and 5 were generated by 425 this mechanism (Fig 4, S8 Fig), suggesting that these are parallel events. In addition, the gene 426 gtaB in K58 has been split to two neighboring genes by nonsense mutations with respect to K6 (in 427 group 3).

428
Secondly, multi-copy genes which cluster together on gene trees may arise from uniquely occur in K17, and kdsB duplication with 45.1% identities, uniquely occur in K18 (Table   433 3), suggesting these duplication events may contribute to the origin of K17 and K18. There are 3 434 duplication genes in K17, which may be crucial to the development of multi-copy genes as it has 435 the longest CPS gene cluster except K28 and K3 (Fig 2). Located on the sequence of insertion

441
Thirdly, apart from mutation and duplication, multi-copy genes clustered into different clades 442 on gene trees plausibly emerge from recombination from unknown donor sources. Ten of such 443 genes display multi-copies linked to recombination, most of which are pathway genes.

444
Interestingly, nearly all the pathway genes with high distribution frequency achieved multi-copy 27 445 by recombination, while those with low distribution tends to be generated by duplication (Table   446 3). region of CPS gene cluster (Fig 4).

464
We identified the gene glpX at downstream of VP0238 as an accurate right-border gene of V.  Fig 5). These 24 499 pandemic K-serogroups distribution in all five groups, but were unexpectedly concentrated in 500 group4 (Fig 5). Half of the pandemic K-serogroups belong to group 4 with high frequency at 501 70.5% (12/17), while frequencies for the other half are 7/13 in group 5, 3/5 in group3, 1/3 in group

544
In this study, we found that 5 multi-copy genes that have evolved through gene duplication in 545 some or all K-serogroups with multi-copy genes. Most of them belong to pathway gene class (4/5  (Fig 4 and 5). This 570 means that K68, K25, K18 could not have stemmed from K6 by gene gain or loss but by CPS loci