Genome-wide association study of gastric cancer- and duodenal ulcer-derived Helicobacter pylori strains reveals discriminatory amino acid differences and novel oncoprotein candidates

Genome-wide association studies (GWASs) can reveal genetic variations associated with a phenotype in the absence of any hypothesis of candidate genes. The problem of false-positive sites linked with the responsible site might be bypassed in bacteria with a high homologous recombination rate, such as Helicobacter pylori, which causes gastric cancer (GC). We conducted a GWAS followed by regression-based prediction of GC and duodenal ulcer H. pylori strains. We identified 14 single nucleotide polymorphisms (11 amino acid changes) that, combined, allowed effective disease discrimination. They were often informative of the underlying molecular mechanisms, such as electric charge alteration at the ligand-binding pocket, alteration in subunit interaction, and mode-switching of DNA methylation. We also identified three novel virulence factors/oncoprotein candidates. These results provide both defined targets for further informatic and experimental analyses to gain insights into GC pathogenesis and a basis for identifying a set of biomarkers for application in clinical settings.


Introduction 66
The faster-than-exponential decrease in the cost of DNA sequencing has brought about the 67 era of population genomics, which refers to the comparative analysis of numerous genome 68 sequences within a population. Among the various population genomics methods available, 69 genome-wide association study (GWAS) has the advantage of being able to reveal genetic 70 variations associated with a particular phenotype in the absence of any hypothesis of 71 candidate genes. GWASs have revealed the genetic basis of various human diseases, 72 including some with multiple genetic factors. Although GWAS in bacteria has been difficult 73 due to the strong population structure (Falush and Bowden 2006), methodological 74 influence H. pylori growth or survival. Lactate was recently identified as a highly used 224 molecule in the stomach, and one that varies between stomach regions (Keilberg, et al. 2020). 225 The α family of carbonic anhydrases in the periplasm processes CO2 produced from urea 226 by urease and is essential for acid acclimation (Supuran and Capasso 2016). The β family of 227 carbonic anhydrases, including IcfA (HPF57_0004, Figure 4B (i)), likely plays the same role 228 in the cytoplasm (Nishimori, et al. 2007). Residue L47, corresponding to the discriminatory 229 SNP, forms a pair at the dimerization joint ( Figure 4B (ii) (iii)). We predicted that this residue 230 affects subunit interaction to regulate the reaction speed of this very fast-acting enzyme. It 231 could be related to the elevated acid secretion in UCs and its suppression during GC 232 progression. 233 One GWAS SNP was mapped on a DNA methyltransferase, M subunit of a Type I 234 restriction-modification system (HPF57_0490, Figure 4C). In addition to being responsible 235 for modification against restriction, these methyltransferases affect gene expression (Yano, et 236 al. 2020). When the two assemblies of the methyltransferase (2 × S1M2) recognize their 237 target DNA sequences, they transform from an open to a closed form ). The 238 amino acid residue D115 on the helices of the two assemblies moves a large distance to bind 239 and thus connect the two assemblies. A mutation in the E. coli homolog corresponding to the 240 residue at position 115 and some mutations in this helix switch the enzyme from maintenance 241 methylation mode (hemi-methylated to fully methylated DNA) to de-novo methylation mode 242 (unmethylated to fully methylated DNA) (Kelleher, et al. 1991), somehow recognizing the 243 methylation status of the target DNA sequence. This likely affects restriction attacks on 244 incoming unmethylated DNA and on endogenous DNA upon loss of methylation. The 245 necessity of such destructive genome maintenance is expected to differ between the stomach 246 environment and the duodenal environment as well as between different ulcer/cancer stages. 247 Amino acid changes corresponding to the discriminatory SNPs included four more known 248 virulence factors: CagA oncoprotein (S6A Figure), DsbG/K disulfide bond (S-S)-forming 249 enzyme (S6B Figure), FecA-2 iron importer (S6C Figure), and OmpA101 outer membrane 250 protein (S6D Figure). 251 252

Novel virulence factor/oncoprotein candidates 253
Our GWAS identified three new virulence factor/oncoprotein candidates. The first 254 (HPF57_0139, HP0096), designated CtbP (C-terminal binding protein) here, is similar to the 255 human "C-terminal of adenovirus E1A" binding protein 1 (CtBP1) (34% amino acid 256 sequence identity in BLAST) and CtBP2 (32%) (Bellesis, et al. 2018)  promoters (Chinnadurai 2009). Their dehydrogenase activity with NAD as the acceptor is 263 used to monitor intracellular NADH/NAD concentrations (or the energy state, in a sense). 264 Both CtBPs have been implicated in repression of the epithelial phenotype and of apoptotic 265 pathways, and in cancer progression (Chinnadurai 2009). The H. pylori homolog has a more 266 basic charge around the NAD-binding site ( Figure 5A (iii) vs. (v)) and a stronger dimer-267 dimer interaction (Bellesis, et al. 2018). H. pylori CtbP might interfere with NAD sensing 268 and/or assembly of human CtBPs and affect their NAD-based transcription regulation and 269 cell death/differentiation. H. pylori may inject its CtbP into the host cell to hack it at the CtBP 270 hubs of a protein-protein network, just as adenovirus uses E1A (King, et al. 2018). V65 in 271 CtBP1, corresponding to GWAS amino acid E48 in H. pylori CtbP, is located at the N- UniProt) as a bacterial DegP/Q subfamily, which includes H. pylori HtrA. Figure 5B (i) 290 shows Isp modeled on and aligned with a 24-mer cage of DegP (Backert, et al. 2018). The 291 negatively charged area in the funnel is narrowed by an amino acid substitution at position 292 178, corresponding to the discriminatory SNP ( Figure 5B (iv)-(v)), presumably affecting 293 protein binding. We expect that this protein interferes with HtrA family proteases in human 294 mitochondria, other bacteria, or their own (HtrA). 295 The third virulence factor/oncoprotein candidate is TriH (triple halves) (HPF57_0151, 296 HP0130), which carries three half pathogenicity-related domains ( Figure 5C). It carries a  in TriH. We expect that this TriH domain enters nuclei and interacts with cohesin kleisin to 323 affect cohesin action. 324 Our fourth novel virulence factor candidate, HPF57_1209 (HP1250), with an SH3b 325 domain, turned out to be the previously studied Csd5 for helical cell shape (Blair, et al. 2018). 326 Its C-terminal SH3b domain binds to peptidoglycan, while its N-terminal membrane domain 327 interacts with bactofilin, a peptidoglycan precursor synthase, and the inner membrane-328 spanning F1FO ATP synthase. Its N-terminal 1-30 aa is predicted to be a signal peptide 329 (SignalP 5.0). Residue 124 in its central disordered region, corresponding to the 330 discriminatory SNP, varied among strains. Deletion of the disordered region resulted in 331 reduced side curvature. We hypothesize that this residue and this domain may adjust cell 332 curvature to its target mucous layers in different tissues (stomach vs. duodenum) or to 333 different cancer progression stages, as seen with many other proteins (Blair, et al. 2018). 334 335

Intergenic SNPs 336
One of the three intergenic SNPs was found 91 bp upstream of HPF57_0521 (corresponding 337 to HP0490) (S7 Figure). It had the lowest P-value (0.000003) among the discriminatory 338 SNPs (upper right green dot in the Q-Q plot in Figure 2). Upstream of HP0490, there is an 339 extended Pribnow box (tgnTAtaAT) as the -10 motif of sigma 80 preceded by periodic AT-340 rich sequences, although a transcription start site was not detected in previous experiments 341 (Bischler, et al. 2015), likely because of the high transcription of the upstream ribosome 342 protein gene, HP0491. In a predicted secondary structure (M-fold) of the expected transcript, 343 the discriminatory SNP is located at a loop-stem boundary, presumably slightly affecting 344 interaction with a protein or an RNA. Use of the sub-optimal UUG start codon instead of 345 AUG and the presence of this antisense RNA suggest tight regulation of HP0490. The 346 HP0490 gene (kch, trkA) encodes a K + channel protein regulating K + conductance across the 347 15 membrane (UniProt) and is essential for H. pylori colonization of the murine stomach (Stingl, 348 et al. 2007). Mutation of HP0490 might modulate the expression of K + conductance for 349 persistence in the gastric/duodenal environment. 350 The second intergenic SNP is present in the promoter region (-29 bp) of the omega 351 subunit of RNA polymerase (HPF57_0798). (In H. pylori, the upstream promoter element is 352 characterized by an AT-rich sequence (Sharma, et al. 2010).) E. coli omega binds ppGpp 353 alarmon (Mechold, et al. 2013) in stringent response (Hauryliuk, et al. 2015), although its N-354 terminal MAR motif for ppGpp-binding is not conserved in H. pylori (Hauryliuk, et al. 2015). This was the first GWAS to reveal GC-related genetic features by focusing on the highest-371 risk H. pylori population of GC and to utilize the largest dataset of H. pylori strains isolated 372 from GC and DU patients reported to date. Generally, it becomes more difficult to isolate H. 373 pylori as the cancer stage progresses. The dataset is peerless in that is covers more than 100 374 strains from GC patients, as well as those from DU patients, which enabled the 375 unprecedented GWAS. However, we should keep in mind that the sample size is still smaller 376 than that of GWASs in other bacterial species that utilized thousands of genome sequences 377  pylori (S6B Figure), a homolog of DsbG of E. coli, is secreted and affects the stomach 408 epithelium (Kim, et al. 2002) and enables colonization (Kaakoush, et al. 2007). It can interact 409 with and refold reduced HcpE (HP0235) (Lester, et al. 2015). DsbG/K acts on HopQ and 410 helps HopQ-CEACAM interaction for the delivery of CagA in which another discriminatory 411 SNP was found. Therefore, the loci identified by our GWAS are not only multifaceted, but 412 can interact with and affect each other. 413 The discriminatory SNP with the lowest P-value was located upstream of kch (trkA) (S7 414 These genes mostly belonged to the cag pathogenicity island (PAI) and encoded outer 432 membrane proteins, such as babA. In our GWAS, focusing on the hspEAsia population, we 433 found none of these previously reported GC-associated loci. This discrepancy is likely due to 434 two major differences. First, the preceding study compared GC and gastritis, while we 435  The set of discriminatory SNPs identified in this study will be potentially applicable to 441 personalized risk stratification in clinical settings for early-stage discrimination between GC 442 and DU and for the selection of appropriate treatments. A recent study developed a high-443 throughput multiple allele detection assay (Zhang, et al. 2016). Incorporation of the 444 discriminatory SNPs into such a technique will assist clinicians in diagnosis and clinical 445 decision making. 446 In conclusion, our study revealed multifaceted genetic features of H. pylori associated 447 with the pathogenesis of GC as compared to DU, and demonstrated the effectiveness of 448 GWAS followed by regression-based prediction in distinguishing these H. pylori-related 449 diseases, although the individual effect of each discriminatory SNP was not significant 450 despite using the largest-to-date, but still limited in sample size dataset. Our analysis 451 provided insights into the pathogenesis of GC and provided a basis for identifying a set of 452 biomarkers for application in clinical settings. remove adapter sequences and low-quality bases from the raw short-read data (Bolger, et al. 468 2014). Trimmed reads were de-novo assembled to produce contigs using the SPAdes 469 (v3.12.0) genome assembler with the "-careful" option to reduce mismatches in the assembly 470 RS_HGAP_Assembly.2. After the removal of overlapping ends, the chromosomal contig was 490 reshaped to start from the ori-sequence. Thereafter, it was re-sequenced with 491 RS_Resequencing.1 to create consensus sequences.  Table 2). 502 503

Population assignment of each strain 504
We inferred the population structure of 614 global strains in total, using chromosome 505 painting and fineSTRUCTURE, as previously described (Lawson, et al. 2012). Briefly, a 506 contig from each genome was initially mapped to the genome of strain 26695 as a reference, 507 using Snippy v4.0.7 (https://github.com/tseemann/snippy). The Snippy-core function was 508 used to create genome-wide haplotype data for all strains. Subsequently, ChromoPainter 509 (v0.04) inferred chunks of DNA donated from a donor to a recipient for each recipient 510 haplotype, and summarized the results into a co-ancestry matrix. Using this co-ancestry 511 matrix, fineSTRUCTURE (v0.02) then clustered individuals by a Bayesian Markov chain 512 Monte Carlo (MCMC) approach with 100,000 iterations for both the burn-in and the MCMC 513 chain after the burn-in. 514

GWAS 516
All isolates assigned to hspEAsia based on the fineSTRUCTURE results and for which 517 clinical information of interest (GC or DU) was available were used for GWAS. First, a 518 maximum-likelihood phylogenetic tree based on core-genome SNPs was reconstructed using 519 PhyML (Guindon, et al. 2010), and the distribution of GC and DU in the tree was visualized 520 using Phandango (Hadfield, et al. 2017). The tree is shown as mid-point-rooted. Core-521 genome SNPs were extracted based on mapping of each genome against that of the East 522 Asian-type (hspEASia) H. pylori strain F57, using Snippy v4.0.7. We used strain F57 as a 523 reference because it was isolated from a GC patient in Japan and its genome sequence has 524 been determined by whole-genome shotgun sequencing (Kawai, et al. 2011). 525 Next, we conducted a pairwise genome alignment between each genome and strain F57 526 using progressiveMauve (Darling, et al. 2010), which enables the construction of positional 527 homology alignments even for genomes with variable gene content and rearrangement. 528 Subsequently, we combined all alignments into a multiple genome alignment in which each 529 position corresponded to that of the strain F57 reference genome. Next, we extracted SNPs 530 with ≤ 10% missing frequency and >5% minor allele frequency. We conducted a SNP GWAS 531 based on a previous study (Berthenet, et al. 2018), in which a linear mixed regression model 532 with the bugwas package (Earle, et al. 2016b) was used to control for population structure 533 based on an n × n relatedness matrix calculated from bi-allelic SNPs. We also conducted a SNP 534 GWAS in which the algorithm-factored spectrally transformed linear mixed model (FaST-535 LMM) implemented in pyseer (Lees, et al. 2018) was used to control for population structure 536 from the same set of bi-allelic SNPs. A Q-Q plot was created using the R statistical program to 537 assess the number and magnitude of observed associations between SNPs and disease (GC and 538 DU) as compared to the association statistics expected under the null hypothesis of no 539

association. 540
We also conducted a GWAS focusing on the presence or absence of specific genes rather 541 than SNPs, based on pan-genome analysis using the Roary pipeline (Page, et al. 2015), 542 although we did not conduct further analysis based on Q-Q plot assessment (see Results). A 543 gene presence or absence matrix was used as an input of the linear mixed regression model 544 implemented in the bugwas package. 545 546

Discrimination between GC and DU using a set of SNPs identified by GWAS 547
Top hit SNPs deviating from the null hypothesis in the Q-Q plot and positively associated 548 with GC were used to calculate a simple discriminatory score for each strain (Berthenet, et  We next used a prediction approach as follows. We used a univariate logistic regression 556 model to predict the probability of a strain i (pi) being isolated from a GC patient from the 557 discriminatory score: 558 We then conducted a 2-fold cross-validation in which the logistic regression model was fit to 560 a training dataset to estimate the regression coefficients, and the probability for each strain in 561 a test dataset (ˆi p ) was predicted from xi. ROC curves were drawn from the true host disease 562 status (GC or DU) and the predicted probability (ˆi p ) of each strain to calculate the AUC, 563 determine the optimal cutoff value for the discrimination of GC and DU, and calculate the 564 sensitivity and specificity of the discrimination, using the R package pROC (Robin, et