Genome-wide association studies of global Mycobacterium tuberculosis resistance to thirteen antimicrobials in 10,228 genomes

The emergence of drug resistant tuberculosis is a major global public health concern that threatens the ability to control the disease. Whole genome sequencing as a tool to rapidly diagnose resistant infections can transform patient treatment and clinical practice. While resistance mechanisms are well understood for some drugs, there are likely many mechanisms yet to be uncovered, particularly for new and repurposed drugs. We sequenced 10,228 Mycobacterium tuberculosis (MTB) isolates worldwide and determined the minimum inhibitory concentration (MIC) on a grid of twofold concentration dilutions for 13 antimicrobials using quantitative microtiter plate assays. We performed oligopeptide- and oligonucleotide-based genome-wide association studies using linear mixed models to discover resistance-conferring mechanisms not currently catalogued. Use of MIC over binary resistance phenotypes increased heritability for the new and repurposed drugs by 26-37%, increasing our ability to detect novel associations. For all drugs, we discovered uncatalogued variants associated with MIC, including in the Rv1218c promoter binding site of the transcriptional repressor Rv1219c (isoniazid), upstream of the vapBC20 operon that cleaves 23S rRNA (linezolid) and in the region encoding an α-helix lining the active site of Cyp142 (clofazimine, all p<10-7.7). We observed that artefactual signals of cross resistance could be unravelled based on the relative effect size on MIC. Our study demonstrates the ability of very large-scale studies to substantially improve our knowledge of genetic variants associated with antimicrobial resistance in M. tuberculosis.

treatment is 85% successful overall, that drops to 57% for rifampicin-resistant and MDR-TB 32 [1]; underdiagnosis and treatment failures then amplify the problem by encouraging 33 onward transmission of MDR-TB [2]. New treatment regimens for MDR-TB are therefore an 34 important focus, introducing new and repurposed drugs such as bedaquiline, clofazimine, 35 delamanid and linezolid [3,4]; however resistance is already emerging [5,6,7]. 36 37 Understanding mechanisms of resistance in TB is important for developing rapid 38 susceptibility tests that improve individual patient treatment, recommending drug regimens 39 that reduce the development of MDR and developing new and improved drugs that expand 40 treatment options [8,9]. Genomics can accelerate drug susceptibility testing, replacing 41 slower culture-based methods by predicting resistance from the sequenced genome rather 42 than directly phenotyping the bacteria [10]. Genome sequencing-based susceptibility testing 43 for first-line drugs has achieved sensitivities of 91.3-97.5% and specificities of 93.6-99.0% 44 [11], surpassing the thresholds for clinical accreditation, motivating its adoption by multiple 45 public health authorities [12]. In low-resource settings, molecular tests such as Cepheid 46 GeneXpertⓇ and other line probe assays offer rapid and more economical susceptibility 47

88
CRyPTIC collected isolates from 27 countries worldwide, oversampling for drug resistance 89 [31]. 10    across the drugs, we tested for associations at 10,510,261 variably present oligopeptides 137 and 5,530,210 oligonucleotides; these captured substitutions, insertions and deletions. The drugs differed in the number of genes or intergenic regions that were significant, the drugs 139 with fewest significant genes being isoniazid (12), levofloxacin (13) and moxifloxacin (6). We 140 defined the significance of a gene or intergenic region by the most significant oligopeptide 141 within it, and assessed all significant variants above a 0.1% minor allele frequency (MAF) 142 threshold for the top 20 significant genes. The top 20 genes for each drug are detailed in 143  [18,19,20,21,22,23]. 165 The most significant catalogued variants for each drug were (lowercase for nucleotides, 166 uppercase for amino acids): rrs a1401g (amikacin, kanamycin), embB M306V (ethambutol), 167 fabG1 c−15t (ethionamide), katG S315T (isoniazid), gyrA D94G (levofloxacin, moxifloxacin), 168 and rpoB S450L (rifampicin) [17,11]. For the remaining drugs with no previously catalogued 169 resistance determinants, the genes identified by the top signals were: Rv0678 (bedaquiline, 170 clofazimine), ddn (delamanid), fabG1 (ethionamide), katG (isoniazid), rplC (linezolid) and 171 rpoB (rifabutin). The top variants identified for each drug were all significant at p<1.04x10 -15 . 172 173 For many drugs, the direction of effect of the most significant oligopeptide variants was to 174 decrease MIC (Supplementary Figure 5), implying that low-MIC oligopeptides and 175 oligonucleotides are more likely to be genetically identical across strains than high-MIC 176 haplotypes. This would be consistent with the independent evolution of increased MIC from 177 a shared, low-MIC TB ancestor. Uncatalogued variants significantly associated with MIC are 178 important because they could improve resistance prediction and shed light on underlying 179 resistance mechanisms; they may be novel or previously implicated in resistance but not to 180 a standard of evidence sufficient to be catalogued. We discuss the choice of catalogues in 181 the Discussion [17,11].  that drug by [17,11]. Gene names separated by colons indicate intergenic regions. Genes or intergenic regions capturing repeat regions are highlighted with the superscript R . Alphabetic characters following gene names 188 are used to cross-reference with the corresponding Manhattan plots in Figure 3.

190
We next looked at uncatalogued variants in known resistance-conferring genes. We 191 identified uncatalogued variants in gyrB associated with levofloxacin and moxifloxacin MIC 192 (minimum p-value levofloxacin: p<10 -15.6 , moxifloxacin: p<10 -11.6 . The primary mechanisms 193 of resistance to the fluoroquinolones levofloxacin and moxifloxacin are mutations in gyrA or 194 gyrB, the subunits of DNA gyrase. The gyrB Manhattan plots for levofloxacin and 195 moxifloxacin both contained two adjacent peaks within the gene, but for each drug just one 196 of the two peaks was significant, and these differed between the drugs (Figure 4). 197 Interpretation of oligopeptides and oligonucleotides requires an understanding of the 198 variants that they capture, which we visualised by aligning them to H37Rv and interpreting 199 the variable sites (e.g. Figure  were associated with lower MIC (e.g. NTEVQAIITAL, -log10p = 11.63, b = -1.33, present in 212 6332/6388 genomes). Amino acids 461 and 501 are at the interface between gyrB and the 213 bound fluroquinolone [34]. gyrB is included in the reference catalogues for predicting 214 levofloxacin but not moxifloxacin resistance, therefore our results support inclusion in 215 future moxifloxacin catalogues [17,11].

219
Oligopeptide Manhattan plots are shown for A levofloxacin B moxifloxacin. Oligopeptides are coloured by the 220 reading frame that they align to, black for in frame and grey for out of frame in gyrB. Oligopeptides aligned to 221 the region by nucmer but not realigned by BLAST are shown in grey on the right hand side of the plots. The not catalogued by [17,11] for each of the drugs. A well-recognized challenge in GWAS for 231 antimicrobial resistance is the presence of artefactual cross resistance. To mitigate this risk, 232 we preferentially highlight variants significantly associated with a single drug. However, 233 many catalogued resistance variants demonstrated artefactual cross resistance. For 234 example, variants in the rifampicin resistance determining region were in the top 20 235 significant associations for all drugs except for delamanid (Table 1) Rv1217c-Rv1218c multidrug efflux transport system [38]. It binds two motifs, a high-affinity 262 intergenic sequence in the operon's promoter, and a low-affinity intergenic sequence 263 immediately upstream of Rv1218c [38]. overexpression of Rv1218c has been observed to correlate with higher isoniazid MIC in vitro 271 [39]. 272

Second-line drugs 274
Amikacin and kanamycin. Oligopeptides in PPE42 (Rv2608) were significantly associated 275 with aminoglycoside MIC, for both amikacin and kanamycin (minimum p-value p<10 -12.8 , 276 Supplementary Figure 9). PPE42 is an outer membrane-associated PPE-motif family protein 277 and potential B cell antigen. It elicits a high humoral and low T cell response [40] and is one 278 of four antigens in the vaccine candidate ID93 [41]. The C-terminal major polymorphic 279 tandem repeats (MPTRs) contain a region of high antigenicity [40]. of WhiB7 [46]. WhiB7 is induced by antibiotic treatment and other stress conditions and 298 activates its own expression along with other drug resistance genes, for example tap and 299 erm [45]. Variants in and upstream of another whiB-like transcriptional regulator, whiB6, 300 were previously found to be associated with resistance to ethionamide [19,47], 301 capreomycin, amikacin, kanamycin and ethambutol [22,23]. WhiB7 has been implicated in 302 cross-resistance to multiple drugs, including macrolides, tetracyclines and aminoglycosides 303 [45,46], however activation of WhiB7 is not induced by all antibiotics, for instance isoniazid 304 [43]. Interestingly, oligopeptides and oligonucleotides in or upstream of whiB7 were not 305 found to be significantly associated with any of the other 12 antimicrobials. This could 306 indicate yet another mechanism by which whiB7 is involved in resistance to anti-307 tuberculosis drugs. 308 309 Levofloxacin. Oligopeptides in tlyA (Rv1694) were significantly associated with MIC of the 310 fluoroquinolone levofloxacin (minimum p-value p<10 -7.8 , Supplementary Figure 11). tlyA 311 encodes a methyltransferase which methylates ribosomal RNA. Variants in tlyA, including 312 loss-of-function mutations, confer resistance to the aminoglycosides viomycin and 313 capreomycin [48] by knocking out its methyltransferase activity [49]. 314 An extremely low frequency oligopeptide was associated with increased MIC, and captured 315 a one-nucleotide adenosine insertion between positions 590 and 591 in codon 198 in a 316 conserved region [50]. In contrast, oligopeptides containing the reference alleles in this 317 region were associated with decreased MIC (e.g. GKGQVGPGGVV, -log10p = 7.83, b = -1.86, 318 present in 7281/7300 genomes). The resulting frameshift likely mimics the knockout effect 319 of deleting the 27 C-terminal residues of TlyA, which ablates methyltransferase activity [51].
While loss-of-function mutations conferring antimicrobial resistance were previously 321 reported to specifically increase aminoglycoside MIC, fluoroquinolones were not 322 investigated [52]. The signal in tlyA may therefore reveal genuine, previously unidentified 323 cross-resistance. compared to sensitive isolates [55], and Era (but not AmiA2) has been shown to be required 353 for optimal growth of H37Rv [56]. These variants may therefore enhance tolerance to 354 bedaquiline.  Supplementary Figure 14). Oligopeptides 359 associated with higher MIC captured the amino acid residue 176I (e.g. EDFQITIDAFA, -log10p 360 = 7.99, b = 1.14, present in 100/7297 genomes). The association signal falls within the F a-361 helix of CYP142, which lines the entrance to the active site with largely hydrophobic 362 residues, forming part of the substrate binding pocket [57,58]. Homology with CYP125 363 suggests that residue 176 captured by the GWAS is within 5 Å of the binding substrate [58]. 364 The potential for cytochrome P450 enzymes as targets for anti-tuberculosis drugs has been 365 highlighted [59]; CYP142 is inhibited by azole drugs [59] and has been found to form a tight 366 complex with nitric oxide (NO) [60]. The anti-mycobacterial activity of clofazimine has been 367 shown to produce reactive oxygen species [61], therefore the substitution identified by the 368 hydrophobic residues, so the mechanism for how this would disrupt binding is unknown. 370 371 Delamanid. Oligonucleotides in pknH (Rv1266c), which encodes a serine/threonine-protein 372 kinase, were significantly associated with delamanid MIC (minimum p-value p<10 -30.2 , 373 Supplementary Figure 15). Delamanid is a prodrug activated by deazaflavin-dependent 374 nitroreductase which inhibits cell wall synthesis. PknH phosphorylates the adjacent gene 375 product EmbR [62], enhancing its binding of the promoter regions of the embCAB operon 376 [63]. Mutations in embAB are responsible for ethambutol resistance [64]. The peak GWAS 377 signal localized to the C-terminal periplasmic domain of PknH [62]. Oligonucleotides below 378 our MAF threshold captured extremely low frequency triplet deletions of either ACG at 379 nucleotides 1645-7 or GAC at nucleotides 1644-6. In contrast, oligonucleotides containing 380 the reference alleles in this region were associated with decreased MIC (e.g. 381 CAAGACGGTCACCGTCACGAATAAGGCCAAG , -log10p = 30.21, b = -3.29, present in 382 7555/7564 genomes). These variants likely disrupt intramolecular disulphide binding linking 383 the two highly conserved alpha helices that form the V-shaped cleft of the C-terminal sensor 384 domain [65]. Since NO is released upon activation of DLM, and deletion of PknH alters 385 sensitivity to nitrosative and oxidative stresses [66], these rare variants may alter tolerance 386 to delamanid mediated by NO. 387 388 Linezolid. Oligonucleotides in vapB20 (Rv2550c) were significantly associated with linezolid 389 MIC (minimum p-value p<10 -8.6 , Supplementary Figure 16). VapB20 is an antitoxin 390 cotranscribed with its complementary toxin VapC20 [67]. The latter modifies 23S rRNA [68], 391 the target of linezolid which inhibits protein synthesis by competitively binding 23S rRNA. 392 The peak signal in vapB20 occurred just upstream of the promotor and VapB20 binding 393 sites, 21 nucleotides upstream of the -35 region [68]. Oligonucleotides below our MAF 394 threshold associated with increased MIC shared a cytosine 33  In this study we tested oligopeptides and oligonucleotides for association with quantitative 402 MIC measurements for 13 antimicrobials to identify novel resistance determinants. 403 Analysing MIC rather than binary resistance phenotypes enabled identification of variants 404 that cause subtle changes in MIC. This is important, on the one hand, because higher 405 rifampicin and isoniazid MIC in sensitive isolates are associated with increased risk of 406 relapse after treatment [69]. Conversely, low-level resistance among isolates resistant to RIF 407 and isoniazid mediated by particular mutations may sometimes be overcome by increasing 408 the drug dose, or replacing rifampicin with rifabutin, rather than changing to less desirable 409 drugs with worse side effects [70,71,72,73,74]. The investigation of MIC was particularly 410 effective at increasing heritability for the new and repurposed drugs. 411

412
The MICs were positively correlated between many drugs, particularly amongst first-line 413 drugs. Consequently, many of the 10,228 isolates we studied were MDR and XDR. In GWAS, 414 this generates artefactual cross resistance, in which variants that cause resistance to one 415 drug appear associated with other drugs to which they do not confer resistance. In practice, it is difficult to distinguish between associations that are causal versus artefactual without 417 experimental evidence. Nevertheless, we found frequent evidence of artefactual cross 418 resistance: several genes and intergenic regions featured among the top 20 strongest 419 signals of association to multiple drugs, including rpoB (12 drugs), embB (7), fabG1 (7), rrs 420 (6) We elected to classify significant variants as catalogued versus uncatalogued, rather than 444 known versus novel, for several reasons. The catalogues represent a concrete, pre-existing 445 knowledgebase collated by expert groups for use in a clinical context [17,11]. We chose 446 [17,11] as they are the most recent and up to date catalogues available for the drugs we 447 investigated. The inclusion criteria for variants to be considered catalogued are therefore 448 stringent; it follows that a class of variants exist that have been reported in the literature 449 but not assimilated into the catalogues [17,11]. The literature is vast and heterogenous, 450 with evidence originating from molecular, clinical and genome-wide association studies. 451 Inevitably, some uncatalogued variants in the literature will be false positives, while others 452 will be real but did not meet the standard of evidence or clinical relevance for cataloguing. 453 Evidence from CRyPTIC that supports uncatalogued variants in the latter group is of equal or 454 greater value than the discovery of completely novel variants, because it contributes to a 455 body of independent data supporting their involvement. For instance, gyrB did not appear 456 in the catalogues we used for moxifloxacin [17,11]. Yet our rediscovery of gyrB 501D 457 complements published reports associating the substitution with moxifloxacin resistance 458 [75,76,77], strongly enhancing the evidence in favour of inclusion in future catalogues. 459 Indeed, the recent WHO prediction catalogue, published after the completion of this study 460 and which draws on the CRyPTIC data analysed here includes the E501D resistance-461 associated variant [78]. Moreover, of the five new genes added to the forthcoming WHO 462 catalogue [78] but not featuring in the catalogues [17,11]  clofazimine, delamanid and linezolid. The improvement was most striking for delamanid, 472 whose heritability was not significantly different to zero for the binary resistance 473 phenotype. In contrast, the scope for improvement was marginal for the better-studied 474 drugs isoniazid and rifampicin, where MIC heritabilities of 94.6-94.9% were achieved. This 475 demonstrates the ability of additive genetic variation to explain almost all the phenotypic 476 variability in MIC for these drugs. Nevertheless, we were still able to find uncatalogued hits 477 for these drugs. The very large sample size also contributed to increased heritability 478 compared to previous pioneering studies. Compared to Farhat et al 2019 [22] who 479 estimated the heritability of MIC phenotypes in 1452 isolates, we observed increases in 480 heritability of 2.0% (kanamycin), 3.3% (amikacin), 14.0% (isoniazid), 10.8% (rifampicin), 481 11.2% (ethambutol) and 19.4% (moxifloxacin). Furthermore, many of the uncatalogued 482 signals we report here as significant detected rare variants at below 1% minor allele 483 frequency, underlining the ability of very large-scale studies to improve our understanding 484 of antimicrobial resistance not only quantitatively, but to tap otherwise unseen rare variants 485 that reveal new candidate resistance mechanisms. 486

Sampling frames 490
CRyPTIC collected isolates from 27 countries worldwide, oversampling for drug resistance, 491 as described in detail in [31]. Clinical isolates were subcultured for 14 days before 492 inoculation onto one of two CRyPTIC designed 96-well microtiter plates manufactured by 493 ThermoFisher. The first plate used (termed UKMYC5) contained doubling-dilution ranges for 494 14 different antibiotics, the second (UKMYC6) removed para-aminosalicylic acid due to poor 495 results on the plate [30] and changed the concentration of some drugs. Para-aminosalicylic 496 acid was therefore not included in the GWAS analyses. Phenotype measurements were 497 determined to be high quality, and included in the GWAS analyses, if three independent 498 methods (Vizion, AMyGDA and BashTheBug) agreed on the value [31]. Sequencing pipelines 499 differed slightly between the CRyPTIC sites, but all sequencing was performed using 500 Illumina, providing an input of matched pair FASTQ files containing the short reads. 501 502 15,211 isolates were included in the initial CRyPTIC dataset with both genomes and 503 phenotype measurements after passing genome quality control filters [31,79], however 504 some plates were later removed due to problems identified at some laboratories with 505 inoculating the plates [31]. Genomes were also excluded if they met any of the following 506 criteria, determined by removing samples at the outliers of the distributions: (i) no high 507 quality phenotypes for any drugs; (ii) total number of contigs > 3000; (iii) total bases in 508 contigs < 3.5x10 6 or > 5x10 6 ; (iv) number of unique oligonucleotides < 3.5x10 6 or > 5x10 6 ; (v) 509 sequencing read length not 150/151 bases long. This gave a GWAS dataset of 10,422 510 genomes used to create the variant presence/absence matrices. We used Mykrobe 511 [80,79,81] to identify Mycobacterium genomes not belonging to lineages 1-4 or 512 representing mixtures of lineages. This led to the exclusion of 193 genomes, which were 513 removed from GWAS by setting the phenotypes to NA. The number of genomes with a high 514 quality phenotype for at least one of the 13 drugs was therefore 10,228. Of these 533 were 515 lineage 1, 3581 lineage 2, 805 lineage 3, and 5309 lineage 4. Due to rigorous quality control 516 described above, only samples with high quality phenotypes were tested for each drug, 517 resulting in a range of 6,388-9,418 genomes used in each GWAS. 518 519 Phylogenetic inference 520 A pairwise distance matrix was constructed for the full CRyPTIC dataset based on variant 521 calls [79]. For visualisation of the dataset, a neighbour joining tree was built from the 522 distance matrix using the ape package in R and subset to the GWAS dataset. Negative 523 branch lengths were set to zero, and the length was added to the adjacent branch. The 524 branch lengths were square rooted and the tree annotated by lineages assigned by Mykrobe 525 [80]. 526 527 Oligonucleotide/oligopeptide counting 528 To capture SNP-based variation, indels, and combinations of SNPs and indels, we pursued 529 oligonucleotide and oligopeptide-based approaches, focusing primarily on oligopeptides. 530 Where helpful for clarifying results, we interpreted significant associations using 531 oligonucleotides. Sequence reads were assembled de novo using Velvet Optimiser [82] with 532 a starting lower hash value of half the read length, and a higher hash value of the read 533 length minus one; if these were even numbers they were lowered by one. If the total 534 sequence length of the reads in the FASTQ file was greater than 1x10 9 , then the reads were 535 randomly subsampled prior to assembly down to a sequence length of 1x10 9 which is 536 around 227x mean coverage. For the oligopeptide analysis, each assembly contig was 537 translated into the six possible reading frames in order to be agnostic to the correct reading 538 frame. 11 amino acid long oligopeptides were counted in a one amino acid sliding window 539 from these translated contigs. 31bp nucleotide oligonucleotides were also counted from the 540 assembled contigs using dsk [83]. We used the surrounding context of the contigs that the oligopeptides/oligonucleotides 549 were identified in to assist with their alignment. First, we aligned the contigs of each 550 genome to the H37Rv reference genome [84] using nucmer [85], keeping alignments above 551 90% identity, assigning a H37Rv position to each base in the contig. Variant/gene combinations were then kept if seen in five or more genomes. In some specific 560 regions where significant oligonucleotides or oligopeptides appeared to be capturing an 561 invariant region, a threshold of just one genome was used to visualise low frequency 562 variants in the region. This was used only for interpretation of the signal in the region, and 563 not for the main analyses. To improve alignment for the most significant genes and 564 intergenic regions, all oligonucleotides/oligopeptides in the gene/IR plus those that aligned 565 to a gene/IR within 1kb were re-aligned to the region using BLAST. Alignments were kept if 566 above 70% identity, recalculated along the whole length of the oligonucleotide/oligopeptide 567 assuming the whole oligonucleotide/oligopeptide aligned. Oligopeptides were aligned to all 568 six possible reading frames and only the correct reading frame was interpreted. An 569 oligonucleotide/oligopeptide was interpreted as unaligned if it did not align to any of the six 570 possible reading frames. A region was determined to be significant if it contained significant 571 oligopeptides above a minor allele frequency (MAF) of 0.1% that were assigned to the 572 region that also aligned using BLAST. If no significant oligopeptides aligned to the correct 573 reading frame of a protein, or if the significant region was intergenic, then oligonucleotides 574 were assessed. 575

576
Covariates 577 Isolates were sampled from 9 sites and minimum inhibitory concentrations (MIC) were 578 measured on two versions of the quantitative microtiter plate assays, UKMYC5 and UKMYC6 579 [31]. UKMYC6 contained adjusted concentrations for some drugs. Therefore in order to 580 account for possible batch effects, we controlled for site plus plate type in the LMM by 581 coding them as binary variables. These plus an intercept were included as covariates in the 582 GWAS analyses. 583

Testing for locus effects 585
We performed association testing using linear mixed model (LMM) analyses implemented in 586 the software GEMMA to control for population structure [32]. Significance was calculated 587 using likelihood ratio tests. We computed the relatedness matrix from the 588 presence/absence matrix using Java code which calculates the centred relatedness matrix. 589 GEMMA was run using no minor allele frequency cut-off to include all variants. When 590 assessing the most significant regions for each drug, we excluded oligopeptides below 0.1% 591 MAF. To understand the full signal at these regions, oligo-peptides and nucleotides were 592