Drug resistance prediction for Mycobacterium tuberculosis with reference graphs

The dominant paradigm for analysing genetic variation relies on a central idea: all genomes in a species can be described as minor differences from a single reference genome. However, this approach can be problematic or inadequate for bacteria, where there can be significant sequence divergence within a species. Reference graphs are an emerging solution to the reference bias issues implicit in the “single-reference” model. Such a graph represents variation at multiple scales within a population – e.g., nucleotide- and locus-level. The genetic causes of drug resistance in bacteria have proven comparatively easy to decode compared with studies of human diseases. For example, it is possible to predict resistance to numerous anti-tuberculosis drugs by simply testing for the presence of a list of single nucleotide polymorphisms and insertion/deletions, commonly referred to as a catalogue. We developed DrPRG (Drug resistance Prediction with Reference Graphs) using the bacterial reference graph method Pandora. First, we outline the construction of a Mycobacterium tuberculosis drug resistance reference graph, a process that can be replicated for other species. The graph is built from a global dataset of isolates with varying drug susceptibility profiles, thus capturing common and rare resistance- and susceptible-associated haplotypes. We benchmark DrPRG against the existing graph-based tool Mykrobe and the haplotype-based approach of TBProfiler using 44,709 and 138 publicly available Illumina and Nanopore samples with associated phenotypes. We find DrPRG has significantly improved sensitivity and specificity for some drugs compared to these tools, with no significant decreases. It uses significantly less computational memory than both tools, and provides significantly faster runtimes, except when runtime is compared to Mykrobe on Nanopore data. We discover and discuss novel insights into resistance-conferring variation for M. tuberculosis - including deletion of genes katG and pncA – and suggest mutations that may warrant reclassification as associated with resistance. 3. Impact statement Mycobacterium tuberculosis is the bacterium responsible for tuberculosis (TB). TB is one of the leading causes of death worldwide; before the coronavirus pandemic it was the leading cause of death from a single pathogen. Drug-resistant TB incidence has recently increased, making the detection of resistance even more vital. In this study, we develop a new software tool to predict drug resistance from whole-genome sequence data of the pathogen using new reference graph models to represent a reference genome. We evaluate it on M. tuberculosis against existing tools for resistance prediction and show improved performance. Using our method, we discover new resistance-associated variations and discuss reclassification of a selection of existing mutations. As such, this work contributes to TB drug resistance diagnostic efforts. In addition, the method could be applied to any bacterial species, so is of interest to anyone working on antimicrobial resistance. 4. Data summary The authors confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. The software method presented in this work, DrPRG, is freely available from GitHub under an MIT license at https://github.com/mbhall88/drprg. We used commit 9492f25 for all results via a Singularity[1] container from the URI docker://quay.io/mbhall88/drprg:9492f25. All code used to generate results for this study are available on GitHub at https://github.com/mbhall88/drprg-paper. All data used in this work are freely available from the SRA/ENA/DRA and a copy of the datasheet with all associated phenotype information can be downloaded from the archived repository at https://doi.org/10.5281/zenodo.7819984 or found in the previously mentioned GitHub repository. The Mycobacterium tuberculosis index used in this work is available to download through DrPRG via the command drprg index --download mtb@20230308 or from GitHub at https://github.com/mbhall88/drprg-index.


19
The dominant paradigm for analysing genetic variation relies on a central idea: all genomes 20 in a species can be described as minor differences from a single reference genome. However, 21 this approach can be problematic or inadequate for bacteria, where there can be significant 22 sequence divergence within a species. 23 Reference graphs are an emerging solution to the reference bias issues implicit in the "single-24 reference" model. Such a graph represents variation at multiple scales within a population -25 e.g., nucleotide-and locus-level. 26 The genetic causes of drug resistance in bacteria have proven comparatively easy to decode 27 compared with studies of human diseases. For example, it is possible to predict resistance to 28 numerous anti-tuberculosis drugs by simply testing for the presence of a list of single 29 nucleotide polymorphisms and insertion/deletions, commonly referred to as a catalogue. 30 We developed DrPRG (Drug resistance Prediction with Reference Graphs) using the bacterial 31 reference graph method Pandora. First, we outline the construction of a Mycobacterium 32 tuberculosis drug resistance reference graph, a process that can be replicated for other 33 species. The graph is built from a global dataset of isolates with varying drug susceptibility 34 profiles, thus capturing common and rare resistance-and susceptible-associated haplotypes. 35 We benchmark DrPRG against the existing graph-based tool Mykrobe and the haplotype-36 based approach of TBProfiler using 44,709 and 138 publicly available Illumina and 37 Nanopore samples with associated phenotypes. We find DrPRG has significantly improved 38 sensitivity and specificity for some drugs compared to these tools, with no significant 39 decreases. It uses significantly less computational memory than both tools, and provides 40 significantly faster runtimes, except when runtime is compared to Mykrobe on Nanopore 41 data. 42

74
Human industrialisation of antibiotic production and use over the last 100 years has led to a 75 global rise in prevalence of antibiotic resistant bacterial strains. The phenomenon 76 was even observed within patients in the first clinical trial of streptomycin as a drug for 77 tuberculosis (TB) in 1948 [2], and indeed as every new drug class has been introduced, so has 78 resistance followed. Resistance mechanisms are varied, and can be caused by point mutations 79 at key loci (e.g., binding sites of drugs [3,4]), frame-shifts rendering a gene non-functional Phenotypic and genotypic methods for detecting reduced susceptibility to drugs play 84 complementary roles in clinical microbiology. Carefully defined phenotypic assays are used 85 to give (semi)quantitative or binary measures of drug susceptibility; these have the benefit of 86 being experimental, quantitative measurements, and are able to detect resistance caused by 87 hitherto unknown mechanisms. Prediction of drug resistance from genomic data has different 88 advantages. Detection of a single nucleotide polymorphism (SNP) is arguably more 89 consistent than a phenotypic assay, as it is not affected by whether the resistance it causes is 90 near some threshold defining a resistant/susceptible boundary. Additionally, combining 91 sequence datasets from different labs is more reliable than combining different phenotypic 92 datasets, and using sequence data allows one to detect informative genetic changes (e.g., a 93 tandem expansion of a single gene to form an array, thus increasing dosage). More subtly, 94 defining the cut-off to separate resistant from susceptible is only simple when the minimum 95 inhibitory concentration distribution is a simple bimodal distribution; in reality it is 96 sometimes a convolution of multiple distributions caused by different mutations, and genetic 97 data is sometimes needed to deconvolve the data and choose a threshold [8,9]. 98 99 The key requirement for a genomic predictor is to have an encodable understanding of the 100 genotype-to-phenotype map. Research has focussed on clinically important pathogens, 101 primarily is conservative, and for the user to update the catalogue or rules with minimal effort. Finally, 140 to provide resistance predictions, it takes a prebuilt index (an MTB one is currently provided) 141 and sequencing reads (FASTQ).

143
We describe the DrPRG method, and to evaluate it, gather the largest MTB dataset of 144 sequencing data with associated phenotype information and reveal novel insights into 145 resistance-determining mutations for this species. 146

147
DrPRG is a command-line software tool implemented in the Rust programming language. 148 There are two main subcommands: build for building a reference graph and associated 149 index files, and predict for producing genotypic resistance predictions from sequencing 150 reads and an index (from build) . 151

Constructing a resistance-specific reference graph and index 152
The The final reference graph is then indexed with pandora[11]. 174

Genotypic resistance prediction 175
Genotypic resistance prediction of a sample is performed by the predict subcommand of 176 DrPRG. It takes an index produced by the build command (see Constructing a resistance-177 specific reference graph and index) and sequencing reads -Illumina or Nanopore are 178 accepted. To generate predictions, DrPRG discovers novel variants (pandora), adds these to 179 the reference graph (make_prg and MAFFT), and then genotypes the sample with respect 180 to this updated graph (pandora). The genotyped VCF is filtered such that we ignore any 181 variant with less than 3 reads supporting it and require a minimum of 1% read depth on each 182 strand. Next, each variant is compared to the catalogue. If an alternate allele has been called 183 that corresponds with a catalogue variant, resistance ('R') is noted for the drug(s) associated 184 with that mutation. If a variant in the VCF matches a catalogue mutation, but the genotype is 185 null ('.'), we mark that mutation, and its associated drug(s), as failed ('F'). Where an alternate 186 allele call does not match a mutation in the catalogue, we produce an unknown ('U') 187 prediction for the drug(s) that have a known resistance-conferring mutation in the relevant 188 gene.

189
DrPRG also has the capacity to detect minor alleles and call minor resistance ('r') or minor 190 unknown ('u') in such cases. Minor alleles are called when a variant (that has passed the 191 above filtering) is genotyped as being the susceptible (reference) allele, but there is also read 192 depth on the resistant (alternate) allele above a given minor allele frequency parameter (--193 maf; default is 0.1 for Illumina data). Minor allele calling is turned off by default for 194 Nanopore data as we found it led to a drastic increase in the number of false positive calls 195 (this is also the case for Mykrobe and TBProfiler).

196
When building the index for DrPRG and making predictions, we also accept a file of "expert 197 rules" for calling variants of a certain class. A rule is associated with a gene, an optional 198 position range, a variant type, and the drug(s) that rule confers resistance to. Currently 199 supported variant types are missense, nonsense, frameshift, and gene absence.

200
The output of running predict is a VCF file of all variants in the graph and a JSON file of 201 resistance predictions for each drug in the index, along with the mutation(s) supporting that 202 prediction and a unique identifier to find that variant in the VCF file (see Supplementary  203 Section S1 for an example). The reference graph gene presence/absence (as determined by 204 pandora) is also listed in the JSON file. 205

Benchmark 206
We compare the performance of DrPRG against Mykrobe  We used Mykrobe and TBProfiler with default parameters, except for a parameter in each 221 indicating the sequencing technology of the data as Illumina or Nanopore and the TBProfiler 222 option to not trim data (as we do this in Quality control). 223 We compare the prediction performance of each program using sensitivity and specificity. To 224 calculate these metrics, we consider a true positive (TP) and true negative (TN) as a case 225 where a program calls resistance and susceptible, respectively, and the phenotype agrees; a 226 false positive (FP) as a resistant call by a program but a susceptible phenotype, with false 227 negatives (FN) being the inverse of FP. We only present results for drugs in the catalogue and 228 where at least 10 samples had phenotypic data available. 229 To benchmark the runtime and memory usage of each tool, we used the Snakemake 230 benchmark feature within our analysis pipeline [36]. 231

Datasets 232
We gathered various MTB datasets where whole-genome sequencing data (Nanopore or 233 Illumina) were available from public repositories (ENA/SRA/DRA) and associated 234 phenotypes were accessible for at least one drug present in our catalogue[16,27,37-49]. 235 All data was downloaded with fastq-dl (v1. where the best mapping was to H37Rv. 247 After quality control, we removed any sample with average read depth less than 15, or where 248 more than 5% of the reads mapped to contaminants. 249 Lineage information was extracted from the TBProfiler results (see Benchmark). 250

Statistical Analysis 251
We used a Wilcoxon rank-sum paired data test from the Python library SciPy [53] to test for 252 significant differences in runtime and memory usage between the three prediction tools.

253
The sensitivity and specificity confidence intervals were calculated with a Wilson's score 254 interval with a coverage probability of 95%. 255

256
To benchmark DrPRG, Mykrobe, and TBProfiler, we gathered an Illumina dataset of 45,702 257 MTB samples with a phenotype for at least one drug. After quality control (see Quality  258 control), this number reduced to 44,709. In addition, we gathered 142 Nanopore samples, of 259 which 138 passed quality control. In Figure 1 we show all available drug phenotypes for 260 those interested in the dataset, yet our catalogue does not offer predictions for all drugs listed 261 (see Benchmark). Lineage counts for all samples that passed quality control and have a 262 single, major lineage call can be found in Table 1.

Sensitivity and specificity performance 271
We present the sensitivity and specificity results for Illumina data in Figure 2 and Suppl.

272
Table S1 and the Nanopore data in Figure 3 and Suppl. Table S2. 273 When comparing DrPRG's performance to that of Mykrobe and TBProfiler, we look for 274 instances where the confidence intervals do not overlap; indicating a significant difference.

275
With Illumina data (Figure 2 and Suppl.   Nanopore data. For specificity, the tools are all very similar and either exceed or fall below 314 the threshold together (see Figure 2). The target of >98% is met by all tools on Illumina data 315 only for ofloxacin, amikacin, linezolid, and delamanid. Mykrobe also exceeds the target for 316 capreomycin. As such, amikacin is the only drug where both sensitivity and specificity 317 performance exceed the minimal requirement of the WHO target product profiles. Only 318 capreomycin and kanamycin specificity targets are exceeded (by all tools) with Nanopore 319 data. 320 However, for Illumina data, we did find that likely sample-swaps or phenotype instability [ (n=441) and isoniazid (n=241) resistance; rrs a1401g, which is associated with resistance to 332 capreomycin (n=241), amikacin (n=70), and kanamycin (n=48). In addition there were 333 common false positives from gyrA mutations A90V and D94G, which are associated with 334 resistance to the fluoroquinolones levofloxacin (n=108 and n=70, respectively), moxifloxacin 335 (n=419 and n=349) and ofloxacin (n=19 and n=17), and are known to cause heteroresistance 336 and minimum inhibitory concentrations (MIC) close to the critical concentration 337 threshold[57-59]. 338

Evaluation of potential additions to the WHO catalogue 339
False negatives are much harder to investigate as it is not known which mutation(s) were 340 missed as they are presumably not in the catalogue if all tools failed to make a call. However, 341 looking through those FNs where DrPRG makes an "unknown" resistance call, we note some 342 potential mutations that may need reclassification or inclusion in the WHO catalogue. For 343 delamanid FNs, we found five different nonsense mutations in the ddn gene in seven samples 344 -W20* (n=2), W27* (n=1), Q58* (n=1), W88* (n=2), and W139* (n=1) -none of which 345 occurred in susceptible samples. We also found 13 pyrazinamide FN cases with a nonstop 346 (stop-loss) mutation in pncA -this mutation type was also seen in two susceptible samples. 347 Another pncA mutation, T100P, was also observed in 10 pyrazinamide FN samples and no 348 susceptible samples. T100P only appears once in the WHO catalogue data ("solo" in a 349 resistant sample). As such, it was given a grading of uncertain significance. In seven of these cases, it co-occurs with ahpC mutations t-75g (n=2) and t-76a (n=5). This 365 mutation occurs in only three resistant isolates in the WHO catalogue dataset, giving it an 366 uncertain significance, but we add a further 11 cases. This mutation has been found to cause a 367 high isoniazid MIC and be associated with resistance[62,63]. 368

Detection of large deletions 369
There are expert rules in the WHO catalogue which treat gene loss-of-function ( the associated drug (n=100), the results are nevertheless striking (Figure 4). Given the low 378 false-positive rate of pandora for gene absence detection [11], these no-phenotype samples 379 provide insight into how often gene deletions are occurring in clinical samples.

380
Of the 34 isolates where katG was identified as being absent, and an isoniazid phenotype was 381 available, all 34 were phenotypically resistant. DrPRG detected all 34 (100% sensitivity) and 382 TBProfiler identified 26 (76.5% sensitivity). Deletions of pncA were detected in 56 isolates, 383 of which 49 were phenotypically resistant. DrPRG detected 47 (95.9% sensitivity) and 384 TBProfiler detected 46 (93.9% sensitivity). Lastly, ethA was found to be missing in 16 385 samples with an ethionamide phenotype, of which 10 were phenotypically resistant. Both 386 DrPRG and TBProfiler correctly predicted all 10 (100% sensitivity). No gid deletions were 387 discovered. We note that the TP calls made by Mykrobe were due to it detecting large 388 deletions that are present in the catalogue, which is understandable given the whole gene is 389 deleted.

390
We conclude that DrPRG is slightly more sensitive at detecting large deletions than 391 TBProfiler (and Mykrobe) for katG, and equivalent for pncA and ethA. However we note that 392 the WHO expert rule which predicts resistance to isolates missing specific genes appears 393 more accurate for katG (100% of isolates missing the gene are resistant) than for pncA (87%

Runtime and memory usage benchmark 402
The runtime and peak memory usage of each program was recorded for each sample and is 403 presented in Figure 5. DrPRG (median 161 seconds) was significantly faster than both 404 TBProfiler (307 seconds; p≤0.0001) and Mykrobe (230 seconds; p≤0.0001) on Illumina data.

417
In this work, we have presented a novel method for making drug resistance predictions with 418 reference graphs. The method, DrPRG, requires only a reference genome and annotation, a 419 catalogue of resistance-conferring mutations, a VCF of population variation from which to 420 build a reference graph, and (optionally) a set of rules for types of variants in specific genes 421 which cause resistance. We apply DrPRG to the pathogen M. tuberculosis, for which there is 422 a great deal of information on the genotype/phenotype relationship, and a great need to 423 provide good tools which implement and augment current and forthcoming versions of the 424 WHO catalogue. We illustrate the performance of DrPRG against two existing methods for 425 drug resistance prediction -Mykrobe and TBProfiler.

427
We benchmarked the methods on a high-quality Illumina sequencing dataset with associated 428 phenotype profiles for 44,709 MTB genomes; the largest known dataset to-date [16]. All tools 429 used the same catalogue and rules, and for most drugs, there was no significant difference 430 between the tools. However, DrPRG did have a significantly higher specificity than 431 TBProfiler for rifampicin predictions, and sensitivity for ethionamide predictions. DrPRG's 432 sensitivity was also significantly greater than Mykrobe's for rifampicin, streptomycin, 433 amikacin, capreomycin, kanamycin, and ethionamide. Evaluating detection of gene loss, we 434 found DrPRG was more sensitive to katG deletions than TBProfiler. 435 We also benchmarked using 138 Nanopore-sequenced MTB samples with phenotype 436 information, but found no significant difference between the tools. This Nanopore dataset 437 was quite small and therefore the confidence intervals were large for all drugs. Increased 438 Nanopore sequencing over time will provide better resolution of the overall sensitivity and 439 specificity values and improve the methodological nuances of calling variants from this 440 emerging, and continually changing, sequencing technology. 441 DrPRG also used significantly less memory than Mykrobe and TBProfiler on both Nanopore 442 and Illumina data. In addition, the runtime of DrPRG was significant faster than both tools on 443 Illumina data and faster than TBProfiler on Nanopore data. While the absolute values for 444 memory and runtime for all tools mean they could all easily run on common computers found 445 in the types of institutions likely to run them, the differences for the Nanopore data warrant 446 noting. As Nanopore data can be generated "in the field", computational resource usage is 447 critical number of resistance-conferring mutations, or they're localised to a specific region (e.g. the 487 rifampicin-resistance determining region in rpoB), resistance-conferring mutations are 488 numerous -with most being rare -and distributed throughout pncA [16,76,77]. Adding all of 489 these mutations will, and does, lead to decreased performance of the reference graph[33], and 490 so improving minor allele calling for pyrazinamide remains a challenge we need to revisit in 491 the future. 492 One final limitation is the small number of Nanopore-sequenced MTB isolates with 493 phenotypic information. In order to get a clearer picture of the sensitivities and specificities 494 this sequencing technology can provide, we need much larger and more diverse data.

496
In conclusion, DrPRG is a fast, memory frugal software program that can be applied to any 497 bacterial species. We showed that on MTB, it performs as well as, or better than two other 498 commonly used tools for resistance prediction. We also collected and curated the largest 499 dataset of MTB Illumina-sequenced genomes with phenotype information and hope this will 500 benefit future work to improved genotypic drug susceptibility testing for this species. While 501 we applied DrPRG to MTB in this study, it is a framework that is agnostic to the species.

502
MTB is likely one of the bacterial species with the least to gain from reference graphs given 503 its relatively conserved (closed) pan-genome compared to other common species [78]. As 504 such, we expect the benefits and performance of DrPRG to improve as the openness of the 505 species' pan-genome increases[11]; especially given its good performance on a reasonably 506 closed pan-genome.

Acknowledgements 520
We thank Kerri M. Malone for sharing her MTB knowledge through many discussions and 521 critiquing the final manuscript. We also thank Martin Hunt