Predicting MmpR-based bedaquiline resistance using sequence- and structure-based features

Accurate molecular detection of resistance to bedaquiline, a core drug for the treatment of drug resistant tuberculosis, remains challenging. In this study, we investigated whether three sequence- and nine structure-based features describing the impact of Rv0678 variants on the MmpR transcriptional repressor could predict the bedaquiline phenotype of isolates containing Rv0678 variants. Paired genotypic and phenotypic data was used to train a binary random forest classifier. The mean value of the individual features was similar for resistant and susceptible variants (p≥0.05). Leave one out cross validation showed predictability of bedaquiline resistance from Rv0678 variants when using a binary classifier with a combination of sequence and structural features (ROC AUC = 0.766; f1 score = 0.746), but the performance was too low for clinically use. Evolutionary conservation of the affected residue was the most important individual feature (mean decrease in impurity = 0.226 ± 0.003) to discriminate bedaquiline resistance from susceptibility. Prediction of bedaquiline resistance was only possible when restricting the data to missense variants, as the selected features could not be applied to Rv0678 insertions and deletions. Additionally, prediction of bedaquiline resistance was only possible when restricting the data to isolates for which the phenotype was determined using Mycobacterial Growth Indicator Tubes, suggesting possible misclassification of the bedaquiline phenotype by other methods. The results of this study suggest that structural features describing Rv0678 missense variants could be used to predict bedaquiline resistance, albeit not yet at the performance level required for clinical practice. Author summary Guidelines by the World Health Organization indicate that antibiotic resistant tuberculosis should be treated with bedaquiline, a newly discovered antibiotic. However, resistance to bedaquiline has already emerged and is spreading rapidly. More than 500 different variants have been discovered in the bedaquiline resistance genes in Mycobacterium tuberculosis. It is not known which DNA variants specifically drive bedaquiline resistance due to insufficiency of data. Here, we used a machine learning technique to predict whether a variant in the Rv0678 gene confers bedaquiline resistance. To this end, for each variant we defined features that describe the impact of the variant on the MmpR (encoded by the Rv0678 gene) protein structure. These features were then fed into a classification algorithm to predict bedaquiline resistance. We successfully developed a bedaquiline resistance prediction model, although performance was limited. We also found that accurate predictions were only possible when restricting the data to samples that were tested on the MGIT platform, a World Health Organization endorsed method to test for bedaquiline resistance. Our study shows that predicting bedaquiline resistance from protein structural features is feasible. Furthermore, our study provides new insights into phenotypic testing platforms to assess bedaquiline resistance.

209 Notably, evolutionary conservation calculated using ConSurf was the best individual feature 210 to discriminate BDQ resistance from susceptibility. This is not surprising, as conserved 211 residues tend to be located in protein core regions and are, in general, more important to 212 protein stability and function compared to less conserved residues [24][25][26]. For example, it has 213 been previously reported that isoniazid resistance conferring variants tend to affect highly 214 conserved residues in the katG gene, while variants affecting less conserved residues are 215 associated with a lower level of isoniazid resistance [27].
216 Surprisingly, the BDQ resistance predictability observed for classifying missense variants was 217 lost when including indel and nonsense variants. Due to their highly disruptive nature, indel 218 and nonsense variants are associated with a higher fitness cost relative to missense 219 variants [28]. For example, indels and nonsense variants have been described to confer high-220 level isoniazid and PZA resistance [29,30]. In contrast, Rv0678 indel and nonsense variants are 221 frequently reported in clinical isolates [31,32], even in BDQ susceptible clinical isolates [9]. The 222 poor predictability was likely due to the difficulty in describing the impact of indel and 223 nonsense variants with features that accurately capture the information loss on protein 224 level [33].
225 Another surprising observation was that the predictability was lost when phenotypic data 226 obtained by platforms other than the MGIT platform were included in the analysis. Although 227 many different phenotypic platforms are used to determine BDQ resistance, a recent study 228 showed that the MGIT platform is the most reproducible method to determine BDQ 229 resistance, followed by broth microdilution using dry plates, such as the Thermo Fisher 230 plates [34]. Currently, only MGIT and 7H11 are endorsed by the WHO with interim critical 231 concentrations of 1µg/ml and 0.25µg/ml, respectively [35]. These interim critical 232 concentrations are often based on a consensus in the absence of solid scientific data and 233 could thus result in misclassification of resistance [36]. The loss of predictability when 234 including MIC data obtained by any DST platform or when limiting the data to the Thermo 235 Fisher microtiter plate assay was likely due to the misclassification of isolates as susceptible 236 or resistant by the phenotypic methods other than the gold standard MIGT method.
237 Several limitations of this study should be considered. First, the data used in the analysis was 238 derived from multiple studies with different study designs and different methodologies for 239 phenotypic assessment which could have resulted in misclassifications. We tried to overcome 240 this by limiting the main analysis to studies that used the gold standard MGIT platform. 241 Second, while we excluded all isolates containing a variant in the atpE gene, the other 242 candidate BDQ resistance genes (pepQ, mmpL5, mmpS5 and Rv1979c) were not considered 243 in the analysis. Although rare, variants in the mmpL5 gene have been shown to override 244 Rv0678-mediated BDQ resistance, resulting in BDQ hypersusceptibility [37]. Similarly, variants 245 in pepQ and Rv1979c have been observed in BDQ resistant isolates, although evidence for 246 their role in clinical BDQ resistance is very scarce [38,39]. Third, many isolates with variants 247 had to be excluded. We tried to overcome this by performing sensitivity analyses that 248 included indels and nonsense variants, but isolates containing multiple variants and variants 249 in the promotor region of Rv0678 had to be excluded from all analyses. Finally, we could only 250 identify 12 relevant sequence-and structure-based features. The publication of the deep 251 learning based AlphaFold 2 has revolutionized protein structure prediction[40]. Although not 252 yet recommended to predict the impact of variants on protein stability and function, a recent 253 study has hinted that AlphaFold 2 could in future be used to study the effect of missense 254 variants on protein structures [41,42]. It will be interesting to see if improvements of the 255 AlphaFold 2 model will enable to study of the effect of multiple variants (missense, nonsense, 256 or indel) from de novo predicted protein structures compared to the wild type structures. 257 Lastly, the clinical performance of this computational prediction model would ideally be 258 validated using an independent dataset of clinical isolates with known BDQ phenotype and 259 Rv0678 variants. Since BDQ is still a relatively novel drug, such datasets were not available at 260 the time this study was conducted.
261 Many studies have shown that PZA resistance-conferring and benign variants are frequently 262 observed across the entire length of the pncA gene without any apparent resistance hotspots, 263 resulting in a markedly low sensitivity to identify PZA resistance when only considering the 264 genotypic information [43,44]. Similarly, Rv0678 variants occur across the entire gene and the 265 standard statistical approach is inadequate to accurately diagnose BDQ resistance from 266 genotypic data[9, 44]. Our study suggest that a combination of sequence-and structure-267 based features may, in future, be able to predict BDQ resistance from Rv0678 missense 268 variants. This could facilitate the development of sequencing-based strategies as diagnostics 269 to detect BDQ resistance and prevent emergence and spread of BDQ-resistant TB. 286 Several isolates were excluded from the analysis (Fig. 1). First, isolates which contained a 287 variant in the Rv0678 promoter region, or more than one missense variant in the Rv0678 288 coding region were excluded because features could not be calculated for these isolates. 289 Second, isolates with a variant in both the atpE and Rv0678 gene were also excluded because 290 the presence of these atpE variants could confound the genetic causality of BDQ resistance 291 by the Rv0678 variant. When information on the atpE gene was missing, the isolate was 292 considered atpE wild type and included in the analysis. Third, isolates for which the MIC 293 platform used was not specified were excluded because this information is needed to classify 294 the isolate as phenotypically resistant or susceptible. Fourth, variants that occurred an equal 295 number of times in resistant and susceptible isolates could not be classified as resistant or 296 susceptible. Isolates containing these variants were excluded. Fifth, all isolates with a variant 297 affecting amino acid positions 1-14 and 160-165 of the MmpR protein were excluded because 298 these regions are not included in the MmpR crystal structure, thus precluding the assessment 299 of the impact of the Rv0678 variant on the protein. Finally, all synonymous variants in the 300 Rv0678 gene were ignored for the analysis.
301 For the main analysis, all isolates containing an indel variant or missense variant resulting in 302 a premature stop-codon (nonsense) were excluded, and only isolates for which the MIC was 303 determined on the MGIT platform, the current gold standard, were included in the analysis.
304 Five sensitivity analysis datasets were constructed to investigate the impact of inclusion of 305 non-missense variants and isolates for which the MIC was determined using non-MGIT 306 platforms (Fig. 1). 327 For Rv0678 indels and nonsense variants, ConSurf scores were calculated by averaging the 328 evolutionary scores of the residues in the affected regions. Because the other features could 329 not be calculated for indel and nonsense variants, binary features indicating the variant type 330 and whether the variant causes a disruption of the DNA reading frame were created for all 331 variants. For each indel, two sets of feature scores were created. In one set the feature scores 332 for all indel and nonsense variants were set to zero to investigate whether the binary features 333 describing the variant type are sufficient to predict BDQ resistance. In the other set, the 334 feature scores for all indel and nonsense variants were set to the maximum score of the 335 missense variants, to accurately represent the highly disruptive nature of such variants.
336 The MmpR structures in ligand-bound and DNA-bound conformations were visualised in 337 ChimeraX and the dimerization and DNA-binding domains were defined as described by 338 Radhakrishnan et al.[46,53] Feature distributions were visualized using the Matplotlib library 339 (version 3.1.2) in Python [54] and were compared between resistance and susceptibility 340 variants using the Mann-Whitney U test. P-values were adjusted for false discovery rate using 341 the Benjamini-Hochberg procedure and were considered significant when p-value < 0.05. 342 Construction of machine learning classifier 343 In order to predict whether a variant would be associated with a BDQ resistant phenotype, a 344 binary random forest classifier was trained using the Scikit-learn Python module (version 345 1.0.2)[55] with default parameters and the number of trees set to 1000. Classification models 346 were trained using repeated ten-fold cross validation. Models were evaluated by calculating 347 the receiver operator characteristic curve (ROC) with area under the curve (AUC) and 348 precision recall curve (PR) with f1 score after aggregating performance measures over the