Accessory genes define species-specific pathways to antibiotic resistance

Background The rise of antimicrobial resistance (AMR) is a growing concern globally and a deeper understanding of AMR gene carriage vs usage is vital for future responses to reduce the spread of AMR. Identification of AMR phenotype by laboratory-based assays are often hindered by difficulties in establishing cultures. This issue could be resolved by rapid computational assessment of an organism’s genome, however, AMR gene finder tools are not intended to infer AMR phenotype which is likely to be a product of multiple gene interactions. Methods To understand the importance of multi-gene interactions to the relationship between AMR genotype and AMR phenotype, we applied machine learning approaches to 16,950 genomes from microbial isolates representing 28 different genera with 1.2 million corresponding laboratory-determined MICs for 23 different antibiotics. We then elucidated the genomic paths to phenotypic antimicrobial resistance with the aim of allowing for the development of rapid determination of AMR phenotype from genomes or even whole microbiomes. Results The application of machine learning models resulted in a >1.5-fold increase in average prediction accuracy of AMR phenotype across the 23 antibiotic models. Interpretation of these models revealed 528 distinct genomic pathways to phenotypic resistance, many of which were species-specific and involved genes which have not previously been associated with AMR phenotype. This is the first study to demonstrate the utility of machine learning models in the prediction of AMR phenotype for a wide range of clinically relevant organisms and antibiotics. This could be applied as a rapid and affordable alternative to culture-based techniques, estimating taxonomy in addition to AMR phenotype, and providing real-time monitoring of multi-drug resistant pathogens. Availability and implementation Contact ldillon05@qub.ac.uk View supplementary information at this link https://osf.io/cj4bq/?view_only=c0ee87b7609543b688953089be4c376f See Code Availability for scripts used.


INTRODUCTION 9
The overuse and misuse of antibiotics has escalated the rate at which many bacteria have evolved resistance 10 to multiple antibiotics [9, 37], including last-resort treatments [2]. This has led to a growing prevalence 11 of antimicrobial-resistant infections worldwide [27], which can be challenging to treat [7]. This has Models for 23 different antibiotics were selected with respect to various data constraints. Each model is 134 trained specific to a single antibiotic and the genomes present in the model must have corresponding MIC 135 values. For a model to be able to learn from the data and thus predict the correct AMR phenotype, the 136 models had to have both susceptible and resistant organisms (Supplementary Table 3). The proportion of 137 organisms with a susceptible to resistant MIC value can be seen in Fig.S1. 138 The J48 model was chosen for further analysis due to the interpretability of its decisions. Hence, providing  Table 4). There was no difference in eight of the antibiotic 144 models using the different parameters and the rest of the models had minor differences. The most accurate 145 C value could be found by using 0.25 or 0.5 for 15 out of 23 antibiotic models. The C value of 0.25 was 146 selected as this level of tree pruning is recommended to not overfit the model or prune the tree too much  The individual fold results allowed the standard error of the models to be calculated (Supplementary Table   150 3).

151
To evaluate what factors may impact the models or improve model accuracy, the composition of AMR 152 genes used to train the models was analysed. The models were originally trained using specific AMR 153 genes for the antibiotic the model represented. For example, Ampicillin-specific AMR genes to train the 154 Ampicillin model. The antibiotic target is defined in the CARD database in which the genes are annotated 155 to correspond to specific antibiotics. The models were then trained with all AMR genes present in the 156 genomes regardless of which antibiotic model they were training (Supplementary Table 3, Supplementary  Table 5, Fig.S2). A custom Python script used to make the .arff files for this analysis (RGI all analysis.py).

158
Investigating the role of taxonomy on decision tree model accuracy 159 To investigate the role of taxonomy on model accuracy, for each antibiotic model, one genus was excluded 160 from the training data. The excluded genus was then used to test the model. This included each genus 161 available for each antibiotic model (see Supplementary Table 6 for details). This way we can evaluate 162 how the models may perform on a species that was not in the training set. We used a custom python script 163 to develop the .arff files (taxa test train files.py).

164
To process CSV files into the format required for weka (.arff) we created a simple tool to translate a .csv 165 file into a .arff file. This code is freely available at https://github.com/LucyDillon/CSV_2_ 166 arff. 168 To investigate the role of accessory genes in AMR phenotype, the genomes from BV-BRC were analysed was compared with the genome AMR phenotype using the same J48 model and parameters in Weka, 174 using a custom python script to make the .arff files (Eggnog analysis.py). A 10-fold cross-validation was 175 used to evaluate the model's accuracy in predicting AMR phenotype, from which the standard error was 176 calculated.

Analysis of accessory gene involvement in AMR phenotype
then labelled as having a known AMR gene function in the models. Finally, "pathways to resistance" 180 were identified in all of the resulting decision tree models by identifying all possible paths through 181 the resulting trees that lead to a "resistance" outcome, using Apply Decision Tree available at https: 182 //github.com/ChrisCreevey/apply_decision_tree. All gene families traversed to reach 183 each resistance outcome on the decision trees were considered important to that resistance path (regardless 184 if it needed to be present or absent) and included in subsequent analyses of different paths to resistance.  To find associations with the gene families across multiple models, STRING and Cytoscape (version 3.9.1)

194
[38] were used to analyse the data using the same options as above, but including all gene families from 195 all antibiotic models. To reduce the network to directly link to the decision trees, edges in the network 196 were only retained if both gene families were present in the STRING protein-protein interactions (and 197 therefore predicted to interact) and the pair was also present in at least one decision tree model. Details of 198 this analysis can be found in the Cytoscape analysis.md file.

199
The predicted protein-protein interaction network of each path to resistance was also produced using

213
Within this study, we analysed several techniques for predicting AMR phenotype from genomic data,  Even though AMR gene finder tools are designed to identify the presence of AMR genes in genomic 217 data, their results are frequently used to directly infer AMR phenotype in literature [40,6,17,43]. We 218 examined the accuracy of predicting AMR phenotype solely based on the presence/absence of AMR 219 genes for 23 antibiotics and 16,950 genomes, from organisms with laboratory-derived MIC data. This 220 naive model assumed an antibiotic-resistant phenotype when an AMR gene which targeted a particular 221 antibiotic as defined in the CARD database was found in a genome.

222
The average prediction accuracy of this model (as defined by the number of genomes correctly predicted to 223 be susceptible or resistant to an antibiotic divided by the total number of genomes tested) was 57.6% and 224 ranged from 3.5% (Clindamycin) to 100% (Moxifloxacin) (Fig.1). Clindamycin had quite a poor ratio of . The boxplots represent the following methods used in this study to predict AMR phenotype in the following order: Naive RGI analyses (orange), logistic regressions using the RGI data (blue), J48 decision trees using RGI genes specific to the antibiotic (green), J48 decision trees using all RGI genes regardless of the antibiotic model (yellow), J48 decision trees using Eggnog gene families (pink). The statistical significances are the result of a pairwise Wilcoxon signed-rank test adjusted for multiple testing using the Benjamini-Hochberg method (q<0.05). No significant difference between distributions is indicated by a shared letter above their respective boxplot (see Supplementary Table 11 for more details). Outliers are represented with a triangle-shaped point.

291
To see if this observation extended to non-classic AMR genes, decision trees were generated for the 23 292 antibiotics using eggNOG gene family functional profiles generated for all 16,950 genomes. The average 293 7/17 accuracy for these models was 92.2% (ranging from 74.0% (Tigecycline) -100.0% (Moxifloxacin)). In the comparison of the eggNOG models to the RGI models, the mean value was the highest for RGI all 295 analysis (92.5%) (Supplementary Table 3, Supplementary Table 12). The difference between the RGI 296 decision tree models and eggNOG gene families was not significant overall (RGI specific genes vs Eggnog 297 q = 3.66E-01, RGI all genes models vs Eggnog q = 2.49E-01) (Supplementary Table 11). Overall, the 298 eggNOG models were not significantly worse than the AMR gene-based decision trees, this highlights 299 that the decision trees are able to extract key genes involved in AMR phenotype.

300
The average precision of the eggNOG-based decision tree was 84.3% (ranging from 50.0% (Fosfomycin) 301 to 100.0% (Moxifloxacin)) (Supplementary Table 8). This was significantly better than the logistic 302 regression of RGI genes. This suggests that the models based on the eggNOG genes are less biased than 303 the logistic regression and the RGI-specific analysis.

304
The average recall of the eggNOG-based decision trees was 86.6% (ranging from 66.0% (Colistin)-100.0% 305 (Moxifloxacin)) (Supplementary Table 8). This was significantly better than the naive RGI, the logistic 306 regression and RGI-specific decision tree models. 307 We can gain further biological insight by using these accessory genes which may be involved in resistance 308 pathways. This could provide novel information about pathways to resistance to particular antibiotics.

309
Using the eggNOG decision trees we found an additional 675 gene families across all 23 models which 310 are not in the RGI database but are linked to the AMR phenotype. 311 Figure 2. An example of a decision tree with two routes to resistance, indicated by the red and blue lines. For example, in the red pathway, if more than one copy of Gene A, D and E are present the genome will be resistant but if one of those genes is not present (i.e. Gene E) the organism will be susceptible.

312
The use of decision trees allowed biological interpretation of (428 susceptible routes and 528 resistant 313 routes in eggNOG models) paths to resistance and susceptibility ( Fig.3 and Fig.S3-S5 Table 13). The models showed the importance of the absence or number of copies of genes which could 315 influence the AMR phenotype of an organism. The co-occurrence of genes was another important factor 316 when determining the AMR phenotype. This highlights key genes involved in AMR phenotype that may 317 not be classic AMR genes (Fig.S7). RGI genes were matched to the gene families in the decision tree 318 models, we can see that the majority of models contained RGI gene families. In the eggNOG-based Figure 3. Predicting Tetracycline resistance using Eggnog gene families copy number or absence. A. A J48 decision tree model to predict Tetracycline AMR phenotype. RGI-associated gene families have been highlighted with a thick black outline. COG0480 relates to gene tet(44), COG0642 relates to gene adeS, and COG2946 relates to gene tetU. The decision trees have numbers in the phenotype boxes to represent the number of genomes. This may include two numbers in some cases, the first number indicates the total number of genomes and the second number are incorrectly classified genomes.* B. Stacked bar chart showing the routes to susceptibility and resistance for Tetracycline. This is a genus level analysis, the species, family, order, class, and phylum analysis can be found in Fig.S8. The pathway numbers relate to the numbers on the decision tree (Part A). Note: pathway 9 is not 100% Campylobacter, 0.4% are Nesseria. C. Protein-protein interactions between gene families for each pathway to resistance. The lines (edges) represent the protein-protein interactions from STRING and the thicker the line, the higher the confidence (see Supplementary Table 13 for details). See part A for details of each pathway (the pathway numbers correspond to the numbers on the phenotype boxes in part A).* Note*: COG 28JVV and 2EFWV are recognised as NOG03658 and NOG52501 in the STRING database, respectively.