Abstract
Nontyphoidal Salmonella species are the leading bacterial cause of food-borne disease in the United States. Whole genome sequences and paired antimicrobial susceptibility data are available for Salmonella strains because of surveillance efforts from public health agencies. In this study, a collection of 5,278 nontyphoidal Salmonella genomes, collected over 15 years in the United States, were used to generate XGBoost-based machine learning models for predicting minimum inhibitory concentrations (MICs) for 15 antibiotics. The MIC prediction models have average accuracies between 95-96% within ± 1 two-fold dilution factor and can predict MICs with no a priori information about the underlying gene content or resistance phenotypes of the strains. By selecting diverse genomes for training sets, we show that highly accurate MIC prediction models can be generated with fewer than 500 genomes. We also show that our approach for predicting MICs is stable over time despite annual fluctuations in antimicrobial resistance gene content in the sampled genomes. Finally, using feature selection, we explore the important genomic regions identified by the models for predicting MICs. To date, this is one of the largest MIC modeling studies to be published. Our strategy for developing whole genome sequence-based models for surveillance and clinical diagnostics can be readily applied to other important human pathogens.
- Abbreviations
- AMP
- ampicillin
- AMR
- antimicrobial resistance
- AUG
- amoxicillin/clavulanic acid (Augmentin)
- AXO
- ceftriaxone AZI: azithromycin
- CDC
- United States Centers for Disease Control and Prevention
- CHL
- chloramphenicol CIP: ciprofloxacin
- CLSI
- Clinical and Laboratory Standards Institute
- COT
- trimethoprim/sulfamethoxazole (co-trimoxazole)
- FDA
- United States Food and Drug Administration
- FIS
- sulfisoxazole
- FOX
- cefoxitin
- FSIS
- USDA Food Safety and Inspection Service
- GEN
- gentamicin
- KAN
- kanamycin
- ME
- major error
- MIC
- minimum inhibitory concentration
- NAL
- nalidixic acid
- NARMS
- National Antimicrobial Resistance Monitoring System
- SNP
- single nucleotide polymorphism
- STR
- streptomycin
- TET
- tetracycline
- TIO
- ceftiofur
- USDA
- United States Department of Agriculture
- VME
- very major error
- WGS
- whole genome sequencing
- XGBoost
- Extreme Gradient Boosting
Introduction
Nontyphoidal Salmonella species are the leading bacterial cause of food-borne disease in the United States[1, 2], causing over one million illnesses per year[3] and an estimated 80 million illnesses annually world-wide[4]. Nontyphoidal Salmonella causes acute gastroenteritis and is usually contracted via fecal contamination of food sources[5]. Although these infections are usually self-limiting and typically do not require antibiotic treatment[6], severe infections can occur[7]. Antimicrobial resistance (AMR) is prevalent in Salmonella isolates and infections caused by highly antimicrobial resistant Salmonella strains result in worse outcomes than infections caused by susceptible strains[8–11].
In 1996, the National Antimicrobial Resistance Monitoring System (NARMS) was established as a collaboration between the United States Centers for Disease Control and Prevention (CDC), U.S. Food and Drug Administration (FDA), U.S. Department of Agriculture (USDA), and state and local health departments. A primary goal of NARMS is to monitor antimicrobial resistance in Salmonella and other food-borne bacteria, including Campylobacter, Escherichia and Enterococcus[12]. The data collected by NARMS is used to inform public health decisions aimed at identifying contaminated food sources and reducing the spread of AMR through enhanced stewardship. In recent years, NARMS has adopted whole genome sequencing (WGS) as a routine monitoring tool. The WGS data are used to determine the source of outbreak strains, the virulence factor and AMR genes carried by each strain. As a result, a large collection of bacterial whole genome sequences with extensive metadata is available for downstream research efforts[13].
Whole genome sequencing is now routinely used for public health surveillance and to guide diagnostic and patient care descisions[14–18]. For routine surveillance, WGS provides the highest possible resolution for individuating traits in bacteria, assessing phylogenetic relationships, conducting outbreak investigations, and predicting virulence and epidemicity. From the clinical perspective, rapid diagnostics are key to improving patient care. For a conventional microbiology laboratory diagnosis, the total time for organism growth, isolation, taxonomic identification, and antimicrobial minimum inhibitory concentration (MIC) determination may exceed 36 hours for relatively fast-growing bacteria and several days for slower growing organisms[19–21]. Since reducing the time to optimal antimicrobial therapy significantly improves patient outcomes[22–24], rapid sequencing-based approaches that predict MICs may have clinical utility. The extensive WGS datasets generated by health agencies and the scientific community, such as nontyphoidal Salmonella strains, provides the necessary training sets required for building predictive models.
Several investigations have recently built models for predicting AMR phenotypes from WGS data. To date, the most common approach has relied on using a curated reference set of genes and polymorphisms that are implicated in AMR[25–33]. This reference-guided approach best predicts susceptibility and resistance when organisms are well studied and the AMR mechanisms are known. As larger collections of genomes have become available, several studies have used machine learning algorithms to predict susceptible and resistant phenotypes[27, 29, 31, 34–38]. By using WGS and AMR phenotype data to train a machine learning model, predictions without a priori information about the underlying gene content of the genome or molecular mechanism for resistance to each agent are possible. Although this reference-free approach requires many genomes, it is unbiased and can potentially be used to enable the discovery of new genomic features that are involved in AMR[36, 37]. These two general approaches have also been used to predict MICs for Streptococcus, Neisseria, and Klebsiella[35, 38–40]. When a curated reference collection of genes and SNPs is used for predicting MICs, a rules-based or machine learning algorithm is required for determining how much a given feature contributes to the MIC. Thus, for MIC prediction, both reference-guided and reference-free approaches are expected to have similar advantages and disadvantages if the collection of genes and SNPs used by the reference-guided method is sufficient for predicting all MICs, including those that are in the susceptible range. For example, in previous work, we built a machine learning model to predict MICs for a comprehensive population-based collection of 1,668 Klebsiella pneumoniae clinical isolates[38]. For each genome, we used nucleotide 10-mers and the MICs for each antibiotic as features to train the model. Extreme gradient boosting (XGBoost) was chosen as the machine learning algorithm[41]. The model could rapidly predict the MICs for 20 antibiotics with an average accuracy of 92%. This demonstrated that it is possible to successfully predict MICs without using a precompiled set of AMR genes or polymorphisms.
In this study, we build models that use whole genome sequence data to predict MICs for nontyphoidal Salmonella based on strains collected and sequenced by NARMS from 2002-2016. Our strategy can be used to guide responses to outbreaks and inform antibiotic stewardship decisions.
Materials and Methods
Genomes and Metadata
A total of 5,278 nontyphoidal Salmonella genome sequences were used in this study. All strains were collected and sequenced as part of the NARMS program. The strains were recovered from either raw retail meat and poultry or directly from livestock animals at slaughter. Antimicrobial susceptibility testing was performed using broth microdilution on the Sensititre® system (Thermo Scientific) for 15 antibiotics: ampicillin (AMP), amoxicillin/clavulanic acid (AUG), ceftriaxone (AXO), azithromycin (AZI), chloramphenicol (CHL), ciprofloxacin (CIP), trimethoprim/sulfamethoxazole (COT), sulfisoxazole (FIS), cefoxitin (FOX), gentamicin (GEN), kanamycin (KAN), nalidixic acid (NAL), streptomycin (STR), tetracycline (TET), and ceftiofur (TIO) at FDA and USDA NARMS laboratories[13]. Clinical breakpoints are based on CLSI and FDA guidelines[42]. Whole genome sequencing was performed using the lllumina HiSeq and MiSeq platforms using standard methods[25]. Accession numbers and MICs for each isolate are listed in Table S1. All non-AMR metadata including serotypes, host, geographic location of isolation and isolation year were taken from the metadata associated with each NCBI SRA entry.
Genomic Analyses
The short read sequence data for each strain was assembled with the PATRIC genome assembly service[43], using the “Full SPAdes” pipeline which uses BayesHammer[44] for read correction and SPAdes for assembly[45]. All genomes were annotated using the PATRIC annotation service[43], which uses a variation of the RAST tool kit annotation pipeline[46]. Annotated genomes are available on the PATRIC website (https://patricbrc.org). PATRIC genome identifiers are displayed in Table S1. Protein annotations, including those specifically asserted to be involved in AMR[47] were downloaded from the PATRIC workspace and used for subsequent analyses. A phylogenetic tree was generated for the strains in the analysis by aligning the genes for the beta and beta prime subunits of the RNA polymerase using MAFFT[48], concatenating the alignments, and computing a tree with FastTree[49]. The tree was rendered using iTOL[50].
MIC Prediction
Model Generation
A model for predicting minimum inhibitory concentrations for the 15 antibiotics was built following the methods previously described by Nguyen and colleagues[38]. Briefly, each genome was divided into the set of nonredundant overlapping nucleotide 10-mers using the k-mer counting program KMC[51]. A matrix was built where the k-mers, antibiotics, and MICs are treated as features for each genome. Each row in the matrix contains the k-mers for a genome as well as the MIC for a single antibiotic. The MIC prediction model was built using an XGBoost[41] regressor predicting linearized MICs. All model parameters were identical to those used by Nguyen et al[38]. Ten-fold cross validations were used to assess the overall accuracy and sensitivity of every model used in this study. A non-overlapping training set (80% of the data), validation set (10% of the data), and test set (10% of the data) were generated for each fold. The validation set was used to monitor each model to prevent overfitting. Unless otherwise stated, the accuracy is reported as the ability to predict the correct MIC within ± 1 two-fold dilution step of the laboratory-derived MIC. Defining an accuracy to be within one two-fold dilution step is consistent with FDA requirements for automated MIC measuring device standards and is consistent with established clinical microbiology practices[20, 52, 53]. A comparison of raw accuracies and accuracies within ±1 two-fold dilution step is shown in Table S2. To assess the accuracy of a model over various metadata categories including date, serotype source, and location, the training set genomes are used to make the model. The test set genomes are used to assess the model accuracy for a given fold. For models based on date ranges, all parameters are identical and the accuracy is reported over the genomes from the held-out dates.
Subsampling
In order to perform the model building on a machine with 1.5 TB of RAM (machines with more memory are currently somewhat uncommon), we reduced the matrix size to sets of size n, where n ≤ 250, 500,1000, 2000, 3000, 4000, and 4500 genomes respectively. To create a diverse subset of size n, a hierarchical clustering method[54] was used to create n clusters by using the 10-mer distribution of each genome as input features. To avoid the curse of dimensionality[55, 56], the taxicab/Manhattan distance (L1 norm) was used, rather than the Euclidean distance (L2 norm), since previous research has shown it to be both computationally fast and more accurate for high dimensional data[57]. From the resulting n clusters, one genome from each cluster was randomly selected from a uniform distribution to create the subset containing n genomes. For each subset of genomes, a matrix was generated, and models were generated as described above.
Feature identification
In order to unambiguously identify k-mers that are important to MIC prediction, we built separate models for each individual antibiotic using the method described above, except that we increased the k-mer length to 15 nucleotides in order to reduce the number of redundant k-mers within each genome and to enable analyses with BLAST[58]. We also measured k-mer hits as presence versus absence, rather than counts, in order to simplify the analysis. Each model was built using the set of 1,000 diverse genomes from the subsampling experiment described above and 10-fold cross validations were performed on each model.
XGBoos’s internal feature importance was computed for each fold within the 10-fold cross validation. This results in an importance score per feature (15-mer) from each fold. In order to generate an overall importance score for the top features, we summed the feature importance scores from each fold for the top ten features. This overall importance score captures both the importance of the 15-mer to a given fold and the number of times that 15-mer was chosen as a top feature within each of the ten folds.
XGBoos’s internal feature importance is unable to provide correlations between features and label values, and thus does not provide an indication of whether a k-mer is related to antibiotic resistance or susceptibility. This is partially due to the fact that many non-linear correlations exist that may use multiple features. In order to see if the high scoring k-mers correlate with resistance or susceptibility, we computed the distribution of MICs for the genomes containing each high scoring k-mer. For example, a k-mer conferring susceptibility should be found in more genomes with lower MICs, while a k-mer conferring resistance should exist in genomes with higher MICs. Each high scoring k-mer was also compared to the set of protein encoding genes within each Salmonella genome. If a k-mer was found within a known AMR gene, that gene was reported. Otherwise, all protein-encoding genes within 3kb of the k-mer were reported in order to assess the neighborhood of the k-mer.
To find k-mers that are being used by the individual antibiotic models to predict susceptible MICs, we computed the presence or absence of each k-mers with high XGBoost feature importance values (described above) for the entire data set of 5278 genomes. The k-mers with the largest difference in occurrence between the susceptible and resistant genomes are the ones that are being chosen by the models for predicting susceptible MICs. To demine if there were significant SNPs in these k-mers, we found the genomic features containing the k-mer—protein encoding gene, RNA gene, or intergenic region—using BLASTn[58]. The corresponding feature or region was then found for all genomes in the collection. The features were aligned using MAFFT[48] and manually curated using Jalview[59]. Poor quality sequence was removed, all duplicates and paralogs were removed, and the subalignment covering the k-mer was extracted. To prevent possible biases due to clonality that may exist in the full set of genomes, the analysis was repeated on the diverse subset of 1000 genomes (described above). We report a SNP in a k-mer region as being significant if the susceptible and resistant sets are significantly different (P-value < 0.001) for a given nucleotide position based on a Chi-square test for both the full set of 5278 genomes and the set of 1000 diverse genomes. Sequence logos for k-mers containing significant SNPs were generated using WebLogo[60]. K-mers from the Azithromycin and Ciprofloxacin models were excluded from this analysis because they each had seven resistant genomes. Comparisons of codon usage were computed versus the genome average, genome mode, and high expression gene sets as described previously[61, 62].
Software availability
The Salmonella MIC prediction model based on 4,500 genomes—including the software and documentation for running the model—is available at our GitHub page, https://github.com/PATRIC3/mic_prediction.
Results
Model Construction
For this study, we used a publicly available collection of 5,278 Salmonella whole genome sequences generated by the NARMS project between 2002 and 2016. The strains were isolated from retail meat and food animal sources in the United States. The collection includes 98 different serotypes, including Heidelberg (678 genomes), Kentucky (618 genomes), and Typhimurium var. 5- (588 genomes) from 41 states (Table S1). Isolates were tested for resistance to up to 15 antimicrobial agents using the broth microdilution method. Many of the strains were randomly selected for sequencing as part of a compulsory nation-wide collection program (Table 1).
The nontyphoidal Salmonella MIC prediction model was built similar to our previously described strategy used to predict MICs for K. pneumoniae clinical isolates[38]. Since the Salmonella data set has many more genomes and greater sampling in the range of susceptible MICs, it provides a critical test case for determining if the approach remains robust for other pathogens. In the Klebsiella study, we built individual models for each antibiotic, as well as a single large integrated model by combining the data from all antibiotics. We found that the combined model achieved slightly higher overall accuracies (by ~l-2%), however the matrix that was necessary to train this model had a large memory footprint. Indeed, if we were to build a similar matrix for the current Salmonella data set using all 5,278 genomes, the model training would exceed 1.5 TB of RAM. Therefore, we first built models for all antibiotics using subsets of the genomes ranging in size from 250-4,500 genomes that were rationally selected to maximize genetic diversity (Figure 1). A matrix built from 4,500 genomes is the largest we can train on a 1.5 TB machine using this protocol. As the training set size increases from 250 to 1000 genomes, the accuracy increases from 88.5% to 91.4%. Then as the training set increases beyond 1000 genomes, the accuracy continues to improve more modestly, with the 4,500-genome model having an average accuracy of 95.2%. Results indicate that the overall MIC prediction approach, which was developed previously for Klebsiella pneumoniae, also works for Salmonella despite the differences in sampling, genetic diversity and MICs. Also, we discovered that a smaller number of well-chosen diverse genomes can serve as a useful proxy for representing the entire set, since models built from ≥500 genomes have accuracies exceeding 90%.
Model Accuracy
We computed the overall accuracy for each antibiotic using the model that is based on 4,500 genomes. For this model, all 15 antibiotics have average accuracies ≥90%, with their Q1 quartile bound ≥89% (Figure 2). Chloramphenicol and ceftiofur had the highest accuracies (99%), and gentamicin and tetracycline had the lowest accuracies (91% and 90%, respectively) (Table S2). Since the model is robust to the various mechanisms of resistance for the 15 antibiotics, it is possible that the slightly lower accuracies for gentamicin and tetracycline could be due to the distribution of multiple AMR genes/mechanisms across the population of strains with resistant genomes (which will be analyzed in more detail below). Figure 3 depicts the accuracy of the 4,500-genome model for each MIC. Overall, the model is robust for both the resistant and susceptible MICs, and it tends to be more accurate when a MIC is represented by many genomes. The model tends to have lower accuracies for the highest and lowest MICs, perhaps because of underlying genetic differences between strains that have been reported with ≥ or ≤ values, which represents a range of MICs rather than a discrete value.
The utility of AMR diagnostic devices is often described in terms of error rate. Major errors (MEs) are defined as susceptible genomes that have been incorrectly assigned resistant MICs by the model. Very major errors (VMEs) are defined as resistant genomes that have been incorrectly assigned susceptible MICs by the model. FDA standards for automated systems recommend a major error rate ≤ 3%[53]. All antibiotics used in the model have ME rates within this range (Table 2). The FDA standards for VME rates indicate that the lower 95% confidence limit should be ≤1.5% and upper limit should be ≤7.5%[53]. Seven of the 15 antibiotics—amoxicillin/clavulanic acid, ceftriaxone, chloramphenicol, cefoxitin, streptomycin, tetracycline and ceftiofur—have acceptable VME rates by this measure. Ampicillin and sulfisoxazole have VME rates with 95% confidence intervals approaching this range: [0.022, 0.033] and [0.026, 0.053] respectively. The VME rates are higher for some of the remaining antibiotics because there are fewer resistant genomes. As more resistant genomes are collected, and the data set becomes more balanced, we expect VME rates to be reduced.
In addition to the extensive MIC data, NARMS reports rich metadata including isolation date, food or animal source, collection year, geographic location and serotype. We computed the accuracy of the model over each available metadata category to determine if the model is robust to these differences and to ensure that no subset is biasing the model. The genomes span a 15-year collection period, with all the years except 2002 (the oldest) and 2016 (the most recent) having over 100 isolates. The model accuracy ranges from 94-97% over each collection year (Table 3). That is, the genetic factors that contribute to the MICs have either remained stable over the 15-year period or have been learned as the model was trained. Although the data set is mostly comprised of poultry meat or live animal isolates, the accuracy ranges between 94-96% over the four contamination sources: turkey, beef, pork, and chicken (Table 4). No obvious biases were detected in the accuracies based on the state of isolation (an average of 95% accuracy over 41 states with a 95% Cl equal to [0.95-0.96]) (Figure 4) or the serovars of each isolate (94% accuracy over 97 serovars with a 95% Cl equal to [0.94-0.96]) (Table S3). Since the traditional Salmonella serotyping scheme is based the lipopolysaccharide O and flagellar H antigens, which are encoded by genes that influence the cell surface[63], we also constructed a phylogenetic tree for Salmonella genomes to observe the model accuracy over the various clades. Overall, no phylogenetic bias in the model accuracy was detected (Figure S1).
One concern of using a model that is trained on the data from previous years, in some cases over 15 years old, is that the training set is not representative of currently circulating strains. That is, the model may be inaccurate for predicting MICs for genomes of strains that are currently circulating or will emerge in the future. For example, shifts in clonal groups, evolution of AMR-associated genes, or introduction of AMR genes by horizontal gene transfer is possible[64, 65]. We evaluated this possibility by building models from subsets of the whole genome sequence data using strains collected in earlier years and measuring the accuracy of the models on genomes collected in later years. Models were built for years prior to 2009 through 2014 and tested on the remaining genomes (Table 5). These models have accuracies ranging from 86-92%. As the number of years used for building the models increases, the number of genomes available for testing decreases, so we also tested each model on only the 462 genomes from 2015 and 2016. Similarly, the accuracy of each model on the 2015 and 2016 genomes ranges from 87-90% (Table S4). The results indicate that within this data set, models generated from genomes collected at earlier dates yield stable MIC predictions for genomes collected at later dates. This finding is consistent with the pattern of AMR genes that is observed within the data set. Although AMR gene content may vary from year to year, we do not observe any major sweeps or fixation events that drastically alter the AMR gene content of the collection between years, which would cause the MIC predictions to fail for a large fraction of the genomes (Table S5). Taken together, these data suggest that the MIC prediction models generated in this study are likely to be sustainable over time.
Genomic regions important for MIC prediction
The 4,500-genome model described above contains data from all antibiotics and MICs, making feature extraction to determine which k-mers contribute to the MIC predictions for each antibiotic difficult. To address this limitation, we modified the protocol by building separate models for each antibiotic. Instead of using 10-mers, we increased the k-mer length to 15 nucleotides to reduce redundancy and make them identifiable using BLAST[58]. We also searched for presence or absence of k-mers, rather than using k-mer counts, to simplify the analysis of the XGBoost decision trees. Since a 15-mer matrix can be 45 times larger than a 10-mer matrix, we used <= 1000 diverse genomes to reduce the memory footprint during training. Overall, the average accuracy for the individual models is nearly identical to the average accuracy for the combined 4,500-genome model (96% vs. 95%, respectively), and in nearly all cases, the 95% confidence intervals overlap between the combined and single antibiotic models (Table S6). Thus, for this data set, single antibiotic models with fewer genomes and larger k-mers perform as well as a combined model (Figure S2).
During model training, XGBoost assigns an importance value to each k-mer used in a decision tree. When the model is used to predict the MICs for a new genome, the k-mers with the highest importance values are the most informative for the MIC prediction. Thus, by analyzing the feature importance values of each k-mer, we can use the models as a tool for understanding the genomic regions that differentiate MICs. For each antibiotic-specific model, we parsed the XGBoost decision trees from each fold of the ten-fold cross validation to extract the importance values for each k-mer. To understand the relationship between known AMR genes and the important k-mers that were chosen by each model, we then searched for k-mers with high importance values within AMR genes that occur in close proximity to an AMR gene (in this case, we consider a window of 3kb, approximately 3 genes, to be a close proximity). Table 6 lists the highest-ranking k-mers from each model that occur within or in close proximity to an AMR gene. In most cases, the k-mers correspond to known AMR genes including class A and C beta-lactamases for the beta lactam antibiotics, aminoglycoside nucleotidyl- and acetyltransferases for the aminoglycosides, DNA gyrase and QnrB for the fluoroquinolones, TetA and TetR for tetracycline, and dihydrofolate reductase and dihydropteroate synthase for co-trimoxazole and sulfisoxazole. In the case of azithromycin, the collection contains mostly susceptible genomes (Table 1), so the first macrolide resistance gene observed corresponds with the eighth ranking k-mer. The top ten k-mers with the highest feature importance values from each of the ten folds used in model training are listed in Tables S7-S21. In addition to the top AMR k-mers displayed in Table 6, these tables show other highly ranking k-mers from the same AMR genes as well as k-mers from related genes that are known to confer resistance to the given antibiotic. In some cases, k-mers matching regions or genes from unrelated AMR mechanisms have high importance values, suggesting a pattern of co-occurrence on horizontally transferred genetic elements.
Since each model is predicting the entire range of MICs, some of the highly ranking k-mers will be used to predict susceptible MICs. To assess this, we computed the fraction of susceptible and resistant genomes with each k-mer from Tables S7-21. The set of k-mers that are most enriched in the susceptible genomes is shown in Table 7. Overall, seven of the top ten k-mers represent significantly different SNPs (P-value < 0.001) in both the complete set of 5,278 genomes and in the set of 1,000 diverse genomes used to build the models (Figure S3). The top k-mer associated with susceptibility is from the nalidixic acid model and occurs in the DNA gyrase gyrA gene. This is also the top k-mer that was found in an AMR gene for nalidixic acid from Table 6. In this case, the model is relying more heavily on the “wild type” version of the k-mer rather than any of the resistant versions (the remaining k-mers from Table 6 occur almost exclusively in resistant genomes). The same gyrA k-mer is also found as a highly ranking k-mer in the case of ciprofloxacin (Table S12). Two significant gyrA SNPs are captured by this k-mer (Figure S3). These are missense mutations in the resistant genomes occurring at Ser-83 and Asp-87, and changes at these positions have been shown to confer quinolone resistance in in E. coli [66, 67]. The remaining significant mutations from Figure S3 that occur in the protein-encoding genes are same sense (not amino acid changing) mutations. In the cases of eptA (Ser, TCG to TCA), oadA (Ala, GCC to GCA), the AraJ precursor gene (Leu, CTG to CTA), and the second gcd mutation (Thr, ACG to ACA), the codon changes from a commonly used codon in the susceptible genomes to the least preferred codon in the resistant genomes. In the cases of the nrfE/nrfF mutation (Asn, AAT to AAC) and the first gcd mutation (Asp, GAC to GAT), the resistant genomes have the preferred codon of the pair. Whether these SNPs have a modulating effect on protein translation or contribute to the fitness of the resistant organisms requires further analysis.
Discussion
In this study, we have built machine learning-based MIC prediction models for nontyphoidal Salmonella genomes using XGBoost[41] that achieve overall accuracies of 95-96% within ± 1 two-fold dilution factor. To our knowledge, this is one of the largest and most accurate MIC prediction models to be published to date. Importantly, it provides a model strategy for performing MIC prediction directly from genome sequence data that could be applied to other human or veterinary pathogens.
The success of our MIC prediction model was dependent on the large, publicly available, population-based collection of genomes with associated metadata. Since researchers often focus on collecting highly resistant or otherwise unusual strains, the opportunities to generate balanced models are rare. We demonstrate the many benefits from comprehensive sampling for the entire range of possible MICs. First, diverse and balanced data sets improve model accuracies because there is better sampling across all MIC dilutions. Second, having balanced data enabled us to achieve acceptable ME and VME rates for 7 of the 15 antibiotics used in the study. Third, compared with our recent model for Klebsiella pneumoniae, the larger and more balanced data set in nontyphoidal Salmonella enabled us to build models for individual antibiotics that had similar accuracies to the combined model. This enabled us to begin to disambiguate the important genomic regions relating to resistant and susceptible MICs. Finally, we show that MICs in the susceptible range can be accurately predicted with the algorithm using all genomic data rather than scoping it to known AMR genes or gene polymorphisms. This contrasts with prior work correlating MICs to known resistance mechanisms in Salmonella[68]. In future studies, our strategy could be used as a starting point for identifying the subtle genomic changes that result in different MICs.
For each single-antibiotic model, we analyzed the k-mers that had high feature importance values and were important to the models for predicting MICs. The highly ranking k-mers that were enriched in the resistant genomes mainly occurred within or in close proximity to well-known AMR genes. With the exception of the gyrA k-mer, the highly ranking k-mers that were enriched in the susceptible genomes were significant in several cases, but more difficult to interpret. Some of these susceptibility k-mers hint at a possible relationship between AMR and oxidative stress or electron transport, such as the k-mers matching components of the nitrate and nitrite reductases and pqq-dependent glucose dehydrogenase, which is consistent with the known link between antibiotics to oxidative stress[69, 70]. Determining the molecular mechanisms underlying the susceptibility k-mers and AMR phenotypes should be further investigated.
The genomes in this study were collected over a 15-year period from 41 U.S. states. By building models encompassing ranges of earlier dates, we demonstrated stable and accurate MIC prediction for genomes collected at later dates. Presently, we are not aware of any large publicly available collections of Salmonella genomes with MIC data from other countries. Since AMR gene content may vary across pathogen populations, validation of the Salmonella models using strains from other countries is important to its application in global health. Nevertheless, the present analysis clearly demonstrates that current model provides accurate MIC predictions for United States isolates. Similarly, an analysis of this model on Salmonella typhi strains would provide information about the utility of the model over broader phylogenetic distances.
Author contribution statement
MN: study design, experiments, data generation manuscript preparation, SWL: study design, PFM: study design, data generation, RJO: study design, RO: software engineering, RLS: study design, GHT: study design, data generation, SZ: study design, data generation, JJD: study design, data generation, manuscript preparation
Additional Information
Accession Codes
Data are available under bioprojects PRJNA292661 and PRJNA292666. SRA run accession for each genome are displayed in (Table SI).
Competing financial interests
The authors claim no competing financial interests.
Disclaimer
The views expressed in this article are those of the authors and do not necessarily reflect the official policy of the Department of Health and Human Services, the U.S. Food and Drug Administration, and Centers for Disease Control and Prevention or the U.S. Government. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture or Food and Drug Administration.
Acknowledgements
This work was supported by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Service [Contract No. HHSN272201400027C]. We thank Emily Dietrich for her helpful edits.