Benchmarking software to predict antibiotic 1 resistance phenotypes in shotgun 2 metagenomes using simulated data 3 4

A: Nell Hodgson Woodruff School of Nursing, Emory University, Atlanta, GA, US 8 B: Population Biology, Ecology, and Evolution Program, Graduate Division of Biological and 9 Biomedical Science, Emory University, Atlanta, GA, US 10 C: Cockrell School of Engineering, The University of Texas at Austin, Austin, TX 11 D: Division of Infectious Diseases, Department of Medicine, School of Medicine, Emory 12 University, Atlanta, GA, US 13 E: Department of Gynecology & Obstetrics, Emory University School of Medicine 14 F: Department of Human Genetics, School of Medicine, Emory University, Atlanta, GA, US 15 16


Introduction 63
Antibiotic resistant bacterial infections pose a serious threat to public health. Particularly 64 concerning is that the burden of multi-drug resistant pathogens is increasing globally, creating 65 complex clinical scenarios in which there are limited (if any) therapeutic options. In the United 66 States alone, multi-drug resistant infections cost over $4.5 billion annually and kill over 35,000 67 people each year. 1 Genes that confer antimicrobial resistance (AMR) are increasingly present in 68 commensal members of the human microbiome and are recognized as an important reservoir for 69 conferring pathogen resistance through horizontal gene transfer. 2,3 Detecting AMR potential 70 through non-culture based, high throughput DNA sequencing and bioinformatic approaches is of 71 growing relevance and importance. Two key approaches to mitigating AMR infections are 72 antibiotic stewardship and AMR surveillance. While antibiotic stewardship focuses on 73 prescribing antibiotics appropriately, AMR surveillance focuses on describing AMR genes 74 already present in a community. 75 AMR surveillance is a key strategy in understanding the threat of AMR. Currently, AMR 76 surveillance typically relies on phenotypic characterization through culture or genotypic 77 characterization through molecular diagnostics based on PCR and hybridization techniques. 4 78 However, there is a move toward genome-based methods 5 with the Illumina short-read platform 79 being the dominant platform for data generation at the present time. 6 Direct sequencing of 80 clinical samples using shotgun metagenomic approaches is of growing interest for minimizing 81 sample processing and for fully characterizing the commensal members of the microbiome. 82 However, the bioinformatic tools that currently exist to detect AMR have typically not been 83 assessed for their performance on shotgun metagenomic sequence data. Further, as is common 84 with software developed in academic settings, tools are not always maintained or easy to install. 85 Software managers like conda and docker help to alleviate this problem, however, it can still be 86 difficult for those without a bioinformatics background to understand the state of the tools and 87 select the best one for their needs. 88 As shotgun metagenomic sequencing is emerging as a powerful tool for detecting and 89 controlling AMR, 7 it is essential to understand how well these tools perform with these data. In 90 addition to testing these tools against a widely available data type, they should be compared 91 against samples with extensive phenotypic resistance (acquired and mutational AMR genes). 92 This analysis aims to compare a set of existing bioinformatic tools in their ability to 93 accurately identify AMR genes in a community. We describe a software pipeline, hAMRoaster, 94 that provides statistics on accuracy of software when the presence of phenotypes is known. As 95 shotgun metagenomic data is more often used in research and surveillance, and likely soon in 96 clinical diagnostics, 8 we believe this approach of validating tools using synthetic data will be 97 important in selecting the most appropriate software. hAMRoaster was written as a Python script to take three inputs: a) the text output of 103 AMR prediction run tool on a FASTQ or FASTA test file, such as a text file processed through 104 hAMRonization, 9 b) a list of known phenotypes associated with the test file and c) (optional) a 105 tab formatted This table is identical to that produced by the hAMRonization 9 software. hAMRonization is 112 conda installable and can compile the outputs of many AMR tools into a unified format. 113 shortBRED 11 and fARGene 12 are not included in hAMRonization at the time of analysis, so 114 hAMRoaster can take the path to the raw output for these tools and partially match it to the 115 hAMRonization output. 116 hAMRoaster requires an input to the "known" phenotypic resistance in the mock 117 community (--AMR_key flag of hAMRoaster), such as a result of susceptibility testing tables 118 that are available from NCBI Biosamples. Antibiotics in the table of known resistances are 119 matched to their respective drug classes. Results classified as "susceptible" or "intermediate" in 120 susceptibility testing are filtered out so only resistant instances are considered. In cases where 121 susceptibility testing occurred with two or more agents, each agent is considered independently 122 (e.g. resistance to "amoxicillin-tetracycline" was treated as resistance to "amoxicillin" and 123 "tetracycline" independently). Each identified AMR gene is labeled with its corresponding drug 124 class for comparison. In instances where a gene confers resistance to multiple drug classes, the 125 detected gene is split into multiple rows so that each conferred resistance can be independently 126 compared to what is known from the susceptibility testing. Gene to drug class linkage is verified 127 using the CARD database 13 when applicable. Any genes corresponding to 'unknown' or 'other' 128 drug classes (including hypothetical resistance genes) are excluded from further analysis. Genes 129 that confer resistance to an antibiotic that was only effective in combination with another drug 130 (e.g. clavulanic acid in amoxicillin-clavulanic acid) are classified as 'Other' and excluded from 131 analysis. 132 A detected AMR gene is labeled as a true positive by hAMRoaster if the drug class 133 matched to an AMR gene corresponds to a drug class represented in the mock community. 134 Similarly, a false positive is coded as a drug class that is called by the software, but tested as 135 susceptible in the mock community (--AMR key parameter). Observed AMR genes are labeled 136 "Unknown" if the corresponding drug class is not tested in the mock community and not 137 included in the AMR key file. Once true/false positives and true/false negatives are determined 138 per tool, hAMRoaster calculates sensitivity, specificity, precision, accuracy, recall, and percent 139 unknown. 140

Creation of a synthetic mock communities of antibiotic resistance bacteria 141
Bacterial members of the base mock community were chosen from NCBI's BioSample 142 Database 14 and met the following criteria: (1) the strain had extensive antibiotic susceptibility 143 testing data using CLSI or EUCAST testing standards as part of the public NCBI BioSample 144 record; (2) the strain was isolated from human tissue; (3) the strain was the cause of a clinical 145 infection; (4) the FASTA was available to download from NCBI BioSample Database. 14 Eight 146 bacteria, each representing a different species, with overlapping resistance to 43 antibiotics 147 across 18 drug classes, were selected for the mock community ( Running antibiotic prediction software on mock communities 159 All tools for AMR prediction were run on the mock community at all coverage levels 160 using default settings for either simulated FASTQ or assembled contigs. When both options were 161 available, assembled contigs were run. 162

Statistical Analysis 163
Data were analyzed in Python v3.7.7 and plotted in R v4.0.4. In initial runs we found that 164 some tools provided results with a very high number of observed AMR genes because of 165 multiple overlapping matches on the same gene. Because of this, we condensed the results so 166 that the first observed gene is included in the dataset and subsequent genes that start before the 167 observed end of that gene were not included. Unweighted Cohen's kappa was calculated for each 168 pairwise combination of tools to test agreement between tools. 169

Results 174
Selection of nine open source, conda-installable tools for detection of antibiotic resistance 175 phenotypes 176 To identify tools for antibiotic resistance prediction, we used a multi-headed search 177 strategy. We searched PubMed using terms "AMR", "antibiotic resistance genes", 178 "bioinformatics", and "antimicrobial resistance". We also searched GitHub using the same set of 179 terms. Once an initial list of tools was compiled, we performed a second PubMed literature 180 review including the search terms from above plus the names of the tools ("tool 1" OR "tool 2"). 181 We also used Twitter to ask the research community what bioinformatic tools they use to 182 identify AMR (link available in supplementary materials). These searches identified 16 potential 183 tools to identify AMR genes ( Table 2). The search for tools concluded on March 1, 2021. 184 In order for an identified tool to be considered eligible for comparison, it had to meet the 185 following criteria: (1) be conda or Docker installable; (2) have source code publicly available in 186 a data repository and be actively maintained (defined as tool updates or GitHub responses within 187 the last year); (3) have an open source license; and (4) take FASTQs or FASTAs as input files. 188 Nine tools met the criteria to be included in this analysis: ABRIcate 17 , fARGene 18 ResFinder 19 , 189 shortBRED 11 , RGI 20 , AMRFinderPlus 21 , starAMR 22 , sraX 23 , and deepARG 24 . PointFinder 190 also qualified 25 , but was a subtool of ResFinder and only identified mutational resistance for 191 some organisms, so it was excluded from analysis. The code used to install and run all tools is 192 available on the hAMRoaster GitHub. 193 Identified tools fell into two groups -those that aligned reads to a database, and those that 194 compared reads against some model of AMR (Table 2)

Range of phenotypes detected 244
Overall, the number of AMR genes detected across all tools ranged from 13 to over 700 245 at 100x coverage ( Table 3). For some tools, genes detected did not match to a tested phenotype 246 in the mock community, so the prediction fell into the "unknown" category. Among the tools 247 tested, AMR Finder Plus had the highest degree of unclassifiable/unknown results (observed 248 AMR gene not testing in the mock community). An overview of these results are available in 249 The highest sensitivity for phenotype detection ranged from >0.99 (RGI) to 0.23 (sraX) at 252 the lowest coverage levels (Fig. 2). In general, coverage did not greatly affect sensitivity, with 253 the exception of sraX, which increased to 0.53 at the highest level. fARGene and deepARG had 254 a high sensitivity value (>0.90) at all coverage levels. RGI, deepARG, and fARGene are all tools 255 that compare reads to a model of AMR instead of aligning reads directly to a database, indicating 256 that this method may be appropriate when high sensitivity values are preferred. As a note, in this 257 dataset, there were only 2 possible true negatives because only two drug classes were always 258 susceptible to antibiotics in those two drug classes when tested (nitrofuran and polypeptide). 259 When all software predictions were combined we achieved the 0.99 sensitivity across the 260 coverage (Supplementary Table 1). However this came at the cost of low specificity 0.11 . 261 Specificity in this study is artificially low for most tools because the number of possible true 262 negatives is low (only two). Therefore we did not assess this metric. 263

Condensing Results 264
All tools provide results in which the detected AMR genes are overlapping, where one 265 gene starts between the start and stop codon of another. If we remove overlapping genes so that 266 only the first detected gene was included, and all genes that started before its stop codon were 267 removed, the counts for all tools decrease (Table 4). However, this process does not necessarily 268 improve metrics or counts, and it is unclear that such a tactic is useful for real life uses as there is 269 no simple way to determine which detected AMR genes to include and which should be filtered 270 out. 271

Concordance between tools 272
An analysis of the agreement between tools of detected AMR genes within drug classes 273 revealed that overall, there was low agreement (<0.50) between tools at all coverage levels 274 ( The central challenge in developing this software was to compare detected AMR genes to 286 resistance phenotypes. Detected AMR genes needed to be classified by their corresponding drug 287 class(es) so they could be matched to the known phenotypically resistant drug classes. One 288 hurdle in this translation is that tools use different databases, and some databases classify genes 289 differently. For example, shortBRED classifies gene families, while CARD classifies specific 290 genes. While this analysis checked the drug classification via the DNA/Protein Accession value 291 in CARD, only around 300 of the >1,000 genes detected could directly map to genes in CARD 292 by accession value. The hAMRonization tool overcomes this challenge by providing a drug class 293 column and filling in the values from ChEBI ontology 41 when possible. The hAMRoaster 294 strategy is to assign a CARD drug class value to every detected AMR gene first by accession 295 number, then by gene name. If neither of these methods assign a drug class for an AMR gene, 296 then the drug class provided by hAMRonization is used. Another challenge in converting 297 detected AMR genes to drug classes is that some drugs are only administered in combination, for 298 example clavulanic acid with amoxicillin. For these instances, resistance to the drug only used in 299 combination (e.g. clavulanic acid) is treated as an "other" drug class and excluded from analysis. 300 In these cases, we used the experience of practicing clinicians to identify combination 301 antibiotics. 302 The analysis presented here used synthetic data to compare tool performance. Synthetic 303 data has the benefit of allowing controlled input with known ground truth. Therefore users can 304 focus on the types of organisms and phenotypes they need to to detect in their own datasets, 305 perform experiments with real samples, and manipulate a range of factors such as relative 306 abundance and sequencing error. The NCBI BioSample repository (used in this study) is an 307 invaluable resource for creating such datasets as it contains many samples with AMR phenotypes 308 determined by international standards. Researchers could also sequence and phenotype 309 culturable organisms in their own laboratories to provide testing standards to evaluate software. 310 Here, we exclusively examined synthetic short read Illumina data, but this analysis strategy 311 could be adapted to understand the effect of using data generated on long read technologies such 312 as the Pacific Bioscience and Oxford Nanopore platforms. 313

Overall trends in performance and reasons for variability between tools 314
Tools used one of two basic strategies, either aligning reads to a database of AMR genes 315 or using a more complex model of sequenced-based AMR detection ( Table 2). The methods 316 appear to lead to the different AMR genes detected across tools, as demonstrated in Figure 1 and 317 summarized in Table 3. 318 We found the sensitivity of almost all tools to be very good (>0.80), with the exception 319 of sraX, which had a proportionally high number of false negatives compared to true positives. 320 All tools except shortBRED and starAMR detected a large number of genes that were not 321 associated with a lab-determined phenotype in our mock community. This is a feature of the 322 approach of limiting focus to a specific set of phenotypes in the testing process. In practice, 323 researchers and epidemiologists may be only interested in a narrow range of AMR phenotypes. 324 hAMRoaster calculates specificity, precision, accuracy, recall , and F1 (Table 3). 325 However, all of these measures are dependent on false positives and/or true negatives in their 326 calculations. As these values are inherently low in our mock community due to the robust 327 resistance profile, these metrics are not particularly informative for understanding how well these 328 tools detect resistance in this phenotypically resistant sample. Similarly, we calculated all 329 effective metrics when the results of all tools are combined. While sensitivity in the combined 330 data was very high (>0.99), there was a very high number of overall detected AMR genes, 331 including overlapping results between genes, thus,.it would be difficult for researchers to 332 meaningfully use this type of result to understand the AMR profile. We calculated Cohen's 333 kappa to capture the agreement at the drug class level between AMR tools to see if all AMR 334 tools detected resistance to the same drug classes. We found that agreement was surprisingly 335 low across all tools ( Finally, this research supports the need for the further development of software tools for 339 the detection of AMR genes in the human microbiome. It is increasingly recognized that the 340 confined location and genetic diversity of this microbial population provides ideal conditions for 341 genetic exchange among residential microbes and between residential and transient, including 342 pathogenic microbes. Notably, rates of horizontal gene transfer among bacteria in the human 343 microbiome (especially the gastrointestinal tract) are estimated to be many times higher than 344 among bacteria in other diverse ecosystems, such as soil. 42 Refined tools appropriate for use in 345 shotgun metagenomic data will be important for tracking the spread of AMR genes from diverse 346 environmental sources to the human microbiome and across sites in the human body and 347 understanding whether AMR genes are derived from vertical inheritance or via horizontal gene 348 transfer, for example. 349 In conclusion, this study compared bioinformatics tools for detecting AMR genes in a 350 simulated short read metagenomic sample at three coverage levels at one time point. While tools 351 use slightly different methods and databases, these tools overall had high sensitivity for detection of AMR genes. Moreover agreement between tools was low, indicating the importance of tool 353 selection. In our test set we found starAMR had the highest sensitivity value with fewer than 354 20% unknown detected genes at all coverage levels. We advocate that researchers should test 355 these software tools using pipelines such as hAMRoaster with a synthetic community that 356 highlights the resistance profiles and sample of interest. In particular, this assessment of 357 performance of available tools should take place before the commencement of the study as the 358    highly sensitive (greater than 0.80). All genes corresponding to "Other" or "Unknown" drug 565 classes were not included in these calculations. Similarly, AMR genes corresponding to 566 phenotypic resistance that was not tested in the mock community was considered "Unknown" 567 and not included in the sensitivity analysis. 568     data were condensed so that overlapping genes are excluded from the count data (i.e. genes that 601 start between the start and stop codon of another gene are not considered in analysis).