Speed, accuracy, sensitivity and quality control choices for detecting clinically relevant microbes in whole blood from patients

Infections are a serious health concern worldwide, particularly in vulnerable populations such as the immunocompromised, elderly, and young. Advances in metagenomic sequencing availability, speed, and decreased cost offer the opportunity to supplement or replace culture-based identification of pathogens with DNA sequence-based diagnostics. Adopting metagenomic analysis for clinical use requires that all aspects of the pipeline are optimized and tested, including data analysis. We tested the accuracy, sensitivity, and resource requirements of Centrifuge within the context of clinically relevant bacteria. Binary mixtures of bacteria showed Centrifuge reliably identified organisms down to 0.1% relative abundance. A staggered mock bacterial community showed Centrifuge outperformed CLARK while requiring less computing resources. Shotgun metagenomes obtained from whole blood in three febrile neutropenia patients showed Centrifuge could identify both bacteria and viruses as part of a culture-free workflow. Finally, Centrifuge results changed minimally by eliminating time-consuming read quality control and host screening steps. AUTHOR SUMMARY Immunocompromised patients, such as those with febrile neutropenia (FN), are susceptible to infections, yet cultures fail to identify causative organisms ~80% of the time. High-throughput metagenomic sequencing offers a promising approach for identifying pathogens in clinical samples. Mining through metagenomes can be difficult given the volume of reads, overwhelming human contamination, and lack of well-defined bioinformatics methods. The goal of our study was to assess Centrifuge, a leading tool for the identification and quantitation of microbes, and provide a streamlined bioinformatics workflow real-word data from FN patient blood samples. To ensure the accuracy of the workflow we carefully examined each step using known bacterial mixtures that varied by genetic distance and abundance. We show that Centrifuge reliably identifies microbes present at just 1% relative abundance and requires substantially less computer time and resource than CLARK. Moreover, we found that Centrifuge results changed minimally by quality control and host-screening allowing for further reduction in compute time. Next, we leveraged Centrifuge to identify viruses and bacteria in blood draws for three FN patients, and confirmed suspected pathogens using genome coverage plots. We developed a web-based tool in iMicrobe and detailed protocols to promote re-use.


INTRODUCTION 56 57
The current gold standard for clinical diagnosis of infections relies on isolating organisms 58 by culture-based methods followed by identification and drug resistance testing. Methods for 59 identifying pathogens that rely on culture have several drawbacks including fastidious bacteria, 60 the time required for growth in culture, and the difficulty targeting viruses, fungi, and parasites. 68 Also, the hazard ratio of dying was nearly four-fold higher in culture-negative patients than for 69 patients where no culture was taken (presumably due to lack of fever), indicating the high cost in 70 lives when cultures fail. Therefore, we seek to apply metagenomic sequencing to overcome the 71 low rate and time delay of culture-based diagnostic methods in clinical settings such as febrile 72 neutropenia.

73
The potential of metagenomic shotgun sequencing has been demonstrated in a broad

86
On the data analysis side, there are no standards for analysis of metagenomic data 87 obtained from clinical samples; however, there have been recent innovations in taxonomic 88 classification algorithms that make it possible to quantify microbial species directly from reads 89 in metagenomic datasets rapidly. These algorithms use two main approaches to assign reads to 90 species in a reference database including: (1) a mapping approach using a Burrows-Wheeler 116 sequence datasets. We tested the linearity and threshold for detection of Centrifuge using three 117 sets of bacterial mixtures, selected to represent taxonomic distances from phylum to genus-level.
118 We created dilution mixtures over a six-log range of relative abundance with each organism 119 ranging from 0.1% to 99.9% of the mixture (Figure 1). Centrifuge correctly identified all four 120 species in the mixtures and misidentified less than one percent of the reads in any of the 18 121 combinations sequenced (false positives, Figure 1). Centrifuge was sensitive to the lowest 122 relative abundance (0.1%) in four out of six opportunities, failing to detect the extremes in the E.
123 coli/S. saprophyticus mixture. Reads matching phage present in the mixtures were classified and 124 quantitated by Centrifuge separately from their host genomes. Because the phage relative 125 abundance estimates were not included with their host, the bacteria present were underestimated 126 so that the abundance estimates shown in Figure 1 do not add to 100%.  166 Relative to CLARK, Centrifuge required less than a tenth of the memory and a quarter of the 167 runtime, while using half the number of central processing units (Table 1).

174
Pathogens were enriched using a simple sample preparation method from whole blood 175 samples drawn from three patients with febrile neutropenia, and the resulting metagenomic DNA 176 sequenced. Table 2 shows the starting number of raw reads and the percent passing through each 177 step from quality control, to host-screening by alignment, and finally Centrifuge analysis. The 178 reads classified by Centrifuge identified three likely pathogens: Pseudomonas fluorescens with a 179 relative abundance of 50.7% in patient 1, Human parvovirus with a relative abundance of 99.8% 180 in patient 2, and Torque teno virus in patient 3 with a relative abundance of 62.8% ( Figure 3).
181 Comparing the percentages shown in Table 2 with the relative abundances calculated by     277 Figure 6A shows the relative amount of reads that were classified as human, microbial, or 278 unknown when the datasets were analyzed by Centrifuge without removing reads by alignment 279 to the human genome before analysis. The relative proportion of host (human) reads in the data 280 agreed well with the proportions found by alignment (see Table 2). While the proportion of host 281 DNA was less than in prior studies, suggesting that the enrichment for pathogen DNA used in 282 this study was successful, a significant proportion of the reads were still human.

283
Having established that a significant proportion of the reads in the datasets were of host 284 origin by both alignment and Centrifuge, we compared three approaches for removing host reads 318 lack of well-defined bioinformatics methods. The goal of our study was to assess Centrifuge, a 319 leading tool for identification and quantitation of metagenomic data, using clinically relevant 320 datasets to establish its accuracy in microbial/viral identification and abundance estimates with 321 an eye toward reducing compute time.

322
The first dataset used to assess Centrifuge was a series of binary bacterial mixtures 323 chosen for their phylogenetic distance and mixed so that each pair was combined across six logs 325 bacteria, E. coli and S. flexneri, even when one of the organisms was present as 0.1% of the 326 mixture. As the proportion of E. coli decreased, the relative abundance estimate diverged from 327 expected, so that the E. coli estimate was 2.1% when E. coli was only 0.1% of the mixture. The 328 same inaccuracy did not occur as the S. flexneri relative abundance decreased to 0.1%, 329 suggesting Centrifuge misidentified a portion of the S. flexneri genome as E. coli but not the 330 other way around. The difficulty classifying S. flexneri suggested by the fact that the false 331 positive rate increased from 0% to 1%, the highest measured, as S. flexneri relative abundance 332 increased. One likely cause for more relative matches to E. coli than S. flexneri is that E. coli 333 strains and isolates represent the most substantial fraction of the Centrifuge reference database.

344
Centrifuge appears to be capable of detecting organisms even when they are present in 345 minor abundance, regardless of the phylogenetic distances between them. Overall, Centrifuge 348 relative abundance estimates. A reasonable assumption would be that as phylogenetic distance 349 increases, the number of discriminatory k-mers increase to allow for better read classification by 350 Centrifuge. Instead, we observed high classification accuracy for the most closely related pair (E.

358
We compared Centrifuge's performance against another leading k-mer based taxonomic 359 classifier, CLARK, in analyzing sequence data from a more complex community of 20 bacteria.
360 The mock community was also mixed in varying relative abundances as with the binary 361 mixtures, albeit, in a different range (~0.01-35%). Abundance calculations between the two 362 algorithms were nearly identical across the relative abundance range; however, the processing 363 time and computational resources for CLARK were greater (Table 1) Table 1   Centrifuge abundance report results were filtered to only include organisms at the species 572 or strain-level with a minimum of 1% of total reads classified and at least 5% abundance as 573 calculated by Centrifuge. Similarly to the bacterial mixtures, no phage or prophage passed the 574 filters above, so there was no effect on relative abundance calculations.