Abstract
Bacteriophages and plasmids usually coexist with their host bacteria in microbial communities and play important roles in microbial evolution. Accurately identifying sequence contigs as phages, plasmids, and bacterial chromosomes in mixed metagenomic assemblies is critical for further unravelling their functions. Many classification tools have been developed for identifying either phages or plasmids in metagenomic assemblies. However, only two classifiers, PPR-Meta and viralVerify, were proposed to simultaneously identify phages and plasmids in mixed metagenomic assemblies. Due to the very high fraction of chromosome contigs in the assemblies, both tools achieve high precision in the classification of chromosomes but perform poorly in classifying phages and plasmids. Short contigs in these assemblies are often wrongly classified or classified as uncertain.
Here we present 3CAC, a new three-class classifier that improves the precision of phage and plasmid classifications. 3CAC starts with an initial three-class classification generated by existing classifiers and further improves the classification of short contigs and contigs with low confidence classification by using proximity in the assembly graph. Evaluation on simulated metagenomes and on real human gut microbiome samples showed that 3CAC outperformed PPR-Meta and viralVerify in both precision and recall, and increased F1-score by at least 10 percentage points.
1 Introduction
The metagenomes of microbial communities are mainly composed of bacterial chromosomes and the associated extrachromosomal mobile genetic elements (eMGEs), such as plasmids and bacteriophages (phages). These eMGEs carry genes related to antibiotic resistance [6, 36, 19], virulence factors [14, 30] and auxiliary metabolic pathways [11, 28, 12]. They can frequently move between species in the microbial community [32, 8] and enable their hosts to rapidly adapt to environmental changes [35, 33]. Despite their important roles in horizontal gene transfer events and in antibiotic resistance, our understanding of these eMGEs is still limited. Part of the difficulty is the challenge of identifying such elements efficiently from mixed metagenomic assemblies [3, 16, 1, 2, 24, 34, 38].
Multiple algorithms have been developed for identifying either phages or plasmids from metagenomic assemblies in recent years. VirSorter and VirSorter2 identify viral metagenomic fragments by searching for reference homologs and testing enrichment of virus-like proteins [29, 10]. These knowledge-based tools have high precision in virus classification but poor ability to identify novel viruses, due to reference database-associated bias. Other tools, such as deep-VirFinder [27], Seeker [4], and VIBRANT [12], use machine learning to learn k-mer signatures of viral sequences and perform better on novel virus classification, since they are more loosely linked to annotation databases. cBar is the first tool designed primarily for plasmid identification in metagenomes [40]. More recently, two supervised-learning approaches, PlasFlow [15] and PlasClass [23], were shown to classify plasmid fragments better from metagenomic assemblies. Although both phages and plasmids are commonly found in the metagenomes of microbial communities, all of these tools identify either only phages or only plasmids from metagenomic assemblies.
Currently, only two published tools, PPR-Meta [7] and viralVerify [2], can identify phages and plasmids simultaneously from metagenomic assemblies. However, due to the overwhelming abundance of chromosome fragments in the assemblies (usually ≥ 70%), both tools achieve high precision in chromosome classification but very low precision in classification of phages and plasmids [7, 2]. Moreover, classification of short contigs is challenging for all the existing classifiers, as they analyze each contig independently [2, 7, 15, 29, 26]. Here we present 3CAC (3-Class Adjacency based Classifier), an algorithm that employs existing two-class and three-class classifiers to generate an initial three-class classification with high precision, and then improves the classification of short contigs and of contigs classified with lower confidence by taking advantage of classification of their neighbors in the assembly graph. Evaluation on simulated and real metagenome datasets with short and long reads showed that 3CAC improved both precision and recall, and increased F1-score by at least 10 percentage points.
2 Methods
3CAC accepts as input a set of contigs and its associated assembly graph, uses the classification result of existing tools as a starting point, and repeatedly improves the classification using the assembly graph. Its output is a classification of each contig in the input as phage, plasmid, chromosome, or uncertain. The details of the algorithm are described below.
2.1 Generating the initial classification
3CAC exploits existing two-class and three-class classifiers to generate an initial three-class classification as follows.
Generating a three-class classification. The algorithm runs either viralVerify or PPR-Meta on the set of the input contigs and classifies each contigs as phage, plasmid, chromosome, or uncertain. viralVerify was designed to classify contigs as viral, non-viral or uncertain. Moreover, for non-viral contigs, viralVerify can further classify them as plasmid or non-plasmid using -p option. Here, we used -p option of viralVerify to classify each of the input contigs as viral, plasmid, chromosome, or uncertain. PPR-Meta calculates three scores representing the probabilities of a contig to be classified as a phage, plasmid, or chromosome. By default, PPR-Meta classifies a contig into the class with the highest score. If a specified score threshold is provided and no score passes the threshold, the sequence will be classified as uncertain. Here, we ran PPR-Meta with a score threshold of 0.7.
Improving plasmid classification. To improve the precision of plasmid classification, PlasClass is run on contigs classified as plasmids in step (1). PlasClass outputs for each contig the probability that it originated from a plasmid. By default, PlasClass classifies a contig as plasmid if it has a probability > 0.5 and as chromosome otherwise. To assure high precision, here we identify contigs with probability ≥ 0.7 as plasmids. Contigs with probability ≤ 0.3 are moved to the chromosome class. The remaining contigs are reclassified as uncertain.
Improving phage classification. Similarly, in order to improve the precision of phage classification, we run deepVirFinder on all contigs classified as phages in step (1). deepVirFinder generates a score and a p-value for each input contig. Contigs with higher scores or lower p-values are more likely to be viral sequences. Here, a contig is kept in the phage class if its p-value ≤ 0.03 and moved to the chromosome class if its p-value > 0.03 and its score ≤ 0.5. The remaining contigs are reclassified as uncertain.
We will denote the algorithm up to this step Initial(vV) and Initial(PM) if viralVerify or PPR-Meta were used in step (1), respectively.
2.2 Refining the classification using the assembly graph
In genomics and metagenomics, assembly graphs, such as de Bruijn graphs [18, 25] and string graphs [21, 31], are used as the core data structure to combine overlapped reads (or k-mers) into contigs. Nodes in an assembly graph represent contigs and edges represent subsequence overlaps between contigs. Existing classifiers take contigs as input and classify each of them independently based on its sequence. The overlap information between neighboring contigs in the assembly graphs was ignored by all the existing classifiers. However, recent studies showed that neighboring contigs in an assembly graph are more likely to come from the same taxonomic group [5, 20]. Based on this insight, here we exploit the assembly graph to improve the classification by the following two steps.
Correction of classified contigs. Scan all the classified contigs in the assembly graph in random order. If a classified contig has ≥ 2 classified neighbors and all of them belong to same class, while this contig was classified into a different class, we reason that this contig was wrongly classified and correct its classification to match that of its classified neighbors. This step is repeated until no change was made.
Propagation of the classification to uncertain contigs. Scan all the uncertain contigs in the assembly graph in random order. If an uncertain contig has one or more classified neighbors and all of them belong to same class, we classify this contig into the same class as its classified neighbors. We repeat this step until no uncertain contigs could be classified.
Figure 1 shows the result of applying steps (1) and (2) in a small assembly graph, which is part of the graph generated by assembling simulated long reads (Sim4; see details in the Results section).
We will use the names 3CAC(vV) and 3CAC(PM) for the full 3CAC algorithms initialized with viralVerify and PPR-Meta solutions, respectively.
Vertices with color red, blue, green, and grey represent contigs classified as phages, plasmids, chromosomes, and uncertain, respectively. (a) The result of Initial(vV). (b) After the correction step. The four contigs encircled in (a) were corrected. (c) After the propagation step.
3 Results
We tested 3CAC on both simulated and real metagenomic assemblies and compared it to PPR-Meta and viralVerify.
3.1 Evaluation criteria
3CAC, viralVerify and PPR-Meta were evaluated based on precision, recall, and F1 score, calculated as follows.
– Precision: the fraction of correctly classified contigs among all classified contigs. Note that uncertain contigs were not included in the calculation.
– Recall: the fraction of correctly classified contigs among all contigs.
– F1 score: the harmonic mean of the precision and recall, which can be calculated as: F 1 score = (2 * precision * recall)/(precision + recall).
Following [23, 7], the precision, recall, and F1 score here were calculated by counting the number of contigs and did not take into account their length. The precision and recall were also calculated separately for phage, plasmid and chromosome classification. For example, the precision of phage classification was calculated as the fraction of correctly classified phage contigs among all contigs classified as phages, and the recall of phage classification was calculated as the fraction of correctly classified phage contigs among all phage contigs.
3.2 Performance on simulated metagenome assemblies
We generated two short-read and two long-read metagenome samples as follows. Sequences of complete bacterial genomes were randomly selected from the NCBI database along with their associated plasmids. The abundance of bacterial genomes was modeled by the log-normal distribution and the copy numbers of plasmids were simulated by the geometric distribution as in [23]. The phage genomes and their abundance profiles were sampled from [26]. Two metagenomic datasets of different complexities were designed. For each of the datasets, 150bp-long short reads were simulated from the genome sequences using InSilicoSeq [9] and assembled by metaSPAdes [22]. Long reads were simulated from the genome sequences using NanoSim [39] and assembled by metaFlye [13]. The error rate of long reads was 9.8% and their average length was 14.9kb. For each assembly, contigs were matched to the reference genomes used in the simulation by minimap2 [17]. Contigs having matches to a reference genome with ≥ 90% mapping identity along ≥ 80% of the contig length were assigned to the class of that reference, and these assignments were used as the gold standard to test the classifiers. Table 1 presents a summary of the simulated metagenome assemblies.
The number of genome references for the real human gut metagenomes is the number of all complete phage, plasmid and chromosome genomes in NCBI database.
Figure 2 shows the performance of PPR-Meta, viralVerify and the first phase of 3CAC on these simulated metagenome assemblies. Both PPR-Meta and viralVerify had high precision in chromosome classification, but their precision in phage and plasmid classification was usually low. Further analysis revealed that both of the algorithms distinguished well between phages and plasmids. Their low precision in phage and plasmid classification was due to contamination from chromosome contigs (Supplementary Table A.1). Utilizing two-class classifiers, PlasClass and deepVirFinder, the first phase of 3CAC improved markedly the precision in phage and plasmid classification, while it decreased a little bit the precision in chromosome classification (Figure 2, Supplementary Table A.2). In contrast, recall decreased in phage and plasmid classification, but increased in chromosome classification (Supplementary Figure B.1).
See supplementary Figure B.1 for recall.
Figure 3 shows the results of initial phase of 3CAC on the short-read simulated metagenome assemblies for different contig lengths. Short contigs tended to have lower recall in the initial classification of 3CAC, while precision was not sensitive to the contig length. When the initial classification of 3CAC was generated based on PPR-Meta solution, recall decreased sharply for contigs with length < 1kb. When viralVerify solution was used, recall was even lower for contig shorter than 1kb and improvement with size was roughly linear. We reasoned that these classifiers classified each of the input contigs independently, and so short contigs could not be classified reliably. However, Table 1 shows that more than half of the contigs assembled from short reads are shorter than 1kb. To assist in the classification of these short contigs, 3CAC was designed to take advantage of the longer contigs with confident classification and that are neighbors of these short contigs in the assembly graph. Figure 3 shows that 3CAC significantly increased recall for all contigs with almost no loss of precision. Remarkably, the recall for contigs shorter than 1kb increased from < 0.2 to ≥ 0.8. For contigs assembled from long reads, 3CAC not only improved the recall substantially but also slightly improved the precision (Figure 4).
Results are shown for contigs of lengths < 1 kb, 1-2 kb, …,9-10 kb, ≥ 10kb.
Results are shown for contigs of lengths < 1 kb, 1-2 kb, …,9-10 kb, ≥ 10kb.
The analysis above shows that the two phases of 3CAC algorithm improved the precision and recall for the three-class classification. Evaluation of PPR-Meta, viralVerify and 3CAC on these simulated metagenome assemblies showed that 3CAC performed the best in all the assemblies (Figure 5). 3CAC out-performed PPR-Meta and viralVerify in both precision and recall. For contigs assembled from short reads (Sim1 and Sim2), the recall and F1 scores of viralVerify were more than doubled by 3CAC. We also calculated the precision, recall and F1 scores for phage, plasmid, and chromosome classification separately (Supplementary Table A.3). 3CAC(vV) had the best F1 scores on all the datasets and the highest precision in classification of phages and plasmids. Note that PPR-Meta here was run with default setting. Running PPR-Meta with 0.7 score threshold (as done in Initial(PM)) resulted in higher precision but lower recall and lower F1 score. Supplementary Table A.4 shows that 3CAC also out-performed PPR-Meta with 0.7 score threshold.
3.3 Performance on human gut microbiome samples
Five publicly available human gut microbiome samples with short-read sequencing datasets (NCBI accession numbers: ERR12976697, ERR1297651, ERR1297751, ERR1297845, ERR1297770) were selected and assembled together using metaSPAdes [22]. Another set of five human gut microbiome samples with long-read sequencing datasets (NCBI accessions: SRX2529348, SRX2529347, SRX2529346, SRX2529341, SRX2529340) were selected from [34] and assembled together using metaFlye [13]. To identify the class of contigs in the real metagenome assemblies, we downloaded all complete phage, plasmid and chromosome genomes from NCBI database and mapped contigs to all the reference genomes using minimap2 [17]. A contig was considered matched to a reference sequence if it had ≥ 80% mapping identity along ≥ 80% of the contig length. Contigs that matched to reference genomes of two or more classes were excluded to avoid ambiguity. Overall, 131,578 out of 469,022 contigs in the short-read assembly and 4,743 out of 12,541 contigs in the long-read assembly had matches to a single class and were used as the gold standard to test the classifiers. Table 1 summarizes the properties of the datasets and the assemblies.
Figures 6(a) and 7(a) show the results of PPR-Meta, viralVerify and 3CAC on the short-read and long-read assemblies, respectively. On the long-read assembly, 3CAC(vV) and 3CAC(PM) had comparable performance. 3CAC was best in precision, recall and F1 score (Figure 7).
(a) performance on all contigs; (b) performance on non-isolated contigs in the assembly graph.
(a) performance on all contigs; (b) performance on non-isolated contigs in the assembly graph.
Interestingly, on the short-read assembly, 3CAC(PM) and PPR-Meta had higher F1 score than 3CAC(vV) (Figure 6 (a)). Further analysis revealed that this was due to a large number of isolated contigs in the short-read assembly graph. The second phase of 3CAC was only performed on contigs that have neighbours in the assembly graph. However, 59% of the contigs assembled from short reads were isolated and had no neighbors in the assembly graph, while the fraction on the long-read assembly was only 21%. Figures 6 (b) and 7 (b) show the results on the non-isolated contigs in the assembly graph. For both long read and short read assemblies, 3CAC(PM) and 3CAC(vV) had comparable performance and outperformed PPR-meta and viralVerify in precision, recall, and F1 score.
Supplementary Figure B.2 shows the precision, recall and F1 score separately for phage, plasmid and chromosome classification in both short-read and long-read assemblies. In classification of phages and plasmids, PPR-Meta had the highest recall, but its precision was as low as 0.02. Compared to PPR-Meta, 3CAC had higher precision in classification of phages and plasmids, at the cost of lower recall and tended to have better F1 scores. In chromosome classification, 3CAC performed the best in the long-read assembly while PPR-Meta performed slightly better in the short-read assembly.
3.4 Software and Resource usage
3CAC uses classifications generated by existing classifiers, and so the running time of its first phase depended on the classifiers used. On our datasets, PPR-Meta, PlasClass and deepVirFinder were fast and required less than an hour. viralVerify took up to 4-5 hours for the real metagenome assemblies with 8 threads. The second phase of 3CAC is fast and took less than 30 minutes in a single thread for all the datasets we tested. Performance was measured on a 44-core, 2.2 GHz server with 792 GB of RAM. 3CAC will be freely available under Shamir-Lab on Github soon.
4 Discussion and Conclusion
Classification of phages and plasmids from mixed metagenome assemblies is important for further unravelling and understanding the functions of these mobile genetic elements in microbiome communities. Many two-class classifiers has been developed in recent years to identify either phages or plasmids from metagenome assemblies. A naive way to identify phages and plasmids simultaneously from mixed metagenome assemblies is by using phage classifiers and plasmid classifiers to identify phages and plasmids respectively, and then combining the classification result. However, this is impractical since phage sequences are often arbitrarily classified as plasmids or chromosomes by plasmid classifiers, and so are the plasmid sequences in phage classifiers. In this work, we first exploit three-class classifiers to accurately separate plasmids from phages and then utilize two-class classfiers to further improve the precision of phage and plasmid classification. The key improvement by 3CAC is obtained by utilizing the structure of the assembly graph to assist the classification of short and uncertain contigs. This leads to significant improvement of the recall and almost no loss of the precision.
Evaluation the performance of classifiers on real metagenome assemblies remains challenging due to the lack of gold standard. By mapping contigs to all the available reference genomes, we are able to identify the class of a fraction of the contigs. However, as shown in previous studies [7], some plasmid genomes are quite similar to their host bacterial chromosomes. Thus, many contigs from metagenome assemblies have matches to both plasmid and chromosome reference genomes, and it is hard to identify their classes. Additionally, many contigs with no matches to the reference database may represent novel species, but they were excluded from our evaluation. Keeping in mind these shortcomings of the gold standard for real metagenome assemblies, 3CAC outperformed existing three-class classifiers substantially.
3CAC is initialized with solutions of PPR-Meta or viralVerify. Their overall performance was comparable, with initialization with PPR-Meta doing slightly better in the short read real data, and initialization with viralVerify slightly better on long read real data and in all simulations. PPR-Meta could be run with different score thresholds, and a higher score threshold results in higher precision and lower recall. In our experiments, we tried the score thresholds 0.7 and 0.8, and the difference in the results was minor.
3CAC has some limitations. The propagation step of 3CAC can greatly improve the recall, but it can only be performed on non-isolated contigs in the assembly graph. The recall of isolated contigs is still limited by the performance of existing classifiers. 3CAC also relies on current 2-class and 3-class classifiers. In the future, we plan to extend 3CAC to a stand-alone classification tool without relying on existing classifiers. Currently, 3CAC scans contigs in the assembly graph in random order, in both the correction and the propagation steps. That order may affect the results, and a more judicious order may improve the classification. Finally, there is room for extending 3CAC to a four-class algorithm that would be able to classify also eukaryotic contigs in metagenome assemblies [37].
Supplementary Material
A Supplementary tables
PPR-Meta was run with a score threshold ≥ 0.7 to assure high precision.
PPR-Meta and PPR-Meta(0.7) represent running PPR-Meta on default setting and with a score threshold of 0.7, respectively.
PPR-Meta and PPR-Meta(0.7) represent running PPR-Meta on default setting and with a score threshold of 0.7, respectively.
B Supplementary figures
Acknowledgments
This study was supported in part by grant 2016694 from the United State - Israel Binational Science Foundation (BSF), Jerusalem, Israel and the United States National Science Foundation (NSF). L.P. was supported in part by a fellowship from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University. L.P. was also supported in part by postdoctoral fellowships from the Planning and Budgeting Committee (PBC) of the Council for Higher Education (CHE) in Israel.
Footnotes
lianrongpu{at}mail.tau.ac.il, rshamir{at}tau.ac.il