Abstract
Resolving bacterial and archaeal genomes from metagenomes has revolutionized our understanding of Earth’s biomes, yet producing high quality genomes from assembled fragments has been an ever-standing problem. While automated binning software and their combination produce prokaryotic bins in high-throughput, their manual refinement has been slow and sometimes difficult. Here, we present uBin, a GUI-based, standalone bin refiner that runs on all major operating platforms and was specifically designed for educational purposes. When applied to the public CAMI dataset, refinement of bins was able to improve 78.9% of bins by decreasing their contamination. We also applied the bin refiner as a standalone binner to public metagenomes from the International Space Station and demonstrate the recovery of near-complete genomes, whose replication indices indicate active proliferation of microbes in Earth’s lower orbit. uBin is an easy to install software for bin refinement, binning of simple metagenomes and communication of metagenomic results to other scientists and in classrooms. The software is open source and available under https://github.com/ProbstLab/uBin.
Main Text
Genome-resolved metagenomics aims at recovering genomes from shotgun sequencing data of environmental DNA. The genomes allow determination of the metabolic capacities of the individual community members and provide the basis for many downstream ‘omics techniques like metatranscriptomics and metaproteomics. Results from these technologies can provide important insight into the interactions of microbes within the community and with the environment [1,2]. While long-read sequencing can nowadays produce complete genomes from environmental samples [2], the percentage of closed genomes from complex ecosystems remains, however, as low as 5.3% [3]. Consequently, genomes need to be binned from metagenomes using genome-wide shared characteristics like their similar abundance pattern and k-mer frequencies [4,5]. Many automatic and semi-automatic tools have been developed to extract genomes from metagenomes [6–10]. The quality of the resulting bins, however, can vary greatly depending on metagenome complexity, sample type or microbial community characteristics [6]. Recent studies have shown that contamination in genomes from metagenomes in public databases is a frequent occurrence [11,12] and suggested genome curation as a mandatory analysis step prior to genome submission to public databases [13].
While established tools exist to determine the bin quality [6,14], i.e. searching candidate genomes for ubiquitous or specific marker genes to evaluate completeness and contamination, tools to improve upon the bin quality are sparse. Some established tools are used for genome refinement [15,16] but have not been designed for educational purposes and are sometimes not open source [16]. Consequently, we developed uBin as an interactive graphical-user interface that is easy to install on Mac OS, Windows, and Ubuntu for usage in, e.g., classrooms. uBin is inspired by ggKbase [16] and enables the curation of genomes based on a combination of GC content, coverage and taxonomy and couples this to information on completeness and contamination for supervised binning. In addition, uBin can be directly used as a standalone software to bin genomes from low complexity samples.
We tested the performance of uBin (MacOS, 16 GB of RAM) on simulated datasets with varying complexity of the Critical Assessment of Metagenome Interpretation (CAMI) challenge. The pre-assembled CAMI scaffolds were binned using four automated binners (using tetranucleotide frequency and differential coverage) and the results were aggregated using DAS Tool [6] (see Supplementary Methods for details). The dereplicated bins were curated using uBin, and the quality of the bins before and after curation was compared to the correct assignment based on the CAMI dataset (see Tab. S1 for F-scores of Bins pre- and post-uBin curation). uBin curated bins showed a highly significant quality improvement in medium (p < 10−4) and high complexity datasets (p < 10−5), using both paired t-test and unpaired Kruskal-Wallis tests (Fig. 1A). No significant difference could be detected for the low complexity dataset (p > 0.70 / 0.65).
The bin quality of the low complexity dataset was significantly higher than the bin quality in medium (0.197 higher F-score, p < 10−6) and high complexity (0.118 higher F-score, p < 10−4) datasets (ANOVA coupled to TukeyHSD, p < 2×10−6) after DAS Tool [6] bin aggregation. Subsequent to curation with uBin the differences between these datasets were much less pronounced (ANOVA, p < 0.01), with only the high to medium complexity dataset showing a significant difference (p < 0.01, average 0.077 higher F-score in high complexity). We conclude that low complexity datasets bin very well with automated binners, while medium to high complexity datasets can greatly benefit from manual curation.
To challenge the above-mentioned conclusion, we applied uBin for the curation of bins from environmental metagenomes of medium and high complexity. As the true genome composition is unknown for these datasets, we used CheckM [14] to assess the completeness and contamination of constructed genomic bins. CheckM [14] is an independent metric compared to the marker sets used within DAS Tool [6] and uBin (see Tab. S1 for F-scores of bins pre- and post-curation). We detected a significant improvement in genome quality when using uBin curation and directly comparing the bins in paired tests (p-values are provided in Fig. 1B).
Following the conclusion that binning of low complexity genomes can be achieved easily, we tested uBin’s capability as a standalone binner compared to Emergent-Self-Organizing Maps (ESOMs) [8] on public metagenomes of the International Space Station (ISS). uBin outperformed ESOM-based binning when used as a standalone tool and when used as a curation tool of the ESOM bins (Fig. 2A, see Supplementary Material for details). Using uBin, we successfully reconstructed 53 genomes with at least 94 percent completeness (Fig. 2B) and only 6% or less contamination (see Tab. S2 for completeness and contamination statistics of recovered ISS genomes). When comparing their phylogenetic placement based on 16 ribosomal proteins to the taxonomic classification of uBin, we observed agreement between the taxonomic classification methods (see Tab. S3 for the phylogenetic and uBin-based taxonomic placement of genomes). The one exception was the genome ISS_JPL_2332_S1_L003_Corynebacterium_afermentans_66_84, which was phylo-genetically placed next to a Turicella genome [17]. This genome has since been reclassified as Corynebacterium otitidis ATCC 51513 (NZ_AHAE00000000, see File S1 for the full phylogenetic tree).
These bins represent an important step for space science since these are the first environmental genomes reconstructed from the ISS or associated transport flights. To investigate if the genomes are actively replicated under these conditions, we calculated the in situ replication measure iRep [1] for 43 out of 53 genomes. Across all sampling sites, the replication rates of the recovered population genomes varied from 1.20 to 2.55, which implies an active metabolism. For instance, the lowest iRep value, which was calculated for Methylobacterium aquaticum, indicated that on average 20% of its sampled population was undergoing genome replication. While closely related organisms often had similar replication measures (Fig. S3), the main discriminatory factor for varying replication indices was the origin of the flight (Fig. 2C) indicating community-wide shifts in replication between the different flights. The dataset also enabled the answer to a long-standing question of indoor microbiology relating to how external DNA influences the measurements of iRep values in metagenomics. Samples of the third sampled ISS flight were analyzed using both regular metagenomics as well as metagenomics following propidium monoazide (PMA) treatment, which removes external DNA fragments and enables DNA sequencing of cells with intact membranes. When comparing the iRep values of the paired samples (n=7 per group), no significant difference could be observed (paired t- and Wilcoxon-tests, Fig. 2D), although the variance of the iRep values increased tremendously after PMA treatment. Equivalence testing confirmed that there are no differences between these two sample types (p < 0.01). We suggest that PMA-treatment can improve the accuracy of iRep measures of environmental samples and recommend its usage where appropriate.
The herein presented uBin software is designed for improvement of bins and as a standalone binner for simple metagenomes with few species. It is independent of the operating system (available for Windows, MacOS, Linux) and GUI-based so that a wide audience of non-bioinformaticians can make use of it. The initial data processing (as general metagenomic data processing) necessitates bioinformatics knowledge but respective easy-to-use wrapper scripts are provided along with the software. Thus, uBin is ideally used by bioinformaticians to communicate metagenomic data to non-bioinformatics peers and to students in classrooms. After binning or curation with uBin, the user can deploy each genome into individual fasta files. These genomes can then be further explored for metabolic analyses with, e.g., MAGE [18] or KEGG mapper [19]. Consequently, uBin represents an important software link between automated binners along with the widely-used software DAS Tool and downstream analyses including genome refinement to completion [20].
Supplementary Material for
Supplementary Methods
Software implementation
uBin is written in TypeScript(3.2+)/JavaScript. It utilizes React (https://reactjs.org/) for its user interface and Redux (https://redux.js.org/) to manage the application state/data.
All imported data is stored in a local SQLite (sqlite3) database. Communication between uBin and the database is abstracted through TypeORM (https://typeorm.io/), an ORM written in TypeScript. To build the application and to provide cross-platform support, we use Electron (https://www.electronjs.org/).
The user interface uses HTML/CSS + Blueprint JS (a User-Interface (UI) toolkit, https://blueprintjs.com/) for general UI elements, react-vis (https://uber.github.io/react-vis/) for its Sunburst plot, and VX (a library for d3-based React visualization components, https://github.com/hshoff/vx)for every other plot. Crossfilter (https://github.com/crossfilter/crossfilter) is used to calculate the data to be plotted on-the-fly.
Metagenomic data assembly and processing
Quality control of ISS metagenome raw reads was performed using BBduk (B Bushnell, http://jgi.doe.gov/data-and-tools/bb-tools/) and Sickle [21]. Reads were assembled into contigs and scaffolded using metaSPAdes 3.12 [22] (see Tab. S5 for read and assembly statistics). Genes were predicted for scaffolds larger than 1 kbp using Prodigal [23] in meta mode and annotated using DIAMOND [24] against UniRef100 (state Dec. 2017) [25], modified with NCBI taxonomic information of the respective protein sequences (FunTaxDB, tentatively accessible through https://uni-duisburg-essen.sciebo.de/s/pi4cuYwyZ3KJVMl). The consensus taxonomy of each scaffold was predicted by considering the taxonomic rank of each protein on the scaffold on each taxonomic level and choosing the lowest taxonomic rank when more than 50% of the protein taxonomies agree. Reads were mapped to scaffolds using Bowtie2 [26] and the average scaffold coverage was estimated along with scaffolds’ length and GC content. Previously published ubiquitous single copy genes [27] were identified using HHmer 3.2 [28] and custom tables collecting GC, coverage, length, taxonomy and presence / absence of single copy genes of scaffolds were generated using scripts available along with uBin under https://github.com/ProbstLab/uBin-helperscripts.
Binning and curation
ISS assemblies were binned using Emergent Self-Organizing Maps (ESOM) [8]. Scaffolds were fragmented using the esomWrapper.pl [8] script, using 10kbp and 5kbp as maximum and minimum fragment sizes respectively. Streptomyces griseus NBRC13350 (high GC, NC_010572.1) and Escherichia coli K12 (low GC, NC_000913.3) genomes were spiked in to verify successful ESOM training. For ESOM training, the starting radius was set to 50 and the map-size was adjusted to the suggested size in the esomWrapper.pl output. ISS data was additionally binned directly using uBin.
CAMI datasets were binned using the automatic binners abawaca [29] and MaxBin2 [7], using both 3 kbp and 5 kbp as well as 5kbp and 10kbp as minimum and maximum fragment sizes respectively as abawaca input and using both available marker gene sets of MaxBin2 for binning. The output of the four different binners was aggregated using DAS Tool [6]. Tomsk and SulCav binning has been described previously [30].
Tables containing Bin, GC, coverage, length, taxonomy and single copy gene presence / absence information were loaded into uBin and used to curate draft genomes. Coding regions and single copy genes on genomes were predicted as described, omitting the -meta flag in prodigal.
Calculation of in situ replication indices
Bacterial in situ replication indices (iRep [1]) were calculated by mapping reads on the genomes and filtered for 3 mismatches, which correspond to 2% mismatch rate in the 150 bp reads. The rest of the settings for the iRep software were default.
Estimation of sample complexity
The sample complexity was estimated using the diversity of the rpS3 marker gene. rpS3 genes were annotated as described above. We are aware that sample complexity can also stem from other factors like K-mer frequency or coverage distribution patterns that this estimation does not take into account. However, these metrics cannot be assessed for environmental samples easily as the real composition is unknown. See Tab. S4 for rpS3 based complexity estimates across analyzed samples.
Phylogenomics
Ribosomal proteins were identified with blastp [31] (e-value 10−5) against 16 ribosomal proteins set as used in [32], aligned using muscle [33] with default parameters, trimmed with BMGE [34] and the BLOSUM30 substitution matrix and concatenated. The phylogenetic tree was calculated using Fasttree 2.1.8 [35] with default parameters. The tree was visualized in Dendroscope 3.7.2 [36].
Calculation of F-scores
Precision and recall of CAMI bins were determined using the known genomic assignment of the scaffolds and where they were allocated to during binning and curation. Genomic bins were assigned as corresponding to a CAMI genome based on the maximum scaffolds belonging to the same CAMI genome. Precision and recall of genomes from real-world datasets were determined using completeness as a proxy for True Positives, 1-%completeness as False Negatives, contamination as a proxy for False Positives and 1-contamination as True Negatives. The F-score was calculated as the mean between precision and recall.
Statistical evaluation
Statistical evaluation was performed in R [37]. Both paired and unpaired Welch t-tests [38] as well as Kruskal-Wallis [39] tests, one- and two-way ANOVA’s [40] and TukeyHSD [41] significance tests were performed. ggplot2 [42] was used to visualize data. The TOSTpaired.raw function within the TOSTER [43] package was used to confirm the non-significance of PMA-related tests, using 0.1 as the equivalence bound.
Metagenome availability
Accessions to raw reads and assemblies used in this study are listed in Tab. S4.
Software availability
The platform-independent genome curation software uBin is freely available under the MIT license at https://github.com/ProbstLab/uBin. The installation of the software from the OS-dedicated installers is dependency-free, while source code installation requires a Unix-based OS and package managers like npm or yarn.
The authors declare no competing interest. All data is publicly available
Supplementary Figures
Supplementary Tables
Tab. S1 | F-scores pre- and post-uBin of CAMI, Tomsk and SulCav datasets.
TableS1_Fscores_CAMI_Tomsk_SulCav.xlsx
Additional Supplementary Files
File S1 | Phylogenetic tree for placement of ISS genomes based on 16 ribosomal proteins.
FileS1_ISS_PhyloTree.tree
Acknowledgments
This study was funded by the Ministerium für Kultur und Wissenschaft des Landes Nordrhein-Westfalen (“Nachwuchsgruppe Dr. Alexander Probst”). We thank the students who tested and worked with uBin over the last two years in classrooms. We thank Christine Sun for her contribution to the script for calculating consensus taxonomy of scaffolds and Kasthuri Venkateswaran for input regarding sampling locations of the ISS samples. We thank Ken Dreger for the administration and maintenance of our servers.
Footnotes
We added a supplementary table detailing the F-scores of CAMI and real world test dataset bins prior and after uBin curation. We also rephrased the comparison to other existing software.