Abstract
Identifying the biological diversity of a microbial population is of fundamental importance due to its implications in industrial processes, environmental studies and clinical applications. Today, there is still an outstanding need to develop new, easy-to-use bioinformatics tools to analyze both amplicon and shotgun metagenomics, including both prokaryotic and eukaryotic organisms, with the highest accuracy and the lowest running time. With the aim of overcoming this need, we introduce GAIA, an online software solution that has been designed to provide users with the maximum information whether it be 16S, 18S, ITS, or shotgun analysis. GAIA is able to obtain a comprehensive and detailed overview at any taxonomic level of microbiomes of different origins: human (e.g. stomach or skin), agricultural and environmental (e.g. land, water or organic waste). By using recently published benchmark datasets from shotgun and 16S experiments we compared GAIA against several available pipelines. Our results show that for shotgun metagenomics, GAIA obtained the highest F-measures at species level above all tested pipelines (CLARK, Kraken, LMAT, BlastMegan, DiamondMegan and NBC). For 16S metagenomics, GAIA also obtained excellent F-measures comparable to QIIME at family level. The overall objective of GAIA is to provide both the academic and industrial sectors with an integrated metagenomics suite that will allow to perform metagenomics data analysis easily, quickly and affordably with the highest accuracy.
Introduction
The study of the different microbiomes present in either the environment or inside the human body is of fundamental importance as it is highly relevant for industrial processes as well as environmental and clinical applications. For instance, dysbiosis in distinct communities has been related to diseases or have been established as an early sign of environmental pollution [1-3].
Traditionally, studying bacterial communities required the isolation and culture of each individual microorganism, which is a significant limitation considering that less than 1% of the prokaryotes known are culturable [4]. Nowadays using high-throughput sequencing technologies is a standard for metagenomics analysis. These technologies are essential for the development of metagenomics, which is defined as the culture-independent genomic analysis of all the microorganisms in a particular environmental niche [5]. The data analysis can be either taxonomical, which aims to identify the origin of all the genomic material in the sample, or functional, where the goal is to identify the genes within a sample and their function in terms of Gene Ontology (GO) and metabolic pathways.
There are two main approaches in metagenomics: amplicon and shotgun analysis. Amplicon metagenomics is based on the PCR amplification of a genetic marker: the 16S rRNA in bacteria and archaea, ITS regions in fungi or the 18S rRNA in other eukaryotes. Even though it is a cheap and well-established technology, these markers lacks resolution as they cannot differentiate between closely related species [6]. Shotgun analysis is defined as the unrestricted sequencing of the genomes (Whole Genome Sequencing-WGS-metagenomics) or transcriptomes (metatranscriptomics) inside a sample. Shotgun analyses allow taxonomic identification down to strain level and also functional annotation by identifying genes present in the samples. In addition, the expression of these genes can be quantified by means of metatranscriptomics, thus allowing the identification of key roles of microorganisms in the samples. On the other hand, shotgun analyses are more computationally expensive then amplicon sequencing as they require higher sequencing coverage of the genomes in the sample. Additionally, the data analysis is more complex due to the large amount of data generated [7].
Dozens of pipelines have been developed so far for the analysis of metagenomics samples for taxonomic identification. Some of these pipelines have been compared in recent publications: Siegwald et al. (2017) [8] benchmarked commonly-used 16S pipelines, McIntyre et al. (2017) [9] benchmarked commonly-used shotgun metagenomics pipelines, and Brown, et al. (2017) [10] benchmarked pipelines that are able to process Nanopore data.
All things considered, there are still limitations to be overcome in the field of metagenomics: 1) there is still room for improvement in terms of accuracy, 2) some of the pipelines work on amplicon but not on shotgun analyses and vice versa, 3) lack of true user-friendly interfaces for the vast majority of the pipelines, 4) lack of pipelines that have eukaryotic databases, 5) high computational and data storage demand, especially in terms of shotgun metagenomics, and 6) high processing time. To overcome these limitations, we introduce GAIA: a new online metagenomics integrated suite which aims to become the reference method in metagenomics analysis for amplicon and WGS metagenomics as well as metatranscriptomics. With the aim of validating GAIA, we have gathered the results and datasets available from the different benchmarks, which include the following pipelines: BMP [11], mothur [12], QIIME [13], LMAT [14], BlastMegan [15], DiamondMegan [15], NBC [16], CLARK [17] and Kraken [18].
Materials and methods
Benchmark
Whole genome sequencing
The datasets (FASTA and FASTQ files) from Segata et al. (2013) and Ounit and Lonardi (2016) used in McIntyre, et al. (2017) for benchmarking were downloaded from http://ftp-private.ncbi.nlm.nih.gov/nist-immsa/IMMSA/. FASTA files were converted to FASTQ files using an in-house script and the quality assigned to the bases was 40 (Phred+33). The datasets were then mapped against the GAIA’s Prokaryotes database v1.0. Precision, recall and F-measure values at read-level for the benchmarked tools were extracted directly from the paper. Precision, recall and F-measures for GAIA were calculated as:
The datasets in FASTQ format from Brown, et al. (2017) were downloaded from the European Nucleotide Archive (ENA) as using the accessions PRJEB8672 and PRJEB8716. Accuracy values at read-level for the benchmarked tools were extracted directly from the paper. Accuracy for GAIA was calculated as:
Amplicon sequencing
The datasets from Siegwald, et al. (2017) were downloaded from http://pegase-biosciences.com/metagenetics/. These datasets were then mapped against GAIA’s Amplicon NCBI 16s database v1.0. Precision, recall and F-measures at read-level for the benchmarked tools were extracted directly from the paper. Precision, recall and F-measure values for GAIA were calculated as:
Results
Pipeline
GAIA consists of different Python, Java, Bash and R scripts that perform the following 5 different steps: i) the first step is the quality check and trimming, in that GAIA calls BBDuk [19] in order to remove both adapter sequences and bad quality portions from the reads for Illumina and Ion Torrent data and, for Oxford Nanopore data, it uses Porechop [20] for adapter removal; ii) BWA [21] is used to map the high quality reads from any platform (Illumina, Ion Torrent and Oxford Nanopore) against custom-made databases created from NCBI sequences [22]; iii) reads are classified into the most specific taxonomic level using an in-house Lowest Common Ancestor (LCA) algorithm; iv) minimum identity thresholds are applied to classify reads into strains, species, genus, family, order, class, phylum and domain levels; v) alpha and beta diversities are finally calculated using phyloseq [23]. Additionally, should any of the input datasets come from different conditions, GAIA includes an additional step to perform a differential abundance analysis using DESeq2 [24] (Figure 1).
Overview of GAIA pipeline with the distinct steps it follows.
Benchmark
In order to assess the performance of GAIA and to compare it with other metagenomics pipelines, we conducted a benchmark using both whole genome and amplicon sequencing data as described in the following paragraphs. For each dataset, GAIA’s precision, recall and F-score were calculated and compared with the others pipelines.
Whole genome sequencing
The datasets used from McIntyre, et al. (2017) were generated in silico and can be divided into two groups. The first group of datasets are created taking into account their complexity: high complexity datasets contain more species (100 different species) with more variable abundances than the low complexity datasets (25 different species). The second group of datasets were created including the species that are commonly found in mouth, city parks, gut or indoors. By comparing the results of GAIA with the expected ones, precision, recall and F-scores for each dataset were calculated (Supplementary Table 1). On average, GAIA obtained a precision of 0.982, a recall of 0.902 and the highest average F-score (0.94) at the species level, followed by CLARK-S and CLARK with F-scores of 0.936 and 0.921, respectively.
The Oxford Nanopore datasets used from Brown, et al. (2017) were real data generated from four separate cultures of Escherichia coli, Pseudomonas fluorescens, Microcystis aeruginosa and Synechococcus elongatus, and three mixed cultures of these four species. By comparing the results of GAIA with the expected ones, the accuracy for each dataset was calculated (Supplementary Table 2). On average, GAIA obtained the highest accuracy of 0.967, followed by Kraken with an accuracy of 0.946.
Amplicon sequencing
The datasets used from Siegwald, et al. (2017) were generated in silico with and without simulating sequencing errors on different rRNA subunits (V3, V4-V5) and can be divided into three groups:
High Complexity (HC): all taxa equally distributed with no dominant organisms.
Medium Complexity (MC): four dominant species of different genera accounting for 20% of reads and the remaining taxa are equally distributed.
Low Complexity (LC): 1 dominant species accounting for 30% of reads and the remaining taxa are equally distributed.
At family level, using the Siegwald, et al. (2017) benchmark, GAIA obtained equal or slightly higher F-measures relative to QIIME: GAIA (0.957), QIIME UCLUST with SILVA database (0.956), QIIME SortMeRna SUMACLUST with Greengenes database (0.955) (Supplementary Table 3). At genus level, GAIA (0.83) showed the third best F-measure after CLARK (0.878) and Kraken (0.859), which was followed by QIIME UCLUST with SILVA database (0.776) and QIIME SortMeRna SUMACLUST with Greengenes database (0.665) (Supplementary Table 4).
Online platform
In order to provide a comfortable user experience, the GAIA pipeline was integrated into an online software solution, which delivers the software in a way that can be accessed from any device with an Internet connection and a web browser without any bioinformatics skills required. The analysis is performed interactively online and it includes dynamic charts and tables using Google Charts and DataTables (JavaScript-based) (Figure 2).
Screenshot of the upload page (A), in which the user uploads sequencing data, and screenshot of the taxonomy barplot at genus level once the analysis has been completed (B).
Conclusions
We propose GAIA as a new software able to obtain a comprehensive and detailed overview at any taxonomic level (including strains) of microbiomes of different origins such as human (e.g. stomach or skin), agricultural and environmental (e.g. land, water or organic waste) in an accurate and easy way. The presented high benchmark scores validate the algorithm. In fact, on average GAIA obtained the highest scores at species level for WGS metagenomics and it also obtained excellent scores for amplicon sequencing. Indeed, GAIA was the most performing software at genus level and within the top-three most performing softwares at family level for 16S rRNA data. In addition, at both family and genus level for amplicon sequencing, GAIA obtained higher F-scores than QIIME, the most cited software for this kind of analysis. As metagenomics is shifting towards shotgun analyses which are able to sequence any organisms within a sample, GAIA’s database also includes eukaryotes to perform the so-called true metagenomics: a complete view in terms of existing life within the samples. GAIA has been created so the user can spend more time interacting with their results and less time setting up the analysis. The overall objective of GAIA is to provide academia and industries with an integrated metagenomics suite that will allow to perform metagenomics data analysis easily and quickly. GAIA is available at http://gaia.sequentiabiotech.com.
Acknowledgements
We thank Anna Delgado Tejedor for her contribution in evaluating GAIA during her internship at Sequentia Biotech.