MetaPrism: A Toolkit for Joint Taxa/Gene Analysis of Metagenomic Sequencing Data

Background In microbiome research, metagenomic sequencing generates enormous amounts of data. These data are typically classified into taxa for taxonomy analysis, or into genes for functional analysis. However, a joint analysis where the reads are classified into taxa-specific genes is often overlooked. Result To enable the analysis of this biologically meaningful feature, we developed a novel bioinformatic toolkit, MetaPrism, which can analyze sequence reads for a set of joint taxa/gene analyses: 1) classify sequence reads and estimate the abundances for taxa-specific genes; 2) tabularize and visualize taxa-specific gene abundances; 3) compare the abundances between groups, and 4) build prediction models for clinical outcome. We illustrated these functions using a published microbiome metagenomics dataset from patients treated with immune checkpoint inhibitor therapy and showed the joint features can serve as potential biomarkers to predict therapeutic responses. Conclusions MetaPrism is a toolkit for joint taxa and gene analysis. It offers biological insights on the taxa-specific genes on top of the taxa-alone or gene-alone analysis. MetaPrism is open source software and freely available at https://github.com/jiwoongbio/MetaPrism. The example script to reproduce the manuscript is also provided in the above code repository.

In microbiome research, metagenomic sequencing generates enormous amounts of data. These 3 data are typically classified into taxa for taxonomy analysis, or into genes for functional analysis. 4 However, a joint analysis where the reads are classified into taxa-specific genes is often 5 overlooked. 6 Result 7 To enable the analysis of this biologically meaningful feature, we developed a novel 8 bioinformatic toolkit, MetaPrism, which can analyze sequence reads for a set of joint taxa/gene 9 analyses: 1) classify sequence reads and estimate the abundances for taxa-specific genes; 2) 10 tabularize and visualize taxa-specific gene abundances; 3) compare the abundances between 11 groups, and 4) build prediction models for clinical outcome. We illustrated these functions using 12 a published microbiome metagenomics dataset from patients treated with immune checkpoint 13 inhibitor therapy and showed the joint features can serve as potential biomarkers to predict 14 therapeutic responses.

Introduction 2
The human microbiome consists of ~39 trillion bacteria and influences host health. Recently, the 3 use of metagenomic sequencing has become increasingly popular as a more unbiased approach to 4 gut microbiome profiling as compared to 16S rRNA sequencing. A common approach to 5 comparing differences in the gut microbiome between groups (cases and controls) is to identify 6 significant differences in either taxa or microbial genes.  (Table S1). However, these tools analyze either taxonomic abundances (taxonomic 9 profiling) or gene abundances (function profiling) separately. As each microorganism carries its 10 own genes, taxonomic and functional profiling results are not intrinsically independent. In fact, 11 recent discoveries demonstrated that taxon-specific genes have a causative role in disease Klebsiella [6]. Therefore, joint analysis, where taxonomy and functional features are analyzed 16 together, could provide useful biological and clinical insights [7]. However, bioinformatics tools 17 for joint analyses are comparatively lacking.

18
Our innovation in this manuscript is to define and utilize joint taxa/gene features via 19 bioinformatics approach, with the goal of offering biologically interpretable findings. For example, 20 our method characterizes the genes discovered for each species. This allows to quantitative 21 analysis of this species-specific gene, which is usually not readily available. Our approach is 22 initiated from de novo assembled contigs which are both taxonomically and functionally annotated.

1
Our simulations showed this method can accurately detect bacterial species and their carried genes.

2
In a recent review article[7], Langille prompted that understanding the gene contents at species 3 level can offer better interpretation than using the taxon or gene content alone, and potentially 4 provide better prediction outcomes. This confirmed that the joint feature is useful for general 5 microbiome studies. Our tool provided these joint features as the first step for a wide range of 6 downstream analysis tasks. For example, we demonstrated that the quantity of taxa-specific gene 7 abundances is a potentially useful biomarker to predict the immunotherapy responses.

8
To facilitate joint analysis, we developed MetaPrism, a novel bioinformatics tool to (1) classify 9 metagenomic sequence reads into both taxa and gene level, (2) normalize the taxa-specific gene 10 abundances within samples, (3) tabularize or visualize these joint features, (4) perform 11 comparative microbiome studies, and (5) build prediction models for clinical outcomes.

12
MetaPrism is open-sourced and is available at https://github.com/jiwoongbio/MetaPrism. Given 13 the advantages of joint analysis, MetaPrism is a useful tool for microbiome metagenomic sequence 14 studies.

17
MetaPrism is a toolkit for joint analysis tasks. At its core, MetaPrism will infer the taxa and gene 18 for each metagenome sequence read. One approach is to align each read to bacterial nucleotide 19 reference genomes to obtain its taxonomy and align it to a protein database to obtain its gene 20 functions. However, this approach is technical challenging: due to the short lengths of Illumina 21 sequence reads and the high sequence similarities between bacteria genomes, alignment of short 22 reads is not feasible. We thus developed a novel algorithm ( Figure 1A) in an integrated toolkit 1 ( Figure 1B) to tackle this challenge.
2 First, we perform de novo assembly for each sample using metaSPAdes [8] with all metagenomic 3 sequence reads to obtain long contigs. As these contigs are much longer than sequence reads, that 4 allows for accurate taxonomical and functional profiling. Second, we identify the taxonomy of these contigs. All the contigs are aligned to a large 1 reference database of more than 4,000 bacterial genomes using centrifuge [9]. Ambiguous 2 alignments will be filtered out from the subsequent analysis.  In brief, we simulated metagenomic sequence reads from known species and inferred the gene 15 abundances using FMAP (Kim, et al., 2016b) and MetaPrism. This benchmark showed that gene 16 abundances inferred by MetaPrism were accurate and achieved the highest correlation between 1 inferred abundances and true abundances (Figure 2). 2 Based on these joint features, MetaPrism provided the following downstream joint analysis 3 functions (demonstrated in Figure 1B Table S2.

10
The gut microbiome plays an important role in modulating immune checkpoint therapy [14]. Here 11 we demonstrated a joint analysis using MetaPrism to build a therapy response prediction model. 12 We collected stool samples of 12 melanoma patients before anti-PD1 (pembrolizumab) therapy 13 and performed metagenomic sequencing. 6 patients responded to the therapy and 6 did not. 14 Starting from the metagenomic sequence reads, we performed quality control, including removal 15 of human contamination as previously described [14]. Then all remaining sequence reads were 16 processed in MetaPrism (detailed analysis steps provided in Supplementary Texts). On average,

17
MetaPrism inferred the taxonomy and gene features for 1.2 billion reads per sample. Next,

18
MetaPrism normalized the reads within samples by reporting the mean depth per assembled contig.

19
The taxa-specific gene abundances were ranked using a random forest model with leave-one-out 20 cross-validation. This prediction model reached 69% accuracy to predict the immunotherapy 21 responses, which is higher than the accuracy using taxa features alone (54%) or gene features alone 22 (62%). Furthermore, it detected four joint features with variable importance greater than 50%.

23
MetaPrism visualized these abundances with red to green colors representing the depth values 1 (Figure 3). The most important feature is the K00826 gene (branched-chain amino acid 2 aminotransferase) from the genus Eubacterium (Table 1). It is a novel joint feature that may serve 3 as a potential biomarker for cancer therapy. MetaPrism_heatmap.pl to visualize four joint features (taxa-specific gene abundances) in the immune checkpoint 7 therapy study. The colors from red to green represent the increased gene abundances, the mean depth normalized by 8 the contig lengths. P10, P14, P23, P25, P34, P39 are patients who respond to the therapy; P8, P16, P24, P30, P32, P42 9 are patients having progressive outcomes. 10 11 Table 1. Prediction models and performances for taxonomical analysis, functional analysis, and joint analysis. 12 We tabularized the details of prediction models used in three types of analyses and their prediction performances.  In terms of computation, all the above analyses can be accomplished on a standard computation 4 cluster (e.g., 128GB memory with 2 GB hard drive space per sample). 6 We present a novel bioinformatics tool, MetaPrism. It implements functions to quantify the joint 7 features (both taxonomic and functional) from metagenomic sequence reads, as well as other 8 functions for downstream data analyses. We demonstrate that the joint features can provide novel 9 insights to understand the microbial role in a cancer immunotherapy study.

10
MetaPrism is flexible and can be customized. For example, to study species-specific antibiotic 11 resistance genes (ARGs), a reference protein database with ARGs, such as ARDB [15] or CARD

13
In a GVHD study, with the interests to study the patients' resistome, we performed the analysis 14 using the ARDB in MetaPrism and found increased abundances of antibiotic-resistance genes (e.g., 15 mdtG, AcrA, AcrB, and TolC) in Klebsiella and E. coli in the GVHD patients compared with the 16 abundances in non-GVHD patients. This finding may hint optimal antibiotic prescription for better 17 management of GVHD.

18
MetaPrism characterizes the joint features based on the contigs that are de novo assembled from 19 metagenomic sequence reads. This is a distinct feature compared with other software. For example,

20
HUMAnN2 used a tiered search strategy that relied on a curated reference database for organism-21 specific genes [3]. As human microbiome contains trillions of microbial genes, reference databases 1 can be inadequate to enumerate the organism-specific genes. Thus we designed the MetaPrism to 2 reduce the dependency on curated reference databases. The tradeoff for this decision is that 3 MetaPrism requires more computational resources for the de novo assembling step.

4
In all, MetaPrism is free and useful software to facilitate joint analyses and it is suitable for 5 general microbiome studies. Researchers can expect MetaPrism to quantify species-specific gene 6 abundances and use these interpretable features in association studies and prediction tasks.