Bayesian Classification of Microbial Communities Based on 16S rRNA Metagenomic Data

We propose a Bayesian method for the classification of 16S rRNA metagenomic profiles of bacterial abundance, by introducing a Poisson-Dirichlet-Multinomial hierarchical model for the sequencing data, constructing a prior distribution from sample data, calculating the posterior distribution in closed form; and deriving an Optimal Bayesian Classifier (OBC). The proposed algorithm is compared to state-of-the-art classification methods for 16S rRNA metagenomic data, including Random Forests and the phylogeny-based Metaphyl algorithm, for varying sample size, classification difficulty, and dimensionality (number of OTUs), using both synthetic and real metagenomic data sets. The results demonstrate that the proposed OBC method, with either noninformative or constructed priors, is competitive or superior to the other methods. In particular, in the case where the ratio of sample size to dimensionality is small, it was observed that the proposed method can vastly outperform the others. Author summary Recent studies have highlighted the interplay between host genetics, gut microbes, and colorectal tumor initiation/progression. The characterization of microbial communities using metagenomic profiling has therefore received renewed interest. In this paper, we propose a method for classification, i.e., prediction of different outcomes, based on 16S rRNA metagenomic data. The proposed method employs a Bayesian approach, which is suitable for data sets with small ration of number of available instances to the dimensionality. Results using both synthetic and real metagenomic data show that the proposed method can outperform other state-of-the-art metagenomic classification algorithms.

Many recent studies using both preclinical models and human subjects have high-2 lighted the interplay between host genetics, gut microbes, and colorectal tumor initia-3 June 1, 2018 1/11 tion/progression. For instance, the findings in [1] provided novel insights into the complex 4 effects of inflammation on microbial composition/activity and the host's intestinal mu-5 cosa ability to protect itself from microorganisms with genotoxic capabilities. Also, from 6 a mechanistic perspective, the study in [2] demonstrated that Frizzled proteins (receptors 7 that regulate Wnt signaling, required for colonic stem cell maintenance, self-renewal, 8 and repair of the epithelial lining) can be activated by toxin B produced by pathogenic 9 bacteria (Clostridium difficile). These findings demonstrate that colonic stem cells are 10 the target of some microbial toxins. Moreover, the results in [3] demonstrated that 11 dietary fiber and fat content have a remarkable effect on the colonic microbiota and its 12 metabolic activity in high-risk vs. low-risk cancer populations. 13 There is ample evidence that the microbiome (eg, patchy bacterial biofilms) in the gut 14 can modulate host immune cell function, promoting low-grade chronic inflammation. This, 15 combined with intestinal barrier deterioration induced by (1) colorectal cancer-initiating 16 genetic lesions, and (2) crosstalk between microbiota and diets low in fiber, results in the 17 invasion of secreted microbial products (eg, oncotoxins), which can drive tumor growth. 18 Intriguingly, the protective effects of dietary fiber on cancer development may be based 19 on dramatic shifts in microbial community function-particularly short-chain fatty acid 20 production-which stimulate gut mucosal metabolism and increase epithelial barrier 21 function. 22 The characterization of microbial communities by 16S rRNA gene amplicon sequencing 23 has received renewed interest in the last decade, in part due to the emergence of high-24 throughput sequencing technology [4]. Microbial metagenomics provides a means to 25 determine what organisms are present without the need for isolation and culturing. Next 26 generation sequencing, applied to microbial metagenomics, has transformed the study 27 of microbial diversity [5]. For amplicons reads it is possible to classify sequence reads 28 against known taxa, and determine a list of those organisms that are present and the 29 read frequency associated with them [6]. In this case an unsupervised strategy can be 30 used to identify proxies to traditional taxonomic units by clustering sequences, so called 31 Operational Taxonomic Units (OTUs).

32
State-of-the-art classification methods for 16S rRNA metagenomic data include 33 Random Forests (RF) [7,8] and the phylogeny-based Metaphyl algorithm [9]. RF is a 34 popular classification algorithm, the basic principle of which is to generate a number of 35 random classification trees, and then classify a test point by majority voting among the 36 trees. The most important procedure for generating the ensemble of trees is to search for 37 the best split at each node of the tree over a random selection of from a large number 38 of features, which is the method proposed in [7], but other procedures are possible [8]. 39 Random forests have proved to be very successful in Bioinformatics [19]. In metagenomics, 40 in particular, they have become the industry standard [10,11]. The Metaphyl algorithm 41 of [9], on the other hand, seeks to introduce prior biological information in the form of 42 microbial species phylogenetic trees, under the assumption that the natural hierarchical 43 grouping of features contained in the phylogeny can help the classification procedure. 44 The Metaphyl classifier is obtained by a regularized maximum-likelihood procedure 45 where the penalty term is guided by the phylogeny tree. Many other approaches for 46 the classification of metagenomic data can be found in the literature; see [9] for a good 47 review of those.

48
In this paper, we consider a Bayesian paradigm for including prior knowledge, which 49 consists of 1) a hierarchical model for the 16S rRNA metagenomic data, where each 50 microbial sample is represented as a set of operational taxonomic unit (OTU) frequencies; 51 2) prior distribution construction; 3) posterior distribution calculation from the sample 52 data in closed form; 4) derivation of an Optimal Bayesian Classifier (OBC) [12]. The 53 prior distribution on the parameters expresses the biological information available about 54 the problem; in the absence of external knowledge, one may use a "noninformative" 55 prior, or construct the prior from a small portion of the available data (we consider both 56 approaches in our numerical experiments). The hierarchical model proposed here allows 57 computation of the posterior probability distribution in closed form, without a need for 58 computationally-expensive, approximate Monte-Carlo Markov-Chain computations. The 59 posterior distributions are used to derive the OBC, which minimizes the expected error 60 over the space of all classifiers under the assumed likelihood model. Ordinary "Bayes" 61 classifiers minimize the misclassification probability when the underlying distributions are 62 known. However, Optimal Bayesian classification trains a classifier from data assuming 63 the underlying distributions are not known exactly, but are rather part of an uncertainty 64 class of distributions, each having a weight based on the prior and the observed data.

65
The advantages of the proposed OBC classification algorithm include being applicable: 66 1) in the absence of phylogenetic information (required by the the Metaphyl algorithm), 67 2) presence of multiple classes, 3) with small ratios of sample size to dimensionality. The 68 performance of the proposed approach is compared to both the Random Forest and 69 Metaphyl algorithms, as well as a state-of-the-art nonlinear radial-basis function (RBF) 70 Support Vector Machine (SVM) algorithm, using both synthetic and real-world 16S 71 rRNA metagenomic data sets, for varying sample size, dimensionality, and classification 72 difficulty. The results show that the OBC classifier with both noninformative and 73 constructed prior consistently outperform the others on the synthetic data. On the real 74 data sets, the proposed OBC with constructed prior is competitive with the Random 75 Forest algorithm and outperforms the Metaphyl and SVM algorithms on three of the 76 four data sets using pre-defined training-testing splits; the fourth data set is resampled 77 to simulate small ratios of sample size to dimensionality (i.e., number of OTUs), in which 78 case the proposed OBC method, with either noninformative or constructed priors, vastly 79 outperformed all other methods. We remark that a summary of this work, without the 80 real data results or the prior construction method, appeared in [13].  We propose a Poisson-Dirichlet-Multinomial (PDM) hierarchical Bayesian model for 16S 84 rRNA metagenomic microbial abundance data. Multinomial sampling has been used 85 previously in the study of microbial communities [14,15]; the novelty of our approach 86 resides in adding Dirichlet and Poisson distributions to allow for OTU sparsity and a 87 varying number of total reads in the sequencing experiment.

88
Let M be the number of OTUs, and let x = (x 1 , . . . , x M ) be the microbial abundance 89 vector, where x j is the number of sequencing reads corresponding to the j-th OTU, for 90 j = 1, . . . , M . Let N = M j=1 x j be the total number of sequencing reads. Since N is 91 bound to change across different experiments, we model this parameter as a Poisson 92 random variable with parameter λ. The hierarchical model for x assumes a multinomial 93 distribution with probability vector parameter p = (p 1 , · · · , p M ) and N : The parameters of the model are p and λ. The prior distribution for p is assumed to 95 be a Dirichlet distribution with hyperparameter θ θ θ = (θ 1 , . . . , θ M ), while the prior for 96 λ is a Gamma distribution with hyperparameters α, β. We point out that there is in 97 general separate set of parameters and priors, one for each class label (some may be 98 shared between classes), but for simplicity we have not denoted this in this section. See 99 Fig. 1 for a graphical representation of the proposed hierarchical model.

101
Prior construction concerns setting the values of the hyperparameters to reflect external 102 knowledge about the model. If no external knowledge is available, the hyperparameters 103 may be set to make the priors noninformative [16]. In this paper, we assume a noninfor-104 mative prior for λ by setting α = β = 1, but for p, we consider both noninformative 105 and informative priors. In the former case, we set θ θ θ = (1, . . . , 1), in which case the 106 Dirichlet becomes a uniform distribution. In the informative case, we estimate the 107 hyperparameters though a maximum-likelihood procedure using a subset of the available 108 data. This procedure is similar to the one in [17], except that we do not impose any 109 prior knowledge constraints, i.e., the process is entirely data-driven. We describe this 110 procedure next. 111 We divide the data S n into two subsamples, one for constructing the prior, and 112 another for deriving the posterior and obtaining the classifier (which is discussed in the 113 next two sections). In all that follows, we consider the binary classification problem, but 114 the approach can be easily adapted to multiple classes (which is the case for some of 115 the data sets considered in Section 2. The class-specific sample sizes n p 0 and n p 1 used for 116 prior construction are set to a fixed proportion of the original sample sizes n 0 and n 1 , 117 respectively, where the indices indicate the class label (in this paper, the ratio is 0.3). In 118 the sequel we omit the class label to simplify the notation, but the Dirichlet parameters 119 are fitted separately for each class.
Our objective is to maximize the expected mean-likelihood with respect to the parameter 124 June 1, 2018 4/11 to obtain the constructed prior: where Π is the set of all Dirichlet priors. Using (1), we have 126 (θ θ θ; S prior It can be shown that where ψ(θ) = d dθ log Γ(θ) is a digamma function. The optimization in (3) involves 128 transforming the problem to maximization over the hyperparameter space and then 129 applying a numerical procedure, such as conjugate gradient descent, to find the desired 130 hyperparameter values. More details can be found in [17]. } be the training data. The posterior distributions for the parameters (θ θ θ y , λ) for classes y = 0, 1 (parameter λ is shared between the classes) are determined as follows: p(θ θ θ y , λ | S train n t ) ∝ p(θ θ θ y , λ) p(S train n t |θ θ θ y , λ) x y ij is the number of reads for each profile, i = 1, · · · , n t y , y = 0, 1, 133 andθ y j = θ y j + ny i=1 x y ij , j = 1, · · · , M , y = 0, 1.

135
Let ψ : R d → 0, 1 be a classifier designed on training data S = S train n t , which takes metagenomic abundance profiles X X X ∈ R d into one of the two labels 0 or 1. The error of the classifier is the probability of a mistake given the sample data: where Y ∈ {0, 1} denotes the true label corresponding to X X X.
In addition, the joint distribution depends on the prevalence c = P (Y = 1 | Υ). The expected value of the classification error over the posterior distribution of Υ becomes the quantity of interest: The OBC [12] is the classifier that minimizes the quantity in Eq 9. It was shown in [12] 137 that the OBC is given by are the effective class-conditional densities and p(Υ | Y = y, S) are the parameter 139 posterior probabilities, for y = 0, 1.

140
Plugging Eqs 1 and 7 into 10 leads to an integration that can be accomplished analytically, without a need for MCMC methods, yielding the effective class-conditional densities In addition, we assume that the parameter c is beta distributed with hyperparameters 141 (β 0 , β 1 ), independently of Υ (prior to observing the data). It can be shown that the 142 posterior distribution p(c | S) is also beta with hyperparameters β 0 + n t 0 and β 1 + n t 1 , in 143 which case E Υ|S [c] =

148
In this section, the performance of the proposed classification approach is compared to 149 state-of-the-art metagenomic classification algorithms in the literature, using synthetic 150 and real-world data sets.

152
Synthetic OTU abundance data and phylogeny trees were generated using the strategy 153 proposed in [9], which considers a common phylogenetic tree T for the OTUs in all 154 the 16S rRNA metagnomic profiles. To generate sample data for each class, the tree 155 is traversed systematically, deciding for each internal node v what fraction of species 156 would come from each of the subtrees rooted at the child nodes of v.

157
Two parameters are assigned to each node v for each class y: µ y v is the average 158 proportion of species that correspond to the subtree rooted at the left child node of 159 v in the class y, and (σ y v ) 2 is the variance of this proportion within the class. A new 160 metagenomic profile is generated by sampling the pro-portions of species at each node 161 v according to the normal distribution N (µ y v , (σ y v ) 2 ). The value of µ y v is in turn sampled 162 from the normal distribution N (μ y v , (σ v ) 2 ), whereσ v characterizes the variance between 163 the classes, whileμ k v are base values. The within-and between-class variances can be 164 controlled by using the parameters σ y v andσ v , respectively. The exact values ofσ y v and 165 σ v will be sampled at each tree node v according to N (0,λd(v)) and N (0, λd(v)), where 166 d(v) is the distance between v and the tree root. Note that the parameterslambda and 167 λ influence the difficulty of the classification problem, which is proportional to λ and 168 inversely proportional toλ. 169 We consider a total of 128 OTUs and three sample sizes for the training data, 170 n = 30, 50, 70, with equal class prevalences, c = 0.5. To generate data sets of different 171 complexity, we considered values forλ and λ of 0.5, 1 and 1.5. The sample sizes n 0 and 172 n 1 are fixed at n 0 = n 1 = n/2. Classifier accuracy is obtained by testing each designed 173 classifier on a large synthetic test data set, and averaging the results over 1000 iterations 174 using different synthetic training data sets. 175 Fig. 3 compares the performance of our proposed optimal Bayesian classifier, using 176 both noninformative and constructed priors, against that of the kernel SVM [18], RF [19] 177 and MetaPhyl [9] classification algorithms, mentioned in Section 1. As expected, the 178 classification error rate is reduced when the variance within class was decreased and 179 variance between classes was increased. Also as the sample size increases, classification 180 performance improves for all classification rules. We observe that the kernel SVM 181 classifier clearly performs poorly comparing to other classifiers, the RF and Metaphyl 182 classifier have comparable performance overall, while the proposed OBC classifier exhibits 183 the best performance over different values for between and within class variances.

185
Here, we consider four different 16S rRNA metagenomic data sets. The sample sizes, 186 dimensionality, and number of classes of each data set are displayed in Table 1. Each of 187 these data sets is accompanied by a phylogeny tree, which allows the application of the 188 Metaphyl classification algorithm. major categories of habitat: External Auditory Canal (EAC), Gut, Hair, Nostril, Oral 193 cavity, and Skin. This data set is an example of a relatively easy classification task due to 194 the generally pronounced differences between the communities. These microbiota, varies 195 systematically across body habitats and time; such trends may reveal how microbiome 196 changes cause or prevent disease [20]. from the "keyboard" data set, for which at least 397 raw sequences were recovered [14]. 201 The class labels are the anonymized identities of the three experimental subjects. This 202 classification task is the easiest of all four data sets because of the clear distinctions 203 between the individuals, the fact that all of the samples come from approximately the 204 same time point, and the large number of training samples available for each class. 205 Fierer et al. Subject Hand (FSH) [22]: The influence of sex and washing on the 206 diversity of hand surface bacteria is studied in this work. Data set contains the palmar 207 surfaces of the dominant and non-dominant hands of 51 healthy young adult volunteers 208 to characterize bacterial diversity on hands and to assess its variability within and 209 between individuals. This data set is a more challenging version of the previous ones. 210 The class labels are the concatenation of the experimental subject identities and the 211 label of which hand (left vs. right) the sample came from on that individual. There were 212 three subjects, and so there are six classes in this data set.

213
Turnbaugh et al. Twin Gut [23]: This data set contains the faecal microbial 214 communities of adult female monozygotic and dizygotic twin pairs concordant for 215 leanness or obesity, and their mothers, to address how host genotype, environmental 216 exposure and host adiposity influence the gut microbiome. This data set is a challenging 217 classification task because the classes correspond to microbial communities from the 218 same body habitat and thus are very similar. 219 We compare our classifier with RF, kernel SVM and MetaPhyl classifiers, described 220 in Section 1. For comparing different methods, we used the training and test data 221 splits provided in the CBH, FSH, FS studies. For the Twin Gut data set, we iteratively 222 sampled subsets of data of small sample sizes from the data set for training, applied the 223 classification rules, and tested the resulting classifiers on the remaining large collection 224 of points not selected. This was done to simulate the effects of sample sizes on classifier 225 training. This process is repeated 500 times and the results are averaged. In all cases, for 226 the OBC with prior construction, the "training" data includes both the subsample for 227 prior construction and for the subsample for training proper (i.e., posterior inference). 228 The classification error rates for the CBH, FSH, and FS data sets are displayed in 229 Table 2. We can observe that the proposed OBC with constructed priors outperforms 230 all others on the CBH and FSH data sets, while being slightly outperformed by the 231 state-of-the-art Random Forest classifier on the FS data set; in fact, the error rates for 232 all classifiers is small on this latter data set, revealing an easy classification problem. 233 Notice that the results for the OBC with constructed priors are always better than the 234 OBC with noninformative priors, as expected. The results for the Twin Gut data sets, for varying sizes of the training data set are 236 displayed in Figure 4. Here the small-sample effects become clear. In particular, it is 237 evident that Random Forests cannot handle very small training sample sizes well. The 238 performance of the proposed method is vastly superior to the others fon this data set. 239 We believe that the reason is that the ration between sample size and dimensionality 240 (i.e., the number of OTUs) is much smaller in this data set than the others. In addition, 241 the classes are unbalanced in this data set.

243
In this paper, we presented a model-based Bayesian framework for the classification of 244 metagenomic microbial abundance data. This approach was contrasted to state-of-the 245 art metagenomic classification algorithms, including Random Forests and the phylogeny-246 based Metaphyl algorithm. The advantages of the proposed OBC classification algorithm 247 include being applicable in the absence of phylogenetic information and in the presence 248 of multiple classes and small ratios of sample size to dimensionality. The proposed 249 classification method showed promising results in a comprehensive set of numerical 250 experiments, particularly when the ratio of sample size to dimensionality is small.