## Abstract

**Background** One of the major challenges in microbial studies is to discover associations between microbial communities and a specific disease. A specialized feature of microbiome count data is that intestinal bacterial communities have clusters reffered as enterotype characterized by differences in specific bacterial taxa, which makes it difficult to analyze these data under health and disease conditions. Traditional probabilistic modeling cannot distinguish dysbiosis of interest with the individual differences.

**Results** We propose a new probabilistic model, called ENIGMA (Enterotype-like uNIGram mixture model for Microbial Association analysis), to address these problems. ENIGMA enables us to simultaneously estimate enterotype-like clusters characterized by the abundances of signature bacterial genera and environmental effects associated with the disease.

**Conclusion** We illustrate the performance of the proposed method both through the simulation and clinical data analysis. ENIGMA is implemented with R and is available from GitHub (https://github.com/abikoushi/enigma).

## Introduction

More than 100 trillion microbes live on and within human beings and consists of complex microbial communities (microbiota). The majority of microbes cannot be cultured in laboratories, which makes it difficult to understand which individual microorganisms mediate vital microbiome-host interactions under health and disease conditions. However, recent important advances in high-throughput sequencing technology have allowed us to observe the composition of these intestinal microbes. That is, for each sample drawn from an ecosystem, the number of occurrences of each operational taxonomic units (OTUs) is measured and the resulting OTU abun-dance are summarized at any level of the bacterial phylogeny. Discovering recurrent microbial compositional patterns that are related with a specific disease is a significant challenge since individuals with the same disease typically harbor different microbial community structures.

The recent large-scale sequencing surveys of the human intestinal microbiome, such as the US NIH Human Microbiome Project (HMP) and the European Metage-nomics of the Human Intestinal Tract project (MetaHIT), have shown considerable variations in microbiota composition among individuals [1, 2]. In particular, the presence of community clusters characterized by differences in the abundance of signature taxa, referred to as enterotypes, have been first reported in humans [3]. Later, other studies found enterotype-like clusters which might reflect features of host-microbial physiology and homeostasis in different species [4, 5] or across human body sites [6–9]. These observed microbial stratification has motivated the development of methods to examine unknown clusters of microbial communities.

Probabilistic modeling of microbial metagenomics data often provides a powerful framework to characterize the microbial community structures [10–12]. For example, Knights *et al.* [10] applied a Dirichlet prior to a single-level hierarchy and proposed a Bayesian approach to estimate the proportion of microbial communities. Holmes *et al.* [11] extended the Dirichlet prior to Dirichlet multinomial mixtures to facilitate clustering of microbiome samples. Shafiei *et al.* [12] proposed a hierarchical model for Bayesian inference of microbial communities (BioMiCo) to identify clusters of OTUs related with environmental factors of interest.

However, such models are not suitable for identification of enterotype-like clusters of microbial communities doe to the following two reasons. First, the frameworks of Knights *et al.* [10] and Holmes *et al.* [11] do not explicitly address the association between the microbial compositional patterns and environmental factors of interest. Second, the framework of Shafiei *et al.* [12] models the structure of each sample by a hierarchical mixture of multinomial distributions that are dependent to factors of interest. It is known that individual host properties such as body mass index, age, or gender cannot explain the observed enterotypes [3]. Thus, such enterotype-like clusters that describes interindividual variability among humans do not always to directly affect host probabilities such as diseases ranging from localized gastroen-terologic disorders to neurologic, respiratory, metabolic hepatic, and cardiovascular illnesses.

Here, we introduce a novel probabilistic model of a microbial community structures, called ENIGMA (Enterotype-like uNIGram mixture model for Microbial Association analysis), to address these problems. ENIGMA includes the following contributions:

ENIGMA takes OTU abundances as input and models each sample by underlying unigram mixture whose parameters are represented by unknown group effects and known effects of interest. The group effects are represented by the baseline parameters which change with a latent group of microbial communities. One of the most important features for our model is that the group effects are independent of the effects of interest. This enables to separate interindividual variability and fixed effects of the host properties related with disease risk.

ENIGMA is regarded as a Bayesian learning for the association between community structure and factors of interest. Our model can be used to simultaneously learn how enterotype-like clusters of OTUs contributes to microbial structure and how microbial compositional patterns might be related to the known features of the sample.

We provide an efficient learning procedure for ENIGMA by using a Laplace approximation to integrate out the latent variables and estimate the evidence of the complete model and the credible intervals of the parameters. The software package that implements ENIGMA in the R environment is available from https://github.com/abikoushi/enigma.

We describe our proposed framework and algorithm in section named “Methods”. We evaluate the performance of ENIGMA on simulated data in terms of its accuracy to estimate parameters and identify clusters in Section named “Simulation Study”. We apply ENIGMA to clinical metagenomics data and demonstrate how ENIGMA simultaneously identifies enterotype-like clusters and gut microbiota related with Parkinson’s disease (PD) in Section named “Results on Clinical Data”.

## Methods

Suppose that we observe microbiome count data of *K* taxa for *N* samples with *M* individual host properties, (*y*_{nk}, *x*_{nm}) (*n* = 1,…, *n*; *k* = 1,…, *K*; *m* = 1,…, *M*) where represents the abundance of the *k*-th taxa in the *n*-th sample and *x*_{nm} represents a binary variable such that *x*_{nm} = 1 if the *n*-th sample has the *m*-th host property and *x*_{nm} = 0 otherwise. Here the word “taxa” could be at any level of the bactgerial phylogeny, e.g., species, genes, family, order, etc.

### Model

Figure 1 illustrates the plate diagram of the proposed model for metagenome sequencing, where *y** _{n}* is the read count vector of the

*n*-th sample,

*x**is the vector of the host properties of the*

_{n}*n*-th sample and

*z*

_{n}∈ {1,…,

*L*} is a latent class of the

*n*-th sample. Our model is a simple extension of unigram mixture model. We assume that each sample is generated from a multinomial distribution with the parameter vector

*p**= (*

_{n}*p*

_{n}

_{1},…,

*p*

_{nK})

^{T}. The elements of

*p*_{n},

*p*

_{nk}(

*k*= 1,…,

*K*) are probabilities of the occurrence of the

*K*taxa for the

*n*-th sample. We also assume that

*p*

_{nk}can be influenced independently by the environmental factor on the taxa that is common to all latent classes and the interindividual factor on the latent enterotype-like classes. More specifically, the generative process of ENIGMA is defined by:where

*γ**is baseline parameter (*

_{l}*K*-dimensional vector) which change with the latent class,

*M × K*matrix

**= (**

*B**β*

_{mk}) is effect of a environmental factor common the all enterotypes,

*β**is a*

_{m}*m*-th row-vector of

**= (**

*B, π**π*

_{1},…,

*π*

_{L}) is a mixing ratio of components,

*O*

_{K}is

*K*-dimensional zero matrix and

*I*

_{K}is

*K*-dimensional identity matrix. Here softmax function is defined by softmax for a vector

**= (**

*x**x*

_{1},…,

*x*

_{K})

^{T}using element-wise exponential function and the probability function of categorical distribution is parameterized as Pr(

*z*=

*l|*

**) =**

*π**π*

_{l},

*l*∈ {1,…,

*L*}. In a Bayesian approach we need to define prior distributions for

**, and**

*π, β*

*γ*_{l}. We set a prior based on the Dirichlet distribution for

**, and flat priors to the hyperparameters**

*π**σ*and

*τ*for

**and**

*β***, respectively. For the convenience of later section, let**

*γ*

*p**′*= softmax(

_{l}

*γ*_{l}) be probabilities of the occurrence of bacteria in the latent classes

*l*.

### Parameter estimation

Let us denote observed matrix by ** Y** = (

*y*

_{nk}),

**= (**

*X**x*

_{nm}), the unknown parameters by

**= (**

*θ*

*α, B, γ*_{1},…,

*γ**) and their prior by*

_{L}, σ, τ*φ*(

**). The posterior distribution is represented as follows:First, latent variable**

*θ**z*

_{n}must be marginalized. The likelihood belongs toThe posterior distribution is proportional to product of the likelihood and prior density:Let be the MAP estimator of

**, found by maximizing log**

*θ**p*(

**).**

*θ, Y, X*We use a Laplace approximation [15] for parameter estimation. A Taylor expansion around giveswhere and *H*() is Hessian of log *p*(** θ**|

**) evaluated at . Eq.4 giveswhere**

*Y, X**C*is normalizing constant. This relation shows that

*p*(

*θ|*

**) can be approximated by normal distribution**

*Y, X**N*(,

*H*

^{−}

^{1}()). Credible intervals can be calculated from this multivariate normal distribution.

We used stochastic programming language Stan (http://mc-stan.org/) for its implementation. The MAP estimators were obtained by L-BFGS method. Credible intervals were computed from the using a Stan function to compute the Hessian at the MAP estimates.

After fitting the model, we are left with the task of classify the enterotype of each samples. The conditional probability of *z*_{n} = *l* isThis is the probability which *n*-th sample belong enterotype *l*. Then, *n*-th sample is then classified into the *l*-th enterotype that maximizizes the conditional probability gven by Eq.5.

### Model Selection

We are also interested in whether or not the whole set rather than individual bacteria is related to the environmental factors of interest. We consider the comparison between the two models when ** B** /=

**0**and

**=**

*B***0**. We can use the log marginal likelihood as the goodness of fit for model comparison. The marginal likelihood is given byFrom Eq.4, we haveSo, log marginal likelihood is approximated by following formula:where

*D*is the number of free parameters. In model comparison, we choose the model with the larger log marginal likelihood.

## Simulation Study

To show the performance of ENIGMA, we conducted several experiments by simulation. The synthetic data can be naturally produced via our generative process given by Eq.1. We set *M* = 1, *L* = 3, *π*_{l} = 1/3, and ** α** = (1, 1, 1)

^{T}. We first generated

**and**

*B*

*γ**from the standard normal distribution. The variables*

_{l}

*x**, and*

_{n}, z_{n}

*y**are then sampled from the Bernoulli distribution with probability of 0.5, the categorical distribution, and the multinomial distribution, respectively. For the above parameter setting, we randomly generate a count dataset of 100 taxa for 100 samples for evaluation.*

_{n}**Coverage probability (CP)**: The coverage probability is the proportion of the time that the interval contains the true value. A discrepancy between the coverage probability and the nominal coverage probability frequently occurs. When the actual coverage is greater than the nominal coverage, the interval is called conservative. If the interval is conservative, there is no inconsistency in interpretation.**Bias**: The bias ofis defined by difference between true value and estimated value*B**E*[]*−*.*B***Standard error (SE)**: The standard error is the standard deviation of the estimate. The smaller standard error indicates the higher accuracy of estimation.**Root mean squared error (RMSE)**: The RMSE is defined by . The smaller RMSE indicates the higher accuracy of estimation.**Accuracy**: The accuracy is the percentage of samples correctly classified into original group.

For calculating these metrics, we note that we calculated the sample means and standard deviations of and ( *−* ** B**)

^{2}from the 10,000 synthetic datasets.

Figure 2 shows the comparison of true ** B** and the mean and standard deviation of estimates through the 10,000 simulations. We observed that the points are arranged diagonally, which implies the estimator of ENIGMA is unbiased. We also calculated the proportion of the time that the 95% credible interval contains the true value of

**. We found that this proportion is greater than nominal value 0.95 for all**

*B***in Figure 3. Table 2 shows the coverage probability (CP), bias, standard error (SE), and RMSE of , respectively. We observed that the bias and standard error decrease when**

*B**β*

_{mk}is large (i.e. the corresponding abundance is large). We also found that the accuracy of classification given by Eq.5 is exactly 100%. Thus, these results indicate that ENIGMA can produce reasonable estimates.

## Results on Clinical Data

To validate the performance of ENIGMA on discovering clusters of micribial communities and associations between microbes and a specific disease, we applied ENIGMA to the real metagenomic sequencing data from Scheperjans *et al.* [16], Hill-Burns *et al.* [17], Heintz-Buschart *et al.* [18] and Hopfner *et al.* [19]. The data is analized by sequencing the bacterial 16S ribosomal RNA genes sampled from patients of Parkinson’s disease (PD) and control in Finland, USA, and Germany. Table 1 shows the summary statistics of the data. The OTUs are mapped to the SILVA taxonomic reference, version 132 (https://www.arb-silva.de/) and the abundances of family-level taxa are calculated. Following the evidence of Arumugam *et al.* [3], the number of latent classes in ENIGMA is chosen to be *L* = 3. We set the hyperparameters of Dirichlet prior ** α** = (1, 1, 1)

^{T}, which is equivalent to a non-informative prior.

We evaluated whether the model where bacteria have the associations to the PD patients is better than the model without the associations in terms of marginal likelihood. We note that the marginal likelihood represents the model evidence which expresses the preference of the data for different models. Let *M*_{1} be the model which is described Eq. 1 and *M*_{0} be the model setting all *β*_{mk} = 0 in Eq. 1. Table 3 shows that the marginal likelihood of *M*_{1} is greater than *M*_{0}. It is better to explain the data by considering the association between the microbiota and PD.

Figure 4 shows the estimated probabilities of the occurrences of bacteria for the three latent classes, *p**′ _{l}*, (

*l*= 1, 2, 3). Bacteria detected in less than three countries were removed. Arumugam

*et al.*[3] showed that enterotype is characterized by the differences in the abundance of

*Bacteroides, Prevotella*, and

*Ruminococcus*. The result of ENIGMA shows the same tendency as previous survey. According to the results of ENIGMA, the abundance of

*Enterobacteriaceae*and

*Lachnospiraceae*also differ greatly among clusters. Bacterial abundance differs between countries. In USA there is a large abundance of

*Verrucomicrobiaceae*, but in Finland there are few. Conversely, in Finland there is more

*Prevotellaceae*, but in USA it is less.

Table 5 shows the coefficients whose 95% credible intervals do not contain zero in more than two countries. The microbes with these coefficients indicates that the corresponding microbial composition patterns are significantly related to PD. We found that, in family levels, *Clostridiaceae, Comamonadaceae, Pasteurellacea, Prevotellaceae, Actinomycetaceae, Bifidobacteriaceae, Enterococcaceae, Lactobacil-laceae, Synergistaceae, Verrucomicrobiaceae* and *Victivallaceae*, the signs of these coefficients matched in all countries. These results are consistent with previous studies. Hill-Burns *et al.* [17] reported PD patients contained high levels of *Bifidobacte-riaceae* and *Verrucomicrobiaceae* and low levels of *Pasteurellaceae*. *Scheperjans et al.* [16] reported PD patients contained high levels of *Lactobacillaceae, Verrucomi-crobiaceae* and low levels of *Prevotellaceae*. Hopfner *et al* reported PD patients have high levels of *Lactobacillaceae* and *Enterococcaceae*.

We compared ENIGMA to the Wilcoxon rank sum test, one of the classical methods for identifying bacteria related with a environmental factor of interest [18]. Table 4 shows bacteria significantly related with the PD patients with *p*-value < 0.05 in more than two countries. We observed that the bacteria detected by the Wilcoxon test were almost included in those of ENIGMA (Table 5). We note that all of the corrected *p*-values in Table 4 are larger than 0.05. This result shows that ENIGMA is superior to the Wilcoxon rank sum test in terms of identifying more associations between microbiota and the PD patients.

The analyses with real data thus show that ENIGMA can identify enterotype-like clusters and the associations between the gut microbiota and the PD patients, and some of the results are strongly supported by the previous researches.

## Conclusion

We proposed a novel hierarchical Bayesian model, ENIGMA, to discover the underlying microbial community structures and associations between microbiota and their environmental factors from microbial metagenome data. ENIGMA is based on a probabilistic model of a microbial community structures and supplied with labels for one or more environmental factors of interest for each sample. The structures of each sample is modeled by a multinomial distribution whose parameters are represented independently by group and environmental effects of each sample, which prevent mixing of individual differences and effects of interest. This framework enables the model to learn (*i*) how microbes contribute to an underlying community structures (cluster) and (*ii*) how microbial compositional patterns are explained environmental factors of interest, simultaneously. The effectiveness of ENIGMA was evaluated on the bases of experiments involving both synthetic and read datasets. We believe that these newly discovered clusters and associations estimated from ENIGMA would provide more insight in the the mechanisms of a microbial community.

There is one major limitation of ENIGMA is its scalability and efficiency, since the number of the parameters in the model grow proportional to the number of taxa when the number of environmental factors of interest is large. Further works should focus on developing a scalable probabilistic model of microbial compositions to analyze underlying microbial structures with a large number of these effects by using sparse parameter estimation [20]. We are also interested in developing a dynamic probabilistic model similar to reproted by Blei and Lafferty [21] to analyze time-varying bacteria compositions during the progression of a disease.

## Competing interests

The authors declare that they have no competing interests.

## Author’s contributions

KA and TS designed the proposed algorithm; KO and MH designed the experiments.

## Acknowledgements

This work was supported by Grants-in-Aid from the Ministry of Education, Culture, Sports, Science and Technology of Japan (MEXT); Ministry of Health, Labour and Welfare of Japan (MHLW); Japan Agency for Medical Research and Development (AMED), and the Hori Sciences and Arts Foundation.