Summary
Numerous human conditions are associated with the microbiome, yet studies are inconsistent as to the magnitude of the associations and the bacteria involved, likely reflecting insufficiently employed sample sizes. Here, we collected diverse phenotypes and gut microbiota from 34,057 individuals from Israel and the U.S.. Analyzing these data using a much-expanded microbial genomes set, we derive an atlas of robust and numerous unreported associations between bacteria and numerous human traits, which we show to replicate in cohorts from both continents. Using machine learning models trained on microbiome data, we predict human traits with high accuracy across continents. Subsampling our cohort to smaller cohort sizes yielded highly variable models and thus sensitivity to the selected cohort, underscoring the utility of large cohorts and possibly explaining the source of discrepancies across studies. Finally, many of our prediction models saturate at these numbers of individuals, suggesting that similar analyses on larger cohorts may not further improve these predictions.
Introduction
The human gut microbiota is linked to metabolic disorders such as diabetes and obesity but these links are based on relatively small cohorts of several dozens or hundreds of individuals 1–8. Although these studies reported many statistically significant associations, many of these effects are either moderate or do not replicate in other works 9,10. One such example is alpha diversity, for which there are contradicting reports regarding its association with different phenotypes. While microbiome diversity is mostly regarded as a positive indicator of health 8,11–14, other studies found that increased diversity is associated with microbiome instability 15,16. Diversity was also shown to increase with age 17,18, but this association was not conclusive in other cohorts 19. These discrepancies call for studying these questions across larger cohorts from diverse backgrounds. Indeed, in the field of genetics, large cohorts are required since many traits are known to be polygenic and to be affected by small effects from many variants 20,21. Similarly, in the microbiome we expect that individual bacterial species may have a low abundance or mild associations with human phenotypes, necessitating large sample sizes. In addition, many bacterial species are present in only a relatively small fraction of the population such that the association between their abundance and traits can only be studied in large cohorts that have enough individuals that harbor them.
Apart from cohort size, there are other challenges in finding robust signals from the microbiome. One such challenge stems from the large number of genes that are shared between different bacteria through mechanisms such as horizontal gene transfer 22,23. Such sharing causes many short metagenomic sequencing reads to map non-uniquely to multiple bacteria, making it difficult to estimate bacterial relative abundance. Several methods were devised to address this issue,. e.g., by mapping to genes that appear in a single copy and are unique to a single species 24. However, these methods need to be applied anew every time that we wish to use a different reference genome set, as we wished to do here given the much expanded reference of bacterial species groups (SGBs) published in 2019 which added 3,796 new SGBs to the human microbiome catalog 25.
To address the above issues and with the aim of deriving robust microbiome associations, we used metagenomic sequencing to profile the gut microbiome of 34,057 individuals from both Israel and the U.S., for which we also obtained a rich set of phenotypes. We devised a novel algorithm for assessing bacterial relative abundances based on unique genetic elements, and applied it to the recent and much expanded SGB dataset of Pasolli et al. 25. Using the relative abundances on this expanded genome set and much larger cohort, we identified numerous associations between microbiome diversity and several human traits. We were also able to develop models that predict these traits using only microbiome data with high accuracy, as in the case of age (R2=0.31). Notably, these associations replicate across continents, and models derived from the Israeli cohort generalize well to the U.S. cohort, so they are not specific to a certain environment.
By subsampling our cohort to typical cohort sizes used in other studies, we show that associations and predictions derived from smaller cohorts are highly variable and thus sensitive to the selected cohort, underscoring the need for larger cohorts in the microbiome field.
Results
Metagenome samples for 34,057 participants from two continents
We obtained gut metagenomic profiles from 30,083 and 3,974 individuals from Israel and the U.S., respectively, who submitted their sample to a consumer microbiome company and signed an appropriate consent form. Participants also answered questionnaires and provided self-reported phenotypic data and blood tests (e.g., age, gender, BMI and the diabetes marker HbA1C%, Table 1).
We randomly selected 90% of the samples from the Israeli cohort (n=27,075 samples) to be our discovery cohort on which we trained predictive models using cross-validation and set aside as independent test sets the remaining 10% of the Israeli cohort (“test1”, n=3,008 samples) and the entire U.S. cohort (“test2”, n=3,974 samples) (Figure 1a-e, Table 1). These test sets were only used once to evaluate the performance of the models developed on the discovery cohort.
To compute bacterial relative abundance, we used the representatives of the species-level genome bins (SGBs) classification of Pasolli et al. 25, as they represent a greatly expanded set of genomes with thousands of new bacterial genomes that increase the number of mapped reads and allow better exploitation of metagenomic samples. We restricted ourselves to 3,127 SGBs that provide a good representation of species diversity (Methods) and used only reads that mapped uniquely to a single SGB representative. We developed a method for estimating the relative abundance of each SGB in every sample (Methods). Our method is based on examining only reads that map uniquely to a single SGB, since when using unique mappings we expect uniform coverage across SGB genome bins that have the same number of unique positions. This property allows robust estimation of relative abundances, as coverage across the genome bins depends linearly on the SGB’s relative abundance. The mean relative abundances of the different species are the same in the two Israeli cohorts but are somewhat different than in the US cohort (Figure 1f-g).
Microbiome diversity increases with age and associates with metabolic parameters
We first examined the association of microbiome diversity and human phenotypes since the literature is conflicted even on these basic associations. To this end, we computed alpha diversity using the species level Shannon index and ranked individuals by deciles of alpha diversity (Figure 2a). When comparing the top decile and the bottom decile of alpha diversity, we found that HbA1C%, BMI, fasting glucose and fasting triglycerides are significantly higher in the bottom decile while age and HDL cholesterol are significantly lower in the bottom decile (P-value < 1e-16 after FDR correction, Mann Whitney rank-sum test), including a trend across deciles (Table S1-S6). Similarly, examining alpha diversity as a function of these traits, we found significant correlations between alpha diversity and each of these traits (Figure 2b). Notably, these associations were consistent in both the Israeli and U.S. cohorts (Figure 2b), even though the Israeli cohort has significantly higher alpha diversity values (mean 7.3±0.0.77 vs. 7.18±0.67, P-value<10−40, Mann Whitney rank-sum test). The higher diversity of the Israeli cohort persisted even when subsampling the Isareli cohort to match the U.S. cohort on age and BMI (Methods, Table S7).
Microbiome-phenotype associations are consistent across continents
We previously employed linear mixed models to estimate the fraction of phenotypic variance that can be inferred from microbiome composition, termed microbiome-association-index (b2) 26. Our previous estimates were based on a cohort of 715 individuals and therefore had wide 95% confidence intervals, we revisited these estimates for our two new and larger cohorts. We estimated explained-variance based on alpha-diversity alone (Figure 3a, Methods), and based on the full species relative abundances (Figure 3b, Methods). Notably, our new estimates agreed well with our previous findings (Table S8), but our much larger cohort of 30,083 Israeli individuals has substantially narrower 95% confidence intervals (Table S9). We found that microbiome composition strongly associates with self-reported diabetes (b2=52%), age (b2=28%), HbA1C% (b2=15%), fasting blood glucose (b2=13%), BMI (b2=11%), fasting triglyceride (b2=9%), HDL cholesterol (b2=6%) and smoking status (b2=6%). In contrast, the blood levels of thyroid-stimulating hormone (TSH), albumin and clotting (measures by International Normalized Ratio INR) were not significantly associated with the microbiome in our cohort. Notably, b2 estimates from our U.S. cohort of 3,974 individuals were consistent with those derived from the Israeli cohort (Pearson correlation R=0.75, P-value <0.001) but had wider confidence intervals (Figure 3b, Table S9,S10). Although as expected, the variance explained by the full relative abundance matrix (our b2 estimates) was higher than that explained by alpha diversity alone, these two microbiome features highly agreed (Pearson correlation R=0.52, P-value <0.03).
Different traits are accurately predicted by microbiome composition
We next asked whether various traits can be accurately predicted based only on microbiome composition. We compared two models; a linear model (with ridge regression regularization) and gradient boosted decision trees (GBDT) (Methods). Both models used species relative abundances as input. Our models obtained significant predictions for many traits (Figure 4a, b) such as age (R2=0.35 for linear regression, R2=0.31 for GBDT, for 10 fold cross validation on train IL samples), gender (AUC=0.64, 0.78), HbA1C% (R2=0.24, 0.26) and BMI (R2=0.15, 0.15).
We obtained significant predictions even when performing the analyses separately for each gender, with the exception of height which was significantly predicted (R2=0.13) in the entire cohort but not in the gender-separated predictions, indicating that its predictions were driven by the prediction of gender. Since metformin, the most common drug used to treat patients with type2 diabetes, is known to affect microbiome composition, we also evaluated the performance of an HbA1C% predictor only on participants who did not report taking metformin and obtained equivalent performance (R2=0.19 GBDT).
The linear models are attractive since their accuracy was almost similar to boosting decision trees and they are easier to interpret. However, boosting trees performed better across 11 of 12 phenotypes (age being the exception) that had significant predictions(overall mean R2 improvement of 0.02+/-0.011, Figure S1) suggesting that non-additive interactions between different bacteria are predictive of several traits. As additional evidence for the importance of non-additive interactions among bacteria in predicting traits, for both HbA1C% and BMI the R2 of the GBDT predictions on held-out subjects was higher than the estimated b2 for these traits (Figure 3b). The b2 estimation used linear mixed models to estimate the fraction of variation predicted by the microbiome, which does not include any non-linear interaction.
We investigated if the predictive power of the microbiome is mediated through age and sex, since some of the above traits such as HbA1C% are known to increase with age 27. We found that the microbiome composition predicted age with high accuracy (R2=0.31), and age and gender alone predict HbA1C% with R2=0.20 and BMI with R2=0.02 (GBDT, Figure 4c-e). Therefore, we asked whether microbiome-based predictions of HbA1C% and BMI are mediated entirely by its ability to predict age and gender, or whether it carried additional predictive power specific to these traits. We found that adding microbiome to age and gender to the GBDT model significantly improved the predictions of both HbA1C% (from R2=0.20 to 0.36, Figure 3c) and BMI (from R2=0.03 to 0.18, Figure 3c), demonstrating that some of the association between microbiome and these traits is not mediated through age and gender.
We also evaluated the accuracy of our above models, derived only based on the Israeli training set, on our two independent and held out cohorts from Israel and the U.S. and found that they all had highly significant predictions (Figure 4f), thereby validating the robustness of our models. We note that prediction accuracy for age and HbA1C% was lower in the U.S cohort, which may be explained by differences in the microbiome composition between the IL and U.S. cohorts and by lower age and HbA1C% levels in the U.S. cohort.
Finally, to examine the importance of cohort size on prediction accuracy, we applied our above prediction pipeline to different random subsamples of our training cohort, ranging from a few hundreds of subjects to 24,000 (Methods). We found that prediction accuracy increases with cohort size (Figure 4g-i) and does not saturate even with a cohort of 1,000 individuals. For age, we observed an almost two-fold increase in the R2 (from 0.18 to 0.30) when increasing the cohort from 1,000 to 12,000 individuals. For cohorts of hundreds of individuals the standard deviation of the predictions was high, as in the case of HbA1C% for which different subsamples of 200 individuals can reach both R2=0.4 and R2=0.0 as likely outcomes (within 2 standard deviations). Together, these results highlight the need for obtaining large cohorts of microbiome as is known to be the case in the field of human genetics.
An atlas of bacterial species that robustly correlate with age, HbA1C% and BMI
We next sought to identify which individual bacterial species are responsible for driving the predictions of our models for age, HbA1C% and BMI, since these traits were predicted with the highest accuracy. We found many bacterial species that exhibited highly significant correlations to these traits (Figure 5a-c, e.g., 640, 454, 779 bacteria out of the top 1345 occuring bacteria had significant Spearman correlation, P-value<0.05 after Bonferroni correction for age, HbA1C% and BMI respectively). Moreover, the spearman correlation of the bacterial abundances with these traits was in good agreement between the Israeli and U.S. cohorts (Figure 5a-c, R=0.58, 0.57, for age, HbA1C% and BMI, P-value<10−39). Notably, the 3 bacteria most strongly associated with BMI in both cohorts included a bacterial species from the Eubacteriaceae family that was only recently assembled and that has no genome in public repositories (unknown SGB, Figure 5a-c). Again, we subsample the cohorts to smaller cohort sizes and observe that large cohorts are necessary in order for results to replicate (Figure 5d-f).
As a useful resource for the community, we compiled our results into an atlas of summary statistics for all bacterial species and top predicted phenotypes (Table S11-S16). For each species we report bacterial associations to human phenotypes based on bacterial log relative abundances. Specifically, we report the Spearman correlation coefficient and P-value, the Pearson correlation coefficient and P-value, the coefficient in the linear model (trained with Ridge regularization), and bacterial feature importance in the GBDT model using the feature attribution framework of SHapley Additive exPlanations28 (SHAP). In genetics, summary statistics of single nucleotide polymorphisms are widely used to generate polygenic risk scores which were shown to be predictive of disease 29,30. Similarly, researchers can now use our resource to generate microbiome-based predictions of phenotypes in their datasets by extracting our reported bacterial regression coefficients and multiplying them by the log of the relative abundances of the corresponding species in their dataset.
Discussion
In this study, we collected the largest cohort to date of metagenomic samples and phenotypic data from two continents, and analyzed it using a much expanded set of reference microbial species. Together, this allowed us to identify highly robust associations between gut microbiome composition and phenotypes, which replicate in both cohorts. We compiled these robust associations into an atlas that can be used by the community to derive trait predictions on smaller datasets, akin to the use of summary statistics in the field of genetics. We show that a large fraction of the variance of several traits such as age, HbA1C% and BMI can be accurately predicted by both linear models and boosting decision trees models. These predictions replicate across continents and there is also high agreement in the set of individual bacterial species that associate with these traits in the Israeli and U.S. cohorts.
When sub-sampling our large cohort into smaller sized cohorts, we found that even cohorts of 1,000 individuals have significantly lower average accuracy of associations between bacteria and phenotypes. Models derived from different sub-samples of smaller cohort sizes display high variability both in the set of bacteria that associate with each trait and in prediction accuracy. These results may explain the relatively low agreement that exists across studies in the set of bacteria associated with different traits and conditions, and they call for employing larger cohort sizes in microbiome studies.
Using an expanded reference set allowed us to study many bacterial species for the first time and to identify novel associations for them. Notably, even among the top associated bacteria we found unnamed bacteria that are prevalent and appear in hundreds and sometimes thousands of individuals from our cohort. These findings emphasize the importance of expanding the reference set of the human microbiome even further, and suggest that such newly identified species may have strong associations with important host phenotypes.
Overall, by combining larger microbiome cohorts and expanded bacterial genome references we robustly characterize bacterial links to many important health parameters, serving an important first step towards unraveling the causal links and mechanisms by which bacteria affect host phenotype.
Methods
Microbiome sample collection, processing and analysis
Participants provided a stool sample using an OMNIgene-Gut stool collection kit (DNA Genotek), and processed according to the methods described in Mendes-Soares et al. 31: Genomic DNA was purified using PowerMag Soil DNA isolation kit (MoBio) optimized for Tecan automated platform. Illumina compatible libraries were prepared as described in 32, and sequenced on an Illumina Nextera 500 (75bp, single end), or on a NovaSeq 6000 (100bps, single end). Reads were processed with Trimmomatic 33, to remove reads containing Illumina adapters, filter low quality reads and trim low quality regions; version 0.32 (parameters used: -phred33 ILLUMINACLIP:<adapter file>:2:30:10 SLIDINGWINDOW:6:20 CROP:100 MINLEN:90 for 100bps reads, CROP:75 MINLEN:65 for 75bps reads). Reads mapping to host DNA were detected by mapping with bowtie2 34,34,35 (with default parameters and an index created from hg19) and removed from downstream analysis.
All samples were subsequently downsampled to a depth of 5M reads. Samples with fewer reads were removed from further analysis, leaving us with a reduced sample of participants that was used for downstream microbiome analyses.
Relative abundance estimation of SGBs
The bacterial reference dataset for relative abundance estimation is based on the representative assembly of the species-level genome bins (SGBs) and genus-level genome bins (GGBs) defined by Pasolli et al. 25. By construction, all assemblies in each SGB are at high average nucleotide identity with one another, and the representative was chosen to be the best quality assembly amongst them.
Out of the 4,930 human SGBs (associated with various body sites), we chose to work with 3,127 SGBs, which were characterized by either belonging to a unique genus or with at least 5 assemblies to justify having a new SGB. We employed this restriction, since we noticed that the cutoff threshold used by Pasolli et. al. to cluster assemblies into SGBs resulted in small groups with little nucleotide difference from a large nearby SGB is artificially split to a new SGB.
Abundance was calculated by counting reads that best matched to a single SGB of the set. In order to avoid sample reads which may be assigned to more than one SGBs (which might mislead us to believe an SGB appears in sample when it actually does not), we created a mapping of all 100/75-bps reads which are unique to a single of these representatives. We divided each representative genome assembly to consecutive windows such that each window includes 100 unique 100/75-bp reads (unique-100-bins). Since different proportions of reads are unique in different areas of the assembly, these windows are not of constant length, but the number of sample reads expected to uniquely map to them should be constant.
We used bowtie2 35 to map samples from our cohort versus an index built from the set of representatives of the SGBs (demanding all mappings of length 100/75 to score −40 or above). When analysing the mapping, we looked only at reads whose best map is unique (thus mapped to a location which is unique in the set of representatives). We count the number of reads uniquely mapped to each window of each SGB.
To assess the cover of each SGB, we first choose a window size, which is a multiple of the original unique-100-bins, for which the average cover is at least 20 reads. Next, we sum the number of reads in this enlarged-window, and test the distribution of covers over the windows.
Finally, we take the dense mean of that distribution 36, in order to avoid our cover estimation being biased by a relatively small part of the reference which is highly covered (may come about from plasmids or horizontal transfer which was not identified in the uniqueness process since it did not appear in any other representative) or lowly covered (since this is a representative of an SGB, a strain present in our sample may not include all parts of the representative). When the dense 50% of the cover distribution starts above 0 we conclude the SGBs exists in the sample, and we estimate its relative abundance. The cover estimation for each SGB is the dense mean cover of its representative, normalized by the enlarged-window size.
The relative abundance estimation is the cover divided by the sum of the covers of all representatives we concluded exist in this sample.
Code of the algorithm is provided in github: https://github.com/erans99/UniqueRelativeAbundance
Cohort matching
We subsampled the IL cohort to match the US cohort on age and BMI using MatchIt package from CRAN repository for r.
Alpha diversity explained variance
We calculated the alpha diversity explained variance by regressing out gender and age from each phenotype, and then using ordinary least squares modeled the phenotype by alpha diversity. To get confidence intervals, we bootstrapped the data 10,000 times.
Microbiome- association- index
We calculated b2 estimates using linear mixed models as was previously described 26 We used age and gender as fixed effects covariates, and built a microbiome genetic-relationship-matrix, using our developed SGB based relative abundances. The b2 calculation assumes that the phenotype distributed normally, we removed sample outliers from the IL and US cohorts using the same thresholds (removing less than 5% of individuals Table S17). To account for differences between the population and study prevalence of binary traits, we applied the correction of Lee et al. 37 which has been shown to provide a lower bound on the fraction of explained variance 38. We also provide uncorrected estimates in Table S18. Phenotype distributions of blood SGPT levels were far from normally distributed and were not estimated.
Phenotypes prediction
We used the gradient boosting trees regressor from Xgboost 39 as the algorithm for the regression predictive model for different phenotypes. We used the gradient boosting trees classifier from Xgboost as the algorithm for the classification predictive model for phenotypes with binary values. All hyperparameters of the xgboost were fitted based only on cross validation of the train set.
The parameters of the predictors when using microbiome features were: colsample_bylevel=0.075, max_depth=6, learning_rate=0.0025, n_estimators=4000, subsample=0.6, min_child_weight=20. These parameters were used for regression as well as classification.
The rest of the parameters had the default values of Xgboost
For the Ridge linear regression we used the RidgeCV from the scikit-learn package. The parameters used for the regressor were: alphas=[0.1,1,10,100,1000], normalize=True. The rest of the parameters were the default. The input to the Ridge linear regression was log transformed SGB abundance.
For binary phenotypes SGD classifier from the scikit-learn package was used, with default parameters (L2 normalization).
When using microbiome features for the prediction, only the top 1345 occuring SGBs were used, i.e., the SGBs that were found in the highest number of samples, to avoid overfitting on rare SGBs.
Calculating prediction accuracy as a function of cohort size
For cohort size n (for n=200, 500, 1000, 2000, 3000, 4000, 6000, 8000, 12000, 16000, 20000, 24000; for prediction of HbA1c the maximum size was 16000) we repeated the following process 10 times: we randomly selected a subset of n samples, ran 10 fold cross validation of the prediction and listed the mean and standard deviation of each fold. By repeating the procedure 10 times we received the mean and standard deviation of the prediction accuracy estimate.