Abstract
The gut microbiome ecosystem is a significant driver of host health and disease. High-throughput Longitudinal studies have begun to unravel the complex dynamics of these ecosystems, and quantitative frameworks are now being developed to understand their organizing principles. Dimensionality reduction offers unique insights into gut bacterial dynamics by leveraging collective abundance fluctuations of multiple bacteria across multiple subjects driven by similar underlying ecological factors. However, methods providing lower-dimensional representations of gut microbial dynamics both at the community and individual taxa level are currently missing. To that end, we develop EMBED: Essential Microbiome Dynamics. Similar to normal modes in structural biology, EMBED infers ecological normal modes (ECNs), which represent the unique set of orthogonal dynamical trajectories capturing the collective behavior of microbial communities across subjects. We show that a small number of ECNs accurately describe gut microbiome dynamics across multiple data sets. Importantly, we find that ECNs reflect specific ecological behaviors, providing natural templates along which the dynamics of individual bacteria may be partitioned. Moreover, the multi-subject treatment in EMBED systematically identifies subject-specific and universal dynamical processes. Collectively, our results highlight the utility of dimensionality reduction approaches to understanding the dynamics of the gut microbiome and provide a framework to study the dynamics of other high-dimensional systems as well.
Introduction
Deciphering the temporal dynamics of the human gut microbiome is essential to understanding its role in human health and disease. Advances in sequencing technologies have enabled the characterization of these complex ecosystems at unprecedented scale and resolution1,2. In contrast to static snapshots across large populations, high-resolution longitudinal studies offer unique insights into the biological processes structuring communities within individual hosts. For example, recent longitudinal studies have elucidated the determinants of the gut microbiome in early childhood3,4, the effects of the gut microbiome on outcomes following bone-marrow transplant5, and the recolonization of gut microbial communities following antibiotic perturbation6–10.
A significant challenge in understanding gut microbiome dynamics is its enormous organizational complexity, comprising thousands of individual bacterial species whose abundances vary substantially across space, time, and host ecosystems11–15. Systems biology approaches are now beginning to reveal broad-scale insights into the temporal behavior of the gut microbiome, including its defining features of long-term stability and resilience to perturbations16–20. More recently, methods have also been developed to address the significant technical challenges of inferring true relative abundances of bacteria from large-scale sequencing data21–23. Collectively, these studies have suggested that abundances of individual bacterial species do not fluctuate independently, but rather as a collective community with coordinated responses to factors such as host diet24,25, medications10,26, and environmental exposures12.
The correlated nature of bacterial abundance dynamics suggests that dimensionality reduction may offer unique insights by distilling the behavior of large communities into a handful of variables. Indeed, dimensionality reduction techniques are widely utilized in sequencing-based studies27. Popular approaches based on multidimensional scaling, such as principal coordinate analysis, have been seminal to understanding the organizing principles of the human microbiome28–30. Other non-probabilistic approaches based on log-transformations do not account for zero abundances and technical sampling noise and could potentially lead to inaccurate reconstructions31,32. Crucially, while these approaches may be useful in identifying broad shifts in the overall microbiome community, they lack information on the dynamics of individual bacterial taxa.
To that end, we have developed EMBED: Essential Microbiome Dynamics, a probabilistic reduced dimensional descriptor of gut microbiome dynamics that identifies the common dynamical templates of bacterial communities across multiple subjects exposed to the same perturbation. In EMBED, we model bacterial abundances using the exponential Gibbs-Boltzmann distribution33 with unknown extensive and intensive variables that are learned directly from data (Fig. 1A). The Gibbs-Boltzmann distribution has its origins in statistical physics and can be thought of as a latent space embedding model with a softmax non-linearity. The result is a set of unique and orthogonal trajectories, which we refer to as Ecological Normal Modes (ECNs), that capture the collective temporal behavior of bacterial communities across multiple subjects. Moreover, our framework provides a set of “loadings”, that represent the contribution of each identified ECN to the dynamical profiles of individual bacterial taxa in individual subject-specific ecosystems. Thus, similar to how the principal components in principal component analysis (PCA) represent a lower dimensional basis to reconstruct community abundance profiles, ECNs represent a set of basis functions to reconstruct the dynamics of variation of abundances of individual bacterial taxa. In addition to providing an ecologically motivated description of bacterial dynamics, our approach has several salient features that are particularly well-suited for sequencing studies of the gut microbiome. First, EMBED utilizes the exponential Gibbs-Boltzmann distribution, which captures the extensive variability of the species abundances in the gut33. Second, by restricting the number of specified ECNS to be low, EMBED naturally provides a reduced-dimensional description of the community thereby filtering out potentially unimportant signal in the data13. Third, ECNs are inferred using a fully probabilistic method that further accounts for sequencing noise inherent in all microbiome studies13. Fourth, similar to the normal modes in biomolecular dynamics34, ECNs represent the unique and orthonormal dynamical modes that represent statistically independent collective abundance fluctuations. Fifth, by treating individual subjects separately, EMBED systematically identifies universal and subject-specific dynamical behaviors and bacterial taxa that exhibit that behavior.
We used EMBED to study several publicly available, high-resolution longitudinal data sets that encompass major ecological perturbations such as dietary changes and antibiotic administration10–12,25. EMBED accurately captured the dynamics in these communities with only a handful of ECNs, demonstrating the highly correlated nature of bacterial abundance dynamics and the efficacy of EMBED as a dimensionality reduction method. The identified ECNs reflected specific ecological behaviors, providing natural templates to reconstruct the dynamics of individual bacterial taxa. Indeed, we found major groups of bacteria that are partitioned according to their relative contributions along each of the identified ECNs which further indicates that the identified ECNs represent a collection of distinct ecological behaviors observed in the community. Additionally, subject-specific analyses identified universal and subject-specific dynamics and taxa exhibiting those dynamics. Collectively, our study provides an ecologically motivated dimensionality reduction framework to better understand dynamics in the gut microbiome.
Results
EMBED identifies reduced-dimensional descriptors for longitudinal microbiome dynamics
We sketch the mathematical foundation of identifying ecological normal modes using EMBED (Fig. 1A). A detailed derivation is found in the Supplementary Information. Briefly, we consider that microbial abundances nos (t) are quantified across several taxa “o”, subjects “s”, and time points “t”. We model the data nos (t) as arising from a multinomial distribution: where Ns(t) = ∑onos (t) is the total read count on a given day t in the microbiome sample in subject s. The probabilities qos (t) are modeled as a Gibbs-Boltzmann distribution 33 In Eq. 2, Zk(t) are time-specific latents that are shared by all OTUs and subjects, and θkos are OTU- and subject-specific loadings that are shared across all time points. The number of latents/loadings is chosen such that K ≪ O,T thereby achieving a lower dimensional description of the data. These parameters can be simultaneously estimated using log-likelihood maximization.
The long-term stability of the gut microbiome is now well-established14,15,18. Therefore, we model the dynamics of the latents as return to normal fluctuations around a fixed steady state: In Eq. 3, the matrix A is assumed to be symmetric and the noise ε Gaussian distributed and uncorrelated. To identify ecological normal modes (ECNs) yk(t) whose dynamics are statistically independent of each other, we diagonalize the interaction matrix, A = vT Λv. Here, v is the orthogonal matrix of eigenvectors and Λ is the diagonal matrix of its eigenvalues. We have Where the ECNs y(t) = vz(t) are a redefined set of latents, u′ = vu, and ε′ = vε. We redefine the corresponding loadings Φ = vT θ. Since vvT= vTv = I, this simultaneous transformation does not change the model predictions33. Moreover, the redefined noise ε′ is Gaussian distributed and uncorrelated as well. Notably, if we start with orthonormal sets of latents zk(t), the ECNs are also orthonormal. As we show in the supplementary information, the ECNs are uniquely defined for a given longitudinal data set. The actual dynamics of the latents are likely to be more complex than the linear model invoked here. Yet, similar to normal mode analysis in biomolecular dynamics34, ECNs represent a re-orientation of the latent variable space that uncovers the unique and orthogonal templates of microbial abundance fluctuations.
EMBED accurately reconstructs microbiome abundance time series using a few ecological normal modes
We first highlight the intuition of EMBED with simple illustrative in silico examples (see Supplementary Information for details). The first community comprised OTUs whose abundances oscillated at a single frequency but with one of two phases. The second community comprised a single set of OTUs oscillating with high frequency and another set that fluctuated as a sum of two oscillations. The third community comprised a set of OTUs whose abundances decreased exponentially, and those whose abundances oscillated with one of two different frequencies. In silico data was generated by first normalizing the abundances and then sampling read counts from a multinomial distribution (SI Fig. 1). As expected, EMBED identified a small number of ECNs that were sufficient to capture the abundance variation in all three communities (SI Fig. 2). Importantly, the identified ECNs directly corresponded to salient dynamical features of the abundance profiles (SI Fig. 3). Specifically, ECN y1(t) was relatively stable over time and the corresponding loading vector Φ1 correlated strongly with the mean OTU abundance, capturing steady-state behavior of OTUs over longer time periods (SI Fig. 4). The rest of the ECNs separately captured other major features of the underlying dynamics: out of phase oscillations (A), three different oscillation frequencies (B), and exponential decay and oscillations at different frequencies (C). Finally, the inferred ECNs were uniquely determined for each community (SI Fig. 5). While simplified, these examples show how EMBED can be used to identify any existing modes of dynamics underlying complex microbial communities.
Next, using several longitudinal microbiome time series, we investigate the accuracy of EMBED-based time series reconstruction. We compared EMBED with a recently developed method by Martino et al.31 (centered log ratio transform followed by sparse singular valued decomposition or CLR-SSVD, see Supplementary Information). This dimensionality reduction method also forms a basis of a recent multi-subject analysis32. Briefly, non-zero microbiome abundances are log-transformed using the so-called robust centered log-ratio transform (CLR)35. Sparse singular value decomposition (SSVD)36 is then performed, using a user-specified number of components, on these non-zero abundances. Finally, an inverse CLR transform is performed on the SSVD-based reconstruction. We investigated the ability of CLR-SSVD and EMBED to reconstruct the same time series using 23 abundance time-series from four different studies11,12,25,10. In Fig. 1B, we compare the mean Kullback-Leibler divergences (averaged over the total number of days for each time series) using K = 3, 5, and 7 components for EMBED- and for CLR-SSVD-based reconstructions. Notably, for each time series and each K, EMBED offered a more accurate representation of the data compared to CLR-SSVD (SI Table 1). EMBED-based reconstruction is also accurate for the time series of individual bacterial taxa. The average taxa-specific Pearson correlation coefficient between the reconstruction the data, averaged across taxa and datasets was r = 0.89±0.07 (for K = 7) compared to an average correlation of r = 0.71±0.1 for CLR-SSVD. Collectively, these results show that EMBED identifies key ecological normal modes that can accurately represent collective abundance fluctuations in microbiome time series. Notably, a much smaller number of EMBED modes are sufficient to accurately capture the abundance dynamics compared to CLR-SSVD.
We next sought to identify underlying ecological modes of dynamics in the gut microbiome by using EMBED to reconstruct low-dimensional representations of bacterial communities 5subjected to various ecological perturbations.
Effect of dietary oscillations on the gut microbiome
Host diet has been shown to be a major factor influencing gut bacterial dynamics in both humans and mice24,25 but in a subject specific manner37. We therefore applied EMBED to the data collected by Carmody et. al.25 to better understand bacterial abundance changes in response to highly controlled dietary perturbations. Briefly, the diets of individually housed mice were alternated every ∼3 days between a low-fat, plant-polysaccharide diet (LFPP) and a high-fat, high-sugar diet (HFHS). Daily fecal samples were collected for over a month (SI Fig. 6).
Using K = 5 ECNs, EMBED obtained a lower dimensional time series approximation that reconstruction of the original data with great accuracy (average taxa Pearson correlation coefficient r = 0.75±0.18, average community Pearson correlation coefficient, r = 0.98±0.003) (SI Fig. 2). We investigated each of the underlying ECNs. The first ECN y1(t) represented a relatively constant abundance throughout the entire time series (Fig. 2A). Moreover, the corresponding loading vector Φ1 showed a significant correlation to the average individual OTU abundance across time. (Average Spearman correlation coefficient across subjects, r = − 0.86±0.06, SI Fig. 4), suggesting that despite large-scale, cyclic dietary changes, gut bacterial abundances in the community tended to fluctuate around a constant average abundance.
In contrast, ECNs y2(t) and y3(t) collectively captured the cyclic nature of dietary oscillations, confirming that the murine diet rapidly and reproducibly alters abundance dynamics even at the individual OTU level. To identify OTUs whose oscillatory dynamics were similar across subjects, we clustered the loadings Φ2 and Φ3 of individual OTUs on ECNs y2(t) and y3(t). We found that bacteria in the community largely clustered into three groups (Fig. 2C), those whose abundances increased with the LFPP diet (blue, group 1), and those whose abundances increased with the HFHS diet to different extents (black and magenta, groups 2 & 3). In keeping with recent studies38–40, we found that the genera Saccharicrinis, members of the Bacteroidetes phylum, were significantly enriched in group 3, consistent with the notion that bacteria belonging to this genera are able to degrade plant polysaccharides and utilize the metabolic byproducts present in the LFPP diet (p = 0.0015, hypergeometric test).
Unexpectedly, we found two ECNs y4 (t) and y5 (t) that represented profound non-oscillatory behavior in abundance fluctuations. y4(t) represented an overall drift in abundance over the time series and y5(t) represented a U-shaped recovery. The loadings corresponding to these two modes the were significantly correlated across subjects (Spearman correlation coefficient r = 0.37±0.16, averaged across mice). The top 5 OTUs with most negative and positive loadings Φ4 (omitting OTUs that were also in the top 5 negative/positive for loadings Φ5) experienced a significant, irreversible increase and decrease throughout the time course of the experiment respectively (Fig. 2B, top). Thus, while the dynamics of most gut bacteria in this community exhibit rapid and reversible changes in response to dietary oscillations, there exist certain bacteria that exhibit irreversible changes over time. This concept of hysteresis has been explored previously in the gut microbiome25,41, but the underlying mechanisms likely warrant continued investigation. In contrast, the top 5 OTUs with most negative and positive loadings Φ5 (omitting OTUs that were also in the top 5 negative/positive for loadings Φ4 experienced an inverted U-shaped and a U-shaped abundance profile (Fig. 2B, bottom). Interestingly, the OTUs that exhibited the drifting and the U-shaped abundance profiles differed from subject-to-subject (SI Table 2, SI Fig. 6). This strongly suggests that these universal non-oscillatory dynamics are primarily driven by the state of the ecosystem rather than specific functions of the bacterial taxa that exhibit these behaviors. This is reminiscent of the universal dynamical behaviors recently reported by Ji et al.14 that were shared across different host organisms but were exhibited by different bacterial taxa.
EMBED systematically identifies OTUs that exhibit universal dynamics and those that exhibit subject-specific behavior. Each OTU within each subject-specific ecosystem is characterized by a five-dimensional vector of loadings corresponding to the five ECNs. OTUs whose loading vectors are similar across all subjects have similar dynamics across subjects and vice-versa for OTUs with different loading vectors. To identify these universal and subject specific OTUs, we computed the average distance across all pairs of subjects of the OTU specific loadings vectors. This average distance correlated strongly with the average distance of the subject specific OTU abundance trajectories as well (inset of Fig. 2D). In Fig. 2D, we plot the average abundance of 10 OTUs with the most similar Φ loadings (bottom) and the 10 most dissimilar Φ loadings (top). The black lines show the OTU-averaged abundances for individual subjects and the colored bold lines (green and orange) show the average across subjects. As seen in Fig. 2D, the top 10 OTUs whose dynamics were similar across all subjects strongly preferred the HFHS diet. Notably, these OTUs are overrepresented by the genus Oscillibacter (4 out of 10 compared to 5 out of 73, Hypergeometric test p = 9 × 10− 4). Interestingly, this overrepresentation was found at the genus and the family level and was not observed at higher taxonomic classifications (SI Table 3). Importantly, no other genus or family were overrepresented. This strongly suggests a specific genus level preference to high fat high sugar diet in the genus Oscillibacter that can override subject-specific ecosystem parameters. Notably, Oscillibacter are known to prefer high fat42 as well as high sugar diets43. Future work is needed to further establish the mechanistic connection between Oscillibacter and HFHS diets.
ECNs identify modes of recovery of bacteria under antibiotic action
Broad-spectrum oral antibiotics have significant effects on the gut flora both during and after administration. Specifically, microbiome abundance dynamics following antibiotic administration can potentially exhibit a combination of several typical behaviors which may reflect different survival strategies7,9,10,16,44. These include quick recovery following removal of antibiotic, slow but partial recovery, and one-time changes followed by resilience to repeat antibiotic treatment. The temporal variation in abundances of any bacteria could be a combination of these typical behaviors. Moreover, given that the gut ecosystems differ across different hosts, the response of specific bacteria to the same antibiotic treatment could vary from host to host16. To better parse the major modes of gut bacterial dynamics associated with antibiotic administration, we analyzed the data collected by Ng et al.10. Briefly, several mice were given the antibiotic ciprofloxacin in two regimens (day 1-4 and day 14-18) and fecal microbiome samples were collected daily over a period of 30 days (SI Fig. 7).
We found that a very small number K = 4 ECNs was sufficient to capture the data with significant accuracy (average taxa Pearson correlation coefficient r = 0.80±0.2, average community Pearson correlation coefficient, r = 0.98 ±0.01) (SI Fig. 2). As shown in panel (A) of Fig. 3 and consistent with the previous analysis, we found that ECN y1(t) was relatively stable throughout the study and the corresponding loading vector Φ1 was strongly correlated with the mean OTU abundance over time (Spearman correlation coefficient r = − 0.57±0.07) (SI Fig. 4). This suggests that on average, even after several large-scale perturbations, there exists a characteristic range of abundances beyond which individual OTUs tend not to deviate, at least on the time scale considered. Interestingly, we found the remaining several ECNs to follow broad classes of behaviors in response to periods of stress. Indeed, ECNs, y2(t) appeared to represent an inelastic one-time change followed by a relatively stable response. ECN, y3(t) represented the opposite, it responded to the antibiotic treatment the second time but not the first time. In contrast, ECN y4 (t)represented elastic changes in the microbiome, potentially representing abundances reproducibly decreasing (or increasing) with the action of the antibiotic but quickly bouncing back to pre-antibiotic levels when it was withdrawn.
These salient dynamical features were captured when we clustered the OTUs using the loadings Φ2− Φ4(panel B), which identified seven major groups of OTUs with distinct dynamical behaviors (Figure 3B,C). Interestingly, while some of the groups simply reflected behaviors of individual ECNs, others could be understood according to their relative contributions across multiple ECNs. For example, the behavior of OTUs in groups 1 and 3 aligned with ECN y2(t), albeit with opposing trends. Group 1 OTUs flourished during the first antibiotic treatment but the second treatment did not elicit a similar response. In contrast, OTUs in group 3 diminished in their abundance after the first antibiotic treatment but were resistant to subsequent antibiotic action.
OTUs in groups 2, 5, 6, and 7 displayed highly elastic dynamics in response to both periods of antibiotic administration. Group 2 OTUs overrepresented by the genus Akkermansia (all 2 out of 41 OTUs are in Group 2, Hypergeometric test p = 0.026) flourished during the antibiotic treatment but decreased their abundance in a reversible manner when antibiotics were withdrawn. OTUs in groups 5, 6, and 7 in contrast diminished their abundance in the presence of antibiotics in a reversible manner. Group 6 was overrepresented by the genus Blautia (3 out of 6 compared to 5 out of 41, Hypergeometric test p = 0.017), while group 7 was overrepresented by the genus Aestuariispira (all 2 out of 41 OTUs are in Group 7, Hypergeometric test p = 0.0073). Finally, group 4 comprised OTUs that were exquisitely sensitive to initial antibiotic administration, whose abundance did not make any meaningful recovery. These OTUs were overrepresented in the genus Coprobacter (2 out of 5 compared to 3 out of 41, Hypergeometric test p = 0.035).
Notably, OTUs in groups 5 and 7 exhibited significant subject-to-subject variability as quantified by both the average subject-to-subject variability in OTU-specific Φ loadings (Fig. 3D) and the subject-to-subject variability in OTU-specific abundance trajectories (SI Fig. 7). While these OTUs exhibited qualitative dynamics of recovery across all subjects (SI Fig. 7), the time course and the extent of recovery varied from subject-to-subject.
Discussion
Bacteria in host-associated microbiomes live in complex ecological communities governed by competitive and cooperative interactions, and a constantly changing environment. Extensive spatial and temporal variability are a hallmark of these communities. Recent systems biology approaches have made progress in distilling some of this complexity by utilizing generalized quantitative frameworks. For example, simple and universal statistical features have recently been discovered in these communities14,15. Dimensionality reduction offers an alternative approach by leveraging the correlated nature of bacterial abundance fluctuations in the community, but its use towards understanding microbiome dynamics has thus far been limited.
To address this issue, we developed EMBED, essential microbiome dynamics. EMBED is a novel dimensionality reduction approach specifically tailored to identify the underlying ecological normal modes in the dynamics of bacterial communities that are shared across subjects undergoing identical environmental perturbations. These ECNs can be viewed as dynamical templates along which the trajectories of individual bacteria within individual host ecosystems can be decomposed. Identified ECNs shed insight into the underlying structure of bacterial community dynamics. By applying EMBED to several times series data sets representing major ecological perturbations, we identified immediate and reversible changes to the gut community in response to these stimuli. However, EMBED also identified more subtle, longer-term, and perhaps irreversible changes to specific members of the community, the mechanisms and consequences of which would be interesting to pursue further. For example, EMBED identified genus levels associations with specific dynamical behaviors under diet oscillations that were not observed at higher taxonomic levels, potentially implicating specific functional properties of the genus.
One key parameter in EMBED is the number of components. A high number of components will necessarily fit the data better, potentially fitting to the technical noise. How do we decide the appropriate number of components? Importantly, EMBED is a probabilistic model and potentially information theoretic criteria45,46 could be used to identify the correct number of components. These criteria seek a balance between increase in number of parameters and the accuracy of fit to data (likelihood). We note that the total likelihood of the data is linearly proportional to the sequencing depth. However, the reported sequencing depth is typically over-inflated compared to the true nucleotide capture probability of the experiments47 leading to an inflated estimate of the total likelihood. One approach to solve this is to obtain technical repeats which can in turn allow us to estimate the true technical noise13,47.
While EMBED was specifically developed to study microbiomes, it reflects a more generalizable framework that can easily be applied to other types of longitudinal sequencing data as well. We therefore expect that EMBED will be a significant tool in the analysis of dynamics of high dimensional sequencing data beyond the microbiome.
Footnotes
We extended the analysis to subject-specific variability in the microbiome