Looking at the Full Picture: Utilizing Topic Modeling to Determine Disease-Associated Microbiome Communities

The microbiome is a complex micro-ecosystem that provides the host with pathogen defense, food metabolism, and other vital processes. Alterations of the microbiome (dysbiosis) have been linked with a number of diseases such as cancers, multiple sclerosis (MS), Alzheimer’s disease, etc. Generally, differential abundance testing between the healthy and patient groups is performed to identify important bacteria (enriched or depleted in one group). However, simply providing a singular species of bacteria to an individual lacking that species for health improvement has not been as successful as fecal matter transplant (FMT) therapy. Interestingly, FMT therapy transfers the entire gut microbiome of a healthy (or mixture of) individual to an individual with a disease. FMTs do, however, have limited success, possibly due to concerns that not all bacteria in the community may be responsible for the healthy phenotype. Therefore, it is important to identify the community of microorganisms linked to the health as well as the disease state of the host. Here we applied topic modeling, a natural language processing tool, to assess latent interactions occurring among microbes; thus, providing a representation of the community of bacteria relevant to healthy vs. disease state. Specifically, we utilized our previously published data that studied the gut microbiome of patients with relapsing-remitting MS (RRMS), a neurodegenerative autoimmune disease that has been linked to a variety of factors, including a dysbiotic gut microbiome. With topic modeling we identified communities of bacteria associated with RRMS, including genera previously discovered, but also other taxa that would have been overlooked simply with differential abundance testing. Our work shows that topic modeling can be a useful tool for analyzing the microbiome in dysbiosis and that it could be considered along with the commonly utilized differential abundance tests to better understand the role of the gut microbiome in health and disease.


47
Trillion of bacteria (microbiome) living in and on the human body play an important role in keeping us 48 healthy and an alteration in their composition has been linked to multiple diseases such as cancers, 49 multiple sclerosis (MS), and Alzheimer's. Identifying specific bacteria for targeted therapies is crucial, 50 however studying individual bacteria fails to capture their interactions within the microbial community. 51 The relative success of fecal matter transplants (FMTs) from healthy individual(s) to patients and the 52 failure of individual bacterial therapy suggests the importance of the microbiome community in health. 53 Therefore, there is a need to develop tools to identify the communities of microbes making up the healthy 54 and disease state microbiome. Here we applied topic modeling, a natural language processing tool, to 55 identify microbial communities associated with relapsing-remitting MS (RRMS). Specifically, we show 56 the advantage of topic modeling in identifying the bacterial community structure of RRMS patients, 57 which includes previously reported bacteria linked to RRMS but also otherwise overlooked bacteria. 58 These results reveal that integrating topic modeling with traditional approaches improves the 59 understanding of the microbiome in RRMS and it could be employed with other diseases that are known 60 to have an altered microbiome. 61 62 and is also made available for use under a CC0 license.
was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted July 25, 2023.

63
The microbiome is the collection of microorganisms that live in and on our body. Although, the 64 microbiome includes bacteria, viruses, fungi, and phages, the majority of microbiome studies have been 65 focused on bacteria. With regard to the bacterial microbiome, it has been established that there is a 66 community structure where a number of different species from various bacterial phyla live together. Their 67 composition is regulated by various host and microbe specific factors and in a steady state, they help to 68 maintain homeostasis, keeping the host healthy. However, the alteration in the composition of the 69 microbiome (dysbiosis), has been linked to a number of diseases including cancers, multiple sclerosis, 70 Parkinson's disease, Alzheimer's disease, inflammatory bowel disease (IBD), and others [1][2][3][4]. In the 71 majority of microbiome studies, the relative abundance of each individual microbe is compared one at a 72 time between people with a particular disease and healthy controls. This type of analysis has provided 73 several major findings on overly enriched or overly depleted microbes that are linked to disease. For 74 example, Fusobacterium nucleatum with colorectal cancer [5] and Clostridium difficile with IBD [6]. 75 These findings are helpful in each respective area of research however, providing a singular species of 76 bacteria to an individual lacking that species for health improvement has not been as successful as fecal 77 matter transplant (FMT) therapy. A fecal matter transplant (FMT), where the entire microbiome is 78 provided, the recipient can see improvement of disease [7][8][9]. This reveals to us that the community of 79 microorganisms is important to our health, and we should consider the structure of the community to 80 better prevent, diagnose, and treat disease. FMTs do, however, have limited success possibly due to 81 concerns that not all bacteria in the community may be responsible for the healthy phenotype. Therefore, 82 there is a need for a method to identify communities within the healthy community. 83 In this work we aim to show the benefits of using the natural language processing (NLP) tool, 84 topic modeling, in order to assess the community structure associated with diseases. Topic modeling is an 85 unsupervised machine learning approach that assesses all the terms (bacteria) within documents (samples) 86 and groups them into topics (communities) based on term similarities and patterns. 87 and is also made available for use under a CC0 license.
was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted July 25, 2023. and is also made available for use under a CC0 license. was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted July 25, 2023. ; https://doi.org/10.1101/2023.07.21.549984 doi: bioRxiv preprint

Data For Analysis 102
Our primary analysis was performed on the data from Chen et. al. 2016 [11] and validated with the data 103 from Yadav et. al. 2022 [3]. For simplicity, each dataset is referred to by using the first author's last name 104 (e.g., Chen and Yadav). After data processing, we retained 175 genera in the Chen dataset and 160 in the 105 Yadav validation dataset. 106 107

Number and Similarity of Validated Community Types 108
Our cosine similarity analysis revealed 34 community-type associations with high correlations ( > 0.80) 109 between the Chen and Yadav topics ( Fig. 1), highlighting similarities in the community types across 110 datasets. These associations comprised 13 topics from the exploratory dataset and nine from the 111 validation dataset that were also associated with RRMS versus HC status. In the Chen dataset, 10 of the 112 30 topics were enriched in samples obtained from the RRMS patients compared to controls (Fig 2). In the 113 Yadav dataset, four of the 30 topics were associated with RRMS versus HC status, with three enriched in 114 RRMS and one enriched in HC samples. The plots for all statistically significant topic associations can be 115 found in Supplementary Figure 2. 116 Specifically, of the significant community types found in the Chen data, five were validated based 117 on having high cosine values ( > 0.80) to topics derived independently from the Yadav validation data. 118 All these topics were significantly (p ≤ 0.05 and q ≤ 0.25) enriched in RRMS patients compared to HC. In 119 detail, Chen Topic 4, Chen Topic 6, and Chen Topic 23 were similar to Yadav Topic 8 (cosine = 0.92, 120 0.86, 0.81, respectively). We will refer to this validated community as Community Type A ( Fig. 3.a). 121 Chen Topic 5 and Chen Topic 10 were similar to Yadav Topic 9 (cosine = 0.84, 0.85, respectively). We 122 will refer to this validated community as Community Type B (Fig. 3.b). For an overview of the topics 123 making up these Community Types see was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted July 25, 2023. ; https://doi.org/10.1101/2023.07.21.549984 doi: bioRxiv preprint We next examined the genera with high probabilities of assignment to these community types. In 127 Community Type A, Bacteroides was the most often assigned genera, followed by Blautia, both were 128 increased in RRMS compared to HC. Many other genera were also assigned to this community and were 129 higher in RRMS than HC, including Streptococcus, Eggerthella, Faecalitalea, and Lachnoclostridium. 130 Several genera were assigned to Community Type A that varied in directional abundance 131 between datasets. For example, Ruminococcus and Agathobacter were higher in the RRMS group in the 132 Chen dataset and lower in the RRMS group in the Yadav dataset. Inversely, Erysipelatoclostridium was 133 lower in RRMS in Chen and higher in RRMS in Yadav. 134 In Community  We assessed the assigned genera's differential abundance within the statistically significant community 144 types. Community Type A comprised many enriched genera; however, only Eggerthella and Blautia were 145 significantly increased in RRMS compared to HC in both the Chen and Yadav datasets. None of the 146 significantly depleted genera differed between RRMS and HC in abundance in both datasets. In 147 Community Type B several genera were identified as differing in abundance, but the only shared 148 significant finding between datasets was the increase in Blautia in RRMS. 149 150 Discussion 151 and is also made available for use under a CC0 license. was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted July 25, 2023. ; https://doi.org/10.1101/2023.07.21.549984 doi: bioRxiv preprint We hypothesized that communities of microbes might be associated with RRMS patients when compared 152 to healthy controls and that microbes not identified by one-at-a-time differential abundance testing 153 approaches would be important to these dysbiotic community types. As such we utilized our previously 154 published data and performed topic modeling on this dataset to look for community types associated with 155 RRMS. Out of 30 topics assessed, we identified 10 that were more often associated with RRMS when 156 compared to HC, and we validated these findings utilizing a separate dataset. 157 Several themes were found in Community Type A and Community Type B, suggesting similar 158 dysbiotic communities associated with RRMS. We found that Bacteroides was one of the most often 159 assigned genera. This genus was higher in RRMS than HC in Chen, and validated in Yadav, but did not 160 reach statistical significance at this sample size in either dataset. This finding highlights the possibility 161 that differences in clusters of microbes might be more important than differences in specific microbes in 162 the dysbiotic gut microbiome communities seen in MS patients. Blautia also had a high assignment 163 cross the blood-brain-barrier (BBB) [28,29], which is of interest in MS research, as the gut-brain axis is 186 often implicated in the pathobiology of this disease. One hypothesized mechanism of action is that permeability. Of note, Faecalitalea was assigned to the RRMS community types and was increased in 202 RRMS patients compared to HC. This genus is thought to be beneficial as it can ferment many sugars and 203 and is also made available for use under a CC0 license.
was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted July 25, 2023. ; https://doi.org/10.1101/2023.07.21.549984 doi: bioRxiv preprint its major end product is also butyric acid [35]. Butyric acid is considered to support the integrity of the 204 gut [36]. 205 Collectively, our findings indicate that the complex dysbiotic microbiota in RRMS patients can 206 be characterized by a diverse community of bacteria specifically comprising a reduction in beneficial 207 symbiont bacteria, an increase in potentially harmful pathogenic bacteria, and an overall shift of certain 208 commensal bacteria towards a pathobiont phenotype. As a number of bacteria in these communities don't 209 reach statistical significance on their own, our findings highlight that the collective impact of these 210 bacteria is greater than their individual effect. Thus, a healthy or disease phenotype outcome can be 211 attributed to the balance between symbionts and pathobionts shifting towards pathobionts. It seems there 212 are certain keystone symbionts species, such as Faecalibacterium, which are mostly associated with a 213 healthy phenotype, likely due to their inability to adjust to environmental changes lacking nutritional 214 sources such as dietary fibers [37]. However, other commensal gut bacteria, such as Bacteroides, Blautia, 215 and Eggerthella spp., can be more flexible due to their adaptability to thrive in diverse conditions and 216 utilize a wide range of substrates as a food source. They can efficiently switch their metabolic pathways 217 and enzymatic activities to utilize different nutrients, ensuring their survival and maintenance in the ever-218 changing gut ecosystem [38,39]. Based on our data, we propose a potential mechanism through which 219 healthy gut microbiota can be converted to a dysbiotic phenotype. This can be explained through a term 220 used in social sciences: "bottom-up influencer", where peripheral pressure leads to changes in a central 221 authority [40]. In a steady state, the gut microbiota is dominated by keystone species such as 222 Faecalibacterium which had been shown to be highly abundant in human populations. These keystone 223 species regulate the properties of other peripheral members of this community (e.g., Blautia, Dorea, and 224 Eggerthella) to sustain a healthy state by producing beneficial metabolites required for maintaining an 225 intact gut barrier and inducing anti-inflammatory responses. However, in certain scenarios such as 226 infection, dietary, or environmental changes, peripheral members can adjust to the changing environment 227 better than the keystone species resulting in their higher abundance due to the reduction/depletion of 228 keystone species. A higher abundance of certain commensal bacteria in a disease state suggests these 229 and is also made available for use under a CC0 license.
was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted July 25, 2023. ; https://doi.org/10.1101/2023.07.21.549984 doi: bioRxiv preprint community members might acquire an inflammatory potential in the absence of keystone species. As 230 there is lots of heterogeneity in human populations, certain individuals might be more prone to these 231 bottom-up influencers, thus being more likely to have a dysbiotic phenotype and further, a predisposition 232 to diseases such as MS. It is possible that some of these bacteria start metabolizing or degrading host-233 derived nutrients such as mucins or synthesize immunostimulatory LPS as shown by us recently [41]. patients. In addition, with the use of topic modeling, we observed associations for community structures 245 related to RRMS that cannot be identified with differential abundance testing. These findings should be 246 further validated with more datasets and diverse cohorts but highlight the potential of topic modeling in 247 microbiome research. In the future, we hope that topic modeling will be incorporated with traditional 248 statistical approaches for microbiome analysis and help provide a better picture of the microbiome as a 249 whole in complex diseases such as RRMS. 250 251

Clinical and Sequence Data 253
and is also made available for use under a CC0 license.
was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted July 25, 2023. The sequence data was downloaded utilizing the SRA toolkit, denoised with DADA2 [42] using 264 the default parameters and trimming of the forward and reverse reads at 240 and 180nt, respectively, for 265 Chen, and trimming of the forward and reverse reads at 290 and 275nt, respectively, for Yadav. The 266 taxonomy was assigned using the assignTaxonomy function and the Silva database (Version 138). Low 267 prevalence features (relative abundance < 1e-5) were removed. Post-filtering, the reads were aggregated 268 at the genus level for analysis. 269 The method = "VEM" option was selected to perform variational inference when deriving the latent 277 topics. This was performed on each dataset and an average of the ideal topic numbers was selected. A 278 total of 30 topics was chosen based on this approach (Supplementary Figure 1). 279 and is also made available for use under a CC0 license.
was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted July 25, 2023. ; https://doi.org/10.1101/2023.07.21.549984 doi: bioRxiv preprint To derive the final set of topics we used the LDA function from the topicmodels [47] package to 280 perform the latent Dirichlet allocation on each dataset. This model was chosen as it allows for fractional 281 membership, or the allowance of assignment to multiple topics, when assigning reads to the underlying 282 community types. We then extracted the beta and gamma probability matrices from our topic model using 283 tidytext package [48] and multiplied the per-document-per-topic probabilities by the read count for each 284 sample to assign reads to each topic (i.e., generate a document-term matrix). A new phyloseq object was 285 then built with the document-term matrix serving as the abundance table. 286 287

Validation of Topics Found in Original Dataset 288
We extracted the topic-term-probability matrix from the Chen and Yadav LDA models and assessed the 289 similarity in the community structure (topics) between our exploratory and validation data by calculating 290 the cosine similarity matrix for each topic. The communities that had a cosine similarity of 0.80 or higher 291 were considered to reflect similar community types identified independently in each dataset, and thus 292 validating the findings from the Chen data. 293 294

Differential Abundance of Topics and Bacteria within Topics 295
To assess differences in the relative abundance of each community type between samples collected from 296 RRMS patients and HC we performed a differential abundance analysis using the LinDA (linear models 297 for differential abundance analysis) function from the MicrobiomeStat [49] package with feature.dat.type 298 = "count" and is.winsor = F for community type comparison, and is.winsor = T for genera comparison. 299 The Benjamini-Hochberg false discovery rate (FDR) correction was applied to account for the multiple 300 testing. LinDA was also used to test for differences in the genus-level relative abundances. Community 301 Types (topics) and bacteria with a p ≤ 0.05 and a q-value ≤ 0.25 were considered differentially abundant. 302 303 Data Availability 304 and is also made available for use under a CC0 license.
was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted July 25, 2023. ; https://doi.org/10.1101/2023.07.21.549984 doi: bioRxiv preprint The sequence data used for analysis can be found at the National Center for Biotechnology Information 305 (NCBI) Sequence Read Archive (SRA) under the BioProject numbers PRJNA335855 and 306 PRJNA732670. On GitHub (https://github.com/raeshrode/TheFullPicture_Article) are the following data 307 and scripts: (1) R script to analyze Chen data, Yadav data, and cosine similarity; (2) R environments after 308 topic model analysis of Chen and Yadav datasets, (3) abundance tables, metadata, and taxa tables for each 309 dataset, and (4) the R scripts to create the network plots. product from the patent were used in the present study. All other authors declare no commercial or 327 financial relationships that could be a potential conflict of interest. 328 and is also made available for use under a CC0 license.
was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted July 25, 2023. Yadav. Specifically, communities Chen Topic 5, Chen Topic 10, and Yadav Topic 9. 346 347 and is also made available for use under a CC0 license. was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted July 25, 2023. ; https://doi.org/10.1101/2023.07.21.549984 doi: bioRxiv preprint and is also made available for use under a CC0 license. was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted July 25, 2023. ; https://doi.org/10.1101/2023.07.21.549984 doi: bioRxiv preprint and is also made available for use under a CC0 license. was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted July 25, 2023.   and is also made available for use under a CC0 license. was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted July 25, 2023. ; https://doi.org/10.1101/2023.07.21.549984 doi: bioRxiv preprint