Meta-analysis of the human gut microbiome uncovers shared and distinct microbial signatures between diseases

Microbiome studies have revealed gut microbiota’s potential impact on complex diseases. However, many studies often focus on one disease per cohort. We developed a meta-analysis workflow for gut microbiome profiles and analyzed shotgun metagenomic data covering 11 diseases. Using interpretable machine learning and differential abundance analysis, our findings reinforce the generalization of binary classifiers for Crohn’s disease (CD) and colorectal cancer (CRC) to hold-out cohorts and highlight the key microbes driving these classifications. We identified high microbial similarity in disease pairs like CD vs ulcerative colitis (UC), CD vs CRC, Parkinson’s disease vs type 2 diabetes (T2D), and schizophrenia vs T2D. We also found strong inverse correlations in Alzheimer’s disease vs CD and UC. These findings detected by our pipeline provide valuable insights into these diseases.

observed from traditional medical observation (6).Doing so has provided a scaffold for drugrepositioning to repurpose drugs to target similar cancer types.We propose that the same approach should be taken when conducting microbiome studies.Existing literature also offers strong motivation for undertaking this study.Firstly, microbes function as a central hub for the metabolism of dietary compounds, thereby serving as a coordination point for the distribution of nutrients (7).Consequently, it is plausible that any metabolic disorder may have an associated microbiome component (8).Secondly, microbes have a strong interaction with the immune system (9), thus play a critical role in the development and modulation of the immune system.Thirdly, microbes are known to metabolize ingested drugs, and are known to contribute to efficacy (10,11).All these facts indicate that microbes can have a multifaceted impact on disorders that have not been previously linked to the gut microbiome.Consequently, it is highly possible that there are disorders that are observed to have similar microbiome patterns, despite having dissimilar disease phenotypes.Previous studies have shown that the imbalance of the bacterial community played a contributory role in the development of complex disorders, including neurological disorders, immune disorders, metabolic disorders, and gastrointestinal disorders.Recent studies have further highlighted the shared microbial signatures that contribute to these diseases, underscoring the need for more indepth studies.For instance, Prevotella copri was found to be more prevalent in both type 2 diabetes (T2D) and rheumatoid arthritis patients compared to healthy controls, possibly due to its immune-relevant role in the pathogenesis (12,13).More recently, dysregulation of the gutbrain axis (GBA) has been demonstrated to contribute to the development of several neurological disorders, such as Alzheimer's disease (AD), autism spectrum disorder (ASD) and mood disorders (14,15).We explored the intersection of microbial signatures associated with those disorders, leveraging existing datasets that delve into the gut microbiome of these conditions.Our goal is to provide a computational pipeline that can measure disease similarity based on microbiome composition.We utilize interpretable machine learning and differential abundance methods to identify both disease-specific microbes and microbes that are commonly observed across diseases.The key to comparing multiple disease cohorts was leveraging recent insights from removing batch effects within studies (15,16).While previous meta-analysis research has mainly focused on analyzing multiple diseases to understand what makes each disease unique (17,18), our pipeline represents the largest shotgun metagenomic meta-analysis conducted to measure the similarity between diseases with high resolution.We focus on diseases that are found to be associated with the imbalance of gut microbiome.We included datasets investigating 11 disorders ranging from metabolic disorders, gastrointestinal (GI) disorders, neurological disorders to cancer.Since estimating disease similarity is a necessary first step prior to drug repositioning, we provide a modest first step in highlighting the possibility of incorporating microbiome insights into the drugrepositioning pipeline.To investigate the similarity between these disorders, we focus on shotgun metagenomics.While there is a lot more 16S rRNA gene data that is available, we have opted not to include these due to the lack of species resolution and genomic insights.This allows us to obtain high level species or strain resolution while gaining insights to the potential functional roles of these microbes.We address a critical gap in understanding complex disease by examining shared microbial signatures.

RESULTS
We have developed a novel pipeline (Figure 1) that computes disease similarity at both microbial species and gene level, enabling a consistent data standard to make different studies more comparable.We compiled a large multi-study meta-analysis, with consistent processing to enable comparisons across studies that accounts for batch effects.Our findings reveal a high degree of similarity between Crohn's disease (CD) vs ulcerative colitis (UC), CD vs colorectal cancer (CRC), Parkinson's disease (PD) vs T2D, as well as schizophrenia vs T2D.Our results show that the similarity at the microbial species level was consistent with the similarity at the microbial gene level, explained by both the enrichment of pathogenic microbes and the depletion of beneficial microbes.Lastly, we found that the microbial gene profiles between AD and inflammatory bowel disease (IBD) are anticorrelated, highlighting a more pronounced metabolic distinction between these two disorders than previously suspected.

Consistent data processing and cohort selection for this meta-analysis
In this study, we applied the Snakemake pipeline to process available samples, and constructed binary disease classifiers for each study.After fitting both binary gradient boosting (GB) and random forest (RF) classifiers, we found that GB classifiers showed better overall performance across diseases.We then employed GB classifiers in subsequent studies and utilized them to exclude studies that cannot discriminate the disease phenotype based on microbial profile.The resulting dataset derived from 18 studies (19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30)(31)(32)(33)(34)(35), encompasses a total of 2091 samples (Table 1).These samples are distributed across 11 countries spanning Europe, Asia, and America.The following diseases were included: four neurological disorders (AD, ASD, schizophrenia, and PD); two autoimmune disorders (multiple sclerosis (MS) and type 1 diabetes (T1D); two metabolic disorders (obesity and T2D); two GI disorders (CD and UC); and one cancerous disorder (CRC).We compared the classifier performances per cohort and per disease.Per cohort refers to build classifiers for each dataset, and per disease refers to build classifiers with the datasets for a disease combined.The results showed that the overall classification accuracy increased when it was tested per disease.Suggesting that the consolidation of datasets from diverse cohorts enhances the overall representativeness of the disease, consistent with previous findings investigating both ASD (15) and CRC (26).Our analysis revealed three diseases with with-in cohort cross-validation Area Under the receiver operating characteristic Curve (AUC) exceeding 0.95 utilizing different machine learning algorithms in certain datasets: ASD, CD, and CRC (Table 1).Notably, both CD and CRC are diseases related to GI, predominantly impacting the GI tract.Within the ASD cohort, where we observed high classification accuracy, most of the ASD patients also have GI symptoms (36).Classifier is trained on the training dataset, and its predictive accuracy is assessed on a hold-out test dataset.This is important to emulate real-world clinical environments, where there could be drift between clinical studies due to confounding demographics and experimental protocols.
Additionally, we collected a hold-out cohort evaluation using two independent datasets to assess the generalization performance of the binary classifiers on previously unseen test datasets for CD (35) and CRC (37).

Comparison of Crohn's disease and colorectal cancer: SHAP (SHapley Additive exPlanations) interpretation and differential abundance analysis
Population-based cohort studies have found that CD is a risk factor for CRC (38), we choose to compare the shared microbial signature between CD and CRC first as a sanity check for our analysis.We applied two distinct approaches to gain insights.Firstly, we used SHAP to interpret the binary disease classifiers.SHAP provides a valuable means of understanding the contribution of each feature in the classification process, offering interpretability to complex machine learning models.Secondly, we conducted a comprehensive analysis of differential abundance.This approach allows us to identify significant variations in the abundance of microbes between disease cases and healthy controls.Since these two quantities are generated from distinct methods, they provide different perspectives that are sometimes in conflict.By leveraging both pieces of information, we looked at microbes that could be strongly explained by both the Shapley values and large log-fold changes.While we did not observe an overlap in the features that contribute most to the classification for CD and CRC, we found there is a considerable overlap in the microbial species that exhibit differential abundance in CD and CRC patients.Both the binary classifiers of CD and CRC displayed robust generalization abilities when tested on previously unseen cohorts, with the AUC values of 1.00 and 0.87, respectively.In the case of CD classification, control-associated species within the Faecalibacterium genus, such as F. sp3900551435 (shapley rank 1st in CD control and rank 2nd in CD case) and F. prausnitzii (shapley rank 6th in CD control and rank 7th in CD case), exhibit high absolute shapley values in both cases and controls of the CD cohort (Supplementary Table 1).As demonstrated in prior research, F. prausnitzii can produce proteins with anti-inflammatory properties and is involved in CD pathogenesis (39).Altogether, our results implies that these control-associated microbes played a substantial role in distinguishing CD patients from controls (Figure 2a).However, in CRC, case-associated microbes had a more pronounced influence on CRC classification (Figure 2b), particularly exemplified by Fusobacterium nucleatum, Allisonella pneumosintes, and Prophyromonas asaccharolytica (Supplementary Table 2).Specifically, F. nucleatum was ranked first in terms of shapley values in both CRC case and control groups.F. nucleatum was known to be enriched in colorectal adenomas and adenocarcinomas (40) and it can create a proinflammatory microenvironment that supports the progression of colorectal neoplasia (41).It's also one of the common oral bacteria.This has been previously observed in lung cancer, where oral commensals are more abundant in the lower airway of lung-cancer patients compared to the control population (42).Recent studies found the connection between oral bacteria and gut is possibly through both the ectopic gut colonization by oral commensals and induction of migratory Th17 cells, constituting a complex interplay between the microbiome and immune system (43,44).Our results confirmed its pivotal role in distinguishing CRC patients from controls.We also identified other candidates that warrant further investigation.It has been found that individuals diagnosed with CD may face an elevated risk of developing CRC, possibly due to the chronic inflammation associated with CD (38,45).Our findings revealed the overlapped case-associated microbes between CD and CRC contributed to the similarities of these two diseases, such as Fusobacterium spp.and veillonella spp.along with the shared depletion of potential probiotic Coprococcus spp.(Figure 2c).It is worth noting that we found many Fusobacterium species worth looking into (F.animalis, F. sp000235465, F. nucleatum, F. vincentii, F. polymorphum), including one of them (F.prausnitzii) that have been validated by previous studies (39).Differential abundance analysis found 19% and 17% overlap of the caseassociated and control-associated microbes respectively between these two diseases (Figure 3a).Our findings, in agreement with previous studies, highlight the potential involvement of these shared microbial signatures in the progression of both CD and CRC.The shared microbial features we find here are crucial for us to better understand the common features between these diseases, and can help us step closer to the real therapeutic target for these complex diseases.Differential abundance analysis revealed disease pairs with high similarity at the microbial species level Among the top disease-pairs exhibiting the most significant overlap of case-associated microbes, CD vs UC had the highest co-occurrence, followed by PD vs T2D (Figure 3a).For case-associated microbes, 27% were common to both CD and UC, 23% were shared between PD and T2D.On the other hand, schizophrenia vs T2D, as well as CD vs UC, showed a substantial overlap of control-associated microbes.Specifically, 20% and 18% of the top control-associated microbes were shared in these pairs, respectively (Figure 3a).We choose to conduct a more in-depth comparison at microbial species level for those disease pairs: CD vs UC, PD vs T2D, and schizophrenia vs T2D.Based on the observation that these pairs exhibited the highest overlaps of differential abundant microbes.We identified the shared differential abundant microbes for these three disease pairs.Differential abundant microbes shared by more than two diseases have also been identified in our study (Figure 3).Among the case-associated microbes shared by CD and UC, recognizable microbes such as Pediococcus acidilactici, and known commensal species like Morganella morganii (Figure 3b, 3e), are potentially involved in gut dysbiosis.A recent study in mice has confirmed that the overabundance of P. acidilacitic may play a role in triggering IBD by producing lipopolysaccharide and exopolysaccharide byproducts (46).Cao et al. have demonstrated that M. morganii isolated from IBD patients can generate genotoxic metabolites called indolimines.These metabolites have the potential to induce DNA damage and contribute to cancer progression (47).Within the controlassociated microbes, Coprococcus eutactus, a potent probiotic that can alleviate colitis through acetate-mediated IgA response and microbiota restoration (48), is amongst the most prevalent in the control populations.Our results unveiled several other case-associated microbes that warrant further investigation, including species from the genera Enterobacter and Citrobater, among others (Figure 3b).Altogether, this highlights the strong microbial similarity between the two IBD subtypes, which is consistent with both previous microbiome studies and the clinical phenotype (23,49).
We found control-associated microbes depleted in both schizophrenia and T2D include species from Lachnospira and Haemophilus genera (Figure 3c, 3f).Microbes from the Lachnospiraceae family are known to produce butyrate, which is one of several SCFAs that has beneficial effects on cellular metabolism and intestinal homeostasis.The loss of such microbes is linked to chronic inflammation and is likely involved in metabolic diseases such as T2D (50).On the other hand, Haemophilus parainfluenzae is a common commensal that has been recognized as an opportunistic pathogen, but its specific functional role remains unclear.Studies have reported a lower abundance of H. parainfluenzae in mental disorders compared to healthy controls (51).We found the decreased abundances of these microbes contribute to the similarity between schizophrenia and T2D.Similar to schizophrenia, PD is another neurological disorder that showed high similarity with T2D, however mainly contributed by the shared increase of case-associated microbes in patients (Figure 3d).We observed that M. morganii is also differentially increased in both PD and T2D patients, along with species from the genera Acidaminococcus, Limosilactobacillus, and others.The increased abundance of Acidaminococcus intestini in disease cases has been found in seven diseases (Figure 3e), making it the most commonly shared case-associated microbe in our analysis.A recent cross-sectional study found Acidaminococcus intestini was one of the microbes that were more abundant in subjects consuming the most pro-inflammatory diets (52).We did observe that Akkermansia muciniphila was differentially increased in all these disorders (PD, schizophrenia, and T2D) as well (Figure 3e).This finding is highly controversial in the literature, since previous studies have observed that Akkermansia muciniphila is both beneficial (53) and pathogenic (54).From our analysis, it is difficult to determine the causal role of Akkermansia muciniphila in these diseases.Follow up mechanistic and clinical studies will be necessary to explore the involvement of this microbe in more depth.

Microbial gene level comparison and pathway analysis showed consistency with species level results
Disease similarity based on the microbial genes can be accessed by comparing the Pearson correlation coefficient R between the inferred log2 fold changes (LFC) across every two diseases (Figure 4a).The R value for CD vs UC stands at 0.6, representing the highest positive correlation observed across all disease pairs (Figure 4b).As two subtypes of IBD, both CD and UC are characterized by transmural inflammation, with CD being able to affect any area from the mouth to the perianal region while UC is limited to the colon's mucosal layer (55).Previous studies have demonstrated that IBD is influenced by genetic predisposition, immune system dysregulation, and environmental factors (23,35).Our pathway analysis revealed the involvement of both caseassociated and control-associated microbes in various metabolic pathways.Specifically, we identified discrepancies in amino acid metabolism, energy metabolism, and lipid metabolism that differentiated between case-associated microbes and control-associated microbes for CD and UC.Case-associated microbial genes exhibited a higher prevalence within most of these metabolic pathways, many of which contributed to inflammation and infection (Figure 4e).Pathogenic microbes such as Fusobacterium, Klebsiella, and Stenotrophomonas (56-58) are heavily involved in pathways including tryptophan biosynthesis, oxidative phosphorylation, and fatty acid biosynthesis.This is consistent with previous studies, highlighting the microbial and clinical similarity between these two disorders.Conversely, AD has a strong negative correlation between differential gene abundances between both CD (R=-0.55)(Figure 4c) and UC (R=-0.46)(Figure 4d), highlighting how the microbial link with AD affects the same pathways, but possibly through a different mechanism of action.Amongst the genes that were most differentiating between cases and controls for CD, UC, and AD, many of these genes are involved in the pathways related to the metabolism of amino acid, lipid, and energy metabolic.In AD patients, this was partially due to the decrease in microbes, such as Hafnia alvei, Ruminococcus, and Citrobacter, which had a greater prevalence of genes that are encoded in these pathways (Figure 4e).Some of the control-associated microbes are known to generate metabolites like histamine, conjugated fatty acids, and dopamine, which act as neuroprotective agents in AD (59).AD is known for the accumulation of beta-amyloid plaques and tau tangles in the brain (19), and is often characterized by metabolic abnormalities, including compromised bioenergetics, impaired lipid metabolism, and an overall decreased metabolic capacity (60).Most of the drugs designed to treat AD patients, such as Lecanemab, Donanemab, and Remternetug, are focused on removing plaque (61).In contrast, many drugs that target IBD are immunosuppressants, such as Azathioprine, Mercaptopurine (6-MP), and Methotrexate (62).While drugs used to treat IBD and AD are known to have very different functional roles, it is interesting to see how the microbial gene profiles between these two disease populations have discordance in the same metabolic pathways (Figure 4e).It is currently not clear to us why this discordance exists.However, these findings highlight interesting directions for pre-clinical follow up studies, particularly in exploring the utility of immune-enhancing drugs in AD.Both the enrichment of pathogens and depletion of control-associated microbes contribute to the similarity between complex human diseases Various types of microbiome shifts in complex human diseases have been identified by previous studies, encompassing the depletion of beneficial microbes, enrichment of pathogens, and a comprehensive reconstruction of gut microbial communities (17).In many of the disease pairs that exhibit a high overlap in microbial signatures, we found that both the enrichment of pathogens and the depletion of beneficial microbes contribute substantially to their similarity.This holds true regardless of the previous classification of their dysbiosis patterns in prior studies.Dysbiosis associated with CRC was generally characterized by increased prevalence of the pathogenic microbes (25) while CD was consistently characterized by the depletion of controlassociated microbes (63).Combined similarity networks with the sum of overlapped microbe weights show that both shifts contribute to the similarities between diseases (Figure 5).The color of edges shows the difference of shifts, and the width of edges between two diseases are proportional to the overlapped microbes.The similarities between CD and CRC comprise a mixture of both shifts.This indicates that the dysbiosis patterns of some diseases are more complicated than initially clarified, opening new opportunities for repurposing narrow-spectrum antimicrobials and probiotic treatments.There is consistency between the similarity observed at the microbial species level and that at the microbial gene level.AD and the two IBD subtypes showed the least overlap in differentially abundant microbes.They also exhibited the least similarities at the microbial gene level.Furthermore, disease pairs like CD vs UC, CD vs CRC, PD vs T2D, as well as schizophrenia vs T2D, demonstrated a high overlap in differentially abundant microbes and high Rs in microbial gene abundances.Discrepancies may arise when comparing these similarities at different levels.For instance, both PD vs UC and PD vs T2D have strong microbial gene similarities (R=0.43, and R=0.43) (Figure 3a).However, PD vs UC has a very small overlap in differentially abundant microbes (overlap=21%), while PD vs T2D has a larger overlap (overlap=35%) (Figure 5).This is consistent with the functional redundancies that have been observed in microbial communities, even if there is small overlap between the microbial taxa, there could still be strong overlap in the metabolic function due to the common metabolic roles different microbes play (64).

DISCUSSION
We have assembled the largest shotgun metagenomics meta-analysis that has inferred disease similarity with high resolution, across 2091 samples from 18 studies, encompassing 11 different disease types.We conducted case-control differential abundance analysis within each disease, following comparison between diseases.Our results demonstrated that binary disease classifiers for CD and CRC exhibit a strong generalization capability when applied to unseen data.We discovered a high degree of microbial similarity between CD and CRC.This finding aligns with the fact that CD is a risk factor for CRC, thereby validating our pipeline.Furthermore, CD and UC are detected to have the strongest microbial similarity.Given that both are subtypes of IBD, this observation further substantiates the effectiveness of our pipeline.We identified two neurological disorders, PD and schizophrenia, that exhibited high microbial similarity with T2D.Higher prevalence of T2D in schizophrenia patients has been observed in observational clinical studies (65), but there has not been a microbiome connection that has been previously established.Our findings offer valuable perspective on the potential for repositioning T2D drugs to treat these neurological disorders or vice versa.Metformin is a commonly used oral treatment for T2D (66).Metformin alters the gut microbiome of T2D patients and altered gut microbiota mediates some of metformin's antidiabetic effects (67).Interestingly, recent studies suggest that metformin has a positive effect on conditions such as anxiety or depression (68,69).Mouse study also established the neuroprotective effect of metformin in PD and supports the therapeutic potential of metformin in the treatment of PD (70,71).One surprising finding we discovered was that microbiome profiles in AD were anti-correlated with microbiome profiles in IBD.Microbiome components have been observed to play a role in both diseases.In a study on AD involving fecal microbiota transplantation (FMT), fecal microbiota from Alzheimer's patients and age-matched healthy controls were transplanted into microbiotadepleted rats respectively.It was observed that the severity of impairments in hippocampal neurogenesis in these rats correlated with the clinical cognitive scores of the donor patients (72).Multiple FMT in IBD have shown promising results reducing inflammation in patients (73,74).However, common biological mechanisms between these two diseases have not been previously established.AD is characterized by inflammation in the brain, while IBD is mainly characterized by GI inflammation.The anti-correlated microbial gene profiles between AD and IBD highlights potentially novel directions for drug design.If drugs designed to target IBD were applied to Alzheimer's patients, would it antagonize Alzheimer's symptoms?Furthermore, is it possible for these efforts to uncover new therapeutic strategies that could counteract the effects of these drugs?While our findings provide valuable insights, it is important to note that our study is subject to several notable limitations.First, there are multiple confounding factors that could bias our findings.For instance, most studies did not perform absolute quantification, thus it is not possible to identify microbes that are truly differential between the case and control populations (75).It is possible that the microbes detected was due to our choice of reference frame, we assume that the average microbe isn't changed between the case-control cohorts, but if there is a significantly altered microbial load between the cohorts, that could lead to false positives or false negatives in the differential abundance results (75).To ameliorate this issue, we focused on the top 100 microbes differentially increased in the cases and the top 100 microbes differentially increased in the controls to avoid the issue of identifying an unstable reference frame.There are also likely a few biological confounders that aren't well-documented but could affect our findings, such as medication history (i.e., antibiotics usage) or dietary patterns.To improve our ability to perform causal inference, it is important to not only account for these relevant confounders, but also take advantage of longitudinal observational cohorts and clinical trials to identify indirect effects on outcomes due to external interventions.Incorporating multiple omics levels will also help improve causal resolution, since increasing the number of observed biomarkers will increase the chances of observing a biomarker that plays a causal role in the disease symptoms.Our analysis focused solely on shotgun metagenomics data, overlooking the potential insights offered by other omics-level data (76).Host transcriptomic profiles facilitated the identification of host gene-microbiome associations in gastrointestinal disorders (77).Metabolomics would yield insights into lipid and bile-acid metabolism which has been observed in the context of IBD (78).Proteomics will likely play an important role in understanding amino acid metabolism and immune response, which we have shown to play a role in estimating disease similarity.While observational and clinical studies can help identify putative causal biomarkers, preclinical studies with follow up mechanistic experimentation are needed to confirm the causal roles of these biomarkers.Furthermore, the availability of microbiome datasets presents obstacles for performing a more comprehensive microbiome-centric disease meta-analysis.Some diseases, such as T1D, have fewer microbiome studies, especially when compared to other conditions like IBD and CRC.Furthermore, most of the studies that we analyzed focused on a single disease, which by design excludes other disease comorbidities.At this moment, we are not aware of observational studies that investigate population-level comorbidities from a microbiome perspective.Our findings strongly suggest that broadening the range of microbiome data collection could significantly enhance the analysis of disease comorbidity.This would not only improve our understanding of known microbiome-associated diseases but could also unveil microbiome associations for disorders that have not been previously shown to have a microbiome component.

METHODS Curate shotgun metagenomic datasets
The Sequence Read Archive (SRA) stands as the most extensive publicly accessible repository of sequencing data across various sequencing platforms.To identify studies and datasets exploring the human gut microbiome in the context of complex human diseases, we utilized the SRAdb package (https://github.com/seandavi/SRAdb). Using keywords such as "gut microbiome", "human", and "shotgun", we identified relevant studies and datasets within the SRA repository.Datasets that have metadata available were selected and subjected to case-control matching within studies based on age and gender information.Samples were filtered to exclude individuals with obesity when BMI information was available.Unmatched samples were subsequently removed from the analysis.The retained samples then underwent consistent processing methods to generate microbial abundances.To streamline and automate the workflow, a Snakemake (79) pipeline was developed for this study, which can be accessed at (https://github.com/jindongmin/snakemake_metagenomics).The pipeline takes the SRA BioProjecct IDs as input and outputs the microbial abundance biom tables.The workflow began with downloading the sequencing data with fasterq, which was followed by quality profiling and filtering steps with fastp (80).Kraken2 (81) and bracken (82) were employed to classify the reads to the best matching location in the taxonomic tree and compute the abundance of species.In terms of Kraken2 databases, we benchmarked Web of Life (WoL) and the Unified Human GI Genome version 2 (UHGG v2.0) databases (83).Our finding indicated that UHGG v2.0 offered a more comprehensive coverage of species at the time of our access to the databases.Consequently, we opted to utilize UHGG v2.0 in our study.Read lengths are specified based on the characteristics of the sequencing data per study.

Build disease classifiers and filter datasets
Machine learning algorithms including GB and RF were employed for cross-validation to assess the accuracy of gut microbes in distinguishing between disease cases and control subjects.We fitted classifiers for each dataset using q2-sample-classifer (84), each microbial abundance was treated as a feature.Binary disease classifiers for CD and CRC were constructed by combining all the datasets per disease.The samples were randomly divided into training and testing sets, with an 80/20 split.The training set was utilized to construct the model and obtain optimal model parameters, while the hold-out testing dataset was used to generate predictions.The performance of GB and RF classifiers was evaluated across the datasets using the AUC.An AUC value of 0.5 indicates that the corresponding classification has the same predictive ability as random guessing.To ensure that the included datasets possessed discernible microbial signatures capable of distinguishing between cases and control subjects, we applied a threshold for AUC of 0.6 and retained only those datasets with an AUC greater than 0.6.

SHAP interpretation of binary disease classifiers
SHAP is an explanatory approach rooted in game theory that aims to shed light on the outcomes produced by machine learning models (85).It leverages shapley values, which are a solution concept derived from cooperative game theory (86).These values provide insights into the individual contributions of players within a coalition game.In the context of SHAP, each feature (the abundance of microbes) is considered a player, and by calculating shapley values, we can ascertain their respective influences on the predictions made by disease classifiers.

Differential abundance analysis
Microbial abundance and microbial gene abundance were analyzed using the DESeq2 package (87).The microbial species abundance data were represented as count matrices, where rows corresponded to microbial species, and columns represented samples.Healthy controls were specified as reference.Microbial species were ranked with 5% or 95% Confidence Intervals (CI) of the LFC depending on whether it decreased or increased in disease cases.And differentially abundant microbes were identified based on the ranking.Microbes with 5% CI of LFC ranked top are termed as case-associated microbes, and microbes with 95% CI of LFC ranked bottom are termed as control-associated microbes.The analysis of differential gene abundance followed a similar approach as the microbial abundance analysis.Differential gene abundances were generated with the eggNOG annotations (88).The count matrix was created with genes as rows and samples as columns.And the LFCs were calculated for further analysis.

Disease similarity analysis
Disease similarity at the microbial species level was measured using the overlap of differentially abundant microbes.We looked at the top 100 case-associated and top 100 control-associated microbes.Concentrating on these top microbes allows us to prioritize the most relevant and significant microbial species associated with disease status.To investigate the shared microbial signatures between diseases at the species level, pairwise comparisons were conducted to determine the number of overlapping differentially abundant microbes.Disease similarity at the microbial gene level was represented using the Pearson correlation coefficient (R) of LFCs between two diseases.A higher positive correlation coefficient value indicates a stronger similarity in the differentially abundant pattern, while a negative correlation value represents how reversed the patterns are.

Microbial gene analysis
In order to determine if a particular gene is more commonly observed in case-associated microbes or healthy-associated microbes than by random chance, a genome wide binomial test (15) was performed between two groups of taxa.The significance level for the test was set as 0.001.Microbial genes identified that were statistically significant were subsequently mapped to KEGG pathways to elucidate their respective functional roles.

ACKNOWLEDGEMENTS
The authors thank NYU Department of Biology and the Simons Foundation.This study used High-Performance Computing resources at the Flatiron Institute Scientific Computing Core.This work was supported by the NIAID grant 5R01AI130945.The overall design and data analysis pipeline.Flow Chart of this meta-analysis.First, shotgun metagenomic datasets investigating the human gut microbiome in multiple diseases were curated and processed consistently with the Snakemake metagenomic pipeline built in this study, and the microbial abundances matrices were generated.Second, Gradient Boosting and Random Forest classifiers were built for each dataset/disease.Then datasets with classification accuracy above the threshold of 0.6 remained in the following analysis.Disease specific microbial signatures and microbial similarity at the species level were analyzed with the differential abundance results.At the gene level, Pearson correlation coefficients of microbial genes between every disease pair were calculated and used as the proxy for disease similarity.Disease pairs that showed high or low similarity were further investigated with pathway analysis.

Figure 1 .
Figure1.The overall design and data analysis pipeline.

Figure 1 .
Figure1.The overall design and data analysis pipeline.Flow Chart of this meta-analysis.First, shotgun metagenomic datasets investigating the human gut microbiome in multiple diseases were curated and processed consistently with the Snakemake metagenomic pipeline built in this study, and the microbial abundances matrices were generated.Second, Gradient Boosting and Random Forest classifiers were built for each dataset/disease.Then datasets with classification accuracy above the threshold of 0.6 remained in the following analysis.Disease specific microbial signatures and microbial similarity at the species level were analyzed with the differential abundance results.At the gene level, Pearson correlation coefficients of microbial genes between every disease pair were calculated and used as the proxy for disease similarity.Disease pairs that showed high or low similarity were further investigated with pathway analysis.

Figure 2 .
Figure 2. Interpretation of binary classifiers and differentially abundant microbes overlaps in Crohn's Disease (CD) and Colorectal Cancer (CRC).

Figure 2 .dFigure 3 .Figure 3 .Figure 4 .
Figure 2. Interpretation of binary classifiers and differentially abundant microbes overlaps inCrohn's Disease (CD) and Colorectal Cancer (CRC).The differential abundant microbes are identified first by computing LFC between case and control within one disease, then ranked by the 5% Confidence Interval (CI) of LFC to identify the top-100 case-associated microbes, finally ranked by the 95% CI of LFC to identify the top-100 control-associated microbes.Each dot represents one microbe, and its color is coded by its ranking.Dots colored blue and salmon represent the microbes differentially abundant in disease cases and controls respectively.Dots colored gray are the ones that are considered neutral.(a)-(b) X axis is the shapley values, Y axis is the log2 fold change (LFC) between case and control.Left panels are the cases, which have a sum of shapley values as negative values, right panels are the controls, which have a sum of shapley values as positive values.Dots with high absolute Shapley values and high log2FC are labeled.(a) Shapley values vs LFC in CD cases and controls.(b) Shapley values vs LFC in CRC cases and controls.(c) Overlap of the differentially abundant microbes between CD and CRC.X axis is the microbes, y axis is the microbe's rankings.A smaller ranking number for caseassociated microbes indicates a greater increase of the microbe in disease cases.A smaller ranking number for control-associated microbes indicates a greater increase of the microbe in healthy controls. d

Figure 4 .
Figure 4. Microbial gene level similarity between diseases and the pathway signatures of the microbes.(a) Pearson correlation coefficient R between the inferred microbial gene log2 fold changes across every two diseases.(b) Scatterplot of the Pearson R between Crohn's Disease (CD) and Ulcerative Colitis (UC).(c) Scatterplot of the Pearson R between CD and Alzheimer's Disease (AD).(d) Scatterplot of the Pearson R between UC and AD.(e) Amino acid metabolism, energy metabolism, and lipid metabolism pathways of the microbial signatures in AD, CD, and UC.The x axis are the differential abundant microbes, the blue ones represent control-associated microbes, while the salmon ones represent the case-associated.The y axis are the KEGG pathway modules.The numbers on the right green bar represent the number of genes.

Figure 5 .
Figure 5. Combined similarity networks with sum of overlapped microbe weights.

Figure 5 .
Figure5.Combined similarity networks with sum of overlapped microbe weights.Each node represents one disease type, the weight of edges shows how similar two diseases are.The number in each edge is proportional to the overlapped differential abundant microbes in each disease (case vs control): top-100 (case-associated) and bottom-100 (control-associated).The colors of the edges indicate the origin of the similarities: salmon color edges represent the similarity conferred by the overlap of case-associated microbes; the blue color represents the similarity conferred by the overlap of control-associated microbes.

Table 1 .
Diseases and metagenomic datasets included in this meta-analysis 813Diseases and metagenomic datasets included in this meta-analysis