Population Structure Discovery in Meta-Analyzed Microbial Communities and Inflammatory Bowel Disease

Microbial community studies in general, and of the human microbiome in inflammatory bowel disease (IBD) in particular, have now achieved a scale at which it is practical to associate features of the microbiome with environmental exposures and health outcomes across multiple large-scale populations. This permits the development of rigorous meta-analysis methods, of particular importance in IBD as a means by which the heterogeneity of disease etiology and treatment response might be explained. We have thus developed MMUPHin (Meta-analysis Methods with a Uniform Pipeline for Heterogeneity in microbiome studies) for joint normalization, meta-analysis, and population structure discovery using microbial community taxonomic and functional profiles. Applying this method to ten IBD cohorts (5,151 total samples), we identified a single consistent axis of microbial associations among studies, including newly associated taxa such as Acinetobacter and Turicibacter detected due to the sensitivity of meta-analysis. Linear random effects models further revealed associations with medications, disease location, and interaction effects consistent within and between studies. Finally, multiple unsupervised clustering metrics and dissimilarity measures agreed on a lack of discrete microbiome “types” in the IBD gut microbiome. These results thus provide a benchmark for consistent characterization of the IBD gut microbiome and a general framework applicable to meta-analysis of any microbial community types.


26
Meta-analysis for molecular epidemiology in large populations has seen great success in linking 27 high-dimensional 'omic features to complex health-related phenotypes. One example of this is in 28 genome-wide association studies (GWAS 1 ), where the appropriate study scale, achieved by 29 rigorous integration of multiple cohorts, has both facilitated reproducible discoveries (in the form 30 of disease-associated loci 2-4 ) and addressed confounding due to unobserved population 31 structure 5 . The inflammatory bowel diseases (IBD) represent a particular success story for GWAS 32 meta-analysis 3,4 , and environmental and microbial contributors complementing the condition's 33 complex genetic architecture have been detailed by many individual studies 6-8 . However, in the 34 absence of methods appropriate for large-scale microbial meta-analysis, the extent to which these 35 findings reproduce across studies, or can be extended by increased joint sample sizes, remains 36 undetermined. Likewise, it is unclear whether reproducible population structure in the microbiome, 37 such as microbially-driven IBD "subtypes," exists to help explain the clinical heterogeneity of these 38 conditions 9 . 39  (Table 1). We uniformly processed the associated sequence data and harmonized 87 metadata across cohorts. Microbial taxonomic profiles were then corrected for batch-and study-effects before 88 downstream analyses for omnibus and per-feature association with disease phenotypes and unsupervised population 89 structure discovery. b) MDS ordination of all microbial profiles (Bray-Curtis dissimilarity) before batch correction 90 visualize the strongest associations with gut microbial composition, including disease, sample type (biopsy or stool), 91 cohort (visualized separately for larger and smaller studies), and dominant phyla.

92
Using this joint dataset and upon uniform bioinformatics processing (Methods), we first assessed 93 the factors that corresponded to overall variation in microbiome structure, which included disease 94 status, sample type (biopsy versus stool), and dominant phyla (Bacteroidetes and Firmicutes, Fig.  95 1b). Cohort effects prior to batch correction and meta-analysis were also significant. Microbiome 96 differences associated with disease were notable even without normalization. However, this can 97 be misleading due to the confounding of cohort structure between studies, such as the 98 differentiation between RISK (a predominantly mucosal study of CD) and PROTECT (a 99 predominantly stool study of UC). Inter-individual differences largely independent of population or 100 disease, such as Bacteroidetes versus Firmicutes dominance, were also universal among studies 101 and sample types as expected 9,26 . Many of these factors were of comparable effect size, both 102 visually and as quantified below, emphasizing the need for covariate-adjusted statistical modelling 103 to delineate the biological (disease, treatment) and technical (cohort, batch) effects associated 104 with individual taxa throughout the cohorts (Supplemental Notes, Supplemental Fig. 1-3

108
A statistical framework for meta-analysis of microbial community profiles 109 We developed a collection of novel methods for meta-analysis of environmental exposures, 110 phenotypes, and population structures across microbial community studies, specifically 111 accounting for technical batch effects and interstudy differences (Methods, Fig. 1a biologically interest (exposure, phenotype). Second, we combined well-validated data 117 transformation and linear modelling combinations for microbial community profiles 33 with fixed and 118 random effect modelling 34 for meta-analytical synthesis of per-feature (taxon, gene, or pathway) 119 differential abundance effects. Lastly, we generalized and formalized approaches from cancer 120 transcriptional subtyping 35 to permit unsupervised discovery and validation of both discrete and 121 continuous population structures in microbial community data (Supplemental Fig. 4) (Fig. 2a-b, Supplemental Fig. 5). This was true both in terms 131 of reducing the overall microbial variability attributable to technical artifacts and in terms of the 132 ratio of "biological" versus technical variability (Fig. 2a). ComBat correction 15 , suited for gene 133 expression data, was capable of reducing batch effects to a lesser degree, but also tended to 134 reduce desirable "biological" variation in the process, likely due to noise introduced by it changing 135 many zero counts to non-zero values. Previously proposed techniques for microbial community 136 data, namely quantile normalization 18 and BDMMA 19 , are only appropriate for differential 137 abundance analysis and do not provide batch-normalized profiles, thus precluding PERMANOVA 138 batch effect quantification; their per-feature testing performance is evaluated together with 139 MMUPHin in the following section. MMUPHin thus provides batch-corrected microbial community 140 profiles that retain biologically meaningful variation more than (or not even possible using) existing 141 methods. 142 For differential abundance testing, MMUPHin successfully corrected for false associations when 143 batch/cohort effects were confounded with variables of interest, which is a common concern for 144 'omics meta-analysis 38 , while quantile normalization 18 and BDMMA 19 had either inflated or overly 145 conservative false positive rates ( Fig. 2c-d, Supplemental Fig. 6). We also validated MMUPHin's 146 support for unsupervised population structure discovery, in addition to these "supervised" 147 differential abundance and statistical association tests. In microbial communities, valid, 148 generalizable population structure can manifest as either discretely clustered subtypes 39 or as 149 continuously variable gradients of community configurations 40 , but methods for discovery are thus identifying major axes of variation that explain the largest amount of heterogeneity between 157 microbial profiles and are also consistent across studies ( Fig. 2g-h, Supplemental Fig. 8). As a 158 result, MMUPHin was able to successfully identify discrete clusters (i.e. microbiome "types") when 159 present, as well as significantly consistent continuous patterns of microbiome variation that recur 160 among populations (Supplemental Notes). 161

Meta-analysis of the IBD microbiome 178
Given these validations of MMUPHin's accuracy in simulated data, we next applied it to the 10-179 study, 4,789-sample IBD gut amplicon profile meta-analysis introduced above (Fig. 3). MMUPHin 180 successfully reduced the effects both of differences among studies, and of batches within studies 181 (study effect correction modelling disease and sample type as covariates, see Methods), 182 although these remained among the strongest source of variation among taxonomic profiles as 183 quantified by PERMANOVA R2 (Fig. 3a, Methods, Supplemental Table 3

210
We identified individual taxonomic features consistently associated with disease and treatment 211 variables (Fig. 3b, Supplemental Table 4), with meta-analysis multivariate differential 212 abundance analysis, adjusting for common demographics (age, gender, race) and further 213 stratifying for sample type and disease when appropriate (Methods, Supplemental Table 3). At 214 a very high level, differential abundance patterns between CD and control microbiomes were 215 consistent with, and often more severe than contrasts between UC and control, confirming with 216 increased resolution previous observations that CD patients tend to have more aggravated 217 dysbiosis than UC patients 9 . As expected, our meta-analysis confirms many of the taxa 218 associated with IBD reported by previous individual (Fig. 3b, detailed in Supplemental Notes); 219 these findings strongly supports the emerging hypotheses of pro-inflammatory aerotolerant 220 clades forming a positive feedback loop in the gut during inflammation, often of oral origin 7 , and 221 depleting the gut's typical fastidious anaerobe population as a result. 222 We also identified two taxa not previously associated with IBD, both of modest effect sizes and 223 likely newly detected by the meta-analysis' increased power. The genus Acinetobacter was 224 enriched in CD, and Turicibacter was depleted. Turicibater in particular is poorly represented in 225 reference sequence databases, with only nine genomes for one species (Turicibacter sanguinis) 226 currently in the NCBI genome database; this makes it easy to overlook in shotgun metagenomic 227 profiles relative to amplicon sequencing. The genus Acinetobacter, conversely, is quite well 228 characterized due to its role in antimicrobial resistant infections 44 , and it was previously linked 229 specifically to the primary sclerosing cholangitis phenotype in UC 45 , although without follow-up to 230 our knowledge. Turicibacter is overall less characterized both in isolation and with respect to 231 disease, although our findings and others' suggest it might be inflammation-sensitive when 232 present; it was one of many clades increased in mice during CD8+ T cell depletion 46 and reduced 233 in a homozygous TNF deletion 47 . As the strains of Acinetobacter implicated in gut inflammation 234 are unlikely to be those responsible for e.g. nosocomial infections, further investigation of both 235 clades using more detailed data or IBD-specific isolates is warranted. 236 Among treatment variables (samples or time points during which subjects were receiving 237 antibiotics, immunosuppressants, steroids, and/or 5-ASAs), antibiotics had the strongest effects 238 on individual taxa, as well as the greatest number of significantly associated taxa (Fig. 3b). These 239 associations are also broadly in agreement with previous observations for microbiome responses 240 to antibiotics in IBD or generally, e.g. the depletion of Faecalibacterium, Ruminococcus, and 241 Bacteroides in patients treated with antibiotics, and the enrichment of (often stereotypically 242 resistant) taxa such as Streptococcus, Acinetobacter, and the Enterobacteriaceae, with 243 differential responses to the treatment groups speaking to both administration considerations and 244 their impact on host versus microbial community bioactivities (Supplemental Notes). 245 Subsets of IBD-linked taxa were additionally associated with the diseases' phenotypic severity 246 ( Fig. 4a, Supplemental Table 5). Montreal classification 43 was used as a proxy for disease 247 severity, including Behavior categories for Crohn's disease (B1 non-stricturing, non-penetrating, 248 B2 stricturing, non-penetrating, B3 stricturing and penetrating) and Extent for ulcerative colitis (E1 249 limited to rectum, E2 up to descending colon, E3 pancolitis). We tested for features differentially 250 abundant in the more severe phenotypes when compared against the least severe category (B1 251 CD and E1 UC, Methods). Among statistically significant results, many extended those identified 252 above as overall IBD associated (Fig. 3b), such as the depletion of Faecalibacterium in B3 CD 253 and Roseburia in B2 CD, as well as the enrichment of Enterobacteriaceae in E3 UC. In most 254 cases, microbial dysbiosis was also additionally aggravated from the moderate to the most 255 extreme disease manifestations; such differences were statistically significant (Methods) in, for 256 example, the progressive depletion of Bacteroides in CD and UC, as well as the enrichment of 257 Enterobacteriaceae in UC. This meta-analysis is uniquely powered to detect these subtle 258 differences, which aid in shedding light on the microbiome's response to progressive inflammation 259 and disease subtypes. Pancolitis corresponds with a unique microbial configuration distinct from 260 regional colitis and not generally detectable in smaller studies 6 , for example, while more severe 261 CD induces essentially a more extreme form of the same dysbiosis observed in less severe forms 262 of the disease. 263  Table 6). Interaction effects, in the 279 statistical sense, were defined as a main exposure (IBD or treatment) having differential effects 280 on taxon abundance with respect to either sample type (biopsy/stool) or diseases (CD/UC); they 281 were identified via moderator meta-analysis models (Methods). Overall, we found elevated 282 effects of both CD (relative to controls) and antibiotic treatment in stool as compared to biopsy-283 based measurements of the microbiome (Supplemental Table 6). An example of this is 284 Dehalobacterium, with significantly greater depletion in CD stool relative to biopsies (Fig. 4b). 285 Dehalobacterium, as with Turicibacter above, is underrepresented in reference sequence 286 databases, better-detected by amplicon sequencing, and thus not a common microbial signature 287 of IBD. It has been linked to CD in at least one existing 16S-based stool study 48 . In contrast, 288 several UC-specific microbial disruptions were more prominent at the mucosa (i.e. in biopsies, 289 Supplemental Table 6). Coupled with the severity-linked differences above, this suggests CD-290 induced changes in the entire gut microbial ecosystem largely as a consequence of inflammation, 291 with UC-induced dysbioses both more local and more specific to disease and treatment regime. 292 Additional results include effect of steroids on the Enterobacteriaceae, which tended to be more 293 abundant in CD patients receiving steroids, but less abundant in UC recipients (Fig. 4b,  294 Supplemental Table 6, Supplemental Notes). 295

Consistent IBD microbial population structure discovered by unsupervised analysis 296
The existence of subtypes within gut microbial communities has been a major open question in 297 human microbiome studies, and it is of particular importance within IBD as a potential explanation 298 for heterogeneity in disease etiology and treatment response 6,9 . To systematically characterize 299 population structure in the IBD gut microbiome that was reproducible among studies, we 300 performed both discrete and continuous structure discovery on the 10 cohorts using our meta-301 analysis framework. To identify potential discrete community types (i.e. clusters), we performed 302 clustering analysis within each cohort's IBD patient population, and evaluated the clustering 303 strength via prediction strength (Methods). We found no evidence to support discrete clustering 304 structure within individual cohorts, nor were we able to reproduce each cohort's clustering results 305 externally (Fig. 5a). This lack of discrete structure was consistent when we further stratified 306 samples to either CD or UC populations (Supplemental Fig. 9), or extended to additional 307 dissimilarity metric and clustering strength measurements (Supplemental Fig. 9, Methods). Our 308 observation that the IBD gut microbiome cannot be well characterized by discrete clusters is thus 309 consistent with previous findings on gut microbial heterogeneity for healthy populations 40 and 310 suggests that, at the level powered by this study, such microbiome subtypes are not clearly 311 responsible for clinical heterogeneity. 312  (Supplemental Fig. 9). b) Conversely, 318 two reproducible, continuously variable patterns of microbiome population structure were identified using groups of 319 similar principal components (Methods) 35 . These patterns were consistent within and between cohorts, disease types, 320 and sample types, as well as under different edge strength cutoffs (Supplemental Fig. 11), and their consensus 321 loadings were reproducible among cohorts (Supplemental Fig. 12). c) Top 20 genera with highest absolute loadings 322 for the disease-associated dysbiosis score corresponding to the first cluster in b. Many of these taxa were also IBD-323 associated (Fig. 3b). d) Distribution of the dysbiosis pattern across CD, UC, non-IBD control, and healthy populations.
Although it was defined in an unsupervised way solely within the IBD population, across which the pattern is highly 325 variable, it also differentiates well between IBD and control populations (Supplemental Fig. 13).

326
Conversely, we identified two consistent, continuously varying gradients of microbial community 327 variation in the IBD microbiome ( Fig. 5b-d, Supplemental Fig. 10). These gradients represent 328 patterns of microbes that occur with greater or lesser abundance in tandem, and which covary 329 across subjects in a population; they were identified as principal component (PC) vectors that 330 recur among different cohorts (see Methods) 35 . Briefly, we used the four largest IBD cohorts (CS-331 PRISM, Pouchitis, PROTECT, and RISK) as training datasets to identify two clusters of consistent 332 PCs (Fig. 5b), which were confirmed with sensitivity analysis (Supplemental Fig. 11) and 333 validated in the remaining cohorts (Supplemental Fig. 12). The consensus loadings (i.e. within-334 cluster average) representing these two clusters (Fig. 5c, Supplemental Fig. 10, Supplemental 335 Table 7) were used to assign continuously varying scores to the IBD population that capture 336 gradient changes in the microbiome that occurred consistently within IBD, across diseases, 337 sample types, and cohorts. This disease-linked "type" of microbiome variation corresponded 338 roughly to severity or extent of inflammation, as detailed below. 339 In particular, while the second continuous population structure captured the Firmicutes-340 Bacteroidetes tradeoff present in most gut microbiome studies (Supplemental Fig. 10) 9,26,40 , the 341 first continuous score was IBD-specific and corresponded roughly to more extreme disease-342 associated dysbiosis in CD and UC populations (Fig. 5d). This is evidenced by the taxa with 343 highest weights in the scores' consensus loading vector (Fig. 5c), which included taxa 344 differentially abundant between IBD and control populations (Fig. 3). The score was consistent 345 both within CD and UC while also further differentiating IBD, non-IBD control, and healthy 346 populations (Fig. 5d, Supplemental Fig. 13), even though it was identified unsupervisedly only 347 from diseased subsets. The composition of the score and its population structure are also 348 consistent with our recent definition of dysbiotic gut microbiome configurations corresponding with multi'omic perturbations during IBD activity 9 . Together with the supervised meta-analysis results 350 above, these unsupervised population structure findings confirm that there are no detectable 351 discrete subtypes of the gut microbiome in IBD even among ~5,000 combined samples, while 352 showing a single continuously variable gradient of microbiome changes reproducibly present 353 during more dysbiotic diseases. 354

355
Here, we provide a novel framework for microbial community meta-analysis and apply it to the 356 first large-scale integration of over 5,100 amplicon profiles of the stool and mucosal microbiomes 357 in IBD. This identified a significantly reproducible gradient in the gut microbiome indicative of 358 increasing dysbiosis in subsets of patients. The study also showed no evidence of additional 359 population structure, such as microbiome-driven discrete disease subtypes, within CD or UC. The 360 increased power provided by meta-analysis supported many of the taxonomic associations 361 previously ascribed to IBD (e.g. Faecalibacterium, Ruminococcus, Enterobacteriaceae) while 362 uncovering new associations (Turicibacter, Acinetobacter) not confidently associated with 363 inflammation by other populations or data types. Almost all effects were exhibited similarly using 364 either stool or mucosal profiling, with a small number of exceptions showing significant 365 differentiation (e.g. Dehalobacterium). Novel disease-treatment response interactions were 366 observed (e.g. steroids on Enterobacteriaceae). Finally, the meta-analysis framework developed 367 for the study, MMUPHin, has been extensively evaluated and its performance for batch effect 368 removal, supervised meta-analysis of exposures and covariates, and unsupervised population 369 structure discovery validated on a variety of simulated microbial community types. It is extensible 370 to integration of microbial community taxonomic or functional profiles from other data types (e.g. 371 metagenomic sequencing) or environments.
However, all microbial community meta-analyses should be approached with caution, since in 373 many cases unwanted sources of technical variation between studies (i.e. batch effects) are so 374 large as to potentially mask biological signals even after correction 49-51 (Supplemental Notes). 375 Reducing inter-study variation in microbial community profiles is challenging relative to other 376 'omics data types due to 1) the extreme heterogeneity of microbes within most communities 377 (exacerbating both technical and biological differences), and 2) feature zero-inflation arising from 378 both biological and technical reasons 13,52 . Notably, despite these challenges, MMUPHin was able 379 to meta-analyze amplicon profiles in this study both to associate microbial shifts with disease 380 outcome, to associate them with treatment-specific differences, and to identify a single pattern of 381 typical microbial variation within IBD. While previous efforts have developed IBD dysbiosis scores 382 by contrasting patients with control groups 7,9 , this pattern of microbial variation was present 383 specifically within IBD patients (both CD and UC), and in agreement with supervised methods, 384 captured several classes of microbial functional responses in the gut (Supplemental Note). 385 The IBD gut microbiome particularly stands to benefit from meta-analysis, as have other multiply-386 sampled conditions such as colorectal cancer 53,54 , in order to identify ecological and 387 microbiological changes during the disease that are reproducible across populations. We consider 388 this study based on 16S rRNA gene sequencing to be a proof of concept, able to achieve 389 unprecedented power due to the number of amplicon profiled samples available, but with greater 390 precision possible in future work using e.g. metagenomic and other 'omics technologies. This also 391 enabled comparison of responses in the stool versus mucosal microbiomes, the latter of which 392 are not amenable to metagenomic profiling from biopsies; these were in overall good agreement, 393 but the few areas of significantly differential responses to inflammation are likely of particular 394 immunological interest. The large sample and population sizes also provide some confidence in 395 ruling out discrete, microbially-driven population subtypes as an explanation for CD and UCs' 396 clinical heterogeneity. Instead, the work identified a single consistent axis of gradient microbial 397 change corresponding to increasing departures from "normal" microbiome configurations 7,9,55 . 398 This pattern of consistent microbial dysbiosis can continue to be explored in further work on its 399 functional, immunological, and clinical consequences. Overall, this study represents one of the 400 first large-scale, methodologically appropriate, targeted meta-analysis of the IBD microbiome, and 401 the corresponding methodology and its implementation are freely available for future meta-402 analyses of human-associated and environmental microbial populations.
specific location and scale parameters. # is a feature-specific standardization factor. # are 419 covariate-specific coefficients, and !"# is an independent error term following a standard normal 420 distribution. !"# is a binary (0, 1) zero-count indicator, to allow for zero inflation of features. As in 421 ComBat, !# and !# are modelled with normal and inverse-gamma priors, respectively. 422 Hyperparameters are estimated with empirical Bayes estimators as in ComBat 15 . The posterior 423 means, * %# 6 and * %# 6 , along with standard frequentist estimates # 6 and # 7 are used to provide 424 batch-corrected count data: 425 Per-sample feature counts are then re-normalized to keep sample read depth unchanged post-427 correction. In practice, the user provides sample microbial abundance table ( ), batch/study 428 information, and optionally any other covariates that are potentially confounded with batch but 429 encode important biological information. MMUPHin outputs an adjusted profile : that is corrected 430 for the effect of batches but retains the effects of (if provided). 431 Meta-analysis differential abundance testing 432 For meta-analytical differential abundance testing, after batch correction, MMUPHin first performs 433 multivariate linear regression within individual studies using previously validated data 434 transformation and modelling combinations appropriate for microbial community profiles 435 (MaAsLin2 33 ). This yields study-specific, per-feature differential abundance effects estimations 436 %# 6 , where indicates study and indicates feature. These are then aggregated into meta-437 analysis effect size with fixed/random effects modelling as implemented in the metafor R 438 package 34 : 439 %# 6 = # + !# + !# 440 # is the overall differential abundance effect of feature . !# is per-study measurement error, 441 and !# is study-specific random effects term (not present in fixed-effect models). In practice, the 442 user provides a microbial community profile, study design (batch) information, the main exposure 443 variable of interest, and optional additional covariates. If any meta-analyzed studies include 444 repeated measures (e.g. longitudinal designs), then random covariates can also be provided and 445 will be modelled for such studies. MMUPHin then performs MaAsLin2 regression modelling within 446 each study and aggregates effect sizes of the exposure variable %# 6 across studies using the 447 resulting random/fixed effects model. The estimated overall effect, # 6 , is reported as the overall 448 differential abundance effect for feature . 449

Simulation validation of MMUPHin 522
We performed extensive simulation studies (Fig. 2, Supplemental Fig. 5-8, Supplemental Table  523 2) to validate the performance of each component of MMUPHin. In all cases these employed 524 realistic microbial abundance profiles generated using SparseDOSSA 525 (http://huttenhower.sph.harvard.edu/sparsedossa). This is a model of microbial community 526 structure using a set of zero-inflated log-normal distributions fit to selected training data, in this 527 case drawn from the IBD gut microbiome 6 . Controlled microbial associations with simulated 528 covariates can then (optionally) be spiked in. Note that although the assumed null distributions in 529 MMUPHin and SparseDOSSA are the same (zero-inflated log normal), the models of effects for 530 batch and biological variables are substantially different: MMUPHin assumes exponentiated 531 effects, while SparseDOSSA assumes re-standardized linear effects. 532 Specifically, SparseDOSSA models null microbial feature abundances using a zero-inflated log-533 normal distribution: 534 This is the same initial distributional assumption as the MMUPHin batch correction model, when 536 there are no batch or covariates effects. However, for spiked-in associations with metadata (batch, 537 biological variables, etc.), SparseDOSSA uses a different model. Given a simulated, pre-spiking-538 in feature count vector # with mean # 9 and standard error # 9 , as well as a metadata variable 539 vector with mean ; and standard error ; , the post-spiked-in feature count is set to: 540 where is a configurable spike-in strength parameter. By this definition, microbial features post-542 spike-in have the same mean and approximately the same variance as before, the only difference 543 being the added association with the metadata variable(s) used. This is to ensure the counts of 544 the modified feature are not dominated by the values of the target covariate, but instead 545 distributed similarly to real data. The SparseDOSSA association model thus differs from 546 MMUPHin's model in two substantial ways: i) MMUPHin's associations are defined within the 547 exponentiated component and are thus better described as a multiplicative effect, whereas 548 SparseDOSSA's effects are directly applied on untransformed data, and ii) SparseDOSSA 549 additionally ensures realistic data generation with the re-standardization procedure.
Thus, the only component of the SparseDOSSA model that requires fitting to training data is the 551 aforementioned zero-inflated log-normal null distribution. In our analysis, this was always PRISM 6 , 552 while other parameters were specified across a wide range of combinations to simulate different 553 application scenarios. These include the effect sizes of the associated batch and biological 554 variables (i.e. the parameter), number of batches, sample sizes, as well as dimensionality (both 555 the total number of features and the percentage of features randomized to be associated with 556 batch/biological variables). For each combination of simulation parameters, we performed 20 557 random replications (i.e. running simulation/evaluation with the same parameters but different 558 random seeds). Supplemental Table 2 presents the full list of parameter combinations. 559 Evaluating batch adjustment 560 For evaluation of MMUPHin's batch effect adjustment component, we simulated metadata that 561 included batch (with varying total batch numbers 2, 4, 6, 8), a binary positive control (simulated 562 "biological" covariate), continuous positive control ("biological"), and negative control (binary, and 563 guaranteed to be non-associated with microbial features) variables. Microbial abundance data 564 was simulated to be associated with the batch and the two positive control variables at varying 565 effect sizes (1, 2, 5, 10 for batch variable and fixed at 10 for positive control variables), but not 566 with the negative control variable. We additionally varied the number of samples per batch (20 to 567 simulate multiple-batches in a single study scenario, 100 to simulate meta-analysis with moderate 568 sized studies and 500 to simulate large meta-analysis), total number of microbial features (n=200 569 and 1000), as well as the percentage of features associated with metadata (5%, 10%, and 20%) 570 (Supplemental Table 2). 571 Performance of batch correction methods was quantified by omnibus associations (PERMANOVA 572 R2) between the simulated microbial abundance data with the batch and positive control variables, 573 before and after batch correction. For ComBat 15 and our method, batch correction was performed 574 with both positive control variables as well as the negative control variable as covariates. 575 MMUPHin successfully reduced the confounding batch effect, but retained the effect of positive 576 control variables, and did not inflate the effect of negative control variable (Fig. 2a, Supplemental  577   Fig. 5). 578 Evaluating meta-analytic differential abundance testing 579 We evaluated false positive rates (FPR) in particular for meta-analytic feature association testing, 580 specifically the null case in which there are no associations between microbial features and 581 covariates, but false associations can arise in the presence of batch effects with unbalanced 582 distribution of covariate values across studies (Fig. 2b). For simulation, we generated a binary 583 covariate unevenly distributed between two "studies" at varying levels of disparity (Supplemental 584 Table 2). Microbial abundance data was simulated to be associated only with the two studies and 585 not with the covariate (i.e. study confounded null data), with varying strengths of batch effect (from 586 0 to 10). The number of samples per batch varied between 100 and 500 to, again, simulate 587 moderate-and large-sized meta-analysis. Lastly, we varied total number of microbial features and 588 the percentage of features associated with metadata as above. 589 FPRs were calculated as the percentage of simulated microbial features with nominal p-values < 590 0.05 for associations with the exposure variable. Four data normalization and analysis regimes 591 were evaluated (Fig. 2c, Supplemental Fig. 6): a) naive MaAsLin2 model on the study effect 592 confounded null data (without explicitly modelling the batches), b) the quantile normalization 593 procedure, paired with two-tailed Wilcoxon tests, as proposed in 18 , c) BDMMA as proposed in 19 , 594 with the default 1,0000 total MCMC sampling and 5,000 burn-in, d) the complete MMUPHin meta-595 analysis model for the batch corrected data as described above. Note that due to its computational 596 cost we were only able to evaluate the Dirichlet-multinomial regression model on a subset of 597 parameter combinations, namely number of samples per batch = 100, number of features = 200, 598 and percent of associated microbes = 5%. These parameters roughly agree with those used in 599 the simulation analysis in the method's original publication 19 . 600 We also evaluated the computational costs of quantile normalization, BDMMA, and MMUPHin 601 (Supplemental Fig. 6). For this, the same subset of 20 replications (batch effect 0, exposure 602 imbalance 0, number of samples per batch 100, and number of features 200) were ran through 603 the three methods under the same computation environment (single core Intel(R) Xeon(R) CPU 604 E5-2680 v2 @ 2.80GHz). 605 Evaluating unsupervised discrete structure discovery 606 To simulate microbial abundance data with known discrete clustering structure, we again used 607 the simulation model above, with microbial feature associations added both with a discrete "batch" 608 variable and a discrete clustering variable, at varying number of batches (2, 4, 6, 8), number of 609 clusters (3, 4, 5, 6), as well as effect size of association (0 to 10 for batch, fixed at 10 for cluster). 610 For the evaluation of MMUPHin's unsupervised methods (both here and during continuous 611 population structure discovery below), we fixed the number of samples per batch at 500, the 612 number of total features at 1,000, and the percent of associated features at 20%. These were 613 guided by the fact that the underlying unsupervised methods (clustering, PCA) require larger 614 sample sizes for good performance even without batch confounding, and are generally only 615 practical with higher feature dimensions (Supplemental Table 2). 616 Performance of clustering was evaluated as the percentage of replicates in which the right number 617 of synthetically defined underlying clusters was identified using prediction strength, across 618 technical replicates for a fixed combination of simulation parameters. That is, the number of 619 clusters within a simulation was identified as that which maximized prediction strength. This was 620 compared to the "truth" (i.e. the known simulation parameter) and counted as a success only if 621 the two agreed. The percentage of success for a given parameter combination across the 20 random replications was used as the evaluation metric for model performance. We compared the 623 performance of clustering before and after MMUPHin batch correction (Fig. 2e, Supplemental 624 Table 7). Note that batch correction is modelled only using the batch variable and specifically not 625 including the cluster variable as a covariate in the batch correction model above, as the underlying 626 cluster structure is unknown in non-synthetic unsupervised analyses settings. 627 Evaluating unsupervised continuous structure discovery 628 To simulate microbial abundance data with known continuously variable population structure, we 629 spiked in feature associations with both a simulated batch covariate (4, 6, 8) and a continuously 630 varying gradient (uniformly distributed between -1 and 1), at varying number of batches and effect 631 size of both associations (as above). The number of samples per batch, total number of microbial 632 features, and the percentage of features associated were fixed at the same values as above 633 (Supplemental Table 2). 634 Performance of continuous structure discovery analysis was evaluated as the Spearman 635 correlation between the known simulated gradient score and the strongest continuously valued 636 population structure as identified by MMUPHin's continuous structure discovery method (above). 637 We again compared the performance of continuous score discovery on the batch confounded and 638 batch corrected data (Fig. 2g, Supplemental Fig. 8). Note that, as above, batch correction is 639 again modelled only using the batch variable and does not have any access to the synthetic 640 continuous gradient, as any underlying continuous population structure is unknown during 641 unsupervised analyses settings.