Microbiome differential abundance methods produce disturbingly different results across 38 datasets

Identifying differentially abundant microbes is a common goal of microbiome studies. Multiple methods have been applied for this purpose, which are largely used interchangeably in the literature. Although it has been observed that these tools can produce different results, there have been very few large-scale comparisons to describe the scale and significance of these differences. In addition, it is challenging for microbiome researchers to know which differential abundance tools are appropriate for their study and how these tools compare to one another. Here, we have investigated these questions by analyzing 38 16S rRNA gene datasets with two sample groups for differential abundance testing. We tested for differences in amplicon sequence variants and operational taxonomic units (referred to as ASVs for simplicity) between these groups with 14 commonly used differential abundance tools. Our findings confirmed that these tools identified drastically different numbers and sets of significant ASVs, however, for many tools the number of features identified correlated with aspects of the tested study data, such as sample size, sequencing depth, and effect size of community differences. We also found that the ASVs identified by each method were dependent on whether the abundance tables were prevalence-filtered before testing. ALDEx2 and ANCOM produced the most consistent results across studies and agreed best with the intersect of results from different approaches. In contrast, several methods, such as LEfSe, limma voom, and edgeR, produced inconsistent results and in some cases were unable to control the false discovery rate. In addition to these observations, we were unable to find supporting evidence for a recent recommendation that limma voom, corncob, and DESeq2 are more reliable overall compared with other methods. Although ALDEx2 and ANCOM are two promising conservative methods, we argue that those researchers requiring more sensitive methods should use a consensus approach based on multiple differential abundance methods to help ensure robust biological interpretations.

literature. Although it has been observed that these tools can produce different results, there have 23 been very few large-scale comparisons to describe the scale and significance of these differences. 24 In addition, it is challenging for microbiome researchers to know which differential abundance 25 tools are appropriate for their study and how these tools compare to one another. Here, we have 26 investigated these questions by analyzing 38 16S rRNA gene datasets with two sample groups 27 for differential abundance testing. We tested for differences in amplicon sequence variants and 28 operational taxonomic units (referred to as ASVs for simplicity) between these groups with 14 29 commonly used differential abundance tools. Our findings confirmed that these tools identified 30 drastically different numbers and sets of significant ASVs, however, for many tools the number 31 of features identified correlated with aspects of the tested study data, such as sample size, 32 sequencing depth, and effect size of community differences. We also found that the ASVs 33 identified by each method were dependent on whether the abundance tables were prevalence-Introduction 3 negative binomial distribution. To identify differentially abundant taxa, a null and alternative 81 hypothesis are compared for each taxon. The null hypothesis is that the same parameters for the 82 negative binomial solution explain the distribution of taxa across all sample groupings. The 83 alternative hypothesis is that different parameters are needed to account for differences between 84 sample groupings. If the null hypothesis can be rejected for a specific taxon then it is considered 85 differentially abundant. This idea is the foundation of distribution-based DA tests, including 86 other methods such as corncob (Martin et al., 2020) and metagenomeSeq (Paulson et al., 2013), 87 which model microbiome data with the beta-binomial and zero-inflated Gaussian distributions, 88 respectively. 89 Compositional data analysis (CoDa) methods represent an alternative to these 90 approaches. It has recently become more widely appreciated that sequencing data are paper is what quantity is used as the denominator, or the reference, for the transformation. The 98 centred log-ratio (CLR) transformation is a CoDa approach that uses the geometric mean of the 99 relative abundance of all taxa within a sample as the reference for that sample. An extension of 100 this approach is implemented in the tool ALDEx2 (Fernandes et al., 2014) . The additive log-101 ratio transformation is an alternative approach where the reference is the relative abundance of a 102 single taxon, which should be present with low variance in read counts across samples. ANCOM 103 is one tool that implements this additive log-ratio approach (Mandal et al., 2015). 104 Evaluating the numerous options for analyzing microbiome data outlined above has 105 proven difficult. This is largely because there are no gold standards to compare with DA tool 106 results. Simulating datasets with specific taxa that are differentially abundant is a partial solution 107 to this problem, but it is imperfect. For example, it has been noted that parametric simulations 108 can result in circular arguments for specific tools (Hawinkel et al., 2019). It is unsurprising that 109 distribution-based methods perform best when applied to simulated data based on that 110 distribution. Nonetheless, simulated data with no expected differences has been valuable for  Although these general observations have been well substantiated, there is less agreement 117 regarding the performance of tools across evaluation studies. Certain observations have been 118 reproducible, such as the higher FDR of edgeR and metagenomeSeq. Similarly, ALDEx2 has 119 been repeatedly shown to have low power to detect differences (Calgaro et  Thirty-eight different datasets were included in our main analyses for assessing the 154 characteristics of microbiome differential abundance tools. Two additional datasets were also 155 included for a comparison of differential abundance consistency across diarrhea-related  needed to be processed from raw sequences. These raw sequences were processed with QIIME 2 ASVs for each sample were then output into tab-delimited files. Rarefied tables were also 174 produced for each dataset, where the rarefied read depth was taken to be the lowest read depth of 175 any sample in the dataset over 2000 reads (with samples below this threshold discarded).

177
Differential abundance testing 178 We created a custom shell script (run_all_tools.sh) that ran each differential abundance tool on 179 each dataset within this study. As input the script took a tab-delimited ASV abundance table, a 180 rarefied version of that same table, and a metadata file that contained a column that split the 181 samples into two groups for testing. This script also accepted a prevalence cut-off filter to 182 remove ASVs below a minimum cut-off, which was set to 10% (i.e., ASVs found in fewer than 183 10% of samples were removed) for the filtered data analyses we present. Note that in a minority 184 of cases a genus abundance table was input instead, in which case all options were kept the same.

185
When the prevalence filter option was set, the script also generated new filtered rarefied tables 186 based on an input rarefaction depth.

187
Following these steps, each individual differential abundance method was run on the 188 input data using either the rarefied or non-rarefied table, depending on which is recommended 189 for that tool. The workflow used to run each differential abundance tool (with run_all_tools.sh) is 190 described below. The first step in each of these workflows was to read the dataset tables into R   We converted the metadata and non-rarefied feature tables into a phyloseq object, which we 220 input to corncob's differentialTest function (Martin et al., 2020). This function first converted the 221 data into relative abundances and then fit each taxon abundance to a beta-binomial model, using 222 logit link functions for both the mean and overdispersion. Because corncob models each of these 223 simultaneously and performs both differential abundance and differential variability testing   t-test 297 We applied total sum scaling normalization to the rarefied feature table, and performed an 298 unpaired Welch's t-test for each feature to compare the specified groups. We corrected the 299 resulting p-values for multiple testing with the BH method. Wilcoxon test 302 Using raw feature abundances in the rarefied case, and CLR-transformed abundances (after 303 applying a pseudocount of 1) in the non-rarefied case, we performed Wilcoxon rank-sum tests 304 for each feature to compare the specified sample groupings. We corrected the resulting p-values 305 with the BH method.

307
Comparing numbers of significant hits between tools 308 We compared the number of significant ASVs each tool identified in 38 different datasets. Each 309 tool was run as described above using default settings with some modifications suggested by the 310 tool authors, as noted above. A heatmap representing the number of significant hits found by 311 each tool was constructed using the pheatmap R package (Kolde, 2012). Spearman correlations 312 between the number of significant ASVs identified by a tool and the following dataset  Cross-tool, within-study differential abundance consistency analysis We compared the consistency between different tools within all datasets by pooling all ASVs 320 identified as being significant by at least one tool in the 38 different datasets. The number of 321 methods that identified each ASV as differentially abundant were then tallied. Cross-study differential abundance consistency analysis 336 Two additional datasets were acquired to bring the number of diarrhea-related datasets to five.

337
The ASVs in each of these datasets were previously taxonomically classified and so we used 338 these classifications to collapse all feature abundances to the genus level. Note that taxonomic 339 classification was performed using several different methods, which represents another source of 340 technical variation. We excluded unclassified and sensu stricto-labelled genus levels. We then 341 ran all differential abundance tools on these datasets at the genus level. These comparisons were 342 between the diarrhea and non-diarrhea sample groups. The same processing workflow was used 343 for the supplementary obesity dataset comparison as well.

344
For each tool and study combination, we determined which genera were significantly 345 different at an alpha of 0.05 (where relevant). For each tool we then tallied up the number of 346 times each genus was significant, i.e., how many datasets each genus was significant in based on 347 a given tool. The null expectation distributions of these counts per tool were generated by 348 randomly sampling genera from each dataset for 100,000 replicates. The probability of sampling 349 a genus (i.e., calling it significant) was set to be equal to the proportion of actual significant 350 genera. For each replicate we tallied up the number of times each genus was sampled across 351 datasets. We then compared the observed and expected distributions of the number of studies 352 each genus was found to be significant in. Note that to simplify this analysis we ignored the 353 directionality of the significance (e.g., whether it was higher in case or control samples). We 354 excluded genera never found to be significant. We performed bootstrap Kolmogorov-Smirnov  To investigate how different DA tools impact biological interpretations across microbiome 365 datasets, we tested 14 different differential abundance testing approaches ( Table 1)  including the human gut, plastisphere, freshwater lakes, and urban environments (Supp. Table   368 1). The features in these datasets corresponded to both ASVs and clustered operational 369 taxonomic units, but we refer to them all as ASVs below for simplicity. 370 We also investigated how prevalence filtering each dataset prior to analysis impacted the 371 observed results. We chose to either use no prevalence filtering (Fig. 1A) or a 10% prevalence 372 filter that removed any ASVs found in fewer than 10% of samples within each dataset (Fig. 1B). 373 We found that in both the filtered and unfiltered analyses the percentage of significant 374 ASVs identified by each DA method varied widely across datasets, with means ranging from 375 3.8-32.5% and 0.8-40.5%, respectively. Interestingly, we found that many tools behaved 376 differently between datasets. Specifically, some tools identified the most features in one dataset 377 while identifying only an intermediate number in other datasets. This was especially evident in 378 the unfiltered datasets (Fig. 1A). 379 To investigate possible factors driving this variation we examined how the number of  Fig. 2A) and 0.31-0.52 for filtered data (Fig. 2B). We also found 385 in the filtered datasets that the number of features found by all tools significantly correlated with 386 the median read depth, range in read depth, and sample size. There was much less consistency in 387 these correlations across the unfiltered data. For instance, only the t-test, both Wilcoxon 388 methods, and both limma voom methods correlated significantly with the range in read depth 389 (Fig. 2B). We also found that edgeR was negatively correlated with mean sample richness in the 390 unfiltered analysis.   Despite the variability of tool performance between datasets, we did find that several 421 tools tended to identify more significant hits (Supp. Fig 1C-D). In the unfiltered datasets, we other tools found 0-11% ASVs to be significant (Fig. 1A). Similarly, we found that both limma 429 voom methods identified over 99% of ASVs to be significant in several cases such as the Built-  Fig 1C-D). As with the unfiltered data, ANCOM-II was the most stringent method highest numbers of significant ASVs (Fig. 1B). 446 Finally, we examined the mean relative abundance of the features identified by each tool 447 to determine whether tools may be biased toward the identification of highly abundant features. 448 We found that both ALDEx2 (median: 0.013%), ANCOM-II (median: 0.024%) and to a lesser 449 degree DESeq2 (median: 0.006%) tended to find significant features that were higher in relative 450 14 abundance in the unfiltered datasets. A similar trend for ALDEx2 (median: 0.011%) and 451 ANCOM-II (median: 0.029%) was also found in the filtered datasets (Supp. Fig 1A-B). ASVs that were different from those of most other tools (Fig. 3A). However, we also found that 463 many of the ASVs identified by the limma voom methods were also identified as significant The overlap in significant ASVs based on the prevalence-filtered data was similar overall 474 to the unfiltered data results (Fig. 3B). One important exception was that the limma voom 475 approaches identified a much higher proportion of ASVs that were also identified by most other

484
The above analyses summarized the variation in tool performance across datasets, but it 485 is difficult to discern which tools performed most similarly from these results alone. To identify 486 overall similarly performing tools we conducted principal coordinates analysis based on the 487 Jaccard distance between significant sets of ASVs (Fig. 3C, 3D). One clear trend for both 488 unfiltered and filtered data is that edgeR and LEfSe cluster together and are separated from other 489 methods on the first principal coordinate. Interestingly, corncob, which is a methodologically 490 distinct approach, also clusters relatively close to these two methods on the first PC. The major 491 outliers on the second principal coordinate differ depending on whether the data was prevalence-492 filtered. For the unfiltered data, the main outliers are the limma voom methods, followed by 493 Wilcoxon (CLR; Fig. 3C). In contrast, ANCOM-II is the sole major outlier on the second 494 principal component based on filtered data (Fig. 3D). These visualizations highlight the major 495 tool clusters based on the resulting sets of significant ASVs. However, the percentage of 496 variation explained by the top two components is relatively low in each case, which means that 497 substantial information regarding potential tool clustering is missing from these panels (Supp. False discovery rate of microbiome differential abundance tools depends on the dataset 514 We next investigated the FDR of each DA tool across eight datasets. For each dataset we 515 selected the most frequently sampled group and randomly reassigned them as case or control 516 samples. Each DA tool was then run on those samples and results were compared. This was 517 repeated a total of 10 times for each filtered dataset and 100 times for each unfiltered dataset 518 (except for corncob, ANCOM-II and ALDEx2). We used a higher number of replicates for the 519 unfiltered datasets because they were less stable across replicates. The percentage of significant 520 ASVs after BH multiple-test correction was relatively low for most tested tools on the filtered 521 data (Fig. 4B). Two outliers were edgeR (mean: 10.3%; SD: 9.0%) and LEfSe (mean: 4.4%; SD: 522 1.2%), which consistently identified more significant hits compared with other tools (range of 523 other tool means: 0% -2.2%). Both limma voom methods output highly variable percentages of 524 significant ASVs, especially based on the unfiltered data (Fig. 4A). In particular, in 5/8 of the 525 unfiltered datasets, the limma voom methods identified more than 5% of ASVs as significant on 526 average. Interestingly, while these two methods exhibited similar performance overall, the 527 performance within the unfiltered Freshwater-Treatment dataset was highly different between the 528 methods with the TMMwsp method identifying 0.001% and the TMM method identifying 9.0%.

529
Only ALDEx2 and the t-test (rare) approach consistently identified no ASVs as significantly 530 different in this analysis.

531
Overall, we found that the raw numbers of significant ASVs were lower in the filtered 532 dataset than in the unfiltered data (as expected due to many ASVs being filtered out), and that 533 most tools identified only a small percentage of significant ASVs, regardless of filtering 534 procedure. The exceptions were the two limma voom methods, which had high FDRs with 535 unfiltered data, and edgeR and LEfSe, which had high FDRs on the filtered data. Although these 536 tools stand out on average, we also observed that in several replicates on the unfiltered datasets, 537 the Wilcoxon (CLR) approach identified almost all features as statistically significant (Supp. 538 Fig. 4). This was also true for both limma voom methods, which highlights that a minority of 539 replicates are driving up the average FDR of these methods. Such extreme values were not 540 observed for the filtered data (Supp. Fig 5). 541 Investigation into these outlier replicates for the Wilcoxon (CLR) approach revealed that 542 the mean differences in read depth between the two tested groups were consistently higher in 543 replicates in which 30% or more of ASVs were significant. Interestingly, this pattern was absent 544 when examining replicates for the limma voom methods (Supp. Fig 6). Tools vary in how consistently they identify the same significant genera within diarrhea 557 case-control datasets 558 Separate from the above analysis comparing consistency between tools on the same dataset, we 559 next investigated whether certain tools provide more consistent signals across datasets of the 560 same disease. This analysis focused on the genus-level across tools to help limit inter-study 561 variation. We specifically focused on diarrhea as a phenotype, which has been shown to exhibit a 562 strong effect on the microbiome and to be relatively reproducible across studies (Duvallet et al.,  was identified as significant would not be informative.

573
Instead, we analyzed the observed distribution of the number of studies that each genus 574 was identified as significant in compared with the expected distribution given random data. This 575 approach enabled us to compare the tools based on how much more consistently each tool 576 performed relative to its own random expectation. For instance, on average edgeR identified 577 significant genera more consistently across studies compared with ALDEx2 (mean numbers of 578 datasets that genera were found in across studies were 1.67 and 1.54 for edgeR and ALDEx2, 579 respectively). However, this observation was simply driven by the increased number of 580 significant genera identified by edgeR. Indeed, when compared with the random expectation,

581
ALDEx2 displayed a 1.35-fold increase (p < 0.0001) of consistency in calling significant genera 582 in the observed data. In contrast, edgeR produced results that were only 1.10-fold more 583 consistent compared with the random expectation (p = 0.02).

584
ALDEx2 and edgeR represent the extremes of how consistently tools identify the same 585 genera as significant across studies, but there is a large range (Fig. 5). Notably, all tools were 586 significantly more consistent than the random expectation across these datasets (p < 0.05) ( Table   587 2). In addition to ALDEx2, the other top performing approaches based on this evaluation 588 included both MaAsLin2 workflows, limma voom (TMM), and ANCOM-II. 589 We conducted a similar investigation across five obesity 16S datasets, which was more 590 challenging to interpret due to the lower consistency in general (Supp . Table 2). Specifically, 591 most significant genera were called in only a single study and only MaAslin2 (both with non-592 rarefied and rarefied data) and the t-test (rare) approaches performed significantly better than  test different hypotheses, we believe this perspective ignores how these tools are used in practice.

635
In particular, these tools are frequently used interchangeably in the microbiome literature.

636
Accordingly, an improved understanding of the variation in DA method performance is crucial to 637 properly interpret microbiome studies. We have illustrated here that these tools can produce identify only a relatively small number of ASVs as significant. We hypothesize that these latter 649 tools are more conservative and have higher precision, but with a concomitant probable loss in 650 sensitivity. This hypothesis is related to our observation that significant ASVs identified by these 651 two tools tended to also be identified by almost all other differential abundance methods, which 652 we interpret to be ASVs that are more likely to be true positives.

653
Given that ASVs commonly identified as significant are likely more reliable, it is 654 noteworthy that significant ASVs in the unfiltered data tended to be called by fewer tools. This 655 was particularly true for both limma voom approaches and the Wilcoxon (CLR) approach.

656
Although it is possible that many of these significant ASVs are incorrectly missed by other tools, 657 it is more likely that these tools are simply performing especially poorly on unfiltered data due to 658 several reasons, such as data sparsity.

659
This issue with the limma voom approaches was also highlighted by high false positive 660 rates on several unfiltered randomized datasets, which agrees with a past FDR assessment of this 661 approach (Hawinkel et al., 2019). It is also important to acknowledge that our randomized 662 approach for estimating FDR is not a perfect representation of real data; that is, real sample 663 groupings will likely contain some systematic differences in microbial abundances-although 664 the effect size may be very small-whereas our randomized datasets should have none.

665
Accordingly, identifying only a few significant ASVs under this approach is not necessarily 666 proof that a tool has a low FDR in practice. However, tools that identified many significant 667 ASVs in the absence of distinguishing signals likely also have high FDR on real data. such (Thorsen et al., 2016), that was not the case in our analysis. This agrees with a recent report 672 that metagenomeSeq (using the zero-inflated log-normal approach, as we did) appropriately 673 controlled the FDR, but exhibited low power (Lin and Peddada, 2020 taxonomic markers in categories based on the tool characteristics presented within this paper or 716 the number of tools that agree upon its identification. Importantly, applying multiple DA tools to 717 the same dataset should be reported explicitly. Clearly this approach would make results more 718 difficult to biologically interpret, but it would provide a clearer perspective on which 719 differentially abundant features are robust to reasonable changes in the analysis.

720
A common counterargument to using consensus approaches with DA tools is that there is 721 no assurance that the intersection of the tool outputs is more reliable; it is possible that the tools 722 are simply picking up the same noise as significant. Although we think this is unlikely, in any 723 case running multiple DA tools is still important to give context to reporting significant features.

724
For example, researchers might be using a tool that produces highly non-overlapping sets of large numbers of significant ASVs in the unfiltered data. However, these significant ASVs 737 tended to be more tool-specific in the unfiltered data and there was much more variation in the 738 percentage of significant ASVs across tools. Accordingly, we would suggest performing 739 prevalence filtering (e.g., at 10%) of features prior to DA testing, although we acknowledge that 740 more work is needed to estimate an optimal cut-off rather than just arbitrarily selecting one We would like to thank everyone who responded to MGIL's queries on Twitter regarding which 764 differential abundance tools to evaluate. We would also like to thank the authors of all DA tools 765 and datasets used in this study for making their code and data freely available. JTN is funded by