statSuma: automated selection and performance of statistical comparisons for microbiome studies

There is a reproducibility crisis in scientific studies. Some of these crises arise from incorrect application of statistical tests to data that follow inappropriate distributions, have inconsistent equivariance, or have very small sample sizes. As determining which test is most appropriate for all data in a multicategorical study (such as comparing taxa between sites in microbiome studies), we present statsSuma, an interactive Python notebook (which can be run from any desktop computer using the Google Colaboratory web service) and does not require a user to have any programming experience. This software assesses underlying data structures in a given dataset to advise what pairwise or listwise statistical procedure would be best suited for all data. As some users may be interested in further mining specific trends, statSuma performs 5 different two-tailed pairwise tests (Student’s t-test, Welch’s t-test, Mann-Whitney U-test, Brunner-Munzel test, and a pairwise Kruskal-Wallis H-test) and advises the best test for each comparison. This software also advises whether ANOVA or a multicategorical Kruskal-Wallis H-test is most appropriate for a given dataset and performs both procedures. A data distribution-vs-Gaussian distribution plot is produced for each taxon at each site and a variance plot between all combinations of 2 taxa at each site are produced so Gaussian tests and variance tests can be visually confirmed alongside associated statistical determinants.


Introduction
The advent of readily available, low-cost high throughput DNA sequencing has promoted the rampant growth of microbiome and metagenome studies. These studies, while informative and insightful, are often fraught by well-meaning but ultimately incorrect data assumptions and statistical test applications (Martin, 2019;Free, 2020). The selection of statistical tests requires knowledge of underlying data characteristics such as sample distribution, sample size and equivariance (Figure 1), which can be easily misinterpreted especially for complex studies with a lot of groups and relatively small sample sizes per group (Makin and De Xivry, 2019;Konietschke, Schwab and Pauly, 2021). Statistical analyses are of considerable importance in microbiology and microbiome research to ensure accurate result reporting and optimal experimental design (Adams-Huet and Ahn, 2009). Statistical analysis for non-statisticians is often cumbersome, requiring a researcher to purchase paid statistical software or to use a programming language (eg. Python or R) for analyses not available in such packages.
Here, we present statSuma, an interactive Python notebook (which can be run from any desktop computer using the Google Colaboratory web service) for the automatic selection (and reasoning behind the selection) of statistical comparisons and performance of said comparisons for a user. The software is most concerned with the some of the most popular comparison procedures: Student's (Gosset's) t-test (Student, 1908), Welch's t-test (Welch, 1947), Mann-Whitney U-test (Mann and Whitney, 1947), Brunner-Munzel test (Brunner and Munzel, 2000), analysis-of-variance (ANOVA) (Fisher, 1921), and the Kruskal-Wallis H-test (Kruskal and Wallis, 1952). Due to the excessive variance associated with microbiome studies (Falony et al., 2016;Leigh, Murphy and Walsh, 2021), it is anticipated that comparisons utilising equivariance will be rare.

Test selection
The software focusses on the most used comparisons in microbiome studies where all comparisons discussed will focus on two-tailed analysis of independent data. For the purposes of these explanations, it is assumed that all groups used for a given comparison have more than one sample > 0 and have a standard deviation > 0. Selection of a particular test is predicated on 6 main criteria: Underlying hypothesis to be tested (discussed below) Once all these criteria are satisfactorily addressed, the most appropriate statistical comparison is determined. The user selects whether a test is to be pairwise (ngroups = 2) or listwise (ngroups > 2) and statSuma determines the minimum nsamples for each comparison and whether sample sizes are equal between all groups, thus satisfying these requirements.
A Shapiro-Wilk test (Shapiro and Wilk, 1965) is employed to determine whether both datasets follow a Gaussian distribution (H0:X~N(μ,σ 2 );HA:X≁N(μ,σ 2 )). Again, a P ≥ α (α = 0.05) is used to determine whether a Gaussian distribution is observed. As with the Levene's test, statSuma sets α to 0.5 to determine if a distribution is Gaussian by default (which can be changed by a user).
For pairwise tests, if both distributions are determined to be equivariant and Gaussian, a Student's t-test is most appropriate (Student, 1908). In this scenario, the means of two groups are compared (H0:μ(a)=μ(b);HA:μ(a)≠μ(b)). If both distributions are assumed to be Gaussian but not equivariant the Welch's t-test is most appropriate (Welch, 1947), again this is a comparison of means (H0:μ(a)=μ(b);HA:μ(a)≠μ(b)). Both t-tests can technically be performed with a minimum sample size of 2, however larger sample sizes are strongly recommended to ensure statistical power (Rusticus and Lovato, 2014). If either distribution is non-Gaussian but are equivariant, a Mann-Whitney U-test is most appropriate. Generally speaking, the Mann-Whitney U-test is a comparison of medians which is also sensitive differences in data distributions (H0:η(a)=η(b);HA:η(a)≠η(b)). A Mann-Whitney U-test requires at least 8 samples per group to correctly function (Mann and Whitney, 1947;Cheung and Klotz, 1997 (b)).
Once all pairwise data examinations have been completed, one test is recommended for all comparisons to ensure an appropriate standard approach to all comparisons to a given study.
As comparisons that do not require underlying Gaussian distributions provide accurate results for non-Gaussian data and as comparisons that do not require underlying equivariance provide accurate results for equivariant data (Nahm, 2016;Delacre, Lakens and Leys, 2017), these methods are ranked as follows: (

Plots
For visual analysis, statSuma offers an optional plot for the standardised distribution of each taxon with a generated (standardised) Gaussian distribution for comparison. For these.
The y-axis is given as probability as the cumulative scores of each bin for each taxon (and associated generated Gaussian curve) each equated to 1 (eg. Figure 2). For completeness, QQplots are also produced for each taxon (eg. Figure 3). Comparatively, optional plots are offered to visually assess equivariance between two taxon distributions. Taxon distributions are each cumulatively standardised so samples with unequal sample sizes can be visualised together (eg. Figure 4). Again, y-axis is given as probability as the cumulative of each taxon equated to 1.
Application to real-world data.

Discussion
This software is designed to guide researchers in choosing the most appropriate statistical comparators in microbiome studies and to provide access to some statistical tests that are not yet commonly available on professional statistics analysis package (such as the Brunner-Munzel test). Due to the variability and cyclical shifts associated with microbiomes (Falony et al., 2016;Leigh, Murphy and Walsh, 2021) and equivariance were also expected to be rare. In practice, equivariance and ubiquitous Gaussian distribution were not observed in any instance.
We advise against statSuma to be used blindly by researchers without ensuring that the provided tests are appropriate for the hypotheses they intend to explore. It is provided as a helpful guide in choosing and performing statistical analyses with the aim of reducing time spent on these tasks for researchers.

Code availability
The software and dataset used for this publication are fully available at https://github.com/RobLeighBioinformatics/statSuma

Figure 1: Example of distributions and equivariance
In this example, 4 distributions are presented, 3 of which are Gaussian (as determined by the green box in the Gaussian grid) and one is non-Gaussian. There is one instance of equivariance in this example (between the red and blue distributions; indicated by the green box in the equivariance matrix) and the rest are non-equivariant (denoted as red boxes in the equivariance matrix). Black boxes in the equivariance matrix indicate a non-comparator.