The Microbiome and EpidemiologyCompositional data analysis of the microbiome: fundamentals, tools, and challenges
Introduction
Compositional data are vectors of nonnegative elements constrained to sum to a constant. This simple feature of compositional data can have surprisingly adverse effects when traditional methods of multivariate statistics are naively used [1]. The dangers of ignoring the effects of compositionality were noted by Pearson, who recognized more than a century ago, that “spurious correlations” would result, should values constructed as proportions be compared haphazardly [2]. Compositional data is subject to the “closure problem” that occurs when components necessarily compete to make up the constant sum constraint [3]. This can cause large changes in the absolute abundance of one component to drive apparent changes in the measured abundance of others, violating the assumption of sample independence and creating inevitable errors in covariance estimates that can lead to bias and flawed inference. Diverse academic disciplines have begun to appreciate the complexity of the analysis of compositional data, ranging from forensics [4], [5] and psychology [6] to the assessment of antibiotic use [7] and nutritional epidemiology [8].
In the case of the microbiome sequencing surveys, the compositional nature of the data comes from the fact that a correction must be made for different samples having different numbers of sequences while the total absolute abundance of all bacteria in each sample is unknown. These complications arise from sample collection, polymerase chain reaction (PCR) amplification, and the sequencing technology itself from which the absolute abundances of bacteria are not recoverable from sequence counts, but the proportions of different taxa are still relevant. Numerous schemes are used in the literature to convert the number of sequences for each taxon within each sample to relative abundance with popular techniques, including proportional abundance and rarefying, the latter being the default choice in the popular Quantitative Insights Into Microbial Ecology pipeline [9], [10]. Neither of these approaches corrects for compositionality and it has been argued that this lack of correction has led to erroneous analyses that fail to discriminate between true and spurious correlations between taxa [11], [12]. However, it remains unclear whether these sorts of normalization schemes routinely produce spurious correlations in the study of complex microbial communities, like the gut, or whether errors due to compositionality are instead restricted to analysis of microbial communities where only a few taxa dominate, such as the vaginal microbiome.
In this review, we examine the historical literature on the compositionality problem and some modern approaches to its solution that have been proposed for the analysis of next-generation sequencing data sets. We track recent progress and indicate where we think more research is needed. We also emphasize that the analysis of compositional data will always be at least a partially intractable problem despite the development of sophisticated statistical transformations as the absolute abundances of microbes before sequencing can never be recovered from sequence data alone, and this will inevitably color inference based on compositional samples.
Section snippets
Compositional data sets are best analyzed after a log-ratio transformation
The initial literature on compositional data analysis has largely been attributed to a pioneering author, John Aitchison, whose classic treatise, “The Statistical Analysis of Compositional Data,” has remained enormously influential for nearly 3 decades [3]. However, Aitchison, developing his theory in the 1980s, was analyzing data sets considerably smaller than those of current next-generation sequencing. His examples were often sourced from geology and usually featured problems such as how
Compositional data analysis in practice
Ordination and dimensionality reduction of compositional data requires several important considerations with distance metrics being chief among them. The Aitchison distance, formed by the sum of log-ratio differences over all taxa, is one such means of working within the restrictions of the Aitchison geometry to retain metric properties [42]. In the metagenomics literature, however, distance measures and dissimilarities like Bray–Curtis and UniFrac are much more commonly used. It remains an
References (52)
- et al.
Compositional data analysis for elemental data in forensic science
Forensic Sci Int
(2009) - et al.
Transformations for compositional data with zeros with an application to forensic evidence evaluation
Chemometer Intell Lab
(2011) - et al.
Analysis of compositional data in communication disorders research
J Commun Disord
(2009) - et al.
Model-based replacement of rounded zeros in compositional data: classical and robust approaches
Comput Stat Data Anal
(2012) - et al.
Interpretation of multivariate outliers for compositional data
Comput Geosci
(2012) - et al.
zCompositions—R package for multivariate imputation of left-censored data under a compositional approach
Chemometer Intell Lab
(2015) A short history of compositional data analysis
Mathematical contributions to the Theory of Evolution—on a form of spurious correlation which may arise when indices are used in the measurement of organs
Proc R Soc Lond
(1897)The Statistical Analysis of Compositional Data
(1986)- et al.
Analysing the composition of outpatient antibiotic use: a tutorial on compositional data analysis
J Antimicrob Chemother
(2011)
Applying compositional data methodology to nutritional epidemiology
Stat Methods Med Res
QIIME allows analysis of high-throughput community sequencing data
Nat Methods
Using QIIME to Analyze 16S rRNA Gene Sequences from Microbial Communities
Curr Protoc Microbiol
Microbial co-occurrence relationships in the Human Microbiome
PLoS Comput Biol
Compositional data in community ecology: the paradigm or peril of proportions?
Ecology
Microbiome, Metagenomics and High-Dimensional Compositional Data Analysis
Annu Rev Stat Its Appl
Sparse and compositionally robust inference of microbial ecological networks
PLoS Comput Biol
A taxonomic signature of obesity in the microbiome? Getting to the guts of the matter
PLoS One
Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis
Microbiome
Isometric logratio transformations for compositional data analysis
Math Geol
Waste not, want not: why rarefying microbiome data is inadmissible
PLoS Comput Biol
Effects of library size variance, sparsity, and compositionality on the analysis of microbiome data
PeerJ Prepr
Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2
Genome Biol
Differential expression analysis for sequence count data
Genome Biol
Getting started with microbiome analysis: sample acquisition to bioinformatics
Curr Protoc Hum Genet
Reagent and laboratory contamination can critically impact sequence-based microbiome analyses
BMC Biol
Cited by (213)
Enriched nonlinear grey compositional model for analyzing multi-trend mixed data and practical applications
2024, Applied Mathematical ModellingVariable selection and inference strategies for multiple compositional regression
2024, Chemometrics and Intelligent Laboratory SystemsProportional stochastic generalized Lotka–Volterra model with an application to learning microbial community structures
2023, Applied Mathematics and ComputationKent feature embedding for classification of compositional data with zeros
2024, Statistics and Computing