Compositional data analysis of the microbiome: fundamentals, tools, and challenges

doi:10.1016/j.annepidem.2016.03.002

Annals of Epidemiology

Volume 26, Issue 5, May 2016, Pages 330-335

https://doi.org/10.1016/j.annepidem.2016.03.002 Get rights and content

Abstract

Purpose

Human microbiome studies are within the realm of compositional data with the absolute abundances of microbes not recoverable from sequence data alone. In compositional data analysis, each sample consists of proportions of various organisms with a sum constrained to a constant. This simple feature can lead traditional statistical treatments when naively applied to produce errant results and spurious correlations.

Methods

We review the origins of compositionality in microbiome data, the theory and usage of compositional data analysis in this setting and some recent attempts at solutions to these problems.

Results

Microbiome sequence data sets are typically high dimensional, with the number of taxa much greater than the number of samples, and sparse as most taxa are only observed in a small number of samples. These features of microbiome sequence data interact with compositionality to produce additional challenges in analysis.

Conclusions

Despite sophisticated approaches to statistical transformation, the analysis of compositional data may remain a partially intractable problem, limiting inference. We suggest that current research needs include better generation of simulated data and further study of how the severity of compositional effects changes when sampling microbial communities of widely differing diversity.

Introduction

Compositional data are vectors of nonnegative elements constrained to sum to a constant. This simple feature of compositional data can have surprisingly adverse effects when traditional methods of multivariate statistics are naively used [1]. The dangers of ignoring the effects of compositionality were noted by Pearson, who recognized more than a century ago, that “spurious correlations” would result, should values constructed as proportions be compared haphazardly [2]. Compositional data is subject to the “closure problem” that occurs when components necessarily compete to make up the constant sum constraint [3]. This can cause large changes in the absolute abundance of one component to drive apparent changes in the measured abundance of others, violating the assumption of sample independence and creating inevitable errors in covariance estimates that can lead to bias and flawed inference. Diverse academic disciplines have begun to appreciate the complexity of the analysis of compositional data, ranging from forensics [4], [5] and psychology [6] to the assessment of antibiotic use [7] and nutritional epidemiology [8].

In the case of the microbiome sequencing surveys, the compositional nature of the data comes from the fact that a correction must be made for different samples having different numbers of sequences while the total absolute abundance of all bacteria in each sample is unknown. These complications arise from sample collection, polymerase chain reaction (PCR) amplification, and the sequencing technology itself from which the absolute abundances of bacteria are not recoverable from sequence counts, but the proportions of different taxa are still relevant. Numerous schemes are used in the literature to convert the number of sequences for each taxon within each sample to relative abundance with popular techniques, including proportional abundance and rarefying, the latter being the default choice in the popular Quantitative Insights Into Microbial Ecology pipeline [9], [10]. Neither of these approaches corrects for compositionality and it has been argued that this lack of correction has led to erroneous analyses that fail to discriminate between true and spurious correlations between taxa [11], [12]. However, it remains unclear whether these sorts of normalization schemes routinely produce spurious correlations in the study of complex microbial communities, like the gut, or whether errors due to compositionality are instead restricted to analysis of microbial communities where only a few taxa dominate, such as the vaginal microbiome.

In this review, we examine the historical literature on the compositionality problem and some modern approaches to its solution that have been proposed for the analysis of next-generation sequencing data sets. We track recent progress and indicate where we think more research is needed. We also emphasize that the analysis of compositional data will always be at least a partially intractable problem despite the development of sophisticated statistical transformations as the absolute abundances of microbes before sequencing can never be recovered from sequence data alone, and this will inevitably color inference based on compositional samples.

Section snippets

Compositional data sets are best analyzed after a log-ratio transformation

The initial literature on compositional data analysis has largely been attributed to a pioneering author, John Aitchison, whose classic treatise, “The Statistical Analysis of Compositional Data,” has remained enormously influential for nearly 3 decades [3]. However, Aitchison, developing his theory in the 1980s, was analyzing data sets considerably smaller than those of current next-generation sequencing. His examples were often sourced from geology and usually featured problems such as how

Compositional data analysis in practice

Ordination and dimensionality reduction of compositional data requires several important considerations with distance metrics being chief among them. The Aitchison distance, formed by the sum of log-ratio differences over all taxa, is one such means of working within the restrictions of the Aitchison geometry to retain metric properties [42]. In the metagenomics literature, however, distance measures and dissimilarities like Bray–Curtis and UniFrac are much more commonly used. It remains an

References (52)

G.P. Campbell et al.
Compositional data analysis for elemental data in forensic science
Forensic Sci Int
(2009)
T. Neocleous et al.
Transformations for compositional data with zeros with an application to forensic evidence evaluation
Chemometer Intell Lab
(2011)
L. Pennington et al.
Analysis of compositional data in communication disorders research
J Commun Disord
(2009)
J.A. Martín-Fernández et al.
Model-based replacement of rounded zeros in compositional data: classical and robust approaches
Comput Stat Data Anal
(2012)
P. Filzmoser et al.
Interpretation of multivariate outliers for compositional data
Comput Geosci
(2012)
J. Palarea-Albaladejo et al.
zCompositions—R package for multivariate imputation of left-censored data under a compositional approach
Chemometer Intell Lab
(2015)
J. Bacon-Shone
A short history of compositional data analysis
K. Pearson
Mathematical contributions to the Theory of Evolution—on a form of spurious correlation which may arise when indices are used in the measurement of organs
Proc R Soc Lond
(1897)
J. Aitchison
The Statistical Analysis of Compositional Data
(1986)
C. Faes et al.
Analysing the composition of outpatient antibiotic use: a tutorial on compositional data analysis
J Antimicrob Chemother
(2011)

M.L. Leite

Applying compositional data methodology to nutritional epidemiology

Stat Methods Med Res

(2014)

J.G. Caporaso et al.

QIIME allows analysis of high-throughput community sequencing data

Nat Methods

(2010)

J. Kuczynski et al.

Using QIIME to Analyze 16S rRNA Gene Sequences from Microbial Communities

Curr Protoc Microbiol

(2012)

K. Faust et al.

Microbial co-occurrence relationships in the Human Microbiome

PLoS Comput Biol

(2012)

D.A. Jackson

Compositional data in community ecology: the paradigm or peril of proportions?

Ecology

(1997)

H. Li

Microbiome, Metagenomics and High-Dimensional Compositional Data Analysis

Annu Rev Stat Its Appl

(2015)

Z.D. Kurtz et al.

Sparse and compositionally robust inference of microbial ecological networks

PLoS Comput Biol

(2015)

M.M. Finucane et al.

A taxonomic signature of obesity in the microbiome? Getting to the guts of the matter

PLoS One

(2014)

A.D. Fernandes et al.

Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis

Microbiome

(2014)

J.J. Egozcue et al.

Isometric logratio transformations for compositional data analysis

Math Geol

(2003)

P.J. McMurdie et al.

Waste not, want not: why rarefying microbiome data is inadmissible

PLoS Comput Biol

(2014)

S.J. Weiss et al.

Effects of library size variance, sparsity, and compositionality on the analysis of microbiome data

PeerJ Prepr

(2015)

M.I. Love et al.

Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2

Genome Biol

(2014)

S. Anders et al.

Differential expression analysis for sequence count data

Genome Biol

(2010)

R. Kumar et al.

Getting started with microbiome analysis: sample acquisition to bioinformatics

Curr Protoc Hum Genet

(2014)

S.J. Salter et al.

Reagent and laboratory contamination can critically impact sequence-based microbiome analyses

BMC Biol

(2014)

Cited by (213)

Enriched nonlinear grey compositional model for analyzing multi-trend mixed data and practical applications
2024, Applied Mathematical Modelling
The compositional data are interrelated, and analyzing the evolution of each component is crucial for understanding population dynamics. However, the complex structure and tedious process of modeling pose challenges to the reasonable construction of grey compositional models for analyzing multi-trend mixed data. To address this, a novel enriched nonlinear grey compositional model with global multi-parameter combinatorial optimization is firstly proposed. Secondly, two types of Monte Carlo simulations are designed to validate the performances, modeling characteristics and noise levels of our model. Finally, using the bioenergy power generation structure of China as a case study, the practicability of our approach is verified. The results demonstrate that our model significantly outperforms traditional mainstream models in multi-trend mixed sequences, and the interrelationships among components are effectively verified. Our model not only enriches the methodological base but also broadens the application scope of grey compositional model.
Variable selection and inference strategies for multiple compositional regression
2024, Chemometrics and Intelligent Laboratory Systems
An important problem in compositional data analysis is variable selection in linear regression models with compositional covariates. In the context of microbiome data analysis, there is a demand for considering grouping information such as structures among taxa and multiple sampling sites, resulting in multiple compositional covariates. We develop and compare two different methods of variable selection and inference strategies, based on the debiased lasso and a resampling-based approach. Confidence intervals for individual regression coefficients, obtained from each of the two methods, are shown to be asymptotically valid even in a high-dimension, low-sample-size regime. However, microbiome data often have extremely small sample sizes, rendering asymptotic results unreliable. Through extensive numerical comparisons of the finite-sample performances of the two methods, we find that resampling-based approaches outperform the debiased compositional lasso in cases of extremely small sample sizes, showing higher positive predictive values. Conversely, for larger sample sizes, debiasing yields better results. We apply the proposed multiple compositional regression to steer microbiome data, identifying key bacterial taxa associated with important cattle quality measures.
Proportional stochastic generalized Lotka–Volterra model with an application to learning microbial community structures
2023, Applied Mathematics and Computation
Inferring microbial community structure based on temporal metagenomics data is an important goal in microbiome studies. The deterministic generalized Lotka–Volterra (GLV) differential equations have been commonly used to model the dynamics of microbial taxa. However, these approaches fail to take random environmental fluctuations into account and usually ignore the compositional nature of relative abundance data, which may deteriorate the estimates. In this article, we consider the microbial dynamics in terms of relative abundances by introducing a reference taxon, and propose a new proportional stochastic GLV (pSGLV) differential equation model, where the random perturbations of Brownian motion in this model can naturally account for the external environmental effects on the microbial community. We establish conditions and show some mathematical properties of the solutions including general existence and uniqueness, stochastic ultimate boundedness, stochastic permanence, the existence of stationary distribution, and ergodicity property. We further develop approximate maximum likelihood estimators (AMLEs) based on discrete observations and systematically investigate the consistency and asymptotic normality of the proposed estimators. At last, numerical simulations support our theoretical findings and our method is demonstrated through an application to the well-known “moving picture” temporal microbial dataset.
Impact of the diet in the gut microbiota after an inter-species microbial transplantation in fish
2024, Scientific Reports
A robust microbiome signature for autism spectrum disorder across different studies using machine learning
2024, Scientific Reports
Kent feature embedding for classification of compositional data with zeros
2024, Statistics and Computing

View all citing articles on Scopus

View full text

The Microbiome and EpidemiologyCompositional data analysis of the microbiome: fundamentals, tools, and challenges

Abstract

Purpose

Methods

Results

Conclusions

Introduction

Section snippets

Compositional data sets are best analyzed after a log-ratio transformation

Compositional data analysis in practice

Forensic Sci Int

Chemometer Intell Lab

J Commun Disord

Comput Stat Data Anal

Comput Geosci

Chemometer Intell Lab

A short history of compositional data analysis

Mathematical contributions to the Theory of Evolution—on a form of spurious correlation which may arise when indices are used in the measurement of organs

Proc R Soc Lond

The Statistical Analysis of Compositional Data

Analysing the composition of outpatient antibiotic use: a tutorial on compositional data analysis

J Antimicrob Chemother

Applying compositional data methodology to nutritional epidemiology

Stat Methods Med Res

QIIME allows analysis of high-throughput community sequencing data

Nat Methods

Using QIIME to Analyze 16S rRNA Gene Sequences from Microbial Communities

Curr Protoc Microbiol

Microbial co-occurrence relationships in the Human Microbiome

PLoS Comput Biol

Compositional data in community ecology: the paradigm or peril of proportions?

Ecology

Microbiome, Metagenomics and High-Dimensional Compositional Data Analysis

Annu Rev Stat Its Appl

Sparse and compositionally robust inference of microbial ecological networks

PLoS Comput Biol

A taxonomic signature of obesity in the microbiome? Getting to the guts of the matter

PLoS One

Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis

Microbiome

Isometric logratio transformations for compositional data analysis

Math Geol

Waste not, want not: why rarefying microbiome data is inadmissible

PLoS Comput Biol

Effects of library size variance, sparsity, and compositionality on the analysis of microbiome data

PeerJ Prepr

Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2

Genome Biol

Differential expression analysis for sequence count data

Genome Biol

Getting started with microbiome analysis: sample acquisition to bioinformatics

Curr Protoc Hum Genet

Reagent and laboratory contamination can critically impact sequence-based microbiome analyses

BMC Biol

The Microbiome and Epidemiology
Compositional data analysis of the microbiome: fundamentals, tools, and challenges