Abstract
Recovering metagenome-assembled genomes (MAGs) from shotgun sequencing data is an increasingly common task in microbiome studies, as MAGs provide deeper insight into the functional potential of both culturable and non-culturable microorganisms. However, metagenome-assembled genomes vary in quality, and may contain omissions and contamination. These errors present challenges for detecting genes and comparing gene enrichment across sample types. To address this, we propose happi, an approach to testing hypotheses about gene enrichment that accounts for genome quality. We illustrate the advantages of happi over existing approaches using published Saccharibacteria MAGs and via simulation.
1 Background
Members of the same bacterial species can display a wide variety of different phenotypes, and intra-species variation in pathogenicity, virulence, drug resistance, environmental range, and stress response has been observed across the tree of life [16, 19, 23, 30, 34]. Variation in phenotypes can in part be explained by genotypic variation, which is also considerable because mechanisms of genetic recombination in bacteria facilitate large genetic variation even within narrow organismal groups. For example, of 7,385 gene clusters observed in a study of 31 genomes in the genus Prochlorococcus, only 766 gene clusters were detected in all genomes [8]. We refer to the set of genes shared by all members of a clade as the core genome and we refer to the set of genes not shared by all members as the accessory genome [33]. Together, these sets of genes comprise a clade’s pangenome: the entire collection of genes present in one or more organisms within the clade. In this paper, we describe a novel tool for pangenome analysis. Our tool is a statistical method to model the association between gene presence and covariates (predictors). Our method offers interpretable parameter estimates, a fast algorithm for estimation, and a flexible hypothesis testing procedure.
While cultivation-based studies have historically been used to study the gene content of bacteria, it has become increasingly common to employ shotgun metagenomics to study bacterial genomes and communities. Shotgun metagenomic sequencing involves untargeted sequencing of all DNA in an environment, enabling the study of genomes in their environmental context. Short reads from shotgun sequencing can be assembled into contigs and binned into metagenome-assembled genomes (MAGs), which represent a partial reconstruction of an individual bacterial genome. Despite major advances in methods for binning MAGs, MAGs can contain two types of errors. First, there can be genes that are truly present in the genome the MAG represents, but are unobserved in a MAG. Common reasons for this error include inadequate sequencing depth, high diversity in the metagenomes under study, and the inherent limitations of short read sequencing for reconstructing repetitive regions [10, 22, 24, 29, 37]. A second type of error in MAGs is erroneously observed genes: genes that are included in a MAG that are not truly present in the originating genome. This phenomenon is often referred to as contamination. The use of automated binning tools in the absence of manual inspection and refinement can lead to elevated rates of contamination. For example, the identification of contaminating contigs from manual refinement of MAGs produced by a massive unsupervised genome reconstruction effort removed 30 putative functions from a single contaminated genome [5, 20].
To address the challenges that contaminating and unobserved genes create for detecting enriched genes, our proposed method incorporates information about each genome’s quality. Under our proposed model, a gene may be unobserved in a genome either because the gene is not present in the source genome, or because it could not be recovered from the obtained sequencing data. If, for example, the coverage of short reads across the genome was high and most of the expected core genes were observed, then the lack of detection of a given gene is more likely attributable to its true absence. The user can select which variables they believe to be the most informative for genome quality in their dataset. We develop estimators of the parameters of our model, discuss interpretation of model parameters, propose a hypothesis testing approach, and illustrate the performance of our model on shotgun sequencing and simulated data.
2 Results
2.1 A Hierarchical Model for Gene Presence
We present a hierarchical model for the association between bacterial gene presence and covariates of interest (e.g., host treatment status, environment of origin, relevant confounders, etc.). We consider observations on n genomes, which could be either metagenome-assembled genomes, isolate genomes, reference genomes, or any combination. Let Yi be an indicator variable for the gene of interest being observed in genome i, Yi = 1 if the gene is observed in genome i and Yi = 0 otherwise. However, we are not interested in whether the gene is observed in each genome – we are interested in whether it is present in each genome. To this end, we define λi to be a latent (unobserved) random variable that indicates if the gene is truly present in genome i (λi = 1 if present).
We propose a logistic model to connect gene presence to covariate vector :
where the λi’s are conditionally independent given Xi and follow a Bernoulli distribution. Therefore, when comparing groups of genomes that differ by one unit in X.k but are alike with respect to X.1, X.2,…, X.,k–1, X.,k+1,…, X.p, βk gives the difference in the log-odds that the gene will be present between these two groups of genomes. To connect λi to Yi we propose the following model
where Yi are conditionally independent Bernoulli distribution random variables; ε is the probability that a gene is observed in a genome in which it is absent (e.g., due to contamination or crosstalk);
is a vector of genome quality covariates; and
is a flexible function to connect quality variables to the probability of detecting a present gene. Relevant quality variables are context-dependent and could include coverage of the gene from metagenomic read recruitment, completion (percentage of single copy core genes observed in the genome), redundancy (percentage of single copy core genes observed more than once in the genome), and an indicator for the genome originating from an isolated bacterial population.
2.2 Parameter Estimation
The latent variable structure of our model makes the Expectation-Maximization Algorithm [9] an appealing choice for estimating unknown parameters θ = (β, f). Because we do not observe , ε and f are not, in general, jointly identifiable. Therefore, we treat ε as a hyperparameter that can be fixed by the user or leveraged for sensitivity analyses. To improve stability of parameter estimates, we impose a Firth-type penalty on β. The complete data penalized log-likelihood is linear in λi, which allows us to simplify the expected complete data penalized log-likelihood at step t of an EM iteration as
where
for all x, and
can be simplified as
where the terms in the numerator are given in (1) and (2), and the denominator is given by
We maximize the expected complete data penalized log-likelihood separately for β and f. Owing to the form of the expected complete data penalized log-likelihood, efficient algorithms exist to perform each of these maximizations. Optimizing (3) with respect to β is equivalent to fitting a binomial generalized linear model with logit link function for outcomes via Firth-penalized maximum likelihood, and we find Newton’s method to be stable and fast for this purpose.
Optimizing for f depends on the class of functions in which f falls. We investigated two flexible non-parametric options for , where
is the class of bounded non-decreasing functions that map from
to
, and
where
is the class of linear combinations of k I-spline basis functions and a constant function where all basis functions have nonnegative coefficients. Both
and
result in a monotone estimate for f. To obtain the EM update for
, we use the primal active set algorithm of isotone [7] with custom loss function given by the first term in (3) plus a penalty term
to prevent
from growing without bound. We found that setting a = 50 gives a sensible tradeoff between algorithm convergence and numerical stability. To obtain the EM update for
, we fit a logistic regression on
with predictors consisting of an I-spline basis with all non-intercept coefficients constrained to be nonnegative. We use the I-spline basis functions implemented in splines2 [35]. In an analysis where we used short-read subsampling to approximate an empirical f, we found that
outperformed
(see Section 4.2.2), and for that reason we consider
throughout the remainder of this manuscript. We run the estimation algorithm for tmax steps or until the relative increase in the log-likelihood is below threshold Δ for 5 consecutive steps.
2.3 Hypothesis Testing
To enable inference on the odds that a gene will be present in groups of genomes that differ in their covariate attributes, we construct a hypothesis test for null hypotheses of the form Aβ = c for and
where rank(A) = h. This allows testing of null hypotheses including βk = 0 (the odds that the gene will be present are equal when comparing groups of genomes that differ in X.k but are alike with respect to X.1, X.2,…, X.,k–1, X.;k+1,…, X.,p). We propose to use a likelihood ratio test for Aβ = c, rejecting H0 at level α if
exceeds the upper 100α% quantile of a
distribution, where
is the maximum likelihood estimate of θ;
is the maximum likelihood estimate of θ under the null hypothesis; and
is the log-likelihood function.
2.4 Data Analysis: Saccharibacteria MAGs
We consider a publicly-available dataset of n = 43 non-redundant Saccharibacteria (TM7) MAGs recovered from supragingival plaque (n = 27) and tongue dorsum (n = 16) samples of seven individuals from [28] (see Section 4 for more information). The wide variation in mean coverage across the MAGs (1.07 – 26.35×) makes this an appealing dataset on which to illustrate our quality variable-adjusting pangenomics method.
We consider methods that allow us to test the null hypothesis that the probability (equivalently, odds) that a gene is present in Saccharibacteria genomes are equal for tongue and plaque-associated genomes. The alternative hypothesis is that the probabilities differ. We compare our proposed method (happi: a Hierarchical Approach to Pangenomics Inference) with three competitors: a logistic regression model for Yi with a likelihood ratio test (GLM-LRT); a logistic regression model for Yi with a Rao test (GLM-Rao); and Fisher’s exact test (Fisher). Note that these latter three methods test hypotheses about the odds that a gene is observed, while our proposed approach tests hypotheses about the odds that a gene is present, but we believe that results can be reasonably compared between these methods. We consider a single quality variable Mi for our analysis with happi: mean coverage across genome i. Our primary comparison is with GLM-Rao, which is the method currently implemented for pangenomics hypothesis testing in anvi’o [28]. We also note that the results from GLM-Rao and GLM-LRT are highly correlated, especially for larger p-values.
Different methods identified different differentially present genes. Out of 713 COG functions tested, happi identified 171 differentially present genes when controlling false discovery rate at the 5% level; GLM-LRT identified 219 genes; GLM-Rao identified 175 genes; and Fisher identified 146 genes. Our proposed method calculated lower p-values for 20%, 35% and 85% of genera compared to GLM-LRT, GLM-Rao, and Fisher’s test. We show results from 6 specific model estimates in Figure 1: 3 genes for which happi produced greater p-values than GLM-Rao (upper panels), and 3 genes for which it produced smaller p-values than GLM-Rao (lower panels). In all instances where happi produced greater p-values than GLM-Rao, non-detections generally occurred in genomes with low mean coverage. GLM-Rao does not account for coverage information, and so unlike happi, it can conflate gene absence with non-detections due to quality. We believe that statements about significance should be moderated when detection patterns can be attributable to quality variables, and therefore that it is reasonable that p-values are larger in these three cases. In contrast, happi produced smaller p-values than GLM-Rao in instances when non-detections occurred for greater coverage MAGs, or broadly across the range of MAG coverage (lower panels). In these instances, differences in detection are less likely to be attributable to quality factors, and it is reasonable that the significance of findings can be strengthened by including data on quality variables.
We test the null hypothesis that the probability that a gene is present are equal for tongue and plaque-associated Saccharibacteria genomes. The top 3 panels show genes for which our proposed method resulted in greater p-values than existing methods, and the lower 3 panels shows genes for which our proposed method resulted in smaller p-values than existing methods. Our method reduced p-values when differences in detection cannot be attributed to genome quality factors (here, coverage), and increased p-values in situations when non-detection may be conflated with lower quality genomes. Points have been jittered vertically to separate observations.
2.5 Simulation Study
Finally, we investigate the performance of our approach by evaluating its Type 1 error rate and power. To generate data that most realistically reflects the relationship between coverage and gene detection in shotgun metagenomics studies, we construct f(·) for use in this simulation by subsampling short-reads from host-associated E. coli genomes ([1]; see Section 4.2 and Figure 3). We consider q = 1 and q = 2, and let , Xi1 = 1,
and ϵ = 0. σx is a parameter that controls the degree of correlation between Mi and Xi2, with larger values resulting in less correlation between quality variables and the predictor of interest. We simulate data according to the model described in (1) and (2), with β = (0, 0)T for Type 1 error simulations and β = (0, β1)T with β1 ≠ 0 for power simulations. Note that because Xi1 is continuous, a Fisher’s exact test cannot be applied in this setting.
The results of Type 1 error rate simulations are shown in Figure 2 (left panels). We only show results for GLM-Rao because GLM-LRT and GLM-Rao produced highly similar p-values (mean squared difference 1.3 × 10-5, correlation = 0.99996, nsim = 3000). Notably, the logistic regression methods are anti-conservative, and do not control Type 1 error rates at nominal levels.
Simulations can be useful for evaluating the Type 1 and Type 2 error rates of methods for testing statistical hypotheses. (left) We find that logistic regression methods do not control Type 1 error, while happi behaves control Type 1 error at nominal levels. (right) We evaluate the power of happi to reject a false null hypothesis, finding that larger samples have greater power. In situations with greater correlation between quality variables and the covariate if interest, happi exhibits comparatively lower power.
We subsampled reads from a publicly available E. coli isolate genome to understand the impact of coverage on the probability of detecting a gene, finding that the probability of detection increases with coverage. We use a nonparametric smoother to interpolate this curve and use it as the true function f in our simulations.
For example, for a 5%-level test, Type 1 error rates for GLM-LRT range from 8.6% (n = 30 and σx = 0.5; 95% CI: 6.1–11.1%) to 31.6% (n = 100 and σx = 0.25; 95% CI: 27.5–35.7%). Stated differently, under H0, GLM-LRT will return p-values that are usually too small, leading to more frequent incorrect conclusions of an association. In contrast, happi does control the Type 1 error rate, behaving near-exactly. We estimate that happi’s Type 1 error rates for a 5% test when n = 30 and σx = 0.5 is 5.2% (95% CI: 3.3–7.2%), and when n = 100 and σx = 0.25, happi’s empirical Type 1 error rate is 6.0% (95% CI: 3.9–8.1%). Greater correlation between the quality variable (coverage) and the covariate of interest leads to greater anti-conservativeness for logistic regression methods, which incorrectly attribute differences in gene presence to the covariate of interest. However, happi appears to control Type 1 error across the range of σx investigated here.
We show the power of happi to correctly reject a null hypothesis at the 5% level in Figure 2 (right panels). We do not evaluate power for GLM-Rao and GLM-LRT because they have uncontrolled Type 1 error rates, making them invalid tests. We observe that the power of happi to reject a false null hypothesis increases with the effect size and sample size, but decreases with greater correlation between Mi and Xi1. Stated differently, happi has low power to detect true associations between gene presence and covariates of interest when covariates are correlated with genome quality, though this can be remedied with larger sample sizes.
Taken together, these results show that happi is robust to potential correlation between covariates of interest and genome quality. This is not the case for logistic regression-based methods, which cannot distinguish between differential gene presence due to genome quality and differential gene presence due to associations with covariates. No method will perform well under the alternative with small sample sizes and high correlation (see Figure 2, third panel), but happi has some power for large sample sizes and large effect sizes in this setting, and controls Type 1 error at nominal levels regardless of the sample size.
3 Discussion
Many tools exist to study associations between microbial genome variation and microbial or host phenotypes [4, 6, 11, 18, 27]. Studies investigating the association between microbial genomes and phenotypes are often referred to as microbial genome-wide association studies (mGWAS) [21, 25]. Most mGWAS tools have been developed for the analysis of pure microbial isolates, and do not account for differential genome quality in genomes analyzed collectively. mGWAS tools may be better-suited when the hypothesized causal direction is that the presence of genetic features gives rise to a phenotypic characteristic, and not the reverse. In this paper, we propose and validate a novel method (happi) to understand how non-microbial variation (e.g., environmental variation) is associated with microbial genome variation. The implied direction of modeling is reversed in our model compared to mGWAS models: our response variable is gene presence rather than phenotype. This allows interrogation of questions about factors influencing selection pressures on genomes, rather than questions about the impact of the microbiome on phenotypic outcomes.
We view the main advantage of happi as its use of data about genome quality factors. To support the increasing use of shotgun metagenomic data to recover fragmented microbial genomes, researchers need methods that are capable of analyzing incomplete and imperfect genomes. While we are not aware of methods for modeling gene enrichment in MAGs, we offer comparisons to commonly used methods for analyzing near-complete genomes, such as Fisher’s exact test (used by PanPhlAn3 [2, 26] and Scoary [4]) and logistic regression (used by anvi’o [12, 28]; see also [3]). In situations where differences in gene detection can be attributed to differences in genome quality, happi correctly infers that gene enrichment is ambiguous, and correspondingly identifies associations as less significant compared to competitor methods. However, in situations where genome quality cannot explain gene detection patterns, happi has greater precision than other methods and produces smaller p-values. We show via simulation that the advantages of happi are most pronounced when there is correlation between covariates and quality variables.
Results generated from happi are easily interpretable with reasonable run times on a modern laptop without parallelization, averaging 1.04 seconds per gene over 713 genes in n = 43 samples with tmax = 1000 and Δ = 0.01 on a 2.6 GHz i7 processor with 16 GB RAM. Since genes are treated independently, this analysis can be trivially parallelized, and furthermore, accuracy in estimation can be traded off for reduced runtime by reducing tmax or increasing Δ.
We suggest several avenues for further research. The first is to study the impact of experimental design on the statistical power of our proposed hypothesis testing procedure. Researchers often have to decide how to allocate budget across number of samples (including replicates and control data) and sequencing depth per sample. While existing guidelines for sequencing depth have focused on taxonomy estimation, MAG reconstruction, and gene detection [13, 14, 22, 24, 31, 37], our proposed modeling approach enables the principled study of the design of shotgun sequencing experiments to maximize power to detect differences in gene presence across sample groups.
Our latent variable model has possible utility for modeling the presence of amplicon sequence variants, and could offer a method for studying patterns of sequence variant presence when shotgun sequencing is infeasible or not preferred. For example, if a sequence variant is observed Wi times in sample i, then it would be reasonable to model Yi = 1{Wi>0}. This would permit inference on the equality of the probability that the sequence variant is absent in a sample across sample groups. Notably, by choosing an ϵ > 0 (e.g., via the use of negative control samples), happi can adjust for the impact of index switching in studies that leverage multiplexing [15, 17]. We leave the application of happi to modeling the presence of amplicon sequence variants to future research.
Collectively, we have shown that happi is accurate and robust, even when genome quality is correlated with gene presence predictors. As the recovery of metagenome-assembled genomes becomes increasingly common, statistical tools that account for errors in recovered genomes become increasingly necessary. By leveraging genome quality metrics, happi provides sensible and inter-pretable results in an analysis of metagenome-assembled genome data, improves statistical inference under simulation, and can run efficiently on a local machine. Finally, by distributing open-source software in R implementing our proposed estimation and inference methods, we hope that happi can be used widely in a variety of genomics research settings. happi is available as an open-source R package via https://github.com/statdivlab/happi under a BSD-3-Clause license.
4 Methods
4.1 Methods: Saccharibacteria MAGs
The Saccharibacteria MAGs used in Data Analysis: Saccharibacteria MAGs, were taken from publicly available data [28]. Specifically, data on genome quality metrics (i.e. mean coverage) of these Saccharibacteria MAGs were retrieved from supplementary materials https://doi.org/10.6084/m9.figshare.11634321 and information on the presence or absence of COG functions in each MAG was extracted from the Saccharibacteria pangenome contigs databases and profiles located at https://doi.org/10.6084/m9.figshare.12217811. Functional annotation of the genes was performed using NCBI’s Clusters of Orthologous Groups (COG) database [32]. Further details on sampling, assembly, binning, and refinement can be found in [28]. In our data analysis, we specified tmax = 1000, Δ = 0.01 and ϵ = 0. We set ϵ = 0 because these MAGs had undergone careful manual refinement to remove contamination from other genomes. We suggest the use of ϵ > 0 when binning is performed automatically and without additional manual refinement.
4.2 Methods: simulation studies
4.2.1 Subsampling study of E. coli isolate DRR102664
To investigate the probability of detecting a gene that it is truly present (Pr(Yi = 1|λi = 1, Mi = m)), we conducted a subsampling simulation study of an E. coli isolate genome taken from [1]. We selected E. coli isolate DRR102664 to perform our subsampling simulation and the eaeA gene (K12790) as our target gene of interest. In enteropathogenic Escherichia coli, the eaeA gene produces a 94-kDa outer membrane protein called intimin which has been shown to be necessary to produce the attaching-and-effacing lesion. For our subsampling study, we subsampled paired sequences 50 times from the DRR102664 genome at approximate coverages m = (2×, 3×,…, 24×, 25×). Coverages were estimated using the calculation . We annotated and identified the eaeA gene in each set of subsampled sequences and calculated the empirical probability of detection as the fraction of samples of coverage m that detected eaeA. The results of our subsampling investigation of the impact of coverage on the probability of detection given presence are shown in Figure 3.
4.2.2 Evaluating estimators for f
Many different choices of functions f could be used to connect the probability of detecting a present gene to quality variables Mi. We evaluated two options under simulation: for
the class of bounded non-decreasing functions and
for
the class of bounded non-decreasing functions. As in Simulation Study, we set
, Xi1 = 1,
, β0 = 0 and ϵ = 0. The true f(·) in this simulation is a generalized additive model with binomial link function [36] fit to the observations shown in Figure 3. This was done to select a true detection curve that well-reflects empirical probabilities of detecting a gene at a given coverage, such as gene eaeA in E. coli isolate genome DRR102664. We evaluated all estimators via mean squared error and median squared error for estimating β1. We investigated all combinations of n ∈ {30, 50, 100}, β1 ∈ {0.5, 1, 2} and σx ∈ {0.25, 0.5}, and performed 250 draws for each combination. For 17 out of 18 combinations of n, β1 and σx, we found that
outperformed
with respect to median squared error, with an average reduction in median squared error of 54%. For 18 out of 18 combinations,
outperformed
with respect to mean squared error, with an average reduction of 51%. For this reason, we chose to set
as the default option happi, and used this class of functions for both our data analyses and error rate simulations.
4.2.3 Type 1 error and power simulations
For the Type 1 error rate and power simulations shown in Section 2.5, we performed 500 simulations for each combination of σx, β1 and n. We set a minimum of 16 EM iterations, tmax = 50 and Δ = 0.1 for both the null and alternative models.
Funding
This work was supported in part by the National Institute of General Medical Sciences (R35 GM133420); and the National Institute of Environmental Health Sciences (T32ES015459).
Availability of data and materials
happi is available as an open-source R package at https://github.com/statdivlab/happi. The data supporting the conclusions of this article along with code for reproducing our results are made available at https://github.com/statdivlab/happi_supplementary.
Acknowledgements
The authors would like to thank Taylor Reiter and members of the StatDivLab for expert advice and constructive suggestions.