## Abstract

Recovering metagenome-assembled genomes (MAGs) from shotgun sequencing data is an increasingly common task in microbiome studies, as MAGs provide deeper insight into the functional potential of both culturable and non-culturable microorganisms. However, metagenome-assembled genomes vary in quality, and may contain omissions and contamination. These errors present challenges for detecting genes and comparing gene enrichment across sample types. To address this, we propose `happi`, an approach to testing hypotheses about gene enrichment that accounts for genome quality. We illustrate the advantages of `happi` over existing approaches using published Saccharibacteria MAGs and via simulation.

## 1 Background

Members of the same bacterial species can display a wide variety of different phenotypes, and intra-species variation in pathogenicity, virulence, drug resistance, environmental range, and stress response has been observed across the tree of life [16, 19, 23, 30, 34]. Variation in phenotypes can in part be explained by genotypic variation, which is also considerable because mechanisms of genetic recombination in bacteria facilitate large genetic variation even within narrow organismal groups. For example, of 7,385 gene clusters observed in a study of 31 genomes in the genus *Prochlorococcus*, only 766 gene clusters were detected in all genomes [8]. We refer to the set of genes shared by all members of a clade as the *core genome* and we refer to the set of genes not shared by all members as the *accessory genome* [33]. Together, these sets of genes comprise a clade’s *pangenome*: the entire collection of genes present in one or more organisms within the clade. In this paper, we describe a novel tool for pangenome analysis. Our tool is a statistical method to model the association between gene presence and covariates (predictors). Our method offers interpretable parameter estimates, a fast algorithm for estimation, and a flexible hypothesis testing procedure.

While cultivation-based studies have historically been used to study the gene content of bacteria, it has become increasingly common to employ shotgun metagenomics to study bacterial genomes and communities. Shotgun metagenomic sequencing involves untargeted sequencing of all DNA in an environment, enabling the study of genomes in their environmental context. Short reads from shotgun sequencing can be assembled into contigs and binned into metagenome-assembled genomes (MAGs), which represent a partial reconstruction of an individual bacterial genome. Despite major advances in methods for binning MAGs, MAGs can contain two types of errors. First, there can be genes that are truly present in the genome the MAG represents, but are unobserved in a MAG. Common reasons for this error include inadequate sequencing depth, high diversity in the metagenomes under study, and the inherent limitations of short read sequencing for reconstructing repetitive regions [10, 22, 24, 29, 37]. A second type of error in MAGs is erroneously observed genes: genes that are included in a MAG that are not truly present in the originating genome. This phenomenon is often referred to as contamination. The use of automated binning tools in the absence of manual inspection and refinement can lead to elevated rates of contamination. For example, the identification of contaminating contigs from manual refinement of MAGs produced by a massive unsupervised genome reconstruction effort removed 30 putative functions from a single contaminated genome [5, 20].

To address the challenges that contaminating and unobserved genes create for detecting enriched genes, our proposed method incorporates information about each genome’s quality. Under our proposed model, a gene may be unobserved in a genome either because the gene is not present in the source genome, or because it could not be recovered from the obtained sequencing data. If, for example, the coverage of short reads across the genome was high and most of the expected core genes were observed, then the lack of detection of a given gene is more likely attributable to its true absence. The user can select which variables they believe to be the most informative for genome quality in their dataset. We develop estimators of the parameters of our model, discuss interpretation of model parameters, propose a hypothesis testing approach, and illustrate the performance of our model on shotgun sequencing and simulated data.

## 2 Results

### 2.1 A Hierarchical Model for Gene Presence

We present a hierarchical model for the association between bacterial gene presence and covariates of interest (e.g., host treatment status, environment of origin, relevant confounders, etc.). We consider observations on *n* genomes, which could be either metagenome-assembled genomes, isolate genomes, reference genomes, or any combination. Let *Y _{i}* be an indicator variable for the gene of interest being

*observed*in genome

*i*,

*Y*= 1 if the gene is observed in genome

_{i}*i*and

*Y*= 0 otherwise. However, we are not interested in whether the gene is

_{i}*observed*in each genome – we are interested in whether it is

*present*in each genome. To this end, we define λ

_{i}to be a latent (unobserved) random variable that indicates if the gene is truly present in genome

*i*(λ

_{i}= 1 if present).

We propose a logistic model to connect gene presence to covariate vector :
where the λ_{i}’s are conditionally independent given *X _{i}* and follow a Bernoulli distribution. Therefore, when comparing groups of genomes that differ by one unit in

*X.*but are alike with respect to

_{k}*X*.

_{1},

*X*.

_{2},…,

*X*.

_{,k–1},

*X*.

_{,k+1},…,

*X*.

_{p},

*β*gives the difference in the log-odds that the gene will be present between these two groups of genomes. To connect λ

_{k}_{i}to

*Y*we propose the following model where

_{i}*Y*are conditionally independent Bernoulli distribution random variables;

_{i}*ε*is the probability that a gene is observed in a genome in which it is absent (e.g., due to contamination or crosstalk); is a vector of genome quality covariates; and is a flexible function to connect quality variables to the probability of detecting a present gene. Relevant quality variables are context-dependent and could include coverage of the gene from metagenomic read recruitment, completion (percentage of single copy core genes observed in the genome), redundancy (percentage of single copy core genes observed more than once in the genome), and an indicator for the genome originating from an isolated bacterial population.

### 2.2 Parameter Estimation

The latent variable structure of our model makes the Expectation-Maximization Algorithm [9] an appealing choice for estimating unknown parameters *θ* = (*β*, *f*). Because we do not observe , *ε* and *f* are not, in general, jointly identifiable. Therefore, we treat *ε* as a hyperparameter that can be fixed by the user or leveraged for sensitivity analyses. To improve stability of parameter estimates, we impose a Firth-type penalty on *β*. The complete data penalized log-likelihood is linear in λ_{i}, which allows us to simplify the expected complete data penalized log-likelihood at step *t* of an EM iteration as
where for all *x*, and can be simplified as
where the terms in the numerator are given in (1) and (2), and the denominator is given by

We maximize the expected complete data penalized log-likelihood separately for *β* and *f*. Owing to the form of the expected complete data penalized log-likelihood, efficient algorithms exist to perform each of these maximizations. Optimizing (3) with respect to *β* is equivalent to fitting a binomial generalized linear model with logit link function for outcomes via Firth-penalized maximum likelihood, and we find Newton’s method to be stable and fast for this purpose.

Optimizing for *f* depends on the class of functions in which *f* falls. We investigated two flexible non-parametric options for , where is the class of bounded non-decreasing functions that map from to , and where is the class of linear combinations of *k* I-spline basis functions and a constant function where all basis functions have nonnegative coefficients. Both and result in a monotone estimate for *f*. To obtain the EM update for , we use the primal active set algorithm of `isotone` [7] with custom loss function given by the first term in (3) plus a penalty term to prevent from growing without bound. We found that setting *a* = 50 gives a sensible tradeoff between algorithm convergence and numerical stability. To obtain the EM update for , we fit a logistic regression on with predictors consisting of an I-spline basis with all non-intercept coefficients constrained to be nonnegative. We use the I-spline basis functions implemented in `splines2` [35]. In an analysis where we used short-read subsampling to approximate an empirical *f*, we found that outperformed (see Section 4.2.2), and for that reason we consider throughout the remainder of this manuscript. We run the estimation algorithm for *t*_{max} steps or until the relative increase in the log-likelihood is below threshold Δ for 5 consecutive steps.

### 2.3 Hypothesis Testing

To enable inference on the odds that a gene will be present in groups of genomes that differ in their covariate attributes, we construct a hypothesis test for null hypotheses of the form **A***β* = *c* for and where rank(**A**) = *h*. This allows testing of null hypotheses including *β _{k}* = 0 (the odds that the gene will be present are equal when comparing groups of genomes that differ in

*X*.

_{k}but are alike with respect to

*X*.

_{1},

*X*.

_{2},…,

*X*.

_{,k–1},

*X*.

_{;k+1},…,

*X*.

_{,p}). We propose to use a likelihood ratio test for

**A**

*β*=

*c*, rejecting

*H*

_{0}at level

*α*if exceeds the upper 100

*α*% quantile of a distribution, where is the maximum likelihood estimate of

*θ*; is the maximum likelihood estimate of

*θ*under the null hypothesis; and is the log-likelihood function.

### 2.4 Data Analysis: Saccharibacteria MAGs

We consider a publicly-available dataset of *n* = 43 non-redundant Saccharibacteria (TM7) MAGs recovered from supragingival plaque (*n* = 27) and tongue dorsum (*n* = 16) samples of seven individuals from [28] (see Section 4 for more information). The wide variation in mean coverage across the MAGs (1.07 – 26.35×) makes this an appealing dataset on which to illustrate our quality variable-adjusting pangenomics method.

We consider methods that allow us to test the null hypothesis that the probability (equivalently, odds) that a gene is present in Saccharibacteria genomes are equal for tongue and plaque-associated genomes. The alternative hypothesis is that the probabilities differ. We compare our proposed method (`happi`: a Hierarchical Approach to Pangenomics Inference) with three competitors: a logistic regression model for *Y _{i}* with a likelihood ratio test (GLM-LRT); a logistic regression model for

*Y*with a Rao test (GLM-Rao); and Fisher’s exact test (Fisher). Note that these latter three methods test hypotheses about the odds that a gene is observed, while our proposed approach tests hypotheses about the odds that a gene is present, but we believe that results can be reasonably compared between these methods. We consider a single quality variable

_{i}*M*for our analysis with

_{i}`happi`: mean coverage across genome

*i*. Our primary comparison is with GLM-Rao, which is the method currently implemented for pangenomics hypothesis testing in anvi’o [28]. We also note that the results from GLM-Rao and GLM-LRT are highly correlated, especially for larger p-values.

Different methods identified different differentially present genes. Out of 713 COG functions tested, `happi` identified 171 differentially present genes when controlling false discovery rate at the 5% level; GLM-LRT identified 219 genes; GLM-Rao identified 175 genes; and Fisher identified 146 genes. Our proposed method calculated lower p-values for 20%, 35% and 85% of genera compared to GLM-LRT, GLM-Rao, and Fisher’s test. We show results from 6 specific model estimates in Figure 1: 3 genes for which `happi` produced greater p-values than GLM-Rao (upper panels), and 3 genes for which it produced smaller p-values than GLM-Rao (lower panels). In all instances where `happi` produced greater p-values than GLM-Rao, non-detections generally occurred in genomes with low mean coverage. GLM-Rao does not account for coverage information, and so unlike `happi`, it can conflate gene absence with non-detections due to quality. We believe that statements about significance should be moderated when detection patterns can be attributable to quality variables, and therefore that it is reasonable that p-values are larger in these three cases. In contrast, `happi` produced smaller p-values than GLM-Rao in instances when non-detections occurred for greater coverage MAGs, or broadly across the range of MAG coverage (lower panels). In these instances, differences in detection are less likely to be attributable to quality factors, and it is reasonable that the significance of findings can be strengthened by including data on quality variables.

### 2.5 Simulation Study

Finally, we investigate the performance of our approach by evaluating its Type 1 error rate and power. To generate data that most realistically reflects the relationship between coverage and gene detection in shotgun metagenomics studies, we construct *f*(·) for use in this simulation by subsampling short-reads from host-associated *E. coli* genomes ([1]; see Section 4.2 and Figure 3). We consider *q* = 1 and *q* = 2, and let , *X*_{i1} = 1, and *ϵ* = 0. *σ _{x}* is a parameter that controls the degree of correlation between

*M*and

_{i}*X*

_{i2}, with larger values resulting in less correlation between quality variables and the predictor of interest. We simulate data according to the model described in (1) and (2), with

*β*= (0, 0)

^{T}for Type 1 error simulations and

*β*= (0,

*β*

_{1})

^{T}with

*β*

_{1}≠ 0 for power simulations. Note that because

*X*

_{i1}is continuous, a Fisher’s exact test cannot be applied in this setting.

The results of Type 1 error rate simulations are shown in Figure 2 (left panels). We only show results for GLM-Rao because GLM-LRT and GLM-Rao produced highly similar p-values (mean squared difference 1.3 × 10^{-5}, correlation = 0.99996, *n _{sim}* = 3000). Notably, the logistic regression methods are anti-conservative, and do not control Type 1 error rates at nominal levels.

For example, for a 5%-level test, Type 1 error rates for GLM-LRT range from 8.6% (*n* = 30 and *σ _{x}* = 0.5; 95% CI: 6.1–11.1%) to 31.6% (

*n*= 100 and

*σ*= 0.25; 95% CI: 27.5–35.7%). Stated differently, under

_{x}*H*

_{0}, GLM-LRT will return p-values that are usually too small, leading to more frequent incorrect conclusions of an association. In contrast,

`happi`does control the Type 1 error rate, behaving near-exactly. We estimate that

`happi`’s Type 1 error rates for a 5% test when

*n*= 30 and

*σ*= 0.5 is 5.2% (95% CI: 3.3–7.2%), and when

_{x}*n*= 100 and

*σ*= 0.25,

_{x}`happi`’s empirical Type 1 error rate is 6.0% (95% CI: 3.9–8.1%). Greater correlation between the quality variable (coverage) and the covariate of interest leads to greater anti-conservativeness for logistic regression methods, which incorrectly attribute differences in gene presence to the covariate of interest. However,

`happi`appears to control Type 1 error across the range of

*σ*investigated here.

_{x}We show the power of `happi` to correctly reject a null hypothesis at the 5% level in Figure 2 (right panels). We do not evaluate power for GLM-Rao and GLM-LRT because they have uncontrolled Type 1 error rates, making them invalid tests. We observe that the power of `happi` to reject a false null hypothesis increases with the effect size and sample size, but decreases with greater correlation between *M _{i}* and

*X*

_{i1}. Stated differently,

`happi`has low power to detect true associations between gene presence and covariates of interest when covariates are correlated with genome quality, though this can be remedied with larger sample sizes.

Taken together, these results show that `happi` is robust to potential correlation between covariates of interest and genome quality. This is not the case for logistic regression-based methods, which cannot distinguish between differential gene presence due to genome quality and differential gene presence due to associations with covariates. No method will perform well under the alternative with small sample sizes and high correlation (see Figure 2, third panel), but `happi` has some power for large sample sizes and large effect sizes in this setting, and controls Type 1 error at nominal levels regardless of the sample size.

## 3 Discussion

Many tools exist to study associations between microbial genome variation and microbial or host phenotypes [4, 6, 11, 18, 27]. Studies investigating the association between microbial genomes and phenotypes are often referred to as microbial genome-wide association studies (mGWAS) [21, 25]. Most mGWAS tools have been developed for the analysis of pure microbial isolates, and do not account for differential genome quality in genomes analyzed collectively. mGWAS tools may be better-suited when the hypothesized causal direction is that the presence of genetic features gives rise to a phenotypic characteristic, and not the reverse. In this paper, we propose and validate a novel method (`happi`) to understand how non-microbial variation (e.g., environmental variation) is associated with microbial genome variation. The implied direction of modeling is reversed in our model compared to mGWAS models: our response variable is gene presence rather than phenotype. This allows interrogation of questions about factors influencing selection pressures on genomes, rather than questions about the impact of the microbiome on phenotypic outcomes.

We view the main advantage of `happi` as its use of data about genome quality factors. To support the increasing use of shotgun metagenomic data to recover fragmented microbial genomes, researchers need methods that are capable of analyzing incomplete and imperfect genomes. While we are not aware of methods for modeling gene enrichment in MAGs, we offer comparisons to commonly used methods for analyzing near-complete genomes, such as Fisher’s exact test (used by PanPhlAn3 [2, 26] and Scoary [4]) and logistic regression (used by anvi’o [12, 28]; see also [3]). In situations where differences in gene detection can be attributed to differences in genome quality, `happi` correctly infers that gene enrichment is ambiguous, and correspondingly identifies associations as less significant compared to competitor methods. However, in situations where genome quality cannot explain gene detection patterns, `happi` has greater precision than other methods and produces smaller p-values. We show via simulation that the advantages of `happi` are most pronounced when there is correlation between covariates and quality variables.

Results generated from `happi` are easily interpretable with reasonable run times on a modern laptop without parallelization, averaging 1.04 seconds per gene over 713 genes in *n* = 43 samples with *t*_{max} = 1000 and Δ = 0.01 on a 2.6 GHz i7 processor with 16 GB RAM. Since genes are treated independently, this analysis can be trivially parallelized, and furthermore, accuracy in estimation can be traded off for reduced runtime by reducing *t*_{max} or increasing Δ.

We suggest several avenues for further research. The first is to study the impact of experimental design on the statistical power of our proposed hypothesis testing procedure. Researchers often have to decide how to allocate budget across number of samples (including replicates and control data) and sequencing depth per sample. While existing guidelines for sequencing depth have focused on taxonomy estimation, MAG reconstruction, and gene detection [13, 14, 22, 24, 31, 37], our proposed modeling approach enables the principled study of the design of shotgun sequencing experiments to maximize power to detect differences in gene presence across sample groups.

Our latent variable model has possible utility for modeling the presence of amplicon sequence variants, and could offer a method for studying patterns of sequence variant presence when shotgun sequencing is infeasible or not preferred. For example, if a sequence variant is observed *W _{i}* times in sample

*i*, then it would be reasonable to model

*Y*=

_{i}**1**

_{{Wi>0}}. This would permit inference on the equality of the probability that the sequence variant is absent in a sample across sample groups. Notably, by choosing an

*ϵ*> 0 (e.g., via the use of negative control samples),

`happi`can adjust for the impact of index switching in studies that leverage multiplexing [15, 17]. We leave the application of

`happi`to modeling the presence of amplicon sequence variants to future research.

Collectively, we have shown that `happi` is accurate and robust, even when genome quality is correlated with gene presence predictors. As the recovery of metagenome-assembled genomes becomes increasingly common, statistical tools that account for errors in recovered genomes become increasingly necessary. By leveraging genome quality metrics, `happi` provides sensible and inter-pretable results in an analysis of metagenome-assembled genome data, improves statistical inference under simulation, and can run efficiently on a local machine. Finally, by distributing open-source software in `R` implementing our proposed estimation and inference methods, we hope that `happi` can be used widely in a variety of genomics research settings. `happi` is available as an open-source `R` package via https://github.com/statdivlab/happi under a BSD-3-Clause license.

## 4 Methods

### 4.1 Methods: Saccharibacteria MAGs

The Saccharibacteria MAGs used in Data Analysis: Saccharibacteria MAGs, were taken from publicly available data [28]. Specifically, data on genome quality metrics (i.e. mean coverage) of these Saccharibacteria MAGs were retrieved from supplementary materials https://doi.org/10.6084/m9.figshare.11634321 and information on the presence or absence of COG functions in each MAG was extracted from the Saccharibacteria pangenome contigs databases and profiles located at https://doi.org/10.6084/m9.figshare.12217811. Functional annotation of the genes was performed using NCBI’s Clusters of Orthologous Groups (COG) database [32]. Further details on sampling, assembly, binning, and refinement can be found in [28]. In our data analysis, we specified *t*_{max} = 1000, Δ = 0.01 and *ϵ* = 0. We set *ϵ* = 0 because these MAGs had undergone careful manual refinement to remove contamination from other genomes. We suggest the use of *ϵ* > 0 when binning is performed automatically and without additional manual refinement.

### 4.2 Methods: simulation studies

#### 4.2.1 Subsampling study of *E. coli* isolate DRR102664

To investigate the probability of detecting a gene that it is truly present (*Pr*(*Y _{i}* = 1|λ

_{i}= 1,

*M*=

_{i}*m*)), we conducted a subsampling simulation study of an

*E. coli*isolate genome taken from [1]. We selected

*E. coli*isolate DRR102664 to perform our subsampling simulation and the

`eaeA`gene (K12790) as our target gene of interest. In enteropathogenic Escherichia coli, the

`eaeA`gene produces a 94-kDa outer membrane protein called intimin which has been shown to be necessary to produce the attaching-and-effacing lesion. For our subsampling study, we subsampled paired sequences 50 times from the DRR102664 genome at approximate coverages

*m*= (2×, 3×,…, 24×, 25×). Coverages were estimated using the calculation . We annotated and identified the

`eaeA`gene in each set of subsampled sequences and calculated the empirical probability of detection as the fraction of samples of coverage

*m*that detected

`eaeA`. The results of our subsampling investigation of the impact of coverage on the probability of detection given presence are shown in Figure 3.

#### 4.2.2 Evaluating estimators for *f*

Many different choices of functions *f* could be used to connect the probability of detecting a present gene to quality variables *M _{i}*. We evaluated two options under simulation: for the class of bounded non-decreasing functions and for the class of bounded non-decreasing functions. As in Simulation Study, we set ,

*X*

_{i1}= 1, ,

*β*

_{0}= 0 and

*ϵ*= 0. The true

*f*(·) in this simulation is a generalized additive model with binomial link function [36] fit to the observations shown in Figure 3. This was done to select a true detection curve that well-reflects empirical probabilities of detecting a gene at a given coverage, such as gene

`eaeA`in

*E. coli*isolate genome DRR102664. We evaluated all estimators via mean squared error and median squared error for estimating

*β*

_{1}. We investigated all combinations of

*n*∈ {30, 50, 100},

*β*

_{1}∈ {0.5, 1, 2} and

*σ*∈ {0.25, 0.5}, and performed 250 draws for each combination. For 17 out of 18 combinations of

_{x}*n*,

*β*

_{1}and

*σ*, we found that outperformed with respect to median squared error, with an average reduction in median squared error of 54%. For 18 out of 18 combinations, outperformed with respect to mean squared error, with an average reduction of 51%. For this reason, we chose to set as the default option

_{x}`happi`, and used this class of functions for both our data analyses and error rate simulations.

#### 4.2.3 Type 1 error and power simulations

For the Type 1 error rate and power simulations shown in Section 2.5, we performed 500 simulations for each combination of *σ _{x}*,

*β*

_{1}and

*n*. We set a minimum of 16 EM iterations,

*t*

_{max}= 50 and Δ = 0.1 for both the null and alternative models.

## Funding

This work was supported in part by the National Institute of General Medical Sciences (R35 GM133420); and the National Institute of Environmental Health Sciences (T32ES015459).

## Availability of data and materials

`happi` is available as an open-source `R` package at https://github.com/statdivlab/happi. The data supporting the conclusions of this article along with code for reproducing our results are made available at https://github.com/statdivlab/happi_supplementary.

## Acknowledgements

The authors would like to thank Taylor Reiter and members of the StatDivLab for expert advice and constructive suggestions.