Cryptic Functional Variation in the Human Gut Microbiome

1. CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. Abstract Background: The human gut microbiome harbors microbes that perform diverse biochemical functions. Previous work suggested that functional variation between gut microbiota is small relative to taxonomic variation. However, these conclusions were largely based on broad pathways and qualitative patterns. Identifying microbial genes with highly variable or invariable abundance across hosts requires a new statistical test.


Background
The microbes that inhabit the human gut encode a wealth of proteins that contribute to a broad range of biological functions, from modulating the human immune system [1,2,3] to participating in metabolism [4,5].Shotgun metagenomics is revolutionizing our ability to identify and quantify protein-coding genes from these microbes.However, we still lack a comprehensive understanding of the factors that govern host-microbe interactions and microbial tness within the gut.For example, some bacteria become long-term gut residents while others (e.g., some probiotics [6]) only inhabit the gastro-intestinal (GI) tract transiently.Dierences in microbiome composition and diversity have been associated with many diseases.Establishing causality and designing microbiome therapies will require much deeper understanding of microbially-encoded genes and their relationships to human genetics 2 and the gut environment.This knowledge could not only oer mechanistic explanations for host-microbe associations, but may ultimately help us to engineer the gut microbiome for human health.
Recent screens have shed light on bacterial genes allowing human colonization, such as sugar or polysaccharide utilization genes (e.g.[7,8]).These screens, while highly informative, are labor-intensive and limited to specic bacterial clades (e.g., Bacteroides or Lactobacilli ).They also require not only that a particular species be culturable, but also that techniques for forward genetics (e.g., transposon insertion) have been developed in that species.Furthermore, it is likely that multiple niches exist within the human gut, and that the metabolic environment of the gut may be inuenced by factors like diet [9,10], variation in immune function [11,12], prior antibiotic use [13,14], and host genetics [15].This suggests that many dierent symbiotic gut microbial lifestyles are possible and that selective pressures are likely to dier between individual hosts.For these reasons, a computational approach incorporating high-throughput microbiome sequencing data from multiple human populations could oer unique insights.
Gene families that are necessary for life in the gut are expected to have consistent levels across dierent human hosts, spanning geography and other variables like age and sex.Conversely, gene families that contribute to survival in one particular type of gut environment (e.g., associated with a particular diet) should vary between subjects.Therefore, we propose that the variability of gene families, rather than only their average abundance across sampled metagenomes, is a statistic that can be informative regarding selection in the gut and dierences in host-microbe interactions across people.
There is substantial interest in characterizing the extent of functional variation across gut microbiomes [16,17].Previous work suggested that the overall variability of gene functions across metagenomes is lower than 1. the variability of those same functions across fullysequenced genomes [17] and 2. the variability of taxa in the same metagenomes [16] (the functional stability hypothesis).However, these analyses tended to be qualitative or binary in nature, as opposed to probabilistic, and/or have been conducted on relatively broad units of function, such as biological pathways or the entire set of annotated functions.On the other hand, microbe-host interactions usually depend upon specic genes, such as colitis-inducing cytolethal distending toxins of Helicobacter hepaticus [18] and the enzymes of commensal bacteria that protect against these toxins by producing anti-inammatory polysaccharide A [19].Similarly, dierences in drug ecacy [20] or side-eects [21] across patients can be attributed to levels of specic microbiome proteins.
To enable high-resolution, quantitative analysis of functional stability in the microbiome, we developed a statistical test that identies individual gene families whose abundances are either signicantly variable or invariable across samples, without needing to choose a comparison group from either metagenomes or whole genomes.Our method ts gene family abundances to a model and calculates the variance of the residuals, then compares this residual variance statistic to a data-driven null distribution based on the negative binomial distribution.Using simulated data, we show that under certain assumptions, this test has high power (>90%) and controls the false-positive rate appropriately.When these assumptions are violated, we can use simulations to control the false discovery rate (FDR) empirically.
We apply this test to healthy gut metagenomes (n = 123) spanning three dierent shotgun sequencing studies and nd both signicantly invariable (3,768) and variable (1,219) gene families (FDR<5%).Many pathways, including some commonly viewed as housekeeping or previously identied as stable across gut microbiota (e.g., central carbon metabolism), include signicantly variable gene families.Phylogenetic distribution (PD) correlates overall with variability in gene family abundance, and exceptions to this trend highlight functions that may be involved in adaptation, such as two-component signaling and specialized secretion systems.Proteobacteria emerge as a source for genes with the greatest variability in abundance across hosts, suggesting a relationship between inammation and gene-level dierences in what gut microbiota are doing.

A null distribution based on the negative binomial allows detection of signicant (in)variability
While calculating the residual variance statistic V g allows a ranking of gene families by variability, by itself, it does not tell us whether this variability (or lack thereof ) is surprising.
Since there is no straightforward formula for the p-value associated with this statistic, we developed a method to assess signicance (Supplemental Figure S1).We choose the null hypothesis that the metagenomic read count data (before any normalization, e.g., for gene length and average genome size) are distributed negative-binomially.The negative binomial distribution is frequently used to model high-throughput sequencing data [22], and can be conceptualized as an overdispersed Poisson where the variance can exceed the mean.
To obtain our null distribution, we allow the mean value to change for dierent gene families as observed, but x the overdispersion parameter k that controls the mean-variance relationship across genes.This has similarities to previous approaches to model RNAseq distributions [23] and to identify (in)variable genes from single-cell RNAseq data [24] (see also Discussion).Intuitively, this null hypothesis means that the gene families we identify as signicantly variable or invariable will be those whose abundances are over-or underdispersed, respectively, relative to the median gene family.This choice of null is important: if we instead simply test for high variance, regardless of mean abundance, highly abundant gene families (e.g., single-copy proteins in the bacterial ribosome) are signicantly variable despite being nearly universally present at equal abundance in each bacterial genome, because genes with high mean abundance will have high variance in any sequencing experiment.
Conversely, thousands of lower-abundance gene families would appear to be signicantly invariable simply by virtue of having relatively low read counts.
Based on simulations with similar properties to real data (see Methods, Supplemental Figure S3), we nd that this test has high power (> 0.9).The false positive rate (i.e., type I error α) is well-controlled as long as the overdispersion parameter k used in the null distribution is accurately estimated.This appears to be easiest to achieve when fewer than 50% of gene families are signicantly variable or invariable.To make the test more robust to estimation of k, we developed a simulation based method to empirically identify signicance thresholds that control FDR for both variable and invariable families (Supplemental Table S7).

Thousands of gene families in the gut microbiome are signicantly (in)variable
To describe variation within healthy gut microbiota across dierent human populations, we randomly selected 123 metagenomes of healthy individuals from the Human Microbiome Project (HMP) [16], controls in a study of type II diabetes (T2D) [25], and controls in a study of glucose control (GC) [26].These span American, Chinese, and European populations, respectively (see Methods).We map these metagenomes to KEGG Orthology families with ShotMAP [27] and quantify gene family abundance using log-transformed reads per kilobase of genome equivalents (RPKG) [28].This produces an n × m data matrix D of log-RPKG abundances consisting of n = 17,417 gene families across m = 123 healthy human gut samples.
Before testing for signicantly variable and invariable KEGG families, we conducted a simulation that mirrored this dataset (n = 120, variable-to-invariable gene family ratio between 1:2 and 1:3).The observed α was somewhat higher than the targeted level for the variable gene families (Supplemental Figure S4).We therefore used the simulation results to empirically identify FDR controlling signicance thresholds.
We nd 2,357 gene families with more variability than expected and 5,432 with less (leaving 9,628 non-signicant) at an empirical FDR of 5% (Supplemental Figure S5).Restricting ourselves further to gene families with at least one annotated representative from a bacterial or archaeal genome in KEGG, we obtain 1,219 signicantly variable and 3,813 signicantly invariable gene families (and 2,194 non-signicant).The dierences in the residual variation of these gene families can be visualized using a heatmap of the residual g,s values (Supplemental Figures S6, S7).
Importantly, the magnitude of the residual variance statistic V g is not the sole determi- nant of signicance, as observed by the overlap in distributions of V g between the variable, invariable, and non-signicant gene families.For example, both low-abundance gene families with many zero values and high-abundance but invariable gene families will tend to have low residual variance, but the evidence for invariability is much stronger for the second group.
Our test accurately discriminates between these scenarios, tending to call the second group signicantly invariable and not the rst (Supplemental Figure S5, inset).

Biological pathways, including those in central metabolism, contain a range of stable and variable components
It has been observed that person-to-person dierences in the taxonomic composition of healthy gut microbiomes are much larger than dierences in functional composition [16].

6
. CC-BY 4.0 International license peer-reviewed) is the author/funder.It is made available under a The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/056614doi: bioRxiv preprint first posted online Jun. 2, 2016; This has been interpreted as evidence that diverse gut microbiota are doing the same things [16,17]. Our results support but also qualify this conclusion.Many of the specic pathways identied as stable (e.g., aminoacyl-tRNA metabolism, central carbon metabolism) have more stable than variable genes in our analysis.However, these pathways also include signicantly variable genes (Figure 2).For example, even the highly conserved KEGG module set aminoacyl tRNA includes one variable gene at an empirical FDR of 5%, SepRS.SepRS is an O-phosphoseryl-tRNA synthetase, which is an alternative route to biosynthesis of cysteinyl-tRNA in methanogenic archaea [29].Methanogen abundance has previously been noted to be variable between individual human guts, though this may be due to variability in DNA extraction for archaea and not due to true dierences in abundance [30].Another gene in this category is variable at a weaker level of signicance (10% empirical FDR): PoxA, a variant lysyl-tRNA synthetase.Recent experimental work has shown that this protein has a diverged, novel functionality, lysinylating the elongation factor EF-P [31,32].
By comparison, in the KEGG module set central carbohydrate metabolism, 77% of the tested prokaryotic gene families were signicantly stable, and 5.6% (5 genes) were signicantly variable (Figure 2) at an empirical FDR of 5%.In this case, the variable gene families highlight the complexities of microbial carbon utilization.Glucose can be metabolized by two alternative pathways: the more famous Embden-Meyerhof-Parnas (EMP) pathway (i.e., classical glycolysis), or the Entner-Doudoro pathway (ED).Both take glucose to pyruvate, but with diering yields of ATP and electron carriers; ED also allows growth on sugar acids like gluconate [33].Indeed, while all genes in the core module of glycolysis dealing with 3-carbon compounds were signicantly invariable across individuals, the ED-specic gene family edd, which takes 6-phosphogluconate to 2-keto-3-deoxy-phosphogluconate (KDPG), was signicantly variable according to our test.
We discovered signicant variability in abundance for other unusual glycolytic enzymes and enzymes in the tricarboxylic acid cycle (TCA).Multifunctional (K16306, K01622) variants of fructose-bisphosphate aldolase were signicantly variable, while the typical FBA enzyme (FbaA) was signicantly stable.A subunit of fumarate reductase, frdD, was also signicantly variable.Fumarate reductase catalyzes the reverse reaction from the typical TCA cycle enzyme succinate dehydrogenase and can be used for redox balance during anaerobic growth [34].Conversely, the standard succinate dehydrogenase genes sdhA, sdhB and sdhC were signicantly invariable.These results suggest that using our test to identify variable genes within otherwise stable pathways can reveal diverged functionality as well as families that play domain or clade-specic roles.
We also nd that the majority of signicantly variable gene families annotated to bacte-7 .CC-BY 4.0 International license peer-reviewed) is the author/funder.It is made available under a The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/056614doi: bioRxiv preprint first posted online Jun. 2, 2016; rial secretion system (16 out of 18) are involved in specialized secretion systems, especially the type III and type VI systems (Figure 2D).These secretion systems are predominantly found in Gram negative bacteria and are often involved in specialized cell-to-cell interactions, between microbes and between pathogens or symbionts and the host.They allow the injection of eector proteins, including virulence factors, directly into target cells [35,36].Type VI secretion systems have also been shown to be determinants of antagonistic interactions between bacteria in the gut microbiome [37,38].In contrast, gene families in the Sec (general secretion) and Tat (twin-arginine translocation) pathways were nearly all signicantly stable at an empirical FDR of 5%, with only one gene in each being found to be signicantly variable.This contradicts previous suggestions that the Sec and Tat pathways were some of the most variable in the human microbiome [16].This discrepancy is probably due to our accounting for the mean-variance relationship in shotgun data; the Sec and Tat systems are abundant and phylogenetically diverse [39].

Phylogenetic distribution correlates with, but does not totally explain, gene family variability
To explore the relationship between gene family taxonomic distribution and variability in abundance across hosts, we constructed trees of the sequences in each KEGG family using ClustalOmega and FastTree.We then calculated phylogenetic distribution (PD), using tree density to correct for the overall rate of evolution [40] (Figure 4a).
These results are consistent with the hypothesis that one function of the gut microbiome is to encode carbohydrate-utilization enzymes the host lacks [41].Additionally, recent experiments have also shown that the major gut commensal Bacteroides thetaiotaomicron contain enzymes adapted to the degradation of sulfated glycans including GAGs [42,43], and that many Bacteroides species can in fact use the GAG chondroitin sulfate as a sole carbon source [44].
Out of the 298 signicantly-variable gene families with above-median PD, we found no pathway enrichments but three module enrichments.
Only the archaeal (q = 1.5 × 10 −3 ) and eukaryotic (q = 8.7 × 10 −9 ) ribosomes were enriched, as expected from the abundance patterns of ribosomal proteins from dierent domains of life across hosts (2b).We also discovered an enrichment for the type VI secretion system (q = 0.039).
Finally, gene families described as hypothetical were also enriched in the variable/high-PD gene set (p = 2.4×10 −8 , odds ratio = 2.2) and depleted in the invariable/low-PD set (p = 5.4×10 −13 , odds ratio = 0.41).Intriguingly, specialized secretion systems were also observed to vary within gut-microbiome-associated species in a strain-specic manner, using a wholly separate set of data [45].
Transporters were also recently observed to show strain-specic variation in copy number across dierent human gut microbiomes [45], and analyses by Turnbaugh et al. identied membrane transporters as enriched in the variable set of functions in the microbiome [17].
However, we mainly nd transporters enriched amongst gene families with similar abundance across hosts, despite being phylogenetically restricted (invariable/low-PD genes).Part of this dierence is likely due to our stratifying by phylogenetic distribution, which previous studies did not do.

Proteobacteria are a major source of variable genes
To explore which taxa contribute variable and stable genes, we rst computed correlations between phylum relative abundances (predicted using MetaPhlAn2 [46]) and gene family abundances.This analysis revealed that Proteobacterial levels are correlated with abundance of many variable genes (Figure 5b).Proteobacteria are a comparatively minor component of these metagenomes (median = 1%), compared to Bacteroidetes (median = 59%) and Firmicutes (median = 33%).However, some hosts had up to 41% Proteobacteria.Overgrowth of Proteobacteria has been associated with metabolic syndrome [47] and inammatory bowel disease [48].Also, Proteobacteria can be selected (over Bacteroidetes and Firmicutes) by intestinal inammation as tested by TLR5-knockout mice [49], and some Proteobacteria can induce colitis in this background [50], potentially leading to a feedback loop.Thus, the variable gene families we discovered could be biomarkers for dysbiosis and inammation in otherwise healthy hosts.
We also examined correlations between gene abundance and three taxonomic summary statistics that have been previously linked to microbiome function: average genome size (AGS) [28], the Bacteroidetes/Firmicutes ratio [17,51], and α-diversity (Shannon index).All of these statistics were less often correlated with variable gene families than with invariable or non-signicant gene families (see Supplemental Information, Supplemental Figure S9).
These statistics therefore do not explain the variability of gene families in this dataset.9 .

Each major bacterial phylum has a largely unique complement of variable gene families
The variable gene families we identied seem to include both genes whose variance is explained by phylum-level variation (e.g., Proteobacteria), and genes that vary within negrained taxonomic classications, such as strains within species.Also, some gene families may confer adaptive advantages in the gut only within certain taxa.We were therefore motivated to detect lineage-specic variable and invariable gene families, independent of phylum-level trends.To do so, we repeated the test, but using only reads that mapped best to sequences from each of the four most abundant bacterial phyla (Bacteroidetes, Firmicutes, Actinobacteria, and Proteobacteria).Because we do not necessarily expect the assumption that fewer than 50% of gene families will be signicantly (in)variable to hold within each individual phylum, we estimate the average level of overdispersion from the full dataset instead.
Most (77%) gene families showed phylum-specic eects.Invariable gene families tended to agree, but the reverse was true for variable gene families: 19.4% of gene families that were invariable in one phylum were invariable in all, compared to just 0.34% (8 genes) in the variable set.Gene families invariable in all four phyla were enriched for basal cellular machinery, as expected (Supplemental Table S3).
Mirroring results we obtained in Figure 5, Proteobacteria-specic variable gene families also tended to be variable overall (59%); the opposite trend was true for Bacteroidetes-(12%), Firmicutes-(29%), and Actinobacteria-specic (18%) variable gene families (Figure 6A).This supports the hypothesis that Proteobacterial abundance is a dominant driver of functional variability in the human gut microbiome.It further suggests that many overall-variable gene families are not merely markers for a phylum that varies itself (i.e., Proteobacteria), but are also variable at ner taxonomic levels, such as the species or even the strain level [45,52].
Comparing the two dominant phyla in the gut, Bacteroidetes and Firmicutes, we further observe that the overall proportions of variable and invariable families are similar across pathways, with some exceptions: for example, lipopolysaccharide (LPS) biosynthesis has many invariable gene families in Bacteroidetes and very few in Firmicutes, which we expect given that LPS is primarily made by Gram-negative bacteria.Conversely, both two-component signaling and the PTS system have many more invariable gene families in Firmicutes than in Bacteroidetes (Figure 6B).However, phylum-specic variable gene families tend not to overlap (median overlap: 0%, compared to 46% for invariable gene families).This is even true for pathways where the overall proportion of variable and invariable gene families is similar, such as cofactor and vitamin biosynthesis and central carbohydrate metabolism (Figure 6C).
Furthermore, the enriched biological functions of the phylum-specic variable gene families dier by phylum (Supplemental Table S4).For instance, Proteobacterial-specic variable gene families are enriched (Fisher's test enrichment q = 0.13) for the biosynthesis of siderophore group nonribosomal peptides; iron scavenging is known to be important in the establishment of both pathogens (e.g.Yersinia ) and commensals (E. coli ) [53].Another phylum-specic variable function appears to be the Type IV secretion system (T4SS) within Firmicutes (q = 0.021): homologs of this specialized secretion system have been shown to be involved in a wide array of biochemical interactions, including the conjugative transfer of plasmids (e.g.antibiotic-resistance cassettes) between bacteria [54].We conclude that our approach enables the identication of substantial variation within all four major bacterial phyla in the gut, much of which is not apparent when data are analyzed at broader functional resolution or without stratifying by phylum.

Variable genes are not biomarkers for body mass index, sex or age
To explore associations of gene variability with measured host traits, we used a two-sided partial Kendall's τ test that controls for study eects (Methods).Body mass index, sex, and age were measured in all three studies we analyzed.None of these variables correlated signicantly with any variable gene family abundances, even at a 25% false discovery rate.
This suggests that major sources of variation in microbiota gene levels, possibly including diet and inammation, were not measured in these studies. of these studies presented data showing that abundances of several Clusters of Orthologous Groups (COGs) and KEGG modules across the metagenomes of human subjects varied less than phylum abundances among the same samples.However, it is not clear a priori whether phylum-level taxonomic variation should indeed be comparable to functional variation, since many functions are so widely conserved, and since many of the tested pathways actually include diverse biological functions (e.g., carbon metabolism).Moreover, these analyses do not identify individual gene families that could break this overall trend.
Turnbaugh et al. [17] also compared the variance of gene families in particular pathways across metagenomes to the variance across sequenced whole genomes.They concluded that biological pathways tended to vary less in metagenomes than in whole genomes.However, there are barriers to extending this test to individual gene families or biological pathways.
First, a set of representative whole genomes must be chosen to construct a null distribution.
Choosing an appropriate set of genomes to compare to a metagenome is not trivial, especially since the most appropriate genomes may not have been sequenced.Second, samples from a mixture of genomes should have lower variance than samples from individual genomes in that mixture; accounting for this bias could be dicult as it would depend on the precise composition of the sample.This study did classify gene families into stable versus variable subsets, by looking for families that were observed in all samples versus not observed in at least one.However, while this approach is useful and intuitively appealing, it is more descriptive or qualitative than our method, and would miss gene families that were reliably observed, but whose abundances still varied more than expected between subjects.
We nd that basic microbial cellular machinery, such as the ribosome, tRNA-charging, and primary metabolism, are universal functional components of the microbiome, both in general and when each individual phylum is considered separately.This nding is consistent with previous results [17], and indeed, is not surprising given the broad conservation of these processes across the tree of life.However, we also identify candidate invariable gene families that have narrower phylogenetic distributions.These include, for example, proteins involved in two-component signaling, starch metabolism, and glycosaminoglycan metabolism.Previous experimental work has underscored the importance of some of these pathways in gut symbionts: for instance, multiple gut-associated Bacteroides species are capable of using the glycosaminoglycan chondroitin sulfate as a sole carbon source [42], and the metabolism of resistant starch in general is thought to be a critical function of the human microbiome.
These results suggest that the method we present is capable of identifying protein-coding gene families that contribute to tness of symbionts within the gut.
We also identify signicantly variable gene families, including specialized secretion systems, e.g., the T6SS.Phylum-specic tests also reveal that gene families involved in the T6SS are also variable even within Proteobacteria.We nd that variable but broadly-conserved gene families include many genes of unknown function, and that these gene families tend to correlate with Proteobacterial abundance (though others also correlate with the abundance of Firmicutes, Bacteroidetes, and Actinobacteria).While fewer in number, we also nd invariable gene families whose function is not yet annotated; these gene families may represent functions that are either essential or provide advantages for life in the gut, and may therefore be particularly interesting targets for experimental follow-up (e.g., assessing whether strains in which these gene families have been knocked out in fact have slower growth rates, either in vitro or in the gut).
Though the interpretation of invariable gene families is potentially more straightforward, variable gene families have a variety of ecological interpretations, e.g., rst-mover eects, drift, host demography, and selection within particular gut environments.Computationally distinguishing among these possibilities is likely to present challenges.For example, distinguishing selection from random drift will probably require longitudinal data and appropriate models.Separating eects of host geography, genetics, medical history, and lifestyle will be possible only when richer phenotypic data is available from a more diverse set of human populations.To control for study bias and batch eects, it will be important to include multiple sampling sites within each study.
While statistical tests focused on dierences in variances are not yet common throughout genomics, there is some recent precedent using this type of test to quantify the gene-level heterogeneity in single-cell RNA sequencing data [56,24], and to identify variance eects in genetic association data [57].Like Vallejos et al. [24], we model gene counts using the negative binomial distribution, and identify both signicantly variable and invariable genes, although we frame our method as a frequentist hypothesis test as opposed to a Bayesian hierarchical model.Unlike previous approaches in this domain, the method we describe does not explicitly decompose biological from technical noise, and therefore does not require the use of experimentally-spiked-in controls, which are not present in most experiments involving sequencing of the gut microbiome.
A similar statistical method for detecting signicant (in)variability such as the one we present here could also be applied to other biomolecules measured in counts, such as metabolites, proteins, or transcripts.Performing such analyses on human microbiota would reveal patterns in the variability in the usage of particular genes, reactions, and pathways, which would expand on our investigation of potential usage based on presence in the DNA of organisms in host stool.Another important extension is to generalize our method for comparing hosts from dierent pre-dened groups (disease states, countries, diets) to identify gene families that are stable in one group (e.g., healthy controls) but variable in another 13 .
(e.g., patients).Since metagenomic samples contain substantial heterogeneity, investigating group dierences in functional variability could allow the detection of dierent trends from the more common comparison of means.

Conclusion
This study present a novel statistical method that provides a ner resolution estimate of functional redundancy [58] in the human microbiome than was previously possible.We found that most biological pathways, including tRNA charging, central carbon metabolism, and bacterial secretion, have both invariable and variable components.Some variable genes have surprisingly broad phylogenetic distributions, and Proteobacteria emerge as a major source of variable genes.Since Proteobacteria have also been linked to inammation and metabolic syndrome [47], we speculate that baseline inammation may be one variable inuencing functions in the gut microbiome.

Data collection and processing
Stool metagenomes from healthy human guts were obtained from three sources: 1. two American cohorts from the Human Microbiome Project [16], n = 42 samples selected; 2. a Chinese cohort from a case-control study of type II diabetes (T2D) [25], n = 44 samples from controls with neither type II diabetes nor impaired glucose tolerance; 3. and a European cohort from a case-control study of glucose control [26], n = 37 samples from controls with normal glucose tolerance.
After downloading these samples from NCBI's short read archive (SRA), the FASTA-formatted les were mapped to KEGG Orthology (KO) [59] protein families as previously described [27].For consistency, each sample was rareed to a depth of 1.5×10 7 reads, and additionally, as reads from HMP were particularly variable in length, they were therefore trimmed to a uniform length of 90 bp.
For each sample, we used ShotMAP to detect how many times a particular gene family matched a read (counts; we add one pseudocount for reasons described below).The bitscore cuto for matching a protein family was selected based on the average read length of 14 .
each sample as recommended previously [27].For every gene family in every sample, we also computed the average family length (AFL), or the average length of the matched genes within a family.Finally, we also computed per-sample average genome size using MicrobeCensus [28] (http://github.com/snayfach/MicrobeCensus).These quantities were used to estimate abundance values in units of RPKG, or reads per kilobase of genome equivalents [28].
These RPKG abundance values were strictly positive with a long right tail and highly correlated with the variances (Spearman's r = 0.99).This strong mean-variance relationship is likely simply because these abundances are derived from counts that are either Poisson or negative-binomially distributed.We therefore took the natural log of the RPKG values as a variance stabilizing transformation.Because log(0) is innite, we add a pseudocount before normalizing the counts and taking the log transform.Since there is no average family length (AFL) when there are no reads for a given gene family in a given sample, we impute it in those cases using the average AFL across samples.

Model tting
We t a linear model to the data matrix of log-RPKG D of log-RPKG described above, with n gene-families by m samples, to capture gene-specic and dataset-specic eects: where g ∈ [1, n] is a particular gene family, s ∈ [1, m] is a particular sample, G g is the grande or overall mean of log-RPKG s Dg,s m for a given gene family g, Y is the set of studies, I y,s is an indicator variable valued 1 if sample s is in study y and 0 otherwise, X g,y is a mean oset for gene family g in study y, and the residual for a given gene family and sample are given by g,s .For each gene family, the variance across samples of these g,s , which we term the residual variance or V g , becomes our statistic of interest.
Overall trends in these data are explained well by this model, with an R 2 = 0.20.The residuals, which are approximately symmetrically distributed around 0, represent variation in gene abundance not due to study eects.

Modeling residual variances under the null distribution
Having calculated this statistic V g for each gene family g, we then need to compare this statistic to its distribution under a null hypothesis H 0 .This requires us to model what the data would look like if in fact there were no surprisingly variable or invariable gene families.
To do this, we use the negative binomial distribution to model the original count data (before 15 .CC-BY 4.0 International license peer-reviewed) is the author/funder.It is made available under a The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/056614doi: bioRxiv preprint first posted online Jun. 2, 2016; adding pseudocounts and normalization to obtain RPKG).
The negative binomial distribution is commonly used to model count data from high throughput sequencing.It can be conceptualized as a mixture of Poisson distributions with dierent means, which themselves follow a Gamma distribution.Like the Poisson distribution, the negative binomial distribution has an intrinsic mean-variance relationship.However, instead of a single mean-variance parameter as in the Poisson, the negative binomial can be described with two, a mean parameter and a size parameter, which we refer to here as k such that k = µ 2 σ 2 −µ .k ranges from (0, ∞), with smaller values corresponding to more overdispersion (i.e., higher variance given the mean) and larger values approaching, in the limit, the Poisson distribution.
To model the case where no gene family has unusual variance given its mean value, i.e., our null hypothesis, we assume that the data are negative-binomially distributed with the observed means µ g,y for each gene g and study y, but where the amount of overdispersion is modeled with a single size parameter k y for each study y: To estimate this k y , the overall size parameter for a given study y, we estimate the mode of per-gene-family size parameters k g,y within data set y, using the method-of-moments estimator for each k g,y .We accomplish this by tting a Gaussian kernel density estimate to the log-transformed k g,y values, and then nding the k y value that gives the highest density.
(From simulations, we found that the mode method-of-moments was more robust than the median or harmonic mean: see Supplemental Figure S2.)We can then easily generate count data under this null distribution, add a pseudocount and normalize by AFL and AGS, t the above linear model, and obtain null residual variances V 0 g using exactly the same procedure described above.
Statistical signicance is then obtained by a two-tailed test: Here, B refers to the number of null test statistics V 0 g (in this case, B = 750), and the overlined test statistics refer to their mean across the null distribution.
The resulting p-values are then corrected for multiple testing by converting to FDR q-values using the procedure of Storey et al. [60] as implemented in the qvalue package in R [61].An alternative approach to determining signicance is based on the bootstrap.
While using a parametric null distribution allows us to explicitly model the null hypothesis, it also breaks the structure of covariance between gene families, which may be substantial because genes are organized into operons and individual genomes within a metagenome.This structure can, optionally, be restored using a strategy outlined by Pollard and van der Laan [62].Instead of using the test statistics V 0 g obtained under the parametric null as is, we can use these test statistics to center and scale bootstrap test statistics V g , which we derive from applying a cluster bootstrap with replacement from the real data and then tting the above linear model ( 2) to the resampled data to obtain bootstrap residual variances: A similar non-parametric bootstrap approach has previously been successfully applied to testing for dierences in gene expression [63].
As expected, when the residuals are plotted in a heatmap as in Figure S6, variable gene families are generally brighter (i.e., more deviation from the mean) than invariable gene families, though not exclusively: this is because our null distribution, unlike the visualization, models the expected mean-variance relationship.We visualize this information by scaling each gene family by its expected standard deviation under the negative binomial null (i.e., by the mean root variance b∈[1,B] V 0 g b /B) (Figure S7).

Power analysis
The test we present controls α as expected if the correct size parameter k is estimated from the data (Supplemental Figure S2a-b).Estimating this parameter accurately is known to be dicult, however, particularly for highly over-dispersed data [64], and in this case we must also estimate this parameter from a mixture of true positives and nulls.We nd that the mode of per-gene-family method-of-moments estimates is more robust to dierences in the ratio of variable to invariable true positives (Supplemental Figure S2e-g) than the median or harmonic mean (the harmonic mean mirrors the approach in Yu et al. [23]).
Power analysis was performed on simulated datasets comprising three simulated studies.
Null data were drawn from a negative-binomial distribution with a randomly-selected size parameter k in common to all gene families, which was drawn from a log-normal distribution (log-mean= −0.65, sd= 0.57).Gene family means were also drawn from a log-normal (log mean= 2.94, sd = 2.23).True positives were drawn from a similar negative-binomial distribution, but where the size parameter was multiplied by an eect size z (for variable gene families) or its reciprocal 1/z (for invariable gene families).The above test was then applied to the simulated data, and the percent of Type I and II errors was calculated by comparing to the known gene family labels from the simulation.Using similar parameters to those estimated from our real data, we see that α decreases and power approaches 1 with increasing sample size (see Supplemental Figure S3) and that n = 120 appears to be sucient to achieve control over α.However, at n = 120, we also noted that α appeared to be greater for variable vs.
invariable gene families (Supplemental Figure S4), possibly because accurately detecting additional overdispersion in already-overdispersed data may be intrinsically dicult.We therefore performed additional simulations to determine q-value cutos corresponding to an empirical FDR of 5%.We calculated appropriate cutos based on datasets with 43% true positives and a variable:invariable gene family ratio ranging from 0.1 to 10, taking the median cuto value across these ratios (Supplementary Table S7).Using these cutos, the overall dataset had 45% true positives and a variable:invariable gene family ratio of 0.43.

Calculating phylogenetic distribution of gene families
The phylogenetic distribution (PD) of KEGG Orthology (KO) families was estimated using tree density [40].We rst obtained sequences of each full-length protein annotated to a particular KO, and then performed a multiple alignment of each family using ClustalOmega [65].These multiple alignments were used to generate trees via FastTree [66].For both the alignment and tree-building, we used default parameters for homologous proteins.
For all families represented in at least 5 dierent archaea and/or bacteria (6,703 families total), we then computed tree densities, or the sum of edge lengths divided by the mean tip height.Using tree density instead of tree height as a measure of PD corrects for the rate of evolution, which can otherwise cause very highly-conserved but slow-evolving families like the ribosome to appear to have a low PD [40].Empirically, this measure is very similar to the number of protein sequences (Supplemental Figure S8), but is not as sensitive to high or variable rates of within-species duplication: for example, families such as transposons, which exhibit high rates of duplication as well as copy-number variation between species, have a larger number of sequences than even very well-conserved proteins such as RNA polymerase, but have similar or even lower tree densities, indicating that they are not truly more broadly conserved.
Many protein families (8,931 families) did not have enough observations in order to reliably calculate tree density, with almost all of these being annotated in only a single bacterium/archaeum.For these, we predicted their PD by extrapolation.To predict PD, we used a linear model that predicted tree density based on the total number of annotations (including annotations in eukaryotes).In ve-fold cross-validation, this model actually had a relatively small mean absolute percentage error (MAPE) of 13.1%.We also considered a model that took into account the taxonomic level (e.g., phylum) of the last common ancestor of all organisms in which a given protein family was annotated, but this model performed essentially identically (MAPE of 13.0%).Predicted tree densities are given in Supplemental Table S6.The PD of gene families varied from 1.2 (an iron-chelate-transporting ATPase only annotated in H. pylori ) to 434.9 (the rpoE family of RNA polymerase sigma factors).

Gene family enrichment
We were interested in whether particular pathways were enriched in several of the gene family sets identied in this work.For subsets of genes (such as those with specically low PD), a 2-tailed Fisher's exact test (i.e., hypergeometric test) was used instead to look for cases in which the overlap between a given gene set and a KEGG module or pathway was signicantly larger or smaller than expected.The background set was taken to be the intersection of the set of gene families observed in the data with the set of gene families that had pathway-or module-level annotations.p-values were converted to q-values as above.Finally, enrichments were enumerated by selecting all modules or pathways below q ≤ 0.25 that had positive odds-ratios (i.e., enriched instead of depleted).

Associations with clinical and taxonomic variables
We were interested in using a non-parametric approach to detect association of residual RPKG with clinical and taxonomic variables (e.g., the inferred abundance of a particular phylum via MetaPhlan2).To take into account potential study eects in clinical and taxonomic variables without using a parametric modeling framework, we used partial Kendall's τ correlation as implemented in the ppcor package for R [67], coding the study eects as binary nuisance variables.Kendall's τ was used over Spearman's ρ because of better handling of ties (an issue with taxonomic variables especially, since many, particularly at the ner-grained levels, were often zero).The null distribution was obtained by permuting the clinical/taxonomic variables within each study 250 times, and then re-assessing the partial τ .Finally, p-values were calculated by taking the fraction of null partial correlations equally or more extreme (i.e., distant from zero) than the real partial correlations.
Phylum-level relative abundances were predicted from the shotgun data using MetaPhlAn2 with the --very-sensitive ag [46].

Phylum-specic tests
We created taxonomically-restricted data sets in which the abundance of each gene family was computed using only metagenomic reads aligning best to sequences from each of the four most abundant bacterial phyla (Bacteroidetes, Firmicutes, Actinobacteria, and Proteobacteria).Phylum-specic data were obtained from the overall data as follows.First, the NCBI taxonomy was parsed to obtain species annotated below each of the four major bacterial phyla (Bacteroidetes, Firmicutes, Actinobacteria, and Proteobacteria); these species were then matched with KEGG species identiers.Next, the original RAPSearch2 [68] results were ltered, so that the only reads remaining were those for which their best hit in the KEGG database originally came from the genome of a species belonging to the specic phylum in question (e.g., E. coli for Proteobacteria).Finally, when performing the test, normalization for average genome size was accomplished by normalizing gene family counts by the median abundance of a set of 29 bacterial single-copy gene families [69], which had been ltered in the same phylum-specic way as all other gene families; this approach is similar to the MUSiCC method for average genome size correction [70].This also controls for overall changes in phylum abundance.Finally, k y values for individual studies were estimated based on the non-phylum-restricted data, since the expectation that < 50% of gene families were dierentially variable might not hold for each individual phylum.We used the same q-value cutos as in the overall test to set an estimated empirical FDR (Supplementary Table S7).
Otherwise, tests were performed as above.

Codebase
The scripts used to conduct the test and related analyses are available at the following URL: http://www.bitbucket.org/pbradz/variance-analyze Counts of reads mapped to KEGG Orthology (KO) groups and average family lengths for all of the samples used in this study can be obtained at FigShare: • https://figshare.com/s/fcf1abf369155588ae41(overall) • https://figshare.com/s/90d44cffdfb1d214ef83(phylum-specic)      Gene families in specic functional groups are also highlighted in dierent colors, specically the bacterial ribosome (pale green), the type VI secretion system (or T6SS; orange), the KinABCDE-Spo0FA sporulation control two-component signaling (yellow), and hypothetical genes (tan squares).Gene families that are signicantly invariable (ribosome and sporulation control) or signicantly variable (hypothetical genes and the T6SS) at an estimated 5% FDR are outlined in black.The bacterial ribosome, as expected, has very high PD and is strongly invariable.The Type VI secretion system genes, in contrast, are conserved but variable, while some genes involved in the Kin-Spo sporulation control two-component signaling pathway have low PD but are invariable.Only gene families with at least one annotated bacterial or archaeal homolog are shown.

Additional Files
Figure S1: Schematic shows overview of data processing and method.A) Data is processed by taking reads from multiple datasets (represented by letters here) with a certain number of samples (represented by S A , S B , etc.).These reads will eventually map to multiple gene families G. MicrobeCensus [28] is used to estimate average genome size, while Shotmap [27]is used to map reads, yielding both matrices of counts (right hand side) and matrices of average lengths of the best-hit proteins (average family length or AFL).AFL and AGS estimates are used to normalize counts.B) We calculate our statistic and assign p-values as follows.First, we normalize counts from Shotmap using AFL and AGS, log-transform the resulting reads per kilobase of genome (RPKG), then apply a simple linear model to t dataset-and gene-family-specic eects.The resulting residuals (residual log RPKG) form a matrix of G genes by A +S B +S C samples.We take the variance across all samples for each gene to obtain a 1xG vector of residual variances.To get a null distribution, we can either use data generated from a negative binomial t, or, optionally, from a negative binomial t integrated with (shaded section) bootstrap resampling.For the negative binomial t, from the count matrices, we estimate the mean of each gene in each dataset, as well as datasetspecic overdispersion parameters k.We then use these to make simulated count datasets ( × B indicating that this card is replicated once for each of B simulations), which we process as in the case of the real data, yielding simulated log-RPKG matrices and simulated residual variances for each gene family.For the resampling (if applicable), we sample with replacement from each count dataset, yielding resampled counts.We process these in the same way to obtain resampled residual variances.Finally, if using the resampled data, we center and scale the resampled residual variances using per-gene-family means and standard deviations from the simulated residual variances; otherwise, we simply take the values from applying the test to the negative binomial simulations.These form the background distribution (bottom solid curve) for each gene in G ( × G indicating that this card is replicated once for each of G genes).The actual observed residual variance (dashed line) is then compared to this distribution to obtain p-values (gray shaded area).
Figure S2: Size parameter estimation aects power and α.For each mock dataset y, simulated null data is generated from a negative binomial distribution, xing the size parameter k y but allowing the mean µ g,y to vary for each of 1,000 genes; simulated truepositive gene families are drawn from a negative-binomial distribution with size equal zk y or k y /z, where z is the eect size.A-C) The choice of estimator aects the accuracy of size estimates.The mode method-of-moments estimator (C, y-axis) more accurately estimates the true size specied the simulation (x-axis) than the harmonic mean (A, yaxis) or median (B, y-axis), and is more tolerant to dierences in the ratio of true-positive variable and invariable gene families (colors).D-E) When the size parameter is known, α (D) and power (E) are well controlled, with α approximately equal to 0.05 at p ≤ 0.05 and power approaching 1.Here, each simulation comprises three mock studies with dierent size parameters, mirroring our actual data.Bar heights are means from 4 simulations and error bars are ±2 SD.The proportion of variable:invariable gene families was 0.5 and 44% of genes were true positives.
Figure S3: Size parameter estimation aects power and α. α (A) is minimized and power (B) is maximized when the mode method-of-moments estimator is used to get estimates of the study-specic dispersion parameters k y .Bars are from 4 simulations.The proportion of variable:invariable gene families was 0.4 and 43% of genes were true positives.
Figure S4: The mode estimator is robust to changes in the proportion of true positives and the ratio of variable to invariable gene families.α (A-C) and power as a function of the proportion of true positives (x-axis) and the ratio of variable to invariable true positives (y-axis) for = 120.α = 0.05 and power = 1 are shown in colorbars to the left of each heatmap for reference.α and power are calculated overall (left), for variable gene families (center), and for invariable gene families (right).In general, α is better controlled for the invariable gene families than for the variable gene families; we therefore use dierent empirical cutos for each set of genes.Table S1: Module and pathway enrichments for variable and invariable gene sets (Fisher's exact test q ≤ 0.25).
Table S2: Module and pathway enrichments for variable/high-PD and invariable/low-PD gene sets (Fisher's exact test q ≤ 0.25).
Table S3: Module and pathway enrichments for gene families with invariable abundances in every phylum-specic test (Fisher's exact test, ≤ 0.25).
Table S4: Module and pathway enrichments for gene families variable in each phylum-specic test (Fisher's exact test, q ≤ 0.25).

Figure 1 :
Figure 1: The residual variance statistic captures variation in gene families after accounting for between-study variation.The left panels (original abundances) show lled circles representing log-abundance (RPKG) for gene families from the KEGG Orthology (KO), with per-study means shown in solid horizontal lines and the distance from these means shown as dashed vertical lines.The right hand panels (residuals) show the same gene families after tting a linear model that accounts for these per-study means, with an accompanying density plot showing the distribution of these residuals.V g values in bold underneath density plots are the calculated variances of these residuals.These gene families are sets of orthologs corresponding to A) the waaL family of lipopolysaccharide Oantigen ligases, an immunogenic component of Gram-negative bacterial outer membranes, and B) the vicR family of OmpR family transcriptional regulators that is involved in twocomponent vic signaling, is conserved across many Gram-positive bacteria, and is essential in Streptococcus pneuomoniae [71].Despite having similar overall mean log-abundances and similar magnitude study-specic eects, waaL has much higher residual variance across individual metagenomes than vicR.

Figure 2 :
Figure 2: Most pathways include a mixture of both variable and invariable gene families.A) Stacked bar plots show the fraction of invariable (blue), non-signicant (gray), and variable (red) gene families annotated to KEGG Orthology pathway sets (rows), at different false discovery rate (FDR) cutos (color intensity).Only gene families with at least one annotated bacterial or archaeal homolog are counted.B) Fraction of strongly invariable, non-signicant, and strongly variable gene families within the ribosomes of dierent kingdoms.Row labels with only one kingdom indicate gene families unique to that kingdom, while rows with multiple kingdoms (e.g.Eukaryotes/archaea) indicate gene families shared between these two kingdoms.As expected, the bacterial ribosome is completely invariable.

Figure 3 :
Figure 3: Variable and invariable gene families within broad biological pathways separate by gene function.A-C) Heatmaps showing scaled residual log-RPKG for gene families (rows) involved in A) tRNA metabolism, B) central carbohydrate metabolism, and C) bacterial secretion systems.Variable (red) and invariable (blue) gene families are clustered separately, as are samples within a particular study (columns).log-RPKG values are scaled by the expected variance from the negative-binomial null distribution.

Figure 4 :
Figure 4: Phylogenetic distribution (PD) of gene families partially explains gene family variability.Scatter plot shows log 10 PD (x-axis) vs. log 10 residual variance statistic (y-axis).Red points are signicantly variable while blue points are signicantly invariable.

Figure 5 :Figure 6 :
Figure 5: Variable gene families correlate with the predicted abundance of Proteobacteria.Bar plots give the fraction of gene families in each category (signicantly invariable, non-signicant, and signicantly variable, 5% FDR) that are signicantly correlated to predicted relative abundances of phyla, as assessed by MetaPhlan2, using partial Kendall's τ to account for study eects and a permutation test to assess signicance.Asterisks give the level of signicance by chi-squared test of non-random association between gene family category and the number of signicant associations.(***: p ≤ 10 −8 by chi-squared test after Bonferroni correction; **: p ≤ 10 −4 .)

Figure S5 :
Figure S5: We identify signicantly variable and invariable gene families.Density plots of distributions of residual variance (V G ) statistics for signicantly invariable (blue dashed line), non-signicant (black solid line), and signicantly variable (red dashed line) gene families.The distributions have the expected trend (e.g., signicantly variable gene families tend to have higher residual variance) but also overlap, indicating the importance of the calculated null distribution.The inset shows the proportion of zero values for the non-signicant (black) and signicantly invariable (blue) gene families with V G falling in the lowest range (vertical dashed lines), indicating that the test dierentiates between gene families that only appear invariable because they have few observations, and gene families that are consistently abundant yet invariable.

Figure S8 :
Figure S8: Number of leaves is correlated with tree density, but tree density corrects for the overall rate of evolution.The number of leaves (i.e., individual sequences) is plotted vs. tree density on a log-log scatter plot, with each circle representing one gene family.Two outliers with lower density than expected are plotted in colors: a putative transposase (green) and a Staphylococcus leukotoxin (red).Both families have large numbers of sequences from the same organism.

Figure S9 :
Figure S9: Variable families are less-often correlated to measured host characteristics.A) Bar plots give the fraction of gene families with at least one bacterial or archaeal representative in each category (signicantly invariable, non-signicant, and signicantly variable) that are signicantly correlated to various sample characteristics, using partial Kendall's τ to account for study eects and a permutation test to assess signicance.These sample characteristics are average genome size (AGS), the ratio of Bacteroidetes to Firmicutes (B/F ratio), and a measure of α-diversity (Shannon index).(***: p ≤ 10 −8 by chi-squared test after Bonferroni correction; **: p ≤ 10 −4 .)

Figure S7 :
Figure S7: Heatmap showing signicantly variable and invariable gene families (scaled).As with S6, but residual log-RPKG abundances scaled by their expected variance under the negative binomial null model (see Methods).

6
Author contributions PHB performed the experiments and analyses.PHB and KSP developed the test, designed the experiments, wrote the paper, and read and approved the nal manuscript.

Table S5 :
SRA IDs and characteristics (read length, average genome size from Microbe-Census) for samples used in this study.

Table S7 :
q-value cutos to reach a given empirical FDR, estimated from simulation.