## Abstract

Identifying sources of variation in DNA methylation levels is important for understanding gene regulation. Recently, bisulfite sequencing has become a popular tool for estimating DNA methylation levels at base-pair resolution, and for investigating the major drivers of epigenetic variation. However, modeling bisulfite sequencing data presents several challenges. Methylation levels are estimated from proportional read counts, yet coverage can vary dramatically across sites and samples. Further, methylation levels are influenced by genetic variation, and controlling for genetic covariance (e.g., kinship or population structure) is crucial for avoiding potential false positives. To address these challenges, we combine a binomial mixed model with an efficient sampling-based algorithm (MACAU) for approximate parameter estimation and *p*-value computation. This framework allows us to account for both the over-dispersed, count-based nature of bisulfite sequencing data, as well as genetic relatedness among individuals. Furthermore, by leveraging the advantages of an auxiliary variable-based sampling algorithm and recent mixed model innovations, MACAU substantially reduces computational complexity and can thus be applied to large, genome-wide data sets. Using simulations and two real data sets (whole genome bisulfite sequencing (WGBS) data from *Arabidopsis thaliana* and reduced representation bisulfite sequencing (RRBS) data from baboons), we show that, compared to existing approaches, our method provides better calibrated test statistics in the presence of population structure. Further, it improves power to detect differentially methylated sites: in the RRBS data set, MACAU detected 1.6-fold more age-associated CpG sites than a beta-binomial model (the next best approach). Changes in these sites are consistent with known age-related shifts in DNA methylation levels, and are enriched near genes that are differentially expressed with age in the same population. Taken together, our results indicate that MACAU is an effective tool for analyzing bisulfite sequencing data, with particular salience to analyses of structured populations. MACAU is freely available at www.xzlab.org/software.html.

## Introduction

DNA methylation — the covalent addition of methyl groups to cytosine bases — is a major epigenetic gene regulatory mechanism utilized by a wide variety of species. DNA methylation levels predict gene expression patterns, are involved in genomic imprinting and X-inactivation, and function to suppress the activity of transposable elements [1–3]. In addition, DNA methylation is essential for normal development [4–7]. For example, mutant *Arabidopsis* plants with reduced levels of DNA methylation display a range of abnormalities including reduced overall size, altered leaf size and shape, and reduced fertility [4–6]. In humans, DNA methylation levels are strongly linked to disease, including major public health burdens such as diabetes [8,9], Alzheimer’s disease [10,11], and many forms of cancer [8,12–16]. These observations point to a central role for DNA methylation in shaping genome architecture, influencing development, and driving trait variation. Consequently, there is substantial interest in characterizing the genome-wide distribution of DNA methylation marks, and particularly, in identifying the genetic [17–20] and environmental [21–24] factors that explain variation in DNA methylation levels.

Recently, high-throughput sequencing based approaches have increased the feasibility, and consequently the popularity, of measuring DNA methylation levels. These methods, which include whole genome bisulfite sequencing (WGBS or BS-seq) [25], reduced representation bisulfite sequencing (RRBS) [26,27], and sequence capture followed by bisulfite conversion [28,29], produce base-pair resolution estimates of DNA methylation levels at genome-wide scales. All such methods rely on the differential sensitivity of methylated versus unmethylated cytosines to the chemical sodium bisulfite. Specifically, sodium bisulfite converts unmethylated cytosines to uracil (and ultimately thymine following PCR), while methylated cytosines are protected from conversion. Estimates of DNA methylation levels for each cytosine base can thus be obtained directly through high-throughput sequencing. Specifically, DNA methylation levels are estimated as the ratio of mapped cytosine reads (reflecting an originally methylated version of the base) to the total number of mapped reads at the same target (reflecting both methylated and unmethylated versions of the base).

The raw data produced by bisulfite sequencing methods are therefore count data, in which both the number of methylated reads and the total coverage at a site contain useful information. Higher total coverage corresponds to a more reliable estimate of the true DNA methylation level; however, in a typical experiment, total coverage can vary dramatically (e.g., by several orders of magnitude) across individuals and sites (Fig. S1). Many commonly used analysis methods, including all tools initially designed for array-based data [30,31], ignore this variability by converting counts to percentages or proportions (e.g., t-tests, Mann-Whitney U tests, or linear models, Table 1). Thus, a site at which 5 of 10 reads are designated as methylated (i.e., read as a cytosine) is treated identically to a site at which 50 of 100 reads are designated as methylated. This assumption reduces the power to uncover true predictors of variation in DNA methylation levels, because it ignores substantial sources of error in DNA methylation level estimates.

To address this problem, several recently introduced methods for differential DNA methylation analysis implement a beta-binomial model (e.g., ‘DSS: Dispersion Shrinkage for Sequencing data’ [32], ‘RADMeth: Regression Analysis of Differential Methylation’ [33], and ‘MOABS: Model Based Analysis of Bisulfite Sequencing data’ [34]). These methods model the binomial nature of bisulfite sequencing data, while taking into account the well-known problem of over-dispersion in sequencing reads. Because they work directly on count data, they can reliably account for variation in read coverage across sites and individuals. Consequently, beta-binomial methods consistently provide increased power to detect true associations between genetic or environmental sources of variance and DNA methylation levels [32–34].

However, beta-binomial-based methods only model over-dispersion due to independent variation, making them unsuited to studying DNA methylation variation in data sets affected by population structure or kinship. Taking these sources of structure into account is important because genetic variation is well known to exert strong and pervasive effects on DNA methylation levels [18,20,35,36]. In humans, methylation levels at more than ten thousand CpG sites are influenced by local genetic variation [19], and DNA methylation levels in whole blood are 18%-20% heritable on average, with the heritability estimates for the most heritable loci (top 10%) averaging around 68% [35,36]. As a result, DNA methylation levels will frequently covary with genetic relatedness (either kinship or population structure), and failure to account for this covariance could lead to spurious associations or reduced power to detect true effects. This phenomenon has been extensively documented for genotype-phenotype association studies [37–41], and controlling for genetic covariance between samples is now a basic requirement for these types of analyses. Similar logic applies to analyses of gene regulatory phenotypes, and studies of gene expression variation often do take genetic structure into account by using mixed model approaches [42–44]. However, despite growing interest in environmental epigenetics and epigenome-wide association studies (EWAS), no methods exist that appropriately control for genetic effects on DNA methylation levels in bisulfite sequencing data sets (Table 1).

To address this gap, we present a binomial mixed model (BMM) that accounts for both covariance between samples and extra over-dispersion caused by independent noise. We also present an efficient, sampling-based inference algorithm to accompany this model, called MACAU (Mixed model association for count data via data augmentation). MACAU works directly on binomially distributed count data and uses random effects to model relatedness/population structure and over-dispersion. Hence, MACAU enables parameter estimation and hypothesis testing in a wide variety of settings. To illustrate the advantages of our approach, we compared MACAU’s performance with currently available methods using both simulated data and two real data sets (publicly available *Arabidopsis thaliana* WGBS data [45] and newly generated RRBS data from wild baboons, *Papio cynocephalus*). We found that MACAU appropriately controls for type I error and provides increased power compared to alternative methods, which either fail to account for the count nature of bisulfite sequencing data (e.g., linear mixed models [38,39,46,47]) or fail to account for genetic relatedness (e.g., beta-binomial models).

## Results

### The binomial mixed model and the MACAU algorithm

Here, we briefly describe the model and the algorithm. Additional details are provided in Text S1.

To detect differentially methylated sites, we model each potential target of DNA methylation individually (i.e., we model each CpG site one at a time). For each site, we consider the following binomial mixed model (BMM):
where *r*_{i} is the total read count for *i*th individual; *y*_{i} is the methylated read count for that individual, constrained to be an integer value less than or equal to *r*_{i} and π_{i} is an unknown parameter that represents the true proportion of methylated reads for the individual at the site. We use a logit link to model *π*_{i}; as a linear function of several parameters:
where *w*_{i} is a *c*-vector of covariates including an intercept and a is a *c*-vector of corresponding coefficients; *x*_{i} is the predictor of interest and *β* is its coefficient; ** g** is an

*n*-vector of genetic random effects that model correlation due to population structure or kinship;

**is an**

*e**n*-vector of environmental residual errors that model independent variation;

**is a known**

*K**n*by

*n*relatedness matrix that can be calculated based on pedigree or genotype data and that has been standardized to ensure

*tr*

**(**/n = 1(this ensures that

*K*)*h*

^{2}lies between 0 and 1, and can be interpreted as heritability, see [48];

*tr*denotes the trace norm);

**is an**

*I**n*by

*n*identity matrix; σ

^{2}

*h*

^{2}is the genetic variance component σ

^{2}(1-

*h*

^{2}) is the environmental variance component;

*h*

^{2}is the heritability of the logit transformed methylation proportion (i.e.

*logit*(

**π**)); and MVN denotes the multivariate normal distribution.

Both ** g** and

**model over-dispersion (i.e., the increased variance in the data that is not explained by the binomial model). However, they model different aspects of over-dispersion:**

*e***models the variation that is due to independent environmental noise (a known problem in data sets based on sequencing reads: [49–52]), while**

*e***models the variation that is explained by kinship or population structure. Effectively, our model improves and generalizes the beta-binomial model by introducing this extra**

*g***term to model individual relatedness due to population structure or stratification.**

*g*We are interested in testing the null hypothesis that the predictor of interest has no effect on DNA methylation levels: *H*_{0} : *β* = 0 This test requires obtaining the maximum likelihood estimate from the model. Unlike its linear counterpart, estimating from the binomial mixed model is notoriously difficult, as the joint likelihood consists of an *n*-dimensional integral that cannot be solved analytically [53]. Standard frequentist approaches rely on numerical integration [54] or Laplace approximation [55,56], but neither strategy scales well with the increasing dimension of the integral, which in our case is equal to the sample size. Because of this problem, frequentist approaches often produce biased estimates and overly narrow (i.e., anti-conservative) confidence intervals [57–61]. To overcome this problem, we instead use a Markov chain Monte Carlo (MCMC) algorithm-based approach for inference. After drawing accurate posterior samples, we rely on the asymptotic normality of both the likelihood and the posterior distributions [62] to further obtain the approximate maximum likelihood estimate and its standard error se () This procedure allows us to construct approximate Wald test statistics and *p*-values for hypothesis testing. Despite the stochastic nature of the procedure, the MCMC errors are small enough to ensure stable p-value computation across multiple MCMC runs (Fig. S2).

For efficient, approximate *p*-value computation, we developed a novel MCMC algorithm based on an auxiliary variable representation of the binomial distribution [63–65] (Text S1). Our main contribution here is a framework that approximates the distribution of these latent variables (Fig. S3, Table S1-S2) and allows us to take advantage of recent innovations for fitting mixed effects models [38,46,47,66] (Text S1). These modifications substantially reduce the computational burden of fitting the BMM. Our algorithm reduces per-MCMC iteration computational complexity from cubic to quadratic with respect to the sample size. This results in an over 50-fold speed up compared with the popular software MCMCglmm [67] (Table S3) and makes our implementation of the BMM efficient for data sets ranging up to hundreds of individuals and millions of sites.

Because our model effectively includes the beta-binomial model as a special case, we expect it to perform similarly to the beta-binomial model in settings in which population structure is absent (we say “effectively” because strictly speaking, the beta-binomial model uses a beta distribution to model independent noise while we use a normal distribution). However, we expect our model to outperform the beta binomial in settings in which population structure is present. In addition, in the presence of population stratification, we expect the beta-binomial model to produce inflated test statistics (thus increasing the false positive rate) while our model should provide calibrated ones. Below, we test these predictions using both simulations and real data applications.

### Count-based models perform well in the absence of genetic effects on DNA methylation levels

We first compared the performance of the BMM implemented in MACAU with the performance of other currently available methods for analyzing bisulfite sequencing data in the absence of genetic effects. Intuitively, since the BMM models count data and effectively includes the beta-binomial model as a special case, we expected it to perform similarly to the beta-binomial model; further, we expected both models to outperform methods that do not model counts. To test our prediction, we simulated the effect of a predictor variable on DNA methylation levels across 5000 CpG sites (4500 true negatives and 500 true positives). To approximate the distribution of a predictor variable in a real population, and because we analyze age-associated variation in DNA methylation levels in a baboon RRBS data set in detail below, we conducted our simulation using known age values sampled from the same baboon population. For all simulations, we set the effect of genetic variation on DNA methylation levels equal to zero, which is equivalent to setting either (i) the heritability of DNA methylation levels to zero (unlikely based on prior findings [35,36]), or (ii) studying completely unrelated individuals in the absence of population structure. To explore MACAU’s performance across a range of conditions, we simulated age effects on DNA methylation levels across three different effect sizes (percent of variance in DNA methylation explained (PVE) = 5%, 10%, or 15%) and three different sample sizes (n = 20, 50, and 80).

Because age is naturally modeled as a continuous variable, we focused our comparisons only on approaches that could accommodate continuous predictor variables (comparisons in which we artificially binarized age, which allowed us to include a larger set of approaches, produced qualitatively similar results: Fig. S4). Specifically, in addition to the BMM implemented in MACAU, we considered the performance of a beta-binomial model, a linear model, a binomial model, and a linear mixed model (implemented in the software GEMMA [46]). As expected, we found that MACAU performed similarly to the beta-binomial model, and that these two approaches consistently detected more true positive age effects on DNA methylation levels (at a 10% empirical FDR) than all other methods (Figs. S5-S6). For example, in the “easiest” case we simulated (PVE = 15%, n = 80), we found that the beta-binomial model detected 30% of simulated true positives, while the BMM implemented in MACAU detected 27.8%. The slight loss of power in the BMM is a consequence of the smaller degrees of freedom caused by the additional genetic variance component. In comparison, the linear model detected 21.2% of true positives; the linear mixed effects model, 14%; and the binomial model, 8.4% (Fig. S5). The binomial model exhibits low power when FDR is used to control for multiple hypothesis testing due to poor type I error calibration, as has been previously reported [33]. Area under a receiver operating characteristic curve (AUC) was also consistently very similar between the beta-binomial and MACAU (Fig. S6), although the advantage of the count-based methods was less clear by this measure. This reduced contrast is because AUC is based on true positive-false positive trade-offs across a range of p-value thresholds; methods can consequently yield high AUCs even when they harbor little power to detect true positives at FDR thresholds that are frequently used in practice. Taken together, our simulations suggest a general advantage to count-based models for samples that contain no genetic structure. Further, the differences in performance between the beta-binomial model and the BMM implemented in MACAU were consistently small in this setting (Figs. S5-S6).

### Binomial mixed models control for false positive associations that arise from population structure

Next, we investigated the performance of each method in the presence of population structure. When DNA methylation levels are heritable and the predictor variable of interest is confounded with population structure, false positive associations should arise if genetic covariance between samples is not modeled. Because the BMM accounts for population structure while the beta-binomial model does not, we therefore expected MACAU to produce well-calibrated test statistics and the beta-binomial model to produce inflated test statistics. To test this prediction, we drew on publicly available phenotype data and SNP genotype data for 24 *Arabidopsis thaliana* accessions [68,69] in which leaf tissue samples were recently subjected to whole genome bisulfite sequencing [45]. Among these accessions, a secondary dormancy phenotype (measured as the slope between the germination percentages of non-dormant seeds after one and six weeks of cold treatment) is correlated with population structure (R^{2}= 0.38 against the first principal component of the genotype matrix for these accessions; p = 7.84 x 10^{-4}; Fig. S7). Because secondary dormancy is associated with environmental conditions that are experienced after the seed has already dispersed, we have no expectation that secondary dormancy should be associated with DNA methylation levels in leaf tissue. Consequently, we used the true distribution of secondary dormancy characteristics and the true genetic structure among these 24 accessions to simulate a dataset that consisted entirely of true negatives. Specifically, we simulated data sets (containing 4000 sites each) in which the association between secondary dormancy and DNA methylation levels in leaf tissue was always equal to 0, but the effect of genetic variation on DNA methylation levels was either moderate (h^{2} = 0.3) or large (h^{2} = 0.6). Thus, in these data sets, population structure could confound the relationship between the predictor variable (the capacity for secondary dormancy) and DNA methylation levels if not taken into account.

As predicted, we found that the BMM implemented in MACAU appropriately controlled for genetic effects on DNA methylation levels: whether DNA methylation levels were moderately (h^{2} = 0.3) or strongly (h^{2} = 0.6) heritable, MACAU did not detect any sites associated with secondary dormancy at a relatively liberal false discovery rate threshold of 20% (whether calculated against empirical permutations or calculated using the R package *qvalue* [32]). In addition, the *p*-value distributions for secondary dormancy effects on DNA methylation levels, in both simulations, did not differ from the expected uniform distribution (Fig. 1; Kolmogorov-Smirnov (KS) test when h^{2} = 0.3: D = 0.015, p = 0.909; when h^{2} = 0.6: D = 0.016, p = 0.874; genomic control factors: 0.90 when h^{2} = 0.3, 0.93 when h^{2} = 0.6). In contrast, when we analyzed the same simulated data sets with a beta-binomial model, we erroneously detected 2 CpG sites associated with secondary dormancy when heritability was set to 0.3, and 4 CpG sites when heritability was set to 0.6 (at a 20% FDR in both cases). More concerningly, the distributions of *p*-values produced by the beta-binomial model were significantly different from the expected uniform distribution and skewed towards low (significant) values (KS test when h^{2} = 0.3: D = 0.084, p = 1.75 x 10^{-8}; when h^{2} = 0.6: D = 0.096, p = 2.80 x 10^{-11}; genomic control factors: 1.18 when h^{2} = 0.3, 1.32 when h^{2} = 0.6). These numbers suggest an increasing problem with false positives as the heritability of DNA methylation levels increases.

To investigate the calibration of test statistics in a real data set, we next analyzed the relationship between the secondary dormancy phenotype and publicly available WGBS data for the same 24 *Arabidopsis* accessions (n = 830,676 CpG sites tested [32,33,34]). We again compared the performance of a simple linear model, a binomial model, a beta-binomial model, the BMM implemented in MACAU, and an LMM implemented in GEMMA. Again illustrating its poor handling of Type I error, the binomial model detected more than 100,000 secondary dormancy-associated sites at a 10% empirical FDR threshold, respectively, with a genomic control factor of 3.81. A beta-binomial model substantially improved over the binomial model, but still detected 39 secondary dormancy-associated sites at a 20% empirical FDR threshold, and 150 sites and 690 sites at a 10% or 20% FDR *qvalue* threshold, respectively (genomic control factor = 1.16). Given the clear confounding of population structure and secondary dormancy in this sample, as well as the results of our simulations, these associations are probably spurious. In contrast, MACAU, the linear mixed model (GEMMA), and the simple linear model did not identify any CpG sites associated with secondary dormancy, either at a 10% or a 20% false discovery rate threshold (Fig. 1; genomic control factors: MACAU – 0.89, GEMMA – 0.97, Linear model – 0.99). Based on our earlier simulations, the similarity of performance among the three models likely stems from different reasons: both the linear model and the linear mixed model are more lowly powered to detect positive hits (either true positives or false positives), whereas MACAU combines both the increased power conferred by modeling the raw count data with appropriate controls for population structure (see Fig. 1 and results below).

### MACAU provides increased power to detect true positives in the presence of kinship

We next investigated the power of different approaches to detect truly differentially methylated sites in the presence of relatedness. Because it appropriately models genetic similarity between relatives, we expected the BMM implemented in MACAU to exhibit improved power over the other methods. To test this prediction, we returned to the baboon data set that was the focus of our initial simulations. Instead of assuming no genetic contribution to variation in DNA methylation levels, here we instead simulated moderate to large genetic effects (h^{2} = 0.3 and 0.6 respectively, as in the *Arabidopsis* simulation above). We simulated relatedness values based on the distribution of relatedness values within a single mixed-sex baboon social group. Female baboons remain in their natal groups throughout their lives, producing relatedness values that are primarily due to matrilineal descent. The resulting genetic structure is one in which females tend to be more closely related to each other, on average, than males or male-female dyads [70], but in which not all females are related (because multiple matrilines co-reside in a single group). Thus, baboon social groups contain a large set of unrelated dyads, some pairs of close relatives, and some distant relatives (Fig. S8). We simulated an effect of age on DNA methylation levels in a data set consisting of 80 baboons with known ages and dyadic relatedness levels. We simulated a range of non-zero effect sizes (percent variance explained by age = 5%, 10%, or 15%) for 5000 CpG sites, containing 500 true positives and 4500 true negatives. We chose these parameters to mimic the distribution of effect sizes observed in real data sets, which can range from small to substantial but which are generally limited to a minority of sites [9,17,36,71].

In simulations in which age had a moderate effect on DNA methylation levels (PVE = 10%), MACAU detected 11.4% (when h^{2} = 0.3) and 20.6% (when h^{2} = 0.6) of simulated true positives at a 10% empirical FDR. In comparison, the beta-binomial model (the next best model) detected 8.2% and 10.4% of true positives, respectively (Fig. 2). As in the simulations, we again observed that a simple binomial model was prone to type I error, which resulted in failure to detect true age-associated sites when empirical FDRs were calculated against permuted data. Our additional simulations at PVE = 5% or PVE = 15% confirmed MACAU’s advantage over other methods across a range of effect sizes (Fig. S9). As expected, the magnitude of this advantage was positively correlated with the heritability of DNA methylation levels.

### Age-associated DNA methylation levels in wild baboons

Finally, we applied MACAU to a real RRBS data set that we generated from 50 wild baboons, drawn from the same population used to parameterize the simulations above. This data set included 433,871 CpG sites, enriched (as expected in RRBS data sets [26,27]) for putatively functional regions of the genome (e.g., genes, gene promoters, CpG islands: Fig S11). We used these data to investigate the epigenetic signature of age at sampling (range = 1.76 – 18.01 years in our sample, Table S4); we focused on age because it is a known predictor of DNA methylation levels in humans and other animals [35,72,73] and because DNA methylation changes with age are well characterized [35,36,74–76]. Consequently, we were able to not only assess MACAU’s power to detect statistically age-associated sites, but also test its ability to identify known age-related signatures in DNA methylation data.

As in our simulations, we found that MACAU provided increased power to detect age effects in the presence of familial relatedness. We detected 1.6-fold more age-associated CpG sites at a 10% empirical FDR using MACAU compared to the results of a beta-binomial model, the next best approach (1.4-fold more sites at a 20% empirical FDR; Fig. 3 and Fig. S10). This advantage was consistently observed across all FDR thresholds we considered, except for relatively low (<7.5%) empirical FDR thresholds, when all of the methods were very low powered as a result of the modest sample size.

We performed several analyses to investigate the likely validity and functional importance of the age-associated CpG sites we identified. Based on the results of previous studies, we expected that age-associated sites in CpG islands would tend to gain methylation with age [75,76], while sites in other regions of the genome (e.g., CpG island shores, gene bodies) would tend to lose methylation with age [75,76]. In addition, we expected that, in whole blood, bivalent/poised promoters should gain DNA methylation with age, while enhancers should lose methylation with age (as discussed in [74,75,77]). Our results conformed to these patterns: sites in CpG islands tended to gain methylation with age (71.4% of sites were positively correlated with age); and sites in promoters, CpG island shores, and gene bodies tended to lose methylation with age (72.7%, 75.4%, and 75.2% of sites were negatively correlated with age, respectively; Fig. 3). In addition, we found that positively correlated, age-associated sites were highly enriched in chromatin states associated with bivalent/poised promoters (as defined by the Roadmap Epigenomics Project [78]). Specifically, age-associated CpG sites in bivalent/poised promoters were 3.4 times more likely to show increases in DNA methylation with age, compared to age-associated CpG sites in other regions (p < 10^{-10}, Fisher’s exact test). Furthermore, negatively correlated age-associated sites (i.e., sites where DNA methylation levels decreased with age) were strongly enriched in enhancers (defined as sites either marked by H3K4me1 in human PBMCs [79] or sites within chromatin states annotated as ‘enhancers’ by the Roadmap Epigenomics Project [78], p = 2 x 10^{-4}, Fisher’s exact test).

Finally, we reasoned that true positive age-associated CpG sites should also contain information about age-associated gene expression levels. To test this hypothesis, we turned to previously generated whole blood RNA-seq data [42] from the same baboon population (n = 63; only four baboons in the RNA-seq data set were also included in the DNA methylation data set). Overall, we observed a strong enrichment of differentially methylated CpG sites in or near (within 10 kb) blood-expressed genes (n = 12,018 genes), compared to the background set of all CpG sites near genes (Fisher’s exact test, p < 10^{-10}). Further, CpG sites near age-associated genes (n = 1396 genes, 10% FDR) were 30.5% more likely to be differentially methylated with age compared to the background set of all CpG sites near genes (Fisher’s exact test, p = 0.032).

### Discussion

DNA methylation levels can have potent effects on downstream gene regulation, and, in doing so, can shape key behavioral, physiological, and disease-related phenotypes [8,21,80–82]. These observations have motivated an increasing number of DNA methylation studies in humans and other organisms, highlighting the need for sophisticated statistical methods that can accommodate the complexities of a broad array of data sets. Here, we demonstrate that the binomial mixed model implemented in our software MACAU can (i) effectively control for confounding relationships between genetic background and a predictor variable of interest and (ii) provide increased power to detect true sources of variance in DNA methylation levels in data sets that contain kinship or population structure. In addition, MACAU provides increased flexibility over current count-based methods that cannot accommodate biological replicates (e.g., Fisher’s exact test), continuous predictor variables (e.g., DSS, MOABS, RadMeth), or biological or technical covariates (e.g., MOABS, DSS; see also Table 1). Given the increasing interest in both the environmental [22,71,83] and genetic [17,18,20,84] architecture of DNA methylation levels, we believe MACAU will be a useful tool for generalizing epigenomic studies to a larger range of populations. MACAU is particularly well suited to data sets that contain related individuals or population structure; notably, several major population genomic resources contain structure of these kinds (e.g., the HapMap population samples [85], the Human Genome Diversity Panel [86], and the 1000 Genomes Project in humans [87]; the Hybrid Mouse Diversity Panel [88]; and the 1001 Genomes Project in *Arabidopsis* [89]).

Indeed, our results suggest MACAU is a useful tool even in data sets that are less affected by genetic structure, or when the heritability of DNA methylation levels is unclear. Because the beta-binomial model is incorporated as a special case, MACAU exhibits only a slight loss of power relative to a beta-binomial model without random effects when h^{2} = 0, while conferring better power and better test statistic calibration when h^{2} > 0 (Fig. S5-S6; Fig. 1). Previous studies in humans have shown that, while the heritability of DNA methylation levels varies across loci, an appreciable proportion of loci are either modestly (h^{2} >= 0.3: 21.06% of all CpG sites) or highly (h^{2} >= 0.6: 6.95% of all CpG sites) heritable [36,90]; further, DNA methylation QTLs are widespread across the genome [19,35,84]. Thus, because investigators will rarely have *a priori* knowledge of the heritability of DNA methylation levels at a given locus, and because the advantage of a beta-binomial model is small even when heritability is zero, we recommend applying MACAU in cases where genetic effects on DNA methylation levels are poorly understood. In addition, our model provides a natural framework for incorporating the spatial dependency of DNA methylation levels across neighboring sites [91,92], which we expect to increase power even further [91,92]. However, we do note that, even with the efficient algorithm implemented here, fitting the binomial mixed model (or its extensions) remains more computationally expensive than other approaches for moderately sized datasets (Table S3). While it remains appropriate for the sample sizes used in current studies (e.g., dozens to hundreds of individuals), rapid increases in sample size—especially in the context of EWAS—strongly motivate additional algorithm development to scale up the binomial mixed model for data sets that include thousands or tens of thousands of individuals.

Although we developed MACAU with the analysis of bisulfite sequencing data in mind, we note that a count-based binomial mixed model may be an appropriate tool in other settings as well. For example, allele-specific gene expression (ASE) is often measured in RNA-seq data by comparing the number of reads originating from a given variant to the total number of mapped reads for that site [66,93–95]. The structure of these data are highly similar to the structure of bisulfite sequencing data, which focus on counts of methylated versus total reads. Unsurprisingly, beta-binomial models have also emerged as one of the most popular methods for estimating ASE values [95–97]. Researchers interested in the predictors of variation in ASE levels—which could include *trans-*acting genetic effects, environmental conditions, or properties of the individual (e.g., sex or disease status)—might also benefit from using MACAU. Recent work from the TwinsUK study motivates the need for such a model: Grundberg et al. demonstrated a strong heritable component to ASE levels [98], which could be effectively taken into account using the random effects approach implemented here.

Finally, linear mixed models have also been recently proposed to account for cell type heterogeneity in epigenome-wide association studies focused on array data [99]. In this framework, the random effect covariance structure is based on overall covariance in DNA methylation levels between samples, which is assumed to be largely attributable to variation in tissue composition. MACAU provides a potential avenue for extending these ideas to sequencing-based data sets.

## Materials and Methods

*Arabidopsis thaliana* whole genome bisulfite sequencing (WGBS) data set

We downloaded publicly available WGBS data generated by Schmitz et al. [45], as well as previously published SNP genotype data [69] and secondary dormancy data [68] for 24 *Arabidopsis* accessions. We used the SNP genotype data (specifically, 188,093 sites with minor allele frequency >5%) to construct a pairwise genetic relatedness matrix, *K*, as the product of a standardized genotype matrix [48] (implemented with a built-in function in MACAU). We used this estimate of *K* for both the simulations and our analyses of the real WGBS data.

In these analyses, we focused on CpG sites measured in ≥50% of accessions, and excluded sites that were constitutively hypermethylated (average DNA methylation level >0.90) or hypomethylated (average DNA methylation level <0.10, following [71,99]). We also excluded highly invariable sites (i.e., sites where the standard deviation of DNA methylation levels fell in the lowest 5% of the overall data set) and sites with very low coverage (i.e., sites where the mean coverage fell in the lowest quartile for the overall data set, below a mean of 3.34 reads). After filtering, the final data set consisted of 830,676 sites.

### Baboon reduced representation bisulfite sequencing (RRBS) data set

#### Study subjects and sample collection

To investigate age effects on DNA methylation levels, in both real and simulated data sets, we drew on data and samples from a wild population of yellow baboons in the Amboseli ecosystem of southern Kenya. This population has been monitored for over four decades by the Amboseli Baboon Research Project (ABRP) [100], and the ages of animals born in the study population (n = 37, 74% of the data set) are therefore known to within a few days’ error. For animals that immigrate into the study population, ages are estimated from morphological features by trained observers (n = 13, 26% of the data set) [101]. Pairwise relatedness values were available from previously collected microsatellite data (14 highly variable loci) [102,103] analyzed with the program COANCESTRY [104]. Using the age and relatedness data sets, we simulated age effects on DNA methylation levels for either n = 50 or n = 80 baboons from a single social group. In addition, we used previously collected blood samples from the Amboseli population, paired with age and microsatellite genotype records, to investigate age effects on DNA methylation levels in a newly generated RRBS data set.

To generate the new RRBS data, we used whole blood samples collected from 50 animals (35 males and 15 females) by the ABRP between 1989 and 2011 following well-established procedures [42,105,106]. Briefly, animals were immobilized by an anesthetic-bearing dart delivered through a hand-held blow gun. They were then quickly transferred to a processing site for blood sample collection. Following sample collection, study subjects were allowed to regain consciousness in a covered holding cage until they were fully recovered from the effects of the anesthetic. Upon recovery, study subjects were released near their social group and closely monitored. Blood samples were stored at the field site or at an ABRP-affiliated lab at the University of Nairobi until they were transported to the United States.

Importantly, given the large range in sample collection dates, we observed no correlation between the age of our study subjects at sample collection and sample age (i.e., time since the collection date; Spearman rank correlation, p = 0.779). Further, to ensure that variation in sample collection dates did not influence our results, we also controlled for sample age as a covariate in our final analyses of the RRBS dataset (see *Analysis of age-related changes in DNA methylation levels*).

#### RRBS data generation and low-level processing

Genomic DNA was extracted from whole blood samples using the DNeasy Blood and Tissue Kit (QIAGEN) according to the manufacturer’s instructions. RRBS libraries were created from 180 ng of genomic DNA per individual, following the protocol by Boyle et al. [26]. In addition, 1 ng of unmethylated lambda phage DNA (Sigma Aldrich) was incorporated into each library to assess the efficiency of the bisulfite conversion. All RRBS libraries were sequenced using 100 bp single end sequencing on an Illumina HiSeq 2000 platform, yielding a mean of 28.97 ±8.97 million reads per analyzed sample (range: 9.59 – 79.78 million reads; Table S4).

We removed adaptor contamination and low-quality bases from all reads using the program TRIMMOMATIC [107]. We then mapped the trimmed reads to the olive baboon genome (*Panu* 2.0) using BSMAP, a tool designed for high-throughput DNA methylation data [108]. We used a Python script packaged with BSMAP to extract the number of reads as cytosine (reflecting an originally methylated base) and the total read count for each individual and CpG site. We performed the same set of filtering steps described for the *Arabidopsis* WGBS data set to produce our final data set for the baboons. Specifically, we excluded sites that were constitutively hypermethylated or hypomethylated, sites that were highly invariable, and sites that had low average coverage across individuals (in this case, the lowest quartile for mean coverage levels was 4.74 reads). The final filtered data set consisted of 433,871 CpG sites.

### Simulations

To simulate the methylated read counts and total read counts that result from WGBS and RRBS, we performed the following procedure:

First, we simulated the proportion of methylated reads for each site. To do so, we drew secondary dormancy values or age values, *x*, as the predictor of interest, from the actual values for the *Arabidopsis* accessions or from the baboon population, respectively. For each CpG site, we simulated the DNA methylation level, π, as a linear function of *x* and its effect size (*β*), as well as the effects of genetic variation (*g*) and random environmental variation (*e*), passed through a logit link (based on the model described in the Results section).

For the baboon RRBS simulations, we determined *K* from 14 highly variable microsatellite loci [102,103], focusing on the true values for either n = 50 or n = 80 baboons drawn from a single social group in the Amboseli population (i.e., the same population we sampled in the real RRBS dataset). For the *Arabidopsis* WGBS simulations, *K* was determined from publicly available SNP genotype data [69]. For each simulation, we set *h*^{2} to 0, 0.3, or 0.6 to simulate no, modest, or highly heritable DNA methylation levels. We also estimated the variance term σ^{2} from the real data sets. Specifically, we took the mean estimate of σ^{2} across all sites (as calculated in MACAU) for each real data set, and used this value as the fixed value of σ^{2} in the corresponding simulations.

Next, for each site, we simulated total read counts *r*_{i} for each individual *i* from a negative binomial distribution that models the extra variation observed in the real data:
where *t* and *p* are site specific parameters estimated from the real data. Specifically, we generated 10,000 sets of *t* and *p* parameters by fitting a negative binomial distribution to the total read count data from 10,000 randomly selected CpG sites in the real baboon RRBS data set or the real *Arabidopsis* data set, using the function ‘fitdistr’ in the R package *MASS* [109]. To simulate counts for a given CpG site, we randomly selected one of these parameter sets to produce the total number of reads. Finally, we simulated the number of methylated reads for each individual at that locus (*y*) by drawing from a binomial distribution parameterized by the number of total reads (*r*) and the DNA methylation level (*π*).

### Comparison of MACAU to existing methods

For all simulated and real data sets, we used raw methylated and total read counts to compare the results of a beta-binomial model (using a custom R script), a binomial model (implemented via ‘glm’ in R), and the binomial mixed model implemented in MACAU. For computation time comparison, we also used the MCMCglmm software that implements the binomial mixed model [67]. In addition, we used the same count data to run a Fisher’s exact test (implemented in R), DSS [32], and RadMeth [33] in the subset of analyses that utilized these programs. Finally, to analyze simulated and real data sets using a linear model (implemented using ‘*lm’* in R) or the linear mixed model implemented in GEMMA [46], we estimated DNA methylation levels by dividing the number of methylated reads by the total read count for each individual and CpG site. We then quantile normalized the resulting proportions for each CpG site to a standard normal distribution, and imputed any missing data using the K-nearest neighbors algorithm in the R package *impute* [110].

To compute empirical false discovery rates in simulated data, we divided the number of false positives detected at a given *p*-value threshold by the total number of sites called by the model as significant at that threshold (i.e., the sum of false positives and true positives). To compute empirical false discovery rates in the real data, in which the false positives and true positives were unknown, we used permutations. Specifically, we permuted the predictor variable for each data set four times, reran our analyses, and then calculated the false discovery rate as the average number of sites detected at a given *p*-value threshold in the permuted data divided by the total number of sites detected at that threshold in the real data. For simulated data sets only, we also calculated the area under the receiver operating characteristic curve (AUC) to produce a measure of the overall tradeoff between detecting true positives and calling false positives.

### Analysis of age-related changes in DNA methylation levels

Our initial analyses of the baboon RRBS dataset focused only on the relative ability of each method to detect age-associated sites. For these analyses, we therefore did not control for other biological covariates that may contribute to variance in DNA methylation levels (note that biological covariates cannot be incorporated into several implementations of the beta-binomial model [32,34]: see Table 1). However, to investigate patterns of age-related changes in DNA methylation levels, and to compare them to previously described patterns in the literature, we wished to control for such covariates. To do so, we reran the differential methylation analysis in MACAU, this time controlling for sex, sample age, and efficiency of the bisulfite conversion rate estimated from the lambda phage spike-in.

First, we investigated whether age-associated sites were enriched in functionally coherent regions of the genome, many of which have previously been identified as age-related [35,75,76]. To do so, we defined gene bodies as the regions between the 5’-most transcription start site (TSS) and 3’-most transcription end site (TES) of each gene using *Panu* 2.0 annotations from Ensembl [111]. We defined promoter regions as the 2 kb upstream of the TSS. CpG were annotated based on the UCSC Genome Browser track for baboon [112], with CpG island shores defined as the 2 kb regions flanking either side of the CpG island boundary (following [27,113,114]). Finally, because no enhancer annotations are available that are specific to baboons, we used H3K4me1 ChIP-seq data generated by ENCODE (from human peripheral blood mononuclear cells) to define enhancer regions [79]. In addition, we used chromatin state annotations from the Roadmap Epigenomics Project (also generated from human peripheral blood mononuclear cells) to further investigate biases in the locations of age-associated sites [78]. Using these annotation sets, we performed Fisher’s Exact Tests to ask whether age-associated sites were enriched or underrepresented in specific genomic regions.

Second, we asked whether differentially methylated sites were more likely to fall close to blood-expressed genes. For this comparison, we drew on previously published RNA-seq data, generated from whole blood samples collected in the Amboseli baboon population [42]. We defined blood-expressed genes as those genes that had non-zero counts in more than 10% of individuals in the RNA-seq data sets, and that had mean read counts greater than or equal to 10. We then compared the number of differentially methylated CpG sites near blood-expressed genes (i.e., within the gene body or within 10 kb of the gene TSS or TES) to the number of differentially methylated CpG sites near genes that were not expressed in blood, using a Fisher’s Exact Test.

Finally, we investigated whether CpG sites that occur near genes that are differentially expressed with age were also more likely to be differentially methylated with age. For this comparison, we defined ‘age-associated genes’ as genes differentially expressed with age (at a 10% FDR) in the RNA-seq data set [42]. We compared the number of differentially methylated CpG sites near blood-expressed, age-associated genes to the number of differentially methylated CpG sites near genes that were not within this set of genes, again using a Fisher’s Exact Test.

### Software and data availability

The MACAU software and a custom script for implementing a beta-binomial model in R is available at: www.xzlab.org/software.html. Previously published data sets are available at http://bergelson.uchicago.edu/regmap-data/regmap.html/ (*Arabidopsis* SNP genotype data), http://www.ncbi.nlm.nih.gov/geo/ (*Arabidopsis* WGBS data: GSE43857, Baboon RNA-seq data: GSE63788); and http://www.nature.com/nature/journal/v465/n7298/full/nature08800.html#supplement ary-information (*Arabidopsis* phenotype data). The baboon RRBS data set will be made publicly available at the NCBI Gene Expression Omnibus upon manuscript acceptance.

## Funding

This work was supported by a seed grant from the Duke Population Research Institute to JT (a component of 5R24-HD065563-03 to S. Sanders); NIH grants R21-AG049936 and R03-AG045459-01 to JT and SCA; NSF grant BCS-1455808 to JT and AJL; and a start-up fund from the University of Michigan to XZ.

## Text S1: Detailed Methods

### 1 Binomial Mixed Model

To detect differentially methylated sites, we model each potential target of DNA methylation one site at a time. For each site, we consider the following binomial mixed model (BMM):
where *r*_{i} is the total read count for *i*th individual; *y*_{i} is the methylated read count for that individual, constrained to be an integer value less than or equal to *r*_{i}; and *π*_{i} is an unknown parameter that represents the true proportion of methylated reads for the individual at the site. We use a logit link to model *π*_{i} as a linear function of parameters:
where logit denotes a logistic transformation logit ; is the odds; **w**_{i} is a *c*-vector of covariates including an intercept and ** α** is a

*c*-vector of corresponding coefficients;

*x*

_{i}is the predictor of interest and

*β*is its coefficient;

**g**is an

*n*-vector of genetic random effects that model correlation due to population structure or individual relatedness;

**e**is an n-vector of environmental residual errors that model independent variation;

**K**is a known

*n*by

*n*relatedness matrix that can be calculated based on a pedigree or genotype data and that has been standardized to ensure

*tr*(

**K**)

*/n*= 1 (this ensures that

*h*

^{2}lies between 0 and 1, and can be interpreted as heritability, see [1]);

**I**is an

*n*by

*n*identity matrix;

*σ*

^{2}

*h*

^{2}is the genetic variance component;

*σ*

^{2}(1–

*h*

^{2}) is the environmental variance component;

*h*

^{2}is the heritability of the logit transformed methylation proportion (i.e. logit(

*π*)); and MVN denotes the multivariate normal distribution.

The binomial mixed model proposed here belongs to the generalized linear mixed model family [2]. Both **g** and **e** model over-dispersion, the increased variance in the data that is not explained by the binomial model. However, they model different aspects of over-dispersion: **e** models the variation that is due to independent environmental noise (a known problem in data sets based on sequencing reads), while **g** models the variation that is explained by kinship or population structure. Effectively, our model improves and generalizes the previous beta binomial model by introducing this extra **g** term to model individual relatedness due to kinship, population structure, or stratification.

### 2 Inference Method Overview

We are interested in testing the null hypothesis *H*_{0} : *β* = 0. This requires obtaining the maximum likelihood estimate from the model. Unlike its linear counter-part, obtaining the estimate of from the binomial mixed model is not a trivial task, as the joint likelihood consists of an *n*-dimensional integral that cannot be solved analytically [2]. Previous frequentist approaches to address this problem include direct numerical integration using Gauss-Hermite quadrature [3], or Laplace approximation that is applied to the likelihood function [4] or the quasi-likelihood function [5–8]. However, both numerical integration and analytic approximation do not scale well with the increasing dimension of the integral, which unfortunately equals the sample size in our model. Even a second order Laplace approximation yields a biased estimate and overly narrow confidence interval, especially when the uncertainty in the variance component estimate is large [9–13]. Therefore, frequentist approaches for estimation and inference in the binomial mixed model remain notoriously difficult and is still an active area of research [14].

In contrast to the frequentist methods, Markov chain Monte Carlo (MCMC)-based Bayesian approaches provide an appearing alternative [11]. Bayesian methods naturally account for the uncertainty in the variance component estimates and can achieve arbitrary inference accuracy if the chain is allowed to run long enough. Despite these attractive theoretical features, however, constructing an efficient MCMC algorithm for practical problems is not easy. Previous MCMC approaches for generalized linear mixed models either require a normal approximation to the likelihood function that diminishes its gains over the frequentist methods [15,16], or use *n*-steps of Metropolis–Hastings algorithm to sample the *n*-dimensional latent rate variables where efficient proposal distributions for all of them can be hard to construct [17,18]. To improve upon these previous approaches, a new MCMC algorithm [19–21] has been recently developed based on auxiliary variable representation of the binomial distribution [22]. By introducing latent variables to replace the observed count data, the algorithm makes sampling and computation relatively straightforward.

Therefore, we rely on this particular form of MCMC in the present study. Our main contribution is to further develop an accurate approximation to the distribution of these latent variables, where the approximation form is specifically designed to allow us to adapt recent mixed model innovations [23–26] that substantially reduce the computational burden. By using a mean-normal mixture approximation to the negative log gamma distribution, our approach reduces the per-MCMC iteration computational complexity from *O*(*n*^{3}) to *O*(*n*^{2}), where *n* is the sample size. This modification allows the binomial mixed model to be efficiently applied to hundreds of individuals and millions of methylation sites.

Although we use MCMC for posterior sampling, our primary goal is not to perform a Bayesian analysis by producing Bayes factors for model comparison (although this is an interesting area to explore in the future). Rather, our goal is to use MCMC as a convenient and accurate tool to obtain the marginal likelihood of *β* that is otherwise infeasible or inaccurate to obtain under various frequentist approaches. Under asymptotics, both the likelihood function and the marginal posterior distribution for *β* will be approximately normal [27]. Since the likelihood function is simply the difference between the posterior and the prior, once we have obtained the posterior mean and standard deviation of *β* and paired these values to their prior counter-parts, we can easily obtain the approximate likelihood function and compute the approximate maximum likelihood estimate and its standard error *se*() using the method of moments. We can then construct approximate Wald test statistics and *p* values for hypothesis testing.

In the present study, we use flat priors for all nuisance parameters (** α**,

*σ*

^{2},

*h*

^{2}), or

*p*(

**)**

*α**∼*1,

*p*(

*σ*

^{2})

*∼*1 and

*p*(

*h*

^{2}) ∼ 1. For the parameter of interest,

*β*, we could also use a flat prior, in which case the posterior would be the likelihood. For computational stability reasons, however, we use a relatively informative prior, instead. A relatively informative prior restricts the sampling space when the likelihood is not informative, allowing efficient posterior sampling. Since we rely on the difference betwen the posterior and the prior for approximate inference, the choice of prior for

*β*does not influence the eventual results. In the present study, we set .

Applications to real data confirm that this procedure produces well-calibrated *p*-values (Figure 1), suggesting that a few dozen samples are large enough to ensure asymptotic behavior. Moreover, although our approach is inherently stochastic – because the posterior mean and standard deviation of *β* may be slightly different for different chains – we show that a thousand MCMC iterations per site is large enough to produce stable estimates of the test statistics and *p* values (Figure S2).

### 3 The MACAU Algorithm

Below, we describe the MACAU algorithm, for Mixed model Association for Count data via data AUg-mentation, in detail.

#### 3.1 Data Augmentation

To bypass the difficult likelihood function that results from the count nature of the data, we introduce continuous auxiliary variables to replace *y*_{i}. For *i*th individual, observing *y*_{i} methylated reads out of *r*_{i} total reads is equivalent to observing a sequence of *r*_{i} binary read indicators (*y*_{i1},…, *y*_{iri}), where *y*_{ij} = 1 indicates that the jth read is a methylated read and *y*_{ij} = 0 indicates otherwise. Obviously, We can view each *y*_{ij} as a random variable generated from a logistic regression model with mean log(*λ*_{i}). We further introduce a continuous latent variable *u*_{ij} [19, 20], often referred to as a utility [22], such that
where EV(0, 1) denotes a standard type-1 extreme value distribution with density function .
Then
where . The above two equations come from the fact that the difference between two type-1 extreme value distributed random variables follows a logistic distribution, and a random variable that follows a logistic distribution serves as a liability variable for a logistic regression [22].

The attractive feature of introducing *u*_{ij} is that, conditional on all *u*_{ij}, the posterior of (** α**,

*β*,

*σ*

^{2},

*h*

^{2}) no longer depends on the observed methylated read indicator

*y*

_{ij}, hence removing the non-linearity constraint that comes with the binomial aspect of our model. Applying the relationship between the EV distribution and the exponential distribution, we have

*e*

^{−ui,j}∼ Exp(

*λ*

_{i}) and , where Exp denotes the exponential distribution. This relationship allows us to easily sample

*u*

_{ij}conditional on

*λ*

_{i}and

*y*

_{ij}based on the convenient exponential distribution rather than the more difficult EV distribution, as

*e*

^{−uij}∼ Exp(1 +

*λ*

_{i}) if

*y*

_{ij}= 1 and

*e*

^{−uij}∼ Exp(1 +

*λ*

_{i}) + Exp(

*λ*

_{i}) if

*y*

_{ij}= 0.

An undesirable feature of the above approach, however, is that we have to work with a much larger latent space of *u*_{ij} than the original *n* observations of *y*_{i}. This drawback can be mitigated by combining all exponentiated negative latent utilities together [21], by introducing a new latent variable
where follows a negative log gamma distribution, − log(Ga(*r*_{i}, 1)); Ga denotes a gamma distribution with the two parameters representing shape and rate, respectively. This is because a gamma random variable is a summation of independent exponential random variables with a same rate parameter.

Using the latent variable *z*_{i} instead of *u*_{ij} reduces the size of the latent space back to the observed space. Conditional on *z*_{i}, we again do not need to use *y*_{i}, allowing us to bypass the count feature of the observed data in the algorithm.

#### 3.2 Normal Mixture Approximation

To further circumvent the difficulty introduced by the non-normality of *∈*_{i}, we follow previous ideas [20, 21] to approximate the non-normal distribution by using a mixture of normals. Importantly, we take advantage of recent innovations in efficient mixed model algorithms [23–26] by using a mean mixture of normals where each normal distribution has a different mean but share the same variance.

Specifically, for every possible integer value of *r*, we identify a normal approximation in the form of , to the negative log gamma distribution − log(Ga(*r*, 1)). Because the mean (−Ψ(*r*), where Ψ denotes a digamma function) and the variance (Ψ′(*r*), where Ψ′ denotes a trigamma function) of the negative log gamma distribution is a function of *r*, to ensure approximation stability we work on the standardized version of the negative log gamma distribution, by centering with the mean and standardizing with the standard deviation. Then, we estimate the number of components *k*_{r}, the weights *w*_{rk}, the means *m*_{rk} and the variance via the Nelder-Mead algorithm by minimizing the Kullback – Leibler (KL) divergence between the two distributions. These parameter estimates ensure that the KL divergence is smaller than 0.0005, so that the difference between the approximate and the exact distributions are ignorable in practice. Because the negative log gamma distribution asymptotically approximates a normal distribution, the approximation becomes easier for larger *r*. Therefore, we can use increasingly smaller number of normal components for accurate approximation.

For small values of *r* (*r* ∈ [1, 5]), we provide detailed parameter values in Table S1. For median values of *r* (*r* ∈ [6, 169]), we no longer need to store parameters for every *r*. Instead, we can interpolate the weight, mean and variance estimates across the range of *r* using rational functions without loss of accuracy. These functions are provided in the Table S2. For large values of *r* (*r* ∈ [170, *∞*), we use a single normal distribution N(0, Ψ′(*r*)) for approximation. The mean normal mixture approximations are accurate. Even in the most difficult case where *r* = 1, we only observe small difference between the approximate and the exact distributions (Figure S3).

#### 3.3 Detailed Sampling Steps and Efficient Computation

Now we are ready to describe the detailed MCMC algorithm. Here, with the normal mixture approximation, we have

We introduce a vector of latent indicators *γ* = (*γ*_{1},…, *γ*_{n}), where each *γ*_{i} ∈ (1,…, *k*_{ri}) indicates which normal component the corresponding *ε*_{i} is from. Conditional on *z*_{i} and (** α**,

*β*,

*g*

_{i},

*e*

_{i}), we have where

*k*∈ (1,…,

*k*

_{ri}) and

**Φ**denotes the normal density function. Conditional on

*γ*, we can integrate out

**,**

*α**β*,

**g**,

**e**and

*ε*analytically to obtain the marginal distribution of

*σ*

^{2}and

*h*

^{2}, where

**z**= (

*z*

_{1},…,

*z*

_{n})

^{T},

**m**

_{γ}= (

*m*

_{r1γ1},…,

*m*

_{rnγn})

^{T},

**W**= (

**w**

_{1},…,

**w**

_{n})

^{T},

**D**

_{r}is an

*n*by

*n*diagonal matrix with

*ii*th element ,

**V**=

*h*

^{2}

**K**+(1−

*h*

^{2})

**I**,

**H**=

*σ*

^{2}

**V**+

**D**

_{r},

**P**

_{w}=

**H**

^{−1}−

**H**−

^{−1}

**W**

^{T}(

**W**

^{T}

**H**

^{−1}

**W**)

^{−1}

**WH**

^{−1}and .

We can use the Metropolis–Hastings (MH) algorithm to obtain posterior samples for *σ*^{2} and *h*^{2} jointly. Afterwards, we can obtain posterior samples for ** α**,

*β*and

**g**+

**e**in turn,

Finally, conditional on *y*_{i} and *λ*_{i}, the posterior of *z*_{i} is easy to sample. By using the relationship between the gamma distribution and the exponential distribution, we have

The most computationally expensive part of the algorithm is the MH step: a naive approach to evaluate *P*(*σ*^{2}, *h*^{2}|*z*_{i}, *γ*_{i}) would involve cubic operations. Our mean normal mixture approximation allows us to evaluate this marginal likelihood efficiently as we can apply here the mixed model innovations developed recently [23–26]. This is because given the observed data, **D**_{r} is a fixed diagonal matrix where the elements do not depend on a *γ* that changes in every MCMC iteration. Therefore, for a given matrix **V**, we can perform an eigen-decomposition on . This allows us to decompose . Afterwards, we can transform the latent variables and other covariates to obtain . This procedure avoids any cubic operations later on in the MCMC steps. Therefore, with the mean normal mixture approximation, we only need to perform eigen-decompositions at the beginning of the MCMC. Afterwards, each Gibbs step only requires quadratic operations (transformation of **z − m**_{γ}). In practice, because **V** is a function of *h*^{2}, we assign a discrete uniform prior for *h*^{2} and evaluate the eigen-decompositions on every discrete values of *h*^{2}. In the present study, we found that using either 10 or 100 discrete values of *h*^{2} yields almost identical results (and we present the analyses results for the formal in the main text), suggesting that a fine grid for *h*^{2} is not necessary because of our small sample size. Finally, for all analyses in the present study, we ran 1100 Gibbs sampling iterations with the first 100 as burn-in. In each Gibbs iteration, after sampling the latent variables **z** and the latent indicators *γ*, we further ran 10 MH steps before continuing the Gibbs iterations.

### 4 Parameter Estimation and *p* Value Computation

Denote as the posterior mean and as the posterior variance. Since both the likelihood and the posterior follow normal distributions asymptotically, and because we also use a normal distribution as the prior distribution, we can easily obtain the approximate maximum likelihood estimate and its standard error by the method of moments, or

The condition is guaranteed by asymptotics. In rare cases, however, this condition may not be satisfied because of the limited MCMC sampling iterations in practice. This may be particularly concerning for sites where the likelihood function is not informative. Arguably, these non-informative sites are the ones that we do not want to perform analysis on in the first place. Therefore, this condition gives us a natural way to perform post-filtering. In the software implementation, we do not analyze sites where for a user defined threshold *c* (*c* ≤ 1). We use *c* = 0.95 throughout the present study. This post-filtering step, however, has minimal influence on the results, as only a few dozen sites, out of half a million, are filtered out in each analysis.

## Acknowledgments

We thank the Kenya Wildlife Services, Institute of Primate Research, National Museums of Kenya, National Council for Science and Technology, members of the Amboseli-Longido pastoralist communities, Tortilis Camp, and Ker & Downey Safaris for their assistance in Kenya. We also thank Jeanne Altmann for general support and access to the Amboseli data set and samples, Raphael Mututua, Serah Sayialel, Kinyua Warutere, Mercy Akinyi, Tim Wango, and Vivian Oudu for invaluable assistance with sample collection; Matthew Stephens and Sayan Mukherjee for insight and support on previous versions of MACAU; and Dan Runcie for useful suggestions on data applications. Finally, we thank the Baylor College of Medicine Human Genome Sequencing Center for access to the current version of the baboon genome assembly (*Panu 2.0*).

## Footnotes

Other author emails: amanda.lea{at}duke.edu, alberts{at}duke.edu, jt5{at}duke.edu

## References

- 1.↵
- 2.
- 3.↵
- 4.↵
- 5.
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.
- 14.
- 15.
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.
- 41.↵
- 42.↵
- 43.
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.
- 51.
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.
- 59.
- 60.
- 61.↵
- 62.↵
- 63.↵
- 64.
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.
- 95.↵
- 96.
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.↵