Bayesian mixture model for clustering rare-variant effects in human genetic studies

Rare-variant aggregate analysis from exome and whole genome sequencing data typically summarizes with a single statistic the signal for a gene or the unit that is being aggregated. However, when doing so, the effect profile within the unit may not be easily characterized across one or multiple phenotypes. Here, we present an approach we call Multiple Rarevariants and Phenotypes Mixture Model (MRPMM), which clusters rare variants into groups based on their effects on the multivariate phenotype and makes statistical inferences about the properties of the underlying mixture of genetic effects. Using summary statistic data from a meta-analysis of exome sequencing data of 184,698 individuals in the UK Biobank across 6 populations, we demonstrate that our mixture model can identify clusters of variants responsible for significantly disparate effects across a multivariate phenotype; we study three lipid and three renal traits separately. The method is able to estimate (1) the proportion of non-null variants, (2) whether variants with the same predicted consequence in one gene behave similarly, (3) whether variants across genes share effect profiles across the multivariate phenotype, and (4) whether different annotations differ in the magnitude of their effects. As rare-variant data and aggregation techniques become more common, this method can be used to ascribe further meaning to association results.


Introduction
Population-scale sequencing studies are becoming pervasive [1][2][3][4] . As a result, analyses considering the joint contribution of rare variants to disease susceptibility and phenotypic variation are also becoming pervasive. Commonly used aggregation approaches for rare-variant association studies include the sequence kernel association test, the burden test, and more-general Bayesian model comparison methods [5][6][7][8][9][10][11][12][13][14] . However, aggregation as performed in these methods also tends to lose information within the blocks (typically, genes) specified; that is, they fail to indicate whether certain variants (or certain types of variants) within the same block may have different effects on the multivariate phenotype. In particular, they do not pinpoint which variants are driving the association signal. It is thus critical that we develop methods that can "trace back" variants' effects, clustering the variants used within the blocks of interest into groups that have distinct per-phenotype aggregate effects while adequately accounting for the uncertainty that rare-variant studies exhibit.
Clustering methods fall into several categories, the most prevalent of which are distancebased methods (such as K-means and self-organizing maps) that are sensitive to the noise that typically plagues biological data 8 . Model-based methods such as mixture models are a simple but elegant alternative that assume that data are generated from multiple source distributions, which are then learned and set as the "clusters". Algorithms such as Expectation Maximization can estimate latent variables that underlie these clusters. The finite-mixture model, one such model-based method, assumes a finite number of clusters and asserts that this number can be estimated using goodness-of-fit criteria like the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) 15,16 .
In this study, we propose to use a Bayesian hierarchical mixture model where a hierarchical structure is introduced to allow the sharing of information among related clusters (in our case, genes) and where the number of clusters are pre-specified. Our approach, the Multiple Rarevariants and Phenotypes Mixture Model (MRPMM), considers several factors when estimating parameters of interest underlying the mixture of effects driving an association signal. We calculate matrices of genetic correlations among phenotypes of interest using both common or null variants and rarer or significant variants. The method also estimates the spread of effects across predicted consequences of the genomic variants (variant annotations), which is a critical aspect of interpreting genetic findings. The annotations represent expected severity of impact on phenotypes (for example, protein-truncating variants, or PTVs 17,18 , are predicted to truncate the protein product, and are purported to be much more deleterious than other variants). In MRPMM, we use summary statistics (for single-variant single-phenotype GWAS, these are per-variant estimates of marginal univariate effect size and corresponding standard errors). In practice, sharing of individual genotype and phenotype data across groups in large genetic consortia is difficult to achieve due to privacy concerns and consent issues; using summary statistics can bypass these issues while also increasing computational efficiency without reducing accuracy. Insights from Liu et al. 19 and Cichonska et al. 20 also suggest that the use of additional summary statistics, like covariance estimates across variants and studies, respectively, enable a lossless ability to detect gene-based association signals using summary statistics alone.

Methods
Algorithm: Our goal is to cluster the variants into groups based on their effects on the multivariate phenotype; that is, we are interested in the joint posterior distribution that indicates cluster memberships per-variant and per-gene. For multi-parameter models such as these, the joint posterior may be difficult to sample from directly. Often, it is easier to sample sequentially from the full conditional distribution of each parameter 21 using a Gibbs sampler, a Markov Chain Monte Carlo (MCMC) algorithm that constructs a dependent sequence of parameter values whose distribution approximates the joint posterior [21][22][23] . We implement the Gibbs sampler in MRPMM as follows: • The prior for b c is N (0, Θ 0 ), where Θ 0 is an estimate of genetic correlation across the traits.
• Consider the model with C > 1 clusters. In order to model the sharing of clusters across the genes, we first draw a C-dimensional probability vector π 0 ∼ Dirichlet (1, 1, 1, . . . , 1) .
• Next, for each gene j, we draw a probability vector π j |π 0 ∼ Dirichlet (απ 0 ) to determine the mixture proportions π jc that dictate how the variants in gene j are distributed across the clusters 1, . . . , C.
• The parameter α governs how similar the cluster proportions are across genes; it is drawn from a prior α ∼ Inv-Gamma (1, 1) .
• The algorithm also takes into account the functional annotation, via σ 2 a , a variance parameter for annotation a across the clusters. It follows an inverse-gamma prior σ 2 a ∼ Inv-Gamma (sh a , sp a ) , with hyperparameter values sh a (shape) and sp a (spread).
• Under the above assumptions of the model, the phenotype of individual i with a rare allele of variant m with annotation a in gene j is V Y is the estimated residual variance-covariance matrix of phenotypes after the effects of the variants have been regressed out.
• If we have access only to summary statistics of estimated effect sizes β jm , then we estimate variance-covariance matrices V jm for each variant m in gene j as where SE m denotes the K-dimensional vector of standard errors across K phenotypes for variant m, and Ω is the K × K matrix of correlation of errors estimated from null variants. Then, our sampling model for the data is β jm |(π j , b, σ 2 a ) ∼ C c=1 π jc N σ a b c , V jm .
• Similar to the individual level data model above, this formulation assumes independence between the variants, which is approximately true when each individual carries at most one of the rare variants considered. A connection between the two data types is that where f jm is the frequency of the rare allele at variant m of gene j among the 2N haplotypes, and y jm is the mean phenotype of the carrier individuals of that allele.
• To compare models with different numbers of clusters, we use the Bayesian Information Criterion (BIC) 15 , defined for the model M C with C clusters as where D denotes the observed data, θ C is the vector of maximum likelihood estimates of the parameters of M C , ν c is the number of independent parameters of M c , and n is the number of data points. The difference in BIC values between two models approximates twice the logarithm of the Bayes Factor between the models, with lower BIC value corresponding to the model preferred.
MRPMM utilizes Metropolis-Hastings (MH) steps to update the parameters of interest. The algorithm is run for n burn + n iter iterations, of which the first n burn are discarded as an initial "burnin". With superscripts in parentheses denoting iteration, the steps of the algorithm are as follows.
2. Repeat the following steps for iterations t = 1, 2, . . . , n burn + n iter : (a) Update π 0 (probability of cluster assignment independent of gene). We use a Metropolis-Hastings sub-step 24 using a proposal centred around the current value, drawing π 0 ∼ Dirichlet γπ To calculate the acceptance probability, we define the normalizing constant for the C-dimensional Dirichlet distribution with parameters z as and the density at point x as x zc−1 c .
With this notation, the Metropolis-Hastings transition probability from π , and the density of the observed data depends on π (t−1) 0 only through the product J j=1 p Dir π j |απ (t−1) 0 . Hence, the proposal acceptance probability is (b) For each gene j = 1, . . . , J we update π j (per-gene cluster parameter) to be where δ jm is the index of the cluster to which the variant m of gene j belongs to and I(·) is the indicator function.
(c) Update δ jm for all j = 1, . . . , J and m = 1, . . . , M j . For all c, compute and renormalize in such a way that p jmc ∝ p jmc sums to 1 over all c. Then, we sample δ jm ∼ Discrete(p jmc ).
(d) Update b c (cluster effect profile) using a Gibbs update from a Gaussian distribution: (e) Update σ 2 a (annotation spread parameter). We use a Metropolis-Hastings sub-step using a random-walk proposal. That is, sequentially for each annotation a, we sample a proposal value where ξ 0 is a hyperparameter controlling the spread of the proposals.
In the examples, we have used ξ 0 = 1. Then we calculate the acceptance probability where the products are over those variants v jm whose annotation is a and c jm is the (f) Update α (parameter determining sharing of clusters across genes). We again use a Metropolis-Hastings sub-step using a random-walk proposal. That is, we sample a where ξ α is a fixed value controlling the variance of the proposal distribution. Then, we compute the acceptance probability where the prior for α is Inv-Gamma(1,1) (i.e., p(α) = α −2 exp (−α −1 )), and p Dir is the density of the Dirichlet distribution as defined above in part (a). With probability λ, we set α (t) = α , and with probability 1 − λ, α (t) = α (t−1) . refined the population definition as described elsewhere 25 Semi-related individuals were grouped as individuals whose genetic data (after passing UK Biobank QC filters; sufficiently low missingness rates; and genetically inferred sex matching reported sex), using a King's relationship table, were between conditional third and conditional second degrees of relatedness. Admixed individuals were grouped as unrelated individuals who were flagged as "used_in_pca_calculation" by the UK Biobank and were not assigned to any of the other populations 6 .

Data
We performed genome-wide association analysis on individuals with whole-exome sequenc- Field 30720], and effective glomerular filtration rate [derived from UK Biobank Field 30700]).
The analyses were performed for each of the six population subgroups as defined above using PLINK v2.00a (20 October 2020). The quantitative trait values were rank normalized using thepheno-quantile-normalize flag. We used age, sex, and the first ten genetic principal components as covariates in the analyses. The analysis was performed for 5,850,789 rare (minor allele frequency ≤ 0.01) protein-truncating (492,151) and protein-altering (5,358,638) variants.
For the admixed population, we conducted local ancestry-corrected GWAS. We first assem- Hunter-Gatherer, East Asian, European, Native American, Oceanian, South Asian, and West Asian.
We then used RFMix v2.03 30 to assign each of the 20,727 windows across the phased genomes to one of these eight ancestry clusters (for all individuals in the UK Biobank). These local ancestry assignments were subsequently used with PLINK2 as local covariates in the GWAS for the admixed individuals for SNPs within those respective windows. PLINK2 allows for the direct input of the RFMix output (the MSP file, which contains the most likely subpopulation assignment per conditional random field [CRF] point) as local covariates using the "local-cov", "local-psam", and "local-haps" flags, the "local-cats0=n" flag (where n is the number of assignments), and the "localpos-cols=2,1,2,7" flag (for a typical RFMix MSP output file -see "Association Analysis" page on PLINK website).
Subsequently, we used METAL 31 to perform inverse-variance weighted meta-analysis to generate a single summary statistic file per phenotype.
For the remainder, we used Variant Effect Predictor (VEP) 32 to annotate the most severe consequence, the gene symbol, and HGVSp of each variant in the UK Biobank exome data. We calculated minor allele frequencies using PLINK. We provide these metadata, which are necessary for MRPMM, in exome and array tables, available for direct download via the Global Biobank Engine 33 (Code Availability).

Results
To assess MRPMM's ability to estimate the underlying mixture of effects from summary statis-  [3,569], semi-related [18,100], and admixed [11,961]). After using a Bayesian model comparison approach 6 to consider GWAS evidence across each set of phenotypes and across each gene, we chose 13 genes that had a log 10 Bayes Factor (BF) ≥ 5 for lipid traits, and 5 genes that met that criterion for renal traits. We then ran MRPMM across PTVs in these genes, increasing the number of hypothesized clusters until there was an increase in BIC, a goodness-of-fit measure which is minimized under ideal conditions. We found that four clusters (including the "null cluster" with a constant effect of 0 across phenotypes) was favored for both sets of phenotypes (Figure 1), and that each cluster had a distinct effect size profile on the multivariate phenotype (Figures 2, 3).
Specifically, we see that there are some APOB variants (Figure 4a -left side) that have a strong negative effect on TG and LDL levels (Figure 2d) through exclusive membership to Cluster 3. To the right of those variants, other variants are definitively placed into Cluster 2 and Cluster 1 respectively. The variants that are definitively placed into Cluster 2 feature the PCSK9 and ANGPTL3 genes; MRPMM shows that these variants down-regulate TG and LDL levels ( Figure   2c). The variants which belong exclusively to Cluster 1 feature the PDE3B, APOC3, CETP, and ANGPTL8 genes and have positive effects on HDL levels, negative effects on TG levels, and mild negative effects on LDL levels (Figure 2b). We can perform a similar visual analysis with the renal multivariate phenotype results (Figure 4b). Variants from the CGNL1, RNF186, SLC22A2, and SLC34A3 genes largely seem to fall under Cluster 2, whereas variants from CST3 clearly fall into Cluster 1 in an isolated manner. Cluster 3 seems to be populated partially by several variants in CGNL1. These variant-specific breakdowns and their gene-aggregate counterparts (Figures 5a, 5b) help trace back the variants' effect profile on the multivariate phenotype. Unlike in aggregation approaches, where a single statistic captures which genes are associated with the trait without providing much context as to the nature of the association, MRPMM has the useful characteristic of being able to cluster not only variants but also genes into different effect profiles. We provide single-trait MRPMM results for all traits across the UK Biobank on the Global Biobank Engine 33 (Code Availability).

Future Directions and Discussion
In this study, we used a Bayesian hierarchical mixture model to estimate the underlying mixture of components driving association signals between various protein-truncating variants and a set of genetically related phenotypes. By explicitly modeling the sharing of effects across genes, we are able to use Gibbs sampling to approximately infer the joint posterior distribution and thereby assign variants to clusters. In both applications, we see that clusters have vastly different effect size profiles on the sets of phenotypes chosen; this shows that aggregating rare-variant signal in blocks (e.g. genes) may not fully encapsulate the information that is available in summary statistics. Coupling a Bayesian model comparison approach as described by Venkataraman et al. 6 with MRPMM may be a way to (1)         that are marked as belonging to Cluster 3 on the left are responsible for coding for a hepatic lipase enzyme, hence their strong effects on triglycerides and LDL cholesterol levels (see Figure 3D).
Only PTVs that have posterior probability ≤ 0.2 of belonging in the null cluster (Cluster 0) are shown. Variants are sorted by the posterior probability of belonging to the null cluster.

Figure 4b:
Variant-level posterior probabilities that PTVs in candidate genes in the UK Biobank exome sequencing data set belong to each of four clusters hypothesized with respect to a renal-related multivariate phenotype of creatinine (CRE), cystatin C (CSTC), and effective glomerular filtration rate (eGFR). The PTVs in CST3 that are marked as belonging to Cluster 1 in the middle are responsible for coding cystatin C directly, hence their strong effects CSTC levels (see Figure 4B). Only PTVs that have posterior probability ≤ 0.2 of belonging in the null cluster (Cluster 0) are shown. Variants are sorted by the posterior probability of belonging to the null cluster. Figure 5a: Gene-level posterior probabilities that PTVs in candidate genes in the UK Biobank exome sequencing data set belong to each of four clusters hypothesized with respect to a lipid-related multivariate phenotype of high-density lipoprotein cholesterol (HDL), low-density lipoprotein cholesterol (LDL), and triglyceride (TG) levels. All PTVs in the analysis are accounted for here. Figure 5b: Gene-level posterior probabilities that PTVs in candidate genes in the UK Biobank exome sequencing data set belong to each of four clusters hypothesized with respect to a renal-related multivariate phenotype of creatinine (CRE), cystatin C (CSTC), and effective glomerular filtration rate (eGFR). All PTVs in the analysis are accounted for here.

Code Availability
We have published the full set of associations (log 10 BF ≥ 5) from an independent effects model