Joint modelling of whole genome sequence data for human height via approximate message passing

Human height is a model for the genetic analysis of complex traits, and recent studies suggest the presence of thousands of common genetic variant associations and hundreds of low-frequency/rare variants. However, it has not yet been possible to fine-map the genetic basis of height, since all variant effects have not been modelled jointly leaving correlations unaccounted for. To address this issue, we develop a new algorithmic paradigm based on approximate message passing, gVAMP, to directly fine-map whole-genome sequence (WGS) variants and gene burden scores, conditional on all other measured DNA variation genome-wide. We find that the genetic architecture of height inferred from WGS data differs from that inferred from imputed single nucleotide polymorphism (SNP) variants: common variant associations from imputed SNP data are allocated to WGS variants of lower frequency, and there is a stronger relationship of effect size and variant frequency. Thus, even fine-mapped imputed variants are systematically mis-assigned and without the joint analysis of WGS data it remains premature, if not unfounded, to make statements regarding the number of independent associations and their properties. We validate gVAMP on various datasets across UK Biobank traits where it outperforms widely used methods for polygenic risk score prediction and association testing, offering a scalable foundation towards analyzing hundreds of millions of variables measured on millions of people.


Introduction
Efficient utilization of large-scale biobank data is crucial for inferring the genetic basis of disease and predicting health outcomes from DNA.The common statistical approach of single-marker or single-gene burden score regression [1][2][3][4], gives marginal associations that do not account for linkage disequilibrium (LD), and in whole genome sequence (WGS) data, even weak associations that are physically distant from causal variants will be discovered as significant at scale.While fine-mapping aims to identify causal variants, current methods focus only on genome-wide significant loci within one region at a time [5], in isolation from the rest of the genome, resulting in miscalibration and a compromise of power.Thus, we currently lack accurate statistical models to jointly estimate the effect of each locus, conditional on all other genetic variants.Applying whole genome regression (WGR), where the effect of each variant is estimated conditional on all others, has the potential to resolve these issues and reveal the underlying genetic architecture of complex traits.
Here, we focus on the highly heritable polygenic phenotype of human height, and develop a new framework, gVAMP, which fits tens of millions of WGS variants jointly at scale.Applying gVAMP to WGS data on hundreds of thousands of UK Biobank participants, we find a stronger relationship of effect size and variant frequency, due to common variant associations in imputed SNP data being allocated to WGS variants of lower frequency.These insights could not be obtained from existing statistical approaches, and we additionally validate gVAMP on a number of datasets by benchmarking against the state of the art: gVAMP outperforms widely used summary statistic methods such as LDpred2 [6] and SBayesR [7] for polygenic risk score prediction, and an individuallevel REGENIE [1] method for association testing.Additionally, we show that its performance matches that of MCMC sampling schemes [8] but with a dramatic speed-up in time (analysing 8.4M SNPs jointly in under day as opposed to weeks).This lays the foundations for a wider range of analyses in large WGS datasets that are entirely infeasible for other methods.

Overview of the approach
Our focus is on the simple idea of joint association testing controlling for local and long-range LD: we estimate the significance of each variant, conditional on all other observed DNA locations genome-wide.To do this, we consider a general form of whole-genome Bayesian linear regression, common to genome-wide association studies (GWAS) [7,8], estimating the effects vector β ∈ R P from a vector of phenotype measurements y = (y 1 , . . ., y N ) ∈ R N given by y i = ⟨x i , β⟩ + ϵ i , for i ∈ {1, . . ., N }. (1) Here, x i is the row of the normalized genotype matrix X corresponding to the i-th individual, ⟨x i , β⟩ = x T i β denotes the inner product, and ϵ = (ϵ 1 , . . ., ϵ N ) is an unknown noise vector with multivariate normal distribution N (0, γ −1 ϵ • I) and unknown noise precision γ −1 ϵ .To allow for a range of genetic effects, we select the prior on β to be of an adaptive spike-and-slab form: Here, λ ∈ [0, 1] is the DNA variant inclusion rate, L is the unknown number of Gaussian mixtures, (π i ) L i=1 denote the mixture probabilities and (σ 2 i ) L i=1 the variances for the slab component.
Current association testing [1][2][3][4], fine-mapping [5] and polygenic risk score methods [8,9] are all based on forms of Equations ( 1) and ( 2), with parameters estimated by restricted maximum likelihood (REML), Markov Chain Monte Carlo (MCMC), expectation maximisation (EM), or variational inference (VI).REML and MCMC are computationally intensive and slow; while EM and VI are faster, they trade speed for accuracy with few theoretical guarantees.Furthermore, current software implementations of these algorithms limit either the number of markers or individuals.
Mixed linear model association (MLMA) approaches are restricted to using less than one million SNPs to control for the polygenic background [1,2], resulting in a loss of power [8] and the potential for inadequate control for fine-scale confounding factors [10].Polygenic risk score algorithms are limited to a few million SNPs, and lose power by modelling only blocks of genetic markers [6,7].
Likewise, fine-mapping methods are generally limited to focal segments of the DNA [5], and they are unable to fit all genome-wide DNA variants together.Thus, no existing approach can apply the statistical model of Equations ( 1) and ( 2) to jointly estimate the effects vector β and the genome-wide significance of each element in WGS data.
We overcome this issue by developing a new approach for GWAS inference, dubbed genomic Vector Approximate Message Passing (gVAMP).Approximate Message Passing (AMP) [11][12][13] refers to a family of iterative algorithms with several attractive properties: (i) AMP allows the usage of a wide range of Bayesian priors; (ii) the AMP performance for high-dimensional data can be precisely characterized by a simple recursion called state evolution [14]; (iii) using state evolution, joint association test statistics can be obtained [15]; and (iv) AMP achieves Bayes-optimal performance in several settings [15][16][17].However, we find that existing AMP algorithms proposed for various applications [18][19][20][21] cannot be transferred to biobank analyses as: (i) they are entirely infeasible at scale, requiring expensive singular value decompositions; and (ii) they give diverging estimates of the signal in either simulated genomic data or the UK Biobank data.To address the problem, we combine a number of principled approaches to produce an Expectation Propagation method tailored to whole genome regression as described in the Methods (Algorithm 1).gVAMP approximates the posterior E[β | X, y], providing joint effect size estimates and statistical testing via state evolution (see "gVAMP SE association testing" in the Methods).Additionally, we learn all unknown parameters in an adaptive Bayes expectation-maximisation (EM) framework [22,23], which avoids expensive cross-validation and yields biologically informative inference of the phenotypic variance attributable to the genomic data (SNP heritability, h 2 SN P ) allowing for the first full characterisation of the genetic architecture of human complex traits in WGS data.testing framework, capable of estimating the effects of each genomic position conditional on all other SNP markers.In the panel "Imputed SNPs", we combine 8,430,446 autosomal imputed SNP markers with 17,852 whole exome sequencing gene burden scores, estimating the effects jointly within the gVAMP SE testing framework.In the panel "WGS" we combine 16,854,878 whole genome sequence variants with 17,852 whole exome sequencing gene burden scores, again estimating the effects jointly within the gVAMP SE testing framework.

The genetic architecture of human height
When analysing 415,000 UK Biobank individuals, we find that the genetic architecture of human height inferred from 16,854,878 WGS variants differs to that inferred from 8,430,446 imputed SNP markers (Figure 1).gVAMP estimates the proportion of phenotypic variance in human height attributable to WGS data as 0.652, as compared to 0.63 for the imputed SNP data, comparable to a previously published REML estimate in a different WGS dataset [25] and family-based estimates [26].This confirms that additional phenotypic variation is attributable to variants in WGS data that are missing in imputed SNP data.Surprisingly, despite this increase in attributable height variation, when gVAMP maps effects to single-locus positions across the DNA in WGS data, we find 526 genome-wide significant effects as compared to 930 in the imputed data (Figure 1).This decrease in genome-wide significant loci occurs because the WGS analysis attributes height variation to DNA variants of lower minor allele frequency (MAF), as compared to the imputed SNP data analysis, giving a reduction in association testing power (Figure 2).
For WGS variants significant at different thresholds, we determine whether the same variant, or a variant within a given number of base pairs (distance, x-axis), is identified in the imputed SNP data at the same significance threshold.We find little overlap in the locations of the SNPs that we genome-wide significance thresholds, we determine whether we also identify a variant within a given number of base pairs (distance, x-axis) in the imputed SNP data at the same threshold.The overlap is calculated as the proportion of WGS variants of either > 1% or < 1% minor allele frequency (MAF) that are discovered at a given threshold within a certain base pair distance.(b) For each whole genome sequence (WGS) variant discovered at different genome-wide significance thresholds, we determine whether a variant was identified within the latest height GWAS study [24] at the same threshold for a given number of base pairs (distance, x-axis), with the overlap calculated as in (a).We select the GWAS marginal summary statistics for European individuals including (incl.UKB), or excluding (excl.UKB) the UK Biobank (for analysis details see [24]).
(c) For each WGS variant at genome-wide significance level p ≤ 5 • 10 −6 , we determine the imputed SNP at p ≤ 5 • 10 −6 with the closest MAF and show a histogram of these frequency differences for WGS variants of different frequencies.(d) For all WGS variants and imputed SNPs, we calculate the proportional contribution to phenotypic variance across different MAF groups.
map to single-locus resolution across the two datasets within 1kb (a small proportion maps to the same location), but substantial overlap within 100kb either side of the WGS findings (Figure 2a).
Thus, similar DNA regions are identified, but the effects are assigned to different variant locations.
When we determine whether the same variant, or a variant within a given number of base pairs (distance, x-axis), was identified in the most recent GWAS study of human height [24], we again find little overlap in the locations of the SNPs that we map to single-locus resolution across the two datasets within 1kb, but substantial overlap within 100kb either side of the WGS findings (Figure 2b).This demonstrates that our results are predominantly replicated in large-scale GWAS studies, but again that in WGS data effects are localised to different DNA variants.
When we then examine the properties of the WGS variants we identify, we find that WGS variants of MAF ≤ 16% are generally always mis-mapped in the imputed data to variants of higher frequency (Figure 2c).For each WGS variant discovered at genome-wide significance level p ≤ 5 • 10 −6 , we determine whether there is an imputed SNP at p ≤ 5 • 10 −6 within 250kb, and show a histogram of the imputed variant with closest MAF, separating the discovered WGS variants by their frequency: we find that, while some variants map to the same location across the two datasets (frequency difference of 0), the majority do not and are assigned in the imputed data to variants of higher frequency (Figure 2c).We also find that each WGS variant not discovered in the imputed data can have multiple neighbouring SNPs of various MAF distribution with the same significance level (Figure S1).As a consequence, the phenotypic variance attributable to different MAF groups differs in WGS as compared to imputed SNP data, with less variance attributable to common SNPs in WGS data (Figure 2d).This shows that for human height, many discoveries in imputed SNP data are attributable to variants of lower frequency in WGS data.Thus, fine-mapped imputed variants can be systematically mis-assigned, without a full joint analysis of WGS data.
Despite the expectation that fine-mapped variants would show elevated marginal test statistics in standard GWAS association testing, we find that many genome-wide significant height-associated rare variants discovered in both the WGS and imputed data were not found in previous UK Biobank analyses in Open Targets including: rs116467226, an intronic variant by TPRG1; rs766919361, an intronic variant by FGF18; rs141168133, an intergenic variant 19kb from ID4; rs150556786, an upstream gene variant for GRM4; rs574843917, a non-coding transcript exon variant in GPR21; rs532230290, an intergenic variant 42kb from SCYL1; rs543038394, intronic in OVOL1; rs1247942912, a non-coding transcript exon variant in AC024257.3;rs577630729, a regulatory region variant for ISG20; and rs140846043, a non-coding transcript exon variant of MIRLET7BHG. 10 out of our top 38 height-associated WGS variants of ≤ 1% MAF were not previously discovered, and only become height-associated when conditioning on the entire polygenic background captured by WGS data within our analysis.
We can directly determine the relationship between effect size and minor allele frequency, which again differs between WGS and imputed SNP data (Figure 3a).For variants of significance p ≤ 5•10 −4 , the power relationship of effect size and locus variance, denoted as α in the literature [27], is α = −0.318,95% CI = 0. We highlight the benefits of joint estimation to explore genetic architecture, where controlling for LD allows effects to be summed over different categories, facilitating gene/annotation analyses.
We find a general concordance of the estimated effect sizes of the 17,852 WES gene burden scores when fit alongside either imputed or WGS data, for most but not all genes (Figure 3b).We sum up the joint effects to determine the variation in height attributable to each gene, and again find general concordance across markers annotated to each of the 17,852 genes, conditional on all other markers, across the imputed SNP and WGS analysis (Figure 3c).We see little relationship between the variance attributable to the burden score of a given gene and the SNPs annotated to the gene (Figure S2), with some notable exceptions: ACAN, ADAMTS17, ADAMTS10, and LCORL.The top genes, where gVAMP attributes ≥ 0.04% of height variation in addition to those listed above are EFEMP1, ZBTB38, and ZFAT.All the 38 genome-wide significant gene burden scores are for genes that have previous GWAS height associations linked to them in Open Targets, but our analysis suggests that the effect is attributable to a rare protein coding variant rather than the common variants suggested by current genome-wide association studies.,852 WES gene burden scores, we find general concordance of the estimated effect sizes when fit alongside either imputed or WGS data, for most but not all genes, with squared correlation 0.532.(c) Likewise, we also find general concordance of the phenotypic variance attributable to markers annotated to each of the 17,852 genes, when fit conditional on either imputed or WGS variants, with squared correlation 0.494.(d) Finally, we show similar patterns of enrichment when annotating markers to functional annotations in either the proportion of variance attributable to each group (labelled "variance"), or in the average effect size relative to the genome-wide average effect size (labelled "effect") from joint estimation of either imputed or WGS data.
Additionally, across annotations, we find that the phenotypic variance attributable to different DNA regions is higher for intergenic and intronic variants (Figure 3d).However, when adjusting for the number of SNPs contained by each category by using the average effect size of the group relative to the average effect size genome-wide, we find that exonic variants that are nonsynonymous, splicing and stop-gain have the largest average effects of the WGS variants included within our model (Figure 3d).Taken together, joint association testing of all WGS variants resolves many previously discovered height-associated DNA regions to rare DNA variants where exonic variants have large effect sizes, insights that cannot be provided by other GWAS approaches at a scale of 17M DNA variants.As gVAMP is the only algorithm that scales to tens of millions of WGS variants, we can only validate and benchmark gVAMP against state-of-the-art approaches in a number of alternative datasets.We find that (i) gVAMP outperforms summary statistic approaches [6,7] for polygenic risk score prediction, (ii) it outperforms REGENIE [1] for mixed-linear model association testing, and (iii) it has similar performance to MCMC approaches [8], but in a fraction of the compute time, which allows analyses at far larger scale that then result in improved performance (Figure 4).

Validation and benchmarking of gVAMP
Specifically, we compare the prediction accuracy of gVAMP to the widely used summary statistic methods LDpred2 [6] and SBayesR [7], and to the individual-level method GMRM [8] for imputed SNP data in the UK Biobank across 13 traits (training data sample size and trait codes given in Table S1).gVAMP outperforms all methods for most phenotypes and, in comparison to published estimates, we obtain the highest out-of-sample prediction accuracy yet reported to date for most traits.Specifically, for human height, we obtain an accuracy of 45.7%, which is a 97.8% relative increase over LDpred2 (Table 1 and Figure 4) and higher than the accuracy obtained from the latest height GWAS study of 3.5M people of 44.7% [24], despite our sample size of only 414,055.However, we caution that modelling WGS data does not improve the out-of-sample prediction obtained as compared to the 8.4M imputed SNP results for human height, and this is again likely because of the reduction in the MAF of the markers included within the WGS model.
Generally, gVAMP performs similarly to GMRM, an MCMC sampling algorithm, improving over it when analysing the full set of 8,430,446 imputed SNPs (Table 1 and Figure 4).gVAMP estimates the h 2 SN P of each of the 13 traits at an average of 3.4% less than GMRM when using 2,174,071 SNP markers, but at an average of 3.9% greater than GMRM when using 8,430,446 SNP markers.This implies that more of the phenotypic variance is captured by the SNPs when the full imputed SNP data are used (Figure S3).We note that GMRM already takes several days to analyze 2,174,071 SNPs and would take many weeks to analyze 8,430,446 SNPs, making it entirely infeasible to run at that scale.In contrast, gVAMP yields estimates on 8,430,446 SNPs in under a day (Supplementary Note 1, Figure S5c).Second, we highlight that gVAMP can also be used for standard leave-one-chromosome-out (LOCO) statistical testing.We compare gVAMP to REGENIE and to GMRM for association testing of the 13 traits within a MLMA framework.gVAMP performs similarly in LOCO testing conducted using a predictor from GMRM, with the use of the full 8,430,446 imputed SNP markers generally improving performance (Table 2 and Figure 4b).REGENIE yields far fewer associations than either GMRM or gVAMP for all traits (Table 2 and Figure 4b), consistent with simulation study results presented in Supplementary Note 1.Using the gVAMP SE association testing framework, we find hundreds of marker associations for each trait that can be localised to the single locus level, conditional on all other SNPs genome-wide (Table 2, Figure 4c), with the obvious caveat that these results are for imputed SNP data and re-analysis of WGS data may yield different results as we show above for height.For all 13 traits, we find that the SE association estimates we obtain converge in number and location after iteration 20 (Figure S4).
In addition, we conduct a series of simulation studies showing that gVAMP is the only approach to generate genetic predictors and association test statistics in a single step, without additional computations, with accuracy similar to individual-level MCMC methods achieved in a fraction of the compute time (Supplementary Note 1).As compared to REGENIE, gVAMP completes in 2/3 of the time given the same data and compute resources, and it is dramatically faster (12.5× speed-up) than GMRM (Supplementary Note 1, Figure S5c).

Discussion
Our results reveal that a different genetic architecture for human height is inferred in WGS data, as compared to imputed SNP data, which shows that fine-mapping results can be miscalibrated by missing rare variants.Although large sample sizes of WGS data will be needed to pinpoint the variants responsible for the heritability of traits, our results show that the prioritization of relevant genes and gene sets is feasible at smaller sample sizes in imputed data.We highlight that gVAMP is not restricted to the analysis of WGS data, and it also provides a general approach to obtain genetic predictors and MLMA association test statistics in a single step, with accuracy similar to individual-level MCMC methods, but in a fraction of the compute time.We demonstrate this both in an extensive simulation study and in the analysis of 13 UK Biobank traits.Importantly, we provide a different association testing approach where the effects of each locus or burden score can be estimated conditional on all other DNA variation genome-wide.This allows associations to be localised to the single-locus, or single-gene level, refining associations by testing each of them against a full genetic background of millions of DNA variants.
There are a number of remaining limitations.Our results suggest that the detection and accurate estimation of the effects of height-associated variants is expected to be difficult even with millions of WGS samples.There are a very large number of rare variants within the human population that are missing from our WGS analysis of 16,854,878 variants, and we believe it is quite likely that the true underlying genetic architecture of human height is even rarer than we present here.
Two solutions could be to (i) apply gVAMP region-by-region, or (ii) make slight modifications in the implementation, so that gVAMP streams data, at the cost of increased run time.Additionally, while our approach can be applied within any sub-grouping of data (by age, genetic sex, ethnicity, etc.), this is not within the scope of the present work.Combining inference across different groups is of great importance [28], and previous work suggests that better modelling within a single large biobank can facilitate improved association testing in other global biobanks [29].Here, while our approach can be used in the same way, maximising association and prediction across the human population requires a model that is capable of accounting for differences in the design matrix (minor allele frequency and linkage disequilibrium patterns) across different datasets.Our ongoing work now aims at expanding the gVAMP framework to make inference across a diverse range of human groups, to model different outcome distributions (binary outcomes, time-to-event, count data, etc.), to allow for different effect size relationships across allele frequency and LD groups, to model multiple outcomes jointly, and to do all of this using summary statistic as well as individual-level data across different biobanks.This is key to obtaining the sample sizes that are likely required to fully explore the genetic basis of complex traits.
In summary, gVAMP is a different way to create genetic predictors and to conduct association testing.With increasing sample sizes reducing standard errors, a vast number of genomic regions are being identified as significantly associated with trait outcomes by one-SNP-at-a-time association testing.Such large numbers of findings will make it increasingly difficult to determine the relative importance of a given mutation, especially in whole genome sequence data with dense, highly correlated variants.Thus, it is crucial to develop statistical approaches fitting all variants jointly and asking whether, given the LD structure of the data, there is evidence for an effect at each locus, conditional on all others.

Methods gVAMP algorithm
Approximate message passing (AMP) was originally proposed for linear regression [11,14,30] assuming a Gaussian design matrix X.To accommodate a wider class of structured design matrices, vector approximate message passing (VAMP) was introduced in [12].The performance of VAMP can be precisely characterized via a deterministic, low-dimensional state evolution recursion, for any right-orthogonally invariant design matrix.We recall that a matrix is right-orthogonally invariant if its right singular vectors are distributed according to the Haar measure, i.e., they are uniform in the group of orthogonal matrices.gVAMP extends EM-VAMP, introduced in [12,22,23], in which the prior parameters are adaptively learnt from the data via EM, and it is an iterative procedure consisting of two steps: (i) denoising, and (ii) linear minimum mean square error estimation (LMMSE).The denoising step accounts for the prior structure given a noisy estimate of the signal β, while the LMMSE step utilizes phenotype values to further refine the estimate by accounting for the LD structure of the data.
A key feature of the algorithm is the so called Onsager correction: this is added to ensure the asymptotic normality of the noise corrupting the estimates of β at every iteration.Here, in contrast to MCMC or other iterative approaches, the normality is guaranteed under mild assumptions on the normalized genotype matrix.This property allows a precise performance analysis via state evolution and, consequently, the optimization of the method.
In particular, the quantity γ 1,t in line 7 of Algorithm 1 is the state evolution parameter tracking the error incurred by r 1,t in estimating β at iteration t.The state evolution result gives that r 1,t is asymptotically Gaussian, i.e., for sufficiently large N and P , r 1,t is approximately distributed as Here, β represents the signal to be estimated, with the prior learned via EM steps at iteration t: Compared to Equation (2), the subscript t in λ t , π t,l , σ t,l indicates that these parameters change through iterations, as they are adaptively learned by the algorithm.Similarly, r 2,t is approximately distributed as N (β, γ −1 2,t I).The Gaussianity of r 1,t , r 2,t is enforced by the presence of the Onsager coefficients α 1,t and α 2,t , see lines 17 and 22 of Algorithm 1, respectively.We also note that α 1,t Algorithm 1 gVAMP 1: Input: preprocessed normalized genotype matrix X ∈ R N ×P , max number of iterations N it , initial estimate of effect sizes r 1,0 = 0 P ∈ R P , initial estimate of effect sizes precision γ 1,0 = 10 −6 > 0, initial estimate of noise precision γ ϵ,0 = 2, initial set of parameters defining the prior distribution Θ 0 = {λ, (π }, max number of variance auto-tuning steps N var_tune = 5 ∈ N, threshold for stopping criterion ε = 10 −4 > 0, damping factor ρ ∈ (0, 1).2: for t = 0, 1, . . ., N it do 3:
The vectors r 1,t , r 2,t are obtained after the LMMSE step, and they are further improved via the denoising step, which respectively gives β1,t , β2,t .In the denoising step, we exploit our estimate of the approximated posterior by computing the conditional expectation of β with respect to r 1,t , r 2,t in order to minimize the mean square error of the estimated effects.For example, let us focus on the pair (r 1,t , β1,t ) (analogous considerations hold for (r 2,t , β2,t )).Then, we have that Here, f t : R → R denotes the denoiser at iteration t and the notation f t (r 1,t ) assumes that the denoiser f t is applied component-wise to elements of r 1,t .Note that, in line 15 of Algorithm 1, we take this approach one step further by performing an additional step of damping, see "Algorithm stability" below.
From Bayes theorem, one can calculate the posterior distribution (which here has the form of a spike-and-slab mixture of Gaussians) and obtain its expectation.Hence, by denoting a generic component of r 1,t as r 1 , it follows that where N (r 1 ; 0, γ −1 1,t + σ 2 t,l ) denotes the probability density function of a Gaussian with mean 0 and variance γ −1 1,t + σ 2 t,l evaluated at r 1 .Furthermore, we set , with σ 2 t, * := max l (σ 2 t,l ).This form of the denoiser is particularly convenient, as we typically deal with very sparse distributions when estimating genetic associations.We also note that the calculation of the Onsager coefficient in line 17 of Algorithm 1 requires the evaluation of a conditional variance, which is computed as the ratio of the derivative of the denoiser over the error in the estimation of the signal, i.e., The calculation of the derivative of f t is detailed in Supplementary Note 2.
If one has access to the singular value decomposition (SVD) of the data matrix X, the periteration complexity is of order O(N P ).However, at biobank scales, performing the SVD is computationally infeasible.Thus, the linear system (γ ϵ,t X T X + γ 2,t I) −1 (γ ϵ,t X T y + γ 2,t r 2,t ) (see line 21 of Algorithm 1) needs to be solved using an iterative method, in contrast to having an analytic solution in terms of the elements of the singular value decomposition of X.In the next section, we provide details on how we overcome this issue.

Scaling up using warm-start conjugate gradients
We approximate the solution of the linear system (γ ϵ,t X T X + γ 2,t I) −1 (γ ϵ,t X T y + γ 2,t r 2,t ) with a symmetric and positive-definite matrix via the conjugate gradient method (CG), see Algorithm 2 in Supplementary Note 2, which is included for completeness.If κ is the condition number of γ ϵ,t X T X + γ 2,t I, the method requires O( √ κ) iterations to return a reliable approximation.
Additionally, inspired by [31], we initialize the CG iteration with an estimate of the signal from the previous iteration of gVAMP.This warm-starting technique leads to a reduced number of CG steps that need to be performed and, therefore, to a computational speed-up.However, this comes at the expense of potentially introducing spurious correlations between the signal estimate and the Gaussian error from the state evolution.Such spurious correlations may lead to algorithm instability when run for a large number of iterations (also extensively discussed below).This effect is prevented by simply stopping the algorithm as soon as the R 2 measure on the training data or the number of SE associations starts decreasing.
In order to calculate the Onsager correction in the LMMSE step of gVAMP (see line 22 of Algorithm 1), we use the Hutchinson estimator [32] to estimate the quantity Tr[(γ ϵ,t X T X + γ 2,t I) −1 ]/P .We recall that this estimator is unbiased, in the sense that, if u has i.i.d.entries equal to −1 and +1 with the same probability, then Furthermore, in order to perform an EM update for the noise precision γ ϵ one has to calculate the trace of a matrix which is closely connected to the one we have seen in the previous paragraph.
In order to do so efficiently, i.e., to avoiding solving another large-dimensional linear system, we store the inverted vector (γ ϵ,t X T X + γ 2,t I) −1 u and reuse it again in the EM update step (see the subparagraph on EM updates).

Algorithm stability
We find that the application of existing EM-VAMP algorithms to the UK Biobank dataset leads to diverging estimates of the signal.This is due to the fact that the data matrix (the SNP data) might not conform to the properties required in [12], especially that of right-rotational invariance.
Furthermore, incorrect estimation of the noise precision in line 28 of Algorithm 1 may also lead to instability of the algorithm, as previous applications of EM-VAMP generally do not leave many hyperparameters to estimate.
To mitigate these issues, different approaches have been proposed including row or/and column normalization, damping (i.e., doing convex combinations of new and previous estimates) [33], and variance auto-tuning [23].In particular, to prevent EM-VAMP from diverging and ensure it follows its state evolution, we empirically observe that the combination of the following techniques is crucial.
1. We perform damping in the space of denoised signals.Thus, line 15 of Algorithm 1 reads as Here, ρ ∈ (0, 1) denotes the damping factor.This ensures that the algorithm is making smaller steps when updating a signal estimate.
2. We perform auto-tuning of γ 1,t via the approach from [23].Namely, in the auto-tuning step, one refines the estimate of γ 1,t and the prior distribution of the effect size vector β by jointly re-estimating them.If we denote the previous estimates of γ 1,t and Θ with γ (previous) 1,t and Θ (previous) , then this is achieved by setting up an expectation-maximization procedure whose aim is to maximize , Θ (previous)   with respect to γ 1,t and Θ.
3. We filter the design matrix for first-degree relatives to reduce the correlation between rows, which has the additional advantage of avoiding potential confounding of shared-environmental effects among relatives.

Estimation of the prior and noise precision via EM
The VAMP approach in [12] assumes exact knowledge of the prior on the signal β, which deviates from the setting in which genome-wide association studies are performed.Hence, we adaptively learn the signal prior from the data using expectation-maximization (EM) steps, see lines 8 and 28 of Algorithm 1.This leverages the variational characterization of EM-VAMP [22], and its rigorous theoretical analysis presented in [23].In this subsection, we summarize the hyperparameter estimation results derived based upon [34] in the context of our model.We find that the final update formulas for our hyperparameter estimates are as follows.
The intuition behind {ζ j } P j=1 is that each ζ j tells what fraction of posterior probability mass was assigned to the event that it has a non-zero effect.Then, the update formula for the sparsity rate λ t+1 reads as • Probabilities of mixture components in the slab part {π i } L i=1 : We define {ξ j,i } L,P i=1,j=1 as , ∀i = 1, . . ., L, ∀j = 1, . . ., P.
The intuition behind {ξ j,i } L,P i=1,j=1 is that each ξ j,i approximates the posterior probability that a marker j belongs to a mixture i conditional on the fact that it has non-zero effect.Thus, the update formula for π i,t+1 reads as • Variances of mixture components in the slab part {σ 2 i } L i=1 : Using the same notation, the update formula reads as Here we also introduce a mixture merging step, i.e., if the two mixtures are represented by variances that are close to each other in relative terms, then we merge those mixtures together.Thus, we adaptively learn the mixture number.
• Precision of the error γ ϵ : We define Σ t := (γ ϵ,t X T X + γ 2,t I) −1 .Then, the update formula for the estimator of γ ϵ reads as In the formula above, the term ||y − X β2,t || 2 /N takes into account the quality of the fit of the model, while the term Tr(XΣ t X T )/N prevents overfitting by accounting for the structure of the prior distribution of the effect sizes via the regularization term γ 2,t .We note that the naive evaluation of this term would require an inversion of a matrix of size P × P .We again use the Hutchinson estimator for the trace to approximate this object, i.e., Tr(XΣ t X T ) = Tr(X T XΣ t ) ≈ u T (X T XΣ t )u, where u has i.i.d.entries equal to −1 and +1 with the same probability.Furthermore, instead of solving a linear system Σ t u with a newly generated u, we re-use the u sampled when constructing the Onsager coefficient, thus saving the time needed to construct the object Σ t u.

C++ code optimization
Our open-source gVAMP software (https://github.com/medical-genomics-group/gVAMP) is implemented in C++, and it incorporates parallelization using the OpenMP and MPI libraries.MPI parallelization is implemented in a way that the columns of the normalized genotype matrix are approximately equally split between the workers.OpenMP parallelization is done on top of that and used to further boost performance within each worker by simultaneously performing operations such as summations within matrix vector product calculations.Moreover, data streaming is employed using a lookup table, enabling byte-by-byte processing of the genotype matrix stored in PLINK format with entries encoded to a set {0, 1, 2}: The lookup table enables streaming in the data in bytes, where every byte (8 bits) encodes the information of 4 individuals.This reduces the amount of memory needed to load the genotype matrix.In addition, given a suitable computer architecture, our implementation supports SIMD instructions which allow handling four consecutive entries of the genotype matrix simultaneously.
To make the comparisons between different methods fair, the results presented in the paper do not assume usage of SIMD instructions.Additionally, we emphasize that all calculations take un-standardized values of the genotype matrix in the form of standard PLINK binary files, but are conducted in a manner that yields the parameter estimates one would obtain if each column of the genotype matrix was standardized.

Participant inclusion
UK Biobank has approval from the North-West Multicenter Research Ethics Committee (MREC) to obtain and disseminate data and samples from the participants (https://www.ukbiobank.ac. uk/ethics/), and these ethical regulations cover the work in this study.Written informed consent was obtained from all participants.
Our objective is to use the UK Biobank to provide proof of principle of our approach and to compare to state-of-the-art methods in applications to biobank data.We first restrict our analysis to a sample of European-ancestry UK Biobank individuals to provide a large sample size and more uniform genetic background with which to compare methods.To infer ancestry, we use both self-reported ethnic background (UK Biobank field 21000-0), selecting coding 1, and genetic ethnicity (UK Biobank field 22006-0), selecting coding 1.We project the 488,377 genotyped participants onto the first two genotypic principal components (PC) calculated from 2,504 individuals of the 1,000 Genomes project.Using the obtained PC loadings, we then assign each participant to the closest 1,000 Genomes project population, selecting individuals with PC1 projection ≤ absolute value 4 and PC2 projection ≤ absolute value 3. We apply this ancestry restriction as we wish to provide the first application of our approach, and to replicate our results, within a sample that is as genetically homogeneous as possible.Our approach can be applied within different human groups (by age, genetic sex, ethnicity, etc.).However, combining inference across different human groups requires a model that is capable of accounting for differences in minor allele frequency and linkage disequilibrium patterns across human populations.Here, the focus is to first demonstrate that our approach provides an optimal choice for biobank analyses, and ongoing work focuses on exploring differences in inference across a diverse range of human populations.Secondly, samples are also excluded based on UK Biobank quality control procedures with individuals removed of (i) extreme heterozygosity and missing genotype outliers; (ii) a genetically inferred gender that did not match the self-reported gender; (iii) putative sex chromosome aneuploidy; (iv) exclusion from kinship inference; (v) withdrawn consent.

Whole genome sequence data
We process the population-level WGS variants, recently released on the UK Biobank DNAnexus platform.We use BCF tools to process thousands of pVCF files storing the chunks of DNA sequences, applying elementary filters on genotype quality (GQ ≤ 10), local allele depth (smpl_sum LAD < 8), missing genotype (F_MISSING > 0.1), and minor allele frequency (MAF < 0.0001).We select this MAF threshold as it means that on average about 80 people will have a genotype that is non-zero, which was the lowest frequency for which we felt that there was adequate power in the data to detect the variants.While we accept that it is quite possible to include additional rare variants, we wished for a conservative threshold that was at least an order of magnitude lower than the threshold we used for the imputed SNP data described below to facilitate a comparison among the analysis of the different data types.
Simultaneously, we normalize the indels to the most recent reference, removing redundant data fields to reduce the size of the files.For all chromosomes separately, we then concatenate all the pre-processed VCF files and convert them into PLINK format.The compute nodes on the DNAnexus system are quite RAM limited, and it is not possible to analyse the WGS data outside of this system, which restricts the number of variants that can be analysed jointly.To reduce the number of variants to the scale which can be fit in the largest computational instance available on the DNAnexus platform, we rank variants by minor allele frequency and remove the variants in high LD with the most common variants using the PLINK clumping approach, setting a 1000 kb radius, and R 2 threshold to 0.36.This selects a focal common variant from a group of other common variants with correlation ≥ 0.6, which serves to capture the common variant signal into groups, whilst keeping all rare variation within the data.Finally, we remove the variants sharing the same base pair position, not keeping any of these duplicates, and merge all the chromosomes into a large data instance, including the final 16,854,878 WGS variants.

Imputed SNP data
We use genotype probabilities from version 3 of the imputed autosomal genotype data provided by the UK Biobank to hard-call the single nucleotide polymorphism (SNP) genotypes for variants with an imputation quality score above 0.3.The hard-call-threshold is 0.1, setting the genotypes with probability ≤ 0.9 as missing.From the good quality markers (with missingness less than 5% and p-value for the Hardy-Weinberg test larger than 10 −6 , as determined in the set of unrelated Europeans) we select those with MAF ≥ 0.002 and rs identifier, in the set of European-ancestry participants, providing a dataset of 9,144,511 SNPs.From this, we took the overlap with the Estonian Genome Centre data as described in [8] to give a final set of 8,430,446 autosomal markers.
For our simulation study and UK Biobank analyses described below, we select two subsets of 8,430,446 autosomal markers.We do this by removing markers in very high LD using the "clumping" approach of PLINK, where we rank SNPs by minor allele frequency and then select the highest MAF SNPs from any set of markers with LD R 2 ≥ 0.8 within a 1MB window to obtain 2,174,071 markers.We then further subset this with LD R 2 ≥ 0.5 to obtain 882,727 SNP markers.This results in the selection of two subsets of "tagging variants", with only variants in very high LD with the tag SNPs removed.This allows us to compare analysis methods that are restricted in the number of SNPs that can be analysed, but still provide them a set of markers that are all correlated with the full set of imputed SNP variants, limiting the loss of association power by ensuring that the subset is correlated to those SNPs that are removed.

Whole exome sequence data burden scores
We then combine this data with the UK Biobank whole exome sequence data.for LoF variants have received a score of 2, those with a single heterozygous LoF variant 1, and the rest 0. We chose to use the WES data to create the burden scores rather than the WGS data as existing well-tested pipelines were available.

Phenotypic records
Finally, we link these DNA data to the measurements, tests, and electronic health record data available in the UK Biobank [35] and, for the imputed SNP data, we select 7 blood based biomarkers and 6 quantitative measures which show ≥ 15% SNP heritability and ≥ 5% out-of-sample prediction accuracy [8].Our focus is on selecting a group of phenotypes for which there is sufficient power to observe differences among approaches.We split the sample into training and testing sets for each phenotype, selecting 15,000 individuals that are unrelated (SNP marker relatedness < 0.05) to the training individuals to use as a testing set.This provides an independent sample of data with which to access prediction accuracy.We restrict our prediction analyses to this subset as predicting across other biobank data introduces issues of phenotypic concordance, minor allele frequency and linkage disequilibrium differences.In fact, our objective is to simply benchmark methods on as uniform a dataset as we can.As stated, combining inference across different human groups, requires a model that is capable of accounting for differences in minor allele frequency and linkage disequilibrium patterns across human populations and, while our algorithmic framework can provide the basis of new methods for this problem, the focus here is on benchmarking in the simpler linear model setting.Samples sizes and traits used in our analyses are given in Table S1.

Statistical analysis in the UK Biobank gVAMP model parametes for WGS
We apply gVAMP to the WGS data to analyse human height using the largest computational instance currently available on the DNAnexus platform, employing 128 cores and 1921.4GB total memory.Efficient C++ gVAMP implementation allows for parallel computing, utilizing OpenMP and MPI libraries.Here, we split the memory requirements and computational workload between 2 OpenMP threads and 64 MPI workers.For the prior initialization, we set an initial number of 22 non-zero mixtures, we let the variance of those mixtures follow a geometric progression to a maximum of 1/N , with N the sample size, and we let the probabilities follow a geometric progression with factor 1/2.The prior probability 1 − λ of SNP markers being assigned to the 0 mixture is initialized to 99.5%.The SNP marker effect sizes are initialised with 0. Based on the experiments in the UK Biobank imputed dataset, in WGS we set the initial damping factor ρ to 0.1, and adjust it to 0.05 for iteration 4 onward, stabilizing the algorithm.We then report the results corresponding to the iterate having the largest number of SE associations.We also note that, after the first few iterations, the number of SE associations is typically rather stable (see Figure S4).

gVAMP model parameters for imputed SNP data
We run gVAMP on the 13 UK Biobank phenotypes on the full 8,430,446 SNP set, and on the 2,174,071 and 882,727 LD clumped SNP set.We find that setting the damping factor ρ to 0.1 performs well for all the 13 outcomes in the UK Biobank that we have considered.For the prior initialization, we set an initial number of 22 non-zero mixtures, we let the variance of those mixtures follow a geometric progression to a maximum of 1/N , with N the sample size, and we let the probabilities follow a geometric progression with factor 1/2.The SNP marker effect sizes are initialised with 0. This configuration works well for all phenotypes.We also note that our inference of the number of mixtures, their probabilities, their variances and the SNP marker effects is not dependent upon specific starting parameters for the analyses of the 2,174,071 and 882,727 SNP datasets, and the algorithm is rather stable for a range of initialization choices.Similarly, the algorithm is stable for different choices of the damping ρ, as long as said value is not too large.
Generally, appropriate starting parameters are not known in advance and this is why we learn them from the data within the EM steps of our algorithm.However, it is known that EM can be sensitive to the starting values given and, thus, we recommend initialising a series of models at different values to check that this is not the case (similar to starting multiple Monte Carlo Markov chains in standard Bayesian methods).The feasibility of this recommendation is guaranteed by the significant speed-up of our algorithm compared to existing approaches, see Supplementary Note 1, For the sparsity parameter, we consider either initializing it to 50, 000 included signals (λ 0 = 50, 000/P ), or to further increase the probability of SNP markers being assigned to the 0 mixture to 97%, which results in a sparser initialised model.We also consider inflating the variances to a maximum of 10/N to allow for an underlying effect size distribution with longer tails.It is trivial to initialise a series of models and to monitor the training R 2 , SNP heritability, and residual variance estimated within each iteration over the first 10 iterations.Given the same data, gVAMP yields estimates that more closely match GMRM when convergence in the training R 2 , SNP heritability, residual variance, and out-of-sample test R 2 are smoothly monotonic within around 10-40 iterations.
Following this, training R 2 , SNP heritability, residual variance, and out-of-sample test R 2 may then begin to slightly decrease as the number of iterations becomes large.Thus, as a stopping criterion for the 2,174,071 and 882,727 SNP datasets, we choose the iteration that maximizes the training R 2 , and in practice it is easy to optimise the algorithm to the data problem at hand.
We highlight the iterative nature of our method.Thus, improved computational speed and more rapid convergence is achieved by providing better starting values for the SNP marker effects.
Specifically, when moving from 2,174,071 to 8,430,446 SNPs, only columns with correlation R 2 ≥ 0.8 are being added back into the data.Thus, for the 8,430,446 SNP set, we initialise the model with the converged SNP marker and prior estimates obtained from the 2,174,071 SNP runs, setting to 0 the missing markers.Furthermore, we lower the value of the damping factor ρ, with typical values being 0.05 and 0.01.We experiment both with using the noise precision from the initial 2,174,071 SNP runs and with setting it to 2. We then choose the model that leads to a smoothly monotonic

gVAMP SE association testing
We provide an alternative approach to association testing, which we call state evolution p-value testing (SE association testing), where the effects of each marker can be estimated conditional on all other genetic variants genome-wide.Relying on the properties of the EM-VAMP estimator, whose noise is asymptotically Gaussian due to the Onsager correction [12], we have r 1,t ≈ β + N (0, γ −1 1,t I), where β is the ground-truth value of the underlying genetic effects vector.More precisely, one can show that 1 N ∥r 1,t − β − N (0, γ −1 1,t I)∥ → 0, as N, P → ∞, with the ratio N/P being fixed.Therefore, for each marker with index j, a one-sided p-value for the hypothesis test H0 : β j = 0 is given by Φ(−|(r 1,t ) i | • γ 1/2 1,t ), where Φ is the CDF of a standard normal distribution and (r 1,t ) i denotes the i-th component of the vector r 1,t .We conduct this association testing for height in the WGS data and for the full 8,430,446 imputed SNP markers for the empirical UK Biobank analysis of 13 traits, using the estimates of r 1,t .We remark that the testing results are generally stable after 20 iterations (Figure S4).To these, we apply a Bonferroni multiple testing correction to give a conservative comparison for presentation, but we note that the estimates made are joint, rather than marginal, and thus FDR control methods may also be an alternative.AMP theory provides a joint association testing framework, capable of estimating the effects of each genomic position conditional on all other SNP markers.We show this SE p-value testing approach for each iteration of our iterative algorithm, where we calculate the number of genome-wide fine-mapped associations for 13 UK Biobank traits at a p-value threshold of less than 5 • EXP(0) .
Conjugate gradient algorithm for solving linear systems Algorithm 2 Conjugate gradient method for solving a symmetric linear system Ax = b.
1: Input: Initial estimate of the solution x 0 , initial residual r 0 = b, initial search direction p 0 = r 0 , linear system matrix A, right-hand side vector b, stopping error threshold ε > 0.

Figure 1 .
Figure 1.Joint association plot of 8.4 million imputed SNPs and 17 million WGS for human height in 415,000 UK Biobank participants.AMP theory provides a joint association

Figure 3 .
Figure3.The relationship between effect size and minor allele frequency for human height inferred from whole genome sequence data differs to that inferred from imputed SNP variants in the UK Biobank.(a) For different genome-wide significance thresholds, we plot the relationship between joint effect size and minor allele frequency (MAF) for imputed SNPs, whole genome sequence (WGS) variants and whole exome sequence (WES) burden scores fit alongside either imputed SNPs or WGS data.(b) Across 17,852 WES gene burden scores, we find general concordance of the estimated effect sizes when fit alongside either imputed or WGS data, for most but not all genes, with squared correlation 0.532.(c) Likewise, we also find general concordance of the phenotypic variance attributable to markers annotated to each of the 17,852 genes, when fit conditional on either imputed or WGS variants, with squared correlation 0.494.(d) Finally, we show similar patterns of enrichment when annotating markers to functional annotations in either the proportion of variance attributable to each group (labelled "variance"), or in the average effect size relative to the genome-wide average effect size (labelled "effect") from joint estimation of either imputed or WGS data.

Figure 4 .
Figure 4. Validating gVAMP through polygenic risk score accuracy and association testing benchmarks in the UK Biobank within imputed SNP data.(a) Relative prediction accuracy of gVAMP in a hold-out set of the UK Biobank across 13 traits as compared to other approaches.(b) Relative number of leave-one-chromosome-out (LOCO) testing of gVAMP across 13 UK Biobank traits as compared to other approaches at 8,430,446 markers.(c) Number of genome-wide fine-mapped associations obtained via gVAMP SE association testing for 13 UK Biobank traits at a p-value threshold of less than 5 • 10 −8 for all 8,430,446 SNP markers.

Figure S2 .
Figure S2.The variance attributable to gene burden scores calculated from whole exome sequence data (y-axis) shows no relationship to the variance attributable to DNA variants within and around the gene (x-axis) for either imputed SNPs or whole genome sequence variants.

Figure S3 .Figure S4 .
Figure S3.SNP heritability estimation of GMRM versus gVAMP with different numbers of SNP markers across 13 trait in the UK Biobank.Comparison of the proportion of phenotypic variation attributable to 2,174,071 autosomal SNP genetic markers (SNP heritability) estimated by GMRM (x-axis) to the SNP heritability estimated by gVAMP (y-axis) at either the same 2,174,071 SNPs (red) or 8,430,446 SNP markers (blue).The slope of the lines shows a 1-to-1 relationship of gVAMP to GMRM, but with an average of 3.4% lower estimate for gVAMP at 2.17M SNPs.Analysing 8.4M SNPs with gVAMP increases the heritability estimate over GMRM by 3.9%, which is consistent with an increase in phenotypic variance captured by the full imputed sequence data, as opposed to a selected subset of SNP markers.The dashed grey line gives y = x.

The genetic architecture of human height inferred from 16,854,878 whole genome sequence variants differs to that inferred from 8,430,446 imputed SNP markers in the UK Biobank.
(a)For each whole genome sequence (WGS) variant discovered at different 022, p-value ≤ 2 • 10 −16 for imputed SNPs, which is consistent with

Table 1 . Polygenic risk score prediction accuracy R 2 for 13 different traits from statistical models trained in the UK Biobank data and tested in a UK Biobank hold-out set.
Training data sample size and trait codes are given in TableS1for each trait.The sample size of the hold-out test set is 15, 000 for all phenotypes.LDpred2 and SBayesR give estimates obtained from the LDpred2 and SBayesR software respectively, using summary statistic data of 8,430,446 SNPs obtained from the REGENIE software.GMRM denotes estimates obtained from a Bayesian mixture model at 2,174,071 SNP markers ("GMRM 2M").gVAMP denotes estimates obtained from an adaptive EM Bayesian mixture model within a vector approximate message passing (VAMP) framework, using either 887,060 ("gVAMP 880k"), 2,174,071 ("gVAMP 2M"), or 8,430,446 SNP markers ("gVAMP 8M").

Table 2 . Genome-wide significant associations for 13 UK Biobank traits from GMRM, gVAMP and REGENIE at 8,430,446 genetic variants
. REGENIE denotes results obtained from leave-one-chromosome-out (LOCO) testing using the REGENIE software, with 882,727 SNP markers used for step 1 and 8,430,446 markers used for the LOCO testing of step 2. GMRM refers to LOCO testing at 8,430,446 SNPs, using a Bayesian MCMC mixture model in step 1, with either 882,727 ("GMRM 880k") or 2,174,071 SNP markers ("GMRM 2M").gVAMP refers to LOCO testing at 8,430,446 SNPs, using the framework presented here, where in step 1 either 882,727 ("gVAMP 880k"), 2,174,071 ("gVAMP 2M"), or 8,430,446 SNP markers ("gVAMP 8M") were used.We also present leave-one-out ("gVAMP 8M LOO", see Methods) and state-evolution (SE) p-value testing for 8,430,446 SNP markers ("gVAMP 8M SE", see Methods).For LOCO testing, the values give the number of genome-wide significant linkage disequilibrium independent associations selected based upon a p-value threshold of less than 5 • 10 −8 and R 2 between SNPs in a 5 Mb genomic segment of less than 1%.For LOO and SE testing, values give the number of genome-wide significant associations selected based upon a p-value threshold of less than 5 • 10 −8 .
Genomic data preparation and aggregation is conducted with custom pipeline (repo) on the UK Biobank Research Analysis Platform (RAP) with DXJupyterLab Spark Cluster App (v.2.1.1).Only biallelic sites and high quality variants are retained according to the following criteria: individual and variant missingness < 10%, Hardy-Weinberg Equilibrium p-value > 10 −15 , minimum read coverage depth of 7, at least one sample per site passing the allele balance threshold > 0.15.Genomic variants in canonical, protein coding transcripts (Ensembl VERSION) are annotated with the Ensembl Variant Effect Predictor (VEP) tool (docker image ensemblorg/ensemblvep:release_110.1).High-confidence (HC) loss-of-function (LoF) variants are identified with the LOFTEE plugin (v1.0.4_GRCh38).For each gene, homozygous or multiple heterozygous individuals • 10 −8 for all 8,430,446 SNP markers.