Whole genome association testing in 333,100 individuals across three biobanks identifies rare non-coding single variant and genomic aggregate associations with height

The role of rare non-coding variation in complex human phenotypes is still largely unknown. To elucidate the impact of rare variants in regulatory elements, we performed a whole-genome sequencing association analysis for height using 333,100 individuals from three datasets: UK Biobank (N=200,003), TOPMed (N=87,652) and All of Us (N=45,445). We performed rare (<0.1% minor-allele-frequency) single-variant and aggregate testing of non-coding variants in regulatory regions based on proximal, intergenic and deep-intronic annotation. We observed 29 independent variants associated with height at P < 6 × 10−10 after conditioning on previously reported variants, with effect sizes ranging from -7cm to +4.7cm. We also identified and replicated non-coding aggregate-based associations proximal to HMGA1 containing variants associated with a 5cm taller height and of highly-conserved variants in MIR497HG on chromosome 17. We have developed a novel approach for identifying non-coding rare variants in regulatory regions with large effects from whole-genome sequencing data associated with complex traits.


lntroduction
The role of rare non-coding variation in common human phenotypes is still largely unknown.
Previous studies have been largely limited to studying common variation using genotyping arrays or rare variation in the coding regions of genes using exome sequencing.Studies of rare variation in the non-coding genome, which is by far the most abundant form of inherited variation, could lead to the identification of important gene regulatory elements with large effects on human diseases and traits.
Most genetic variation associated with complex phenotypes lies in non-coding regions of the genome 1 .Array-based genome-wide association studies have had substantial success at identifying common variants associated with complex phenotypes and disease 2 .For height a large proportion of the common variant heritability has been explained 2 .In contrast, the identification of rarer variation, potentially with substantially larger effects, has been largely limited to coding variation based on exome sequencing (e.g.loss-of-function variants in GIGYF1 associated with diabetes 3 ) or imputation of lower frequency variants 2 .
Despite the success of common variant and rare coding variant-based approaches, the vast majority of inherited human genetic variation is both rare and in the 99% of the genome that is non-coding.Identifying the rare non-coding variation associated with common diseases and traits could reveal new regulatory gene mechanisms, and substantially increase our understanding of human biology and disease.
Whole genome sequencing (WGS) has been successful at identifying rare non-coding causes of monogenic disease in several cases 4,5 .For example, we have recently shown that rare variants in an intronic regulatory element of HK1 causes inappropriate expression of Hexokinase 1 in pancreatic beta-cells leading to congenital hyperinsulinism 6 .However, there have been few sequencing-based studies aiming to identify rare non-coding variation associated with complex phenotypes 7 , despite estimates of the relative functional importance of the non-coding genome of 6-15% 89 .Two recent studies from TOPMed performed WGS rare-variant analysis for lipid-levels 10 (N = 66,000), where they identified suggestive associations with variants in DNA hypersensitivity sites proximal to PCSK9 altering lipids, and in blood pressure 11 (N = 51,456), where genomic aggregate signals at KIF3B were identified.
Identifying associations between rare variants and complex traits has several advantages over common variant associations.Firstly, rare causal variants are likely to have larger effect sizes and so potentially be of greater clinical relevance.Secondly, rare variants are less likely to be in linkage disequilibrium with other variants and so provide more direct information about likely causal regulatory regions and genes involved.Finally, rare variant aggregate associations, where genetic variants of similar predicted consequence and location are tested in aggregate, can also provide strong evidence for specific non-coding elements that are responsible for an association compared to single variant associations.
We performed an analysis of height, a model complex trait, focussed on identifying novel rare variant associations from large-scale WGS data.We performed a discovery analysis using WGS data on 200,003 individuals from UK Biobank (UKB) and replicated our results in 133,097 individuals from All of Us 12 and TOPMed 13 .We show that our approach can identify new rare single variant and aggregate associations in the non-coding genome.
Importantly, our analytical approach to WGS-based association analyses can be applied to other complex phenotypes.

Methods Summary
We performed discovery association analyses using WGS data on 200,003 individuals from the UKB, a population cohort from the United Kingdom 14 .We analysed rank inversenormalised standing height, a model polygenic trait, with genomic data on 789,700,118 genetic variants including single nucleotide variants (SNVs), small insertions/deletions (indels) and large structural variants (SVs) including copy number deletions and duplications.
To identify novel rare non-coding genetic associations that have not previously been identified, we conditioned our analyses on 12,661 variants from the latest GIANT height consortium analysis of 5.4 million people 2 based on imputed genotype array data, an exomearray analysis of height 15 , and genome-wide significant (P<5 ൈ 10 ି଼ ) variants from an exome-wide association study of height 16 .Our primary discovery analysis was performed in 183,078 individuals of genetically-inferred European ancestry.We also performed the same analyses in individuals with genetically-inferred South Asian (N = 4,439) and African (N=3,077) ancestry.We replicated our results in a cross-ancestry analysis using 87,652 individuals with WGS from TOPMed, and 45,445, 20,548 and 13,683 individuals with genetically-inferred European, African and self-reported Hispanic ancestry/ethnicity with WGS data in All of Us respectively (refer to ST1 for a breakdown of ancestries and cohort demographics).Statistical significance was defined as P<6.3 ൈ 10 ିଵ based on 20 simulated randomly generated phenotypes (see Methods).

Single Variant Association Testing
We tested all genetic variants with a minor allele count (MAC) ≥20, excluding variants with a low-quality genotype calling score (graphTyper AA score<0.5),using REGENIE 17 .Variants which were associated at our threshold were then clumped using PLINK 18 , and a sequential variant conditioning procedure was applied to determine the variant most likely to be responsible for the signal (see Methods).

Genomic Aggregate Association Testing
After annotating each variant using the Ensembl Variant Effect Predictor 19 , we segmented variants in the genome into classification groups, including gene-centric (i.e.coding and splicing; or proximal regulatory, including 5kb upstream and 5kb downstream -+/-5kbp from the 5/3' UTR's) and non-gene-centric potentially regulatory variation (intergenic and intronic based on any transcript), as well as a sliding window test that covered the whole genome, excluding exons.We performed genomic unit aggregate testing limited to rare (within-sample minor allele frequency, MAF <0.1%) genetic variants in functionally annotated regions based on three published weights representing in silico predicted deleteriousness (Combined Annotation Dependent Depletion, CADD 20 ), conservation (Genomic Evolutionary Rate Profiling, GERP 21 ) and non-coding constraint (Junk Annotation Residual Variation Intolerance Score, JARVIS 22 ).Variants that were classified as coding in any transcript were excluded from regions we defined as proximal (within 5kbp of the 5/3' UTR 19 ), and variants in proximal regions were subsequently excluded from regions defined as non-proximal potentially regulatory regions -see the Methods section for precise definitions.We refer to proximal-regulatory regions and non-proximal regulatory regions as "proximal" and "regulatory" respectively for the remainder of the manuscript.

We identfied 29 rare and low-frequency novel single variants, associated with human height in UKB
After adjusting for published height genetic variants (ST2), 28 rare (MAF<0.1% & MAC>20) and low-frequency (0.1%<MAF<1%) SNVs and indels remained independently associated with height (Fig. 1).These variants had effect sizes ranging from -7.25cm to +4.71cm (-0.79 to 0.52SD) -see ST3,4.As expected, variants with a lower minor-allelefrequency had the largest effect estimates (Fig. 2).We additionally identified evidence of association with a 47,543bp structural deletion in the pseudo-autosomal region of chromosome X (X:819,814-867,357).The proximal-SHOX deletion occurs 173kbp downstream of SHOX, and is present in 0.3% of the population and associates with lower height (β = -2.79cm [-3.33, -2.25], P = 5.01 ൈ 10 ିଶସ , ST3).This exact deletion, downstream of SHOX, has only previously been reported in clinical cohorts with Leri-Weill dyschondrosteosis 23 , a genetic disorder characterised by shortened limbs and short stature.In these clinical cohorts, 15% had at least one copy of the 47.5kbp deletion.In the UKB population, the deletion was present in 824 individuals (0.3%) (one carrier was a homozygote).
We did not identify any novel replicating associations in the South Asian or African ancestry specific analyses in the UKB.Only one genetic variant achieved genome-wide significance in an analysis of individuals of South Asian ancestry (X:116495780:AGTGTGTGTGT:A, P=1.43ൈ 10 ିଵ ), but did not associate in the European (P=0.91) or African (P=0.98)ancestry-specific analyses (we were unable to test this variant in All of Us or TOPMed).

We identified and replicated three rare (MAF<0.1%) non-coding regions associated with height
We performed 57,608,498 genomic aggregate association tests, consisting of 5,941,548 coding, 13,005,638 proximal regulatory, 4,861,759 intergenic/deep intronic and 33,799,553 non-coding sliding window association tests.We performed three different types of statistical test: i) 'BURDEN', where the direction of effects for all variants is assumed to be the same, ii) 'SKAT', where there is no assumption about directionality or similarity of magnitude of effects, and iii) 'ACAT', where there is no assumption about directionality or magnitude of effects and not all variants need be associated with the outcome 24 .
We identified six non-coding regions of interest based on aggregate tests (P<6.31ൈ 10 ିଵ ; ST6,7).Four regions remained significant after adjusting for previously identified height loci (Table 1).The four regions consisted of nine genomic aggregate tests proximal to: HMGA1, C17orf49, GH1, CSHL1, PRR5-ARGHGAP8 and MIR6835.We did not find any novel genomic unit associations based on African or South Asian ancestry-specific analyses in our discovery analysis.The aggregate-based tests at HMGA1 and C17orf49 replicated in All of

Us and TOPMed when combined with genetically inferred individuals of South Asian (SAS)
and African (AFR) in the UKB, and we were unable to replicate PRR5-ARHGAP8 (Table 1).Table 1.Significant rare (<0.1%) non-coding genetic aggregate associations with human height.Genomic aggregates which were significant after correcting for known height loci ('log10p conditioned') are assigned a loci number and appear in bold if they lie within the same region with evidence of correlation.The classification column denotes how variants were classified according to Figure 1, and the annotation column denotes how the variants were additionally grouped together (see methods) based on variant scores.Replication was calculated as a meta-analysis of TOPMed, All of Us and non-EUR analyses within UKB. * Indicates that the meta-analysis was calculated using the ACAT p-value combination method, as betas are not produced for the ACAT aggregate tests.

Multiple rare variants, and a common variant, form an allelic series in a regulatory region upstream of HMGA1, with substantial effects on height
There were 2,006 rare variants included in the upstream non-coding association for HMGA1 (High-mobility group protein) in the UKB -ST8.Several variants appeared to be responsible for these aggregate signals (Fig. 3).The two rare variants most strongly associated with increased height were 6:34237902:G:A (ߚ=4.83cm,P = 2.00 ൈ 10 ିଵଷ , MAF = 0.04%) and 6:34236873:C:G (β = 3.97cm, P = 1.00 ൈ 10 ିଵ , MAF=0.0470%).The five most-strongly associated variants, at P<5.76 ൈ 10 ି (Fig. 3C), were statistically independent of each other, as determined by sequential conditional testing.Our results remained statistically significant after removing several low-quality indels (P = 1.45e-11).The most strongly associated rare variant alters the first base of the transcription start site of the MANE Select transcript (ENST00000311487.9,NM_145899.3) of HMGA1 25 (Fig. 3).
This variant could result in reduced transcription of this transcript and may result in an alternative start site becoming dominant.
We next searched for evidence of a role for coding variation in the impact of HMGA1 in height.HMGA1 is a constrained gene (pLI score =0.83) and there are no predicted protein truncating variants in the UKB and only a single individual with a first exon frameshift in gnomAD.There was also no evidence of individual missense or aggregate coding association with height for HMGA1 either in the UKB WGS data (min(P)=3.09e-4),or in GeneBass, based on 394,841 exome-sequences from UKB (min(P) = 0.284).

Rare variants of microRNA host-gene MIR497HG affect height
There were 235 highly conserved (GERP>2) rare variants which contributed to the noncoding C17orf49 (Chromosome 17 Open Reading Frame 49) genomic aggregate result in the UKB, cumulatively associated with a 1.36cm reduction in height (95% CI 1.11, 1.48cm, P = 1.26 ൈ 10 ିଵଵ ) -see ST9.Of the 235 variants that contributed to the aggregate signal, 152 (64.7%) had an effect estimate with the same direction of effect as the aggregate (binomial P = 7.96 ൈ 10 ି ), suggesting that multiple variants are responsible.
The proximal region of C17or49 overlaps with microRNA host cluster MIR497HG, from which microRNAs MIR195 and MIR497 are derived (Fig. 4A).We thus re-analysed the C17orf49 proximal region excluding miRNA variants, and additionally tested the microRNA as independent genome units (Fig. 4B).The strength of association between the identified C17orf49 proximal aggregate and height was reduced after removing any variant overlapping miRNA (β = 1.11cm,P = 3.98 ൈ 10 ିହ ) -ST10.Further, the association between a genomic aggregate of miRNA variants in MIR195 and height was more than double that of the primary C17orf49 signal (ߚ = 3.05cm [95% CI 1.44, 4.65cm, P = 1.97 ൈ 10 ିସ ]), showing nominal evidence of heterogeneity (P=0.0454) to the primary signal.There is extensive literature on the genes that MIR195 and MIR497 bind and affect expression of, while there is little previous literature referencing C17orf49, except for a small number of studies of cancer phenotypes 26 .MIR497 and MIR195 expression have been associated with a range of genes that influence cancer 27 , and both have been implicated in quiescence of skeletal muscle cells 28 .Reduced expression of MIR497 has also been shown to promote osteoblast proliferation and collagen synthesis 29 .Zhao et al. also reported an association between down-regulation of MIR497, one of the three miRNA which overlapped with the proximal aggregate, and idiopathic short stature in a clinical cohort of Chinese children with short stature 30 (P<0.05).Zhang et al. 31 have additionally implicated MIR497 in chondrogenesis (cartilage development), and shown that the miRNA impacts IHH (Indian Hedgehog Homolog), which is essential for bone formation 32 .
MIR195 has also been shown to interact with HMGA1 and affect expression.For example, it has been shown that MIR195 and MIR497 repress HMGA1, which in turn downregulates one of the HMGA1 downstream targets Id3, which has an inhibitory effect on myogenic differentiation 33 .We therefore tested interaction of the common HMGA1 variant and the miRNA, but did not detect an association either at the aggregate (min(P) = 0.541) or single variant (min(P) = 3.09 ൈ 10 ିଷ ) level.

Promoter variants of GH1 have substantial effects on height
Nine rare highly conserved (GERP>2) variants contributed to the upstream non-coding aggregate for GH1 (Growth Hormone 1) in the UKB -see ST11.The aggregate signal was associated with a 0.34SD (3.11cm) reduction in height.One of the 9 variants, which replicated, was independently associated with height (17:63918961:A:G, β = -4.24cm[95% CI -5.53, -2.94cm], P = 1.46 ൈ 10 ିଵ , MAF = 0.04%), and has previously been reported as a variant of unknown significance in multiple clinical cohorts for idiopathic short stature as NM_000515.5_c.185T> C.These findings in clinical cohorts of idiopathic short stature include: three carriers of the variant we identified (c.185T > C) were previously identified in a Sri Lankan cohort of patients with Isolated Growth Hormone deficiency 34 (IGHD) ; three siblings with consanguineous parents with IGHD 35 .The variant was originally identified in a cohort of 41 unrelated children with short stature, and 11 unrelated patients with IGHD 36 (as -123T>C).That study showed that that the variant occurs in a distal binding site for POU1F1 36 (Pituitary-specific positive transcription factor 1), which might regulate GH1 37 .

Discussion
By conducting one of the largest whole-genome sequence-based analyses to date with a focus on rare non-coding variation, we have provided novel insights into the genetic architecture of height not previously detected by standard array-based GWAS or exome sequencing approaches.Our results clearly demonstrate that our approach to analysing whole-genome sequencing data has revealed a largely untapped potential for linking rare non-coding genetic variation to complex, common human phenotypes.
We identified six non-coding regions based on genomic aggregate testing, four of which contained at least one genomic aggregate that survived adjustment for genetic variation known to impact height.We presented evidence for replication of three of these non-coding genomic aggregates, proximal to C17orf49, GH1 and HMGA1.These loci implicated novel highly-conserved miRNA regulating gene expression, an altered transcription start site, pituitary growth factor co-gene regulation, multiple proximal enhancers, and conservation and constraint of genetic variation in the biology of human growth via height.
Three of the variants identified were proximal, non-coding associations (CUL3, HMGA1, GHRH) that showed strong evidence of replication in the All of Us and TOPMed studies.
Our work further highlights the importance of adjusting for common variants in rare and low frequency variant discovery analyses to circumvent linkage-driven associations.Before adjustment for common variants, we observed 319 rare and low frequency variants, which dropped to 80 (non-independent) after adjusting.We observed no additional rare variant associations after adjusting for common variation, despite some recent claims to the contrary 38 , although we did not explicitly test adjusting for a polygenic risk score, as the study suggested.
We chose to report genetic aggregate results after correcting for known variation only, despite some genes (e.g.HMGA1) containing genetic variants that were independently significant in our analysis.Although conditioning upon independent variants within the aggregates often decreased the strength of association, we do not interpret this as a suggestion that the association at the locus is driven entirely by a single variant.This is a topical point for rare variant analyses: at sufficiently high sample sizes, we predict that a large proportion of genetic variants within an identified genetic aggregate will be independently associated.
We propose that this does not imply, however, that the association itself is not aggregate.
There are some limitations to our study.First, we acknowledge that our study is currently limited by sample size: a maximum allele frequency cut-off of 0.1% for genomic aggregate restricts our analysis to approximately 183 carriers per variant.Upcoming releases of wholegenome sequencing data from UKB (500,000 total by early 2024), All of Us and TOPMed will substantially increase the identification of novel findings.Sample sizes for analysis of individuals not of inferred-European genetic ancestry were particularly limited, restricting rare variant analysis and reducing statistical power more so than for common variant analysis.
We were additionally limited to replication in non-UKB datasets: future methodological advances will allow individual-level meta-analysis, substantially increasing statistical power.However, this should not understate the significance of replication of our findings in independent cohorts with differing different ancestral backgrounds.Further, there is a lack of high-quality tissue-based functional data available for the non-coding genome, which will improve as more non-coding sequencing data becomes available.
In conclusion, we have identified several non-coding single variant and genomic aggregate genetic loci associated with human height using generalised annotation criteria.Our approach provides a template for future rare-variant analyses of whole-genome sequencing data of other complex phenotypes.

UK Biobank and Whole Genome Sequencing
The whole genome sequencing performed for UKB had an average coverage of 32.5X, with a minimum of 23.5X, using Illumina NovaSeq sequencing machines provided by deCODE 39 .
The genome build used for sequencing was GRCh38: single variant nucleotide polymorphisms and short 'indels' were jointly called using GraphTypher 40 .deCODE found that the number of variants identified per individual was 40 times larger than that found using WES in the initial 150,000 release of whole genome sequences.Structural variants were called using the same process.
Of the 200,000 individuals whose genomes were sequenced, we found, using genetic principal components as previously described 41 , there were 183,078 individuals of European ancestry in this subset of the UK Biobank.

Genetic Data Format
We performed a multi-allele splitting procedure on each of the 60,648 pVCF whole genome sequencing files provided by the UK Biobank using bcftools 42 and then converted those pVCFs to PLINK 18 (v1.9).bed/bim/famformat.We then grouped multiple PLINK files together, to produce 1,196 non-overlapping PLINK files each covering approximately 2.5Mbp of the genome, which we use as input to REGENIE 17 (v3.1) to perform both single variant and genome unit testing.

Common Variant Conditioning
We adjusted for all known loci at most 5Mbp from each variant by further grouping each of the 1,196 PLINK format files into triplets, with the two genotype files up-and downstream of the central PLINK file, to ensure that a genetic variant which was close to the beginning of an individual genome chunk was conditioned on sufficiently distant loci.We merged genome chunks at the beginning and end of a chromosome, and at either end of the centromere with only one chunk, be it downstream or upstream as appropriate.

Genetic Variant Exclusion
We excluded all variants from our association analyses if GraphTypher, the software used to by the UK Biobank to perform genotype calling, assigned an AAScore which was less than 0.5 39 , denoting variant quality.

Single Variant Association Testing
We performed single variant association testing on any variant with at least 20 carriers in the population (MAC20).We conditioned our association tests on all common variants identified in the most recently published GWAS 2 as well as published exome array variants 15 , and significant (P<5.00ൈ 10 ି଼ ) exome variants published by Regeneron for standing height 16 , to minimise the likelihood that novel non-coding associations were driven by known common GWAS or coding loci -ST2.

Null Association Model
We randomly generated and performed association testing for 20 normally distributed (mean zero and unit standard deviation) 'dummy' phenotypes, with an N matching that of our European ancestry analysis, in order to estimate the number of independent tests, because Bonferroni correction is known to be over-conservative for highly correlated tests.To determine a significance threshold, we took the minimum p-value across all single variant and genomic unit tests across any of the 20 simulated phenotypes, representing a 95% significance level relative to the null.

Defining independent variants
Single variants which passed genome-wide significance were analysed using PLINK's clumping procedure, based on ‫ݎ‬ ଶ <0.001 (linkage disequilibrium) and a minimum clump distance of 250kb.Variants classified as independent by PLINK then underwent a formal conditional analysis step.For each window (as defined above) containing more than one 'clumped' variant, we conditioned on the top variant in the window, which we classify as an independent variant.

LocusZoom
We generated a LocusZoom 43 plot for each genetic variant which passed our clumping procedure, based on statistical linkage disequilibrium derived from the UK Biobank whole genome sequencing data.In these cases, all variants with MAC1 within +/-750kbp of the lead variant were tested for association with height, and the lead variant within the locus was determined using the PLINK clumping procedure with a maximum r2<=0.001and distance of at least 250kbp.If a variant passed only one of these criteria, we performed a bespoke independence test, where significant variants are conditioned on one-by-one until no association remains.

Genetic Variant Annotation
We annotated all genetic variants using Variant Effect Predictor (VEP) 19 .Where possible, we assigned each variant to one of three classifications: coding, proximal-regulatory or intergenic-regulatory.A variant was classified as coding if it had an impact on an exon of any transcript; proximal-regulatory if the variant lay within a 5kbp window around a transcript, and was not already a coding variant in any transcript, and finally intergenicregulatory if the variant fell within a conserved, constrained, intronic or non-coding exon region (details below), and was neither proximal-regulatory or coding.We additionally tested variants in sliding windows of size 2000 base pairs, regardless of the number of variants in each window, with proximal and coding variants excluded to minimise hypothesis overlap.
We then assigned each variant to groupings, which we refer to as masks, according to their predicted consequence and location.We used five published variant scores to group variants by consequence:

Genomic Evolutionary Rate Profiling (GERP)
The GERP score is a measure of conservation at the variant level 21 .We classified a variant if it had a GERP score >2.

phastCons score
phastCon is a window-based measure of conservation across species 44 : either strictly mammalian (phastCon 30), or for all species (phast_100).We tested non-coding genome windows, i.e. excluding any window containing an exon, that had a phastCon score in the top percentile.

Constrained Score
Constraint was calculated in windows of size 1kbp 8 based on the local mutability and observed mutation rate of each window.We tested windows with a constraint z-score greater than or equal to four.

Splice AI (AI) score
The splice AI score 45 is a measure of how well predicted each variant within a pre-mRNA region is of being a splice donor/acceptor, or neither.A variant was classified as a splice site with high confidence if it had an AI>70.

Combined Annotation Dependent Deletion score (CADD)
The CADD score 20 predicts how deleterious a variant is likely to be.We applied the CADD score only to coding variants, and considered loss-of-function variants only if tagged as high confidence by VEP.Missense variants with CADD>25 were segregated for testing in a separate mask.

JARVIS Score
The JARVIS score was derived to better prioritise non-coding genetic variation for association study, based on a machine learning model derived from measures of constraint 22 .
Each genome mask consisted of a number of variants with different consequences, based on their location, one of the above scores and/or predicted coding consequences.For example, for a variant to be classified as missense CADD>25, it must change a codon of an exon of a gene transcript, and be predicted to be highly deleterious.
In Table 2 we present the full list of consequences assigned to each mask and classification.We re-assigned variants that fulfilled two distinct criteria within a given genome unit to avoid duplication.In these cases, a variant was re-labelled as a combination of the two criteria, and were attached to any mask which selects variants from at least one of those criteria.

Pseudo Genes
We assigned variants to pseudo gene transcripts if they contained pseudo-exons.However, pseudo-exons were not excluded from proximal regions of non-pseudo gene associations, instead being tested as a regulatory genome unit.If a pseudo-exon overlapped with any significant genome unit signal, we performed a bespoke analysis.

Association Testing
All association analyses were corrected for age, sex, age squared, UK Biobank recruitment centre (as a proxy for geography) and the first forty genetic principal components.
Genome unit testing was performed for variants with a maximum allele frequency threshold of 0.1%, using REGENIE, based on the genetic units specified in We performed each of the four statistical tests above for each mask for which a genome unit has at least one variant.Additionally, a singleton association test was performed for all variants with MAC=1 in each unit.REGENIE also estimated an `all-mask` association strength for each genome unit, which is an aggregation of the test statistics of the individual masks.To ensure that this did not result in a mixing of non-coding and coding association statistics, we split each gene transcript into a coding transcript, which we tested for all coding masks, and a proximal transcript that we tested for all proximal masks.Regulatory genome units were either classified by their ENSR assignment, by the extent of a 1kb constrained window, or a phastCon conserved window.We named sliding windows by the range of chromosome which they covered.

Signal Classification
We determined whether a genomic unit signal was the result of the net effect of many variants of similar consequence or driven by one variant/a single loci of variants, by performing a second batch of genomic unit association testing corrected for single variants that passed the significance threshold in the single variant analysis.

Fine Mapping
To calculate the credible set for any common variant which lay within our rare-variant loci (single variant or aggregate), we performed a fine-mapping procedure using the recentlyreleased SuSiEx 46 software.SuSiEx leverages linkage-disequilibrium information across ancestries.R 2 between all variants was calculated directly from UKB WGS data, stratified by genetically determined ancestry.

Heterogeneity Calculations
We used the R-package metafor 47 to calculate all heterogeneity p-values between effect estimates, under the assumption of a fixed-effects model.

Replication within non-European UKBB ancestries
We first attempted to replicate our results by repeating our analysis for individuals of South Asian (SAS) and African (AFR) ancestry, with samples sizes of 4,439 and 3,077 respectively.
The National Institutes of Health and the National Heart Lung and Blood in the US sponsored the creation TOPMed.The WGS was performed at a target depth of >30x using DNA extracted from blood.We analysed 87,652 multi-population samples from 33 studies in the freeze 8 TOPMed (ST1).Population group was defined by self-reported information from participant questionnaires in each study (Supplementary Note).For individuals who had unreported or non-specific population memberships (e.g., "Multiple" or "Other"), we applied the Harmonized Ancestry and Race/Ethnicity (HARE) method (Fang et al. 2019; Zhang et al. 2023) to infer their group memberships using genetic data.The population groups were thus labelled by their self-identified or primary inferred population group.Among the 87,652 participants, 52,519 (60%) were female and 44,846 (51%) were non-European.Additional descriptive tables of the participants are presented in ST1.

Replication using All of Us
We have also conducted a mutual-replication analysis with short read WGS data from All of Us freeze 6, stratified by continental genetic ancestries European (EUR), AFR, and Admixed American (AMR).The AllofUs team pre-computed principal components by projecting AllofUs into the same PC space as the Human Genome Diversity Project and 1000 Genomes.
These PCs were then used as input into a random forest classifier to derive continental ancestry classifications.Low quality variants were removed from the dataset before association analyses were performed using REGENIE 17 .

Figure 1 :
Figure 1: Manhattan plots of results split by single variant and genomic aggregate analysis.From top to bottom: unconditioned single variants, single variants conditioned on known height loci, rare (<0.1% minor-allele frequency) coding genome aggregates, followed by rare non-coding genome units proximal genome aggregates, regulatory genome aggregates and sliding window aggregates.We plot -log10(p) on the y-axis.Red horizontal lines indicate the position of genome-wide significance considering only that panel, whilst blue indicates genome-wide significance across the entire study.For the single variant, coding and proximal panels, loci leads are labelled by their annotated gene based on the output of Variant Effect

Figure 2 :
Figure 2: Variant minor-allele-frequency versus absolute effect size for the 28 genetic variants (red) identified after adjusting for previously published height loci, contrasted

Figure 3 :
Figure 3: A) UCSC genome browser window showing genomic features in the region upstream of HMGA1, including JARVIS score and conservation score.Custom track 'Common Variants' shows the locations and -log10(P) values of variants with MAF>0.01%, and 'Rare Variant Associations' displays the locations and -log10(P) values of variants which contributed to the genomic aggregate (MAF<0.001%).B) Manhattan plot showing the distribution of log10-pvalues centred on the common GWAS signal at the HMGA1 locus.C)

Figure 4 :
Figure 4: A) UCSC genome browser window showing genomic features in the region of the region upstream of C17orf49, including JARVIS score and conservation score.-log10(P) values of rare (<0.01%) variants which contributed to the aggregate association are highlighted in a custom track.The vertical blue, red and green lines show the boundaries of MIR195, MIR497 and MIR497-HG respectively.B) Forest plot demonstrating how the effect estimate for the association between the proximal and miRNA aggregates, depending on how variants are allocated.C) QQ plot for variants in the C17orf49 proximal aggregate.
included in each grouping.UTR = Untranslated Region, 3` = variants at the 3` end of a transcript, 5` = variants at the 5` end of a transcript, GERP = Genomic Evolutionary Rate Profiling score (a measure of conservation), Start Gained/Lost = the inclusion or removal of a start codon, Downstream = downstream of a transcript, CADD = Combined Annotation Dependent Deletion score, AI = Splice AI (AI) score.