ABSTRACT
Genetic correlation (rg) analysis is commonly used to identify traits that may have a shared genetic basis. Traditionally, rg is studied on a global scale, considering only the average of the shared signal across the genome; though this approach may fail to detect scenarios where the rg is confined to particular genomic regions, or show opposing directions at different loci. Tools dedicated to local rg analysis have started to emerge, but are currently restricted to analysis of two phenotypes. For this reason, we have developed LAVA, an integrated framework for local rg analysis which, in addition to testing the standard bivariate local rg’s between two traits, can evaluate the local heritability for all traits of interest, and analyse conditional genetic relations between several traits using partial correlation or multiple regression. Applied to 20 behavioural and health phenotypes, we show considerable heterogeneity in the bivariate local rg’s across the genome, which is often masked by the global rg patterns, and demonstrate how our conditional approaches can elucidate more complex, multivariate genetic relations between traits.
Introduction
Results from just over a decade of genome-wide association studies (GWAS) have demonstrated that statistical pleiotropy across the genome is ubiquitous, meaning that particular genetic variants, genes, or genomic regions often show association with more than one trait1–3. Pleiotropy is valuable to study for a number of reasons, as it could elucidate biological pathways that are shared between traits4–6, help generate hypotheses about the functional significance of GWAS results7–9, and improve our understanding of the aetiology and overlap between complex traits and diseases1,10,11.
Pleiotropy on the single variant level has traditionally been studied using colocalization methods, which typically employ a Bayesian analysis framework with the aim of detecting true, causally shared genetic effects7,8,11,12; but there now exists a wide range of different cross-trait genetic association methods aimed at elucidating pleiotropy5,9,13–15. Given the notion that extensive pleiotropy may result in a genome-wide correlation between the genetic association signals, genetic correlation analysis has been frequently employed to identify traits for which there could be widespread pleiotropy across the genome, and this type of analysis has become a standard follow-up analysis to genome-wide association studies (GWAS)13,16–18. Notably, an observed genetic correlation (rg) does not guarantee the presence of true, causal pleiotropy, as strong linkage disequilibrium (LD) between different causal SNPs could also give rise to a correlation between genetic signals17, but genetic correlation analysis nonetheless facilitates prioritisation of scenarios where pleiotropy is likely.
While pleiotropy is typically discussed on a local level, such as single SNPs or genes, rg is traditionally studied on a global, genome-wide scale. Since a global rg merely represents an average of the shared association signals across the genome, local rg’s in opposing directions could result in a low or completely non-significant global rg, and strong local rg’s in the absence of any global relationship may go undetected17,19. In addition, global rg’s offer limited insight into the biological mechanisms that are shared between phenotypes, as the exact source of any detected genetic relation remains unidentified. To overcome these limitations, some have employed strategies such as partitioning the rg by annotation (e.g. GNOVA20), or restricted testing only to variants that are assumed to be associated (MiXER21).
Due to often high levels of LD between nearby SNPs, global rg methods cannot easily be translated to a local scale; but methods aimed at estimating local rg have also started to emerge (Rho-Hess19, SUPERGNOVA22, LOGOdetect23). To our knowledge, however, no existing tool currently offers the opportunity to model the local genetic relations between more than two phenotypes simultaneously. To address this, we have developed LAVA (Local Analysis of [co]Variant Annotation), a flexible and user-friendly tool aimed at detecting regions of shared genetic association signal between any number of phenotypes. LAVA can analyse binary as well as continuous phenotypes with varying degrees of sample overlap, and in addition to evaluating standard bivariate local rg’s, it can test the local joint univariate association signal (i.e. the local h2) for the traits in question, which can be used to filter out non-associated loci that may yield unstable rg estimates, and analyse the genetic relations between several traits simultaneously. Local genetic association analysis of multiple traits can be performed either via partial correlation or multiple linear regression, allowing for complex, multivariate genetic relationships to be examined in more detail than is currently possible with standard bivariate approaches.
In this paper, we demonstrate the features of LAVA through application to real data, and validate its properties and robustness using simulation. We first describe the details of the method and the various analysis options, and then apply LAVA to 20 behavioural and health-related traits. We examine the heterogeneity of genetic relations between these phenotypes, and then zoom in on the major histocompatibility complex (MHC) on chromosome 6, to examine the variability of the conditional local relationships between selected health phenotypes in this LD-dense region.
RESULTS
Input processing and estimation of local genetic signal
For any genomic region of interest, consider a centred continuous phenotype vector Yp (for phenotype p) with sample size N, and a standardised genotype matrix X containing K’() SNPs. We can model the relation between a phenotype and all SNPs in this region using a multiple linear regression model of the form Yp = Xαp + ∈p, where αp represents the vector of joint SNP effects (which account for the LD between SNPs) and ∈p the vector of residuals which are normally distributed with variance
Given that the standard least squares estimate of αp is of the form , if we denote the local SNP LD matrix as S = cor(X)and the vector of estimated marginal SNP effects as (which do not account for LD), we can express as. Then, after obtaining from GWAS summary statistics for Yp, by using a reference genotype data set (e.g. 1,000 Genomes24) from a population with a matching ancestry/LD structure to compute S, we can estimate the joint SNP effects αp (effectively removing the correlation between SNPs effects that is due to LD), without the need for any individual level data. To ensure that the direction of effect is consistent across traits (which is crucial for preventing false positives; see Suppl. Fig. 1), LAVA aligns the summary statistics to the reference data before computing
Once we have obtained , we can estimate the residual variance , and hence also the phenotypic variance explained by the SNPs in the locus (i.e. the univariate local genetic signal, or local h2). To determine whether the local h2 is significant, we test the explained phenotypic variance using an F-test (see Methods). We recommend using this test to filter out non-associated loci prior to any rg analysis, since rg’s will not be interpretable or reliable for phenotypes that do not show any local genetic signal.
Note that the above regression formulation concerns continuous phenotypes; for binary phenotypes we employ a largely similar strategy, reconstructing a multiple logistic regression model for the locus based on the marginal SNP effects and using a X2-test to test the joint association of SNPs with the phenotype (see Methods for more detail).
Estimation of bivariate local genetic correlations
To determine the bivariate local genetic correlation: for any region and set of P traits, we define the local genetic component matrix G = Xα, where X represents the standardised genotypes at that locus and α the Ksnp by P matrix of joint effects on each trait (as outlined above). We denote the realised covariance matrix of G as Ω, such that each diagonal element represents the local genetic variance of Gp for phenotype p, and each off-diagonal element ωpq the local genetic covariance of Gp and Gq (for phenotypes p and q). Thus, for two phenotypes p and q: From this Ω, the local genetic correlation can be directly computed as . Since genotype and phenotype data is all standardised, the square of the estimated local genetic correlation , can be interpreted as the proportion of variance in the local genetic component G that is explained by Gq (and vice versa). Since G is unobserved, Ω cannot be computed directly, and we therefore estimate it using the Method of Moments25. Once estimated, we compute ρpq as shown above, and generate simulation-based based p-values to evaluate its significance (see Methods).
As shown in Suppl. Figs. 2-4, this approach produces unbiased parameter estimates with well contained type 1 error rates for both binary and continuous phenotypes, and a wide range of locus sizes. See also the Suppl. Note 1 for a comparison of the local rg estimation employed in LAVA, to that of Rho-Hess and SUPERGNOVA.
Since any potential sample overlap between summary statistics sets can result in an upward bias in the estimated correlation, known or estimated sample overlap (obtained e.g. via bivariate LDSC13) should be provided to LAVA. Any shared variance that is due to sample overlap will be modelled as a residual covariance, effectively removing such bias (see Methods; Suppl. Fig. 5).
Estimation of multivariate local genetic relations
Multivariate local genetic analysis can be used to obtain the conditional genetic associations, using several traits simultaneously. This has been implemented in two forms: partial correlation, which models the local rg between two phenotypes while controlling for their rg’s with one or more other phenotypes, and multiple regression, which can model local genetic signal of an outcome phenotype using a set of predictor phenotypes simultaneously.
The partial genetic correlation between the phenotypes p and q, conditioned on their local rg’s with some other phenotypes(s) Z, is denoted ρpq|z. This ρpq|z can be computed directly from Ω (see Methods), and indicates how much of the initial correlation ρpq remains once the rg’s with the phenotypes(s) in Z are accounted for.
In contrast, the multiple regression models the genetic signal for a single outcome phenotype Y using the genetic signal for one or more predictor phenotypes X. We formulate this as GY = GXγ + ε for standardised genetic components GX and GY, such that γ reflects the vector of standardised regression coefficients and ε the residuals, with residual variance τ2 (all computed directly from Ω; see Methods). Here, the γ’s indicate how much the genetic component for each individual predictor in X contributes to the genetic component of Y (conditioned on the other predictors) and τ2 the proportion of local heritability of Y that is independent of X. From this τ2, we can then compute the model r2 as r2 = 1 – τ2, which tells us how much of the local heritability for Y can be explained by the genetic components of all predictor phenotypes jointly. (95% confidence intervals are computed for all individual γ’s, as well as the total model r2; Methods). For a more in-depth overview of the differences and similarities between partial correlation and multiple regression, see Suppl. Note 2.
As can be seen in Suppl. Figs. 6-9, our two multivariate local genetic association approaches provide overall unbiased estimates and controlled type 1 error rates. The only exceptions were a few instances of the partial correlation for binary phenotypes where we saw a minor median-bias, although no mean-bias, for some null simulations with very low univariate signal (although type 1 error rates for these were nevertheless controlled). Additionally, there was a very slight type 1 error rate inflation at a local odds ratio of 1.5, which had an error rate of .06 for an alpha level of .05 (Suppl. Fig. 9), though we note that this represents a rather extreme level of local heritability for complex, non-mendelian phenotypes.
Bivariate local genetic correlation analyses reveal extensive overlap of local genetic association signals between traits
To demonstrate our method, we applied LAVA to 20 health related and behavioural traits (see Table 1), testing the pairwise local genetic correlations within 2,495 genomic loci (genome-wide), followed by conditional local genetic analyses for a subset of strongly intercorrelated phenotypes. In order to partition the genome, we developed an algorithm that sections the chromosomes into approximately equal sized (∼1Mb) semi-independent blocks, by recursively splitting the chromosomes into smaller regions while minimising the LD between them (see Methods – Genome partitioning for details).
As our summary statistics were based on European samples, we utilised the individuals of European ancestry from the 1000 Genomes (phase 3)24 as a genotype reference (both for the definition of genomic regions and for all LAVA analyses). Sample overlap was estimated using the intercepts from bivariate LDSC (see Methods).
Given that the detection of valid and interpretable local rg’s requires the presence of sufficient local genetic signal, we used the univariate test as a filtering step for the bivariate local rg analyses. Since the power to detect a significant local heritability depends on the power of the original GWAS, this step could potentially lead to the exclusion of some relevant loci, particularly for phenotypes with a small sample size. Though similar to a lack of genetic signal, such scenarios would likely also produce unstable rg estimates, and we therefore tested only the local correlations for any pairs of traits which both exhibited univariate local genetic signal at p < .05 / 2,495. This resulted in a total of 21,374 bivariate tests across all trait pairs, spanning 1,919 unique loci.
With a Bonferroni corrected p-value threshold of 2.34e-6 (.05 / 21,374), we detected a total of 546 significant bivariate local rg’s across 234 loci, of which 81 loci were associated with more than one trait pair. For 193 of these correlations, the 95% confidence intervals (CI’s) for the explained variance included 1, which is consistent with the scenario that the genetic signal of those pairs of phenotypes in these loci is completely shared (Fig. 1).
The trait pairs exhibiting the greatest number of significant local rg’s were BMI and educational attainment (39), which also had with the largest sample sizes (see Table 1), followed by depression and neuroticism (33), and BMI and waist-hip ratio (20). As can be seen in Fig. 1, several conceptually related phenotypes tended to show a large number of significant rg’s, with depression and neuroticism having the highest proportion of regions within which the CI’s included 1.
Given the number of immune phenotypes analysed here, we chose to retain the major histocompatibility complex (MHC; chr6:26-34Mb, 21 loci) in our analyses as this locus is highly relevant to the aetiology of these phenotypes. Of the 546 significant local rg’s, 229 were found within these MHC loci (particularly for immune phenotypes), consistent with the notion that there is extensive pleiotropy within this region2,26.
Local genetic correlation analysis more accurately captures the heterogeneous genetic relationships between phenotypes
For all trait pairs, we examined the strength and direction of effect of the local rg’s by taking the average of the observed correlation coefficients across tested loci. As shown in Fig. 2, consistently positive rg’s with multiple significant loci were observed for many phenotypes (e.g. neuroticism & depression, cholesterol & CAD, BMI & diabetes, Crohn’s & UC), for which we also saw concordance with the observed pattern of global rg’s, as obtained via bivariate LDSC13 (Fig. 2).
However, there were also several trait pairs with a global rg close to 0 which nonetheless exhibited significant local genetic correlations (e.g. BMI & WHR, BMI & neuroticism, asthma & Crohn’s, asthma & UC, alcohol & WHR), supporting the notion that global rg’s fail to capture the complexity and heterogeneity in the genetic overlap between many traits. As expected, these traits tended to exhibit local rg’s in opposite directions, and/or within a limited number of loci. Although even for pairs like alcohol intake frequency and BMI, which showed a consistent pattern of negative local rg’s, and a highly significant negative global rg (r = -.3, p = 8.85e-42), two significant local rg’s in a positive direction were still found. Similar patterns were observed for several other phenotypes, such as BMI and cholesterol, diabetes and cholesterol, and alcohol and neuroticism (for an overview of the total number of positive versus negative local genetic correlations detected per phenotype pair, see Suppl. Fig. 10).
Bivariate local genetic correlations implicate potential pleiotropy hotspots
From the bivariate analyses, we identified a total of 81 regions within which significant rg’s were found between multiple trait pairs. Most of these were located within the MHC (chr6:26-34Mb), a region within which extensive pleiotropy has been noted previously2,26. Within MHC hotspots, immune related phenotypes were among the most frequently intercorrelated (with lupus displaying the greatest number of significant genetic correlations of all; Fig. 3), consistent with the known role of the MHC in immune function38,39.
The largest hotspot, i.e. the locus with the greatest number of genetic correlations, was locus 963 (chr6:32,454,578-32,539,567, within the MHC), where a total of 27 significant correlations between 10 different phenotypes were detected (Fig. 3a; Suppl. Table 1). This locus contains a single protein coding gene, HLA-DRB5, which has been linked with several of the associated phenotypes previously (e.g. asthma40, diabetes6, WHR41, lupus42). The second largest hotspot, locus 961 (chr6:31,427,210-32,208,901; also within the MHC), was the most diverse, showing a total of 24 significant genetic correlations for 15 different traits. Here, lupus was situated as a hub phenotype, showing significant correlations with most other phenotypes (Fig. 3b), a pattern that was observed across a few other MHC loci as well (Suppl. Tables 2,6,9,13). Notably, both of these loci were contained within a region identified as the top pleiotropic locus in a recent large scale investigation of pleiotropy by Watanabe et al. (2019)2, across a total of 558 traits.
Although the MHC is a region known for its complex LD structure43, we nonetheless observed clustering of conceptually related traits within these loci (e.g. cholesterol & WHR vs Crohn’s & UC, depression & neuroticism see Suppl. Tables 1-11, 13, 15-18, 21, 25-27), and saw instances of substantial local heritability, without necessarily the presence of any local genetic correlation (e.g. IgAD and RA in loci 958-960: univariate p’s < 1e-118, bivariate, bivariate p’s > .05). This suggests that shared and nonshared genetic signal might be distinguishable even in regions with notoriously strong LD; though, it should be noted that in scenarios where separate causal SNPs are in perfect LD, true pleiotropy will be inseparable from confounding, and such instances might be more common within LD-dense regions like the MHC.
Outside the MHC, the two largest hotspots had 8 significant rg’s each. The first on chromosome 11 (112,755,447-113,889,019), with phenotypes depression, neuroticism, alcohol intake, educational attainment, and WHR (locus ID 1719; Suppl. Table 12), and the second on chromosome 3 (47,588,462-50,387,742) with educational attainment, insomnia, alcohol intake, BMI, CAD, UC, and Crohn’s (locus ID 464; Suppl. Table 14). These hotspots also overlapped with loci identified among the top pleiotropic regions for psychiatric, cognitive, metabolic, and immunological phenotypes previously (see Suppl. Table 4 in Watanabe et al. 20192). In addition, locus 1719 on chromosome 11 contains both NCAM1 and DRD2 (among 8 other genes), which have been frequently implicated in behavioural and psychiatric traits (e.g. alcohol dependence44, smoking2, cannabis use45, depression34, neuroticism2,46, sleep duration47, ADHD symptoms48), and this locus also overlapped with a hotspot flagged by SUPERGNOVA22 within which significant rg’s were identified for autism, bipolar disorder, depression, cognitive performance, schizophrenia and smoking initiation22, suggesting this region might be a key regulator of brain related phenotypes.
Finally, we also identified two loci that were specific to RA, hypothyroidism, and lupus: the first on chromosome 2 (191,051,955-193,033,982; locus ID 374) and the second on chromosome 1 (113,418,038-114,664,387; locus ID 100) (see Suppl. Tables 40 & 42). In both cases, positive rg’s were observed between hypothyroidism and RA, both of which showed negative rg’s with lupus.
For a complete overview of all hotspots, including the relevant statistics, associated genes, and network plots, see Suppl. >Tables 1-81.
Genetics of asthma partially explain localgenetic correlations between other health-related phenotypes
To demonstrate the application of our partial local rg approach – which can model the local rg between two phenotypes of interest conditioned on one or more other phenotypes – we selected a subset of the aforementioned MHC hotspots within which we discovered a sub-cluster of four phenotypes that were consistently interconnected: asthma, hypothyroidism, RA, and diabetes (see network plots in Suppl. Tables 3-5,8). Given that asthma tended to show consistently strong rg’s with the other phenotypes within this cluster (see Fig. 4 & Suppl. Tables 3-5,8), we hypothesized that the shared genetic signal with asthma could account for some of the overlap between RA, diabetes, and hypothyroidism. We therefore computed the partial genetic correlations between these phenotypes, accounting for their local rg with asthma.
As shown in Fig. 4, conditioning on asthma, in most cases, resulted in a substantial decrease of the rg’s between hypothyroidism, RA, and diabetes. On several occasions, the 95% CI’s for the partial rg’s included 0, indicating that they were no longer even nominally significant. This reduction in signal was particularly evident for locus 965 (chr6:32,586,785-32,629,239), in which the partial rg’s were no greater than .08 after accounting for asthma (and all of the CI’s spanning 0), despite bivariate rg’s ranging from .32 to .52. This suggests that, in part, genetic variants associated with asthma within these loci might confer a more general susceptibility these phenotypes.
Notably, the degree of change in the local rg after conditioning on asthma was locus dependent, indicating variation in the local genetic covariance structure even for adjacent blocks within the MHC.
Shared genetic aetiology of hypothyroidism within the MHC
To also demonstrate the multiple regression approach – which can model the genetic signal for a single outcome phenotype of interest using the genetic signal for one or more predictor phenotypes – we selected hypothyroidism as the outcome and computed the full joint local genetic relations across a set of relevant MHC loci, using asthma, diabetes, and rheumatism as predictors. With this, we determined the total proportion of variance in the genetic component of hypothyroidism that can be attributed to the genetic signal for these three traits simultaneously.
As seen in Fig. 5, there was notable variation in the total multivariate r2 across adjacent MHC loci, with the proportion of the genetic component of hypothyroidism that could be explained by that of the predictor phenotypes ranging from as little as 13% to as much as 79%. Note, however, that the CI’s for the joint r2’s did not include 1, suggesting that some proportion of the local heritability for hypothyroidism within these loci is nonetheless independent of the three predictor phenotypes.
There was also substantial variation in the strength of the effects of the three traits on hypothyroidism in the multivariate model, largely mirroring the bivariate correlations. In general, either asthma or diabetes tended to account for most of the shared association signal across loci, with 95% CI’s for the multiple regression coefficients of other traits all spanning 0. The relationship between RA and hypothyroidism was largely accounted for by either of asthma or diabetes in all but locus 968 (chr6:32,897,999-33,194,975) where there was some independent association signal for both asthma and RA.
DISCUSSION
Global genetic correlation (rg) analysis is commonly used to identify pairs of traits that have a shared genetic basis, and is a widely popular follow-up to GWAS. The traditional, global approach to rg analysis reports only the average rg across the genome, and may therefore fail to detect more complex and heterogeneous genetic relationships where the signal might be confined to specific regions or even show opposing association patterns at different loci13,17,19.
Here, we presented a novel method, LAVA, which is an integrated statistical framework aimed at testing the local genetic relations within user-defined genomic regions. LAVA handles both continuous and binary phenotypes with varying degrees of sample overlap, and in addition to computing standard bivariate local rg’s between two phenotypes, LAVA can test the local univariate genetic association signal for each phenotype, and model the conditional local genetic relations between several traits simultaneously using either partial correlation or multiple linear regression.
Applied to 20 different behavioural and health related traits across 2,495 semi-independent regions defined based on LD, we identified a total of 546 significant bivariate local rg’s across 234 regions. Although the direction of effect for individual pairs was in many cases consistent across loci (particularly for traits showing a strong global rg), there was substantial variability in the strength of the association across the genome, indicating that the genome-wide rg is far from constant. In addition, we identified significant rg’s in opposing directions for several phenotypes, implying a more complex aetiological relationship than that revealed by a global rg analysis. Significant local rg’s were also observed between several trait pairs whose global correlation was not significant, further emphasizing the value of stratifying rg by region.
From the bivariate local rg analyses, we identified several regions that harboured significant rg’s between multiple trait pairs, implicating these regions as potential pleiotropy hotspots. As expected, many of these hotspots were located in the MHC, likely owing to the number of immune- and health related phenotypes included in our example (with the MHC frequently implicated in immune function38,39), and the MHC having been flagged as a pleiotropy hotspot in the past2,26.
We emphasize that while the aim of an rg analysis is to elucidate pleiotropy, the ability to do so is naturally limited by the amount of LD that exists within a region. While this LD structure is accounted for by LAVA, there may be cases where distinct causal scenarios yield identical patterns of SNP associations within a locus, and with the extensive LD that exists within the MHC, there is an increased chance that true pleiotropy may be indistinguishable from confounding here. In spite of this, however, we did observe substantial levels of univariate genetic signal without necessarily the presence of any genetic correlation, with the rg patterns reflecting some clustering of conceptually related traits, suggesting that strong genetic signal may be distinguishable from genetic covariance even within LD dense regions such as the MHC. However, experimental evidence will be required to confirm these observations.
Based on the elaborate rg’s patterns observed within the MHC, we selected a subset of loci within this region to demonstrate how more complex association patterns can be disentangled via our two multivariate models: the partial correlation, which tests the genetic correlation between two phenotypes of interest conditioned on some other phenotype(s), and the multiple linear regression, which models the genetic signal of an outcome phenotype using that of several predictor phenotypes jointly (i.e. conditioned on each other). For a cluster of consistently associated phenotypes – asthma, diabetes, RA, and hypothyroidism – we showed how such models allow us to examine in detail the patterns of mediation and confounding that exist between them; providing further insights into the genetic association between traits beyond what can be achieved using standard bivariate models.
The LAVA analysis framework can be applied to answer a wide array of research questions. It may be used in a more targeted manner to follow up on a smaller subset of loci highlighted through GWAS, identifying regions of shared association with aetiologically informative phenotypes, or in a more agnostic manner, scanning multiple traits across the entire genome (as done in this paper). Approaching the genomic region as the unit of interest, LAVA could be applied to study the function of particular blocks or genes by mapping out patterns of genetic sharing within a locus across the phenome (similar to a PheWAS). This general analysis framework will have implications for our understanding of disease aetiology and genetic heterogeneity as a whole, which can be further aided by integrating summary statistics of molecular phenotypes or endophenotypes (such as gene expression, metabolites, or brain regions), facilitating the functional interpretation of GWAS results by evaluating the local rg’s with these lower level phenotypes. In this setting, the conditional models could prove particularly useful as they may enable identification of key tissues or regions, offering unique insight into the biological mechanisms that underlie complex traits.
Our method is not without limitations. As already discussed, analytical approaches like LAVA can only pinpoint locations where pleiotropy is likely, but these may be confounded by excessive LD and, ultimately, experimental validation will be required to establish the true nature of any observed genetic overlap. In addition, significant local genetic correlations could be detected from multiple nearby regions, but as LAVA can currently only analyse a single locus at a time, it is unable to condition on the association signal from nearby loci, and it is therefore possible that local genetic correlations are observed in regions adjacent to those harbouring the true association signal. LAVA is also limited by the number of overlapping SNPs within different summary statistics data sets, which could potentially lead to a failure to detect true correlations in scenarios where there are too few shared SNPs between them. Though we endeavour to address these limitations as best possible in future versions of LAVA.
METHODS
Model overview and input processing of continuous phenotypes
For any given locus and phenotype p, consider a linear regression model of the standardised phenotype vector Yp on a genotype matrix X (containing Ksnp SNPs, also standardised): Yp = Xαp + ϵp, where αp represents the vector of standardised joint SNP effects and ϵp the vector of normally distributed residuals with variance . Denote the SNP LD matrix as S = cor(X) and the vector of estimated marginal SNP effects (standardised); we obtain the estimated joint effects from these GWAS summary statistics as , using the reference data set to compute S. Here it is assumed that the SNP LD in the reference data is the same as in the original GWAS sample. We can then estimate the residual variance as being the original GWAS sample size and K the number of SNP principal components (see below), with explained variance . The estimated joint effects are distributed as , with being the sampling variance.
As we cannot be certain whether beta coefficients provided as input are standardised, for each SNP s, we create Z-scores using the p-value and sign of the provided effect size as , with Φ the cumulative normal distribution function and Pps the SNP p-value. We then convert Zps to the corresponding correlation , which equals the standardised beta coefficient (note: when per SNP sample size Ns is not provided, we will use the overall N as a proxy). If Z-scores or T-statistics are provided we can also use these directly, in which case p-values and beta coefficients are not necessary.
Due to the substantial LD between SNPs, it is unlikely that the LD matrix S will be of full rank, in which case it is not invertible. This therefore requires us to work in a lower dimensional space. To do so, we compute the singular value decomposition , such that S = QΛΛQT and hence S-1 = Q(ΛΛ)-1 QT (here Nref denotes the sample size of the reference data set with genotype matrix Xref). For each component j, the corresponding squared singular value is proportional to the amount of variance of the total accounted for by that component. We order the components by decreasing singular value, and select the smallest subset of the first K components such that these account for at least 99% of the total variance (pruning away the rest).
Defining Q∗ as the Ksnp by K pruned eigenvector matrix, and Λ∗ as the corresponding K by K diagonal singular value matrix, we approximate the inverse of S as . We then define the scaled principal component matrix W = XR with projection matrix . Finally, we define the corresponding vector of joint effects δp = R+αp, with , such that Wδp closely approximates Gp = Xαp, and use this sparser δp in place of αp/Gp for parameter estimation instead.
To test the proportion of phenotypic variance that can be attributed by the local genetic signal, we construct the test statistic and evaluate this using an F-distribution with K and N − K – 1 degrees of freedom.
Processing of binary phenotypes
In order to obtain the joint SNP effects from GWAS summary statistics of binary phenotypes, for the scaled principal components W as defined above, and with denoting the individual, we reconstruct the multiple logistic regression model and , where , with δ0p the model intercept. To do so, we use the iteratively reweighted least squares (IRLS) approach49, which iteratively updates estimates of the model coefficients according to the equation: Here, , where c is a vector with cpi = pi (1 − pi), and is the index of the current iteration.
For the sufficient statistic needed for this process, we note that with Pcase designated as the proportion of individuals in the original sample that are cases (Ypi = 1). In addition, we have that WTYp = RT XTYp for the standardised SNP genotype matrix X and R the projection matrix for W. This sufficient statistic can therefore be computed from the individual for each SNP s. To obtain these, we define the marginal logistic regression model , with and , and observe that at convergence of the IRLS algorithm the left side of which can be obtained by filling in the marginal SNP effect estimates . Because the intercept is unlikely to have been reported in the GWAS summary statistics and the slope may not be on the correct scale, we use a search algorithm to re-estimate these SNP effects from the GWAS Z-statistics and case and control counts reported for s(substituting general case and control count for the sample if not available per SNP).
From this, the can then be estimated using the IRLS algorithm as outlined above, which has sampling covariance matrix . Because the components in W are independent and all have the same variance, in practice Vp should be close to a diagonal matrix, and the standard errors for each be very similar. To verify this, the ratio between the maximum and median standard error is computed, and if this ratio exceeds 1.75, the PC with the highest standard error is discarded, and the process repeated until no PC has a ratio above that threshold. Subsequently, we define as the mean of the diagonal elements of this sampling covariance matrix, and assume (our simulations show that this approach has no appreciable effect on type 1 error rates, see Suppl. Fig. 3). From this we can then also define a test for the univariate signal, similar to the F-test for continuous phenotypes, using the test statistic . Given the distribution for , this test statistic has a distribution under the null of no genetic association.
Estimating bivariate local genetic correlations
We define ΩG as the P x P realised covariance matrix of the genetic components G = Xα of any P phenotypes , which is the main variable of interest for the estimation of the local genetic correlation. In practice, since we are working with the sparser joint effects of the PCs δ (rather than the α’ s), which have the same covariance as G by a scaling factor K (Ω = αTSα = αTR+TR+α = δTδ, and hence ), we actually use the Ωδ = cov(δ) instead. As all the output is standardised, however, this makes no practical difference (since ΩG and Ωδ have identical correlational structure). We will use Ω to refer to Ωδ henceforth. This Ω can be subdivided as: with each diagonal element reflecting the (scaled) variance of the genetic component of phenotype p, and each off-diagonal element ωpq the (scaled) covariance of the genetic components for phenotypes p and. We can compute the corresponding bivariate local genetic correlations from the elements of this Ω as with representing the proportion of explained variance (i.e., the local r2).
For estimation of Ω, we note that the K x P matrix of estimated joint effects of the principal components are distributed as where represents the sampling covariance matrix. We then use the Method of Moments25 to estimate Ω as follows: With , the expected value of has the form , and hence . Plugging in the sample moments for , we therefore obtain the estimator .
If there is no sample overlap, is defined as , where is a length P vector with the sampling variances of each phenotype. In the presence of possible sample overlap, estimates of the sampling correlation across phenotypes must be provided by the user. These can be obtained using cross-trait LDSC13, creating a P x P covariance matrix with the intercepts for the genetic covariance for the off-diagonal elements (for the diagonal, use the intercept from a cross-trait analysis of a phenotype with itself, or its univariate LDSC intercept). LAVA then internally converts this to a correlation matrix,, and computes the sampling correlation matrix as .
Local multiple regression & partial correlations
Local conditional genetic associations between more than two phenotypes can be obtained using either multiple regression or partial correlation.
For the multiple regression approach, consider an outcome phenotype Y and set of predictor phenotypes X, with corresponding genetic components GY and GX. Here, we can decompose GY as GY = GX r) into a component that can be explained by GX and a residual component with cov(GX) = 0, such that λ(r) reflects the vector of unstandardised regression coefficients; the variance of is denoted as ε . Subdividing we can then and Denoting the vector of standard deviations in ΩX as , we can then use these to obtain the standardised regression coefficients and standardised residual variance . The corresponding explained variance for the full model is computed as r2 = 1-τ2.
The partial correlations between the genetic components of two phenotypes X and Y, conditional on a set of other phenotypes (denoted), can be expressed using the linear equations GX = GZ X X and GY = GZ Y Y, with XY Z = cov(X Y) As with the parameters from the multiple regression, this can also be computed from the Ω directly. Given the partial covariance (with subscripts denoting the subset of relevant variances and covariances for X, Y, and), and the partial variance (and likewise for ), we can simply compute the partial correlation as .
Simulation p-values and confidence intervals
Because the sampling distributions for the local genetic correlation, partial correlation, and multiple regression coefficients have no tractable closed form, we employ a simulation procedure with partial integration to obtain empirical p-values for these parameters. Below, we denote the particular statistic being tested as T, with observed value Tobs.
First, we define a pure simulation approach, observing that the sufficient statistic has a noncentral Wishart distribution with K degrees of freedom, scale matrix Σ and non-centrality matrix KΩ. For a statistic T, we can therefore specify the Ω0 corresponding to the null hypothesis to be tested and use that to define the non-centrality matrix. We can then generate a random sample of null matrices, and for each of those compute the corresponding and from there the statistic T*. The sample of null T* values can then be compared to the observed statistic Tobs to obtain an empirical p-value, defining this p-value as the proportion of simulations for which T* has a value more extreme than Tobs.
A drawback of empirical p-values is that they can require a substantial number of simulations to reach sufficient accuracy for low p-values. To deal with this, we augment the simulation procedure with a partial integration step as follows. For a single phenotype p, the distribution of given is multivariate normal with parameters of known form, and consequently many of the statistics of interest will have a normal distribution given (and Ω0). We can therefore generate draws for from the noncentral Wishart distribution, and for each such draw compute the parameters of the conditional distribution of the statistic, then obtain the corresponding conditional p-value for obs for that draw. We then compute the final p-value as the mean of the conditional p-values across all draws.
Although the resulting p-value is still empirical and subject to simulation uncertainty, with this procedure we can obtain sufficiently reliable p-values even at very low value ranges without needing prohibitively many simulations. By default, LAVA performs 10,000 simulations to estimate the p-value. This is increased this to 100,000 or 1,000,000 simulations if the p-value estimate falls below thresholds of 1e-4 and 1e-6 respectively.
For a pair of phenotypes p and q, to test the null hypothesis of no local correlation, 0 pq = pq =, we use the local covariance pq as the statistic to test. For the integration step, to ensure symmetry, we use the conditional distribution of given for half of the simulations, and the distribution of given for the other half. Similarly, to test the null hypothesis of no local partial correlation given a set of phenotypes, 0 pq Z = pq Z =, we use the local partial covariance pq Z as the statistic T. We use the conditional distribution of given and for half the simulations, and the conditional distribution of given and for the other half. For the regression model, with outcome phenotype Y and set of predictor phenotypes X, to test the null hypothesis of no conditional effect for predictor phenotype j, , we use the semi-standardised coefficient (standardised for X, but not Y) as the statistic. For the integration step, we use the conditional distribution of given .
Optionally, LAVA can also be requested to generate 95% confidence intervals for the local correlation, partial correlation and standardised regression coefficients, as well as for the multiple 2 parameter of the multiple regression model. These are computed by generating 10,000 draws from the noncentral Wishart distribution (with non-centrality matrix ), and for statistic of interest computing the simulated statistics T * for all draws. The 2.5% and 97.5% quantiles of these T * are then used as estimates of the boundaries of the confidence interval for Tobs.
Genome partitioning
In order to partition the genome into smaller regions, we developed a method that uses the LD information between SNPs and groups them into approximately equal sized, semi-independent blocks (available for download at https://github.com/cadeleeuw/lava-partitioning).
The blocking procedure is as follows: For each chromosome, a break point metric for each pair of consecutive SNPs is computed. Starting with the whole chromosome as the initial block, the blocks are then recursively split into two smaller blocks using this metric and a minimum size requirement, continuing until some threshold for the break point metric or size are met and the blocks cannot be divided any further.
Each pair of consecutive SNPs defines a potential break point, for which a metric is computed to determine which breakpoint is the most suitable (i.e., at which point the LD between the SNPs is the lowest). The break point metric between each SNP pair can be thought of as the strength of the LD between the SNPs on each side of the break point. For computational efficiency, we compute only the correlations near the diagonal of the entire SNP x SNP matrix, i.e., between the most proximal SNPs (in this case, we used a window of 200 SNPs). Each break point defines a triangular wedge on this thick diagonal, and the break point metric is simply the mean of the squared correlations in this wedge.
When a block is split, the minimum size requirement is first used to determine the region within the block that contains the subset of potentially valid break points, and within this region, the break point with the lowest metric value is identified. If this value is above some user-defined maximum, the block will not be split any further.
A small margin was also applied to the minimum break point in a block, treating all other potential breakpoints with metrics within that margin as equivalent. The break point closest to the centre of the block was then selected to split the block, in order to encourage more even sizes of sub-blocks. Prior to applying this algorithm, SNPs with a MAF smaller than 1% were filtered out to speed up computation time. These SNPs were added back in after the blocks had been created, applying a variation of the same algorithm to further refine the boundaries between the blocks.
For this paper, we used the default values of the program for all parameters (see program manual for details), except the minimum size requirement which was set to 2500 SNPs in order to obtain an average block size of around 1Mb.
Simulations and model evaluation
Simulations were conducted in order to validate the robustness of our models, examining the influence of heritability, block size, sample overlap, allele mismatch, and case/control ratio. To ensure an ecologically valid LD structure for our simulations, we used real genotype data from the 1,000 genomes (European subset), from which we simulated phenotypes under various scenarios. In order to achieve a larger sample size than the standard N = 503, we stacked the sample 40 times, and subsetted to the first 20,000 individuals. Univariate power for a given locus is fully determined by the sample size and univariate joint effect size (h2 or OR). Consequently, simulation conditions at N = 20,000 and a particular effect size are representative of conditions at higher N and lower effect size that have the same level of power; for example, for continuous phenotypes with h2 values of 1%, 5%, 10%, and 25% approximately equivalent power is obtained at an N of 100,000 with h2 values of .2%, 1%, 2%, and 6% respectively (see Suppl. Note 3). For this reason, we opted to keep the sample size constant, and only varied the effect sizes. The original 1,000 genomes sample (N=503) was also used as a LD reference for the analysis of the simulated data. Note that as we did not seek to evaluate the influence of mismatching LD between the original data and the reference data, we used the same data set for simulation and analysis in order to prevent unknown violations of model assumptions from influencing the results. Though an extensive investigation of the effects of LD mismatch would be worth undertaking in the future.
The simulations were based on 5 randomly selected loci. Locus size was varied by resizing these 5 loci from the centre SNP and outward until the desired size was achieved (50, 500, 1000, or 5000 SNPs). SNPs with MAF <.01 or an SD of 0 were excluded. All simulations were repeated 1,000 times per block by default, though this was increased to 10,000 for some conditions in order to evaluate type 1 error at lower significance levels.
To evaluate the type 1 error rate for the standard bivariate local rg analysis, we simulated two phenotypes with a true local genetic correlation of 0, quantifying the proportion of times where a significant local genetic correlation was detected at different significance levels (p < .05, p < .01, and p < .001). Estimation bias was assessed by simulating true local genetic correlations p of 0 and .5, and comparing the distribution of estimated correlations to their true value.
For the multiple linear regression, we simulated two genetically correlated (at p = .5) predictor phenotypes, X0 and X+, exhibiting true rg’s with an outcome phenotype Y of 0 and .5, respectively. Detection of a significant effect of X0 in the multivariate model was considered a false positive, and we used the estimated betas for both X0 and X+ to evaluate bias.
For the partial genetic correlations, we generated four predictor phenotypes X, Y, Z1 and Z2 simultaneously, with δ such that , and . This was accomplished by first generating unit-variance δz1 and δz2 such that then setting and with for the noise terms and ΔX ⊥ ΔY. Here, we used the p-values for to evaluate the type 1 error rate, and the estimated values for both and to evaluate bias.
The actual phenotype data was simulated as follows: first, the 1,000 Genomes24 genotype data was read in to R (using the snpStats package) and standardised (after increasing of N as explained above). We subsequently computed the scaled principal components W as defined previously, and for the bivariate and multiple regression simulations, we used these to create the desired Ω (see Suppl. Note 4 for more detail). With this, we then generated the true δ, with which we compute the genetic components G = Wδ. We obtained the residual variance , and drew the residuals ϵ from a multivariate normal with covariance σ/C (with C being the residual correlation matrix used to indicate degree of sample overlap). The δ were scaled such that var(G) = 1.
For continuous phenotypes, we then generated the N × P phenotype matrix Y as Y = β5G + ϵ, with and h2 the desired local heritability value. The P × 1 vector of residuals ϵi for individual i was drawn from a normal distribution with zero mean and covariance matrix C. The matrix C was set to the desired residual correlation matrix for conditions simulating sample overlap, and to IP otherwise.
For binary phenotypes, the outcome Ypi for phenotype p and individual i was modelled as a Bernouilli random variable with probability πPi defined using a logistic model: logit (πPi) = β0 + β1G. As for the continuous phenotypes, the β1 parameter was used to control the effect size, defining β1 = log (OR) with OR the odds ratio relative to a 1 SD change in the genetic component G. The intercept β0 was used to control the population prevalence under the model. Its value was determined using a simple linear search, selecting β0 such as to obtain the desired prevalence.
SNP Alignment
As misalignment of SNPs can cause noticeable type 1 error inflation (Suppl. Fig. 1), LAVA performs alignment of SNP effect alleles for all summary statistics prior to analysis. This is done first by removing any SNPs with strand ambiguous alleles or alleles that are not present in the reference data set; then, in the case that the reported effect allele does not correspond to that of the reference data set, the sign of the marginal SNP effect size (for phenotype p and SNP s) is flipped.
GWAS summary statistics & LD reference data
The GWAS Atlas2 (https://atlas.ctglab.nl) was used to search for and access publicly available summary statistics for the 20 traits analysed here, details of which can be found in Table 1. We aimed to select a combination of health related and behavioural traits, with the intention of selecting a number of related traits from different categories (e.g., immune, cardiovascular, body composition, psychiatric) in order to facilitate the detection of local genetic correlations, while maintaining some level of phenotypic diversity. When imputation quality metrics were available, we filtered out any SNPs with an INFO score < .9.
As a reference for the estimation of LD in all of our analyses, we used the European subset of the 1,000 Genomes24 data as downloaded from https://ctg.cncr.nl/software/magma.
Global genetic correlation analysis and estimation of sample overlap using LDSC
Bivariate LD-score regression (LDSC)13 was used to evaluate the global rg’s between all trait pairs, as well as to obtain estimates of their level of sample overlap (as required for our LAVA analyses). To account for the sample overlap, we created a (symmetric) matrix based on the intercepts from the bivariate LDSC analyses (the diagonals populated by the intercepts from the analysis of each phenotype with itself). This was then converted to a correlation matrix and provided to LAVA (see ‘Methods: Estimating bivariate local genetic correlations’, for an overview of how LAVA uses this information). Summary statistics for each phenotype were munged using HapMap SNPs.
Data availability
All analyses in this study relied on publicly available summary statistics downloaded from the GWAS Atlas2 (https://atlas.ctglab.nl; original sources and Atlas-IDs are referenced in Table 1). The locus file used for all the LAVA analyses can be downloaded from https://github.com/josefin-werme/LAVA.
Code availability
The LAVA software is implemented as an R package which is publicly available at https://github.com/josefin-werme/LAVA. The method used for genome partitioning can be downloaded from https://github.com/cadeleeuw/lava-partitioning.
AUTHOR CONTRIBUTIONS
J.W, S.vd.S, D.P, & C.d.L conceived of the study. J.W and C.d.L developed the statistical framework and implemented the software. J.W performed the analyses, simulations, and wrote the manuscript. J.W, S.vd.S, D.P, & C.d.L participated in the interpretation of the results and revision of the manuscript. All authors provided meaningful contributions at each stage of the project.
CONFLICTING INTERESTS
The authors declare no competing financial interests.
ACKNOWLEDGEMENTS
This work was funded by COSYN (Comorbidity and Synapse Biology in Clinically Overlapping Psychiatric Disorders: Horizon 2020 Program of the European Union under RIA grant agreement 667301 to D.P.) and the Netherlands Organization for Scientific Research (NWO: VICI 435-14-005). The analyses were carried out on the Genetic Cluster Computer, which is financed by the Netherlands Organization for Scientific Research (NWO: 480-05-003), by the VU University (Amsterdam, The Netherlands) and the Dutch Brain Foundation, hosted by the Dutch National Computing and Networking Services SurfSARA.
Footnotes
Updated methods to add some missing subscripts and improve definitions; Updated simulation plot for partial correlations in supplements and expanded supplemental figure legends