## Abstract

Horizontal integration of summary statistics from different GWAS traits can be used to evaluate evidence for their shared genetic causality. One popular method to do this is a Bayesian method, coloc, which is attractive in requiring only GWAS summary statistics and no linkage disequilibrium estimates and is now being used routinely to perform thousands of comparisons between traits.

Here we show that while most users do not adjust default software values, misspecification of prior parameters can substantially alter posterior inference. We suggest data driven methods to derive sensible prior values, and demonstrate how sensitivity analysis can be used to assess robustness of posterior inference.

The flexibility of coloc comes at the expense of an unrealistic assumption of a single causal variant per trait. This assumption can be relaxed by stepwise conditioning, but this requires external software and an LD matrix aligned to study alleles. We have now implemented conditioning within coloc, and propose a new alternative method, masking, that does not require LD and approximates conditioning when causal variants are independent. Importantly, masking can be used in combination with conditioning where allelically aligned LD estimates are available for only a single trait.

We have implemented these developments in a new version of coloc which we hope will enable more informed choice of priors and overcome the restriction of the single causal variant assumptions in coloc analysis.

**Author Summary** Determining whether two traits share a genetic cause can be helpful to identify mechanisms underlying genetically-influenced risk of disease or other traits. One method for doing this is “coloc”, which updates prior knowledge about the chance of two traits sharing a causal variant with observed genetic association data in a Bayesian statistical framework. To do this using only summary genetic association data that is commonly shared, the method makes certain assumptions, in particular about the number of genetic causal variants that may underly each measured trait in a genomic region.

We walk through several data-driven approaches to summarise the prior knowledge required for this technique, and propose sensitivity analysis as a means of checking that inference is robust to uncertainty about that prior knowledge. We also show how the assumptions about number of causal variants in a region may be relaxed, and that this improves inferential accuracy.

## Introduction

As genome-wide association studies (GWAS) have considered a greater diversity of traits in greater numbers of samples, comparative analyses of GWAS results have become a useful tool to explore the aetiological connections between different traits. For example, estimates of genetic correlation obtained via LD score regression quantify the average proportion of genetic variance of two traits that is shared across the genome,^{1} although typically large sample sizes are required in both trait studies for accuracy.^{2} Linking traits through genetics overcomes at least one major challenge of observational studies, reverse causality, and with careful design, can also address confounding. Epidemiologists have developed and widely deployed the technique of Mendelian randomization (MR),^{3} which has been used, for example, to establish causal effects of factors such as alcohol intake on aspects of health.^{4} The method uses a genetic variant or variants with established effects on one trait, and assesses whether a second trait is (proportionally) associated with these instrumental variables. Assuming certain assumptions hold true,^{5} this provides evidence that the first trait is somehow causal for the second. MR has been extended to routinely assess the potential for one GWAS trait to mediate another.^{6} However, the ubiquity of genetic effects on some measurable aspect of human physiology or health, concordant with an omnigenic model,^{7} raise concerns that LD between causal variants can violate the MR assumption that the instrumental variable is only associated with the outcome through the “mediating” trait.^{8} This can be addressed through alternative approaches that focus not on whether one trait is causal for another, but whether two traits share the same causal variants in a single, LD-defined, genetic region, termed colocalisation. One such method is built on MR: SMR/HEIDI^{9} is a two-stage approach for when genetic instruments are not known from indepedent data. For example, testing first for joint association of a SNP to gene expression and a GWAS trait, then for heterogeneity in the estimated proportional effect across multiple SNPs in the region to assess whether the causal variant(s) for the two traits colocalise or are merely in LD.

Another popular colocalisation method, coloc,^{10} enumerates every possible configuration of causal variants for each of two traits, and calculates the support for that causal model in the form of a Bayes factor, under an assumption that at most one causal variant per trait exists in the region. Each configuration corresponds to exactly one of five mututally exclusive hypotheses about association and genetic sharing in the region :

The coloc approach has also been extended beyond pairs of traits, although computational efficiency scales poorly with numbers of traits^{11,12} unless decisions are binarised^{13} and to deal with GWAS data that share controls, though at the expense of requiring raw genotype data.^{11}

As a Bayesian method, coloc requires specification of three informative prior probabilities: *p*_{1}, *p*_{2}, *p*_{12} are, respectively, the prior probabilities that any random SNP in the region is associated with exactly trait 1, trait 2, or both traits (Figure 1). Although values for these were suggested in the initial proposal,^{12} appropriate values should depend on specific datasets used, particularly for *p*_{12}, and no specific guidance on *how* this choice should be made was given.

One of the strengths of coloc is the simplicity of data required. The assumption of at most one causal variant per trait allows inference to be made through reconstructing joint models across all SNPs from univariate (single SNP) GWAS summary data.^{14,15} Importantly, this requires no reference LD matrix and allows combining data from traits studied in differently structured populations. Further, p-values will suffice if internal or external estimates of minor allele frequency (MAF) are available, so that (unsigned) effect estimates and their standard errors can be re-constructed. However, the single causal variant assumption is convenient rather than realistic and when it does not hold colocalisation effectively tests whether the *strongest* signals for the two traits colocalise^{10} which has been shown to be conservative.^{16}

e-CAVIAR^{17} removes the assumption of a single causal variant per trait by integrating over the fine mapping posteriors for two traits, but requires signed effect estimates that are aligned to a reference LD matrix, that the traits are studied in the same population, and does not allow using any prior knowledge that shared causal variants are more or less likely than distinct variants. Perhaps the most challenging of these is the alignment of signed effect estimates to a reference LD matrix. This can be impossible in the case that signed estimates are not provided due to privacy concerns,^{18} or that alleles are not provided. Even where alleles are available, palindromic SNPs (A/T, C/G) cannot be aligned unambiguously particularly for MAF ≈ 0.5.

The assumption of a single causal variant in coloc may be relaxed by successively conditioning on the most significant variants for each trait, and testing for colocalisation between each pair of conditioned signals, although this requires either complete genotype data or use of external software such as CoJo^{19} together with signed and LD-aligned effect estimates to allow reconstruction of conditional regression effect estimates.

To support more accurate coloc analyses, we explored a variety of data-driven approaches to inform prior choice across a range of traits and developed a framework to explore sensitivity of conclusions to the priors used. Further, we implemented an existing conditioning approach in the coloc package, but also developed an alternative approach to conditioning which does not require aligned LD and effect estimates, to offer an option to deal with multiple causal variants which preserves the simplicity of the data required for coloc analyses.

## Results

We used Scopus to identify 60 papers which cited coloc^{10} and were published in 2018 and extracted the subset of 25 applied papers for which full text could be accessed (Supp Table 1). The studies covered a variety of trait pairs, generally integrating a disease GWAS with molecular quantitative trait loci (QTL) data,^{20–39} but also comparing pairs of disease GWAS,^{40} eQTL and pQTL^{41,42} or eQTL and other molecular traits.^{43,44} Conditioning was used to allow for multiple causal variants in only one study^{40} and 22 out of 25 studies used the software default priors across this diverse range of trait pairs.

Given that it is likely that the prior probability of colocalisation will depend on the trait pairs under consideration, we decided to evaluate the effect of mis-specifying prior parameters and/or not conditioning when multiple causal variants exist.

### The importance and elicitation of prior parameter values

Before examining the robustness of inference to changes in prior values, we elucidate some properties of prior parameters. While priors are expressed per SNP, our hypotheses and posterior relate to a region - a set of *n* neighbouring SNPs. The prior that one SNP in the region is causally associated with trait 1 is ≈ *np*_{1} (and similarly *np*_{2} for trait 2, *np*_{12} for colocalisation). All these scale with the number of SNPs - the larger the set of SNPs we consider, the greater the chance one of them is causal for any trait. Despite this, the prior odds for *H*_{4}*/H*_{1} - colocalisation compared to association of a trait 1 only - remains constant at *p*_{12}*/p*_{1},

The prior for *H*_{3} (two distinct variants for the two traits) is ≈ *n*(*n* − 1)*p*_{1}*p*_{2} which scales with the square of *n*. This means that prior odds of the two hypotheses of greatest interest, *H*_{4}*/H*_{3}, depends not only on the per SNP prior of causality for one or other trait, but also on the number of SNPs in a region, to the extent that the same *p*_{1}, *p*_{2}, *p*_{12} may favour either *H*_{3} or *H*_{4} as larger regions are considered (Figure 2). This effect can be understood by assuming we know that two traits have a causal variant in a region (so either *H*_{3} or *H*_{4} is true). Simple combinatorics implies that it is more likely that the same SNP associates with both traits as the number of SNPs in the region decreases.

### Marginal priors

To elicit values for *p*_{1}, *p*_{2}, we reparameterise, focusing on the possible marginal events for any SNP:

Note that in this notation, *A*_{1} and *A*_{2} are not mutually exclusive, so that colocalisation is *A*_{1} ∩ *A*_{2}. *q*_{1}, *q*_{2} can be estimated empirically by considering evidence from the wealth of single trait association data that already exists. For eQTLs, we use GTeX data^{45} and find that *q*. is dependent on the MAF of SNPs considered, which reflects variable power with fewer true eQTL variants detectable at lower MAF, and search window around the gene considered as previously noted, tending to 10^{−4} for common SNPs and windows ∼1 mb (Figure 3).

The GWAS Catalog^{46} enables us to consider something similar by aggregating over 5000 GWAS studies. We find, as expected, and again as previously noted,^{47} that the number of hits per study increases steadily with increasing sample size (Figure 3), but that the count also depends on the class of trait considered, with “harder” endpoints such as breast cancer and heel bone mineral density identifying orders of magnitude more associations compared to “weaker” endpoints such as tendency to strenuous sports or activity levels. The largest studies find ∼ 100–1000 hits out of ∼ 2 million common SNPs leading to estimates that 5 in 10,000–100,000 common SNPs are detectably causal for these traits which corresponds to *q*. = 5 × 10^{−5} − −5 × 10^{−4}.

An alternative approach is to choose the prior according to the p-value that we would consider significant. The threshold of *p* < 5 × 10^{−8} has been widely adopted as “genome-wide significant” for GWAS studies in European populations. Across a range of designs (case/control or quantitative trait, with varying MAF and sample size), we see that a prior of *q*. = 10^{−4} gives a strong posterior probability of association (≈ 0.94).

The default coloc marginal prior of *q*_{1} = *q*_{2} = 10^{−4}+*p*_{12} ≈ 10^{−4} is thus supported by the convergence of these three approaches to values of the order of 10^{−4}.

### Prior probability of joint or conditional causality

*q*_{1} and *q*_{2} themselves place some constraints on *p*_{12}. On the one hand, the chance of joint causality cannot be greater than the chance of causal association with either trait. One the other hand, if traits were independent, then causal variants for each trait would happen to co-occur at the same location with probability *q*_{1} × *q*_{2}. However, simulations show that the distribution of expected posterior probabilities vary considerably with *p*_{12} over this range (Figure 4), indicating that we need to make some effort to elicit plausible values. The results suggest that the coloc default of *p*_{12=} = 10^{−5} may be overly liberal, with data simulated under *H*_{3} having posterior support for *H*_{4}, particularly for smaller samples, and that *p*_{12} = 5 × 10^{−6} may be a more generally robust choice.

We consider different approaches to determine data-driven estimation of *p*_{12}. First, we can set a lower bound if we take into account that not all of the genome is understood to be functional. Estimates of the functional proportion vary considerably, from 25%^{48} –80%.^{49} Even for traits that are genetically independent, knowing that a SNP is causal for one trait implies it is functional, and thus more likely to be causal for another trait then a random SNP that may or may not be functional. Assuming the proportion of genetic variants that are functional is *f*, the probability of co-occurence by chance alone is *q*_{1}*q*_{2}*/f* (see Appendix).

In the case of comparing two GWAS studies, it may be possible to estimate the genetic correlation, *r*_{g}. We show in the appendix that, when shared variants do not have any systematically different distribution of allele frequencies or effects compared to non-shared variants,
where *n*_{12}, *n*_{1}, *n*_{2} are the number of variants shared, distinct to trait 1 and distinct to trait 2.

Putting these together, we find

Second, where studies of both traits are well powered, then methods for joint analysis of trait pairs may be informative. For example, gwas-pw^{50} extends the original coloc by using empirical Bayes to estimate per-hypothesis priors via joint analysis of all regions genomewide. However, this comes at a cost of ignoring the dependence of per-hypothesis priors on the number of SNPs in a region, and even in simulated data did not generate consistent estimates. This latter may reflect the limited information that exists in any pair of GWAS (the number of regions where detectable signals exist for both traits). Nonetheless, such an approach can probably give a useful order of magnitude estimate for *p*_{12}.

Finally, in the absence of data about joint trait association at the genome-wide level, it is necessary to rely more on investigator judgement, and here it may helpful to consider conditional probabilities

The term *q*_{1|2} represents the probability that a SNP, already known to be causal for trait 2, is also causal for trait 1. In asymetric analysis such as GWAS and eQTL, it may be simpler to condition on one event rather than the other - does the investigator have a clearer idea of the chance that a SNP that causally regulates gene expression in a given tissue is causally associated with a disease or the chance that a SNP that is causally associated with a disease does so via transcriptional regulation in that same tissue?

To aid translation of priors between the two parameterisations discussed here, we have created an online tool “coloc explorer” at https://chr1swallace.shinyapps.io/coloc-priors.

### Sensitivity analysis

In the expected case that an investigator does not have a strong prior belief in a single value for *p*_{12} we can use sensitivity analysis to consider whether conclusions are robust over a range of plausible values. Helpfully, it is not necessary to reanalyse the complete dataset multiple times. Given that
where *D* represents study data and *π* = (*p*_{1}, *p*_{2}, *p*_{12}) is the prior parameter vector used for analysis, we can derive posterior probabilities under an alternative prior parameter *π** as
and so we can rapidly explore sensitivity of inference to changes to *p*_{12}. Figure 5 shows an example where conclusions depend heavily on the relative prior belief in *H*_{3} and *H*_{4} and a conclusion of colocalisation by a decision rule of *P* (*H*_{4}|*D, π*) > 0.5 is only valid if prior beliefs are that *H*_{4} is at least as likely as *H*_{3}. An alternative example where results are robust over a wide range of *p*_{12} is shown in Figure S1.

### Conditioning and masking to allow for multiple causal variants

In order to deal with multiple causal variants in a region, we implemented the CoJo approach^{19} within the coloc package. We also propose an alternative to conditioning which does not depend on allelic alignment and can be used with p-values alone: masking. Stepwise regression proceeds by identifying the top SNP, and then re-estimating association statistics across all other SNPs to test whether they provide any additional information to infer the trait of interest. Conditional effect estimates at SNPs in LD with the top SNP(s) differ from their unconditional values, so that they capture the residual evidence for association, but conditional and unconditional effect estimates are (effectively) the same at SNPs independent from the top SNP(s). Our proposed masking algorithm relaxes the assumption of a single causal variant by instead assuming that if multiple causal variants exist for any individual trait, they are unlinked. It therefore first identifies lead SNPs, then successively masks all SNPs in LD with the top signals(s), testing for significant association in the remainder, and adding SNPs sequentially while residual association remains (Figure 6). When colocalising, each lead SNP is taken in turn, and any SNPs in LD with *any other* lead SNP is masked, by setting the log Bayes factor to -3 for any SNP-specific hypothesis relating to that SNP/trait pair. We have implemented both approaches in the development version of the coloc package, https://github.com/chr1swallace/coloc/tree/condmask.

We compared conditioning and masking to single coloc analysis across a variety of simulated datasets (Figure 7, 8). A single coloc comparison generally relates to the strongest signals for each of the two traits, as previously reported,^{10} which can miss colocalising signals that are secondary to a primary independent signal (Figure 7, row 3) or that have differently ordered effect sizes (Figure 8, row 5). Conditioning allows more distinct comparisons and shows a marked improvement on single coloc, in particular being able to identify a greater proportion of the truly colocalising signals. Masking increases the number of comparisons compared to single coloc, but is less informative than conditioning. In particular, the number of comparisons that cannot be clearly assigned to a specific causal variant pair (at least one lead SNP does not have *r*^{2} > 0.8 with a causal variant) increases when multiple causal variants are in LD (*r*^{2} >, Figures S2, S3) and this fraction of comparisons are often inaccurate, finding posterior support for *H*_{3} when *H*_{4} is true.

## Discussion

This paper has focused on two practical aspects of Bayesian colocalisation analysis that hitherto have not received detailed attention. The ability of Bayesian methods to incorporate prior knowledge and beliefs is a strength of the coloc approach, but also places onus on a researcher to evaluate their prior beliefs. Elicitation of informative priors is a subject that has received much attention in the statistical literature^{51} but rather less within the genetics community. Nonetheless, the use of Bayesian methods in genomics is growing in popularity, as a natural way to fit joint models to large and complex data sets and to enable integrative analysis over different traits or datasets. When data are large, and the number of events are also large, then empirical Bayes can enable an analyst to learn the prior from the same data used for testing. However, in the case of smaller studies or less common events, the wealth of existing information from other large studies as well as investigators’ own beliefs can be used.

For coloc, the choice of marginal prior parameter values can be readily informed in this way. For joint causality this is harder and while we suggest and walk through several alternative ways of doing this the conclusions we draw are not universally applicable; each investigator should use both available data and their own judgement to elicit their own prior beliefs and those of their co investigators. Perhaps the most widely applicable are the results of simulations, that suggest values of the order *p*_{12} ≈ 5 × 10^{−6} lead to robust inference over a range of scenarios, but the adoption of sensitivity analysis will help evalutate robustness of inference to changes in prior parameter values.

Attempts to colocalise disease and eQTL signals have ranged from underwhelming^{52} to positive.^{53} One key difference between outcomes is the disease-specific relevance of the cell types considered, which is consistent with variable chromatin state enrichment in different GWAS according to cell type.^{54} For example, studies considering the overlap of open chromatin and GWAS signals have convincingly shown that tissue relevance varies by up to 10 fold,^{55} with pancreatic islets of greatest relevance for traits like insulin sensitivity and immune cells for immune-mediated diseases.^{54} This suggests that *p*_{12} should depend explicitly on the specific pair of traits under consideration, including cell type in the case of eQTL or chromatin mark studies. One avenue for future exploration is whether fold change in enrichment of open chromatin/GWAS signal overlap between cell types could be used to modulate *p*_{12} and select larger values for more *a priori* relevant tissues.

The other focus of this paper is on dealing with multiple causal variants for single traits in a single region. Single coloc can be misleading when there are completely shared causal variants in the two traits, but with different effect sizes, such that colocalisation concludes there are single effects in each trait, different to each other (e.g. row 5 of Figure 8). Inference is much improved with conditioning, and we hope that by including the conditioning method within coloc we will enable more widespread use of this step. Note that if the two traits are measured in different populations, then colocalisation can still be performed, with a separate LD matrix for each. However, if the summary statistics from a single trait are the results of meta analysis of different populations, then conditioning needs to be performed in each population separately.

One advantage of coloc has been the minimal amount of data pre-processing required. In particular, there is no need to harmonize alleles between the two datasets or to some reference dataset. However, harmonization cannot be avoided if multiple causal variants are to be dealt with via conditioning. Although masking loses accuracy in comparison to conditioning, it improves on single coloc, and importantly doesn’t appear to lead to erroneous positive conclusions for *H*_{4} when *H*_{3} is true, although the reverse - supporting *H*_{3} for a secondary comparison when *H*_{4} is true - can occur when causal variants are themselves in LD. Therefore secondary *H*_{3} conclusions should be treated with some caution, but secondary *H*_{4} conclusions may signal true colocalisations that would have otherwise been missed. Often a researcher may be colocalising results from one dataset for which they have complete information (e.g. because it was generated in their lab) with a public disease GWAS with less information, and here we recommend the hybrid strategy of conditioning in the dataset with full information and masking in the public dataset.

While we have discussed the thought process required to consider prior parameter values, thought is also required to interpret partially colocalising signals (i.e. a convincing mixture of one colocalising and one non-colocalising variant). When the two datasets are different disease GWAS, it may be reasonable that they share only one signal, with the alternate signal operating through a different mechanism. But if there are two signals for an eQTL only one of which colocalises with a disease signal, then this should be interpreted with greater caution than complete colocalisation. It suggests that there are two ways of modifying expression of a gene but that only one of those ways is also associated with variable disease risk. This might mean that the right gene has been identified in the wrong tissue, given the overlap in eQTL signals between tissues,^{45} but it might also indicate incidental colocalisation. Similarly, lack of colocalisation may indicate only that the correct tissue or state has not been assayed. We anticipate that systematic analysis of multiple tissues and genes with a single disease may lead to a set of posterior probabilities that are jointly more amenable to interpretable than a single isolated analysis. However, colocalisation will always be limited by its basis in analysis of observational data, and experimental manipulation through CRISPR or through genotype-targeted assays will be required to establish causality.

## Materials and Methods

Code to run the simulations and analyses described below is available at https://github.com/chr1swallace/coloc-mask-paper.

### Simulations

We evaluated different prior parameter settings, sensitivity analysis, or strategies for dealing with multiple causal variants by simulation. In each case, we simulated GWAS data by sampling 2*N* haplotypes of length *M* SNPs for *N* individuals from 1000 Genomes samples (either EUR or YRI), and selected one or two causal variants at random from amongst common SNPs (MAF>5%) according to the question being addressed.

Effect estimates at each variant were sampled from the set {0.17, 0.33, 0.50, 0.67, 0.83, 1.00, 1.17, 1.33, 1.50}, sample sizes *N* from the set {100, 200, 500, 1000, 2000, 5000, 10000} and number of SNPs *M* from {250, 500, 750}. Quantitative traits with residual standard deviation 1 were then simulated according to linear models, i.e. as
where *i* indexes causal variants, *b*_{i} and *G*_{i} the effect estimate and genotype at variant *i*, and *e* ∼ *N* (0, 1).

For all analyses, we used *p*_{1} = *p*_{2} = 10^{−4} and varied *p*_{12} as described in the text.

### GTEx analysis

We used GTEx data to estimate the probability that a random SNP could be causally associated with the expression of a gene within some bp-defined window. We analysed GTEx v7 Whole Blood significant eQTLs, downloaded from https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_eQTL.tar.gz on 25 June 2019. We used masking to define independent signals within this set for each gene (*r*^{2} < 0.01) using 1000 Genomes EUR samples to estimate LD. We estimated q as the ratio of the number of significant lead eQTLs in multiples of 100 kb windows around the TSS to the number of SNPs in 1000 Genomes with SNPs grouped by MAF into 5 groups: [0, 0.1], (0.1, 0.2], (0.2, 0.3], (0.3, 0.4], (0.4, 0.5].

### GWAS catalog analysis

We used the GWAS summaries in the GWAS catalog (https://www.ebi.ac.uk/gwas/api/search/downloads/full, download date: 12 June 2019) to estimate the proportion of common SNPs that were independently associated with any given case/control or quantitative trait and examined how this varied according to reported sample size.

## Supporting information

**S1 Table. Summary of applied papers from 2018 using coloc.**

**S1 Appendix. Supporting mathematical derivations.**

**S1 Figure. Example of sensitivity analysis on a dataset which shows evidence for colocalisation at a predefined rule of posterior** *P* (*H*4) > 0.5 **across a wide range of** *p*_{12}.

**S2 Figure. Average posterior probabilities for each hypothesis when trait 1 has two causal variants, and trait 2 has just one, according to whether the maximum** *r*^{2} **between multiple causal variants is** ≤ 0.01 **or** > 0.01.

**S3 Figure. Average posterior probabilities for each hypothesis when both traits have two causal variants, according to whether the maximum** *r*^{2} **between multiple causal variants is** ≤ 0.01 **or** > 0.01.

## Acknowledgements

We thank Stasia Grinberg and members of the BSU for helpful discussions during the preparation of this manuscript.

CW is supported by the Wellcome Trust (WT107881) and the MRC (MC UU 00002/4).

The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS.

The NHGRI-EBI GWAS Catalog is funded by NHGRI Grant Number 2U41HG007823, and delivered by collaboration between the NHGRI, EMBL-EBI and NCBI.

## Footnotes

typo that suggested an incorrect value should be used for p12