Abstract
The functional mechanisms underlying disease association identified by Genome-wide Association Studies remain unknown for susceptibility loci located outside gene coding regions. In addition to the regulation of gene expression, synthesis of effects from multiple surrounding functional variants has been suggested as an explanation of hard-to-interpret associations.
Here, we define filter criteria based on linkage disequilibrium measures and allele frequencies which reflect expected properties of synthesizing variant sets. For eligible candidate sets we search for those haplotypes that are highly correlated with the risk alleles of a genome-wide associated variant.
We applied our methods to 1,000 Genomes reference data and confirmed Crohn’s Disease and Type 2 Diabetes susceptibility loci. Of these, a proportion of 32% allowed explanation by three-variant-haplotypes carrying at least two functional variants, as compared to a proportion of 16% for random variants (P = 2.92 · 10−6). More importantly, we detected examples of known loci whose association can fully be explained by surrounding missense variants: three missense variants from MUC19 synthesize rs11564258 (L0C105369736/MUC19, intron; Crohn’s Disease). Next, rs2797685 (PER3, intron; Crohn’s Disease) is synthesized by a 57 kilobase haplotype defined by five missense variants from PER3 and three missense variants from UTS2. Finally, the association of rs7178572 (HMG20A, intron; Type 2 Diabetes) can be explained by the synthesis of eight haplotypes, each carrying at least one missense variant in either PEAK1, TBC1D2B, CHRNA5 or ADAMTS7.
In summary, application of our new methods highlights the potential of synthesis analysis to guide functional follow-up investigation of findings from association studies.
1. Introduction
Genome-wide Association Studies (GWAS) [1, 2] detected a multitude of genetic risk variants associated with complex diseases and phenotypes. For a great portion of these variants, the underlying biological mechanisms are still unknown. In public databases (GWAS Catalog, dbGene, etc.), the gene closest to the strongest association signal is provisionally listed as the susceptibility gene. However, other genes nearby might embody the true functional origin and cause the association signal via more or less complicated patterns of linkage disequilibrium (LD). As pointed out by the authors of [3, 4] there is no guarantee, that causal variants are mandatory in particular high LD with the top association signal (we will call it the tag variant throughout this work). For instance, interaction between multiple causal loci interfere with the ability to find either of them separately, but create a strong signal at a distantly linked marker [5].
Goldstein [6], see also [7], suggested that an association of a common variant with a complex disease can be synthetically created by multiple rarer functional variants from a surrounding genomic region. In this case the rarer variants occur more often or exclusively on a haplotype branch carrying a specific allele of the tag variant, generating in this way the strongest association signal at this locus. This situation is depicted in figure 1 of [8]. The idea can be quantified by checking whether the LD measures between tag variant and candidate variants yield |D′| ≈ 1 and r2 not large [6, 9]. Examples for such synthetic associations have been reported from GWAS [10, 11, 12, 13, 14] and sequencing studies [8]. A statistical method to test a given set of variants for synthetic association with quantitative trait has been described by [9].
The concept of synthetic association has also given rise to some debate: some authors considered the ubiquity of synthetic associations to be unlikely [15], while others discussed whether synthetic associations should have already been detected by linkage analysis [16, 17] and whether a rare-only model for synthetic association is applicable to a lot of GWAS findings [17, 18]. In addition, the expected properties of synthetic associations were empirically assessed via simulation studies [4, 19], with partially contradictory results. The authors of [4] stated, that rarer variants that contribute to a synthetic association might be as far as 2.5 Megabases (Mb) away, whereas the authors of [19] reported that in 90% of their simulations at least one rare causal variant was already captured within a window of size 100kilobases (kb). In any case, it can be stated, that until now there is no complete understanding of which role synthetic associations actually play for the etiology of complex diseases, how frequent the phenomenon of synthesis really occurs, and whether it is rather build up by a few low-frequency and common variants or a lot of rare variants.
In view of the lack consensus judgment on the relevance of synthetic association, we started an empirical evaluation of the frequency of the phenomenon. Until now, no methods have been provided to systematically search for sets of variants that synthesize the association of a common variant. A major reason for this is the computational load. Already within a ±100 kb region surrounding a susceptibility locus, typically 2,000 variants are to be expected. With n eligible variants in an identified trait-related susceptibility region, there are 2n – 1 variant sets to be investigated. Even when only sets with, for instance, up to 6 variants shall be tested for potential synthesis, 8.84 · 1016 different sets have to be investigated. In view of the large number, an efficient search engine is prerequisite for the detection of synthetic associations. Identification of such variant sets is highly relevant for the follow-up of association signals that were produced by GWAS or next generation sequencing (NGS) association studies, in order to come closer to the variants with disease relevant biological function [20].
This work is organized as follows: in section 2 we define filtering criteria under which potential variant sets for synthetic associations are selected, and we motivate the haplotype–tag variant correlation as a measure of synthesis. We apply our methods to data from the 1,000 Genomes Project [21] in section 3: a) frequency of the phenomenon of synthesis with broad-sense functional variants compared between random micro-array variants and known susceptibility loci of Crohn’s Disease and Type 2 Diabetes; b) identification and description of synthetic associations given by missense variants. A detailed discussion of our findings and conclusions are given in section 4.
2. Methods
We consider an LD region that is associated with a disease phenotype. The top association signal (tag variant) has been reported according to a GWAS catalog or a consensus meta-analysis. We assume that variant genotype data, either from public reference data, GWAS, imputed GWAS or NGS association studies, for the tag variant and a sufficiently large surrounding region are available. We advise to include variants in a window size between 2 and 5 Mb with the tag variant in its center.
2.1. Filtering rules
Let ai, Ai be the alleles of variant si. Let f(ai) be the allele frequency of ai and let h(aiaj) be the frequency of the 2-variants haplotype. The alleles ai and aj are said to be in-phase if D = h(aiaj) − f (ai) f(aj) > 0. From the data we calculate the allele frequencies and the LD measures r2 = D2/(aiajAiAj) and D′ = D/Dmax, where Dmax = min(aiaj, AiAj) if D > 0 and Dmax = min(aiAj, Aiaj) if D < 0. Note that D′ comprises a leading sign.
We employ the following allele frequency criteria for synthetic association. Let S = {s1,…, sk} be a set of k variants and let s ∉ S be the tag variant. Let D′(ai, aj), r2(ai, aj) be the pairwise LD measures for two variants si, sj. Let and be predefined intervals, that quantify the conditions |D′| ≈ 1 and r2 not large [6, 9] and t ≳ 1 be a tolerance parameter.
Let a be the risk allele of s, i.e. the allele with an reported OR > 1. Let ai be the alleles of si that are in-phase with a. S is said to be a risk candidate set if
In the same way, we can regard A to be the protective allele of s, i.e. the allele with an reported OR < 1. Let now Ai be the alleles of si that are in-phase with A. S is then said to be a protective candidate set if the above criteria hold, whereupon the second condition is replaced by
2.)
One may also see this twofold search from a different perspective: once we seek candidate sets that are synthesized by the minor allele of the tag variant, which means D′(ai, a) > 0, and another where the sets are synthesized by the major allele of the same, D′(ai, a) < 0.
In the context of this work, we prune variants for r2 = 1, while we do not remove variants that are marked to have known functional consequences. As parameters we choose t = 1.1 and filter for an LD space of . More details on our implementation are given in Appendix A.
2.2. Tag variant – set haplotype correlation
The goal of our approach is to find variant sets that explain a given tag variant via synthesis, in particular to find haplotypes that are in nearly perfect LD with one of the alleles of the tag variant. For a candidate set S of tag variant s we phase and reconstruct the haplotypes from the genotypes employing the EM-algorithm using maximum-likelihood estimation of haplotype frequencies according to [22, 23]. Here, we use an implementation that improves our previous implementation in FAMHAP [24]. We evaluate the haplotypes in a binary storing version, which is presented in detail in Appendix B. Haplotypes with a frequency below a cutoff will not be considered. From the reconstructed haplotypes with their estimated frequencies we compose dichotomized haplotype markers consisting each of one haplotype versus all others. Then we calculate the Pearson product-moment correlation coefficient of that allele x of tag variant s which is tested for being synthesized by S and each haplotype marker h of candidate set S where M is the number of individuals, xi ∈ {0,1, 2} is the ith individual’s allele count of s and hi = hi(ai… ak) ∈ [0; 2] is the frequency of the haplotype with set variant alleles aj for individual i from the maximum-likelihood estimation. denote the mean values of all xi (hi). Synthesis is established if |rxh| ≈ 1. Note that rxh comprises a phase information in terms of a leading sign. In case of k = 1, rxh is equal to r, with r2 the standard pairwise LD measure.
3. Data analysis and results
3.1. Frequency of the synthesis phenomenon
Nearly perfect pairwise LD (r2 ≈ 1) between neighboring variants is a common phenomenon. Likewise, perfect LD between a single variant and a haplotype marker defined by a set of variants is likely to exist in regions of strong LD. In order to assess the frequency of the phenomenon, we randomly selected 1,000 variant markers from the Illumina© 550K marker panel. For these variants, we systematically searched for all three-marker syntheses in a 2 Mb surrounding interval in the CEU subsample (85 individuals) of the 1,000 Genomes Project phase 1 integrated release [21] reference data (accessed Mar2012). In table 1 we list the absolute numbers and proportions of syntheses for |rxh| of either 0.995 or 0.975. A portion of 74.4% (55.8%) of tag variant markers allowed a synthesis by three surrounding variants at an rxh cutoff of 0.975 (0.995). We investigated how many of these syntheses comprised “broad-sense” functional variants, i.e. variants classified not as ‘unknown’, ‘intergenic’ or ‘intronic’. 55.3% (36.9%) of variants allowed a synthesis including at least one functional variant, 15.7% (8.2%) allowed a synthesis with at least two functional variants and 5.1% (3.4%) could be explained by a synthesis entirely build by functional variants. Thus, formal synthesis, also involving functional variants, in general, is a common phenomenon.
Next, we investigated the frequency of syntheses for Crohn’s Disease and Type 2 Diabetes susceptibility loci. We took 71 consensus variants for Crohn’s Disease [25] and 62 for Type 2 Diabetes [26] from the GWAS Catalog [27]. Of these variants, 70 Crohn’s variants and 60 Type 2 Diabetes variants were available in the 1,000 Genomes CEU reference data. For those loci our method was able to reveal synthesizing sets, we list the respective rs-numbers, the top set and the correlation coefficient in supplemental table A. The portion of syntheses, ignoring functional annotations, for Crohn’s Disease, 74,3% (64,3%) and Type2 Diabetes, 80.0% (61.7%), did not differ significantly from the portions observed for random variants (P > 0.05). However, a substantial increase in the portion of synthetic associations involving functional variants was observed for the susceptibility loci of both phenotypes. After adjustment for multiple comparisons, four settings showed significance: 17.1% of the Crohn’s Disease variants allowed a synthesis with three functional variants at rxh = 0.975 as compared to a portion of 5.1% for random variants (P = 3.5 · 10−5). 37.1% (28.6%) of Crohn’s Disease variants allowed a synthesis with at least two functional variants at rxh = 0.975 (0.995) which reflects a strongly significant increase as compared to random variants (P = 2.01 · 10−8 and P = 4.1 · 10−6). Finally, Type 2 Diabetes showed a portion of 21.7% synthesis involving at least two functional variants at rxh = 0.995 (P = 3.9 · 10−4). The nominally highly significant cells for both disease phenotypes are highlighted in table 1.
In summary, table 1 shows a significant increase of syntheses involving functional variants for Crohn’s Disease and Type 2 Diabetes susceptibility loci which suggests that a portion of these syntheses potentially reflects the actual functional causes behind the respective GWAS association signal.
3.2. Examples of associated susceptibility loci for Crohn’s Disease and Type 2 Diabetes
We further restricted the syntheses discovered in section 3.1 for “narrow-sense” functional variants, i.e. ‘missense’, ‘nonsense’, ‘stop-loss’, ‘frameshift’ and ‘splice-site’ variants. We discovered several complete or nearly complete syntheses made up from variants of those annotation types. In the following we will describe three examples in detail.
First, rs11564258 is an intron variant located in L0C105369736/MUC19 on chromosome 12 which has previously been identified as a susceptibility locus for Crohn’s Disease [25, 27]. Its minor A-allele conveys an odds ratio of 1.73 [1.55;1.95] [25]. Synthesis analysis revealed twelve different three-variant functional syntheses for rs11564258 with |rxh| > 0.99. A list of these sets can be found in supplemental table B. The synthesizing sets partially overlapped and were made up by a total of 14 different missense variants located in MUC19. Parts of these variants have already been reported in [14]. In order to disentangle the complex LD pattern, we determined the joint haplotype distribution of the tag variant and the synthesizing variants. The respective 15-variant haplotypes with a total length of 138 kb can be found in table 3 and the description of the variants in table 2. The A-allele of the tag variant perfectly tags a single haplotype of frequency 0.024 which fits the previously reported [25] minor allele frequency of 0.025 for rs11564258. While all 14 variants that are involved in the synthesis are missense variants, the tagged haplotype, marked by the red A-allele of the tag variant, does not always carry the minor allele of these variants. Actually, the tagged haplotype carries at most positions the major allele with the exception of rs1444220. Under the convention that the minor allele is the missense allele this suggests that the missense alleles are protective alleles: all further haplotypes carry at least one of these “protective” alleles and the tag haplotype is characterized by an absence of protective alleles. We note, however, that the classification into risk and protective alleles is to a large extent a matter of terminology to describe one of the two sides of a coin. In any case, it can be stated that rs11564258 risk allele carriers can fully be characterized by the allele patterns present at 14 missense variants in MUC19. As shown above, actually various subsets made of three variants are sufficient to obtain the one-to-one correspondence. All of these three-variant-sets contain rs1492319 and rs1444220, while the remaining 12 missense variants are equally well suited to complete the synthesis, see supplemental table B.
Second, rs2797685 is an intron variant located in PER3 on chromosome 1 which has previously been identified as a susceptibility locus for Crohn’s Disease [25, 27]. Its minor T-allele conveys an odds ratio of 1.05 [1.01-1.10] [25]. Synthesis analysis revealed a manifold of 40 syntheses at |rxh| ≥ 0.99 with a cardinality between 3 and 6 variants. The synthesizing sets are given in supplemental table B and each of them consists of one non-functional variant in the “narrow-sense” and exclusively additional missense variants. Information of the variants are given in table 2 and the 17-variant haplotypes of the joint distribution with a total length of 56 kb can be found in table 4 (upper panel). The tagged haplotype carries at most major alleles but 4 minor alleles of the variants rs34433622, rs7550657, rs228693 and rs2859387. Since this haplotype comprises a number of intron variants and a synonymous coding variant, we checked if the “narrow-sense” variants are sufficient to characterize the tag variant. This 9-variant haplotype is given in table 4 (lower panel). The yellow marked lines demonstrate a degeneracy of the 8-missense-variant-haplotype which splits up to a f = 0.308 haplotype with the C-allele of the tag variant the f = 0.165 haplotype comprising the T-allele. There is at least one additional variant necessary, for instance the synonymous coding variant rs2859387, in order to cancel this degeneracy.
Third, rs7178572 is an intron variant located in HMG20A on chromosome 15 which has previously been identified as a susceptibility locus for Type 2 Diabetes [26, 27]. Its minor G-allele conveys an odds ratio of 1.08 [1.04-1.13] [26]. Synthesis is established by a two-variant haplotype of size 371 kb comprising the missense variant rs1867780 located in PEAK1 and rs7119 in the untranslated-3’ HMG20A region. In total synthesis analysis revealed 17 synthesizing sets, which are listed in supplemental table B. All sets include the aforementioned two variants and between none and two additional missense variants inside exons of the genes TBC1D2B, CHRNA5, ADAMTS7 and RASGRF1. In table 5 we list the 1,877 kb joint nine-variant haplotypes of the tag and all synthesizing variants. The A-allele of the tag variant perfectly tags a single haplotype of frequency 0.324 consisting of the wildtype alleles of all contributing variants.
4. Discussion
We presented methods for the exhaustive search of multi-locus haplotype markers in near perfect LD with a tag variant. Such haplotype markers fulfill the formal criteria of a synthetic association [6]. Our filtering criteria, which we deducted heuristically from typical examples of synthetic association presented in [4], based on marker allele frequencies and LD measure criteria effectively reduce the enormously large space of potentially synthesizing variant sets. The approach can be applied in a case-control setting as well as to public reference genotype data. Via these filtering parameters and our ordering algorithm a quasi-exhaustive search for syntheses is made computationally feasible. Inferring those variants may yield additional insight and help to identify causal genes.
Our data analysis demonstrated that formal synthesis is a very common phenomenon in regions of linkage disequilibrium. We could further show that syntheses involving functional variants occur more frequently with known GWAS susceptibility loci (Crohn’s Disease and Type 2 Diabetes) than with random variants. A potential limitation of this finding is that it is not trivial to select an appropriate set of random variants for comparison. GWAS susceptibility loci are certainly not randomly distributed in the genome. In addition, their allele frequency spectrum will differ from the overall spectrum, alone for the reason of power in discovery studies. In any case, we believe that the detection of syntheses is of substantial relevance, whether or not is possible to proof an enrichment of its frequency among disease associated variants: LD, in general, is a very strong phenomenon and it can be expected that the majority of syntheses found in reference data will carry over to case-control data in the sense that the synthesis will explain the association signal of a tag variant. Verification in case-control data for the phenotype the tag variant is associated with, is, of course, an important validation step. It has to be noted that by statistical means alone it is not possible to ultimately prove causality of a set of variant markers. In the presence of near perfect LD between two or more variant or haplotype markers any of these might provide causality. Still it is important to know all syntheses of a susceptibility locus. First, these syntheses point to functional variants that might underlie the signal. Those variants are at least primary candidates for causality. Second, even if the functional variants of a perfect synthesis are not causal, it has to be acknowledged in functional studies that allele carriers of the tag risk allele also carry the functional variants of the synthesis, and hence are exposed to their biological consequences.
For some analyzed regions we found syntheses involving variants which are as far as 1 Mb or more away from the tag variant. In this sense, we can confirm the statement given in [4] that there might be variants contribution to a synthesis that far away from the main association peak. Of note, the functional synthesis for the diabetes susceptibility locus in HMG20A we described also comprised variants more than 1 Mb away from the tag variant.
In summary, we have shown that the inference of synthetic variants has the potential to yield additional insight into the biology underlying hard-to-interpret association signals. We found a significantly increased number of syntheses involving functional variants for previously confirmed Crohn’s Disease susceptibility loci suggesting that a relevant portion of these reflect the true cause of the association. Furthermore, we detected intriguing examples of syntheses with missense variants for Crohn’s Disease and Type 2 Diabetes. For the future, we plan to apply our approach to all known phenotype associations listed in the GWAS Catalog [27] and to provide a data base of functional syntheses of susceptibility loci. This will display a very valuable resource for the follow-up research of hits from association studies.
Our methods are implemented in the efficient software tool GetSynth, which is freely available and regularly updated at http://sourceforge.net/projects/getsynth/. The software is written in C/C++ and requires the binary genotype files, defined by PLINK [29, 30], as input format. Filter criteria, set size, function classes and the number of functional variants that shall be involved in a synthesis and lots of more option can be pre-specified by the user.
Supplemental Data
Supplemental Data include two tables.
Acknowledgements
Web resources
GetSynth: http://sourceforge.net/projects/getsynth/
1,000 Genomes Project: http://www.1000genomes.org
PLINK2: http://www.cog-genomics.org/plink2/
GWAS CATALOG: http://www.ebi.ac.uk/gwas/
Ensembl: http://www.ensembl.org
Appendix A. Ordering algorithm
As a first step the data set is pruned for variants with r2 ≥ c, where c has to be chosen c ≤ 1. While pruning and setting very narrow allowed LD space in condition 1.) of section 2.1 tapers the number of variants n to be tested, a pre-definition of a maximal set size N reduces the number of considered sets from the exponential 2n − 1 to a sum of binomial coefficients The number of tests to perform are further reduced by condition 2.) and 3.) of section 2.1. Nonetheless the sheer number of possible sets by rare variants from sequencing may become a computational challenge, if the number of provided variants becomes large.
We arrange the list of candidate variants from condition 1.) in decreasing order of |D′(ai,a)|. Our algorithm picks successively one variant at a time and forms all possible sets with all formerly picked variants up to a set cardinality of N. If a set passes the conditions 2.) and 3.) the test is performed. The sets are created in an order, that all subsets have previously been considered. Runtime can further be reduced by additional restrictions: if an annotation file with the genetic functions for the variants is provided, a minimum number of “functional” variants per set can be demanded. Supersets for already detected sets could optionally be omitted.
Appendix B. Haplotype binary evaluation
We transform the genotypes of a set of variants per individual into binary bitsets storing partial multi-variant genotypes. As a bitset an integer number of sufficient large word length can be used or alternative bitset constructs provided by the employed programming language. First the variants are ordered in an arbitrary way. For missing genotypes we have to omit the individual for the calculation or have to impute missing genotypes in the preface. We use one bitset storing the heterozygous genotypes ‘h1’ and one bitset storing the homozygous two-mutation genotypes ‘h2’. The bitset for the two-wildtype homozygous genotypes ‘h0’ is not needed for the following consideration, but could be obtained by the binary operation NOT(h1 OR h2). An example for these genotypes on eight variants is:
The multi-variant genotype ‘h2’ trivially contributes to both resulting haplotypes in a haplotype pair, but the ‘h1’ part needs to be distributed in any combination (2checksum in total) to one of both haplotypes. In order to avoid idling cycles one should omit considering zeros in ‘h1’ by making use of a map-array. Let now TYPE be any sub-bitset of ones from ‘h1’ then kmpl = h1 XOR type is the complement sub-bitset to type from ‘h1’. The pairs of haplotypes for an individual are simply constructed by h2 OR type and h2 OR kmpl. With all combinations of possible haplotype pairs per individual we can perform the maximum-likelihood estimation of the haplotype frequencies [23, 24]. The expectation maximization recursion formula for the frequencies is given by where M is the number of individuals, Ci is the set of all ordered pairs of haplotypes compatible with ‘h1’ and ‘h2’ for individual i, hj denotes the frequency of haplotype j and zlmn is an indicator variable, {0,1, 2}, equal to the number of times the haplotype n is present in the pair (l,m).