Introductory Paragraph
Linkage disequilibrium (LD) is a fundamental concept in genetics; critical for studying genetic associations and molecular evolution. However, LD measurements are only reliable for common genetic variants, leaving low-frequency variants unanalyzed. In this work, we introduce cumulative LD (cLD), a stable statistic that captures the rare-variant LD between genetic regions and opens the door for furthering biological knowledge using rare genetic variants. In application, we find cLD reveals an increased genetic association between genes in 3D chromatin interactions, a phenomenon recently reported negatively by calculating standard LD between common variants. Additionally, we show that cLD is higher between gene pairs reported in interaction databases, identifies unreported protein-protein interactions, and reveals interacting genes distinguishing case/control samples in association studies.
Main
Linkage Disequilibrium (LD) is a fundamental concept in population genetics that statistically captures non-random associations between two genetic variants due to reasons such as lack of recombination or different age of mutations1. LD serves as a core component in genotype-phenotype association mapping, as a statistically significant genetic variant could be just a proxy in LD with the genuine causal variant(s)2. Also, LD is critically important in analyzing the fine resolution of genotype-phenotype association mapping3 and forming polygenic risk4. Additionally, from the perspective of molecular evolution, LD values substantially higher than expected under neutrality may indicate interesting phenomena, e.g., interactions between loci that are favoured by selection5. As such, LD has been extensively utilized in evolutionary studies.
The calculation of LD involves the use of allele frequencies of the genetic variants in its denominator to normalize the statistic (Online Methods; Supplementary Materials 1.1) and therefore suffers from a high variance (instability) when allele frequencies are close to zero. As such, in practice, researchers only analyze common genetic variants with minor allele frequency (MAF) higher than a threshold (e.g.,0.05), excluding more than 90% of human genetic variants6.
In the field of association mapping, researchers have developed multiple techniques to aggregate the associations of multiple rare variants with a phenotype into a single shared effect. One of the pioneering methods that is still popularly used7 is synthesizing a cumulative allele frequency from multiple rare genetic variants in the same genetic region (e.g., within a gene). The cumulative minor allele frequency (cMAF) is defined on a region containing multiple rare variants: an individual will be labelled as a “mutant” if it has at least one of the rare variants, and then the proportion of individuals in the sample that are labelled as mutants will be the cMAF for this region (Fig. 1a).
An example to show the calculation of cLD. a) There are two individuals [1, 4] who have mutations in region A. Therefore, the cMAF P(A) for region A is 2/6 = 0.33. b) There are three individuals [3, 4, 5] who have mutations in region B and the cMAF P(B) for region B is 3/6 = 0.50. If we consider regions A and B together, there is one individual with mutations in both regions: [4]. Thus, the P(AB) is 1/6 = 0.17. Finally, by yielding P(A), P(B) and P(AB) into the standard formula of LD we have cLD = 0.375.
Building on the idea of cMAF and the essence of LD, we developed a statistic, cumulative Linkage Disequilibrium (cLD) to capture the aggregated correlation between two sets of rare variants. Specifically, for the traditional calculation of LD between two variants, g1 and g2 with minor alleles a and b respectively the essential part is the definition of individual MAF P(a) and P(b) and the frequency that a and b show up in the same haplotype, P(ab). For calculating cLD between two regions, A and B, we first use cMAF to define P(A) and P(B) (the proportion of individuals carrying a rare variant within regions A and B, respectively); and then P(AB), the proportion of individuals who have at least one rare variant in both regions A and B (Fig. 1b; Online Methods; Supplementary Materials 1.1 & 1.2).
As cMAF is always higher than MAF, cLD’s variance (reflecting its instability) should be lower than LD’s. We verify this intuition by deriving the closed-form of variance of both LD and cLD (denoted as Var(LD) and Var(cLD)) using multinomial distributions and their multivariate normal approximation as well as the multivariate Delta Method8(Online Methods; Supplementary Materials 2.1 & 2.2). Using allele frequencies from the 1000 Genomes Project data6 and the formula (Supplementary Materials 2.3), we see that the variance of cLD is orders of magnitudes smaller (i.e., more stable) than the alternative -- calculating LD directly on rare variants in all ethnic populations and all cMAF bins (Fig. 2a; Supplementary Figs. 2.2a & 2.3a). Additionally, following the conventional statistical procedure of bootstrapping to empirically estimate stability, we sub-sample half of each population sample 1,000 times to form bootstrapped distributions for both cLD and LD (Online Methods; Supplementary Materials 2.4). The subsampling shows that cLD exhibits a much slimmer bootstrapped distribution than LD across three ethnic groups (Fig. 2b, Supplementary Figs. 2.2b & 2.3b), further confirming the greater stability of cLD compared to traditional measures of LD.
a) The gene pairs were split into four different bins based on the cMAF values, i.e., <0.05, 0.05 - 0.1, 0.1 - 0.2, and 0.2-0.4 (y-axis). The x-axis is the ratio between the variance of LD and cLD. b) Probability density distribution of cLD and LD from bootstrapped samples. Results from the European population are shown. See Supplementary Materials Section 2 for other populations.
By aggregating information from multiple independent mutations, cLD is sensitive to subtle interactions poorly reflected by LD (which can only account for two at a time). Interactions within the 3D structure of genomes is one place where this difference allows for insight from cLD where LD-based methods fail. The availability of high-throughput experimental technologies that can assess chromatin conformation such as Hi-C9,10 allows researchers to analyze genetic regions that are in close contact in 3D spatial structure. There was a widely disseminated expectation that the 3D genomic interaction in the form of chromatin contact may leave a footprint in the form of genetic LD11. Motivated by such expectation, Whalen and Pollard calculated the standard LD based on common variants (MAF>0.05) in 1000 Genomes Project data6and reported negative results stating that genetic LD map is not overlapping with the 3D contact map12. However, by reanalyzing the 1000 Genomes sequencing data and Hi-C data9,10 in the developing brain using cLD on rare variants (Online Methods; Supplementary Materials 3.1 & 3.2), we revealed that the 3D chromatin interactions did leave genetic footprints in the form of higher cLD in pairs of genes that are in the adjacent Hi-C regions (Fig 3a; Supplementary Fig. 3.1). To assess the statistical significance of the enrichment of cLD in 3D contact, we conducted Mantel-Haenszel Fisher exact tests (Supplementary Materials 3.4), both of which are highly significant (P-value < 1.0E-50; Supplementary Table 3.2, Supplementary Materials 3.4.1). As Whalen & Pollard’s work12 is not at the resolution of genes, we re-calculated standard LD using common variants based on gene pairs, which shows a subtle effect (Fig. 3b, Supplementary Fig. 3.2) but still not statistically significant in (P-value =0.999; Supplementary Table 3.3; Supplementary Materials 3.4.1). Additionally, we checked the ratio between the numbers of pairs of genes within the 3D contact region and without as a function of cLD value (Supplementary Materials 3.5) and found that the ratio is large and increases as the cLD cut-off increases (Fig 3c,d,e, Supplementary Table 3.7). Taking together, 3D interactions clearly overlap with genetic interactions and cLD outperforms LD in observing this.
Enrichment of cLD among pairs of genes in chromatin contact regions. a) The comparisons of cLD values between the 3D chromatin interaction regions and non-interaction regions among 13 different distance groups in the European population. b) The same comparisons using LD values. c-e) The enrichment of the ratios between the number of gene pairs in 3D chromatin interaction regions against the number of gene pairs that are not in 3D regions. The x-axis is the cLD value cutoffs. a) European population. b) African population. c) East Asian population.
To demonstrate that gene-gene interactions leave footprints in rare genetic mutations regardless of their physical position we computed the distribution of cLD among interacting pairs genes reported in Reactome13and BioGRID14, MINT15and IntAct16 (Online Methods; Supplementary Materials 3.3). We compared this distribution against a null distribution formed by all pairs of genes. Indeed, the comparisons led to the expected result: for gene pairs separated by any physical distance within 2MB, cLD is elevated in interacting gene pairs(Fig. 4; Supplementary Fig. 3.3). Again, the Mantel-Haenszel and Fisher exact tests confirm that the differences are significant (P-value < 1.0E-20; Supplementary Table 3.5; Supplementary Materials 3.4.2).
The comparisons of cLD values in European populations between gene pairs found in interaction databases and all pairs that are not in databases. Each bar represents the average of pairs with distance smaller than the value of its x-axis label but larger than the value of the previous x-axis label. (Other populations show the same trend, Supplementary Materials)
cLD is also effective at identifying novel pairs of likely interacting proteins. Looking at all pairs of genes, we observed several pairs without prior evidence of interaction with extraordinarily high cLD, such as between 3BCZ and 4RIQ (encoded for by genes MEMO1 and DPY30, respectively) with a cLD of 0.86. We conducted protein docking analysis for the 19 pairs of genes of large cLD values (top 0.01% among all gene pairs) with cMAF >0.05 and existing IDs in PDB, however, not reported in any databases (Online Methods, Supplementary Materials 4.1, Supplementary Table 4.1). We found multiple lines of evidence of the interaction of five pairs (Supplementary Table 4.2) in terms of both binding affinity and interacting residues (Fig. 5a-d; Supplementary Figs. 4.1-4.4).
Protein docking interaction between 3BCZ and 4RIQ revealed by cLD (=0.86) with a binding affinity of −341.21 kJ/mol. a) Structure of 3BCZ (red) and 4RIO (blue) protein-protein complex. b-d) 2D representation of closest interacting residues around the protein-protein interaction interfaces, including multiple non-covalent bonds, for example, hydrogen bonds (green dotted line) and hydrophobic interactions (read and rose semi-circle with spikes). Residues for the 3BCZ are depicted in upper letters (T, U, O, R, N) and for the 4RIO are depicted in lower letters.
In the context of case/control association studies, cLD can be used to identify pairs of genes whose interactions may be responsible for human disease. Using data from the Autism Spectrum Disorders (ASD) whole exome sequencing dataset17 we calculated cLD scores for all pairs of genes, with separate scores for the populations of cases and controls (Online Methods; Supplementary Materials 5.1 & 5.2). The difference in cLD for a pair of genes conditional on case/control status (ΔcLD) is indicative of an interaction that is non-random associating with disease status. Overall, using a hypergeometric test, we analyzed the enrichment among high-ΔcLD genes for ASD sustainability genes reported by DisGeNet18, an established general database for diseases and SFARI19, a gold-standard database focusing on ASD (Supplementary Materials 5.3). The genes included in the pairs with the highest ΔcLD scores are highly enriched in both the Autism related genes in DisGeNet (Fig. 6a) and SFARI (Fig. 6b). Gene Ontology20 and pathways (KEGG)21,22 enrichment analysis for the high ΔcLD genes (Online Methods; Supplementary Materials 5.4 also showed sensible biological functions and pathways (Figure 6c,d) that are well supported by the literature (Supplementary Materials 5.4)20–35. By taking a closer look of the 20 genes identified by the top 10 gene pairs with the highest ΔcLD values, found that 14 genes (70%) have been reported to be associated with ASD, including DENND4A, EFCAB5, ABI2, RAPH1, MSTO1, DAP3, ARL13B, PRB2, PRB1, ZNF276, FANCA, ADAM7, SLC26A1 and TUBB8 (Supplementary Table 5.1). Moreover, among the rest of six genes, we also identified indirect links of two, RAB11A and IDUA with ASD (Supplementary Materials 5.3).
a-b) Group bar charts show the ratio between the number of selected genes being validated in the database dividing the number of genes in the database (q/m) and the number of selected genes dividing the total number of genes in the population minus m (k/n). The values on the top of each bar are the p-values of the hypergeometric distribution probability test. The x-axis indicated the top gene pairs from the top 200 to 2,000. a) DisGeNet database. b) SFARI database. c) a dot plot showing the top 10 KEGG pathways ranked by the GeneRatio values. The size of the balls indicates the number of the genes enriched and the color indicates the level of the enrichment (P-adjusted values). The GeneRatio is calculated as count/setSize. ‘count’ is the number of genes that belong to a given gene-set, while ‘setSize’ is the total number of genes in the gene-set. d). a bar plot showing the top 10 enriched biological processes ranked by p-values. The correlation is more significant as the red/blue ratio increases. The number on the x-axis indicates the number of genes that belong to a given gene set.
LD is a broadly applicable concept applicable to many types of genetic analyses. cLD allows us to expand on this concept and capture additional information from the distributions of many variants segregating in a population at low frequencies within particular regions of a genome. In contrast to the previous attempts to utilize LD between multiple variants focusing on dominant haplotypes36 or joint distributions37, cLD emphasizes biological interactions. With it’s demonstrated power in identifying gene and protein interactions, cLD might offer an essential tool to analyze biological interactions and their evolution.
Author contributions
Conceived the study: QZ. Supervised the study: JW and QZ. Analyzed real data: DW, JH, DP, PK, QL. Conducted mathematical derivation and statistical simulations: DW, WZ, JW. Provided comments: CC, AP. Wrote the paper: DW and QZ with major input from JH, DP, AP, and minor input from all authors.
Online Methods
Definition of LD and cLD
The calculation of LD between two bi-allelic loci relies on the estimate of three key quantities: PA, the allele frequency of an allele in locus A, PB the allele frequency of an allele in locus B, and PAB, the frequency of these two alleles of A and B showing up together. Then one can define the unnormalized disequilibrium statistic D = PAB - PAPB. To rescale the statistic based on allele frequency, one can normalize D by dividing it by the allele frequency variances:
Another different definition is D’ (that will not be the focus of this paper). As LD involves PA and PB in the denominator, it is highly instable when PA or PB are very close to zero, which means LD cannot be used if A or B are rare variants.
The cLD statistic is designed to handle the above problem by aggregating rare variants cumulatively. More specifically, here we look at two sets of variants in two genetic regions, e.g., two genes, again namely A and B. Assuming that there are m SNPs in gene A, and there are r SNPs in gene B. Also, we assume the sample size is n. Then, for gene A, we use {S1i, S2i, …, Smi} to denote the allele of the s-th (s= 1, 2, …, m) SNP in the i-th individual (i=1,2,…,n). Similarly, for gene B, we use {K1i, K2i, …, Kri} to denote the allele of the k-th (k= 1, 2, …, r) SNP in the i-th individual (i=1,2,…,n). Note that Ssi and Kki is either 0 or 1. (0 denotes a major allele, whereas 1 denotes a minor allele).
Then we have the cMAF (PA & PB) defined below:
Based on these values, we can calculate the r2 version of cLD:
The more rigorous mathematical descriptions and the definition of D’ version are provided in Supplementary Materials.
Derivation of variances
To obtain the variance of cLD and LD, we derived their asymptotic distributions. The details are in Supplementary Materials. The gist of our approach is summarized in the following three steps:
First, using multinomial random variables, we rewrite the formula of cLD and LD in terms of counts. In the definition, we use Xijk to denote the allele of the k-th variant of the j-th gene for the i-th individual (haplotype) of. For a pair of variants, the i-th pair (Xi1u,Xi2v) (i = 1,…,n) can take possible values (1,1), (0,1), (1,0) and (0,0). If we use O1 to O4 to denote the count of the 4 possible pairs in two variants, then the distribution of O = (O1,O2, O3, O4) is O∼multinom (n; p) with p = (p1, p2, p3, p4) represents the population probability. The LD between the u-th and v-th variants can be rewritten as ,. Similarly, we use the same strategy to using multinomial random variables to describe cLD as follows:
In analogy to the case of LD, we used Xij to denote the allele of the j-th gene for the i-th individual (haplotype). For a pair of genes, the i-th pair (Xi1,Xi2) (i = 1,…,n) can take possible values (1,1), (0,1), (1,0) and (0,0). Using M1 to M 4 to denote the counts of the 4 possible pairs in two genes,then the distribution of M = (M1, M 2, M 3, M 4) is M ∼ multinom (n; q) with q = (q1, q2, q3, q4) represents the population probability. The cLD between a pair of genes could be rewritten as
Second, we use the central limit theorem (CLT) to derive the asymptotic multivariate normal distribution. In the LD case, with the population mean p = (p1, p2, p3, p4), we can write the covariance matrix as
Then by the multivariate CLT8 we have
.
In the cLD case, with the population mean q = (q1, q2, q3, q4), we can write the covariance matrix as
Then by the multivariate CLT8 we have
.
Third, as the cLD and LD are functions of random variables, we apply the multivariate Delta method8 to derive the distribution of cLD and LD. In the LD case, suppose the Jacobian matrix of . Then the asymptotic distribution of LD(O) is
, where ‘AN’ stands for asymptotic normal.
In the cLD case, suppose the Jacobian matrix of . Then the asymptotic distribution of
Assessing the instability of LD and cLD using bootstrapped distributions
We randomly sample half of the haplotypes in each population (EUR, AFR, or EAS), and calculated the average cLD and average LD over the gene pairs within a cMAF group and repeat this procedure 1,000 times. Based on these bootstrapped cLDs and LDs we form bootstrapped distribution cLD and LD (with appropriate re-scaling (Supplementary Materials). More specifically, we randomly sample 1000 genes from the 1000 Genomes Project data and assessed their pairwise LD and cLD in stratified cMAF bins (Supplementary Materials) using half of the haplotypes in the given population (AFR, EAS or EUR). These randomly drawn subsamples (each with half of the individuals in the original population) form bootstrapped samples. We define the LD of a gene pair as the average value of LD(O/n) over all rare SNV pairs within that gene pair. In each iteration, we calculate the average cLD over the gene pairs in each group (Supplementary Materials). Then using these average cLD and LDs, we plot the bootstrapped densities in Fig. 2b for the visualization.
Calculation of cLD and LD for gene pairs in 3D interaction regions
1000 Genome Variant Call Data
The variant call data of the Phase 3 analysis of the 1000 Genome dataset was obtained through The European Bioinformatics Institute’s dedicated FTP (http://ftp.1000genomes.ebi.ac.uk) server. The complete variant call dataset was found using the webpage (Announcements | 1000 Genomes (internationalgenome.org)) (This is a sub-page maintained by the 1000 Genome webpage) and downloaded from (Index of /vol1/ftp/release/20130502/ (ebi.ac.uk)).
3D chromatin conformation assessed by Hi-C
Here, we utilized a Hi-C assessment in the developing brain, which has 27,982 brain-specific paired 3D-interacting regions, measured from neurons derived from human induced pluripotent stem cells (hiPSCs) [39, 52]. This dataset is available in the Synapse database (https://www.synapse.org/) with Synapse ID: syn12979149.
cLD calculation
We first calculated the distance between the genes in each pair and separate the gene pairs into 13 distance groups (Supplementary Materials). After stratifying all gene pairs into 13 distance groups, within each distance group, we calculate cLD between all gene pairs and further split them into two categories: the ones that are located in 3D interaction regions (assessed by Hi-C experiments) and the ones that are located in non-3D interaction regions. The gene pairs with one gene in interaction region will be discarded. Finally, the average cLD over gene pairs within interaction and non-interaction regions will be used to generate the bar-charts in Fig. 3a.
LD calculation
In general, the procedure mirrors the one used above for cLD using distance groups and 3D-interaction vs non-interaction categories. As LD is originally defined by individual variants, not genes, the following averaging steps are taken. For each gene pair in the 3D interaction regions, we randomly chose 2,000 rare variant pairs from it to calculate their LD values. For each selected rare variant pair, we calculated its distance and then, among the gene pairs without 3D interactions, we randomly selected another rare variant pair with the same or very similar distance (Supplementary Materials). As a result, we achieved 2,000 randomly selected variant pairs from gene pairs without interaction that are matched up with the 2,000 variant pairs from gene pairs with interaction. The average LD within these two sets of 2,000 pairs will then contribute to the bar charts in Fig 3b (Supplementary Materials).
Calculation of cLD and LD for gene pairs in gene-gene interaction databases
Gene-gene interaction databases
Four frequently used databases, Biogrid14 Reactome13, MINT15 and Intact16 are aggregated as the source of gene-gene interactions (Supplementary Materials). The related datasets were downloaded from their corresponding websites and the IDs are matched using standard gene models (gencode v17). To quantify the distance between genes, only data for the gene pairs within the same chromosomes were used. Calculation of cLD and LD follows the same procedure as described in 3D-interaction case (Supplementary Materials).
Protein docking analysis
HDOCKlite-v1.138,39 [PMID: 32269383] was employed for conducting the protein-protein docking analysis between the protein pairs (Supplementary Materials). The protein’s crystal structure was obtained from the Protein Data Bank40 (https://www.rcsb.org/) and validated41 for the study (Supplementary Materials). The output file of the docked complex was visualized with PyMOL 2.5.142, and the 2D plot of the protein-protein binding region was analyzed and deduced using LigPlot+ v.2.243 (Supplementary Materials). These analyses are illustrated in Fig. 5.
ΔcLD genes, their functional annotation, and pathway enrichment
Calculation of cLD-differential gene pairs
We calculated cLD using the Autism Spectrum Disorder (ASD)17 [phs000298.v4.p3] whole exome sequencing dataset. We first calculate cLD values for each gene pair for case and control groups separately. Then, we calculate the absolute differences between the cLD values in case and control groups for each gene pair. These absolute differences were sorted from largest to smallest. The top ranked genes pairs are collected and called cLD-differential gene pairs (Supplementary Materials).
Functional annotation and pathway enrichment
We select the top 200, 500, 1,000, 1,500 and 2,000 cLD-differential gene pairs with the largest cLD differences, and include all genes within these top gene pairs as candidate sets for the downstream functional annotations. We make use two different databases, Simons Foundation Autism Research Initiative (SFARI)19 and DisGeNet18 (https://www.disgenet.org/) to be the gold-standard as they are frequently used in the field of ASD studies and general disease gene queries, respectively. We use the hypergeometric distribution probability to assess the p-value of the significance of enrichment of the cLD-differential genes against the gold-standard genes (Supplementary Materials). These results are illustrated in Fig. 6a,b. Additionally, we include all genes within the top 2,000 cLD-differential gene pairs and conduct GO enrichment20 and KEGG pathway analysis22 for these genes. These results are illustrated in Fig. 6c,d.
Acknowledgement
Q.Z. is supported by NSERC Discovery Grant (RGPIN-2018-05147), University of Calgary VPR Catalyst grant and New Frontiers in Research Fund (NFRFE-2018-00748); J.W. is supported by NSERC Discovery Grant (RGPIN-2018-04328); A.P. is supported by NIH (R35 GM134957-01) and American Diabetes Association (Pathway to Stop Diabetes grant 1-19-VSN-02); D.P. is supported by Alberta Innovates Graduate Scholarship and Eyes High International Scholarship; J.H. is supported by CSC Scholarship. The computational infrastructure is funded by Canada Foundation for Innovation JELF grant (36605).
Footnotes
↵* joint first authors