Types of cis- and trans-gene regulation of expression quantitative trait loci across human tissues

Expression quantitative trait loci (eQTLs) have been identified for most genes in the human genome across tissues and cell types. While most of the eQTLs are near the associated genes, some can be far away or on different chromosomes, with the regulatory mechanisms largely unknown. Here, we study cis- and trans-regulation by eQTLs on protein-coding genes and long noncoding RNAs (lncRNAs) across nearly 50 tissues and cell types. Specifically, we constructed trios consisting of an eQTL, its cis-gene and trans-gene and inferred the regulatory relationships with causal network inference. We identify multiple types of regulatory networks for trios: across all the tissues, more than half of the trios are inferred to be conditionally independent, where the two genes are conditionally independent given the genotype of the eQTL (gene 1 ← eQTL → gene 2). Around 1.5% of the trios are inferred to be mediation (eQTL → mediator → target), around 1.3% fully connected among the three nodes, and just a handful v-structures (eQTL → gene 1 ← gene 2). Unexpectedly, across the tissues, on average more than half of the mediation trios have the trans-gene as the mediator. Most of the mediators (cis and trans) are tissue specific, and cis-gene mediators are significantly enriched for protein-coding genes, whereas trans-gene mediators have a similar distribution of protein-coding genes and lncRNAs to the whole genome.


INTRODUCTION
Gene expression is regulated by genetic variants, among many factors. Expression Quantitative Trait Loci (eQTLs) have been identified to be widespread in the genome by several other consortia [Võsa et al., 2021, The GTEx Consortium, 2020b, Bryois et al., 2014, Battle et al., 2014, Lappalainen et al., 2013, Powell et al., 2012. In particular, the Genotype-Tissue Expression (GTEx) consortium examines the expression profiles and identifies regulatory variants in around 50 tissues and cell types. They have identified at least one cis-eQTL (i.e., the eQTL is near its target gene, which is the cis-gene of this eQTL) for nearly all the protein-coding genes and nearly 70% of the long intergenic noncoding RNA (lincRNA) genes, as well as a small number of interchromosomal trans-eQTLs (i.e., the eQTL is far away from its target gene, which is the trans-gene of the eQTL). On the other hand, trans-eQTLs are enriched in the GWAS (GenomeWide Association Study) Catalog variants [The GTEx Consortium, 2020b], suggesting potentially important roles of trans-eQTLs in complex traits and diseases.
However, the relationship between the cis-and trans-gene of the same eQTL remains unclear. A popular model is cis-gene mediation, where eQTL regulates the cis-gene (e.g., a transcription factor), which acts as the mediator and regulates the trans-gene [Bryois et al., 2014, Pierce et al., 2014, Kirsten et al., 2015, Yao et al., 2017, Yang et al., 2017, Delaneau et al., 2019, The GTEx Consortium, 2020b, Yang et al., 2021, Viñuela et al., 2021, possibly through regulatory elements (such as cis-regulatory domains, which may contain transcription factor binding sites) on the genome [Delaneau et al., 2019]. However, other modes of relationships have also been observed: Bryois et al. [2014] analyzed 869 lymphoblastoid cell lines with a highly conservative approach and identified 49 cases where the eQTL regulates the cis-gene and the trans-gene independently, and 2 cases where the eQTL affects the trans-gene, which in turn affects the cis-gene, in addition to 19 cisgene mediation trios. Delaneau et al. [2019] also estimated a high probability of 0.86 for an eQTL to regulate the cis-and trans-gene independently. Furthermore, there is experimental evidence supporting functionally important inter-chromosomal interactions exist. For example, enhancers located on multiple chromosomes may converge and regulate the expression of olfactory receptor genes in cis and in trans [Bashkirova andLomvardas, 2019, Monahan et al., 2019], suggesting that trans-regulation may be complex and much of it remains unknown.
Analyses of the multiple tissues and cell types of the GTEx consortium have further demonstrated tissue sharing and tissue specificity in the effect of eQTLs: around 80% of all the eQTLs found are shared in more than five tissues, and 25-30% of all the eQTLs are shared in nearly all the GTEx tissues [The GTEx Consortium, 2020b]. To better understand regulation of cis-and trans-genes across tissues, we take a causal network approach to classify possible regulatory relationships in trios of an eQTL and its cis-and trans-target genes, using 48 tissues and cell types in the GTEx version 8 data [The GTEx Consortium, 2020b]. Our analysis goes beyond cis-gene mediation and examines diverse regulatory types, while also accounting for potential confounding from other genes. Our analysis further examines tissue sharing and tissue specificity of these causal relationships. Similar to GTEx, here we focus on protein-coding genes and long noncoding RNAs https://cran.r-project.org/web/packages/MRPC/index.html). MRPC stands for "incorporating the principle of Mendelian Randomization into the PC algorithm", and the PC algorithm is a classical algorithm for inferred directed networks that is named after developers Peter Spirtes and Clark Glymour [Spirtes et al., 2000]). MRPC uses the eQTL as the instrumental variable, and views eQTLs, genes and potential confounding variables as nodes in a network that can represent diverse causal relationships among these nodes.
This approach allowed us to identify multiple types of regulatory networks (or models) for trios (see Section "MRPC analysis accounting for confounding" in Methods). The more interesting types among them are mediation (M 1 ), v-structure (M 2 ), conditional independence (M 3 , a.k.a., pleiotropy), and fully connected (M 4 ) ( Figure 1A). In these networks, the two genes have correlated expression levels, although the correlation arises from different regulatory mechanisms.
Specifically, in the mediation network, one gene acts as the mediator between the eQTL and the other gene. When conditioned on the mediator, the downstream target is determined only by the mediator and is independent of the eQTL. In the network of conditional independence, the two genes are regulated by the same eQTL, inducing correlation between the two genes, even though there is no direct relationship between them. In the fully connect network, the two genes are not only regulated by the same eQTL, but also influence by additional, unknown processes between the two genes that lead to extra correlation between them. By contrast, in the v-structure network, both the eQTL and a gene are the parents of the other gene. Whereas the eQTL and the "parent" gene are independent, their information becomes dependent when the expression of the "child" gene is given. Apart from these networks, there are also inferred networks with only one edge -these are under the null model (M 0 ) category or the "other" category.
Across tissues, the number of trios we selected for network inference varies from 627 to 4,668 (median: 2,278; mean: 2,259; Supplementary Table 1). The same eQTLs or genes may appear in multiple tissues. The number of trios generally increases with the sample size (median: 235; mean: 315.2; Supplementary Figure 1). We further derived principal components (PrCs) from the wholegenome gene expression data in each tissue. Using Holm's method to control the familywise error rate across all the p-values at 5% [Holm, 1979], we identified PrCs that are highly significantly associated with the eQTLs or genes in all the trios of that tissue. We then applied MRPC to each  2.2. Cis-versus trans-gene mediation. The mediation model is of particular interest, as in general the hypothesis is that an eQTL regulates its trans-gene target through the cis-gene, which acts as a mediator. We term this the "cis-gene mediation" ( Figure 1B). A trans-gene may also act as a mediator; we term this type of mediation the "trans-gene mediation" ( Figure 1B).

Trans-gene mediation
Unexpectedly, we inferred a large number of trios to be trans-gene mediation (Figure 3; Supplementary Tables 4, 5, 6, 7 and 8). Among the 1,718 mediation trios we inferred across tissues, we identified 395 (23%) cis-gene mediation trios and 1,323 (77%) trans-gene mediation trios. The number of cis-gene mediation trios varies from 3 to 15 across tissues (median: 8; mean: 8.2), and that of trans-gene mediation trios varies from 7 to 46 (median: 28; mean: 27.6). Furthermore, the 395 cis-gene mediation trios were also inferred by MRPC-ADDIS (with less stringent FDR control). Only 696 of the 1,323 trans-gene mediation trios were also inferred by MRPC-ADDIS; most of the remaining trios (491) were inferred to be M 4 , which contains three edges.
In cis-gene mediation trios, the absolute Pearson correlation between an eQTL and its cis-gene tends to be larger than that between the eQTL and its trans-gene ( Figure 4). This trend is reversed in trans-gene mediation trios, which is consistent with the statistical inference: direct association (i.e., having an edge between two nodes) tends to have a stronger correlation than indirect association (i.e., not having an edge between two nodes), and an edge with a stronger association is more likely to be causal than an edge with a weaker association. Due to the same reasons, when the cis-gene is inferred to be the mediator, the association between the two genes also tends to be stronger than that between the eQTL and the trans-gene. When the trans-gene is inferred to be the mediator, the association between the two genes tends to be stronger than that between the eQTL and the cis-gene.
These mediation trios involve 1,339 unique genes as mediators. Among them, 243 are cis-genes and not trans-genes for an eQTL, 1,066 are trans-genes and not cis-genes, and 30 are both cis-and trans-genes for an eQTL (Supplementary Table 9). Most of these mediators appear to be tissue specific: 215 (79%) of the 243+30=273 cis-gene mediators appear only in one tissue, whereas

Cis-gene mediation trios
Trans-gene mediation trios FIGURE 4. Pearson correlations in mediation trios. Top: scatterplots of the absolute correlations between the SNP genotype and the cis-gene expression versus that between the SNP and the trans-gene expression. Middle: scatterplots of the absolute correlations between the SNP and the cis-gene expression versus that between the expression of the two genes. Bottom: scatterplots of the absolute correlations between the SNP and the trans-gene expression versus that between the expression of the two genes.
We next investigated whether the mediators are enriched for protein-coding genes or lncRNAs, protein-coding genes and lncRNAs in the human genome, protein-coding genes account for 57% and lncRNAs 43%. The distribution of the two types does not differ between trans-gene mediators and the whole genome. On the other hand, cis-gene mediators are significantly enriched for proteincoding genes, compared to trans-gene mediators (chi-square test p-value: 5×10 −7 ) and to the whole genome (chi-square test p-value: 2×10 −7 ). The mediators inferred by MRPC-ADDIS show a similar distribution of protein-coding genes and lncRNA, and the corresponding chi-square test p-values are similarly small (1×10 −10 and 2×10 −11 , respectively).  [Yang et al., 2017] to all the candidate trios in five GTEx tissues with the largest sample sizes: adipose subcutaneous, tibial artery, muscle skeletal, sun exposed skin, and whole blood. GMAC aims to detect mediation; specifically, GMAC examines in each trio whether the data supports the candidate mediator having an effect on the target gene, even if the eQTL may also have a direct effect on the target (see Discussion on the implications of this approach). Although Yang et al. [2017] considered only cis-genes to be candidate mediators in their analysis, the GMAC method itself is agnostic to which gene can be the candidate mediator. In our application, we applied GMAC to each trio twice: first treating the cis-gene as a potential mediator, and next treating the trans-gene as a potential mediator. With this analysis, we were interested in testing whether a method different from MRPC also identified many trans-gene mediation trios.
Most of these trios are the same trios, and GMAC inferred mediation when the cis-gene was the mediator and also when the trans-gene was the mediator. This means that these trios follow an M 4 model in our framework where the edge between the two genes is bidirected (Figure 1). Nevertheless, this result is consistent with ours and confirms that trans-gene mediation is at least as common as cis-gene mediation.

DISCUSSION
Using our causal network inference method, we identified multiple types of regulatory relationships in trios of an eQTL and its cis-and trans-genes, which provides a more comprehensive picture of the complex relationships in trios. Across the tissues, more than half of the tested trios are inferred to be conditionally independent, and around 1.6% of the trios are inferred to be mediation.
Interestingly, on average more than half of the mediation trios have the trans-gene as the mediator.
Furthermore, cis-gene mediators are enriched for protein-coding genes compared with the genome average and with the trans-gene mediators.
An edge inferred by our statistical method indicates that the relationship is strong enough not to be explained away by other factors we have considered. Such a strong relationship, albeit still a statistical result, is therefore more likely to reflect a genuine regulatory relationship. On the other hand, it is very likely that an inferred edge condenses a complex regulation process, which may involve many genes or processes other than transcription.
However, although the number of trios we examined is on the order of thousands across tissues, this is still a very small number of possible trios for the human genome. The trios we considered here generally have strong association. This is because when multiple SNPs were identified to be eQTLs for the same cis-gene, we used the SNP with the smallest p-value. As a result, we may have missed many eQTLs that could have slightly weaker association than the chosen one, but may have stronger association with trans-genes. The distribution of the regulation types may be biased due to this omission, but the distribution from our analysis is quantitatively comparable to that from other studies [Bryois et al., 2014, Delaneau et al., 2019, with the majority of the trios being the conditional independence type, and a small percentage of mediation trios.
Several studies have examined cis-gene mediation [Bryois et al., 2014, Yang et al., 2017, The GTEx Consortium, 2020b. In particular, Yang et al. [2017] systematically identified trios of cisgene mediation across GTEx tissues (using data from version 7), using their GMAC method, which is also based on the principle of Mendelian randomization and accounts for confounding variables.
However, the GMAC method focused exclusively on mediation and cannot detect other relationships. Yang et al. [2017] also focused on trios with strong associations and examined a smaller number of trios in each tissue (median: 1,112.5; mean: 1,473). On the other hand, at an FDR of 5%, they identified more trios of cis-gene mediation than we did in more than half of the tissues: specifically, they identified a median of 102.5 (mean: 140.0) cis-gene mediation trios, much higher than ours. Take whole blood for example, they identified 281 trios of cis-gene mediation, whereas we identified only 50 mediation trios of both types (9 cis-gene mediation trios and 41 trans-gene mediation trios).
Although surprising initially, the observation of a large number of trans-gene mediating trios is consistent with existing literature on the prevalence of trans-regulation. As summarized by Liu et al. [2019], trans-eQTLs contribute to 60-90% of the heritability in gene expression across multiple studies [Price et al., 2008, 2011, Wright et al., 2014, suggesting that eQTLs often regulate genes that are far away. Trans-gene mediation has also been identified before by other studies. For example, Bryois et al. [2014] applied the CIT (Causal Inference Test) method to the lymphoblastoid cell lines (LCLs) from a cohort of 869 individuals and identified 19 trios of cis-gene mediation, 2 trios of trans-gene mediation, as well as 49 conditional independent trios. The number of trans-gene mediation trios is low in this study, but the number of conditional independent trios is also low, suggesting lower power to detect any type. Overall, the observation that the conditional independent trios are much more frequent than mediation trios is consistent with our findings.
What is the potential mechanism for trans-gene mediation? GTEx found enrichment of trans-eQTLs at CTCF (CCCTC-binding factor) binding sites, and hypothesized that such eQTLs may disrupt CTCF binding, which influences the spatial chromatin interaction and therefore gene ex- We validated the common presence of trans-gene mediation with the GMAC method. However, it is important to note that GMAC detects mediation by testing for association between two genes in a trio, in the presence of an eQTL and confounder variables. This means that GMAC tests only the edge between the two genes, regardless the presence of other edges [Yang et al., 2017]. Therefore, mediation under GMAC corresponds to any of three models under our framework: M 1 , M 2 , or M 4 (all have an edge between two genes), whereas lack of a mediation relationship under GMAC corresponds to either M 0 or M 3 (no edge between two genes) (Figure 1). For the trios from the five tissues analyzed by both GMAC and MRPC-LOND, we found that the agreement is as high as 98% between the two methods for no mediation across tissues (Supplementary Table 13). On the other hand, this agreement is only 25-44% for mediation, although this is likely due to MRPC-LOND being a high conservative method (Supplementary Table 14).

The comparison of MRPC and GMAC raises the question: what does mediation mean? GMAC
allows an eQTL to regulate a target gene directly and through a mediator, whereas MRPC requires that all the eQTL effect goes through the mediator and that there is no edge connecting the eQTL and the target gene. We may consider the former partial mediation and the latter complete mediation. Partial mediation, however, is a challenge in statistical inference. Consider the following two models: i) V → T 1 , V → T 2 , T 1 → T 2 ; and ii) V → T 1 , V → T 2 , T 1 ← T 2 . These two graphical models always have the same likelihood, which means that they are Markov equivalent [Verma and Pearl, 1990] and cannot be distinguished without additional biological information. Both models are interpreted as mediation by GMAC, but are considered to be statistically unidentifiable under MRPC and are therefore inferred as M 4 (V → T 1 , V → T 2 , T 1 − T 2 ; Figure 1)  4.3. Selection and identification of trios. Using the eQTLs reported by GTEx and the PEERnormalized gene expression described in the two sections above, we next ran the R package Ma-trixEQTL [Shabalin, 2012] to look for trans-genes located 1 Mb away from the eQTLs with a p-value < 10 −5 . Multiple trans-genes may be identified for the same eQTL. We constructed trios for each eQTL with a cis-gene and a trans-gene. Different trios may have the same eQTL, or the same gene. A gene may also be a cis-gene in one trio but a trans-gene in another. From these trios, we further selected those that have only protein-coding genes or lncRNAs.
4.4. Identification of associated confounding variables. We performed principal component analysis on the PEER-normalized, genomewide gene expression in each tissue and then tested the significance of the three variables in each candidate trio to each PrC using a simple regression and obtained a p-value. We used Holm's method to control the familywise error rate across all the p-values at 5% [Holm, 1979]. Each PrC reflects potential impact from a large number of genes and represents the influence from the larger gene regulatory network to which a trio may belong. We then included these PrCs as additional nodes in the MRPC analysis of the trio. Due to the strong control of Holm's method, the median number of PrCs included in the end varies between 0 and 2 across tissues.  [Scutari, 2010] and pcalg [Kalisch et al., 2012]). Existing methods in this class are computationally efficient but difficult to modify to account for the PMR. The other class is developed for genomic data and explicitly accounts for the PMR (e.g., CIT from Millstein et al. [2009], QPSO from Wang and van Eeuwijk [2014], and findr from Wang and Michoel [2017]). However, the types of causal networks detected by existing methods are often limited to a subset of the five basic models in Figure 1. Recently, several methods [Howey et al., 2020, Yazdani et al., 2016, Zhu et al., 2012, including our MRPC method, have been developed to combine the strengths of two classes of methods for causal network inference.
The PC algorithm consists of two main steps: inferring a graph skeleton, where the key edges are retained but undirected; and determining the direction of the edges. Our MRPC improves both steps and achieves better power and lower FDR on small networks Fu, 2019, Badsha et al., 2021]. The improvement in Step 1 is further explained below.
Step 2 in MRPC incorporates the PMR, which takes advantage of the additional information in eQTLs. Under the PMR, the genotypes can be reasonably assumed to be randomly allocated in a natural population, and can therefore be viewed as randomization of the individuals. Since the genotypes influences the phenotypes, but not the other way around, the PMR then views an eQTL as an instrumental variable for causal inference. Causal inference on a trio aims to infer a three-node network for the eQTL and the two genes, with an edge pointing from the eQTL to one or both genes (Figure 1).
Step 1 in MRPC, as well as other PC-like algorithms, starts from a fully connected network and performs a series of tests on each edge to see whether the two nodes are marginally correlated or conditionally correlated, given one other node, or two other nodes, or any subset of other nodes.
If a test produces an insignificant p-value, it means that the correlation between the two nodes may be not strong enough or can be explained away by other nodes. The edge would be removed and never tested again. Hypothesis testing in PC-like algorithms is therefore online, meaning that the number of statistical tests to be performed is unknown in advance, and that the threshold for a p-value to be considered significant cannot be fixed beforehand. Several methods have been developed to control the overall FDR in this online setting. We have implemented such a method called LOND [Javanmard and Montanari, 2015] in MRPC, and demonstrated that LOND achieved better power and lower actual FDR than existing methods on small networks through extensive simulations Fu, 2019, Badsha et al., 2021].
However, LOND may be too conservative and leads to the true edges being missed, especially in larger networks [Badsha et al., 2021]. We have therefore also implemented the ADDIS method [Tian and Ramdas, 2019], another less-conservative online FDR control method, in MRPC. We used both ADDIS and LOND in MRPC to control the FDR at 5% for each trio.
To compare the performance of ADDIS and LOND, we performed the following simulation.
We included three known confounders in GTEx gene expression data: PCR, platform, and sex.
We further used the PrCs derived from the GTEx whole blood gene expression data as additional, potential confounders. We simulated data independently for trios, where each trio has a set of randomly selected confounders. We generated data for a total of 200 trios under all the five basic causal networks (34-47 trios per network) in Figure 1A, with the confounders being the parent nodes of both gene nodes in each trio. We varied the number of confounders, effect size, standard deviation in the error, and minor allele frequency of the eQTL in the simulation. Most of the values used for the parameters are similar to those in the GMAC paper [Yang et al., 2017]. Specifically, the effect size from the eQTL is drawn from Unif (0.5, 1.5), and the effect size from a gene is drawn from Unif (0.5, 1). The standard deviation in the random error is set to be aβ , where β is the effect size of a gene, and a is drawn from Unif (0.3, 1.5). The minor allele frequency is drawn from Unif (0.01, 0.5). For each trio, we selected a random number of PrCs from a discrete uniform distribution between 1 and 15. The effect size from these PrCs was drawn from Unif (0.15, 0.5). The effect size from the three known confounders was drawn from Unif (0.01, 0.1). We then applied MRPC-ADDIS and MRPC-LOND to each dataset. We evaluated recall and precision in two ways: i) of the edges in the true and inferred networks; and ii) only for the edge between the two genes.
This simulation showed that ADDIS and LOND have similar performance overall (Supplementary Table 15). ADDIS achieved better recall (i.e., power) than LOND did across models, although the recall rates were not high. Both ADDIS and LOND achieved similarly high precision rates across models, once again confirming our previous observations that MRPC is conservative but accurate on inferred edges Fu, 2019, Badsha et al., 2021].
4.6. Gene type enrichment analysis among mediators. Both cis-genes and trans-genes may be inferred to be the mediator. We used the Ensembl human gene annotations (GRCh38/hg38) to label each gene as a protein-coding gene or a lncRNA. We summarized the counts for cis-gene mediators, and separately for trans-gene mediators. We performed chi-square tests to compare the distribution of gene types among cis-gene (or trans-gene) mediators to the whole genome.

4.7.
HiC sequencing data analysis. We downloaded four HiC-sequencing datasets from the EN- To account for multiple testing, we applied Holm's method [Holm, 1979] to control the family-wise error rate at 5%, and the Benjamini and Yekutieli method [Benjamini and Yekutieli, 2001] and the q-value method [Storey, 2002] to control the FDR at 5%. For each trio, GMAC identified and removed covariates that were a common child or intermediate variable to the two genes at an FDR of 10%, and identified confounders (defined as a parent node to the two genes in GMAC) at an FDR of 5%. The input to GMAC consisted of the eQTL and the PEER-normalized expression values of the cis-and trans-gene. Since the genotypes are missing in some individuals, we performed imputation using multiple correspondence analysis (MCA; Josse et al. [2016]) prior to the GMAC analysis. We ran GMAC twice on each trio, first with the cis-gene as the potential mediator and second with the trans-gene as the potential mediator. GMAC output a p-value for each trio. Yang et al. [2017] used these unadjusted p-values in two ways to select mediation trios: i) applying a cutoff p < 0.05, which means no correction for multiple testing; and ii) applying the q-value method and setting a cutoff of q < 0.05 [Yang et al., 2017]. Here, we took the middle road: we applied the q-value method but set the cutoff q < 0.1 for each tissue.
We further applied GMAC to the simulated data generated in Section "MRPC analysis accounting for confounding" to compare with MRPC. Since GMAC can infer only M 1 , we assessed recall and precision only on the edge (or its absence) between the two genes (Supplementary Table 16).
GMAC showed a very high recall rate (0.97) but low precision (0.79) for this edge, indicating its tendency of inferring the presence of such an edge, regardless whether other edges are present or not. In contrast, MRPC-ADDIS and MRPC-LOND both had low recall (0.42 and 0.33, respectively) but high precision (0.85 and 0.87, respectively), again confirming that MRPC is much more conservative in its inference but tends to get it right.
SUPPLEMENTARY Recall and precision of MRPC and GMAC on simulated data. Here, the two metrics are evaluated for the edge between the two genes (T 1 and T 2 ) in a trio. The five basic models are grouped into two classes: M 1 , M 2 and M 4 have an edge between the two genes, whereas M 0 and M 3 do not. We evaluated recall and precision for the two classes separately.      Distribution of Mediator Gene Types For each Tissue SUPPLEMENTARY FIGURE 6. A stacked bar plot of MRPC-ADDIS inferred mediators by gene types.