Detecting global influence of transcription factor interactions on gene expression in lymphoblastoid cells using neural network models

Background Transcription factor(TF) interactions are known to regulate target gene(TG) expression in eukaryotes via TF regulatory modules(TRMs). Such interactions can be formed due to co-localizing TFs binding proximally to each other in the DNA sequence or over long distances between distally binding TFs via chromatin looping. While the former type of interaction has been characterized extensively, long distance TF interactions are still largely understudied. Furthermore, most prior approaches have focused on characterizing physical TF interactions without accounting for their effects on TG expression regulation. Understanding TRM based TG expression regulation could aid in understanding diseases caused by disruptions to these mechanisms. In this paper, we present a novel neural network based TRM detection approach that consists of using multi-omics TF based regulatory mechanism information to generate features for building non-linear multilayer perceptron TG expression prediction models in the GM12878 immortalized lymphoblastoid cells. Results We estimated main effects of 149 individual TFs and interaction effects of 48 distinct combinations of TFs forming TRMs based on their influence on TG expression. We identified several well-known and discovered multiple previously uncharacterized TF interactions within our detected set of TRMs. We further characterized the pairwise TRMs using long distance chromatin looping and motif co-occurrence data. We found that nearly all the TFs constituting TRMs detected by our approach interacted via chromatin looping, and that these TFs further interacted with promoters to influence TG expression through one of four possible regulatory configurations. Conclusion Here, we have provided a framework for detecting TRMs using neural network models containing multi-omics TF based regulatory features. We have also described these TRMs based on their regulatory potential along with presenting evidence for the possibility of TF interactions forming the TRMs occurring via chromatin looping.

Availability of high-throughput ChIP-Seq datasets, which provide the sequence specific binding information for each TF, has enabled researchers to detect TF interactions across the whole genome. Gerstein et al. analyzed the co-localization maps of different TFs in K562 and GM12878 cell lines to detect significantly co-associating TFs using a discriminative machine learning approach [6]. They detected several well-characterized TF interactions such as the GATA1-complex(GATA1-GATA2-TAL1), MYC complex(MYC-MAX-E2F6) and the AP1factors (FOS-JUN-JUND-FOSL) as well as some novel TF interactions such as GATA1-CCNT2-HMGN3 and GATA1-NRSF-REST using their approach. Others have used nonparametric modeling approaches to identify pairwise or higher-order interactions of TFs [7], [9]. For example, Guo and Gifford developed a topic modeling approach called Regulatory Motif Discovery(RMD) that identifies different TF interactions utilizing TF co-localization information. [7] They detected multiple well known TF interactions such as the cohesin complex (CTCF-RAD21-SMC3) complex, the transcription pre-initiation complex (POL2-TBP-TAF1) and the AP1 factor complex. Bailey et. al. identified several literature-annotated interactions by identifying closely binding TFs based on significant spacings between their sequence motifs [8]. Lastly, soft and hard clustering methods such as k-means clustering, non-negative matrix factorization and self-organizing maps have also been used to identify co-localizing TFs across the genome [14]- [17]. Although these studies have helped in systems-wide detection and characterization of TF interactions, they have the following limitations: 1) Interactions that nonadditively influence TG expression via distally binding TFs caused by chromatin looping (the "TF collective" model) cannot be detected using the abovementioned methods, as they rely upon TF co-localization information to identify proximally co-binding TFs (more consistent with the "Enhanceosome" and "Billboard" models). 2) The unsupervised clustering and topic modelling methods require the user to pre-determine the number of TF interactions to be identified preventing the agnostic discovery of TF interactions. 3) Lastly, and most importantly, for most of these studies the quantitative impact of TF interactions on TG expression remains unknown.
In this study, we use a multi-omics machine learning framework to model the impact of multiple TF based regulatory mechanisms on TG expression and detect TRMs based on the interaction effects learned from these models. We generated a gene regulatory network(GRN) containing information from datasets representing TF-TG, TF cooperativity and TG coregulation. The TF-TG interactions in our multi-omic GRN were also weighted based on chromatin looping interactions made by distally binding TFs with the TG promoters to appropriately capture their effect on TG regulation. We used the features from this GRN to predict TG expression values in the GM12878 lymphoblastoid cell line(LCL) using non-linear deep learning multilayer perceptron(MLP) prediction models. By aggregating interaction effects among different combinations of TFs from our learned models, we were able to identify specific TRMs that had high impact on TG expression. We validated the TF interactions, that we discovered within these TRMs, based on long distance chromatin looping contacts between their distal binding sites and significant spacing between their motifs for proximal binding sites. We also characterized the transcriptional regulatory programs for these modules based on the orientation and interaction of the corresponding ChIP-seq peaks relative to the promoters of TGs. Using our flexible multi-omics machine learning framework, we were able to detect TRMs significantly influencing TG expression, while characterizing their regulatory architectures using biologically relevant information.

Target gene expression could be better predicted by modelling complex non-linear interactions among transcription factors.
We hypothesized that information beyond sequence co-localization of TFs would be useful for detecting TRMs, formed by the "TF collective" model described in Background, essential for TG expression regulation. As a basis to examine this hypothesis, we developed a multi-omics machine learning framework shown in Figure-1, which modelled the influence of multiple TF based regulatory mechanisms (TF co-operativity, TF-TG binding and TF-TG coregulation) on TG expression, in the GM12878 LCL, by using features derived from a gene regulatory network(GRN) built using the PANDA algorithm [18]. We have shown previously that features obtained from such a GRN explain more variance in TG expression compared to using TF-TG binding information alone. [19] In order to generate these GRNs, we first extracted all the cis-regulatory and intronic TF binding sites(TFBS) corresponding for 149 TFs for each TG. Next, we created an adjacency matrix weighted by the number of chromatin looping interactions between each TF and the TG promoter based on GM12878 high throughput chromatin capture(Hi-C) data. We used this weighted adjacency matrix to create a motif network reflecting GM12878 specific co-expression and PPI networks.
We used the TF-TG edge-weights from this GRN to build our prediction models where we tested for purely additive as well non-additive influence of TF features on TG expression. As a baseline comparison, we first built ElasticNet(ENET) regularized regression models which assume an additive(linear) influence of TF features over TG expression without considering any non-additive effects of TF interactions. We next used a traditional neural network based multilayer perceptron(MLPs) capable of modelling non-additive(non-linear) interaction effects of TFs on TG expression, which we hypothesized will help identify "TF collective" based TRMs. We further evaluated a hybrid model (MLP-U) that can decompose the effects into additive and non-additive components. This model was composed of set of univariate MLP models capturing individual TF influence over TG expression along with a traditional MLP to capture all possible interaction effects (see Supplementary Figure S1). Thus, we used 3 different prediction models :1) an ENET to model TF main effects only 2) an MLP to model complex interaction effects, and 3) an MLP-U model that can be decomposed into additive and nonadditive components. Further details about these models, especially the MLP and MLP-U

Figure 1:Using a multi-omics GRN framework to predict gene expression. We downloaded ChIP-seq data for 149 GM12878 TFs from the ENCODE consortium whose accession numbers are provided in Supplementary table S1. We used the peaks that passed the optimal IDR(Irreproducible Discovery Rate) threshold defined by the consortium and mapped them onto the regulatory region of each gene to define TFBs. We used CTCF peaks within a 50Kb window upstream and downstream of the gene body in order to demarcate the regulatory boundaries. Furthermore, we weighted the TF-TG interactions based on the number of contacts made by the corresponding peaks with the promoter of TGs. We used a weighting scheme where promoter TFBS were automatically upweighted because of the inability of HiC data to capture them due to limited resolution. We created PANDA GRNs using the weighted adjacency matrices, the PPI data corresponding to the TFs obtained from BioGRID and the lymphoblastoid co-expression data obtained from GEUVADIS. After generating the PANDA GRN, we built elastic net(ENET) and multilayer perceptron(MLP) models that used them as input features to predict log FPKM values(gene expression) of an independent dataset. We used two different internal cross-validation strategies to train the ENET and the MLP models and assessed their accuracy by computing Pearsons correlation coefficient(PCC) between observed and predicted expression.
architectures have been provided in the MLP network architecture and building the prediction models section of the Methods.
We used an independent GM12878 LCL expression dataset(accession: ENCSR889TRN) to train and test the prediction models. We used 80% of the total TG set for training the models as well as for internal cross validation and used the remaining 20% for testing the prediction accuracy (Figure 1). We performed the prediction task for 20 iterations, each time using a different set of test TGs. We used the median Pearson's correlation coefficient(PCC) aggregated across all of these iterations to compare the performance of the 3 different models. As shown in Figure-2A, the models capable of capturing interaction effects of TFs (MLP and MLP-U) perform significantly better (median PCCMLP = 0.68; median PCCMLP-U = 0.63) compared to the ENET models (median PCCENET = 0.57), which model linear influence of individual TFs on gene expression. This improvement in performance was also statistically significant (median PCCMLP vs. median PCCENET p-value = 1.91e-06; median PCCMLP-U vs. median PCCENET p-value = 1.91e-06) as calculated by performing paired Wilcoxon sign rank tests. We have shown the PCC for the three types of models obtained from each prediction iteration in Supplementary table S2. Furthermore, we observed that over all the prediction iterations for the MLP-U models, the main effects, obtained from their univariate component, were more predictive of TG expression, explaining about 34% of the variance on average, than the interaction effects, captured by their MLP component, which explained 23% of the variance in TG expression(see Partitioning TG expression variance explained by the univariate and MLP components from the MLP-U models of Supplementary Methods and Supplementary Figure S2).
In conclusion, accurate prediction of TG expression requires efficient modelling of main effects of individual TFs as well as interaction effects of TRMs.

Context dependent influence of individual transcription factors on target gene expression could be discerned from our models.
The MLP-U architecture, described in Supplementary Figure S1, allowed us to model main effects of individual TFs separately from the interaction effects of TF combinations. We used equations(1)- (18) to calculate these main and interaction effects from the trained MLP-U models(see Obtaining main and interaction effects from the MLP-U models of the Methods section).
Our learned MLP-U models contained individual univariate MLPs corresponding to each one of the 149 TFs. We aggregated all the learned connection weights at the first layer of these MLPs and multiplied them with the nodal influence score for each node in that layer. After averaging these nodal scores, we calculated an average main effect for each TF across all the prediction iterations followed by scaling it in the range (-1,1). We have provided the scaled and the raw main effects for each of the 149 TFs (Supplementary table S3A).

C
In order to examine the validity of our main effects aggregation approach, we divided the TFs into 5 bins based on their scaled main effects (Supplementary Figure S3 and Supplementary table S3A). The bin placement of the TFs derived from their main effects reflected their functional roles. For instance, activating TFs such as TAF1, MYC, TBLXR1, RELA and BCL11A were present in the right most bin(5) because of their highly positive main effects. On the other hand, transcription repressors such as MXI1, HDAC2, SMC3, MAZ and ZNF592 had strongly negative main effects placing them in the left most bin (1). We compared the main effects obtained from the MLP-U models to those obtained from the ENET model by computing the difference in ranks(DIR) of the TFs based on their effects for the two modelling approaches (Supplementary table S3A). Positive DIR for a TF reflected decrease in the MLP-U main effect, while a negative DIR represented increase in the MLP-U main effect, compared to that obtained from the ENET models.

Interaction effects aided the detection of well-known and novel transcription factor regulatory modules.
The MLP component of the MLP-U models quantify the non-additive interaction effects of different combinations of TFs on TG expression. These effects could reflect the influence of non-linear "TF collective" interactions on TG expression. We applied the NID algorithm [29] to compute interaction effects in the form of NID scores for such TRMs. This calculation is done at each node of the first layer, for all the possible combinations/orders of the interactions and only the top ranked interactions for each order are retained. The interactions are aggregated such that lower order redundant interactions are removed and higher order top ranking interactions are retained giving a final set of highly impactful interactions of different orders. We defined these interactions along with their average NID scores as TRMs. We applied Log2 normalization to the average NID scores calculated for each TRM across all the 20 prediction iterations (Supplementary table S3B).
We detected 48 unique TRMs out of which 32 were pairwise interactions, 12 were 3-way and 4 were 4-way interactions as shown in Figure-2B. The pairwise TRMs were formed by 36 unique TFs, 3-way TRMs were formed by 22 TFs and the 4-way TRMs were formed by 12 TFs. Furthermore, we observed that among the higher order (3-way or higher) TRMs, the "nestedness" or the proportion of all the possible pairwise TRMs being also detected was never 100% (Supplementary Figure S4). We found that JUND formed the largest number of TRMs(11) followed by GATAD2B (10), RELB(10) and ATF2 (9). All of these TFs are versatile DNA binding proteins capable of affecting cell proliferation, division and apoptosis, which explains their presence in a large number of TRMs.
Multiple literature annotated TF interactions were present in the TRMs we detected. For instance the pairwise TRM of ATF2-JUND(Log2NID score = 2.57) where both the TFs are part of the well-known AP-1 factor complex, which is involved in expression regulation of multiple TGs [30]- [32]. TF GATAD2B is known to form a repressive complex involving nucleosome remodeling and deacetylase activity with the CHD family of TFs [33]. We discovered that GATD2B and CHD1 were present in two different TRMs: ARID3A-CHD1-GATAD2B(Log2NID score = 2.64) and ARID3A-CHD1-GATAD2B-RELA(Log2NID score = 2.63). The presence of ARID3A and RELA in these TRMs has not been validated by the existing literature, although both of them have been associated with immune cell proliferation [34], [35]. We also discovered the three way TRM EZH2-KDM5A-SUZ12(Log2NID score =2.40, where the methyltransferase EZH2 and scaffolding protein SUZ12 are known to form the polycombrepressive complex PRC2 , which interacts and competes with H3K4me3 demethylase KDM5A during the process of angiogenesis and hematopoiesis [36]. We also discovered the pairwise TRM KDM5A-SUZ12(Log2NID score = 2.47) indicating that KDM5A and SUZ12 may be the primary interactors within the three-way TRM.
We also detected several TRMs containing previously uncharacterized TF interactions. For example, the TRM with the highest influence over TG expression was RELB-STAT1 with the largest Log2NID score of 3.18. Both of these TFs play an important role in immune response and lymphocyte development [37], [38]. Thus, their closely related functions could point to the possibility of their interaction in vivo. Another intriguing, albeit unvalidated interaction, that we discovered was EP300-TAF1(Log2NID score = 2.39). Both of these TFs are well known lysine acetyltransferases and are responsible for activating and regulating transcription of several TGs and were also found to have the highest frequency of oncogenic mutations among all the other lysine acetyltransferases [39]. The Log2NID scores for all pairwise TRMs are shown in the form of a heat-map in the Figure 2C (all scores are available in Supplementary table S3B). Thus, we detected TRMs containing many previously uncharacterized as well as some well-known TF interactions using the NID algorithm.

Chromatin looping plays an essential role in forming transcription factor regulatory modules and in mediating their regulation of target genes.
Apart from some well-known interactions, our discovered TRMs contained a significant number of previously uncharacterized TF interactions. As described in the Background, TRM formation can be brought about by either co-localization of proximally binding TFs based on motif proximity or by distally binding TFs brought in close proximity by long distance chromatin looping. Thus, to characterize the TRMs, we used chromatin looping and TF motif co-occurrence information to identify TFs interacting with each other via chromatin looping (long-distance interactions) or by binding in close proximity(see Detecting co-binding TF ChIP-Seq peaks and Detecting TF ChIP-Seq peaks interacting via chromatin looping sections of the Methods).

Figure 3: Pairwise TRMs interact via long distance chromatin looping. A) We overlapped the GM12878 Hi-C data at 5Kb resolution with the ChIP-seq peak pair regions corresponding to the 32 pairwise TRMs within the cis-regulatory regions of the TGs. B)Barplot showing the mean log10 Hi-C contacts(5Kb resolution) between peak regions of the pairwise TRMs shaded according to the respective Log2 NID scores across all the TG. We weren't able to detect any HiC contacts between the peak pairs of the TRM SUZ12-ZNF284
Using GM12878 specific high throughput chromatin capture(Hi-C) data, we looked for long distance interactions between ChIP-seq peaks, present within TG's cis-regulatory regions, corresponding to all pairwise TF modules we detected (Figure 3A). We compared the enrichment of Hi-C contacts for these peaks with that obtained from a background set of peak pairs, within the TG's cis-regulatory regions, corresponding to random pairwise combinations of TFs not present in the detected set of pairwise TRMs using a chi-square test (see Supplementary  table S4A). We observed significant enrichment of Hi-C contacts at 5Kb resolution among 36,734 ChIP-seq peak pairs corresponding to 31 pairwise TF modules (χ 2 p-value = 9e-04) within TG's cis regulatory region as shown in Figure 3B and provided in Supplementary table S4B. The only pairwise TRM that did not contain any Hi-C contact points between the peak pairs was SUZ12-ZNF284. The enrichment of Hi-C contacts at 1Kb resolution was not statistically significant, however with the χ 2 p-value of 0.3423(see Supplementary Figure S3).
In order to identify co-localizing TF interactions based on their sequence/motifs, we used the SpaMo tool from the MEME suite(v.5.1.1) [8] to examine pairwise TRMs,. We looked for significant spacing between TF motifs occurring within their overlapping peak pair regions. We found significant motif co-occurrence for 60 peak pairs corresponding to 6 pairwise modules (adjusted p-value < 0.05, see Supplementary table S4C). Additionally, we did not find these cobinding TRMs in the set of modules previously described by other approaches [6], [7], [9], [40].
To further characterize the regulatory architecture of the TRMs, we defined four transcription regulatory programs shown in Figure 4A based on their interactions with TG promoters. We first identified 2,038 TGs where TFs peaks were interacting with each other either via Hi-C or via motif co-occurrence. We then determined the regulatory programs followed by the TRM peak pairs for each TG (see Supplementary table S5). As shown in Figure  4B, on an average 95% of the peak pairs corresponding to each pairwise TRM followed a configuration where at least one is interacting with the TG promoter and the two peaks interact with each other via long distance chromatin looping. Furthermore, TRMs HDAC2-PAX8, KDM5A-SUZ12 and SREBF1-ZNF274, for which the TFs are not known to directly bind to the

Figure 4: Pairwise TF TRMs follow different regulatory programs for different TGs. A) We utilized HiC and co-binding data to define 4 TF regulatory patterns/programs for the pairwise modules for different TGs. B) Barplot shows the proportion of the total peak pairs for each pairwise TRM following each of the 4 transcription regulatory programs shown in A
TG promoters, regulated all their TGs using this program exclusively. We observed that for the remaining TRMs, about 4.5% of the peak pairs followed the second regulatory program which constituted one of them being present directly within the TG promoter while interacting with the other one via chromatin looping. About 17% of the peak pairs corresponding to the TRM EZH2-MYC, which contained TFs with known TG promoter binding activity, followed this regulatory program. Lastly, we found only 25 co-localizing peak pairs corresponding to 4 pairwise modules (RELB-SPI1, JUND-RELB, FOXM1-PKNOX1 and MAZ-PBX3) interacting with the promoters of 15 TGs via chromatin looping and 1 instance of co-localizing peak pair for the TRM RELB-SPI1 directly binding the promoter of 1 TG. Hence, only RELB-SPI1, which contained TFs important for lymphocyte development, contained peak pairs following all four types of transcription regulatory programs.
Thus, based on the above analyses, we conclude that the pairwise TRMs identified from the MLP-U learned models almost exclusively contained TF peak interactions occurring over long distance via chromatin looping. In addition, these TRMs mostly regulated their TGs also via long distance chromatin interactions with the TG promoters.

Discussion
In this study, we designed a machine learning prediction framework for identifying TRM for the GM12878 immortalized LCL utilizing multiple big "omics" data sources. We used a modified form of the neural network MLP architecture called MLP-U in order to account for the influence of individual TFs as well as of TF interactions on TG expression within the same model. We found that accounting for both these effects resulted in more accurate TG expression prediction compared to accounting for just the linear effects of TFs using the ENET regularized regression models. The traditional MLP models produced better prediction than the MLP-U models because of the recapitulation of the main effects of TFs. In other words, both main effects and interaction effects were being modelled using complex non-linear functions in the traditional MLP architecture leading to perhaps an overestimation of the main effects resulting in the better TG expression prediction.
One of the biggest drawbacks of a neural network model is that it is usually considered a "black-box" as features learned during the training as well as testing of the models are difficult to interpret. We overcome this limitation and extracted biologically relevant information using the NID algorithm [29]. We calculated main effects of individual TFs as well as interaction effects of TF combinations. We observed that the direction of the TF main effects correlated well with their known functional roles. However, these effects were largely different compared those obtained from ENET models as the MLP-U captured context/interaction dependent TF main effects, while the ENET models estimate TF main effects only. Furthermore, we also detected highly influential TF interactions forming TRMs via statistical interactions in models of TG expression. We derived literature-based annotations for some of these TRMs, while many were novel TF interactions not identified by other approaches. This could be due to two reasons. First, the non-additive non-linear nature of the TF interactions, reflecting the "TF collective model" we detected is fundamentally different from that of the linear, co-localizing TFs, reflecting the "Enhanceosome" and "Billboard" models identified by the previous approaches. Second, our strategy for identifying TF interactions was to model their influence on TG expression, which was largely ignored by the previous approaches.
Thus, a co-localizing set of TFs not significantly impacting TG expression would be missed using our approach, though these TFs presumably have little influence on the expression of nearby genes. Additionally, we found that a significant proportion of the TF peaks for the pairwise TRMs interacted with each other and with the promoters of the TGs they regulated via chromatin looping. Therefore, long distance chromatin interactions likely play a large role in formation of TRMs as well as in their regulation of the TGs. This further validates the idea that TF interactions are not limited to proximally binding co-localizing sets of TFs. We used Hi-C chromatin looping data in two mutually independent contexts; we first included Hi-C contacts made by distal TFs with the TG promoters while building our GRNs, and further validated these chromatin interactions by examining Hi-C contact enrichments between the TF peak regions themselves. While the former Hi-C data aggregation was done to quantify the influence of distally binding TFs on TG regulation via promoter interaction, the latter instance reflected characterization of pairwise TFs interacting over long distances.
We focused our analyses on the GM12878 LCL in this study due to the density of TF binding data available, however our approach is flexible enough to analyze TRM based TG regulation in other commonly studied human cell-lines when these data are available A key limitation of our approach is the need for high-density omics assay data that often require large input DNA quantities that likely limit their application to cell-lines only. In different cellular contexts and environmental conditions, additional higher order TRMs may exist, and the precise models underlying these interactions will be difficult to elucidate. However, we did identify pairwise TF interactions that form a basis for higher order interactions that could act as a starting point for further experimental validation or examination under different environmental conditions.

Conclusions
In this study, we have detected TRMs significantly impacting TG expression using neural network based prediction models containing multi-omics GRN derived TF regulatory features. We have demonstrated multiple ways in which long distance chromatin looping plays a role in TRM based TG regulation. Our approach for detection, characterization and validation of TRMs provides a roadmap for a multi-omics analysis to study the complex phenomenon of transcription regulation genome-wide, and may provide insights into the impact of transcriptional dysregulation in the genetic basis of human phenotypes.

Building multi-omics GRN
We utilized the Passing Attributes between Networks for Data Assimilation (PANDA) algorithm to build the GRN. This algorithm uses a TF binding site(TFBS) based motif network, a PPI network and a co-expression network for building the GRN (Figure-1). We generated these three networks using the following approach: Motif network: We isolated all the ChIP-Seq peaks within a 50Kb window upstream of the TSS of the longest transcript and downstream of the body of each protein coding TG. We then used the most distant CTCF peaks to demarcate the cis-regulatory boundaries for these TFBS, as it is a well-known insulator protecting the enhancers of TG gene from acting upon the promoters of another as shown in Figure-1. Furthermore, we added the TFBS found in the intronic regions of each TG to this set in order to capture the effect of introns on transcriptional regulation. We have shown previously that inclusion of intronic TFBS in the GRN framework ultimately improves the model prediction accuracy [41], as introns are hypothesized to have regulatory influence over TG expression [3], [42]. We then weighted each TFBS based on the number of Hi-C contacts(1Kb) it makes with the TG's promoter( Figure-1) using the weighting scheme described in Generating Hi-C Weightings of the Supplementary Methods. Using such a weighting scheme helps to capture regulatory information provided by long distance interactions of distal TFBS with TG promoters created via chromatin looping while preserving the influence of proximal promoter based TFBS [41]. We created a weighted motif network using the unique TF-TG interactions and the average Hi-C weight for them.
PPI network: We downloaded PPI data from the BioGRID database(v.3.5.188) to generate the PPI network.
Co-expression network: We extracted expression residuals for the 462 LCL samples within the GEUVADIS datasets using a genome-wide genetic relationship matrix(GRM) based mixedlinear regression model and used them to generate the co-expression network (see Building coexpression network for the PANDA GRN of Supplementary Methods). This was done to adjust out the genetic effect of the variants in the dataset.
We used the above networks to generate GRN utilizing the R(v.3.4.2) implementation of the PANDA algorithm. After 25 iterations, we obtained convergence by setting the threshold for Hamming's distance at 0.001 and by using the value of 0.1 for the update parameter.

MLP network architecture and building the prediction models
We utilized two different MLP architectures in our paper: 1)MLP-U(MLP-Univariate) and 2)Traditional MLP as shown in Supplementary Figure S1. The MLP-U architecture contained individual univariate MLPs receiving inputs corresponding to each TF in addition to the traditional MLP. All the univariate MLPs had 3 layers containing 10 nodes each and the traditional MLP also contained 3 layers with 800, 500 and 1000 nodes for each model. The nonlinear activation function for all the layers was Rectified Linear Unit(ReLU).
We built the ENET and the MLP prediction models using log10 FPKM expression values of 11,780 protein coding TGs, where we used 80% of the data(9,424 TGs) for training the models and the remaining 20%(2356 TGs) to test the models and assess their prediction accuracy. We used two different internal cross-validation strategies to train the two types of models: 1) For the MLP-U and MLP models, we further divided the training data into 85% training and 15% validation sets. We then trained these models using the backpropagation algorithm. Additionally, we summed the output from all the individual univariate MLPs and the traditional MLP at the last node for training the MLP-U models. We note here that the traditional MLP architecture was only used as a comparison in the paper and most of the analyses were done using the trained MLP-U models. 2) For the ElasticNet(ENET) prediction model, we used an alpha of 0.5 and trained the models based on 20 fold inner cross-validation. We trained and tested the models for 20 iterations (Figure-1), and computed Pearson's Correlation Coefficient(PCC) each time to assess model performance.
Thus, we had an input matrix X of size N × T, containing N TGs and T TFs. The values in this matrix were scaled edge-weights corresponding to the vertex TFt ⟶TGn, where n ϵ{N} and t ϵ {T} derived from the learned PANDA GRN network. The output was a column vector y of size N containing scaled and centered log FPKM (Fragments per kilobase per million) expression values of the N genes. For the MLP-U models, it was derived based on a generalized additive model: (1)

Obtaining main and interaction effects from the MLP-U models
For each trained MLP-U model, we performed an additional 5-fold prediction task in order to capture the prediction performance over all the TGs within each iteration. Thus, we essentially conducted 100 prediction rounds for which we stored the model weights learned during the training process.
In order to calculate the main effect corresponding to each TF, we utilized the learned MLP-U models. Specifically, we extracted layer weights from each one of the univariate MLP corresponding to each TF feature and aggregated them across all the prediction iterations. These iterations corresponded to a set S of 20 random numbers s each representing an instance/state for bootstrapping test set genes for each prediction task.
For each random state , we picked 5 non-overlapping sets of test genes = { | ∈ }; ∈ ; 1 ≤ ≤ 5 ; For each , we then used the remaining genes as the training set _ such that We then predicted the expression values of Gsi genes according to the following equation: S = { s | s ϵ ℝ, k > 0 } and |S| = 20 (2) Here ( ) is the vector containing predicted expression for gene set using the input matrix by the model trained using the input from genes in set _ . The first part of equation (5) captures the main effect of each one of the TF with representing the corresponding univariate MLP while the second part captures the interaction effect of K interactions, via the traditional MLP ,on the gene expression trait. Thus, for each iteration , the expression vector of gene set ( ) is derived from a generalized additive model containing main effects and interaction effects derived from a collection of complex non-linear functions Φ and respectively. The parameters for this model were learned during the training process using the training set _ . Furthermore, we had 5 models for each random iteration each containing a different set of test genes.
We note here that the = 3 for all the models in our case.
We utilized the learned models and to calculate the main effect for each TF and the interaction effect of interactions respectively. We used an extension of the neural interaction detection(NID) developed by Tsang et al. in order to compute these effects [29].
For each random state , we first aggregated the layer weights across all the models The main effect for each TF and the interaction effect of the -interaction at unit of the first layer across all the models for a random state was calculated using the following equations: