## Abstract

**Background** The inference of gene regulatory networks is of great interest and has various applications. The recent advances in high-throughout biological data collection have facilitated the construction and understanding of gene regulatory networks in many model organisms. However, the inference of gene networks from large-scale human genomic data can be challenging. Generally, it is difficult to identify the correct regulators for each gene in the large search space, given that the high dimensional gene expression data only provides a small number of observations for each gene.

**Results** We present a Bayesian approach integrating external data sources with knockdown data from human cell lines to infer gene regulatory networks. In particular, we assemble multiple data sources including gene expression data, genome-wide binding data, gene ontology, known pathways and use a supervised learning framework to compute prior probabilities of regulatory relationships. We show that our integrated method improves the accuracy of inferred gene networks. We apply our method to two different human cell lines, which illustrates the general scope of our method.

**Conclusions** We present a flexible and systematic framework for external data integration that improves the accuracy of human gene network inference while retaining efficiency. Integrating various data sources of biological information also provides a systematic way to build on knowledge from existing literature.

## Background

The inference of gene regulatory networks has attracted great interest in recent years, especially gene network inference from large-scale data. Advances in technology have led to the generation of high-throughput biological data. Gene regulatory networks play an important role in understanding the interactions between genes and have many applications. However, inferring gene networks from high-dimensional genomic data can be challenging.

We define a gene regulatory network as a directed graph that represents the regulatory relationships between genes, in which each node represents a gene and each directed edge represents the regulatory relationship between a regulator (parent node) and a target gene (child node). Furthermore, these regulatory relationships or edges from regulators to target genes can be calibrated by probabilities representing the likelihood of such edges, especially in Bayesian approaches.

There is an extensive literature on methods for the inference of human gene regulatory networks and their applications. Some authors inferred gene networks to uncover causal relationships between gene expression and disease, which could help drug discovery and development [1, 2, 3] as well as disease biomarkers [4]. Woo et *al.* [5] proposed a method to predict changes in gene expression level after drug perturbation, which offers insight into target prioritization of novel compounds. Also, gene network inference could advance the understanding of the mechanisms underlying various biological processes and identify genes that play important roles in biological activities.

Bayesian networks are one of the most commonly used modeling approaches in gene network construction. For example, Friedman et *al.* [6] built a framework on Bayesian networks to infer interactions between genes based on multiple expression measurements. A Bayesian network is a directed acyclic graph that describes the joint probabilities of the conditional independence between nodes. Bayesian network methods have been applied to yeast gene expression data and further extended using probabilistic graphical models [7]. Many other models have also been developed to infer gene regulatory networks, including ordinary differential equation methods (e.g. [8, 9]) and regression-based approaches (e.g. [10]). Ordinary differential equations have been used in both static and time-series gene expression data. These methods usually suffer from the curse of dimensionality, especially in the case of human data when the number of genes is large. Subsequently, dimension reduction techniques have been used, such as forward feature selection [11], singular value decompositions [12] and Principal Component Analysis [13].

Yeung and colleagues developed Bayesian regression-based network inference methods by integrating external data to yeast time series gene expression data [14, 15, 16, 17]. Specifically, they developed a regression framework using Bayesian Model Averaging (BMA) to select predictive variables using time series gene expression in yeast. BMA methods sample each model in the ensemble to improve the accuracy of inference. In addition, they also developed a supervised learning framework to integrate multiple data sources, including genome-wide binding data, additional gene expression data, protein-protein interaction data and prior knowledge from the literature.

Given that the high-dimensional gene expression data typically consist of limited numbers of experiments, it is difficult to identify the correct regulators for each gene in the large search space. The expected number of regulators for each target gene is relatively small compared to the whole gene set. To help the inference of gene regulatory networks, many types of external data have been used. These external data include genome-wide binding data [18], genetic interactions data [19] and protein-protein interaction data [20, 21].

## Related Work

### Time-series Gene Expression Data

Various types of data have been used to infer gene networks, including both time-series and static gene expression data [22, 23]. Compared to static data, time-series data provide much additional information from sequential time points. Using time-series gene expression data, dynamic Bayesian networks considering gene expression levels from different time points allow self-loops [24, 25, 26, 27], which are not possible in Bayesian networks due to the directed acyclic graph assumption.

Although time-series gene expression data may provide useful information from which gene regulatory relationships can be derived, it can also introduce noise and redundant information which can subsequently result in a reduction of accuracy. It is also difficult to determine the optimal number of time points profiled in the experimental design and the intervals between consecutive time points, especially after taking into account the balance of data informativeness and experiment efforts [28, 29]. Interpolation approaches using measured time points have been employed to solve the problem when the number of time points are not sufficient to infer a gene network of high accuracy. Interpolation approaches not only makes the expression levels distributed more smoothly across the time points, but also handle the estimation of time derivatives of each time point which can be utilized to build ordinary differential equations (ODE) models [30, 29]. However, interpolation cannot help to alleviate the curse of dimensionality of time-series data [31]. Due to the challenge of distinguishing signal from noise in time-series gene expression data and the increased dimensionality, gene network inference from time-series data is generally time and resource consuming.

Another limitation is that it is highly difficult to infer causality using time series gene expression data alone without additional data sources. Causality in gene regulatory networks is of great biological interest. In the context of gene networks, an inferred directed edge in the form of (*A* → *B*) means the following: gene *A* is the parent or regulator of gene *B*, and that if the expression level of gene A changes then we can expect the expression level of gene *B* will change as well. However, time-series data only provide information on statistical causality. Methods have been developed to infer directed edges by leveraging additional data sources and expert knowledge [15] or using graphical models [32]. Nevertheless, these models are limited to the inference of statistical causality.

### Perturbation Gene Expression Data

To derive directed gene regulatory networks, perturbation data such as over-expression and knockdown data has been used in many proposed methods [33, 34, 6]. As static expression data without any time points, perturbation data does not reflect any dynamic biological behavior over time but the experimental design could be used to derive a causal relationship. Specifically, after gene *A* is perturbed *(i.e.* by either knockdown or over-expression), the expression level of gene *B* is observed to change. Since the causal event (perturbation) is included in the experimental design, we can infer a directed edge (*A* → *B*). Knockdown data has been widely used in the literature in gene network inference. For example, Pinna *et al.* [35] showed the effectiveness of inferring gene networks using genetic perturbation expression data followed by graph analyses when applied to synthetic data from the DREAM4 *in silico* network challenge. Subsequently, Pinna *et al.* used non-linear ordinary differential equations (ODE) to generate additional synthetic data to optimize input parameters. Other methods applied to knockdown data include Bayesian networks [6, 36] and correlation-based Gaussian noise models [37]. In addition to accounting for causality in the experimental design, processing perturbation data is generally not too time or resource consuming.

However, the accuracy of gene network inference using knockdown data to some extent depends on the assumption that all the knocked down genes are fully suppressed in the experiments, which could potentially be difficult to accomplish in practice. The difficulties do not only come from the limitations of experimental techniques, but also from the fact that many functional genes are not able to be completely removed or knocked out, although this problem could be partially solved by partially suppressing the target gene. Perturbation experiments also require advanced lab techniques and much more resources compared to time-series and static gene expression data. Therefore, the existing data sources generally contain relatively less knockdown data. For example, in the LINCS L1000 gene expression data most of the knockdown experiments are generated using only 8 cell lines [38]. The lack of data sources could limit the usage of perturbation type data.

### Combined Data

Bonneau *et al.* proposed combining time-series and knockdown gene expression data in which a regression model coupled with bi-clustering algorithms were developed and applied to infer gene networks using data from the archaeon *Halobacterium* [39]. Other studies have combined static and knockdown gene expression data using ordinary differential equation (ODE) based methods with or without external knowledge integration [40, 41].

### Ordinary Differential Equations (ODE)

As mentioned before, many different models have been used for gene network inference including ordinary differential equations (ODE) [8, 9]. By representing the model as linear or non-linear differential equations, the interaction of expression level measured from different genes could be described by variables in the equations. Linear differential equations are generally more highly abstract in terms of describing the gene interactions and could be handled by existing linear algebra methods. Non-linear differential equations are able to describe complex behaviors with the trade-off between computational cost and strict constraints [8, 42]. Differential equations could be employed to model different types of data such as static gene expression data [43, 44] and time-series gene expression data [13].

ODEs could suffer from the curse of dimensionality given a large set of candidate genes such as human gene set. Dimension reduction techniques such as forward feature selection [11], singular value decompositions [12] and principal component analysis [13] have been used in ODE approaches to reduce dimensionality of genomic data.

### Regression-based methods

Regression models are also widely used in gene network inference. The models could be built on linear differential equations [43] or other types of functions to select the variables using a regression approach. By formulating network inference as a variable selection problem, regression-based methods could be solved using existing techniques but suffer from the curse of dimensionality. Commonly used regression-based methods include regularization methods and Bayesian Model Averaging (BMA). Regularization methods such as least absolute shrinkage and selection operator (LASSO) [45], least angle regression (LARS) [46] and elastic net [47] have been employed to model different types of data in gene network inference [48, 49, 50, 51]. Variations of BMA (Bayesian model averaging) methods have been proposed to facilitate gene network inference. Examples include iBMA [52], ScanBMA [16] and fastBMA [17] for analyzing high dimensional gene expression data.

### Bayesian Networks

Another established gene network inference approach is Bayesian networks [26, 53, 7]. A Bayesian network is a flexible framework that can be applied to both continuous and discrete types of data. It allows the incorporation of prior knowledge [54]. Although only statistical causality can be inferred by a Bayesian network, biological causality can be implied with constraints of prior knowledge or expression levels [55, 24].

However, Bayesian network inference could be time consuming given a large set of variables since it is a NP-complete problem [56, 57]. The computational complexity increases exponentially with the number of network nodes. A Bayesian network is a directed acyclic graph (DAG), hence it cannot contain any feedback loops that are ubiquitous in real-life biological systems. As mentioned before, this constraint could be removed with the application of Dynamic Bayesian networks to time-series data.

### Data Integration

Instead of using a single data source, many proposed methods have incorporated external knowledge in the construction of Bayesian networks and other models. For example, Le *et al.* [58] and Geier *et al.* [28] have applied Bayesian networks to synthetic data with prior knowledge. Imoto *et al.* [21] employed Bayesian networks on gene expression data with known regulatory interactions in yeast. Other approaches of data integration in Bayesian networks include James *et al.* [59] using gene expression data in Escherichia coli with literature knowledge, Djebbari and Quackenbush [20] integrating literature knowledge and protein-protein interaction (PPI), and Nariai *et al.* [60] integrating protein-protein interactions and pathway data. Other external knowledge integration methods include linear and non-linear differential equations [61, 62]. Data integration on human microarray data has also been conducted [61]. Multiple source knowledge integration has also been shown to lead to positive results in gene network inference [63, 64].

Integrating different sources of biological information can avoid the bias generated from a single data source. However, external knowledge can also introduce additional noise. The overall impact of data integration depends on the quality, assumptions and biases of the external knowledge.

### Our Contributions

In this paper, we present an approach integrating external data sources with knockdown data from human cell lines for predictive regulatory network inference. Our methods build on the previous work by Young and colleagues [65, 66] in which they developed a Bayesian regression framework to infer gene networks from knockdown expression data. Our key contribution is the supervised learning framework that integrates multiple data sources to derive the prior probabilities of regulatory relationships in humans. Subsequently, we incorporate these prior probabilities in the Bayesian regression-based approach to infer gene networks from knockdown data. Our results show improved accuracy of the inferred gene networks. In addition, we extend Young *et al.* [65] by applying our methods to more than one cell line (skin melanoma cell line A375 and lung cancer cell line A549).

## Methods

### LINCS L1000 Gene Expression Data

The Library of Integrated Cellular Signatures (LINCS) http://lincsproject.org [38] is a National Institutes of Health (NIH) funded program that aims to develop comprehensive signatures of cellular states and related tools. Many types of large scale data were generated to profile changes induced by genetic and drug perturbations across human cell lines. In particular, the LINCS L1000 data generated by the Broad Institute measure the gene expression levels across approximately 1,000 landmark genes. These landmark genes were chosen to capture approximately 80% of the information for 20,000 genes in the human genome. The LINCS L1000 gene expression data are publicly available from the Gene Expression Omnibus (GEO) database with accession number GSE70138 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70138.

The L1000 experiments were performed using Luminex Bead technology [67], generating high-throughput gene-expression assays using 384-well plates. To measure the expression level of specific genes, color-coded microspheres bind fluorophore and the corresponding RNA sequence. Therefore, the expression level of each gene could be represented by the intensity of the fluorescence. For a pair of genes, two types of beads sharing one bead color were designed to measure the expression of the two different genes. In each perturbation experiment, about 35,000 to 50,000 beads across 500 bead colors were added to each well to measure the expression levels of approximately 1000 landmark genes.

The beads for each pair of genes were mixed in approximately 2:1 ratio. Therefore, two peaks are expected in a histogram of fluorescence levels, and these observed peaks were deconvoluted to assign expression values to the appropriate pair of genes. To reduce the noise from experimental conditions, there are several wells used for control on each plate. In addition, technical replicate data was generated in which the same perturbations were performed in the same wells across multiple plates.

The L1000 gene expression data was generated and processed by the Broad Institute LINCS Center for Transcriptomics as part of their Connectivity Map project [68]. L1000 gene expression data is publicly available in different formats varying from levels 1 to 5 [69]. Level 1 represents the raw unprocessed data from the Luminex Bead technology. In level 2, the gene expression values of the landmark genes were deconvoluted from the observed fluorescence levels and normalized to a set of internal standards. Subsequently, quantile normalization was performed on these landmark genes, and interpolated to all 20,000 human genes in level 3. The level 4 and level 5 data consist of the gene signatures comparing the perturbed experiments to the unperturbed experiments.

Young *et al.* [66] observed that the deconvolution step introduces artifacts in the data. There are three types of artifacts found and discussed in the expression data of two paired genes on the same bead color. First, the two genes can be assigned the same expression value if their expression levels are not different enough to be distinguished. Second, together with the quantile normalization step, the deconvolution step can generate incorrect additional clusters. Finally, sometimes the paired two genes can be assigned flipped expression values, which means gene A is assigned the expression value of gene B and vice versa. A correction which can be applied to the data to eliminate some effects of these artifacts will be discussed in a later section.

As a large-scale genomics data resource, LINCS L1000 data provides a rich set of human gene expression information. Specifically, we used the knockdown experiments data in this paper. There are approximately 4500 knockdown experiments in L1000 dataset. Most of these data were generated using 8 cell lines: A375, A549, HA1E, HCC515, HEPG2, HT29, MCF7 and PC3. Data from these knockdown experiments is usually collected 96 hours after the perturbation [38].

### Method Outline

We integrate external data sources in gene network inference with human genetic perturbation data from the LINCS L1000 project. We show that the accuracy of our inferred networks is improved by external data integration and MCDC (model-based clustering with data correction). Details of the data integration and MCDC are described in the Methods section. Figure 1 shows an overview of our approach.

Our inferred gene networks consist of directed edges in that we aim to infer causal relationship between gene pairs. This feature of our gene network does not come from the Bayesian regression-based approach but the gene knockdown experimental design. With this biological context, we assume that the observed changes in gene expression level originate from the knocked down gene.

### BayesKnockdown

We use the BayesKnockdown Bioconductor package [70] to calculate posterior probabilities of regulatory relationships. In [65], this BayesKnockdown package was applied to L1000 gene expression data from a single cell line (A375). Here we extend Young *et al.* [65] by integrating additional data sources and by applying the package to an additional cell line (A549).

To prepare the input data for the BayesKnockdown package, the LINCS L1000 knockdown experiments are first transformed by calculating z-scores to account for bias and noise among replicates:
where *x*_{hi} represents the gene expression level of gene *h* and experiment (well) *i* on plate *p*, and *s*_{hp} are, respectively, the mean and standard deviation for gene *h* across all control experiments on plate *p*. A linear regression model is then applied to model the change in a target gene *t* as dependent on the change in the knockdown gene *h*, with as the error term:

In the BayesKnockdown package [65], the linear regression model is estimated with a Bayesian approach using Zellner’s g-prior [71] for the model parameters. The parameter *g* specifies the expected size of the regression coefficient *β*_{1}. The value of g can be estimated using the Expectation-Maximization algorithm [72, 16]. Then the regression model with g-prior is used to calculate the probability *Pr*(*h* → *t*|*x*) that gene *h* regulates gene *t* given the data *x*, versus the probability *Pr*_{0} that there is no regulatory relationship:

In the absence of external data sources, the prior probability of regulation *π*_{ht} is set to 0.0005 for all the gene pairs in Young *et al.* [65]. The value of 0.0005 is derived from prior knowledge for yeast data, reflecting the expected number of regulators per gene [73]. The coefficient of determination *R*^{2} for the simple linear regression is calculated from the correlation of the expression data of gene *h* and gene *t*. Then we have the posterior probability of the regulatory relationship between a given gene pair.

### Data integration using supervised machine learning

We download transcription factor and target gene pairs (TF-G pairs) from the PAZAR database, a public resource for transcription factor and regulatory sequence annotation [74, 75, 76]. Subsequently, we map the target genes from Ensemble IDs to Entrez IDs using Biomart [77]. After the data processing, we keep the TF-gene pairs for which both the TFs and target genes are in the L1000 landmark genes. This results in a total of 232 TF-gene pairs that we label as positive training samples (Y=1) in our supervised framework. Due to a lack of documentation on non-regulatory TF-gene pairs, we randomly generate 240 negative training samples of TF-gene pairs (Y=0) that are not documented in PAZAR.

After collecting the positive and negative training samples of TF-gene pairs, we derive the training data using external data sources to generate attributes in the supervised framework as described below. Table 1 summarizes the different types of attributes defined using external data sources in our supervised learning framework.

Gene expression data across human cell lines. For each TF-gene pair, we compute the Pearson’s correlation between TF and gene across 917 human cell lines from the Cancer Cell Line Encyclopedia (CCLE) [78]. The CCLE data is publicly available from https://portals.broadinstitute.org/ccle/home and from GEO with accession number GSE36133. As another attribute (or variable) in the training data, we also compute the Pearson’s correlation between TF and gene across 675 commonly used human cell lines in the RNAseq data generated by Klijn

*et al.*[79]. The Klijn*et al.*data is publicly available from ArrayExpress with accession number E-MTAB-2706 http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-2706/.Gene ontology (GO). Gene Ontology (GO) defines a controlled vocabulary and descriptions of gene products across biological systems [80, 81]. Genes assigned to the same ontology terms are expected to share common functionalities. Intuitively, we expect regulatory TF-gene pairs to share common GO terms. Since GO terms are hierarchical in nature, we filter out large and hence, less informative GO terms. The upper boundary is set to 100. We define a binary attribute in our supervised framework: if a given TF-gene pair are both assigned to the same GO term, we define the binary variable to be 1, otherwise 0.

Genome-wide binding data. We also use genome-wide binding data (ChIP-chip, ChIP-seq) from ENCODE [82]. The chromatin immunoprecipitation (ChIP) technology could be used to detect binding between proteins and DNA

*in vivo.*Since transcriptional regulation is typically preceded by binding, we define a binary attribute for a (TF, gene) pair to be 1 if TF binds gene, and 0 otherwise. We derive these binary variables by parsing the processed ChIP data from the ENRICHR website [83].Pathways data. We hypothesize that a regulatory (TF, gene) pair is more likely to be assigned to the same biochemical pathways. Therefore, we define a binary variable for each of WikiPathways [84], KEGG [85], BioCarta [86] and Reactome [87, 88]. If TF and gene appear in the same pathway, we define the binary variable to be 1, otherwise 0. We derive these binary variables by parsing the processed library data from the ENRICHR website [83] http://amp.pharm.mssm.edu/Enrichr/.

Some attributes in our original supervised learning framework have very similar values, which decreases the informativeness of the attributes. We iteratively remove one attribute at a time and apply logistic regression to filter down to 14 attributes in our final supervised learning framework. After finalizing the training data used in the supervised learning framework, we perform 10-fold CV using different machine learning methods on our supervised framework. Table 2 summarizes the results. Logistic regression yield the highest AUROC (0.76) in our cross validation studies and is used to compute prior probabilities of regulatory relationships in next steps.

### Sampling bias correction

After obtaining priors, we correct for sampling bias in the prior as previously reported in Lo *et al.* [15]. Specifically, we add an offset of *log*(π_{1}/π_{0}) to the log odds in our logistic regression model. Here π_{1}and π_{0} are the sampling rates for positive and negative cases respectively in the training data. We use the prior knowledge from Lo *et al.* that the average number of regulators for each gene is about 2.76. Knowing there are approximately 20000 human genes, we then compute π_{1} = 232/(20000 × 2.76), π_{0} = 240/(20000 × (20000 - 2.76)) and π_{1}/π_{0} = 7003.864 from our 232 positive instances and 240 negative instances in the supervised framework. Figure 2 shows the histograms before and after the correction in cell line A375. Figure 3 shows the histograms in cell line A549. Threshold probability values are set to 0.5.

The intuition behind this correction is that our supervised training data consists of approximately the same proportion of positive and negative training samples (TF-G pairs). However, the positive cases are expected to be rare in real situations. Therefore we perform this correction to better match the biological relationships between genes in practice.

### MCDC

MCDC (Model-Based Clustering with Data Correction) is a method intended to remove artifacts in gene expression data [66]. The gene expression data was originally paired by bead color when the LINCS data was generated. In the deconvolution step during the processing, sometimes additional noisy clusters can be generated and the expression values of the paired genes can be reversed. MCDC corrects this using model-based clustering.

Model-based clustering [93, 94, 95, 96] assumes that the data come from a distribution consisting of a mixture of multiple components. Each of these components can be modeled by a Gaussian distribution with parameters which can be estimated using an EM (Expectation-Maximization) algorithm. MCDC extends model-based clustering to detect flipped points in the gene expression data.

Instead of the original data, MCDC uses a transformation matrix to determine whether each data point should be corrected. Clustering is then done with both the original and transformed data, resulting in a probability of transformation for each data point. This method thus could be used to identify flipped data points. Furthermore, to eliminate the effect of noisy clusters generated from expression level estimating process, the expression levels of the paired genes are estimated as the mean of the largest cluster after selecting the best model in MCDC.

MCDC runs with the number of clusters ranging from 1 to some maximum number. Here, the maximum number is set to 9. The best number of clusters and the best model are then selected using the BIC (Bayesian information criterion) values [96].

Our work applies MCDC to the untreated data in cell line A375 and A549. The untreated data is then used as the control to the knockdown data.

### Assessment

The high-quality transcription factor binding profile (JASPAR) database [97] provides experimentally defined transcription factor (TF) DNA-binding sites for eukaryotes. We use the TF-gene targeting relationships in this data to assess the resulting gene networks. We use Fisher’s exact test to calculate the p-values of contingency table.

A Pearson’s chi-square test is applied to a 2 × 2 contingency table to assess the consistency of our constructed network with the known regulatory relationships. Table 3 shows an example contingency table with the definitions of TP, FP, TN and FN. Precision is defined as *T P*/(*TP* + *FP*).

To assess our result, we use TRANSFAC and JASPAR [97] lists of edges as our reference standard. The TRANSFAC and JASPAR (T&J) edgelist contains approximately 4200 edges for 37 transcription factors that overlaps LINCS landmark genes. This is the same gold standard which was used in Young *et al.* [65, 66]. Although the T&J edgelist is limited to transcription factors that are previously well-studied, it is difficult to find a comprehensive standard for gene network assessment in mammalian systems at current stage.

## Results and Discussion

### Results:NIH LINCS Data A375

We evaluate the performance of our proposed method by comparing our inferred networks to the T&J dataset using contingency tables. Table 4 shows the assessment. We used two cutoff posterior probability values, 0.5 and 0.95, as thresholds for positive edges. The two tables in the first row are computed using knockdown data only. The two tables in the second row correspond to the network inferred using knockdown data and our external knowledge integration. The two tables in the last row show the assessment results of the network inferred from knockdown data using untreated data with MCDC correction as control, as well as external knowledge integration.

We observe that both MCDC and integrating prior knowledge improve both the p-value from the Fisher’s exact test and the precision. By applying both MCDC and external knowledge integration, the p-values were improved from around 0.01 to around 0.001. Also, the precision increased from 0.14 to 0.17 at 0.5 cutoff, and 0.2 to 0.3 at 0.95 cutoff.

Next, we compare our inferred edges to the T&J dataset by ranked lists. The assessment results are shown in Table 5. We first identify all the edges in the intersection of our predicted edges and T&J edges. Then we rank these found edges by posterior probabilities from our prediction in descending order. Finally, we rank all our predicted edges by posterior probabilities in descending order. For each edge also in T&J dataset, we note down the corresponding ranking in our edgelist for the same edge. With the edge ranks we not only assess our inferred networks by the found edges, but also involved the values of posterior probabilities.

As an example, for the first column “Knockdown Data”, the sixth edge in our edgelist is the first edge found in T&J in Table 5. This number means that the top 5 edges in our edgelist are not found in T&J. These ranked lists can help us to determine the differences between T&J and our own edgelist.

We can see that the external knowledge integration improves the results of middle-ranked edges. As another example, In the third column that corresponds to the network inferred using both the L1000 knockdown data and external knowledge integration, we can see that the 25th found edge is ranked 127th in our edgelist. In the first column, from knockdown data only, the 25th found edge is ranked 135th in our edgelist. The larger difference in rankings indicates a larger difference between our prediction and T&J dataset.

Figure 4 shows precision-recall curves under different combinations of data and prior. After external knowledge integration, the area under the curve has been improved. Furthermore, with the MCDC correction to the untreated data the area under the curve has been further improved.

Our gene network inferred from A375 gene expression data is shown in the form of a directed graph in two figures. We used A375 knockdown data and MCDC-corrected untreated data, then integrated priors from our supervised framework to infer this gene network. Figure 5 shows all the inferred edges at a cutoff of 0.5. Figure 6 shows all the true positive edges found in TRANSFAC and JASPAR database at a cutoff of 0.5. Some of our inferred edges are also found in literature such as *CREB1 → JUN* [98, 99, 100]. Compared to another cell line we worked on, A375 cell line has less noise and therefore constructs a more reliable gene network.

### Results: Lung Cancer A549

We apply our proposed methods and assessment criteria to another cell line A549. Table 6 shows the contingency table comparing T&J and our result. The two thresholds for positive edges are again set to 0.5 and 0.95. Similar to the results on cell line A375, we observe that the MCDC correction and integrated prior knowledge also improve the p-value and precision in A549, although the effect applying MCDC is not as significant as in A375. By applying both the MCDC correction and external knowledge integration, the p-values were improved from 0.001 level to 0.0001 level. Also, the precision increased from 0.13 to 0.15 at 0.5 cutoff, and 0.14 to 0.15 at 0.95 cutoff.

The assessment results of edge ranks are shown in Table 7. As in A375, external knowledge integration and MCDC correction improve the results of middle-ranked edges. Figure 7 shows the precision-recall curves for the A549 cell line. After external knowledge integration, the area under the curve has been improved. The MCDC correction to the untreated data has not further improved the result significantly, which is different from A375.

## Conclusions

In this paper, we present an approach that integrates external data sources with knockdown data from human cell lines for gene network inference. Our key contribution is to present a supervised learning framework to systematically integrate multiple data sources, and we demonstrate the flexibility of our Bayesian regression framework to integrate multiple data sources. Furthermore, the Bayesian approach we used to estimate our linear regression model is highly computationally efficient. Our ultimate aim is causal inference, and some basis for this is provided by the experimental design of the knockdown data. In addition, we extend Young *et al.* [65] by applying our methods to more than one cell line (skin melanoma cell line A375 and lung cancer cell line A549). Note that the supervised learning step is universal for all cell lines. When we apply our methods to additional cell lines, we only need to repeat the regression step.

We show that the accuracy of the inferred gene networks is improved with the help of prior knowledge integration and MCDC. One point notable from our results is that the improvement in performance varies from cell line to cell line. Studying the different response to the data integration and MCDC of different cell lines remains a meaningful future direction.

## Availability of supporting data

The scripts used in the current study are available in the GitHub repository, https://github.com/xlxlx/GeneNetworkInference

## Competing interests

The authors declare that they have no competing interests.

## Author’s contributions

KYY envisioned this project, identified the datasets, designed the supervised learning framework and the empirical studies. AER and WCY designed the BayesKnockDown algorithm and edited the manuscript. WCY wrote and provided the R code for BayesKnockDown and model-based clustering data correction. XL extended WCY’s R code, performed the empirical studies, compared the performance of various machine learning algorithms, and created all the figures and tables. XL and KYY wrote the manuscript. LHH participated in the design of the empirical studies, project discussions and edited the manuscript. All authors read and approved the final manuscript.

## Acknowledgements

We would like to thank Bo Ding for her contributions to the data processing and the supervised learning step. AER, LHH, KYY and WCY are supported by NIH grant U54 HL127624. AER is also supported by NIH grants R01 HD054511 and R01 HD070936.

## Footnotes

xlianguw{at}uw.edu, wmchad{at}gmail.com, lhhung{at}uw.edu, raftery{at}uw.edu, kayee{at}uw.edu

## References

- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].↵
- [88].↵
- [89].↵
- [90].↵
- [91].↵
- [92].↵
- [93].↵
- [94].↵
- [95].↵
- [96].↵
- [97].↵
- [98].↵
- [99].↵
- [100].↵