Joint Gene Network Construction by Single-Cell RNA Sequencing Data

In contrast to differential gene expression analysis at single gene level, gene regulatory networks (GRN) analysis depicts complex transcriptomic interactions among genes for better understandings of underlying genetic architectures of human diseases and traits. Recently, single-cell RNA sequencing (scRNA-seq) data has started to be used for constructing GRNs at a much finer resolution than bulk RNA-seq data and microarray data. However, scRNA-seq data are inherently sparse which hinders direct application of the popular Gaussian graphical models (GGMs). Furthermore, most existing approaches for constructing GRNs with scRNA-seq data only consider gene networks under one condition. To better understand GRNs under different but related conditions with single-cell resolution, we propose to construct Joint Gene Networks with scRNA-seq data (JGNsc) using the GGMs framework. To facilitate the use of GGMs, JGNsc first proposes a hybrid imputation procedure that combines a Bayesian zero-inflated Poisson (ZIP) model with an iterative low-rank matrix completion step to efficiently impute zero-inflated counts resulted from technical artifacts. JGNsc then transforms the imputed data via a nonparanormal transformation, based on which joint GGMs are constructed. We demonstrate JGNsc and assess its performance using synthetic data. The application of JGNsc on two cancer clinical studies of medulloblastoma and glioblastoma identifies novel findings in addition to confirming well-known biological results.


Introduction
Complex transcriptomic interactions exist among genes (Allocco et al., 2004;Trapnell et al., 2014). Deciphering such genetic architecture is essential in understanding the mechanism of complex diseases (Buil et al., 2015). Gene regulatory network (GRN) analysis is an attractive way to build transcriptional relationships among genes in providing a comprehensive view of gene expression dependencies (Wolfe et al., 2005;Peng et al., 2005;Margolin et al., 2006;Faith et al., 2007;Langfelder and Horvath, 2008;Sales and Romualdi, 2011;Ballouz et al., 2015;Ha et al., 2015;van Dam et al., 2018;Grimes et al., 2019;Ramirez et al., 2020;Van de Sande et al., 2020). In a GRN, individual genes are represented as nodes and gene pairs with interactions are connected by edges (Chen and Mar, 2018).
In practice, one focus area in biological and clinical research is the study of GRN changes across different conditions since transcriptomic interactions often vary across different tissues, cell types and cell states. Estimating a single network that applies to multiple conditions may not be appropriate and can lead to a biased estimation. In contrast, separately constructing graphical models under each condition is statistically valid (Lee et al., 2018). However, its performance may be improved by utilizing the network similarity across the related conditions. As such, joint graphical modeling is preferred (Danaher et al., 2014;Oates et al., 2014;Chun et al., 2015;Yang et al., 2015;Haslbeck and Waldorp, 2015;Cai et al., 2016;Qiu et al., 2016;Hallac et al., 2017;Huang et al., 2017;Lyu et al., 2018;Zhu and Koyejo, 2018;Lee et al., 2018;Geng et al., 2019;Jia and Liang, 2019;Zhang et al., 2019;Wu et al., 2019Wu et al., , 2020. GGMs and their adaptive models are popular for constructing GRNs with microarray data due to their ease of interpretation. Recently, bulk RNA-seq data and newly emerging scRNAseq data enable researchers to explore the gene transcriptional relationships at a much higher resolution, further expanding their knowledge on genetic mechanisms and novel therapeutic targets. Many network construction methods (Bansal et al., 2007;Chai et al., 2014;Karlebach and Shamir, 2008;Marbach et al., 2012;De Smet and Marchal, 2010) have been developed for bulk RNA-seq data, including the popular GGMs. However, they cannot be readily applied to scRNA-seq data (Chen and Mar, 2018;Blencowe et al., 2019). Though both are count data, scRNA-seq data are inherently more sparse with many inflated zero counts. Hence, traditional data processing and normalization methods (e.g., log-transformation) that work well for bulk RNA-seq data may no longer work for the scRNA-seq data (Townes et al., 2019).
The excessive amount of zeros in the scRNA-seq data are due to either technical artifacts (false zeros), known as "dropouts" (Hicks et al., 2018), or biological absence of expression (true zeros) (Jiang et al., 2017;Dong and Jiang, 2019). Only the technical artifacts need to be imputed in the downstream analysis. However, there is no experimental way to differentiate the two types of zeros which challenges the pre-process and analysis of scRNA-seq data.
Furthermore, most existing GRN models for scRNA-seq data are developed for homogeneous cell population under one condition (Matsumoto et al., 2017;Chan et al., 2017;Aibar et al., 2017;Woodhouse et al., 2018;Papili Gao et al., 2018;Sanchez-Castillo et al., 2018;Iacono et al., 2018;Chiu et al., 2018;Deshpande et al., 2019;Aubin-Frankowski and Vert, 2020;Blencowe et al., 2019;Pratapa et al., 2020;Cha and Lee, 2020) with joint GRN methods across multiple conditions just beginning to emerge (Wu et al., 2020). Although the Gaussian copula graph model of Wu et al. (2020) is on joint GRN construction for scRNAseq data, its major focus is on the novel penalty function that it employs. Additionally, and more importantly, the model directly applies the nonparanormal transformation (Liu et al., 2009) to the count data, which, as mentioned above, may not be appropriate for the single-cell setting (Yoon et al., 2019). To facilitate GGMs and properly employ the Gaussian copula model, for bulk RNA-seq data, Jia et al. (2017) propose to impute and continuize the read counts by learning the posterior distribution under a Bayesian Poisson model framework. Inspired by Jia et al. (2017), we propose to impute and continuize the scRNA-seq count data by a Bayesian zero-inflated Poisson (ZIP) model in which the zeroinflation part accommodates and predicts the dropout events. The choice of the Poisson distribution is supported by the increasing evidence of empirical studies (Townes et al., 2019;Svensson, 2020;Kim et al., 2020). After the dropout zeros are predicted and initially imputed with the Bayesian ZIP model, we further improve the imputation with McImpute (Mongia et al., 2019) to borrow information across genes in realizing that the gene-specific ZIP model does not utilize information across genes. McImpute imputes the dropout events based on low-rank matrix completion, yet it may over-simplify the imputation procedure as it does not employ the gene expression distribution information.
Here, we introduce a comprehensive tool for constructing Joint Gene Networks with scRNAseq (JGNsc) data from different but related conditions. The framework consists of three major steps as shown in Figure 1: (1) a hybrid iterative imputation and continuization procedure; (2) a nonparanormal transformation of the processed data from step (1); and (3) joint GRNs construction which employs the fused lasso technique (Danaher et al., 2014). There are two tuning parameters associated with the model in step (3). Hence, we also investigate the performances of several tuning parameter selection criteria on them, including Akaike information criterion (AIC) (Akaike, 1998), Bayes information criterion (BIC) (Schwarz et al., 1978), Extended Bayesian Information criterion (EBIC) (Chen and Chen, 2008), and the stability approach to regularization selection (StARS) (Liu et al., 2010). we present the proposed method in Section 2, show results from empirical evaluations and real data analysis in section 3, and conclude in section 4.

Hybrid Imputation Procedure for scRNA-seq Data
The proposed Bayesian framework largely follows Jia et al. (2017), in which we replace the Poisson model with the ZIP model to take account the dropout events in scRNA-seq data.
For gene i (i = 1, 2, ..., G) and cell j (j = 1, 2, ..., n), where G is the total number of genes and n is the total number of cells, let the observed expression read count be y ij , and the latent variable for indicating the non-dropout event be: . Assuming the gene-specific dropout rate is constant across samples within a homogeneous group, or 1 − p i , the ZIP model is where δ(0) is the Dirac delta function with a unit mass concentrated at zero; θ i is the mean expression factor for gene i after being scaled for the library sizes. The priors that we place on θ i and p i are That is, the prior of the mean gene expression θ i follows a Gamma distribution with two gene-specific parameters α i and β i . The priors for α i and β i are further assumed to be Gamma distribution, and are a priori independent with hyperparameters a 1 , b 1 , a 2 , b 2 , a 3 , b 3 .
The full conditional posterior distributions for the parameters of interest are given below (see supporting information for more details): We consider y ij as a dropout event if the averaged value T t=1 z (t) ij /T is greater than a prechosen threshold, such as 0.75, which we use in the simulated and real data analysis, and impute it toθ i × τ j , whereθ i can be the posterior mean or mode of θ i . We refer the imputed values here as MCMC-imputed values, and let the imputed matrix beΘ. Instead of imputing all zero events as done in Jia et al. (2017) for bulk RNA-seq data, we choose to only impute the zeros that are predicted to be dropouts.
In realizing thatθ i is estimated one gene at a time, we introduce a hybrid imputation procedure that integrates the Bayesian ZIP model and McImpute (Mongia et al., 2019) to further improve the imputation accuracy. McImpute is a low-rank matrix completion method for imputing scRNA-seq data. It borrows expression information from other genes and cells while performing imputation. McImpute has the following objective function: Within each iteration, α% of MCMC-imputed values for dropouts are masked back to zeros.
The impact of α is investigated in the simulation study.
Algorithm 1 JGNsc Hybrid imputation algorithm 1: Input: Single-cell gene expression matrix Y , number of iterations B.

5:
Mask α% of the MCMC-imputedΘ values back to zero, and define 6: the new data matrix as M .

Joint Graphical Lasso Model
Prior to building the joint GGMs, we apply the nonparanormal transformation (Liu et al., 2009) to the final imputed data. The nonparanormal transformation is a popular procedure for pre-processing non-Gaussian distributed data that follow a nonparanormal distribution (Jia et al., 2017;Yoon et al., 2019;Wu et al., 2020).
. When the f j 's are monotonic and differentiable, the nonparanormal distribution N P N (µ, Σ, f ) is a Gaussian copula, and the conditional independence structure of the original graph is preserved in the precision matrix Ω = Σ −1 .
After the imputation and the Gaussian transformation, we then employ the Joint Graphical Lasso (JGL) (Danaher et al., 2014), which extends GGMs to simultaneously estimate multiple related precision matrices. For K groups of Gaussian data, denote the precision matrix under condition k (k = 1, · · · , K) as Ω (k) , with ω (k) uv being the (u, v) th entry. The task of JGL is to: where n k is the number of observations under condition k, S (k) = (Y (k) ) T Y (k) /n k is the empirical covariance matrix for the k th expression dataset, and P ({Ω}) is a convex penalty function which is defined as under the fused lasso framework. The penalty function involves two tuning parameters λ 1 and λ 2 , where λ 1 ensures a sparsity solution of the graphical lasso model, while λ 2 enforces the parameters to be shared across conditions.
For tuning parameter selection, many different criteria are available, and the choice of a criterion is largely driven by data and research goals (Wysocki and Rhemtulla, 2019) in practice. There is no one gold standard that has been established for scRNA-seq data. Via simulations, we evaluate four popular tuning parameter selection criteria that are adapted to the JGL models, including AIC, BIC, EBIC, and StARS with detailed definitions given below: where E k is the number of non-zero elements inΩ (k) λ 1 ,λ 2 , 0 γ 1 is a parameter for EBIC and is set to 0.5 in simulation to control false discovery rate while maintaining positive selection rate. We adapt the StARS method to the JGL models by optimizing one parameter at a time iteratively with details outlined in the supplements.

Simulation of scRNA-seq Data under Different Conditions
To simulate scRNA-seq data under different but related conditions, we first simulate the conditional independence structures based on an inverse nonparanormal transformation algorithm developed by Yahav and Shmueli (2012), which is later applied by Jia et al. (2017) according to the simulation scheme below. For each condition k (k = 1, 2, · · · , K), (a) Randomly sample from a multivariate Gaussian distribution with a known precision matrix Ω (k) (Details available in Web supporting information). Denote the random samples by X 1 , ..., X G , where each variable X i = (X i1 , ..., X in ) T consists of n realizations.
(b) For each variable X i , derive the empirical CDF based on the n realizations, and then calculate the cumulative probability for each X ij .
(c) Generate the kth scRNA-seq count variables with pre-specified parameters X * 1 , ..., X * G , where X * i = (X * i1 , ..., X * in ) T consists of n realizations. X * ij is sampled from a Poisson distribution with true mean level θ i , which is then assumed to be generated from Gamma distribution. Derive the empirical CDF for each X * i , and then calculate the cumulative probability for each X * ij .
(d) Map the quantiles of the data points in X * i , which is generated in (c) to the cumulative probability values calculated from (b). Denote the mapped counts for variable i as Y ij .
(e) Based on the zero-inflation parameter distribution, we simulate the random dropout event for each data point z ij . The final count matrix will be presented as (z ij × Y ij ).

Results
This session will show analysis results from our simulation study as well as two real data examples.

Simulation Settings and Results
We simulate scRNA-seq data under K = 2 conditions following the strategy described in section 2.3. The number of genes is set to G = 100. We vary the sample size, the precision matrix structure, and the dropout rate. Since n < G and n G are both possible in real single-cell studies, we set n 1 = n 2 = n ∈ {50, 100, 200, 300, 500}. To evaluate the impact of the mask rate α on the performance of JGNsc, we also vary α at {5%, 10%, ..., 95%}. Web We continuize and impute the dropout events in the scRNA-seq data using a selected set of ∼ 6K genes (genes of interest and genes with non-zero counts in at least 300 cells in each group), and use the enzyme-related genes from mammalian metabolic enzyme database (Corcoran et al., 2017) (2) in Group 3 MBs, where Otx2 is thought to be functionally cooperating with Myc (Lu et al., 2017;Bunt et al., 2011;Boulay et al., 2017), its connection to metabolic genes was distinct from Myc's, highlighting the unique link between Myc and metabolism in Group 3 MB cells ( Figure 3B). Further Gene Set Enrichment Analysis (GSEA) based on the previously defined metabolic pathway (Corcoran et al., 2017) reveals that overlapping yet distinct metabolic pathways were connected to each of the two TFs ( Figure 3C). Collectively, these results from the JGNsc analysis suggest that the role of Myc in regulating the expression of metabolic genes is MB-subtype-dependent, and that Myc and Otx2 likely play diverse roles in regulating the transcription of metabolic genes.
Next, we extend the JGNsc analysis to a scRNA-seq data from another type of brain tumor, GBM. With a median survival of about 12-15 months, GBM is the most common and lethal brain tumor, due to its intratumoral heterogeneity and treatment resistance. It is believed that these features of GBMs are largely due to the existence of a subset of tumor initiating cells -GBM stem cells (GSCs) (Lathia et al., 2015), and that Myc, a TF in regulating the expression of metabolic genes, is essential for sustaining this population of tumor cells (Wang et al., 2008(Wang et al., , 2017. To test the connection between Myc and the expression of metabolic enzyme-encoding genes in GBM tumor cells exhibiting prevalent heterogeneity, we apply JGNsc to a subset of GBM scRNA-seq data (GSE131928) from Neftel et al. (2019).

Discussion
We propose an integrated framework JGNsc to construct joint gene networks using scRNAseq data under different conditions. The JGNsc framework consists of three major steps: first, continuize and impute scRNA-seq data using a MCMC procedure; second, transform the data into Gaussian form using nonparanormal distribution; and third, jointly construct gene network using JGL. The novelty of our paper lies in the MCMC data continuization step for scRNA-seq data, where iterative matrix completion imputation is implemented. This step helps to correct the posterior distribution of non-dropout event, and in turn improve the expression estimation of the dropouts.
In our simulation, we also demonstrate that a proper imputation of scRNA-seq data would greatly improve the downstream joint network construction performance. In addition, JGNsc performs consistently better than the other benchmark methods when we vary the sample size and data structure. By transforming the estimated precision matrices into the partial correlation matrices, our method also allows the investigation of the relative magnitude and direction of gene-wise connections.
As future research, we may adopt the following penalty function: uv | γ is the weight assigned for the pair of ω  Figure 2. Benchmark of data processing methods with sample size varying. As the sample size increases, the overall performance is improved for each method. The JGNsc Hybrid method outperforms other benchmark methods in terms of all four metrics for the joint network partial correlation.  Figure 3. JGNsc networks for Medulloblastoma data. In a network, each node is a gene and each edge shows the partial correlation between a pair of genes. If the partial correlation is zero, there is no connection between this pair of genes. Further, if two genes are connected by a red line, then their partial correlation is positive; otherwise if they are connected by a green line, their partial correlation is negative. (A) Network visualization of JGNsc results for Myc related genes from purine enzymes metabolism pathway. Connections between non-MYC related genes are not shown. (B) Network visualization of JGNsc results for Otx2 related genes from purine enzymes metabolism pathway. Connections between non-Otx2 related genes are not shown. (C) GSEA for metabolism enzymes genes connected to Myc or Otx2. Fisher's exact test was performed for each group and each pathway respectively. The pathways with significant p-value for Fisher's exact test for either Myc or Otx2 connected gene set under a condition. The complete results are available at supporting information Web Figure 12.