scTenifoldKnk: a machine learning workflow performing virtual knockout experiments on single-cell gene regulatory networks

Gene knockout (KO) experiments are a proven approach for studying gene function. A typical KO experiment usually involves the phenotypic characterization of KO organisms. The recent advent of single-cell technology has greatly boosted the resolution of cellular phenotyping, providing unprecedented insights into cell-type-specific gene function. However, the use of single-cell technology in large-scale, systematic KO experiments is prohibitive due to the vast resources required. Here we present scTenifoldKnk—a machine learning workflow that performs virtual KO experiments using single-cell RNA sequencing (scRNA-seq) data. scTenifoldKnk first uses data from wild-type (WT) samples to construct a single-cell gene regulatory network (scGRN). Then, a gene is knocked out from the constructed scGRN by setting weights of the gene’s outward edges to zeros. ScTenifoldKnk then compares this “pseudo-KO” scGRN with the original scGRN to identify differentially regulated (DR) genes. These DR genes, also called virtual-KO perturbed genes, are used to assess the impact of the gene KO and reveal the gene’s function in analyzed cells. Using existing data sets, we demonstrate that the scTenifoldKnk analysis recapitulates the main findings of three real-animal KO experiments and confirms the functions of genes underlying three Mendelian diseases. We show the power of scTenifoldKnk as a predictive method to successfully predict the outcomes of two KO experiments that involve intestinal enterocytes in Ahr-/mice and pancreatic islet cells in Malat1-/mice, respectively. Finally, we demonstrate the use of scTenifoldKnk to perform systematic KO analyses, in which a large number of genes are virtually deleted, allowing gene functions to be revealed in a cell type-specific manner.


Introduction
Gene knockout (KO) experiments are a proven approach for studying gene function. A typical KO experiment involves the phenotypic characterization of organisms that carry the gene KO. For example, in KO mice, a gene is made inoperative or knocked out using genetic techniques. Deletion of one or more alleles provides mechanistic insight as to how the target gene functions within a particular biological context. Gene function may be predicted from phenotypic differences between the KO and wild-type (WT) samples, and its expression is a molecular phenotype that can be quantitatively measured. Gene expression is regulated in a coordinated manner in all living organisms. Typically, regulation is observable as synchronized patterns of transcription in a gene regulatory network (GRN). Co-regulation and co-expression are often seen among genes associated with the same biological processes and pathways or regulated by the same master transcription factors [1]. If one gene is knocked out, functionally related genes can mediate a homeostatic response. Thus, in unraveling regulatory mechanisms and synchronized patterns of cellular transcriptional activities, network analysis of gene expression provides mechanistic insights.
The recent advent of single-cell technology has greatly improved cellular phenotyping resolution-e.g., droplet-based single-cell RNA sequencing (scRNA-seq) makes it possible to profile transcriptomes of thousands of individual cells per assay. Applications of single-cell technology in KO experiments allow investigators to probe gene function in a cell type-specific manner. Large-scale, systematic KO experiments aim at deleting many genes individually in an organism. The use of scRNA-seq in such a large-scale systematic KO experiment would allow the function of each gene in various cell types to be studied at a single-cell level of resolution. However, limited resources tend to prohibit this kind of experiments, especially when multiple genes are targeted. Also, essential genes cannot be knocked out. For that reason, predictive computational methods stand as a possible solution for such prohibitive experiments [2][3][4]. Taking advantage of the synchronized gene expression variability allowing the detection of co-expression between genes belonging to the same biological processes or being under the effect of the same transcription factors, it is now possible to construct personalized transcriptome-wide GRNs that are sample, tissue, and cell type-specific [5]. The topology of those biological networks is known to serve as a basis to accurately predict perturbations, like those caused by a KO gene [2,6].
Here we present a machine learning workflow, called scTenifoldKnk, that can be used to perform virtual KO experiments from the topology of the constructed GRNs. scTenifoldKnk takes expression data from scRNA-seq of wild-type (WT) samples as input and constructs a denoised single-cell GRN (scGRN) using principal component (PC) regression and tensor decomposition. The WT scGRN is copied and then converted to a pseudo-KO scGRN by artificially zeroing out the target gene in the adjacency matrix. A quasi-manifold alignment method [7,8] is adapted to compare the two scGRNs (WT vs. pseudo-KO). Through comparing the two scGRNs, scTenifoldKnk predicts changes in transcriptional programs and assesses the impact of KO on the WT scGRN. This information is then used to elucidate functions of the KO gene in analyzed cells through enrichment analysis.
ScTenifoldKnk is computationally efficient enough to allow the method to be applied to systematic KO experiments. In such a systematic study, we assume that thousands of genes in analyzed cells will be knocked out one by one. As mentioned, due to the experimental and biological limitations, such systematic KO experiments would never be possible in a real-animal experimental setting. The other features of scTenifoldKnk include: (1) scTenifoldKnk requires no data from KO animals or cell lines, as it only utilizes the scRNA-seq data from WT samples; and (2) scTenifoldKnk can be used to perform double-KO experiments-that is, to knock out more than one gene at a time.
The remainder of this paper is organized as follows. We first present an overview of the workflow of scTenifoldKnk and use simulated data to demonstrate its basic functions. We then use existing data generated from authentic-animal KO experiments to highlight the use of scTenifoldKnk. These existing data sets contain scRNA-seq expression matrices from both WT and KO samples. Although the KO data sets were available, they were not used by scTenifoldKnk as input. Instead, the KO data sets were specifically used as positive controls to show that scTenifoldKnk can produce expected results. Next, we show applications of scTenifoldKnk to delete genes that cause three different Mendelian diseases with known phenotypic outcomes. Finally, we conduct our own KO animal experiments to validate the utility of scTenifoldKnk as a predictive tool and discuss the contribution of scTenifoldKnk to the future of predictive biology.

Results
Result 1: The scTenifoldKnk workflow scTenifoldKnk takes a single gene-by-cell count matrix from the WT sample as input. Eventually, scTenifoldKnk employs a network comparison method to compare a pseudo-KO scGRN to the WT scGRN to identify differentially regulated genes. These genes are called virtual-KO perturbed genes. From the enriched function of these perturbed genes, the function of the KO gene (i.e., the gene that is virtually knocked out) can be inferred. scTenifoldKnk is implemented with a modular structure containing three core modules illustrated in Figure 1 and briefly summarized in three steps as follows.
Step 1: Constructing scGRN with scRNA-seq data from a WT sample With the scRNA-seq data from a WT sample, scTenifoldKnk first constructs an scGRN using a pipeline we proposed previously, namely scTenifoldNet [9]. This network construction step contains three sub-steps ( Figure 1A):

. Subsampling cells randomly
Denote as the scRNA-seq expression data matrix, which contains the expression levels for genes and cells. Then, (< n) cells in are randomly sampled to form ′ using an m-out-of-n bootstrap procedure. This subsampling process repeats times to create subsets of cells 1 ′ , … , ′ .

Substep 1.2. Constructing a GRN for each subsampled set of cells
For each ′ , principal component (PC) regression is run times to construct a GRN. Each time the expression level of one gene is used as the response variable and the expression levels for the remaining genes as dependent variables. The constructed GRN from ′ is stored as a signed, weighted and directional graph, represented with a × adjacency matrix , each of whose columns stores the regression coefficients for the PC regression of a gene.
is then normalized via dividing by the maximal absolute value.

Substep 1.3. Denoising adjacency matrices to obtain the final GRN
Tensor decomposition is used to denoise the adjacency matrices { } obtained from the PC regression step. First, the collection of { } for GRNs is processed as a third-order tensor Ξ, containing × × elements. Next, the CANDECOMP/PARAFAC (CP) decomposition is applied to decompose Ξ into components. Then, Ξ is reconstructed using top components to obtain denoised tensors Ξ . Denoised { } in Ξ are collapsed by taking the average of edge weights to obtain the final averaged matrix, .
Step 2: Generating pseudo-KO scGRN by virtually knocking out a gene In the last step, the scRNA-seq expression data matrix from the WT sample, , is first used to construct the WT scGRN. In this step named virtual KO, the adjacency matrix of the WT scGRN, , is copied, and then the entire row of corresponding to the target gene is set to 0 ( Figure  1B). In this way, the virtual KO operation is performed on directly. The modified is denoted as � , that is, the adjacency matrix of pseudo-KO scGRN.
Step 3. Comparing scGRNs to identify virtual-KO perturbed genes In this step, we assume that WT and pseudo-KO scGRNs, and � , have been obtained. A quasi-manifold alignment method is then used to align and � (Figure 1C, see Methods for details). All genes included in the two scGRNs are projected in a k-dimensional space, where k << p. After the projection, each gene has two low-dimensional representations: one is in respect to and the other � . For each gene , is the Euclidean distance between the gene's two projections. The greater , the more significant the differential regulation. Finally, a χ 2 test is applied to detect significantly differentially regulated (DR) genes, i.e., virtual-KO perturbed genes.
A more detailed description of scTenifoldKnk modules is in the Methods section. corresponding to the KO gene is set to zero. The modified 〈 〉 is called the pseudo-KO scGRN. (C) Quasimanifold alignment. This method is used to learn latent representations of two networks, 〈 〉 and 〈 〉 , and align them based on their underlying manifold structures. The distance between a gene's projections with respect to the two scGRNs on the low-dimensional latent representation is used to measure the level of differential regulation of the specific gene. The significantly differentially regulated genes are identified as virtual-KO perturbed genes.
Result 2: Virtual knockout experiments using simulated scRNA-seq data We first used the simulated data to validate the relevance of our method. We generated a synthetic scRNA-seq data set using the simulator SERGIO-a single-cell expression simulator guided by GRNs [10]. To simulate the data, we supplied SERGIO with a predefined GRN with five co-expression modules of different sizes (containing 5, 10, 25, 40, and 20 genes, respectively). The simulated scRNA-seq data set was a sparse matrix (70% zeros) of 3,000 cells and 100 genes. We applied the PC regression method to the simulated data and constructed an scGRN. As expected, in the scGRN, genes were clearly clustered into five distinct modules (see Figure 2A for the adjacency matrix), mirroring the predefined modules given by the generative model of SERGIO. Genes in the same module were supposed to be functionally related or under the same regulation. Because SERGIO simulates gene expression in steady-state cells [10], we only used it to simulate co-expression modules rather than the regulatory processes of some upstream regulators acting on these modules. We regarded the constructed network as the WT scGRN. Next, we set the weight of the 20 th gene (gene #20) in the WT scGRN to zero to produce the pseudo-KO scGRN. Gene #20 belongs to the third module, which includes a total of 25 genes. We then used scTenifoldKnk to compare the pseudo-KO scGRN with the WT scGRN to identify genes significantly differentially regulated due to the KO. These genes were predicted likely to be perturbed along with gene #20. Because we knew the KO effect was due to the deletion of gene #20, we expected that the identified genes to be those closely correlated with gene #20. Indeed, as expected, scTenifoldKnk showed all significant genes were from the third module ( Figure 2B & 2C, top), in which gene #20 is located. We repeated the analysis using genes #50 and #100 as another two examples. Again, the results were as expected ( Figure 2B & 2C, middle and bottom). Thus, we concluded that when a member gene is knocked out from a tightly regulated module, other member genes in the same module should be detected by scTenifoldKnk. Algorithmically, genes in the same module with the KO gene were detected because their projected positions in low-dimensional latent representations of WT and pseudo-KO networks changed more than other genes not in the same module (see Methods for details). (A) Heatmap of a 100 × 100 adjacency matrix of scGRN constructed from simulated scRNA-seq data of 100 genes and 3,000 cells. The color is scaled according to the normalized PC regression coefficient values between genes. The network contains 5 predefined co-regulated modules of different sizes, indicated by the blocks of gene pairs with a high correlation. The number of genes of each module is 5, 10, 25, 40, and 20, respectively. (B) The significance of the scTenifoldKnk DR test of genes in the simulated network is shown as -log(p-value). Red and black dots indicate whether genes are significant or not after false discovery correction. Assuming genes in the same module should be detected as significance genes, sensitivity is defined as = TP/(TP+FN), and specificity as = (TN)/(TN+FP), where T, P, F, and N stands for true, positive, false and negative, respectively. Balanced accuracy is defined as the average sensitivity and specificity values. (C) QQ-plots of expected vs. observed p-values of genes given by the DR tests.     Virtual vs. real KO experiment 1: Nkx2-1 is required for the transcriptional control, development, and maintenance of alveolar type-1 and type-2 cells NK homeobox 2-1 (Nkx2-1) is a transcription factor expressed in lung epithelial cells of alveolar type I (AT1) and type II (AT2). AT1 cells cover 95% of the gas exchange surface and are 0.1 μm thick to allow passive oxygen diffusion into the bloodstream. Nkx2-1 is essential at all developmental stages of AT1 cells. Loss of Nkx2-1 results in the impairment of three main defining features of AT1 cells, molecular markers, expansive morphology, and cellular quiescence [11]. AT2 cells are cuboidal and secrete surfactants to reduce surface tension. Mutations in Nkx2-1 interrupt the expression of Sftpb and Sftpc, two genes related to AT2 cell function and molecular identity [11,12].
To examine the molecular and cellular changes underlying the mutant phenotypes in mouse lungs caused by the Nkx2-1 -/-KO, Little, Gerner-Mauro [11] generated a comprehensive set of data using the lung samples from WT and Nkx2-1 -/-KO mice. Using bulk RNA-seq and immunostaining assays, they observed that the expression of marker genes of AT1 and AT2 cells was downregulated in the Nkx2-1 mutant cells. They also found that the expression of marker genes for gastrointestinal cells was upregulated in Nkx2-1 -/mutant AT1 cells, which form dense microvilli-like structures apically. Using ChIP-seq, they found that Nkx2-1 binds to a set of genes implicated in regulating the cytoskeleton, membrane composition, and extracellular matrix. Little, Gerner-Mauro [11] also generated scRNA-seq data for 2,312 and 2,558 epithelial cells from lung samples of the WT and Nkx2-1 -/-KO mice, respectively.
We obtained the scRNA-seq data, generated by Little, Gerner-Mauro [11], and used the expression matrix of 8,647 genes × 2,312 cells from WT mice as the input for scTenifoldKnk. We constructed the WT scGRN and then knocked out Nkx2-1. The final report of scTenifoldKnk analysis contained 171 significant genes (FDR < 0.05, Supplementary Table S1). These virtual-KO perturbed genes included seven marker genes of AT1 cells (Egfl6, Ager, Cldn18, Icam1, Crlf1, Gprc5a, and Aqp5) and 25 marker genes of AT2 cells (highlighted in Supplementary Table S1).
Enrichr functional enrichment test [43] indicated that these genes were enriched for the following functional categories: epithelial to mesenchymal transition led by the WNT signaling pathway members, surfactant homeostasis, lamellar body, and cell adhesion molecules. These enriched function annotations were known to be related to the function of AT2 cells. Next, we applied the interaction enrichment analysis to the 171 significant genes. The interaction enrichment analysis was based on the STRING database [44]. We found that these significant genes appeared in a fully connected component in the STRING interaction network, which is unexpected by chance (P < 0.01, STRING interaction enrichment test), indicating a closely related functional relationship between those genes. We subsequently performed the gene set enrichment analysis (GSEA [45]) to evaluate the extent of perturbation caused by the Nkx2-1 KO at the transcriptome-wide level. The GSEA analysis identified gene sets containing marker genes of AT1 and AT2 cells (FDR < 0.01 in both cases). That is to say, AT1 and AT2 marker genes were among the topmost perturbed genes caused by the deletion of Nkx2-1. GSEA analysis also showed that the Nkx2-1 -/-KO impacted genes with functions related to intestinal microvillus ( Figure 3A), cell cycle, and the cytoskeleton. These results are consistent with those reported in the original study [11].
Again, we emphasize that scTenifoldKnk did not use data generated from the KO mice. Instead, all results of scTenifoldKnk were derived in silico using data generated from the WT mice. Overall, the results of our scTenifoldKnk analysis are consistent with those of the original study [11], in which scRNA-seq data was experimentally derived from KO mice and combined with other experimental techniques to characterize the KO individual.
Virtual vs. real KO experiment 2: Trem2 regulates microglial cholesterol metabolism The Triggering Receptor Expressed on Myeloid cells 2 (Trem2) is a single-pass transmembrane immune receptor selectively expressed in microglia within the central nervous system. Trem2 is known to be involved in late-onset Alzheimer's disease and plays a role in modulating proliferation, survival, immune response, calcium mobilization, cytoskeletal dynamics, mTOR signaling, autophagy, and energy metabolism [46]. The function of Trem2 is known to be mediated via signaling transducer Hcst and adaptor Tyrobp [17]. Trem2 is also known to play a role in regulating lipid metabolism, with most studies focusing on lipids in the form of either lipoprotein particles or cell surface-exposed signals, such as candidate Trem2 ligands [13]. By comparing WT and Trem2 -/-KO mice, Poliani, Wang [15] showed that Trem2 regulates many genes, such as Apoe and Lpl, which control lipid transport and catabolism in microglia. Trem2 was also found to modulate gene expression of macrophages in adipose and control blood cholesterol metabolism in obese mice [14], further linking the function of Trem2 to lipid metabolism. To examine whether Trem2 mediates myelin lipid processing in microglia, Nugent, Lin [16] isolated and characterized Cd11b + /Cd45 low microglial cells from Trem2 +/+ , Trem2 +/-, and Trem2 -/mice, fed with a 0.2% demyelinating cuprizone diet for 12 weeks. They analyzed a comprehensive set of analytical data using FACS, RNA-seq, scRNA-seq, and lipidomics. They reported that Trem2 upregulates Apoe and other genes involved in cholesterol transport and metabolism, causing robust intracellular accumulation of a storage form of cholesterol upon chronic phagocytic challenge. Trem2 was also shown to regulate the expression of genes associated with cell damage response, lysosome and phagosome function, Alzheimer's disease, and oxidative phosphorylation [16].
To perform the virtual KO analysis, we obtained scRNA-seq data from WT (Trem2 +/+ ) mice [16]. The expression matrix contained data of 7,715 genes and 765 Cd11b + /Cd45 low microglial cells. We used this WT expression matrix as the input and used scTenifoldKnk to knock out Trem2. The final results of scTenifoldKnk analysis contained 128 virtual-KO perturbed genes (FDR < 0.05, Supplementary Table S2). The Enrichr analysis [43] showed that these genes were enriched with Alzheimer's disease, oxidative phosphorylation, lysosome, TYROBP causal network, metabolic pathway of LDL, HDL and TG, and microglia pathogen phagocytosis pathway. Such an enrichment indicates that the proteins were at least partially biologically connected as a group. These virtual-KO perturbed genes were highly interactive with each other, as shown by their positions on the STRING interaction network. The network of virtual-KO perturbed genes had significantly more interactions than expected (P < 0.01, STRING interaction enrichment test), which means that gene products exhibited more interactions among themselves than what would be expected for a random set of proteins of similar size, drawn from the genome. This result suggests that the identified virtual-KO perturbed genes are closely related to shared functions. Collectively, these scTenifoldKnk findings provide insight into understanding Trem2 function by revealing the list of genes perturbed following Trem2 deletion ( Figure 3B). We show that the scTenifoldKnk results recapitulate those reported in the original study [16], demonstrating the utility of scTenifoldKnk.
Next, to assess the robustness of scTenifoldKnk, we randomly subsampled 500 cells from each set of replicates and performed the virtual KO of Trem2. We repeated the process independently five times with each set. With each of the subsamples, scTenifoldKnk produced a ranked gene list. We found that these ranked gene lists were positively correlated with each other (average rank correlation coefficient = 0.55, Spearman's tests, Supplementary Figure S3A). GSEA analyses with these ranked gene lists produced similar results (Supplementary Figure S3B). The most enriched functions included Oxidative phosphorylation, Alzheimer's disease, Cholesterol metabolism, Phagosome, and Lysosome, which were shared among all subsamples. These results suggest scTenifoldKnk is robust against the heterogeneity of input cells.
Virtual vs. real KO experiment 3: Hnf4a and Hnf4g stabilize enterocyte identity Hepatocyte nuclear factor 4 alpha and gamma, Hnf4a and Hnf4g, are transcription factors that regulate gene expression in the gut epithelium. Hnf4a and Hnf4g function redundantly, and thus, an independent deletion of one paralog causes no gross abnormalities [18,47]. Hnf4a and Hnf4g double-KO Hnf4ag DKO mice exhibit fluid-filled intestines indicative of intestinal malfunction [18].  Figure 3C left panel). The Hnf4ag DKO mutant cells also showed increased expression for genes in the BMP/SMAD signaling pathway and decreased expression of genes involved in lipid metabolism, microvillus and absorption, and genes related to the cytoplasm [18].

Epithelial cells in the
We obtained the scRNA-seq expression matrix of 4,100 cells from the Hnf4ag WT samples used as input for scTenifoldKnk. We constructed the WT scGRN of 2,591 genes and then virtually knocked out both Hnf4a and Hnf4g genes at the same time. The final result of scTenifoldKnk contained 65 virtual-KO perturbed genes (FDR < 0.05, Supplementary Table S3). These genes were enriched with electron transport chain, fat digestion and absorption, cholesterol metabolism, chylomicron assembly, and cytoplasmic vesicle lumen. A search of the STRING database [44] indicated that all virtual-KO perturbed genes form a fully connected network module. STRING is a database of known and predicted protein-protein interactions. The interaction enrichment test, a statistical test provided by the STRING database, indicated that such a full interconnection of 65 genes is less likely to be expected by chance (P < 0.01). Furthermore, GSEA analysis revealed that these virtual-KO perturbed genes were enriched with canonical marker genes of enterocytes ( Figure 3C), which is consistent with the finding of the original study [18]. Mendelian diseases are a family of diseases caused by the loss or malfunctioning of a single gene. For many Mendelian diseases, we know a great deal about their genetic basis and pathophysiological phenotypes [48]. Thus, we decided to use three Mendelian diseases as "positive controls" to test the performance of scTenifoldKnk. We attempted to determine whether scTenifoldKnk can accurately predict the molecular phenotypic consequences of gene deletion. The three chosen Mendelian diseases were cystic fibrosis, Duchenne muscular dystrophy, and Rett syndrome. As described in more detail below, in each case, we performed scTenifoldKnk analysis using existing scRNA-seq data generated from cell types that are most relevant to the disease conditions ( Table 1).

Cystic fibrosis
Cystic fibrosis (CF) is one of the most common autosomal recessive diseases [49]. It is caused by mutations in CFTR, a gene encoding for a transmembrane conductance regulator [20,50], which functions as a channel across the membrane of cells that produce mucus, sweat, saliva, tears, and digestive enzymes. The CFTR protein also regulates the function of other channels. CFTR is expressed in epithelial cells of many organs, including the lung, liver, pancreas, and digestive tract [19]. The most common CFTR mutation that causes CF is the deletion of phenylalanine 508 (ΔF508), which disrupts the function of the chloride channels, preventing patients from regulating the flow of chloride ions and water across cell membranes [20]. The truncated CFTR protein leads to a reduction of surfactant and causes the build-up of sticky, thick mucus that clog the airways, increasing the risk of bacterial infections and permanent lung damage [21].
To test scTenifoldKnk, we obtained scRNA-seq data from 7,326 pulmonary alveolar type II (AT2) cells in the GEO database (access number: GSM3560282). The original data sets were generated by Frank, Penkala [22] [51,52]. Ctsh encodes for cathepsin H, a cysteine peptidase, which is involved in the processing and secretion of the pulmonary surfactant protein B [53]. GSEA analysis of the ranked gene list was conducted using gene sets of the MGI mammalian phenotypes database as a reference. The result showed that the gene list scTenifoldKnk produced was significant in terms of ion transmembrane transporter activity, abnormal surfactant secretion, and alveolus morphology (FDR < 0.01 in all cases, Figure 4A). These results are consistent with the known pathophysiological changes resulting from the loss of Cftr function in the lungs.
To test the specificity of scTenifoldKnk, we decided to use genes that show highly similar expression profiles to that of Cftr to repeat the analysis. The selection of these genes was made by computing for each gene both the mean and variance of their expression levels across cells.
Five closest genes to Cftr-Akap7, Zranb1, Krcc1, Mta1 and Rps8-were selected. These selected genes may not necessarily have any functions related to that of Cftr in alveolar type II cells. When scTenifoldKnk was applied to predict perturbation profiles caused by their deletion, we found none of the sets of predicted functions similar to those specific functions (i.e., ion transmembrane transporter function, abnormal surfactant secretion, and alveolus morphology) that are associated with Cftr. A principal component analysis (PCA) was used to confirm that perturbation profiles of Mta1, Akap7 and Zranb1 differ most from that of Cftr, while the perturbation profiles of Krcc1 and Rps8 are relatively more similar to that of Cftr (Supplementary Figure S4).
The two genes closer to Cftr, Krcc1 and Rps8 have been reported to play a role in regulating cellular plasticity and fibrotic process [54,55]. These results suggest that scTenifoldKnk specifically reveals gene function using the KO gene's perturbation profile rather than the KO gene's expression profile.

Duchenne muscular dystrophy
Duchenne muscular dystrophy (DMD) arises as a result of mutations that affect the open reading frame of DMD [56,57]. The DMD gene encodes dystrophin, a large cytoskeletal structural protein, which is mostly absent in DMD patients [23]. The absence of dystrophin results in a disturbance of the linkage between the cytoskeleton and the glycoproteins of the extracellular matrix, generating an impairment of muscle contraction, eventually leading to muscle cell necrosis [23,24] (Figure 4B, left).
We obtained the scRNA-seq data of 5,159 muscle cells from the mouse limb (quadriceps) in the GEO database (Access number: GSM4116571). The original data was generated to study gene expression patterns in skeletal muscle cells [25]. The original study was not focused on DMD. We used scRNA-seq data from normal tissue to construct the WT scGRN of 9,783 genes. We subsequently performed the scTenifoldKnk virtual KO analysis to predict the molecular phenotype due to the impact of the Dmd KO. The final results of scTenifoldKnk included 190 virtual-KO perturbed genes (FDR < 0.05, Supplementary Table S5). These genes were enriched with functions related to beta-1 integrin cell surface interaction, contractile actin filament bundle, actomyosin, extracellular matrix receptor interaction, and extracellular matrix organization ( Figure  4B, middle). The GSEA analysis against the MGI mammalian phenotype database gave the following top hits (FDR < 0.01): abnormal collagen fibril morphology, abnormal skeletal muscle morphology, and abnormal skeletal muscle fiber morphology (Figure 4B, right). These phenotype terms represent consistently with known effects of the loss of DMD function in muscle cells, verifying that scTenifoldKnk can predict phenotypic effects caused by gene KO that are pertinent for the biological context.

Rett syndrome
The third Mendelian disease we considered was the Rett syndrome (RTT, MIM 312750), which is a severe neurodevelopmental disease [58,59]. RTT is known to be caused by mutations in Mecp2, a transcriptional repressor required to maintain normal neuronal functions [27,28]. Mecp2 deficiency in the brain decreases the expression level of genes involved in the Bdnf signaling pathway, mediated by repressing the transcription factor Rest [26].
We obtained scRNA-seq data generated from mouse neurons (SRA database access numbers: SRX3809326 and SRX3809327) for the mouse brain atlas project [33]. The two data sets contain 2,054 and 2,156 neurons, respectively, derived from two CD1 P19 female mice that served as biological replicates. We analyzed the two data sets independently to see whether scTenifoldKnk could, as expected, produce similar results with data gathered from biological replicates. Two scGRNs containing 8,652 and 8,555 genes were constructed first. The scTenifoldKnk analysis of virtual KO of Mecp2 produced 377 and 322 virtual-KO perturbed genes, respectively (FDR < 0.05, Supplementary Table S6 & S7), including 211 shared genes. The number of shared genes was significantly greater than random expectation (P < 1x10 -5 , hypergeometric test), indicating a high overlap rate between results of scTenifoldKnk analyses when applied to the two data sets from biological replicates. We also compared two ranked gene lists generated from the analysis of two replicate data sets. If scTenifoldKnk results are robust, we expected that the relative positions of the same genes in the two ranked lists should be more similar to each other than with a list of randomly ranked genes. Indeed, we found the correlation between rankings of two reported lists was positive (Spearman's correlation coefficient = 0.68) and highly significant (P < 1x10 -12 ).
Many of these 211 genes were found to be targets of the transcription factor Rest (FDR < 0.01). The enriched functions include axon, synaptic vesicle cycle, GABA synthesis, release, reuptake and degradation, syntaxin binding, and transmission across chemical synapses. GSEA analyses using ranked gene list as input showed BDNF signaling pathway was highly significant (FDR < 0.01, for both replicates, Supplementary Figure S2). These results are consistent with previous experimental results. For example, it is known that the most prominent alterations in gene products due to Mecp2 KO are related to synapses and synaptic vesicle proteins [29]. At the phenotypic level, the Mecp2 KO causes early defects in GABAergic synapses [30] and mediates autism-like stereotypies and RTT phenotypes [31]. Mutations in syntaxin-1 are known to elicit phenotypes similar to those found in RTT [32].  scTenifoldKnk predicts transcriptional changes in enterocytes of Ahr -/-KO mice The Aryl hydrocarbon receptor (AhR) is a transcription factor that regulates the gene expression of metabolic enzymes, such as cytochrome P450s. It acts primarily as a sensor for both dietary and gut microbial-derived ligands. These include structurally diverse phytochemicals, pharmaceuticals, and endogenous compounds, including tryptophan metabolites and xenobiotic chemicals. AhR also plays a role in regulating immunity, stem cell maintenance, and cellular differentiation [34,37,[60][61][62].
To study the role of Ahr in intestinal stem cell fate determination, we generated inducible, stem cell-specific Ahr -/-KO mice (see Methods). We then used scRNA-seq to generate data for five WT and five KO mice and obtained 13,626 and 8,425 sequenced cells from WT and KO mice, respectively. Data was pooled and projected to low-dimensional embedding, showing no global shift in overall gene expression profiles between cells from WT and KO mice ( Figure 5A). We subsequently clustered cells into groups and identified seven distinct cell types: non-cycling Lgr5 + colonic stem cells, Goblet cells, cycling Lgr5 + colonic stem cells, pre-enterocytes, transitamplifying cells, enteroendocrine cells, and Tuft cells ( Figure 5B). Since enterocytes are the most abundant epithelial cell lineage in the intestine, we decided to conduct scTenifoldKnk using data from enterocytes (indicated by dashed oval in Figure 5B) to assess the function of Ahr.
We conducted scTenifoldKnk analysis to knock out Ahr from the scGRN of WT mice. The scGRN was constructed from an input expression matrix of 1,029 cells (i.e., WT enterocytes) and 8,478 genes. The output of scTenifoldKnk was a ranked gene list including 53 top genes, i.e., virtual-KO perturbed genes (FDR < 0.05, Supplementary Table S8). Enrichr functional enrichment tests indicated that these genes were associated with Myc active pathway, nuclear receptors metapathway, MHC class II protein complex binding, Chaperone-mediated protein complex assembly, response to unfolded protein, RNA binding, translation elongation factor activity, cytoplasmic vesicle lumen, T cell receptor regulation of apoptosis, and secretory granule lumen ( Figure 5C). Next, a STRING database [44] search indicated that identified virtual-KO perturbed genes form two network modules (both P < 0.01, STRING interaction enrichment tests, Supplementary Figure S5). Gene members in each module are fully connected, further confirming closely related functional associations among identified genes. Finally, GSEA analysis against a comprehensive collection of gene sets (see Methods) identified 60 non-redundant functional gene sets (Supplementary Table S9), including many related to cell cycle and cell proliferation signaling.
GSEA analysis against the canonical marker gene sets (retrieved from PanglaoDB, see Methods) showed that enterocytes was the most significant hit (Figure 5D, left), suggesting that Ahr KO induces a change in the expression of enterocyte marker genes, which may be responsible for the alteration of enterocyte differentiation fate [35]. Taken together, our comprehensive functional annotation for the virtual-KO perturbed genes revealed the multifactorial functions of Ahr as a key player of enterocyte cell differentiation and proliferation. These results are consistent with the known functions of Ahr. For example, a previous study showed that Ahr deletion in intestinal epithelial cells resulted in an abnormal barrier function and uncontrolled proliferation of intestinal cells, promoting malignant transformation [34]. Organoids with dysregulated Ahr had substantially compromised differentiation to goblet and enterocyte cell fate [35]. Intestinal-specific Ahr KO increases basal stem cell and crypt injury-induced cell proliferation and promotes colon tumorigenesis [36]. Myc is a key transcriptional effector of the Wnt signaling pathway through its actions to promote cellular proliferation [63].
Without scTenifoldKnk, we would have to use differential expression (DE) analysis. Indeed, DE analysis has been a dominant method for assessing the impact of gene KO (e.g., as in [11,16,18]). Here, we put ourselves in such a hypothetical scenario-that is, scTenifoldKnk is not available. We resorted to comparing gene expression levels between the WT and Ahr KO mice.
To identify DE genes, we used function FindMarkers in the R package Seurat [64] with its default parameter setting. This function calls an established method for scRNA-seq DE analysis-MAST [65]. A total of 68 DE genes were identified [FDR < 0.05 and log(FC) > 0.25], including 46 upregulated and 22 downregulated genes in the Ahr-KO enterocytes ( Figure 5E). Three upregulated DE genes, Gstm1, Gpx2, and Ifitm2, and one down-regulated DE gene, Dmb1t, are also among the list of 53 virtual-KO perturbed genes identified by scTenifoldKnk (highlighted with red and blue boxes, respectively, in Figure 5C & 5E). Note that, in the WT scGRN, the sign of edges from Ahr to the three up-regulated genes is positive (shown as red edges in Figure 5C), while the sign of edge from Ahr to the down-regulated Dmb1t is negative (shown using the blue edge between Ahr and Dmb1t in Figure 5C). The GSEA analysis of the DE gene list, with the comprehensive gene sets as the references (Methods), reported 35 non-redundant significant gene sets, with 16 overlaps with the 60 non-redundant significant gene sets in Supplementary  Table S9, which were identified using scTenifoldKnk ranked gene list. The overlap is significantly larger than expected by chance (P < 10 -20 , hypergeometric test, assuming N= 10 9 of nonredundant gene sets). In addition, GSEA analysis of the DE gene list with marker gene sets as references showed that enterocytes was one of the significant hits ( Figure 5D, right), which is the same as the result of GSEA analysis of virtual-KO perturbed genes (Figure 5D, left). In summary, we obtained overall similar results using scTenifoldKnk and DE analysis. The difference is that scTenifoldKnk is a virtual KO analysis, whereas DE analysis requires data from the real KO experiment.

scTenifoldKnk predicts transcriptional changes in pancreatic beta cells of Malat1 -/-KO mice
Malat1 is a long, non-coding RNA gene that acts as a transcriptional regulator for numerous genes, which are involved in the generation of reactive oxygen species, cell cycle regulation, cancer metastasis, and cell migration [39][40][41][42]. Malat1 upregulation is also associated with hyperglycemia [38]. We have previously shown that the ablation of Malat1 changes insulin responses in vivo by suppressing Jnk activity and inducing the activation of the insulin receptor substance 1 (IRS-1) [42]. The ablation of Malat1 also induces the phosphorylation of Akt in a process mediated by increased reactive oxygen species (ROS) [42].
To explore the regulatory role of Malat1 in pancreatic cell functions, we generated Malat1 -/-KO mice (see Methods). We performed scRNA-seq experiments with isolated pancreatic cells from WT (C57BL/6) and Malat1 -/-KO mice ( Figure 6A). The beta cells were the most abundant cells in both samples, 1,142 WT and 1,695 KO ( Figure 6B).
We used scTenifoldKnk to knock out Malat1 over the WT scGRN and identified 167 virtual-KO perturbed genes (FDR < 0.05, Supplementary Table S10). Enrichr functional enrichment analysis indicated that these genes were significantly associated with glucose metabolisms functions such as Cori cycle, glycolysis and gluconeogenesis, and the regulation of insulin secretion, as well as with pathways related to cellular carcinomas ( Figure 6C). This result is consistent with the known function of Malat1. For example, Malat1, which stands for Metastasis Associated Lung Adenocarcinoma Transcript 1, is known to be overexpressed in lung carcinoma. In addition, Malat1 is known to have pleiotropic functions in various physiological and pathophysiological processes [42,[66][67][68][69][70][71][72], including cytoskeleton, circadian rhythm, and hypoxiainducible factor (HIF) signaling pathway, as predicted ( Figure 6C). A STRING database search showed that virtual-KO perturbed genes form a highly connected interaction network (P < 0.01, STRING interaction enrichment test, Supplementary Figure S6). GSEA analysis with scTenifoldKnk-reported gene list indicated that 79 non-redundant gene sets were enriched significantly (FDR < 0.05, Supplementary Table S11), including insulin secretion and circadian rhythm ( Figure 6D).
To validate these predictions made using scTenifoleKnk, we performed DE analysis using MAST to compare gene expression in beta cells between WT and KO mice. We identified 1,695 DE genes (FDR < 0.05, Figure 6E). Two up-regulated genes (Insig1 and Meg3) and two downregulated genes (Glu1 and Hist1h1) were among the 167 virtual-KO perturbed genes ( Figure 6C). GSEA analysis with the ranked list of DE genes sorted by the fold-change produced only 17 nonredundant enriched gene sets, including insulin secretion.

Result 6: Systematic KO of all genes in a given sample to generate the landscape of genes' perturbation profiles
As mentioned, scTenifoldKnk is designed and implemented to be computationally efficient. Indeed, we benchmarked the performance of scTenifoldKnk. When the WT scGRN is given, scTenifoldKnk could complete virtual KO analysis at a rate of one minute per gene on a standard workstation. That is to say, for a given data set of 6,000 genes, scTendifoldKnk could knock out all of these genes individually in 4 to 5 days. The process can also be split and run in parallel to increase the speed of computing. The outcome of such a systematic KO experiment is a collection of perturbation profiles of all genes. For each gene (e.g., gene ), the perturbation profile of the gene (i.e., gene ) is a vector of distances of all other genes ({ − }) produced by scTenifoldKnk (see Figure 1C). The distance value quantifies the level of a gene (i.e., a gene in { − }) being perturbed by the deletion of the KO gene (i.e., gene ). Figure 7A illustrates an analytic flowchart when scTenifoldKnk is used in a systematic KO experiment. For a given WT scGRN with n genes, scTenifoldKnk can be used to delete individual genes from 1 to n. For the i-th gene deletion, , scTenifoldKnk produces a KO perturbation profile for the i-th gene. The KO perturbation profile is a vector of distances: � ,1 , ,2 , . . . , � , where i = 1, 2, …, n. Combining all genes' KO perturbation profiles into an n × n matrix, called KO perturbation profile matrix, is followed by t-SNE embedding and clustering of genes.
To demonstrate the use of the systematic KO functionality, we downloaded scRNA-seq data from the brain immune atlas and obtained the expression matrix of 6,853 genes and 5,271 microglial cells (see Methods). These microglial cells were derived from WT homeostatic mice [73]. After knocking out all genes, we obtained the KO perturbation landscape of all 6,853 genes, as shown in the t-SNE embedding (Figure 7B). Each point represents a perturbation profile caused by KO of a gene. Genes with similar perturbation profiles will be closely located in the low dimensional embedding. Therefore, examining genes closely located will allow us to discover potential functional associations between them [74]. We selected three clusters of different sizes to examine the member of genes in each cluster and the functional relationships between these genes. We found that all three clusters contain genes that show a significantly higher level of functional associations than expected by chance (P < 0.01, STRING interaction enrichment tests). We then focused on the smallest cluster, which contains 16 genes ( Figure 7C). According to the STING database, 12 of these genes (Apoc1, Apoe, Clec12a, Clec4n, Cp, Fth1, Lilrb4, Mrc1, Ms4a6c, Ms4a7, Pilra and Pla2g7) are functionally associated with each other. These associations are supported by evidence from published literature [75][76][77]. All these supporting references are related to the microglia study. The remaining 4 genes (Htra3, Tgfbi, Pf4 and Ifitm2) are not connected to this 12-gene sub-network. Nevertheless, a new study [78] showed that Htra3 is overexpressed in repopulating microglia. Therefore, these "isolated" genes may be worthy of further scrutiny for their functions in microglia with the potential links to the other genes in the 12gene sub-network.
Instead of using genes' KO perturbation profiles, using genes' expression information can also produce a landscape of genes. Indeed, we applied t-SNE to the UMI count matrix, which was transformed and standardized using the same procedure (see Methods), and obtained the embedding plot of genes (Supplementary Figure S7B). The difference between this embedding plot derived from gene expression and the one derived from the gene's KO perturbation profile is obvious. The former has no structure among cells but is just a cloud of data points. Performing clustering using the embedding derived from gene expression ought to not produce clusters containing more true interactions than using the embedding derived from gene expression profile across cells. To test whether similar results were obtained using expression and perturbation profiles, we calculated the average distance between genes that belong to the same KEGG gene set. We found that, across all KEGG gene sets, the average distance in the embedding derived from a KO perturbation profile was significantly smaller than that in the embedding derived from the expression profile (Supplementary Figure S7C). Genes in the three example clusters, as shown in Figure 7B, were found to be scattered in different clusters in the embedding derived from gene expression (Supplementary Figure S7B). We also conducted the same test using the embedding derived from the genes' profile of the weight of edges in the WT scGRN. The same pattern was uncovered-that is, the average distance is smaller in the embedding derived from the KO perturbation profile than in that derived from the scGRN edge weight. These results suggested that gene sets identified using the gene's KO perturbation profile were more likely to be functionally connected than gene sets identified using other types of gene profiles.  (Apoc1, Apoe, Clec12a, Clec4n, Cp, Fth1, Lilrb4, Mrc1, Ms4a6c, Ms4a7, Pilra and Pla2g7) in the cluster highlighted with the rectangle in B. The references [75][76][77], from which the associations between genes are established, are given in the table.

Result 7: Systematic KO of a gene from different cell types
ScTenifoldKnk can be used to systematically knock out a gene in different cell types. In this way, it is possible to identify cell-type-specific functions of the gene, as well as functions shared across multiple cell types. Here we use MYDGF (Myeloid Derived Growth Factor) as an example KO gene to illustrate this application of scTenifoldKnk. MYDGF, also known as C19orf10, is a 142residue protein broadly expressed in multiple tissues and cell types [79,80]. Mydgf has been shown in a mouse model to enhance cardiac myocyte survival, tissue repair, and angiogenesis caused by myocardial infarction [81]. To elucidate MYDGF's functional roles, we downloaded multiple human scRNA-seq data sets from the PanglaoDB database [82]. From the downloaded data sets, we extracted cells from 45 different cell types. Subsequently, a virtual KO of MYDGF for each of these cell types and recovered 45 cell-type-specific perturbation profiles was performed. We first examined the perturbation profile of endothelial cells, in which the function of MYDFG has been studied [83]. GSEA analysis with the ranked list of genes showed that the enriched functions of MYDGF include Cell cycle, VEGFA-VEGFR2 pathway, and Intra-Golgi traffic and activation (Figure 8A). These results are consistent with previous findings [83], suggesting that scTenifoldKnk can recapitulate results from a cell-type-specific gene function study. We also concatenated perturbation profiles of 45 different cell types. There were 1,288 genes expressed across all tested cell types. As expected, several cell types exhibited perturbation profiles similar to that of endothelial cells (Figure 8B). These cell types include T-cell, hepatocyte, cardiomyocyte, and kidney cell. GSEA analysis identified enriched functions, including AKT signaling, endothelial NOS activation, apoptosis, muscle contraction, and ALK2 pathway (Figure 8C) [84]. These findings are consistent with the fact that overexpression of Mydgf increases AKT phosphorylation and cell proliferation via AKT/MAPK signaling pathways [85]. Identified functions were also related to promoting survival and growth, as previously reported [79-81, 83, 85]. In summary, we used Mydgf as an example gene to demonstrate that scTenifoldKnk can be used to knock out a gene across multiple cell types in order to identify shared as well as cell type-specific functions of the KO gene.

Discussion
Gene expression is almost always under coordinated regulation in cells of living organisms. Inferring GRNs is the key to a better understanding of such coordinated regulation. However, inferring GRNs is a very difficult process-there are always a large number of unknown variables in the system, and the power of inference is limited by the sample size. The development of singlecell technology has brought new "oil" to network science. We have previously shown that scRNAseq information can be leveraged to fuel the machine learning algorithms for reliable scGRN construction [9].
In a GRN, the regulatory effect manifests as observable synchronized patterns of expression between genes. These genes are associated with the same biological process, pathway, or under the control of the same set of transcription factors [1]. When a gene involved in a process is perturbed (e.g., knocked out), the expected first responders for such perturbation are those functionally closely related to the KO gene. Thus, modeling influence patterns in a GRN, such as using topological models to approximate perturbation patterns [6], can be used to bypass expensive experimental measurement. Thus, in principle, GRN-based perturbation analysis can contribute to the planning and design of real-animal experimental work for testing or validating hypotheses. There is evidence showing that gene expression data need not necessarily be collected from perturbation experiments for GRN-based analysis to be successful [86,87].
Our contribution is to provide an scGRN-based perturbation analytical system-scTenifoldKnk. By performing virtual KO analyses with a series of existing scRNA-seq data, we showed that scTenifoldKnk can be used to identify KO responsive genes. The reported genes were found to be enriched with molecular functions that are similar to those enriched in genes identified by the authentic KO experiments. We also tested scTenifoldKnk using data from three different cell types known to be affected in three Mendelian diseases, respectively. While these diseases represent conditions involving distinct causal genes and different dysregulated molecular processes, scTenifoldKnk demonstrated its value in all three cases. The top hits reported by scTenifoldKnk are highly interpretable, suggesting scTenifoldKnk can be used to reveal molecular mechanisms underlying less characterized pathophysiological processes. The predictive power of scTenifoldKnk was further demonstrated with scRNA-seq data from our own KO experiments, involving enterocytes in Ahr -/mice and islet cells in Malat1 -/mice. Finally, we showed two case studies of systematic KO experiments.
Despite some obvious limitations associated with the virtual KO method, we start by discussing its advantages. First, the virtual KO method, as we implemented in scTenifoldKnk, is species agnostic-it works with scRNA-seq data from humans and animals alike. This feature gives the method a huge advantage for KO experiments focusing on human samples. In the lack of human KO samples, the KO animals are used as surrogates. The evolutionary divergence between humans and model animals is assumed to play a neglectable role in shaping orthologue gene function-but we know this is not always the case. While applying scTenifoldKnk to human scRNA-seq data, researchers can avoid many pitfalls caused by extending the conclusions from animal KO experiments to humans. Second, scTenifoldKnk allows for any gene to be knocked out for functional analysis as long as the gene expression is detectable in the WT sample. One may want to knock out all genes or a set of genes one by one to obtain an effect profile for each of the KO genes. The effect profile can be as simple as a virtual-KO disrupted gene list produced by scTenifoldKnk. Genes have similar effect profiles that are most likely to share molecular functions or are involved in the same signaling pathways. Genes of known functions can be used as positive controls to gauge the performance of scTenifoldKnk in the tested system and cell type. Third, using scTenifoldKnk, it is feasible to virtually delete more than one gene from a system at the same time to study synergistic KO effects. An example is the double-KO settings for Hnf4ag DKO . This is significant because the number of possible combinations of multiple KO targets is often too large to allow researchers to explore the impact of even a small fraction of the multigene KO possibilities. In this circumstance, scTenifoldKnk will be an invaluable tool to predict the consequences of these KO combinations and provide experimental design guidance by suggesting the gene combinations that should be prioritized. Fourth, scTenifoldKnk can be used to study the effects of gene KO across multiple cell types. A typical scRNA-seq experiment generates expression data for multiple cell types. In this case, the scTenifoldKnk analysis will allow researchers to study diverse phenotypes associated with the KO gene in all cell types as long as the gene is expressed. Finally, scTenifoldKnk can be used to study the function of essential genes, for which the gene KO causes lethal outcomes, making it impossible to establish the KO animals. When scTenifoldKnk is applied to these essential genes, especially with embryonic expression data of the genes from the WT samples, developmental functions of these genes can be studied.
The resultant gene list of scTenifoldKnk analysis may differ substantially from that of the differential gene expression analysis, with few overlapping genes, as seen in the analyses of Ahr -/and Malat1 -/-KO mice. We attribute this difference to that scTenifoldKnk detects direct and early responders to the perturbation caused by the gene KO, whereas differentially expressed genes tend to be late responders and reflect a final steady state of the response. Given the hierarchical structure in GRNs [87,88], it is not unexpected that early and late responders are not the same set of genes. Furthermore, the simple and effective design of scTenifoldKnk does not mean that the analysis can be achieved by simply "picking up" genes that have a large weight in the adjacency matrix of the original scGRN. We examined the correlation between the scTenifoldKnkreported distance and the weight of genes. We found that, in all real data analyses, scTenifoldKnk picked genes with a wide range of weight (see Supplementary Figure S8 for two examples: Trem2 KO and Dmd KO). These results showed that scTenifoldKnk's manifold alignment algorithm does not depend upon the weight of the edge between a gene and the KO gene to assert responder genes. scTenifoldKnk can be used in extended research areas other than KO experiments. For example, biologists often need to know whether a manipulation has an effect or not. scTenifoldKnk can be applied not only to detect novel targets but also to prioritize known targets prior to in vivo or in vitro studies with different experimental techniques. One may use drugs to block the transcription of a gene in a candidate pathway. If the drug has an effect, one would conclude that the drug works on that pathway involved; otherwise, the pathway is not affected. The scTenifoldKnk-based analysis represents a novel analytical method that can be applied to genome-wide association studies (GWASs). Throughout the last decade, GWASs have been successful in detecting associations between variants and phenotypes; however, a phenotypic trait is usually associated with many variants, presumably influencing gene expression regulation. scTenifoldKnk may be used to help geneticists to assess functional consequences in order to prioritize actionable gene targets. We point out that the prediction of scTenifoldKnk does not have to be perfect. As we can see, with many example data sets, scTenifoldKnk could recapitulate major findings reported in real-KO experiments. We showed the landscape of KO perturbation profiles of all genes in microglia. Experimentally, such systematic perturbation analysis has only been done in yeast [89]. With scTenifoldKnk, systematic KO analysis in silico can be performed in a cell-type-specific manner in organismal systems.
Limitations of scTenifoldKnk are inherited from being a virtual KO method. scTenifoldKnk cannot be used to predict the consequence of gene overexpression, which is also a commonly used method for gene function study. Also, as the power of scTenifoldKnk is rooted from the WT scGRN, the regulatory network from the WT sample, the prediction that scTenifoldKnk may "favor" regulatory rather than structural genes. Nevertheless, it is still possible to make adjustments to some details in the implementation of scTenifoldKnk to make it better fit user analytical needs. For example, instead of knocking out a target gene by setting its values to zeros in the adjacency matrix, a random shuffle of a gene's expression values in the expression matrix may mimic gene dysregulation.
Prediction of gene expression responses to perturbation using scRNA-seq data is an active research area. To the best of our knowledge, there are two software tools that have been developed for this purpose: scGen [3] and CellOracle [4]. scGen is a package implemented in Python, using TensorFlow variational autoencoders combined with vector arithmetic to predict gene expression changes in cells. Overall, scGen works like a neural network-empowered regression tool that predicts the changes of gene expression in cells in response to specific perturbations such as disease and drug treatment. scGen requires training data sets from samples before and after being exposed to the same perturbation. CellOracle is a workflow, developed in Python with several R dependencies, that integrates scRNA-seq and single-cell chromatin accessibility data (scATAC-seq) data to infer GRN and predict the changes of gene expression in response to specific perturbations. CellOracle constructs a GRN that accounts for the relationship between transcription factors and their target genes based on sequence motif analysis using the information provided by the scATAC-seq data. After that, the constructed GRN is further refined using regularized Bayesian regression models to remove weak connections and is adjusted to infer the context-dependent GRN using the scRNA-seq data. scTenifoldKnk has a different design compared to either scGen or CellOracle. scTenifoldKnk is specifically designed for performing virtual KO experiments. It requires only the scRNA-seq data from the WT sample: unlike scGen, scTenifoldKnk does not need training data; unlike CellOracle, scTenifoldKnk does not require extra scATAC-seq data to obtain information of transcription factors and their targets.
The minimalist design of scTenifoldKnk should allow for adoption widely in most scenarios.
Overall, we have shown evidence that our machine learning workflow represents a powerful and efficient method for conducting virtual KO experiments. The highly efficient implementation of scTenifoldKnk allows systematic deletion of a large number of genes in scRNA-seq data. The prediction power offered by scTenifoldKnk enables accurate prediction of perturbations in regulatory networks caused by the deletion of a gene, revealing its function in a cell type-specific manner. We anticipate that scTenifoldKnk will be adopted and widely applied in the predictive science of single-cell biomedical research.
Methods scTenifoldKnk is a machine learning workflow for virtual KO experiments with scRNA-seq data. It utilizes a scRNA-seq expression matrix from a WT sample as input, without using any data from a KO sample, to predict regulatory network changes and perturbed genes caused by the KO of a gene.
Construction of the WT scGRN scTenifoldKnk uses a method we proposed previously in scTenifoldNet [90] to construct scGRNs. The procedure of scGRN construction consists of three steps: cell subsampling, principal component (PC) regression, and tensor decomposition. In cell subsampling, cells are randomly sampled to form subsets. This strategy assures that the outliers in the sample are not included by many sub-sample sets and reduce the influences of outliers. When constructing the scGRN for each sub-sample set, both the accuracy and computational efficiency of the algorithms should be considered. Among existing methods, PCNet based on PC regression is an efficient approach to construct scGRN. It can also surpass the multicollinearity problem when explanatory variables are linearly correlated and provides stable constructing results. Finally, to aggregate multiple GRNs, scTenifoldNet uses the CP tensor decomposition algorithm. CP decomposition factorizes the multilayer scGRN and regenerates all adjacency matrices using top components, which achieves denoising and enhancing, i.e., making main signals stronger and other signals weaker. Overall, scTenifoldNet provides relatively stable and accurate results in constructing GRN. We briefly describe each step below.
(1) Cell subsampling: Initially, scTenifoldNet builds several subsets of cells via random sampling. Denote ∈ ℝ × as the scRNA-seq data matrix that reflects gene expression levels for genes in cells. A sub-sample set of cells is constructed via randomly sampling m (< n) cells in . By repeating this subsampling process for t times, t sub-sample sets of cells are derived, denoted as 1 ′ , … , ′ ∈ ℝ × .
(2) Network construction: For each ′ , scTenifoldNet builds a GRN with an adjacency matrix via PC regression, where a PC analysis (PCA) is applied to the original explanatory variables, and then the response variable is regressed on a few leading PCs. Since PC regression only utilizes d PCs (d << n) as the covariates in regression, it mitigates over-fitting and reduces the computation time. To build an scGRN, each time scTenifoldNet focuses on one gene (referred to as the response gene) and applies PC regression. The expression level of the response gene is used as the response variable, and the expression levels of other genes are used as the explanatory variables in PC regression. scTenifoldNet repeats this process for another N -1 times, with one different gene as the response gene each time. In the end, scTenifoldNet collects the coefficients of regression models together and forms a p × p adjacency matrix , whose (i, j) entry saves the coefficient of the i-th gene on the j-th gene.
could reflect the interaction strengths between each pair of genes.
(3) Network denoising: The adjacency matrices of the networks 1 , … , can be stacked to form a third-order tensor Ξ ∈ ℝ ( × × ) . To remove noise and construct an overall adjacency matrix, scTenifoldNet applies CANDECOMP/PARAFAC (CP) tensor decomposition to Ξ to extract important latent factors. More specifically, scTenifoldNet approximates Ξ by Ξ : where ∘ denotes the outer product, ∈ ℝ , ∈ ℝ , and ∈ ℝ are unit-norm vectors, and is a scalar. The reconstructed tensor Ξ ∈ ℝ ( × × ) includes denoised adjacency matrices, and by taking the average of them, scTenifoldNet obtains the overall stable adjacency matrix. After further normalizing its entries by dividing them by their maximum absolute value, scTenifoldNet generates the final adjacency matrix of scGRN for the given sample. For later use, denote it as .
(4) Adjusting edge weights of the directed network: We provided an option to adjust edge weights for the constructed scGRN. PC regression constructs directed networks. To obtain the strictly directed network, with a given, denoised network , for each gene pair (i, j) and (j, i), only the entry with a larger absolute weight is kept. More specifically, we defined the (i, j) entry for the strictly directed network by (i, j) = (i, j), if | (i, j)| > | (j, i)| and (i, j) = 0 otherwise. Note that if | (i, j)| < | (j, i)|, then we set (i, j) = 0, then the information in (i, j) is removed. In order to keep the information of (i, j), instead of removing the information completely, we defined a new parameter λ. Using this new parameter, we can integrate and to generate an "interpolated network" i , which contains part of the information of d (i, j). Given parameter λ, i is define as: where ∈ [0,1]. It is easy to check that when = 0, we get the original denoised network and when = 1, it is to go back to the strictly directed network .

Deletion of the KO gene from the WT scGRN
We propose the virtual KO method that directly works on the WT scGRN ( Figure 1B). The adjacency matrix represents the scGRN constructed using the WT data. In the virtual KO method, the entire row of the adjacency matrix for the gene is set to zero. We denote the adjacency matrix of the scGRN generated as � .
Comparison between the WT and pseudo-KO scGRNs After obtaining and � , two comparable low-dimensional feature vectors of each gene in the two networks are built and then compared to detect affected genes. Our approach of creating lowdimensional feature vectors is inspired by manifold alignment [7,8] and its application [91]; our approach is referred to as quasi-manifold alignment because the adjacency matrices used here are not symmetric matrices while they are required to be symmetric in the original procedure. Here and � serve as the inputs for manifold alignment and the outputs are the lowdimensional features ∈ ℝ × and � ∈ ℝ × of genes before and after knocking out the target gene, where k << n. Before giving the details of the alignment procedure, we point out that and � may include negative values, which reflect the negative correlation between genes. Before doing alignment, we add 1 to all entries in and � , and the range of and � is transformed from [-1,1] to [0,2].
To perform quasi-manifold alignment, we first construct a joint adjacency matrix by combining and � together, where = � 2 2 � �. We can treat as the adjacency matrix of a joint network formed by linking the corresponding genes in two networks. The off-diagonal block of this matrix reflects the corresponding genes between two networks. is a tuning parameter. In practice, we select as the mean of the row summations of and � . We further build = [ � ] ∈ ℝ 2 × , and the manifold alignment problem of two networks characterized by the adjacency matrices and � is equivalent to the manifold learning problem that finds the low dimensional features for the joint network characterized by the adjacency matrix . For the sake of convenience, we denote ∈ ℝ as the i-th row of that reflects the projection corresponding to the i-th gene in the large network. The next step is to build a "Laplacian" matrix = − , where is a diagonal matrix with = ∑ ( ) ,

=1
. Denote 1 , 2 , … , as the eigenvectors corresponding to the smallest nonzero eigenvalues of . Note that is not a symmetric matrix. We found that the usual solution of symmetrizing does not work well with either simulated or real data. We, therefore, use asymmetric matrix in our quasi-manifold alignment procedure. Since is not symmetric, there may be imaginary parts in the eigendecomposition. Based on our experiment, taking only the real part of eigenvectors with respect to the eigenvalue that has the smallest real part will give better overall results. The final low dimensional representation is = [ ( 1 ), ( 2 ), … , ( )], where ( ) means the real part of .

Test for significance of virtual-KO perturbed genes
The virtual-KO perturbed genes are identified as genes with significant differences in their regulatory patterns in two scGRNs constructed from the WT and KO data. The method for testing the significance of the difference for each gene is described here. With = � � � = [ 1 , 2 , … , ] obtained in manifold alignment, for each gene, we calculate the distance d j between its two projected feature vectors from two networks. The rankings of d j are used to help identify significant genes. To avoid arbitrariness in deciding the number of selected genes, we proposed a χ 2 test. Specifically, since 2 is calculated by taking the summation of squares of the differences of projected representations of two samples, its distribution could be approximately χ 2 . Instead of 2 , we use the scaled fold-change ⋅ 2 / 2 ��� as the test statistic for each gene to adjust the scale of the distribution, where is the degree of freedom. ⋅ 2 / 2 ��� approximately follows a χ 2 distribution with the degree of freedom (df) if the gene does not perform differently before and after knocking out the target gene. By using the upper tail (P[X>x]) of the χ 2 distribution and the Benjamini-Hochberg (B-H) FDR correction [92] for multiple testing correction, we assign a Pvalue for each gene. To determine df, since the number of the selected significant genes will increase as df increases, we choose df = 1 to make a conservative selection of genes with high confidence.
Gene functional annotation and enrichment tests Identified virtual-KO perturbed genes were evaluated and assessed for their molecular functions and biological processes using Enrichr functional enrichment analysis. The enrichment test results of Enrichr were derived from a large number of reference gene sets. In this study, we reported results derived from the following gene set collections: KEGG_2019_Human, KEGG_2019_Mouse, GO_Biological_Process_2018, GO_Cellular_Component_2018, GO_Molecular_Function_2018, BioPlanet_2019, WikiPathways_2019_Human, WikiPathways_2019_Mouse, and Reactome_2016. GSEA analysis was conducted using a ranked list of all genes as input and comparing against either KEGG_2019_Mouse or KEGG_2019_Human gene sets. The genes were ranked according to their distance between corresponding projection on the aligned manifold between WT and pseudo-KO scGRNs. To remove redundant hits, identified gene sets were compared pairwise. For each pair of identified gene sets, e.g., A and B, if Jaccard index, = | ∩ | | ∪ | ⁄ , is greater than 0.8, then this pair of gene sets were merged into one. In this way, non-redundant sets of identified gene sets were created and used for functional inference. GSEA analysis was also conducted against gene sets of marker genes. The marker genes were obtained from PanglaoDB [82]. The protein interaction enrichment tests were performed using the web tool of the STRING database [44].
Systematic KO analysis using scRNA-seq data in microglia The scRNA-seq data used in the systematic KO analysis was obtained from the brain immune atlas (https://www.brainimmuneatlas.org), which is a scRNA-seq resource for assessing and capturing the diversity of the brain immune compartment, as published in [73]. The data was generated using the 10x genomics chromium platform, including more than 61,000 CD45+ immune cells from whole brains or isolated dura mater, subdural meninges, and choroid plexus of mice. The downloaded data is called the full aggregate data set (combining cells of whole brain and choroid plexus cells from WT + Irf8 KO mice), stored in the file named filtered_gene_bc_matrices_mex_irf8_fullAggr.zip. The downloaded matrix was processed, and the sub-matrix contained 5,271 microglia from the WT mice. For all genes, the KO perturbation profile of each gene, i.e., a vector of DR distances, was transformed using Box-Cox transformation and then was standardized using z-score transformation. The processed KO perturbation profiles of all genes were combined into one matrix for t-SNE embedding.
Ahr -/-KO mouse strain and crypt cell isolation Animals were housed under conventional conditions, adhering to the guidelines approved by the Institutional Animal Care and Use Committee at Texas A&M University. Stem cell-targeted Lgr5-EGFP-IRES-Cre ERT2 , AhR f/f and tdTomato f/f mouse strains have all been previously described [37]. The mouse genotypes used in this study include: tamoxifen inducible -Lgr5-EGFP-Cre ERT2 X Tomato f/f (WT, control), AhR f/f X Lgr5-EGFP-Cre ERT2 X Tomato f/f (KO). Male mice (n=5 per genotype, 8-10 weeks of age) were intraperitoneally injected with 2.5 mg of tamoxifen (Sigma, T5648) dissolved in corn oil (25 mg/ml) once a day, for four consecutive days. Mice were maintained on an AIN-76A, semi-purified diet (Research Diets, D12450B), fed ad libitum and housed on a 12 h light-dark cycle. For all experiments, littermate controls were cohoused with the KO mice. Two weeks post tamoxifen injection, colons were removed, washed with cold PBS without calcium and magnesium (PBS-/-), everted on a disposable mouse gavage needle (Instech Laboratories) and incubated in 15 mM EDTA in PBS-/-at 37°C for 35 min as previously described [37]. Subsequently, following transfer to chilled PBS-/-, crypts were mechanically separated from the connective tissue by vigorous vortexing. Cell suspensions were then filtered through a 40-µm mesh and Tomato-expressing cells (includes GFP + /Tom + as well as GFP negative/Tom + ) were collected using a MoFlo Astrios Cell Sorter (Beckman Coulter) or a Bio-Rad S3e Cell Sorter. Tomato positive cells represent colonic stem cells and their progeny. Dead cells were excluded by staining with propidium iodide or 7-AAD. Samples were subsequently processed using the 10x Genomics scRNA-seq pipeline described below. A total of 62,741 cells from 10 mice were sequenced. These included 34,889 sorted colonocytes from the WT and 27,852 from the KO mice. The average number of genes detected per sample was 20,141. From all sequenced cells, 40,690 (21,263 from WT and 19,427 from KO samples) were removed with the scRNA-seq quality control procedure. These cells had either a high proportion of mitochondrial reads (greater than 10%) or exhibited an extremely large or small library size.

Malat1 -/-KO mouse strain and pancreatic islet cell isolation
Malat1 null mice were obtained from Texas A&M Institute for Genomic Medicine (TIGM). Islets were isolated based on published procedures [93][94][95]. Briefly, mice were killed by cervical dislocation. The pancreas was inflated with 3 mL of cold collagenase solution (0.3 mg/mL) (Collagenase type 4, Worthington Biochemical Corporation) through the common bile duct with a 20 G needle, starting at the gallbladder. The pancreas was then removed from the body and placed in a siliconized vial containing 2 mL of 1 mg/mL collagenase solution and digested at 37°C in a water bath for 12∼15 min. After three washes of the digested pancreas with centrifugation (97 × g at 4°C for 1 min) to collect the tissues, islets were purified by density gradient centrifugation (cold polysucrose/sodium diatrizoate solution 1.1119 g/ml, 560 × g at 4 °C for 15 min) and then dispersed in 96 well plates and cultured in RPMI (Gibco) with 10% FBS for further experiments. To determine islet function and confirm success of isolation of the functional islets, the glucosestimulated insulin secretion level was measured using a highly sensitive sandwich ELISA assay kit following the manufacturer's instructions (EMD Millipore, EZRMI-13K). After confirming the function of isolated islets, the islets were washed in 10 ml of PBS and dissociated using 2ml of Accutase (Sigma-Aldrich, Cat#A6964) and incubated at room temperature for 20 mins with gentle intermittent mixing by pipetting. The dissociated islets were confirmed using microscopy and were stopped with islet culture media (RPMI with 10% FBS). The dissociated islets were then filtered through a 40-mm filter to remove disassociated tissue and obtain a single-cell suspension of pancreatic islet cells. A total of approximately 20,000 cells were resuspended in 30 ml PBS with 0.05% BSA. Samples were subsequently processed using the 10x Genomics scRNA-seq pipeline described below.
10x Genomics scRNA-seq Single-cell sample preparation was conducted according to Sample Preparation Demonstrated Protocol provided by 10x Genomics as follows: 1 ml of cell suspension from each mouse genotype was pelleted in Eppendorf tubes by centrifugation (400 g, 5 min). The supernatant was discarded, and the cell pellets were then resuspended in 1x PBS with 0.04% BSA, followed by two washing procedures by centrifugation (150 g, 3 min). After the second wash, cells were resuspended in ~500 µL 1x PBS with 0.04% BSA followed by gently pipetting 10-15 times. Cells were counted using an Invitrogen Countess automated cell counter (Thermo Fisher Scientific, Carlsbad, CA), and the viability of cells was assessed by Trypan Blue staining (0.4%). Subsequently, single-cell GEMs (Gel bead in EMulsion) and sequencing libraries were prepared using the 10x Genomics Chromium Controller in conjunction with the single-cell 3′ v3 kit. Cell suspensions were diluted in nuclease-free water to achieve a targeted cell count of 5,000 for each cell line. The cDNA synthesis, barcoding, and library preparation were then carried out according to the manufacturer's instructions. Libraries were sequenced in the North Texas Genome Center facilities on a NovaSeq6000 sequencer (Illumina, San Diego). For the mapping of reads to transcripts and cells, sample demultiplexing, barcode processing, and unique molecular identifier (UMI) counting were performed using the 10x Genomics pipeline CellRanger v.2.1.0 with default parameters. Specifically, for each library, raw reads were demultiplexed using the pipeline command 'cellranger mkfastq' in conjunction with 'bcl2fastq' (v2.17.1.14, Illumina) to produce two fastq files: the read-1 file containing 26-bp reads, consisting of a cell barcode and a unique molecule identifier (UMI), and the read-2 file containing 96-bp reads including cDNA sequences. Reads then were aligned to the mouse reference genome (mm10), filtered, and counted using "cellranger count" to generate the gene-barcode matrix.

Data and code availability
The scRNA-seq data of KO mice reported in this paper has been deposited in the NCBI's Gene