WENDY: Gene Regulatory Network Inference with Covariance Dynamics

Determining the structure of gene regulatory networks (GRNs) is a central problem in biology, with a variety of inference methods available for different types of data. However, for a prominent and intricate scenario with single-cell gene expression data collected post-intervention across multiple time points, where joint distributions remain unknown, there is only one known specifically developed method, which does not fully utilize the rich information contained in this data type. In response, we introduce an inference approach tailored to this challenging context: netWork infErence by covariaNce DYnamics, dubbed WENDY. The core idea of WENDY is to model the dynamics of the covariance matrix, and solve this dynamics as an optimization problem to determine the regulatory relationships. To assess its efficacy, we benchmark WENDY against alternative inference methods using synthetic data. Our findings underscore WENDY’s robust performance across diverse synthetic datasets. Moreover, we deploy WENDY to analyze three distinct experimental datasets, uncovering potential gene regulatory mechanisms.


Introduction
In general, a gene is transcribed into mRNA and then translated into proteins.This process, known as gene expression, commonly employs mRNA count or protein count to denote the expression level.In addition to directly changing cell phenotypes [66,17], influencing extracellular processes [5], or even manipulating macroscopic neurological circuitry [47,79], certain proteins can affect the transcription of other genes (mutual regulation) or their own corresponding genes (autoregulation).Genes and their regulatory relationships form a gene regulatory network (GRN).
Determining the GRN structure is a central problem in biology, as it reveals how a living organism is maintained [4], and provides control of essential biological processes [86], especially treating cancer [54,16].However, directly establishing the GRN using traditional technologies is extremely difficult since they cannot measure the expression levels of many genes within the same cell.Instead, numerous methods have been developed to infer the GRN structure from gene expression data.Particularly, recent advancements in single-cell RNA-sequencing technologies have made it possible to profile the whole transcriptome of single cells at large-scale.However, single-cell RNA-seq can only measure one time point because cells have to be killed during the experimental process, making it challenging to study gene regulation relationships that require multiple observations over time.
In this paper, we focus on a specific data type arising from the following setup: First, implement an intervention that affects gene expression (e.g., drugs).Then measure the expression level (generally mRNA count) of n genes for different single cells at multiple time points, and select the data from time points where the expression has not yet reached a stationary state.Since gene expression at the single-cell level is stochastic, for each time point, we obtain many samples of an n-dimensional random vector.However, since we need to kill a cell before measuring its gene expression levels, one cell can only be measured once.Thus, we measure different cells at different time points, and we do not have a joint distribution for gene expression at different time points.Although this approach has become common in recent experimental research [13], and it provides more informative data compared to most other approaches, to our knowledge, there is only one inference method developed specifically for this data type, SINCERITIES [61].A major limitation of SINCERITIES is that it requires data from at least six time points to perform well.Additionally, for single-cell expression data of n genes over T time points, this method only extracts n(T −1) numbers for further analyses, implying low data utilization efficiency.There have been many inference methods for single-cell time series gene expression data, where the joint distribution of expression levels at different time points is known [37,101].Since obtaining the joint distribution of gene expression is difficult, such methods are usually not practically applicable.There are also many inference methods for single-cell gene expression data measured at a single time point [8,95], or bulk level gene expression data measured at multiple time points after interventions [62,51,87].These data types are more common because of their low cost.Nevertheless, they provide less information single-cell gene expression data compared to the data type we examine in this study, namely, time series data from singlecell gene expression.Therefore, while it is feasible to convert our considered data type into these more common forms and use corresponding inference methods, such transformations result in a significant loss of the rich information inherent in the original dataset.
In this paper, we introduce an algorithm named NetWork infErence by covariaNce DYnamics (WENDY), designed to connect single-cell gene expression data at different time points, even in the absence of knowledge about the joint distribution.The core idea behind WENDY is to compute the covariance matrices of gene expression levels at two time points and model the evolution of these covariance matrices over time.To infer the GRN, we formulate a non-convex optimization problem based on the dynamics of covariance matrices and derive a numerical solution.For a visual representation of WENDY's workflow, refer to Figure 1.
One of WENDY's key advantages is its requirement of only two time points worth of data.This feature is particularly valuable in scenarios where intervention and / or measurement ultimately result in cell death, precluding measurements at additional time points.However, if data from more time points are available, WENDY can still be applied to each pair of neighboring time points to detect potential rapid changes in the GRN during the experiment.Furthermore, for single cell expression data comprising n genes across T time points, WENDY extracts (0.5n 2 + 0.5n)T numbers for further analyses, indicating significantly higher data utilization efficiency.
The paper proceeds as follows.In Section 2, we present a classification framework for gene expression data and review existing GRN inference methods.Section 3 details the WENDY method, including the mathematical gene expression model and the approach to solving the dynamics of this model.In Section 4, we evaluate WENDY and other GRN inference methods using synthetic data to compare their performance.In Section 5, we evaluate WENDY and other GRN inference methods using experimental data to compare their performance.Finally, we conclude with discussions in Section 6.

Data classification and literature review 2.1 A framework for data classification
There are different types of gene expression data that can be used to infer the GRN structure.Different data types correspond to different inference methods.We first present a framework for classifying related data types, modified from the framework by Wang and Wang [85].See Table 1 for this classification framework.There are different dimensions to classify data types.
(1) We can measure the gene expression levels when the dynamics of gene expression is stationary (invariant along time), or we can add an intervention to drive the dynamics of gene expression away from stationarity, and measure the gene expression levels when they gradually return to the (possibly new) stationary state.For the intervention, we consider general interventions such as adding drugs (we cannot control which genes are affected) and specific interventions such as gene knockdown and gene knockout (we can select any genes to affect).Considering our capability to measure gene expression levels pre-and post-intervention, a specific intervention yields more informative data than a general one.Moreover, scenarios with intervention are richer in information compared to those without, where only stationary expression levels are observed.
(2) We can measure the average expression levels of many cells (bulk level) or measure the expression levels for each single cell (single-cell level).On single-cell level, gene expression is essentially stochastic, and we shall obtain a random variable for the expression level of each gene.On bulk level, the stochasticity is averaged out, and we should obtain a deterministic value for the expression level of each gene.Single-cell level measurement is more informative than bulk level measurement.
In practice, repeating the same bulk level measurement can still lead to different values, making some researchers regard such data as stochastic and apply inference methods designed for single-cell data [8].Nevertheless, at bulk level, randomness from single cells is averaged out, and the different values from bulk level measurement can only come from systematic differences, such as different cell phenotypes or different environmental factors.Such unobserved systematic differences can affect multiple genes and make them correlated, although these genes might not have direct regulatory relations.Therefore, we do not consider bulk level data that have different values for the same measurement.
(3) We can measure expression levels at one time point or multiple time points.When we measure expression level at multiple time points, one essential issue is whether we can measure the same cell multiple times.For bulk level data, this does not matter, as the data are deterministic, and whether the cells at t+1 are the same as the cells at t should not make a difference.However, for single-cell level data, since the measured levels are stochastic, there is an essential difference.Denote the single-cell expression level of a gene at time t as X(t).If the same cell can be measured multiple times, then we have the joint probability distribution of a time series, P[X(0) = x 0 , X(1) = x 1 , X(2) = x 2 , . ..].Otherwise, we only have marginal probability distributions for each time point, P[X(0) = x 0 ], P[X(1) = x 1 ], P[X(2) = x 2 ], . .., but not the correspondence between time points, and certain quantities cannot be calculated, such as the correlation coefficients of expression levels at two time points.Time series data types are more informative than one-time data types, and joint distribution is more informative than marginal distributions.
In practice, measuring the expression levels of many genes is destructive, and we cannot measure the same cell more than once.If we only want to measure the expression level of a single gene, there are some techniques (fluorescent proteins [91], etc.) that can measure the same cell multiple times.Another approach is to measure the amount of spliced and unspliced mRNAs, which provides both the current expression level and an approximation of its time derivative (RNA velocity [44]).This approach provides two measurements of the same cell, and some inference methods for time series data can be applied.

Known inference methods for different data types
Given this classification framework, we can review inference methods for different data types.In this framework, there are 15 data types (scenarios).Some data types do not have enough information that can be used to infer the GRN structure.Some data types (e.g., Scenario 13) have more information than some other data types (e.g., Scenario 8), but the extra information cannot lead to new GRN inference methods.Thus for such scenarios, we can only use methods for other less informative scenarios.This approach loses a lot of information, and therefore cannot justify the time and money spent to obtain more informative data.Some data types have extra information that allows for inference methods that work for such scenarios but not for less informative scenarios.
For bulk level data types, since we only have a single deterministic value for each gene, it is difficult to obtain the correlation between genes.For Scenarios 1, 3, 6, the GRN structure cannot be inferred.For Scenarios 8 and 13, we can regard the gene expression One-time For different data types (scenarios), we study whether the GRN structure can be inferred.
For each data type, No means that the GRN structure cannot be inferred.Ditto means that there are no specific inference methods, but the GRN structure can be inferred using the same method for a less informative data type.Yes means specific inference methods exist.Scenario 11 only has an inference method that also requires the data in Scenario 1.
We focus on Scenario 9, which is not as well-studied and has only one specific inference method.
time series as solution trajectories of an ordinary differential equation (ODE) system.If we assume that the ODE system is linear [62] or has certain nonlinear forms [51], we can discretize the ODE system into an algebraic equation system and use regression to infer the ODE parameters.Here the ODE parameters represent the GRN.We can infer all the edges, including the directions.For Scenario 11, one can add an intervention on each gene and observe which genes (descendants of this gene in the GRN) are also affected.Such ancestor-descendant relationships can be used to partially infer the GRN structure [85].Not all regulatory relationships (edges) can be inferred, but one can infer at least n − 1 edges for a GRN with n genes.
For single-cell level one-time data types (Scenarios 2, 7, 12), there have been numerous GRN inference methods for Scenario 2, while Scenarios 7 and 12 generally do not have extra information that supports specific inference methods.For Scenario 2, most inference methods turn the GRN inference problem into a feature selection problem: select genes whose levels can be best used to predict the level of the target gene.Then such selected genes might have regulatory relationships with the target gene.The selection can be made by calculating certain quantities between the target gene and the candidate genes that measure their similarity: covariance [60], mutual information [8], or other information theory quantities [95].Besides, one can apply regression [33], decision tree [38], or other machine learning and deep learning [70,100,103] methods to directly select out genes that can predict the target gene.Regularization terms (e.g., L 1 [32] and L 2 [41] regularizers) can be added to the regression to make the result sparse.Besides the idea of feature selection, another approach is to build probabilistic models (Bayesian network [2,45], stochastic differential equation (SDE) [82], and others [49,12]), and use likelihood to determine the most probable network.Since the number of candidate networks is large, a common solution is to apply Markov chain Monte Carlo (MCMC) to approximate the network probabilities [55,2,1,31,102].Deterministic models, such as Boolean networks [48], can also be used.There is a well-developed platform that combines different inference methods for Scenario 2 [89].One problem of Scenario 2 is to determine the direction of a regulatory relationship, since if the level of V i can predict the level of V j , then generally the level of V j can also predict the level of V i .To solve this problem, one can add specific interventions (Scenario 12) on V i to see whether V j is affected.
For single-cell level time series data types without joint distribution of different time points (Scenarios 4, 9, 14), it is common to treat Scenario 4 as Scenario 2 (treat data at different time points separately), and treat Scenarios 9, 14 as Scenario 8 (average over different cells), and apply corresponding inference methods.The only GRN inference method we know that works specifically for Scenario 9 (but not any less informative scenarios) is SINCERITIES [61], which considers the Kolmogorov-Smirnov distance between the distributions of the same gene at two time points, and then applies linear regression, similar to methods for Scenario 8.There are two other regression-based methods, Harissa and CAR-DAMOM [35,77], that essentially work for Scenario 2, but can partially (and insufficiently) integrate the time information when work for Scenario 9.
For single-cell level time series data types with joint distribution of different time points (Scenarios 5, 10, 15), there have been numerous GRN inference methods for Scenario 5, while Scenarios 10 and 15 generally do not have extra information that support specific inference methods.For Scenario 5, most inference methods are similar to those for Scenario 2, especially those methods on regression [96], tree-based feature selection [37,101], or more advanced machine learning tools [57,3], although there are also inference methods based on more complicated biological models [39].One common approach is to model the gene expression by a vector autoregressive model (either linear or nonlinear) [27,28,71], and then use Granger causality to determine whether one gene directly regulates another gene [29,30,56,98].This approach solves the famous "correlation does not imply causation" problem from two aspects.First, when gene V i and gene V j are correlated, it is possible that they are not directly regulating each other, such as In this case, given the values of V k , V i cannot provide more information for V j , and Granger causality can determine that V i does not directly regulate V j .Second, when there is directly regulation between V i and V j , we do not know whether V i is the cause or the result of V j .Since Granger causality determines whether the past of V i contains unique information of the future of V j , the direction of regulation is also known, since causality can only travel forward along time.This explains why most methods for Scenario 5 can determine the direction of regulations, different from their analogies for Scenario 2.
Besides treating a more informative data type as a less informative data type, one can also use certain methods to transform a less informative data type into a more informative data type, provided there are certain assumptions about the underlying systems.For instance, from single-cell one-time data (Scenarios 2, 7, 12), one can construct the pseudotime to transform the data into time series data [67,74], and apply corresponding inference methods [22].
Different methods need different assumptions regarding gene expression and gene regulation.For instance, some methods need the gene expression dynamics to be linear [62], and some other methods need the GRN to have no directed cycle [95].Besides, different methods have different inference abilities: some methods can determine all edges, including the direction [37], while some other methods can only partially determine some edges, and/or cannot determine the edge direction [85].
Most GRN inference methods can only determine regulations between genes, but not autoregulation.Autoregulation inference methods generally need stronger model assumptions [92], more informative data types [25], or only produce partial results [83].
The above discussion only considers the situation of inferring GRN after obtaining all data.Another situation is to design intervention experiments, so that the GRN can be inferred with the minimal cost [18].
Readers may refer to other reviews for more details about GRN inference methods for different scenarios [7,65,99,59,85] or for other data types (besides mRNA/protein count) that can help with GRN inference, such as ChIP-seq and ATAC-seq [24,6].

Novel GRN inference method
In this section, we present an algorithm of netWork infErence by covariaNce DYnamics (WENDY), that works for Scenario 9, single-cell level time series data without joint distribution of different time points, measured after general interventions.It can determine all regulatory edges including their directions, but not autoregulation.Using extra information such as DNA sequence and transcription factor binding motifs, some regulatory edges can be excluded.Such prior knowledge can be incorporated by WENDY, but we first consider the original problem that any regulatory edge is possible (except autoregulation).

Mathematical model of gene expression
We start by building a model of gene expression for a single cell.Although various factors can affect gene expression, we only study regulations between genes.Assume there are n genes V 1 , . . ., V n , which form a GRN.We assume that no other genes affect V 1 , . . ., V n , meaning that there is no hidden variable.Denote the expression levels of V 1 , . . ., V n by X 1 , . . ., X n .A common model for the dynamics of X i is a (stochastic) differential equation Here c 1,i means the basal synthesis rate of X i , and c 2,i X i means the total degradation rate of X i .The function f that reflects the interaction between genes can take any form with any number of unknown parameters.The noise term is generally Gaussian.We think that the random fluctuation is proportional to X i , and decide to follow Pinna et al. and Papili Gao et al. [64,61] to set the noise term to be σ i X i dW i (t), where σ i is an unknown constant, and W i (t) is a standard Brownian motion.Since f in Eq. 1 is an arbitrary function, it is impossible to determine its parameters.We need to add restrictions on the form of f .Besides single gene regulations V j → V i , transcription factors can bind to enhancers, whose signals are transmitted by coactivators (e.g., mediators) to promoters, which can bind to general transcription factors and regulate the expression of certain genes [72,40,11].Therefore, to model the complicated gene expression, at least we need to consider cooperative regulations with two genes, V j + V k → V i .This means that Eq. 1 becomes Here g and h are known functions (not necessarily linear), and a j→i , b j,k→i are regulation coefficients.
To determine the GRN, we need to determine all coefficients a j→i , b j,k→i , whose total number is n 3 /2 + n 2 /2 for n genes.For single-cell gene expression data, at each time point, the data set is a matrix of m cells × n genes.From the data, we can estimate the parameters of the joint distribution of n genes.If this joint distribution is Gaussian, we can obtain n parameters for the mean, and n(n + 1)/2 parameters for the (co)variance (since the covariance matrix is symmetric), which fully determine the distribution.Even though the joint distribution of n genes might not be Gaussian in reality, the number of parameters that can be estimated from expression data at each time point should be at the level of n 2 /2.Therefore, to determine n 3 /2 + n 2 /2 GRN parameters in Eq. 2, we need at least n time points.This generally does not hold in reality, since a small GRN of interest might contain tens of genes, but the number of time points for type 9 data is commonly only a few [34,76,19].Thus we have to further (over)simplify Eq. 2 and drop cooperative terms: Now there are only n 2 GRN parameters, and theoretically, single-cell data at only 2 time points can provide enough information.For instance, we can calculate the covariance matrices of X 1 , . . ., X n at time 0 and time t, and determine what a j→i in Eq. 3 can lead to such covariance matrices.This is an inverse problem of Eq. 3, which is much harder than the original problem.
In some models, the function g is nonlinear, meaning that Eq. 3 cannot be solved analytically.Thus, it is very difficult even if we only want to solve the inverse problem numerically, especially if we want the solver to be numerically stable.We need to further (over)simplify the model to assume that it is linear: Since Eq. 4 is linear, a i→i g(X i ) (represents autoregulation) and −c 2,i X i (represents degradation) are combined to a i,i X i , while a j,i = a j→i .This means that we cannot determine the existence of autoregulation, even if we can solve a i,i .Since the autoregulation of V i is masked by the natural synthesis and degradation of V i , most GRN inference methods cannot determine the existence of autoregulation.We combine Eq. 4 for different V i to obtain Here and ⊙ is the entrywise (Hadamard) product: If we directly solve the covariance matrix of X 1 (t), . . ., X n (t) from Eq. 5, then it has an integral that hinders the following optimization procedure.Thus we first ignore the noise term in Eq. 5, and its solution is Since e tA is still difficult to handle in the following optimization procedure, we consider the first-order approximation where I is the n × n identity matrix.Then we add back the integrated noise term, which is then approximately where ǫ(t) = [ǫ 1 (t), . . ., ǫ n (t)] is an n-dimensional normal random noise with mean (0, . . ., 0) and covariance matrix G.Here we assume that G is diagonal, meaning that noise terms ǫ i (t), ǫ j (t) for different genes are independent, but the diagonal elements of G are unknown.Besides, ǫ(t) and X(0) are also independent.For type 9 data, after adding drugs or other interventions, we measure the expression levels of n genes at time 0, and the expression levels are n random variables: X(0) = [X 1 (0), . . ., X n (0)].Then at time t, we measure the expression levels of these n genes again to get n random variables: X(t) = [X 1 (t), . . ., X n (t)].Since we add interventions to drive the system away from stationary, X(0) and X(t) are not identically distributed.However, we do not have the joint distribution of X(0) and X(t), meaning that we do not know which sample of X(0) corresponds to which sample of X(t).From such data, we want to solve A, an invertible n × n matrix that represents the GRN we want: A i,j > 0 means gene i activates gene j; A i,j < 0 means gene i inhibits gene j; and A i,j = 0 means gene i does not regulate gene j directly.However, as discussed above, A i,i is the combination of degradation and possibly autoregulation of gene i, and A i,i = 0 does not necessarily mean autoregulation of gene i.In the next subsection, we will present a numerical method that calculates A in Eq. 7.
Gene expression has many biochemical subtleties, and Eq. 7 is certainly an oversimplification of Eq. 1.Here the simplification from Eq. 1 to Eq. 3 is inevitable, since type 9 data generally do not have many different time points, and cannot provide enough information to solve a model with too many unknown parameters (as in Eq. 2).If we want to learn some knowledge about the GRN from type 9 data, some simplification is necessary.The simplifications from Eq. 3 to Eq. 7 are only for numerical purposes.In Subsection 4.3, We will see that although our method is derived from Eq. 7, it has good performance on data generated by Eq. 11, which is in the form of Eq. 3. Therefore, the simplification from Eq. 3 to Eq. 7 does not harm the generalizability of our method.Besides, although our method cannot determine autoregulation V i → V i , it can determine feedback loop V i → V j → V i , which is the fundamental mechanism of some "autoregulations" observed in experiments [20].

Dynamics of covariance matrix
Since we do not know the joint distribution of X(0) and X(t), we cannot determine which sample of X(0) corresponds to which sample of X(t), meaning that directly attacking Eq. 7 does not help.Instead, we can assume that some statistical quantities on both sides should be equal.
If we take expectation on Eq. 7, it returns to Eq. 6.For any A, we can find the value of c to make Eq.6 hold.Thus we cannot solve A from Eq. 6.Instead, we can consider the covariance matrices of X(0) and X(t), denoted as K(0) and K(t).We have Given K(0) and K(t), we cannot solve A directly from Eq. 8, even if we set D = 0. Assume K(0) and K(t) are invertible.As covariance matrices, they are positive-definite and have Cholesky decomposition K(0) = L T 0 L 0 and K(t) = L T 1 L 1 with upper-triangular L 0 and L 1 .Then for any orthonormal matrix O with O T O = I, A = (L −1 0 OL 1 − I)/t is a solution of Eq. 8 with D = 0. Thus Eq. 8 has infinitely many solutions in this case, and we need to add some conditions to obtain a unique A.

Optimization formulation for covariance dynamics
Assume we measure the expression levels of n genes for m single cells.When m < n, which is common in reality, if we directly calculate the covariance matrix, it will always be degenerate (non-invertible).Therefore, we need to apply a specific method, called graphical lasso, that estimates the covariance matrix K in this case, where K is invertible, and the inverse of K is sparse [26].
Therefore, if N is multivariate normal, then K −1 i,j = 0 if and only if N i and N j are independent conditioned on other variables.
By assuming the expression levels of n genes satisfy a multivariate normal distribution, there is a GRN inference method [60]: gene i and gene j have a direct regulatory relation (direction unknown) if and only if K −1 i,j = 0. Nevertheless, this method assumes that the gene expression is in stationary.
For the data set we consider, the intervention might change the dynamics, and the gene expression is not in stationary.Therefore, K −1 i,j = 0 might just mean that gene i and gene j has a direct regulatory relation before the intervention, not necessarily implying A i,j = 0.However, the inverse should be true: if K(0) −1 i,j = 0 and K(t) −1 i,j = 0, then gene i and gene j should have no direct regulatory relation, whether before intervention or after intervention.Define C = {(i, j) | K(0) −1 i,j = 0 and K(t) −1 i,j = 0}.Then A i,j = 0 if (i, j) ∈ C. Since K(0) −1 and K(t) −1 are symmetric, C is also symmetric: (i, j) ∈ C implies (j, i) ∈ C. Besides, since K(0) −1 and K(t) −1 are sparse, C contains most edges.
Certain data, such as DNA sequence and transcription factor binding motifs and ATACseq data, can provide prior knowledge that gene i cannot regulate gene j, meaning that A i,j = 0. We denote the set of such forbidden edges as F. Notice that F might not be symmetric.
From the data, after estimating the covariance matrices, we obtain invertible covariance matrices K(0) and K(t), where K(0) −1 and K(t) −1 are sparse.Now we have Since the diagonal matrix D is unknown, we only want to match off-diagonal elements of K(t) and (I + tA T )K(0)(I + tA).Therefore, we need to solve A from Eq. 9 regardless of diagonal elements, so that (I + tA) i,j = 0 whenever (i, j) ∈ C ∪ F. Under this restriction, there might not be a solution.Instead, we can minimize the matching error and turn it into an optimization problem: where A i,j = 0 whenever (i, j) ∈ C ∪ F or i = j, while λ ≥ 0 is a predetermined constant.The constraints are handled by only optimizing over the nonzero edges, and thus (10) simplifies to a (possibly) regularized nonlinear least squares problem.We set λ = 0 in the following simulations, but allow users to adjust λ if necessary.Since C ∪ F contains most edges, the final A is sparse, which is biologically favorable, since we do not want a very dense GRN.If calculated A i,j > 0 or A i,j < 0 (i = j), we claim that gene i regulates (activates/inhibits) gene j.The diagonal elements of A do not provide information about gene regulation, as we cannot distinguish between autoregulation and normal gene expression.
We use the BFGS algorithm to minimize (10).The overall algorithm is presented in Algorithm 1.For the WENDY method, step 4 of Algorithm 1 is much faster than step 2, since the solver terminates after a constant number of iterations.Graphical lasso has time complexity O(n 3 ) [26], which does not depend on the cell number m.Therefore, the overall time complexity of WENDY method is O(n 3 ).In Section 4, we will see that in practice, the time cost of WENDY increases with n, but not m.
Algorithm 1: Detailed workflow of WENDY method.

Input:
Single-cell gene expression data at T = 0 (for m 1 cells and n genes) and at T = t (for m 2 cells and the same n genes), both measured after general interventions.Prior knowledge of forbidden edges, F 2. Call graphical lasso method to calculate the covariance matrix K(0) for expression data at T = 0, so that K(0) is invertible, and K(0) −1 is sparse.Also calculate the covariance matrix K(t) for expression data at T = t

Call BFGS solver for the optimization problem arg min
with constraints A i,j = 0, ∀(i, j) ∈ C ∪ F

Output:
The GRN matrix A. A i,j > 0 or A i,j < 0 means gene i activates/inhibits gene j.However, A i,i = 0 does not necessarily mean autoregulation of gene i

Theoretical comparison with other methods
To infer GRN structure in Scenario 9, besides WENDY, we have four other options: (I) apply SINCERITIES method; (II) calculate the average gene expression levels over all cells to transform the data into Scenario 8, and apply corresponding methods; (III) only consider the data at one time point, which is Scenario 2, and apply corresponding methods; (IV) treat the data at each time point as Scenario 2, and apply corresponding methods, but calibrate the inferred GRN by the GRN inferred for the previous time point, similar to Harissa and CARDAMOM.
(I, II) For an expression level data set with n genes at T time points, SINCERITIES only extracts n(T − 1) values, and then applies linear regression.If we transform the data into Scenario 8, we can only obtain nT values before fitting to an ODE system.If we abandon data at other time points to switch to Scenario 2, we lose the temporal information.In comparison, WENDY extracts n 2 T /2 + nT /2 values from such data before proceeding to the next step.Therefore, WENDY can extract more information from data.SINCERITIES requires at least T = 3 time points to work, and requires at least T = 6 time points to perform well.Methods for Scenario 8 need at least T = 3 time points to work, and also more time points to work well.In comparison, WENDY only needs T = 2 time points.This can be explained by information theory: the GRN (without autoregulation) has n 2 − n independent values.WENDY uses data at T = 2 time points to extract n 2 + n values; SINCERITIES needs data at T = n time points to extract n 2 − n values; methods for Scenario 8 need data at T = n − 1 time points to extract n 2 − n values.
SINCERITIES and methods for Scenario 8 require more time points.This not only increases the cost, but also requires that the GRN does not change among such time points.This might not hold in reality, such as in the early development of embryos, where the regulatory strength might change rapidly.Also, when studying the effect of drugs on gene expression, the regulatory effect of drugs will gradually disappear.Instead, since WENDY only needs two time points, when there are data from more time points, we can apply WENDY for each pair of adjacent time points, and study how the GRN evolves along time.
(III) The theoretical foundation of some methods for Scenario 2 requires that the gene expression is at stationary.For Scenario 9, after adding the intervention, we need to wait for the gene expression to return to stationary state.However, in some experiments, the intervention (e.g., adding certain drugs) can drive the gene expression too far away from stationary, which leads to the death of cells before returning to stationary.Therefore, WENDY is a better solution to this transient scenario.
(IV) Harissa and CARDAMOM [35,77] have the same idea: for Scenario 2 data, build a nonlinear regression model under complicated pre-processing.Given the single-cell expression data D t at time t, they can calculate the regression error R(D t , A t ) for any possible GRN A t .Different from (III), they do not directly minimize R(D t , A t ), but add a calibration term that prevent the current inferred GRN A t from being too different with the inferred GRN A t−1 for the previous time point t − 1: When t = 1, set A 0 to be the zero matrix.For Scenario 9 data at time points 1, 2, . . ., T , use the last inferred GRN A T as the final result.This idea can partially utilize the time information, but most information between two neighboring time points is wasted.For the situation that the GRN does not change along time, this idea relies heavily on data at later time points, and wastes the data at early time points.For the situation that the GRN can change along time, this idea only reflects the GRN at the last time point, but not GRN at earlier time points.Since these two methods are based on a piecewise-deterministic Markov process model, not the stochastic differential equation model that generate the testing data for the next section, we will not test their performance.

Performance evaluation on synthetic data
In this section, we use two synthetic data sets with different settings to test the performance of WENDY and four other GRN inference methods that work for different data types: GENIE3, dynGENIE3, NonlinearODEs, SINCERITIES.WENDY ranks the second for the DREAM4 data set, and ranks the first for the SINC data set.

Methods and measurements
GENIE3 [38] works for Scenario 2, single-cell level one-time expression data at stationary.The idea is to infer the level of one gene by the levels of other genes using random forest or extra trees.All edges can be inferred, but sometimes the direction is hard to determine.
dynGENIE3 [37] is the revised version of GENIE3 that works for time series data, such as Scenario 5 (and 10), single-cell level time series expression data at stationary or after general interventions.The idea is basically the same as GENIE3, but here it infers the level of one gene at a later time by the levels of other genes at an earlier time.Therefore, all edges including the directions can be determined.dynGENIE3 requires the correspondence of cells at different time points.This means that it cannot be applied to Scenario 9 data directly.Therefore, dynGENIE3 is included just for completeness, not as a main comparison target.
NonlinearODEs [51] works for Scenario 8, bulk level time series expression data measured after general interventions.The idea is to fit the data to a nonlinear ODE model.It uses XGBoost to determine the importance score of each edge.There is a tunable parameter α that can be chosen manually or automatically, and we use the from data mode to determine the parameter α automatically.All edges including the directions can be determined.
SINCERITIES [61] works for Scenario 9, single-cell level time series expression data measured after general interventions, where the joint distribution at different time points is unknown.The idea is to calculate the distance between the distributions of the same gene at two time points, and then applies linear regression.All edges including the directions can be determined.
Our WENDY method also works for Scenario 9.All edges including the directions can be determined.For the data sets in this section, there is no extra biological knowledge about regulations.Therefore, the set of forbidden edges F = ∅.
To test a GRN inference method under different circumstances, a common practice is to use synthetic data [61,51,37].The data are generally generated by numerically simulating an SDE system.We will test different inference methods on two data sets: DREAM4 and SINC.
Each inference method obtains a calculated GRN matrix A ′ , whose entries take values in R. To generate the synthetic data, there is a true GRN matrix A, which can take values 1/0/ − 1.To evaluate the inference result A ′ with the true GRN matrix A, we calculate two measurements, AUROC and AUPR [61].AUROC is the area under the curve of true positive rate versus false positive rate, and AUPR is the area under the curve of precision versus recall.They evaluate how the inferred GRN fits the true GRN from different perspectives.Since A ′ i,i = 0 does not necessarily mean autoregulation, we do not compare the diagonal elements of A and A ′ .See Algorithm 2 for the calculation procedure of such quantities.

DREAM4 data set
The most common synthetic data set used to evaluate GRN inference methods is DREAM4 -In Silico Network Challenge [52].It has multiple challenges, each with multiple data types.We will consider the two in silico challenges and use only the time series data set.There are 5 GRNs with 10 genes, each accompanied by 5 stochastic trajectories at 21 time points.There are also 5 GRNs with 100 genes, each accompanied by 10 stochastic trajectories at 21 time points.The time points are equally distributed: 0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000.Each GRN is represented by a matrix A, where A i,j = 1 or A i,j = 0, which means that gene i can or cannot regulate gene j.Also, A i,i = 0 for each gene i.These two data sets are denoted as DREAM4 (10 genes) and DREAM4 (100 genes).DREAM4 data are generated by GeneNetWeaver software [69], which integrates some unknown SDE system.
We apply WENDY, GENIE3, dynGENIE3, NonlinearODEs, and SINCERITIES methods to DREAM4 data sets.To test the performance of different methods on DREAM4 data, we compare different settings of the same method and choose the best one.Specifically, we find that for methods that can use data from multiple time points, it is not always good to use data from all 21 time points.Instead, data from some time points should be abandoned.Here we list the best settings for each method on each data set: (1) For WENDY, we regard DREAM4 data as Scenario 9 data.For DREAM4 data (10 genes), we use the data at t = 450 and t = 850 without the cell correspondence between different time points.For DREAM4 data (100 genes), we use the data at t = 300 and t = 550 without the cell correspondence between different time points.
See Table 2 for the results, where the AUROC and AUPR are averaged over all 5 GRNs in the same data set.We can see that on average, WENDY is slightly better than NonlinearODEs and GENIE3, which are slightly better than SINCERITIES, while dynGENIE3 is significantly better than all other methods.This is not surprising, since dynGENIE3 can utilize the cell correspondence (joint distribution) of different time points, which is not realistic, but contains more information.
Besides, some results are lower than those values reported in corresponding papers [38,37,51], since we only use the time series data, not combining with other data types, and we do not manually fine-tune the parameters accordingly.One problem of the DREAM4 data is that each GRN only generates 5 or 10 stochastic trajectories (each corresponds to a measured cell population).This fits with the mainstream of bulk level data in the early 2010s, when it was difficult to repeat the measurement many times.Nevertheless, with the development of single-cell RNA sequencing, it is easier to obtain single-cell data from thousands of cells.Therefore, we want to generate single-cell expression data over more cells and test the performance of inference methods under different cell numbers.

SINC data set
Pinna et al. and Papili Gao et al. [64,61] use the following SDE to generate synthetic data: Here X i (t) is the expression level of gene i at time t, and W j (t) is a standard Brownian motion, independent with other W i (τ ).A i,j describes the GRN.V = 30, β = 1, θ = 0.2, σ = 0.1.When t → ∞, each X i (t) will converge to the stationary value and fluctuate slightly around it [93,14,15].
The following equation is the first-order approximation of Eq. 11 when X i (t) is small: Also, Eq. 7 is the discretization of Eq. 12: A, b i , F i,i in Eq. 7 correspond to V βA − V θI, V β, σ 2 t in Eq. 12.These facts provide the theoretical support that WENDY (derived from Eq. 7) can work for data generated by Eq. 11, since we only care about off-diagonal elements of A, which represent mutual regulations between different genes.For other gene regulation mechanisms, we can also use first-order approximation and discretization to obtain Eq. 7 or similar forms.Therefore, WENDY should be applicable to different gene regulation mechanisms.For Eq. 11, we simulate the system with the Euler-Maruyama method [42] for time t ∈ [0, 1] with time step ∆t = 0.01.This means treating dX j (t) as X j (t + ∆t) − X j (t), dt as ∆t, and dW j (t) as a normal random variable N (0, ∆t).
For the GRN matrix A, we use the 40 networks in Papili Gao et al. [61]: 10 Escherichia coli networks with 10 genes, 10 E. coli networks with 20 genes, 10 Saccharomyces cerevisiae (yeast) networks with 10 genes, 10 yeast networks with 20 genes.Each network has A i,j = 1/0/ − 1, and A i,i = 0 for each i.
For each group of data, we run the simulation m times, where m represents the number of cells/trajectories measured in reality.We set m = 10, m = 30, and m = 100 to test the performance of different methods under different m.The initial state X i (0) is independently and uniformly sampled in [0, 1).For each network in Papili Gao et al.'s paper, and each value of m, we generate 100 groups of data.The same as Papili Gao et al.'s paper, the simulation finishes at t = 3.0.We record the expression levels at the following 11 time points: t =(0.0, 0.3, 0.6, 0.9, 1.2, 1.5, 1.8, 2.1, 2.4, 2.7, 3.0).For the data sets generated by Eq. 11 on GRNs with 10 or 20 genes, we name them SINC (10 or 20 genes) data.Since the initial state is generally different from the stationary state, SINC data should be regarded as Scenario 10 data (single-cell level time series data under general interventions, where the joint distribution of different time points is known).
To test the performance of different methods on SINC data, we compare different settings of the same method and choose the best one.Specifically, we find that for methods that can use data from multiple time points, it is not always good to use data from all 11 time points.Instead, data from some time points should be abandoned.Here we list the best settings for each method on each data set: (1) For WENDY, we regard SINC data as Scenario 9 data.For SINC data (10 genes), we use the data at t = 0.0 and t = 3.0 without the cell correspondence between different time points.For SINC data (20 genes), we use the data at t = 2.7 and t = 3.0.
We apply WENDY, SINCERITIES, NonlinearODEs, GENIE3, and dynGENIE3 to SINC data sets and compare the corresponding AUROC and AUPR.Each AUROC or AUPR value is averaged over 2000 simulations (20 GRNs, each with 100 groups of data).See Table 3 for the results.We can see that WENDY is significantly better than all other methods.Here dynGENIE3 has the worst performance, possibly because it cannot determine the sign (activation/inhibition) of a regulation.
For different values of cell/trajectory number m, we can see that the performance metrics generally increase with m, meaning that more cells provide more information.
Notice that WENDY is not always better in all settings.For example, if we only have the data at t = 0.0 and t = 0.1 for SINC data with m = 100, then WENDY has AUROC 0.77 and AUPR 0.34 for SINC (10 genes) data, and AUROC 0.83 and AUPR 0.29 for SINC (20 genes) data, while GENIE3 has AUROC 0.77 and AUPR 0.32 for SINC (10 genes) data, and AUROC 0.92 and AUPR 0.50 for SINC (20 genes) data.Besides, for the setting in Table 3, if we reduce m to 5, then the performance of WENDY is also no better than GENIE3.
We also test the running time of each inference method.For SINC (10 genes) and SINC (20 genes) data sets, we measure the execution time (averaged over different GRNs) of each algorithm for m = 10, m = 30, and m = 100 cells.See Table 4 for the running

Performance evaluation on experimental data
Besides synthetic data, there are also some GRNs determined by experiments, so that we can use them as the ground truth for the corresponding gene expression data [36,68,63].In this section, we use two experimental data sets in Scenario 9 with known GRNs to test the performance of WENDY and three other GRN inference methods that work for different data types: GENIE3, NonlinearODEs, SINCERITIES.Since each cell is only measured once, we do not have the joint distribution of expression levels at different time points, and dynGENIE3 does not apply.WENDY ranks the first for the THP-1 data set, and ranks the third for the hESC data set.

THP-1 data set
THP-1 data set considers single-cell expression levels of monocytic THP-1 human myeloid leukemia cells, measured at 8 time points, t = 0, 1, 6, 12, 24, 48, 72, 96h, after applying phorbol-12-myristate-13-acetate [43].For each time point, there are 120 cells measured.However, since each cell can be measured only once, we do not know how cells at different time points correspond to each other.Therefore, this data set is in Scenario 9.
To test the performance of different methods on THP-1 data, we compare different settings of the same method and choose the best one.Specifically, we find that for methods that can use data from multiple time points, it is not always good to use data from all 8 time points.Instead, data from some time points should be abandoned.Here we list the best settings for each method: (1) For WENDY, we regard THP-1 data as Scenario 9 data.We use the data at t = 0 and t = 48 without the cell correspondence between different time points.
(3) For NonlinearODEs, we regard THP-1 data as Scenario 8 data by taking average of gene expression levels over different cells.We use the data at 2 time points: t =(0, 1).( 4) For GENIE3, we regard THP-1 data as Scenario 2 data by considering only one time point.We use the data at t = 48.
We apply WENDY, SINCERITIES, NonlinearODEs, and GENIE3 to THP-1 data set and compare the corresponding AUROC and AUPR.See Table 5 for the results.We can see that WENDY is better than all other methods.Here GENIE3 has the worst performance, possibly because it cannot determine the sign (activation/inhibition) of a regulation.The results for SINCERITIES are slightly different from those reported by Papili Gao et al. [61], possibly because we do not consider diagonal elements of the GRN in the calculation of AUROC and AUPR., and the regulation mechanism of their gene expression is highly complicated [46].The same as THP-1 data set, there is no cell correspondence between different time points, and it is in Scenario 9. Matsumoto et al. [53] calculated the mean expression level of each gene at each time point, and then calculated the variance of the mean expression levels at different time points for each gene.Using this approach, they selected out top 100 highly varying genes, which might actively regulate each other.Given the data of these 100 highly varying genes, we further select out genes that express in at least 95% of the cells.The reason is that when the expression data have too many 0s, graphical lasso might fail to converge, which fails WENDY.There are 18 genes selected: POU5F1, AEBP2, MIER1, SMAD2, ZNF652, ZFX, TERF1, SOX11, BBX, ZFP42, TULP4, ZNF471, ARID4B, ZNF483, SHOX, ZNF587, ZFP14, CEBPZ.For such genes, there has been a GRN determined by experiments [58,73] that we can use as the ground truth.
Similar to the THP-1 data set, We apply WENDY (use t = 12, 24 data), SINCERITIES (use t = 0, 12, 24 data), NonlinearODEs (use t = 24, 36 data), and GENIE3 (use t = 0 data) to hESC data set and compare the corresponding AUROC and AUPR.See Table 6 for the results.We can see that WENDY is worse than SINCERITIES and NonlinearODEs.

Discussion
In this paper, we address the GRN inference problem for single-cell time series gene expression data following general interventions, where the joint distribution of different time points is unknown.Although this type of data is common in recent experiments, there are few GRN inference methods that fully utilize the information contained in the data.Therefore, we introduce WENDY, a GRN inference method developed for single-cell gene expression data spanning two time points after an intervention.This method is capable of inferring all mutual regulatory relations, including direction.We test WENDY and other GRN inference methods on two synthetic data sets and two experimental data sets, and the performance of WENDY is satisfactory.Besides, the time cost of WENDY is almost as low as the fastest method.The model in Subsection 3.1 only considers cells of the same type and under the same environment.In reality, cells of a complex organism are under different regulations by environmental factors, especially during development.For instance, during animal morphogenesis, retinoic acid can form a concentration gradient (positional information) [84], and leads cells at different positions to different fates by regulating certain genes [90].If the levels of such spatially heterogeneous factors can be measured for each single cell, then one can add them into the GRN inference procedure.Otherwise, they become hidden variables in the gene expression model, and can make the inference much more difficult [50].Besides such position-related factors, cell type can affect gene regulation [94].One should label the cell type during experiments or after experiments using cluster analysis [9].If the GRN inference is directly applied to data from different types of cells, the inference result might be unreliable [83].In this work, we assume that cells are of the same type, and they are under the same environment.
Although most gene expression models are similar to Eq. 1 that only considers the (mRNA) levels of V 1 , . . ., V n , some researchers argue that the actual gene expression mechanism should be more complex.In some models [10,77,35,78], each gene can switch between "on" and "off" states, which correspond to different transcription rates.Some proteins can affect the transition rates between "on" and "off" states, which is the only way of gene mutual-regulation.When one gene is turned on, the number of its mRNA increases quickly, until it is turned off, when the mRNA count starts to decay exponentially.This leads to the mRNA bursts phenomenon [23], meaning that the mRNA count does not fluctuate locally, but has global bursts.Since proteins are translated from mRNAs, proteins also have bursts.However, proteins degrade much slower than mRNAs, meaning that mRNA count and protein count do not match exactly.Therefore, in such models, besides the mRNA counts X 1 , . . ., X n , there are also hidden variables for gene states and protein counts.However, more complicated models, generally using piecewise-deterministic Markov processes, are harder to solve, and researchers need to conduct large-scale simplifications.Besides, more complicated models need more accurate data, but the current mRNA measurement techniques are not very sensitive [97].Therefore, we think that it is worthwhile to study differential equation based models.
Similar to most other GRN inference methods, WENDY cannot infer autoregulation.One potential future direction is to develop an autoregulation inference method inspired by the principles of WENDY.For instance, we can choose a reasonable nonlinear gene expression model that allows autoregulation, and study whether autoregulation can make a difference for the dynamics of covariance matrices.
In interventional experiments, it is customary to measure expression levels before intervention as the control group.These control data align with Scenario 2 data, enabling the application of corresponding inference methods.Therefore, a comparison of GRNs before and after intervention can elucidate the effect of the intervention on gene regulation.
As of 2024, single-cell level gene expression measurement remains relatively insensitive.It is common to miss all mRNAs of one gene in a single cell, resulting in experimental data with many zeros.This characteristic may impede graphical lasso, and consequently, WENDY.A potential future direction involves developing more robust inference methods capable of handling data with numerous missing values.
Although WENDY aims for data type 9, other types of data can help to improve the results of WENDY.For instance, one can use a CRISPR gene knockout study to verify regulations inferred by WENDY.Besides, advancements in biotechnology introduce new types of data, which are not in our classification framework, but might be applicable to GRN inference, and can enhance WENDY's capabilities through incorporation.
In our tests, we observed that WENDY performs better for data in earlier time points when dealing with a small number of genes (10 or fewer).Conversely, for data collected long after intervention, where gene expression approaches a new steady state, the inference results are less reliable compared to earlier time points.One possible reason is that after enough time, the covariance matrix approaches its steady state, and the small changes of covariances might be covered by random noise.Therefore, when applying WENDY to experimental data, caution should be exercised with results obtained several days after the intervention.
Theoretically, the GRN matrix A calculated by WENDY is not symmetric, allowing determination of the directions of the regulatory relations (i → j or j → i).However, our simulations indicate that A i,j and A j,i generally exhibit proximity, resulting in a significant decrease in AUPR.A prospective avenue involves developing a novel solver capable of producing highly asymmetric results.
In our WENDY implementation, we employ the standard graphical lasso algorithm to determine the inverse of the covariance matrix.Other algorithms with similar purposes [80,21] can also be considered for integration into WENDY.

2 iFigure 1 :
Figure 1: Workflow of the WENDY method.Given single-cell level gene expression data at two time points, where the joint distribution (cell correspondence) between two time points is unknown, first calculate the covariance matrix of gene expression for each time point.Then use the mathematical gene expression model to derive the equation of covariance matrices.Last, transform this into an optimization problem and solve the GRN numerically.

Table 2 :
AUROC and AUPR of different methods on DREAM4 data sets

Table 3 :
AUROC and AUPR of different methods on SINC data sets the form of mean ± standard deviation) in different settings.We can see that for each algorithm, the time cost increases with the gene number n.In addition, WENDY, SINCERITIES and non-linearODEs are insensitive to the cell number m, while the time costs of GENIE3 and dynGENIE3 increase significantly with m.When m and n are large, WENDY is roughly the same fast as SINCERITIES and NonlinearODEs, but much faster than GENIE3 and dynGENIE3.Therefore, WENDY has a satisfactory speed.

Table 4 :
Computational time for different algorithms.We consider the SINC data of n = 10/20 genes over m = 10, m = 30, or m = 100 cells.For each situation, we use 20 different GRNs, and present the computational time as mean ± standard deviation.All time costs are in seconds, and are measured on a desktop with Intel i7-13700 CPU.

Table 5 :
[19]C and AUPR of different methods on THP-1 data set data set considers single-cell expression levels of human embryonic stem cell-derived progenitor cells, measured at 6 time points, t =0, 12, 24, 36, 72, 96h[19].For each time point, there are 66-172 cells measured.Such cells are under development (which can be regarded as an intervention) hESC

Table 6 :
AUROC and AUPR of different methods on hESC data set