A Variational Inference Approach to Single-Cell Gene Regulatory Network Inference using Probabilistic Matrix Factorization

Inferring gene regulatory networks (GRNs) from single-cell gene expression datasets is a challenging task. Existing methods are often designed heuristically for specific datasets and lack the flexibility to incorporate additional information or compare against other algorithms. Further, current GRN inference methods do not provide uncertainty estimates with respect to the interactions that they predict, making inferred networks challenging to interpret. To overcome these challenges, we introduce Probabilistic Matrix Factorization for Gene Regulatory Network inference (PMF-GRN). PMF-GRN uses single-cell gene expression data to learn latent factors representing transcription factor activity as well as regulatory relationships between transcription factors and their target genes. This approach incorporates available experimental evidence into prior distributions over latent factors and scales well to single-cell gene expression datasets. By utilizing variational inference, we facilitate hyperparameter search for principled model selection and direct comparison to other generative models. To assess the accuracy of our method, we evaluate PMF-GRN using the model organisms Saccharomyces cerevisiae and Bacillus subtilis, benchmarking against database-derived gold standard interactions. We discover that, on average, PMF-GRN infers GRNs more accurately than current state-of-the-art single-cell GRN inference methods. Moreover, our PMF-GRN approach offers well-calibrated uncertainty estimates, as it performs gene regulatory network (GRN) inference in a probabilistic setting. These estimates are valuable for validation purposes, particularly when validated interactions are limited or a gold standard is incomplete.

: (A) PMF-GRN graphical model overview. Input single-cell gene expression W is decomposed into latent factors U and V , representing TF activity and TF-gene interactions respectively. V is further decomposed into A and B, representing the degree of existence of interaction, and the strength and direction of an interaction, respectively. Information obtained from chromatin accessibility data or genomics databases is incorporated into the prior distribution for A. Additional latent variables are included to model observation noise σ obs and sequencing depth d, in order to better model our observed single-cell gene expression input data. (B) Input experimental data for PMF-GRN includes single-cell RNA-seq gene expression data. ATAC-seq is used to determine chromatin accessibility through peak calling. Motif enrichment within these accessible regions can be used to create a prior-known network to better inform the prior distribution. When experimental information is unavailable, databases can be used instead to construct a known-prior network. (C) Hyperparameter selection process is performed for model selection. The provided prior-known network is split into a train and validation dataset. 80% of the prior-known information is used to infer a GRN, while the remaining 20% is used for validation by computing AUPRC. This process is repeated multiple times, using different hyperparameter configurations in order to determine the optimal hyperparameters for the GRN inference task at hand. Finally, using the optimal hyperparameters, as determined by the highest achieved AUPRC, a final network is inferred using the full prior and evaluated using an independent gold standard.
(we choose 20%) by leaving the rows corresponding to these genes in A but setting the prior logistic normal 135 means for all entries in these rows to be the same low number. 136 The second step is to carry out a hyperparameter search using this modified prior-knowledge matrix. The early 137 stopping and model selection criteria are both the 'validation' AUPRC of the posterior point estimates of A 138 corresponding to the held out genes against the entries for these genes in the full prior hyperparameter matrix. 139 This step is motivated by the idea that inference using the selected hyperparameter configuration should yield a 140 GRN whose columns correspond to the TF names that the user has assigned to these columns. 141 The third step is to choose the hyperparameter configuration corresponding to the highest validation AUPRC and  The proposed PMF-GRN framework decouples the generative model from the inference procedure. Instead of 155 requiring a new inference procedure for each generative model, it enables a single inference procedure through 156 (stochastic) gradient descent with the ELBO objective function above, across a diverse set of generative models.

157
Inference can easily be performed in the same way for each model. Through this framework, it is possible to 158 define the prior and likelihood distributions as desired with the following mild restrictions: we must be able to 159 evaluate the joint distribution of the observations and the latent variables, the variational distribution and the 160 gradient of the log of the variational distribution.

161
The use of stochastic gradient descent in variational inference comes with a significant computational advantage.

162
As each step of inference can be done with a small subset of observations, we can run GRN inference on a very 163 large dataset without any constraint on the number of observations. This procedure is further sped up by using 164 modern hardware, such as GPUs.

165
Under this probabilistic framework, we carry out model selection, such as choosing distributions and their 166 corresponding hyperparameters, in a principled and unified way. Hyperparameters can be tuned with regard 167 to a predefined objective, such as the marginal likelihood of the data or the posterior predictive probability of 168 held out parts of the observations. We can further compare and choose the best generative model using the same 169 procedure.

170
This framework allows us to encode any prior knowledge via the prior distributions of latent variables. For 171 instance, we incorporate prior domain knowledge about TF-gene interactions as hyperparameters that govern the 172 prior distribution over the matrix A. If prior knowledge about TFA is available, this can be similarly incorporated 173 into the model via the hyperparameters of the prior distribution over U .

174
Because our approach is probabilistic by construction, inference also estimates uncertainty without any separate 175 external mechanism. These uncertainty estimates can be used to assess the reliability of the predictions, i.e., 176 more trust can be placed in interactions that are associated with less uncertainty. We verify this correlation 177 between the degree of uncertainty and the accuracy of interactions in the experiments.

178
Overall, the proposed approach of probabilistic matrix factorization for GRN inference is scalable, generalizable 179 and aware of uncertainty, which makes its use much more advantageous compared to most existing methods.

181
To demonstrate PMF-GRNs ability to infer informative and robust GRNs, we use two single-cell RNA-seq 182 datasets from the model organism Sacchromyces cerevisiae. S.cerevisiae is a relatively simple and well studied 183 eukaryote with an available and reliable gold standard, which allows us to test and evaluate our models 184 performance. 185 We perform three experiments using two independently collected single-cell RNA-seq S. cerevisiae datasets 186 (8; 39) to test PMF-GRN and compare our performance against three state-of-the-art GRN inference methods, 187 the Inferelator (AMuSR, BBSR, StARS) (31), Scenic (33), and CellOracle (34). In the first experiment, we infer 188 a GRN for each of the two single-cell datasets and average the posterior means of A to simulate a "multi-task" 189 2A). To provide a baseline for each method in the scenario where data cannot be cleanly separated into tasks, we 192 combine the two expression datasets into one observation before inferring a GRN. This baseline demonstrates a 193 large performance decrease for BBSR, indicating that the method may only be useful when gene expression 194 is organized into tasks. This could present challenges when attempting to infer GRNs in more complicated 195 organisms where cell-types or conditions are less easily defined. Here, we show the effectiveness of  in recovering the true underlying GRN for both scenarios, as performance remains relatively stable whether a 197 network was inferred using the individual or combined data. We also provide an example as to how we can 198 use PMF-GRN on a single observation or multiple observation matrices to infer a consensus GRN by simple 199 averaging.

200
In the second experiment, we implement a 5 fold cross-validation approach to establish a baseline for each 201 model. Cross-validation is an essential technique for evaluating the performance of machine learning models like 202 PMF-GRN as it allows us to test our method's ability to generalize to new data. Cross-validation further allows 203 us to simulate the process of training and testing PMF-GRN on multiple subsets of the available data, providing 204 a more robust and reliable estimate of model accuracy. In the context of GRN inference, cross-validation is 205 particularly important because it helps us assess the performance of PMF-GRN in predicting TF-target gene 206 interactions based on limited data, which is often the case in experimental settings. 207 We first combine the two S. cerevisiae single-cell RNA-seq datasets into one observation matrix for simplicity. To 208 perform cross-validation, the gold standard is divided into an 80% − 20% split, where a network is inferred using 209 80% of the gold standard as "prior-known information", and evaluated using the remaining 20%. We repeat this 210 cross-validation process five times using different random splits of the gold standard to obtain meaningful results. 211 We observe that PMF-GRN outperforms Scenic and CellOracle, while achieving competitive performance to 212 BBSR and StARS ( Figure 2B). We note that for this experiment, we are unable to implement the AMuSR 213 algorithm as it is a multi-task inference approach that requires more than one task (dataset).

214
Finally, in the third experiment, we demonstrate the robustness of each GRN inference method against noisy 215 prior information. To do so, we infer GRNs where increasing amounts of noise have been added to the input 216 prior-known information. Here, we show that as noise increases, PMF-GRN's AUPRC decreases similarly to 217 CellOracle, while on average, performing better than BBSR, StARS and CellOracle, demonstrating that it is one 218 of the most robust approaches to inferring accurate GRNs from noisy priors ( Figure 2C).

219
From the results of our experiments on the S. cerevisiae data, we have the following observations. The first main 220 observation is that on average the proposed PMF-GRN performs better than the Inferelator in recovering the true 221 GRN, regardless of whether we pick the mean or median Inferelator algorithm in terms of AUPRC. Specifically, 222 we see that PMF-GRN performs markedly better than two Inferelator algorithms (AMuSR and StARS), and 223 similarly to the remaining algorithm (BBSR). However, when the expression data is not separated into tasks, 224 PMF-GRN outperforms BBSR. In comparison to CellOracle, we observe that PMF-GRN infers competitive 225 GRNs during normal inference. However, PMF-GRN greatly outperforms CellOracle when performing cross-226 validation. Finally, we observe that PMF-GRN consistently outperforms Scenic in all experiments considered.

227
The second main observation is that our approach eliminates the high variance associated with choosing between 228 different inference algorithms. Implementing the Inferelator on the S. cerevisiae datasets in a normal setting 229 yields AUPRCs approximately in the range 0.2 to 0.4, without any a priori information on which of these 230 algorithms to use. The resulting inferred GRN could be arbitrarily accurate or inaccurate depending on which 231 algorithm is chosen. In contrast, our method is reliable as it provides one set of results, chosen using a 232 principled objective function, performing competitively with the best performing Inferelator algorithm (BBSR) 233 and CellOracle.

234
Finally, in order to highlight the identifiability issue and ensure that the prior-known information provided is 235 useful, we demonstrate the performance of PMF-GRN where prior information is not used (e.g. all prior logistic 236 normal means of A are set to the same low number). We use the same process for all other GRN inference 237 algorithms by providing an empty prior. Additionally, we demonstrate PMF-GRN's performance when we 238 randomly shuffle the prior-known TF-target gene interaction hyperparameters before using them to build the 239 prior distribution for A. We repeat this process for all other GRN inference algorithms by providing them with 240 prior-known information in which the gene names have been shuffled randomly. As anticipated, the resulting 241 AUPRC scores are close to 0, implying that our approach, as well as the Inferelator and CellOracle are capable 242 of taking into account such prior information well and that the prior information we provided is useful and To demonstrate GRN inference on a second dataset, we carry out experiments using two microarray datasets for 248 the prokaryote Bacillus Subtilis (B1 -GSE27219 (40) and B2 -GSE67023 (41)). Although PMF-GRN is not 249 primarily designed to learn GRNs from microarray data, we show that it is still possible to learn informative 250 GRNs with this data. For our B. subtilis experiments, we have access to prior-knowledge derived from the 251 subtiwiki database (42; 43; 44). Here, we implement a 5 fold cross-validation approach by using five random 252 splits of the subtiwiki database-derived information, where 80% is used as prior knowledge and 20% is used as 253 the gold standard for evaluation.

254
The two B. subtilis datasets were previously normalized after data collection as part of standard microarray 255 processing. However, each dataset was normalized using different approaches (described in 4). For B1, the 256 expression data underwent no further normalization and was simply converted to integers to simulate single-257 cell-like data. For B2, the expression data was re-scaled and then converted to integers, in order to contain 258 only positive integers resembling single-cell-like data. The results from our experiments are shown in Figure   259 3, and the numbers used to create this figure are given in Supplementary Table 4 and 5. Using five repeats 260 of cross-validation, we show the performance of GRNs inferred for the two B. subtilis datasets (B1 and B2). 261 We remark that the difference in performance between B1 and B2 is likely a result of the chosen microarray 262 processing normalization. To further support this claim, we demonstrate GRN performance after re-scaling the 263 data with min-max scaling (Supplemental Figure S1).

264
'No Prior' and 'Shuffled' results are also shown in Figure 3 by black and gray dots respectively. Here, we are 265 able to demonstrate that for B1 and B2, each GRN yields a better performance as compared to negative controls.  more likely to be accurate than those associated with higher uncertainty. This is in line with our expectations.

279
The more certain the model is about the degree of existence of a regulatory interaction, the more accurate it is 280 likely to be, showing that our model is well-calibrated. In this paper we present a framework for probabilistic matrix factorization, optimized using automatic variational 283 inference, for inferring GRNs from single cell gene expression data. In contrast with previous methods, our 284 framework decouples the model that defines the data generation process from the inference procedure. Concretely, 285 this means that we can modify the latent variables that constitute the model, along with their distributions, 286 without altering the inference procedure. This flexibility will allow for different sequencing data and modeling 287 assumptions to be readily incorporated into the model. Building new models no longer requires defining a new 288 inference procedure, which has previously been the case.  In order to demonstrate successful GRN inference, we infer a consensus GRN for S.cerevisiae using our 296 principled model selection method, and compare our results to GRNs inferred by the Inferelator, Scenic and 297 CellOracle with respect to a reliable gold standard. Whereas the Inferelator yields a set of highly varying 298 results across the variants, our approach results in a single inferred GRN. This GRN yields an AUPRC that 299 is higher than Scenic, as well as than the mean and median AUPRC achieved by the Inferelator's respective 300 algorithms (AMuSR and StARS), and is comparable to the AUPRC achieved by the best performing Inferelator 301 algorithm (BBSR), in addition to CellOracle. However, when the expression data is not separated into tasks, 302 we demonstrate that BBSR can no longer recover a competitive network. Our model hence yields a reliable 303 high-performing set of results without any need for heuristic model selection. 304 We further evaluate PMF-GRN by performing cross-validation and find that PMF-GRN, BBSR, and StARS 305 yield high performance, indicating that these methods are able to generalize well to new data and do not overfit 306 to training data. In contrast, Scenic and CellOracle do not perform well during cross-validation, indicating that 307 these methods may not be generalizable.

308
Finally, because prior-known networks are inherently noisy due to limited validated regulatory interactions, 309 we design an experiment to test each algorithm's robustness against increasing amounts of noise in the input 310 prior-known information. This experiment identifies PMF-GRN and CellOracle as the methods which are 311 overall most robust against noisy priors, indicating that inferred GRNs will remain reliable regardless of noisy 312 interactions introduced into the prior during inference.

313
Using two microarray B. subtilis datasets, we further demonstrate that PMF-GRN is capable of learning 314 informative GRNs. To include database information into both the prior knowledge and evaluation, we use an 315 approach motivated by cross-validation. Here, we find that scaling allow us to place these datasets on the same 316 scale, allowing them to be more comparable during inference. Although PMF-GRN is not primarily designed for 317 microarray data, we show that is still possible to learn informative networks by simply converting the expression 318 to integers to represent single-cell-like counts.

319
In order to determine the effect of incorporating our prior domain knowledge into the model, we compare results 320 obtained using shuffled and unshuffled hyperparameters for the matrix A. We observe that for both S. cerevisiae 321 and B. subtilis, not using prior information or shuffling the prior information results in very low AUPRCs, 322 whereas using the prior information as intended results in significantly better AUPRCs. This result holds for 323 PMF-GRN as well as for CellOracle and all Inferelator algorithms. However, for Scenic, we show that it is 324 challenging to obtain a GRN that performs better than these negative controls. This shows that prior information 325 is essential for addressing the latent factor identifiability issue and obtaining interpretable results from matrix 326 factorization, as well as regression based approaches.

327
In contrast to previous methods, our model provides well-defined uncertainty estimation in addition to point 328 estimates of GRNs. We evaluate these uncertainty estimates as provided by our model, by computing the AUPRC 329 for inferred TF-target gene interactions corresponding to different levels of posterior uncertainty. We find that the 330 AUPRC increases as the posterior variance decreases, demonstrating that when our model is more certain about 331 its estimates, it produces better rankings of TF-target gene interactions compared to when it is uncertain. This 332 indicates that our model is well-calibrated. For downstream experimental validation, biologists could therefore 333 place more trust in model estimates that have a lower posterior variance. We also note that the computational 334 cost of our model scales linearly with the number of cells in the dataset. This enables application of our method 335 to single-cell RNA-seq datasets of any size. 336 We envision many possible directions for future work to design a better algorithm for inferring GRNs under 337 our framework. This framework could be extended to explicitly model multiple expression matrices and their 338 batch effects. We could probabilistically model prior information for A obtained from ATAC-seq and TF motif 339 databases, and include this as part of the probabilistic model over which we carry out inference. Evaluating We index cells, genes and TFs using n ∈ {1, · · · , N }, m ∈ {1, · · · , M } and k ∈ {1, · · · , K}, respectively. 358 We treat each cell's expression profile Wn as a random variable, with local latent variables Un and dn, and We assume that U , V , σ obs and d are independent i.e. p(U, V, σ obs , d) = p(U )p(V )p(σ obs )p(d). In addition to 362 our iid assumption over the rows of U and d, We also assume that the entries of Un are mutually independent, 363 and that all entries of A and B are mutually independent. We choose a lognormal distribution for our prior over 364 U and a logistic Normal distribution for our prior over d: 365 p(log(U nk )) = N (µu, σ 2 u ), p(logit(dn)) = N (0, 9) where µu ∈ R and σu ∈ R + . 366 We use a logistic Normal distribution for our prior over A, a Normal distribution for our prior over B and a 367 logistic Normal distribution for our prior over σ obs : 368 p(logit(A mk )) = N (logit(clip(Ā mk , amax, amin)), σ 2 a ), p(B mk ) = N (0, σ 2 b ). p(log(σ obs )) = N (0, 1), whereĀ mk ∈ {0, 1}, amax ∈ (0, 1), amin ∈ (0, 1), σa ∈ R>0, clip(Ā mk , amax, amin) = 369 max(min(Ā mk , amax), amin) and σ b ∈ R>0.Ā mk is given by a pipeline that is used by other methods 370 such as the Inferelator. The pipeline leverages ATAC-seq and TF binding motif data to provide binary initial We impose the same independence assumptions on each approximate posterior as we do for its corresponding 375 prior. Specifically, we use the following distributions: where the parameters on the right hand sides of the equations are called variational parameters;Ũ nk ,dn,Ã mk , 377B mk ,õ ∈ R andσU nk ,σ dn ,σA mk ,σB mk ,σo ∈ R + . To avoid numerical issues during optimization, we place 378 constraints on several of these variational parameters. Inference is carried out using the Adam optimizer with learning rate 0.1 and beta values of 0.9 and 0.99. We clip 385 gradient norms at a value of 0.0001. We set amin = 0.005, amax = 0.995, σ 2 b = 1 and µu = 0. We vary σa 386 and σu as hyperparameters that control the strengths of the priors over A and U , respectively. We also vary β as 387 a hyperparameter. 388 We choose a hyperparameter configuration using validation AUPRC as the objective function as well as the 389 early stopping metric. We hold out hyperparameters for p(A) for a fraction of the genes. We do this by setting        A prior-known TF-target gene interactions matrix was obtained from the Subtiwiki database (49) from "regula-615 tions" (downloaded 07/21/22). Using the columns "regulator locus" and "gene locus" a cross-tab integer matrix 616 was created, where 1 represents the existence of an interaction and 0 represents no interaction. This matrix 617 was randomly split 5 times in 80%-20% proportions along the gene axis to generate independent prior-known 618 information and gold standard matrices.

619
To demonstrate the importance of scaling microarray data to place independently collected datasets on the same 620 scale, we demonstrate how Min-Max Scaling improves inference in both B. subtilis datasets. For "Min-Max