Reactive SINDy: Discovering governing reactions from concentration data

The inner workings of a biological cell or a chemical reaction can be rationalized by the network of reactions, whose structure reveals the most important functional mechanisms. For complex systems, these reaction networks are not known a priori and cannot be efficiently computed with ab initio methods, therefore an important approach goal is to estimate effective reaction networks from observations, such as time series of the main species. Reaction networks estimated with standard machine learning techniques such as least-squares regression may fit the observations, but will typically contain spurious reactions. Here we extend the sparse identification of nonlinear dynamics (SINDy) method to vector-valued ansatz functions, each describing a particular reaction process. The resulting sparse tensor regression method “reactive SINDy” is able to estimate a parsimonious reaction network. We illustrate that a gene regulation network can be correctly estimated from observed time series.


I. INTRODUCTION
Mapping out the reaction networks behind biological processes, such as gene regulation in cancer [1], is paramount to understanding the mechanisms of life and disease. A well-known example of gene regulation is the lactose operon whose crystal structure was resolved in [24] and dynamics were modeled in [43]. The system's "combinatorial control" in E. coli cells was quantitatively investigated in [22], in particular studying repression and activation effects. These gene regulatory effects often appear in complex networks [35] and there exist databases resolving these for certain types of cells, e.g., E. coli cells [11] and yeast cells [23]. Another example where mapping the active reactions is important is that of chemical reactors [30], where understanding which reactions are accessible for a given set of educts and reaction conditions is important to design synthesis pathways [7,20].
The traditional approach to determine a reaction network is to propose the structure of the network based on chemical insight and subsequently fit the parameters given available data [32]. To decipher complex reaction environments such as biological cells, it would be desirable to have a data-driven approach that can answer the question which reactions are underlying a given observation, e.g., the time series of a set of reactants. However, in sufficiently complex reaction environments the number of reactive species and possible reactions is practically unlimited -as an illustration, consider vast amount of possible isomerizations and post-translational modifications for a single protein molecule. Therefore, the more specific formulation is "given observations of a set of chemical species, what is the minimal set of reactions necessary to explain their time evolution?". This formulation calls for a machine learning method that can infer the reaction network underlying the observation data.
Knowledge about the reaction network can be applied to parameterize other numerical methods to further investigate the processes at hand. Such methods include particlebased approaches derived from the chemical master equation [13,18,41,42], as well as highly detailed but parameter-rich methods such as particle-based or interacting-particle reaction dynamics [2,8,10,17,33,39,40] capable of fully resolving molecule positions in space and time -see [3,34] for recent reviews.
Existing methods to infer regulatory networks include ARCANE [26] that uses experimental essay data and information theory, as well as the likelihood approach presented in [37] that takes the stochasticity of observed reactant time series into account.
The method presented in this work can identify underlying complex reaction networks from concentration time series by following the law of parsimony, i.e., by inducing sparsity in the resulting reaction network. This promotes the interpretability of the model and avoids overfitting. We formulate the problem as data-driven identification of a dynamical system, which renders the method consistent with and an extension of the framework of sparse identification of nonlinear dynamics (SINDy) [5]. Specifically, the problem of identifying a reaction network from time traces of reactant concentrations can be solved by finding a linear combination from a library of nonlinear ansatz functions that each corresponds to a reaction acting on a set of reactants. With this formulation, the reaction rates can be determined via regression. Sparsity is induced by equipping the regression algorithms with a sparsity inducing regularization. SINDy was investigated, generalized, and applied in many different ways, e.g., including control [6] (SINDYc), in the context of partial differential equations [31], updating already existing models [29] (abrupt-SINDy), and looking into convergence properties [44].
We extend and apply SINDy to the case of learning reaction networks from non-equilibrium concentration data. Similar approaches make use of SINDy but do not resolve specific 2 reactions [25], use weak formulations to avoid numerical temporal derivatives [28], or use compressive sensing and sparse Bayesian learning [27].
Our extension of the original SINDy method mostly involves estimating parameters which are coupled across the equations of the arising dynamical system. In the context of learning reaction networks this means that we look for specific reactions and their rate constants that might have lead to the observations instead of net flux across species. We demonstrate the algorithm on a gene regulatory network in three different scenarios of measurement: When there is no noise in the data we can find, given sufficient amounts of data, all relevant processes of the ground truth. If there is noise in the data we converge to the correct reaction network and rates with decreasing levels of noise. The final scenario generalizes the method to two measurements with different initial conditions, also converging to the correct model with decreasing levels of noise.

II. REACTIVE SINDY: SPARSE LEARNING OF REACTION KINETICS
We are observing the concentrations of S chemical species in time t: . . .
We assume that their dynamics are governed by classical reaction-rate equations subject to the law of mass action. A general expression for the change of concentration of reactant s as a result of order-0 reactions (creation or annihilation), order-1 reactions (transitions of other species into s or transitions of s into other species), order-2 reactions (production or consumption of s by the encounter of two species), etc, is given by: where the β (...) s,k -values are constants belonging to the reactions of order k. These rate constants however can incorporate several underlying reactions at once. For example, the two reactions both contribute toẋ 1 = β To disentangle (2) into single reactions, we choose a library of R possible ansatz reactions that each represent a single reaction: . . .
With this ansatz, the reaction dynamics (2) becomes a set of linear equations with unknown parameters ξ r that represent the sought macroscopic rate constants: where ξ r are the to-be estimated macroscopic rate constants. The two reactions in the previous example (3)(4) would be modeled by the ansatz functions illustrating that the values of the coefficients ξ 1 and ξ 2 can be used to decide whether a single reaction is present and to what degree. Now suppose we have measured the concentration vector (1) at T time points t 1 < · · · < t T . We represent these data as a matrix Given this matrix, a library Θ : ansatz reactions can be proposed with corresponding reaction functions where X i * denotes the i-th row in X. Applying the concentration trajectory to the library yields Θ(X) ∈ R T ×S×R . The goal is to find coefficients Ξ = ξ 1 ξ 2 · · · ξ R , so thaṫ In particular, the system is linear in the coefficients Ξ, which makes regression tools such as elastic net regularization [45] applicable. To this end, one can consider the regularized minimization problem (reactive SINDy): Here, · F denotes the Frobenius norm, λ ∈ [0, 1] is a hyperparameter that interpolates linearly between LASSO [15,38] and Ridge [16] methods, and α ≥ 0 is a hyperparameter that, depending on λ, can induce sparsity and give preference to smaller solutions in the L 1 or L 2 sense. For α = 0 the minimization problem reduces to standard least-squares (LSQ) with the constraint Ξ ≥ 0. Reactive SINDy (10) is therefore a generalization of the SINDy method to to the vector-valued ansatz functions. Since only the concentration data X is available but not its temporal derivative,Ẋ is approximated numerically by second order finite differences with the exception of boundary data. Once the pair (X,Ẋ) is obtained, the problem becomes invariant under temporal reordering. Hence, when presented with multiple trajectories the data matrices X i andẊ i can simply be concatenated.
In order to solve (10) the numerical sequential least-squares minimizer SLSQP [21] is applied via the software package SciPy [19]. Code related to this paper can be found under https://github.com/readdy/readdy_learn.

III. RESULTS
We demonstrate the method by estimating the reactions of a gene-regulatory network from time series of concentrations of the involved molecules. Let S := {A, B, C} be a set of three species of proteins which are being translated each from their respective mRNA molecule. Each mRNA in turn has a corresponding DNA which it is transcribed from. The proteins and mRNA molecules decay over time whereas the DNA concentration remains constant. The network contains reactions of the following form [36] for each of the species i ∈ S. These reactions model a regulation of species j by virtue of the fact that the transcription product inhibits the transcription processes. In our example proteins of type A regulate the mRNA B molecules, proteins of type B regulate the mRNA C molecules and proteins of type C regulate the mRNA A molecules (Fig. 1). Using this reaction model, time series of concentrations are generated using the rates given in Tab II under the initial condition described in Tab Ia, which were chosen so that all the reactions in the reaction model significantly contribute to the temporal evolution of the system's concentrations. The generation samples the integrated equations equidistantly with a discrete time step of τ = 3 · 10 −3 yielding 667 frames which amounts to a cumulative time of roughly T = 2.  Figure 1. The regulation network example described in Sec. III. Each circle depicts a species, each arrow corresponds to one reaction. Blue arrows denote transcription from DNA to mRNA, green arrows denote translation from mRNA to protein, and red arrows denote the regulatory network.
The proposed estimation method is applied to analyze these time series of concentrations in order to recover the underlying reaction network from data. To this end we use the library of ansatz functions given in Tab. II, which contains a large number of possible reactions, only few of which are actually part of the model.

A. Learning the reaction network in the low-noise limit
We first demonstrate that the true reaction network can be reconstructed when using a finite amount of observation data without additional measurement noise, i.e., the observations are reflecting the true molecule concentrations at any given time point. The minimization problem (10) is solved using the concentration time series shown in Fig. 1b.
We first set the hyperparameter α = 0 in the minimization problem (10), which results in constrained least-squares regression without any of the regularization terms. In this case we estimate a reaction network that can reproduce the observations almost exactly (Fig. 2). However, the result is mechanistically wrong as the sparsity pattern does not match the reaction network used to generate the data. On the one hand many spurious reactions are estimated that were not in the true reaction scheme and would lead to wrong conclusions about the mechanism, such as A + A A and A + C C. More dramatically, the reaction responsible for the decay of A particles is completely ignored (Fig. 3).
Next, we sought sparse solutions by using α > 0 and additionally eliminating reactions with rate constants smaller than a cutoff value κ. For a suitable choice of hyperparameters  Figure 3. Estimated reaction rates in the system described in Sec. III A. The y and x axes contain reaction educts and products, respectively. A circle at position (i, j) represents a reaction i j whose rate has a linear relation with the area of the circle. The black outlines denote the reactions with which the system was generated and contain the respective rate value. Red crosses denote reactions that were used as additional ansatz reactions. Blue circles are estimated by LSQ and orange circles depict rates which were obtained by solving the minimization problem (10). The latter rates are subject to a cutoff κ = 0.22 corresponding to the green circle's area under which a sparse solution with the correct processes can be recovered. If a certain rate was estimated in both cases, two wedges instead of one circle are displayed. α ≈ 1.91 · 10 −7 , λ = 1, and κ = 0.22, a sparse solution is obtained that finds the correct reaction scheme and also recovers the decay reaction (Fig. 3).
The value of the cutoff was determined by comparing the magnitude of estimated rates and finding a gap, see Fig. 6. The hyperparameter pair (α, λ) was obtained by a grid search and evaluating the difference Ξ α,λ − Ξ 1 , whereΞ α,λ is the estimated model under a particular hyperparameter choice and Ξ is the ground truth. If the ground truth is unknown, a hyperparameter pair can be estimated by utilizing cross-validation as in the following sections.

B. Learning the reaction network from data with stochastic noise
In contrast to Sec. III A, we now employ data that includes measurement noise. Such noise can originate from uncertainties in the experimental setup or from shot noise in singleor few-molecule measurements. In gene regulatory networks such noise is commonly observed when only few copy numbers of mRNA are present [4,9,14]. In order to simulate noise from few copies of molecules, the system of Sec. III with initial conditions as given in Tab. Ia is integrated using the Gillespie stochastic simulation algorithm (SSA) [12,13]. In the limit of many particles and realizations, the Gillespie SSA converges to the integrated reaction-rate equations subject to the law of mass action. As our model is based on exactly these dynamics, the initial condition's concentrations are interpreted in terms of hundreds of particles. Each realization is then transformed back to a time series of concentrations. We define the noise level as the mean-squared deviation of the concentration time series from the integrated reaction-rate equations. Data with different noise levels are prepared by averaging multiple realizations of the time series obtained by the Gillespie SSA.
It can be observed that decreasing levels of noise lead to fewer spurious reactions when applying reactive SINDy (10), see Fig. 4a. Also the estimation error ξ −ξ 1 with respect to the ground truth ξ decreases with decreasing levels of noise (Fig. 4b). In both cases, the regularized method with a suitable hyperparameter pair (α, λ) performs better than LSQ.
The hyperparameters (α, λ) are obtained by shuffling the data and performing a 10-fold cross validation.

C. Learning the reaction network from multiple initial conditions
Preparing the experiment that generates the data in different initial conditions can help identifying the true reaction mechanisms as a more diverse dataset makes it easier to confirm or exclude the participation of specific reactions. This section extends the analysis of Sec. III B to two initial conditions, where the first initial condition is identical to the one used previously and the second initial condition is given in Tab. Ib.
The corresponding time series are depicted in Fig. 5a. The gray graph corresponds to a sample trajectory generated by the Gillespie SSA. For both initial conditions the same time step of τ = 3 · 10 −3 has been applied, amounting to 2 · 667 = 1334 frames. Once the data matrices and the corresponding derivativesẊ 1 ,Ẋ 2 have been obtained, the frames are concatenated so that X = x 1 (t 1 ) x 2 (t 1 ) · · · x 1 (t 667 ) x 2 (t 667 ) , analogously forẊ.  Similarly to Sec. III B, decreasing levels of noise lead to fewer spurious reactions (Fig. 5b) and a smaller L 1 distance to the ground truth (Fig. 5c). Again applying the optimization problem with a suitable set of parameters (α, λ, κ) performs better than LSQ. Compared to the previous section the convergence is better due to twice as much available data. At noise levels of smaller than roughly 10 −6 the model can reliably be recovered when using the regularized method.
The hyperparameters (α, λ) are obtained by shuffling the data and performing a 20-fold cross validation.

IV. CONCLUSION
In this work we have extended the SINDy method to reactive SINDy, not only parsimoniously detecting potentially nonlinear terms in a dynamical system from noisy data, but also yielding, in this case, a sparse set of rates with respect to generating reactions (8). Mathematically this has been achieved by permitting vector-valued basis functions and  obtaining a tensor linear regression problem. We have applied this method on data generated from a gene regulation network and could successfully recover the underlying model.
The studies of Sec. III B and Sec. III C have shown that the applied regularization terms can mitigate noise up to a certain degree compared to the unregularized method, so that identification of the reaction network is more robust and closer to the ground truth.
Potentially, this method could be used to identify reaction networks from time series measurements even if the initial conditions are not always exactly identical, as was demonstrated in Sec. III C. One apparent limitation is that the method can only be applied if the data stems from the equilibration phase, as the concentration-based approach has derivatives equal zero in the equilibrium, which precludes the reaction dynamics to be recovered. In future work, we will consider the identification of reaction schemes from instantaneous fluctuations of particle numbers in equilibrium.

ACKNOWLEDGMENTS
The authors are grateful to the Center for Theoretical Biological Physics (CTBP, supported by NSF PHY-1427654) at Rice University for hosting their sabbatical visit, during which part of this work was performed.
We    Table II. Full set of ansatz reactions Θ used in Sec. III. The given rate constants define the ground truth reaction model.