SMGen: A generator of synthetic models of biochemical reaction networks

Several software tools for the simulation and analysis of biochemical reaction networks have been developed in the last decades; however, assessing and comparing their computational performance in executing the typical tasks of Computational Systems Biology can be limited by the lack of a standardized benchmarking approach. To overcome these limitations, we propose here a novel tool, named SMGen, designed to automatically generate synthetic models of reaction networks that, by construction, are characterized by both features (e.g., system connectivity, reaction discreteness) and non trivial emergent dynamics of real biochemical networks. The generation of synthetic models in SMGen is based on the definition of an undirected graph consisting of a single connected component, which generally results in a computationally demanding task. To avoid any burden in the execution time, SMGen exploits a Main-Worker paradigm to speed up the overall process. SMGen is also provided with a user-friendly Graphical User Interface that allows the user to easily set up all the parameters required to generate a set of synthetic models with any user-defined number of reactions and species. We analysed the computational performance of SMGen by generating batches of symmetric and asymmetric Reaction-based Models (RBMs) of increasing size, showing how a different number of reactions and/or species affects the generation time. Our results show that when the number of reactions is higher than the number of species, SMGen has to identify and correct high numbers of errors during the creation process of the RBMs, a circumstance that increases the overall running time. Still, SMGen can create synthetic models with 512 species and reactions in less than 7 seconds. The open-source code of SMGen is available on GitLab: https://gitlab.com/sgr34/smgen.


Introduction
Systems Biology is a multidisciplinary research field that combines mathematical, 23 computational, and experimental expertise to understand and predict the behavior of 24 complex biological systems [1,2]. Among the different formalisms that can be used 25 to describe intracellular processes, Reaction-based Models (RBMs) [3][4][5] are the most 26 suitable for obtaining a detailed comprehension of the mechanisms that control the 27 emergent behavior of the system under analysis [4]. The analysis of RBMs can be used 28 to drive the design of focused lab experiments; to this aim, computational tasks such as 29 Parameter Estimation (PE), Sensitivity Analysis (SA), and Parameter Sweep Analysis dreds or thousands of reactions and molecular species, thus hampers the possibility of 48 performing a thorough analysis of the performance of these simulators. 49 The computational performance of several GPU-powered tools have been already 50 assessed using randomly generated synthetic RBMs [13,18,19]. However, only a few 51 generators of biochemical models have been proposed so far, hindering the possibility 52 of having a common and well-defined benchmarking approach. For instance, Komarov 53 et al. [13,14] developed a tool to generate synthetic networks, which was then used to 54 test the performance of their GPU-based simulators. Given the number of reactants, 55 the type of reactions to be included in the RBM, and the total number of reactions, 56 they generated synthetic RBMs by exploiting a hash table to avoid duplicates. The tool 57 was then modified by randomly sampling the values of the initial concentrations of 58 the species from a uniform distribution and the kinetic constants from a logarithmic 59 distribution [18]. Another known and established model generator is the Reaction 60 Mechanism Generator (RMG) [26], which was specifically developed to create synthetic 61 chemical processes. RMG exploits an extensible set of 45 reaction families to generate 62 elementary reactions from chemical species, while the reaction rates are estimated using 63 a database of known rate rules and reaction templates. RMG relies on graphs to represent 64 the chemical structures, and trees to represent thermodynamic and kinetic data. Due 65 to its peculiarities, RMG was used to, e.g., automatically create kinetic models for the 66 conversion of bio-oil to syngas through gasification [27]. Finally, other tools, such as 67 Moleculizer [28], were introduced for the generation of reaction systems to obtain a 68 deeper understanding of transduction networks. 69 Despite the efforts done to automatically define synthetic models, all these genera-70 tors share a common drawback, that is, they have a limited flexibility and can generate 71 only a restricted set of biochemical networks and processes. As a matter of fact, the exist- networks-the tool is not publicly available. 78 Considering the impelling necessity of defining a common benchmarking approach 79 that allows for fairly evaluating and comparing different simulation approaches [29], we 80 propose here a novel tool, named SMGen, designed to automatically generate synthetic 81 biological networks codified as RBMs, which allow to obtain non trivial dynamics. It 82 is worth mentioning that we are mainly interested in the generation of RBMs suitable 83 for the analysis and comparison of the performance of the biochemical simulators. An 84 example of such RMBs is a model composed of plausible reactions (e.g., transformation, 85 production and degradation reactions, which are common in real biochemical networks) 86 leading to non trivial dynamics. For instance, in the case of a dynamics that instantly 87 exhaust all reactants, some of the most advanced integration algorithms are able to 88 simulate such stable and/or flat dynamics in just one computation step, thus hampering 89 the possibility of a fair comparison among the different simulation approaches. 90 In addition, SMGen overcomes the existing approaches and tools, which generally 91 do not allow for using different probability distributions for the initialization of the 92 species amounts and the kinetic constants. As a matter of fact, the possibility of ex-93 ploiting different probability distributions for the initialization of these parameters is an plex networks to measure their information and entropy [30]. In addition, considering 114 that every RBM can be converted into the corresponding system of coupled Ordinary 115 Differential Equations (ODEs), studying the symmetries of this system of ODEs can 116 reveal the intrinsic properties of the system of interest [31]. As a matter of fact, Ohlsson 117 et al. pointed out that an alternative analysis of the system of ODEs can be carried out by 118 considering the symmetries of the system solutions, aiming at formalizing the structures 119 and behavior of the underlying dynamics of biological systems. Moreover, the possibil-120 ity of evaluating GPU-powered simulators using symmetric and asymmetric RBMs is 121 fundamental to understand their performance under different conditions. Indeed, a fair 122 comparison would allow the user to select the best simulator based on characteristics of 123 the RBM that has to be analysed. Thanks to its features and efficiency, SMGen was used 124 to generate the synthetic RBMs necessary to realize a thorough comparison of the perfor-125 mance of different meta-heuristics in solving the PE problem of biochemical networks 126 [3,5], which is one of the most common and difficult computational issues in Systems 127 Biology. The outcome of such analyses showed that some well-known and widely used 128 meta-heuristics, generally able to outperform all competitors in the optimization of 129 benchmark functions, obtained poor performance when used to solve the PE problem. Moreover, SMGen was exploited to generate symmetric and asymmetric RBMs required 131 for the in-depth analyses and comparisons of the computational performance of different 132 biochemical simulators and to highlight their peculiarities [19]. In particular, these 133 analyses allowed for determining the best simulator to employ under specific conditions, 134 such as the size of the RBM and the total number of simulations to perform. SMGen can 135 also be used for the generation of biochemical networks to investigate the performance 136 of approaches specifically designed to tackle other well-known and computationally 137 demanding tasks in Systems Biology (e.g., SA and PSA [1,7]).

138
SMGen allows also for exporting the generated RBMs into the Systems Biology 139 Markup Language (SBML) [32], Version 4 Level 2, and into the BioSimWare standard 140 [33], which is used by different GPU-powered simulators. Thus, we designed and 141 developed SMGen to be a unifying, user-friendly, and standalone tool freely accessible 142 to the Systems Biology community. The RBMs can be easily generated by using the 143 provided user-friendly Graphical User Interface (GUI), which is designed to help the 144 users in setting all the parameters required to generate the desired RBMs.

145
The manuscript is structured as follows. Section 2 describes the mathematical 146 formalism of RBMs, as well as the structural characteristics that must be complied to 147 generate synthetic biological networks. In addition, we provide all the algorithms and where a ij and b ij ∈ N are the stoichiometric coefficients, and k i ∈ R + is the kinetic 153 constant associated with R i . The stoichiometric coefficients specify how many molecules 154 of species S j , with j = 1, . . . , N, appear either as reactants or products in reaction R i . Note 155 that some species might not appear in a reaction, so that the corresponding stoichiometric 156 coefficient will be equal to 0. The order of a reaction is equal to the total number of 157 molecules (of the same or different species) that appear as reactants in that reaction.

164
Starting from an RBM and assuming the law of mass-action [34][35][36], the system of coupled ODEs corresponding to the RBM can be derived as follows: where each ODE describes the variation in time of a species' concentration. In Equation 2 165 the N-dimensional vector X = [X 1 · · · X N ] represents the concentration values of species 166 S 1 , . . . , S N , while X A is the vector-matrix exponentiation form [34]; the symbol • denotes  easily allows for ensuring the system connectivity, a property that is strictly required 175 to ensure that each species S j ∈ S, with j = 1, . . . , N, will be involved in at least one 176 reaction R i ∈ R, with i = 1, . . . , M.
To be more precise, as a first step, an undirected graph with a single connected component is built. This undirected graph is randomly generated by using N − 1 edges and by taking into account the maximum number of reactants and products, obtaining a connected biochemical reaction network. It is worth noting that a graph with a single connected component can be built if and only if the following condition is met: where max num r and max num p are the maximum number of reactants and products, 178 respectively. As a second step, starting from the initial undirected graph, the 179 stoichiometric matrices, which can be viewed and treated as a Petri net [37,38] 180 that, in turn, can be considered as a bipartite graph, are generated. Then, the  SMGen is provided with a user-friendly GUI (see Figure 1) that allows the user to easily • the maximum number of reactants and products max num r and max num p that might 219 appear in any reaction;

220
• the probability distribution D s that is used to initialize the species amounts (to 221 be chosen among uniform, normal, logarithmic or log-normal distributions). It is

228
Note that all kinetic constants are generated using the same distribution probability;

229
• the minimum and maximum values min r and max r for the kinetic constants;

230
• the total number of RBMs that the user wants to generate;

231
• the output format file to export the generated RBMs (i.e., BioSimWare [33] and    Figure 3. Workflow of a single Worker execution. First, the graph of the reactions is randomly initialized, and then converted into the data structures used to store the reactants and products. Second, the stoichiometric coefficients are randomly generated and the consistency of the reactants and products is verified. Third, the initial amounts of the species and the kinetic parameters of the reactions are randomly generated using the probability distributions specified by the user. • each Worker (Proc p , with p = 3, . . . , P) generates an RBM. As soon as a Worker

253
The workflow of each Worker consists in 9 different phases, in which a specific algorithm 254 is executed (see Figure 3). for i = 1 to M do 8: for j = 1 to N do 9: if A[i, j] == 1 then 10:  The steps performed by each Worker to generate an RBM are the following:

295
We analyzed the performance of SMGen regarding both its capability of creating     Table 2 shows the list of reactions along with the kinetic constants of one of these 100 310 synthetic RBMs. Since the species X 0 appears only as a reactant, its amount will be kept 311 constant during the simulation. The initial molecular amounts of all species-given as 312 number of molecules-are listed in Table 3. This small RBM includes the basic "cascade 313 of reactions" structure typically observed in signaling pathways, starting from the source 314 represented by species X 0 and X 4 , toward species X 2 and X 3 .

315
We simulated the dynamics of this RBM for 50 time steps (arbitrary units), and 316 the achieved dynamics are shown in Figure 4. These plots evidence that, although the 317 RBM was randomly generated by SMGen, the simulated behavior is non trivial (e.g., the 318 reactants are not instantly exhausted, resulting in a flat dynamics). Note that currently 319 SMGen can be only used to generate RBMs and does not include any simulation tools 320 yet. The simulation of the dynamics of this RBM was carried out using FiCoS [19]. It    Table 2. RBMs to collect statistically sound results. Figure 6 shows the average running time  Figure 6) the generation times are higher than 368 the opposite situation (top panel in Figure 6). This circumstance is due to the potential 369 higher number of errors that SMGen has to identify and correct. Indeed, when M N, 370 the probability that repeated reactions are randomly generated is higher than the case  the higher the number of operations that must be performed by each thread, leading to 388 a higher running time [18,19].

389
SMGen was developed in Python and was designed to be a unifying, user-friendly, e.g., the Brusselator model [38,43]). The analysis of the features of the generated models, 403 according to different distributions, will be addressed in a future work. 404 We assessed the capabilities of SMGen for the creation of RBMs characterized by a we assume that all reactions follow the law of mass-action; however, we will implement 430 additional kinetics (e.g., Michaelis-Menten and Hill kinetics [44,45]) in the future.

431
SMGen relies on graph theory and linear algebra properties to comply with specific 432 structural characteristics that are essential to create synthetic but real RBMs. Neverthe- are repeated sub-graphs appearing with a frequency higher than in random graphs 438 and depending on the network's null model [46]. Graphlets have been used to ana-439 lyze local network structures, and to cluster different network types [46]. Probabilistic 440 graphlets have been also developed to analyze the local wiring patterns of probabilistic 441 networks. When applied to study biological networks, probabilistic graphlets resulted 442 in a robust tool, able to capture the biological information underlying the biological 443 networks, thanks to the capabilities of managing the low signal topology information 444 [47]. Graphlets can also be used to measure the structural similarity among large net-445 works, calculating the graphlet degree distribution [48]. Motifs can be used to study 446 autoregulation, single input module, dense overlapping regulons and feedback loops, as 447 well as to reveal answers to many important biological questions [49,50]. The frequencies 448 of the motifs were also exploited as classifiers for the selection of the network models 449 [51]. Studying the motifs is also fundamental to understand the stability and robustness 450 of the biological networks to small perturbations [52]. As a matter of fact, Prill et al.

451
showed that the stability and robustness of a network is strictly related to the relative 452 abundance of the motifs [52]. This result suggests that the structural organization of a 453 biological network can be highly related to the dynamic properties of the small network 454 motifs. We plan to extend SMGen to incorporate network graphlets and motifs during 455 the generation of the RBMs, to obtain synthetic models with an improved biological 456 significance.

3')
)LWWHGGLVWULEXWLRQ 3RZHUODZGLVWULEXWLRQ Figure A2. Degree distribution of an interaction network induced by an RBM randomly generated using SMGen. The discrete fitted distribution is shown using linearly spaced bins.
single connected component (independently from the maximum number of reactants and products); i.e., with the following constraint: (max num r + max num p ) × M > N.
Then, we calculated the network deficiency for each generated RBM; finally, for each

526
Biological networks, in particular metabolic networks [57], often show scale-freeness 527 properties [54]. In order to prove that SMGen produces networks that are characterized 528 by this property, we converted the bipartite graph defined by the stoichiometric matrices 529 to a common interaction graph, where nodes represent the chemical species and edges 530 represent reactions involving the nodes as reactants or products. Then, we analyzed 531 the structural properties of the generated network and, in particular, we investigated 532 whether the degree distribution follows a power law distribution, which is characteristic 533 of scale-free graphs. Since the scale-free properties emerge only for large scale networks, 534 we generated and analyzed the interaction network induced by an RBM with by 10, 000 535 chemicals species involved in 10, 000 reactions. According to our results, the degree 536 distribution seems to fit well with a power law (see Figure A2), confirming that SMGen