WEBNG: A templating tool for weighted ensemble sampling of rule-based models

Time scales for biological processes span many orders of magnitude, forcing modelers to tackle coupled processes that have large time scale gaps. This results in rare events, which take longer to occur than the fastest processes in the model. Efficient generation of rare events has been a focus of modelers for a long time and multiple software packages implement various rare event sampling algorithms. However, these packages frequently require expertise to get started with, making it harder for researchers to start using them. WEBNG (short for Weighted Ensemble–BioNetGen) is an open source software framwework that bridges the open source software packages WESTPA, which implements the weighted ensemble method for sampling rare events, and BioNetGen, which facilitates the specification and simulation of biochemical reaction network models following a rule-based approach. WEBNG simplifies rare event sampling in simulations of rule-based models by taking a model specified in the BioNetGen language (BNGL) and generating a WESTPA simulation folder ready to simulate with default parameters selected to match model observables. WEBNG is written in Python with dependencies only on proven, open-source packages that are in active development, which makes WEBNG easy to install and maintain. Here, we describe the architecture and features of WEBNG and demonstrate its capabilities through application to a two-gene model of cell fate transitions.

Biological molecules can undergo many transformations, including phosphorylation, 2 binding, ubiquitination, etc. The multisite/multivalent nature of many molecular 3 components of biological networks gives rise to combinatorial complexity in the number 4 of individual species (complexes) and reactions that can occur in the reaction network 5 governing system dynamics [1]. To simplify modeling such systems, rule-based modeling 6 approaches were developed to avoid manually enumeration of the network [2,3]. Rule- 7 based models define rules that specify how species can interact, and these rules are 8 the applied iteratively to generate the network [4]. BioNetGen (BNG) [3,5,6] is an 9 open-source rule-based modeling software package that allows for model construction in 10 BioNetGen language (BNGL) and simulation of BNGL models. 11 Rare events in modeling are events that take a long time to occur compared to the 12 time step used to simulate the model. Time scales on which biological processes happen 13 span many orders of magnitude. Most biological models need to tackle coupled processes 14 that happen at drastically different time scales [7], resulting in rare events. While in 15 1/8 some models it is possible to simplify some of the faster processes to close the gap 16 between time scales, in many models rare events are unavoidable because the fastest 17 process in the system must be modeled for accuracy. Models with rare events can be 18 computationally expensive to simulate because it takes many events to get a statistically 19 robust estimate of any aspect of a biological process, which means the model has to be 20 simulated for a long time in order to sample enough rare events. 21 Tackling this problem has been a focus of the computational modeling field and many 22 rare event sampling methods have been developed [8][9][10][11][12]. One of these methods is called 23 weighted ensemble path sampling method [8]. Weighted ensemble path sampling works 24 by organizing multiple parallel trajectories and resampling them at a fixed time interval 25 to replicate trajectories that are making progress towards the rare event of interest and 26 terminate trajectories that are not. Each trajectory is assigned a statistical weight that 27 is tracked and updated during resampling in order to ensure that each trajectory that is 28 generated has the correct statistical weight for computing system properties.

29
WESTPA is an open-source, scalable, and interoperable software package that applies 30 the WE strategy [13]. WESTPA has been successfully used to study multiple challenging 31 biological processes [14][15][16][17][18] and is being actively developed. The fact that WE does 32 not bias the kinetics of the model makes it easy to use with any model simulated with 33 stochastic dynamics, including rule-based models. WESTPA has been designed with a 34 general interface for stochastic simulation engines making it straightforward to couple 35 with BioNetGen's ssa method, which implements an efficient version Gillespie's Direct 36 Method [19] for stochastic simulation of reaction networks [6].

37
To this end we have developed a tool called WEBNG (Weighted Ensemble-BioNetGen), 38 which generates a WESTPA template for rare event simulation of a BNGL model. The 39 user provides basic parameters for a WESTPA simulation along with a BNGL model to 40 simulate, and the tool creates a WESTPA simulation folder that can then be run like 41 any other WESTPA simulation. WEBNG is designed to tackle high dimensional WE 42 simulations that are common for rule-based models and also provides several built-in 43 analyses that are tailored to network-level simulations (as opposed to the molecular 44 level simulations for which WESTPA was originally designed). WEBNG aims to lower 45 the entry barrier for systems biology researchers to use the weighted ensemble method 46 to explore models where rare events occur in an efficient way without having to learn 47 details of the WESTPA interface that are needed for advanced applications.

49
WEBNG Architecture 50 WEBNG comes with three subcommands that will set up the WESTPA simulation 51 folder and analyze the resulting simulation in steps (Fig. 1). The first subcommand, 52 template, allows the researcher to point to the BNGL model file they want to simulate 53 and this subcommand will generate a YAML file that is easy to read and modify and is 54 populated with reasonable default values for package locations, WESTPA simulation 55 parameters, and parameters for the built-in analyses, all of which can be modified by 56 the user.

57
The next subcommand, setup, takes as a single argument the name of the YAML file 58 generated in the previous step. This subcommand will use this information to generate the 59 standard WESTPA simulation folder, which contains all of the files needed to carry out a 60 simuluation using WESTPA. From this point the user can add more sophisticated options 61 that are not currently supported by WEBNG prior to invoking WESTPA.  BNG model with reasonable defaults while also allowing an advanced WESTPA user to 65 change simulation parameters and options freely. One important thing to note, by default 66 WEBNG uses an adaptive Voronoi binning scheme [20] for the progress coordinates, 67 which is a good starting point for high dimensional problems in the absence of further 68 information. Binning strategies will be further discussed in the Results section.

69
Once the simulation is run, the user can use the analysis subcommand, which also 70 takes in the same YAML file generated in the template step. This subcommand will 71 take in the parameters provided in the YAML file and run the enabled analyses on the 72 simulation. The built-in analyses shown in Fig. 1 are probability distributions, 73 clustering, and kinetic analysis. Here, we provide brief descriptions of each analysis 74 method with detailed examples provided in the next section.

75
There are two built-in probability distribution analyses, one that shows the joint 76 probability distribution between every progress coordinate on a heatmap, and one that 77 shows the evolution of the probability distributions of each progress coordinate. The 78 joint probability distribution analysis is particularly useful to assess binning efficiency 79 and simulation convergence.

80
WEBNG allows for automated clustering using the Generalized Robust Perron Cluster 81 Cluster Analysis (GPCCA+) [21] implemented in the PyGPCCA+ python library [22]. 82 GPCCA+ is a generalized version of the clustering algorithm PCCA+ [23,24], which is a 83 clustering method based on the kinetics of a system. PCCA+ was originally designed to 84 work on reversible systems under equilibrium, GPCCA+ extends this method to handle 85 non-reversible systems as well. The user provides the number of clusters they want, and 86 the GPCCA+ algorithm finds the ideal clustering of the bins in which transitions between 87 clusters are minimized and transitions within clusters are maximized, maximizing the 88 stability of each cluster. The resulting clustering can also be used in conjunction with 89 WESTPA tools to calculate rate constants between clusters. Finally, WEBNG uses 90 the clusters determined by GPCCA+ and the calculated transition matrices to make a 91 GraphML file [25]. This file can be opened in Gephi [26] to visualize the clusters and 92 transitions between them.

94
WEBNG is distributed via the Python Package Index (PyPI), which simplifies instal-95 lation to a single command from the command line. Installing WEBNG also installs 96 PyBioNetGen (a Python package that comes with BioNetGen) and WESTPA. This not 97 3/8 only allows the user to avoid having to install the other two packages separately but also 98 enables the template command to automatically locate the other packages and provide 99 their paths in the YAML configuration file.

100
Application of WEBNG to a Gene Expression Model 101 Tse et al. used weighted ensemble to run stochastic simulations of multiple gene regulatory 102 network models, identify different phenotypes of the model, and estimate mean first 103 passage times between those phenotypes. We used this study as a guide in designing of 104 WEBNG with the aim to quickly set up simulations for the systems used in the paper 105 and replicate the analyses in a streamlined fashion. Here, we have replicated the results 106 of the exclusive mutual inhibition and self-activation (ExMISA) model shown in Fig. 2. 107 In this model, two genes, A and B, encode transcription factors that activate their own 108 transcription and repress the transcription of the other gene. The model includes the 109 creation and degradation of the transcription factors as well as binding and unbinding 110 of these transcription factors to the regulatory regions of both genes. 111 Figure 2. Diagram of the genetic switch model ExMISA [27].
The analysis reveals four expected high probability regions (Fig. 3(A)), one region 112 where both proteins are low in count (lo/lo), one where both proteins are high in count 113 (hi/hi) and two regions where one protein is low and the other is high in count (hi/lo 114 and lo/hi). The automated clustering correctly identifies each cluster (Fig. 3(B)) and 115 automated network visualization can visualize each state and transitions between them 116 (Fig. 3(C)). Standard WESTPA tools can also be used to calculate the rate constants 117 and therefore mean first passage times (MFPTs) between phenotypes identified from 118 clustering ( Fig. 3(C)).

119
Discussion 120 WEBNG is a simple templating tool that bridges two open software packages, WESTPA 121 and BioNetGen, and lowers the barrier to entry to weighted ensemble rare event sampling 122 of rule-based models. WESTPA, BNG and WEBNG are all well documented, which 123 helps any potential researcher who wants to model processes that contain rare events 124 using rule-based modeling by providing a simple starting point. BioNetGen also has 125 the capability to take as input any model encoded using the Systems Biology Markup 126 Language (SBML) standard [28]. Using PyPI as the distribution point not only makes 127 installation simple but also allows for all three software packages to be easily kept up-to-128 date and in sync. New analyses, higher simulation efficiency, and increased support for 129 WESTPA parameters are currently in development for WEBNG. Each colored dot represents one discrete possible state the model is in and each color represents a different cluster. (C) Transition probabilities between GPCCA+ generated clusters and steady state probabilities estimated from those transitions. MFPT is estimated using WESTPA tools and using definitions that match Tse et al. study. Network graph is automatically generated from the transition matrix and the steady state probabilities. Thicker edges mean higher probability transitions between clusters and node sizes scale with the probability of each phenotype in steady state. Each edge is colored by the state that they start from.
Another advantage of a dedicated software package for templating these simulations 131 is reproducibility. There are many studies where the model is implemented in a language-132 specific manner (including the Tse et al. [27], in which the WE simulation is coded in 133 MATLAB) and contain hard-coded variables which makes reproducing results challenging 134 even when the code is provided with the published paper. Dedicated software packages 135 like WEBNG allow for the same simulation setup to be generated as long as the original 136 model file is provided and results analyzed in a consistent manner to make results more 137 5/8 reproducible.