Rule-based modeling using wildcards

Many biological molecules exist in multiple variants, such as proteins with different post-translational modifications, DNAs with different sequences, and phospholipids with different chain lengths. Representing these variants as distinct species, as most biochemical simulators do, leads to the problem that the number of species, and chemical reactions that interconvert them, typically increase combinatorially with the number of ways that the molecules can vary. This can be alleviated by “rule-based modeling methods,” in which software generates the chemical reaction network from relatively simple “rules.” This article presents a new approach to rule-based modeling. It is based on wildcards that match to species names, much as wildcards can match to file names in computer operating systems. It is much simpler to use than the formal rule-based modeling approaches developed previously but can also lead to unintended consequences if not used carefully. This article demonstrates rule-based modeling with wildcards through examples for: signaling systems, protein complexation, polymerization, nucleic acid sequence copying and mutation, the “SMILES” chemical notation, and others. The method is implemented in Smoldyn, a spatial and stochastic biochemical simulator, for both the generate-first and on-the-fly expansion, meaning whether the reaction network is generated before or during the simulation.


Introduction
Since about the time that Boyle posited that matter was composed of minute particles "associated into minute masses or clusters" (1), now recognized as molecules, the dominant paradigm in chemistry has been to classify molecules into chemical species.
This paradigm forms the foundation of chemical kinetics (2, 3) and is supported by the finding that different molecules of the same species are completely indistinguishable from each other (4, 5). Correspondingly, most modern biochemical simulation software represents molecules as members of species, treating all members of a single species identically (see reviews (6, 7)). However, many biological molecules do not fit neatly into these classes. For example, a cell might have a hundred or more DNA molecules, each with a different sequence. Similarly, a cell might have thousands of copies of some protein, but the copies vary according to whether they are bound to other proteins, bound to cofactors, or post-translationally modified with phosphate, methyl, or other moieties.
Several approaches have been developed to represent this molecular variation in computational models. One is to represent every multimer as an explicit graph, including its component monomers and their interconnections (e.g. (8-12)). Here, every molecule is its own entity and the concept of a species as a class of molecules is unnecessary. A second approach is to maintain the species concept, but to include states in the molecule definitions. For example, some biochemical simulators allow molecules to have modification states (13-15), surface-binding states (16), or an entire hierarchy of states (17). A third approach is to define each molecular variant as a separate species, with minimal variation within species. The possible variations can lead to a combinatorial expansion in the number of species (18), leading to the development of so-called rule-based modeling methods for automating reaction network expansion from "rules" that describe molecular complexation and modifications (e.g. (19-21)).
I recently followed this last approach, adding rule-based modeling to the Smoldyn simulator (22). Smoldyn is a widely used biochemical simulator that represents molecules as individual particles in 1D, 2D, or 3D space; these molecules diffuse, react with each other, and interact with surfaces (16, 23, 24). Smoldyn  performs a separate type of rule-based modeling using wildcard characters (22), which is the focus of this article. In this method, a modeler can specify groups of species using wildcard characters, much as a computer user can specify groups of files using wildcard characters. When used in chemical reactions, these wildcards can be used to define new species and new reactions.
Whereas rule-based modeling with formal languages, such as BNGL, were designed around an underlying model of how protein complexation and modification generally works, this is not the case for wildcards. Instead, the wildcard approach is simply a well-defined set of text-replacement tools with which the modeler can create his or her own notational scheme. This offers substantial versatility and generally simplifies input files. However, it can also lead to undesired results if not used carefully. Thus, the main text of this article focuses on the precise behavior of the wildcards, while the notes section presents examples that illustrate how the method can be used effectively. Windows with the install scripts, which is generally easy. Install on Linux computers by compiling the source code with CMake and Make, which is also straightforward.

Materials
Smoldyn runs on most laptops and larger computers that are less than 5 years old, as well as many older computers. Support is available by e-mailing support@smoldyn.org.

Running Smoldyn
To simulate a model in Smoldyn Smoldyn displays the simulated system to a graphics window and saves quantitative data to one or more output files.

Wildcards for matching
Molecules in Smoldyn are classified into chemical species and can also adopt any of 5 physical states. These states are in solution (e.g. a cell's cytoplasm) or the 4 surfacebound states called "front," "back," "up," and "down." Originally, the former two surface-bound states were for peripheral membrane proteins and the latter two were for integral membrane proteins, although they are all essentially equivalent in practice. All molecules of a single species and state behave identically, meaning that they have the same diffusion coefficients, graphical display parameters, surface interaction rates, and chemical reaction rates. Any other molecular variation needs to be expressed using separate species. For example, if a model includes the yeast Fus3 protein, which can bind to zero, one, or two phosphate groups (27), then each of its phosphorylation states would need to be represented as a separate species. Alternatively, if a model includes a receptor that diffuses at one rate in normal membrane regions and more slowly in lipid rafts, then this variation would again need to be represented using separate species.
Modeling such variations can lead to a rapid proliferation of separate species and so can become tedious to address. Use of wildcards alleviates this problem because it enables one to represent multiple species at once. For example, if the three Fus3 species were named Fus3, Fus3p, and Fus3pp, then the species pattern Fus3* would represent all three species. Also, if the receptors mentioned above were named R_normal and R_raft, then the species pattern R_* would represent both species. More generally, a species pattern is a species name that may include wildcard characters. In both of these examples, the '*' wildcard is used to represent variable portions of the species names.
Smoldyn supports text-matching and structural wildcards, where the former ones match to specific portions of the species names and the latter ones enable logical operations in species patterns. The text-matching wildcards include '*', which matches to any zero or more characters, '?', which matches to any one character, and [...], which matches to any one character from a specified list. The structural wildcard characters include '|', which is an OR operator, '&', which is a permutation operator, and {...}, which specifies the order of operation for the other two structural wildcards (the normal order of operations is that '&' takes precedence over '|'). The structural wildcards are most easily explained through examples, this time using the generic protein monomer names A, B, and C: the pattern A|B matches to either A or B, the pattern A&B matches to either AB or BA, the pattern A&B|C matches to AB, BA, or C, and the pattern A&{B|C} matches to AB, BA, AC, or CA. See Table 1.
Internally, when Smoldyn parses the user's input file and expects a species name, it inputs the given text as a species pattern. The pattern may be as simple as a single species name but could also include one or more wildcard characters.

Wildcards for substitutions
Smoldyn also supports wildcards in chemical reaction definitions, where they can be used to specify multiple chemical reactions at once. Smoldyn inputs each chemical reaction equations as a reaction pattern, which again may include wildcards but does not have to.
First, consider elementary reaction patterns, meaning reaction patterns that do not contain structural wildcards. In this case, Smoldyn substitutes any text that the wildcards match for the reactants into the corresponding wildcards in the products. For example, the reaction Ste5 + Fus3* → Ste5-Fus3* specifies that any of the three Fus3 species described above can associate with the Ste5 protein (27). In this case, the respective products would be Ste5-Fus3, Ste5-Fus3p, and Ste5-Fus3pp. If the same text-matching wildcard is used multiple times on each side of the equation, then Smoldyn corresponds the first instance in the reactants to the first instance in the products, the second to the second, and so on. For example, if Ste5 can also be phosphorylated, then Ste5* + Fus3* → Ste5*-Fus3* specifies that the binding reaction occurs for all phosphorylation states of both proteins, and that they maintain their phosphorylation states in the reaction. The correspondence can also be given explicitly using the '$n' wildcard on the product side of a reaction, using any value of n from 1 to 9, where it represents the n'th item of matching text. For example, the previous reaction could also be written as Ste5* + Fus3* → Ste5$1-Fus3$2. Text-matching wildcards in the reactants do not have to appear in the products; for example, Fus3* → X shows that all three Fus3 species decay to the same product. On the other hand, text-matching wildcards in the products must appear in the reactants, meaning that Smoldyn would not accept the reaction X → Fus3*.
Much like the case for species patterns, Smoldyn expands reaction patterns that include structural wildcards to lists of elementary reaction patterns and then performs matching and substitution on these elementary patterns. In the reaction pattern A&* → X*, for example, Smoldyn would first expand it to A* → X* and *A → X*, and would then perform matching and substitution on these two elementary reaction patterns. There In addition to the chemical reaction equation, Smoldyn allows modelers to specify several other reaction parameters. These include the reaction rate constant, how any dissociation products should be arranged, whether molecule serial numbers should be retained, and others. These parameters are entered in the same way for single reactions, reactions defined using wildcards, and reactions defined as rules, described next.

Reaction network expansion
In On the other hand, if the statement is suffixed with "_rule", such as difc_rule or

Properties of new species
As mentioned above, each Smoldyn species has several properties, including its diffusion coefficient, graphical display parameters, and set of surface interaction behaviors. Typically, one assigns these properties to all species that are defined in the input file using the difc, color, display_size, action, and rate statements, where the last two statements define molecule-surface interaction behaviors. However, if Smoldyn acts on these statements before it performs reaction network expansion (which is always the case for on-the-fly expansion), then they do not apply to any newly generated species.
The rule statements described above, such as difc_rule, are one way to address this problem. An alternate and often better approach is that Smoldyn can assign species properties automatically by computing reaction product properties from the reactant properties.
It does so using the following assumptions: (i) reactants diffuse as though they are roughly spherical, (ii) reactant volumes add upon binding, and (iii) molecule diffusion coefficients scale as the inverse of the molecule's radius (22). This last assumption follows from the Stokes-Einstein equation, which appears to be reasonably accurate even within cells (26, 30). These assumptions lead to the following equations for the product of the generic reaction A + B → AB: where r A and r B are the reactant radii and D A and D B are the reactant diffusion coefficients.
Smoldyn assigns the product diffusion coefficient as D AB and computes the product's graphical display radius from the r AB equation. Next, Smoldyn computes the product's display color using a radius-weighted average of the reactant colors. For each of the red, green, and blue colors, it computes the product brightness value using where v A and v B are the reactant brightness values. Finally, Smoldyn determines surface interactions for products using the method that the new species behaves like the reactant that has the "greater action," where the possible actions are ordered with increasing value as: transmission, reflection, absorption, and porting (which is for hybrid simulations).
For example, if a surface reflects reactant A and transmits reactant B, then reflection is the greater action, so the surface reflects product AB.

Symmetric species
Reaction networks that include structurally symmetric species often include multiple reactions that form the same products, which increases the effective reaction rate. An exception arises to this multiplicity computation if the reaction rule for a bimolecular reaction can match to both possible orderings of a single pair of reactants.
For example, the rule * + * → ** can match to the two reactants A and AA as either A + AA or AA + A (see Note 5). Because these two possible reaction orderings typically reflect two different chemical bonds being formed, Smoldyn only considers one of the two orderings (the one in which the reactant's internal indices are in increasing order).

Notes
The following notes illustrate the use of wildcards for rule-based modeling using several example problems. These models, and additional files that I used for their 1. Simple reaction networks with low symmetry. Reaction networks that are conceptually simple and have low symmetry are typically easy to define using wildcards. This is illustrated with an example of second messenger signaling, where extracellular "first messengers" bind to cell receptors, which then release intracellular "second messengers" (31, 32). Figure 1A  Each line shows the rule name, the reaction rule, and the reaction rate constants. Note that the use of wildcards, which in this case is just the '*' character, enabled each rule to represent a separate process in a clear manner. Also note that the reactant and product states (the spatial localizations given within parentheses) are straightforward to define and reasonably intuitive. Smoldyn uses them to correctly place all receptor complexes at the membrane, ligands in the extracellular space, and messengers in the cytosol ( Figure   1C). Figure   3. Symmetric complexes, modeled with asymmetric notation. Reaction networks that include structurally symmetric protein complexes, such as dimers and higher oligomers (34), generally require a little more care. In particular, it is often the case that a single complex can be represented correctly in multiple ways, leading to the question of whether the model notation should just include one of the ways, or all of them. Which approach is simplest depends on the specific problem; this note shows an example of the former approach, in which each complex is represented in just one way. Figure 3A shows a simple model of reversible dimer assembly for a symmetric complex that has the form A-B-B-A, a form that is loosely based upon receptor tyrosine kinases such as the epidermal growth factor and insulin receptors (35). The model includes the monomers A and B, dimers AB and BB, trimer ABB, and the tetramer ABBA. The notation is asymmetric in that it includes the species AB but not the species BA, which would be chemically identical. Similarly, it includes ABB but not BBA. It can be expressed with the reaction rules:

More complicated networks with low symmetry.
These rules make heavy use of the OR operator. For example, the first reaction rule shows that A can bind to any of B, BB, BB, or ABB, and the products are, respectively, AB, ABB, ABB, and ABBA. The repeated BB reactants in this rule reflect the fact that A can bind to either the left or right side of BB so the rate constant for this reaction should be twice the listed value (AB_ON). Similarly, in the second reaction rule, ABBA dissociates twice to A + ABB to reflect the two A-B bonds in ABBA. These rules are somewhat inelegant in that they do not reflect the symmetry of the system, include strings of OR operators, and only include irreversible reactions despite the fact that the model reactions are reversible. This inelegance arises from the decision to use asymmetric notation and from limitations in the wildcard approach. Nevertheless, these rules are substantially simpler than the full list of 12 reactions.
The reaction network that Smoldyn computed from these rules was identical to ones that arose from BioNetGen and manual expansion (22), validating the rule approach. Figure 3B shows that a Smoldyn simulation that was defined with these rules agreed well with a deterministic simulation of the same network, computed using ordinary differential equations.
4. Symmetric complexes, modeled with symmetric notation. This note continues on the topic of symmetric complexes, but now using symmetric notation. Here, if a complex can be represented correctly in multiple ways, the approach is to not just pick one of them but to use all of them. This increases network complexity due to the greater number of species and reactions, but can simplify the reaction rules through maintenance of the network symmetry.     A more serious flaw with this model is that Smoldyn assigns the same reaction rate to all association reactions. This is the correct behavior for the given reaction rule, but does not account for the fact that the A + A → AA reaction can happen in either of two ways: either of the two reactant monomers can end up at the "left" end of the product.
The following reaction rules (model "polymer_end2") fix this flaw Here, monomer association proceeds twice as fast as association of higher polymers.
Results from these latter rules agreed with a comparable model written in BNGL (22) and, at equilibrium, they also showed an exponential length distribution for all polymers with more than one monomer (see supplementary information). Figure 5B shows a similar model, but one which represents polymers that can anneal and break (model "polymer_mid"). It can be expressed with the reaction rule rxnmid * + * <-> ** 2*KF KR As above, the association reaction rate was doubled to account for the fact that either of the two reactants can end up on the "left" side of the product. This follows from the fact that Smoldyn only considers a single ordering for any particular pair of reactants; for example, it generates the reaction A + AA → AAA but not also AA + A → AAA. This model reached equilibrium much faster than the former ones but produced essentially the same exponential length distribution as the "polymer_end1" model ( Figure 5C). This model led to a much larger reaction network, with 151 species and 4037 reactions, because each species can participate in many more reactions.

Polymer sequences and chemical structures.
The pattern matching aspects of the wildcard method enable it to be used to define reactions that are specific to individual polymer sequences and chemical structures. These applications would be possible using BioNetGen or other rule-based modeling approaches, but would generally not be as convenient.
The central dogma of molecular biology is that cells transcribe DNA to mRNA and then translate mRNA to protein (45). Figure 6A shows that this process can be modeled using wildcards if sequences are reasonably short. The reaction rule performs transcription, where "Dna" and "Rna" are prefixes that indicate the sequence type and the "$1" portions of the products show that the same text gets substituted into each one. Ideally, this rule would not only preserve the sequence, which it does, but also replace all T symbols, for DNA thymine bases, with U symbols, for RNA uracil bases.
However, there is no easy way to do so with the wildcard method as it is currently designed. A wildcard approach that used regular expressions, which are more sophisticated pattern matching approaches, would solve this problem but would also be more difficult to use. The following reaction rules perform translation by modeling ribosome ("Rib") binding to the beginning of an mRNA sequence, translation of each codon, and finally dissociation of RNA, ribosome, and protein ("Prot" prefix). Wildcards can also be used to define reactions based on chemical structures that are not sequence data. In particular, they are useful in conjunction with the SMILES notation (51), a scheme that allows most chemical complexes to be uniquely expressed using a single line of normal text characters (e.g. ethanol, CH 3 CH 2 OH is CCO in SMILES notation). As an example, the E. coli lipid synthesis pathway includes several enzymes that act repeatedly on lipids, adding a two carbon groups with each repetition (52). Each enzyme is specific to a particular chemical functional group but has low specificity with regard to the lipid chain length. This can be representing using wildcards starting with the 10-carbon lipid cis-3-decenoyl-ACP, written in SMILES notation as           Analytical model

Tables
The above table shows that the reaction time constants, which are the inverse of the reaction rate constants, vary over 15 orders of magnitude. This suggests the possibility of simplifying the model by separating the fast processes from the slow ones, and treating the fast ones as though they are at equilibrium. Indeed, this approach works well here because reactions 0 and 1, which interconnect the group of species T 2,nsb , T 2 , A-B, T 2 A-B, A-BT 2 , and T 2 A-BT 2 , are much faster than reaction 2, which leads out of this group, to AT 2 B. Thus, assume that this group of species is at equilibrium.
Focus first on reaction 0. At equilibrium, the concentration ratio between T 2 and T 2,nsb is This shows that 720 times as much transposase is non-specifically bound than is freely diffusing. With the assumption that there are many fewer transposons than total transposase dimers, then an insignificant number of transposases will be bound to transposons. Define [T 2,total ] as the sum of the non-specifically bound and freely diffusing transposase dimer concentrations, [T 2,nsb ] + [T 2 ], which is a close approximation to the actual total transposase dimer concentration. Using this, the fraction of total transposase that is freely diffusing can be approximated as Focus next on reaction 1. At equilibrium, the following concentration ratios are equal to each other and can be given in terms of k 1f and k 1r , Define the association constant K as The inverse of K, which is about 500 µM, is a dissociation constant equal to [A-B][T 2,total ]/[T 2 A-b]; it is the total transposase dimer concentration at which there are equal concentrations of transposon in the unbound state A-B as in the singly bound state T 2 A-B. Also define the transposon concentration sum as This is close to the total transposon concentration, except that it ignores the concentration of transposons in the AT 2 B state. These are not necessarily negligible because reaction 3 is slow. The fraction of transposons in each binding state can be computed by combining the above equations, giving Finally focus on reactions 2 and 3. The concentration of AT 2 B can be computed by setting its rate equation to 0, [ ] Next, use the fact that reaction 2 is upstream of reaction 3 and has a negligible reverse reaction rate, so its rate is the transposition rate. In other words, any transposon that undergoes reaction 2 will almost certainly go on to transposition, so the rate of reaction 2 is the rate of transposition. Define the rate of reaction 2 as φ, for the transposition flux, which is Substituting in the above values gives [ ] The final term can be neglected if k 3 >> k 2f . This final result is the transposition rate as a function of the transposase dimer concentration. For the most part, it depends only on the rate of DNA ring closing, which is k 2f , and on the association constant for transposase onto transposons, which is K.

Smoldyn input file
This file was only used to verify that the reaction network that was defined using wildcards exactly matched the one that was created manually. The Smoldyn simulation proceeded too slowly for it to be practical.

MinD dimerization
Model parameters • Cell volume. The cytoplasmic volume of an E. coli cell is roughly 0.67 µm 3 (from BioNumbers ID 100011 (2) and (3)). However, this model assumes a slightly larger volume of 1 µm 3 , which is 1 fl, in part because the MinD system is of particular interest in cells that are about to divide, which are larger than average cells.
• • Nucleotide exchange from MinD ADP to MinD ATP . There are two possible mechanisms for conversion from MinD ADP to MinD ATP : transfer of just a phosphate group from free ATP to bound ADP, leading to free ADP and bound ATP, or exchange of entire nucleotides. Experiments with radiolabeled ATP showed that the latter mechanism is the correct one (10). There are no good in vivo values for the reaction rate constant for this reaction. Huang et al. (9) used a value of 1 s -1 for the reaction MinD ADP → MinD ATP without justification, except to note that it's within an observed range for guanine nucleotide exchange that spans 5 orders of magnitude; note that this is a pseudo-first order reaction because it ignores the ATP reactant. However, their model behavior agrees well with the experimental data, which does provide some justification for their parameters. Lacking better values, this model assumes a rate constant of 1 s -1 . Dividing by the ATP concentration of 1.54 mM gives a reaction rate constant of 650 M -1 s -1 for the second order reaction MinD ADP + ATP → MinD ATP + ADP. After unit conversion, this is 1.1×10 -6 µm 3 s -1 . The model assumes this value for the nucleotide exchange from MinD ADP to MinD ATP , represented here as k DT = 1.1×10 -6 µm 3 s -1 .
• Nucleotide exchange from MinD ATP to MinD ADP . Lackner et al. showed that ATP competes about 3 times more effectively at binding to MinD than does ADP (their Figure  2B) (11). This contrasts crystal structure results, in (12), which suggests similar binding affinity. Based on the former result, this model assumes that the reaction rate constant for MinD ATP + ADP → MinD ADP + ATP is one third that of the reaction in the other direction, making it k TD = 0.37×10 -6 µm 3 s -1 .
• Nucleotide gain. There are no good data on the rate of nucleotide binding to unbound MinD in the reactions MinD apo + ATP → MinD ATP and MinD apo + ADP → MinD ADP . Thus, this model assumes the same reaction rate constants as for the respective nucleotide exchange reactions. In particular, it assumes a rate constant of k AT = 1.1×10 -6 µm 3 s -1 for the former reaction and k AD = 0.37×10 -6 µm 3 s -1 for the latter reaction.
• MinD ATP nucleotide loss. To my knowledge, all published models of the Min system assume that all of the cell's MinD protein is bound to either ADP or ATP (e.g. refs. (9, 13,  14)). This model modifies this assumption slightly, assuming instead that the equilibrium concentration of MinD ATP is 20-fold higher than that of MinD apo . The reaction MinD apo + ATP ↔ MinD ATP is at equilibrium when k AT  ). Setting these two results for the same ratio equal to each other and simplifying shows that k DA = k DT k TA k AD /(k TD k AT ). Plugging in the numbers from above, k AT = 0.05, which is the same as the k TA value.
• MinD ATP dimerization and dissociation rate constants. MinD ATP has been shown to be predominantly dimeric when its concentration is above 2 µm, and primarily monomeric at lower concentrations (15). This model assumes this value for the MinD ATP dissociation constant. There are no good data for the dimerization and dissociation rate constants, so this model assumes that the dissociation rate constant is 1 s -1 (making it the same as nucleotide exchange from MinD ADP to MinD ATP ). Combining this assumption with the dissociation constant implies that the association rate constant is 5×10 5 M -1 s -1 , which is 8.5×10 -4 µm 3 s -1 .
• Other dimer dissociation rate constants. This model assumes the same dissociation rate constant for all MinD dimers. From above, this is 1 s -1 . •

Polymerization
Theory for polymer length distribution Consider the polymerization model "polymer_end1," in which polymerization and depolymerization arise through assembly and disassembly at one polymer end. It is expressed using the reaction rule: "A + * -> A*". This section shows that the equilibrium length distribution is exponential.
In this model, the association and dissociation rate constants, given here as k f and k r , respectively, are the same for all polymer lengths. Based on this, define the association constant as This is also the equilibrium constant for the dimerization reaction where the dimer is expressed here as A 2 , rather than as AA as it is in the simulation. As an equation, this means that where brackets denote concentrations. The same equilibrium constant applies to longer polymers too due to the assumption that the reaction rate constants are the same for all polymer lengths. Thus, The solution with the positive sign simplifies to The value of the right hand side is much greater than 1 for this low association case, which disagrees with the statement made earlier that [ ] [ ]

( )
This agrees with prior statements, implying that the negative sign leads to the correct solution.
Thus, the equilibrium distribution of polymer lengths can be calculated from the two following equations, both of which are copied from above, This shows that the length distribution depends exponentially on the polymer length, in agreement with prior results by Flory (17).
Next consider the model "polymer_end2", which is identical to the one just shown except that monomer dimerization, through the reaction A + A → A 2 , has twice the association reaction rate. Using the same definition for K a as before, In the latter equation, n can adopt values of 2 or larger. Following the same procedure as before, these rearrange to Substituting into the [A tot. ] definition leads to Expanding and rearranging the latter equality leads to a cubic equation in [A], which is too complex to solve here. However, importantly, the length distribution is still exponential for n ≥ 2, Also, the actual monomer concentration, [A], is half of the value that would be predicted from this equation.
The "polymer_mid" model is more difficult to evaluate because it has many more reactions. Of these reactions, all of the forward reaction rate constants are the same; they are all twice as large as the values in the "polymer_end1" model due to the fact that association can happen with either reactant ending up on the "left" side of the product. Also, all of the reverse reaction constants for asymmetric dissociation are the same; they are also twice as large as the values in the "polymer_end1" model, in this case because there are two ways for each asymmetric dissociation to arise. However, the reverse reaction constants for symmetric dissociation are a factor of two smaller than the others because each of these dissociations can only happen in one way. There are many fewer of these latter reactions than the others, so nearly all of the reaction equilibrium constants are the same as for the "polymer_end1" model, which also leads to the same equilibrium polymer length distribution. Figure 3 of the main text shows this result. Nevertheless, the slower dissociations for symmetric species undoubtedly affect the length distribution, and this should be particularly noticeable for small polymers because symmetric dissociations are a larger fraction of their total dissociations. Indeed, the simulation data shown in the figure shows a small but statistically significant deviation away from the exponential line for small polymer sizes, presumably arising from this effect.
The standard deviation for the equilibrium populations of different polymer lengths can be computed from the standard result that, when at equilibrium, the probability density for the population size for any individual chemical species obeys a Poisson distribution. A Poisson distribution has the variance equal to its mean, so the standard deviation is the square root of the mean.

Polymer length distribution figure
The following figure shows equilibrium polymer length distributions for all three models tested here. As in Figure 3 of the main text, red dots represent the "polymer_end1" model, blue dots represent the "polymer_mid" model, the solid black line represents the theoretical exponential distribution, and the dashed black lines represent the theoretical computation for one standard deviation away from the exponential distribution. In addition, the green dots represent the "polymer_end2" model. These points agree qualitatively with the predictions described above.
Smoldyn input file for polymer_end1 model # Smoldyn configuration file to test wildcards in reactions # This file simulates polymerization, one unit at a time # This file uses a simple reaction rule, which gives a simple length distribution # but ignores multiplicity in the monomer association reaction.

Transcription and translation
The model here uses numbers that are approximately correct for a Saccharomyces cerevisiae yeast cell. In particular, it uses a spherical cell geometry with a plasma membrane radius of 2.5 µm and a nuclear radius of 1 µm; typical cell values are 2.5 to 5 µm and 0.9 µm (18), respectively. The simulation runs for 10,000 s, which is about 2.8 hours; typical cell generation times are around 1.5 to 2.3 hours in rich media (19).