## Abstract

DNA strands have to sample numerous states to find the alignment that maximizes Watson-Crick-Franklin base pairing. This process depends strongly on sequence, which affects the stability of the native duplex as well as the prevalence of non-native inter- and intra-molecular helices. We present a theory which describes DNA hybridization as a three stage process: diffusion, registry search, and zipping. We find that non-specific binding affects each of these stages in different ways. Mis-registered intermolecular binding in the registry search stage helps DNA strands sample different alignments and accelerates the hybridization rate. Non-native intramolecular structure affects all three stages by rendering portions of the molecule inert to intermolecular association, limiting mis-registered alignments to be sampled, and impeding the zipping process. Once inregister base-pairs are formed, the stability of the native structure is important to hold the molecules together long enough for non-native contacts to break.

## Introduction

The double stranded nature of DNA plays a crucial role in the storage, retrieval, and transfer of genetic information. These functions depend on the ability of DNA to separate into individual strands and then re-hybridize. The recognition of specific nucleotide sequences during hybridization also plays a fundamental role in medicine, biotechnology, and nanotechnology^{1–4} by enabling techniques like PCR^{5–7}, DNA-based nanostructures (DNA origami)^{8–12}, and methods for the diagnosis of diseases like HPV, HIV, and cancer^{13–15}. The specificity in DNA hybridization is encoded in the complementary H-bonding patterns of the A-T and G-C Watson-Crick-Franklin (WCF) base pairs. However, the major driving force for duplex formation comes from the stacking of aromatic bases^{16}. This contribution from base stacking has important consequences on the stability of double-stranded DNA helices. For example, there is a free energy penalty to initiate a helix due to the fact that base stacking cannot occur until consecutive base pairs have formed. Additionally, base stacking interactions depend on the identity of neighboring bases, resulting in helix stabilities that are strongly sequence dependent. Both the initiation and sequence dependent effects can be approximately captured using base pair interaction energies that account for nearest-neighbor bases ^{5,17–22}.

While there are numerous tools to predict the thermodynamic stability of DNA duplexes^{5,23–29}, the kinetics of hybridization are less well understood. This is a major limitation because kinetic limitations can prevent DNA molecules from finding the most stable state. The combinatorics of only having four bases means there is a 25% probability of an AT or GC match for each base pair forming in non-native alignments. Therefore, many random portions of DNA will be able to hybridize. While these states have a negligible effect on the equilibrium population of states^{30}, they can prevent DNA from finding the correctly hybridized state within the required timescale and require special attention in the design of sequence-specific DNA oligomers^{31}.

Experiments and theoretical analysis of long DNA molecules (kbp to Mbp) have shown that hybridization follows second order reaction kinetics that are limited by a slow nucleation step followed by a rapid zipping step^{32}, and further analysis has explained how the hybridization rate scales with length, solvent viscosity, and sequence complexity^{33}. Work on small molecules has provided high resolution views of hybridization. Simulations have revealed “inchworm” and “slithering” mechanisms in the search over intermolecular alignments^{2,3,10,31,34–39} and theory has explained the temperature dependence of the alignment search^{40}. Single molecule experiments have shown that this search is very sensitive to sequence perturbations with a threshold of ∼ 7bp to initiate hybridization^{41}. However, the sequences explored in these studies were too short to show the extensive non-native traps expected in longer molecules. Multi-state kinetics, indicative of traps and intermediate states, have been observed on sequences as short as 12 bp^{42} and non-Arrhenius temperature effects are suggestive of different hybridization mechanisms at high and low temperature^{43}. Thus, there is a gap in our understanding of hybridization on scales between the short oligomer and genomic DNA. There is a considerable practical importance at the scale of tens to hundreds of base pairs because this range encompasses oligonucleotides used by PCR, CRISPR, DNA origami, and more. At this intermediate scale sequence dependent variations do not average out as they do on the genomic scale. Additionally, at this scale there is a significant probability of non-native base-pairing which increases the variability between sequences.

Here we present a theoretical model developed to explain the hybridization kinetics of 36 bp oligonucleotides studied by Zhang et al.^{44}. Our model accounts for the competition between native and non-native base pairing during the random search over binding alignments with a three stage model^{33} using random walks to efficiently compute the effects of non-native interactions in each stage. We find that non-native *inter*molecular base pairs facilitate hybridization by increasing the lifetime of the encounter complexes in which the DNA strands search for native base pairs. In contrast, non-native *intra*molecular base pairs impede hybridization by creating barriers in the zipping process. However, intramolecular structure has a secondary beneficial effect of limiting the number of alignments that need to be searched.

## Methods

We introduce a model that accounts for three types of base pairing interactions.

### Native base pairs

Native base pairs, which we refer to as “in-register”, determine the stability of the final DNA duplex but only have a secondary influence on hybridization kinetics because most of the hybridization process involves searching non-native states.

### Mis-registered intermolecular base pairs

Misaligned DNA strands have a low probability of forming WCF matches and, therefore, have weaker binding affinity than in-register states. However, these misaligned states have a strong effect on hybridization kinetics due to the fact that most intermolecular collisions are out-of-register.

### Intramolecular base pairs

It is common for DNA strands to have self-complementary regions that allow the molecules to form intramolecular base pairs. While self-hybridization can lead to complex folds, especially in RNA, here we restrict our analysis to sequences that remain unstructured or only form simple “stem-loop” structures. We use the NUPACK software to predict these structures and compute their stability^{23} (Fig. 2). Such self-hybridized regions sequester a portion of the sticky bases, thereby increasing the probability that DNA molecules collide without hybridization. The intramolecular base pairs also must be broken to complete hybridization.

Figure 3 shows that metric specific to each of these kinds of interactions correlate poorly with hybridization kinetics when considered individually. The reason for this is that hybridization is a multi-stage process in which the bond types contribute differently to each stage.

Our model includes these non-specific interactions using three stages, similar to those proposed by Niranjani et al.^{33}. The stages, which we refer to as the diffusion stage, the registry stage, and the zipping stage (Fig. 4), each have a characteristic timescale as well as a probability that the system advances to the next stage.

### Diffusion stage

This stage describes molecules that have not formed any intermolecular base pairs. The characteristic time, *t*_{diff}, is the inverse collision rate and *P*_{diff} is the probability that a collision results in the formation of an intermolecular base pair. We take *P*_{diff} = 0 in collisions where either of the colliding bases are involved in intramolecular stems. We also neglect the formation of intermolecular base pairs by bases in loop regions under the assumption that these bases are too constrained to properly stack.

### Registry stage

The registry stage is an encounter complex stabilized by non-native base pairs. We characterize the encounter complex using the registry variable *R*, which indicates how many bases a molecule needs to be shifted to form native base pairs (Fig. 1). The registry variable can take positive or negative values −*L*+1 ≤ *R* ≤ *L*−1, where *L* is the number of base pairs in a strand and *R* = 0 indicates the in-register (native) alignment^{45,46}. Each registry has a binding lifetime *t*_{reg}(*R*) in which the molecules search for in-register base pairs. *P*_{reg}(*R*) describes the probability that the molecules held together by non-native base pairs in registry *R* form an in-register base pair before the non-native bonds dissociate.

### Zipping stage

Once two molecules form the first in-register base pair they advance to the zipping stage. *P*_{zip} indicates the probability that all native base pairs successfully hybridize before the molecules dissociate. This stage has an average duration *t*_{zip+} or *t*_{zip−} depending on whether the zipping stage ends with the formation of all in-register bonds or rupture of all in-register bonds before zipping completes, respectively. We assume that failure in either the registry or zipping stages results in the molecule returning to the diffusion stage.

#### Molecules randomly sample registries during hybridization

The hybridization rate is computed from the rate that molecules progress through these three stages. The first step is to write the time required for *N* intermolecular collisions.
In Eq. 1 the first term represents failed collisions in which two strands are unable to bring bases into contact. *P*_{diff} (*R*) is the probability of forming a first bond in registry *R* after an inter-molecular collision. The second term accounts for molecules that form intermolecular bonds but do not form in-register bonds before dissolution. *P*_{reg}(*R*) is the probability of forming a *R* = 0 base pair after forming at least one base pair in registry *R*. The third term describes events in which in-register bonds form but fail to reach the fully zipped state. *P*_{zip} is the probability for fully zipping after forming at least one *R* = 0 base pair. The last term describes successful collisions where two strands are able to arrive at the fully zipped state.

Noting that *T*_{tot}(*N*) scales linearly with *N*, we can write *T*_{tot}(*N*) = *NT*_{tot}(1), where *T*_{tot}(1) is the average time per collision. After *N* collisions there will be Σ_{R}[*N* ·*P*_{diff} (*R*)·*P*_{reg}(*R*)·*P*_{zip}] successful events, so the hybridization rate is
This expression can be simplified considerably in the low concentration limit where *t*_{diff} ⪢ *t*_{reg}, *t*_{zip}, which describes the 50 pM concentrations used in the experiments of Zhang et al.^{44}. In this limit *t*_{diff} + *t*_{reg} + *t*_{zip} Δ *t*_{diff}, which allows us to rewrite the hybridization rate as
where (2*L* − 1) is the number of registries that need to be sampled.

#### The growth of the hybridized region is a random walk

To compute *t*_{reg}, *t*_{zip}, and *P*_{zip} we model the formation and breakage of base pairs as a first passage random walk^{33}. To do this, we use the number of base pairs formed as the reaction coordinate ^{47}. A system starting with *x* base pairs will evolve to *x* + 1 base pairs at a rate *k*_{+}(*x*) or to *x* − 1 base pairs at a rate *k*_{−}(*x*). Using these rates, we can write the probability of base pair formation as *p*_{+} = *k*_{+}(*x*)/(*k*_{+}(*x*)+*k*_{−}(*x*)) and the probability of base pair breakage as *p*_{−} = *k*_{−}(*x*)/(*k*_{+}(*x*)+ *k*_{−}(*x*)). The forward and backward rates are related by the detailed balance relationship
where Δ*G*(*x*) is the free energy change to form the *x*th base pair defined as Δ*G*(*x*) = *G*(*x*) − *G*(*x* − 1). We assume that bond formation *k*_{+} ≃ 10^{9}*s*^{−1} is independent of sequence, while bond breakage is limited by the bond breakage energy *k*_{−} = *k*_{+}*e*^{ΔG} where Δ*G* is the free energy of base pair formation computed with the nearest-neighbor free energies of Santa Lucia et al.^{5,17–22}.

## Results and Discussion

### Diffusion stage

#### The duration of the diffusion stage is determined by random collisions

We estimate the diffusion time using the Smoluchowski formula for an absorbing sphere
where *D*_{s} is the diffusion constant of the strands, *a* is the effective radius of the polymer coil, and *c* is the concentration of DNA. Inserting the Stokes-Einstein relation, *D*_{s} = *k*_{B}*T/*6*πηa*, for the diffusion constant shows that the size dependence cancels when the target and incoming species are identical
where *η* is the viscosity of the solvent.

#### The diffusion stage ends with the collision between two exposed bases

The sticking probability, *P*_{diff} (*R*), serves two purposes in our model. First, it provides a weighting factor to account for the fact that random collisions between DNA molecules are biased in favor of small |*R*|. To see this we observe that, of the *L*^{2} possible intermolecular base-base contacts, only (1,*L*) is consistent with registry *R* = *L* − 1. In contrast, there are *L* pairings: (1,1), (2,2),…(*L, L*) consistent with *R* = 0. Therefore, for unstructured molecules we have
where *C*_{1} is a sequence-independent geometric factor that accounts for collisions in orientations incompatible with base pair formation (i.e., between phosphate backbones).

The second purpose of *P*_{diff} is to ensure that intermolecular base pairs only form between bases not previously engaged in intra-molecular base pairs. For stem-loop molecules this means the contact must be between free tails (we neglect binding in the loops). This limits registries that are possible because registries with |*R*| greater than the length of the tail cannot form base pairs. To handle this we need to apply Eq. 7 separately for each tail
where *ℓ*_{1} and *ℓ*_{2} are the lengths of the two tails, we have neglected binding between opposite tails, *ℓ* − |*R*| is the number of possible collisions for each registry, and Θ is the Heaviside function defined as Θ(*n*) = 1 for *n >* 0 and Θ(*n*) = 0 for *n* ≤ 0. When a molecule with two free tails necessitates the use of Eq. 8, we also separately calculated *P*_{reg} and *P*_{zip} for each tail. In this case, Eq. 3 take the form
where the superscripts on *P*_{reg} and *P*_{zip} indicate the respective tail. The most important consequence of Eq. 8 is that molecules with extensive intra-molecular bonding have low values of *P*_{diff} (*R*) because the two strands can only form favorable interactions between the free tails.

### Registry stage

#### Non-native base pairs transiently hold two strands together

The registry stage is an encounter complex held together by non-native base pairs. The lifetime of the encounter complex is determined by the affinity of base pairs at the intermolecular contact. We compute the lifetime using Gillespie simulations^{48} in which the allowed moves are the formation and breakage of base pairs at either end of the intermolecular helix. These moves enter the simulations with rate constants given by Eq. 4 (see Appendix for details). The registry stage duration *t*_{reg} is computed as the average first passage time for a random walk starting at a single base pair to reach a state with zero base pairs.

Since the affinity of base pairs depends on both the sequence and alignment of molecules, there is a large variation in the registry stage lifetimes. Fig. 5 shows *t*_{reg} of an unstructured sequence as a function of the alignment *R*. Most alignments have very short binding lifetimes, less than 10 ns, due to the lack of contiguous complementary base pairs. However, misalignments of *R* = ±12 and *R* = ±17 allow for stretches of 7 and 6 WCF base pairs, respectively. As a result, these registries have binding lifetimes on the order of 1 *μ*s.

#### DNA strands search for in-register base pairs during the registry search

While the molecules are held together by mis-registered base pairs, the free tails are free to fluctuate and search for in-register contacts via “inchworm” moves. We expect that the tails will come into contact on a time scale comparable to the Zimm time, *τ*_{Zimm}, which describes the relaxation modes of a polymer in dilute solution^{49,50}. Each tail-to-tail contact provides an opportunity for the molecule to find in-register base pairs. The probability of success depends on both the amount of time the molecules are held together, which determines the number of attempts, and the length of the free tails, which determines the probability a given attempt is successful. The number of attempts during the registry stage is *t*_{reg}(*R*)*/τ*_{Zimm}. To estimate the probability that one of these attempts is successful, consider two unhybridized segments of length *ℓ* joined by a single base pair in registry *R*. This leaves *ℓ* −1 bases on each strand available to form new intermolecular contacts (we neglect fluctuations in the number of non-native base pairs), so the number of possible new contacts is (*ℓ* − 1)^{2}. However, only *ℓ* − 1 of these contacts are in-register. Furthermore |*R*| of the in-register contacts involve one of the strands folding back across the original base pair. After such a fold, any new base pairs will have parallel backbones (e.g., 5’ → 3’ with 5’ → 3’), which is incompatible with hybridization. Therefore, the probability that a random inter-strand contact is compatible with in-register hybridization is
where (*ℓ* − 1) − |*R*| is the number of possible in-register base pairs and (*ℓ* − 1)^{2} is the total possible base pairs the molecules can search.

If *t*_{reg}(*R*)*/τ*_{Zimm} is small *P*_{reg}(*R*) can be approximated by *t*_{reg}(*R*)*P*_{native}*/τ*_{Zimm}, however, if *t*_{reg}(*R*) is large this expression can exceed unity. This is because the small time approximation allows multiple successful attempts in the allotted time. The desired quantity, *P*_{reg}(*R*), is the probability that at least one tail-to-tail contact finds the in-register state during *t*_{reg}. This is equivalent to *P*_{reg}(*R*) = 1 − *P*_{fail}(*R*), where *P*_{fail}(*R*) is the probability that all attempts fail. The probability of a single attempt fail is 1 − *P*_{native}(*R*), so
where *ℓ* = *L* for unstructured molecules and the length of the relevant tail, *ℓ*_{1} or *ℓ*_{2}, for stem-loop sequences.

The Zimm time is given by *τ*_{Zimm} ∝ *ηR*^{3}*/kT* ^{49}, which can be rewritten as
where *C*_{2} is a constant and *R* = *bn*^{ν} is the Flory radius with *b* ≈ 0.3 nm and *ν* = 0.588.

The rare cases where the initial intermolecular contact forms an in-register base pair can be described by *P*_{reg}(0) = 1 and *t*_{reg} = 0, which means the molecules proceed directly from the diffusion stage to the zipping stage, skipping the registry stage.

Fig. 7 compares the computed *P*_{reg} and *t*_{reg}, averaged over registries, for all sequences in the data set. We note two trends. First, longer binding lifetimes correlate with a higher probability of finding the in-register state. This is due to the fact that long-lived encounter complexes have more opportunities for the tails to collide and explore potential registries. Second, when the binding lifetimes are equivalent, the probability of finding in-register bonds is greater when part of the molecule is self-hybridized (orange squares and blue circles). This is because the intramolecular helix restricts the number of mis-registered base pairs that need to be sampled. In fact, unstructured regions shorter than 11 bases (orange squares) have > 70% chance of finding the in-register state with binding times of just a few nanoseconds. However, unstructured sequences (red triangles), which must search 36^{2} contacts, have less than 50% chance even when the binding lifetimes approach a microsecond.

### Zipping stage

#### The first in-register bonds provide an anchor to hold the strands together while hybridization completes

After forming the first in-register bonds, the DNA strands enter the zipping stage. The dynamics of the zipping stage are similar to the registry stage in that both phases involve the formation and breakage of base pairs at the edge of a growing helix. The success of this stage is determined by whether the native base pairs are able to form across the full length of the DNA before the molecules detach. There are two obstacles to a successful outcome. The first is that the initial in-register base pairs provide very little stability, so the nascent helix is prone to dissociate before the “toehold” grows large enough to form multiple base stacking interactions. Second, if the molecules have intramolecular base pairing, the stems will need to be broken before native base pairs can form in these regions. These obstacles depend only minimally on the specific sequence of the molecule, which allows us to we obtain an analytic form for *P*_{zip} that captures the most important characteristics of the zipping stage.

To solve for *P*_{zip} we introduce the quantity *P*_{zip}(*x*), which is the probability that a pair of molecules with *x* base pairs forms the last (*L*th) base pair before falling apart. The quantity *P*_{zip} in Eq. 1, which describes the probability of the full zipping after forming the first base pair is given by *P*_{zip}(1). *P*_{zip}(*x*) satisfies the recursion relation
which says that a random walk starting with *x* base pairs will proceed to *x*+1 with probability *p*_{+} and to *x* − 1 with probability *p*_{−} ^{51}. In the continuum limit Eq. 13 takes the form of a one-dimensional convection-diffusion equation
where the drift velocity and diffusion coefficient are given by
If *x* describes a region of the molecule without intramolecular bonding, then the new base pairs are favorable (Δ*G*(*x*) < 0) so Eqs. 4 and 15 give *v >* 0. As a result, there is a strong bias for the helix to grow. However, in regions with intramolecular helices, two non-native bonds must be broken for each intermolecular bond that is formed. Since the two intramolecular bonds that are broken have the same sequence as the single intermolecular bond, they incur an energetic cost −2Δ*G*(*x*). Therefore, the net change is Δ*G*(*x*) − 2Δ*G*(*x*) = −Δ*G*(*x*) *>* 0 indicating that the formation of intermolecular bonds is unfavorable in the stem region. As a result, zipping within regions of intramolecular structure has a bias in the negative direction (*v* < 0). Importantly, sequence-dependent energetics of hybridization in stem regions are equal in magnitude and opposite in sign to that of unstructured regions. These energetic regimes are accounted for by taking *P*_{zip}(*x*) to be a piecewise continuous function with the boundary conditions shown in Fig. 8.

Figure 8c shows the solution to Eq. 14 for an unstructured sequence and both tails of a stem-loop sequence. The local diffusion coefficient and drift velocity are obtained by averaging the nearest-neighbor binding energies within each of the three regions (or entire sequence for the unstructured molecule). The zipping probability grows rapidly with the number of formed base pairs, *x* for the unstructured sequence reflecting the fact that failed zipping events are dominated by cases where the molecules have yet to establish a stable toehold. In contrast, the stem-loop structure shows an intermediate plateau over the region encompassing the anchoring tail and the intramolecular stems. This feature is a result of a local energy minimum where the intermolecular helix meets the bases of the intramolecular stems (Fig. 8c inset). Deviations away from this minimum require either the rupture of intermolecular base pairs or intramolecular stems, both of which are unfavorable. Therefore, the local drift velocities will tend to return random walks with 0 < *x* < *x*_{1} to the minimum at *x* = *x*_{0}. This tendency to return to the base of the stems explains why the zipping probability is flat between *x* = 1 and *x*_{1}. However, the probability jumps dramatically when the stems fully dissolve because there is no further impediment to zipping.

#### Successful zipping is a competition between the stability of the anchoring base pairs and the structure that must be broken

The most important factor influencing *P*_{zip} is the stability of the intramolecular structure that must be broken compared to the stability of the intermolecular helix that holds the molecules together ^{52,53}. Fig. 9 plots *P*_{zip} versus the ratio of the length of the unstructured tail to the length of the intramolecular stems. The figures shows that *P*_{zip} transitions from large values ∼ 70% when the intramolecular stems are shorter than the toehold intermolecular bonds, to small values (< 10%) when the stems are longer than the toehold. This observation suggests that *P*_{zip} is a competition between the lifetime of the intermolecular toehold and the lifetime of intramolecular stems. This interpretation is supported by the interesting case of sequence S14, which has two free tails of length *ℓ*_{1} = *ℓ*_{2} = 7 nucleotides. Furthermore, each tail has five A/T nucleotides and two G/C nucleotides. However, *P*_{zip} for tail 1 is greater than *P*_{zip} for tail 2 because the nearest-neighbor interactions in tail 1 leads to more favorable binding energy (Table 1, Fig. 8c). Therefore, tail 1 has a longer binding lifetime giving it a better chance for hybridization to progress through the intramolecular stems.

#### In-register molecules zip rapidly unless there is intramolecular structure

The zipping time can be subdivided into events for which the zipping is either successful (*t*_{zip+}) or not (*t*_{zip−}). These times were computed via Gillespie simulation, as described above for the registry time, and plotted in Fig. 10. The failure times are generally shorter than the successful events due to the fact that many failure events occur shortly after initial binding. Fig. 10 also shows that longer intramolecular stems slow the zipping time because these system require an energy excitation large enough to rupture the stems.

### Diffusion is slower than the registry search or zipping

Figure 11 compares *t*_{diff}, *t*_{reg}, and *t*_{zip} for two molecules at 55°*C* with opposite characteristics. Sequence 12, has the longest zipping time in our calculations due to a very stable stem-loop structure. However, the calculated *t*_{zip} ≃ 10^{−2} s is still about 3 orders of magnitude faster than the time between diffusive collisions at the experimental concentration of 50 pM. This plot predicts that the hybridization rate will increase with concentration up to approximately 10^{−8} M, at which point the zipping time will become rate limiting. Sequence 73 shows the opposite extreme. Since S73 lacks intramolecular structure, zipping is unimpeded and completes in less than 10^{−7} s. In addition, S73 has a very long registry stage due to mis-registered states with lifetimes on the order 10^{−7} s. In this case the hybridization rate will not plateau until concentrations on the order of 10^{−2} M when mis-registered traps become limiting. To capture the plateau of hybridization rates at high concentrations, it is necessary to use Eq. 2 and not the approximate form Eq. 3.

### The wide range of hybridization rates comes from both native and non-native base pairing

The hybridization rate computed with our model is compared to the experiments of Zhang et al.^{44} in Fig. 12. There is a strong correlation between the computed and experimental values over nearly three orders of magnitude. Importantly, our model captures temperature dependent effects as well as the presence or absence of intramolecular structure with the same set of parameters.

Based on these results, we can understand the effects of both native and non-native base pairing on the hybridization rate. Typically in biomolecule self-assembly, native bonds have negligible contribution to the assembly kinetics because most of the time is spent searching the non-native ensemble^{47,54–58}. However, here we find that native bonds play an important role in holding molecules together long enough for the hybridization to progress through non-native stem-loop structures. Another interesting finding is that non-native intermolecular bonds are beneficial for hybridization because they stabilize encounter complexes that facilitate the search for native contacts. This differs from the alignment search during amyloid protein aggregation where non-specific interactions seem to uniformly inhibit assembly kinetics^{45,47,59}. We speculate that the difference is due to the strong specificity of the WCF base pairing, which limits non-native bonding to small regions with complementary sequences. This limitation on non-specific bonding frees the rest of the molecule to search for native contacts. In contrast, the backbone H-bonds stabilizing amyloid structures are more promiscuous, which allows the non-native bonding to propagate along the molecule and limits the length of free tails that can perform “inchworm” moves. Finally, we find that non-native intramolecular base pairs have both a beneficial effect in restricting the registries to be searched, and an inhibitory effect in creating a barrier to complete hybridization. The inhibitory effect is only significant, however, when the intramolecular stem is longer than the neighboring regions that hold the molecules together. Therefore, short intramolecular stems are likely to be mostly beneficial. The varied effects of these three types of bonding explain why hybridization rates correlate poorly with metrics like native stability, intramolecular structure, or the lifetime of mis-registered states (Fig. 3).

There are several approximations in our model that may contribute to the scatter in Fig. 12. These include our neglect of fluctuations in the size of stem structures or multiple inchworm moves in a single hybridization attempt. Another approximation that is likely to be significant is our neglect of intermolecular interactions between the loops formed by intramolecular stems. Clearly, the potential for WCF base pairing by these loops will strongly depend on the size of the loop and further investigation will be necessary to identify how long the loop needs to be before it contributes significantly to hybridization. In addition, there are other factor that come into play near surfaces^{60–64} or *in vivo*^{65}.

It is interesting that our model performs as well as it does without accounting for the sliding of two unhybridized molecules past each other^{2,3,10,31,34–39}. Intuitively, one might expect that such “slithering” moves would be most efficient to resolve small registry errors, whereas inchworm moves would dominate when larger registry shifts are necessary (the exponent in Eq. 12 suggests super-diffusive scaling for long distances). However, Fig. 7 shows that our inchworm calculation predicts a success rate nearing unity for displacements less than 10 bases. Therefore, the distinction between these two kinds of moves is not significant in the regime where slithering is expected to dominate.

## Conclusion

While our model neglects many of the complexities of DNA hybridization, it provides valuable insight into key processes that have been previously reported. For instance, the slow nucleation events noted by Wetmur and coworkers^{32} are explained by the factor *P*_{diff} *P*_{reg}*P*_{zip} appearing in the numerator of Eq. 3. This factor shows that, before molecules can enter the zipping phase, they must collide in an alignment compatible with an encounter complex then find the native registry. The factor *P*_{zip} will also play an important role on larger DNA molecules, which will contain many regions of intramolecular hybridization. However, once the zipping process proceeds past the first stem-loop structure, it is likely that the hybridized region will be large enough that *P*_{zip} ≃ 1. Therefore, nucleation depends on the combination of three factors: 1) a collision between unstructured regions with enough complementarity to stabilize an encounter complex, 2) the base pairs stabilizing the encounter complex must be close enough to the in-register alignment that native contacts can form within the Zimm time, and 3) the first in-register contacts must provide a strong enough anchor to stabilize the complex long enough for hybridization to work through nearby regions of non-native structure. The requirement of a strong anchor point also provides insight to the “rule of 7” observation that seven continuous base pairs provide a substantial boost to hybridization kinetics^{41}.

In conclusion, we have presented a model that reduces the complex dynamics of hybridization to three basic processes. We hope that an understanding of how sequence and non-native interactions affect these events provides intuition for the design of DNA sequences for diagnostics, biotechnology, and DNA-based nanostructures.

## Acknowledgement

This work was supported by National Institutes of Health grants R01GM107487 and R01GM141235.

## APPENDIX

### Zipping probability calculation

Upon the formation of the first in-register base pair, the intermolecular helix will rapidly grow to encompass the adjacent unstructured region. This means that zipping is insensitive to the precise location of the first contact, which allows us to approximate it as a zipper starting from the left end and proceeding to the right.

### Zipping probability of unstructured sequences

The probability of successfully zipping after forming *x* base pairs satisfies the convection-diffusion equation (Eq. 14). The solution to this equation is
where *A* and *B* are constants. The formation of intermolecular base pairs in unstructured sequences is always favorable due to the absence of existing base pairs that need to be broken. Therefore, Δ*G* < 0 and *v* is always positive. The boundary conditions (Fig. 8) indicate failed zipping at *x* = 0 and success at *x* = *L*
Using the boundary conditions to solve for *A* and *B* we find
The quantity needed in Eqs. 2 and 3 is the zipping probability after the formation of the first base pair
where *v* and *D* are determined from Eqs. 4, 15, and 16. The bond affinity, Δ*G*, appearing in these expression is determined by averaging the nearest-neighbor energies at 0.5M salt over the entire sequence.

### Zipping probability of stem-loop sequences

To calculate *P*_{zip} for stem-loop sequences we divide the molecule into three regions (Fig. 8). The first region, 0 < *x* < *x*_{0}, is the unstructured tail which forms the first intermolecular base pairs. The second region *x*_{0} < *x* < *x*_{1} encompasses the stem-forming nucleotides adjacent to region 1. The third region, *x*_{1} < *x* < *L*, includes the loop, the stem-forming nucleotides not adjacent to region 1, and the unpaired tail (if present). Zipping in regions 1 and 3 is unimpeded by intramolecular structure, so Δ*G* is negative and *v* is positive. Zipping in region 2 requires disruption of the existing intramolecular stems, so Δ*G* is positive and *v* is negative. The solutions to Eq. 14 in each of these regions is
To find *A*_{1}, *B*_{1}, *A*_{2}, *B*_{2}, *A*_{3} and *B*_{3} we use the boundary conditions shown in Fig. 8.
The final result is not particularly illuminating, and too unwieldy to show here, but is plotted in Fig. 8c. The drift velocities *v*_{i} and diffusion coefficients *D*_{i} are determined for each region by averaging the nearest-neighbor stabilities at 0.5M salt within that region.

### Gillespie simulation

The binding lifetimes *t*_{reg}(*R*) depend sensitively on the sequence of bases at the point of first intermolecular contact. Therefore, it is necessary to account for the helix extending in both directions from the point of contact. The bidirectional extension of a helix means that a given base pair has a different Δ*G* of formation depending on whether it is on the left or right end of the growing helix because it has a different nearest-neighbor in each case. Due to this complication, we do not attempt to solve the first passage time analog of Eq. 13^{45,51}, and instead, compute the binding lifetimes by Gillespie simulation^{45,48,66}.

At each step of the Gillespie algorithm, the allowed moves are the formation and breakage of base pairs at either end of the helix, as shown in Fig. 13. The rate constants for each move are determined from Eq. 4 with Δ*G* computed from the nearest-neighbor model^{5,17–22}.

### List of single DNA strands

The following tables show the sequences from Zhang et al.^{44} used in the development of our model. These sequences are filtered so that they contain only unstructured molecules or simple stem-loop structures in the monomer state (as predicted by NUPACK^{23}). Furthermore, we only consider sequences where both strands have identical predicted structures in the monomer state.

## References

- (1).↵
- (2).↵
- (3).↵
- (4).↵
- (5).↵
- (6).
- (7).↵
- (8).↵
- (9).↵
- (10).↵
- (11).
- (12).↵
- (13).↵
- (14).
- (15).↵
- (16).↵
- (17).↵
- (18).
- (19).
- (20).
- (21).
- (22).↵
- (23).↵
- (24).
- (25).
- (26).
- (27).
- (28).
- (29).↵
- (30).↵
- (31).↵
- (32).↵
- (33).↵
- (34).↵
- (35).
- (36).
- (37).
- (38).
- (39).↵
- (40).↵
- (41).↵
- (42).↵
- (43).↵
- (44).↵
- (45).↵
- (46).↵
- (47).↵
- (48).↵
- (49).↵
- (50).↵
- (51).↵
- (52).↵
- (53).↵
- (54).↵
- (55).
- (56).
- (57).
- (58).↵
- (59).↵
- (60).↵
- (61).
- (62).
- (63).
- (64).↵
- (65).↵
- (66).↵