The sequences near Chi sites allow the RecBCD pathway to avoid genomic rearrangements

Bacterial recombinational repair is initiated by RecBCD, which creates a 3′ single-stranded DNA (ssDNA) tail on each side of a double strand break (DSB). Each tail terminates in a Chi site sequence that is usually distant from the break. Once an ssDNA-RecA filament forms on a tail, the tail searches for homologous double-stranded DNA (dsDNA) to use as template for DSB repair. Here we show that the nucleoprotein filaments rarely trigger sufficient synthesis to form an irreversible repair unless a long strand exchange product forms at the 3′ end of the filament. Our experimental data and modeling suggest that terminating both filaments with Chi sites allows recombinational repair to strongly suppress fatal genomic rearrangements resulting from mistakenly joining different copies of a repeated sequence after a DSB has occurred within a repeat. Taken together our evidence highlights cellular safe fail mechanisms that bacteria use to avoid potentially lethal situations.

When a DSB occurs in bacteria, it can be repaired using RecA-mediated homologous 31 recombination following the well-known RecBCD pathway ( Figure 1a) (Symington, 2014, Mawer and 32 Leach, 2014, Azeroglu et al., 2016, Kowalczykowski, 2015, Smith, 2012, Smith, 1991. RecBCD 33 degrades or resects each end of the broken double-stranded DNAs (dsDNA) until it recognizes a Chi 34 site. Chi sites are ~8 bp DNA sequences that alter the function of RecBCD to create two 3′ ssDNA 35 tails that terminate in Chi sites (Symington, 2014, Mawer and Leach, 2014, Azeroglu et al., 2016, 36 Kowalczykowski, 2015, Smith, 2012, Smith, 1991 (Figure 1ai, ii). RecA then binds to the ssDNA 37 tails, creating two ssDNA-RecA filaments with Chi sites at their 3′ ends. Those ssDNA-RecA 38 filaments then search for homologous regions in the dsDNA. 39 To determine whether a region of dsDNA is homologous to the initiating strand, ssDNA-RecA  Qi et al., 2015), and in vivo results suggest that DNA repair is extraordinarily rare unless L prod > 53 20 bp (Lovett et al., 2002, Watt et al., 1985, Shen and Huang, 1986. 54 In the presence of ATP hydrolysis, heteroduplex stability in vitro increases only slightly as 55 L prod extends from 20 to 75 bp (Danilowicz et al., 2017). If L prod > 80 bp the nucleoprotein filament  The rarity of long repeats in random sequences of the same length as an E. coli genome is also 81 illustrated by the dark gray bar clearly seen in the inset of Figure 2c, in which the histogram shows the irreversible recombination products were to form between a 20-30 bp repeats anywhere in the genome. 91 Though the method that we used to find long repeated sequences only finds exact repeats, long 92 repeated regions containing some mismatches appear in the graph as several shorter exact repeats. We 93 find that those exactly repeated shorter regions are almost never separated by more than one single 94 base. 95 In vivo results indicate that the probability of recombining DNA increases exponentially as the 96 homologous region in the recombining DNA strand extends from N = 20 to N = 75, where N = 75 is 97 more than 100x more probable than N = 50 (Lovett et al., 2002, Watt et al., 1985. Remarkably, 98 recombination increases only slightly as N increases from 75 to ~ 300 bp. It has been speculated that 99 in vivo several parallel sequence-matched interactions with L prod < 75 bp separated by ~200 bp may 100 enhance discrimination against N repeat <~200-300 bp . Studies in E. coli suggest 101 that RecA-dependent genomic rearrangements between directly repeated sequences in plasmids is 102 improbable unless the repeat length is at least~ 300 bp, though RecA independent rearrangements 103 between shorter repeats do occur (Bi and Liu, 1994).

104
In vivo results mix the discrimination provided by RecA alone with the discrimination provided 105 by other factors, and we note that not all in vivo recombination follows the RecBCD pathway. In the 106 following, we will demonstrate how the RecBCD pathway reduces the probability that a DSB creates 107 one searching filament that includes a region of a repeat with 75 < N < 300 bp at its 3ˈ end and 108 eliminates the possibility that the 3′ ends of both filaments will include more than 20 bases that 109 originate from the same repeat. Without considering the detailed statistical distribution of Chi sites with respect to repeats, 113 some advantages of the RecBCD pathway can be appreciated by considering a case in which a DSB 114 occurs in the middle of a long repeated sequence. In the hypothetical DSB repair mechanism 115 illustrated in Figure 1b, a DSB occurring within a repeated sequence will create two searching 116 filaments whose 3′ ends terminate in regions of the repeated sequence that flanked the DSB. Genomic 117 rearrangement will result if the two searching filaments pair with both sides of a different copy of the 118 repeated sequence flanking the break.

119
In contrast, Figure 2 indicates that in the RecBCD pathway, which is illustrated in Figure 1a, 120 the repeated sequence that flanked the DSB is likely to lie within the L Chi bases removed by RecBCD.

121
In particular, Figure 2d indicates that the space between adjacent Chi sites on opposite strands is 122 typically > 10 kb. Furthermore, Figure 2e indicates that since 30 % Chi site recognition is observed in 123 vivo (Cockram et al., 2015, Taylor andSmith, 1992) an L Chi distribution that peaks at ~ 50 kb would 124 be created.
125 Importantly, Figure 2f shows a histogram of the repeats averaged over four E. coli genomes.

126
The maximum x-axis value in Figure   IV may stabilize D-loops prior to re-establishment of a DNA polymerase III-dependent replication 139 (Lovett, 2006), and that even in eukaryotic cells, translesion polymerases may aid DSB repair by 140 stabilizing strand invasion intermediates (Lovett, 2006). This is consistent with new work indicating 141 that most Pol IV molecules carry out DNA synthesis outside replisomes (Henrikus et al., 2018).

142
In these experiments, we study DNA synthesis by E. coli DNA Polymerase IV (Pol IV) as well 143 as by the large fragment of Bacillus subtilis DNA polymerase I (LF-Bsu). These polymerases both 144 lack 3′-5′ exonuclease activity. LF-Bsu has been modified to remove the exonuclease activity that Pol 145 IV intrinsically lacks. In the following, we will present experimental results for both proteins 146 indicating that under conditions relevant in vivo, DNA synthesis initiated by RecA-mediated 147 homology recognition is highly unlikely unless there is a sequence matched heteroduplex product with 148 length L prod > 50 bp that terminates within 8 bp of the 3′ end of the initiating strand. 149 We first formed ssDNA-RecA filaments and then allowed these filaments to interact with the 150 dsDNA. If a sufficiently stable heteroduplex forms, a DNA polymerase can extend the initiating 151 strand. That extension begins at the terminal 3′ OH of the initiating strand and proceeds in the 5′ to 3′ 152 direction with respect to the initiating strand. In our in vitro experiments, the synthesis can eventually 153 reach an end of the dsDNA. We will refer to that end of the dsDNA as the 3p end. We will specify 154 positions in the dsDNA using D, their separation from the 3p end of the dsDNA. We monitored the 155 base pairing between the two strands in the dsDNA by measuring the emission due to a fluorescein 156 label on one of the dsDNA strands ( Figure 3a). Initially, the fluorescein emission is quenched by the 157 nearby rhodamine label on the other strand, but if dsDNA separates, the fluorescence emission will 158 increase.

159
To study effects due to the DNA polymerases, we positioned the dsDNA labels L base pairs 160 beyond the 3′ end of the filament. L was chosen to be large enough that long strand exchange 161 products do not produce significant fluorescence increases even if the product extends to the 3′ end of 162 the filament. In what follows, we will show that under these conditions the presence of a DNA 163 polymerase lacking 3′-5′ exonuclease activity can produce large fluorescence increases as long as 164 RecA filaments and dNTPs are present. Importantly, this fluorescence also depends strongly on N, the    Synthesis triggered by two filaments stabilizes recombination products 228 As illustrated in Figure 1, if each of the filaments triggers DNA synthesis that completes a 229 double strand, then no unpaired bases will remain. To study synthesis triggered by the initiating     Chi sites rarely occupy the 3′ ends of long repeats 296 We will refer to the sequence provided by the genome database as the "given" strand. The 297 other strand in the genome is complementary to the given strand, so we refer to that strand as the 298 "comp" strand. In the RecBCD pathway, as indicated in Figure 1aiv, one initiating ssDNA will 299 terminate with a Chi site from the given strand and the other initiating ssDNA will terminate with a 300 Chi site from the comp strand. in this genome contains more than one Chi site, seven repeats in the given strand contain two Chi sites. 315 The cyan bar shows the separation between the two Chi sites. can only occur by joining long repeats that occupy the 5′ side of a Chi site. For these calculations, we 352 assumed that DSBs are distributed randomly on the genome and that the function of RecBCD is 353 changed by the first Chi site it encounters. Given these assumptions, we calculated the fraction of the 354 DSBs that create initiating strands whose 3′ ends terminate in at least one repeat containing N rep 3′ > n 355 bases on a specified initiating strand (DSB1 frac (n)) or on both initiating strands (DSB2 frac (n)).  Fortunately, the difference between the magenta and orange is much easier to interpret 385 because the orange line indicates that if the RecBCD pathway is followed no DSB would create two 386 filaments that would include regions of the same repeat. In contrast, in the DSB ends pathway many 387 do. Importantly, Figure 6d shows that summing over the results for all 12 genomes yielded > 20 388 instances in which two Chi sites on the same strand occur in one repeat. That statistic predicts that 389 summing over the same genomes should yield ~ 20 repeats that could create N rep 3′ > 20 bases on both 390 initiating strands; however, the actual sum was zero; consequently, for the RecBCD pathway the 391 suppression of DSB2 frac (n) is not the result of the observed reduction of instances in which Chi sites 392 occupy one strand on a repeat. Thus, the statistical distribution of Chi sites in the genomes of enteric 393 bacteria suggests that strong suppression of DSB2 frac is much more important than preventing Chi sites 394 from occupying one strand in a repeat. This strong suppression avoids formation of searching filament 395 pairs that include regions of the same long repeat at their 3′ ends, so the strong statistical suppression 396 supports our proposal that the placement of Chi sites allows the RecBCD pathway illustrated in Figure   397 1a to strongly suppress genomic rearrangement; however, it is probable that in rare instances Chi sites 398 may be associated with increased genomic recombination if the system does not follow the pathway 399 shown in Figure 1a.    The dsDNA containing 90 bp with internal labels was obtained by heating and cooling down slowly the 430 corresponding oligonucleotides from 90 to 40°C with 1°C steps equilibrated for 1 minute; the emission 431 at 518 nm was acquired (excitation at 493 nm) at each temperature step. For each of the genomes, the given strand is the strand given by the database from which we obtained 514 the sequence. The sequences for the given strands of DNA for E. coli genomes were acquired from 515 PATRIC in FASTA format. They were converted to a simple .txt file with A, C, G, and T bases and read 516 into Matlab as a single continuous string running from 5´ to 3´ called bases. The sequence of the comp 517 strand is the complement of the given strand; however, if each base in the .txt file for the given strand is 518 simply replaced by the complementary base, the resulting comp strand sequence runs from the 3´ end to 519 the 5´ end. To get the comp strand sequence running from 5´ to 3´, the order of the bases in the comp 520 strand must be reversed. sequences being compared. The first mismatch in either direction was found and its distance to the 535 starting position as well as its absolute location in "bases" was recorded. If there were conflicts between 536 two comparisons within the same group, indicating that at least one sequence in the group was a 537 subsequence of the others, the maximum distance was chosen only for the sequences where the conflict 538 occurred. Therefore, not all sequences within a particular grouping necessarily have the same distance.

539
In the resulting array of start and end position pairs, all repeats were discarded, as these are a remnant of once. From this, the length of homology is easily calculated for each particular sequence, and start, 543 difference, and end information was succinctly summarized in array "start_difference_end".  where X is the random variable denoting the number of Chi sites up to and including the recognized Chi 562 site, p is the probability that a particular chi site is recognized, and E(X) is the expectation value for the 563 random variable with a given p. Adjusted distances for each position were then calculated and a new 564 histogram with bin size 10,000 was generated for each E. coli genome. The individual bin counts were 565 averaged for the four genomes and normalized. 567 A method similar to the one used for N repeat was used to find repeats >= 20 bp adjacent to a Chi site that 568 would remain as part of the searching filament (N rep 3′ ). For Chi sites on the given strand, the 20 bp to the 569 5´ end of the start location of the Chi site in "bases" was selected as the key; for Chi sites on the comp  The end of the dsDNA that would be reached by synthesis initiated by RecA mediated recombination at the 3ˈend of the initiating ssDNA F

Repeats adjacent to Chi sites
The change in fluorescence with time F The difference between F for the positive and the F for a control with N=20 L The separation between the fluorescent labels and the 3ˈend of the initiating ssDNA D label The separation between the fluorescent labels and the 3p end of the dsDNA D init The separation between the 3ˈend of the initiating ssDNA and the 3p end of the dsDNA DSB1 frac (n) The fraction of the DSBs that creates initiating strands with N rep 3′ > n on a specified initiating strand. DSB2 frac (n) The fraction of the DSBs that creates initiating strands with N rep 3′ > n on both initiating strands L Chi The number of bases surrounding the DSB that are not incorporated in the searching filaments because they are removed by RecBCD. L prod The length of a heteroduplex product joining the initiating and complementary strands M 3′ The number of contiguous mismatched bp at the 3ˈend of the initiating ssDNA N The number of contiguous bp in the dsDNA that are sequence matched to bases in the initiating ssDNA in experiments with only one initiating ssDNA N 1 The number of contiguous bp in the dsDNA that are sequence-matched to bases in one of the initiating ssDNA in experiments with two initiating ssDNAs N 2 The number of contiguous bp in the dsDNA that are sequence matched to bases in the other of the initiating ssDNA in experiments with two initiating ssDNA N repeat The length of a repeated sequence occurring anywhere in the genome N rep 3′ The length of a repeated sequence that is positioned on the 5ˈside of a Chi site. In the RecBCD pathway, these repeats would occur at the 3ˈ end of searching filaments.
601 Table 1 Abbreviations used in the text.