Predicting efficiency of writing short sequences into the genome using prime editing

Short sequences can be precisely written into a selected genomic target using prime editing. This ability facilitates protein tagging, correction of pathogenic deletions, and many other exciting applications. However, it remains unclear what types of sequences prime editors can easily insert, and how to choose optimal reagents for a desired outcome. To characterize features that influence insertion efficiency, we designed a library of 2,666 sequences up to 69 nt in length and measured the frequency of their insertion into four genomic sites in three human cell lines, using different prime editor systems. We discover that insertion sequence length, nucleotide composition and secondary structure all affect insertion rates, and that mismatch repair proficiency is a strong determinant for the shortest insertions. Combining the sequence and repair features into a machine learning model, we can predict insertion frequency for new sequences with R = 0.69. The tools we provide allow users to choose optimal constructs for DNA insertion using prime editing.

Cas9-fused reverse transcriptase using the pegRNA-encoded template sequence. Next, DNA 47 repair mechanisms resolve the conflicting sequences on the two DNA strands, ultimately 48 writing the intended edit into the genome. When CRISPR/Cas9 was compared to molecular 49 scissors capable of disrupting target genes, and base editors were called molecular pencils 50 for their ability to substitute single nucleotides, prime editors were described as molecular 51 word processors able to perform search and replace operations directly on the genome 13 .

53
The prime editing system is complex, and the determinants of its efficiency are not fully 54 understood. Several partly independent steps, including three DNA binding events and a 55 successful mismatch repair are needed to produce an edit, each potentially introducing biases.

56
In the largest study so far to understand these biases, Kim

71
Here, we systematically measure the insertion efficiency of over 2,600 sequences and identify 72 the features responsible. We find that insertion sequence length, nucleotide composition, 73 secondary structure, and repair pathway activity together explain most of the variation in 74 insertion rate. We then use these insights to train a sequence-based prediction model 75 informed by mismatch repair efficiency that predicts editing outcomes for novel sequences 76 with high accuracy, and allows selection of optimal reagents for new insertions.

78
We sought to systematically characterize how the length and composition of inserted 79 sequence, as well as cell line, target site, and the version of the prime editor system affect 80 insertion rates. To do so, we designed 2,666 pegRNAs encoding insertions immediately 81 upstream of the nick site. These comprise 270 sequences useful for molecular biology 82 (including e.g. His-6 tag, recombinase sites, and mNeonGreen11 25 ), 1,957 eukaryotic linear 83 motifs 26,27 , and 439 sequences with variable secondary structure (Figure 1a). The insertion 84 lengths ranged from 1 to 69 nt, with varied GC content (Figure 1b). We used lentiviruses to 85 deliver the library against four target sites (three previously tested: HEK3, EMX1, FANCF 12 86 and the safe-harbor CLYBL locus 28 ) in two cell lines (HEK293T and HAP1), followed by

116
Overall, the intended insertions were the dominant mutations generated, and we do not 117 consider the unintended edits further.

119
To understand the consistency of insertion efficiencies across contexts, we next compared 120 them between replicates, cell lines, and target sites. Insertion rates into the same target site

171
To test the hypothesis that rates of inserting short sequences differ between cell lines due to 172 mismatch repair activity, we screened the HEK3 and FANCF-targeted libraries in HAP1 cells 173 that are knockout for MLH1 (HAP1 ∆MLH1, Supplementary Figure 2b,c). We found that 174 average insertion rates increased 3.9 to 5.1-fold in the mutant background compared to wild 175 type HAP1 cells. The rates 1-4 nt sequences were most affected, increasing by 43-66 fold to

243
Combining insert sequence length, GC content and structure explained why some sequences 244 inserted much better than others, which can help guide the choice of tags to insert. For 245 example, the long 66 nt ELMI003108 sequence that was inserted in the HEK3 locus at 1.39% 246 insertion frequency (0.66% on average for the other 10 sequences > 66 nt) had a GC content 247 of 62% and formed a strong structure alone as well as within the pegRNA context (ViennaFold 248 free energy = -23.8 and -37.5 respectively, Figure 3f). Other longer sequences that inserted 249 well relative to their size were recombinase sites, presumably due to their secondary structure 250 that often contains hairpins (Supplementary Figure 4a,b).

252
Finally, we considered how aspects of the CRISPR/Cas system itself impact insertion rates.

253
First, it is known that the occurrence of four consecutive thymines acts as a transcription 254 terminator for RNA polymerase III and strongly impairs guide RNA expression 33,34 . We 255 confirmed that the average insertion rate for sequences that contain this tetranucleotide was

276
We found that the extra nick increased the average insertion rate by 5.7-fold to 1.59%, and 277 moderately increased the indel rate by 1.7-fold to 0.22% (Figure 4b). Importantly, the relative

351
We present the first comprehensive analysis of prime editing insertion efficiencies using 2,666 352 pegRNAs. We discovered that short sequences insert with predictable frequencies across cell 353 lines, target sites, and prime editor systems based on their length, GC content and tendency 354 to form secondary structure. Together, these features enable it to accurately predict variation 355 in insertion rates, and to choose optimal reagents for writing short stretches of DNA into 356 genomes.

358
We uncovered a complex relation between insertion sequence length and efficiency.

359
Sequences between 15 and 21 nt generally inserted well, while longer sequences are 360 incorporated less frequently, but still at moderate efficiencies even for sequences larger than

365
Longer sequences are less efficiently recognized by MMR 30 and therefore insert better than 366 short ones in mismatch repair proficient cell lines.

367
We further discovered that stronger secondary structure of the pegRNA 3'-extension led to 368 higher insertion efficiency. This effect was evident when comparing different inserts into the 369 same target, but also explained variable rates when attempting to write the same sequence 370 into different target sites. One potential explanation is that structured pegRNAs are more

555
One day after transfections, 2 µg/ml of puromycin was added to the cells to start selection.

556
Cells were kept in selection for 3 days and harvested 5 days after transfection.

584
Data for the same screen but from different sequencing lanes was concatenated. The resulting 585 merged fastq files were processed using a custom R script (read_match_pegRNAs.R in

626
Third, sequences with a perfectly preserved sequence around the cut site were removed.

627
Fourth, sequences that are 120 nt long were removed (120 nt corresponds to the length of a 628 sequence without indels). The remaining sequences were classified as indels. The scripts