A comprehensive evaluation of CRISPR lineage recorders using TraceQC

The CRISPR-Cas9 genome editing-based lineage tracing system is emerging as a powerful tool to track cell lineages at unprecedented scale and resolution. However, the complexity of CRISPR-Cas9 induced mutations has raised challenges in lineage reconstruction, which requires a unique computational analysis framework. Meanwhile, multiple distinctive CRISPR-based high-throughput lineage recorders have been developed over the years in which the data analysis is incompatible across platforms. To address these challenges, first, we present the TraceQC, a cross-platform open-source package for data processing and quality evaluation of CRISPR lineage tracing data. Second, by using the TraceQC package, we performed a comprehensive analysis across multiple CRISPR lineage recorders to uncover the speed and distribution of CRISPR-induced mutations. Together, this work provides a computational framework for the CRISPR lineage tracing system that should broadly benefit the design and application of this promising technology.


Introduction 34
Determining the origin of cells in multicellular organisms is a long-term goal in The CRISPR-Cas9 based lineage tracing technology is a new research field that 45 undergoes rapid expansion. Up to now, several lineage tracing platforms have been developed. 46 by engineering a target array into the zebrafish's genome 8 . Using the CRISPR introduced 48 mutations, the GESTALT mapped a complete tissue-level developmental tree of zebrafish. 49 Subsequently, several studies have integrated the CRISPR lineage recorder with single-cell RNA 50 sequencing to provide a cell-level lineage mapping of zebrafish 9-11 and mouse 12,13 . While most 51 recorders used the canonical CRISPR system, the homing guide RNA (hgRNA) lineage recorder 52 made a major modification by directing the Cas9-hgRNA complex to cuts the DNA locus of 53 itself 14 . This modification has increased the efficiency and diversity of CRISPR genome 54 editing 15,16 . 55 Across all the CRISPR lineage recorders, the design of the DNA barcodes is vastly 56 different. Some distinctive categories are: 1) Synthesized target array 8,9,12,13 . 2) The hgRNA 57 system 15,16 . 3) Specific DNA sequences that pre-exist in the genome 10,11,17 (Fig. 1A). Although 58 each platform has different design philosophy on DNA barcode, one commonality is that the 59 barcode sequences are usually compact (less than 300-bp), which makes the high-throughput 60 sequencing readout very homogeneous. As a result, the sequencing analysis could be compatible 61 across multiple CRISPR-based lineage tracing technologies. 62 In general, it takes two major steps to reconstruct the lineage tree.
Step one is to identify 63 the CRISPR-induced mutations from the sequencing data. A major characteristic of CRISPR-64 TraceQC to favor insertion and deletion over substitution. Next, each aligned sequence is 139 indexed according to the reference barcode in which location information is extracted for each 140 mutation. To reduce technical artifacts (e.g. PCR error, sequencing error, sample contamination), 141 the TraceQC package denoises the dataset using filters based on read count and alignment score 142 (Fig. 1C, Methods). Finally, the package provides a series of functions that evaluate the mutation 143 outcomes across multiple platforms (Fig. 1D, Fig. S13-S28). To illustrate the robustness of the 144 TraceQC package, we analyzed the dataset from all major platforms (Table 1) and revealed 145 differences in mutation characteristics such as speed and rate. 146 147 TraceQC determines preferences of mutation patterns across various platforms. Although 148 most platforms exclude substitutions from the mutation library because they are not as prevalent 149 as indels in CRISPR-induced mutations, it is a viable outcome of NHEJ 32-34 . Using the TraceQC 150 pipeline, we detected the three mutation types: insertion, deletion, and substitution vastly exist in 151 every lineage recorder. On average, three mutations types contribute to a similar diversity to the 152 entire mutation library ( Fig. 2A). However, deletion is the most prominent mutation type among 153 the three, as it has the longest length per sequence (Fig. 2B). Each lineage recorder has indel 154 length preference due to differences in experimental designs (Fig. 2C). For example, the barcode 155 sequences of GESTALT and Carlin is a synthesized target array in which deletions can be 156 classified into intra-target deletion and inter-target deletion. While the length of intra-target 157 deletion is mostly less than 10-bp in both GESTALT and Carlin, the length of inter-target 158 deletion could reach 200-bp ( Fig S1). In contrast, LINNAEUS is a single-target platform in 159 which CRISPR is used to target red fluoresce protein transgene. Therefore, the targetable region 160 of LINNAEUS has shorter length than target array and we observed that most mutations are 161 concentrated upstream to the PAM sequence (Fig. 2D). Although the homing-guide RNA system 162 is also a single-target platform, its average deletion length is longer than LINNAEUS due to 163 deletions spanning into the extended homing-guide scaffold regions downstream to the PAM 164 sequence (Fig. S9). 165 Contrary to deletion, the pattern of insertion is more consistent across platforms. In 166 Carlin, GESTALT, and hgRNA system, the most frequent insertion length is 1-bp. Large 167 insertions that are more than 16-bp are rarely seen (Fig. 2C). The mutation properties discovered 168 by the TraceQC are concordant with the original research, and therefore demonstrate that the 169 TraceQC pipeline has provided accurate cross-platform data analysis.  Therefore, a sound lineage recorder should synchronize the cell division rate with the mutation 175 generations rate such that enough signals are generated to track cell development. Ideally, the 176 mutation speed should be kept moderate. While slow mutation speed will not produce enough 177 signals to mark every cell, quick mutation speed will exhaust the CRISPR barcode too quickly to 178 record a prolonged process (Fig. 3A). 179 180 Adjusting mutation speed in the CRISPR-Cas9 system. Presumably, the mutation speed of 181 CRISPR is determined by the binding efficiency between target sequence and the Cas9 protein. A study has shown that the target sequence, especially the PAM-proximal region is a major 183 determinant of Cas9-target binding efficiency 39 . For example, in the Carlin's target array, each target can be edited by a perfectly matched guide RNA, which results in similar mutation speed 185 across multiple targets (Fig. S2). However, modifications on the target sequence or sgRNA 186 sequence could decrease mutation speed 39,40 . In the hgRNA-invitro system, researchers 187 engineered the B-21 barcode with multiple PAM binding sites that decreases the mutation 188 speed 41 (Fig. S3). Besides, the hgRNA system has further synthesized barcodes with various 189 lengths to adjust the mutation speed ( Fig. S3 -S4). The result shows the mutation speed becomes S5). Therefore, the aggregated result is that the mutation speed decreases exponentially through 207 time (Fig. 3B, 3C). Contrarily, a different induction method is to add the Dox several times to 208 replenish the concentration of Cas9 protein, which makes the mutation speed steady (Fig. 3B, C). 209 210 Leaky Cas9 expression introduces background mutations in the inducible CRISPR 211 system. The inducible CRISPR system allows users to initiate the lineage recorder at a desirable 212 time point, which enables lineage tracking of a specific developmental process such as cancer 213 metastasis 5,6 . However, one notable drawback is the leakiness of the inducible CRISPR system, 214 in which Cas9 express before the adding Dox 42,43 . The system leakiness can result in an average 215 of 30%-50% of sequences acquire de novo mutations before the Dox induction. Nevertheless, the 216 system still contains abundant unmutated sequences until it saturates at 90% mutated sequences 217 Moreover, the barcode sequence could be re-targeted by CRISPR-Cas9, resulting in 219 multiple mutations can appear in one sequence. In hgRNA and Carlin, we found the average 220 deletion length increases after Dox induction, suggesting those barcodes have undergone more 221 than one round of mutation (Fig. 3E, Fig. S6). Besides, the initially mutated sequences have 222 shown limited changes on the PAM sequence (Fig. 3F) and a small percentage of deletions ( Fig.  223 S6), which further demonstrated that the sequences could be re-targeted by CRISPR. To sum up, 224 although the leaky Cas9 expression brings background noise to the inducible lineage recorder, 225 the system can still accumulate plenty of mutations subsequently. Therefore, the system's 226 leakiness will not significantly sacrifice the lineage recorder's robustness when adjusted 227

correctly. 228
The lineage-independent mutation causes parallel evolution of barcodes. In genome-editing-230 based lineage recorders, parallel evolution is almost inevitable when independent lineage 231 acquires the exact same mutation. This causes cells from independent lineage cannot be 232 distinguishable during tree reconstruction (Fig. 4A). In the CRISPR lineage recorder, parallel barcodes. Across all the lineage recorders, the CRISPR-induced indels are overwhelmingly 244 concentrated around the PAM sequence (Fig. 4D, Fig S7-S9). This is most likely due to the 245 highly specified DSB location. For GESTALT and the Carlin, the barcode sequences are target 246 array, which means the deletion of both platforms can be classified into inter-target and intra-247 target deletions (Fig. 4B). The inter-target deletion occurs when two independent Cas9-guide 248 RNA complexes create two DSB simultaneously, which causes drop-out of all the targets in-249 between (Fig. 4B). In Carlin, the inter-target deletion and intra-target deletion have a very similar 250 mutation hotspot (Fig. 4E, Fig S7). 251 Increased mutations diversity improves barcode randomness. In general, increase the 253 diversity of CRISPR mutations can increase the randomness of barcodes 44 , which reduces the 254 number of parallel evolutions. In GESTALT and Carlin, the diversity of barcodes is multiplied 255 by the number of targets in the array. Also, the inter-target deletions further produce additional 256 level of diversity by the number of combinations (Fig. 4C). Similarly, the hgRNA-invivo system 257 uses 60 independent targets throughout the genome to increase the barcode diversity. However, 258 the combinatory effect of multiple barcodes is only detectable via single-cell sequencing. 259 Meanwhile, the hgRNA-invivo system has extended scaffold sequence that allows additional 260 mutations to extend to that region, which generates higher diversity than the canonical guide 261 RNA system (Fig. S9). 262 Across all the lineage recorders, the three mutation types: deletion, insertion, and 263 substitution have significant differences in mutation distribution. We discovered that insertion 264 and inter-target deletion are more random than substitution and intra-target deletion (Fig. 5A, 5B, 265 In insertion, the additional diversity is produced by the randomly inserted sequence. In 267 theory, the inserted sequence of length n has a possibility of 4 n possible combination, which 268 generates an exponential increase of diversity. As expected, we discovered that the randomness 269 of insertion patterns increases as the length increase (Fig. S10). In GESTALT and Carlin, the 270 increased diversity of inter-target deletion is generated by the random target-to-target interaction. 271 As a result, the inter-target deletion is more random than intra-target deletion (Fig. 5A, 5B). brings the most irreversible damage to the DNA sequence. When evaluating the accumulation of each mutation type over time, we discovered that inter-target deletion had gained the most 276 diversity whereas the diversity of insertions and substitutions are gained little or decreased (Fig.  277   5C, Fig. S11). This is mainly because the DNA barcode is gradually saturated by deletion over 278 time as the prolonged CRISPR activity removes a great proportion of the barcode sequence (Fig.  279   S6). As a result, the barcode is too exhausted to bear substitutions and insertions in later 280 development stage. In addition, in GESTALT and Carlin, the inter-target deletion removes all 281 targets in between, which completely clean the diversity of a large proportion of barcode. 282

Contrarily, substitution and insertion have limited diversity when the barcode. In contrast to 283
Carlin, the hgRNA is a single-target platform in which the deletions are shorter. Therefore, the 284 domination of deletion in the mutation library is mitigated compared to GESTALT and Carlin 285 (Fig. S12). 286 287

Discussion 288
The two properties of CRISPR-induced mutations: speed and rate, have a broad impact 289 on lineage reconstruction. In the non-inducible CRISPR system, the mutation speed depends 290 entirely on the binding between the Cas9 and the target sequence. But in the inducible CRISPR 291 system, the mutation speed can also be controlled by Dox concentration, bringing flexibility and 292 accessibility to the recorder. Moreover, the pulsed induction of Dox enables a stable expression 293 of Cas9, which results in a steady CRISPR mutation speed. The current application of CRISPR 294 lineage recorder usually focuses on a specific developmental process in which the cell division 295 rate is steady. Therefore, pulsed induction of Dox could be useful in many experiments' settings. 296 However, in many organisms, the mutation rate of cell division varies among different tissues 297 and different developmental stages, e.g. mouse embryogenesis 45 . Also, studies have shown that 298 CRISPR activity varies in different tissues 46 . This could bring challenges to adjusting speed in 299 lineage recorders. Moreover, the current pulsed induction experiment in Carin and hgRNA is 300 only performed in in-vitro systems. In in-vivo experiment, closely monitoring the Cas9 301 expression level could be challenging. Determining the optimal Dox concentration prior to the 302 in-vivo experiment to can help identifying the desired mutation speed. 303 The randomness of CRISPR mutations affects the level of parallel evolutions in lineage 304 recorders. In general, increase the diversity of barcodes by incorporating multiple targets could 305 greatly increase the barcode randomness. We demonstrated that the inter-target interaction of the 306 GESTALT and Carlin platform generated increased the mutation diversity, thus provide robust 307 lineage barcoding through time. The hgRNA mouse model used 60 independent targets 308 throughout the genome to increase the barcode diversity. However, the current experiment 309 readout by bulk DNA sequencing cannot detect the independent target interactions, which could 310 be evaluating using single-cell sequencing. 311 Meanwhile, in GESTALT and Carlin, we discovered that inter-target deletion might 312 saturate the DNA sequence too quickly, preventing insertion and substitution from occurring.

One limitation of our mutation distribution analysis is the lack of quantitative measures. 321
Although several studies have demonstrated that NHEJ-induced mutation patterns can be 322 predicted in-silico 24,35-38 , these machine learning models cannot be directly applied to the 323 CRISPR lineage tracing dataset. First, the NHEJ mechanism varies among different organisms, 324 which results in different mutation patterns 36 . Second, these models are trained on a single-target 325 canonical guide RNA library. It cannot capture the properties of target array or hgRNA system. 326 Nevertheless, from the lineage tracing dataset, we discovered the characteristics of CRISPR-327 induced mutations are mostly concordant with the previously identified characteristics, such as 328 the prevalence of 1-bp insertion. More importantly, we found the mutation frequencies are highly 329 correlated between replicate, which means a machine learning solution could be applied to 330 predict the mutation distribution. 331 The scale and complexity of single-cell lineage tree bring challenges to the classical 332 phylogenetic algorithms 50,51 . Also, the CRISPR-induced mutations require a novel computational 333 modeling framework. Unsurprisingly, every new lineage reconstruction algorithm has realized 334 the importance of mutation rate. However, directly estimating the mutation rate from the lineage 335 tracing data is one of the biggest challenges, which causes them to creatively detoured the 336 problem. For example, the GAMPL algorithm uses Markov chain to model mutation generation. 337 Instead of estimating the mutation rate, GAMPL used lumped matrix to estimate the transition 338 rate between meta-states. In contrast, Cassiopeia obtained the mutation rate through experiment. 339 They performed an in-vitro experiment of CRISPR barcoding for 15 cell cycles. Then mutation 340 rates are derived empirically using frequency. Although TraceQC does not directly provide an 341 estimation of mutation rate, we believe it could be obtained from a combination of in-vitro 342 experiment and in-silico mutation frequency prediction. Next, an integration between the sequencing analysis pipeline and tree reconstruction algorithms could produce a more accurate 344 lineage reconstruction strategy. 345 Overall, the single-cell resolution CRISPR lineage tracing is a promising technology with 346 many potential applications. We developed the TraceQC package to provided sequencing 347 analysis and quality evaluation of lineage tracing datasets across multiple platforms. We hope 348 our study could facilitate a wider application of this technology and provide some insights into 349 its future development.  Alignment quality evaluation. The alignment quality filter is used to remove reads that are 367 contaminated. In every CRISPR lineage recorder, the barcode sequence is usually amplified and 368 sequenced specifically. However, some samples could contain up to 20-40% reads that do not 369 come from the CRISPR barcode. Using the alignment score difference, it is usually easy to 370 separate the contaminated sequence from the mutated CRISPR barcode (Fig. S14C). To 371 quantitively determine the contaminated sequence, a decoy sequence library is generated by 372 randomly substitute e percentage of the sequence. Next, TraceQC trains a local regression model 373 between the substituted percentage and the alignment score. Finally, the sequence below the 374 optimal substitute percentage can be selected. Presumably, when the CRISPR barcode is transcribed, the mRNA transcript from each cell 385 should be identical. Therefore, TraceQC provides merging functions for mutations in each cell. 386 First, for each cell, reads with the same UMI are grouped together. The mutations that appears in 387 more than 50% of reads are retained. During this step, UMI with a low read count should be 388 filtered according to the guideline of the particular single-cell sequencing platform. Next, for each UMI in each cell, a second merger is applied to retain mutations that appear in more than 390 50% UMI. 391 392 QC plots. There are three main aspects that TraceQC evaluates: 1) The quality of sequencing 393 alignment. Using alignment score, TraceQC removes the percentage of contaminated sequences. 394 The method is described in the section: alignment quality evaluation. 2) Mutation patterns can be 395 Carlin used paired-end sequencing. We merged read R1 and R2 using software FLASH with 403 default parameters. Next, the dataset is processed as follows: 404 1. The merged read is processed using the TraceQC alignment with the default parameter. 405 We used e = 0.4 to filter out contaminated sequences. 406 2. We normalized each sequencing sample using count per million (CPM). Reads with 407 CPM > 10 are retained. We processed a part of GESTALT's in-vivo bulk DNA-seq dataset (GESTALT paper fig.  411 3). Although both GESTALT and Carlin used paired-end sequencing, the reads of GESTALT cover approximately 60% of the entire barcode, whereas Carlin covers 100%. Therefore, we 413 processed the R1 and R2 of GESTALT separately using the same procedure. Finally, the 414 mutation event of R1 and R2 are merged for each sample. 415 The hgRNA-invitro dataset used single-end sequencing. The complete dataset is 416 processed using the same procedure as Carlin. 417 The hgRNA-invivo platform contains 60 independent CRISPR barcodes. A DNA 418 identifier is assigned to each barcode which presumably cannot be edited by CRISPR. Therefore, 419 we first used the 12 base pairs DNA sequence upstream to locate the DNA identifier. In this step, 420 the Levenshtein distance of 1 is allowed for the 12-bp sequence. Next, the 10-bp directly 421 downstream is extracted to match with the identifier. The sequences are grouped by each 422 identifier and processed by the TraceQC pipeline using the same procedure as Carlin.  We used e = 0.4 to filter out contaminated sequences. 431 2. For each cell, reads are grouped by UMI. UMIs with read count < 10 are filtered out.