SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms

The emergence of long-read RNA sequencing (lrRNA-seq) has provided an unprecedented opportunity to analyze transcriptomes at isoform resolution. However, the technology is not free from biases, and transcript models inferred from these data require quality control and curation. In this study, we introduce SQANTI3, a tool specifically designed to perform quality analysis on transcriptomes constructed using lrRNA-seq data. SQANTI3 provides an extensive naming framework to describe transcript model diversity in comparison to the reference transcriptome. Additionally, the tool incorporates a wide range of metrics to characterize various structural properties of transcript models, such as transcription start and end sites, splice junctions, and other structural features. These metrics can be utilized to filter out potential artifacts. Moreover, SQANTI3 includes a Rescue module that prevents the loss of known genes and transcripts exhibiting evidence of expression but displaying low-quality features. Lastly, SQANTI3 incorporates IsoAnnotLite, which enables functional annotation at the isoform level and facilitates functional iso-transcriptomics analyses. We demonstrate the versatility of SQANTI3 in analyzing different data types, isoform reconstruction pipelines, and sequencing platforms, and how it provides novel biological insights into isoform biology. The SQANTI3 software is available at https://github.com/ConesaLab/SQANTI3.

transcript ends (Fig.1c). The Reference Match (RM) subcategory is defined as see Methods). These results indicate that the proposed TSS ratio metric is a 245 reasonable short-reads-based alternative to CAGE-seq support. 246 To understand the relationship between TSS-related attributes and tran-247 script length completeness, we compared TSS metrics for transcript models 248 in the FSM and ISM categories. While the vast majority of FSM transcripts 249 showed both CAGE-seq support and TSS ratios larger than 1.5 (Fig.2d, top   250 panel), the ISM category was enriched in transcripts without an overlapping 251 CAGE-seq peak (Fig.2d, bottom panel). We identified 7,599 ISM isoforms with 252 a reliable TSS according to both CAGE-seq data and TSS ratio, one-third of 253 which were novel TSS with respect to the reference annotation. Conversely, 254 4,591 FSM isoforms showed low TSS ratios and lacked support from the refer-255 ence annotation or CAGE-seq data, suggesting that the TSS of these transcript 256 models might not be correctly defined.  Fig.A2b). This suggests that long-read sequencing methods detect alternative 268 3' ends with higher sensitivity than Quant-seq, possibly because Quant-seq 269 requires high expression of the alternative TTS isoform to call a polyA site. 270 Transcripts that were likely to be a product of intrapriming (defined as 60% As 271 downstream of the TTS [23]), rarely contained a polyA motif or the polyA was 272 located closer to the 3'end than expected ( Supplementary Fig.A2c). Overall, 273 these results further support polyA motif detection as a reliable indicator of a 274 bona fide transcription termination site. 275 The results above demonstrate the usefulness of SQANTI3 QC for evalu-276 ating the 3' and 5' ends of lrRNA-seq transcript models. However, they also 277 suggest that conducting a more in-depth analysis of TSS/TTS variability pat-278 terns is advisable, which can be achieved through the novel FSM and ISM supported by CAGE-seq, and 70.9% had a TSS ratio greater than 1.5 (Fig.2e).

288
The subcategory-level analysis of FSM, therefore, reveals incompleteness in SQANTI3 reference annotation and suggests that novel combinations of known start/end 290 sites and intron chains are yet to be described. 291 The analysis of ISM subcategories showed that 3' Fragment (3'F) was by 292 far the most abundant group with a total of 47,594 transcript models (70.2% 293 of ISM), where only 9.4% had a known TSS, 16% had CAGE-seq support 294 and 39.5% displayed an above-threshold TSS ratio (Fig.2e). This pattern was 295 recapitulated by the 13,236 mono-exonic ISM transcripts, for which most 3' 296 ends were validated by orthogonal data, whereas 5' ends remained largely 297 unsupported (Fig.2e). Moreover, transcripts from the mono-exon subcategory 298 presented a larger difference in length ( Supplementary Fig. A2d) and exon 299 number ( Supplementary Fig.A2e) with respect to their matched reference tran-300 script than the rest of ISM, ruling out the possibility that these were fragments 301 of initially shorter molecules. These results suggest that ISM transcripts were 302 enriched in 5'end degradation products. 303 The diversity of TSS and TTS patterns in lrRNA-seq is apparent in the 304 NEXN gene (Fig.2f ). Although all detected multi-exon transcripts were asso- flagged as a potential intrapriming artifact due to a 20-bp stretch with 90% 320 As found immediately downstream of the TTS, which, together with the loca-321 tion of the polyA motif 39 bp from the 3' end, suggested a TTS annotation of 322 poor quality (Fig.2f ). Notably, the degree of isoform diversity and TSS/TTS 323 support variability exhibited by the NEXN gene was frequent in the WTC11 324 lrRNA-seq transcriptome estimated by Isoseq3. 325 In summary, the IsoSeq3 processing of the WTC11 cDNA-PacBio data 326 presents a high level of TSS and TTS variability, which can be effectively char-327 acterized using SQANTI3 (sub)categories in combination with complementary 328 data sources. Our analyses suggest that a combination of artifacts and true 329 biological variability causes the observed diversity at transcript ends. These 330 and previous SQANTI results [23] motivated the design of a comprehensive 331 strategy for removing lrRNA-seq artifacts.  (Fig.3f ).

447
To evaluate the rescued elements, the reference transcriptome was quanti-448 fied at the gene and transcript levels using short reads (see Methods). Genes for SQANTI3 which at least one isoform passed the filter showed the highest expression val-450 ues (Extended Data Fig.A6a), independently of the complementary data input.

451
Genes initially removed and then rescued had consistently higher expression 452 than those that remained excluded. Similar results were obtained when per-453 forming these analyses at the transcript level ( Supplementary Fig.A7). Finally, 454 to validate the biological importance of the rescued transcript set, we retrieved 455 their functional scores (TRIFID scores) from the APPRIS database [34]. 456 Known transcript models that were rescued had higher scores than those that 457 were not rescued (Fig.A6b) of the novel TP transcripts (Fig.3g). SQANTI3 Rescue was also robust to the 489 confounding factor introduced by reference over-annotation, with only between 490 1 and 3 FP transcripts being selected by the algorithm.

491
To better understand the impact of each step of the SQANTI3 pipeline on 492 transcriptome quality, performance metrics were computed for the HIS results ( Fig.3h). The initial reconstruction by IsoSeq3 yielded almost perfect sensi-494 tivity (94%), while precision was much lower (32%). As previously observed, 495 precision significantly improved after filtering, whereas sensitivity decreased 496 slightly due to some TP transcripts being flagged as artifacts. Importantly, the 497 F-score revealed that overall performance steadily improved after every step 498 of the SQANTI3 curation pipeline. Specifically, a stronger F-score increase 499 was observed when using the ML filter (82%) than when applying the rules The SQANTI3 pipeline is designed to perform data-based QC and curation of 519 transcriptomes, particularly those created using tools with high detection levels 520 and low reference dependence, resulting in a significant proportion of novel iso- with poor orthogonal support, we adjusted the rules filter to require SJ cover- The SQANTI3 framework offers not only quality control and curation but 587 also the integration of IsoAnnotLite, which allows for isoform-level functional 588 annotation. This feature facilities downstream analyses on isoform biology, 589 for example, using the tappAS software [19]. To demonstrate this capability,  The SQANTI3 analysis shown here confirms that novel combinations of 3' 707 and 5' ends with intron chains are still to be discovered, even in well-annotated 708 organisms such as human and mouse, and that many novel transcripts can be 709 found with considerable support. This implies that the generation of sample- input and output files are described.

734
The Quality Control module is the cornerstone of the SQANTI3 pipeline.

735
It is designed to characterize transcriptomes built using lrRNA-seq data and were defined in SQANTI [23].

770
In addition to short and long-read data for coverage and expression based-  Processing of short read data:

782
To facilitate the integration of matching short-read data, SQANTI3 has been 783 upgraded to run STAR [41] and Kallisto [42] internally for mapping and   [44] in which the characteristics that make an isoform 862 reliable are specified.

863
The JSON file is structured in two different levels of hierarchy: rules and 864 requisites. A rule is made of one or more requisites, all of which must be 865 fulfilled for an entry to be considered a true transcript. This means that requi-866 sites will be evaluated as AND in terms of logical operators. If different rules 867 (i.e. sets of requisites) are defined for the same structural category, they will 868 be treated independently from one another. In that case, to pass the filter, 869 transcripts will need to pass at least one of these independent rules, mean-870 ing that rules will be evaluated as OR in terms of logical operators. Rules 871 can be set for any numeric or categorical column in the classification file.   Matches between each rescue target and its same-gene candidates are next 917 found by mapping candidate sequences to targets, a process known as rescue-918 by-mapping. To achieve this, minimap2 [45] is run in the long-read alignment 919 mode using the -a parameter, combined with the -x map-hifi (i.e. high-     The WTC11 cell line is an induced pluripotent stem cell (iPSC) line derived 997 from human fibroblasts, often used as a model for cell differentiation [46].

998
The data used in this paper was generated within the LRGASP project [31], 999 where this cell line was deeply sequenced using different technologies. We only 1000 used cDNA-PacBio data for reconstructing transcripts models. Raw data used  The PacBio cDNA lrRNA-Seq datasets used in the present study, i.e.

Rules filter
For the WTC11 dataset, the Rules Filter was defined as follows. The 5'-ends 1092 were considered valid if: 1) they overlapped a CAGE-Seq peak or an annotated refTSS site; 2) the distance to any other annotated TSS in the same gene was 1094 less than 50bp; or 3) they had a TSS ratio 1.5. Similarly, 3'-ends were accepted 1095 if: 1) they were supported by Quant-Seq data or by polyASite annotation, 2) 1096 the distance to any other annotated TTS was less than 50bp, or 3) there was 1097 a canonical polyA motif close to the TTS. FSM and ISM were required to 1098 have support in both their 5' and 3'-ends to pass the filter. For the rest of the 1099 transcript models, it was required that all SJ were supported by at least three        2) The rest of the LR-defined transcript models filtered out (rescue candidates) are mapped against the reference transcriptome combined with the accepted LR-defined isoforms (rescue targets), allowing several hits per candidate. 3) Reference transcriptome was previously evaluated and filtered with the same data and criteria as the LR-defined transcripts. 4) Rescue is completed by evaluating targets. They need to pass the filtering and do not increase the redundancy, meaning that if the target is an LR-defined transcript present or it is a reference transcript already represented as an FSM in the filtered transcriptome, these targets will not be added to the final annotation. Upper track (black) represents the GENCODE annotation for that locus, which includes 5 isoforms for the GST5 gene coded in the negative strand. The bottom track (light blue) shows the LR-defined isoforms identified in the WTC11 sample, while the tracks in between (gold, red, turquoise, and dark blue) are the orthogonal data available to validate those isoforms. The bottom table indicates which filters passed each of the isoforms regarding the informative situation simulated: Low Input (LI), High Input Reference (HIR), and High Input Sample-specific (HIS). SQANTI3 Filtered isoforms (orange) were those FSM/ISM that did not pass the corresponding filter and were not eventually rescued because of unacceptable quality attributes. On the other hand, LR-defined known transcripts lost during filtering but recovered by introducing transcript models from the reference (dark blue) are the the rescue strategy's goal. In some exceptional cases, genes not initially detected were included in the final transcriptome (yellow) if a discarded sequence mapped them and the filtering criteria were fulfilled.