Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

The Long-read RNA-Seq Genome Annotation Assessment Project (LRGASP) Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. The consortium generated over 427 million long-read sequences from cDNA and direct RNA datasets, encompassing human, mouse, and manatee species, using different protocols and sequencing platforms. These data were utilized by developers to address challenges in transcript isoform detection and quantification, as well as de novo transcript isoform identification. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. When aiming to detect rare and novel transcripts or when using reference-free approaches, incorporating additional orthogonal data and replicate samples are advised. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.

FSMs was observed for those transcripts detected by more pipelines, indicating that agreement is more frequent for known transcript models than for novel isoforms. Similarly, transcripts detected by many pipelines showed higher expression values, regardless of the SQANTI category (Supplementary Fig. 2c and Supplementary Data Report, pages 18 to 24), indicating that high expression value was a signature of consistent detection. No association was found between consistent detection and transcript length or exon number (Pilot Data Report, pages 25 to 38).
Using LRGASP metrics we evaluated general characteristics of transcript models, including how well they are supported by the GENCODE annotation and by orthogonal data (CAGE, poly(A) motifs and Illumina reads).
We found significant differences in the definition of 5' and 3' ends when pipelines were compared ( Supplementary Fig. 2d). While methods A and B used to provide FSM transcript models with 5' and 3' ends closely matching the reference TSS and TTS, respectively, pipelines C and E showed greater variability. These differences were maintained regardless the sequencing platform (PacBio or ONT), suggesting an algorithm rather than a data property. Interestingly, similar distribution of distances to closest CAGE peaks were found both for FSM (Pilot Data Report, page 13) and ISM (Pilot Data Report, page 14) when pipelines are compared. NIC are novel transcripts that contain novel combinations of annotated donor or acceptor sites. One type of NIC is intron retention. Pipelines also varied greatly in the number (Pilot Data Report, page 100) and percentage (Supplementary Fig. 2e) of transcripts showing an intron retention event, with pipelines based on ONT data having a general higher incidence. Similarly, the percentage of NNC transcripts having at least one non-canonical splice junction varied from 0% for B and E algorithms, to values above 20% for other computational methods (Supplementary Fig. 2f and The utilization of LRGASP evaluation metrics on the full transcript model dataset allowed us to qualitatively compare pipelines, reveal their specific biases for transcript detection, and provided a means to select candidates for experimental validation. However, as no ground truth exists in this case, formal performance metrics cannot be calculated with these data. The incorporation of spike-ins (SIRVs), simulated data and a set of highly curated GENCODE genes in the LRGASP challenge allowed for evaluation against different types of ground-truth datasets. Since our mock data included the Lexogen SIRV-Set3 with 69 spiked-in isoforms, LRGASP performance metrics can be illustrated with these data ( Supplementary Fig. 2 h-l).
Out of the 11 pipelines in our mock run, 8 provided predictions for SIRVs. Pipelines predicted between 26 and 75 SRIV transcripts (Supplementary Fig. 2h), indicating a great diversity in the isoform calls returned by different methods. Also, some pipelines detected many partial transcripts while others did not show this problem at all (Pilot Data Report, page 46). Since analysis pipelines could include multiple transcript models matching the same SIRV transcript, for example having the same set of junctions but different 3' or 5' ends, we introduced the redundancy and non-redundant prediction metrics to evaluate these cases. While some pipelines had consistent redundancy levels of one, indicating each SIRV was detected by one single transcript model, others had mean redundancy values greater than one (Pilot Data Report, page 56), which suggests that analysis methods follow different strategies for using reference annotation to consolidate their transcript models. Finally, metrics such as False Negatives (Supplementary Fig. 2i We identified UJCs to compare transcripts across pipelines in our pilot experiment and computed barcodes as described in the methods section Supplementary Fig. 3 gives examples of analyses performed based on barcode information. Figures 5a-f show transcript model characteristics as a function of increasing number of Nanopore or PacBio pipelines where the UCJ was detected. We found a number of FSMs detected by Nanopore but not by PacBio, but also a concentration of this category in the detection by both sequencing platforms. Interestingly, NIC were frequently found by only one pipeline, although there were also examples of NICs found by all PacBio pipelines (Supplementary Fig. 3b). A similar pattern, though lower in number could be seen in Fusion transcripts (Supplementary Fig. 3c). This suggests that Nanopore might have a higher capacity for identifying transcripts present in the reference while novel transcripts are strongly pipeline and sequencing platform-specific with a slightly higher percentage of PacBio novel transcripts being robust to pipeline choices.
When looking at transcript properties, the analysis shows that PacBio recovers more transcripts with a higher predicted number of exons (Supplementary Fig. 3d) and length (Supplementary Fig. 3c), and that, in general, highly expressed transcripts are those identified by all pipelines regardless the long reads sequencing platform ( Supplementary Fig. 3f). This conclusion is corroborated when we looked at aggregated values per sequencing platform by setting our barcode selection to filter by > 1 for dRNA-ONT (position 2) to 0 for all others (position 1 and 3). Transcript models predicted exclusively by dRNA-ONT were more highly expressed ( Supplementary  Fig. 3g), with fewer exons (Supplementary Fig. 3h) and shorter (Supplementary Fig. 3i) than all other transcripts.
In summary, our mock-run analysis demonstrated the ability of LRGASP metrics to highlight important differences between both experimental and computational lrRNA-seq methods

Challenge 2 mock evaluation
To test the validity of the proposed metrics in LRGASP Challenge 2, we conducted the performance evaluation of two different pipelines (referred as "Pipelines 1 and 2") on two types of lrRNA-seq data (cDNA-PacBio and cDNA-ONT) from GM12878 cells. We evaluated the accuracy of transcript abundance estimation by examining the variation and similarity of estimates among multiple replicates by the metrics irreproducibility, ACVC, consistency and ACC scores (Supplementary Fig. 4a-b). In addition, SIRV set-3 data was used to evaluate how close the estimations and the ground truth values are by four metrics: SCC, NRMSE, MRD  Pipeline 2 has the lowest coefficient of variation (ACVC=0.92) and the highest reproducibility (irreproducibility=0.07) among multiple replicates in cDNA-ONT data (Supplementary Fig. 4c), which is different from in cDNA-PacBio (ACVC=1.28, irreproducibility=0.11). It indicates the performance variability of the same pipeline in different data. In addition, both pipelines on the mock data showed decreasing coefficient of variation with transcript abundances (Supplementary Fig. 4d), so quantification of lowly expressed transcripts remains a challenge.
Similarly, Pipeline 2 has the highest ACC value (14.57) as well as the best consistency (0.98) of transcript abundance estimation across multiple replicates on cDNA-ONT data (Supplementary Fig. 4e). The consistency scores of both pipelines decreased dramatically when the abundance threshold was 5 or smaller ( Supplementary Fig. 4f). Therefore, there existed greater variabilities and errors of abundance estimation by both pipelines for lowly expressed transcripts.
Finally, SIRV data also demonstrated the highest correlation between the ground truth and the estimations by Pipeline 2 (SCC=0.69, Supplementary Fig. 4g) and showed its best performance on cDNA-ONT data ( Supplementary Fig. 4h Fig. 4).

Challenge 3 mock evaluation
For the Challenge 3 mock evaluation, we used the different long-read libraries -Sequel I, Sequel II and three Table 2) -generated for the manatee sample, and processed them independently with the Isoseq3 algorithm 4 , resulting in five different "pipelines" providing transcript model predictions for the manatee. FASTA sequences were mapped to the manatee draft genome using minimap2 5 . Note that Challenge 3

MinIon (Supplementary
instructions for manatee data indicate that all reads from each sequencing platform must be combined to predict transcript models, therefore our mock pipelines use data subsets of the actual competition. Supplementary Fig.   5shows the results of this analysis. While the number of predicted transcript models was very different for each pipeline (Supplementary Fig. 5a), in all cases the mapping rate was high (Supplementary Fig. 5b) with PacBio transcript models reaching 100% mapping rate and a majority of detected transcripts were multiexon ( Supplementary Fig. 5c). The distribution of transcript length showed that ONT3 and both PacBio pipelines had higher values than ONT1 and ONT2 between and median values varied between 1094 and 1394 nts ( Supplementary Fig. 5d). Clear differences were observed between PacBio and Nanopore pipelines on the number of transcripts with complete short-read junction support (Supplementary Fig. 5e), the total number of non-canonical junctions (Supplementary Fig. 5f) and the number of transcripts containing at least one noncanonical junction (Supplementary Fig. 5), with PacBio pipelines showing, in general, higher support and less incidence of junctions non-canonical. Also, the predicted coding potential for PacBio pipelines was higher than for Nanopore (Supplementary Fig. 5h). As for BUSCO analysis, ONT pipelines returned a higher number of either BUSCO complete, BUSCO incomplete and BUSCO duplicated sequences than PacBio pipelines, although the relative numbers were roughly similar, except for the PB2 pipeline that had a lower fraction of BUSCO complete genes (Supplementary Fig. 5i).
This analysis indicated that our LRGASP metric were able to capture differences between analysis pipelines and revealed that although in all cases transcript models can be mapped to the genome, their number and sequence and splice site accuracy is very different, with ONT pipelines returning more complete transcriptomes, and PacBio pipelines returning better supported and accurate splice sites.
In these mock evaluations for all three challenges on published GM12878 data, we highlighted the variability between sequencing platforms and computational methods. This further motivates the need for the LRGASP effort to highlight these differences in a real study and to use our benchmarks for evaluation.
The following is an overview of the data used for each challenge and the result files that were submitted ( Supplementary Fig. 5-Supplementary Fig. 7).
• Each entry must have met the following requirements:

Requirements for Challenge 1 and 2
At least one experiment must have been supplied for each sample available for a given challenge, library prep, and sequencing platform combination that is selected. Human and mouse samples have biological replicates that must have been used for the entry.
A major goal of LRGASP is to assess the capabilities of long-read sequencing for transcriptome analysis and how much improvement there is over short-read methods. Additionally, long-read computational pipelines vary in their use of only long-read data or if they incorporate additional data for transcript analysis. To facilitate comparisons between long-read and short-read methods and variation in tool parameters, we broke down submissions into different categories: • long-only -Use only LGRASP-provided long-read RNA-seq data from a single sample, library preparation method and sequencing platform.
• short-only -Use only LGRASP-provided short-read Illumina RNA-seq data from a single sample. This is to compare with long-read approaches.
• long and short -Use only LGRASP-provided long-read and short-read RNA-Seq data from a single long-read library preparation method and the Illumina platform. Additional accessioned data in public genomics data repositories can also be used.
• freestyle -Any combination of at least one LRGASP data set as well as any other accessioned data in public genomics data repositories. For example, multiple library methods can be combined (e.g. cDNA-PacBio + CapTrap-PacBio , cDNA-ONT + CapTrap-ONT + R2C2-ONT + dRNA-ONT, all data, etc.).
In all the above categories, the genome and transcriptome references specified by LRGASP were used. For the long and short and freestyle category, additional transcriptome references could be used.
All replicates must have been used in each experiment. Challenge 2 must have reported replicates separately in the expression matrix. Each team could submit multiple entries for each challenge; however, they can only submit one entry per challenge + data type + library prep + sequencing platform combination. This was to encourage tool development that is robust to different library preps and sequencing platforms, but prevent multiple entries that are subtle parameter changes.
For Challenge 1, the submitted GTF file only contained transcripts that had been assigned a read. For Challenge 2, submitters had the option of quantifying against the reference transcriptome or a transcriptome derived from the data (i.e., results from Challenge 1). The GTF used for quantification was included as part of the Challenge 2 submission.
The type of platform and library preparation method used in a given experiment, except for freestyle experiments, was limited to data from a single library preparation method plus sequencing technology (longonly). LRGASP Illumina short-read data of the same sample could optionally be used in an experiment with the LRGASP long-read data (long and short): • cDNA-Illumina -short-only • cDNA-PacBio -long-only or long and short

Requirements for Challenge 3
At least one experiment had to be supplied for each sample available for a given library prep and sequencing platform combination that was selected. Mouse samples had biological replicates that were used for the entry.
Manatee samples only had cDNA library preparation type and sequencing data from Illumina, ONT, and PacBio.
For similar reasons as described above, the data used for a given experiment had to fit into one of the following categories: • long-only -Use only LGRASP-provided long-read RNA-Seq data from a single sample, library preparation method and sequencing platform. No genome reference can be used.
• short-only -Use only LGRASP-provided short-read Illumina RNA-Seq data from a single sample. This is to compare with long-read approaches. No genome reference can be used.
• long and short -Use only LGRASP-provided long-read and short-read RNA-Seq data from a single long-read library preparation method and the Illumina platform. No genome reference can be used.
• long and genome -Use only LGRASP-provided long-read RNA-Seq data from a single long-read library preparation method. A genome reference sequence can be used.
• freestyle -Any combination of at least one LRGASP data set as well as any other accessioned data in public genomics data repositories. For example, multiple library methods can be combined (e.g. cDNA-PacBio + CapTrap-PacBio, cDNA-ONT + CapTrap-ONT + R2C2-ONT + dRNA-ONT, all data, etc.).
In all the above categories, except for freestyle, a transcriptome reference could not be used. The submitted FASTA file only contained transcripts that had been assigned a read. Each team could submit multiple entries for each challenge; however, they could only submit one entry per challenge + data type + library prep + sequencing platform combination.
LRGASP biological data was available at the ENCODE DCC

Supplementary Methods
Additional details of all protocols for library preparation and sequencing can be found at the ENCODE DCC and is linked to each dataset produced by LRGASP (Extended Data Table 1).

Capping SIRVs
Exogenous synthetic RNA references (spike-ins) are widely used to calibrate measurements in RNA assays, but they lack the 7-Methylguanosine (m 7 G) cap structure that most natural eukaryotic RNA transcripts bear at their   . A total of nine good-quality RNA samples were selected to create an RNA pool. These samples included 6 females, one calf, one lactating female and one male and had RIN values from 8.0 to 8.8.

Manatee genome sample preparation
The genome of the Florida manatee Lorelei was sequenced using Nanopore and PacBio. Lorelei is the same individual manatee for which an Illumina-based genome assembly was released by the Broad Institute in 2012 6 .
An EDTA, -80ºC whole blood sample aliquot was used. gDNA was extracted from 1400 µl of blood using the DNeasy kit (QIAGEN, MD, USA) following the companies' specifications for 100 µl aliquots of blood. with RNAse inhibitor, oligo dT, dNTP's and water, incubated at 72ºC for 3 minutes, then ramped down to 50ºC for an additional 3 minutes. We then added a first strand synthesis buffer (5x RT buffer, TSO oligo, Maxima H(-) RT and water) that had previously been equilibrated to 50ºC, to the priming reaction. First strand synthesis was carried out as follows: (Extension at 50ºC for 90 min, 85ºC for 5 min and held at 4ºC). To the first strand reaction, we then added 2x SeqAmp (Takara) reaction buffer, IS primers, water and SeqAmp polymerase). First strand cDNA was amplified for 11 cycles as follows: (95ºC 1 min, 98ºC 15 sec, 65ºC 30 sec and 68ºC 13 min), and finished off by incubation at 72ºC for 10 min and holding at 4ºC. The amplified products were purified using SPRI beads, quantified on Qubit, and checked for length distribution on the Agilent Bioanalyzer. The short-read protocol is described in the Nextera DNA Flex Library Prep Reference Guide 8  minutes. Resulting products were purified with 1.8x AMPure RNA Clean XP beads (catalog num. A63987, Beckman Coulter). After the first-strand generation, the m7G cap structure at the 5' end of the transcripts is selectively captured using the CAP-trapper technique 10,11 , which leads to the removal of uncapped RNAs. The diol group on the m 7 G cap is oxidized with 1M NaOAc (pH 4.5) and NaIO4 (250 mM). Tris HCl (1M, pH 8.5) was added to stop the reaction and the whole reaction was purified with 1.8x AMPure RNA Clean XP beads.
Aldehyde groups were biotinylated using a mixture containing NaOAc (1M, pH 6.0) and Biotin (Long Arm) Assembly (NEBuilder, NEB) with a 1:1 cDNA:splint ratio (100 ng each). After Gibson assembly, a linear digestion (ExoI, ExoIII, and Lambda Exonuclease) was performed to eliminate non-circularized DNA. The circular Gibson assembly product was cleaned up using SPRI beads. The circularized library was used as template for rolling circle amplification (RCA) using Phi29 polymerase and random hexamer primers.
Following the RCA reaction, T7 endonuclease was used to debranch the DNA product. A DNA clean and concentrator column was used to purify the DNA. Purified RCA product was size-selected using a 1% low melt agarose gel. The main band just over the 10 kb marker was excised from the gel and digested with beta-agarase followed by SPRI bead clean up. The cleaned and size selected RCA product was sequenced using the ONT 1D

Manatee ONT genome sequencing
Two µg of genomic DNA in a total volume of 100 µl was fragmented by the g-Tube fragmentation method (Covaris, Woburn, MA, USA) by centrifuging at 6,000x g for 1 min. The large DNA fragments were enriched by using 0.85x volume of Agencourt AMPure XP beads (Beckman Coulter, Brea, CA, USA) in the purification procedure. The enriched DNA fragments were subjected to library preparation with Nanopore Genomic DNA Ligation Sequencing Kit (Oxford Nanopore Technologies, Oxford, UK) following the manufacturer's protocol.
A total of 700 ng of final library product was loaded on a flow cell and sequenced with a Nanopore GridION sequencer (Oxford Nanopore Technologies, Oxford, UK) for a 72-hr run. A total of 5 flow-cell runs were conducted for this project.

Manatee cDNA PacBio library preparation and sequencing
Approximately 280 ng of total pooled RNA were processed according to a modified IsoSeq protocol. Switching reaction and water. Amplified cDNA was purified by AMPure, one round at 0.8 to 1.0 beads to sample ratio and one round at 0.65:1.0 ratio. The yield of amplified cDNA by this modified protocol (300-400 ng) was about 10-fold lower than the standard protocol (i.e., without globin-removal). The average cDNA size was ~1400 bp. When increased amounts of cDNA were desired the cDNA was amplified by 5 additional PCR cycles.
Two preps obtained with the above-described protocol were pooled together and 500 ng were loaded on an electrophoretic lateral fractionation system (ELF, SageScience). Fragments above 2.5 kb were collected, reamplified (10 cycles), and re-pooled equimolarly with non-size-selected cDNA fragments. This re-pooled cDNA prep is referred to as "enriched cDNA_>2.5kb". Both non_enriched cDNA and enriched cDNA_>2.5kb cDNA were used for SMRT bell library construction starting with 1 µg of cDNA as described Two runs were done on non-size-selected manatee cDNA, while only one run was done on the cDNA that had been enriched with >2.5 kb fragments. Sequencing runs were allowed to proceed for 48 hours.

Long-read data processing
Base calling of ONT data from human, mouse and manatee was performed with Guppy 4.  Table 2).

Reference genome and annotations
For submissions of transcript models and quantification, transcript annotations and genome models corresponding to GENCODE human v38 and mouse M27 were used. Submissions of challenge predictions were expected to end in Fall 2021, prior to the release of GENCODE human v39 and mouse M28. The newly released GENCODE annotations would, therefore, be used for the evaluations. GRCh38 was the reference genome sequence for human, and GRCm39 was used for mouse. GENCODE annotations were based on these genomes. Please note that GENCODE M25 and earlier annotation releases were based on GRCm38.

Simulated data
Simulating RNA reads simply from the reference transcriptome would only allow the assessment reconstruction of known transcript models. Thus, we extended both human and mouse annotations with artificial novel transcripts. To obtain those, we mapped reference transcripts of an undisclosed mammalian organism to the human and mouse genomes and converted the alignments into transcript models using SQANTI 13 . We then arbitrarily selected isoforms of known genes that have only canonical splice sites (GT-AG, GC-AG and AT-AC) and merged them into human and mouse GENCODE Basic annotations.
To generate realistic isoform expression profiles, we selected undisclosed human and mouse long read datasets To simulate reads produced by different sequencing platforms we used existing simulation methods. Illumina 2x150bp read pairs were generated with the RSEM simulator 14 using an error model obtained from real RNA-Seq data 15 (accession number ERR1474891).
ONT reads were simulated with NanoSim 16 using pre-trained cDNA and dRNA models available in the package with average error rate of 15.9% (4.8% substitutions, 6.0% deletions, 5.1% insertions) and 11.2% (2.8% substitutions, 5.9% deletions, 2.5% insertions) respectively. NanoSim exploits models trained on real data to produce realistic sequencing error patterns, read length distribution and unaligned sequences at reads ends typical for ONT sequencing. The complete list of Nanopore data characteristics is described in the Trans-NanoSim manuscript 16 . Manual inspection revealed that as the transcript truncation is done randomly in Trans-NanoSim, no 3'/5' bias is introduced. Thus, simulated ONT data may have slightly different coverage profiles compared to the real cDNA-ONT and dRNA-ONT data.
PacBio CCS reads were obtained with IsoSeqSim (https://github.com/yunhaowang/IsoSeqSim), which truncates input reference transcript sequences and uniformly inserts errors according to the given probabilities. Uniform error distribution appears to be a reasonable choice according to the previously developed tool for simulating genomic PacBio reads 17 . Error rate was estimated using real cDNA-PacBio CCS reads obtained in this work as 1.6% (0.4% substitutions, 0.6% deletions, 0.6% insertions). To create a realistic coverage profile, for read truncation in IsoSeqSim we used pre-computed Sequel II truncation probabilities provided along with the package.
To verify generated data we mapped real and simulated reads to the respective genomes with minimap2 5 in spliced mode and computed empirical error rates (Supplementary Table 3). As the table shows, with the exception of cDNA-ONT data, error rates appear to be similar. For cDNA-ONT, however, real data sequenced within this work is more accurate compared to NanoSim-generated reads. We simulated two datasets containing reads from all 3 platforms listed above but with slightly different properties. Human datasets were simulated with 100 million Illumina read pairs, 30 million cDNA-ONT and 10 million PacBio reads. Mouse datasets also contained 100 million Illumina read pairs, but equal amounts of PacBio CCS and dRNA-ONT reads were generated (20 million sequences each).

Supplementary
To allow users to simulate their own data, the methods described above are implemented as simple commandline scripts which are available at https://github.com/LRGASP/lrgasp-simulation/.

CAGE data of WTC11 samples for validation of transcript 5' ends
To validate novel 5' ends, we used a recently generated deep coverage CAGE data on the WTC11 line.

QuantSeq data (3' end sequencing) from challenge 1 and 2 samples was produced for validation of 3' ends and
was not released until the close of the challenge submissions. Data was be obtained from two RNA biological replicates of WTC11, from the same exact RNA used for long-read sequencing.
To validate novel polyadenylation sites, we collected poly(A)-seq data using the Quant-Seq method from Lexogen, which can map poly(A) sites de novo.

LRGASP Data QC
Initial quality control (QC) metrics were determined for the LRGASP data (Supplementary Fig. 1). Reads

GENCODE benchmarks and computational evaluation
Full manual annotation was undertaken on 50 selected loci on both the human and mouse reference genomes.
Transcript models were only annotated during this exercise based on their support from long transcriptomic datasets generated by the consortium specifically for LRGASP. No transcript annotation was based on transcriptomic data from externally produced datasets, although annotators used any publicly available orthogonal data to aid in the interpretation of aligned consortium data. For example, Fantom 5 CAGE datasets were used to help identify transcription start sites and transcript 5' ends, and RNA-seq-supported introns derived from high-throughput reanalysis pipelines such as Recount were used to support putative introns identified in the alignments of long transcriptomic data.

Manual annotation was performed according to the guidelines of the HAVANA (Human And Vertebrate
Analysis aNd Annotation) group 19,20 . Transcriptomic data was aligned to the human and mouse reference genome using appropriate methods. The benefits of aligning the transcriptomic data using multiple methods were tested to reduce the impact of alignment errors and artifacts.
Annotators also tppl advantage of local alignment tools integrated into annotation software to give further alternative views of alignments and improve annotation accuracy. Transcript models where manually extrapolated from the alignments by annotators using the otter annotation interface 21 . Alignments were navigated using the Blixem alignment viewer 22, 23 and, where required, visual inspection of the dot-plot output from the Dotter tool 24 was used to resolve any alignment with the genomic sequence that was unclear or absent from Blixem. Short alignments (<15 bases) that cannot be visualized using Dotter were detected using Zmap DNA Search 24 (essentially a pattern matching tool). The construction of exon-intron boundaries required the presence of canonical splice sites (defined as GT-AG, GC-AG and AT-AC) and any deviations from this rule were given clear explanatory tags (for example non-canonical splice site supported by evolutionary conservation). All non-redundant splicing transcripts at an individual locus were used to build transcript models, and all alternatively spliced transcripts were assigned an individual biotype based on their putative functional potential. Once the correct transcript structure was ascertained, the protein-coding potential of the transcript was determined based on its context within the locus, similarity to known protein sequences, the sequences of orthologous and paralogous proteins, candidate coding regions (CCRs) identified by PhyloCSF, evidence of translation from mass spectrometry and Ribo-seq data, the presence of Pfam functional domains, the presence of possible alternative ORFs, the presence of retained intronic sequence, and the likely susceptibility of the transcript to nonsense-mediated mRNA decay (NMD). Although the annotation of transcript functional biotype and CDS is not required of submitters, they were added to transcripts as a matter of routine manual annotation and may be used to investigate the detection or nondetection of groups of transcripts by submitters. When necessary, annotations were checked by a second annotator to ensure the completeness and consistency of annotation between the genes annotated for LRGASP and the remainder of the Ensembl/GENCODE gene set.

Computational pipeline description from submitters
Challenge 1 Name: Bambu Description: Bambu trains a transcript discovery model on each sample using the known reference annotation to predict if novel aligned reads are likely to represent full-length transcripts. This optimizes several parameters relevant to transcript discovery and reduces this down to a single tunable parameter which is customized to the specific sample transcriptome, the novel discovery rate (NDR). By ranking novel transcripts with the NDR Bambu can extend the annotations across a large range of sensitivity and precision. We ran the FLAIR2 isoform discovery pipeline using the flair-collapse module with the --annotation_reliant, --check_splice, and --stringent parameters. The short-and-long submissions used short-read data to identify confident splice junctions. Notes: FLAIR2 run with the --annotation_reliant argument invokes an alignment of the reads to an annotated transcriptome first, followed by novel isoform detection. When including --check_splice, this enforces higher quality matching specifically around each splice site for read-to-isoform assignment steps.

Name: Iso_IB
Description: An IsoSeq evidence-based approach to predict gene models, alternative splices and isoforms using a custom path from the cDNA-cupcake workflow Notes: The filtering strategy used in the IsoTools pipeline has a significant impact on the number and nature of identified transcripts. While the strategy described above was developed specifically for the challenge, it may not be optimal for all datasets and research questions. Please consult the documentation for detailed instructions on how to use IsoTools for your specific dataset and research question. Note that the filtering syntax has been improved since pre-release version 0.2.5 used for the submission. Config: LyRic was run with default parameters except as specified in its LRGASP configuration file, available at https://github.com/guigolab/LyRic/blob/master/config_LRGASP.json. The LyRic working directory was staged for LRGASP data processing using the setup_LRGASP.sh script (https://github.com/guigolab/LyRic/blob/master/setup_LRGASP.sh) All PacBio sequencing data were reprocessed from BAM files using the pb_gen pipeline (https://github.com/guigolab/pb_gen) with default parameters, except for adapter sequences ('PB_ADAPT' parameter), which were changed according to LRGASP organizers' adapter sequence specifications. PacBio and ONT FASTQ files were processed by LyRic using a filter_SJ_Qscore of 30 and 10, respectively. Briefly, filter_SJ_Qscore represents the minimum average Phred sequencing quality of read sequences +/-3 nts around all their splice junctions for a spliced read to be considered for transcript model building. See LyRic documentation (https://guigolab.github.io/LyRic/documentation.html), and input LRGASP sample annotation file (https://github.com/guigolab/LyRic/blob/master/sample_annotations_LRGASP.tsv) for more details. Reads were merged into transcript models with the tmerge utility (https://github.com/guigolab/tmerge) using the following two-step nested approach. First, reads were merged separately within replicates, requiring a minimum of two reads supporting each transcript model. The resulting transcript models were then merged again, this time across all three replicates of each LRGASP sample, requiring replicate-specific transcript models to be detected at least once in every replicate. Notes: LyRic's output transcript models are completely agnostic to any pre-existing reference annotation. In other words, LyRic does not adjust the coordinates of the transcript models it produces based on a reference annotation. Funding: National Human Genome Research Institute of the US National Institutes of Health (grant 2U24HG007234-09). We acknowledge support of the Spanish Ministry of Science and Innovation to the EMBL partnership, Centro de Excelencia Severo Ochoa and CERCA Programme / Generalitat de Catalunya.

Name: Mandalorion
Description: Mandalorion uses PacBio or ONT-based R2C2 consensus reads. It aligns those reads using minimap2, then parses those alignments to generate models of isoforms. Mandalorion then generates read-based consensus sequences for each isoform using pyabpoa and racon tools. Mandalorion then aligns these isoform consensus sequences and filters the isoforms based on these alignments and their abundance. The final isoforms are reported as both FASTA and PSL/GTF files. Config: all runs were performed using the "-R 3" and "-I 150" flags, setting minimum read number and minimum length of isoforms, respectively. Notes: Mandalorion only uses individual splice sites in any provided GTF annotation file. It discards information on how splice sites are connected into splice junctions. It also ignores annotated transcription start sites and poly(A) sites when constructing isoforms. Because Mandalorion doesn't heavily rely on information in the annotation file, performance is very similar for novel or annotated isoforms and the ends of identified isoforms will agree with read alignments rather than annotated TSS and poly(A) sites. Another consequence of not relying heavily on an annotation file is that Mandalorion will not "assemble" isoforms that are longer than the provided reads, which is obvious for some of the "long SIRV" data. Funding: NIH/NIGMS R35GM133569 to Christopher Vollmers Name: Spectra Description: Spectra is a tool to build gene models based on full-length cDNA reads, not fragmented or incomplete ones, through a guide of genome alignments. The resulting gene models are entirely (end-to-end) supported with one or more observations of reads. Version: v0.1a Team: Hideya Kawaji, Tokyo Metropolitan Institute of Medical Science URL: https://github.com/hkawaji/spectra Config: PacBio's consensus sequence is computed with a set of helper scripts bundled in the repository. Notes: The development of this tool was motivated by the notion that the read counts of long RNA molecules are depleted in the contributed data sets, even with the best protocol. Discoveries supported by experimental evidence were maximized by selecting high-quality data and setting a minimum read count threshold. Funding: AMED (Grant Number 21kk0305013h0002).

Name: StringTie2
Description:StringTie2 is a guided transcriptome assembler, able to assemble either short or long RNA-seq read data, even in the absence of a reference annotation. Since version 2.2.0 it is also capable of handling mixed transcriptomic data that includes both short and long RNA-seq reads sequenced from the same sample. • TranscriptClean: TranscriptClean corrects common long-read sequencing artifacts such as microindels and mismatches. ○ Noncanonical splice junctions, if not provided in the input set of splice junctions, will be corrected to the nearest canonical splice sites in places where possible, otherwise they are discarded.
• TALON: TALON annotates long reads to their transcripts of origin, quantifies the expression of annotated transcripts, and filters novel transcript models based on reproducibility and evidence of internal priming ○ Any read meeting coverage and identity filters with new splice sites will be constitute a novel model ○ Any read with an intron chain that matches that of a reference model or a novel model that's already been cataloged in the database that also 5'/3' ends within a certain distance of the cataloged model will be assigned to that model. Otherwise, it will be used to create a new model with new 5'/3' ends. ○ Transcripts are quantified simply by counting the number of reads that belong to each cataloged transcript model. ○ Unannotated (novel) transcripts are filtered for reproducibility and for those that display evidence of internal priming (see settings used in the config section) • LAPA: LAPA is used to refine the 3' end calls made by TALON. ○ For the ONT reads (dRNA, cDNA, CapTrap), the adapters were not removed in contrast to the PacBio reads. This affected our coverage and identity filters which require certain mapping rates to include each read. Thus, many ONT reads with the adapters still on had large unalignable regions and were discarded, leading to erroneous results. ○ For the ONT cDNA and cDNA CapTrap, the reads are unstranded while our tools expect only 5'-3' oriented alignments as input. This gave us a lot of antisense transcripts that should have been real transcripts and were discarded by the filter that takes novelty into account. ○ Spike-in models were not included in our reference transcriptome annotation. Therefore, we treated them like novel transcripts and were subject to the reproducibility filter; leading many to them being erroneously discarded. Our performance on the spike-ins represents our efforts to do reference-free annotation and should be interpreted as such. This likely explains the discrepancies between our performance on the spike-in and the simulated data. ○ TALON labels transcript models using the SQANTI novelty categories. Users have the option of filtering out all ISMs, even those that pass the reproducibility and internal priming filters. Internally, we have found that our precision on spike-ins is much better when we do so (data available on request). However for this submission we did not, leading to the expected high levels of ISMs reported in our dataset. ○ In general, we ran all our tools with default parameters regardless of the protocol which is similar to what the average user can do. • TranscriptClean: TranscriptClean corrects common long-read sequencing artifacts such as microindels and mismatches. ○ Noncanonical splice junctions, if not provided in the input set of splice junctions, will be corrected to the nearest canonical splice sites in places where possible, otherwise they are discarded.
• TALON: TALON annotates long reads to their transcripts of origin, quantifies the expression of annotated transcripts, and filters novel transcript models based on reproducibility and evidence of internal priming ○ Any read meeting coverage and identity filters with new splice sites will be constitute a novel model ○ Any read with an intron chain that matches that of a reference model or a novel model that's already been cataloged in the database that also 5'/3' ends within a certain distance of the cataloged model will be assigned to that model. Otherwise, it will be used to create a new model with new 5'/3' ends. ○ Transcripts are quantified simply by counting the number of reads that belong to each cataloged transcript model. ○ Unannotated (novel) transcripts are filtered for reproducibility and for those that display evidence of internal priming (see settings used in the config section) • LAPA: LAPA is used to refine the 3' end calls made by TALON. • Known problems with submission: ○ For the ONT reads (dRNA, cDNA, CapTrap), the adapters were not removed in contrast to the PacBio reads. This affected our coverage and identity filters which require certain mapping rates to include each read. Thus, many ONT reads with the adapters still on had large unalignable regions and were discarded, leading to erroneous results. ○ For the ONT cDNA and cDNA CapTrap, the reads are unstranded while our tools expect only 5'-3' oriented alignments as input. This gave us a lot of antisense transcripts that should have been real transcripts and were discarded by the filter that takes novelty into account. ○ Spike-in models were not included in our reference transcriptome annotation. Therefore, we treated them like novel transcripts and were subject to the reproducibility filter; leading many to them being erroneously discarded. Our performance on the spike-ins represents our efforts to do reference-free annotation and should be interpreted as such. This likely explains the discrepancies between our performance on the spike-in and the simulated data. ○ TALON labels transcript models using the SQANTI novelty categories. Users have the option of filtering out all ISMs, even those that pass the reproducibility and internal priming filters. Internally, we have found that our precision on spike-ins is much better when we do so (data available on request). However for this submission we did not, leading to the expected high levels of ISMs reported in our dataset. ○ In general, we ran all our tools with default parameters regardless of the protocol which is similar to what the average user can do.

Method changes from registered report phase 1
While the great majority of the analysis indicated in the Registered Report are present in the final version of the manuscripts, some modifications were introduced. These are listed in the Supplementary Table 4.

Modification Section Description
Orthogonal data We could not find a tractable and cost-manageable technique for providing reliable isoform quantity estimates. Isoform-specific qPCR was cost prohibitive and for the targets of interest, not amenable to analysis.
Target selection for Challenge 1.

Validation
We added additional test categories including novel and suspect transcripts from the GENCODE manual annotation. We report data for the human WTC11 sample.
No mouse targets were validated due to insufficient material and resources.

Primers-JuJu bulk PT_PCR primer design tool
To facilitate the design of a large number of RT-PCR primers for validating a subset of the predicted isoforms, we developed a tool called Primers-Juju. It provides a semi-automated interface between the visualization of the transcript models in the UCSC browser and the Primer3 primer design package.
The design process starts with a UCSC track hub containing the consolidated transcript models from all pipelines. Unique features for transcripts to validate are identified by visualization. A pair of genomic regions that could contain a primer pair that would amplify the targeted transcript are manually defined. The region may be within an exon or two exons spanning a splice junction. These regions are marked using the UCSC Browser region highlight facility. Supplementary Fig. 15 shows an example of specifying the design region primers for a unique intron.
The genomic coordinates of the pair of regions are recorded in a spreadsheet, along with the transcript identifier. Additional transcripts that would also be amplified by the primers may be included for validation.
Primers-Juju provides a command line tool that takes the specification spreadsheet with multiple targets and transcript annotations and does primer design and validation. The input specifications are validated against the targeted transcripts, with minor adjustments for inexact bounds. The sequence for the transcripts is obtained from the genomic sequence of the exons and the regions converted to transcript coordinates.
The Primer3 programmatic library is given each transcript sequence and region pair and will attempt to design a stable primer pair to amplify this transcript, returning up to five possible primer pairs per region. The in-Silico PCR command line tool is used to check for potential off-target primer pairs. Queries are done against the genome sequences and transcriptome sequences, which consist of the known annotations as well as the LRGASP consolidated transcripts.
Primers-Juju generates additional tracks for the hub with primer pairs and amplicon sequences, as seen in Supplementary Fig. 16. It also produces reports and recommends the most stable primer with no off-target hits to order.
Primers-Juju source code is available at https://github.com/diekhans/PrimerS-JuJu/ and was developed by The University of California, Santa Cruz (UCSC) and El Centre de Regulació Genòmica (CRG).

Data and code availability
Biological sequencing data is available from the ENCODE Portal (https://www.encodeproject.org/) and are described in the RNA-Seq data matrix (Extended Data Table 1).
Experimental data used in GENCODE manual evaluation: • ssCAGE WTC11 GEO: GSE185917 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE185917 • WTC11 QuantSeq  Supplementary Fig. 6: Challenge submission. a, Overview of submissions to Challenges 1 and 2. Each entry was derived from a specific data category, library prep, and sequencing platform combination. All available samples for the selected combination must be included in an entry b, Overview of submissions for Challenge 30.34. Supplementary Fig. 8. Flow diagram of Challenge 1: Transcript isoform detection with a high-quality genome. Samples, library prep methods, and sequencing platforms used in the challenge are indicated at the top. Participants select which data category, library prep, and sequencing platform to analyze, run their pipelines to generate transcript predictions, and submit an entry which includes predictions for all samples. The entries include a GTF file of the transcript models and a TSV file that assigns reads that supported each transcript model. Supplementary Fig. 9. Flow diagram of Challenge 2: Transcript isoform quantification. Samples, library prep methods, and sequencing platforms used in the challenge are indicated at the top. Participants select which data category, library prep, and sequencing platform to analyze, run their pipelines to generate transcript predictions, and submit an entry which includes predictions for all samples. The entries include a GTF file of the transcript models that are quantified and a TSV file of the expression quantification. The H1 and endodermal cell samples were released after the initial submission deadline and participants were required to submit the quantification after the deadline Supplementary Fig. 9. Flow diagram of Challenge 2: Transcript isoform quantification. Samples, library prep methods, and sequencing platforms used in the challenge are indicated at the top. Participants select which data category, library prep, and sequencing platform to analyze, run their pipelines to generate transcript predictions, and submit an entry which includes predictions for all samples. The entries include a GTF file of the transcript models that are quantified and a TSV file of the expression quantification. The H1 and endodermal cell samples were released after the initial submission deadline and participants were required to submit the quantification after the deadline. Supplementary Fig. 10. Flow diagram of Challenge 3. Samples, library prep methods, and sequencing platforms used in the challenge are indicated at the top. Participants select which data category and sequencing platform to analyze, run their pipelines to generate transcript predictions, and submit an entry which includes predictions for all samples. The entries include a FASTA file of the transcript models and a TSV file that assigns reads that supported each transcript model. Supplementary Fig. 11. Flow diagram of the evaluation for Challenge 1. Benchmarks and additional orthogonal data that was used for the evaluation are indicated. For example, CAGE and QuantSeq data from WTC11 cells were generated and made available only after participant submissions; therefore, they represent "hidden" data. These was used to define 5' transcript starts and 3' ends.