A new long-read RNA-seq analysis approach identifies and quantifies novel transcripts of very large genes

RNA-seq is widely used for studying gene expression, but commonly used sequencing platforms produce short reads that only span up to two exon-junctions per read. This makes it difficult to accurately determine the composition and phasing of exons within transcripts. Although long-read sequencing improves this issue, it is not amenable to precise quantitation, which limits its utility for differential expression studies. We used long-read isoform sequencing combined with a novel analysis approach to compare alternative splicing of large, repetitive structural genes in muscles. Analysis of muscle structural genes that produce medium (Nrap −5kb), large (nebulin - 22 kb) and very-large (titin - 106 kb) transcripts in cardiac muscle, and fast and slow skeletal muscles identified unannotated exons for each of these ubiquitous muscle genes. This also identified differential exon usage and phasing for these genes between the different muscle types. By mapping the in-phase transcript structures to known annotations, we also identified and quantified previously unannotated transcripts. Results were confirmed by endpoint PCR and Sanger sequencing, which revealed muscle-type specific differential expression of these novel transcripts. The improved transcript identification and quantification demonstrated by our approach removes previous impediments to studies aimed at quantitative differential expression of ultra-long transcripts.

Long-read sequencing technologies such as Pacific biosciences (PacBio) isoform sequencing (IsoSeq) are able to sequence full-length transcripts up to 10 kb in length (Wang et al. 2016), with a better error rate than other popular emerging long read sequencing technologies, such as Oxford Nanopore (Buck et al. 2017). However, the entire length of longer transcripts such as titin and nebulin still cannot be captured within maximum IsoSeq read lengths. In addition, IsoSeq is a transcript discovery tool as the bioinformatics pipeline does not support transcript quantification.
Here, we have addressed above deficits of the PacBio long-read IsoSeq technology and used it to compare alternative splicing of transcripts between different murine muscle types: fast skeletal, slow skeletal and heart (Supplementary Figure S2 and Table S1). Our study focuses on the large structural genes: titin, nebulin and Nrap to test the limits of IsoSeq. We present a new analysis approach to identify and quantify in-phase transcript structures or consecutive multi-exon stretches of these three transcripts. We have successfully phased exons of known splicing patterns, discovered multiple novel patterns and determined the percent expression of each transcript between tissue subtypes. To our knowledge, this is the first study to apply an exon phasing approach that results in accurate quantification of very long and difficult transcripts. This method improves the utility of IsoSeq, by expanding its potential beyond isoform discovery, to analysis of differential expression.

Oligo-dT internal priming provides coverage across large transcripts (> 10kb)
For preliminary analysis between muscle types, the aligned data for all three samples were surveyed in the integrated genomics viewer (IGV). Genes with transcripts close to the 6 kb average read length were often fully covered by a single continuous read (e.g. Nrap, Myomesins, Myosins). For larger genes, such as titin (106 kb transcript) and nebulin (22 kb transcript), the entire genomic loci were tiled with reads up to the 5' end of the gene (Figure 1).
This was unexpected given that IsoSeq uses a poly-A capture method. However, the read coverage was not evenly distributed. Dramatic shifts in read depth, visually similar to an upsidedown "staircase" pattern were observed when viewing the aligned reads across the gene.
Upon close inspection, reads that aligned upstream of the poly-A region (last 3' exon) of the gene consistently started on specific exons in all three samples. Focusing in on these exons (e.g. exon 137 of nebulin), the 3' ends of these reads aligned within 20 bases upstream of multiple stretches of 'A' bases intermixed in the genomic sequence ( Figure 1B). These reads were likely generated from cDNAs synthesized with internally primed oligo-dT primers. Internal priming has long been described with the use of oligo-dT primers for cDNA generation (Nam et al. 2002). With older technologies such as expressed sequence tags and microarrays, internal priming is viewed as a source of false splice variants. Modern oligo-dT primers used in our study reduce internal priming by adding a random addition of two non-T bases to the 3' end to anchor them to the polyA tail. However, stretches of 'A' bases within an exon followed by non-A bases can circumvent anchoring oligo-dT primers. These observations suggest that coverage of ultra-long transcripts can benefit from internal priming in long-read sequencing if the gene contains enough exons that support it. Other large genes such as Obscurin (27 kb transcript) do not have enough exons that support internal priming for complete coverage (data not shown).
Internal priming also produces reads that do not represent complete mRNA transcripts, creating a source of variability in exon coverage across the length of the gene that may be problematic for exon-based analyses using short-read RNA-Seq. Nebulin is on the anti-sense DNA strand and is oriented in the right (5') to left (3') direction. The mouse nebulin transcript (NM_010889.1) can be up to 22.4kb, or 2x the maximum read length (10kb). Red arrow, the first location upstream of the 3'-end to map several reads due to internal priming. B) Zoomed image of exon 137 showing reads produced from cDNAs generated by internal priming of oligo-dTs. Reference sequence shows consecutive 'A' bases followed by A. 167,630 bp  52,167,640 bp  52,167,650 bp  52,167,660 bp   40 bp   52,140 kb  52,160 kb  52,180 kb  52,200 kb  52,220 kb  52,240 kb  52,260 kb  52,280 kb  52,300 kb  52,320 kb  52,340 kb   204 kb   chr2   qA1  qA2 qA3  qB  qC1.1  qC1.3  qC3  qD  qE1  qE2  qE4  qF1  qF3  qG2  qH1 qH3 qH4

5' 3'
Nebulin -Exon 137 (NM_010889) non-A bases downstream of where the reads align. Purple lines are coding/exon regions of a reads. Thin connecting blue lines signify intronic sequences that were spliced out. The black lines, dots, and arrowheads are indel errors.

Observing splice transcripts across samples
To survey splice differences between tissues, nebulin transcripts were scanned for variable regions. The 3' region of nebulin is known to be differentially spliced between muscle types, and our long-read sequencing data confirms those reports (Donner et al. 2004). A stretch of exons between 137 and 152 was highly variable (Figure 2), and three novel exons were detected (between exon 147 and 148 of transcript NM_010889.1) that are unannotated in the RefSeq mouse database. Two of these were annotated in the more comprehensive mouse Gencode mouse database (mm10, release M10), but were part of two short nebulin transcripts. One of three exons between 147 and 148 was unannotated (u-002). These three exons are conserved and expressed in 3 out of 4 human nebulin transcripts (RefSeq). Four exons between exon 85 and 86 of NM_010889.1 (or exons 2, 3, 4 and 5 of ENSMUST00000028320.13) were expressed in all Soleus and EDL transcripts and also conserved in human (data not shown). These results suggest that long-read isoform sequencing is able to resolve complex splice isoforms in highly repetitive genes such as nebulin. Identifying unannotated exons and quantifying differential exon usage.
To identify unannotated exons and compare differential exon usage between tissues, we first selected for genes with at least 10 cluster reads produced from the IsoSeq method. This threshold resulted in 686 out of 48,440 genes that were adequately covered for further analysis. To reduce the complexity of the splice analysis, we collapsed the gencode (release M10) annotation file using a python script included in the DEXSeq R package, `dexseq_prepare_annotation.py`. This  Next, we developed the exCOVator script to perform a preliminary exon-based analysis (exons are treated independently from their neighbors) to determine exon usage between very large transcripts that are not fully covered by the IsoSeq maximum read-length. The script identifies unannotated exons and calculates percent spliced-in (PSI) values for each exon from multiple samples at once using the HT-Seq python library (Supplementary Figure S3C and S3D).
Unannotated exons are identified by screening reads for exons missing from the meta transcript of the collapsed annotation file. This identified 1,218 unannotated exons that were added into the meta transcripts of the collapsed gencode annotation file. Using this new collapsed annotation file, sample cluster reads that matched the annotated exonic parts were counted. The full-length read counts from each cluster read was extracted and used for PSI calculations for each exonic part. Finally, the full-length PSI was compared across samples. We selected for exonic parts that were at least 20% difference in PSI between at least two out of three samples to reduce artifactual differences. This resulted in 2,631 unique exonic parts differentially spliced in 433 genes between at least two of the three muscle samples.
For this study, we focused only on analyzing three large muscle structural genes: 1) Nrap, that produces ~6kb transcripts which fall within average read-lengths, 2) nebulin, that produces ~22kb transcripts that are over twice as long as the maximum read length and 3) titin, that produces 106kb transcripts that are over 10 times the read-length.

Novel muscle type specific splicing patterns in nebulin related anchoring protein
Differential usage of exon 12 of Nrap was observed between all three muscle tissues. Lu et al.
described the expression and alternate splicing of Nrap in skeletal and cardiac muscle during development (Lu et al. 2008), and identified two splice variants: 1) Nrap-s, a splice transcript of Nrap that includes exon 12 (relative to RefSeq annotation NM_008733.4) and is exclusively expressed in skeletal muscle and 2) Nrap-c, a splice transcript that is missing exon 12 and is exclusively expressed in cardiac muscle. The present analysis confirms their findings that there is no expression of exon 12 in Nrap transcripts in cardiac tissue (0/567 reads). In contrast to their findings, our data suggests that Nrap-c is not exclusively expressed in cardiac tissue. Both Nrap-s and Nrap-c splice isoforms are expressed in skeletal muscle and there is a 53% higher usage of exon 12 among the transcripts in the predominantly fast-twitch EDL (576/862 reads; 67% PSI) compared to the slow-twitch soleus muscle (377/2621 reads; 14% PSI; shown as exonic part 38; Figure 3A and 3B).
To validate the findings, we performed RT-PCR and Sanger Sequencing using the same primers in the publication(Supplementary Table 2) (Lu et al. 2008), and total RNA extracted from muscles of two independent mice ( Figure 3). Our RT-PCR results confirm the IsoSeq analysis data, showing: 1) low usage of exon 12 in soleus transcripts (more 444 bp product than 549 bp), 2) high usage of exon 12 in EDL transcripts (more 549 bp product than 444bp) and 3) no usage of exon 12 in cardiac transcripts (single 444 bp product). Sanger sequencing of the excised bands further confirmed the splice products ( Figure 3D). By corroborating the findings of the prior study, the present results establish that PacBio long-read IsoSeq can be used as a semiquantitative analysis approach for differential exon usage between samples. Additionally, the IsoSeq data extends the prior data and show differential exon usage of exon 12 among different skeletal muscle types, as both Nrap-s and Nrap-c transcripts are expressed in fast and slow skeletal muscle at varying amounts. Not only is there alternative splicing of Nrap transcripts between cardiac, and skeletal muscles, but also within skeletal muscle types. This suggests that Nrap-s and Nrap-c are misnomers.

Nebulin exon 138 is differentially used between slow and fast skeletal muscle
Our preliminary observations and exCOVator found multiple differentially used exons between 137 and 152 of nebulin (NM_010889.1; Figure 2, 4A and 4B). We used exon 138 to validate differential exon usage in this region ( Figure 4C and 4D). The results showed a single RT-PCR product at ~500 bp in soleus muscles, suggesting that exon 138 is present and that there is a consistent retention of all exons between 135 and 139 in the majority of transcripts in the soleus muscle ( Figure 4C). The EDL muscles had two product sizes, a ~500 bp product similar to the soleus and a smaller ~400 bp product missing exon 138. These results suggest differential exon usage of 138 in skeletal muscles which is excluded from a subset of EDL transcripts, but not in the soleus.
Cardiac muscle has low expression of nebulin (Bang and Chen 2015). However, we identified a prominent single ~500 bp product similar in size to the product in the soleus ( Figure   4C). This product could be the result of the higher sensitivity of RT-PCR compared to IsoSeq.
However, the soleus band is overloaded making size determination more ambiguous. It is possible that some sensitivity was lost by selecting for cDNAs > 5kb prior to IsoSeq. Sanger sequencing of the amplicon products show that the cardiac splice product is identical to what is found in the EDL and soleus ( Figure 4D), which rules out possible mis-priming to nebulette, a nebulin gene family member that is highly expressed in the heart.   Figure 5A and 5B). Both Gencode and RefSeq databases show this exon to be constitutively expressed. Therefore, the present data suggest titin exon 191 may be an unannotated cassette exon that is specifically removed from a subset of cardiac muscle transcripts. RT-PCR data shows that exon 191 is retained in Soleus and EDL transcripts by the presence of a single product matching the predicted size of 852 bp ( Figure 5C). However, the cardiac sample shows three different RT-PCR products. The 615 bp product is predicted to only be missing exon 191. The 882 bp product contains all exons between 190 and 194. The product in between at ~780 bp is an unknown splice product. Attempted Sanger sequencing of the unknown product resulted in truncated sequences and remains ambiguous ( Figure 5D; Cardiac Band 2). This is likely because the product is a heteroduplex between the smaller 615 bp and larger 882 bp splice products. Overall, these data confirm that titin exon 191 is an unannotated cassette exon that is only removed from a subset of mouse cardiac transcripts. Figure 5. Titin exon 191, found to be an undocumented cassette exon that is spliced out in a

Annotation and quantification of transcript structures
One of the strengths of long-read sequencing is the ability to phase multiple neighboring exons to determine transcript splice patterns within the same read. However, the exon-based approach used here to identify and quantify differentially used exons does not account for phasing. To address this, we developed another tool, exPhaser, to quantify and annotate splicing patterns of larger transcript structures. The exPhaser script takes in a bed file of exons and sample files (bams) and determines the splicing pattern of input exons within each sample read. Then it outputs a table of FL-read counts for each splicing pattern, and all transcript annotations that match the pattern. As exon input for exPhaser, cassette exons were selected that defined the isoform according to known annotations and exCOVator output (e.g. novel exons and differentially used exons). We also filtered out unannotated isoforms with < 10 FLreads or after visually determining in IGV that they were read/alignment errors.

Exons Nebulin
Percent of FL-Reads After analyzing differential exon usage using exCOVator, the results showed that differential splicing of nebulin only occurred in the 3' half of the transcript. We also surveyed other regions upstream known to be alternatively spliced such as exons 65-67 and 83-87. No differential splicing was detected between EDL and Soleus in these exons. In addition, we analyzed the splicing pattern of two mutually exclusive nebulin exons referred in the literature as exon 127¨ and 128¨ ( Figure 6C) (Donner et al. 2006). Translated to our exon numbering using NM_010889, exon 127¨ = 123 and exon 128¨ is not annotated in RefSeq mm10, but instead annotated as exon 7**** of ENSMUST00000148356 in the mouse Gencode database. Our phasing results show that transcript structures in the EDL all include exon 7**** (155/155 FL reads). On the other hand, the Soleus expresses a mixture of transcripts that contain exon 7**** and exon 123, but with slightly more transcripts expressing exon 7**** at 63% (514/817 FL-reads) than exon 123 at 37% (303/817 FL reads) which agrees with previous studies (Donner et al. 2006).

Phasing multiple transcript structures of titin
For titin, we could not phase all differentially spliced exons at once because they are too far apart to be contained within the length of a single read. To illustrate this, we compared the expression of titin N2-A (NM_011652, principle skeletal isoform) vs N2-B (*NM_028004, principle cardiac isoform) transcripts between the three muscle types, using the N2-A isoform for exon numbering unless noted. The N2-B transcript excludes exons 47-167 of N2-A, but includes an alternate exon 45*. Initially, we sought to phase transcripts using exon 45* (N2-B specific), 46, 47 (N2-A specific) and 168. However, the distance between exon 45* and 168 in N2-A transcripts is much greater than the 10kb maximum read length. Therefore, we phased the skeletal and cardiac specific exons using two groups of exons ( Figure 7A and 7B Next, we phased four other sets of differentially used exons in titin including exon 191, exon 312, exon 45 ‡ of ENSMUST00000099980 and exons 11, 12 and 13 ( Figure 7E). First, we phased exon 191, the validated cassette exon that is only removed from a subset of cardiac muscle transcripts ( Figure 6B). The cardiac phasing data were similar to the RT-PCR results  Additionally, the exon is in-frame in both species. These data support that alternative splicing of exon 312 is conserved and that it is a cassette exon that is only removed from a subset of skeletal muscle transcripts.
There are some notable differences between exon 312 in mouse and humans that do not affect the conclusions. The human equivalent of exon 312 (LRG exon 363) is larger with an additional 7 bp on 3' end and 53 bp on the 5'end of the exon. However, the extended human exon 312 remains in-frame because its flanking exons are also different lengths and accommodate for the discrepancy.
Third, we phased the alternative 3' exon 45 ‡ of ENSMUST00000099980. After manually inspecting the reads in IGV, we noted numerous partial reads of the large exon 45 ‡ (data not shown). Therefore, we decided to phase the exon (only 50 bp of 5' end) and its upstream calculations on 3' exons using short-read data.
Lastly, we phased exons 11, 12 and 13 of titin (and their neighboring exons 10 and 14) and observed that the most highly expressed transcripts in each of the three tissues have not previously been annotated ( Figure 7E). Our results show that there are extensive differences in the usage of exons 11, 12 and 13 in the three muscle types. Exon 11 is not used in any titin transcripts expressed in the two skeletal muscles and is likely that the N2-A annotation may be incorrect. In contrast, cardiac muscle makes extensive use of exon 11, which appears in 86.7% of expressed transcripts. Exons 12 and 13 seem to be co-expressed within skeletal muscle, but only in less than half of Soleus (40.8%) and very rarely in EDL (1.7%) transcripts. Exon 12 is rarely used in heart (9.5%), but exon 13 is expressed in more than half (69.6%) of expressed transcripts. This suggests that peptides coded by all three exons are dispensable for titin's function in fast skeletal muscle since 98.3% of transcripts in EDL do not express these exons.
There is also similar expression of exon 13 in slow skeletal and cardiac muscle. However, slow skeletal muscle includes exon 12 in more transcripts than cardiac muscle with the inverse expression pattern seen for exon 11.
All three exons are conserved in humans and retain the same numbering. Savarese et al show that exon 11 is constitutively spliced out in human which agrees with our mouse data.
They also show exon 12 to have an average of 54% inclusion and exon 13 with a 79% inclusion in skeletal muscle transcripts. Our soleus muscle data agrees with theirs, showing a similar exon inclusion rate of 40.8%. However, in the EDL muscle, exon 12 is only found in 1.7% of transcripts which is a large discrepancy. Our mouse exon 13 results also conflict with their human data. We found exon 13 to be co-expressed in transcripts with exon 12, sharing the same inclusion rate of 1.7% in EDL is vastly lower than their average of 79% inclusion in skeletal muscle. It's likely that these discrepancies are due to the averaging of values across multiple skeletal muscles in the Savarese et al study.
Exon 11 is expressed in left ventricle tissue at 59% in DCM and 69% in GTEx according to Cardiodb. This agrees with our analysis that cardiac tissue includes exon 11 in 86.7% of transcripts, the highest inclusion rate out of our three muscle tissues. The database reports that exon 12 is included in 79% of left ventricle transcripts in the DCM study and 75% in GTEx.
These inclusion rates of exon 12 are much higher than our study in which the exon is included in only 9.5% of mouse cardiac transcripts. Exon 13 is expressed in 96% of DCM and 95% of GTEx transcripts, but only 72.5% in mouse cardiac transcripts. We see some agreement between species such as exon 11 being included exclusively in transcripts expressed in heart, but the other two exons seem to be included at higher rates in humans. These differences could also be due to species specific splicing or different sampling locations of the heart.

DISCUSSION
Our work here has developed a unique approach to identify and quantify novel isoforms of ultralong transcripts between samples using long-read sequencing and the IsoSeq approach. On its own, IsoSeq has so far only been used for isoform discovery purposes. Use of our exon-based (exCOVator) approach enables quantifying differential exon usage and identifying novel exons in large structural protein transcripts. We showed this for three such proteins of specific importance in muscle: titin, nebulin and Nrap, without resorting to additional short-read data.
This reduces cost, but more importantly, we specifically show that this method is useful for transcripts with highly complex and repetitive sequences, which are difficult to resolve using short-read sequencing and other molecular assays. Using our exon phasing (exPhaser) approach, we determined splicing patterns between cassette exons to generate transcript structures. The exon pattern of these structures was mapped to known annotations for identification and used to count reads to quantify their expression in each sample.
Nrap transcript sizes (~5kb) are within our 6kb average read length making them an ideal test sample for our study. There are also fewer known splice isoforms, with one described as unique to cardiac (Nrap-c) and another to skeletal muscle (Nrap-s) (Mohiddin et al. 2003;Lu et al. 2008). These isoforms were validated by our exon-based analysis. We found that Nrap-c (exon 12 excluded) was indeed the only isoform expressed in heart (Figure 3). We also confirmed that Nrap-c is not exclusively expressed in cardiac muscle, but is differentially expressed in skeletal muscles as well, with higher expression of Nrap-c in the slow-twitch Soleus than in the fast-twitch EDL muscle ( Figure 3C). Nrap-s (includes exon 12) was also differentially expressed in the two skeletal muscle types with an inverse pattern to Nrap-c. This muscle type difference went undetected in previous studies using a mixture of lower limb muscles (Lu et al. 2008). By phasing every known cassette exon in Nrap, we quantified all known Nrap isoforms in the three muscle types. Doing so also increased our sensitivity for rare transcript isoforms, leading to the detection of an isoform exclusively expressed in heart and soleus muscles ( Figure 6A). This rare transcript lacks both exon 12 and exon 2, the latter being one of two exons that code for the N-terminal LIM domain known to interact with proteins that link the actin cytoskeleton to the cell membrane (Zhang et al. 2001). The LIM domain is also responsible for Nrap's role in myofibrillogenesis during development (Carroll et al. 2001).
Alternative splicing of exon 2 may modulate the binding affinity of the Nrap LIM domain to these myofibrillar and linker proteins.
At 22 kb, nebulin is over twice the length of IsoSeq's maximum read length. This resulted in reads of partial nebulin transcripts. However, internal priming enabled tiled coverage across the entire length of the transcripts. We only detected alternative splicing of the 3' half of the gene encoding the super repeat and Z-disk regions of the protein. As these alternatively spliced exons are close enough in proximity, they fit within a single read and were phased together. In silico translation of mouse nebulin used an older and more inclusive annotation containing 166 exons which is most similar to ENSMUST00000238749.1 in Gencode release M22 with 165 exons (Buck et al. 2010). However, this transcript is likely not accurate as it includes both exons 127 and 128, which are expressed in a mutually exclusive manner (Donner et al. 2006;Lam et al. 2018). Thus, we used the only mouse nebulin RefSeq annotation (NM_010889.1/ENSMUST00000036934.11) that contains 157 exons as a reference and below we clarify the annotation further.
Using our exon-based analysis on IsoSeq data, we detected differential splicing of multiple nebulin exons between 122-152 (NM_010889.1). Exons 13-132 (13-137 of ENSMUST00000238749.1) code for the super repeat region and exons 134-157 (139-165) code for the Z-disk region of nebulin in mouse. The majority of differential splicing of nebulin was detected within the Z-disk region which included an unannotated exon u-002 (Figure 1, 4A, and 6C). Exon u-002 is in-frame and conserved in human nebulin transcripts (see footnote). In the fast-twitch EDL, most of the exons of the Z-disk region are removed while the slow-twitch soleus included the majority of exons at a higher percentage.
The Z-disk plays a key role in contractile force transmission within and across sarcomeres. Differential splicing of the nebulin Z-disk region correlates with Z-disk width being thinner in white muscles and thicker in red muscles (Rowe 1973). Independent observations also found that the slow soleus and fast tibialis cranialis muscles corroborate our splicing data, suggesting that nebulin may have a functional role regulating Z-disk width (Buck et al. 2010).
We also detected alternative splicing of exon 127 and 128 (ENSMUST00000238749.1), which are located in the super repeat region right outside of the Z-disk anchorage point. They are expressed in a mutually exclusive manner, are conserved in humans (NM_001271208), and code for peptides with different charge and hydrophobicity, suggesting distinctive functional roles (Donner et al. 2006;Lam et al. 2018). In humans, these exons show developmentally regulated inclusion (Lam et al. 2018). The peptides in the super repeat region of nebulin are binding sites for kelch-like family member 40 (KLHL40), loss of which causes a nemaline-like myopathy (Garg et al. 2014). In turn, KLHL40 is suggested to stabilize nebulin and Leimodin 3 (LMOD3, another protein implicated in nemaline myopathy) by acting like a chaperone and maintaining proper folding of these two proteins during muscle contraction (Garg et al. 2014).
Using exPhaser we identified and quantified multiple novel nebulin transcripts in both the Z-disk and 3' end of super repeat region ( Figure 6C). Thus, our phasing data presents a catalog of more rigorous splice combinations. It is worth noting that none of the current nebulin transcripts annotated in RefSeq and Gencode release M10 (see footnote), are expressed in the EDL and Soleus muscles.
Titin transcripts (~106 kb) are over tenfold longer than maximum IsoSeq read lengths.
This did not thwart the ability to determine differential exon usage through our exon-based exCOVator analysis. However, the maximum read length put limitations on the exPhaser analysis. The phasing analysis required breaking up the exons into multiple groups based on proximity. We then quantified transcripts known to be cardiac and skeletal muscle specific in mouse. This helped uncover multiple novel splicing patterns expressed in specific muscle types.
Using this strategy together with Cardiodb, TITINdb and other resources (Savarese et al. 2018;Roberts et al. 2015;Laddach et al. 2017), we compared our mouse findings to human studies to determine if alternative splicing of these exons are conserved.
Cardiac specific cassette exon 191 (NM_133378, N2-A) in human (243 LRG) codes for one of multiple immunoglobulin-like (Ig-like) domains in the highly extensible I-band region of titin (Linke 2017). Ig-like domains can quickly unfold and refold, a property important for muscle elasticity and passive stiffness (Rief et al. 1997). This is especially relevant in heart where stiffness can affect the ejection fraction. It's possible that this cassette exon is differentially spliced to adjust the length of the I-band region of titin and stiffness of sarcomeres in the heart (Opitz et al. 2004).  (Bang et al. 2001). It is speculated to play a role in myofibrillar signaling during muscle development and cardiac disease (Bang et al. 2001). Novex-3 is expressed in all striated muscles, but not smooth muscles (Bang et al. 2001). It is interesting that we found it expressed more in skeletal muscle compared to cardiac muscle as it is often labeled as a cardiac isoform.
Exons 11, 12 and 13 of titin are conserved in humans with the same numbering (LRG).
They code for z-repeat domains 4, 5, and 6 respectively in the N-terminal region of titin imbedded in the Z-disk. These exons are variably spliced in human and may be involved in assembling Z-disks of variable width (Gautel et al. 1996). Titin's Z-repeats interact with alphaactinin in the Z-disk, and have been proposed to be important for Z-disk organization, assembly and force transmission between adjacent sarcomeres (Joseph et al. 2001). Alternative splicing of these exons seems to correlate with Z-disk width much like the Z-disk region of nebulin; transcripts produced by the fast EDL muscle excludes more exons compared to Soleus. In addition, exon 11 of titin is also always spliced out in skeletal muscle in a similar manner to humans (Savarese et al. 2018).
IsoSeq is generally recommended for transcript isoform discovery rather than for quantitative measures. In contrast, we found that counting full-length reads extracted from uncollapsed cluster reads was quantitative based on our validations. However, there are false positives caused by spurious reads that require manual inspection in a genome browser.
Recent improvements to sample preparation and data processing in IsoSeq 3 (SmrtLink v8) allow improved read length and data quality. The use randomized primers could also open the door for more evenly distributed coverage for ultra-long transcripts but would require some changes in the data processing pipeline.
Our study analyzed alternative splicing at a bulk tissue level, but it is unknown if the alternatively spliced transcripts are expressed by different muscle fibers in the muscle or by the same fiber. Use of single-cell isoform sequencing approaches that combine single-cell RNA-seq with IsoSeq would be needed to answer these questions (Gupta et al. 2018). Further, new methods for isolating intact ultra-long transcripts and converting them to cDNA would improve alternative splicing analysis of large structural genes. Improved direct RNA sequencing methods being developed may also help circumvent this limitation (Garalde et al. 2018).
We successfully applied our strategy to study three important muscle structural genes that produce complex, repetitive, ultra-long transcripts. It is difficult at best, to assess the alternative splicing of these transcripts with short-read data, and current long-read analysis approaches, are also not yet optimized. Our strategy provides a potential solution to these issues and enables simultaneous ultra-long transcript isoform discovery and quantitation, opening the door to differential expression analysis of difficult transcripts.

FOOTNOTE
Our analysis was based on Gencode release M10 (Ensembl 85) which matches the latest Gencode genome reference used to map the IsoSeq data. At the time of writing, we checked the latest genode annotation (release M22, Ensembl 97) to make sure that our findings were up to date. We found no annotation changes that would affect our analysis of titin and Nrap.
However, we found that two new inferred transcripts (ENSMUST00000238749.1 and  Assessing RNA quantity and quality. RNA was quantified using a Qubit Fluorometer (v2.0) and RNA broad range assay kit (ThermoFisher, #Q10210). Quality of total RNA was assessed on an Agilent BioAnalyzer 2100 using the RNA 6000 Nano kit (Agilent, # 5067-1511).
All samples had a RIN > 7.6. The purified total RNA samples were stored in -80°C.  -356-200) was loaded onto the system. During sequencing loading optimization (first 8 SMRTcells), two concentrations were used on four cells each: 1.0nM and 1.2nM. The 1.2nM concentration resulted in a higher yield of total reads of interest and the concentration was used for the remaining SMRTcells. The pooled IsoSeq libraries were sequenced using a total of 24 SmrtCells (#100-171-800) on the PacificBiosciences RSII machine (output statistics in Table   S1).

Data processing and alignment
Raw data (bax.h5) was processed according to the official Iso-Seq protocol using SmrtAnalysis (v2.3) with a few modifications to help process the large amount of data. Briefly, the fasta files produced from individual SmrtCells after the circular consensus step (CCS) were merged into two fasta files composed of eight SmrtCells. The two merged files were demultiplexed into the three muscle samples based on their barcodes yielding six total files. Barcodes numbers were converted to zero-based according to protocol (dT_BC1àBC0, BC1àdT_BC2, and dT_BC3àBC2). The six files were processed through classify and cluster steps in the IsoSeq method to produce fastq files. Fastq files belonging to each muscle were concatenated producing three large fastq files (e.g. soleus.fastq, edl.fastq, cardiac.fastq) ready for alignment.
To allow for this type of merging, cluster numbers (c1, c2, c3 etc) from each fastq file's read names were renumbered for each consecutive fastq file to avoid overwriting redundant cluster numbers. See mergeFastq.py script in github repository. Finally, the alignment steps were performed using STAR v2.5.0a and GENCODE primary mouse sequence and annotation (GRCm38) version M10 (Ensembl 85) released 2016-07-19. All data processing steps were performed on the Colonial One George Washington University HPC using the Slurm workload manager.

Reverse-Transcription PCR and Sanger Sequencing
Primers were designed using NCBI Primer-Blast (Table S2). RT-PCR reactions were performed using SuperScript III One-Step RT-PCR system (ThermoFisher, #12574030) following manufacturing protocol. Samples used in validation (mouse 2 and 3) are independent from the sequenced sample (mouse1). The amplification products were loaded into a 2% agarose gel made with 1x TBE buffer and 5μl ethidium bromide for electrophoresis and imaging.
The RT-PCR products were excised from the agarose gel and purified using Qiagen gel purification kit and sent along with primers to Quintarabio for Sanger Sequencing. Alignments of the sequences were visualized in IGV after using the BLAT tool.

Bioinformatics analysis
Scripts for identifying novel exons, differential exon usage analysis, exon phasing and transcript structure quantification were created using python and the HT-Seq library. For full details of the analysis approach see supplementary methods and the GitHub page (see data availability section).

DATA ACCESS
All raw and processed sequencing data generated in this study have been submitted to the NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE138362. All scripts, detailed notes, links and archived versions of PacBio protocols used in this project are made available in our GitHub repository (https://github.com/puapinyoying/isoseq_manuscript_resources).

ACKNOWLEDGEMENT
We would like to thank Adam K. L. Wong, Keith Crandall and the Colonial One HPC team at George Washington university for providing technical support and computational resources for the project. Hiroki Morizono for bioinformatics lessons and support. Linda Werling for writing and student support. Healther Locovare for helping us optimize the IsoSeq bench protocol for our project. We would also like to thank past and present members of the Eric Hoffman lab and Genetic Medicine department at Children's National Hospital for helpful discussions. This work was supported by the National Institutes of Health R01NS029525.

DISCLOSURE DECLARATION
No conflicts of interest.