Full-length transcript sequencing of human and mouse identifies widespread isoform diversity and alternative splicing in the cerebral cortex

Alternative splicing is a post-transcriptional regulatory mechanism producing multiple distinct mRNA molecules from a single pre-mRNA. Alternative splicing has a prominent role in the central nervous system, impacting neurodevelopment and various neuronal functions as well as being increasingly implicated in brain disorders including autism, schizophrenia and Alzheimer’s disease. Standard short-read RNA-Seq approaches only sequence fragments of the mRNA molecule, making it difficult to accurately characterize the true nature of RNA isoform diversity. In this study, we used long-read isoform sequencing (Iso-Seq) to generate full-length cDNA sequences and map transcript diversity in the human and mouse cerebral cortex. We identify widespread RNA isoform diversity amongst expressed genes in the cortex, including many novel transcripts not present in existing genome annotations. Alternative splicing events were found to make a major contribution to RNA isoform diversity in the cortex, with intron retention being a relatively common event associated with nonsense-mediated decay and reduced transcript expression. Of note, we found evidence for transcription from novel (unannotated genes) and fusion events between neighbouring genes. Although global patterns of RNA isoform diversity were found to be generally similar between human and mouse cortex, we identified some notable exceptions. We also identified striking developmental changes in transcript diversity, with differential transcript usage between human adult and fetal cerebral cortex. Finally, we found evidence for extensive isoform diversity in genes associated with autism, schizophrenia and Alzheimer’s disease. Our data confirm the importance of alternative splicing in the cerebral cortex, dramatically increasing transcriptional diversity and representing an important mechanism underpinning gene regulation in the brain. We provide this transcript level data as a resource to the scientific community.


Introduction
Alternative splicing is a post-transcriptional regulatory mechanism producing multiple distinct molecules, or RNA isoforms, from a single mRNA precursor. In eukaryotes, alternative splicing dramatically increases transcriptomic and proteomic diversity from the coding genome and represents an important mechanism in the developmental and cell-type specific control of gene expression. The mechanisms involved in alternative splicing include the use of alternative first and last exons, exon skipping, alternative 5' and 3' splice sites, mutually exclusive exons and intron retention 1 . These phenomena are relatively common, influencing the transcription of >95% of human genes 2 . Importantly, because alternatively spliced transcripts from a single gene can produce proteins with very different, often antagonistic, functions 3,4 , there is increasing interest in the role of RNA isoform diversity in health and disease 5 ; the correction of alternative splicing deficits has been shown to have dramatic therapeutic benefit in spinal muscular atrophy 6 . Alternative splicing appears to be particularly important and prevalent in the central nervous system 7 , where it impacts upon neurodevelopment 8 , aging 9 and key neuronal functions 10 . Of note, mis-splicing is a common feature of many neuropsychiatric and neurodegenerative diseases 11 with recent studies highlighting splicing differences associated with autism 12 , schizophrenia (SZ) 13 and Alzheimer's disease (AD) 14 .
A systematic analysis of the full complement of transcripts across tissues and development is an important step in understanding the functional biology of the genome. For example, transcript-level annotation can be used to improve the functional consequences of rare genetic variants 15 . Current efforts to characterize RNA isoform diversity are constrained by the fact that standard short-read RNA-Seq approaches cannot span full-length transcripts, making it difficult to characterize the diverse landscape of alternatively spliced transcripts 16 . 5 reported in this paper were based on the subset of SQANTI2-filtered transcripts unless otherwise indicated, although both unfiltered and filtered datasets are provided as a resource and genome browser tracks to the community (see Web Resources). We subsequently complemented our whole-transcriptome Iso-Seq data with short-read RNA-Seq (Illumina) data and additional full-length transcriptome data generated from nanopore sequencing (ONT) on an overlapping set of human cortex samples. An overview of the methods and datasets used in our analysis is given in Supplementary Figure 3. Taken together, our analysis represents the most comprehensive characterization of full-length transcripts and transcript diversity in the human and mouse cortex yet undertaken.

SMRT sequencing can accurately quantify patterns of gene expression in the human and mouse cortex
Following stringent quality control (QC) of our data, SMRT sequencing reads mapped to 12,832 (human cortex) and 13,450 (mouse cortex) 'annotated' genes already present in existing genomic databases (Table 1, Supplementary Figure 4). Gene expression patterns from Iso-Seq reflected expected transcriptional profiles for the brain regions profiled. Using the Human Gene Atlas database 23 , for example, we found that the most abundantlyexpressed genes (top 500, ranked by transcripts per million (TPM)) in the human cortex Iso-Seq dataset were most significantly enriched for 'prefrontal cortex' genes (odds ratio = 5.91, adjusted P = 1.93 × 10 -35 ) (Supplementary Figure 5 and Supplementary Table 4).
Likewise, using the Mouse Gene Atlas database 23 , we found that the most abundantlyexpressed genes in the mouse cortex Iso-Seq dataset were most significantly enriched for 'prefrontal cerebral cortex' (odds ratio = 5.47, adjusted P = 1.58 × 10 -19 ) (Supplementary  Table 4). Although the Iso-Seq method has been shown to be accurate at characterizing the diversity of RNA molecules present in a sample 24 , its sensitivity for quantifying levels of gene expression has not been systematically explored.
We therefore generated highly-parallel short-read RNA-seq (Illumina) data on human fetal cortex samples (n = 3, total 153M mapped reads), which represent a subset of the human cortex Iso-Seq data, and mouse cortex (n = 8, total 128M mapped reads, Supplementary Table 5) samples, finding a strong correlation between gene level expression levels quantified using the two methods in both datasets despite the relatively low number of sample comparisons (human fetal cortex: n = 9,223 genes, corr = 0.58, P < 2.23 × 10 -308 ; mouse cortex: n = 12,978 genes, corr = 0.80; P < 2.23 × 10 -308 ,

Transcript-level analysis of gene expression identifies widespread RNA isoform diversity amongst expressed genes in the cerebral cortex
In total, we identified 42,645 unique transcripts (mean length = 2.6kb, s.d = 1.3kb, range = 0.082 -11.8kb) in the human cortex and 51,159 unique transcripts (mean length = 2.9kb, s.d = 1.6kb, range = 0.08 -15.9kb) in the mouse cortex ( Table 1). As expected, transcripts were enriched near to annotated Cap Analysis Gene Expression (CAGE) peaks derived from the FANTOM5 25 dataset, which facilitates the mapping of transcripts, transcription factors, transcriptional promoters and enhancers (human cortex: mean distance from CAGE peak = 542bp downstream, 30,978 (72.6%) transcripts located within 50bp of a CAGE peak, Figure   3; mouse cortex: mean distance from CAGE peak = 247bp downstream, 35,781 (69.9%) transcripts located within 50bp of a CAGE peak, Figure 3), and were also located proximal to annotated transcription start sites and transcription termination sites (  Table 6), with the majority of genes (human cortex: n = 8,599 (66.7%), mouse cortex: n = 9,612 (70.7%)) characterized by more than one RNA isoform, and a notable proportion of genes characterized by more than ten isoforms (human: n = 443 (3.4%), mouse: n = 670 (4.9%), Figure 3). The gene displaying greatest RNA isoform diversity in human cortex was MEG3, a maternally expressed imprinted long non-coding RNA (lncRNA) gene involved in synaptic plasticity 26 Table 4), an interesting observation given the role that RNA-binding proteins (RBPs) themselves play in regulating tissue-specific patterns of alternative splicing 28 Figure 3, Supplementary Figure 13), suggesting that they would have been hard to detect using traditional short-read RNA-Seq because of the difficulty in assembling transcripts with limited read coverage 29

Validation of novel transcripts identified by Iso-Seq using short-read RNA-seq and nanopore sequencing
To validate the novel multi-exonic transcripts identified in our Iso-Seq datasets, we first used exon-spanning reads obtained from the highly-parallel short-read RNA-Seq data generated for an overlapping subset of samples. Strikingly, these data supported all splice junctions for the majority of novel transcripts (human fetal cortex: 5,618 (89.2%) junctions, mouse cortex: 15,195 (82.9%) junctions), with less than 1% of transcripts having no support from shortread RNA-Seq data (Supplementary Figure 16). Next, we interrogated publicly-available Iso-Seq data 30 from an Alzheimer's disease (AD) brain sample, processing the data through the same analytical pipeline (see Methods). Of the 13,931 novel transcripts identified in our human cortex dataset and mapped to annotated genes, 5,817 (41.76%) were also detected in this single AD dataset. Finally, we used an alternative long-read sequencing method (nanopore sequencing, see Methods) to generate additional long-read transcript sequences for a subset of human cerebral cortex (n = 2, 40.7 million reads, 23,609 polished transcripts (mean length = 1.39kb, s.d = 0.97kb, range = 0.085 -7.47kb) mapping to 9,762 genes).
Overall, transcriptional patterns were very similar between the PacBio and ONT datasets (Supplementary Figure 17) with a large proportion of novel transcripts of annotated genes from the Iso-Seq dataset also detected in the ONT dataset (human cortex: 7,081 (50.83%) of novel transcripts).

9
A subset of cortex-expressed transcripts represent fusion events between neighbouring genes Transcriptional read-through between two (or more) adjacent genes can produce 'fusion transcripts' 31 that represent an important class of mutation in several types of cancer 32 .
Although fusion events are thought to be rare 33 , we found that ~0.4% of transcripts included exons from two or more adjacent genes (human cortex: n = 153 fusion transcripts associated with 114 genes (0.89%); mouse cortex: n = 219 fusion transcripts associated with 160 genes (1.19%)). A number of genes were associated with more than one fusion transcript (human cortex: n = 23 (20.2%) genes; mouse cortex: n = 40 (25%) genes), and we identified examples of fusion transcripts encompassing more than two genes -e.g. a fusion transcript incorporating exons from three adjacent pseudogenes in the human cortex AC138649.4_AC138649.1_PDCD6IPP1 (Supplementary Figure 18). The majority of fusion transcripts identified in our Iso-Seq data were supported by short-read RNA-Seq generated on both mouse cortex (n = 212 (96.8%) transcripts) and human cortex (n = 53 (98.1%) transcripts associated with 47 genes (0.49%)). We also confirmed many specific fusion events using our human cortex ONT nanopore sequencing dataset (n = 54 (35.29%) transcripts). Furthermore, several of the fusion genes identified in the human cortex (n = 5 (4.4%) genes) and mouse cortex (n = 7 (4.4%) genes) were predicted as potential 'conjoined-genes' in the ConjoinG database 34 . Although the majority of fusion events were specific to the human or mouse cortex datasets, we found evidence of fusion, protein-coding transcripts incorporating exons from TMEM107 and VAMP2 in both species (Figure 4). Of note, both of these genes are known to be associated with rare neurodevelopmental disorders 35 , and these fusions were supported by RNA-Seq junction-spanning reads in both species and ONT sequencing reads from the human cortex.

Identification of novel cortex-expressed genes using long-read sequencing
Although the vast majority of transcripts identified in both the human and mouse cortex were assigned to annotated genes (human cortex: 99.9% of total transcripts; mouse cortex: 99.7% of total transcripts), a small number did not and potentially represented novel genes (human: n = 60 novel transcripts mapping to 49 novel genes; mouse: n = 156 novel transcripts mapping to 131 novel genes) (Supplementary Table 9). These novel genes were all multiexonic (human: mean length = 2.0kb, s.d = 0.9kb, range = 0.4 -4.9kb, mean number of exons = 2.9; mouse: mean length = 1.7kb, s.d = 1.1kb, range = 0.3 -6.9kb, mean number of exons = 2.5), with over half the identified transcripts from these genes predicted to be noncoding (human: n = 40 (66.7%) novel-gene transcripts; mouse: n = 86 (55.1%) novel-gene transcripts), and shorter than annotated genes (human cortex: W = 1.6 x 10 6 , P = 1.5 x 10 -4 , mouse cortex: W = 5.7 x 10 6 , P = 2.8 x 10 -25 , Supplementary Figure 19). The overall expression of these novel genes was lower than that of annotated genes (human cortex: W = 5.1 x 10 5 , P = 9.8 x 10 -18 ; mouse cortex: W = 1.5 x 10 6 , P = 7.7 x 10 -59 ), although a number were characterized by relatively high expression (Supplementary Figure 19). Although the majority of these novel genes did not show high homology with other genomic regions, BLAST analysis identified 18 (21.7%) homologous (greater than 500bp, more than 90% identity) novel-gene transcripts in human cortex and 26 (16.0%) homologous novel-gene transcripts in mouse cortex (Supplementary Table 10). Of the 60 novel-gene transcripts identified in the human cerebral cortex, 28 (46.7%) were also identified as novel by the GTEx consortium (CHESS v2.2 annotation) 36 . Furthermore, our matched short-read RNA-Seq data fully supported novel genes identified in a subset of human cerebral cortex samples (human fetal cortex: n = 22 novel transcripts mapping to 20 novel genes). Ten (16.67%) of the putative novel-gene transcripts identified in the human cortex were also supported by transcripts present in a publicly-available Iso-Seq dataset from an AD brain sample 30 . Finally, further evidence of transcription from a large proportion of these novelgene transcripts (human cortex: 28 (46.67%)) was provided by our ONT nanopore sequencing datasets generated on an overlapping set of human cortical samples. We used the FANTOM5 CAGE dataset to show that about a quarter of the novel-gene transcripts (human cortex: n = 13 (21.7%), mouse cortex: n = 39 (25.0%)) were located within 50bp of a CAGE peak (Supplementary Table 9). Of note, there was an enrichment of antisense transcripts amongst those mapping to novel genes (human cortex: n = 34 transcripts (56.7%) mapping to 27 novel genes; mouse cortex: n = 78 transcripts (50%) mapping to 62 novel genes) (Supplementary Table 9). The majority of the antisense novel genes were found within an annotated gene (human cortex: n = 21 (77.8% of antisense novel genes), mouse cortex: n = 58 (93.5% of antisense novel genes)), with a relatively large proportion of these sharing exonic regions (exon-exon overlap, human cortex: n = 8 (38%), mouse cortex: n = 42 (72.4%) reflecting sense-antisense (SAS) pairs 37 . Furthermore, there were striking examples of antisense novel genes that shared exons from two genes in the mouse cortex ( Figure 4, Supplementary Figure 20). With the majority of novel genes specific either to human or mouse cortex, we identified one common protein-coding novel gene that overlapped, upstream and antisense to E2F3 (identity = 83.1%, E-value = 0, alignment length: 1,926bp, mismatches: 205bp, Supplementary Table 11). This common E2F3associated novel gene was strongly conserved between both species with a similar transcript structure of two exons, and was supported by RNA-Seq in both human and mouse cortex (Figure 4).

Many transcripts map to long non-coding RNA genes and a subset of these were found to contain open reading frames
Although the majority of transcripts mapping to annotated genes were classified as proteincoding by the presence of an ORF in SQANTI2 (human cortex: n = 39,352 (92.4%) transcripts associated with 11,959 genes; mouse cortex: n = 47,757 (93.6%) transcripts associated with 12,748 genes), a relatively large number of transcripts mapped to genes annotated as encoding lncRNA (human cortex: n = 1,545 transcripts associated with 829 genes; mouse cortex: n = 1,041 transcripts associated with 587 genes). lncRNA transcripts were found to be longer than transcripts not defined by reference genome as lncRNA (non-

Alternative splicing events make a major contribution to RNA isoform diversity in the cortex
Alternative splicing (AS), the process by which different combinations of splice sites within a messenger RNA precursor are selected to produce variably spliced mRNAs, is the primary mechanism underlying transcript diversity in eukaryotes 41 and a major source of transcriptional diversity in the central nervous system 10

Intron retention is a relatively common form of alternative splicing in the cortex that is associated with reduced expression and nonsense-mediated mRNA decay (NMD)
Intron retention (IR), the process by which specific introns remain unspliced in polyadenylated transcripts, is the least understood AS mechanism in vertebrates 44 but is hypothesized to be a particularly important mechanism of transcriptional control in the brain 45 . We found evidence for IR in a relatively large proportion of genes (IR-genes) in both the human cortex (n = 5,752 IR-transcripts associated with 2,625 (20.4%) detected genes) and mouse cortex (n = 4,216 IR-transcripts associated with 2,279 (16.8%) genes) (Supplementary Table 13 Table   14) (human cortex: n = 194 (7.4% of genes with IR-transcripts, 1.5% of total detected genes), mouse cortex: n = 125 (5.5% of genes with IR-transcripts, 0.9% of total detected genes). Overall, there was considerable overlap in the list of IR-genes between human and mouse cortex (Figure 5), with 786 homologous genes showing evidence of IR in both the human (62.0% of IR-genes) and mouse (53.2% of IR-genes) cortex. Importantly, a larger proportion of lowly expressed genes showed evidence for IR than highly expressed genes in both human (< 2.5 Log 10 TPM, n = 2,332 (88.8%) genes; > 2.5 Log 10 TPM, n = 293 (11.2%) genes) and mouse (< 2.5 Log 10 TPM, n = 2,030 (89.1%) genes; > 2.5 Log 10 TPM, n = 249 (10.9%) genes, Figure 5) cortex, corroborating previous analyses suggesting that IR is associated with reduced transcript abundance 46 . Nonsense-mediated mRNA decay (NMD) acts to reduce transcriptional errors by degrading transcripts containing premature stop codons 47 and is one mechanism by which IR can influence gene expression 48 . Overall, about a tenth of transcripts mapping to annotated genes (human cortex: n = 5,062 (11.9%) transcripts associated with 2,420 (18.9%) of genes), mouse cortex: n = 4,944 (9.7%) transcripts associated with 2,264 (16.8%) of genes) were predicted to undergo NMD (NMDtranscripts), characterised by the presence of an ORF and a coding sequence (CDS) end motif before the last junction. These NMD-transcripts were found to be more lowly expressed presumably reflecting the fact that these genes were sequenced closer to saturation in our datasets. Despite the overall stability in cortical RNA isoform diversity between human and mouse, there were some notable exceptions for specific genes (Supplementary Table 6).
The largest absolute difference in numbers of multi-exonic isoforms detected between human and mouse cortex was observed in the genes encoding SORBS1 (human cortex: n =

Developmental changes in cortical RNA isoform abundance
Our human cortex Iso-Seq dataset was generated using samples derived from both fetal and adult donors, enabling us to identify developmental variation in transcript diversity. Overall, we detected 23,191 transcripts mapping to 9,647 annotated genes in the fetal cortex (mean length = 2.8kb, s.d = 1.3kb, range = 0.1 -11.8kb) and 27,842 transcripts mapping to 10,949 annotated genes in the adult cortex (mean length = 2.9kb, s.d = 1.2kb, range = 0.08 -9.5kb), with a high degree of overlap in detected genes (n = 8,078 (83.7% of fetal annotated genes, 73.7% of adult annotated genes)) between datasets (Supplementary Figure 4). Using the Human Gene Atlas database 19 , we found that the most abundant genes (top 500, ranked by TPM) in the fetal cortex dataset were most significantly enriched for 'fetal brain' (odds ratio = 6.15, P = 3.32 x 10 -25 ) and those in the adult cortex were most significantly enriched for 'prefrontal cortex' genes (odds ratio = 6.37, P = 7. We next calculated differences in the expression of specific RNA isoforms for genes robustly detected (>20 TPM) in both the fetal and adult cortex (see Methods). Of note, we identified 1,424 transcripts mapping to 1,083 genes that were classified as 'fetal specific' (i.e. they were not detected in the adult cortex). Likewise, 1,062 transcripts mapping to 798 genes were classified as 'adult specific' (i.e. they were not detected in the fetal cortex). For 222 genes (6.09% out of 3,648 detected genes with at least one transcript showing >20 TPM), we identified a switch in the dominant isoform -i.e. differential transcript usage -between human fetal and human adult cortex (Supplementary Table 16 corroborating previous studies suggesting that IR plays a role in the developmental regulation of gene transcription in the brain 53 . Furthermore, although genes with IRtranscripts were generally more lowly expressed, they were more highly expressed in the fetal cortex than the adult cortex (W = 1.1 x 10 6 , P = 2.87 x 10 -8 , Supplementary Figure 40). GO analysis of the 913 genes uniquely associated with IR in the fetal cortex showed that the most enriched molecular function was also 'RNA binding' (odds ratio = 1.99, adjusted P = 5.6 x 10 -11 , Supplementary Table 4).

Differential transcript usage across human fetal brain regions
We next generated Iso-Seq data on two additional fetal brain regions (hippocampus and striatum) from matched donors (Supplementary Table 1). Although the sequencing depth for these additional brain regions was lower than that obtained for the fetal cortex (fetal hippocampus: 0.483M CCS reads, 8,416 transcripts mapping to 5,606 genes; fetal striatum: 0.547M CCS reads, 9,678 transcripts mapping to 6,035 genes, Supplementary Table 17), we used these datasets to explore fetal transcriptional differences across the hippocampus, striatum and cortex. Amongst transcripts mapping to annotated genes, we again identified both known (hippocampus: n = 7,261 (86.28%) transcripts; striatum: n = 8,118 (83.88%) transcripts) and novel (hippocampus: n = 1,155 (13.72%) transcripts; striatum: n = 1,560 (16.12%)) transcripts (Supplementary Table 17). As expected, there was considerable overlap in genes detected across the three brain regions (3,385 transcripts associated with 2,650 genes based on TPM>20), although a notable subset of transcripts were uniquely expressed in each brain region (cortex: n = 2,180; hippocampus: n = 1,502; striatum: n = 2,346, Supplementary Figure 41). We further identified striking evidence for differential transcript usage across brain regions for a subset of genes; dominant isoform switches between cortex and hippocampus (n = 5 genes), cortex and striatum (n = 19 genes) and striatum and hippocampus (n = 6 genes) were observed (Supplementary Table 18

Widespread isoform diversity in genes associated with brain disease
Alternative splicing has been increasingly implicated in health and disease, and is recognized to play a prominent role in brain disorders hypothesized to involve the cerebral cortex including autism 12 , schizophrenia (SZ) 13 and AD 14 . There has been considerable progress in identifying genes associated with these disorders using genome sequencing and genome-wide association study (GWAS) approaches 55 , although the full repertoire of RNA isoforms transcribed from these genes in the cortex has not been systematically characterized. First, we used the human GWAS catalogue database 23 to interrogate the most transcriptionally diverse genes in the human cerebral cortex, finding them to be enriched for genes implicated in relevant GWAS datasets ('Alzheimer's disease (late onset): odds ratio = 6.25, P = 0.04, 'autism spectrum disorder or schizophrenia': odds ratio = 2.39, P = 0.01, 'schizophrenia': odds ratio = 2.49, P = 0.005, Supplementary Table 4). Second, we assessed RNA isoform diversity in genes robustly associated with AD (three familial AD genes 56 and 59 genes nominated from the most recent GWAS meta-analysis 57,58 ), autism (393 genes nominated as being category 1 (high confidence) and category 2 (strong candidate) from the SFARI Gene database https://gene.sfari.org/), and SZ (339 genes nominated from the most recent GWAS meta-analysis 59 ). Amongst disease-associated genes detected in the cortex, we found evidence for considerable isoform diversity (human cortex: 2765 transcripts were mapped to 619 disease-associated genes; mouse cortex: 3,846 transcripts were mapped to 687 disease-associated genes, Supplementary Table   19). The vast majority of disease-associated genes detected in the cortex were characterized by more than one RNA isoform in both the human (n = 472 (76.3%) genes) and mouse cortex (n = 561 (81.6%) genes). MEF2C (AD-associated) and TCF4 (autism-and SZ-associated) were the most "isoformic" disease genes in both human (MEF2C: n = 19 isoforms; TCF4: n = 40 isoforms) and mouse (Mef2c: n = 36 isoforms; Tcf4: n = 83 isoforms) cortex; of note, both genes have been shown to be key members of transcriptional networks associated with neuropsychiatric disease 60 . Importantly, a large number of the transcripts mapping to disease-associated genes had not been previously annotated in existing databases in human (n = 994 (35.9%) isoforms) and mouse cortex (n = 1,654 (43%) isoforms), identifying novel transcripts that may have potential relevance to understanding neurodegenerative and neuropsychiatric disorders. Interestingly, transcripts from disease-  Table 20).

Discussion
We used long-read isoform sequencing to characterize full-length cDNA sequences and generate detailed maps of alternative splicing in the human and mouse cerebral cortex. We identify considerable RNA isoform diversity amongst expressed genes in the cortex, including many novel transcripts not present in existing genome annotations. Of note, we detect full-length transcripts from several previously unannotated genes in both the human and mouse cortex, and many examples of fusion transcripts incorporating exons from neighbouring genes. Although global patterns of RNA isoform diversity appear to be generally similar between human and mouse cortex, we identified some notable exceptions with certain genes showing species-specific transcriptional complexity. Furthermore, we also identify some striking developmental changes in transcript diversity, with certain genes characterized by differential transcript usage between fetal and adult cortex. Importantly, we show that genes associated with autism, schizophrenia and Alzheimer's disease are characterized by considerable RNA isoform diversity, identifying novel transcripts that might play a role in pathology. Our data, which are available as a browsable resource for the research community (see Resources), confirm the importance of alternative splicing in the cortex and highlight its role as an important mechanism underpinning gene regulation in the brain.
Our findings highlight the power of novel long-read sequencing approaches for transcriptional profiling. By generating reads spanning entire transcripts it is possible to systematically characterize the repertoire of expressed RNA isoforms and fully assess the show that read-through RNA transcripts (or gene fusion transcripts) -formed when exons from two genes fuse together -occur at relatively high levels in the cortex. Although many of these fusion transcripts appear to be associated with NMD, many have the potential to be translated into proteins or may have a regulatory effect at the RNA level. Despite gene fusion transcripts having a well-documented role in several human cancers 32 , the systematic analysis of gene fusion and read-through transcripts has been limited to date given the limitations of existing short-read sequencing technologies 63 . Our data support recent data suggesting that read-through transcripts occur naturally 64 , and suggest that some fusion transcripts may have protein-coding potential, with important implications for brain disease.
Third, we are able to highlight the extent to which alternative splicing events make a major contribution to isoform diversity in the cortex. In particular we show that IR is a relatively common form of alternative splicing in the cortex that is associated with reduced expression and NMD. Importantly, IR was more prevalent in the fetal human cortex than adult human cerebral cortex, supporting previous studies suggesting that IR plays a role in the developmental regulation of gene transcription in the brain 45 . Finally, we highlight major 20 developmental changes in cortical isoform abundance in the human brain. In particular, we identify striking examples of transcript usage between fetal and adult cortex, and also highly differences in isoform expression between different regions of the human brain.
Our results should be interpreted in the context of several limitations. First, we profiled tissue from a relatively small number of human and mouse donors. Although we found highly consistent patterns of alternative splicing across these biological replicates and rarefaction curves confirmed our sequencing dataset was close to saturation, we were unable to explore inter-individual variation in alternative splicing. Nonetheless, we were able to identify transcripts from very lowly-expressed genes (such as the antisense novel gene upstream of E2F3 (Figure 4) upon merging multiple Iso-Seq datasets that would have otherwise been filtered out due to low read count. Recent studies have highlighted considerable evidence for genetic influences on isoform diversity in the human cortex, with splicing quantitative trait loci (sQTL) widely implicated in health and disease 65 . Future work will aim to extend our analyses to larger numbers of samples to explore population-level variation in transcript abundance in the cerebral cortex and differences associated with pathology. Second, despite the advantages of long-read sequencing approaches for the characterization of novel full-length transcripts, these methods are often assumed to be less quantitative than traditional short-read RNA sequencing methods 66 . We implemented a stringent QC pipeline (see Methods) and undertook considerable filtering of our data, finding high consistency across biological replicates and validating our findings using complementary approaches (i.e. nanopore sequencing, RNA-Seq, and by comparison to existing genomic databases).
We show that transcriptional profiles generated using Iso-Seq reflect those expected from the tissues we assessed (i.e. the cerebral cortex), and we found a strong correlation with both gene-and transcript-level expression measured using short-read RNA-Seq on the same samples. Given that we have adopted stringent QC approaches, many true transcripts from our final dataset -particularly lowly-expressed transcripts, are likely to have been filtered out -our analyses probably underestimate the extent of RNA isoform diversity in the cerebral cortex; therefore, we also provide a less conservatively-filtered dataset for download from our online track hub (see Web Resources). Third, our analyses were performed on 'bulk' cortex tissue containing a heterogeneous mix of neurons, oligodendrocytes and other glial cell-types. It is likely that these different cell-types express a specific repertoire of RNA isoforms and we are not able to explore these differences in our data. Of note, novel approaches for using long-read sequencing approaches in single cells will enable a more granular approach to exploring transcript diversity in the cortex. Although such approaches are currently limited by technological and analytical constraints, a recent study used long-read transcriptome sequencing to identify cell-type-specific transcript 21 diversity in the mouse hippocampus and prefrontal cortex 67 . Finally, although we explored the extent to which novel transcripts contained ORFs, the extent to which they are actually translated and contribute to cortical proteomic diversity is not known.
In summary, our data confirm the importance of alternative splicing and alternative first exon usage in the cerebral cortex, dramatically increasing transcriptional diversity and representing an important mechanism underpinning gene regulation in the brain. We highlight the power of long-read cDNA sequencing for completing our understanding of human gene annotation, and our transcript annotations, isoform data, and Iso-Seq analysis pipeline are available as a resource to the research community (see Web Resources).  M  5  0  2  5  5  1  8  9  3  4  8  0  2  1  5  1  N  I  C  9  7  2  4  1  0  6  4  9  4  9  0  3  4  7  3  0   N  N  C  4  0  1  3  7  4  1  4  2  3  8  0  1  4  4  7  G  e  n  i  c  G  e  n  o  m  i  c  4  1  5  7  2  3  1  3  A  n  t  i  s  e  n  s  e  3  5  7  8  2  0  9   F  u  s  i  o  n  1  5  3  2  1  9  8  5  5  4  I  n  t  e  r  g  e  n  i  c  2  5  7  8  1  1  1  3  G  e  n  i  c  I  n Table 15).
Iso-Seq human dataset comparisons: Each human Iso-Seq dataset was assigned a unique transcript ID. We used Cupcake's chain_samples.py 22 script to merge full datasets (pre-SQANTI2 filtered) to allow cross comparison followed by SQANTI2 reannotation. FL read counts from each individual SMRT cell were extracted from read_stat.txt file and normalised to TPM (calculated from FL read counts/total transcriptome counts * 1,000,000) * 1,000,000).
Testing for differential transcript usage between human fetal and adult samples was then performed with a Wilcoxon rank sum test. Transcripts annotated to a gene were then examined for a P < 0.05 with at least two transcripts with exclusive expression to fetal or adult samples respectively, each with a normalised expression level minimum of >20 TPM. Testing for differential transcript usage between fetal brain regions was first performed by determining differential transcript expression using a more stringent threshold (>100 TPM and no transcript expression in respective brain regions) -only two SMRT cells were generated from fetal hippocampus and striatum samples and thus limiting power of Wilcoxon rank sum test -and then by the detection of at least two transcripts with expression exclusive in two or more brain regions.
Oxford Nanopore library preparation, sequencing and data processing: As validation, RNA from human fetal (n = 1) and human adult (n = 1) cortex was sequenced on Oxford Nanopore Technology (ONT). Maxima H Minus RT (Thermo Fisher Scientific) followed by 15 cycles of PCR using Takara LA Taq (Clontech) was used. Quantification and size distribution was then determined using Qubit DNA High sensitivity assay (Invitrogen) and Bioanalyzer 2100 (Agilent), and library preparation was proceeded with ONT's PCR barcoding kit (SQK-PCB109).
Sequencing was then performed on ONT PromethION using a FLO-PRO002 flow cell for human samples, and base-called using Guppy (v4.0). Resulting fastq files were processed through the Pychopper/Pinfish 72 pipeline to produce polished transcripts, SQANTI2 filtering was then applied Validation of Transcriptome Landscape: Novel transcripts, such as those not associated with GENCODE defined transcripts, were compared with publicly-available data from PacBio's Alzheimer's disease brain Iso-Seq dataset 30 , which was re-processed using the Iso-Seq 3.