The annotation and function of the Parkinson’s and Gaucher disease-linked gene GBA1 has been concealed by its protein-coding pseudogene GBAP1

The human genome contains numerous duplicated regions, such as parent-pseudogene pairs, causing sequencing reads to align equally well to either gene. The extent to which this ambiguity complicates transcriptomic analyses is currently unknown. This is concerning as many parent genes have been linked to disease, including GBA1, causally linked to both Parkinson’s and Gaucher disease. We find that most of the short sequencing reads that map to GBA1, also map to its pseudogene, GBAP1. Using long-read RNA-sequencing in human brain, where all reads mapped uniquely, we demonstrate significant differences in expression compared to short-read data. We identify novel transcripts from both GBA1 and GBAP1, including protein-coding transcripts that are translated in vitro and detected in proteomic data, but that lack GCase activity. By combining long-read with single-nuclear RNA-sequencing to analyse brain-relevant cell types we demonstrate that transcript expression varies by brain region with cell-type-selectivity. Taken together, these results suggest a non-lysosomal function for both GBA1 and GBAP1 in brain. Finally, we demonstrate that inaccuracies in annotation are widespread among parent genes, with implications for many human diseases.


MAIN
The human genome contains regions that evade comprehensive analysis through short-read sequencing technologies and thus remain poorly studied. While these difficulties can be attributed to challenges with sequencing (e.g., high GC content), they are most commonly the result of duplicated genomic regions 1 . This leads to sequencing reads aligning to multiple genomic locations due to a high degree of sequence similarity, a phenomenon known as multimapping. Given that defective gene copies with high sequence similarity to their parent genes, termed pseudogenes, are frequently found in the human genome this is a common problem 2 .
While the impact of multimapping has been investigated in the context of pathogenic variant detection and can cause variants to be "missed" using conventional analyses 3 , the effect of multimapping on transcriptomic analyses has received less attention despite the problem being similar in nature 4 . This is surprising given the considerable number of genes affected, many of which are implicated in human disease. Short-read RNA-sequencing (RNA-seq) has been crucial to our understanding of transcript annotation, gene expression and its tissue and cell-type specific regulation. However, a major challenge in analysing these datasets is the difficulty of annotating parent-pseudogene pairs due to reads that cannot unambiguously map to either the parent gene or pseudogene, and so accurately quantifying gene expression.
Here, we focused on the disease-relevant example of GBA1 and its expressed pseudogene GBAP1. GBA1 encodes glucocerebrosidase (GCase), a lysosomal hydrolase 5 that degrades the glycosphingolipid, glucosylceramide 6 . Mutations in GBA1 result in decreased GCase activity causing Gaucher disease (GD) 7-11 when biallelic, and when heterozygous are among the most important genetic risk factors for Parkinson's disease (PD) [12][13][14][15] . Heterozygous GBA1 mutations also contribute to a more rapid progression of motor and non-motor symptoms in PD [16][17][18][19][20] and appear to be important predictors for non-motor symptom progression after deep brain stimulation surgery in patients with PD 21,22 .
To address the limitations of short sequencing reads, which seldom span multiple splice junctions 23 , we utilised long-read RNA-seq to examine human brain regions and iPSC-derived brain cells in depth. Our focus was on GBA1 and GBAP1, and we discovered significant differences in gene expression compared to short-read RNAseq. Moreover, we identified a significant number of novel transcripts from both genes, comprising novel protein-coding transcripts. We supported these findings by integrating short-read RNA-seq data, biochemistry, and proteomic data, which validated the novel protein-coding transcripts and confirmed that GBAP1 is translated in cells and human brain. Furthermore, we utilized both long-read sequencing and annotation-agnostic short-read sequencing data and found that inaccuracies in annotation is common among parent genes. Fig. 1 summarizes our analyses.

Pseudogenes are commonly expressed and alternatively spliced across human tissues
We started by quantifying pseudogenes from GENCODE (v38) annotation to investigate their impact on transcriptomic analyses. We identified a total of 14,709 pseudogenes in the human genome 2,24 , which can be divided into processed pseudogenes (n = 10,666) and unprocessed pseudogenes (n = 3,565), derived from retrotransposition of processed mRNAs and segmental duplications, respectively ( Fig. 2a). To date, 10,370 pseudogenes have been confidently assigned to 3,665 unique parent genes (Supplementary Table 1) 25 . We found that 734 (20.0%; Fig. 2b) parent genes were linked to 1,015 OMIM phenotypes, accounting for 17.0% of all OMIM disease genes (https://omim.org/) 26 .
To examine pseudogene expression across tissues, we used uniquely mapped shortread RNA-seq data generated by the Genotype-Tissue Expression (GTEx) 27,28 Consortium (v8, accessed 10/11/2021). We found that 64.7% of pseudogenes are expressed in ≥ 1 tissue (Fig. 2c), and that on average, 25.7 ± 2.5% of pseudogenes are expressed per tissue (n = 41; Supplementary Fig. 1). We then assessed the percentage of expressed pseudogenes that are alternatively spliced (> 1 transcript expressed) across human brain, heart, and lung samples using publicly available long-read RNA-seq data. On average, we found that 54.8 ± 2.6% of unprocessed pseudogenes and 13.5 ± 3.4% of processed pseudogenes are alternatively spliced (Fig. 2d). Taken together, this is consistent with the observation that a proportion of pseudogenes are of functional importance 29 .

Multimapping results in significant underestimation of GBA1 expression in human brain
We next examined the sequence similarity between pseudogenes and their parent genes as a way to investigate the potential functionality and complicating effects of the widespread expression and alternative splicing of pseudogenes. Our findings revealed that pseudogenes share an average of 80.0 ± 13.4% sequence similarity to the coding sequence (CDS) with their parent genes (Fig. 3a). As a result, genomic regions containing pseudogenes have the potential to confound transcriptomic analyses in all human tissues for a considerable proportion of protein-coding genes, including many that are causally linked to disease.
To explore this hypothesis in detail, we focused on the parent-pseudogene pair, GBA1-GBAP1 30 . This choice was driven by: (i) the high sequence similarity of GBA1-GBAP1 of 96%, which we reasoned would make both genes prone to inaccuracies in gene expression measures and transcript annotation (Fig. 3a). (ii) GBAP1's broad tissue expression (determined using RNA-seq data provided by GTEx), which means that simply masking its specific genomic region during mapping would be incorrect) ( Supplementary Fig. 2). (iii) GBA1 has been extensively studied due to its widely known role in disease, and its pseudogene is well-recognised.
We began by studying GBA1 and GBAP1 expression using gene-level measures from human tissues (n = 41) available through GTEx. Counter to previous RT-PCR-based quantifications showing that GBA1 is expressed at significantly higher levels than GBAP1 31 , we found GBA1 and GBAP1 expression to be equivalent in many tissues ( Supplementary Fig. 3), including the human brain (log2 fold change = 0.9 ± 0.5) (Fig. 3b). We questioned whether this observation could be explained by multimapping reads, which are often discarded in standard processing and so do not contribute to gene-level quantification of expression in many publicly available data sets (e.g., GTEx 27 , PsychENCODE 32 and recount3 33 ). To explore this question, we reanalysed publicly available short-read RNA-seq of human anterior cingulate cortex samples derived from 18 individuals, (n, control = 5, PD, with or without dementia = 13) 34 . Using this high-depth data set (100-bp paired-end reads, with a mean depth of 182.9 ± 14.9 million read pairs per sample), we assessed the proportion of reads that uniquely mapped to GBA1. We found that only 41.7 ± 11.2% of all reads mapped to GBA1 were uniquely mapped (Extended Data Fig. 1a), with 96.0 ± 2.0% of multimapped reads also aligning to GBAP1 (Extended Data Fig. 1b). Considering that most reads mapped to GBA1 and GBAP1 are not used for quantification, we concluded that long-read RNA-seq would be required to assess their relative expression. Therefore, we applied direct cDNA Oxford Nanopore sequencing (ONT) to pooled human frontal lobe (n individuals = 26) and hippocampus samples (n individuals = 27) (total library size: 42.7 million and 48.0 million reads, respectively) and found higher expression of GBA1 (numerator) compared to GBAP1 (denominator) (frontal lobe, log2 fold change = 2.3; hippocampus, log2 fold change = 3.1). That is, quantification with short-read RNA-seq wrongly estimated the relative expression of this parent-pseudogene pair by a 2-3 log2 fold difference (frontal cortex, Grubbs' test statistic = 3.58, P = 0.03; hippocampus, Grubbs' test statistic = 4.27, P < 0.01, Grubbs test for one outlier) (Fig. 3c).
Long-read RNA-sequencing reveals novel potentially protein-coding transcripts for GBA1 and GBAP1 with no dominant transcript in the human brain The inaccuracies in quantification suggested that high dependence on short-read RNA-seq technologies may have also led to inaccuracies in GBA1 and GBAP1 transcript structures. To address this, we performed targeted Pacific Biosciences (PacBio) isoform sequencing (Iso-Seq) (Extended Data Fig. 2a) on 12 human brain regions. Brain tissue was used because of GBA1's importance in neurological disease [12][13][14][15]35,36 , and previous evidence suggesting that transcriptome annotation is most incomplete in human brain 37 . We used PacBio Iso-Seq, which has >99% base pair accuracy enabled by circular consensus sequencing (CCS), which in turn, allows accurate mapping. To ensure that full-length reads were generated from mature mRNA alone, we used high-quality polyadenylated RNA (RNA integrity number > 8) pooled from multiple individuals per tissue (Supplementary Table 2). GBA1 and GBAP1 cDNAs were enriched using biotinylated hybridization probes designed against exonic and intronic genic regions (Supplementary fig. 4) to ensure that few assumptions were made regarding transcript structure. Collapsing mapped reads resulted in 2,368 GBA1 and 3,083 GBAP1 unique transcripts, each supported by ≥ 2 full-length (FL) HiFi reads across all samples (Extended Data Fig. 3a,b). After QC (Quality Control) and filtering for a minimum of 0.3% transcript usage per sample (equating to a mean of 43.4 -11,127.2 and 15.4 -1,161.3 FL HiFi reads for GBA1 and GBAP1 respectively) we identified 32 GBA1 and 48 GBAP1 transcripts (Fig. 4), thus providing the most reliable annotation of GBA1 and GBAP1 transcription to date.
Contrary to the expectation that most protein-coding genes express one dominant transcript 38-40 , we did not find a dominant GBA1 or GBAP1 transcript across any of the 12 brain regions sequenced. In fact, the most highly expressed GBA1 transcript (PB.845.2786; a full splice match to ENST00000368373), only corresponded to a mean of 38.4 ± 7.6% of total transcription at the locus (Fig. 4b). Although less surprising for a pseudogene, the most highly expressed transcript of GBAP1 (Non-coding novel) only corresponded to a mean of 14.0 ± 5.0% of total transcription at the locus (Fig.   4e).

GBAP1 are identified
We found that of all the coding transcripts detected, 18 GBA1 transcripts had a novel open reading frame (ORF) and 7 GBAP1 transcripts were predicted to encode a protein, despite GBAP1 being classified as a pseudogene (Fig. 4a,d). Since usage of unannotated 5' transcription start sites (TSSs) was a common feature of GBA1 and GBAP1 transcripts with novel ORFs (open reading frames) (Supplementary Fig. 5

),
we focused on validating these sites using Cap Analysis Gene Expression (CAGE) peaks (defined by FANTOM5 41,42 ). We found, despite the fact that CAGE seq only captures the first 20-30 nucleotides from the 5'-end (unique mapping only), 57% (n = 4) and 50% (n = 9) of novel GBA1 and GBAP1 5' TSSs, respectively, were located within 50 bp of CAGE peaks providing additional confidence in calling of these transcripts. Moreover, we validated all novel ORFs through additional targeted Iso-Seq of GBA1 and GBAP1 in iPSC-derived cortical neurons (n = 6), astrocytes (n = 3), and microglia (n = 3). In summary, we were able to detect GBA1 and GBAP1 transcripts with novel ORFs using a different RNA-seq technology and validate them in an independent data set.
To explore the coding potential of GBA1 and GBAP1 transcripts with novel ORFs, we employed a sequence-based approach along with AlphaFold2 43 (which accurately predicts GBA1 structure; Supplementary Fig. 6). We focused on the most highly expressed GBA1 (n = 3) and GBAP1 (n = 2) ORFs ( Fig. 5a, b). Although protein isoforms of both genes were predicted to have highly similar tertiary structures at the C-terminus, we predicted that all protein products would be unlikely to have GCase activity due to the partial/full loss of key enzymatic sites, or the absence of the lysosomal targeting sequence (LIMP2-interface region; Fig. 5c-h and Extended Data Fig. 4) 44,45 . To assess the coding potential of these novel GBA1 and GBAP1 transcripts, we amplified the ORFs and cloned them into a vector with a C-terminus FLAG tag. We transfected these vectors into H4 cells with homozygous knockout of GBA1, and found translation of all transcripts as detected with both an anti-FLAG antibody and an antibody directed to the conserved C-terminus ( Fig. 6a and Supplementary Fig. 7). However, none of these transcripts encoded protein isoforms with GCase activity, including those transcribed from GBAP1 (Fig. 6b). We also found no evidence to suggest that these protein isoforms inhibited constitutive GCase activity in H4 parental cells expressing GBA1 (Fig. 6c). Furthermore, immunohistochemical analysis in the H4 GBA1 KO and the H4 parental line (expressing endogenous GBA1) showed the lack of lysosomal localization of novel GBA1 and GBAP1 protein isoforms, as predicted (Fig. 6d).
To explore translation in vivo, we interrogated public untargeted mass spectrometry data of human prefrontal cortex 46 . Since novel GBA1 isoforms have no unique sequences that differentiate them, we focused on GBAP1 isoforms. We found proteomic support for GBAP1 (PB.845.1693) within the dataset with a protein Q-value of <0.01. In particular, we identified the unique amino acid sequence QWALDGAEYR, which is unique to GBAP1 and was not identified when searched within the UniProt human protein reviewed dataset. This shows translation of GBAP1 within the human prefrontal cortex.

GBA1 and GBAP1 transcripts show cell-type-selectivity in human brain
We found that novel protein-coding transcripts of GBA1 without predicted GCase activity were common, collectively accounting for between 15.8% (cerebellum) -31.7% (caudate nucleus) of transcription from the GBA1 locus. Notably, we found that only 48% of transcription in the caudate nucleus was predicted to encode a protein isoform with GCase activity. This high variability in the usage of GBA1 transcripts with novel ORFs across the human brain led us to hypothesise that these transcripts may have high cell type specificity. To test this, we employed both 5' single-nucleus RNA-seq (snRNA-seq) of human dorsolateral prefrontal cortex (DLPFC) and targeted PacBio Iso-Seq of human iPSC-derived brain-relevant cell types. Our analysis revealed cell-type-specific differences in the expression of GBA1 and GBAP1 (Fig. 7).
Specifically, we used 5' snRNA-seq of DLPFC to assess the expression of GBA1 and GBAP1 in various cell types, including astrocytes, excitatory neurons, inhibitory neurons, microglia, oligodendrocytes, and oligodendrocyte precursor cells (OPCs) (Fig. 7a). Our analysis showed an absence of signal at the first exon of PB.845.2888 (GBA1) in microglia, along with an overall lower expression of novel GBA1 transcripts in microglia and OPCs (Fig. 7b). Interestingly, we found that microglia showed significantly lower relative expression of shorter GBA1 ORFs lacking GCase activity (PB.275.2954 and PB.845.2888) compared to neurons or astrocytes, using PacBio Iso-Seq of human iPSC-derived neurons, astrocytes, and microglia (Fig. 7d).
Likewise, our analysis revealed that excitatory neurons had higher expression of GBAP1 ORF transcripts as compared to microglia, using 5' snRNA-seq of DLPFC (Fig.   7c). Further, using PacBio Iso-Seq of human iPSC-derived neurons, astrocytes, and microglia, we found significant cell-type-specific differences in GBAP1 ORF usage, with lower utilization of all GBAP1 ORFs in microglia compared to excitatory neurons and astrocytes (Fig. 7e). Additionally, our profiling of H3K4me3 mark in neurons using CUT&RUN 47 supported transcriptional activity at the 5' transcription start sites (TSS) of GBAP1 ORF transcripts (Extended Data Fig. 5).

Inaccurate annotation is frequent among parent genes across human tissues
We have shown significant inaccuracies in annotation of the parent gene GBA1.
However, we wanted to explore the scope of this problem. To do so we compared inaccuracies in annotation of all 3,665 parent genes compared with other proteincoding genes (including paralogs). Initially, we used public long-read RNA-seq data from 29 samples (n, Brain = 9, Heart = 16 and lung = 6; Supplementary Table 3) to assess the proportion of transcripts per gene, with at least one novel splice site in the coding sequence that would result in a novel ORF. Despite a low sequencing depth (mean, 2.2 ± 0.9 million full-length reads per sample), we found a significant increase in such events among parent genes compared to other protein-coding genes (parent genes = 23.9 ± 11.5%; protein-coding genes = 22.7 ± 11.4%; two-sided Wilcoxon rank-sum test p < 0.01; Fig. 8a). We extended this analysis to a greater number of samples (n = 7,595) and human tissues (n = 41, GTEx) using annotation-agnostic short-read RNA-seq analyses to quantify the proportion of parent genes with evidence of novel splicing (Method). Based on the identification of novel expressed genomic regions 37 and novel splice site usage, we found that the proportion of genes with incomplete annotation was significantly higher among parent genes compared to other protein-coding genes (novel expression regions: parent genes = 13.9 ± 1.4%; protein-coding genes = 10.8 ± 1.3%; two-sided Wilcoxon rank-sum test p < 0.01; Fig. 8b; splice site usage: parent genes = 66.5 ± 3.5%; protein-coding genes = 54.8 ± 4.3; two-sided Wilcoxon rank-sum test p < 0.01; Fig. 8c). This observation was consistent across all tissues analysed (Supplementary Fig. 8).

DISCUSSION
Here we show that widespread expression and alternative splicing of pseudogenes in human tissues has limited our understanding of both pseudogene and parent gene transcription with a significant impact on our appreciation of gene function. Our long-read RNA-seq analysis of the parent gene GBA1 and its pseudogene GBAP1 demonstrated significant diversity in transcription and showed that contrary to expectation 48,49 , no single transcript dominated expression of either gene in human brain. A substantial portion of transcription from both loci was novel, leading to the identification of novel protein-coding transcripts with tissue and cell-type specific biases in usage. Together these findings have a significant impact on our understanding of the potential mechanisms through which genetic variation at the GBA1-GBAP1 locus could explain phenotypic diversity in Gaucher disease and modulate disease risk and expressivity in Parkinson's disease.
Even though current annotation is known to be incomplete, especially in the brain 37 , the extent of transcriptional variety and novelty at parent gene loci was surprising, and particularly so at GBA1. After all, GCase dysfunction has been implicated in human disease since 1965 6 and mutations in GBA1 have been described since 1987 50 , making GBA1 one of the most studied genes in the genome. Nonetheless, we found that as much as 31.7% of GBA1 transcription in the caudate nucleus, a key component of the basal ganglia, may be translated into novel protein isoforms that do not localise in lysosomes and thus, lack GCase activity. Given that reduced GCase activity has been linked to PD [12][13][14][15] and basal ganglia circuitry dysfunction is a feature of this disease, these findings were all the more striking.
We also demonstrate that inaccuracies in annotation were significantly more common across parent genes as compared to other protein coding genes (Fig. 8) and were not restricted to GBA1. High sequence similarity within the genome and subsequent multimapping of short RNA-sequencing reads, has impacted on our understanding of many genes, including those already causally linked to disease.
Such loci are predictable using sequence similarity analyses, the technology to resolve these "problem" loci is available and the impact on our understanding of disease is likely to be significant. As exemplified by GBA1-GBAP1, our limited understanding of transcription from this locus results in errors in quantification of gene expression and all dependent analysis from differential gene expression in disease to quantitative trait loci detection. Beyond a research setting, inaccuracies in annotation will affect variant interpretation and consequently diagnostic yield for some disease-associated genes. Finally, and perhaps most importantly, significant inaccuracies in transcript annotation impact on our understanding of gene function.
Directed by our long-read RNA-sequencing results, we have found that some GBAP1 transcripts are more highly expressed in neurons and astrocytes, share a similar predicted 3D protein structure to GBA1, have protein products that do not localise to lysosomes and lack GCase activity. Yet, we find robust evidence of translation of such GBAP1 transcripts in human brain using high-throughput mass spectrometry data 46 .
Extrapolating these findings to GBA1, where mass spectrometry data was uninformative, would suggest a non-lysosomal function for both GBA1 and GBAP1 in brain, and particularly in neurons.
We propose that improving our understanding of the molecular functions of parentpseudogene pairs will become increasingly important to the development and success of RNA-targeting therapies. Accurate annotation is required at the tissueand cell-level to design effective Antisense Oligonucleotides (ASOs) or gene therapies. Furthermore, some pseudogenes may represent particularly high value therapeutic targets due to their potential to operate as genetic modifiers of Mendelian disorders. In fact, Nusinersen, which targets the splicing of former pseudogene SMN2, is a highly successful treatment for Spinal muscular atrophy 51 .
Thus, a deeper understanding of pseudogene function could lead to new and innovative therapeutic strategies.
Taken together, our findings from the GBA1-GBAP1 study demonstrate the need for thorough re-examination of transcription in duplicated genomic regions, such as parent-pseudogene pairs. By employing accurate full-length transcript sequencing, we are able to resolve these complex loci with unprecedented detail, leading to novel transcript discovery and, as a result, new insights into the functionality of human diseases.

Expression analysis from GTEx
Pseudogene and parent gene expression was assessed using median transcript per million (TPM) expression per tissue generated by the Genotype-Tissue Expression Consortium (GTEx, v8, accessed on 10/11/2021). As GTEx only use uniquely mapped reads for expression and multimapping was a concern, expression was assessed as a binary variable. That is, a gene with a median TPM > 0 was considered to be expressed.
For quantitative expression of GBA1 and GBAP1 we used RNA-seq data for 17,510 human samples originating from 54 different human tissues (GTEx, v8) that was downloaded using the R package recount (v 1.4.6) 52 . Cell lines, sex-specific tissues, and tissues with 10 samples or below were removed. Samples with large chromosomal deletions and duplications or large copy number variation previously associated with disease were filtered out (smafrze != "EXCLUDE"). For any log2 fold change calculations GBA1 is the numerator and GBAP1 is the denominator.

Alternative splicing analysis using long-read RNA-sequencing
To identify alternative splicing of pseudogenes we used publicly available long-read RNA-seq data from ENCODE 53 (https://www.encodeproject.org/rna-seq/long-readrna-seq/). We included 29 samples from Brain (n = 9), Heart (n = 16) and lung (n = 6). A description of the samples included can be found in Supplementary Table 2.

Online Mendelian Inheritance in Man data
Phenotype relationships and clinical synopses of all Online Mendelian Inheritance in Man (OMIM) genes were downloaded using API through https://omim.org/api (accessed 14/04/2022) 26 . Parent genes were annotated genes as OMIM morbid if they were listed as causing a mendelian phenotype.

Sequence similarity
Sequence similarity of parent genes and pseudogenes has previously been calculated by Pei et al. 2 and is available through The Pseudogene Decoration Resource (psiDr; http://www.pseudogene.org/psidr/similarity.dat; accessed 14/04/2022). We compared the sequence similarity of parent and pseudogenes considering the coding sequence (CDS) of parent genes.
outFilterMultimapNmax sets the rate of multimapping permitted; as a conservative estimate we set this to 10, half the ENCODE standard. outSAMmultNmax was set to -1, which allowed multimapped reads to be kept in the same output SAM/BAM file.
The QC and alignment processes were performed using a nextflow 57 pipeline. BAM files were sorted and indexed using Samtools (v 1.14; RRID:SCR_002105) 58 and filtered in R (v 4.0.5; RRID:SCR_001905) for reads overlapping the GBA1 or GBAP1 locus, using GenomicRanges (v 1.42.0; RRID:SCR_000025) 59 and Rsamtools (version 2.6.0). Only paired first mate reads on the correct strand (minus for both GBA1 and GBAP1) selected. The "NH" tag, which provides the number of alignments for a read was also extracted from the SAM header. The CIGAR string of the read was used to provide a width of the reads relative to the reference by adding operations that consume the reference together. Reads were then filtered, using dplyr (v 1.0.9; RRID:SCR_016708) 60 and tibble (v 3.1.6) 60 , with this new width to leave reads that aligned completely within the GBA1 and GBAP1 locus. Reads were then split between unique alignment and multimapping alignments based on the "NH" tag. The percentage of reads (uniquely mapped / (uniquely mapped + multimapped)) that mapped uniquely to either the GBA1 or GBAP1 locus was then calculated.
Additionally, for reads that multimapped to the GBA1 or GBAP1 locus the read name was extracted and searched for within the reads that multimapped to the alternate locus (i.e., reads names from reads that multimapped to the GBA1 locus were searched against read names for reads that multimapped to the GBAP1 locus). This provided a percentage of reads that aligned to GBA1 that that also aligned elsewhere and the percentage of reads aligning to GBAP1. Code and commentary can be found here: https://github.com/Jbrenton191/GBA_multimapping_2022.

Samples
Human Poly A+ RNA of healthy individuals that passed away from sudden death/trauma derived from frontal lobe and hippocampus were commercially purchased through Clontech (Supplementary Table 2).

Direct cDNA sequencing
A total of 100ng of Poly A+ RNA per sample was used for initial cDNA synthesis and subsequent library preparation according to the direct cDNA sequencing (SQK-DCS109) protocol described in detail at protocols.io

Comparing short-read quantification versus long-read quantification
For each sample in GTEx a log2 fold change was calculated with GBA1 as the numerator and GBAP1 as the denominator across frontal lobe and hippocampus.
Shapiro-Wilk normality test in each tissue was used to confirm a normal distribution.
To compare against ONT long-read quantification we used Grubbs' test (maximum normalized residual test) for a single outlier.

PACBIO TARGETED ISO-SEQ
Samples Human brain samples: Human Poly A+ RNA of healthy individuals that passed away from sudden death/trauma derived from caudate nucleus, cerebellum, cerebral cortex, corpus callosum, dorsal root ganglion, frontal lobe, hippocampus, medulla oblongata, pons, spinal cord, temporal lobe, and thalamus were commercially purchased through Clontech (Supplementary Table 2). iPSC, neuroepithelial, neural progenitor, cortical neuron, astrocyte, and microglia cells: Control iPSCs consisted of the previously characterized lines Ctrl1 62 , ND41866 (Coriel), RBi001 (EBiSC/Sigma) and SIGi1001 (EBiSC/Sigma) as well as the isogenic line previously generated 63 . Reagents were purchased from Thermo Fisher Scientific unless otherwise stated. iPSCs lines were grown in Essential 8 media on geltrex substrate and passaged using 0.5M EDTA. Cortical neurons were differentiated using dual SMAD inhibition for 10 days (10µM SB431542 and 1µM dorsomorphin, Tocris) in N2B27 media before maturation in N2B27 alone 64 . Day 100 +/-5 days was taken as the final timepoint. Astrocytes were generated following a similar neural induction protocol until day 80 before repeatedly passaging cortical neuronal inductions in 10ng/ml FGF2 (Peprotech) to enrich for astrocyte precursors.
Total RNA was extracted using the Qiagen RNeasy kit according to the manufacturer's protocol with β-mercaptoethanol added to buffer RLT and with a DNase digestion step included.

cDNA synthesis
A total of 250ng of RNA was used per sample for reverse transcription. Two different cDNA synthesis approaches were used: (i) Human brain cDNA was generated by SMARTer PCR cDNA synthesis (Takara) and (ii) iPSC derived cell lines were generated using NEBNext® Single Cell/Low Input cDNA Synthesis & Amplification Module (New England Biolabs). For both reactions sample-specific barcoded oligo dT (12 µM) with PacBio 16mer barcode sequences were added (Supplementary Table 3).
SMARTer PCR cDNA synthesis: First strand synthesis was performed as per manufacturer instructions, using sample-specific barcoded primers instead of the 3' SMART CDS Primer II A. We used a 90 min incubation to generate full-length cDNAs.
cDNA amplification was performed using a single primer (5' PCR Primer II A from the SMARTer kit, 5′ AAG CAG TGG TAT CAA CGC AGA GTA C 3′) and was used for all PCR reactions post reverse transcription. We followed the manufacturer's protocol with our determined optimal number of 18 cycles for amplification; this was used for all samples. We used a 6 min extension time in order to capture longer cDNA transcripts. PCR products were purified separately with 1X ProNex® Beads. The PacBio targeted Iso-Seq protocol is described in detail at protocols.io (dx.doi.org/10.17504/protocols.io.n92ld9wy9g5b/v1).

Automated Analysis of Iso-Seq data using Snakemake
For the analysis of targeted PacBio Iso-Seq data, we created two Snakemake 68  https://ccs.how/), which combines multiple subreads of the same SMRTbell molecule and to produce one highly accurate consensus sequence, also called a HiFi read (≥ Q20). We used the following parameters: --minLength 10 -maxLength 50000 -

CAGE-seq analysis
To assess whether predicted 5' TSSs of novel transcript were in proximity of Cap Analysis Gene Expression (CAGE) peaks we used data from the FANTOM5 dataset 41,42 .

Single nucleus RNA-sequencing analysis
Raw base calls were demultiplexed to obtain sample specific FASTQ files using Cell Ranger mkfastq and default parameters (v 6; 10x Genomics; RRID:SCR_017344).
Reads were aligned to the GRCh38 genome assembly using the Cell Ranger count (v 6; 10x Genomics; RRID:SCR_017344) with default parameters (--include-introns were used for nuclei mapping) 73 . Nuclei were filtered based on the number of genes detected -nuclei with less of the mean minus a standard deviation, or more than the mean plus two standard deviations were discarded to exclude low quality nuclei or possible doublets. The data was normalized to center log ratio (CLR) to reduce sequencing depth variability. Clusters were defined with Seurat function FindClusters

CUT&RUN analysis
Paired-end reads (2x150 bp) were aligned to the hg38 genome using bowtie2 75

Cell transfection
Cells were transfected using Lipofectamine 3000 reagent (Invitrogen L3000008) according to manufacturer's instructions. GBA1 or GBAP1 transcripts subcloned in the pcDNA3.1(+)-C-DYK vector were designed using the GenSmart design tool and acquired from GenScript.

Western blot
Protein was extracted from whole cells using

Mass spectrometric analysis of prefrontal cortex proteomes
A public mass spectrometry dataset was retrieved from ProteomeXchange (PXD026370). This data set consists of human brain tissue was collected post-mortem from patients diagnosed with multiple system atrophy (n = 45) and from controls (n = 30) to perform a comparative quantitative proteome profiling of tissue from the prefrontal cortex (Broadman area 9) 46 .
The data analysis was performed using MetaMorpheus 77 (v 0.0.320; https://github.com/smith-chem-wisc/MetaMorpheus). The search was conducted for two 2 GBAP1 isoforms (PB.845.1693 and PB.845.525), and a list of 267 frequent protein contaminants found within mass spectrometry data as provided by MetaMorpheus. An FDR (false discovery rate) of 1% was applied for presentation of PSMs (peptide spectrum matches), peptides, and proteins following review of decoy target sequences.

ANNOTATION OF PARENT GENES AND PROTEIN-CODING GENES
To explore inaccuracies in annotation of parent genes and protein-coding genes we applied three independent approaches:

Long-read RNA-sequencing
To identify full-length transcripts with at least one novel splice junction we used the same long-read RNA-seq samples available from ENCODE 53 as previously described.
Transcripts with novel splice junction resulting in novel ORF were those transcripts that had a predicted ORF that was not present in GENCODE v38 annotation.

Novel expressed regions
Novel unannotated expression 37

Splice junctions
To identify novel junctions with potential evidence of incomplete annotation, we used data provided by IntroVerse 78 .
IntroVerse is a relational database that comprises exon-exon split-read data on the Second, we extracted all novel donor and acceptor junctions that had evidence of use in >=5% of the samples of each tissue and grouped them by gene. We then classify those genes either as "parent" or "protein-coding." Finally, we calculated the proportion that each category of genes presented within each tissue. Focusing on the parent genes category, this can be described as it follows: = Let denote the total number of parent genes containing at least one novel junction shared by >=5% of the samples of the current tissue. Let denote the total number of parent genes available for study. Let denote the current tissue.
We mirrored the formula above to calculate the proportion of protein-coding genes per tissue.

FIGURE GENERATION
The code for all figures in this manuscript can be accessed through: https://github.com/egustavsson/GBA_GBAP1_manuscript.git    (amino acids 117-446), and glycol_hydro_30C (amino acids 469-531). d, X-ray structure of GBA1 (PDB 2v3f) with catalytic Glu residues highlighted in yellow and probable LIMP-2 interface region highlighted in purple. e, Alphafold2 predictions of GBA1 MANE select (ENST00000368373) and f, the three most highly expressed novel protein-coding GBA1 isoforms colored by prediction confidence score (pLDDT). g, Xray structure of GBA1 (PDB 2v3f) (violet) superimposed on AlphaFold2 predicted structure of the longer ORF generated by GBAP1 PB.845.1693 (green). h, Alphafold2 predictions of the two most highly expressed novel protein-coding GBAP1 isoforms colored by prediction confidence score (pLDDT).