Pseudogenes limit the identification of novel common transcripts generated by their parent genes
ABSTRACT
Genomic sequences with high sequence similarity, such as parent-pseudogene pairs, cause short sequencing reads to align to multiple locations, thus complicating genomic analyses1. However, their impact on transcriptomic analyses, including the estimation of gene expression and transcript annotation, has been less studied. Here, we investigated the impact of pseudogenes on transcriptomic analyses by focusing on the disease-relevant example of GBA1 and its expressed pseudogene GBAP1. Using short-read RNA-sequencing data from human brain samples2, we found that only 42% of all reads mapping to GBA1 did so uniquely, with the remaining reads mapping primarily to GBAP1. This resulted in a significant misestimation of the relative expression of GBA1 to GBAP1. Using targeted long-read RNA-sequencing of 12 human brain regions we identified 18 GBA1 transcripts that had a novel open reading frame (ORF) and 7 GBAP1 transcripts predicted to encode a protein, despite GBAP1 being classified as a pseudogene. Furthermore, we demonstrated the ability of these transcripts to generate stable protein that lacked GBA’s important function as a lysosomal glucocerebrosidase (GCase). However, we found that transcripts were surprisingly common, collectively accounting for 32% of transcription from the GBA1 locus in the caudate nucleus, and their usage showed cell type selectivity in human brain. Finally, we used annotation-independent analyses of both long and short-read RNA-sequencing data sets to show that parent genes were more likely to have evidence of incomplete annotation. Given that 734 (17%) genes causing Mendelian disease have at least one pseudogene, these findings significantly impact our understanding of human disease and highlight the need for long-read RNA-sequencing analyses at many loci.
Competing Interest Statement
S.S., Y.G., J.E., H.S. and C.F.B. are employed by Astex Pharmaceuticals. The other authors declare no competing interests.
Subject Area
- Biochemistry (12934)
- Bioengineering (9834)
- Bioinformatics (31552)
- Biophysics (16259)
- Cancer Biology (13342)
- Cell Biology (19022)
- Clinical Trials (138)
- Developmental Biology (10312)
- Ecology (15320)
- Epidemiology (2067)
- Evolutionary Biology (19554)
- Genetics (12978)
- Genomics (17937)
- Immunology (13055)
- Microbiology (30493)
- Molecular Biology (12716)
- Neuroscience (66570)
- Paleontology (490)
- Pathology (2060)
- Pharmacology and Toxicology (3547)
- Physiology (5532)
- Plant Biology (11403)
- Synthetic Biology (3169)
- Systems Biology (7834)
- Zoology (1769)