Predicting long non-coding RNAs using RNA sequencing
Introduction
Multiple classes of non-coding RNAs (ncRNAs) are transcribed from vertebrate genomes. A number of these classes possess general housekeeping functions and include ribosome-associated RNA (rRNA), transfer RNA (tRNA) and small nuclear/nucleolar RNA (sn/snoRNA). A further set of short RNA molecules (miRNA, siRNA and piRNA) have regulatory roles during diverse cellular processes including cellular differentiation [1], [2], cancer progression [3], [4] and immunity [5], [6]. Nevertheless, large-scale sequencing efforts of cDNA in mammalian cells have identified widespread transcription from the mouse genome [7] giving rise to long transcripts that appear to fall outside of these classes of housekeeping or short RNAs. From these initial transcript maps long non-coding RNA (lncRNA) species have been identified whose loci lie both within and between protein coding genes. While lncRNAs remain the most enigmatic ncRNA species in terms of function, there is now much effort centred on their functional characterisation and their molecular mechanisms in different cell types [8], [9], [10], [11].
Owing to the development of next generation sequencing (NGS) technology, lncRNA identification is now more easily achievable and several assay-based sequencing protocols have been developed that predict lncRNAs. As a result of RNA polymerase II ChIP-seq [12], chromatin signatures [9] and RNA-seq [8] data analyses, there are now multiple catalogues of lncRNAs that are expressed from diverse species, tissues and cell lines. However, each of these approaches to predicting lncRNAs is associated with both advantages and disadvantages. In addition, predicting the molecular and cellular functions of these transcripts, in general, has not been achievable. This is due in part to our current poor understanding of RNA sequence-structure–function relationships, and to the diverse heterogeneity in location and expression patterns of these transcripts: they can be expressed as cytosolic or nuclear transcripts [13], as sense or antisense transcripts of protein coding genes [7], from intronic regions of protein coding loci or from intergenic regions.
While the molecular mechanisms of most lncRNAs have yet to be elucidated, it is clear that many lncRNAs act by activating or repressing the transcriptional activity of other genes either in cis or in trans, or else by modifying their transcript abundance [14]. Activation and repression can be facilitated by interactions between lncRNAs and chromatin modifying enzymes. Indeed, establishment of repressive chromatin via the recruitment of histone modifying proteins has been described for a number of lncRNAs [15]. XIST is one such lncRNA transcript which facilitates genomic imprinting of the X chromosome. XIST coats the X chromosome in an allele-specific manner [16] and through direct interactions with the polycomb repressive complex 2 (PRC2) renders the chromosome inactive via PRC2-mediated H3K27 trimethylation [17]. Genome-wide RNA immunoprecipitation sequencing (RIP-seq) analysis has revealed many additional lncRNAs that associate with the PRC2 complex and play a role in genomic imprinting [18]. For example, Gtl2 RNA may promote the association of PRC2 to the imprinted Dlk1 locus [18]. Non-genomic-imprinting roles of lncRNAs have also been described. Chromatin state at the HOX locus in human fibroblasts is regulated via expression of the HOTAIR long non-coding RNA [19]. HOTAIR may directly interact with PRC2 permitting H3K27 trimethylation and the establishment of silent chromatin at the HOXD locus. In addition to mediating repressive chromatin, lncRNAs also act in activation processes. For example, ncRNA-a lncRNAs directly bind to the Mediator complex, permitting H3S10 phosphorylation and activation of downstream target genes such as SNAI1 and AURKA [20].
In addition to lncRNA function in epigenetic processes, there is interest in their potential roles at enhancer regions. There is growing evidence that many enhancers are transcribed as enhancer RNAs (eRNAs). eRNAs are produced from genomic regions marked by higher levels of histone H3, lysine 4 mono-methylation (H3K4me1) than H3K4me3 [21], [22], [23]. Co-activator binding at these loci confers tissue specificity of gene expression level. Recently, transcribed enhancers were described in the inflammatory response and neuronal activity [22], [23]. During inflammation, expression at numerous enhancer loci preceded that of genomically neighbouring immune mediators, suggestive of a cis-regulatory function for these eRNAs. Whether there is a consequence of transcription at these loci – whether eRNA function is RNA sequence dependent or independent – remains unknown. There are several explanations for enhancer transcription (reviewed in [24]). eRNA transcripts may form ribonucleoprotein complexes that facilitate epigenetic or non-epigenetic regulation. However, whether eRNAs are capable of directing transcription factors or chromatin remodelers in a sequence dependent manner remains to be experimentally tested. The act of transcription may be important for regulation at proximal targets. The binding of polII at enhancers may provide a mechanism through which histone modifying enzymes associated with the polII complex [25] create a domain of permissive chromatin, rendering the transcript a non-functional by-product [24]. Whether eRNAs, in general, possess sequence-dependent function remains unknown.
Competitive endogenous RNAs (ceRNAs) represent another class of lncRNA. Originally described as transcribed retropseudogenes that retain the miRNA-binding function of their parent mRNAs, ceRNAs now include lncRNAs that did not derive from protein-coding genes [26]. ceRNAs have been proposed to function as miRNA ‘decoys’ or ‘sponges’, thereby de-repressing levels of protein coding transcripts that share with the ceRNAs the same miRNA response elements (MREs) [26]. Although ceRNA-mediated regulation represents an elegant mechanism by which lncRNAs may control protein function through miRNA mediators, the proportion of lncRNAs that act as ceRNAs remains unknown.
A final possibility is that an as-yet-unknown proportion of lncRNAs represent transcriptional noise, generated through random collisions of RNA polII complexes with DNA [27]. We also do not mean to imply that these lncRNA functional categories are mutually exclusive, since it is likely that some lncRNAs possess functions from multiple categories.
Accurate identification and annotation of lncRNAs is a necessary first step towards understanding the full functional potential of transcriptomes. The technology for sequencing full length transcripts is available and is allowing for the generation of large RNA-seq data sets. In this review we first discuss the relative merits of individual protocols for identifying lncRNAs using NGS technology but then focus our discussion on various considerations that are required when undertaking an RNA-seq experiment for the discovery of lncRNAs. Factors discussed include the type of sequencing library, the sequencing protocol, read mapping and transcript building algorithms, as well as lncRNA categorisation using computational methods.
Section snippets
Methods for detecting long non-coding RNAs
The full transcriptional repertoire of a given organism is not predictable from just its genomic sequence. Protein-coding gene transcripts and some families of ncRNAs (e.g. tRNAs and rRNAs) can be predicted reasonably accurately based on the presence of long open reading frames (ORFs) and sequence similarity, respectively [28]. However, many of what we now consider to be lncRNAs were not initially predicted or identified in the years soon after the sequencing of the human genome, leaving a
Predicting lncRNAs using RNA-seq
A workflow for the discovery of lncRNAs is outlined in Fig. 2A. While the study design will dictate how analysis of the resulting data is performed, lncRNA discovery approaches show similarities among many studies (Table 1). Below we discuss steps and considerations required for detecting lncRNAs using RNA-seq.
Categorising lncRNAs
Protein coding genes can often be functionally annotated computationally through homology searches to known protein families, patterns of expression and protein domain structure. Ideally lncRNAs would be classified in a similar manner. However, little is currently known about specific features that can distinguish different classes of lncRNAs. LncRNAs are generally better conserved than neutrally evolving sequence [84], [85] suggesting conserved function across species. However, annotation of
Future perspectives
RNA-seq has led to the identification of many novel long non-coding loci. These large-scale studies have revealed fundamental characteristics of lncRNAs including their low levels of expression, temporal and spatial patterns of expression, sequence conservation and association with histone modifications. Functional assays have also revealed diverse mechanisms through which lncRNAs act to regulate protein coding genes at both the transcriptional and translational level. However there remains
References (91)
- et al.
Cancer Lett.
(2011) Trends Immunol.
(2008)Semin. Cell Dev. Biol.
(2011)- et al.
Cell
(2009) - et al.
Cell
(2011) Mol. Cell
(2010)Cell
(2007)Cell
(2011)Cell
(2005)Cell
(2007)
Genomics
Neuron
J. Mol. Biol.
Cell
Am. J. Physiol. Heart Circ. Physiol.
PLoS Genet.
World J. Stem Cells
Immune Netw.
Science
Genes Dev.
Nature
Nat. Biotechnol.
Nat. Rev. Genet.
Genome Res.
Science
J. Cell Biol.
Nature
Nat. Struct. Mol. Biol.
Nature
PLoS Biol
Annu. Rev. Genet.
Mol. Cell Biol.
Nat. Struct. Mol. Biol.
Nature
Nature
Science
Genes Dev.
Genome Res.
Science
Genome Res.
Nat. Genet.
Nature
Proc. Natl. Acad. Sci. USA
PLoS ONE
Nucleic Acids Res.
Cited by (91)
Pathological role of LncRNAs in immune-related disease via regulation of T regulatory cells
2023, Pathology Research and PracticeThe non-coding genome in Autism Spectrum Disorders
2023, European Journal of Medical GeneticsIdentification of novel RNAs in plants with the help of next-generation sequencing technologies
2022, Bioinformatics in Agriculture: Next Generation Sequencing EraDe novo transcriptome assembly of the Southern Ocean copepod Rhincalanus gigas sheds light on developmental changes in gene expression
2021, Marine GenomicsCitation Excerpt :Of the 188,349 contigs included in Corset clusters, 50,293 (26.7%) were annotated as lncRNA's, and 22,361 out of 89,528 clusters (25.0%) consisted entirely of predicted lncRNA transcripts. Consistent with the fact that lncRNA's generally have low expression (Ilott and Ponting, 2013), 41% of lncRNA's fell below Corset's expression filter compared to 22% of mRNA transcripts. Indeed, the mean normalized counts value for lncRNA transcripts was 24.6, compared to 160.7 for mRNA transcripts.
Prediction of the differentially expressed circRNAs to decipher their roles in the onset of human colorectal fcancers
2020, GeneCitation Excerpt :In recent decades, advancements in whole genome and transcriptome sequencing technologies have been carried out to understand the molecular mechanisms of complex diseases like cancer. Several studies have constantly being performed throughout the last few years suggesting that the majority of mammalian genomes are usually transcribed into large sections (nearly 80 – 90% of whole genome) of short or long non-coding RNAs, which play pivotal roles in regulating many biological processes including cancer development and progression (Ilott and Ponting, 2013; Iyer et al., 2015; Wang et al., 2009). Among these non-coding RNAs (ncRNAs), the circular RNAs (circRNAs) are one of the most abundant long ncRNAs, characterized by covalently attached closed 5′ and 3′ ends (Sanger et al., 1976).
Long non-coding RNA CASC2 regulates Sprouty2 via functioning as a competing endogenous RNA for miR-183 to modulate the sensitivity of prostate cancer cells to docetaxel
2019, Archives of Biochemistry and Biophysics