Predicting long non-coding RNAs using RNA sequencing

doi:10.1016/j.ymeth.2013.03.019

Methods

Volume 63, Issue 1, 1 September 2013, Pages 50-59

https://doi.org/10.1016/j.ymeth.2013.03.019 Get rights and content

Abstract

The advent of next-generation sequencing, and in particular RNA-sequencing (RNA-seq), technologies has expanded our knowledge of the transcriptional capacity of human and other animal, genomes. In particular, recent RNA-seq studies have revealed that transcription is widespread across the mammalian genome, resulting in a large increase in the number of putative transcripts from both within, and intervening between, known protein-coding genes. Long transcripts that appear to lack protein-coding potential (long non-coding RNAs, lncRNAs) have been the focus of much recent research, in part owing to observations of their cell-type and developmental time-point restricted expression patterns. A variety of sequencing protocols are currently available for identifying lncRNAs including RNA polymerase II occupancy, chromatin state maps and − the focus of this review − deep RNA sequencing. In addition, there are numerous analytical methods available for mapping reads and assembling transcript models that predict the presence and structure of lncRNAs from RNA-seq data. Here we review current methods for identifying lncRNAs using large-scale sequencing data from RNA-seq experiments and highlight analytical considerations that are required when undertaking such projects.

Introduction

Multiple classes of non-coding RNAs (ncRNAs) are transcribed from vertebrate genomes. A number of these classes possess general housekeeping functions and include ribosome-associated RNA (rRNA), transfer RNA (tRNA) and small nuclear/nucleolar RNA (sn/snoRNA). A further set of short RNA molecules (miRNA, siRNA and piRNA) have regulatory roles during diverse cellular processes including cellular differentiation [1], [2], cancer progression [3], [4] and immunity [5], [6]. Nevertheless, large-scale sequencing efforts of cDNA in mammalian cells have identified widespread transcription from the mouse genome [7] giving rise to long transcripts that appear to fall outside of these classes of housekeeping or short RNAs. From these initial transcript maps long non-coding RNA (lncRNA) species have been identified whose loci lie both within and between protein coding genes. While lncRNAs remain the most enigmatic ncRNA species in terms of function, there is now much effort centred on their functional characterisation and their molecular mechanisms in different cell types [8], [9], [10], [11].

Owing to the development of next generation sequencing (NGS) technology, lncRNA identification is now more easily achievable and several assay-based sequencing protocols have been developed that predict lncRNAs. As a result of RNA polymerase II ChIP-seq [12], chromatin signatures [9] and RNA-seq [8] data analyses, there are now multiple catalogues of lncRNAs that are expressed from diverse species, tissues and cell lines. However, each of these approaches to predicting lncRNAs is associated with both advantages and disadvantages. In addition, predicting the molecular and cellular functions of these transcripts, in general, has not been achievable. This is due in part to our current poor understanding of RNA sequence-structure–function relationships, and to the diverse heterogeneity in location and expression patterns of these transcripts: they can be expressed as cytosolic or nuclear transcripts [13], as sense or antisense transcripts of protein coding genes [7], from intronic regions of protein coding loci or from intergenic regions.

While the molecular mechanisms of most lncRNAs have yet to be elucidated, it is clear that many lncRNAs act by activating or repressing the transcriptional activity of other genes either in cis or in trans, or else by modifying their transcript abundance [14]. Activation and repression can be facilitated by interactions between lncRNAs and chromatin modifying enzymes. Indeed, establishment of repressive chromatin via the recruitment of histone modifying proteins has been described for a number of lncRNAs [15]. XIST is one such lncRNA transcript which facilitates genomic imprinting of the X chromosome. XIST coats the X chromosome in an allele-specific manner [16] and through direct interactions with the polycomb repressive complex 2 (PRC2) renders the chromosome inactive via PRC2-mediated H3K27 trimethylation [17]. Genome-wide RNA immunoprecipitation sequencing (RIP-seq) analysis has revealed many additional lncRNAs that associate with the PRC2 complex and play a role in genomic imprinting [18]. For example, Gtl2 RNA may promote the association of PRC2 to the imprinted Dlk1 locus [18]. Non-genomic-imprinting roles of lncRNAs have also been described. Chromatin state at the HOX locus in human fibroblasts is regulated via expression of the HOTAIR long non-coding RNA [19]. HOTAIR may directly interact with PRC2 permitting H3K27 trimethylation and the establishment of silent chromatin at the HOXD locus. In addition to mediating repressive chromatin, lncRNAs also act in activation processes. For example, ncRNA-a lncRNAs directly bind to the Mediator complex, permitting H3S10 phosphorylation and activation of downstream target genes such as SNAI1 and AURKA [20].

In addition to lncRNA function in epigenetic processes, there is interest in their potential roles at enhancer regions. There is growing evidence that many enhancers are transcribed as enhancer RNAs (eRNAs). eRNAs are produced from genomic regions marked by higher levels of histone H3, lysine 4 mono-methylation (H3K4me1) than H3K4me3 [21], [22], [23]. Co-activator binding at these loci confers tissue specificity of gene expression level. Recently, transcribed enhancers were described in the inflammatory response and neuronal activity [22], [23]. During inflammation, expression at numerous enhancer loci preceded that of genomically neighbouring immune mediators, suggestive of a cis-regulatory function for these eRNAs. Whether there is a consequence of transcription at these loci – whether eRNA function is RNA sequence dependent or independent – remains unknown. There are several explanations for enhancer transcription (reviewed in [24]). eRNA transcripts may form ribonucleoprotein complexes that facilitate epigenetic or non-epigenetic regulation. However, whether eRNAs are capable of directing transcription factors or chromatin remodelers in a sequence dependent manner remains to be experimentally tested. The act of transcription may be important for regulation at proximal targets. The binding of polII at enhancers may provide a mechanism through which histone modifying enzymes associated with the polII complex [25] create a domain of permissive chromatin, rendering the transcript a non-functional by-product [24]. Whether eRNAs, in general, possess sequence-dependent function remains unknown.

Competitive endogenous RNAs (ceRNAs) represent another class of lncRNA. Originally described as transcribed retropseudogenes that retain the miRNA-binding function of their parent mRNAs, ceRNAs now include lncRNAs that did not derive from protein-coding genes [26]. ceRNAs have been proposed to function as miRNA ‘decoys’ or ‘sponges’, thereby de-repressing levels of protein coding transcripts that share with the ceRNAs the same miRNA response elements (MREs) [26]. Although ceRNA-mediated regulation represents an elegant mechanism by which lncRNAs may control protein function through miRNA mediators, the proportion of lncRNAs that act as ceRNAs remains unknown.

A final possibility is that an as-yet-unknown proportion of lncRNAs represent transcriptional noise, generated through random collisions of RNA polII complexes with DNA [27]. We also do not mean to imply that these lncRNA functional categories are mutually exclusive, since it is likely that some lncRNAs possess functions from multiple categories.

Accurate identification and annotation of lncRNAs is a necessary first step towards understanding the full functional potential of transcriptomes. The technology for sequencing full length transcripts is available and is allowing for the generation of large RNA-seq data sets. In this review we first discuss the relative merits of individual protocols for identifying lncRNAs using NGS technology but then focus our discussion on various considerations that are required when undertaking an RNA-seq experiment for the discovery of lncRNAs. Factors discussed include the type of sequencing library, the sequencing protocol, read mapping and transcript building algorithms, as well as lncRNA categorisation using computational methods.

Section snippets

Methods for detecting long non-coding RNAs

The full transcriptional repertoire of a given organism is not predictable from just its genomic sequence. Protein-coding gene transcripts and some families of ncRNAs (e.g. tRNAs and rRNAs) can be predicted reasonably accurately based on the presence of long open reading frames (ORFs) and sequence similarity, respectively [28]. However, many of what we now consider to be lncRNAs were not initially predicted or identified in the years soon after the sequencing of the human genome, leaving a

Predicting lncRNAs using RNA-seq

A workflow for the discovery of lncRNAs is outlined in Fig. 2A. While the study design will dictate how analysis of the resulting data is performed, lncRNA discovery approaches show similarities among many studies (Table 1). Below we discuss steps and considerations required for detecting lncRNAs using RNA-seq.

Categorising lncRNAs

Protein coding genes can often be functionally annotated computationally through homology searches to known protein families, patterns of expression and protein domain structure. Ideally lncRNAs would be classified in a similar manner. However, little is currently known about specific features that can distinguish different classes of lncRNAs. LncRNAs are generally better conserved than neutrally evolving sequence [84], [85] suggesting conserved function across species. However, annotation of

Future perspectives

RNA-seq has led to the identification of many novel long non-coding loci. These large-scale studies have revealed fundamental characteristics of lncRNAs including their low levels of expression, temporal and spatial patterns of expression, sequence conservation and association with histone modifications. Functional assays have also revealed diverse mechanisms through which lncRNAs act to regulate protein coding genes at both the transcriptional and translational level. However there remains

References (91)

A.L. Zimmerman et al.
Cancer Lett.
(2011)
M.A. Lindsay
Trends Immunol.
(2008)
J.S. Mattick
Semin. Cell Dev. Biol.
(2011)
C.P. Ponting et al.
Cell
(2009)
Y. Jeon et al.
Cell
(2011)
J. Zhao
Mol. Cell
(2010)
J.L. Rinn
Cell
(2007)
L. Salmena
Cell
(2011)
B.E. Bernstein
Cell
(2005)
A. Barski
Cell
(2007)

T. Li

Genomics

(2012)

T.G. Belgard

Neuron

(2011)

S.F. Altschul

J. Mol. Biol.

(1990)

N.T. Ingolia et al.

Cell

(2011)

E. Berardi

Am. J. Physiol. Heart Circ. Physiol.

(2012)

C. Ciaudo

PLoS Genet.

(2009)

M. Garg

World J. Stem Cells

(2012)

T.Y. Ha

Immune Netw.

(2011)

P. Carninci

Science

(2005)

M.N. Cabili

Genes Dev.

(2011)

M. Guttman

Nature

(2009)

M. Guttman

Nat. Biotechnol.

(2010)

A. Sandelin

Nat. Rev. Genet.

(2007)

T. Derrien

Genome Res.

(2012)

J.T. Lee

Science

(2012)

C.M. Clemson

J. Cell Biol.

(1996)

F. Lai

Nature

(2013)

F. Koch

Nat. Struct. Mol. Biol.

(2011)

T.K. Kim

Nature

(2010)

F. De Santa

PLoS Biol

(2010)

G. Natoli et al.

Annu. Rev. Genet.

(2012)

H. Cho

Mol. Cell Biol.

(1998)

K. Struhl

Nat. Struct. Mol. Biol.

(2007)

E.S. Lander

Nature

(2001)

J. Kawai

Nature

(2001)

S. Katayama

Science

(2005)

J.L. Rinn

Genes Dev.

(2003)

T. Ravasi

Genome Res.

(2006)

P. Kapranov

Science

(2002)

D. Kampa

Genome Res.

(2004)

N.D. Heintzman

Nat. Genet.

(2007)

T.S. Mikkelsen

Nature

(2007)

T.Y. Roh

Proc. Natl. Acad. Sci. USA

(2006)

L.X. Garmire

PLoS ONE

(2011)

H. Sun

Nucleic Acids Res.

(2011)

Cited by (91)

Pathological role of LncRNAs in immune-related disease via regulation of T regulatory cells
2023, Pathology Research and Practice
Human regulatory T cells (Tregs) are essential in pathogenesis of several diseases such as autoimmune diseases and cancers, and their imbalances may be promoting factor in these disorders. The development of the proinflammatory T cell subset TH17 and its balance with the generation of regulatory T cells (Treg) is linked to autoimmune disease and cancers. Long non-coding RNAs (lncRNAs) have recently emerged as powerful regulatory molecules in a variety of diseases and can regulate the expression of significant genes at multiple levels through epigenetic regulation and by modulating transcription, post-transcriptional processes, translation, and protein modification. They may interact with a wide range of molecules, including DNA, RNA, and proteins, and have a complex structural makeup. LncRNAs are implicated in a range of illnesses due to their regulatory impact on a variety of biological processes such as cell proliferation, apoptosis, and differentiation. In this regard, a prominent example is lncRNA NEAT1 which several studies have performed to determine its role in the differentiation of immune cells. Many other lncRNAs have been linked to Treg cell differentiation in the context of immune cell differentiation. In this study, we review recent research on the various roles of lncRNAs in differentiation of Treg cell and regulation of the Th17/Treg balance in autoimmune diseases and tumors in which T regs play an important role.
The non-coding genome in Autism Spectrum Disorders
2023, European Journal of Medical Genetics
Autism Spectrum Disorders (ASD) are a group of neurodevelopmental disorders (NDDs) characterized by difficulties in social interaction and communication, repetitive behavior, and restricted interests. While ASD have been proven to have a strong genetic component, current research largely focuses on coding regions of the genome. However, non-coding DNA, which makes up for ∼99% of the human genome, has recently been recognized as an important contributor to the high heritability of ASD, and novel sequencing technologies have been a milestone in opening up new directions for the study of the gene regulatory networks embedded within the non-coding regions. Here, we summarize current progress on the contribution of non-coding alterations to the pathogenesis of ASD and provide an overview of existing methods allowing for the study of their functional relevance, discussing potential ways of unraveling ASD's “missing heritability”.
Identification of novel RNAs in plants with the help of next-generation sequencing technologies
2022, Bioinformatics in Agriculture: Next Generation Sequencing Era
The rapidly advancing disciplines of data science, next-generation sequencing, and bioinformatics have provided a veritable glut of information for the exploration of novel RNA in plants. Such RNA includes endogenous, regulatory noncoding RNAs which are diverse in expression and function. To match this growing supply of information, demand has naturally led to the creation of tools that effectively identify, classify, predict, and annotate novel RNA classes across the plant genome. Many of these tools are designed to be accessible to even lay scientists to democratize access to data science methods. This chapter aims to describe the significance of these novel RNA types, as well as their current classifications, and the computational approaches available for the researchers to explore this novel space using sequencing data.
De novo transcriptome assembly of the Southern Ocean copepod Rhincalanus gigas sheds light on developmental changes in gene expression
2021, Marine Genomics
Citation Excerpt :
Of the 188,349 contigs included in Corset clusters, 50,293 (26.7%) were annotated as lncRNA's, and 22,361 out of 89,528 clusters (25.0%) consisted entirely of predicted lncRNA transcripts. Consistent with the fact that lncRNA's generally have low expression (Ilott and Ponting, 2013), 41% of lncRNA's fell below Corset's expression filter compared to 22% of mRNA transcripts. Indeed, the mean normalized counts value for lncRNA transcripts was 24.6, compared to 160.7 for mRNA transcripts.
Copepods are small crustaceans that dominate most zooplankton communities in terms of both abundance and biomass. In the polar oceans, a subset of large lipid-storing copepods occupy central positions in the food web because of their important role in linking phytoplankton and microzooplankton with higher trophic levels. In this paper, we generated a high-quality de novo transcriptome for Rhincalanus gigas, the largest—and among the most abundant—of the Southern Ocean copepods. We then conducted transcriptional profiling to characterize the developmental transition between late-stage juveniles and adult females. We found that juvenile R. gigas substantially upregulate lipid synthesis and glycolysis pathways relative to females, as part of a developmental gene expression program that also implicates processes such as muscle growth, chitin formation, and ion transport. This study provides the first transcriptional profile of a developmental transition within Rhincalanus gigas or any endemic Southern Ocean copepod, thereby extending our understanding of copepod molecular physiology.
Prediction of the differentially expressed circRNAs to decipher their roles in the onset of human colorectal fcancers
2020, Gene
Citation Excerpt :
In recent decades, advancements in whole genome and transcriptome sequencing technologies have been carried out to understand the molecular mechanisms of complex diseases like cancer. Several studies have constantly being performed throughout the last few years suggesting that the majority of mammalian genomes are usually transcribed into large sections (nearly 80 – 90% of whole genome) of short or long non-coding RNAs, which play pivotal roles in regulating many biological processes including cancer development and progression (Ilott and Ponting, 2013; Iyer et al., 2015; Wang et al., 2009). Among these non-coding RNAs (ncRNAs), the circular RNAs (circRNAs) are one of the most abundant long ncRNAs, characterized by covalently attached closed 5′ and 3′ ends (Sanger et al., 1976).
Circular RNAs belong to the class of endogenous long non-coding RNAs that play important roles in many physiological processes including tumorigenesis. One such process is the onset of colorectal cancers (CRC) which is one of the most prevalent cancers in the world. However, the involvement of the circRNAs in CRC progression is still obscure. In this study, we screened the differentially expressed circRNAs in CRC by taking 10 pairs of tumor and non-tumor transcriptomic data. Datasets were downloaded from EBI ENA database and differential expression analysis was performed. For functional characterization and pathway enrichment of differentially expressed circRNAs, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) were employed. Interactions with miRNAs and RNA binding proteins (RBPs) were predicted using miRanda, miRTarBase and starBase tools respectively. Our results identified total of 122 differentially expressed circRNAs in CRC onset, including 85 upregulated and 37 downregulated. GO and KEGG analyses revealed these circRNAs to be involved in many tumorigenic pathways. In addition, we predicted many miRNA and RBP targets of significantly expressed circRNAs that could exhibit the functional role in CRC progression. Combined analyses of miRanda, miRTarBase and KEGG pathway suggested that the possibly affected genes by circRNA-miRNA sponge to be associated with many cancer related pathways. From our findings we concluded 16 novel differentially expressed circRNAs that could play important roles in carcinogenesis of CRC. Our findings provide new insights in circRNA research and could therefore be useful in the development of potential biomarker and therapeutic approaches for CRC.
Long non-coding RNA CASC2 regulates Sprouty2 via functioning as a competing endogenous RNA for miR-183 to modulate the sensitivity of prostate cancer cells to docetaxel
2019, Archives of Biochemistry and Biophysics
Prostate cancer (PC) is the most common cancer in men; however, limited effect is obtained due to the therapy resistance. CASC2 acts as a tumor suppressor in human malignancies serving as a ceRNA for miRNAs; Sprouty2 (SPRY2), a key antagonist of RTK signaling, also serves as a tumor suppressor. Herein, CASC2 and SPRY2 expression was down-regulated in PC tissues and cell lines; the overexpression of CASC2 and SPRY2 could suppress PC cell proliferation, promote PC cell apoptosis, and enhance the sensitivity of PC cells to docetaxel. CASC2 positively regulated SPRY2 expression and inhibited downstream extracellular regulated protein kinases (ERK) signaling activation through SPRY2. By using online tools, miR-183 might be a direct target of CASC2, and might simultaneously bind to the 3′UTR of SPRY2. The direct binding between CASC2, miR-183 and SPRY2 was then validated; miR-183 inhibition enhanced the cytotoxicity of docetaxel on PC cells, which could be partially attenuated by SPRY2 knockdown. In summary, CASC2 competes with SPRY2 for miR-183 binding to rescue the expression of SPRY2 in PC cells, thus enhancing the sensitivity of PC cells to docetaxel through SPRY2 downstream ERK signaling pathway; CASC2 and SPRY2 might be novel adjuvants for docetaxel-based chemotherapy for PC.

View all citing articles on Scopus

View full text

Predicting long non-coding RNAs using RNA sequencing

Abstract

Introduction

Section snippets

Methods for detecting long non-coding RNAs

Predicting lncRNAs using RNA-seq

Categorising lncRNAs

Future perspectives

Cancer Lett.

Trends Immunol.

Semin. Cell Dev. Biol.

Cell

Cell

Mol. Cell

Cell

Cell

Cell

Cell

Genomics

Neuron

J. Mol. Biol.

Cell

Am. J. Physiol. Heart Circ. Physiol.

PLoS Genet.

World J. Stem Cells

Immune Netw.

Science

Genes Dev.

Nature

Nat. Biotechnol.

Nat. Rev. Genet.

Genome Res.

Science

J. Cell Biol.

Nature

Nat. Struct. Mol. Biol.

Nature

PLoS Biol

Annu. Rev. Genet.

Mol. Cell Biol.

Nat. Struct. Mol. Biol.

Nature

Nature

Science

Genes Dev.

Genome Res.

Science

Genome Res.

Nat. Genet.

Nature

Proc. Natl. Acad. Sci. USA

PLoS ONE

Nucleic Acids Res.