Elsevier

Methods

Volume 63, Issue 1, 1 September 2013, Pages 50-59
Methods

Predicting long non-coding RNAs using RNA sequencing

https://doi.org/10.1016/j.ymeth.2013.03.019Get rights and content

Abstract

The advent of next-generation sequencing, and in particular RNA-sequencing (RNA-seq), technologies has expanded our knowledge of the transcriptional capacity of human and other animal, genomes. In particular, recent RNA-seq studies have revealed that transcription is widespread across the mammalian genome, resulting in a large increase in the number of putative transcripts from both within, and intervening between, known protein-coding genes. Long transcripts that appear to lack protein-coding potential (long non-coding RNAs, lncRNAs) have been the focus of much recent research, in part owing to observations of their cell-type and developmental time-point restricted expression patterns. A variety of sequencing protocols are currently available for identifying lncRNAs including RNA polymerase II occupancy, chromatin state maps and − the focus of this review − deep RNA sequencing. In addition, there are numerous analytical methods available for mapping reads and assembling transcript models that predict the presence and structure of lncRNAs from RNA-seq data. Here we review current methods for identifying lncRNAs using large-scale sequencing data from RNA-seq experiments and highlight analytical considerations that are required when undertaking such projects.

Introduction

Multiple classes of non-coding RNAs (ncRNAs) are transcribed from vertebrate genomes. A number of these classes possess general housekeeping functions and include ribosome-associated RNA (rRNA), transfer RNA (tRNA) and small nuclear/nucleolar RNA (sn/snoRNA). A further set of short RNA molecules (miRNA, siRNA and piRNA) have regulatory roles during diverse cellular processes including cellular differentiation [1], [2], cancer progression [3], [4] and immunity [5], [6]. Nevertheless, large-scale sequencing efforts of cDNA in mammalian cells have identified widespread transcription from the mouse genome [7] giving rise to long transcripts that appear to fall outside of these classes of housekeeping or short RNAs. From these initial transcript maps long non-coding RNA (lncRNA) species have been identified whose loci lie both within and between protein coding genes. While lncRNAs remain the most enigmatic ncRNA species in terms of function, there is now much effort centred on their functional characterisation and their molecular mechanisms in different cell types [8], [9], [10], [11].

Owing to the development of next generation sequencing (NGS) technology, lncRNA identification is now more easily achievable and several assay-based sequencing protocols have been developed that predict lncRNAs. As a result of RNA polymerase II ChIP-seq [12], chromatin signatures [9] and RNA-seq [8] data analyses, there are now multiple catalogues of lncRNAs that are expressed from diverse species, tissues and cell lines. However, each of these approaches to predicting lncRNAs is associated with both advantages and disadvantages. In addition, predicting the molecular and cellular functions of these transcripts, in general, has not been achievable. This is due in part to our current poor understanding of RNA sequence-structure–function relationships, and to the diverse heterogeneity in location and expression patterns of these transcripts: they can be expressed as cytosolic or nuclear transcripts [13], as sense or antisense transcripts of protein coding genes [7], from intronic regions of protein coding loci or from intergenic regions.

While the molecular mechanisms of most lncRNAs have yet to be elucidated, it is clear that many lncRNAs act by activating or repressing the transcriptional activity of other genes either in cis or in trans, or else by modifying their transcript abundance [14]. Activation and repression can be facilitated by interactions between lncRNAs and chromatin modifying enzymes. Indeed, establishment of repressive chromatin via the recruitment of histone modifying proteins has been described for a number of lncRNAs [15]. XIST is one such lncRNA transcript which facilitates genomic imprinting of the X chromosome. XIST coats the X chromosome in an allele-specific manner [16] and through direct interactions with the polycomb repressive complex 2 (PRC2) renders the chromosome inactive via PRC2-mediated H3K27 trimethylation [17]. Genome-wide RNA immunoprecipitation sequencing (RIP-seq) analysis has revealed many additional lncRNAs that associate with the PRC2 complex and play a role in genomic imprinting [18]. For example, Gtl2 RNA may promote the association of PRC2 to the imprinted Dlk1 locus [18]. Non-genomic-imprinting roles of lncRNAs have also been described. Chromatin state at the HOX locus in human fibroblasts is regulated via expression of the HOTAIR long non-coding RNA [19]. HOTAIR may directly interact with PRC2 permitting H3K27 trimethylation and the establishment of silent chromatin at the HOXD locus. In addition to mediating repressive chromatin, lncRNAs also act in activation processes. For example, ncRNA-a lncRNAs directly bind to the Mediator complex, permitting H3S10 phosphorylation and activation of downstream target genes such as SNAI1 and AURKA [20].

In addition to lncRNA function in epigenetic processes, there is interest in their potential roles at enhancer regions. There is growing evidence that many enhancers are transcribed as enhancer RNAs (eRNAs). eRNAs are produced from genomic regions marked by higher levels of histone H3, lysine 4 mono-methylation (H3K4me1) than H3K4me3 [21], [22], [23]. Co-activator binding at these loci confers tissue specificity of gene expression level. Recently, transcribed enhancers were described in the inflammatory response and neuronal activity [22], [23]. During inflammation, expression at numerous enhancer loci preceded that of genomically neighbouring immune mediators, suggestive of a cis-regulatory function for these eRNAs. Whether there is a consequence of transcription at these loci – whether eRNA function is RNA sequence dependent or independent – remains unknown. There are several explanations for enhancer transcription (reviewed in [24]). eRNA transcripts may form ribonucleoprotein complexes that facilitate epigenetic or non-epigenetic regulation. However, whether eRNAs are capable of directing transcription factors or chromatin remodelers in a sequence dependent manner remains to be experimentally tested. The act of transcription may be important for regulation at proximal targets. The binding of polII at enhancers may provide a mechanism through which histone modifying enzymes associated with the polII complex [25] create a domain of permissive chromatin, rendering the transcript a non-functional by-product [24]. Whether eRNAs, in general, possess sequence-dependent function remains unknown.

Competitive endogenous RNAs (ceRNAs) represent another class of lncRNA. Originally described as transcribed retropseudogenes that retain the miRNA-binding function of their parent mRNAs, ceRNAs now include lncRNAs that did not derive from protein-coding genes [26]. ceRNAs have been proposed to function as miRNA ‘decoys’ or ‘sponges’, thereby de-repressing levels of protein coding transcripts that share with the ceRNAs the same miRNA response elements (MREs) [26]. Although ceRNA-mediated regulation represents an elegant mechanism by which lncRNAs may control protein function through miRNA mediators, the proportion of lncRNAs that act as ceRNAs remains unknown.

A final possibility is that an as-yet-unknown proportion of lncRNAs represent transcriptional noise, generated through random collisions of RNA polII complexes with DNA [27]. We also do not mean to imply that these lncRNA functional categories are mutually exclusive, since it is likely that some lncRNAs possess functions from multiple categories.

Accurate identification and annotation of lncRNAs is a necessary first step towards understanding the full functional potential of transcriptomes. The technology for sequencing full length transcripts is available and is allowing for the generation of large RNA-seq data sets. In this review we first discuss the relative merits of individual protocols for identifying lncRNAs using NGS technology but then focus our discussion on various considerations that are required when undertaking an RNA-seq experiment for the discovery of lncRNAs. Factors discussed include the type of sequencing library, the sequencing protocol, read mapping and transcript building algorithms, as well as lncRNA categorisation using computational methods.

Section snippets

Methods for detecting long non-coding RNAs

The full transcriptional repertoire of a given organism is not predictable from just its genomic sequence. Protein-coding gene transcripts and some families of ncRNAs (e.g. tRNAs and rRNAs) can be predicted reasonably accurately based on the presence of long open reading frames (ORFs) and sequence similarity, respectively [28]. However, many of what we now consider to be lncRNAs were not initially predicted or identified in the years soon after the sequencing of the human genome, leaving a

Predicting lncRNAs using RNA-seq

A workflow for the discovery of lncRNAs is outlined in Fig. 2A. While the study design will dictate how analysis of the resulting data is performed, lncRNA discovery approaches show similarities among many studies (Table 1). Below we discuss steps and considerations required for detecting lncRNAs using RNA-seq.

Categorising lncRNAs

Protein coding genes can often be functionally annotated computationally through homology searches to known protein families, patterns of expression and protein domain structure. Ideally lncRNAs would be classified in a similar manner. However, little is currently known about specific features that can distinguish different classes of lncRNAs. LncRNAs are generally better conserved than neutrally evolving sequence [84], [85] suggesting conserved function across species. However, annotation of

Future perspectives

RNA-seq has led to the identification of many novel long non-coding loci. These large-scale studies have revealed fundamental characteristics of lncRNAs including their low levels of expression, temporal and spatial patterns of expression, sequence conservation and association with histone modifications. Functional assays have also revealed diverse mechanisms through which lncRNAs act to regulate protein coding genes at both the transcriptional and translational level. However there remains

References (91)

  • A.L. Zimmerman et al.

    Cancer Lett.

    (2011)
  • M.A. Lindsay

    Trends Immunol.

    (2008)
  • J.S. Mattick

    Semin. Cell Dev. Biol.

    (2011)
  • C.P. Ponting et al.

    Cell

    (2009)
  • Y. Jeon et al.

    Cell

    (2011)
  • J. Zhao

    Mol. Cell

    (2010)
  • J.L. Rinn

    Cell

    (2007)
  • L. Salmena

    Cell

    (2011)
  • B.E. Bernstein

    Cell

    (2005)
  • A. Barski

    Cell

    (2007)
  • T. Li

    Genomics

    (2012)
  • T.G. Belgard

    Neuron

    (2011)
  • S.F. Altschul

    J. Mol. Biol.

    (1990)
  • N.T. Ingolia et al.

    Cell

    (2011)
  • E. Berardi

    Am. J. Physiol. Heart Circ. Physiol.

    (2012)
  • C. Ciaudo

    PLoS Genet.

    (2009)
  • M. Garg

    World J. Stem Cells

    (2012)
  • T.Y. Ha

    Immune Netw.

    (2011)
  • P. Carninci

    Science

    (2005)
  • M.N. Cabili

    Genes Dev.

    (2011)
  • M. Guttman

    Nature

    (2009)
  • M. Guttman

    Nat. Biotechnol.

    (2010)
  • A. Sandelin

    Nat. Rev. Genet.

    (2007)
  • T. Derrien

    Genome Res.

    (2012)
  • J.T. Lee

    Science

    (2012)
  • C.M. Clemson

    J. Cell Biol.

    (1996)
  • F. Lai

    Nature

    (2013)
  • F. Koch

    Nat. Struct. Mol. Biol.

    (2011)
  • T.K. Kim

    Nature

    (2010)
  • F. De Santa

    PLoS Biol

    (2010)
  • G. Natoli et al.

    Annu. Rev. Genet.

    (2012)
  • H. Cho

    Mol. Cell Biol.

    (1998)
  • K. Struhl

    Nat. Struct. Mol. Biol.

    (2007)
  • E.S. Lander

    Nature

    (2001)
  • J. Kawai

    Nature

    (2001)
  • S. Katayama

    Science

    (2005)
  • J.L. Rinn

    Genes Dev.

    (2003)
  • T. Ravasi

    Genome Res.

    (2006)
  • P. Kapranov

    Science

    (2002)
  • D. Kampa

    Genome Res.

    (2004)
  • N.D. Heintzman

    Nat. Genet.

    (2007)
  • T.S. Mikkelsen

    Nature

    (2007)
  • T.Y. Roh

    Proc. Natl. Acad. Sci. USA

    (2006)
  • L.X. Garmire

    PLoS ONE

    (2011)
  • H. Sun

    Nucleic Acids Res.

    (2011)
  • Cited by (91)

    • The non-coding genome in Autism Spectrum Disorders

      2023, European Journal of Medical Genetics
    • Identification of novel RNAs in plants with the help of next-generation sequencing technologies

      2022, Bioinformatics in Agriculture: Next Generation Sequencing Era
    • De novo transcriptome assembly of the Southern Ocean copepod Rhincalanus gigas sheds light on developmental changes in gene expression

      2021, Marine Genomics
      Citation Excerpt :

      Of the 188,349 contigs included in Corset clusters, 50,293 (26.7%) were annotated as lncRNA's, and 22,361 out of 89,528 clusters (25.0%) consisted entirely of predicted lncRNA transcripts. Consistent with the fact that lncRNA's generally have low expression (Ilott and Ponting, 2013), 41% of lncRNA's fell below Corset's expression filter compared to 22% of mRNA transcripts. Indeed, the mean normalized counts value for lncRNA transcripts was 24.6, compared to 160.7 for mRNA transcripts.

    • Prediction of the differentially expressed circRNAs to decipher their roles in the onset of human colorectal fcancers

      2020, Gene
      Citation Excerpt :

      In recent decades, advancements in whole genome and transcriptome sequencing technologies have been carried out to understand the molecular mechanisms of complex diseases like cancer. Several studies have constantly being performed throughout the last few years suggesting that the majority of mammalian genomes are usually transcribed into large sections (nearly 80 – 90% of whole genome) of short or long non-coding RNAs, which play pivotal roles in regulating many biological processes including cancer development and progression (Ilott and Ponting, 2013; Iyer et al., 2015; Wang et al., 2009). Among these non-coding RNAs (ncRNAs), the circular RNAs (circRNAs) are one of the most abundant long ncRNAs, characterized by covalently attached closed 5′ and 3′ ends (Sanger et al., 1976).

    View all citing articles on Scopus
    View full text