Abstract
Long noncoding RNAs (lncRNAs) can promote or repress the cellular hallmarks of cancer. Understanding their molecular roles and realising their therapeutic potential depend on high-quality catalogues of cancer lncRNA genes. Presently, such catalogues depend on labour-intensive curation of heterogeneous data with permissive criteria, resulting in unknown numbers of genes without direct functional evidence. Here, we present an approach for semi-automated curation focused exclusively on pathogenic functionality. The result is Cancer LncRNA Census 2 (CLC2), comprising 492 gene loci in 33 cancer types. To complement manual literature curation, we develop an automated pipeline, CLIO-TIM, to identify novel cancer lncRNAs based on functional evolutionary conservation with mouse. This yields 95 novel lncRNAs, which display characteristics of known cancer genes and include LINC00570 (ncRNA-a5), which we demonstrate experimentally to promote cell proliferation. The clinical importance and curation accuracy of CLC2 lncRNAs is highlighted by a range of features, including evolutionary selection, expression in tumours, and both somatic and germline polymorphisms. The entire dataset is available in a highly-curated format facilitating the widest range of downstream applications. In summary, we show how manual and automated methods can be integrated to catalogue known and novel functional cancer lncRNAs with unique genomic and clinical properties.
Introduction
Cancer arises through a series of somatic genetic mutations, leading via defined cellular phenotypic hallmarks to the formation of a tumour (Hanahan & Weinberg, 2011)(Yates & Campbell, 2012). Such mutations are principally thought to alter the function of polypeptides encoded by protein-coding genes (pc-genes) (Sondka et al., 2018) and the increased probability of suffering dysfunctional mutations defines known cancer genes (Furney et al., 2006). Datasets such as the Cancer Gene Census (CGC) collect and organise comprehensive sets of cancer pc-genes according to defined criteria, and represent invaluable and widely-used resources for scientific research and drug discovery (Sondka et al., 2018).
The past decade has witnessed the discovery of numerous non-protein-coding RNA genes in mammalian cells (Guttman et al., 2009; Uszczynska-Ratajczak et al., 2018). The most numerous but poorly understood produce long noncoding RNAs (lncRNAs), defined as transcripts >200 nt in length with no detectable protein-coding potential (Derrien et al., 2012). Although their molecular mechanisms are highly diverse, many lncRNAs have been shown to interact with other RNA molecules, proteins and DNA by structural and sequence specific (Guttman & Rinn, 2012) (Johnson & Guigó, 2014). Most lncRNAs are clade-and species-specific, but a subset display deeper evolutionary conservation in their gene structure (Ulitsky et al., 2011) and a handful have been demonstrated to have functions that were conserved across millions of years of evolution (Marín-Béjar et al., 2017; Ulitsky et al., 2011). The numbers of known lncRNA genes in human have grown rapidly, and present catalogues range from 18,000 to ∼100,000 (Frankish et al., n.d.), however just a tiny fraction have been functionally characterized (Kopp & Mendell, 2018)(Ulitsky & Bartel, 2013)(Ma et al., 2019)(Quek et al., 2015). Understanding the clinical and therapeutic significance of these numerous novel genes is a key contemporary challenge.
LncRNAs have been implicated in molecular processes governing tumorigenesis (Slack & Chinnaiyan, 2019). LncRNAs may promote or oppose cancer hallmarks (Du et al., 2013b). This fact, coupled to the emergence of potent in vivo inhibitors in the form of antisense oligonucleotides (ASOs) (Dias & Stein, 2002), has given rise to serious interest in lncRNAs as drug targets in cancer by both academia and pharma (Tony Gutschner et al., 2013; Wahlestedt, 2013)(Kaczmarek et al., 2017)(Slack & Chinnaiyan, 2019).
Initially, cancer lncRNAs were discovered by classical functional genomics workflows employing microarray or RNA-seq expression profiling (Huarte et al., 2010; Iyer et al., 2015). More recently, CRISPR-based functional screening (Esposito et al., 2019) and bioinformatic predictions (Lanzós et al., 2017; Mularoni et al., 2016; Rheinbay et al., 2017) have also emerged as powerful tools for novel cancer gene discovery. To assess their accuracy, these approaches require accurate benchmarks in the form of curated databases of known cancer lncRNAs.
Any discussion of lncRNAs and cancer requires careful terminology. Tumours display large numbers of differentially expressed genes (Iyer et al., 2015). However, just a fraction of these are likely to functionally contribute to a relevant cellular phenotype or cancer hallmark (T Gutschner et al., n.d.; Hosono et al., 2017; Lee et al., 2016; Leucci et al., 2016; Munschauer et al., 2018). Such genes, termed here “functional cancer lncRNAs”, are the focus of this study. Remaining changing genes are non-functional “bystanders”, which are largely irrelevant in understanding or inhibiting the molecular processes causing cancer.
There are a number of excellent databases of cancer-associated lncRNAs: lncRNADisease (Bao et al., 2019), CRlncRNA (Wang et al., 2018), EVLncRNAs (Zhou et al., 2018) and Lnc2Cancer (Gao et al., 2019). These principally employ labour-intensive manual curation, and rely extensively on differential expression to identify candidates. On the other hand, these databases have not begun to use more recent high-confidence sources of functional cancer lncRNAs, such as high-throughput functional screens (Esposito et al., 2019)(Abbott et al., 2015). For these reasons, existing annotations likely contain unknown numbers of bystander lncRNAs, while omitting large numbers of bona fide functional cancer lncRNAs. Thus, studies requiring high-confidence gene sets, including benchmarking or drug discovery, call for a database focussed exclusively on functional cancer lncRNAs.
Here we address this need through the creation of the Cancer LncRNA Census 2 (CLC2). This extends our previous CLC dataset by several fold (Carlevaro-Fita et al., 2020). More importantly, CLC2 takes a major step forward methodologically, by implementing an automated curation component that utilises functional evolutionary conservation for the first time. Using this data, we present a comprehensive analysis of the genomic and clinical features of cancer lncRNAs. Most important, we present a practical and versatile dataset intended for use by basic researchers and drug discovery projects.
Results
Integrative, semi-automated cataloguing of cancer lncRNAs
We sought to develop an improved map of lncRNAs with functional roles in either promoting or opposing cancer hallmarks or tumorigenesis. Such a map should prioritise lncRNAs with genuine causative roles, and exclude false-positive “bystander” lncRNAs whose expression changes but play no functional role.
We began with conventional manual curation of lncRNAs from the scientific literature, covering the period from January 2017 (directly after the end of the first CLC (Carlevaro-Fita et al., 2020)) to the end of December 2018. The criteria for defining cancer lncRNAs were identical: genes must be annotated in GENCODE (here version 28), and cancer function must be demonstrated by in vitro or in vivo experiments or germline or somatic mutational evidence (Figure 1A). Altogether we collected 253 novel lncRNAs in this way, which added to the original CLC (n=122) amounts to 375 lncRNAs, hereafter denoted as “literature lncRNAs” (Figure 1A).
A) Literature curation with four criteria are used to define “literature lncRNAs”. B) Transposon insertion mutagenesis screens identify “mutagenesis lncRNAs”. C) Validated hits from CRISPRi proliferation screens are denoted “CRISPRi lncRNAs”. Statistical significance calculated by one-sided Fisher’s test.
We previously showed that transposon insertion mutagenesis (TIM) hits are enriched in cancer-associated lncRNA genes, implying that cancer lncRNAs’ functions can be deeply evolutionarily conserved (Carlevaro-Fita et al., 2020). We developed a pipeline to automatically identify likely human functional cancer lncRNAs by orthology to a collection of TIM hits (Abbott et al., 2015). In this way 123 lncRNAs were detected, of which 102 were not already in the literature set. These were added to the CLC2, henceforth denoted as “mutagenesis lncRNAs” (Figure 1B). This analysis is discussed in more detail in the next section.
Pooled functional screens based on CRISPR-Cas9 loss-of-function have recently emerged as a powerful means of identifying function cancer lncRNAs (Esposito et al., 2019). The most comprehensive dataset presently available comes from a CRISPR-inhibition (CRISPRi) screen of ∼16,000 lncRNAs in seven human cell lines, with proliferation as a readout (Liu et al., 2017). Of the 499 hits identified, 322 are annotated by GENCODE. These hits are significantly enriched for known cancer lncRNAs from the literature search (Figure 1C). That study independently validated 21 GENCODE-annotated hits. Four (19%) of these were already mentioned in the scientific literature, and 2 (10%) were detected in the TIM screen above. Given their high-confidence, we added the remaining 15 novel lncRNAs to CLC2 (“CRISPRi lncRNAs”) (Figure 1C).
Altogether, CLC2 comprises 492 unique lncRNA genes detected in 33 cancer types. PVT1, SNHG1 and SNHG12 genes are detected in all three sources. The entire CLC2 dataset is available in Supplementary Table 1 and 2. Importantly, the dataset is fully annotated with evidence information, enabling users to filter particular sets of lncRNAs in which they have greater levels of confidence.
Automated annotation of human cancer lncRNAs via functional conservation
We recently showed that transposon insertional mutagenesis (TIM) screens can be used to identify cancer lncRNAs in mouse (Copeland & Jenkins, 2010) (Carlevaro-Fita et al., 2020), and that these often have human orthologues which are also cancer genes (Figure 2A). TIM screens identify “common insertion sites” (CIS), where multiple transposon insertions at a particular genomic location have given rise to a tumour, thereby implicating the underlying gene as an oncogene or tumour suppressor.
A) Overview of transposon insertional mutagenesis (TIM) method for identifying functional cancer genes. Engineered transposons carry bidirectional cassettes capable of either blocking or upregulating gene transcription, depending on orientation. Transposons are introduced into a population of cells, where they integrate at random genomic sites. The cells are injected into a mouse. In some cells, transposons will land in and perturb expression of a cancer gene (either tumour suppressor or oncogene), giving rise to a tumour. DNA of tumour cells is sequenced to identify the exact location of the transposon insertion. Clusters of such insertions are termed Common Insertion Sites (CIS). B) (Left) Schematic of the CLIO-TIM pipeline used here to identify human cancer genes using mouse CIS. (Right) An example of a CLIO-TIM predicted cancer lncRNA. C) The density of hCIS sites, normalised by gene length, in indicated classes of lncRNAs. Statistical significance calculated by one-sided Fisher’s test. D) Upper panels: Expression of LINC00570 RNA in response to inhibition by CRISPRi (left) or ASOs (right). Lower panels: Measured populations of the same cells over time. Statistical significance calculated by Student’s t-test.
We reasoned that this could be extended to identify new functional cancer lncRNAs in human, and developed a pipeline for this: CLIO-TIM (cancer lncRNA identification by orthology to TIM). Briefly, CLIO-TIM uses chain alignments to map mouse CIS to orthologous regions of the human genome, and then identifies their likely target gene by proximity (see Methods) (Figure 2B) (SUPP FIG 1B).
Using 26,345 mouse CIS from public databases (Abbott et al., 2015), CLIO-TIM identifies 16,430 orthologous regions in human (hCIS) (Figure 2B) (SUPP FIG 1A). Altogether, 123 lncRNAs and 9,295 pc-genes are identified as potential cancer genes. An example is the human-mouse orthologous lncRNA locus shown in Figure 2B, comprising Gm36495 in mouse and LINC00570 in human. A CIS lies upstream of the mouse gene’s TSS, mapping to the first intron of the human orthologue. LINC00570 is an alternative identifier for ncRNA-a5 cis-acting lncRNA identified by Orom et al. (Ørom et al., 2010), that has not previously been associated with cancer or cell growth.
We expect that hCIS regions are enriched in known cancer genes. Consistent with this, the 698 pc-genes from the COSMIC Cancer Gene Census (CGC) (Sondka et al., 2018) (red in SUPP FIG 1D) are 155-fold enriched with hCIS over intergenic regions (light grey). Turning to lncRNAs, the 375 literature lncRNA are 19.5-fold enriched, supporting their disease importance (Figure 2C). Thus, CLIO-TIM predictions are enriched in genuine protein-coding and lncRNA functional cancer genes. As expected, the overall numbers of genes implicated by CLIO-TIM agree with independent analysis in the CCGD database (SUPP FIG 1C).
An additional 209 hCIS fall in intergenic regions that are neither part of pc-genes or lncRNAs, leading us to ask whether some may affect lncRNAs that are not annotated by GENCODE (Figure 2C). To test this, we utilised the large set of cancer-associated lncRNAs from miTranscriptome (Iyer et al., 2015). 186 hCIS intersect 2167 miTranscriptome genes, making these potentially novel non-annotated transcripts involved in cancer. Nevertheless, simulations indicated that this rate of overlap was no greater than expected by random chance (see Methods), making it unlikely that substantial numbers of undiscovered cancer lncRNAs remain to be discovered in intergenic regions, at least with the datasets used here (SUPP FIG 1E).
In addition to known cancer lncRNAs, CLIO-TIM identifies 102 lncRNAs not previously linked to cancer (FIG 2C, dark grey) with a 3.8-fold enrichment of insertions over intergenic space. These lncRNAs represent novel functional cancer genes, and were added to the CLC2. This makes the assumption that human orthologues of mouse cancer genes will have a conserved function. We tested this using LINC00570, predicted to be a cancer gene by CLIO-TIM but never previously been linked to cancer or cell proliferation. LINC00570 expression is dysregulated in several tumour types (SUPP FIG 2A) and its high expression is a poor prognostic indicator for Glioblastoma Multiforme (GBM) patients (SUPP FIG 2A) (Tang et al., 2019). We asked whether LINC00570 promotes cell growth in cancer cells. We found that it is robustly detected in cervical carcinoma HeLa cells (SUPP FIG 2B) and to a lesser extent in HCT116 colon carcinoma cells (SUPP FIG 2B). We designed three distinct antisense oligonucleotides (ASOs) targeting the LINC00570 intron 2 and 3 and exon 3 of the short isoform. Transfection of these ASOs led to strong and reproducible decreases in steady state RNA levels in HeLa cells (Figure 2D). As a consequence, we observed significant decreases in cell proliferation rates (Figure 2D, SUPP FIG 2C). We observed a similar effect through CRISPRi-mediated inhibition of gene transcription by two independent guide RNAs in HeLa (Figure 2D), and using the same ASOs in HCT116 cells (SUPP FIG 2D and E). Therefore, the CLIO-TIM pipeline correctly identified LINC00570 as a likely new functional cancer lncRNA in human.
Enhanced cancer lncRNA catalogue integrating manual annotation, CRISPR screens and functional conservation
We here describe some general features of the CLC2 dataset. Figure 3A shows a breakdown of the composition of CLC2 in terms of source, gene function and evidence strength. Where possible, the genes are given a functional annotation, oncogene (og) or tumour suppressor (ts), according to evidence of promoting or opposing cancer hallmarks. Oncogenes (n=275) quite considerably outnumber tumour suppressors (n=95), although it is not clear whether this reflects genuine biology or an ascertainment bias relating to scientific interest or technical issues. Smaller sets of lncRNAs are associated with both functions, or have no functional information (those from TIM screens where the functions of hits are ambiguous).
A) The CLC2 database broken down by source, function and evidence type. B) Comparison of CLC2 to other leading cancer lncRNA databases.
In terms of the quality of evidence sources, CLC2 represents a strong improvement over the original CLC. The fraction of lncRNAs with high quality in vivo evidence (defined as functional validation in mouse models or identified by mutagenesis analysis) now represent 66% compared to 24% previously (FIG 3A, SUPP FIG 3B). In total the updated CLC2 comprises 33 cancer types (vs 29) and more lncRNAs are reported for every cancer subtype (SUPP FIG 3A).
Several longstanding cancer lncRNA collections have provided an invaluable resource for the community (Figure 3B). Comparing only GENCODE v28 genes, CLC2 with 492 genes is second only to Lnc2Cancer (n= 512) (Gao et al., 2019). However Lnc2Cancer uses looser inclusion criteria, including lncRNAs without GENCODE IDs and those that are differentially expressed in tumours with no functional evidence. Remaining databases are smaller. Furthermore, CLC2 has the greatest number of unique GENCODE gene loci, compared to other databases (n=225), including numerous literature-annotated cases and also 95 novel mutagenesis lncRNAs. Just 40 lncRNAs are common to all five databases (Bao et al., 2019; Gao et al., 2019; Wang et al., 2018; Zhou et al., 2018). In summary, CLC2 is a large, practical and high-quality cancer lncRNA annotation that complements existing resources.
Unique genomic properties of CLC2 lncRNAs
Cancer genes, both protein-coding and not, have elevated characteristics of essentiality and clinical importance, compared to other genes (Furney et al., 2008)(Furney et al., 2006)(Du et al., 2013a)(Sondka et al., 2018). We next asked whether CLC2 lncRNAs display genomic features expected for functionally important genes.
In the following analyses, we compared gene features of CLC2 lncRNAs to all other lncRNAs. Comparison of gene sets can often be confounded by covariates such as gene length or gene expression, therefore where appropriate we used control gene sets that were matched to CLC2 by expression (denoted “nonCLCmatched”) (SUPP FIG 4A) and reported findings correcting for gene length (SUPP FIG 4B).
Using the LnCompare tool (Carlevaro-Fita et al., 2019), we find that the promoters and exons of CLC2 genes display elevated signatures of functionality, in terms of evolutionary conservation in mammalian and vertebrate phylogeny (Figure 4A) and expression in cancer cell lines (Figure 4B). Strikingly we observe a similar effect when considering the novel mutagenesis lncRNAs alone: their promoters are significantly more conserved than expected by chance, and their expression is an order of magnitude higher than other lncRNAs (Figure 4C and D).
In each panel, two gene sets are compared: the test set (either all CLC2 genes, or mutagenesis genes alone), and the set of all other lncRNAs (GENCODE v24). Y-axis: Log2 fold difference between the means of gene sets. X-axis: false-discovery rate adjusted significance, calculated by Wilcoxon test. A) Evolutionary conservation for all CLC2, calculated by PhastCons. B) Expression of all CLC2 in cell lines. C) Evolutionary conservation for mutagenesis lncRNAs, calculated by PhastCons. D) Expression of mutagenesis lncRNAs in cell lines. For (A) and (C), “Promoter mean” and “Exon mean” indicate mean PhastCons scores (7-vertebrate alignment) for those features, while “Exon-coverage” indicates percent coverage by PhastCons elements. Promoters are defined as a window of 200 nt centered on the transcription start site.
Further, we found that CLC2 lncRNAs are enriched in repetitive elements (SUPP FIG 5C) and are more likely to house a small RNA gene, possibly indicating that some act as precursor transcripts (SUPP FIG 5B). CLC2 lncRNAs also have non-random distributions of gene biotypes, being depleted for intergenic class and enriched in divergent orientation to other genes (SUPP FIG 5A).
CLC2 lncRNAs display consistent tumour expression changes and prognostic properties
Although gene expression was not a criterion for inclusion, we do expect that CLC2 lncRNAs levels will be altered in tumours. Furthermore, we expect that oncogenes should be overexpressed, and tumour suppressors downregulated.
To test this, we analysed TCGA RNA-sequencing (RNA-seq) data from 686 individual tumours with matched healthy tissue (total n=1,372 analyzed samples) in 20 different cancer types (SUPP FIG 6A and B), and classed every gene as either differentially expressed (in at least one cancer subtype with a log2 Fold Change >1 and a FDR <0.05) or not. CLC2 lncRNAs are 3.4-fold more likely to be differentially expressed compared to expression-matched lncRNAs (Figure 5A). LncRNAs from each individual evidence source (literature, mutagenesis, CRISPRi) display the same trend. Similar effects were found for pc-genes (SUPP FIG 7A).
A) The percent of indicated genes that are significantly differentially expressed in at least one tumour type from the TCGA. Statistical significance calculated by one-sided Fisher’s test. B) Here, only differentially expressed genes from (A) are considered. LncRNAs with both tumour suppressor and oncogene labels are excluded. Remaining lncRNAs are divided by those that are up- or down-regulated (positive or negative fold change). Statistical significance calculated by one-sided Fisher’s test. C) The density of germline cancer-associated SNPs is displayed. Only SNPs falling in gene exons are counted, and are normalised to the total length of those exons. Statistical significance calculated by one-sided Fisher’s test. D) Examples of mutagenesis lncRNAs with an exonic cancer SNP. E) Length-normalised overlap rate of copy number variants (CNVs) in lncRNA gene span. Statistical significance calculated by one-sided Fisher’s test.
Next we asked whether the direction of expression change corresponds to gene function. Indeed, oncogenes are enriched for overexpressed genes, whereas tumor suppressors are enriched for down-regulated genes, supporting the functional labelling scheme (Figure 5B).
Cancer genes’ expression is often prognostic for patient survival. By correlating expression to patient survival, we found that the expression of 392 CLC2 lncRNAs correlated to patient survival in at least one cancer type (SUPP FIG 7C). When analyzing the most significant correlation of each CLC2 lncRNA compared to expression-matched nonCLC lncRNAs, we find a weak but significant enrichment (SUPP FIG 7C), showing that CLC2 lncRNAs tend to be prognostic for patient survival.
In summary, gene expression characteristics of CLC2 genes, and subsets from different evidence sources, support their functional labels as oncogenes and tumour suppressors and is more broadly consistent with their important roles in tumorigenesis.
CLC2 lncRNAs are enriched with cancer genetic mutations
Cancer genes are characterized by a range of germline and somatic mutations that lead to gain or loss of function. We hypothesised that cancer lncRNAs should be enriched with germline single nucleotide polymorphisms that have been linked to cancer predisposition (Deng et al., 2017). We obtained 5,331 cancer-associated single nucleotide polymorphisms (SNPs) from genome-wide association studies (GWAS) (Buniello et al., 2019) and mapped them to lncRNA and pc-gene exons, calculating a density score that normalises for exon length (SUPP FIG 4B). As expected, exons of known cancer pc-genes are >2-fold enriched in germline SNPs (SUPP FIG 7B). When performing the same analysis with CLC2 lncRNAs, one observes an even more pronounced enrichment of 4.0-fold when comparing to expression-matched nonCLC lncRNAs (Figure 5C). Once again, the lncRNAs from each evidence source individually show enrichment for cancer SNPs >2-fold (Figure 5C). Three mutagenesis lncRNAs, namely miR143HG/CARMN, LINC00511 and LINC01488, exhibit an exonic cancer SNP (Figure 5D).
Cancer genes are also frequently the subject of large-scale somatic mutations, or copy number variants (CNVs). Using a collection of CNV data from LncVar (Chen et al., 2017), we calculated the gene-span length-normalized coverage of lncRNAs by CNVs. CLC2 lncRNAs are enriched for CNVs compared to all non cancer lncRNAs (Figure 5E).
In summary, CLC2 lncRNAs display germline and somatic mutational patterns consistent with known oncogenes and tumour suppressors.
Discussion
We have presented the Cancer LncRNA Census 2, a resource of lncRNAs with functional roles in cancer. We hope this dataset will be of utility to a wide range of studies, from bioinformatic identification of new disease genes, to developing a new generation of cancer therapeutics with anti-lncRNA ASOs (Amodio et al., 2018).
A key novelty of CLC2 is its use of automated gene curation using functional evolutionary conservation. This responds to the challenge arising from the rapid growth of scientific literature, which makes manual curation increasingly impractical. Other automated methods like text mining and machine learning will also be important, although it will be necessary to ensure their predictions are sufficiently accurate in identifying bona fide functional lncRNAs and removing bystanders. Available evidence suggests that the CLIO-TIM pipeline accurately identifies functional cancer genes: in addition to experimental validation of the LINC00570 in two cell lines using two perturbations, the entire set of 102 predicted “mutagenesis lncRNAs” carry a range of features consistent with literature-curated cancer lncRNAs: promoter conservation, high expression, differential tumour expression and germline SNP enrichment. Removing cases that overlap other databases, this set amounts to 95 completely novel functional cancer lncRNAs, an invaluable resource for discovery of new molecular pathways and therapeutic targets.
We also integrated hits from latest CRISPRi screens into CLC2 (Liu et al., 2017) Similar to the “mutagenesis” lncRNAs, the CRISPRi hits have features consistent with cancer functionality.
We recognise that some colleagues may ascribe lower confidence to the novel genes in CLC2 originating from mutagenesis and CRISPR sources. For this reason, the CLC2 data table is organised to facilitate filtering by confidence, to extract only the 375 literature-supported cases, or indeed any other subset based on source, evidence or function as desired by the researcher.
LINC00570 is a new functional cancer lncRNA predicted by CLIO-TIM. The gene was previously studied by Orom and colleagues, as a cis-activating enhancer-like RNA named ncRNA-a5 (Ørom et al., 2010). That and a subsequent study showed that perturbation by siRNA transfection affects the expression of the nearby pc-gene ROCK2 in HeLa. However, these studies did not investigate the effect on cell proliferation. We here show by means of two independent perturbations, that LINC00570 promotes proliferation of HeLa and HCT116 cells. These findings make LINC00570 a potential therapeutic target for follow up.
Intriguingly, amongst the novel mutagenesis lncRNAs identified by CLIO-TIM are genes previously linked to other diseases. miR143HG/CARMEN1 (CARMN) was shown to regulate cardiac specification and differentiation in mouse and human hearts (Ounzain et al., 2015). In addition to being a TIM target, CARMEN1 also contains a germline cancer SNP correlating to the risk of developing lung cancer (Park et al., 2015), adding further weight to the notion that it also plays a role in oncogenesis. Similarly, DGCR5, is located in the DiGeorge critical locus and has been linked to neurodevelopment and neurodegeneration (Johnson et al., 2009), and was recently implicated as a tumour suppressor in prostate cancer (Li et al., 2019). These results raise the possibility that developmental lncRNAs can also play roles in cancer.
In summary, we anticipate that CLC2 will lay the foundation for understanding how lncRNAs are integrated into the molecular events underlying tumorigenesis, and provide targets for a new generation of anti-cancer therapies.
Material and Methods
Screening lncRNAs for inclusion in the CLC2
If not stated otherwise, GENCODE v28 gene IDs (gencode.v28.annotation.gtf) were used. Literature search. PubMed was searched for publications linking lncRNA and cancer using keywords: long noncoding RNA cancer, lncRNA cancer. The manual curation and assigning evidence levels to each lncRNA was performed in the same way as previously (Carlevaro-Fita et al., 2020) and included reports until December 2018.
CLIO-TIM Insertional mutagenesis screen analysis
From the CCGD website (http://ccgd-starrlab.oit.umn.edu/about.php, May 2018 (Abbott et al., 2015)) a table with all CIS elements was downloaded. These mouse genomic regions (mm10) were converted to homologous regions in the human genome assembly hg38 using the LiftOver tool (https://genome.ucsc.edu/cgi-bin/hgLiftOver). Settings: original Genome was Mouse GRCm38/mm10 to New Genome Human GRCh38/hg38, minMatch was 0.1 and minBlocks 0.1. For insertion sites intersecting several lncRNA genes, all the genes were reported. IntersectBed from bedtools was used to align human insertion sites to GENCODE IDs by intersecting at least 1nt and assigned to protein-coding or lncRNA gene families. Insertion sites aligning to protein-coding and lncRNA genes were always assigned to protein-coding genes. If insertion sites overlap multiple ENSGs, all genes are reported. Insertion sites not aligning to protein-coding or lncRNAs genes were added to the intergenic region. CCGD human Entrez gene results were converted to GENCODE IDs using the “Entrez gene ids” Metadata file from https://www.gencodegenes.org/human/ to compare CLIO-TIM results with CCGD results for each gene set.
Cancer-associated MiTranscriptome
The cancer associated MiTranscriptome IDs (Iyer et al., 2015) previously used in Bergada et al. (Bergadà-Pijuan et al., 2019) were intersected with intergenic insertion sites using IntersectBed. With ShuffleBed the intergenic insertions were randomly shuffled 1000x and assigned to MiTranscriptome IDs.
CRISPRi screen analysis
We used the Supp Table 1 (Liu et al., 2017) to extract ENST IDs and gene names which are then converted to GENCODE IDs to match each guide (LH identifier in the screen). From supplementary table S4 (Liu_et_al_aah7111-TableS4) (Liu et al., 2017) we extracted genes with “hit” (validated as a hit in the screen), “LH” (unique identifiers correlating to a gene in the screen) and “lncRNA” (referring to a lncRNA gene and to exclude lncRNA hits close to a protein-coding gene (“Neighbor hit”)) resulting in 499 hits. Of these, 322 hits contain a GENCODE IDs and were used for enrichment analysis, tested by one-sided Fisher’s test.
We included n=21 CRISPRi genes to the CLC2 from Liu et al. Supp Fig 8A, cancer cell line and the effect of the CRISPRi on the growth phenotype (either promoting (tumor suppressor) or inhibiting (oncogene)) of each lncRNA was reported.
Gene family grouping
For downstream analysis protein-coding (pc) genes (GENCODE IDs) are grouped in cancer-associated pc-genes (CGC genes) and non cancer-associated pc-genes (nonCGC n=19,174). The TSV file containing the CGC data was downloaded from https://cancer.sanger.ac.uk/census with 700 ENSGs with 698 ENSG IDs detected in GENCODE v28 of which 696 are unique (CGC n=696). The same is done for lncRNAs, into CLC2 (n=492) and nonCLC genes (n= 15,314).
Matched expression analysis
Based on an in house script used for Survival analysis (section below), TCGA survival expression data for each GENCODE ID is reported and the average FPKM across all tumor samples is calculated. The count distribution of nonCGC and nonCLC gene expression to CGC and CLC2 expression, respectively, is matched using the matchDistribution.pl script (https://github.com/julienlag/matchDistribution).
Cancer lncRNA databases
The tested databases were first filtered for lncRNAs in the GENCODE v28 long noncoding annotation (n=15,767).
Lnc2cancer GENCODE IDs from data table (http://www.bio-bigdata.com/lnc2cancer/download.html) were evaluated (n=512) (Gao et al., 2019).
CRlncRNA gene names from (http://crlnc.xtbg.ac.cn/download/) were converted to GENCODE IDs (n=146) (Wang et al., 2018).
EVlncRNAs gene names (http://biophy.dzu.edu.cn/EVLncRNAs/) were converted to GENCODE IDs (n=187) (Zhou et al., 2018).
lncRNADisease gene names from (http://www.rnanut.net/lncrnadisease/index.php/home/info/download) and only cancer-associated transcripts (carcinoma, lymphoma, cancer, leukemia, tumor, glioma, sarcoma, blastoma, astrocytoma, melanoma, meningioma) were extracted. Names were converted to GENCODE IDs (n=137) (Bao et al., 2019).
Features of CLC2 genes
Genomic classification
The genomic classification was performed as previously (Carlevaro-Fita et al., 2020) using an in house script (https://github.com/gold-lab/shared_scripts/tree/master/lncRNA.annotator).
Small RNA analysis
For this analysis “snoRNA”, “snRNA”, “miRNA” and “miscRNA” coordinates were extracted from GENCODE v28 annotation file and intersected with the genomic region of the genes (intronic and exonic regions).
Repeat elements
In total 452 CLC2 lncRNAs compared to 1693 expression-matched nonCLC lncRNAs using the LnCompare Categorical analysis (http://www.rnanut.net/lncompare/) (Carlevaro-Fita et al., 2019).
Feature analysis
In total 452 CLC2 lncRNAs and 120 mutagenesis lncRNAs are compared to the GENCODE v24 reference using LnCompare (http://www.rnanut.net/lncompare/) (Carlevaro-Fita et al., 2019).
Cancer characteristic analysis
Differential gene expression analysis (DEA)
was performed using TCGA data and TCGAbiolinks. Analysis was performed as reported in manual for matching tumor and normal tissue samples using the HTseq analysis pipeline as described previously. (https://www.bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/analysis.html) (Colaprico et al., 2016). For this analysis only matched samples were used and the TCGA data was presorted for tumor tissue samples (TP with 01 in sample name) and solid tissue normal (NT with 11 in sample name). Settings used for DEA analysis: fdr.cut = 0.05, logFC.cut = 1 for DGE output between matched TP and NT samples for 20 cancer types. CLC2 cancer types had to be converted to TCGA cancer types (Supp Fig 6A) Cancer types and number of samples used in the analysis can be found in Supp Fig 6B. DEA enrichment analysis tested with one-sided Fisher’s test. For each CLC2 gene reported as true oncogene (n=275) or tumor suppressor (n=95), hence where no double function is reported (n=22), the positive and negative fold change (FC) values were counted and compared to expression-matched lncRNA genes found in the DEA.
Survival analysis
An inhouse script for extracting TCGA survival data was used to generate p values correlating to survival for each gene. Expression and clinical data from 33 cohorts from TCGA with the “TCGAbiolinks” R package (https://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html) were downloaded (Colaprico et al., 2016). P value and Hazard ratio were calculated with the Cox proportional hazards regression model from “Survival” R package (https://cran.r-project.org/web/packages/survival/survival.pdf). All scripts were adapted from here (https://www.biostars.org/p/153013/) and are available upon request. For downstream analysis, only groups with at least 20 patient samples in high or low expression group were used. The plot comprises only the most significant cancer survival p value per gene and was assessed by the Komnogorow-Smirnow-Test (ks-test).
Cancer-associated SNP analysis
SNP data linked to tumor/cancer/tumour were extracted from the GWAS page (https://www.ebi.ac.uk/gwas/docs/file-downloads) (n=5,331) and intersected with the whole exon body of the genes. SNPs were intersected to the transcript bed file and plotted per nt in each subset (SNP/nt y axis) and tested using one-sided Fisher’s test.
CNV analysis
Human CNV in lncRNAs downloaded from http://bioinfo.ibp.ac.cn/LncVar/download.php (Chen et al., 2017). NONCODE IDs were converted to GENCODE IDs using NONCODEv5_hg38.lncAndGene.bed.gz. CLC2 and nonCLC ENSGs were matched to NONHSAT IDs with a significant pvalue (0.05, n=733) in the LncVAR table and tested using one-sided Fisher’s test.
Code availability
Custom code are available from the corresponding author upon request.
In vitro validation
Cell culture
HeLa and HCT116 were cultured on Dulbecco’s Modified Eagles Medium (DMEM) (Sigma-Aldrich, D5671) supplemented with 10% Fetal Bovine Serum (FBS) (ThermoFisher Scientific, 10500064), 1% L-Glutamine (ThermoFisher Scientific, 25030024), 1% Penicillin-Streptomycin (ThermoFisher Scientific, 15140122). Cells were grown at 37°C and 5% CO2 and passaged every two days at 1:5 dilution.
Generation of Cas9 stable cell lines
HeLa cells were infected with lentivirus carrying the Cas9-BFP (blue fluorescent protein) vector (Addgene 52962). HCT116 were transfected with the same vector using Lipofectamine 2000 (ThermoFisher Scientific, 11668019). Both cell types were selected with blasticidin (4ug/ml) for at least five days and selected for BFP-positive cells twice by fluorescence activated cell sorting.
CRISPR inhibition sgRNA pair design and cloning
sgRNA pairs targeting LINC00570 were designed using GPP sgRNA designer (https://portals.broadinstitute.org/gpp/). The sgRNA pairs were manually selected from the output list and cloned into the pGECKO backbone (CRISPRi.1: 5’ GTTACTTCCAACGTACCATG 3’, CRISPRi.2: 5’ CCTGTACCCCCATGGTACGT 3’) (Addgene 78534; (Aparicio-Prat et al., 2015))
Antisense LNA GapmeR design
Antisense LNA GapmeR Control (5’ AACACGTCTATACGC 3’) and three Antisense LNA GapmeR Standard targeting LINC00570 (LNA1: 5’ GGAAATTGCTCTGATG 3’, LNA2: 5’ GATTGGCATTGGGATA 3’, LNA3: 5’ GAAGTGGCCTGAGAAA 3’) were designed and purchased at Qiagen.
RT-qPCR
For each time point total RNA was extracted and reverse transcribed (Promega). Transcript levels of LINC00570 (FP: 5’ TAGGAGTGCTGGAGACTGAG 3’, RP: 5’ GTCGCCATCTTGGTTGTCTG 3’) and housekeeping gene HPRT1 (FP: 5’ ATGACCAGTCAACAGGGGACAT 3’, RP: 5’ CAACACTTCGTGGGGTCCTTTTCA 3’) were measured using GoTaq qPCR Master Mix (Promega, A6001) on a TaqMan Viia 7 Real-Time PCR System. Data were normalized using the ΔΔCt method (Schmittgen & Livak, 2008)).
Gene-specific RT-PCR and cDNA amplification
From the extracted total RNA, we performed a gene specific reverse transcription using the reverse primers for LINC00570 and HPRT1 to enrich for their cDNA. Presence or absence of transcript was detected by a regular PCR using GoTaq® G2 DNA Polymerase (Promega, M7841) from 100ng cDNA and visualized on an agarose gel.
Viability assay
HeLa and HCT116 cells were transfected with Antisense LNA GapmeRs at a concentration of 50nM using Lipofectamine 2000 (Thermofisher) according to manufacturer’s protocol. One day after, transfected cells were plated in a white, flat 96-well plate (3000 cells/well). Viability was measured in technical replicates using CellTiter-Glo 2D Kit (Promega) following manufacturer’s recommendations at 0, 24, 48, 72 hours after seeding. Luminescence was detected with Tecan Reader Infinite 200. Statistical significance calculated by t-test.
For CRISPR inhibition experiments, HeLa-Cas9 and HCT116-Cas9 cells were transfected with control sgRNA plasmid and two LINC00570 targeting plasmids. Cells were selected with puromycin (2ug/ml) for 48h. Viability assay was performed as previously described.
Contributions
RJ conceived the project. RJ, AV performed manual annotation of CLC2. AV performed the feature analysis, evolutionary analysis, mutation analysis, differential expression, GWAS SNP, CNV analysis and data integration. AL performed survival analysis. NB performed the ASO and CRISPRi KD experiments. AV, NB and MT performed the qPCR experiments. RJ, AV, AL, NB, MT and SH drafted the manuscript and prepared the figures and supplementary material. All authors read and approved the final draft.
Acknowledgements
We gratefully acknowledge administrative support from Ana Radovanovic and Silvia Roesselet (DBMR, University of Bern). We also acknowledge Juna Carlevaro-Fita (EPFL, Lausanne) and Judith Bergada (University of Zurich) for the helpful advice and discussions as well as Roberta Esposito, Panagiotis Chouvardas, Hugo Guillen Ramirez and the other members of the Laboratory for Genomics of LncRNA and Disease for their valuable input. This work was funded by the Swiss National Science Foundation through the National Center of Competence in Research (NCCR) “RNA & Disease”, by the Medical Faculty of the University and University Hospital of Bern, by the Helmut Horten Stiftung and Krebsliga Schweiz (4534-08-2018).