Abstract
Cancer cells exhibit rewired transcriptional regulatory networks that promote tumor growth and survival. However, the processes that configure these pathological networks remain poorly understood. Through a pan-cancer epigenomic analysis, we found that primate-specific endogenous retroviruses (ERVs) are an abundant source of enhancers that show cancer-specific activity. In colorectal cancer and other epithelial tumors, oncogenic MAPK/AP1 signaling drives activation of enhancers derived from the primate-specific ERV family LTR10. Functional studies in colorectal cancer cells revealed that LTR10 elements are involved in regulating multiple genes associated with tumorigenesis, including ATG12. Within the human population, individual LTR10 elements show frequent structural variation resulting from a highly mutable internal tandem repeat region, which affects AP1 binding activity. Our findings reveal that ERV-derived enhancers mediate transcriptional dysregulation in response to oncogenic signaling, and shape the evolution of cancer-specific regulatory networks.
INTRODUCTION
Cancer cells undergo global transcriptional changes resulting from genetic and epigenetic alterations during tumorigenesis (You and Jones 2012). While regulatory remodeling can arise from somatic non-coding mutations (Rheinbay et al. 2020), epigenomic studies have revealed that transformation is associated with aberrant epigenetic activation of enhancer sequences that are typically silenced in normal tissues (Cohen et al. 2017; Chapuy et al. 2013; Roe et al. 2017). Pathological enhancer activity is an established mechanism underlying tumorigenesis and therapy resistance, and therapeutic modulation of enhancer activity is an active area of investigation (Hnisz et al. 2015; Bradner, Hnisz, and Young 2017; Sur and Taipale 2016; Flavahan, Gaskell, and Bernstein 2017). However, we have a limited understanding of the molecular processes that shape and establish the enhancer landscapes of cancer cells.
Transposable elements (TEs) including endogenous retroviruses (ERVs) represent a potentially significant source of enhancers that could shape cancer-specific gene regulation (Jansz and Faulkner 2021). Many cancers exhibit genome-wide transcriptional reactivation of TEs, which can directly impact cells by promoting oncogenic mutations and stimulating immune signaling (Kong et al. 2019; Shukla et al. 2013; Rodriguez-Martin et al. 2020; Rodić et al. 2014). In addition, the reactivation of TEs is increasingly recognized to have gene regulatory consequences in cancer cells (Artem Babaian and Mager 2016). Several transcriptomic studies have uncovered TEs as a source of cancer-specific alternative promoters across many types of cancer, with some examples shown to drive oncogene expression (A. Babaian et al. 2016; Lamprecht et al. 2010; Edginton-White et al. 2019; Jang et al. 2019). TEs also show chromatin signatures of enhancer activity in cancer cell lines (Wang et al. 2007; Sundaram et al. 2014; Jacques, Jeyakani, and Bourque 2013), yet their functional relevance in patient tumors has remained largely unexplored. A recent study identified and characterized ERV-derived enhancers with oncogenic effects in acute myeloid leukemia (Deniz et al. 2020), but the prevalence and mechanisms of TE-derived enhancer activity are unknown for most cancer types.
Here, we leveraged the availability of published cancer epigenome datasets to understand how TEs influence enhancer landscapes and gene regulation across cancer types. Our pan-cancer analysis revealed that elements from a primate-specific ERV named LTR10 show enhancer activity in many epithelial tumors, and this activity is regulated by MAPK/AP1 signaling. We conducted functional studies in HCT116 colorectal cancer cells, and found that LTR10 elements regulate AP1-dependent gene expression at multiple loci that include genes with established roles in tumorigenesis. Finally, we discovered that LTR10 elements contain highly mutable sequences that potentially contribute genomic variation affecting cancer-specific gene expression. Our work implicates ERVs as a source of pathological regulatory variants that facilitate transcriptional rewiring in cancer.
RESULTS
To assess the contribution of TEs to cancer cell epigenomes, we analyzed aggregate chromatin accessibility maps from 21 human cancers generated by The Cancer Genome Atlas project (Corces et al. 2018). We defined cancer-specific subsets of accessible regions by removing regions that show evidence of regulatory activity in any healthy adult tissue profiled by the Roadmap Consortium (Fig 1A, Methods) (Roadmap Epigenomics Consortium et al. 2015). Out of 1315 total repeat families annotated in the human genome, we found 23 families that showed significant enrichment within the accessible chromatin in at least one cancer type (Fig 1B), of which 19 correspond to long terminal repeat (LTR) regions of primate-specific ERVs (Supp Table 1). These observations from chromatin accessibility data generated from primary tumors confirm previous reports of LTR-derived regulatory activity in cancer cell lines (Jacques, Jeyakani, and Bourque 2013; Wang et al. 2007; Deniz et al. 2020), and support a role for ERVs in shaping patient tumor epigenomes.
LTR10 elements exhibit cancer-specific regulatory activity
To further investigate the cancer-specific regulatory activity of ERVs, we focused on LTR10 elements, which were enriched within cancer-specific accessible chromatin for several types of epithelial tumors including colorectal, stomach, prostate, and lung tumors (Fig 1C, Supp Fig S1A). LTR10 elements (including LTR10A-G, n=2331) are derived from the LTR of the gammaretrovirus HERV-I, which integrated into the anthropoid genome 30 million years ago (Fig 1D, 1E) (Wang et al. 2007). As our initial TCGA analysis was conducted using aggregate data for each tumor type, we first confirmed that LTR10 elements showed recurrent chromatin accessibility across colorectal tumors from multiple individual patients (Fig 1F, Supp Fig S1). We then analyzed epigenomic datasets from the HCT116 colorectal cancer cell line (Akhtar-Zaidi et al. 2012; Baranello et al. 2016; Cohen et al. 2017; Zheng et al. 2019) and found that LTR10A and LTR10F elements exhibit canonical chromatin hallmarks of enhancer activity, including enrichment of histone modifications H3K27ac and H3K4me1, the transcriptional coactivator p300, and RNA Polymerase II occupancy (Fig 1G). We did not observe enhancer-like profiles at LTR10C elements, which have previously been identified as a source of p53 binding sites (Grow et al. 2021; Wang et al. 2007) (Supp Fig S1C). While most LTR10A and LTR10F elements are not transcribed, some show evidence of transcription as promoters for full-length non-coding HERV-I insertions or cellular transcripts (Supp Fig S1D). Therefore, elements derived from the LTR10A and LTR10F subfamilies (hereafter referred to as LTR10 elements) show robust epigenomic signatures associated with enhancer activity in colorectal cancer cells.
We next compared the epigenetic status of LTR10 elements between colorectal cancer cells and normal tissues. In multiple patient-matched epigenomic datasets (Orouji et al. 2021; Bujold et al. 2016), LTR10 elements show globally increased levels of enhancer-associated histone modifications H3K27ac and H3K4me1 in tumor samples compared to adjacent normal colorectal tissues (Fig 2A, 2B). In additional datasets generated from healthy colorectal tissue samples (Cohen et al. 2017; Bernstein et al. 2010; ENCODE Project Consortium 2012; Lister et al. 2009), LTR10 elements do not show any evidence of enhancer activity (Supp Fig S2A). Although we did not examine rare or transient normal cell populations, these observations suggest that LTR10 regulatory activity is generally restricted to tumor cells.
We expanded our analysis to include epigenomic states from all adult tissues (Roadmap Epigenomics Consortium et al. 2015). We found no evidence for LTR10 enhancer activity in normal tissues, but instead observed general enrichment of H3K9me3-associated heterochromatin marks (Fig 2C, Supp Figure S2B, Supp Table 2). To identify factors that directly bind to and potentially repress LTR10 elements, we analyzed the Cistrome database (Zheng et al. 2019) of published human ChIP-Seq datasets to identify transcriptional repressors with evidence for enriched binding within LTR10 elements. Considering all cell types, we found that LTR10 elements are significantly enriched for binding by ZNF562, TRIM28, and SETDB1 (Fig 2D, 2E, Supp Table 3), which are components of the KRAB-ZNF transposon silencing pathway (Imbeault, Helleboid, and Trono 2017). Our analysis suggests that, as expected for most primate-specific TEs (Bruno, Mahgoub, and Macfarlan 2019), LTR10 elements are normally subject to H3K9me3-mediated epigenetic silencing in somatic tissues.
LTR10-derived enhancers are regulated by AP1
To identify which pathways are responsible for cancer-specific reactivation of LTR10 elements, we focused our Cistrome enrichment analysis on activating transcription factors in colorectal cancer cell lines. LTR10 elements were significantly enriched for binding by AP1 complex members (Fig 2F, 2G, Supp Table 3) including the JUND, FOSL1, and ATF3 transcription factors. The LTR10A and LTR10F consensus sequences harbor multiple predicted AP1 binding motifs, which are enriched within LTR10 elements marked by H3K27ac in HCT116 cells (Fig 2H). Moreover, the AP1 motifs are largely absent in other LTR10 subfamilies (Fig 2I). These analyses indicate that the cancer-specific enhancer activity of LTR10 elements is likely driven by sequence-specific recruitment of the AP1 complex.
Dysregulation of AP1 signaling occurs in many cancers, driven by mutations that cause oncogenic activation of the MAPK signaling pathway (Wagner and Nebreda 2009). We first used luciferase reporter assays to test whether LTR10 enhancer activity is affected by modulation of the MAPK/AP1 signaling pathway. We synthesized the LTR10A and LTR10F consensus sequences as well as variants where the AP1 motifs were disrupted, and cloned the sequences into an enhancer reporter construct. We measured reporter activity in HCT116 cells that were treated for 24 hrs with either TNFa to stimulate signaling or cobimetinib (a MEK1 inhibitor) to inhibit signaling. Consistent with regulation by AP1, cobimetinib treatment caused a decrease in LTR10-driven reporter activity, and TNFa caused an increase (Fig 3A). Overall regulatory activity was greatly reduced in sequences where the AP1 motif was disrupted (Fig 3A). These results show that LTR10 enhancer activity can be directly regulated by modulation of the MAPK/AP1 signaling pathway in cancer cells.
To test the role of the AP1 complex in endogenous LTR10 regulation, we silenced the AP1 component FOSL1 using CRISPRi to determine the impact on LTR10 transcriptional activity. Using HCT116 cells expressing dCas9-KRAB-MeCP2 (Yeo et al. 2018), we transfected a guide RNA (gRNA) targeting the FOSL1 transcription start site (TSS) and used RNA-seq to compare gene and TE expression to control cells transfected with a non-targeting gRNA. We first confirmed silencing of FOSL1 (Supp Fig S3A, S3B), then analyzed TE transcript expression summarized at the family level to account for reads mapping to multiple insertions of the same TE (Jin et al. 2015). This analysis revealed that full-length LTR10/HERV-I elements were significantly downregulated upon silencing FOSL1 (Fig 3B), supporting a direct role for the AP1 complex in regulating LTR10 activity.
Next, we investigated how endogenous LTR10 elements respond to modulation of MAPK/AP1 signaling at the chromatin level. We treated HCT116 cells with either cobimetinib or TNFa for 24 hrs and profiled each response using RNA-seq and H3K27ac CUT&RUN (Supp Fig S3C, S3D, S3E, S3F). Consistent with our reporter assay results, our RNA-seq analysis showed that full-length LTR10/HERV-I transcripts were significantly downregulated upon cobimetinib treatment, and upregulated upon TNFa treatment (Fig 3B). LTR10 elements showed similar responses based on H3K27ac CUT&RUN signal, exhibiting significant enrichment within the genome-wide set of predicted enhancers downregulated by cobimetinib or upregulated by TNFa (Fig 3C, 3D). We also observed clear TNFa-induced H3K27ac signal over LTR10 elements in a published dataset of SW480 colorectal cancer cells (Rahnamoun et al. 2017) (Supp Fig S3G). These results indicate that LTR10 elements represent a significant subset of genome-wide enhancers and transcripts in HCT116 cells that are directly modulated by MAPK/AP1 signaling.
LTR10 elements regulate cancer-specific pathological gene expression
To determine whether any LTR10-derived enhancers have a functional effect on MAPK-dependent gene expression in colorectal cancer cells, we used our RNA-seq and CUT&RUN data from HCT116 cells to identify elements predicted to have gene regulatory activity. We defined a set of 620 MAPK-regulated genes that were upregulated by TNFa and downregulated by cobimetinib (Fig 3E, Supp Table 4). We identified LTR10 elements predicted to regulate these genes using the activity by contact model (Fulco et al. 2019) to assign enhancer-gene targets based on H3K27ac signal and chromatin interaction data (Supp Table 5). This yielded 57 LTR10-derived enhancers predicted to regulate 74 MAPK-regulated genes, including many with established roles in cancer pathophysiology (Fig 3F, 3G, Supp Table 6).
We next used CRISPRi to experimentally silence predicted LTR10-derived enhancers in HCT116 cells. We first focused on a predicted tumor-specific enhancer (LTR10.ATG12) formed by two LTR10F elements on chromosome 5, located 87 kb from predicted target genes ATG12 and AP3S1 (Fig 4A). The LTR10.ATG12 element shows characteristic epigenomic signatures of tumor-specific enhancer activity and binding by AP1 transcription factors. To silence the element, we stably transfected cells with a pool of four gRNAs targeting unique sequences within the element. As a positive control, we separately transfected cells with a single guide RNA targeting the ATG12 TSS. We determined the genome-wide transcriptional effects of CRISPRi silencing by using RNA-seq to compare gene expression against cells transfected with a non-targeting gRNA.
We verified that silencing ATG12 resulted in the expected specific downregulation of ATG12 (Supp Fig S4A, S4B). In cells where the LTR10.ATG12 element was silenced, the most significantly downregulated genes were ATG12 and AP3S1 (Fig 4B, 4C, Supp Table 7), confirming the predicted enhancer function of the element. In addition, we observed downregulation of other nearby genes including ARL14EPL (Fig 4B, 4C), consistent with enhancer activity that affects multiple genes in the locus. Genome-wide, we observed differential regulation of other genes, possibly due to indirect effects from target gene knockdown or off-target silencing of other LTR10 elements (Supp Fig 4C). Notably, we observed that multiple genes regulated by LTR10.ATG12 showed similar patterns of transcriptional downregulation in response to FOSL silencing and cobimetinib treatment (Fig 4D). These results indicate that LTR10.ATG12 acts as an enhancer that controls AP1-dependent transcriptional activation of multiple genes in the ATG12/AP3S1 locus in HCT116 cells.
The ATG12 gene encodes a ubiquitin-like modifier required for macroautophagy as well as mitochondrial homeostasis and apoptosis (Haller et al. 2014; Rubinstein et al. 2011; Radoshevich et al. 2010; Hanada et al. 2007). Expression of ATG12 is associated with tumorigenesis and therapy resistance in colorectal and gastric cancer (Hu et al. 2018; YiRen et al. 2017), but the mechanism of cancer-specific regulation of ATG12 has not been characterized. Therefore, we aimed to determine whether the LTR10.ATG12 enhancer was responsible for regulating ATG12 expression and activity in HCT116 cells. First, we validated that silencing the enhancer resulted in decreased ATG12 protein levels by immunoblotting (Fig 4E). In cells where either ATG12 or the enhancer was silenced, there was a clear reduction in protein levels of both free ATG12 and the ATG3-ATG12 conjugate. There was minimal knockdown effect on the levels of the ATG5-ATG12 conjugate, which has previously been observed in ATG12 silencing experiments and is due to the high stability of the ATG5-ATG12 complex (Haller et al. 2014).
Next, we tested whether ATG12-dependent functions require the activity of the LTR10.ATG12 enhancer. We treated each cell line with staurosporine (STS) to trigger mitochondrial apoptosis, which is dependent on free ATG12 binding to Bcl-2 (Rubinstein et al. 2011). In cells where either ATG12 or the enhancer was silenced, we observed significantly reduced caspase 3/7 activity, indicating defective mitochondrial apoptosis (Fig 4F). We did not detect differences in macroautophagy in cells treated with bafilomycin (Supp Fig S4D), consistent with the lack of knockdown of the ATG5-ATG12 conjugate (Hanada et al. 2007). Our experimental results from silencing both ATG12 and the enhancer are concordant with previous studies directly silencing ATG12 using siRNAs in other cancer cell lines (Rubinstein et al. 2011; Haller et al. 2014). Together, these experiments demonstrate that the LTR10.ATG12 enhancer is functionally important for ATG12-dependent activity in HCT116 cells.
Having established the regulatory activity of LTR10.ATG12, we selected 5 additional LTR10 elements for characterization based on their proximity to predicted target genes with established cancer relevance (Fig 5). We separately silenced each LTR10 element using CRISPRi and selected one element (LTR10.KDM6A) to delete using CRISPR/Cas9, due to its intronic location (Supp Fig S5). We used RNA-seq to determine the transcriptional consequences of perturbing each element. As we observed with LTR10.ATG12, we observed local downregulation of multiple genes within 1.5 MB of each targeted element, confirming their activity as functional enhancers in HCT116 cells. These included XRCC4, TMEM167A, VCAN, NES, FGF2, AGPAT5, MAOB, and MIR222HG (Fig 5). For three elements (LTR10.MEF2D, LTR10.MCPH1 and LTR10.KDM6A), the predicted proximal target gene did not show significant expression changes, but we instead observed downregulation of other MAPK-regulated genes near the element (Fig 5G, 5I, 5J). Collectively, our characterization of six LTR10 elements identified 21 genes that are regulated by LTR10 elements, most (17/21) of which are regulated by MAPK signaling based on our RNA-seq data. These experiments suggest that LTR10-derived enhancer elements may exert a genome-wide impact on MAPK-dependent gene expression in cancer cells.
LTR10 elements contain highly mutable VNTRs
Lastly, we investigated variation at LTR10 elements across 15,708 human genomes profiled by the Genome Aggregation database (gnomAD) (Karczewski et al. 2020). All LTR10 insertions are fixed, but we observed an unexpected enrichment of >10bp indel structural variants affecting the AP1 motif region specific to LTR10A and LTR10F, but not other LTR10 subfamilies such as LTR10C (Fig 6A). Further sequence inspection revealed that LTR10A and LTR10F elements contain an internal variable number of tandem repeats (VNTR) region, composed of a 28-30 bp sequence that includes the AP1 motif (Fig 6B, Supp Fig S6A). Individual elements annotated in the reference genome show extensive variation in tandem repeat length, with up to 33 copies of the AP1 motif (Supp Fig S6B). The number of motifs strongly correlated with H3K27ac and FOSL1 binding activity in HCT116 cells (Supp Fig S6B), suggesting that tandem repeat length affects AP1-dependent regulation of individual elements. Across the human population, LTR10A and LTR10F elements harbor many rare and common indel structural variants of lengths that follow a 28-30 bp periodicity, and this pattern is absent in LTR10C elements which lack the tandem repeat region (Fig 6C, 6D). These elevated levels of polymorphism across copies and individuals are characteristic of unstable tandem repeat regions (Gymrek 2017), and suggest that LTR10 VNTR regions may be a common source of genomic regulatory variation.
Accurately genotyping tandem repeat length polymorphisms remains a major challenge using short-read data, therefore we validated the presence of LTR10 VNTR polymorphisms using long-read structural variant calls generated from 15 individuals (Audano et al. 2019). We recovered indel structural variants within 24 distinct LTR10A and LTR10F elements, which also showed 28-30 bp periodicity (Fig 6D, Supp Fig S6C). We confirmed the presence of additional LTR10 VNTR indels using a separate long-read dataset from 25 Asian individuals (Fig 6E, Supp Fig S6D-I) (Quan et al. 2021). At the LTR10.ATG12 locus, we observed multiple indels supported by both short-read and long-read data that are predicted to affect AP1 motif copy number (Fig 6E, Supp Fig S6E). At a genome-wide level, LTR10 elements were a significantly enriched source of long-read indels, despite being fixed in the population (Fig 6F). Finally, in addition to finding extensive germline genetic variation affecting LTR10 elements, we also found evidence for tumor-specific somatic mutations within LTR10 elements, based on tumor-specific variants called by the International Cancer Genome Consortium (Supp Fig S6J) (ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium 2020). Together, these analyses reveal that LTR10 VNTR regions are a source of non-coding polymorphisms, which potentially contribute to both germline and somatic regulatory variation in cancer cells.
DISCUSSION
Our findings build on a growing body of evidence implicating TEs as an important source of pathological regulatory activity in cancer (Jang et al. 2019; Deniz et al. 2020; Lamprecht et al. 2010; A. Babaian et al. 2016). Although TEs have been widely documented to show elevated transcriptional activity in cancer cells, the mechanisms underlying the regulatory reactivation of TEs in cancer have remained poorly understood. Our study demonstrates that oncogenic MAPK/AP1 signaling drives regulatory activation of LTR10 and other ERVs in epithelial tumors, which act as enhancers that mediate AP1-dependent transcriptional dysregulation. While our functional experiments were conducted in a cell line model, our analysis indicates that LTR10 elements show epigenomic signatures of enhancer activity in many patient tumors. The specific consequences of ERV-derived enhancer activity in cancer cells are likely to vary depending on genomic context and tumor type, and individual elements may exert a spectrum of pathological, anti-tumorigenic, or nonfunctional regulatory effects. Nonetheless, our findings demonstrate how ERV-derived enhancers act as a regulatory link between oncogenic signaling and genome-wide transcriptional dysregulation in cancer cells.
Our study used CRISPR to functionally validate multiple LTR10 elements with predicted enhancer activity, revealing their involvement in regulating genes with cancer relevance. As a case example, we demonstrated that a LTR10 element was required for ATG12-dependent mitochondrial apoptosis in HCT116 colorectal cancer cells. ATG12 expression has been implicated in tumorigenesis and therapy resistance in several cancer types, making it a candidate for therapeutic targeting (Rubinsztein, Codogno, and Levine 2012; Hu et al. 2018). Notably, MAPK inhibitor treatment results in silencing LTR10 regulatory activity, suggesting that ERV enhancer inactivation may be an underlying mechanism contributing to therapeutic MAPK inhibition.
Finally, we discovered that LTR10 elements are frequently affected by indels that could influence their regulatory activity. Although all LTR10 insertions are fixed in the human population, they contain internal tandem repeats that show high levels of length polymorphism associated with repeat instability. Germline or somatic variation in AP1 motif copy number within these elements may alter cancer-specific enhancer landscapes, and we speculate that LTR10 VNTRs may be subject to further mutation in cancer cells with microsatellite instability (Hung et al. 2019). Our study of LTR10 highlights how TEs that are normally silenced can become reactivated in cancer and cause aberrant gene expression. For elements that promote pathogenesis, their restricted activity in age-associated diseases like cancer may result in reduced or nearly neutral fitness consequences. Therefore, the accumulation of TEs subject to epigenetic silencing may be a fundamental process shaping the evolution of cancer-specific gene regulatory networks.
METHODS
Cell culture
The HCT116 cell line was purchased from ATCC and cultured in McCoy’s 5A medium supplemented with 10% FBS and 1% penicillin/streptomycin (Gibco). Cells were cultured at 37°C in 5% carbon dioxide. Transfections were performed using FuGENE (Promega). For treatments modulating MAPK signaling, HCT116 cells were untreated or treated for 24 hrs with 1 uM Cobimetinib, 100 ng/mL TNF alpha, or DMSO.
CRISPR-mediated silencing and knockout of LTR10s
For CRISPR-mediated silencing (e.g., CRISPRi) of select LTR10 elements and gene transcription start sites (TSS), a HCT116 dCas9-KRAB-MeCP2 stable line was first generated using the PiggyBac system (System Bioscience). The PiggyBac Donor plasmid, PB-CAGGS-dCas9-KRAB-MeCP2 was co-transfected with the Super PiggyBac transposase expression vector (SPBT) into HCT116 cells. The pB-CAGGS-dCas9-KRAB-MeCP2 construct was a gift from Alejandro Chavez & George Church (Addgene plasmid # 110824). 24 hours post-transfection, cells were treated with Blasticidin to select for integration of the dCas9 expression cassette, and selection was maintained for 10 days. CRISPR gRNAs specific to the DNA elements of interest (i.e., 0 predicted off target sequences) were selected using pre-computed CRISPR target guides available on the UCSC Genome Browser hg38 assembly, and complementary oligos were synthesized by Integrated DNA Technologies. Complementary oligos were designed to generate BstXI and BlpI overhangs for cloning into PB-CRISPRia, a custom PiggyBac CRISPR gRNA expression plasmid based on the lentiviral construct pCRISPRia (a gift from Jonathan Weissman, Addgene plasmid # 84832). Complementary gRNA-containing oligos were hybridized and phosphorylated in a single reaction, then ligated into a PB-CRISPRia expression plasmid linearized with BstXI and BlpI (New England Biolabs). Chemically competent Stable E. Coli (New England Biolabs) was transformed with 2 uL of each ligation reaction and resulting colonies were selected for plasmid DNA isolation using the ZymoPure Plasmid miniprep kit (Zymo Research). Each cloned gRNA sequence-containing PB-CRISPRia plasmid was verified by Sanger sequencing (Quintara Bio).
To generate CRISPRi stable lines, PB-CRISPRia gRNA plasmids were co-transfected with the PiggyBac transposase vector into the HCT116 dCas9-KRAB-MeCP2 polyclonal stable line. The following number of uniquely-mapping gRNA plasmids were designed per target based on the pre-computed UCSC hg38 CRISPR target track: ATG12 (1), GFP (1), FOSL1 (1), LTR10.ATG12 (4), LTR10.FGF2 (2), LTR10.MCPH1 (3), LTR10.MEF2D (2), LTR10.XRCC4 (2). The same total amount of gRNA plasmid was used for transfections involving one or multiple gRNAs. 24 hours post-transfection, cells were treated with Puromycin to select for integration of the sgRNA expression cassette(s). Selection was maintained for 5 days prior to transcriptional analyses.
For CRISPR-mediated knockout of LTR10.KDM6A, 2 gRNAs (1 specific to each flank of the element) were identified and synthesized as sgRNAs by IDT. Using IDT’s AltR technology, RNP complexes were generated in vitro, and electroporated into HCT116 cells using the Neon system (ThermoFisher Scientific). Clonal lines were isolated using the array dilution method in a 96-well plate format, and single clones were identified and screened for homozygous deletions by PCR using both flanking and internal primer pairs at the expected deletion site.
Cell autophagy and apoptosis assays
For assaying mitochondrial apoptosis, HCT116 CRISPRi cell lines were treated for 12 hours with Staurosporine (STS) at 0.5 uM or DMSO (vehicle) followed by measurement of Caspase activity via the Caspase-Glo 3/7 assay (Promega). Results are representative of at least 3 independent experiments. For assaying autophagy, HCT116 CRISPRi cell lines were untreated or treated with Bafilomycin A at 10 nM or 100 nM for 6 hrs and 18 hrs, followed by LC3B Western blotting. Results are representative of at least 3 independent experiments.
Western blots
For ATG12 Western blots, cell lysates were prepared with MPER buffer (ThermoFisher Scientific). For LC3B Western blots, cell lysates were prepared with RIPA buffer. All cell Lysates were resuspended in 4x NuPage LDS Sample buffer containing reducing agent (ThermoFisher Scientific). For ATG12 Western blots, total protein was concentrated and size-selected by passing through an Amicon Ultra 10K column (Millipore), retaining the high molecular weight fraction, and 40 ug of protein was loaded per lane. For LC3B Western blots, 2 ug total protein was loaded per lane. Antibodies used were as follows: ATG12 (cat #4180T, Cell Signaling Technologies), Beta-Actin (cat # 3700T, Cell Signaling Technologies), LC3B (cat # NB100-2220, Novus Biologicals). Results are representative of at least 3 independent experiments.
Luciferase assay
Reporter assays were conducted in HCT116 cells using the secreted NanoLuc enhancer activity reporter pNL3.3 (Promega) and normalized against a constitutively active firefly luciferase reporter vector, pGL4.50 (Promega). LTR10 consensus sequences for subfamilies LTR10A and LTR10F were downloaded from Dfam. AP1 motifs within LTR10A and LTR10F were shuffled as follows: LTR10A (first two motifs): cctgagtcacc to cagccccgtta; LTR10A (third motif): cttagtcacc to cagtttaccc; LTR10F (all three motifs): cctgactcatt to cgtatccttac. Sequences are available in Supp Table 8. Due to their high repeat content, consensus sequences were synthesized as multiple fragments (Integrated DNA Technologies, Twist BioScience) and then assembled into pNL3.3 enhancer reporter plasmids using Gibson Assembly (New England Biolabs). Each cloned reporter plasmid was verified by Sanger sequencing (Quintara Bio). To assay reporter activity, HCT116 cells were transfected with a reporter construct as well as the pGL4.50 construct constitutively expressing firefly luciferase. 24 hrs after transfection, media was replaced with media containing 1 uM Cobimetinib, 100 ng/mL TNF alpha, or DMSO (vehicle). 24 hours following treatment, luminescence was measured using the NanoGlo Dual Luciferase Reporter Assay System (Promega). All experiments were performed with 3 treatment replicates per condition in a 96-well plate format, and all reported values were normalize to firefly co-transfection controls. Results are representative of at least 3 independent experiments. Barplots are presented as mean +/- s.d.
RNA-seq
Sequencing libraries were prepared from RNA harvested from treatment or transfection replicates. Total RNA was extracted using the Quick-RNA Miniprep Plus Kit (Zymo Research). PolyA enrichment and library preparation was performed using the KAPA BioSystems mRNA HyperPrep Kit according to the manufacturer’s protocols. Briefly, 500 ng of RNA was used as input, and KAPA BioSystems single-index or unique dual-index adapters were added at a final concentration of 7 nM. Purified, adapter-ligated library was amplified for a total of 11 cycles following the manufacturer’s protocol. The final libraries were pooled and sequenced on an Illumina NovaSeq 6000 (University of Colorado Genomics Core) as 150 bp paired-end reads.
CUT&RUN
Libraries were prepared from treatment replicates. Approximately 5x105 viable cells were used for each CUT&RUN reaction, and pulldowns were generated following the protocol from (Meers et al. 2019). All buffers were prepared according to the “High Ca2+/Low Salt” method using digitonin at a final concentration of 0.05%. The following antibodies were used at the noted dilutions: rabbit anti-mouse IgG (1:100), rabbit anti-H3K27ac (1:100). pA-MNase (gift from Steven Henikoff) was added to each sample following primary antibody incubation at a final concentration of 700 ng/mL. Chromatin digestion, release, and extraction was carried out according to the standard protocol. Sequencing libraries were generated using the KAPA BioSystems HyperPrep Kit according to the manufacturer’s protocol with the following modifications: Freshly diluted KAPA BioSystems single-index adapters were added to each library at a final concentration of 9 nM. Adapter-ligated libraries underwent a double-sided 0.8X/1.0X cleanup using KAPA BioSystems Pure Beads. Purified, adapter-ligated libraries were amplified using the following PCR cycling conditions: 45 s at 98°C, 14x(15 s at 98°C, 10 s at 60°C), 60 s at 72°C. Amplified libraries underwent two 1X cleanups using Pure Beads. The final libraries were quantified using Qubit dsDNA High Sensitivity and TapeStation 4200 HSD5000. Libraries were pooled and sequenced on an Illumina NovaSeq 6000 (University of Colorado Genomics Core) as 150 bp paired-end reads.
Processing of sequencing data
Reads obtained from our own datasets and from published studies were reprocessed using a uniform analysis pipeline. FASTQ reads were evaluated using FastQC (v0.11.8) and MultiQC (v1.7), then trimmed using BBDuk/BBMap (v38.05). For ATAC-seq, ChIP-seq, and CUT&RUN datasets, reads were aligned to the hg38 human genome using BWA (v0.7.15) and filtered for uniquely mapping reads (MAPQ > 10) with samtools (v1.10). ChIP-Seq and ATAC-seq peak calls and normalized signal coverage bigwig plots were generated using MACS2 (v2.1.1) (with setting --SPMR). CUT&RUN peak calls were generated using MACS2 in paired-end mode using a relaxed p-value threshold without background normalization (--format BAMPE --pvalue 0.01 --SPMR -B --call-summits). MACS2 was also run in single-end mode with additional parameters --shift -75 and --extsize 150, and Bedtools (v2.28.0) was used to merge peaks across the two modes of peak calling for each sample (bedtools merge with options -c 5 -o max).
RNA-seq and PRO-seq reads were aligned to hg38 using hisat2 (v2.1.0) with option --no-softclip and filtered for uniquely mapping reads with samtools for MAPQ > 10. Bigwig tracks were generated using the bamCoverage function of deepTools (v3.0.1), with CPM normalisation (ignoring chrX and chrM) and bin size 1bp. Some datasets from TCGA, ENCODE, Cistrome DB and the CEMT Canadian Epigenomes Project were downloaded as post-processed peaks and bigwig files.
TE colocalization analysis
To determine TE family enrichment within regulatory regions, we used GIGGLE (v0.6.3) (Layer et al. 2018) to generate a genomic interval index of all TE families in the hg38 human genome, based on Dfam v2.0 repeat annotation (n=1315 TE families). Regulatory regions (e.g., ATAC, ChIP-Seq, or CUT&RUN peaks) were queried against the TE interval index using the GIGGLE search function (-g 3209286105 -s). Results were ranked by GIGGLE enrichment score, which is a composite of the product of −log10(P value) and log2(odds ratio). Significantly enriched TE families were defined as those with at least 25 overlaps between TE copies and a set of peak regions, odds ratio over 10, and GIGGLE score over 100 in at least one cancer type.
Defining cancer-specific regulatory elements
To define cancer-specific regulatory elements, we obtained aggregate ATAC-seq regions associated with each tumor type profiled by The Cancer Genome Atlas (Weinstein et al. 2013), which represent a union of recurrent ATAC-seq regions associated with each tumor type. To filter predicted regulatory regions in healthy adult tissues, we used chromHMM regulatory regions defined by the Roadmap project, using healthy adult tissues from categories 1_TssA, 6_EnhG & 7_Enh (e.g. filtering out placental tissues, embryonic stem cells and trophoblast stem cells due to high numbers of active TEs). Cancer-specific regulatory regions were defined using the subtract function of bedtools (option -A) to subtract Roadmap regulatory regions from each cancer peak set.
Transcription factor motif analyses
Motif analysis of LTR10 elements was performed using MEME suite (v5.4.1) in differential enrichment mode (Bailey et al. 2009). HCT116 CUT&RUN H3K27ac-marked LTR10A/F sequences (n=144) were used as input against a background set of unmarked LTR10A/F sequences (n=561), with default settings other than the number of motif repetitions (Any) and the number of motifs to find (5). Each discovered motif was searched for similarity to known motifs using the JASPAR 2018 non-redundant DNA database with TomTom (v5.4.1). FIMO (v5.1.0) was then used to extract motif frequency and hg38 genomic coordinates, with p-value threshold set to 1e-4.
Differential analysis using DESeq2
For RNA-seq samples, gene count tables were generated using featureCounts from the subread (v1.6.2) package with the GENCODE v34 annotation gtf to estimate counts at the gene level, over each exon (including -p to count fragments instead of reads for paired-end reads; -O to assign reads to their overlapping meta-features; -s 2 to specify reverse-strandedness; -t exon to specify the feature type; -g gene_id to specify the attribute type).
To quantify TE expression at the family level, RNA-seq samples were first re-aligned to hg38 using hisat2 with -k 100 to allow multi-mapping reads and --no-softclip to disable soft-clipping of reads. TEtranscripts (v2.1.4) was then used in multi-mapping mode with the GENCODE v34 annotation gtf and hg38 GENCODE TE gtf to assign count values to both genes and TE elements.
For H3K27ac CUT&RUN samples, the bedtools multicov was used to generate a count table of the number of aligned reads that overlap MACS2-defined peak regions. The top 20,000 peaks were extracted from each sample and merged (using bedtools merge with -d 100) to produce the peak file used as input to bedtools multicov.
All count tables were processed with DEseq2 (v1.32.0). Normalized count values were calculated using the default DEseq2 transformation. R packages ggplot2 (v3.3.2), ggrepel (v0.8.2) and apeglm (v1.8.0) were used to visualize differentially expressed genes or TEs. The same DEseq2 analyses were used to identify differentially enriched peak regions between H3K27ac CUT&RUN samples (e.g. in response to MAPK treatment). Significantly differentially enriched regions were queried against the Giggle index of human repeats to identify over-represented TE families.
Prediction of LTR10 enhancer gene targets
LTR10 elements were initially prioritized for CRISPR silencing or deletion based on enhancer predictions from the Activity-by-Contact (ABC) model (Fulco et al. 2019). Publicly available HCT116 ATACseq (GEO accession GSM3593802) and in-house HCT116 H3K27ac CUT&RUN were used as input to the ABC pipeline, as well as the provided averaged human cell line HiC file. Predicted enhancer regions with an ABC interaction score over 0.001 were intersected with H3K27ac-marked LTR10A/F elements. Putative LTR10 enhancers were then checked for proximity (e.g. within 1.5Mb) to MAPK-regulated genes (i.e. genes that were significantly affected by MAPK treatments Cobimetinib and TNF-alpha, based on inhouse RNAseq).
Evolutionary analysis of LTR10 sequences
Genomic coordinates of LTR10 elements in the hg38 human genome were obtained from Dfam v2.0, based on RepeatMasker v4.0.6 repeat annotation. The nucleotide sequence of each LTR10 element was extracted using the getfasta function from bedtools (using -name+ to include coordinates in the header and -s for strand specificity). VSEARCH v2.14.1 was used to set a minimum length threshold of 200bp for LTR10 sequences (--sortbylength -minseqlength 200), prior to alignment. MUSCLE (v3.8.1551) was used to align the remaining sequences. Jalview (v2.11.1.4) was used to perform a principal component analysis on pairwise similarity scores derived from the multiple sequence alignment.
LTR10 consensus sequences representing each LTR10 subfamily (A-G) were likewise downloaded from Dfam. Sequences were concatenated into one FASTA file and aligned using MUSCLE. FastTree was used to infer a maximum likelihood phylogeny representing the LTR10 subfamily relationships.
The phylogeny of known primate relationships was obtained from TimeTree (Kumar et al. 2017) and the HERV-I insertion estimate was confirmed based on the presence or absence of LTR10 sequences across mammals, using BLAST (v2.7.1+) (Altschul et al. 1990).
VNTR identification
gnomAD (v3.1) VCF files for each hg38 chromosome were filtered for high-confidence indels (FILTER=PASS) using the query function of bcftools (v1.8) with format parameter -f’%CHROM\t%POS0\t%END\t%ID\t%REF\t%ALT\t%AF\t%TYPE\tFILTER=%FILTER\n’. The remaining indels were then subsetted by size to retain insertions or deletions between 10 to 300bp in length, and chromosome vcfs were concatenated into one whole genome bed file. Bedtools (v2.28.0) was used to intersect the indel bed file with LTR10 elements from each subfamily, based on Dfam (v2.0) repeat annotation. Indels from long-read datasets were likewise filtered by variant type (INS or DEL) and length (50-300bp), then intersected with LTR10 elements. Deletion length versus allele frequency was plotted for each subfamily, from each separate dataset.
DATA AVAILABILITY
Inhouse high-throughput sequencing data has been deposited in the Gene Expression Omnibus (GEO) with the accession code GSE186619. GSE IDs of public datasets used in this study are listed in the figure legends and GitHub repository. The following large-scale databases were also used: Cistrome DB (downloaded Feb 2019), Roadmap/ENCODE (downloaded Feb 2019), The Cancer Genome Atlas (downloaded Sept 2019), CEMT Canadian Epigenome Project (downloaded July 2020), Dfam 2.0 and gnomAD v3.1.
CODE AVAILABILITY
Source code and workflows are available on GitHub: https://github.com/atmaivancevic/ERV_cancer_enhancers
COMPETING INTEREST STATEMENT
The authors declare no competing interests.
AUTHOR CONTRIBUTIONS
A.I. designed and conducted the bioinformatics analysis. D.M.S. designed and conducted the experiments. A.I. and E.B.C. conceived the study, analyzed the data, and wrote the paper.
SUPP TABLE LEGENDS
Supplementary Table 1: TEs enriched in tumor accessible chromatin from TCGA. The top 23 TE families (based on GIGGLE enrichment score) are shown.
Supplementary Table 2: LTR10 elements enriched in Roadmap regulatory regions, based on GIGGLE enrichment score. The merged set of 649 LTR10A/F elements was searched against all Roadmap tissues (n=127).
Supplementary Table 3: LTR10 elements enriched in Cistrome Transcription Factors, based on GIGGLE enrichment score. The merged set of 649 LTR10A/F elements was searched against all Cistrome human transcription factors (n=11348).
Supplementary Table 4: 620 MAPK-regulated genes based on DEseq2 results for HCT116 RNAseq with untreated, 24hr cobimetinib-treated and 24hr TNFa-treated cells.
Supplementary Table 5: All predicted LTR10 enhancer regions based on the activity by contact model (41), which assigns enhancer-gene targets based on H3K27ac signal and chromatin interaction data.
Supplementary Table 6: 57 LTR10 enhancers predicted to regulate MAPK genes. LTR10 enhancers predicted by the activity by contact model (41) were searched for proximity (within 1.5Mb) to genes significantly affected by both cobimetinib and TNF-alpha treatment.
Supplementary Table 7: DEseq2 results comparing the LTR10.ATG12 element silencing to control (GFP). Genes are sorted from lowest to highest adjusted p-value.
Supplementary Table 8: Primers, gRNAs and LTR10 sequences used for CRISPR experiments, functional validation and luciferase assays.
ACKNOWLEDGEMENTS
We thank the University of Colorado Genomics Shared Resource and BioFrontiers Computing core for technical support during this study. A.I. was supported by the National Cancer Center. E.B.C. was supported by the National Institutes of Health (1R35GM128822), the Alfred P. Sloan Foundation, the David and Lucile Packard Foundation, and the Boettcher foundation.
Footnotes
Supplemental files updated, minor errors fixed