Scans of the MYC mRNA reveal multiple stable secondary structures—including a 3′ UTR motif, conserved across vertebrates, that can affect gene expression

The MYC gene encodes a human transcription factor and proto-oncogene that is dysregulated in over half of all known cancers. To better understand potential post-transcriptional regulatory features affecting MYC expression, we analyzed secondary structure in the MYC mRNA using a program that is optimized for finding small locally-folded motifs with a high propensity for function. This was accomplished by calculating folding metrics across the MYC sequence using a sliding analysis window and generating unique consensus base pairing models weighted by their lower-than-random predicted folding energy. A series of 30 motifs were identified, primarily in the 5’ and 3’ untranslated regions, which show evidence of structural conservation and compensating mutations across vertebrate MYC homologs. This analysis was able to recapitulate known elements found within an internal ribosomal entry site, as well as discover a novel element in the 3’ UTR that is unusually stable and conserved. This novel motif was shown to affect MYC expression: likely via modulation of miRNA target accessibility. In addition to providing basic insights into mechanisms that regulate MYC expression, this study provides numerous, potentially druggable RNA targets for the MYC gene, which is considered “undruggable” at the protein level.


31
The MYC proto-oncogene is an important transcription factor that is required for programmed 32 cell death (apoptosis) and cell proliferation [1]. It is a key component of oncogenesis [2] and, 33 indeed, MYC is dysregulated in >50% of all cancers [3]. Post-transcriptional control plays 34 significant roles in the regulation of many genes including MYC. Within the 5' untranslated 35 region (UTR) of the MYC mRNA is a structured internal ribosomal entry site (IRES) that 36 stimulates cap-independent translation under conditions where cap-dependent translation is 37 inhibited: e.g. during apoptosis [4]. Consistent with other IRESs [5] the MYC IRES secondary 38 structure, deduced from in vitro chemical probing data [6], is complex and contains two 39 pseudoknots-motifs comprised of "non-nested" base pairing between looped out regions of 40 RNA [7]. In addition to the IRES, other post-transcriptional regulatory mechanisms affect MYC 41 expression-e.g. microRNAs (miRs) [8]-that may be affected by RNA structure [9]. 42 To determine if other structured RNA regulatory elements can be playing roles in MYC 43 expression, we applied a methodological pipeline for RNA motif discovery that was optimized 44 from studies of the Xist lncRNA [10], as well as the Human [11], Zika and HIV genomes [12]. 45 There are two major steps in this pipeline: (1) a scanning step, where the RNA is examined 46 using a sliding analysis window to record predicted metrics important for analyzing RNA 47 secondary structure (e.g. the thermodynamic stability); and, (2) an analysis step where unique 48 local motifs are defined then evaluated vs. comparative sequence/structure and/or experimental 49 probing data. Each step is achieved using the programs ScanFold-Scan and ScanFold-Fold, respectively. Used together these programs define the potential RNA structural properties 51 of long sequences and identify motifs likely to be ordered to form, presumably functional, 52 defined structures. This is accomplished by generating consensus structure models across all 53 scanning windows, where base pairs are weighted by their thermodynamic z-score: a measure 54 of the unusual stability of a sequence that is calculated by comparison to the folding energy of 55 matched randomized control sequences. Here, negative values indicate sequences that are 56 ordered to fold and that may be functional [13]. While the primary goal is to deduce what 57 nucleotides may be functionally significant, ScanFold-Fold models can also increase 58 prediction accuracy [12] 59 In this report, ScanFold-Scan and ScanFold-Fold were applied to the longest MYC RefSeq 60 localized to the 3' UTR. Refolding the mRNA with -1 ScanFold-Fold bp as constraints added 99 153 bp to the discovered motifs: e.g. by extending helices or closing unpaired bases in the 100 consensus prediction (Document S2). These 507 bp are divided into 30 motifs that span the 101 MYC mRNA (Fig. 1). Motif locations, as expected, correspond to negative dips in z-score; 102 however, dips in ΔG° and ED are also observed at motif sites. The most prominent regions with 103 dips in metrics occur at Motifs 17 and 18 ( Fig. 1), which contain very low z-score base pairs 104 (cutoff < -2) deduced by ScanFold-Fold. These two motifs, particularly Motif 17, had the most 105 favorable ScanFold metrics of any region/motif predicted for MYC. 106 All motif bp were analyzed versus an alignment of 15 vertebrate mRNA sequences (Document 107 S3). Motif 17 had the highest conservation of structure and was supported by the greatest 108 number of consistent and compensatory mutations (Fig. 1). In general, Motifs 7-19 showed 109 evidence of conservation, however, little conservation data was found outside these regions: UTR sequences (Document S4). The alternative models for Doman 2/Motif 9 are roughly 119 equally well supported by comparative data. Both are comprised of base pairs conserved across 120 vertebrates and that show evidence of possible compensatory mutations: e.g. C279-G284 in 121 Domain 2 vs. A307-U334 and G309-C332 in Motif 9 (Fig. S3). Neither model can be discarded 122 based on these data. Nucleotides within Motif 9 were found to be highly reactive to chemicals in the previous in vitro analysis of the MYC UTR [6], thus their modeling as single stranded RNA. 124 When overlaid on Motif 9, however, only 4 out of 21 modification sites were inconsistent with the 125 ScanFold-Fold generated model (Fig. S3); additionally, sites of AMV reverse transcriptase 126 pausing suggest that this region is structured. sequence data (implicitly) in prediction; the resulting consensus model ( Figure S5) predicts 148 conserved structures that correspond to the two highly-conserved Motif 17 hairpins predicted by 149

ScanFold-Fold. 150
Functional analyses of the MYC 3' UTR 151 As the most significant motifs predicted in MYC occurred in the 3' UTR, a known site of miRNA 152 targeting, the locations of MYC-targeting miRNA binding sites were queried vs. predictions of 153 structure. Of nine miRNAs with known interaction sites [8,14,[16][17][18][19][20], seven occurred within 154 Motif 17 ( Fig. 3A and B). miR-34a/b/c, miR-449c and let-7a have overlapping seed binding sites 155 in the unstructured region between the two highly-conserved Motif 17 hairpins ( Fig. 3A and B). respectively. This is consistent with previous analyses of the MYC 3′ UTR, where the entire short UTR isoform was incorporated into pLSV (an analogous Luciferase vector) and, using a 174 similar analysis pipeline, was shown to lead to gene repression [14]. Similarly, ablation of the 175 miR-34a-c seed (and also, seed regions for miRs 449c and let-7a) showed that miRNA targeting 176 was responsible for the repressive effects of this region. 177 To determine if RNA structure present in Motif 17 influences miRNA binding/repression, two 178 mutant constructs, pIS2-AS1 (ablate stem 1) and pIS2-LS1 (lock stem 1), were designed to 179 increase or decrease miRNA site accessibility respectively ( Fig. 4) according to the ΔΔG metric 180 of Kurtesz et. al [21]. This metric accounts for both the energy needed to break native mRNA 181 secondary structure and the energy gained by miRNA binding, and it was used to predict 182 miRNA site accessibility for the WT and mutant constructs. The WT sequence, pIS2-M17, has a 183 predicted ΔΔG of -4.67, whereas pIS2-AS1 and pIS2-LS1 have values of -13.56 (more 184 accessible) and +2.55 (less accessible) respectively. When assayed, pIS2-AS1 shows (Fig. 4) a 185 ~11% decrease in its RRR and an increase in TE of ~20% compared to pIS2-M17, however, 186 these values are not significantly different from each other (Fig. 4). pIS2-LS1 showed a 187 substantial increase in both RRR and TE (~68% and ~70% increase respectively) when 188 compared to pIS2-M17. where RNA structure thermodynamic stability decreases going 5' to 3', with marked "jumps" in 194 instability observed at the UTR/coding-region junctions (Figs. S1 and S2). Likewise shifts toward 195 more positive ED and z-score values were observed in junction-spanning windows: most 196 dramatically at the 5' junction, which includes both the CUG (non-canonical) and AUG (canonical) translation initiation sites. These results indicate a lack of stable structure here, 198 reiterating previous observations that indicate inhibitory roles for thermodynamically stable RNA 199 secondary structure at initiation sites [22]. We additionally find evidence that evolution may be 200 specifically selecting for MYC initiation site sequences that are ordered to be less-stable than 201 that predicted for sequences of similar composition (thus the positive z-scores); as well, the 202 junction sequence is expected to have a volatile conformational ensemble, where no particular 203 structure dominates (high ED). 204 The high and low respective thermodynamic stabilities of the 5' and 3' UTRs (Figs. S1 and S2) 205 indicate differing roles for RNA folding in these regions. The highly-stable 5' UTR would be 206 expected to inhibit canonical translation by obstructing scanning ribosomes; thus, the presence 207 of an IRES in the MYC mRNA. This can provide mechanisms for fine-tuning the post-208 transcriptional regulation of the MYC gene by allowing it to be translated in a cap-independent 209 manner. The MYC IRES was shown to be active in some, but not all tissue types and the 210 variability of activity is attributed to the presence, or lack of, trans-regulatory elements (e.g. 211 RBPs; [4]). This demonstrates how cis-elements of the mRNA can interact with trans-regulatory 212 elements to diversify (i.e. regulate) the cellular levels of a protein.

213
In contrast, the low stability of the 3' UTR suggests a need for increased accessibility of the 214 mRNA sequence: e.g. for intermolecular interactions with post-transcriptional regulatory factors 215 such as miRNAs and regulatory proteins. Counterintuitively, the sites with the greatest evidence 216 of having been ordered to fold into a specific structure are in the 3' UTR (e.g. Motifs 17 and 18 217 in Fig. 1). Motif 17, for example, is the most well-conserved structured region in MYC-even 218 more so than the IRES domain (Figs. 2 and S3)-and is supported via multiple compensatory 219 and consistent base mutations. The highly favorable metrics and deep conservation of this motif 220 throughout vertebrates indicated its biological importance, which was borne out by the analysis 221 of miRNA binding sites ( Fig. 3A and B) and Motif 17 function (Fig. 4). This motif acts as a hub for miRNA interactions and may organize the MYC miRNA target sites for interactions: e.g. the 223 two highly-conserved hairpins may act as a "structure cassette" for maintaining the single-224 strandedness of the miR-34, -449c, and -let-7a seed binding regions, while modulating the 225 accessibility of additional bases for miRNA pairing that can affect the outcome of miRNA 226 targeting: e.g. translational repression. 227 The second most favorable motif (Motif 18; Fig. 1) contains a miR-24 interacting region (Fig.  228   3C). Notably, this interaction is "seedless" [17]-only three of the miR-24 seed nt are base 229 paired to MYC (Fig. 3C). Most of the miR-24-interacting nt on MYC are predicted to be bound 230 up in structure. Here, as in other interaction sites, RNA folding may be modulating accessibility. 231 We observed that reducing the accessibility of miRNA binding in Motif 17 via pIS2-LS1 232 Both Motif 17 and 18 occur in a particularly GC-poor and thermodynamically unstable region of 240 the 3' UTR (Fig. 1). This is due to long stretches of highly-conserved poly(U) (uridine) tracts that 241 occur between and around these structural motifs (Documents S3 and S5). Interestingly, 242 poly(U) tracts have been found to stabilize mRNA sequences via structural interactions with the 243 poly(A) (adenosine) tail [23]. Thus, an additional function of MYC 3' UTR structural motifs may 244 be to organize poly(U) tracts to facilitate interactions with the poly(A) tail. For example, in Figure  245 e.g. in an analogous way to poly(U)-poly(A) interactions in viral and ncRNA stabilizing elements 248 [24,25]. Notably, our qPCR data does not indicate degradation of miRNA targeted RL 249 transcripts. Instead, slight transcript accumulation is observed in experimental samples (which 250 contain the poly(U) tract) compared to unregulated pIS2 samples (Table S1). Motif 17, in 251 addition to regulating miRNA target accessibility, may be affecting transcript stability by 252 modulating interactions of the poly(U) tract with the poly(A) tail of the mRNA. More investigation 253 is needed to parse out these interactions. 254 Additional functional motifs are predicted beyond Motif 18 (Fig. 1) (Fig. 3D), which falls within the MYC coding region (Fig. 1). 261

Awareness of the importance of miRNA targeting in coding regions is growing [29] and, 262
presumably, additional MYC miRNA interactors remain to be discovered. 263 To conclude, this report provides valuable information on MYC mRNA secondary structure that 264 has implications toward a better understanding of post-transcriptional gene regulation. In 265 addition to providing global structural data, discreet local motifs with a high propensity for 266 function are proposed, including a particularly interesting motif in the 3' UTR that has been 267 folding. Additionally, these results generate a large list of structural motifs that may be 276 druggable targets [30,31] for MYC, which is considered undruggable at the protein level [3]. 277

In silico analyses 279
The Homo sapiens MYC RefSeq mRNA sequence was downloaded from the NCBI nt database 280 (GenBank Accession: NM_002467.5). ScanFold-Scan was run using a single nt step size and 281 window sizes of 70 (Document S1) and 120 nt (results with the longer window size were 282 Here, ΔG° is the native sequence minimum free energy (MFE) predicted by RNAfold. 288

ΔG°� ����
is the average MFE predicted for 100X mononucleotide randomized sequences. The 289 standard deviation, σ, is calculated across all sequences. The other calculated data are: the P-290 value, which measures the fraction of random sequences that are more stable than native in the 291 z-score calculation (this acts as a quality control measure for the z-score); the MFE ΔG°, which 292 measures the thermodynamic stability of RNA secondary structure formation; the MFE base 293 pairs that generate the MFE ΔG°, which are output in "dot-bracket" notation; the ensemble 294 diversity (ED), which provides an estimate of the structural diversity in the RNA conformational 295 ensemble (e.g. low ED indicates a single dominant conformation); the fraction of the (f)MFE in 296 the ensemble, which estimates the contribution of the MFE conformation to the ensemble; the 297 ensemble centroid structure, which is the conformation most similar to others in the ensemble; 298 and the nt frequencies and GC percentages. 299 ScanFold-Scan prediction windows were next analyzed using the program ScanFold-Fold 300 to deduce consensus motifs weighted by the z-score. The ScanFold-Fold method is detailed was generated by constraining base pairs from Motif 17 and refolding the remaining sequence using RNAfold [32]. A consensus secondary structure for the short MYC 3' UTR was predicted 319 (Fig. S3) using RNAalifold [15] with the 3' UTR alignment (Document S5) as input. 320

Experimental analyses 321
Cell Culture. HeLa cells were incubated at 37°C and 5% CO2 and maintained in DMEM 322 supplemented with 10% FBS, penicillin and streptomycin, and L-glutamine. Cells were 323 passaged at 60-80% confluence and used between 3-40 passages. To test the post-transcriptional regulation of ScanFold-Fold predicted motifs, the Motif 17 332 sequence, along with 27 nt upstream and 11 nt downstream were incorporated into the 3'-UTR 333 of pIS2 to generate pIS2-M17. Mutants that destabilize, pIS2-AS1, or stabilize, pIS2-LS1, the 334 structure present in pIS2-M17 were generated. For pIS2-AS1, 6 mutations were incorporated 335 that disrupt canonical base pairing in the first conserved hairpin. To generate pIS2-LS1, 3 336 mutations and 1 base deletion were introduced in the bulge on the upstream side of the first 337 conserved hairpin. Mutations that destabilize or stabilize Motif 17 were predicted using the ΔΔG 338 metric as a measure of miRNA site accessibility (Kertesz et al., 2007). 70 nt upstream and 70 nt 339 downstream of the miRNA target site were included in our ΔΔG calculations. 340 The sequences for pIS2-M17, pIS2-AS1, and pIS2-LS1 were ordered as gBlocks from IDT and 341 cloned using AgeI (5') and Spe1 (3') restriction sites (sequences in Supplemental Table S2).
Insertion of experimental sequences into the 3'-UTR of pIS2 required double restriction enzyme 343 digest (using AgeI and SpeI from NEB) of both the gBlock and pIS2, following digestion, 344 fragment and vector DNA were purified (Zymo DNA Clean and Concentrator kit), ligated (T4 345 Ligase from ThermoFischer), and transformed into DH5α-T1 competent cells using standard 346 procedures. Carbenicillin selected colonies were cultured and plasmids were extracted (Qiaprep 347 kit) and sequenced using an Applied Biosystems 3730xl DNA Analyzer. 348 Dual Luciferase Assay. Dual luciferase assays followed recommendations of an established 349 method (Etten et al., 2013). In brief, the pIS0 vector (FF) is transfected at constant levels across 350 all samples to serve as an internal control to which RL luciferase expression is normalized. All 351 samples were run as biological triplicates. HeLa cells were counted using a hemocytometer and 352 plated in a 24-well dish at a density of 50,000 cells per well. After 48 hours, cells were 353 transfected using Lipofectamine 3000 (ThermoFischer) with 500 nanograms total plasmid DNA 354 at a 1:1:8 ratio (pIS0:pIS2-based:pcDNA3.1-miR34a). Twenty-four hours later, cells were 355 trypsinized, resuspended, and split into each of a 24-well plate for RNA analysis (1 ml) and a 356 96-well plate for the dual luciferase assay (0.2 ml). After another 24-hour incubation, cells in the 357 96-well dish were lysed, and luciferase activity was measured using Promega's Dual Luciferase 358 Reagent Assay kit on a Biotek Synergy 2 plate reader with a collection time of 10 seconds. 359 Relative response ratios (RRR), the ratio of RL to FF relative light units (RLUs), were calculated 360 for each sample and then normalized to the empty, unregulated pIS2 RRR. Cells from the 24-361 well plate were placed in TRIzol (ThermoFisher) and either stored at -80°C or immediately 362 processed as below. 363 RNA Processing and qPCR Analysis. Cellular RNA was purified from samples in TRIzol using 364 Zymo's Direct-Zol RNA Miniprep kit. Purified RNA was then Dnase I treated (NEB) for 2 hours at 365 37°C and the resulting DNase-treated RNA was purified with Zymo's RNA Clean and hexamers, and Superscript III (ThermoFisher). 368 Relative abundance of RL transcripts across samples were measured by qPCR, performed 369 using PowerUp SYBR Green Master Mix on 1% cDNA input on an Applied Biosystems  boxes which depict the location and extent of ScanFold-Fold predicted motifs shaded red 391 based on the average z-score of windows in which motif base pairs occurred. Below these are 392 RNA secondary structure arc diagrams which depict the most favorable base pairs predicted via 393 ScanFold-Fold, colored according the average z-scores of windows in which they appear 394 (with blue, green and yellow corresponding to less than -2, -1 and 0 z-scores averages 395 respectively). Below these, are refolded models of the motifs built with -1 average z-score bps 396 as constraints. Each is annotated with their bp conservation as determined from an alignment of 397 15 representative mRNAs (Document S3) indicated by shading on the base pair (see key). 398 Circled bases are sites of putative structure-preserving consistent and compensatory mutations. 399   M17, pIS2-AS1, and pIS2-LS1 all display activity that differs from pIS2. pIS2-M17 and pIS2-AS1 412 both show decreased RRRs while pIS2-LS1, which was designed to have a more stable and 413 less accessible 3'-UTR, shows an increase in RRR compared to pIS2. pIS2-MSC displays a large decrease in TE and pIS2-AS1, which is predicted to have a more accessible miRNA site, 415 displays a TE which is slightly increased compared to pIS2-M17. The TE of pIS2-LS1 is 416 markedly greater than pIS2-M17, possibly reflecting the decrease in miRNA target site 417 accessibility. The ΔΔG values for pIS2-M17, pIS2-AS1, and pIS2-LS1 are also shown along with 418 destabilizing pIS2-AS1 mutations (displayed in red) and stabilizing pIS2-LS1 mutations 419 (displayed in green). 420

421
Supplemental material is available for this report. 422