Splicing buffers suboptimal codon usage in human cells

Although multiple studies have addressed the effects of codon usage on gene expression, such studies were typically performed in unspliced model genes. In the human genome, most genes undergo splicing and patterns of codon usage are splicing-dependent: guanine and cytosine (GC) content is highest within single-exon genes and within first exons of multi-exon genes. Intrigued by this observation, we measured the effects of splicing on expression in a panel of synonymous variants of GFP and mKate2 reporter genes that varied in nucleotide composition. We found that splicing promotes the expression of adenine and thymine (AT)-rich variants by increasing their steady-state protein and mRNA levels, in part through promoting cytoplasmic localization of mRNA. Splicing had little or no effect on the expression of GC-rich variants. In the absence of splicing, high GC content at the 5’ end, but not at the 3’ end of the coding sequence positively correlated with expression. Among endogenous human protein-coding transcripts, GC content has a more positive effect on various expression measures of unspliced, relative to spliced mRNAs. We propose that splicing promotes the expression of AT-rich genes, leading to selective pressure for the retention of introns in the human genome.


Introduction 21
Mammalian genomes are characterised by large regional variation in base 23 composition (Bernardi, 1993). Regions with a high density of G and C nucleotides 24 (GC-rich regions) are in an open, transcriptionally active state, are gene-dense, 25 and replicate early. In contrast, AT-rich regions are enriched with 26 heterochromatin, contain large gene deserts and replicate late (Arhondakis et al., 27 2011;Lander et al., 2001;Vinogradov, 2003). The mechanisms that give rise to 28 this compositional heterogeneity have been under debate for years and many 29 researchers believe that the pattern originates from the process of GC-biased 30 gene conversion (Duret and Galtier, 2009), though other neutral and selective 31 mechanisms have been proposed as well (Eyre-Walker, 1991;Galtier et al., 2018;32 Plotkin and Kudla, 2011;Sharp and Li, 1987b). 33

34
The sequence composition of mammalian genes correlates with the GC-content 35 of their genomic location. Thus, introns and exons of genes located in GC-rich 36 parts of the genome are themselves GC-rich. This can potentially influence gene 37 expression in multiple ways: nucleotide composition affects the physical 38 properties of DNA, the thermodynamic stability of RNA folding, the propensity of 39 RNA to interact with other RNAs and proteins, the codon adaptation of mRNA to 40 tRNA pools, and the propensity for RNA modifications, such as m6A (Dominissini 41 et al., 2012) and ac4C (Arango et al., 2018). Strikingly, studies of the effects of 42 nucleotide composition on gene expression in human cells have led to opposing 43 conclusions. On the one hand, heterologous expression experiments typically 44 report large positive effects of GC content on protein production in a wide 45 variety of transgenes, including fluorescent reporter genes, human cDNAs, and 46 viral genes (Bauer et al., 2010;Kosovac et al., 2011;Kotsopoulou et al., 2000;47 natural evolutionary experiment of what happens when a previously spliced 87 gene suddenly loses its introns. We first analysed a set of 49 parent-retrogene 88 pairs for which both the parent and the retrocopy ORFs have been retained in 89 human and mouse. Strikingly, we found that the retrocopies had a significantly 90 higher GC4 content than their parents (median GC4retrocopy -GC4parent = 11.5%; p 91 = 2.1×10 -4 from one-tailed Wilcoxon test; Figure 1D). It thus appears that after 92 retrotransposition, newly integrated intronless genes come under selective 93 pressure for increased GC content. In a comparison of 31 parent-retrogene pairs 94 retained between human and macaque, the median GC4 difference is not 95 significant (0.09%; p = 0.13, Wilcoxon test), but this may be explained by 96 duplication events in macaques being more recent (dS ~0.06) than in mouse (dS 97 ~ 0.5) and therefore less evolutionary time has passed to allow changes in GC 98 composition to have occurred. As a control, we analysed the GC4 content of 99 retrocopies classed as pseudogenes ( Figure S1A) and found it to be significantly 100 lower compared to their parental genes (-2.963%; p < 2.2×10 -16 , Wilcoxon test). 101 Furthermore, the genomic neighbourhood of functional retrocopies and 102 pseudogenes had significantly lower GC content than the neighbourhood of their 103 respective parental genes ( Figure S1B). These observations suggest that 104 increased GC content is not intrinsically connected with retrotransposition, but 105 is required for maintaining long-term functionality of retrogenes. Taken 106 together, these results support a splicing-dependent mechanism shaping 107 conserved patterns of nucleotide composition across functional protein-coding 108 genes. 109 110 GC-content is a strong predictor of expression of unspliced reporter genes 111 The above analyses show a connection between splicing and genomic GC content 112 of endogenous human genes. To test whether splicing differentially affects the 113 expression of genes depending on their GC content, we designed 22 synonymous 114 variants of GFP that span a broad range of GC3 content (GC content at the third 115 positions of codons) (Mittal et al., 2018) ( Figure S2). The collection encompasses 116 most of the variation in GC3 content found among human genes. All variants 117 were independently designed by randomly drawing each codon from an 118 appropriate probability distribution, to ensure uniform GC content and statistical 6 independence between sequences. We cloned these variants into two 120 mammalian expression vectors: an intronless vector with a CMV promoter 121 (pCM3) and a version of the same vector with a synthetic intron located in the 5' 122 UTR (pCM4). The GC content profiles of the 5' UTRs were similar in both vectors 123 ( Figure S2E,F). The vectors also encoded a far-red fluorescent protein, mKate2, 124 which we used to normalize GFP protein abundance (normalization reduced 125 measurement noise, but similar results were obtained with and without 126 normalization). Transient transfections of HeLa cells with three independent 127 preparations of each plasmid showed reproducible expression with a large 128 dynamic range: synonymous variants differed in GFP protein production 46-fold. 129 Consistent with previous studies, GFP fluorescence was strongly correlated with 130 GC3 content, both in spliced and unspliced genes (Figure 2A exact test revealed that the percentage of variants whose expression was 140 increased by splicing significantly depended on GC3 content (p=0.02, N=47, GFP 141 and mKate variants combined). These experiments show that many AT-rich 142 genetic variants are expressed inefficiently in human cells, but low expression 143 can be partially rescued by splicing. Notably, the average GC content of the 144 human genome is 41% (Li, 2011). In our experiments, genes with GC content 145 below 41% are expressed extremely inefficiently, unless they contain an intron 146 ( Figure 2). This may provide a strong selective pressure for the retention of 147 introns in many human genes. 148

149
To establish which stages of expression are responsible for this phenomenon, we 150 first measured mRNA abundance of GFP variants in transiently transfected HeLa 151 cells by quantitative RT-PCR (qRT-PCR). High GC content may introduce 7 unwanted bias in RT-PCR, so to allow fair comparison of all variants irrespective 153 of their GC content, qRT-PCR primers were placed in the untranslated regions, 154 whose sequence did not vary. Similar to protein levels, mRNA abundance varied 155 widely between synonymous variants of GFP. GC-poor variants experienced a 156 large increase of expression in the presence of an intron, whereas GC-rich 157 variants were less affected ( Figure 2G-I). The range of variation in mRNA 158 abundance was much smaller in constructs with an intron than without intron 159 ( Figure 2I), indicating that splicing buffers the effects of GC content on 160 expression. 161

162
We then asked if changes in mRNA abundance arose at transcriptional or post-163 transcriptional levels. As a proxy for transcriptional efficiency, we measured the 164 abundance of intronic RNA for GFP variants expressed from the intron-165 containing plasmid. GC content did not correlate with intronic RNA abundance 166 ( Figure 2J), suggesting that the rate of transcription does not depend on GC 167 content of the coding sequence found downstream of the intron. Conversely, high 168 GC content was associated with stabilization in unspliced constructs ( Figure 2K). 169 Taken together, these experiments show that splicing preferentially increases 170 the expression of GC-poor synonymous variants at a post-transcriptional level. 171 172 High GC content at the 5' end correlates with efficient expression 173 To further explore the sequence determinants of expression, we assembled a 174 pool of 217 synonymous variants of GFP that included the 22 variants studied 175 above, 137 variants from our earlier study (Kudla et al., 2009), and 58 additional 176 variants. We cloned the collection into plasmids with and without a 5' UTR 177 intron. We then established pools of HeLa Flp-In T-REx cells that stably express 178 these constructs from a single genomic locus under a doxycycline-inducible 179 promoter and measured the protein levels of all variants by Flow-Seq (Kosuri et 180 al., 2013). We also performed Flow-Seq in HEK293 cells using the intronless 181 constructs only. In Flow-Seq, a pool of cells is sorted by FACS into bins of 182 increasing fluorescence and the distribution of variants in each bin is probed by 183 amplicon sequencing to quantify protein abundance ( Figure 3A). All variants 184 could be quantified with good technical and biological reproducibility and high 185 correlation was found between Flow-Seq and spectrofluorometric measurement 186 of individual constructs ( Figure S4). Most variants showed the expected 187 unimodal distribution across fluorescence bins, but some variants showed 188 bimodal distributions, possibly indicative of gene silencing in a fraction of cells. 189 190 All Flow-Seq experiments showed substantial variation of expression between 191 synonymous variants of GFP ( Figure 3B). GFP protein levels in HeLa cells (with 192 intron), HeLa cells (without intron), and HEK293 cells (without intron) were all 193 correlated with each other, but the moderate degree of correlation (r=0.51 194 HEK293 (without intron) vs HeLa (without intron); r=0.36 Hela (with intron) vs 195 HeLa (without intron)) suggests that the effects of codon usage on expression are 196 modulated by splicing and by cell line identity -in agreement with prior 197 observations of tissue-specific codon usage (Burow et al., 2018;Gingold et al., 198 2014;Plotkin et al., 2004;Rudolph et al., 2016). Flow-Seq of unspliced variants in 199 HeLa and HEK cells confirms the positive correlation of synonymous site GC-200 content with expression ( Figure 3C). In contrast to the results reported by us and 201 others in bacteria and yeast Kudla et al., 2009;Shah et al., 202 2013), but consistently with the positive correlation between GC content and 203 expression, strong mRNA folding near the beginning of the coding sequence 204 correlated with increased expression (Spearman's ρ = 0.27 in HeLa cells; ρ = 0.4 205 in HEK293 cells). Expression was positively correlated with CpG content and 206 codon adaptation index (CAI), and negatively correlated with the estimated 207 density of ARE elements or cryptic splice sites. Because of the strong correlation 208 between GC content, CpG content, CAI and mRNA folding energy, a multiple 209 regression analysis could not resolve which of these properties was causally 210 related to expression. Interestingly, for intron-containing variants, there was no 211 correlation, or a weak negative correlation, between expression and GC content, 212 CpG content, CAI, and mRNA folding energy ( Figure 3C). 213 214 Some of the variants analysed by Flow-Seq featured large regional variation in 215 GC content ( Figure S5A) and we asked whether the localization of low-GC and 216 high-GC regions within the coding sequence influences expression. We found 217 that the GC3 content in the first half of the coding sequence (nt 1-360), but not in the second half (nt 361-720), was positively correlated with expression of 219 intronless GFP variants in the HeLa and HEK293 cells ( Figure 3D). The GC3 220 content in the first half of the gene showed no correlation with expression in the 221 intron-containing constructs.   Figure 4B). A value of 0 therefore 247 indicates 100% nuclear retention, whereas a value of 1 indicates 100% 248 cytoplasmic localisation. In the absence of splicing, RCC scores ranged from 0.09 249 to 0.64 and RCC correlated significantly with GC content (r=0.51, p=3.85×10 -13 , 250 Figure 4C). In the presence of a 5' UTR intron, we observed a significant increase 251 in RCC score for GFP variants with low GC content, but no increase in RCC for GC-252 rich variants ( Figure 4D). GC3 content at the beginning of the coding sequence 253 was significantly correlated with RCC in the absence of splicing (r=0.5, p=2.0×10 -254 11 ), but not in the presence of splicing (R<0.01, p=0.48; Figure S5). Thus, high GC 255 content at the 5' end of genes increases gene expression in part through 256 facilitating the cytoplasmic localization of mRNA. 257

258
To assess whether GC content also affects translational dynamics, we performed 259 polysome profiling on HEK293 GFP pool cells using sucrose gradient 260 fractionation ( Figure 5A). qRT-PCR analyses of RNA extracted from all collected 261 fractions showed a broad distribution of GFP across fractions, with enrichment 262 within polysome-associated fractions. In order to determine distribution 263 patterns of individual GFP variants, RNA from several fractions was pooled (as 264 indicated in Figure 5B) and subjected to high-throughput sequencing. The 265 resulting read distribution indicates that GC-rich variants are associated with 266 denser polysomal fractions (ribosome density, Figure 5C, left panel; R 2 =0.55, p < 267 2.2×10 -16 ) and are more likely to be translated (ribosome association, Figure 5C To test whether splicing-and position-dependent effects of codon usage can also 275 be observed among human genes, we turned to genome-wide measurements of 276 expression at endogenous human loci and related these measurements to codon 277 usage and splicing. Although the correlations between GC content and expression 278 depended on the experimental measure and type of cells under study, we find 279 that GC4 content usually has a more positive effect on gene expression in 280 unspliced genes relative to spliced ones ( Figure 6, Table S1). In particular, 281 unspliced mRNAs show a more positive/less negative correlation of GC4 with 282 transcription initiation (GRO-cap data); cytoplasmic stability (exosome mutant); 283 translation rate (ribosome profiling vs whole cell RNA-seq); and protein amount 285 (mass-spec). These analyses suggest that GC4 content has an effect on the RNA 286 abundance of intronless mRNA molecules, which is carried through to the 287 protein expression. Taken together, these genome-wide analyses support our 288 observation of a splicing-dependent relationship between codon usage and 289 expression in human cells. Specific patterns of codon usage have previously been found at the 5' ends of 304 genes in bacteria, yeast and other species (Gu et al., 2010;Kudla et al., 2009;305 Tuller et al., 2010). In bacteria and yeast, strong mRNA folding near the start 306 codon prevents ribosome binding and reduces translation efficiency, resulting in 307 selection against strongly folded 5' mRNA regions (Kudla et al., 2009;Shah et al., 308 2013). In addition a "ramp" of rare codons has been observed near the 5' end of 309 RNAs in multiple species, with a possible role in preventing a wasteful 310 accumulation of ribosomes on mRNAs (Tuller et al., 2010) or reducing the 311 strength of mRNA folding (Bentele et al., 2013). These phenomena cannot 312 explain our results in human, because both the folding energy and codon ramp 313 models predict low GC content near the start codon, whereas we observe high GC 314 content within first exons of human protein-coding genes ( Figure 1B). 315 Furthermore, our experiments show that high GC content near the start codon 316 increases expression, whereas the folding energy and codon ramp models would 317 predict low expression. 318 319 We propose instead that splicing-and position-dependent effects of GC content 320 are explained by co-transcriptional or early post-transcriptional events in the 321 lifetime of an mRNA. Using matched reporter gene libraries, we show that most, 322 but not all, variants show an increase in expression when spliced. Splicing 323 typically increases the expression of AT-rich variants, but it does not further 324 increase the expression of GC-rich transcripts, which suggests that splicing and 325 high GC content influence expression through at least one common mechanism. 326 Splicing increases transcription (Kwek et al., 2002), prevents nuclear 327 degradation (Nott et al., 2003), facilitates nuclear-cytoplasmic mRNA export 328 through the Aly/REF-TREX pathway (Muller-McNicoll et al., 2016), and 329 stimulates translation (Nott et al., 2004). High GC content might increase RNA 330 polymerase processivity (Bauer et al., 2010;Zhou et al., 2016); GC-rich variants 331 are less likely to contain cryptic polyadenylation sites (consensus sequence: 332 AAUAAA) or destabilizing AU-Rich Elements (AREs) (Higgs et al., 1983); and high 333 GC content near the 5' end may also facilitate cytoplasmic localisation of mRNA. 334 GC-rich sequence elements of endogenous unspliced genes were previously 335 shown to route transcripts into the splicing-independent ALREX nuclear export 336 pathway, allowing efficient cytoplasmic accumulation (Palazzo et al., 2007). In 337 agreement with this, low expression caused by inhibitory sequence features 338 (such as low GC-content) can be rescued by extending the mRNA at the 5'end 339 with a GC-rich sequence ( Figure 3D,E, Figure S5). This may act as a 340 compensatory mechanism when gene expression cannot rely on the positive 341 regulatory effects of splicing (Palazzo and Akef, 2012). In contrast, it was 342 recently shown that binding of HNRNPK to the GC-rich SIRLOIN motif leads to 343 nuclear enrichment of lncRNAs (and also some mRNAs) (Lubelsky and Ulitsky, 344 2018 Recent studies also highlight patterns of codon usage as major determinants of 352 RNA stability in yeast (Presnyak et al., 2015), zebrafish (Mishima and Tomari, 353 2016) and other species (Bazzini et al., 2016). The usage of less common, 'non-354 optimal' codons within transcripts was shown to control poly-A tail length and 355 RNA half-life in a translation-dependent manner through the coupled activity of 356 different CCR4-NOT nucleases (Radhakrishnan et al., 2016;Webster et al., 2018). 357 Consistent with these findings, we observed that CAI is positively correlated with 358 mRNA expression levels in human cells. However, it remains to be seen whether 359 the correlation of CAI with mRNA expression depends on translation. Because of 360 the strong correlation between GC content and CAI, it is difficult to disentangle 361 independent contributions of these variables. Additionally, we find that the 362 correlation between GC content (or CAI) and expression is position-and splicing-363 dependent, whereas no evidence for such context-dependence has been reported 364 for the CCR4-NOT-mediated mechanism. 365 366 Other instances in which the effects of codon usage are context-dependent have 367 been described. Most notably, tRNA populations and transcriptome codon usage 368 patterns were shown to differ between mammalian tissues (Dittmar et al., 2006;369 Gingold et al., 2014;Plotkin et al., 2004;Rudolph et al., 2016). Intriguingly, genes 370 preferentially expressed in proliferating cells and tissue-specific genes tend to be 371 AT-rich, whereas genes expressed in differentiated cell types and housekeeping 372 genes are more GC-rich (Gingold et al., 2014;Vinogradov, 2003). Although these 373 differences have been interpreted in terms of the match between codon usage 374 and cellular tRNA pools, it is plausible that translation-independent mechanisms 375 contribute to context-dependent effects of codon usage. Accordingly, in 376 Drosophila, codon optimality determines mRNA stability in whole cell embryos, 377 but not in the nervous system, independent of tRNA abundance (Burow et al., 378 2018). Recently, it was shown that Zinc-finger Antiviral Protein (ZAP) selectively 379 recognises high CpG-containing viral transcripts as a mechanism to distinguish 380 proteins and mechanisms exist for cellular expressed genes. The cell lines used in 382 the present study, HeLa and HEK293, are both rapidly proliferating and 383 experimental results are correlated (r=0.36, Flow-Seq data), but divergent 384 expression of some GFP variants was also observed. Similarly, the effect size of 385 GC content on the expression of endogenously expressed genes varies with cell 386 type. It would be interesting to compare the expression of our variants in other 387 cell types to further address the question of tissue-specific codon usage and 388 adaptation to tRNA pools. 389

Implications for the evolution of protein-coding genes 391
The fact that long, multi-exon genes are often found in GC-poor regions of the 392 genome might result from regional mutation bias. However, an alternative 393 explanation is possible: GC-poor genes may be under selective pressure to retain 394 their introns, and intronless genes may experience selective pressure to increase 395 their GC content. These possibilities are supported by multiple observations: 396 Firstly, endogenous intronless genes are on average more GC-rich than intron-397 containing genes. Secondly, the GC content of functional (but not non-functional) 398 retrogenes is higher compared to their respective intron-containing parental 399 genes, which cannot be explained by a systematic integration bias. Thirdly, in 400 genome-wide analysis, correlations between GC-content and expression are 401 generally more positive (or less negative) for unspliced compared to spliced 402 genes. Taken together, this suggests that for the long-term success of an 403 unspliced gene (i.e. stable conservation of expression and functionality) an 404 increase in GC content is essential. By contrast, splicing allows genes to remain 405 functional even when mutation bias or other mechanisms lead to a decrease of 406 their GC content. 407

Materials and Methods 409
Genes and plasmids 411 The library of 217 synonymous GFP variants used here consists of 138 variants 412 from an earlier study (Kudla et al., 2009), 59 new variants assembled using the 413 same PCR-based method as in (Kudla et al., 2009), and 22 variants that were 414 designed in silico and ordered as synthetic gene fragments (gBlocks) from 415 Integrated DNA Technologies (IDT) (Mittal et al., 2018). Each of the 22 variants 416 was designed by setting a target GC3 content (between 25 and 95%) and 417 randomly replacing each codon with one of its synonymous codons, such that the 418 expected GC3 content at each codon position corresponded to the target GC3 419 content. For example, to design a GFP variant with GC3 content of 25%, each 420 glycine codon was replaced with one of the four synonymous glycine codons 421 with the following probabilities: GGA, 37.5%; GGC, 12.5%, GGG, 12.5%; GGT, 422 37.5%. We also generated 23 mKate2 sequences using an analogous procedure 423 and ordered the variants as gBlocks from IDT. All the genes were cloned into the 424 Gateway Entry vector pGK3 (Kudla et al., 2009).  set of experiments, the same 3 genes were transfected on every plate to account 477 for technical variability. In the screen of individual GFP variants (see Figure 2), 478 GFP measurements were divided by mKate2 measurements from same wells to 479 reduce noise caused by well-to-well variation in transfection efficiency, but 480 similar results were obtained without normalisation. primers. Samples were analysed in triplicate as 20ul reactions, using 2ul of 500 diluted cDNA. Cycling settings: DNA was first denatured for 5min at 95°C before 501 entering a cycle (50-60x) of denaturing for 10sec at 95°C, annealing for 7sec at 502 55-60°C (depending on primers used), extension for 10sec at 72°C and data 503 acquisition. DNA was then gradually heated up by 2.20 °C/s from 65 to 95°C for 504 5sec each and data continuously collected (Melting curve analysis). Data was evaluated using the comparative Ct method (Livak and Schmittgen, 2001). RNA 506 measurements from transient transfection experiments were normalised to the 507 abundance of neomycin RNA, which is expressed from the same plasmid, to 508 control for differences in transfection efficiency (primers 'Neo_F' and 'Neo_R'). 509

Subcellular fractionation 511
This protocol is based on the cellular fractionation protocol published by 512 (Gagnon et al., 2014) but includes a further clean-up step using a sucrose cushion 513 as described by (Zaghlool et al., 2013) and a second lysis step as described by 514 (Wang et al., 2006). Cell lysis and nuclear integrity was monitored throughout by 515 light microscopy following Trypan blue staining (Sigma). Cells were grown in 516 10cm plates for 24h to about 90% confluency. Cells were then washed with PBS 517 and trypsinised briefly using 1ml of 1xTrypsin/EDTA. After stopping the reaction 518 with 5ml DMEM, cells were transferred into 15ml falcon tubes and collected by 519 spinning at 100g for 5min. Resulting cell pellets were resuspended in 500ul ice-520 cold PBS, transferred into 1.5ml reaction tubes and spun at 500g for 5min, 4°C. 521 The supernatant was discarded and cells resuspended in 250ul HLB (10mM Tris 522 (pH 7.5), 10mM NaCl, 3mM MgCl2, 0.5% (v/v) NP40, 10% (v/v) Glycerol, 0.32M 523 sucrose) containing 10% RNase inhibitors (RNasin Plus, Life Tech) by gently 524 vortexing. Samples were then incubated on ice for 10min. After incubation, 525 samples were vortexed gently, spun at 1000g for 3min, 4°C, and supernatants 526 and pellets were processed separately as indicated in a) and b) below. 527 a) Cytoplasmic extract: 528 The supernatant was carefully layered over 250ul of a 1.6M sucrose cushion and 529 spun at 21,000g for 5min. The supernatant was then transferred into a fresh 530 1.5ml tube and 1ml Trizol was added and mixed by vortexing. 531 b) Nuclear extract: 532 The pellets were washed 3 times with HLB containing RNase inhibitors by gently 533 pipetting up and down 10 times followed by a spin at 300g for 2min. After the 534 3rd wash, nuclei were resuspended in 250ul HLB and 25ul (10%) of detergent 535 mix (3.3% (wt/wt) sodium deoxycholate/6.6% (vol/vol) Tween 40) dropwise 536 added while vortexing slowly (600rpm). Nuclei were then incubated for 5min on 537 ice before spinning at 500g for 2min. The supernatant was discarded and pellets resuspended in 1ml Trizol (Ambion) by vortexing. 10ul 0.5M EDTA are added to 539 each nuclear sample in Trizol and tubes heated to 65°C for 10min to disrupt very 540 strong Protein-RNA and DNA-RNA interactions. Tubes were then left to reach 541 room temperature and RNA was extracted following the manufacturer's 542 instructions. 543 544

Transcription inhibition assay 545
HeLa T-Rex Flp-in cell lines were grown to 80-90% confluency in 6 well for 24h 546 before treatment with 500nM Triptolide (Sigma). Cells were harvested at vector, a separate Gateway LR reaction was set-up in a total volume of 45ul using 574 500ng destination vector, 5ul LR Clonase enzyme mix, 38ul of the mixed 217 575 pGK3-GFP plasmids and TE (pH 8). The reactions were incubated at 25C 576 overnight followed by Proteinase K digest (5ul, LR Clonase kit) for 10min at 37C. 577 The total 50ul reaction mix was transformed into 2.5ml highly competent 578 DH5alpha in a 15ml Falcon tube by heat-shocking cells for 2min 30s at 42C, 579 followed by cooling on ice for 3min, before adding 10ml SOC medium and 580 incubating while shaking for 1h at 37C. After incubation, cells were spun down at 581 3000g for 3min and resulting bacterial pellets resuspended in 1ml fresh SOC. were decanted into 15ml tubes and cells collected by spinning 5min at 500g. The 628 supernatant was transferred into fresh 15ml tubes and precipitated using 2 629 volumes of 100% EtOH/0.1 volume Sodium Acetate (pH 5.3) and 10ul Glycoblue 630 (Ambion). Tubes were shaken vigorously for 10s before incubating at -20C for 631 15min, followed by spinning at 3000g for 20min. Resulting pellets were air-632 dried, resuspended in 1ml digest buffer (100mM Tris pH 8.5, 5mM EDTA, 0.2% 633 SDS, 200mM NaCl) and then combined with the respective cell pellet. 10ul RNAse 634 A (Qiagen, 70U) was added and samples gently rotated at 37C. After 1h, 1ul/ml 635 Proteinase K (20mg/ml, Roche) was added to the samples before rotating a 636 further 2h at 55C. Genomic DNA was purified 3 times by using 1 volume Phenol:Chloroform:Isoamyl alcohol (PCI,25:24:1,Sigma). After each addition of 638 PCI, samples were shaken vigorously for 10s before spinning at 3000g for 20min 639 (first extraction) or 5min (all following). The resulting bottom layers including 640 the interphase were removed before each PCI addition. After the last PCI 641 extraction, the upper layer was transferred into a fresh 15ml tube and 1 642 extraction performed using 1 volume chloroform:isoamyl alcohol (CI,24:1, 643 Sigma). After a 5min spin at 3000g, the upper layer was transferred into a fresh 644 15ml tube and DNA precipitated using EtOH/Sodium Acetate as before. After a 645 5min incubation on ice, DNA was collected by spinning for 30min at 3000g. The were visible. 250ul of polysome extraction buffer was then added (1ml 10x RSB 659 + 50ul NP-40 (Sigma) + 9ml H2O + 1 complete mini EDTA-free protease inhibitor 660 pill (Roche)) and lysate passed 5x through a 25G needle avoiding bubble 661 formation. The lysate was then incubated on ice for 10min before spinning 662 10min at 10,000g, 4°C. The supernatant was then transferred into a fresh 1.5ml 663 tube and the RNA concentration estimated by measuring the OD at 260nm. 664 Sucrose gradients (10-45%) containing 20 mM Tris, pH 7.5, 10 mM MgCl2, and 665 100 mM KCl were made using the BioComp gradient master. 100ug of Lysate 666 were loaded on sucrose gradients and spun at 41,000rpm for 2.5h in a Sorvall 667 centrifuge with a SW41Ti rotor. Following centrifugation, gradients were 668 fractionated using a BioComp gradient station model 153 (BioComp 23  669 Instruments, New Brunswick, Canada) by measuring cytosolic RNA at 254 nm 670 and collecting 18 fractions. 671 RNA from all fractions was precipitated using 1 volume of 100% EtOH and 1ul 672 Glycoblue (Ambion), before extracting RNA using the Trizol method (Life 673 Technologies). Equal volumes of RNA of each fraction was run on a 1.3% 674 Agarose/TBE gel to assess the quality of fractionation and RNA integrity. 675 Additionally, equal volumes of RNA of each fraction were used in cDNA synthesis 676 using SuperScript III (ThermoFisher) and 2uM gene-specific primers for GFP 677 ('pcDNA5-UTR_R') and GAPDH ('GAPDH_R') followed by qRT-PCR analysis. For 678 high-throughput sequencing, total RNA from collected fractions was combined in 679 equal volumes into 4 pools (as indicated in Figure 5B; free ribonucleoprotein 680 (RNP) complexes, monosomes, light polysomes (2-4) and heavy polysomes (5+)) 681 before amplicon library preparation (as described below). 682 683

High-throughput library preparation and sequencing 684
Sequencing libraries were generated by PCR using primers specific for GFP 685 amplification (Supplementary Table 2) which carry the required adaptor 686 sequences for paired-end MiSeq sequencing, as well as 6nt indices for library 687 multiplexing. Between 6-10ug of total genomic DNA were used in multiple PCR 688 reactions (200ng per 50ul reaction). All PCRs were performed using Accuprime 689 Pfx (NEB) according to manufacturer's recommendations using 0.4ul Accuprime 690 Pfx Polymerase and 0.3uM of each primer ('PE_PCR_left' and 691 'S_indexX_right_PEPCR'). The cycling conditions were as follows: Initial 692 denaturation at 95C for 2min, followed by 30 cycles of denaturation at 95C for 693 15sec, annealing at 51C for 30sec, extension at 68C for 1min. The final extension 694 was performed at 68C for 2min. After PCR, all reactions of the same template 695 were pooled and 1/3 of the reaction purified using the Qiagen PCR purification 696 kit according to the manufacturer's instructions. DNA was eluted in 50ul H2O. 697 Library size selection was performed using the Invitrogen E-gel system 698 (Clonewell gels, 0.8% agarose) followed by Qiagen MinElute PCR purification. 699 Correct fragment sizes were confirmed and quantified using the Agilent 700 Bioanalyzer 2100 system. 701 For library preparation of RNA samples, 500ng RNA was first converted into 702 cDNA using 2nmol GFP-specific primers ('S_indexX_right_PEPCR') using 703 SuperScript III (Life technologies) according to manufacturer's protocol, using 704 50C as extension temperature. Resulting cDNA was then treated with 1ul 705 RNaseH (NEB) for 20min at 37C, followed by heat inactivation at 65C for 5min. 706 Samples were diluted 1:2.5 before using 2ul as template in PCR for library 707 preparation. A minimum of 8x50ul PCR reactions were set up and pooled for 708 each sample before PCR purification, followed by E-gel purification as described 709 above. 710 High-throughput sequencing was conducted by Edinburgh Genomics (The 711 University of Edinburgh) and Imperial BRC Genomics facility (Imperial College 712 London) using the Illumina MiSeq platform (2x300nt paired-end reads). 713 714

Regression modelling 817
A pseudocount of 0.0001 was added to each measurement of gene expression 818 and, excluding the nuclear export data, these values were then log2-transformed 819 to generate a normal distribution of expression for subsequent analysis. 820 Transcripts with an expression value of 0 were removed from downstream 821 analysis and the resulting distributions used for regression analysis are 822 displayed in Supplementary Figure 6. Transcripts were separated into unspliced 823 and spliced, where splicing was defined as containing more than one exon in the 824 GENCODE transcript model. Expression measurements were then linearly 825 regressed against the GC4 content separately for each class of transcript and the 826 coefficients along with their associated standard errors. These data were then 827 bootstrapped by sampling with replacement and recalculating the regression 828 coefficients for spliced and unspliced transcripts. The 95% confidence interval of 829 these coefficients (discounting the standard error in these estimations) obtained 830 by 1,000 samplings of this type was used to draw the ellipses shown in Figure 6. 831 832

Analysis of GC content variation in the human genome 833
The GRCh38 sequence of the human genome, as well as the corresponding gene 834 annotations (Ensembl release 85), was retrieved from the Ensembl FTP site 835 (Zerbino et al., 2018). The full coding sequences (CDSs) of protein-coding genes 836 were extracted, filtered for quality and clustered into putative paralogous 837 families (see (Savisaar and Hurst, 2016) for full details). For all analyses, a 838 random member was picked from each putative paralogous cluster. In addition, 839 only one transcript isoform (the longest) was considered from each gene. Note 840 that exon rank was always counted from the first exon of the gene, even if it was 841 not coding. For Figure 1C, GC4 was averaged across all sites that were at the 842 same nucleotide distance to the TSS and within an exon of the same rank. For the 843 functional retrocopies analysis, the parent-retrocopy genes derived in (Parmley 844 et al., 2007) were used. Pseudogenic retrocopies were retrieved from 845 RetrogeneDB (Rosikiewicz et al., 2017). Retrocopy annotations were filtered to 846 only leave human genes with a one-to-one ortholog in Macaca mulatta. Next, 847 only ortholog pairs where both the human and the macaque copy were 848 annotated as not having an intact reading frame and where the human copy was 849 annotated as KNOWN_PSEUDOGENE were retained. For the analyses reported in 850 Supplementary Figure 1, the functional retrocopies were also retrieved from 851 RetrogeneDB, as we could not access genomic locations for the (Parmley et al., 852 2007)     replicates + SEM (see also Figure S5). 937 Human RNA and protein expression data were extracted from various databases, 1056 filtered and normalized as described in Supplementary Table 1 and in the