Translational readthrough goes unseen by natural selection

Occasionally during protein synthesis, the ribosome bypasses the stop codon and continues translation to the next stop codon in frame. This error is called translational readthrough (TR). Earlier research suggest that TR is a relatively common error, in several taxa, yet the evolutionary relevance of this translational error is still unclear. By analysing ribosome profiling data, we have conducted species comparisons between yeasts to infer conservation of TR between orthologs. Moreover, we infer the evolutionary rate of error prone and canonically translated proteins to deduct differential selective pressure. We find that about 40% of error prone proteins in Schizosaccharomyces pombe do not have any orthologs in Saccharomyces cerevisiae, but that 60% of error prone proteins in S. pombe are undergoing canonical translation in S. cerevisiae. Error prone proteins tend to have a higher GC-content in the 3’-UTR, unlike their canonically translated ortholog. We do not find the same trends for GC-content of the CDS. We discuss the role of 3’-UTR and GC-content regarding translational readthrough. Moreover, we find that there is neither selective pressure against or for TR. We suggest that TR is a near-neutral error that goes unseen by natural selection. We speculate that TR yield neutral protein isoforms that are not being purged. We suggest that isoforms, yielded by TR, increase proteomic diversity in the cell, which is readily available upon sudden environmental shifts and which therefore may become adaptive. Author Summary There is an evolutionary balance act between adaptation and selection against change. Any system needs to be able to adapt facing novel environmental conditions. Simultaneously, biological systems are under selection to maintain fitness and thus undergo selection against mutations. Phenotypic mutations - translational errors during protein synthesis - have been suggested to play a role in protein evolvability by enabling quick assessment of viable phenotypes and thus enable quick adaptation. Here we test this hypothesis, by inferring evolutionary rate of proteins prone to a specific case of phenotypic mutations: translational readthrough (TR). By making use of publicly available data of yeasts, we find that TR goes unseen by natural selection and appear as a neutral event. We suggest that TR goes unseen by selection and occurs as “permissive wallflowers”, which may become relevant and yield adaptive benefits. This work highlights that stochastic processes are not necessarily under stringent selection but may prevail. In conclusion, we suggest that TR is a neutral non-adaptive process that can yield adaptive benefits.


Introduction
During protein synthesis, ribosomes translate mRNA into proteins. Translation 17 synthesis is terminated when the ribosome encounters a stop codon in an mRNA. 18 Occasionally, stop codons are ignored by the ribosome, and translation continues 19 beyond the coding sequence (CDS), into the untranslated region . This process, 20 called translational readthrough (TR), yields a protein chain that becomes longer than 21 one would predict from the DNA sequence alone. Alterations to the molecular 22 phenotype of a protein that are not encoded in the DNA, but are introduced during 23 transcription or translation, are referred to as phenotypic mutations. Phenotypic 24 mutations are also known as translational and transcriptional errors. Translational and 25 transcriptional error rate is manifold higher than the genotypic error rate [1][2][3], i.e. the 26 error rate during DNA replication, which leads to heritable variation. Phenotypic 27 mutations have been suggested to act as intermediate stepping stones for traits not yet 28 encoded on the DNA level [4]. In the context of two correlated point mutations, which 29 are jointly required for the formation of stabilising interactions in a protein, this has 30 lead to the formulation of "the look-ahead effect" [4]. Assuming only one of the two 31 point mutations are present on DNA level, and the second mutation is expressed by a 32 phenotypic mutation, the look-ahead effect hypothesis explores the probability of the 33 fixation of a novel phenotype prior to genotypic encoding. If the novel phenotype has a 34 significantly higher fitness than the already encoded protein, both the phenotypic 35 mutation and the first point mutation will get fixed in the population. Once the 36 phenotypic mutation is fixed, Whitehead et al. Whitehead calculated that the 37 probability for encoding the phenotypic alteration on DNA level is highly probable. 38 Accordingly, phenotypic mutations are suggested to enable exploration of protein 39 sequences, which enables quick assessment of viable phenotypes. Quick assessment of 40 viable phenotypes will in turn yield a higher evolvability. Analogously, erroneously 41 translated proteins are predicted to have a high evolutionary rate. Prior analysis of the 42 look-ahead effect was strictly theoretical and computational, while controlling for 43 population size. However, in recent years experiments indicate that phenotypic 44 mutations may become adaptive. A study in bacteria by Bratulic et al. [5] found that 45 phenotypic mutations are foremost selected to be tolerated and not purged, which 46 supports the assumption that phenotypic mutations may be tolerated and at some point 47 become adaptable. Studies in fungi displayed experimental evidence for how phenotypic 48 mutations may yield not only a fitness effect [6][7][8], but appear prior to gene duplication 49 and then become encoded [6]. In light of these studies, we hypothesize that proteins 50 prone to TR should have a higher evolutionary rate if they are yielding beneficial 51 phenotypes. Previously, we have investigated TR in yeast, to identify protein features 52 that correlate with TR-rate. We identified high gene expression to be the most 53 prevalent feature, next to highly disordered C-termini [9]. However, whether these 54 features are a consistent trait in the presence of TR, remains untested. Here, we wanted 55 to test our previous speculations regarding what features may buffer or coincide with 56 TR, and whether one can trace TR -as a phenotypic mutation -to high evolvability. By 57 analysing publicly available ribosome profiling data, we identify proteins that undergo 58 TR and cluster these proteins as "leaky". Proteins undergoing canonical translation are 59 clustered as "non-leaky". This is done both for Schizosaccharomyces pombe and 60 Saccharomyces cerevisiae, and we have here analysed structural features regarding leaky 61 proteins and their orthologs. Furthermore, we treat TR as a phenotypic mutation and

68
Inferring translational readthrough 69 We analysed ribosome profiling data for S.cerevisiae [10] and for S.pombe [11]. By 70 analysing expression data of wildtype cells from an enriched medium from [10] and [11], 71 we inferred translational readthrough (TR) that occur unrelated to external stress 72 factors. We identified proteins that undergo TR and clustered these proteins into what 73 we refer to as a leaky set. Proteins that undergo canonical translation, we clustered as 74 the non-leaky set. The clustering of leaky and non-leaky sets were conducted 75 individually for each investigated species, as in [9] and is described in detail in Materials 76 and Methods. We found 80 leaky proteins and 3200 non-leaky proteins for S.cerevisiae. 77 For S.pombe, there were 240 identified leaky proteins, and 2830 non-leaky proteins. We 78 compared our sets to the data from our previous inference of TR in S.cerevisiae [9]. We 79 found an overlap of 30 proteins between the leaky sets. There were no leaky proteins 80 that overlapped with non-leaky sets, in either data set across the two inferences. We 81 infer that the lack of complete overlaps, between the leaky data, is due to variance of 82 gene expression across the ribosome profiling studies.

83
Ortholog comparisons 84 We wanted to investigate TR across species. The aim was primarily to explore whether 85 leaky proteins have orthologs, and whether such ortholog proteins are leaky too. A 86 consistent pattern of TR across orthologs could indicate that TR is an intrinsic feature 87 to the protein coding sequence. Alternatively, if leaky proteins have non-leaky orthologs, 88 comparative analyses of sequence features may reveal what enables or prevents TR.

89
Moreover, identifying sequence features may elucidate the evolutionary path between 90 error-prone versus canonical translation. In order to analyse homologous of leaky 91 proteins, we aimed to infer orthologs between S.cerevisiae and S.pombe, as there is 92 accessible ribosome profiling data, in addition to that these species have well annotated 93 genomes. We compared leaky and non-leaky orthologs between S.cerevisiae and S.pombe. 94 We find that only 3% of leaky proteins in S.pombe have a leaky ortholog in S.cerevisiae. 95 Totally 58% of leaky proteins, and 64% of non-leaky proteins in S.pombe, have at least 96 one non-leaky ortholog in S.cerevisiae. This suggest that TR is not conserved, at least 97 not between these two species. With some single exceptions (see Table S1), the 98 remaining leaky proteins in S.pombe do not have any orthologs in S.cerevisiae.

99
In a previous study in yeasts by Yanagida et al. [6], it was found that TR yielded 100 functional protein isoforms, which maintained organismal fitness under stressed 101 conditions. These isoforms were near identical to paralogs in closely related species that 102 had undergone whole genome duplication (WGD). Yanagida et al [6] suggested that TR 103 yield functional isoforms, prior to being encoded in the DNA, in support of the 104 look-ahead effect hypothesis. Given the study of Yanagida et al. [6], we wondered if we 105 would find functionally encoded 3'-UTR in S.pombe, who did not undergo a WGD. In 106 order to infer whether leaky protein of S.pombe yield an isoform similar to their 107 non-leaky ortholog in S.cerevisiae, we conducted two kind of alignments. First, the 108 protein sequence of the non-leaky orthologs in S.cerevisiae were aligned to their 109 respective leaky ortholog CDS sequence, of S.pombe. We used an exhaustive 110 bestfit-model in exonerate [12] to align the sequences (see Materials and Methods).

111
Secondly, we conducted the same alignment, but we included the 3'-UTR to the CDS of 112 leaky sequence. This allowed us to infer whether the alignment score of orthologs genes (27%). In other words, a portion of isoforms yielded by TR in S.pombe is encoded 119 in the CDS of S.cerevisiae. This result may imply that the common ancestor lost a stop 120 codon in S.cerevisiae and the sequences have undergone convergent evolution [13][14][15].

121
Alternatively, a stop codon may have been introduced in S.pombe and the 3'-UTR has 122 been under purifying selection in S.pombe. Regardless of the evolutionary path 123 responsible for the homology between the given 3'-UTRs and CDS termini, the 124 homology is only present in a subset of genes. The majority of ortholog alignments does 125 not reveal a 3'-UTR homology to the CDS, which suggest that the 3'-UTRs are 126 generally not conserved past species divergence. We therefore find it unlikely that the 127 given subset of isoforms would be homologous by chance. We speculate whether the 128 isoforms in question may be both functional and under selection. Moreover, our results 129 suggest that the findings of Yanagida et al. [6] is probably not an isolated single-case for 130 the particular protein they investigated.
In conclusion, we have demonstrated that TR 131 is not conserved across orthologs in these two species, and also that proteins undergoing 132 TR are not purged. investigate gene expression for a protein in S.pombe, we subtracted the gene expression 145 of the respective ortholog from S.cerevisiae. As such, we retrieved the difference, for 146 investigated features, between 1-to-1 orthologous pairs. From analysing leaky and 147 non-leaky sets within same-species analyses, we find that TR-rate correlates most 148 strongly with gene expression, codon usage (CAI), translational efficiency, whereas only 149 in S.cerevisiae does TR correlate with GC-content (see Fig S1 and Table S2). We also 150 find that these features disappear when comparing leaky and non-leaky orthologs, but 151 that 3'-UTR GC-content remains as a significantly differentiated feature between sets 152 (see Fig S2, Table S4 and S5). Comparisons of leaky and non-leaky genes -both within 153 same-species, and between orthologs -support the notion that leaky genes have an 154 overall higher GC-content of the 3'-UTR than non-leaky genes (see Fig S3). These 155 results begs the question if and how the GC-content of the 3'-UTR is associated to TR. 156 The connection between expression and GC-content could be indication of mRNA 157 stability, which was previously explored in the context of TR, but without 158 verification [9].

159
High GC-content in the 3'-UTR would by chance lower the probability of random 160 stop codons, and allow translation to continue once the initial stop codon is initially 161 bypassed. In most species, the mutational bias tend to be GC sites that mutate into to 162 TA sites [16]. However, Long et al. [16] found that there is also a substantial 163 contribution to GC composition by either natural selection or biased gene conversion, 164 possibly both. Biased gene conversion that elevates GC-content has previously been 165 suggested to drive emergence of evolutionary novelty in yeasts [17]. However, according 166 to a study that investigated mutation bias in S.pombe and S.cerevisiae, biased gene conversion does not contribute to the nucleotide composition in S.pombe [18]. In light of 168 this, we can not exclude that the difference in 3'-UTR GC-content -between leaky and 169 non-leaky proteins -may be a result of selection. However, it is ambiguous whether 170 GC-content of the 3'-UTR would be a cause or consequence of TR. There is an upper 171 and lower threshold for the tolerated load of phenotypic mutations, where the lower 172 threshold is suggested to be regulated by translational cost-efficiency [1]. Prior to 173 eliminating mutations, selection may primarily act to minimize deleterious effects of 174 phenotypic mutations, as previously found by [5]. Codons that are GC-rich tend to 175 yield amino acids that are disordered [19] and less prone to aggregate, which may make 176 the extended isoform by TR effectively neutral [9]. non-leaky proteins from S.cerevisiae, and compared the two sets for selective pressure. 201 We inferred polymorphism data for S.cerevisiae by making use of publicly available 202 data [21] (1102 yeast project). Our focal gene was the lab strain of S.cerevisiae, and 203 S.paradoxus was used as an outgroup to infer diverging polymorphism (see Materials   204 and Methods).

205
Initially, we had 80 genes of the leaky set and 3200 genes of the non-leaky set. After 206 excluding some proteins due to insufficient annotation (see Materials and Methods), we 207 retained 64 leaky genes. We found that the majority of leaky proteins undergo purifying 208 selection, but so were genes from the non-leaky set (see Fig 1). Moreover, the fraction of 209 proteins undergoing positive selection did not alter between the two sets (see Table 1), 210 and we can not reject the null hypothesis. We investigated if any features, beside 211 TR-rate, correlated with purifying selection. We find that gene expression is the 212 strongest correlating factor of evolutionary rate (see Fig S4), which is in accordance 213 with previous studies [22][23][24]. When we exclusively investigated the leaky set (see Fig   214   S5), we saw the same trend. Complete data for the MKT can be found in S2File.txt.

215
Whether a protein undergoes neutral, positive, or purifying selection, seems to be 216 unrelated to TR. As TR is neither selected against nor for, we find that TR is neutral 217 with respect to selective pressure. However, the fact that TR appears to be neutral may 218  Beyond inferring differential evolutionary rate by comparing the sets, we speculated 232 that one may see a diverging trend by comparing leaky and non-leaky homologous pairs. 233 Ohno suggested that gene duplication may enable genes to evolve a new function, by 234 maintaining native function by a gene copy [25]. With two gene copies, one gene copy 235 can be exploratory without the cellular system losing the vital gene function [25]. Since 236 function is strongly related to the structure of the encoded protein, it can also be 237 assumed that the initial protein fold would be maintained in either of the two copies (i.e. 238 paralogs). Accordingly, the leaky gene would be free to explore the phenotypic 239 landscape while a non-leaky paralog would maintain the biological function. In such a 240 scenario, we hypothesized that a high proportion of leaky proteins would have non-leaky 241 paralogs. Paralogs were retrieved for the leaky and non-leaky sets for S.cerevisiae, using 242 annotations provided by SGD (see Materials and Methods). From the leaky set, 35 243 genes (39%) have an annotated paralog. In the non-leaky set, 594 genes (17%) have an 244 annotated paralog. Amongst the non-leaky set, the majority of genes (57%) have a 245 non-leaky paralog and only a minority (2%) have a leaky paralog (see Table 2). As the 246 majority of paralogs are unassigned for the leaky set, it is not possible to deduct 247 whether the leaky set comprises of primarily non-leaky or leaky paralogs. The current 248 data suggest that a higher fraction of paralogous pairs are found within the non-leaky 249 set. However, the results are biased by the fact that the non-leaky set contain more 250 genes. In conclusion, there is no support for that leaky genes would be exploratory by 251 the assurance of a non-leaky gene-copy. NL stands for non-leaky set, whereas L stands for leaky set. Unassigned is where we did not know whether a gene was leaky or non-leaky as it was not part of our expressed data set.

253
In order to infer the evolutionary role of TR, we have here conducted comparative 254 analyses between proteins that undergo TR and canonically translated proteins. Our 255 inter-species comparison revealed that TR is not conserved between orthologs, nor that 256 genes are purged on account of TR. Inference of paralogs in S.cerevisiae suggests that 257 TR occurs without assurance of functional maintenance by a non-leaky paralog.

258
Moreover, inference of selective pressure in S.cerevisiae suggest that there is no 259 difference between leaky and non-leaky proteins. Our inference of the look-ahead effect, 260 that leaky proteins would have an elevated evolutionary rate, was not confirmed.

261
Ultimately, our results suggest that TR is a neutral event that prevails without natural 262 selection acting upon it. Taken together with inference of presence and rate of TR, in 263 S.cerevisiae and S.pombe, our analyses suggest that TR is continuously present at a low 264 rate and not purged. This is in accordance with other studies that find continuous TR 265 to be heterogeneously present [26]. 266 Recently, TR has been suggested to be non-adaptive as its presence appear 267 stochastic [27]. However, biological heterogeneity and stochastic processes is raw 268 material for novel adaptations [28]. Moreover, the lack of strong adaptation is not an 269 argument against evolutionary relevance. Neutrally evolving proteins have previously 270 been suggested to provide a basis for evolutionary novelty [29]. We suggest that drift-barrier hypothesis [30]. Any phenotypic mutation is initially caused by imperfect 274 purging by natural selection. However, damage control is necessary once TR is present 275 in the cellular environment. Yet, damage control is not needed if the yielded isoform is 276 neutral or near-neutral, and TR can thus persist. However, deleterious isoforms are less 277 likely to persist, as misfolded proteins are degraded by the proteasome [31]. With 278 respect to the abundant findings of TR, by us and others [7,8,[32][33][34], it is doubtful that 279 TR is deleterious as the proteasome would constantly degrade TR-isoforms, which is 280 doubtful given a high cellular energy cost. Rather than direct purging of deleterious 281 mutations, mitigation of deleterious mutations has been found to be the first 282 evolutionary response [5]. Buffering deleterious isoforms by making them neutral would 283 allow TR to persist and explore functional interactions [9]. We suggest that TR-yielded 284 isoforms persist, either because they are initially neutral or have become neutral by 285 selection to mitigate detrimental effects. Like wallflowers, TR-yielded isoforms go 286 unseen and persist. Moreover, in alternating cellular environments, unseen "wallflowers" 287 may become relevant. We found that a subset of TR-isoforms in S.pombe are 288 homologous to functional encoded proteins in S.cerevisiae. This is in accordance with 289 studies who suggest that TR biologically functional and not harmful [7,8,[32][33][34][35].

290
Overall transcriptional heterogeneity has been found to be beneficial in stressful 291 November 12, 2019 7/20 condition [36,37], and suggested to be an adaptive trait under selection [38]. Also 292 protein synthesis has been found to be heterogeneous and noisy [8,26], but also 293 harmful [39]. However, TR have been found to yield positive fitness effects for 294 microorganisms under stress, and plainly neutral in absence of stress [6,26]. We suggest 295 that the TR-yielded isoforms enrich the proteomic diversity by offering slight variants of 296 encoded proteins that may are readily available in sudden environmental shifts, which 297 has been suggested previously by others [8,34]. However, the evolutionary trajectory 298 remains unclear for how phenotypic mutations, like TR, would transcend from noise to 299 become adaptive.

300
To investigate whether there is an adaptive advantage of TR overall and whether 301 noise by phenotypic mutations elevate protein evolvability, we believe one needs to infer 302 the effect over shorter time scale than we have done here. By comparing error prone 303 strains to canonical strains -in both "stressful" and "controlled" environments -one 304 should be able to test if phenotypic mutations allows for rapid adaptation.

305
In conclusion, we find that phenotypic mutations, yielded by TR, do not yield rapid 306 adaptation or increase protein evolvability. We suggest that TR goes unseen by natural 307 selection like permissive "wallflowers". Given that a subset of TR-isoforms are highly 308 homologous to encoded orthologs, we speculate that TR-isoforms may be and become 309 adaptive. However, expanded methodology is needed to infer whether 3'-UTR undergo 310 selection for error mitigation as a consequence of TR. Future studies should also infer 311 whether phenotypic mutations are adaptive by experimental inference, but on short 312 evolutionary time frames.

315
Handling of ribosome profiling data 316 The ribosome profiling data for S.cerevisiae were obtained from [10]. Data was obtained 317 and is accessible at NCBI GEO database with accession ID GSE52119, where individual 318 accession files have IDs; SRR948553,SRR948555,SRR948552 and SRR948551. Ribosome 319 profiling data for S.pombe was obtained from [11], and was accessed at NCBI GEO 320 database by accession ID GSE98934. Individual files have accession IDs; SRR5564114 321 and SRR5564124.

322
Prior to alignment, the reads were trimmed and aligned to rRNA. Reads that did 323 not align to rRNA, were aligned to the genome with bowtie [40] (-S -y -a -m 1 -best 324 -strata -p 22). When aligning reads to the 3'-UTR, we used extra stringent alignment 325 where multimapping was not allowed and only one mismatch was allowed (-S -y -a -m 1 326 -best -strata -n 0 -e 1 -p 22). To sort the output we used samtools [41,42]. Only expressed genes that have annotated 3'-UTR were included in further analyses.

331
Annotations by Yassour et al. [45] were used for mapping reads to the 3'-UTR. HTSeq 332 was used [46] to retrieve the count number of reads mapped with genes and respective 333 3'-UTR, using strict-mode that excludes overlapping reads. Genes that consistently 334 were showing translational readthrough (TR) in all replicates were grouped as "leaky 335 genes". Genes displaying TR in some but not all replicates were grouped as "semi-leaky 336 genes". Genes with annotated 3'-UTR without any count hits, consistently between 337 replicates, were grouped as "non-leaky genes".

338
Mapped reads to 3'-UTR can indicate continued translation of the mRNA beyond 339 the first stop codon, but these reads can also be mere noise. Several measures were 340 made to ensure reads mapped to the 3'-UTR were justifiably counted as translational 341 readthrough (TR). Firstly, annotated 3'-UTRs that are overlapping with a gene on the 342 same strand were excluded. 3'-UTRs with a sequence length shorter than 30 nucleotides 343 were excluded as they infer high stochasticity when calculating coverage.

344
Before TR rate was estimated, an initial threshold was set for at least 5 reads to be 345 registered as mapped to the 3'-UTR for each replicate. This is a common lower 346 threshold when considering gene expression [47]. TR rate was calculated as the 347 following: The sequence hit count (obtained by HTSeq) was normalised by dividing read 348 length with the sequence length, as done by [48]. The normalized hit count for the 349 3'-UTR was divided with the normalized hit count value of the protein coding sequence 350 (CDS), yielding relative expression of 3'-UTR. Genes displaying spurious translation by 351 relative expression of one or above were excluded. Relative expression over or near one, 352 effectively implies that the 3'-UTR is being expressed as high as the CDS.

353
Assuring that the reads were accurately indicating TR, we controlled for background 354 noise and that the TR followed the appropriate open reading frame (ORF). We 355 estimated background noise by quantifying the coverage of riboreads that aligned to 356 tRNA, that were aligned by the same stringent criteria as 3'-UTR. tRNA is not 357 translated by the ribosome. We therefore interpret riboreads aligned to tRNA as noise -358 either caused by ribosomes that spuriously land on RNA or imperfect alignment. By 359 dividing the read count with sequence length we retrieved the normalised coverage for 360 November 12, 2019 9/20 tRNAs. The highest value -between the replicates, not the mean of the replicates -was 361 used as a threshold for noise: all genes that had a read coverage in the 3'-UTR equal or 362 lower to the tRNA coverage (our threshold) were excluded from our analyses. Lastly, we 363 control for that our indicated TR follow the appropriate open reading frame (ORF). We 364 controlled, by an in house script, that the reads aligned with the ORF up until next 365 stop codon in frame in the 3'-UTR. If the coverage was higher or equal beyond the first 366 stop codon encountered in the 3'-UTR, they were dismissed from further analyses as 367 ambiguous.

368
Protein feature analyses 369 For each replicate of both footprints and RNA-seq, gene expression was calculated as 370 Transcript Per Million (TPM) and then checked for significant distribution differences 371 by a Kolmogorov-Smirnov test, which was non-significant. Translational efficiency (TE) 372 was calculated as described by Ingolia et al. [49], dividing TPM of the ribosome 373 profiling reads by the TPM of the RNA-seq reads. Sequence length was measured in 374 nucleotides of the CDS (not including UTR). 375 We used the IUPred short algorithm to predict intrinsic disorder in the protein 376 sequences based on the frequency of disorder-promoting amino acids [50], which uses 0.5 377 as the threshold for a sequence to be disordered. 378 We analysed codon usage for the proteins within all three sets. We made use of  McDonald-Kreitman test analysis 384 We inferred polymorphism data for S.cerevisiae by making use of publicly available data 385 from [21] (1102 yeast project). The allele sequences were aligned using mafft 386 v7.397 [52,53]. By an in-house script using Python 2.3, we retrieved synonymous (Ps) 387 or non-synonymous (Pn) codon substitutions within-species. Aligning focal gene of 388 S.cerevisiae with ortholog from S.paradoxus, we then inferred synonymous (Ds) and 389 non-synonymous (Dn) substitutions. Data for S.paradoxus were retrieved from [54]. Our 390 focal gene was the lab strain YGD of S.cerevisiae, with annotations from Saccharomyces 391 Genome Database (SGD) [44].

392
Some of the sequences from the 1102 yeast project have ambiguous nucelotides (eg X 393 instead of A,G,T,C) and inferring synonymous from non-synonymous substitutions was 394 therefore not possible. These sequences were therefore excluded from further analyses, 395 which diminished our data set. 396 We applied two McDonald-Kreitman tests. An extension of the original 397 McDonald-Kreitman test [20] is inference of the proportion of residues under selection, 398 known as alpha [55]. We applied Fisher's exact test to infer significance of alpha. The 399 significant data is what we report on with respect to protein undergoing test, which is more robust to sampling bias, is Direction of Selection (DoS) test [56].

402
Both of these tests indicate that the sets are not different with respect to selection 403 pressure. The resulting values from these analyses can be found in S1File.csv.

404
Alignments of orthologs 405 We used curated annotation of orthologues between S.cerevisiae and S.pombe [57], 406 retrieved from PomBase [58]. We aligned the leaky proteins in S.pombe with their 407 non-leaky orthologue in S.cerevisiae, by using exonerate version 2.4.7 [12]. We used the 408 protein2dna model, using "exhaustive" and "bestfit" parameters. All other parameters 409 were set to default. The non-leaky protein sequence from S.cerevisiae was used as query 410 sequence, and respective orthologue nucleotide sequence from S.pombe was target 411 sequence.

412
Retrieval of paralogs 413 We accessed curated paralogs by the Saccharomyces Genome Database [43] by using 414 yeastmine [59]. Accessing yeastmine by intermine [60,61], the paralogs of proteins in the 415 leaky and non-leaky sets were retrieved.  Percentage of leaky and non-leaky proteins in S.pombe that have orthologs in S.cerevisiae that are either leaky (L) or non-leaky (NL). For example, 35% of leaky proteins of S.pombe have at least one non-leaky ortholog, while 23% of leaky S.pombe proteins have more than one non-leaky ortholog in S.cerevisiae. In total, 58% of leaky proteins in S.pombe one or more non-leaky orthologs, while only three percent have a leaky ortholog in S.cerevisiae.
Protein feature analysis 419   Unless stated otherwise, the features are data for the CDS or full protein sequence. 'TR rate' stands for translational readthrough rate, 'TPM' (Transcripts per Million) stands for gene expression, 'GC' stands for GC-content, 'disorder' stands for intrinsic disorder, 'TE' stands for Translational Efficiency, 'CAI' stands for Codon Adaptation Index and 'CAI end' represents CAI data only for the last 30 nucleotides of the CDS.     Data include differences between non-leaky orthologs and leaky versus non-leaky orthologs. Asterisk (*) includes data only for the difference between leaky proteins in S.pombe and their S.cerevisiae non-leaky orthologs. Unless stated otherwise, the features are data for the CDS or full protein sequence. 'TPM' is gene expression (Transcripts Per Million), 'TE' is Translational Efficiency, 'CAI' is Codon Adaptation Index, 'CAI end' is CAI data estimated for the last 30 nucleotides of the CDS.