Readthrough errors purge deleterious cryptic sequences, facilitating the birth of coding sequences

De novo protein-coding innovations sometimes emerge from ancestrally non-coding DNA, despite the expectation that translating random sequences is overwhelmingly likely to be deleterious. The “pre-adapting selection” hypothesis claims that emergence is facilitated by prior, low-level translation of non-coding sequences via molecular errors. It predicts that selection on polypeptides translated only in error is strong enough to matter, and is strongest when erroneous expression is high. To test this hypothesis, we examined non-coding sequences located downstream of stop codons (i.e. those potentially translated by readthrough errors) in Saccharomyces cerevisiae genes. We identified a class of ‘fragile’ proteins under strong selection to reduce readthrough, which are unlikely substrates for co-option. Among the remainder, sequences showing evidence of readthrough translation, as assessed by ribosome profiling, encoded C-terminal extensions with higher intrinsic structural disorder, supporting the pre-adapting selection hypothesis. The cryptic sequences beyond the stop codon, rather than spillover effects from the regular C-termini, are primarily responsible for the higher disorder. Results are robust to controlling for the fact that stronger selection also reduces the length of C-terminal extensions. These findings indicate that selection acts on 3’ UTRs in S. cerevisiae to purge potentially deleterious variants of cryptic polypeptides, acting more strongly in genes that experience more readthrough errors.


Introduction
One of the more surprising recent developments in molecular evolution is that adaptive protein-coding innovations sometimes arise out of DNA sequences that were previously non-coding. Sometimes only part of a protein arises in this way, via new or expanded coding exons (Sorek 2007), or via annexation of 3′ untranslated regions (UTRs) (Giacomelli, et al. 2007;Vakhrusheva, et al. 2011;Andreatta, et al. 2015) or 5′ UTRs (Wilder, et al. 2009) into an ORF. More dramatically, completely new protein-coding genes can also arise de novo (reviewed in McLysaght and Guerzoni 2015;Van Oss and Carvunis 2019). This contradicts the classic stance that it is implausible under modern conditions of life for a useful protein to appear de novo from a random sequence of nucleotides with no history of selection (Zuckerkandl 1975;Jacob 1977).
As an example of how important a history of selection is for protein function, consider amino acid composition. Hydrophilic residues promote intrinsic structural disorder (ISD), i.e. a propensity to avoid stable conformational structures and instead exist as a dynamic ensemble of many conformational substates separated by low energy barriers (Guharoy, et al. 2015). Disordered regions are a key component of many proteins and perform important functions such as scaffolding and DNA binding (Uversky 2013;Habchi, et al. 2014;Guharoy, et al. 2015). Disordered proteins are also less prone to toxic aggregation (Linding, et al. 2004;Angyan, et al. 2012). Proteins are more disordered than would be expected from the translation of intergenic (Wilson, et al. 2017) or frameshifted (Willis and Masel 2018) sequences, as a consequence of a history of selection on their protein products.
The de novo evolution of a protein-coding sequence occurs when a "co-option" mutation converts a non-coding sequence into a coding sequence, e.g. a point mutation destroys a stop codon, or a mutation makes an intergenic ORF more prone to translation. Prior to such a co-option mutation taking place, we call the hypothetical gene product a "potential polypeptide." Naïvely, we do not expect potential polypeptides to have been previously exposed to the selection needed to purge dangerous (e.g. excessively hydrophobic) sequence variants. However, this is not a foregone conclusion, because potential polypeptides are translated at low levels, as a consequence of widespread spurious transcription (Clark, et al. 2011;Tisseur, et al. 2011;Palazzo and Lee 2015;Neme and Tautz 2016;Blevins, et al. 2019) and translation (Wilson and Masel 2011;Ruiz-Orera, et al. 2014;Ji, et al. 2015;Ruiz-Orera, et al. 2018;Blevins, et al. 2019;Durand, et al. 2019).
The products of this spurious translation of non-coding sequences provide a preview of the effects of possible future co-option mutations. To show how, consider the case of reading beyond a stop codon.
When a ribosome skips a stop codon, the same sequence beyond the stop codon is translated as would become the norm were a mutation to destroy or frameshift that stop codon. When this 3′ UTR sequence contains a backup stop codon, a stop codon readthrough error and a stop codon mutation will append the same C-terminal extension to a protein. If no backup stop codon exists, both will trigger non-stop mediated mRNA decay (Vasudevan, et al. 2002). If a backup stop codon exists, but the extension sequence is unfavorable, the protein may be targeted for degradation (Arribere, et al. 2016). In all three cases, we refer to the sequence beyond the stop codon as a "cryptic sequence." The cryptic sequence encodes both the consequences of expressing the potential polypeptide via readthrough, and the consequences of a future stop codon loss mutation. Normally, we consider selection as always occurring after mutation. But because a stop codon readthrough error previews a future stop codon mutation, selection on the consequences of a mutation can act before mutation occurs.
This similarity between the consequences of a gene expression "error" in the present and a mutation in the future is not restricted to stop codon loss. For example, splicing errors/splice site mutations can include an intron in an existing ORF, and near-cognate start codons can easily mutate to become constitutive start codons. More broadly, mutations might also convert a promiscuous enzymatic or binding interaction into a constitutive one (Khersonsky and Tawfik 2010;Pal and Papp 2017). While we focus here on stop codon readthrough, the principles we illustrate are broadly applicable.
We expect the distribution of fitness effects (DFE) of errors to be a dampened version of the corresponding DFE of new co-option mutations, because translation errors lead to the same expression or degradation events, at lower levels. In general, the DFE of new mutations is bimodal, with strongly deleterious mutations forming one mode and relatively benign mutations forming the other (Keightley and Eyre-Walker 2007;Lind, et al. 2017). Intuitively, these two modes correspond, in the case of changes to an existing protein, to mutations that break their associated protein and are strongly deleterious, versus those that make a minor tweak and are relatively "benign." We consider potential polypeptides to be "benign" or "deleterious" according to the DFE mode of their corresponding cooption mutation. In order for a co-option mutation to contribute to evolutionary innovation, it must not cause too much collateral harm, i.e. it must co-opt a benign potential polypeptide (Masel 2006;Rajon and Masel 2011). Some proteins may be resilient to mutations adding amino acids to their C-terminal, while other proteins may be "fragile," corresponding to significant vs. negligible frequencies for the benign mode of the DFE of possible C-terminal extensions. For example, proteins with C-terminal ends that are ordered and buried within the folds of the protein structure, making an extension likely to disrupt folding, may be particularly fragile to stop codon mutations.
If the DFE of translation errors is a dampened version of the DFE of new co-option mutations, then erroneous translation of benign potential polypeptides will be effectively neutral, and translation of strongly deleterious potential polypeptides will be only weakly deleterious, with the magnitude of deleterious-ness dependent on the error rate (Masel 2006;Rajon and Masel 2011). A mutation to a cryptic sequence that converts its erroneous expression from being benign to being deleterious will be effectively purged if and only if the deleterious effect of the corresponding potential polypeptide is strong enough and its translation by error is frequent enough (Xiong, et al. 2017). In other words, if errors occur frequently enough, potential polypeptides of strong effect will be exposed to enough selection to remove cryptic sequences with dangerous effects, leaving behind benign cryptic sequences that form better raw material for de novo innovations (Rajon and Masel 2011;Wilson and Masel 2011;Xiong, et al. 2017).
We use the term "pre-adapting" selection to describe selection that purges deleterious potential polypeptides before a co-option mutation occurs. The goal of this paper is to investigate whether preadapting selection occurs in nature. The pre-adapting selection hypothesis has the potential to explain how de novo protein-coding sequences avoid toxicity and hence can sometimes be adaptive, by showing that they have a history of exposure to and purging by selection (Wilson and Masel 2011).
Pre-adapting selection theory predicts that selective pressures on potential polypeptides are sometimes strong enough to purge deleterious variants, especially in populations of large effective population size.
As a more testable corollary within a single genome, potential polypeptides expressed at higher levels are predicted to be more likely to be benign (Xiong, et al. 2017). To test this, we need both to quantify relative levels of erroneous translation of different potential polypeptides, and to identify correlates of polypeptide benign-ness, in order to show an association between the two.
To measure erroneous translation, we focus on stop codon readthrough errors that preview de novo Cterminal extensions to existing genes. While the translation of intergenic ORFs is perhaps of more interest because it previews complete de novo genes, de novo C-terminal extensions are more tractable while still being relevant to de novo protein-coding evolution. Erroneous translation products, whether from an intergenic ORF or from part of a 3′ UTR, are difficult to detect by proteomic methods because they tend to be short and sparse, and may be targeted by degradation machinery (Arribere, et al. 2016).
We therefore quantify translation through ribosomal association with transcripts. This is not straightforward for intergenic ORFs (Ruiz-Orera, et al. 2018;Durand, et al. 2019), e.g. transcriptional analysis must confirm that translation is contiguous, and unannotated functional proteins must be excluded. In contrast, cryptic sequences past the stop codon are not part of the functional protein (see paragraph below) and are already known to be contiguously transcribed. It is therefore straightforward to count ribosome profiling hits ("ribohits") beyond the stop codon as a direct measure of readthrough. This is better than using protein abundance as an indirect measure of opportunities for readthrough, because highly abundant proteins tend to have lower readthrough rates per translation event (Bonetti, et al. 1995;Li and Zhang 2019). Our ability to directly quantify the amount of erroneous translation using ribohits makes stop codon readthrough a perfect test case for assessing whether pre-adapting selection can ever occur.
One possible concern with using ribohits is that not all stop codon readthrough is necessarily in error.
For example, Drosophila melanogaster has been shown to have significant functional ("programmed") readthrough (Jungreis, et al. 2011). While this is disputed (Li and Zhang 2019), pervasive programmed readthrough, if real, seems to be phylogenetically restricted to Pancrustaceae (Jungreis, et al. 2016), and does not occur in our focal species of S. cerevisiae (Jungreis, et al. 2016;Li and Zhang 2019). We can therefore assume that ribohits to 3' UTR sequences indicate translational errors, rather than alternative already-functional protein products. Some ribohits might also indicate non-translating ribosomes, especially in mutant yeast (Guydosh and Green 2014;Young, et al. 2015;Yordanova, et al. 2018); appropriate ribosome profiling methodology (Gerashchenko and Gladyshev 2014;Miettinen and Björklund 2015) in wild-type yeast can reduce the scale of this problem.
We expect benign polypeptides to share some traits, such as a tendency towards high ISD, with functional proteins that have been shaped by selection. Once such traits are identified, they can be used to differentiate potential polypeptides that are likely to be benign (and hence co-optable, as has been documented to occur (Giacomelli, et al. 2007;Vakhrusheva, et al. 2011;Andreatta, et al. 2015)) from those that are likely to be deleterious. We call these traits "preadaptations" because they are found prior to co-option and facilitate future adaptation. Longer extensions are more likely to do harm; this is reflected in their tendency to be degraded (Arribere, et al. 2016). Selection on cryptic sequences should therefore shorten potential polypeptides. However, while documenting selection for short selection length can demonstrate that selection is powerful enough to act, shortness is not the most interesting benign trait with respect to potential to contribute to substantial (i.e. longer) protein-coding innovations. Our primary focus is therefore on high ISD as a preadaptation.
The term "preadaptation" has a complicated history, so some discussion is warranted to avoid conflating our use with the various past uses of the term. Cuénot (1914) discussed "preadapted" characters that are either neutral or adaptive in one environment, then fortuitously become useful in a later environment (are co-opted) (Casinos 2017). Gould and Vrba (1982) took issue with this term, claiming it was implicitly teleological despite Cuénot's explicit rejection of orthogenesis, and attempted to replace "preadaptation" with the mostly-synonymous term "exaptation" (Casinos 2017). Gould and Vrba (1982) distinguished between two types of exaptation, depending on whether co-option is of a character previously shaped by natural selection for a different function (an adaptation), versus a character not shaped by natural selection for any particular function (a nonaptation). Potential polypeptides are nonaptations. Gould and Vrba (1982) did not consider systematic variation among nonaptations in their probabilities that future co-option might be beneficial. In contrast, Eshel and Matessi (1998) argued that there may be systematic variation in suitability for future adaptive co-option. Such variation justifies the use of the prefix "pre-", even in the absence of a teleological claim. Specifically, Eshel and Matessi (1998) used the term "preadaptation" to refer to the presence of cryptic genetic variants (i.e., co-optable nonaptations) whose effects are enriched for trait values with an elevated probability of being adaptive after an environmental shift. Populations become preadapted in this sense because future environmental changes tend to resemble existing marginal environments (Eshel and Matessi 1998). Masel (2006) similarly justified the use of the term "preadaptation" to describe a somewhat different process that also leads to systematic variation in suitability for co-option. When a stock of cryptic variants is expressed at low levels, it can be purged of variants in the deleterious mode of the DFE (which have no chance of becoming adaptive exaptations). As discussed above, this enriches, via a process of elimination, for variants in the benign mode, which have more adaptive potential. This process, which we here call "pre-adapting selection," occurs when cryptic variants are not completely phenotypically silent, but are instead expressed at low levels (Masel 2006;Rajon and Masel 2011).
Here we follow Wilson et al. (2017) in using the term "preadaptation" to refer not to the process described by Masel (2006), but to a trait that systematically makes a protein less likely to be harmful.
Our usage here thus distinguishes clearly for the first time between the process of "pre-adapting selection" and the existence of traits that are "preadaptations." This distinction is necessary, because pre-adapting selection may be occurring even if the nature of preadaptations is unknown, while preadaptations might be identifiable even if pre-adapting selection is not in force. Preadaptation can simply refer to backward-time conditional probability; given that de novo gene birth occurred, new genes are more likely than the average non-coding sequence to have properties that facilitate gene birth.
We hypothesize that cryptic sequences are subject to pre-adapting selection. Cryptic sequences with a history of higher levels of pre-adapting selection (e.g. due to higher expression) should therefore display more preadaptation, i.e. they should be more likely to be benign. High ISD/hydrophilicity is a preadaptation for sequences beyond stop codons (Arribere, et al. 2016), just as it is for the de novo birth of complete genes (Wilson, et al. 2017;Willis and Masel 2018). We test whether sequences beyond yeast stop codons that have a ribohit encode higher ISD.

Results
"Fragile" proteins lack backup stop codons and experience negligible readthrough Fragile proteins are those which are prone to breaking when a C-terminal extension is added. Their Cterminal extensions are therefore unlikely to contribute to de novo evolution. Our ability to detect evolvability-relevant pre-adapting selection will be enhanced if we are able to identify and exclude the most fragile proteins.
Proteins that are fragile to readthrough are expected to be under strong selection for low readthrough rates. In the rare cases where a fragile protein is read through, degrading the readthrough product can mitigate its harms. If a 3′ UTR has no backup stop codon, readthrough should trigger non-stop mRNA decay, degrading the readthrough product and the potentially faulty mRNA that produced it (Vasudevan, et al. 2002). We therefore expect selection on fragile proteins to remove backup stop codons from their 3'UTRs, as well as to lower their readthrough rates.
We tested whether lacking a backup stop codon is an indicator of low readthrough. There are three distinct frames that might have or lack a backup stop codon. Readthrough can either be in-frame (where the stop codon is read by a near-cognate tRNA) or result from an upstream frameshift to one of two different frames. Genes lacking a backup stop codon in at least one frame do not contain any ribohits in their 3′ UTRs (Table 1, P-value near zero, Pearson's Chi-squared test on contingency table).
The dramatically lower levels of ribohits in genes lacking a backup stop codon cannot be explained by lower abundance, shorter 3′ UTRs, or shorter extensions in frames that still have a backup stop codon. In genes in which a least one frame lacks a backup stop codon, protein abundance is 1.4-fold lower (P = 3 ×  (Table 1).
This lack of ribohits shows both that the absence of at least one backup stop codon is a good way to identify fragile proteins, and that there is selection against readthrough of fragile proteins. We therefore refer to a protein that lacks a backup stop codon in at least one frame as "fragile." Table 1. Lacking a backup stop codon in at least one frame greatly decreases the probability of observing ribohits in the 3′ UTR (P-value near zero, Chi-squared test). Fragile proteins have shorter extension(s) in those frame(s) for which they do have a backup stop codon (P = 3  10 -4 , 2  10 -12 , 3  10 -13 , for in-frame, -1 frame, and -2 frame extensions, respectively, Student's t-test on log-transformed extension length). In-frame extensions are also shorter than both -1 shifted and -2 shifted extensions to non-fragile (P = 8 × 10 -23 and 7 × 10 -23 , respectively, Student's t-test on log-transformed extension lengths) and fragile proteins (P = 0.02 and 0.03, respectively). C) Fragile proteins have more ordered C-termini, assessed as the last 10 amino acids of the ORF (P = 5  10 -6 , Student's t-test on square-root-transformed ISD). D) The first four codon positions after the stop codon create slower in-frame translation in fragile than in non-fragile proteins (P = 0.005, Student's t-test on geometric mean tAI). Error bars represent +/-one standard error. 3′ UTRs, extension lengths, and codon tAI values were log-transformed, and ISD was roottransformed, to calculate means and standard errors, and then back-transformed for the figure. A B

C D
When a fragile protein is read through in a frame in which it does have a backup stop codon, other decay pathways could be employed. Decay might occur via no-go decay (Simms, et al. 2017) or slowness-mediated decay (Radhakrishnan, et al. 2016;Rak, et al. 2018) if inefficient codons in the extension cause slow or stalled translation, and hence a backup of ribosomes. We therefore hypothesize that fragile proteins, as assessed by lack of backup stop codons following frameshift, employ slow codons in frame immediately after the ORF stop codon. Our prediction that fragile proteins have slow extension codons is restricted to in-frame readthrough because the location of the slow codons can be predicted, in contrast to frameshifts that can occur at many different positions prior to the stop codon.
We assess codon speed using the tRNA adaptation index ( In the Introduction, we hypothesized that when the C-terminus is highly structured, extensions are likely to be disruptive. In agreement with this, fragile proteins have lower ISD in the last 10 amino acids of the ORF than non-fragile proteins ( fig. 1C). We found no significant Gene Ontology enrichments (Ashburner, et al. 2000;Carbon, et al. 2017;Mi, et al. 2017) among fragile proteins compared to our full set of analyzed proteins (P > 0.05, Fisher's exact test with Benjamini-Hochberg false discovery rate correction).
Our central interest in this paper is in pre-adapting selection on sequences beyond stop codons. This interest is motivated by pre-adapting selection's potential to contribute to the de novo evolution of protein-coding sequences. Fragile proteins are unlikely to ever contribute, and so are excluded, as identified by the absence of a backup stop codon, from the following analyses. So far we have demonstrated that selection has minimized the damage from readthrough of fragile proteins. This selection on the readthrough of fragile proteins might be stronger than selection on non-fragile proteins, but it is the latter that we care about because of their potential for de novo evolution.
Similarly, we care most about the consequences of readthrough in the frame in which readthrough is most frequent. In-frame readthrough errors, driven by near-cognate tRNA pairings with stop codons, seem to occur more frequently than frameshift-driven readthrough errors. Specifically, work on nonsense suppression in reporter genes in Escherichia coli and other bacteria (Parker 1989), yeast (Namy, et al. 2001;Williams, et al. 2004), and mammals (Floquet, et al. 2012) has estimated wild-type in-frame readthrough rates of approximately 10 -2 to 10 -4 . In contrast, stop-codon-bypassing +1 frameshifts occur on the order of 10 -4 per translation event in E. coli, while +2 events occur on the order of 10 -5 (Curran and Yarus 1986), and these observed frameshifting rates are likely elevated due to ribosome stalling associated with the premature nature of the stop codon. Stronger selection on the consequences of in-frame readthrough is supported in our data by the fact that in-frame extensions are shorter than both types of frameshift extension in non-fragile proteins ( fig. 1B). We therefore focus below on in-frame readthrough of non-fragile proteins.

High ISD is a preadaptation
We primarily assess disordered status using IUPred2 (Dosztányi, et al. 2005;Meszaros, et al. 2018), except where noted. This program uses a sliding window to assess each amino acid's interactions with its neighbors. We also sometimes assess or review other analyses of disorder at the level of individual amino acids in isolation, in terms of disorder propensity (Theillet, et al. 2013), relative solvent accessibility (RSA) (Tien, et al. 2013), or simply proportion of hydrophobic amino acids (where e.g. G, A, V, I, L, M, F, Y, and W are considered hydrophobic (Li and Zhang 2019)).
If high ISD is a preadaptation for C-terminal extensions, we expect coding sequences to display higher ISD than non-coding sequences. This prediction is supported for non-coding sequences beyond the stop codon ( fig. 2A; previously found by Kleppe and Bornberg-Bauer (2018) for a high-readthrough subset of non-coding sequences, and by Li and Zhang (2019) for all genes using the proportion of hydrophobic amino acids), just as it was previously supported for mouse non-coding intergenic sequences (Wilson, et al. 2017) and alternative reading frames of viral coding sequences (Willis and Masel 2018). High ISD appears to be particularly important for C-termini, as disordered regions are most commonly found in the protein C-terminus (Uversky 2013), and adding hydrophobic amino acids to a C-terminus tends to lead to protein degradation (Arribere, et al. 2016). This is evidence that high ISD is indeed a preadaptation for C-terminal extensions.
Genes with detectable readthrough have higher ISD Pre-adapting selection predicts that extensions that are read through more often will be more preadapted. In agreement with this prediction, when at least one ribohit is present in the 3′ UTR, extensions have higher ISD ( fig. 3A, Original). Note that Li and Zhang (2019) did not find low hydrophobicity in a set of 172 yeast genes suspected of having programmed readthrough, but our ribohit detection is more sensitive, i.e. we detect ribohits for many genes that Li and Zhang (2019) classed as non-readthrough.
ISD is driven in part by the interactions amino acids form with their neighbors (Dosztányi, et al. 2005); selection to differentially elevate ISD in protein C-termini could thus contribute to differentially elevated ISD in extensions, giving a false impression of pre-adapting selection on the extensions themselves.
Kleppe and Bornberg-Bauer (2018) found that yeast proteins with more readthrough have more disordered C-termini. However, this relationship is at least partly driven by the low-readthrough, low-

Fig. 2.
High ISD is a preadaptation. A) ISD is higher in coding regions than in non-coding sequences beyond the stop codon. Higher ISD in extensions than in complete 3′ UTRs is partly an artifact of the IUPred algorithm, which uses a sliding window in which C-terminal amino acids can affect the assessed ISD of extensions. Error bars represent +/-one standard error. ISD was root-transformed to calculate weighted means and standard errors, weighted by the length of the sequences, and then back transformed for the figure. B) More abundant non-fragile proteins have higher ISD in their C-termini. The regression line comes from a model where the root transform of the ISD of the last 10 amino acids of the ORF is predicted by log(protein abundance) (P = 9  10 -17 , likelihood ratio test).

A B
disorder fragile proteins described above, which do not contribute to protein innovation; once we exclude fragile proteins, genes with detectable readthrough still have higher ISD in the last 10 amino acids of the ORF, but this relationship is not statistically significant (P = 0.09, 2-tailed Student's t-test on square-root transformed ISD). More convincingly, abundant proteins have higher ISD in their C-termini ( fig. 2B, P = 9  10 -17 , see figure legend). This may be because selection shapes the C-terminus to mitigate the effects of readthrough, as suggested previously (Kleppe and Bornberg-Bauer 2018), or disordered C-termini may be intrinsically beneficial even in the absence of readthrough. This elevated protein ISD appears to be specific to the C-terminus; in a model where log abundance is predicted by the root transform of the ISD of the last 10 amino acids of the ORF, the root transformed ISD of the full ORF is no longer a significant predictor of protein abundance (P = 0.4, likelihood ratio test).
In order to test for pre-adapting selection on the extension sequences themselves, the effects of the Cterminal amino acids on the ISD of the extensions need to be controlled for. To do this, we grafted each extension onto the C-terminus of the same forty randomly selected non-fragile proteins, and took the average transformed ISD of the extension across those 40 standardized contexts. Evidence for preadapting selection persisted ( fig. 3A, Grafted).
IUPred accounts for nearby amino acid interactions via a 21 amino acid sliding window (Dosztányi, et al. 2005;Meszaros, et al. 2018), meaning that the first 10 amino acids of an extension are influenced by the last 10 amino acids on the ORF. Because proteins have higher ISD than expected from non-coding sequences, shorter extensions have more influence from the ORF and thus higher ISD by reason of short length alone. Thus, the result in fig. 3A could be driven by extension length rather than ISD.
Ribohit presence is also confounded by extension length: there are more opportunities to detect ribohits when extensions are longer. Because high ISD is associated with short extensions which are associated with fewer ribohits, this negative confounding relationship makes our attempts to detect pre-adapting selection on ISD (i.e. higher ISD with more ribohits) conservative. However, selection on short extension length might create a positive confounding relationship between ribohits and ISD (mediated by short extensions), and this might swamp the negative confounding relationship between ribohits and ISD already described. This could lead to high ISD with more ribohits even in the absence of pre-adapting selection on ISD. In this case, we may mistake selection for short length for pre-adapting selection on ISD.
We therefore control for the effects of extension length on ISD as part of a linear regression model, as described in "ISD model" in the Methods. Extensions have higher ISD when at least one ribohit is present in the 3′ UTR even after controlling for length ( fig. 3B; P = 3 × 10 -4 , grafted ISD model). The effect size including effects of C-termini is about 1.4 times the effect size of extensions alone (original vs. grafted in both fig. 3A and fig. 3B).
Amino acids after stop codon are shaped by pre -adapting selection The ISD results above involve complicated methods to control for confounding factors, in particular extension length. Because IUPred uses a sliding window, the degree to which C-terminal amino acids elevate extension ISD depends on extension length, even after grafting. We therefore also use two straightforward amino acid scores that do not involve a sliding window: RSA, which scores how often amino acids are found in the interior vs. surface of globular proteins (Tien, et al. 2013), and disorder propensity (Theillet, et al. 2013), which scores how often amino acids tend to be found in disordered regions compared to ordered regions. As predicted by pre-adapting selection, extensions with more  Fig. 3. In-frame extension ISD is higher when ribohits are present. A) Elevated ISD with ribohits is partly driven by amino acids beyond the stop codon. B) Ribohits still predict higher ISD after controlling for the large effect of extension length (P = 5 × 10 -5 and 3 × 10 -4 for original and grafted extensions, respectively, ISD models). Lines are from the ISD models. Weighting by the length of the extensions (see Materials and Methods) means that the ISD values represent expectations from sampling an amino acid from the extensions rather than from sampling an extension.

B
Original Grafted A P = 0.002, disorder propensity model), in agreement with suggestions that they are less likely to lead to protein degradation (Arribere, et al. 2016). Shortness of extension might thus be a better metric of the quantity of readthrough products subjected to pre-adapting selection than the number of ribohits.

Discussion
The pre-adapting selection hypothesis predicts that the more often an erroneous gene product is produced, the more likely it is to be preadapted. We found that non-fragile proteins that are read through more often have C-terminal extensions with higher disorder. These high disorder extensions constitute a pool of preadapted cryptic sequences that evolution could draw from for de novo evolution.
Higher readthrough could impose pre-adapting selection on a C-terminal extension, or a fortuitously preadapted C-terminal extension could alleviate selective pressure for low readthrough -our methods are not able to determine the causal direction. Observed readthrough reflects a combination of Fig. 4. Metrics that do not use a sliding window provide further evidence that pre-adapting selection acts on extension amino acids. A) Mean extension RSA is higher when ribohits are present (P = 0.02, RSA model) and lower for longer extensions (P = 4 × 10 -8 ). B) Mean extension disorder propensity is also higher when ribohits are present (P = 0.012, disorder propensity model) and lower for longer extensions (P = 0.002). As in fig. 3, weighting by the length of the extensions means that RSA and disorder propensity values represent expectations from sampling an amino acid from the extensions rather than from sampling an extension.

B A
leakiness (probability of readthrough per translation event), protein abundance, and degradation of readthrough products. The consequences of readthrough are unlikely to be the most significant factor shaping the evolution of protein abundance, and so the problematic scenario, in which causation works in the opposite direction to that posited by pre-adapting selection, is presumably restricted to selection on leakiness and degradation. In this paper we focused on ribohits as a better metric of readthrough than protein abundance, because more abundant proteins are believed to have evolved less leaky readthrough. But relaxing this focus in order to exclude causality in the wrong direction, we also found that more abundant proteins tend to have shorter extensions (P = 4 × 10 -6 , Pearson's correlation coefficient = -0.083), suggesting that a causal direction from readthrough to selection on extensions is valid. Similarly, a few fortuitously short extensions are unlikely to change the fact that most readthrough occurs in-frame, so the fact that in-frame extensions tend to be shorter than frameshifted extensions ( fig. 1B) also indicates causality in the appropriate direction.
The evidence we have presented is focused on in-frame extensions, because this is the most frequent form of readthrough. Applying the same approach to -1 and -2 frameshifted extensions did not provide convincing evidence of pre-adapting selection. But even if there is no pre-adapting selection in these frames, this does not necessarily imply that they are not preadapted. Specifically, -2 frame frameshifts are biased towards hydrophilic amino acids, making their extensions more likely to be benign without pre-adapting selection, perhaps even alleviating the need for it. Preadaptation of a -2 frameshift is present because the T beginning the stop codon becomes the last nucleotide of the first codon of the extension, and the second codon begins with AA, AG, or GA. The former constrains little, but the latter creates a bias toward hydrophilicity. Specifically, AAY and AGY encode asparagine and serine, both of which are polar, while AAR, AGR, GAY, and GAR encode lysine, arginine, aspartic acid, and glutamic acid, respectively, all of which are charged. This guaranteed hydrophilicity of the second amino acid of -2 frameshift extensions increases the ISD. In contrast, the first amino acid of a -1 shifted extension must end with TA or TG, while the second must start with A or G. Codons ending in TA code for leucine, isoleucine, and valine, while codons ending in TG code for leucine, methionine, and valine. All of these are strongly hydrophobic, creating a bias toward low ISD. For our ISD calculations, we simply excised stop codons to generate in-frame extensions, introducing no bias, but real extensions usually result from a near-cognate tRNA pairing with a stop codon. Experimentally validated pairings show that the most common near-cognate decodings in yeast are tyrosine or glutamine for UAA, tyrosine for UAG, and tryptophan for UGA (Blanchet, et al. 2014). Glutamine is hydrophilic, tyrosine is amphipathic, and tryptophan is sometimes grouped as hydrophobic and sometimes as amphipathic, collectively suggesting that bias will not be strong. So, -2 shifted extensions may be more naturally preadapted than in-frame extensions, and -1 shifted extensions less so, due to the biases introduced by the inclusion of a polar/charged amino acid.
Previous studies have also shown evidence of selection for shorter extensions. Backup stop codons are more frequent than expected by chance alone in yeast (Williams, et al. 2004), show evidence of conservation (Liang, et al. 2005), and are overrepresented in genes with a high codon adaptation index (Liang, et al. 2005). However, shorter extensions are unlikely to contribute appreciably to de novo innovations, precisely because of their short length. We have not only confirmed that selection prefers short extensions, but also found for the first time that selection favors high ISD extensions. High ISD increases the potential for de novo innovation (Wilson, et al. 2017;Willis and Masel 2018).
Error rates are not constant throughout the genome. We found that fragile proteins have dramatically lower readthrough than non-fragile proteins. Lack of backup stop codons triggering non-stop mediated mRNA decay (Vasudevan, et al. 2002) due to the absence of a backup stop codon, and slow codons in extensions triggering no-go (Simms, et al. 2017) or slowness-mediated decay (Radhakrishnan, et al. 2016;Rak, et al. 2018), are two of the mechanisms by which this might occur. Another possible mechanism is an mRNA structure that physically blocks ribosome progression into the 3′ UTR, which would stall ribosomes and also lead to no-go decay (Harigaya and Parker 2010).
Many molecular errors are less common in highly expressed genes, including errors in transcription start site in human and mouse (Xu, et al. 2019), mRNA polyadenylation in mammals (Xu and Zhang 2018), post-transcriptional modifications in human and yeast (Liu and Zhang 2018b, a), and mistranscription errors in E. coli (Meer, et al. 2019). Most importantly for our study, readthrough errors in yeast are less common in highly expressed genes (Li and Zhang 2019). However, this does not remove selection for benign extensions that are more disordered.
Here we have provided a proof of principle that pre-adapting selection shapes the sequences beyond stop codons, facilitating later de novo evolution of C-termini. This makes it plausible that pre-adapting selection also works on the precursors of full de novo genes, junk polypeptides (Wilson and Masel 2011;Ruiz-Orera, et al. 2018;Blevins, et al. 2019;Durand, et al. 2019), facilitating later de novo birth of complete proteins. This can reconcile the perceived implausibility of de novo gene birth (Zuckerkandl 1975;Jacob 1977) with recent convincing evidence that de novo gene birth is a real phenomenon.

Data
Pre-processed ribosome profiling data on wild-type [psi -] and [PSI + ] strains of S. cerevisiae was provided by Jarosz et al. (manuscript in preparation), derived using flash freezing to inhibit elongation according to a modified protocol of Brar et al. (2012).
We chose Jarosz et al.'s (manuscript in preparation) dataset for three reasons: 1) it has very high coverage, 2) it is methodologically designed to minimize spurious 3′ UTR ribohits, such as avoiding use of micrococcal nuclease which is known to elevate 3′ UTR ribohits (Miettinen and Björklund 2015), and 3) it uses wild-type yeast. Other yeast ribosomal profiling datasets are either lower coverage (e.g. Baudin-Baillieu, et al. 2014;Cheng, et al. 2018) or focus on translation defects (e.g. Nedialkova and Leidel 2015; Guydosh and Green 2017). For example, Kleppe and Bornberg-Bauer (2018) used the data of Nedialkova and Leidel (2015), which was high coverage but primarily composed of tRNA mutants; excluding nonwild type yeast data would make it lower-coverage than the data used here.
Sequences for 6,752 protein coding genes were downloaded from the Saccharomyces Genome Database (SGD, http://www.yeastgenome.org) using the 2011 S288C reference S. cerevisiae genome sequence and associated annotations, matching the strain used for ribosomal profiling and the gene annotations used to analyze that data [Jarosz et al. (manuscript in preparation)]. 3′ UTR sequence annotations were done by Jarosz et al. (manuscript in preparation) using transcript isoform profiling data (Pelechano, Wei, and Steinmetz 2013). We excluded 2,032 genes without a 3′ UTR annotation, and a further 646 whose ORF is not "verified" as protein-coding in SGD, leaving 4,083 genes. Lastly, one gene, YOR031W, has a blocked reading frame with an in-frame stop codon after only eight amino acids; excluding it left 4,082 genes.
We scored whether a ribohit was found within the 3′ UTR region in either a [psi -]  Protein abundance data was downloaded (11/30/15) from PaxDB (Wang, et al. 2012;Wang, et al. 2015) using the integrated S. cerevisiae dataset, expressed as parts per million individual proteins (ppm). Of the 4,082 genes in our analysis, abundance data was available for all but two; these two genes were excluded from analyses that used abundance data. Abundance values were log transformed before analysis; based on manual inspection, this achieved approximately normally distributed values.
All in-house scripts for our analyses can be found at https://github.com/MaselLab/Kosinski-and-Masel-CTerminalExtensions. All figures were made in R (R Core Team 2019) using the "ggplot2" package (Wickham 2016).
Scoring ISD IUPred2 (Dosztányi, et al. 2005;Meszaros, et al. 2018) returns a score between zero and one for each amino acid in a sequence, which represent non-independent quasi-probabilities that the amino acid is in a disordered region. We estimate the ISD of a sequence using a simple average of the IUPred2 quasiprobabilities for the amino acids in the sequence of interest. Because this procedure creates heteroscedasticity, by making the error a function of sequence length, we use sequence length as a weight in later analyses.
To calculate the ISD of the last 10 amino acids of an ORF, we fed the full ORF into IUPred2, then took the average only of the last 10 amino acids. Similarly, to calculate the ISD of extensions, we excised the stop codon, fed the sequence up to the backup stop codon into IUPred2, and took the average score only of the amino acids in the extension.
Square-root-transformed ISD values were approximately normal for ORF, C-terminal, and grafted extension ISD, and were used in subsequent statistical analyses. This corresponds to the biological intuition that an ISD difference of 0.7 vs 0.8 is less important than one between 0 and 0.1. Note that we transformed sequence ISD; we did not transform IUPred2 quasi-probabilities prior to averaging to produce sequence ISD.

RSA calculation
Relative solvent accessibility (RSA) is a measurement of how often an amino acid tends to be found in the exterior versus interior of globular proteins (Tien, et al. 2013), calculated from high quality crystal structures using the DSSP program (Kabsch and Sander 1983;Tien, et al. 2013). More hydrophilic residues tend to be found on the exterior. tAI tAI was calculated for each codon in a sequence using a script provided by Roni Rak, and tRNA copy numbers for S. cerevisiae were taken from Percudani et al. (1997). Standard practice for calculating sequence tAI is to use the geometric mean (dos Reis, et al. 2004). Equivalently, we calculated sequence tAI as the arithmetic mean of the log-transformed tAI codon scores.
"P-value near zero" We use this phrase to indicate that R returned a P-value of zero. This occurs when P is smaller than the smallest positive value that R can represent (less than 10 -300 ).

Regression models
During model building, we used log transforms of our predictors when this gave a better model fit. Tests of statistical significance for all models came from the likelihood ratio test associated with dropping the variable of interest from the most supported regression model among those discussed below. P-values reported from linear models come from models controlling for all predictive factors listed below, not just the predictive factors mentioned in the Results until that point. All listed predictor coefficients (β) come from the model with all significant predictors included.
ISD models: A separate model was built to predict root-transformed ungrafted ISD values and for average root-transformed grafted ISD values. ISD is a mean across extension amino acids, so we weight each value by the length of the extension. We tested the effects of ribohits (scored as a binary based on presence in the 3′ UTR) while controlling for extension length. showed that both RSA and disorder propensity were approximately normal, and hence neither was transformed.