Abstract
The noncoding genome plays an important role in de novo gene birth and in the emergence of genetic novelty. Nevertheless, how noncoding sequences’ properties could promote the birth of novel genes and shape the evolution and the structural diversity of proteins remains unclear. Therefore, by combining different bioinformatic approaches, we characterized the fold potential diversity of the amino acid sequences encoded by all intergenic ORFs (Open Reading Frames) of S. cerevisiae with the aim of (i) exploring whether the large structural diversity observed in proteomes is already present in noncoding sequences, and (ii) estimating the potential of the noncoding genome to produce novel protein bricks that can either give rise to novel genes or be integrated into pre-existing proteins, thus participating in protein structure diversity and evolution. We showed that amino acid sequences encoded by most yeast intergenic ORFs contain the elementary building blocks of protein structures. Moreover, they encompass the large structural diversity of canonical proteins with strikingly the majority predicted as foldable. Then, we investigated the early stages of de novo gene birth by identifying intergenic ORFs with a strong translation signal in ribosome profiling experiments and by reconstructing the ancestral sequences of 70 yeast de novo genes. This enabled us to highlight sequence and structural factors determining de novo gene emergence. Finally, we showed a strong correlation between the fold potential of de novo proteins and the one of their ancestral amino acid sequences, reflecting the relationship between the noncoding genome and the protein structure universe.
Introduction
Many studies give a central role to the noncoding genome in de novo gene birth (Ingolia et al. 2011; Tautz and Domazet-Lošo 2011; Carvunis et al. 2012; Slavoff et al. 2013; Prabakaran et al. 2014; Ruiz-Orera et al. 2014; Zhao et al. 2014; Schlötterer 2015; Ruiz-Orera et al. 2018; Vakirlis et al. 2018; Schmitz et al. 2018; Durand et al. 2019; Blevins et al. 2021). Nevertheless, how noncoding sequences can code for a functional product and consequently give rise to novel genes remains unclear. Indeed, function is intimately related to protein structure and more generally to protein structural properties. All proteomes are characterized by a large diversity of structural states. The structural properties of a protein result from its composition in hydrophobic and hydrophilic residues. Highly disordered proteins display a high hydrophilic residue content. Membrane proteins which fold in lipidic environments, but aggregate in solution, are enriched in hydrophobic residues. Finally, foldable proteins are characterized by a subtle equilibrium of hydrophobic and hydrophilic residues (Talmud and Bresler 1944). The latter are arranged together into specific patterns that dictate the formation of the secondary structures and the outcoming fold. However, noncoding sequences display different nucleotide frequencies from coding sequences, resulting in different amino acid compositions. If and how these amino acid compositions can account for the structural states observed in proteomes is a crucial question to understand the relationship, if any, between the noncoding genome and the protein structure universe. So far, different models of de novo gene emergence have been proposed (Carvunis et al. 2012; Schlötterer 2015; Wilson and Masel 2011). The “preadaptation” model stipulates that only sequences pre-adapted not to be harmful (i.e. with enough disorder not to be subjected to aggregation), will give rise to gene birth (Wilson and Masel 2011). This model is supported by the observation that young genes and de novo protein domains display a higher disorder propensity than old genes (Schmitz et al. 2018; Wilson and Masel 2011; Bitard-Feildel et al. 2015; Ekman and Elofsson 2010). On the contrary, several studies conducted on S. cerevisiae indicate that young genes are less prone to disorder (Carvunis et al. 2012; Vakirlis et al. 2018, 2020). Recently, Vakirlis et al. (2020) proposed a TM-first model where the membrane environment provides a safe niche for transmembrane (TM) adaptive emerging peptides which can further evolve toward more soluble peptides. These adaptive peptides have been identified with overexpression which, according to the authors, may not be reached outside the laboratory. Whether such peptides, though beneficial in the experimental conditions, would be produced and would be beneficial in “natural” conditions deserves further investigations.
Overall, all these studies attribute to the fold potential of noncoding ORFs (including the propensities for disorder, folded state, and aggregation) an important role in the emergence of genetic novelty. However, several questions remain open. First, if the sequence and structural properties of de novo genes have been largely investigated in specific species, the raw material for de novo gene birth and the early stages preceding the fixation of the beneficial ORFs are to be further characterized (Schmitz et al. 2018). Second, if the role of the noncoding genome in de novo gene birth has been largely investigated, its role in protein evolution and structural diversity is to be further characterized as well. Indeed, de novo domains may emerge from noncoding regions through ORF extension or exonization of introns (Bornberg-Bauer and Alba 2013; Bornberg-Bauer et al. 2015). On the other hand, we can assume that protein-coding genes, whatever their evolutionary history, have had a noncoding ancestral origin (Nielly-Thibault and Landry 2019). Whether the noncoding ORFs which gave rise to novel genes can account for the structural diversity of proteomes or whether this structural diversity evolved from ancestral genes which all displayed similar structural properties (i.e. disordered, foldable or TM-prone) is a crucial question to better understand the role, if any, of noncoding sequences in the protein structure universe.
Therefore, by combining different bioinformatic approaches, we characterized the diversity of the fold potential (i.e. propensity for disorder, folded state or aggregation) encoded in all intergenic ORFs (IGORFs) of S. cerevisiae with the aim of (i) exploring whether the large structural diversity observed in proteomes is already present in noncoding sequences, thereby investigating the relationship, if any, between the fold potential of the peptides encoded by IGORFs and the protein structural diversity of proteins, and (ii) estimating the potential of the noncoding genome to produce novel protein bricks, that can either give birth to novel genes or be integrated into pre-existing proteins, thus participating in protein structure evolution and diversity. We then characterized the early stages preceding de novo gene emergence with two complementary approaches (i) the systematic reconstruction of the ancestral sequences of 70 de novo genes, in order to identify the sequence and structural features of the peptides encoded by IGORFs that indeed gave birth in the past to de novo genes and (ii) the identification of IGORFs with a translation signal through ribosome profiling experiments, in order to investigate the sequence and structural properties of the peptides encoded by candidate IGORFs that could give birth to future novel genes. In particular, we performed ribosome profiling experiments on S. cerevisiae and used additional ribosome profiling data to precisely analyze their translational behavior. Then we characterized the sequence and structural properties of the peptides encoded by IGORFs that are occasionally translated with a weak translation signal and that may give rise to a novel gene and IGORFs that display a strong translation signature in at least two independent experiments probably reflecting the optimization of their translational activity and the emergence of function.
Results
We extracted 105041 IGORFs of at least 60 nucleotides in S. cerevisiae (see Methods). We probed their fold potential with the Hydrophobic Cluster analysis (HCA) approach (Faure and Callebaut 2013; Bitard-Feildel and Callebaut 2017; Bitard-Feildel et al. 2018; Bitard-Feildel and Callebaut 2018) and compared it with the one of the 6669 CDS of S. cerevisiae. HCA highlights from the sole information of a single amino acid sequence, the building blocks of protein folds that constitute signatures of folded domains. They consist of clusters of strong hydrophobic amino acids that have been shown to be associated with regular secondary structures (Bitard-Feildel and Callebaut 2017; Bitard-Feildel et al. 2018; Lamiable et al. 2019) (Fig. S1). These clusters are connected by linkers corresponding to loops or disordered regions. The combination of hydrophobic clusters and linkers in a sequence determines its fold potential. The latter can be appreciated in a quantitative way through the calculation of a foldability score (HCA score) which covers all the structural diversity of proteins.
IGORFs contain elementary building blocks of proteins
We first investigated the structural and sequence properties of protein encoded by CDS and IGORFs. CDS are longer than IGORFs and contain more HCA clusters (Mann-Whitney U-test, P < 2.2e-16 for both observations) (Figure 1AB). Strikingly, the HCA clusters of CDS and IGORFs display a remarkable similar size of about 11 residues (Mann-Whitney U-test, P = 0.17) (Figure 1C) and 96.9% of IGORFs harbor at least one HCA cluster. This result shows that the elementary building blocks of proteins are widespread in noncoding sequences. In contrast, CDS are enriched in long linkers suggesting that linker sizes have increased during evolution (6.3 and 11.5 residues for IGORFs and CDS respectively, Mann-Whitney U-test, P = 2.6e-11) (Figure 1D). This increase in size might favor protein modularity and flexibility, probably reflecting their important role in protein function optimization and protein structure diversity (Blouin et al. 2004; Espadaler et al. 2006; Tendulkar et al. 2004; Papaleo et al. 2016).
CDS are enriched in polar and charged residues
If hydrophobic clusters of CDS and IGORFs display similar sizes, they may not have the same amino acid composition. Therefore, in order to see whether clusters of CDS have evolved toward specific amino acid distributions, we calculated for each amino acid, its propensity for being in HCA clusters of CDS over HCA clusters of IGORFs. CDS HCA clusters are clearly enriched in polar and charged residues compared to those of IGORFs (Fig. S2A). The same tendency is observed for CDS linkers (Fig. S2B). Strikingly, negatively charged residues are over-represented compared to positively charged ones in both HCA clusters and linkers of CDS (Fig. S2). In fact, it has been shown that the charge distribution in a protein sequence has an impact on its diffusion in the cytosol where positively charged proteins get caught in nonspecific interactions with the abundant negatively charged ribosomes (Requião et al. 2017; Schavemaker et al. 2017). Interestingly, Figure S3 shows that the frequency of negatively charged residues calculated on the yeast cytoplasmic proteins is strongly correlated (Spearman correlation coefficient: Rho = 0.44, P < 2e-16) with the proteins' abundance suggesting that the crowded cellular environment has shaped the charge distribution of abundant proteins. This result recalls the observation made in previous studies which showed that the frequency of “sticky” amino acids on the surface of globular proteins or in disordered proteins decreases as the protein cellular concentration increases (Levy et al. 2012; Macossay-Castillo et al. 2019).
IGORFs encode for peptides that display a wide diversity of fold potential including a substantial amount of non-harmful peptides
We next used the HCA score in order to assess the fold potential of the peptides encoded by IGORFs. As reference, we calculated the HCA scores for three sequence datasets consisting of 731 disordered regions, 559 globular proteins and 1269 TM regions extracted from transmembrane proteins, thereby expected to form aggregates in solution while being able to fold in lipidic environments (see Methods for more details) (Figure 2A). Based on their HCA scores, we defined three categories of fold potentials (i.e. disorder prone, foldable, or aggregation prone in solution) (Figure 2A). Here, we define as foldable, proteins that are able to fold into a compact and well-defined 3D structure or partially to an ordered structure in which the secondary structures are however present. Figure 2B shows that CDS and IGORFs belonging to the low HCA score category are indeed presumed to be disordered and display low propensity for aggregation. Interestingly, comparable but small proportions of CDS and IGORFs fall into this group (4.9% and 7.7% respectively) indicating that most coding but also noncoding sequences are not highly prone to disorder in line with Tretyachenko et al. (2017). The high HCA score category corresponds to sequences which exhibit a low propensity for disorder while displaying a high propensity for aggregation in solution. CDS falling into this bin correspond to highly hydrophobic sequences (Table S1) among which 81% are annotated as uncharacterized according to Uniprot (UniProt Consortium 2019) and 60% are predicted as containing at least one TM domain. Finally, the intermediate category concerns sequences which have a high potential for being completely or partially folded in solution as shown by their intermediate HCA scores comparable to those of globular proteins. As anticipated, most CDS (91.4%) and, strikingly, a majority of IGORFs (66.6%) fall into this category. Both are characterized by intermediate aggregation and disorder propensities, although IGORFs display a wider range of aggregation propensities (Fig. 2B). The fact that these CDS, though predicted as foldable, exhibit a certain propensity for aggregation, is in line with several studies which report a high aggregation propensity of proteomes across all kingdoms of life (Langenberg et al. 2020; Greenwald and Riek 2012). This observation has been explained as the side effect of the requirement of a hydrophobic core to form globular structures (Langenberg et al. 2020; Ganesan et al. 2016; Rousseau et al. 2006b). In particular, Langenberg et al. (2020), show a strong relationship between protein stability and aggregation propensity with aggregation prone regions mostly buried into the protein and providing stability to the resulting fold. Like for CDS, these regions, under the hydrophobic effect, may facilitate the stabilization of the IGORF encoded peptide structure. Whether peptides encoded by IGORFs in this intermediate category fold to a specific 3D structure, a partially ordered structure or a “rudimentary fold” which stabilizes itself through oligomerization like the Bsc4 de novo protein (Bungard et al. 2017), would deserve further investigations. Finally, it is interesting to note that the proportions of sequences in the different fold potential categories are different between IGORFs and CDS, with CDS mostly falling into the intermediate HCA score category reflecting that being foldable is a trait which has been strongly selected by evolution. In contrast, IGORFs cover a wide range of fold potentials. It is questionable whether de novo genes mainly originate from IGORFs encoding foldable peptides or from IGORFs whose corresponding peptides subsequently evolved toward foldable peptides regardless of their initial fold potential.
From IGORFs to de novo genes
Therefore, we traced back the evolutionary events preceding the emergence of 70 de novo genes identified in S. cerevisiae by reconstructing their ancestral IGORFs (ancIGORFs) in order to see whether IGORFs that gave birth to de novo genes encode peptides that display different foldability potential from all other IGORFs and to characterize the steps preceding the emergence of a novel gene (see Methods and Fig. S5 for more details on the protocol and Table S2 for the list of de novo genes). Figure S6 shows the example of YOR333C de novo gene which emerged in the lineage of S. cerevisiae. The corresponding noncoding region in the ancestors preceding the emergence of YOR333C consists of two IGORFs separated by a STOP codon. Interestingly, two nucleotide substitutions which occurred specifically in the S. cerevisiae lineage led respectively to the appearance of a start codon (mutation of Isoleucine into Methionine through an Adenine/Guanine substitution) and the mutation of the STOP codon into a Tyrosine through a Guanine/Cytosine substitution, thereby merging the two consecutive IGORFs. Overall, the 70 de novo genes emerged from a total of 167 ancIGORFs. A minority of de novo genes (16 cases) emerged from a single ancIGORF which covers almost all their sequence (i.e. single- ancIGORF de novo genes), while, the majority (54 cases) result from the combination of multiple ancIGORFs through frameshift events and/or STOP codon mutations as observed with the example of YOR333C (i.e. multiple-ancIGORF de novo genes). Interestingly, the multiple-ancIGORF de novo genes exhibit sequence sizes similar to those of the single-ancIGORF ones (Fig. S7A) though the ancestral IGORFs they originate from are shorter than those that led to single-ancIGORFs de novo genes (Fig. S7B). They evoke the expression first model (Schlötterer 2015) where a transcribed and selected IGORF is subsequently combined with neighboring IGORFs through multiple frameshift events and STOP codon mutations. In contrast, single-ancIGORFs de novo genes derive from longer ancestral IGORFs and recall the ORF first model (Schlötterer 2015) which stipulates that the emergence of a long de novo ORF precedes that of the promoter region.
Figure 3 shows the HCA scores of the proteins encoded by the 70 de novo genes (i.e. de novo proteins) and of the peptides encoded by their corresponding ancIGORFs. The majority of de novo proteins (78%) are predicted as foldable, whereas peptides encoded by ancIGORFs display a larger range of HCA scores. However, ancIGORFs are not IGORF-like, being enriched in sequences encoding foldable peptides (75.4% and 66.6% for ancIGORFs and IGORFs respectively - one proportion z test, P = 9.5e-3) and depleted in sequences encoding aggregation prone ones (18.6% and 25.7% for ancIGORFs and IGORFs respectively, one proportion z test, P = 2.1e-2). This suggests that IGORFs encoding foldable peptide are more likely to give rise to novel genes.
Impact of frameshift events and STOP codon mutations on the fold potential of a de novo protein
Interestingly, the overall relationship between the HCA scores of peptides encoded by ancIGORFs and their corresponding de novo proteins is characterized by a funnel shape revealing that most de novo proteins are foldable regardless of the fold potential of the peptides encoded by their IGORF parents (Fig. 3). Two hypotheses can explain this observation: (i) this funnel mostly results from the amino acid substitutions which have occurred since the fixation of the ancIGORF(s) and which led to an increase in foldability of the resulting de novo genes, (ii) this funnel results from the fact that combining at least one IGORF encoding for a foldable peptide with IGORFs encoding peptides with different fold potentials, will lead to a foldable product. Figure 4A shows the amino acid frequencies of IGORFs, ancIGORFs, de novo genes and CDS. Interestingly, de novo genes display amino acid frequencies similar to those of ancIGORFs (blue circles and grey dots respectively) which overall, follow those of all IGORFs (purple line)(see also Table S1 for the frequencies of hydrophobic residues in the different ORF categories). This result shows that the mutations which occurred since the fixation of the ancIGORF did not change the amino acid composition of the resulting de novo genes and thus, cannot explain the funnel shape observed in Figure 3. We then reasoned that since the divergence of the last common ancestor predating the emergence of de novo genes, single-ancIGORF de novo genes were only subjected to nucleotide substitutions, some of which leading to amino acid mutations, while the multiple-ancIGORF ones have undergone frameshift events and/or STOP codon mutations as well. In order to quantify the impact of these different mutational events on the fold potential of the outcoming de novo proteins, we calculated the correlation between the HCA score of each de novo protein and the peptides encoded by its corresponding ancIGORF(s). Figure 4B shows that single-ancIGORF de novo proteins display a clear correlation of HCA scores with those of the peptides encoded by their corresponding ancIGORFs (Pearson correlation coefficient: R = 0.94, P < 2.5e-9). This reveals that the amino acid mutations which occurred between the ancestor and the de novo protein did not affect the fold potential of the ancestral sequences. This suggests that the structural properties of the peptides encoded by the single-ancIGORFs were retained in the resulting de novo proteins. Interestingly, the correlation is weaker for multiple-ancIGORF de novo proteins (Pearson correlation coefficient: R = 0.53, P < 1.6e-7). This can be attributed to the fact that 81% (44/54) of the multiple-ancIGORF de novo proteins are predicted as foldable (white dots included in the blue squares in Figure 4B) while being associated with ancIGORFs of different foldability potentials. Interestingly, all foldable de novo genes include at least one foldable ancestral peptide suggesting that in these cases, combining disordered or aggregation-prone peptides with a foldable one, has led to a foldable de novo protein as well. Figure S5E shows the example of the de novo gene YLL020C and its corresponding ancIGORFs. YLL020C results from the combination through a frameshift event, of a long foldable ancIGORF with a short IGORF predicted as aggregation prone. Interestingly, the resulting de novo gene is also predicted as foldable. Whether the foldable IGORF was the first to be selected and whether selection has only retained the combinations of IGORFs that do not affect the foldability of the preexisting selected product are exciting questions that deserve further investigations.
Translated and ancestral IGORFs display intermediate properties between IGORFs and CDS
Next, we performed two ribosome profiling experiments on S. cerevisiae (strain BY4742) and used additional ribosome profiling data from three other experiments to define two types of translated IGORFs (Radhakrishnan et al. 2016; Thiaville et al. 2016). The former corresponds to IGORFs that are occasionally translated with a weak translation signal (at least 10 reads in one experiment - see Methods for more details). They are mostly expected to be short-lived in evolutionary history, though some of them may give rise to future novel genes. The latter corresponds to IGORFs with a strong translation signal (more than 30 reads in at least two experiments) and whose translation is strongly favored over the overlapping IGORFs in the other phases (i.e. selectively translated IGORFs)(see Methods for more details). This suggests the optimization of their translation activity which could be related to the emergence of a functional translation product. We identified 1235 occasionally translated IGORFs and 31 selectively translated IGORFs. Figure 5 shows the boxplot distributions of the sizes of the sequences, clusters and linkers of the translated IGORFs and all other ORF categories (e.g. IGORFs, ancIGORFs, de novo genes and CDS) along with their number of clusters per sequence. In line with Carvunis et al. (2012), the Figure 5 reveals for most properties, a continuum from IGORFs to CDS reflecting the successive stages preceding the emergence of a de novo gene until the establishment of a genuine gene. Interestingly, the selectively translated IGORFs and the ancIGORFs are both longer than IGORFs (Mann-Whitney U-test, P =3.4e-02 and 1.3e-22 respectively) and display longer linkers (Mann-Whitney U-test, P =2.6e-02 and 1.8e-02), though the effect is less pronounced for the 31 translated IGORFs (Fig. 5). This can be explained by the fact that among the latter, only a handful of them will give rise to a de novo gene whereas all ancIGORF have indeed given rise to a de novo gene.
Strikingly, the HCA cluster size remains invariant for all ORF categories except the one of de novo gene clusters, thereby reinforcing the concept of hydrophobic clusters as elementary building blocks of proteins (Fig. 5). The increase in de novo gene cluster size cannot be explained by the hydrophobic content of de novo genes which is similar to those of IGORFs and ancIGORFs (Fig. 4A). However, we hypothesize that longer clusters mostly result from the fusion of IGORFs through STOP codon mutations or frameshift events as observed in the example of the YPR126C de novo gene (Figure S8A).
Interestingly, this gene results from the fusion of three ancIGORFs through STOP codon mutations which led to longer clusters in YPR126C. Similarly, the fusion of ancIGORFs can also give rise to longer linkers as observed with the YMR153C-A de novo gene (Fig. S8B). The fact that CDS are characterized by longer linkers while their cluster size is similar to the one of IGORFs suggests that harboring long linkers is a criterion that has been selected over evolution whereas it is not the case for long clusters. Having shown that CDS are enriched in hydrophilic residues (Fig S2), we can hypothesize that the mutations of hydrophobic residues toward hydrophilic ones can disrupt long clusters or can switch cluster extremities into linker extremities, thereby decreasing their size over time.
Discussion
In this work, we showed that the noncoding genome encodes the raw material for making proteins. In particular, we showed the widespread existence in the noncoding genome of the elementary building blocks of protein structures which consist of hydrophobic clusters that have been shown to be associated in protein coding sequences with regular secondary structures (Bitard-Feildel and Callebaut 2017; Bitard-Feildel et al. 2018; Lamiable et al. 2019). We showed that hydrophobic clusters in noncoding sequences display sizes similar to those observed in CDS and that ancestral IGORFs that gave birth to de novo genes are characterized by a larger number of clusters (Fig. 5). In contrast, CDS are enriched in longer linkers which probably contribute to optimize the local arrangements of secondary structures, provide flexibility to proteins, and specificity in protein interactions. This observation is in line with several studies reporting a central role to loops in protein function and structural innovation (Blouin et al. 2004; Espadaler et al. 2006; Tendulkar et al. 2004; Papaleo et al. 2016). Like Schmitz et al. (2018), we stipulate that the increase in intrinsic structural disorder observed for old genes by Carvunis et al. (2012), is related to the fact that CDS are characterized by longer linkers, thereby inducing inevitably an increase in the disorder score. As a matter of fact, most CDS display HCA scores similar to those of globular proteins, with low disorder propensities (Fig. 2). Overall, we showed an enrichment in polar and charged residues for both linkers and clusters of CDS which is likely accompanied by an increase in specificity of protein folds and interactions through the optimization of the folding and assembly processes (Lumb and Kim 1995).
Nevertheless, how a noncoding sequence becomes coding remains unclear. In this work, we propose the IGORFs as elementary modules of protein birth and evolution. IGORFs can serve as starting points for de novo gene emergence or can be combined together, thus increasing protein sizes, contributing to protein modularity, and leading to more complex protein architectures. They recall the short protein fragments, reported so far, that result from different protein structure decompositions (Nepomnyachiy et al. 2017; Papandreou et al. 2004; Alva et al. 2015; Postic et al. 2017; Kolodny et al. 2020; Berezovsky et al. 2000, 2001; Lamarine et al. 2001). Interestingly, we showed that these elementary modules encompass all the protein structural diversity observed in CDS. A majority of IGORFs encode peptides predicted as foldable while an important fraction displays high HCA scores and aggregation propensities. Some of the latter, though not the majority (28%), are predicted with at least one TM domain and may “safely” locate in membranes as proposed in Vakirlis et al. (2020). The impact of the other high HCA score IGORFs on the cell deserves further investigations. Nevertheless, we can hypothesize that if produced, most of the time, their concentration will not be sufficient to be deleterious (Langenberg et al. 2020). Indeed, it seems that for CDS, a certain degree of aggregation is tolerated at low concentration (see Fig. S9). On the other hand, although IGORFs with intermediate HCA scores may exhibit a certain propensity for aggregation, we can hypothesize that these aggregation-prone regions, under the hydrophobic effect, may play a role in their capacity to fold, in line with the hypothesis of an amyloid origin of the globular proteins (Langenberg et al. 2020; Greenwald and Riek 2012). Indeed, the balanced equilibrium of hydrophobic and hydrophilic residues observed for these IGORFs (39.1% of hydrophobic residues to be compared with the 50.8% observed for high HCA score IGORFs) may render possible the burying of aggregation-prone regions and the exposure of hydrophilic residues that is accompanied by an increase in foldability. We can hypothesize that, if produced, these IGORFs could form small compact structures and/or could be stabilized through oligomerization or interactions with other proteins. Precisely, we showed that ancestral IGORFs predating de novo gene emergence are not IGORF-like, but rather enriched in sequences with a high propensity for foldability. This reveals that at least for S. cerevisiae, foldable IGORFs are more likely to give rise to novel genes, though, it must be confirmed for other lineages with different GC contents (Foy et al. 2019; Basile et al. 2017; Heames et al. 2020). Nevertheless, we can reasonably hypothesize that de novo peptides struggle to fold to a well-defined and specific 3D structure as shown with the young de novo gene BSC4 identified in the S. cerevisiae lineage (Bungard et al. 2017; Namy et al. 2003). Recently, Bungard et al. (2017) reported that the Bsc4 protein folds partially to an ordered structure that is unlikely to be unfolded according to Circular Dichroism spectra and bioinformatic analyses. Interestingly, despite this “rudimentary” fold, they show through Mass Spectrometry and denaturation experiments that Bsc4 is able to form compact oligomers. Consistently with Bungard et al. (2017), HCA predicts Bsc4 as foldable with an intermediate HCA score of 1.98 (38% of hydrophobic residues), though it cannot predict whether Bsc4 folds completely or partially to an ordered structure. Overall, except the sequence length, the sequence and structural features of Bsc4 are similar to those of ancIGORFs. This suggests that similarly to the Bsc4 protein, young de novo proteins can optimize their fold specificity as well as their interactions with their environment through amino acid substitutions toward hydrophilic residues, thereby leading to the amino acid composition and the well-defined structures of most canonical proteins.
Altogether, these results enable us to propose a model (Fig. 6) which gives a central role to IGORFs in de novo gene emergence and to a lesser extent in protein evolution, thus completing the large palette of protein evolution mechanisms such as duplication events, horizontal gene transfer, domain shuffling... Once an IGORF is selected, it can elongate through frameshift events and/or STOP codon mutations, thus incorporating a neighboring IGORF (Fig. 6B). Bartonek et al. (2020), showed that the hydrophobicity profiles of protein sequences remain invariant after frameshift events. Consequently, frameshift events are most of the time expected to incorporate an IGORF that encodes a peptide with a hydrophobicity profile similar to that of the preexisting gene. This suggests that the fold potential is a critical feature that needs to be conserved even in noncoding sequences, being preserved in +1, −1 phases through the structure of the genetic code. In addition, we showed that combining IGORFs encoding for foldable peptides with IGORFs encoding for disorder or aggregation-prone peptides has low impact on the foldability of the resulting de novo protein (Fig. 4B). We can hypothesize that the newly integrated IGORFs will benefit from the structural properties of the preexisting IGORF network. More generally, proteins can be seen as assemblies on an ancient protein core, whatever its evolutionary history, of either duplicated, shuffled domains or de novo translated products encoded by neighboring IGORFs (Fig. 6D). Our model is supported by previous observations which show that (i) de novo genes are shorter than old ones (Tautz and Domazet-Lošo 2011; Wolf et al. 2009), (ii) the size of de novo gene exons are similar to those of old genes (Schlötterer 2015; Neme and Tautz 2013; Palmieri et al. 2014), and (iii) novel domains are generally observed in the C-terminal regions (Bornberg-Bauer et al. 2015; Klasberg et al. 2018). Nevertheless, the increase in linker sizes observed between the different ORF categories remains unclear. It is unknown whether harboring long linkers is accompanied by an increase in foldability and is thus a selected criterion as suggested by the observation that ancIGORFs and selectively translated IGORFs display longer linkers than IGORFs in general. Also, mutational events such as amino acid mutations toward hydrophilic residues and frameshift events or STOP codon mutations may result in longer linkers (Fig. 6BC) as observed in the example of the YMR153C-A de novo gene (Fig. S8B).
In this work, we propose a model that covers the genesis of all the structural diversity observed in current proteins. Although, we showed that IGORFs encoding foldable peptides are more likely to give rise to novel genes, disordered or aggregation-prone de novo proteins may emerge occasionally (see squares bottom-left and top-right in Fig. 4B). Particularly, in line with Vakirlis et al. (2020), we observe an enrichment in TM-prone sequences for ancIGORFs compared to IGORFs (26.3% and 16.1% respectively, one proportion z test, P = 2e-4), although the majority of peptides encoded by ancIGORFs are not TM-prone. More interestingly, we observe a strong correlation between the foldability propensity of the single ancIGORFs and their resulting de novo proteins (Fig. 4B) and that disorder or aggregation prone de novo proteins are most of the time (79%) associated with ancIGORFs expected to encode disordered or aggregation-prone peptides as well, suggesting that the structural properties of de novo proteins are already encoded in the ancestral peptide they originate from (see in Fig. 3, the blue dots connected by green and pink lines in the low and high HCA score categories respectively). Whether the fold potential of a starting point IGORF conditions the structural properties of the resulting de novo protein is an exciting question that deserves further studies. Indeed, we can hypothesize that once an IGORF is selected, it can elongate over time through the incorporation of neighboring IGORFs, provided that the latter do not affect the fold potential of the preexisting protein. In accordance with Vakirlis et al. (2020), we can reason that once a starting point IGORF is selected, it engenders novel selected effects which in turn, increase the constraints exerted on it and subsequently reduce the possibility of future changes. It is thus tempting to speculate that the structural properties of the peptide encoded by the starting point IGORF will be retained during evolution through the elimination of the deleterious IGORFs’ combinations. All these observations suggest that the fold diversity observed in current proteins has been originally inherited from the diversity of the fold potential already encoded in the noncoding genome.
Methods
Datasets
CDS and IGORFs
The CDS were extracted from the genome of Saccharomyces cerevisiae S288C according to the genome annotation of the Saccharomyces Genome database (Cherry et al. 2012). All unannotated ORFs of at least 60 nucleotides, no matter if they start with an AUG codon, were extracted from the 16 yeast chromosomes. 60 nucleotides correspond to 20 amino acids which is a reasonable minimum size for a peptide to acquire its own fold (Qiu et al. 2002). The hydrophobicity profiles of overlapping sequences in two different frames were shown to be correlated in Bartonek et al. (2020). Therefore, in order to prevent any bias from CDS hydrophobicity profiles, we only retained ORFs that are free from overlap with another gene or that partially overlap with a gene if the non-overlapping region is more than 70% of the IGORFs sequence.
Datasets of reference
The disorder dataset consists of 731 disordered regions extracted from intrinsically disordered proteins of the Disprot database (Hatos et al. 2020), that were used for the calibration of HCAtk (Bitard-Feildel and Callebaut 2018). The globular dataset consists of 559 globular proteins extracted from the Protein Data Bank (Berman et al. 2000; Burley et al. 2021) that were used for the calibration of IUPred (Dosztanyi et al. 2005; Dosztányi 2018; Mészáros et al. 2018, 2009). The transmembrane regions dataset gathers 1269 transmembrane regions extracted from the transmembrane proteins contained in the PDBTM database (Tusnády et al. 2004, 2005; Kozma et al. 2012), thereby expected to form aggregates in solution. We only retained transmembrane segments longer than 20 amino acids corresponding to the minimum size of an IGORF. These TM regions only match buried regions of TM proteins and are not expected to display the same sequence and structural properties as the complete membrane proteins they were extracted from. Indeed, membrane proteins including integral membrane proteins which involve TM domains along with extracellular or cytosolic domains of variable sizes.
Estimation of the fold potential, the aggregation, disorder and TM propensities
The foldability potential was estimated using a score derived from the HCA (Hydrophobic Cluster Analysis) approach using the HCAtk (Bitard-Feildel et al. 2018; Bitard-Feildel and Callebaut 2018). HCA divides a protein sequence into (i) clusters gathering strong hydrophobic residues (V, I, L, F, M, Y, W) or cysteines, and (ii) linkers composed of at least 4 non-hydrophobic residues (or a proline). As supported by analysis of experimental 3D structures, hydrophobic clusters match regular secondary structures (single ones or more, if separated by short loops) while the linkers indicate flexible regions generally corresponding to loops. The fold potential of a sequence is determined by its density in hydrophobic clusters but also by the density of hydrophobic amino acids within hydrophobic clusters. It is reflected with the HCA score which ranges from −10 to +10. Low scores indicate sequences depleted in hydrophobic clusters, which are likely to be disordered whereas high scores are associated with a very high density in hydrophobic clusters, that are expected to form aggregates in solution. The aggregation propensity of a sequence was assessed with TANGO (Fernandez-Escamilla et al. 2004; Linding et al. 2004; Rousseau et al. 2006a). Following the criteria presented in Linding et al. (2004), a residue was considered as participating in an aggregation prone region if it was located in a segment of at least five consecutive residues which were predicted as populating a b-aggregated conformation for more than 5%. Then, the aggregation propensity of each sequence is defined as the fraction of residues predicted in aggregation prone segments. The disorder propensity was probed with IUPred (Dosztanyi et al. 2005; Dosztányi 2018; Mészáros et al. 2018, 2009) using the short prediction option. To be consistent with the criteria used for assessing the aggregation propensity, we considered a residue as participating in a disordered region if it is located in a segment of at least five consecutive residues, each presenting a disorder probability higher than 0.5. Then, the disorder propensity of each sequence is defined as the fraction of residues predicted in disordered prone segments. The presence of at least one TM domain was predicted with TMHMM (Krogh et al. 2001).
Protein abundances and amino acid propensities
Protein abundance data were extracted from the PaxDB database (Wang et al. 2012). In order to depict the impact of the avoidance of nonspecific interactions with the ribosome, we only retained cytoplasmic proteins as annotated in Uniprot35. The propensity of an amino acid i to be found in a CDS is defined by the log ratio of the frequencies of the amino acid i in CDS versus IGORFs as follows:
Reconstruction of Ancestral IGORFs
To reconstruct the ancIGORFs of S. cerevisiae, we used the genomes of the neighboring species S. paradoxs (Durand et al. 2019), S. arboricola (Yue et al. 2017), S. mikatae, S. kudriavzevii, and S. uvarum (Scannell et al. 2011). Based on four independent studies which each listed de novo genes of the S. cerevisiae S288C genome, we retained all de novo genes identified in at least two studies (Carvunis et al. 2012; Vakirlis et al. 2018; Lu et al. 2017; Wu and Knudson 2018). This led to a total of 171 de novo genes among which we retained those (see the list of the 70 de novo genes in Table S2) for which we were able to identify at least two additional homologous sequences in the neighboring species among which at least one had to be noncoding in order to reconstruct the corresponding nongenic region in the ancestor. Therefore, we searched for the orthologous genes of the 70 de novo genes in the neighboring species using Blast (evalue < 1e-2) (Fig. S5A). Then, based on the species tree presented in Figure S5A and starting from the branch of S. cerevisiae, we traced back to the root and identified the first node branching with a branch for which no orthologous gene had been detected (yellow circle in Fig. S5A). We can reasonably hypothesize that the corresponding locus in the ancestor was still nongenic. We then searched for the corresponding nongenic regions in the remaining species with tblastn (evalue < 1e-2). Then following the protocol described by Vakirlis and McLysaght (2019), the resulting homologous nucleotide sequences and orthologous de novo genes were subsequently aligned with MACSE v2.05 (Ranwez et al. 2011, 2018) and the corresponding phylogenetic tree was constructed with PHYML (Guindon et al. 2010). The multiple sequence alignment and its corresponding tree were given as input to PRANK (Löytynoja and Goldman 2010) for the reconstruction of the corresponding ancestral nongenic nucleotide sequence (Fig. S5BC). Finally, the ancestral nucleotide sequences were translated into the three reading frames. The resulting IGORFs were then aligned with the de novo gene of S. cerevisiae with LALIGN (Huang and Miller 1991) and those sharing a homology with it were extracted, the others were eliminated (Fig. S5D).
Ribosome Profiling analyses
Ribosome profiling experiments
Cells were grown overnight in 0.5 liter of liquid glucose-YPD till an OD600 of 0.6, 50 microg/microl of cycloheximide were added to the culture and incubated during 5 min and kept at + 4°C. The pellet of yeast cells was recovered by centrifugation during 5 min at 5000 rpm in Beckman F10 rotor at + 4°C. Total RNA and polysomes were extracted as previously described (Baudin-Baillieu et al. 2014). Briefly, cells were lysated by vortex during 15 min in 500 microl of polysome buffer (10 mM Tris-acetate pH7.5; 0.1M NaCl and 30 mM Mg-acetate) in presence of glass beads in Eppendorf tube, followed by 5 min of centrifugation at 16 krcf at + 4°C. Ribosome-protected mRNA fragments (RPFs) were generated by the treatment following the ratio of 1 OD260nm of extract with 15 U of RNase I during 1 h at 25°C. Monosomes were collected by 2h15 min centrifugation on a 24% sucrose cushion at + 4°C on TLA 110 rotor at 110 krpm. The monosomes were resuspended with 500 microl of polysome buffer. RNA was purified by phenol–chloroform extraction and 28-34 nucleotides RPFs were recovered by electrophoresis in a 17% acrylamide (19/1) 7M urea in 1 × TAE gel. These RPFs were depleted of ribosomal RNA by treatment with the Ribo-Zero Gold rRNA removal kit for yeast from Illumina company. RPF libraries were generated with NEBNext Small RNA Sample Prep Kit, according to the manufacturer’s protocol, and were checked with the bioanalyser small RNA kit. Sequencing was performed by a HighSeq 2000 (Illumina) 75-nucleotide single-read protocol.
Additional ribosome profiling data
we used three additional experiments that were retrieved by Radhakrishnan et al. (2016) (GEO accession numbers GSM2147982 and GSM2147983) and Thiaville et al. (2016) (GEO accession number GSM1850252).
Selection of RPF (Ribosome Protected Fragments)
Ribosome profiling reads were mapped on the genome of S. cerevisiae S288C using Bowtie (Langmead et al. 2009). For this study we kept only the 28-mers since on average 90% of them were mapped on a CDS in the correct reading frame (i.e. are in frame with the start codon of the CDS) (see Fig. S10-left).
Periodicity
The periodicity is calculated using a metagene profile. It provides the number of footprints relative to all annotated start codons in a selected window. The metagene profile is obtained by pooling together all the annotated CDS and counting the number of RPFs at each nucleotide position (determined by the site P of each 28-mer). Results presented in Figure S10 show a clear accumulation of signal over the CDS, and a nice periodicity over the 100 first nucleotides.
Identification of the occasionally translated IGORFs
we retained the IGORFs with at least 10 reads in at least one experiment.
Identification of the selectively translated IGORFs
we kept the IGORFs with at least 30 reads in at least two experiments and for which the fraction of reads in the frame of the IGORF was higher than 0.8, reflecting that the translation of the IGORF is favored over the other overlapping ones.
Statistical analyses
All statistical analyses that aimed at comparing distributions were performed in R using the Kolmogorov–Smirnov test (two-sided) when comparing whether the HCA score distributions are statistically different and the Mann Whitney U-test for the comparison of the median cluster size, linker size, sequence size and cluster number distributions (bilateral test for the comparison of cluster sizes and unilateral test for the other properties). We used the one proportion z test for the comparison of the proportion of disordered, foldable or aggregation prone sequences between different ORF categories. In order to (i) circumvent the p-value problem inherent to large samples (i.e. extremely large samples such as the one of IGORFs induce low p-values) (Lin et al. 2013), tests were performed iteratively 1000 times on samples of 500 individuals randomly chosen from the initial sample when it was larger than 500 individuals. The averaged p-value over the 1000 iterations was subsequently calculated.
Availability of data
The raw ribosome profiling data are currently being deposited on the NCBI GEO platform but are available for referees at: http://bim.i2bc.paris-saclay.fr/anne-lopes/data_Papadopoulos/Riboseq_data_Papadopoulos/ Raw and calculated data along with custom scripts used in this study are available as Supplemental Data files. The extraction of IGORFs and their structural properties (foldability potential, disorder and aggregation propensities) were calculated using our in-house programs (ORFtrack and ORFold respectively) available in the ORFmine package at: https://github.com/i2bc/ORFmine
Supplemental Material
Supplemental Figures are stored in Papadopoulos_Supplemental_Figures.pdf
Supplemental Tables are stored in Papadopoulos_Supplemental_Tables.pdf
Funding
CP work was supported by a French government fellowship.
Competing Interest Statement
The authors declare no conflict of interest.
Author contributions
CP, MR, IH performed research, CP, MR, IH, ON, AL analyzed data. CP, AL designed research. CP, IC, JCG, ON, OL, AL wrote the paper. AL conceived the project.