Abstract
Little is known about the rate of emergence of genes de novo, how they spread in populations and what their initial properties are. We examined wild Saccharomyces paradoxus populations to characterize the diversity and turnover of intergenic ORFs over short evolutionary time-scales. We identified ∼34,000 intergenic ORFs per individual genome for a total of ∼64,000 orthogroups, which resulted from an estimated turnover rate relatively smaller than the rate of gene duplication in yeast. Hundreds of intergenic ORFs show translation signatures, similar to canonical genes, but lower translation efficiency, which could reduce their potential production cost or simply reflect a lack of optimization. Translated intergenic ORFs tend to display low expression levels with sequence properties that are on average closer to expectations based on intergenic sequences. However, some predicted de novo polypeptides with gene-like properties emerged from ancient as well as recent birth events, illustrating that the raw material for functional innovations may appear even over short evolutionary time-scales. Our results suggest that variation in the mutation rate along the genome impacts the turnover of random polypeptides, which may in turn influence their early evolutionary trajectory. Whereas low mutation rate regions allow more time for random intergenic ORFs to evolve and become functional before being lost, mutation hotspots allow for the rapid exploration of the molecular landscape, thereby increasing the probability to acquire a polypeptide with immediate gene-like properties and thus functional potential.
Introduction
The emergence of new genes is a driving engine for phenotypic evolution. New genes may arise from pre-existing gene structures through genome rearrangements, such as gene duplication followed by neo-functionalization, gene fusion or horizontal gene transfer, or de novo from previously non-coding regions (Chen et al. 2013). The mechanism of de novo gene birth has long been considered unlikely to occur (Jacob 1977) until the last decade during which comparative genomics approaches shed light on the role of intergenic regions as a regular source of new genes (Tautz and Domazet-Loso 2011; Landry et al. 2015; Schlotterer 2015; McLysaght and Hurst 2016). Compared to other mechanisms, the de novo gene origination is a source of complete innovation because the emerging genes come from mutations alone not from the evolution of preexisting functions (McLysaght and Hurst 2016).
Non-coding regions undergo three major steps to become gene-coding, the first two occurring in any order. First, the acquisition of an Open Reading Frames (ORFs) by mutations conferring a gain of in-frame start and stop codons and second, the acquisition of regulatory sites to allow the ORF transcription and translation and to produce de novo polypeptides. The third step would correspond to the retention of this structure by natural selection because of its positive effects on fitness (Schlotterer 2015; Nielly-Thibault and Landry 2018). The subsequent maintenance of the structure by purifying selection will lead to the gene being shared among species, as we see for groups of orthologous canonical genes. There are many ORFs associated with ribosomes in non-annotated regions, supporting their translation and the potential to produce de novo polypeptides which are the raw material necessary for de novo gene birth (Ingolia et al. 2009; Wilson and Masel 2011; Carvunis et al. 2012; Ruiz-Orera et al. 2014; Lu et al. 2017; Vakirlis et al. 2017; Ruiz-Orera et al. 2018). We distinguish de novo polypeptides, encoded by intergenic ORFs, from de novo genes because most of de novo polypeptides could be non-functional and seem to evolve neutrally (Ruiz-Orera et al. 2018).
Many putative de novo genes have been identified (McLysaght and Hurst 2016), but there is generally limited information on their translation and few have been functionally characterized (Begun et al. 2006; Levine et al. 2006; Begun et al. 2007; Cai et al. 2008; Zhou et al. 2008; Knowles and McLysaght 2009; Li et al. 2010; Baalsrud et al. 2017). De novo young genes are generally small with a simple intron-exon structure, they are less expressed on average than canonical genes and they may diverge rapidly compared to older genes (Wolf et al. 2009; Tautz and Domazet-Loso 2011), which makes it more challenging to differentiate de novo emerging young genes from non-functional ORFs (McLysaght and Hurst 2016). The absence of sequence similarities of a gene with known genes in other species is not an evidence of a de novo origination, and may also be due to rapid divergence between two orthologs. This confusion resulted in spurious de novo origin annotations, especially over longer evolutionary time-scale (Gubala et al. 2017). One way to overcome this problem is to compare closely related populations or species and to identify the homologous non-coding sequences through synteny, for instance, which may give access to the causal gene-birth most recent mutation (Begun et al. 2006; Levine et al. 2006; Begun et al. 2007; Cai et al. 2008; Zhou et al. 2008; Knowles and McLysaght 2009; Li et al. 2010).
The process of de novo gene birth was framed under hypotheses that consider the role of selection as acting at different time points. The continuum hypothesis involves a gradual change between non-genic to genic characteristics as observed for intergenic ORF sizes for instance (Carvunis et al. 2012). The preadaptation hypothesis predicts extreme levels of gene-like traits in de novo young genes, as observed for the intrinsic structural disorder, which is higher in de novo young genes compared to ancient ones in some species (Wilson et al. 2017). These two models depend on the distribution of random polypeptide properties and the position of the ones with an adaptive potential in this distribution. Under the continuum hypothesis, selection acts on polypeptides with intergenic-like characteristics on average, which will mature progressively towards gene-like properties. Under the pre-adaptation hypothesis, selection acts on polypeptides with gene-like characteristics located at the extremes of the distribution, and which are more favorable for gene birth (i.e. maintenance by natural selection) than the average of non-coding sequences. If they are preferentially maintained, this creates a gap between the distribution of random polypeptides properties and recently emerging de novo genes.
Another question of interest is whether local composition along the genome can accelerate gene birth. The size of intergenic regions, their GC composition and the genomic context (abundant spurious transcription) may affect the birth rate of de novo genes (Vakirlis et al. 2017; Nielly-Thibault and Landry 2018). The comparison of yeast species showed that de novo genes are preferentially found in GC-rich genomic regions, in recombination hotspot and at the proximity of divergent promoters (Vakirlis et al. 2017). It was also demonstrated that mutation rate varies along chromosomes, for instance it is lower closer to replication origins, especially those that fire early in S phase (Chuang and Li 2004; Stamatoyannopoulos et al. 2009; Lang and Murray 2011; Agier and Fischer 2012). Genomic regions with elevated mutation rate may favor the emergence of de novo genes but also their loss in the absence of selection, affecting overall turnover.
It is now accepted that de novo genes continuously emerge (Tautz and Domazet-Loso 2011; Neme and Tautz 2013; Palmieri et al. 2014; Vakirlis et al. 2017). They are also frequently lost which explains the constant number of genes observed over time (Palmieri et al. 2014). Because most studies focus on inter-species comparisons, the extent of polymorphism within species in number of de novo genes is largely unknown, except in a few cases (Zhao et al. 2014; Li et al. 2016). The rate at which new putative polypeptides appear from non-coding DNA, spread in population and the likelihood that they become functional and are retained by natural selection before they are lost is of strong interest to understand the dynamics of the de novo gene birth process. Another pressing question is what are the initial properties of the peptides produced during the neutral exploration period within species. The use of population data may address this issue and allows to precisely monitor the turnover of recently evolving polypeptides over short evolutionary time-scales.
Here we explore the contribution of the intergenic diversity in the emergence and retention of the raw material for the de novo gene birth in natural populations of Saccharomyces paradoxus. We characterize the repertoire and turnover of ORFs located in intergenic regions (named hereafter iORFs), as well as the associated putative de novo polypeptides using ribosome profiling, and compare how the properties of putative polypeptides covary with their age and expression. We observe a continuous emergence of de novo polypeptides that are segregating within S. paradoxus. Compared to canonical genes, de novo polypeptides are on average smaller and less expressed and show a lower translation efficiency. Translation efficiency tends to decrease in the highly transcribed iORFs, suggesting a regulation acting at the translational level results in a buffering in amount of produced polypeptides, resulting perhaps in a lack of optimization. De novo polypeptides display a high variability for various properties, some share gene-like characteristics, suggesting that their functional potential arises directly from non-coding DNA.
Results
A large number of intergenic ORFs segregates in wild S. paradoxus populations
We first characterized iORF diversity in wild S. paradoxus populations. We used genomes from 24 strains that are structured in three main lineages named SpA, SpB and SpC (Charron et al. 2014; Leducq et al. 2016). Two S. cerevisiae strains were included as outgroups: the wild isolate YPS128 (Sniegowski et al. 2002; Peter et al. 2018) and the reference strain S288C. These lineages cover different levels of nucleotide divergence, ranging from ∼ 13 % between S. cerevisiae and S. paradoxus to ∼2.27 % between the two closest SpB and SpC lineages (Kellis et al. 2003; Leducq et al. 2016). We used microsynteny to identify and align homologous non-genic regions between pairs of conserved annotated genes (Fig. S1 and Methods). We identified 3,781 orthologous sets of intergenic sequences representing a total of ∼ 2 Mb, with a median size of 381 bp (Fig. S1 and S2). iORFs were annotated on aligned sequences using a method similar to the one employed by Carvunis et al. (2012), that is the first start and stop codons in the same reading frame not overlapping with known features, regardless of the strand, and with no minimum size. We then classified iORFs according to their conservation level among strains (Fig. S1 and Methods). Because the annotation was performed on aligned sequences, we could precisely detect the presence/absence of orthologous iORFs among strains, based on the conservation of an iORF with the same start and stop positions without disruptive mutations in between. We used S. cerevisiae as an outgroup and removed iORFs present only in this species to focus on S. paradoxus diversity. However, we conserved iORFs present both in S. cerevisiae and in at least one S. paradoxus strain to keep the inter-species conservation.
We annotated 34,216 to 34,503 iORFs per S. paradoxus strain, for a total of 64,225 orthogroups annotated at least in one S. paradoxus strain (Table 1). This represents a density of about 17 iORFs per Kb. The iORFs set shows about 6 % conserved among S. cerevisiae and S. paradoxus strains, and 15 % specific and fixed within S. paradoxus. The remaining 79 % are still segregating within S. paradoxus (Fig. 1A, 1B and Table 1).
To understand how iORF diversity changes over a short evolutionary time scale, we estimated iORFs’ age and turnover using ancestral sequence reconstruction (Fig. 1C) (see Methods). Because polymorphism within lineages (SpA, SpB or SpC) (Leducq et al. 2016) may affect the topology of the phylogeny (although most diversity in this group is among lineage divergence) we used only one strain per lineage (YPS128 (S. cerevisiae), YPS744 (SpA), MSH-604 (SpB) and MSH-587-1 (SpC)) to reconstruct ancestral sequences at two divergence nodes that we labeled N1 for SpB-SpC divergence and N2 for SpA-SpB/C divergence. These strains contain 58,952 iORF orthogroups after removing the polymorphic iORFs that are absents in all the four selected strains. Reconstructed sequences were included in intergenic alignments of actual strains and were used to detect the presence or absence ancestral iORFs at each node (Fig. 1C, S1 and Methods).
We estimated the age of the 58,952 iORFs and annotated the 2,291 iORFs detected only in ancestral sequences. 55 % of iORFs were present at N2 (the oldest age category) and are represented in each conservation group depending on iORF loss events occurring after N2 (Fig. 1D and Table 2). We observed a continuous emergence of iORFs with 6,782 iORF gains between N2 and N1 and 5,324 to 8,454 along terminal branches. As expected, the number of iORF gains or losses is correlated and increases with branch length (Fig. 1E). We estimated a rate of emergence and loss at respectively 0.28 +/-0.01 and 0.27 +/-0.008 ORFs per nucleotide substitution. An ORF is on average gained or lost at every 3.5 substitutions. The de novo ORF gain rate, estimated at around 1.1×10−3 ORFs per genome per cell division, is about one order of magnitude smaller than the gene duplication rate in S. cerevisiae estimated at 1.9×10−2 genes per genome per cell division (Lynch et al. 2008).
We considered that iORFs with no detected ancestors appeared on terminal branches. Among them, 91 to 93 % are present only in one lineage, which is consistent with the expected conservation pattern for recently emerging iORFs (Table 2). The absence of ancestor for the remaining 7 to 9% iORFs present in more than one lineage can be due to convergence on terminal branches, made possible by the relatively high turnover rate. Convergence events may particularly occur if two lineages acquire independently small indels, not necessarily at the same position but in the same iORF, leading to the same frameshift and resulting in stop codon changes. Finally, regions with a higher rate of evolution may more likely lead to ancestral sequence reconstruction errors and to a small overestimation of the gain rate but this effect should be negligible because of the small number of iORFs with ambiguous age estimations.
As previously observed, iORFs tend to be small with a median value of 43 bp compared to known genes in the reference S. cerevisiae (median iORF size of 1,287 bp) (Fig. 2A). Each conservation group of iORFs also contains iORFs longer than the smallest annotated genes in S. cerevisiae, revealing an extended set of iORFs with coding potential. In our study, overlapping iORFs between strains, sharing the same start and a different stop position (or the reciprocal) were classified as different orthogroups because of their changed resulting sizes. We investigated the evolution of iORF sizes along the phylogeny, by connecting overlapping iORF orthogroups in actual strains with their ancestors based on the conservation of their start and/or stop positions (Fig. 2B). The majority of iORF orthogroups (65%) were conserved until N2 (Fig. 2B). We identified 19% of iORFs successively connected to N1 and N2 by one or two size changes along the phylogeny (Fig. 2B). Note that a size change is considered as an iORF loss event generally accompanied by the gain of another iORF, which is consistent with the similar iORF gain and loss rates estimated. iORFs detected only on terminal branches with no ‘connected’ ancestor tend to display intermediate iORF size values compared to iORFs of conserved size and iORFs resulting from size changes (Fig. 2C).
Size changes are also mainly small even if some extreme cases are observed (Fig. 2D-E). Compared to smaller iORFs, longer iORFs are less conserved and more submitted to size changes (Chi-square test, p-value < 2.2 × 10−16, Fig. 2C and 2F), which might be explained by the higher turnover rate of longer iORFs. This suggests a larger target for mutation accumulation between the start and the stop codon. Longer iORFs also tend to decrease, which might be due to a higher chance to acquire a disruptive mutation resulting in a size decrease, and intergenic size constrains limiting the maximum iORF sizes (Fig. 2E and S2).
Altogether, these analyses show that yeast populations’ iORFs repertoire is the result of frequent gain and loss events, and of size changes. 56 % of ancient iORFs detected at N2 are still segregating within S. paradoxus, showing the role of wild populations as a reservoir of iORFs that can used to address the dynamics of early de novo gene evolution.
Intergenic ORFs frequently show signatures of active translation
We performed ribosome profiling to identify iORFs that are translated and that thus putatively produce polypeptides. Only iORFs with a minimum size of 60 bp were considered for this analysis. Among them, 12 iORFs displayed a significant hit when blasted against the proteome of 417 species, including 237 fungi, and were removed for the downstream analysis (see Methods). The set examined consists of 19,689 iORFs. We prepared ribosome profiling sequencing libraries for four strains, one belonging to each lineage or species: YPS128 (S. cerevisiae), YPS744 (SpA), MSH-604 (SpB) and MSH-587-1 (SpC), in two biological replicates. All strains were grown in synthetic oak exudate (SOE) medium (Murphy et al. 2006) to be close to natural conditions in which de novo genes could emerge in wild yeast strains.
Typically, a ribosome profiling density pattern is characterized by a strong initiation peak located at the start codon followed by a trinucleotide periodicity at each codon of protein-coding ORFs. We used this feature to identify a set of translated iORFs for which we compared translation intensity with annotated genes. We first detected peaks of initiation sites in the start codon region. As expected, the number of ribosome profiling reads located at the start codon position is lower for iORFs than for annotated genes (Fig. 3A). However, there is a significant overlap between the two read density distributions, illustrating a similar read density between highly expressed iORFs and lowly expressed genes. We observed an initiation peak for 73.9 to 87.9 % of standard annotated genes depending on the haplotype, and for 1.4 to 6.9 % of iORFs (Table 3 and Fig. 3B). This suggests that at least 20% of translated iORFs could be missed using this approach, because of a too low expression levels or condition-specific expression. Detected peaks were classified using three levels of precision and intensity: ‘p1’ for less precise peaks (+/-1nt relative to the first base of the start codon), ‘p2’ for precise peaks (detected at the exact first base of the start codon) and ‘p3’ for precise peaks with strong initiation signals characterized here by the highest read density in the ORF (see Methods). Among all iORFs with a detected initiation peak, 30, 35 and 34% respectively belong to p1, p2 and p3. A comparable repartition (Chi-square test, p-value= 0.59) was observed for genes with 24, 40 and 36% for each precision group, showing that the precision levels used in our analysis were reliable.
We measured codon periodicity, which is illustrated by an enrichment of reads at the first nucleotide of each codon in the first 50 nt excluding the start codon. As for the start codon region, the number of ribosome profiling reads is lower for iORFs compared to known genes (Fig. 3C). Among the features with a detected initiation peak, 91.8 to 94.8% of genes and 29.4 to 41 % of iORFs show a significant codon periodicity per haplotype (Table 3 and Fig. 3D). The number of detected translation signal is lower in strain MSH-604, which is most likely due to a lower number of reads obtained for this strain and the use of raw read density in this analysis (see Methods). iORFs with an initiation peak and a significant periodicity in at least one strain were considered as significantly translated and labeled iORFsT1 whereas iORFs with no significant translation signatures were labeled iORFsT0. We performed a metagene analysis on annotated genes and iORFsT1, which revealed a similar ribosome profiling read density pattern between low expressed genes and iORFsT1, and confirmed a distinct codon periodicity with significant translation signature for iORFsT1 (Fig. 3E and S3). The resulting iORFsT1 set contains 418 iORF orthogroups with size ranging from 60 to 369 nucleotides. They represent a small fraction (2.12 %) of the 19,689 iORF orthogroups longer than 60 nt. This percentage could be a conservative estimate because the detection depends on the chosen methods and filters and on the ribosome profiling sequencing depth. Also, some iORFs may be expressed under other environmental conditions. Overall, for a genome of about 5,000 genes, the roughly 400 de novo iORFs that show significant translation signatures and which may produce de novo polypeptides, could be an important contribution to the proteome diversity of these natural populations.
Translation does not affect intergenic ORFs retention
We looked for an association between translation and iORFs retention, which could a sign that de novo polypeptides encoded by iORFsT1 contribute to a fitness increase (or decrease) and therefore have beneficial (or deleterious) biochemical activities. We compared the numbers of iORFsT1 and iORFsT0 with respect to their age and conservation. We observed a similar conservation distribution for iORFsT1 and iORFsT0 per age category (NS Chi-square test, Fig. 3F-G and S3F-G). This observation suggests that iORFs that become translated are not preferentially conserved (or eliminated) than supposedly neutral iORFsT0, suggesting overall weak or no selection acting on them.
We compared iORFsT1 and iORFsT0 size distributions, as well as the distribution of their size changes relative to their ancestors, to examine if translation may influence iORFs size evolution. More generally, iORFsT1 tend to be smaller compared to iORFsT0 of the same age, especially for those present at N2 and on terminal branches (Fig. S4C). Note that the absence of effect at N1 may be attributed to a low detection power due to less iORFsT1 detected at N1 in our dataset (Table 2). Translation does not influence the distribution of iORF size changes, which are on average similar for iORFsT0 and iORFsT1 (NS T-test, Fig. S4B). In addition, longer iORFsT1 tend to be less submitted to size changes compared to iORFsT0 of the same size range (Chi-square test p-value = 0.005, Fig. S4A). By comparison with the fitness effect distribution of new mutations, characterized by a large number of mutations of neutral or small effects and few mutations of large effect (Bataillon and Bailey 2014), we hypothesized that only a small fraction of iORFsT1 size changes may be of strong effects and could influence the retention pattern compared to most nearly neutral iORFsT0.
We assumed that most polymorphisms located in intergenic regions are neutral so we used the polymorphic sites proportion for each syntenic intergenic regions as a measure of the SNP density per genomic intergenic region (see Methods). Recent iORFsT1 appearing along terminal branches are located in genomic regions with more polymorphisms compared to iORFsT0 (Fig. S4D), suggesting that recent translated iORFs are more likely to occur in regions with higher substitution rates. We tested for an effect of the GC% in the repartition of iORFsT1 in the genome. iORFsT1 are not preferentially located in GC-rich regions than ORFsT0 (Fig. S4E). We removed sequences of low complexity in our filtering methods, so this may biased the average GC content in our data.
Some intergenic translated ORFs display strong expression changes between lineages
iORFsT1 came from ancient and recent iORFs gains, showing a regular supply of de novo putative polypeptides in intergenic regions (Table 2). We looked for lineage-specific emerging putative polypeptides, among iORFsT1, based on significant differences of ribosome profiling coverage between each pair of haplotypes. Note that a translation gain or increase may be due to an iORF gain, or to a transcription/translation increase or both. 33 iORFsT1 display a significant lineage-specific expression increase, with 20, 5 and 8 iORFsT1 in SpA, SpB and SpC respectively (Fig. 4 and S5). Among them, 24 are accompanied by a lineage-specific presence for the considered ORFsT1 within which, 16 were acquired along terminal branches, like the SpB-specific iORF_70680 (Fig. 4). Nearly 70 % of strong lineage-specific expression pattern are correlated with the presence of the iORFT1 in only one lineage, suggesting that iORF turnover mostly explain translation differences compared to a lineage expression increase in a region already containing a conserved iORFT1 for instance. Three iORFsT1 are also more expressed in both SpB and SpC strains compared to SpA and Scer suggesting an event occurring along branch b2 (Fig. 1C, 4 and S5). We also detected older expression gain/increase events in S. paradoxus, specific relative to S. cerevisiae, for 9 iORFsT1, for instance iORF_69174 (Fig. 4 and S5). This result shows that ancient iORFsT1 may also be conserved over longer evolutionary time-scales, potentially under the action of selection, although there is no evidence for a role of selection.
We observed specific translation patterns resulting from iORFs gains and/or expression increases at different times along the phylogeny. The resulting set of emerging polypeptides of different ages provides key material to examine the properties of de novo polypeptides at the onset of gene birth.
Translational buffering acts on intergenic ORFs
We compared the expression of ancient and recent iORFsT1 with the one of known genes to examine if de novo polypeptides display gene-like expression levels. We looked at the translational and transcriptional levels using ribosome profiling and total RNA sequencing libraries. We estimated translation efficiency (TE) per gene and iORFsT1 as the ratio of ribosome profiling reads (named RPFs for ribosome profiling footprints) over total mRNA. This ratio (in log2) is positive when the number of translating ribosomes increases per molecule of mRNA, illustrating a more effective translation per mRNA unit (Ingolia et al. 2009). Note that RPF and total RNA coverages were calculated on the first 60 nt for genes and iORFsT1 to reduce the bias introduced by the higher number of reads at the initiation codon, which tends to increase TEs in short iORFsT1compared to longer genes. After this correction, TEs remains significantly correlated with gene size but the effect is small and should not interfere in our analysis (Fig. 5E).
As expected for intergenic regions, iORFsT1 were less transcribed and translated than genes (T-test, both p-values < 2.2 × 10−16, Fig. 5A-B). We also observed a significant lower TE on average (T-test, p-value = 4.8 × 10-9, Fig. 5C) for iORFsT1 compared to genes, suggesting that young iORFsT1 are less actively transcribed and translated than genes of the same size, excepted for longer iORFsT1 appearing on terminal branches which display higher TE levels. More generally, the most transcribed iORFsT1 display a more reduced TE compared to genes (Fig. 5D, ANCOVA, p-value < 2.2 × 10-16). The consequence of this buffering effect acting at the post-transcriptional level is a reduction of polypeptides translated per molecule of mRNA. The buffering of highly transcribed iORFsT1 may be due to a rapid selection to reduce the production of toxic polypeptides or may simply be a mechanistic consequence of recent transcription increase without translation optimization. The buffering effect is similar among iORFsT1 of different ages, with no significant pairwise differences between buffering slopes (data not shown), which support the mechanistic consequence hypothesis. We also noted a significant overlap between expression levels and TEs in intergenic genes and genes, which means that some iORFsT1 have gene-like expression levels.
Translated intergenic polypeptides display a high variability for gene-like traits
A recent study suggested that selection favors pre-adapted de novo young genes with a high level of protein disorder (ISD) compared to old genes, whereas random polypeptides in intergenic regions are one average less disordered (Wilson et al. 2017). This would suggest that young polypeptides with an adaptive potential would already be biased in terms of protein structural properties compared to the neutral expectations based on random sequences. We examined the properties of polypeptides as a function of timing of emergence in order to follow their evolution during the time before, or at the early beginning of, the action of selection. We compared the level of intrinsic disorder, GC-content and genetic diversity (based on SNPs density) in iORFsT1 as a function of age with that of annotated known genes. On average, protein disorder and GC-content are lower in iORFsT1 than in canonical genes regardless of iORFsT1 ages (p-values < 0.001, T-test, Fig. 6B-C). This pattern was confirmed for iORFsT1 and genes sharing the same size range of between 45 and around 100 amino acids (Fig. 6B-C). The lower intrinsic disorder for iORFsT1 was also observed for random intergenic sequences in Wilson’s study (Wilson et al. 2017). However, we observed a subset of iORFsT1 with extreme gene-like disorder values that could refer to the subset of non-functional peptides expected to be recruited by natural selection if gene-like characteristics increase their functional potential. iORFsT1 are located in more divergent regions compared to genes, which is in agreement with stronger purifying selection on canonical genes (Fig. 6D). We examined if SNP density variation along the genome may influence the iORFsT1 turnover. Younger iORFsT1, appearing along terminal branches, tend to be in more divergent regions compared to older ones at N2, even when considering the same size ranges (Fig. 6D). This may be due to mutation rate variation or differences in evolutionary constrains acting on iORFsT1 age subsets. Older iORFsT1 are not preferentially located at the proximity of genes where selection may be stronger (Fig. 6G), suggesting that the lower diversity observed at N2 is mainly due to a lower mutation rate. A correlation between mutation rate variation and replication timing differences along chromosomes has already been observed in yeast, where origins of replication (ARS) activated late show higher mutation rate compared to earlier ones (Lang and Murray 2011; Agier and Fischer 2012). We compared the replication timing in regions of iORFsT1 of different ages to examine if the higher diversity observed in younger iORFsT1 on average is correlated with late replicating regions. We used estimates from Muller et al. (2014) which are based on the quantification of the amount of DNA during replication by deep-sequencing, which are higher in genomic regions of early replication compared to regions of late replication. We did not see differences for replication timing between genes and iORFsT1, neither between iORFsT1 ages categories (Fig. 6E). However, we observed that older iORFsT1 tend to be closer to replication origins compared to younger iORFsT1 (Fig. 6F), which is consistent with the higher genetic diversity observed in recently emerging iORFsT1 locations. These observations suggest that younger iORFsT1 are more likely to occur in rapidly evolving sequences with higher mutation rates.
iORFsT1 are located in more divergent regions compared to genes, which is in agreement with stronger purifying selection on canonical genes (Fig. 6D). We examined if SNP density variation along the genome may influence the iORFsT1 turnover. Younger iORFsT1, appearing along terminal branches, tend to be in more divergent regions compared to older ones at N2, even when considering the same size ranges (Fig. 6D). This may be due to mutation rate variation or differences in evolutionary constrains acting on iORFsT1 age subsets. Older iORFsT1 are not preferentially located at the proximity of genes where selection may be stronger (Fig. 6G), suggesting that the lower diversity observed at N2 is mainly due to a lower mutation rate. A correlation between mutation rate variation and replication timing differences along chromosomes has already been observed in yeast, where origins of replication (ARS) activated late show higher mutation rate compared to earlier ones (Lang and Murray 2011; Agier and Fischer 2012). We compared the replication timing in regions of iORFsT1 of different ages to examine if the higher diversity observed in younger iORFsT1 on average is correlated with late replicating regions. We used estimates from Muller et al. (2014) which are based on the quantification of the amount of DNA during replication by deep-sequencing, which are higher in genomic regions of early replication compared to regions of late replication. We did not see differences for replication timing between genes and iORFsT1, neither between iORFsT1 ages categories (Fig. 6E). However, we observed that older iORFsT1 tend to be closer to replication origins compared to younger iORFsT1 (Fig. 6F), which is consistent with the higher genetic diversity observed in recently emerging iORFsT1 locations. These observations suggest that younger iORFsT1 are more likely to occur in rapidly evolving sequences with higher mutation rates.
Because sequences are too similar between strains to test for purifying selection individually on each iORFsT1, we estimated the likelihood of the global dN/dS ratio for two merged set of iORFsT1, containing ancient iORFsT1 conserved in all S. paradoxus strains (set 1) or iORFsT1 appearing at N1 and conserved between the SpB and SpC lineages (set 2). Both sets seem to evolve neutrally with no significant purifying selection acting (NS p-values). These results illustrate the continuous emergence of random polypeptides that do not appear to be under significant selection.
The variability observed for expression levels, genetic and structural properties revealed a subset of de novo polypeptides with gene-like characteristics. We performed a multivariate analysis to look for polypeptides with extreme values for multiple traits as an indicator of their functional potential. We observed a subset of iORFsT1 sharing all considered characteristics with genes and resulting from ancient or recently gained iORFsT1 (Fig. 6H). Although iORFs do not appear to be under significant purifying selection, as a neutral pool they provide raw material for selection to act under either the continuum or pre-adaptation models, revealing a rapid potential for molecular innovations.
Discussion
To better understand the early stages of de novo gene birth, we characterized the properties and turnover of recently evolving iORFs and their putative peptides over short evolutionary time-scales using closely related wild yeast populations. The number of iORFs identified almost doubles when considering within species diversity, which illustrates the role of intergenic diversity to provide potential molecular innovation. The iORFs presence/absence diversity comes from ancient iORFs that are still segregating within S. paradoxus and from a continuous supply of de novo iORFs. The turnover and retention of iORFs appear to be mostly guided by mutation rate variation affecting the number of gains and losses, or by size changes with some larger changes, more likely to occur in longer iORFs because of the longer mutational target between start and stop codons. The iORF turnover rate is lower than the rate of gene duplication or loss estimated in yeast (without whole genome duplication, (Lynch et al. 2008)).
Among the ∼20,000 iORF orthogroups of 60 nt and longer, only a small fraction (about ∼2%, n=418) shows translation signatures similar to expressed canonical genes. We observed a stronger post-translational buffering in the most transcribed iORFs, reflecting either selection against translation or lack of selection for optimal translation. This mechanism was also observed in interspecies yeast hybrids, especially for genes with transcriptional divergence and was hypothesized to be a result of stabilizing selection on the amount of proteins produced (McManus et al. 2014). The post-translation buffering effect is similar between older and younger iORFs, suggesting a lack of translation optimization which attenuates the amount of de novo polypeptides relative to mRNA molecules rather than selection.
Consistent with a model in which most iORFs are neutral, the corresponding de novo polypeptides properties are on average close to expectation for random sequences with some having gene-like properties, suggesting a small set of neutrally evolving polypeptides with a potential for molecular innovations. The conservation distribution of iORFs with translation signature (iORFsT1) is similar that of non-translated ones, suggesting that iORF retention is mainly guided by random mutations and genetic drift even when translated. The absence of selection signature is also consistent with the neutral evolution of most of intergenic polypeptides observed in rodents (Ruiz-Orera et al. 2018), and with the weak effect of purifying selection acting on younger de novo genes in yeast, Drosophila and Arabidopsis (Carvunis et al. 2012; Palmieri et al. 2014; Zhao et al. 2014; Li et al. 2016; Vakirlis et al. 2017). The resemblance to random sequences does not entirely preclude any potential molecular function and effect on fitness however because a recent study showed that a unneglectable fraction of expressed random sequences confers a positive effect on the fitness (Neme et al. 2017).
Recently emerging iORFT1 along terminal branches are more frequent in regions with a higher SNP density, whereas older iORFsT1 tend to be located in slowly evolving regions. This observation suggests variable turnover rates depending of the local mutation rate. Regions with low mutation rates could act as a reservoir of ancient iORFs segregating in population for a longer time before being lost. On the other hand, mutation hotspots may allow to rapidly test many molecular combinations immediately available, which could be advantageous in a changing environment. A small fraction of translated ORFs that recently appears have several gene-like characteristics, suggesting that they are pre-adapted to be biochemically functional, while most have some characteristics but not others, meaning that they would require refinement by natural selection to acquire these traits. These observations could reconcile the two opposing models of de novo gene birth (Carvunis et al. 2012; Wilson et al. 2017). Ongoing de novo genes would be more likely to progressively acquire gene-like properties in slowly evolving regions (low mutation rate) before being lost, as in the continuum model. Faster evolution in some regions may increase the chance to acquire a polypeptide with an immediate functional potential as in the preadaptation hypothesis.
Material and methods
Characterization of the intergenic ORFs diversity
We investigate intergenic ORF (iORF) diversity in wild Saccharomyces paradoxus populations, which are structured in 3 main lineages named SpA, SpB and SpC (Charron et al. 2014; Leducq et al. 2016). The wild S. cerevisiae strain YPS128 was used in our experiments and the reference S288C (version R64-2-1) was added in our analysis for the functional annotation.
Genome assemblies
New genomes assemblies were performed using high-coverage sequencing data from 5, 10 and 9 North American strains belonging to lineages SpA, SpB and SpC respectively 1 (Fig. S1) (Leducq et al. 2016) using IDBA_UD (Peng et al. 2012). For strain YPS128, raw reads were kindly provided by J. Schacherer from the 1002 Yeast Genomes project (Peter et al. 2018). We used the default option for IDBA-UD parameters: a minimum k-mer size of 20 and maximum k-mer size of 100, with 20 increments in each iteration. Scaffolds were then ordered and orientated along a reference genome using ABACAS (Assefa et al. 2009), using the –p nucmer parameter. S. paradoxus and S. cerevisiae scaffolds were respectively aligned along the reference genome of the CBS432 (Liti et al. 2009) and S288C (version R64-2-1 from the Saccharomyces Genome Database (https://www.yeastgenome.org/)) strains. Unused scaffolds in the ordering and longer than 200 pb were also conserved in the dataset for further analysis.
Identification of homologous intergenic regions
We detected homologous intergenic region using synteny. Genes were predicted using Augustus (Stanke et al. 2008) with the complete gene model for the species parameter “saccharomyces_cerevisiae_S288C”. Orthologs were annotated using a reciprocal best hit (RBH) approach implemented in SynChro (Drillon et al. 2014) against the reference S288C (version R64-2-1) using a delta parameter of 3. We used RBH gene pairs provided by SynChro and the Clustering methods implemented in Silixx (Miele et al. 2011) to identify conserved orthologs among the 26 genomes. We selected orthologs conserved among all strains and with a conserved order to extract orthologous microsyntenic genomic regions ≥ 100 nt between each pair of genes (Fig. S1).
Ancestral reconstructions of intergenic sequences
We reconstructed ancestral genomic sequences of internetic regions. Because the divergence between strains belonging to the same lineage is low, we choose one strain per lineage to estimate the ancestral intergenic sequences at each divergence node between lineages (Fig. 1C and S1), that is YPS128 (S. cerevisiae), YPS744 (SpA), MSH-604 (SpB) and MSH-587-1 (SpC). The ancestral sequence reconstruction was done using Historian (Holmes 2017), which allows the reconstruction of ancestral indels in addition to nucleotide sequences. Note that indel reconstruction is essential here to not introduce artefactual frameshifts in ancestral iORFs, see below, which depends on the conservation of the same reading frame between the start and the stop codon. Historian was run with a Jukes-Cantor model and using a phylogenetic tree inferred from aligned intergenic sequences by PhyML version 3.0 (Guindon et al. 2010) with the Smart Model Selection (Lefort et al. 2017) and YPS128 as outgroup.
iORF annotation and conservation level
Orthologous regions identified between each pair of conserved genes in contemporary strains and their ancestral sequence reconstructions were aligned using Muscle (Edgar 2004) with default parameters. Intergenic regions with a global alignment of less than 50% of identity among strains (including gaps) were removed. We annotated iORFs defined as any sequence between canonical start and stop codons, in the same reading frame and with a minimum size of 3 codons, using a custom Python script. Because we are working on homologous aligned regions, the presence-absence pattern does not suffer from limitation alignment bias occurring when we are working with short sequences. We extracted a presence/absence matrix based on the exact conservation of the start and the stop codon in the same reading frame (Fig. S1). iORF aligned coordinates were then converted to genomic coordinates on the respective genomes of each strain, and removed if there was any overlap with a known feature annotation, such as rRNA, a tRNA, a ncRNA, a snoRNA, non-conserved genes and pseudogenes annotated on the reference S288C (version R64-2-1 https://www.yeastgenome.org/). Additional masking was performed by removing iORFs i) located in a region with more than 0.6 % of sequence identity with S. cerevisiae ncRNA or gene (including pseudogenes and excluding dubious ORFs) from the reference genome, or Saccharomyces kudriavzevii and Saccharomyces eubayanus genes (Zerbino et al. 2018), ii) in a low complexity region identified with repeat masker (http://www.repeatmasker.org/) and iii) when local alignments of iORFs +/-300 bp displayed less than 60% of identity (including gaps). If an iORF overlapped a masked region detected in only one strain, it was removed for all the other strains in order to not introduce presence-absence patterns due to strain specific masking. iORFs that do not overlap a known feature were then classified according to the conservation level: 1) conserved in both species, 2) specific and conserved within S. paradoxus, 3) fixed within lineages and divergent among, 4) specific and fixed in one lineage, 4) polymorphic in a least one lineage (Fig. S1).
For iORFs with a minimum size of 60 nt, we also performed a sequence similarity search against the proteome of NCBI RefSeq database (O’Leary et al. 2016) for 417 species in the reference RefSeq category and the representative fungi RefSeq category (containing 237 fungi species). iORFs with a significant hit (e-value < 10−3) were removed to exclude any risks of having an ancient pseudogene. Among the 19,701 iORFs tested, only 12 displayed a significant hit, illustrating the stringency of our thresholds for the iORF annotation and filtering above.
Evolutionary history of iORFs
Gain and loss events were inferred by comparing presence/absence pattern between ancestral nodes and actual iORFs. Because the ancestral reconstruction was done using one strain per lineage (see above), polymorphic iORFs absent in all the considered strains have been removed from this analysis. iORFs with no detected ancestors were considered as appearing on terminal branches. We estimated the rate of iORF gain/substitution on each branch as the number of iORF gain/the number of substitution (i.e branch length × sequence size) and calculated the mean of the four branches. The iORF gain rate per cell per division was estimated by calculating the number of expected substitution per cell per division (from the substitution rate estimated at 0.33 × 10−9 per site per cell division by Lynch et al. (2008), multiplied by the iORF gain rate per substitution.
The evolution of iORFs sizes was inferred by connecting iORFs with their ancestors along the phylogeny if they shared the same start and/or stop position on aligned intergenic sequences. iORF sizes of two connected iORFs may be conserved if there are no changes, an increase or a decrease if there are connected only by the same start or stop position because the position of the other extremity of the iORFs changed.
Ribosome profiling and mRNA sequencing libraries
Ribosome profiling and mRNA sequencing experiments were conducted with the strains YPS128 (S. cerevisiae) (Sniegowski et al. 2002) and YPS744 (S. paradoxus), MSH604 (S. paradoxus) and MSH587 (S. paradoxus) belonging respectively to groups SpA, SpB and SpC according to Leducq et al. (2016). We prepared two replicates per strain and library type. The protocol is described in supplementary methods. Briefly, strains were grown in SOE (Synthetic Oak Exudate) medium (Murphy et al. 2006). Ribosome profiling footprints were purified using the protocol described in Baudin-Baillieu et al. (2016) with modifications (see supplementary methods). The rRNA was depleted in purified ribosome footprints and total mRNA samples using the Ribo-Zero Gold rRNA Removal Kit for yeast (Illumina) according to the manufacturer’s instructions. Ribosome profiling and total mRNA libraries were constructed using the TruSeq Ribo Profile kit for yeast (illumina), using manufacturer’s instructions starting from fragmentation and end repair step. Libraries were sequenced on Illumina HiSeq 2500 at The Genome Quebec Innovation Center (Montreal, Canada).
Detection of translated iORFs
Both total and ribosome profiling samples were processed using the same procedure. Raw sequences were trimmed of 3’ adapters using CUTADAPT (Martin 2011). For RPF data, reads with lengths of 27–33 nucleotides were retained for further analysis as this size is most likely to represent footprinted fragments. For mRNA, reads with lengths of 27–40 nucleotides were retained. Adapter trimmed reads were aligned to the respective genome of each sample using Bowtie version 1.1.2 (Langmead et al. 2009) with parameters –best –chunkmbs 500.
We used ribosome profiling reads to identify translated iORFs. This analysis was performed on iORFs longer than 60 nucleotides. Annotated iORFs may be overlapping because of the 3 possible reading frames for each strand. Ribosomal speed differences during translation cause an accumulation of ribosome footprints at specific positions within a gene (Ingolia 2016). We used ribosome profiling read density, which is typically characterized by a strong initiation peak located at the start codon followed by a codon periodicity at each codon, to detect the translated iORF among overlapping ones. For each strain, we performed a metagene analysis at the start codon region of iORFs and annotated conserved genes to detect the p-site offset for each read length between 28 and 33 nt. Because the ribosome profiling density pattern is stronger in highly translated regions, metagene analyses were done using the two replicates of each strain pooled in one coverage file. Ribosome footprints were mapped to their 5’ ends, and the distance between the largest peak upstream of the start codon and the start codon itself is taken to be the P-site offset per read length. When comparing annotated genes and iORFs, we obtained similar P-site offset estimates per read length, which were used for next analysis. We then extracted the aligned read densities, subtracted by the P-offset estimates, per iORF or genes for next analyses. Metagene analyses were performed using the metagene, psite and get_count_vectors scripts from the Plastid package (Dunn and Weissman 2016), metagene figures were done using R script (R Core Team 2013).
We identified translation initiation signal from ribosome profiling densities, by detecting peaks at the start codon. We defined 3 precision levels of peak initiation: ‘p3’ if the highest peak is located at the first nucleotide of the start codon, ‘p2’ there is a peak at the first position of the start codon and ‘p1’ if there is a peak at the first position of the start codon +/-1nucleotide because the peak position is less precise in low expressed feature. A minimum of 5 reads was required for peak detection. Read phasing was estimated by counting the number of aligned reads at the first, second or the third position for all codons of the considered iORF or gene, to test for a significant deviations from expected ratio with no periodicity, that is 1/3 of each, with a binomial test. We applied an FDR correction for multiple testing.
iORF families or genes with an initiation peak and a significant periodicity, i.e. a FDR corrected p-value < 0.05, in at least one haplotype were considered as translated.
Differential expression analysis
Reads were strand-specifically mapped to iORFsT1 and conserved genes using the coverageBed command from the bedTools package version 2.26.0 (Quinlan and Hall 2010), with parameter -s. We then examined iORFsT1 significant expression changes between strains. Differential expression analysis was performed using DESeq2 (Love et al. 2014). Significant differences were identified using 5% FDR and 2-fold magnitude. We identified lineage specific expression increase when the expression of the iORFsT1 in the considered lineage was significantly more expressed than the others strains in all pairwise comparisons. For SpB-SpC increase, we selected iORFsT1 when SpB and SpC strains were both more expressed than YPS128 and SpA, and S. paradoxus increase when all S. paradoxus lineages were more expressed than YPS128.
For the visualization of iORFsT1 coverages, we extracted the per base coverage on the same strand using the genomecov command from the bedTools package version 2.26.0 (Quinlan and Hall 2010). The normalization was performed by dividing the perbase coverage of each library with the size factors estimated with DESeq2 (Love et al. 2014).
Expression and sequence properties
Normalized read counts for ribosome profiling and total mRNA samples were extracted with DESeq2 software (Love et al. 2014) and we calculated the mean of the two replicates per library type. Translation efficiency (TE) was calculated as the ratio of RPF over total mRNA normalized read counts on the first 60 nt. We excluded iORFsT1 and genes with less than 10 total RNA reads in the first 60 nt for the TE calculation. Slope differences between Genes and iORFsT1 were tested with an ANCOVA. We confirmed the buffering effect on iORFsT1 annotated on the S. cerevisiae reference strain S288C with ribosome profiling and RNA sequencing data obtained in (McManus et al. 2014) study (Fig. S6).
The intrinsic disorder was calculated for genes and intergenic iORFsT1 using IUPRED (Dosztanyi et al. 2005). The SNP rate was calculated for each syntenic intergenic region by dividing the total number of intergenic SNPs in S. paradoxus alignments, by the total number of nucleotides in the region, as in Agier and Fischer (2012) study for intergenic sequences. Replication timing data per 1kb bin comes from Muller et al. (2014) study and were converted to the version R64-2-1 of the reference genome using liftOver (http://genome.ucsc.edu/cgi-bin/hgLiftOver). We used the codeml program from the PAML package version 4.7 (Yang 2007) to estimate the likelihood of the dN/dS ratio, using the same procedure as employed by Carvunis et al. (2012) with codon model 0.
All analysis and figures were conducted with python and R script (R Core Team 2013).
Data access
High-throughput sequencing data have been submitted to the NCBI Sequence ReadArchive (SRA; http://www.ncbi.nlm.nih.gov/sra) and can be accessed under NCBI BioProject number PRJNA400476. De novo assemblies and annotations have been submitted to the NCBI nucleotide database (https://www.ncbi.nlm.nih.gov/nucleotide) under NCBI BioProject number PRJNA400476.
Author contributions
E.D and C.R.L conceived the project. E.D O.N I.H and I.G.A designed ribosome profiling experiments. E.D I.G.A and I.H. performed the experiments. E.D performed bioinformatics analyses with helpful advices from L.N.T, C.R.L and O.N. E.D wrote the manuscript with revisions from all authors.
Disclosure declaration
The authors have no conflict of interest to declare.
Acknowledgments
We thank A. K. Dubé, G. Charron and the IBIS sequencing platform (B. Boyle) for technical help and landry lab members for comments on the manuscript. This project was funded by a FRQNT Team grant to C.R.L and Xavier Roucou and NSERC discovery grant to C.R.L. C.R.L. holds the Canada Research Chair in Evolutionary Cell and Systems Biology.