## Abstract

Over evolutionary timescales, genomic loci switch between functional and non-functional states through processes such as pseudogenization and *de novo* gene birth. Here we ask about the likelihood and rate of functionalization of non-functional loci. We simulate an evolutionary model to look at the contributions of mutations and structural variation using biologically reasonable distributions of mutational effects. We find that a wide range of mutational effects are conducive to functionalization, thus indicating the ubiquity of this process. During functionalization, loci transition from a mutation dominated ’learning’ phase to a selection dominated adaptation phase. Interestingly, in the special case of *de novo* gene birth, whereby non-functional loci begin to express a functional product, we find that expression level changes lead to rare, extreme jumps in fitness, whereas sustained adaptation is driven by product functionality. Our work supports the idea that the potential for adaptation is spread widely across the genome, and our results offer mechanistic insights into the process of *de novo* gene birth.

## 1 Introduction

At a very coarse level, a genome consists of multiple genomic loci which can be non-functional or loci with functions such as genes, gene regulatory loci, sequences maintaining chromosome structure, etc Consortium et al. [2012]. Currently, genome annotation remains a formidable challenge for both prokaryotes Dimonaco et al. [2022] and eukaryotes Salzberg [2019]. Nevertheless, it is reasonable to assume that on an evolutionary timescale, most genomic loci are in flux across functional and non-functional categories. For example, genes can lose their functionality through pseudogenization Albalat and Cañestro [2016]. In the other direction, there are also many examples across eukaryotes of *de novo* gene birth Van Oss and Carvunis [2019]. In this work, we ask about the fate of non-functional genomic loci.

We approach this question using an evolutionary model and explore how the fitness contribution of a non-functional genomic locus might increase over time due to the effects of accumulating mutations. The distribution of mutational fitness effects (DFE) has been experimentally measured in mutation accumulation studies for various organisms Katju and Bergthorsson [2019]. In our model, we sample biologically reasonable DFEs, using recently measured DFE parameters for *Chlamydomonas* Böndel et al. [2019] as reference. Notably, observations in Böndel et al. [2019] indicate that the DFE of specific regions of the *Chlamydomonas* genome, such as exons, introns or intergenic sequences, are similar to each other and to the DFE of the whole genome. In general, the DFE is known to vary across different regions of the genome Racimo and Schraiber [2014], and across different species Huber et al. [2017]. We accommodate this diversity by sampling a wide range of DFEs.

Now, over a time scale of millions of years, in addition to small mutations (< 50bp), one can also expect large structural variations (from 50bp upto several megabases) to impact the evolution of genomic loci Mérot et al. [2020]. While the rate of structural variation is estimated to be hundreds of times slower than the rate of small mutations, its effect is likely to be much larger Trost et al. [2021]. Of particular importance to our question is the possibility that the entire genomic locus under consideration gets deleted. Therefore, we test in our model whether sustained fitness increase can occur in the face of locus deletion.

Finally, we consider the particular case of *de novo* gene birth. Recent studies report how new genes gain expression Majic and Payne [2020] and functionality Zhang et al. [2015] over time. Measurements from mutational scans of protein encoding genes indicate that the overall fitness contribution of a gene is a combination of the adaptive value of the expression product, and its expression level Shen et al. [2022]. We envisage that equivalently, during the process of *de novo* gene birth, mutational fitness effects can be decomposed into the effect on adaptive value and the effect on expression level. In the model, we use the DFE, together with empirical measurements of mutational effects on expression to extract a scenario of the evolution of the adaptive value of the expression product.

Overall, we find that a wide range of biologically reasonable DFEs allow functionalization of genomic loci, indicating the ubiquity of this process. Moreover, this gain of functionality occurred despite the antagonistic effects of locus deletion, particularly for the *Chlamydomonas* DFE parameters. In the special case of *de novo* gene birth, our model reveals a short-tailed distribution for mutational effects on adaptive value, thus implying that the rare, extreme mutations that are characteristic of DFEs are instead driven solely by mutational effects on expression level. In contrast, we find that mutations in adaptive value are the major drivers for the sustained fitness increase over evolutionary time. Our results can be tested experimentally using high throughput mutational scans on random initial sequences; such experiments stand to offer quantitative insights into the process of *de novo* gene birth.

## 2 Model of nonfunctional locus adaptation

We set up a population genetic framework to model well-mixed populations of fixed size *N*, composed of asexually reproducing haploid individuals. Fitness of an individual represents exponential growth rate, which is equivalent to the quantities considered in experiments that measure DFEs (e.g., Böndel et al. [2019]). In this work, for any individual *i*, we consider the evolution of the fitness contribution *F*(*i*) of a single locus in its genome. We are interested in the probability that the locus persists in the population, and that its fitness contribution increases above some predetermined threshold. In particular, we examine the special case of *de novo* gene birth, where the fitness contribution can be decomposed into two quantities: functionality, or adaptive value of the expression product (*A*(*i*)), and its expression level (*E*(*i*)). Our definition of fitness is not tied to any specific function, and we assume that *F*(*i*) = *A*(*i*) × *E*(*i*).

The locus of interest is non-genic, with initial fitness *F*_{0}(*i*) = 0, and an initial expression level *E*_{0} (*i*) = 0.001 for all individuals. The initial expression level captures leaky expression of intergenic regions Clark et al. [2011], which is estimated to be 1000-fold smaller than the level of highly expressed genes Hebenstreit et al. [2011].

Generations in the model are non-overlapping, and the population at time-step *t* + 1 is composed entirely of the offspring of individuals in the time-step *t* (Fig1(A)). Offspring incur mutations at each time-step, which affect the locus fitness (Δ*F*(*i*)). In the case of *de novo* gene birth, Δ*F*(*i*) can be decomposed into mutational effects on adaptive value (Δ*A*(*i*)) and expression level (Δ*E*(*i*)):

The mutation rate sets the timescale of the model: a single time-step is roughly the time it takes for one mutation to occur in the locus. For a locus of 100 base pairs, a single model time-step can range between 100 years to 100 000 years for different organisms (Fig1(B), see also Supplementary Information: Table.1). Offspring can also incur structural variations, which in the model leads to the deletion of the locus in that individual. The probability of locus deletion *d* represents the rate of structural variation relative to mutation rate.

In the model, the probability that an individual leaves an offspring is proportional to the fitness *F*(*i*) of the locus (see Supplementary Information: Section.2 for a discussion of the genomic background). Whenever *F*(*i*) ≤ −1, we consider the locus lethal and such individuals cannot produce offspring. We update populations for 1000 time-steps, equivalent to 0.1 to 100 million years, depending on the organism and size of the locus (Supplementary Information: Table.1, Method to update population fitness).

Fitness effects of mutations (Δ*F*(*i*)) are drawn from the characteristic DFE of the locus (Fig1(C)). Multiple studies indicate that long-tails are important features of DFEs, which can be captured by the general form of long-tailed gamma distributions Eyre-Walker and Keightley [2007]. Therefore, we choose to follow Böndel et al. [2019], and represent DFEs as two-sided gamma distributions, and characterize them using four parameters: (*i*) average effect of beneficial mutations *p*, (ii) fraction of beneficial mutations *f*, (iii) average effect of deleterious mutations *n*, and (iv) the shape parameter *s*, where distributions with lower *s* are more long-tailed. Mutations in the model represent the mutation types included in Böndel et al. [2019], which were single-nucleotide mutations and short indels (insertions or deletions of average length ≤ 10 bp) Ness et al. [2015]. Note that here we assume the DFEs of single loci are similar to the DFE across the whole genome, which is the quantity reported in experimental studies. We account for differences in DFEs across species and locations along the genome by sampling across biologically reasonable values of these four parameters *p, f, n, s*.

We also use empirical measurements to estimate the distribution of mutational effects on expression. Studies indicate that mutational effects on expression from established promoters follow a heavy-tailed distribution Hodgins-Davis et al. [2019]. More relevant to our study of *de novo* gene birth are the recent measurements of mutational effects on expression from *random* sequences Vaishnav et al. [2022], which follow a power law distribution, Pr(Δ*E*) ~ Δ*E*^{−2.25}. At each time-step, we use the above power law distribution to draw Δ*E*(*i*). We then calculate values of mutational effects Δ*A*(*i*) using equations (1) and (2), given distributions of mutational effects on fitness and on expression level (see Method to update expression level and adaptive value; see also Supplementary Information: FigS.9 for possible deviations from the power-law ΔE distribution due to the very small initial values *E*(*i*)).

In all, we survey 324 parameter sets – *p, f, n, s*, the DFE parameters, and *d*, the probability of locus deletion – (Fig.1(C)). We run 100 replicate populations for each set of parameters for population sizes *N* = 100, 1000 (Surveying the space of DFE and locus deletion parameters in populations of various sizes). At the end of each simulation, we trace the ancestry of each locus in each individual (Tracing ancestry to find fixed mutations) in order to track *fixation events*: a mutant is said to have fixed in the population if the ancestry of all individuals at some time-step *t* can be traced back to a single individual at some previous time-step *t* – *t*_{fix}. During the course of a simulation, populations undergo multiple fixation events. We count the number of replicate populations in which the locus is still retained at time-step *t* = 1000, and the most recent mutant that gets fixed is fitter than a predetermined fitness threshold of 0.1 (Fig1(A)).

## Results

### Most of the genome is fertile for adaptation

In the absence of locus deletion (*d* = 0), fitness of the last common ancestor crossed the threshold of 0.1 in at least 50% of the replicate populations for a majority (55 out of 81) of DFE parameter sets in *N* = 1000 populations (Fig.2(A), see Supplementary Information: FigS.2 for *N* = 100). The bimodality of the histogram in Fig.2(A) indicates that DFE parameters tend to either be highly conducive, or highly repressive to adaptation. As one can anticipate, the conducive DFE parameter sets tend to have high values for the magnitude (*p*) and the frequency of beneficial mutations (*f*), and low values for the magnitude of deleterious mutations (*n*) and the shape parameter (*s*) (Fig.2(A),inset and Supplementary Information: FigS.2(A),inset). Particularly, for the *Chlamydomonas* DFE parameters, 97% of the N=1000 replicate populations (52% of N=100 replicate populations) crossed the fitness threshold.

Notably, the four DFE parameters appear to act independently in determining the probability of crossing the fitness threshold. This allows fitness to increase even in populations with small values of parameters *f* and *p*, provided the DFE of mutations is long-tailed (i.e., small values of s) (see Supplementary Information: FigS.3). That is, large-effect beneficial mutations are sufficient for adaptive evolution, even when they are rare.

### Fitness trajectories involve a transition from a mutation dominated to a selection dominated phase

The fitness trajectories of populations where the fitness threshold is crossed have a typical shape: the population average fitness is initially dominated by the effects of new mutations, which are mostly deleterious, and lead to a decrease in fitness (see Supplementary Information: FigS.4). This is followed by a phase where the effects of selection become visible and average fitness increases roughly linearly. These fitness trajectories are reminiscent of the dynamics of learning through adaptive strategies in gambling problems, where an initial phase of loss of capital due to the cost of learning is followed by recovery Despons et al. [2022].

Two numbers indicate the point in the trajectory at which selection leads to consistent improvement in fitness: *minimum average fitness* and *time at which minimum fitness is achieved* (Fig. 2(B), Supplementary Information: FigS.5(A)). The DFE parameters, notably *p* and *f* are significantly correlated with these quantities (Supplementary Information: Table.2). Moreover, as expected, populations with lower minimum fitness achieve it at later times (Pearson correlation coefficient between *minimum fitness* and *time of minimum fitness* = −0.73) (Fig.2(D)).

### Mutations can drive adaptation despite the effect of locus deletion

When *d* > 0, The effect of locus deletion can be understood in terms of a competition between two sub-populations: the sub-population that has lost the locus, and therefore lacks any fitness contribution from it, and the sub-population that retains it (Supplementary Information: FigS.6).

The probability that the sub-population that has lost the locus takes over increase with *time of minimum fitness* as calculated for the case where *d* = 0: the longer the average fitness remains negative, the more probable is the loss of the locus from the whole population. Therefore, fewer replicate populations with DFEs such that minimum fitness is reached later go on to cross the fitness threshold of 0.1 (Fig. 2(C), Supplementary Information: FigS.5(B)). As a consequence, the number of parameter sets for which fitness threshold was crossed in at least 50% of *N* = 1000 replicate populations reduces from 55 at *d* = 0 to 51 at the plausible value of *d* = 0.005, and to 48 and 34 at the inordinately high values of *d* = 0.01 and 0.05, respectively. Particularly, for the *Chlamydomonas* DFE, for which *minimum fitness* and time *of minimum fitness* averaged across all replicate populations are −0.035 and 55.72 respectively, > 50% of the replicate populations crossed the threshold for *d* = 0.005 (Fig. 2(C), red dotted line).

### Functionality drives sustained adaptation, while expression drives extreme mutational events

Our decomposition of fitness into expression level and adaptive value yielded short-tailed exponentially distributed mutational effects on adaptive value (Fig.3(A), Supplementary Information: FigS.7). This indicates that most mutations have little effect on functionality, and mutations with large are extremely rare.

We also looked at correlations between the population averaged fitness trajectories and the average trajectories of expression level and adaptive value. These correlations indicate the contributions of expression level and adaptive value towards the increase of fitness over evolutionary time. We find that in most cases where fitness crosses the 0.1 threshold, the increase in fitness was driven more by the adaptive value than by expression level: the distribution of Pearson’s correlation coefficients for adaptive value is sharply peaked at 1, whereas that of expression level is spread broadly (Fig.3(B), Supplementary Information: FigS.8).

As an interesting aside, the empirical measurements that we base our study on do not indicate the level of correlation between the fitness effect and changes in expression level due to mutations; therefore, we proceed with the assumption that Δ*A*(*i*) and Δ*E*(*i*) are independent of each other. In spite of this, over evolutionary time, selection and heritability effectively link fitness and expression level, and impose correlations between their evolutionary trajectories (Supplementary Information: FigS.9).

Overall, we find that sustained adaptation during gene birth is driven more by the product’s adaptive value rather than its expression level. At the same time, the extreme mutational effects on fitness, which underlie the long-tails of DFEs, are not driven by changes in the adaptive value of the product, and are instead likely to be entirely driven by changes in expression level. As noted earlier, extreme mutational events become important in facilitating adaptation in cases where beneficial mutations are small and infrequent on average (i.e. small *f* and *p*).

## Discussion

A majority of studies in genomics and genetics are concerned with the function and evolutionary course of known genes and their regulation. Recent discoveries have attracted focus towards the evolution of non-genic loci; particularly, experimental studies that demonstrate the adaptive potential of random sequences Hayashi et al. [2003], Yona et al. [2018], Lagator et al. [2022]. Furthermore, genomics studies that indicate the frequent occurrence of *de novo* gene birth demonstrate a need for general, theoretical investigations of the evolution of non-genic loci Tautz and Domazet-Lošo [2011].

In this work, we attempt to describe the process of functionalization of non-genic genomic loci in a simple population genetic model. We make use of experimentally measured effects of spontaneous mutations in order to obtain biologically reasonable estimates for the frequency of locus functionalization.

Our model suggests that a wide range of parameters that govern mutational fitness effects (DFE) are conducive to locus functionalization. We find this to be the case despite the antagonistic effects of structural variation that leads to locus deletion. Although the extent of diversity of DFEs across genomic loci and different organisms is not well-known, the range of DFEs surveyed in this work indicates that large swathes of the genome are conducive to adaptation on evolutionary timescales. This result is in line with observations that 80% of the human genome is likely to be functional, while only 3% of the genome contains well-known protein coding exons Consortium et al. [2012]. Our result also supports the proposed prevalence of orphan genes born through *de novo* gene birth Vakirlis et al. [2020].

The fitness contribution of a gene is a composite function of various molecular mechanisms, for example the accessibility and affinity of the locus to polymerases, the stability, foldability, and interactions of its expression products, etc. Our study of the adaptive value and expression level of *de novo* genes exemplifies how the fitness effects of mutations can be resolved into contributions from underlying mechanisms. Our result also shows how the process of adaptation can be different for *de novo* genes and established genes: we find that in the case of *de novo* gene birth, the increase in fitness was driven more by the adaptive value than by expression level. This effect is likely to be a special feature of *de novo* gene birth, where initially both adaptive value and expression levels are very low. Whereas in the case of established genes, evolution of expression level is known to play a role in adaptation Fraser [2013], Nourmohammad et al. [2017], Blanc et al. [2021].

We built our model to represent naturally evolving populations, where the timescale varies across different organisms, and is set by their respective mutation rates. We assume here that the genomic background, being much larger, evolves at a much faster rate, allowing selection to be solely based on the fitness contribution of the locus of interest. Alternatively, our model can also be used to represent mutation scan experiments such as in Vaishnav et al. [2022], where the genomic background is kept constant. In this case, the generations in the model represent rounds of experiments involving mutagenesis and artificial selection.

The generality of our results is likely to be limited due to dearth of relevant data. Most importantly, we use experimental measurements of DFE and mutational effects on expression that are taken from different organisms: in different organisms, distinct mechanisms produce mutations, therefore the frequencies of different mutation types and its effects may vary across organisms. Although, the leptokurtic nature of DFEs Eyre-Walker and Keightley [2007] and long tailed nature of mutational effects Hodgins-Davis et al. [2019], Vaishnav et al. [2022] on expression have both been observed in independent studies, measurements performed in the same organism could provide important details, for instance the correlations between the effect of a mutation on expression and on fitness. Secondly, the DFE of loci remain constant in our model, while mutational fitness effects are known to vary over evolutionary time due to various causes, such as change of environment, diminishing returns epistasis, etc Sane et al. [2020], Wünsche et al. [2017], Aggeli et al. [2020]. An extended model that includes a consideration of DFE variability would provide valuable insight into the robustness of our results.

We anticipate that our results can be tested and the shortcomings of our model can be addressed through experiments, especially mutational scans such as those in Vaishnav et al. [2022]: For example, one could design experiments that monitor the fitness effects of mutations on random sequences which also concomitantly detect expression from these random sequences. Alternatively, the evolution of adaptive value of expression products can be directly examined in experiments where random sequences are placed under constitutive, high expression promoters (such as in Hayashi et al. [2003]); in this case the fitness effects of mutations directly correspond to the adaptive value of the product. These experiments, together with theoretical approaches like ours, provide us with means to test and compare the adaptive potential of non-functional genomic sequences, and the general mechanisms of *de novo* gene birth across various organisms.

## Methods

### 2.1 Surveying the space of DFE and locus deletion parameters in populations of various sizes

We scan across DFEs with *p* = [0.001, 0.003, 0.005], *f* = [0.25, 0.5, 0.75], *n* = [0.001, 0.005, 0.01] and *s* = [0.3, 0.6, 0.9]. We look at locus deletion probabilities *d* = [0, 0.001, 0.005, 0.01]. And we look at populations of sizes *N* = [100, 1000]. For each parameter set, we simulate 100 replicate systems. In all, we look at 64 800 systems. All codes used to generate and analyze data are written in Python3.6.

### 2.2 Method to update population fitness

For a population of size *N*, fitness of individuals at time-step *t* are stored in the vector , where the fitness of any individual *i* is *F _{t}*(

*i*). We also keep track of the individuals that have lost the locus due to deletion in the vector

*L*∈ [0, 1]

_{t}^{N×1}, such that

*L*(

_{t}*i*) = 1 implies that individual

*i*contains the locus at time-step

*t*, and

*L*(

_{t}*i*) = 0 implies individual

*i*has lost the locus. Note that

*L*(

_{t}*i*) = 0 automatically implies

*F*(

_{t}*i*) = 0.

In the model, only individuals with fitness > −1 are viable, and capable of producing progeny. And individuals in the current population that produce progeny are chosen on the basis of their relative fitness. Let minfit_{t} be the minimum fitness among viable individuals in *F _{t}*.

We define , for *j* such that *F _{t}*(

*j*) > −1. The normalized relative fitness of individuals is then given by relfit

_{t}∈ [0, 1]

^{NX1}, where

Therefore, even if *F _{t}*(

*i*) = 0, relfit

_{t}(

*i*) can be non-zero if minfit

_{t}< 0.

Let be the list of individuals chosen from the current time-step *t* to leave progeny. In other words, Anc_{t+1} is the list of ancestors of the population at time-step *t* + 1. Here, *Pr*(Anc_{t+1}(*j*) = *i*) ∝ relfit_{t} (*i*), ∀*i*, *j* ≤ *N*.

Progeny of the current population incur mutations. The mutation effects are drawn from 2-sided gamma distributions governed by the parameters *p* (average effect of beneficial mutations), *f* (fraction of beneficial mutations), *n* (average effect of deleterious mutations), and *s* (shape parameter). The values of fitness effects of mutations incurred by each individual at time-step *t* is stored in , where

Here Γ (*κ*, *θ*) represents a number drawn from the gamma distribution with shape parameter *κ* and scale parameter *θ*, and Ber(*f*) is the Bernoulli random variable which equals 1 with probability *f*.

Progeny can also lose the locus with probability *d*. Thus, the updated fitness levels of the population is given by *F*_{t+1}(*i*) = 0, if *Anc*_{t+1}(*i*) did not contain the locus, or if the individual loses the locus in the current time step. Otherwise, *F*_{t+1}(*i*) = *F _{t}* (Anc

_{t+1}(

*i*)) + mut

_{t}(

*i*).

### 2.3 Method to update expression level and adaptive value

In the model, we assume *F*(*i*) = *A*(*i*) * *E*(*i*) for any individual *i*. For a population of size *N*, expression levels of the locus at time-step *t* are stored the vector , where the expression level of some individual *i* is *E _{t}*(

*i*). For an individual that has lost the locus due to deletion,

*L*(

_{t}*i*) = 0, which automatically implies

*E*(

_{t}*i*) = 0.

Initially, the expression level of the locus across the population is distributed around 0.001, and reflects leaky expression. At each time step, the expression levels across the population change as individuals are selected and their progeny incur mutations.

The effect of mutations on expression level incurred by each individual at time-step *t* is stored in . The magnitude of Δ*E _{t}*(

*i*) are drawn from a power law distribution such that

*Pr*(|Δ

*E*(

_{t}*i*)| =

*x*) =

*x*

^{−}2.25 for

*x*≥ 0. We assume that a Δ

*E*(

_{t}*i*) is negative with probability 0.5.

The updated expression levels of the population are therefore given by *E*_{t+1}(*i*) = 0, if *Anc*_{t+1}(*i*) did not contain the locus, or if the individual loses the locus in the current time step. If the individual does contain the locus, *E*_{t+1}(*i*) = *E _{t}*(Anc

_{t+1}(

*i*)) + Δ

*E*(

_{t}*i*).

Note that the values of expression level in the model are bounded within [0.001, 1] corresponding to leaky expression and maximal possible expression respectively. In the simulation, whenever *E*_{t+1}(*i*) < 0.001 or *E*_{t+1}(*i*) > 1, we reset it to 0.001 and 1, respectively. Since the initial expression levels are very low, *E*_{t+1}(*i*) never crossed 1 in any simulation. In a run of 1000 time steps, *E*_{t+1}(*i*) crosses 0.001 on average 40 times (Supplementary Information: FigS.10).

We then calculate the corresponding changes in the adaptive value of the locus at each time step: *A _{t}*(

*i*) =

*F*(

_{t}*i*)/

*E*(

_{t}*i*). From this, we can calculate the change in adaptive value due to mutation as Δ

*A*(

_{t}*i*) =

*A*

_{t+1}(

*i*) –

*A*(

_{t}*Anc*

_{t+1}(

*i*)).

### 2.4 Tracing ancestry to find fixed mutations

In order to find the fitness value of the mutant fixed in the population at time-step *t*, we start with the list of ancestors of individuals Anc_{t} at time-step *t*.

Let *X _{t}* = {

*i*, ∀

*i*∈ Anc

_{t}} be the set of unique ancestor identities. We then recursively find

*X*= {

_{t–n}*i*, ∀

*i*∈ {Anc

*(*

_{t–n}*j*), ∀

*j*∈

*X*

_{t–n+1}}} as the set of unique ancestor identities for

*n*= 1, 2, 3 …

*t*

_{0}, where

*X*

_{t–t0}is the first singleton set encountered. This set contains a single individual at time-step

*t*–

*t*

_{0}– 1, whose mutations are inherited by every individual at time-step

*t*. And the fitness value of the mutant fixed in the population at time-step

*t*is then

*F*

_{t–t0–1}(

*i*), where

*i*∈

*X*

_{t–t0}.

## Author contributions

SM: conceived the project, designed research, developed models, performed simulations, analysed data and wrote the manuscript. TT: conceived the project, supervised research, and wrote the manuscript.

## Competing interests

The authors declare they have no competing interests.

## Acknowledgements

We thank John McBride and Luca Peliti for very helpful discussions. This work was funded by the Institute for Basic Science, Grant IBS-R020.