## Abstract

Exchanging genetic material with another individual seems risky from an evolutionary stand-point, and yet living things across all scales and phyla do so quite regularly. The pervasiveness of such genetic exchange, or recombination, in nature has defied explanation since the time of Darwin^{1–4}. Conditions that favor recombination, however, are well-understood: recombination is advantageous when the genomes of individuals in a population contain more selectively mismatched combinations of alleles than can be explained by chance alone. Recombination remedies this imbalance by shuffling alleles across individuals. The great difficulty in explaining the ubiquity of recombination in nature lies in identifying a source of this imbalance that is comparably ubiquitous. Intuitively, it would seem that natural selection should reduce the imbalance by favoring selectively matched combinations of high-fitness alleles, thereby opposing the evolution of recombination. We show, however, that this intuition is wrong; to the contrary, we find that natural selection has an encompassing tendency to assemble selectively mismatched combinations of alleles (the *products* of natural selection), thereby increasing the imbalance and promoting the evolution of recombination. We further show that population dynamics that lead to the fixation of these selectively mismatched genotypes (the *process* of natural selection) themselves produce an average imbalance that promotes the evolution of recombination. This fact is completely independent of the distribution of allelic fitness effects and is primarily due to the additive component of those effects. Our findings provide a novel vantage point from which the enormous body of established work on the evolution of sex and recombination may be viewed anew. They further suggest that recombination evolved and is maintained more as an unavoidable byproduct of natural selection than as a catalyst.

The ability to exchange genetic material through recombination (and sex) is a heritable trait^{5, 6} that is influenced by many different evolutionary and ecological factors, both direct and indirect, both positive and negative. Evidence from nature clearly indicates that the net effect of these factors must be positive: recombination across all levels of organismal size and complexity is undeniably the rule rather than the exception^{2–4, 7}. Theoretical studies, on the other hand, have revealed a variety of different mechanisms and circumstances that can promote the evolution of recombination, but each one by itself is of limited scope^{2, 4, 8}. These studies would thus predict that the absence of recombination is the rule and its presence an exception. The sheer abundance of these exceptions, however, can be seen as amounting to a rule in its own right – a “pluralist” view that has been adopted by some authors to explain the ubiquity of recombination^{3, 7, 9}. The necessity of this pluralist view, however, may be seen as pointing toward a fundamental shortcoming in existing theory: perhaps some very general factor that would favor recombination has been missing^{3, 4, 8, 10}.

Existing theories of the evolution and maintenance of sex and recombination can be divided into those that invoke *direct* vs *indirect* selection on recombination. Theories invoking direct selection propose that recombination evolved and is maintained by some physiological effect that mechanisms of recombination themselves have on survival or on replication efficiency. Such theories might speak to the origins of sex and recombination but they falter when applied to their maintenance^{1}. Most theories invoke indirect selection: they assume that any direct or immediate effect of recombination mechanisms is small compared to the trans-generational consequences of recombination.

While differing on the causal factors involved, established theoretical approaches that invoke indirect selection are unanimous in their identification of the fundamental selective environment required for sex and recombination to evolve: a population must harbor an excess of selectively mismatched combinations of alleles across loci and a deficit of selectively matched combinations. Recombination is favoured under these conditions because on average it breaks up the mismatched combinations and assembles matched combinations. Assembling selectively matched combinations increases the efficiency of natural selection: putting high-fitness alleles together can expedite their fixation^{11–15}, and putting low-fitness alleles together can expedite their elimination^{16, 17}. Under these conditions therefore, populations with recombination have an evolutionary advantage over populations without. Furthermore, competition among recombination-rate variants at a *modifier* locus under these conditions will increase recombination within a single population^{18}. But it is also true that recombinants themselves formed from randomly-chosen parents within such a population have an advantage that is relatively immediate, and this will be the measure we use to assess the evolutionary potential of recombination.

The great challenge has been to identify an evolutionary source of the aforementioned imbalance whose prevalence in nature is comparable to the prevalence of sex and recombination in nature. One feature of living things whose prevalence approximates that of sex and recombination is right under our noses, namely, evolution by natural selection. This observation inspired the aim of the present study, which is to assess the effects of pure natural selection on the evolutionary potential of recombination.

We preface our developments with an essential technical point. In much of the relevant literature, the measure of selective mismatch across loci affecting the evolution of recombination is linkage disequilibrium (LD)^{8, 12, 13, 19–21}, which measures bias in allelic frequencies across loci but does not retain information about the selective value of those alleles. Here, the objectives of our study require a slight departure from tradition: our measure of selective mismatch will be covariance between genic fitnesses. This departure is necessary because covariance retains information about both the frequencies and selective value of alleles, and it is convenient because the mean selective advantage accrued by recombinants over the course of a single generation is equal to minus the covariance (Methods and Extended Data Fig. 1). Our results will thus be given in terms of covariance and we recall: negative covariance means positive selection for recombinants.

## Setting

We reduce the problem to what we believe is its most essential form: we ask how the selective value of haploid recombinants is affected when natural selection simply acts on standing heritable variation. We ask this: 1) when recombination occurs across the *products* of natural selection (e.g., fixed genotypes), and 2) when recombination occurs within a population during the *process* of natural selection.

To eliminate potentially confounding factors and study the effects of natural selection in isolation, we consider large (effectively infinite) populations, each of which consists of just two competing genotypes that differ in both of two genes (or two *loci*). This simple setting permits clean presentation and mathematical tractability and, more importantly, is biologically motivated by the observation that large clonal populations tend to be overwhelmingly dominated by one or two genotypes^{22}. It further provides a connection to foundational evolution-of-sex studies^{23–25}: Fisher^{24} considered the case of a single beneficial mutation arising on a variable background, thereby effectively giving rise to two competing genotypes – wildtype and beneficial mutant – that differ in both the gene with the beneficial mutation (call it the *x* gene) and its genetic background (call it the *y* gene); Muller^{25} considered the case of two competing genotypes, one carrying a beneficial mutation in the *x* gene and the other in the *y* gene. Both of these approaches consider two competing genotypes that differ in both of two loci, and these foundational models are thus subsumed by our framework. Simulations further confirm the adequacy of this two-genotype setting: increasing the number of genotypes only accentuates the effects we describe.

We consider a clonal haploid organism whose genome consists of just two fitness-related loci labeled *x* and *y*. Genetically-encoded phenotypes at these two loci are quantified by random variables *X* and *Y*, both of which are positively correlated with fitness. In each large population of such organisms, two genotypes exist: one encodes the phenotype (*X*_{1}, *Y*_{1}), has fitness *Z*_{1} = *ϕ*(*X*_{1}, *Y*_{1}) and exists at some arbitrary initial frequency *p*; the other encodes phenotype (*X*_{2}, *Y*_{2}), has fitness *Z*_{2} = *ϕ*(*X*_{2}, *Y*_{2}) and exists at initial frequency 1 − *p*. We note that, in the absence of epistasis or dominance, the scenario we describe is formally equivalent to considering a diploid organism whose genome consists of one locus and two alleles available to each haploid copy. The question we ask is this: Does the action of natural selection, by itself, affect covariance between *X* and *Y*, denoted *σ _{XY}*, and if so, how?

### The *products* of natural selection promote recombination

Figure 1 illustrates the problem by analogy to a set of canoe races. Figure 2 shows how the problem is posed analytically. On the surface, one might suspect that natural selection would promote well-matched combinations in which large values of *X* are linked to large values of *Y*, thereby creating a positive association between *X* and *Y*. In fact, this notion is so intuitive that it is considered self-evident, explicitly or implicitly, in much of the literature^{1–3, 7, 9, 14, 26, 27}. If this notion were true, recombination would break up good allelic combinations, on average, and should thus be selectively suppressed. Such allele shuffling has been called “genome dilution”, a label that betrays its assumed costliness. We find, however, that the foregoing intuition is wrong. To the contrary, we find that natural selection will, on average, promote an excess of mismatched combinations in which large values of *X* are linked to small values of *Y*, or vice versa, thereby creating a negative association between *X* and *Y*. Recombination will on average break up the mismatched combinations created by natural selection, assemble well-matched combinations, and should thus be favoured.

Figure 3 illustrates why our initial intuition was wrong and why natural selection instead tends to create negative fitness associations among genes. For simplicity of presentation, we assume here that an individual’s fitness is *Z* = *ϕ*(*X, Y*) = *X* + *Y*, i.e., that *X* and *Y* are simply additive genic fitness contributions, and that *X* and *Y* are independent. In the absence of recombination, selection does not act independently on *X* and *Y* but on their sum, *Z* = *X* + *Y*. Perhaps counter-intuitively, this fact alone creates negative associations. To illustrate, we suppose that we know the fitness of successful genotypes – the *products* of natural selection – to be some constant, *z*, such that *X* + *Y* = *z*; here, we have the situation illustrated in Fig 3a and we see that *X* and *Y* are negatively associated; indeed, covariance is immediate: *σ _{XY}* = −

*σ*≤ 0. Of course, in reality the fitnesses of successful genotypes will not be known

_{X}σ_{Y}*a priori*nor will they be equal to a constant; instead, they will follow a distribution of maxima of

*Z*as illustrated in Fig 3b. If populations consist of

*n*contending genotypes, then

*X*

_{(n)}+

*Y*

_{(n)}=

*Z*

^{[n]}, the

*n*order statistic of

^{th}*Z*with genic components

*X*

_{(n)}and

*Y*

_{(n)}(called

*concomitants*in the probability literature

^{28}). In general,

*Z*

^{[n]}will have smaller variance than

*Z*. Components

*X*

_{(n)}and

*Y*

_{(n)}, therefore, while not exactly following a line as in Fig 3a, will instead be constrained to a comparatively narrow distribution about that straight line, illustrated by Fig 3b, thereby creating a negative association. Figure 3c plots ten thousand simulated populations evolving from their initial (green dots) to final (black dots) mean fitness components; this panel confirms the predicted negative association.

What we have shown so far is that, if recombination occurs across the *products* of natural selection, such as fixed genotypes in different populations, subpopulations, demes or competing clones, the resulting offspring should be more fit than their parents, on average. This effect provides novel insight into established observations that population structure can favor recombination^{29–33} and may even speak to notions that out-crossing can create hybrid vigour (heterosis).

### The *process* of natural selection promotes recombination

Here, our focus is on recombinants formed from two randomly-chosen parents within the same unstructured evolving population. Here again, Fig 2 shows how the problem is posed analytically. Natural selection will cause the two competing genotypes to change in frequency, causing *σ _{XY}* to change over time (

*σ*=

_{XY}*σ*(

_{XY}*t*)). If is positive (negative) on average, this indicates that natural selection creates conditions that oppose (favour) recombination on average.

In the Methods, we show that, in expectation, covariance over the long run is unconditionally non-positive, , implying that the *process* of natural selection, on average, always creates conditions that favour recombination. Remarkably, this finding requires no assumptions about the distribution of *X* and *Y*; in fact, a smooth density is not required. Indeed, this distribution can have strongly positive covariance, and yet the net effect of natural selection is still to create negative time-integrated covariance. Montecarlo integration shows that time-integrated covariance is indeed negative under a wide range of different distributions (Extended Data Fig. 2), and recombinant advantage increases as the number of alleles increases (Extended Data Fig. 3).

We further show that it is primarily the additive component of fitness that causes timeintegrated covariance to be negative. This fact stands in contrast with previous notions that nonadditive effects, specifically negative or fluctuating epistasis, are an essential ingredient in the evolution of recombination^{19, 21, 34–37}.

## Discussion

Some authors^{2, 38} have argued that negative associations build up within a population because positive associations, in which alleles at different loci are selectively matched, are either removed efficiently (when they are both similarly deleterious), or fixed efficiently (when they are both similarly beneficial), thereby contributing little to overall within-population associations. Genotypes that are selectively mismatched, on the other hand, have longer sojourn times, as the less-fit loci effectively shield linked higher-fitness loci from selection. The net effect, it is argued, should be that alleles across loci will on average be selectively mismatched within a population. Our findings differ from these arguments in one respect, namely, we find that even genotypes that are ultimately fixed carry selectively mismatched alleles. In another respect, however, these arguments are entirely consistent with our findings: Equation (1) in Proposition 6 gives time-integrated covariance; it is intuitively more likely that the numerator of that equation will be negative when the denominator is small, i.e., when *Z*_{1} and *Z*_{2} are close to each other. Negative values are thus amplified because they tend to occur when total fitness of the two genotypes are close to each other and thus coexist for a longer period of time before one displaces the other. Indeed this is the intuitive way to understand Proposition 7.

We have identified a phenomenon that is an inherent consequence of natural selection and gives rise to selectively mismatched combinations of alleles across loci. Generally speaking, this pervasive phenomenon is an example of counter-intuitive effects caused by probabilistic conditioning. For example, “Berkson’s bias”^{39, 40} arises when a biased observational procedure produces spurious negative correlations. In the original context, among those admitted to hospital due to illness, a negative correlation among potentially causative factors was observed because those with no illness (who tended to have no causative factors) were not admitted to the hospital and hence not observed. Similarly, negative correlations arise across genic fitnesses in part because genotypes in which both loci have low genic fitness are purged by selection; here, however, the bias is not observational but actual, as these low-fitness genotypes no longer exist in the population.

Several relevant issues are beyond the scope of this study. For example, our approach implicitly assumes that recombination will evolve as a result of persistent recombinant advantage, but we have not explicitly shown this. In simulations (Extended Data Figs. 4-7), we show that our reductionist approach does indeed cause recombination modifiers to increase in frequency. These simulations are perhaps superfluous, however, given our finding that recombinants themselves are chronically advantageous. Also, we have omitted mutation, which was necessary in order to study the effects of natural selection in isolation.

Many previous studies, in one way or another, point to the increase in agility and efficiency of adaptation that recombination confers as the primary cause of its evolution. Here, we invert the perspective of those earlier studies, asking not whether recombination speeds adaptation, but whether adaptation via natural selection generally creates selective conditions that make recombinants directly and immediately advantageous. If so, as our findings indicate, then: 1) the ubiquity of recombination in nature might be less enigmatic than previously thought, and 2) perhaps recombination arose and is maintained more as an unavoidable byproduct than as a catalyst of natural selection.

## Methods

### Notes

In the main text, we employ the shorthand *σ _{XY}* to denote covariance. In what follows, however, we use

*σ*and Cov(

_{XY}*X, Y*) (for clarity) interchangeably. Several of the proofs here are abridged; full proofs are in the SI, as well as alternative and supplemental proofs.

### Covariance and recombinant advantage

Much work on the evolution of recombination employs linkage disequilibrium (LD) as the measure of across-loci associations. It is straight-forward to estimate LD from genomic sequence data, which likely explains the popularity of this measure. LD, however, contains no information about the selective cost of such associations. Covariance, on the other hand, retains all of the information regarding both the prevalence of linkage and its selective cost (i.e., recombinant advantage), and is thus the measure we employ. We note that when the fitness function is a bivariate Bernoulli distribution (*ϕ*(*X, Y*) = ℙ{*X* = *i, Y* = *j*} = *p _{i,j}, i, j* ∈ {0, 1}) then covariance and disequilibrium are equivalent (

*σ*=

_{XY}*D*=

*p*

_{1,1}−

*p*

_{1,•}

*p*

_{•,1}). Recombinants are formed from two randomly-chosen contemporaneous parents such that their genetic makeup is simply an unbiased random sampling of the pool of available alleles at the

*x*and

*y*loci. As such, their instantaneous advantage is zero on average: , where subscript

*R*denotes recombinant and no subscript denotes wildtype. Recombinants and wildtype, however, gain fitness at different rates: and . A first order expansion thus reveals that the selective advantage of recombinants after a single generation of growth is . A single-generation Moran model (Extended Data Fig. 1) shows this prediction to be accurate and that covariance increases linearly in the first generation, implying that the mean selective advantage of recombinants over that first generation is −

*σ*. We note that a generalized linear relationship between fitness and phenotypes

_{XY}*X*and

*Y*, i.e.,

*Z*=

*k*

_{0}+

*k*+

_{X}X*k*, yields a recombinant advantage of −

_{Y}Y*k*. A full treatment of the relation between covariance and recombinant advantage is found in the SI, as well as the relation between our approach and classical population genetics.

_{X}k_{Y}σ_{XY}### The *products* of natural selection promote recombination

The setting for this problem is shown in Fig 2. No hypothesis on the fitness function *ϕ* is made at this point, apart from being measurable. For the sake compact presentation we assume here (relaxed in SI) that (*X*_{1}, *Y*_{1}, *X*_{2}, *Y*_{2}) are i.i.d.; departures from this and other simplifying assumptions are dealt with in the SI. As defined in Fig 2, *Z _{i}* =

*ϕ*(

*X*),

_{i}, Y_{i}*Z*

^{[i]}=

*ϕ*(

*X*

_{(i)},

*Y*

_{(i)}), and

*Z*

^{[2]}>

*Z*

^{[1]}.

Let *Ψ* be any measurable function from ℝ^{2} *into* ℝ. Then:. In particular, the arithmetic mean of *and* *is* .

Proof: Consider a random index *I* ∈ {1, 2}, and for now ℙ(*I* = 1) = ℙ(*I* = 2) = 1*/*2, and *I* is independent of (*X*_{1}, *Y*_{1}, *X*_{2}, *Y*_{2}). The couple (*X _{I}, Y_{I}*) is distributed as (

*X*

_{1},

*Y*

_{1}). Hence, however,

*We have*: .

Proof: The couples (*X*_{(I)}, *Y*_{(I)}) and (*X*_{(I)}, *Y*_{(3−I)}) are both distributed as (*X*_{1}, *Y*_{1}). There-fore their covariances are null. These covariances can also be computed by conditioning on *I* (see *e.g.* formula (1.1) in ^{41}). For (*X*_{(I)}, *Y*_{(I)}) we have: . On the right-hand side, the first term is: . The second term is: . Similarly, we have: . The first term in the right-hand side is: . The second term in the right-hand side is: . Hence the result.

*Assume that the fitness function ϕ is symmetric: ϕ*(*x, y*) = *ϕ*(*y, x*)*. Then the couple* (*X*_{(1)}, *Y*_{(2)}) *has the same distribution as the couple* (*Y*_{(1)}, *X*_{(2)}).

As a consequence, *X*_{(1)} and *Y*_{(1)} have the same distribution, so do *X*_{(2)} and *Y*_{(2)}. Thus: . Another consequence is that: Cov(*X*_{(1)}, *Y*_{(2)}) = Cov(*X*_{(2)}, *Y*_{(1)}). Thus by Proposition 2: .

Proof: Since *ϕ* is symmetric, the change of variable (*X*_{1}, *Y*_{1}, *X*_{2}, *Y*_{2}) ↦ (*Y*_{1}, *X*_{1}, *Y*_{2}, *X*_{2}) leaves unchanged the couple (*Z*_{1}, *Z*_{2}).

*Assume that the ranking function ϕ is the sum: ϕ* (*x, y*) = *x*+*y*. Then: , and .

Proof: The first two equalities come from Proposition 3. By definition, . Hence the inequality.

*Assume that the ranking function ϕ is the sum, and that the common distribution of X*_{1}, *Y*_{1}, *X*_{2}, *Y*_{2} *is symmetric: there exists a such that f* (*x* − *a*) = *f* (*a* − *x*)*. Then* (*a* − *X*_{(1)}, *a* − *Y*_{(1)}) *has the same distribution as* (*X*_{(2)} − *a, Y*_{(2)} − *a*)*. As a consequence,* Cov(*X*_{(1)}, *Y*_{(1)}) = Cov(X_{(2)}, Y_{(2)}).

Proof: The change of variable (*X*_{1}, *Y*_{1}, *X*_{2}, *Y*_{2}) ↦ (2*a* − *X*_{1}, 2*a* − *Y*_{1}, 2*a* − *X*_{2}, 2*a* − *Y*_{2}) leaves the distribution unchanged. It only swaps the indices *i* and *s* of minimal and maximal sum.

If we summarize Propositions 1, 2, 3, 4, 5 for the case where the ranking function is the sum, and the distribution is symmetric, one gets:

### The *process* of natural selection promotes recombination

We recall that recombinant advantage is −*σ _{XY}*. Here, we study how the selection-driven changes in types (

*X*

_{1},

*Y*

_{1}) and (

*X*

_{2},

*Y*

_{2})

*within a single unstructured population*change

*σ*=

_{XY}*σ*(

_{XY}*t*) over time. We are interested in the net effect of these changes, given by in particular, we are interested in knowing whether this quantity is positive (net recombinant disadvantage) or negative (net recombinant advantage).

*Within-population covariance integrated over time is*:
*where q is the initial frequency of the inferior genotype. No assumption about the distribution of* (*X, Y*) is required. And Z_{i} = *ϕ*(*X _{i}, Y_{i}*)

*where fitness function ϕ can be any function.*

Proof: We let *p* denote initial frequency of the superior of the two genotypes, and we let *q* = 1−*p* denote initial frequency of the inferior genotype. Time-integrated covariance is:

Integration by parts yields:
where *q* in Prop 6 is written as 1 − *p*_{0}. We observe that:
and that
from which we have:

*We define spacings* Δ*X* = *X*_{2} − *X*_{1}, Δ*Y* = *Y*_{2} − *Y*_{1}, and Δ*Z* = *Z*_{2} − *Z*_{1} = Δ*X* + Δ*Y*. If the pairs (*X _{i}, Y_{i}*)

*are independently drawn from any distribution, then*Δ

*X and*Δ Y are symmetric about zero, and time-integrated covariance is unconditionally non-positive:

Proof: There is no need to assume that (Δ*X*, Δ*Y*) has a density. This proof also reveals that the result also holds for discrete random variables. Let Δ*X*, Δ*Y* be two real-valued random variables such that: (−Δ*X*, Δ*Y*) has the same distribution as (Δ*X*, Δ*Y*). We have:

When Δ*X* and Δ*Y* have the same sign as imposed by the indicator function in the last expectation, we have |Δ*X* + Δ*Y* | *>* |Δ*Y* − Δ*X*|, from which the inequality derives.

*Proposition 7 holds for divergent expectations.*

Proof: Set *U* = |Δ*X*| and *V* = |Δ*Y* |; *M* = Max(*U, V*), *m* = Min(*U, V*). Then you can rewrite the expectation as:

Indeed, if the expectation is divergent, then it is always −∞. This approach removes the need to make the argument that *U* + *V >* |*U* − *V* | and avoids the need to take a difference of expectations. An alternative approach is given in an expanded statement and proof of Proposition 7 in the SI.

## Author contributions

P.G. conceived the theory conceptually; P.G., P.S., B.S. and A.C. developed the theory verbally and with simulation; P.G, B.Y. and J.C. developed the theory mathematically; B.Y. and J.C. provided mathematical proofs for the across-population part; P.G., V.V., F.C. and N.H. provided mathematical proofs for the within-population part. P.G. wrote the paper with critical help and guidance from B.S., P.S. and B.Y.

## Competing interests

The authors declare no competing interests.

## Additional information

**Supplementary information** is available for this paper at https://doi.org/10.1038/s….

**Correspondence and requests for materials** should be addressed to P.G.

## Acknowledgements

The authors thank S. Otto and N. Barton for their thoughts on early stages of this work. Special thanks go to E. Baake for her thoughts on later stages of this work and help with key mathematical aspects. Much of this work was performed during a CNRS-funded visit (P.G.) to the Laboratoire Jean Kuntzmann, University of Grenoble Alpes, France, and during a visit to Bielefeld University (P.G.) funded by Deutsche Forschungsgemeinschaft (German Research Foundation, DFG) via Priority Programme SPP 1590 Probabilistic Structures in Evolution, grants BA 2469/5-2 and WA 967/4-2. The authors thank D. Chencha, J. Streelman, R. Rosenzweig and the Biology Department at Georgia Institute of Technology for critical infrastructure and computational support. P.G. and A.C. received financial support from the USA/Brazil Fulbright scholar program. P.G. and P.S. received financial support from National Aeronautics and Space Administration grant NNA15BB04A.