The creation-mutation-selection model:

,


Introduction 20
The creation-mutation-selection model is a mathematical model for the fitness of asexual and 21 sexual populations in the presence of positive and negative selection [1]. Unfortunately, for an 22 asexual population, the model only predicts the fitness of the population if the variance or non-23 standardized skewness of the population log fitness distribution is known. Thus to compare the 24 fitness of asexual and sexual populations it is necessary to resort to simulation. 25 Computer simulations of the asexual and sexual models was performed, and compared to the 26 predictions of the creation-mutation-selection model. The model depends on several parameters. 27 Reasonable values for these parameters have been estimated previously, and were used for the 28 simulations [2]. 29 Reviewing the creation-mutation-selection model, consider a population of N haploid organisms in a 30 Wright-Fisher like discrete time model of non-overlapping generations. Suppose positive mutational 31 opportunities occur with mean rate Γ p per generation across all organisms, and negative mutational 32 opportunities occur with mean rate Γ n per organism. Let µ ss be the spontaneous mutation rate, 33 that is the rate of satisfaction of mutational opportunity sites per sexual or asexual generation. 34 Let s be the selection coefficient associated with the satisfaction of a mutational opportunity site. 35 For sexual organisms selection is assumed to occur in haploids, or equivalently in diploids with 36 a selection coefficient of roughly 2s and a heterozygous effect of 1 2 . When N 1 s the sites are 37 termed neutral sites, and the positive or negative site creation rate is denoted Γ 0 . An organism 38 with no mutational opportunity sites has a fitness of 1. Fitness when multiple sites are present 39 is assumed to combine additively in log space, and log fitness will be a negative value. Natural 40 selection involves selection with replacement in accordance with fitness. Sex involves drawing a 41 new population of organisms by mating two randomly selected organisms at a time, and randomly 42 selecting mutational opportunities from each of them without linkage. 43 For presently existing eukaryotes, genomic analysis suggests Γ p is typically in the range 10 −3 to 44 10 −2 population wide adaptive mutational opportunities per sexual generation [2]. Smaller and 45 larger vales for Γ p are considered, partially to gain understanding, and partially to speculate that 46 prokaryotes experience smaller rates of positive selection. The range for Γ p of 10 −4 to 10 −1 will be 47 considered. 48 For eukaryotes, Γ n is typically in the range 10 −1 to 10 1 negative sites per organism per sexual 49 generation [2]. Values two orders of magnitude smaller than this are also considered so as to start 50 to encompass the range spanned by prokaryotes. Prokaryotes have spontaneous mutation rates per 51 genome in the range 10 −4 to 10 −2 per generation [3][ Figure 1, visual inspection assuming all sites 52 are under negative selection]. The range for Γ n of 10 −3 to 10 1 will be considered. 53 For eukaryotes, µ ss is typically in the range 10 −9 to 10 −8 spontaneous mutations per site per sexual 54 generation [2]. For bacteria the spontaneous mutation rate per base pair, µ bs , ranges from 1.4×10 −10 55 to 4×10 −9 , placing µ ss roughly in the range 10 −10 to 10 −9 [3][ Supplement 1]. Rates for archaea are 56 even smaller, at around 10 −11 per generation [3][Supplement 1]. The range for µ ss of 10 −10 to 10 −8 57 will be considered. 58 Let x be the log fitness of an organism, and let x be the mean log fitness of the population. So 59 they may be referred to in this paper, in the creation-mutation-selection model the key formulas 60 for x are as given below [1].

61
For asexual organisms, fitness expressed in terms of σ 2 x , the variance of x, For asexual organisms, fitness expressed in terms of µ 3,x , the third central moment of x, that is 63 non-standardized skewness, For sexual organisms: where 66 a s,0 = 1 For positive and negative selection, or for multiple selection coefficients, log fitness values may be 67 added in the case of sexual populations. For asexual populations the situation is more complex.

68
As will be shown, the results of computer simulation agree with the mathematical creation-69 mutation-selection model, thus validating the model.

70
Having computed the fitness in the asexual case it becomes possible to determine the advantage 71 of sex. This is done for prokaryotes, present day eukaryotes, the first eukaryotes, and the asexual 72 mitochondrion. Sex appears to have an advantage where it occurs, and not to have an advantage 73 where it doesn't.

75
The selection coefficients for sites as a result of negative selection are likely to be larger than for 76 positive selection. There are many ways a protein could mutate under negative selection rendering 77 it almost non-functional, but the likelihood of a positive mutation of similar magnitude is likely to 78 be rare.

79
Results for the asexual model 80 A computer simulation of the asexual case was developed. A default population size of 10 5 is used. 81 This is for ease of simulation. Values ranging from 10 4 to 10 8 are also considered. Values any larger 82 than this would require an excessive amount of simulation time.

83
Positive selection and asexual mean log fitness 84 The observed mean log fitness of the asexual model under positive selection is shown in Table 1. It 85 should be compared to Table 2 which shows the predicted log fitness calculated based on the mean 86 non-standardized skewness of the log fitness distribution (using equation 2). We could have equally 87 well have chosen to predict log fitness based on the mean variance of the log fitness distribution 88 (using equation 1). To estimate the average actual fitness values raise e to the power of these mean 89 log fitness values.

90
While correct according to our mathematical model, it should be noted that a few of the values 91 reported in Table 1 are so large as to be biologically invalid. E.g. for µ ss = 10 −10 , N = 10 5 , 92  Table 1: Observed mean log fitness for the asexual case under positive selection. Log fitness values are relative to a perfectly adapted organism having a log fitness of zero. µ ss is the rate of mutation per site per generation. N is the population size. Γ p is the rate of positive site creation per generation. Γ n is the rate of deleterious site creation per organism per generation. s is the constant selection coefficient for each site per generation.

97
As N increases, mean log fitness becomes smaller. However the rate at which it shrinks is not 98 excessive. This means that most of the conclusions to be reached based on smaller population sizes 99 should still be valid for large populations, such as 10 9 or more, which were not simulated.

100
The reported values for mean log fitness, x, have a standard error of 2.5% or less, apart from 101 seventeen small values of x all of which were computed by direct roll-out and for which standard 102 error values are variable. Predictions for x include standard error estimates indicated by the 103 plus/minus symbol based on the observed variability of µ 3,x . Predictions whose plus or minus two 104 standard error confidence interval fails to overlap a plus or minus 5% confidence interval for x are 105 indicated by italics. Under the hypothesis that the mathematical model is accurate this should 106 happen less than 5% of the time.

107
Despite typically running for 1,000,000 generations the mean value of µ 3,x often involves consid-108 erable standard error. Referring to equation 2, when µ 3,x is small this directly produces a large 109 percentage standard error in x. If expressed in absolute terms, the standard error would be no 110 larger for small values of x than for large values.

111
Predictions based on µ 3,x correctly predicted the value of x except primarily when N = 10 4 . It is 112 reasonable to assume that the discrete nature of small populations causes a loss in fidelity of µ 3,x , 113 causing predictions based on it to be inaccurate.

114
Negative selection and asexual mean log fitness 115 The observed mean log fitness of the asexual model under negative selection is shown in Table 3. 116 It should be compared to Table 4 which shows the predicted log fitness calculated based on the 117 mean non-standardized skewness or the rate of deleterious site creation (using equation 2).

118
For the most part, predictions based on negative skewness or the rate of deleterious site creation 119 match the observed values. The predictions for Γ n = 10 −1 and s = 10 −2 , and Γ n ≥ 1 and s = 10 −1 120 do not match. This appears to be the increasingly broad transition region from the existence of 121 zero or one mutational opportunity site to a continuous fitness distribution. Two of the measured 122 values for Γ n = 10 −3 , namely µ ss = 10 −9 and s = 10 −2 , and µ ss = 10 −10 and s = 10 −1 are small as 123 predicted, but still anomalous, possibly because the simulation wasn't run for long enough in these 124 cases. Running the first of these scenarios a second time with a different random seed produced 125 -0.000997, a value in line with predictions. The originally reported values are used in the subsequent 126 analysis.   Table 3: Observed mean log fitness for the asexual case under negative selection. Symbols are as described in Table 1.   Results for the sexual model For positive selection we are interested in assessing the accuracy of both the queuing formula and 132 the recurrence formula. However the recurrence formula requires that N 1 µss , which is larger than 133 the N values which can reasonably be simulated. Consequently to aid in verifying the recurrence 134 formula we also use a non-physical µ ss value of 10 −4 .

135
Tables 5 show the observed mean log fitness for the simulated sexual model. Table 6 shows mean 136 log fitness calculated using the larger of the queuing formula or recurrence formula (equation 3). 137 This means for µ ss = 10 −4 the recurrence formula was used to compute log fitness, while for smaller 138 values of µ ss the queuing formula was used.

139
Predictions that differed from observations by more than 2 standard errors are indicated by italics. 140 The two tables are in reasonably close agreement throughout. The major discrepancy being the 141 result for N = 10 4 and s = 10 −4 , but here N s = 1, which is a boundary case, and none of the 142  formulas in equation 3 are expected to apply. Other values for s = 10 −4 appear slightly more 143 negative than predicted. This might because N s is starting to approach 1, and random drift is 144 starting to play a role.

145
Sexual mean log fitness conjecture: For the sexual model with µ ss s 1 the mean log fitness 146 calculated using the recurrence formula is given by, Status: Unproven. Agrees with recurrence formula results shown in Table 6 and additional recur-148 rence formula values calculated but not shown.  Table 7 demonstrate 152 this to be correct, but it has not been tested exhaustively.  Comparison of the sexual and asexual models 154 Evaluating the advantage of sex under positive selection could be thought of as an evaluation of 155 the Fisher-Muller hypothesis. Subtracting mean log fitness for positive selection in the asexual case 156 (Table 1) from mean log fitness for positive selection in the sexual case (Table 5) produces Table 8. 157 Omitted sexual values were filled in with the more negative of the recurrence formula or queuing 158 formula (equation 3). This means the recurrence formula was used for N = 10 8 , and the queuing 159 formula elsewhere.

160
In the real world sex likely incurs additional costs not captured in the model. The increased 161 complexity, time, and resource consumption of meiosis. The cost of finding a mate. And so on. 162 It seems likely that these costs are going to incur no more than a small multiplicative reduction 163 factor to fitness. Or put another way, the log fitness cost is likely to be no more than, say, 2. 164 Everywhere in table 8 that exceeds this value, sexual reproduction can be expected to outperform 165 asexual reproduction. Empirically, apart from for Γ p = 10 −4 this is nearly always the case. E.g. 166 for µ ss = 10 −9 , N = 10 5 , Γ p = 10 −2 , and s = 10 −2 , the log advantage of sex is 3,760. This is a log 167 advantage, raising e to this amount produces an approximate advantage of sex of 10 1,633 . Raising 168 e to the values in table 8 typically produces large values, so the conclusions are insensitive to the 169 precise value of any additional costs of sex.

170
It might be expected that sex would always confer some positive or zero advantage. The negative 171 values in table 8 show this is not the case, at least for the model used here. Both the asexual 172 and sexual models involve a weighted sampling with replacement to simulate selection. The act of 173 recombination then involves a second unweighted sampling with replacement to select each parent. 174 No similar second sampling occurs in the asexual case. Sampling each parent in this way for the 175 sexual case reduces the number of organisms that are effectively parents of the next generation. 176 This reduces the odds that a newly occurring mutation will establish itself in the population for the 177 sexual case when compared to the asexual case. The effect of this stands out the most when selection 178 is strong and mutation is weak, In this case no more than one single mutation typically occurs in the 179 population at a time, and sex might initially have been expected to have had no impact. That this 180 additional sexual sampling causes worse performance for the sexual case was verified by picking the 181 worst offender of asexual outperformance, and testing the asexual model with an extra sampling 182 step. The worst offender is µ ss = 10 −10 , N = 10 5 , Γ p = 10 −3 , and s = 10     Table 9: Estimated advantage of sex for negative selection. Observed or predicted mean log fitness for the sexual case less observed mean log fitness in the asexual case. Symbols are as described in Table 1.
Evaluating the advantage of sex under negative selection might be thought of as an evaluation of 187 Muller's ratchet. Subtracting Table 3 from Table 7, or from equation 3 for parameter values that 188 were not simulated, produces Table 9 showing the log advantage of sex under negative selection.

189
For negative selection the advantage of sex is usually less than for positive selection for the same 190 parameter values. There is frequently no advantage when N is large, Γ n is small, and s is large.  Table 8. To get such a low value for Γ p requires a short generation time.

195
For negative selection under the same µ ss values, provided Γ n ≤ 10 −2 when s = 10 −2 , there 196 wouldn't be any disadvantage to asexuality. As mentioned in the introduction, prokaryotes have 197 Γ n ≤ 10 −2 mutations per genome per generation. To get this small value for Γ n requires a small 198 genome and a small spontaneous mutation rate.

199
Prokaryotes thrive when the rate of positive and negative site creation per generation is low. They 200 do so by having short generation times, small genomes, and small spontaneous mutation rates.

Present day eukaryotes 202
With Γ p in the range 10 −3 to 10 −2 , based on Table 8, there is typically a large benefit to sex as 203 a result of positive mutation for a wide range of µ ss , N , and s ≤ 10 −2 . A new positive mutation 204 with an s of 10 −1 is likely to be a very rare event.

205
With Γ n in the range 10 −1 to 10 1 , consulting Table 9, there would be a large advantage to sex 206 except for Γ n = 10 −1 and s = 10 −1 where the benefits are equivocal.

207
Thus eukaryotic species today obtain a significant benefit from sex, and do so through both positive 208 and negative selection.

209
Early eukaryotic evolution 210 Sex might make sense for eukaryotes today, but we also need to ask whether it would have made 211 sense for the first eukaryotes, and for that we need to turn to the archaea. Eukaryotes are believed 212 to have evolved from the merger of Asgard archaea and an alphaproteobacterial endosymbiont, 213 possibly in a geothermal environment [4]. The changing nature of geothermal environments may 214 have created a high rate of environmental positive site creation, making it an ideal environment 215 within which for sex to develop.

222
The known Asgard archaeal genomes range in size from 1.4-5.7M bp, so a typical value for the 223 genome length might be 3.6×10 6 base pairs. For archaea the fraction of the genome that is coding 224 sequence is typically around 0.85[7, Table S1]. If, say, 3 4 of sites are nonsynonymous sites, then 225 the number of nonsynonymous sites, A is 2.3×10 6 . Let a * n be the fraction of nonsynonymous sites 226 maintained by negative selection, and n a be the fraction of nonsynonymous sites in the fraction of 227 the genome under the control of negative selection. Then, 228 Γ n = a * n Aµ bs n a = 1.3×10 −3 a * n n a deleterious mutations per organism per generation For prokaryotes n a will be close to 1, making Γ n small no matter what the value of a * n . Thus from 229 Table 9 negative selection will have little effect, and so for early eukaryotes Γ p would likely be the 230 primary driving force for sex.

231
Whether Γ p exceeded, of the order of 1 site every 1,000 generations, isn't known for certain. In a 232 changing geothermal environment it seems quite likely, especially considering the many mutations 233 that might be associated with a single simple change in temperature.

234
Sex frees the genome from the constraints of asexuality. The generation time can be much larger 235 without incurring large costs from positive selection. And the genome can be much larger without 236 incurring high costs associated with negative selection. The possibility of a larger generation time 237 and a larger genome size might explain two important properties of eukaryotes: eukaryotes are both 238 physically much larger and more complex than prokaryotes. The prokaryote typically needs a small 239 generation time and small genome size. The eukaryote can take its time and have a large genome 240 size. The organism can grow larger and be more complex if this offers an evolutionary advantage. 241 The mitochondrion as an asexual organism

242
Comparing human and chimpanzee mtDNA, the non-synonymous substitution rate of protein cod-243 ing genes is 2×10 −9 substitutions per site per year [8]. If we estimate mtDNA contains about 13,500 244 sites other than synonymous sites, this gives a rate of adaptive mutational opportunity site cre-245 ation of 2.7×10 −5 sites per year, or once every 40,000 years. The time taken for a site to fully fix is 246 approximately 2 log N s s [9], which even for a large population size of 10 6 , a relatively small s value of 247 10 −3 per generation, and a generation time of 25 years, amounts to a full fixation time of 350,000 248 years. Consequently we can expect the mitochondrial population to contain several haplotypes that 249 all descended from a a single mitochondrial "Eve" that existed perhaps 350,000 years ago. The 250 available evidence suggests Eve actually existed maybe 200,000 years ago [10], implying a larger 251 value for s than was imagined.

252
From the above, for the mitochondrial genome, Γ p = 7×10 −4 sites per generation. For synonymous 253 sites the mtDNA substitution rate is 3×10 −8 per site per year [8]. Thus µ ss = 2.5×10 −7 mutations 254 per generation. Using the estimate of 13,500 sites that are not synonymous, Γ n = 1.0×10 −2 sites 255 per generation. Extrapolating Table 8, there is likely to be neither a significant cost, nor benefit, 256 to asexuality under positive selection. Recall s values for negative selection are likely to be larger 257 than those for positive selection. Based on Table 9, there is likely to be no cost for asexuality as a 258 result of negative selection for s = 10 −2 or larger.

259
The genome size and mutation rate parameters of the mitochondrial genome are such that its 260 asexuality appears to incur little or no cost.

262
Computer simulation was used to validate the mathematical creation-mutation-selection model.
the good of the population, only for the good of the gene.

266
For both sexual and asexual organisms, increasing the spontaneous mutation rate reduces the cost 267 associated with positive selection, but increases the cost associated with negative selection. This is 268 because for finite genomes Γ n will scale with the spontaneous mutation rate, µ ss . A balance thus 269 exists between positive and negative selection at which the spontaneous mutation rate is optimal. 270 Prokaryotes thrive under small rates of environmental change, with short generation times, small 271 genomes, and small spontaneous mutation rates. Sexual organisms thrive under larger rates of 272 environmental change. They are able to have longer generation times and larger rates of negative 273 site creation per genome per generation, and thus larger genomes, and higher spontaneous mutation 274 rates.

275
For eukaryotes today, and probably the first eukaryotes, the cost of asexuality appears likely to 276 significatly exceed the cost of sex.

278
Simulation of the asexual model

279
Simplified pseudo-code for the asexual simulation is shown in Figure 1. See Supplement 1 for the 280 actual simulation code [11].

281
The implementation was sped up by keeping track of not the number of mutations for each organism, 282 but the number of organisms with each different mutation count. The overall simulation was 283 implemented in Python with the performance critical mutation and selection steps implemented as 284 multi-threaded C code.

285
Best efforts were made to determine the steady state log fitness mean and variance using a linear 286 regression of the direction in which the mean log fitness was drifting as a function of the initial 287 value for x. Regression helps overcome the problem of stochastic noise associated with any single 288 parameter run. A variant of equation 1 was used to determine the initial value for σ 2 x with the initial 289 fitness distribution being log-normally distributed. Regression data points were gathered until the 290 estimated mean log fitness value for zero drift was accurate to within 5% with 95% confidence 291 as determined through bootstrapping. Each regression data point was generated by simulating 292 warm-up time which was not assessed, followed normally by 10,000 generations of measurement 293 time. The measured drift was adjusted for stochastic variation in site creation. The warm-up time 294 lasted until σ 2 x averaged out over preceding, normally 2,000 time-step intervals, had been seen to 295 both increase and decrease.

296
Once the regression was complete, a simulation was performed starting with a random population 297 with log-normally distributed fitness generated from the estimated steady state log fitness mean 298 and variance computed as described above. Warm-up time was simulated during which no results 299 were collected, followed normally by 10,000 generations of measurement time. The process was 300 usually repeated 100 times to account for long time period fluctuations in σ 2 x , and averages taken. 301 To reduce the compute cost for N = 10 8 , just 20 samples of 5,000 generations were generated.

302
The regression approach fails if x changes significantly from its initial value during the course of 303 the simulation. Consequently, letting Γ represent either Γ p or Γ n , if the regression predicted the 304 expected value of x was smaller in magnitude than 50,000Γ log(1+s) a direct roll-out was performed. 305 No results were collected until the roll-out appeared stable. The roll-out was considered stable when 306 σ 2 x had been observed to both increase and decrease when averaged out over normally 50,000 time-307 steps, followed by x adjusted for stochastic fluctuations in the rate of site creation having been 308 seen to both increase and decrease when averaged over normally 50,000 generations. Once the 309 simulation was stable results were normally collected for 1,000,000 generations divided into 10,000 310 generation intervals. Direct roll-outs were started from the value of x predicted by the partially 311 complete regression.

312
On account of the low frequency of adaptive site creation for Γ p = 10 −4 the number of generations 313 was increased in this case. For the regression warm-up time was averaged using 20,000 generations, 314 followed by 100,000 generations of measurement time, and the final simulation consisted of 20 315 samples of 100,000 generations. A regression was performed if the expected value for x was smaller 316 in magnitude than 500,000Γ p log(1 + s), with stability assessed over 500,000 generations, and then 317 samples collected for 2,000,000 generations divided into 100,000 generation intervals.

318
Reported steady state values of x are based on the regression or the roll-out. Estimates of σ 2 x and 319 skewness are based on averages from the final simulation or the roll-out. Standard error values for 320 the roll-outs were adjusted for any correlation between consecutive samples.

321
Simulation of the sexual model 322 Simplified pseudo-code for the sexual simulation is shown in Figure 2. See Supplement 1 for the 323 actual simulation code [11]. 324 In the asexual model it is possible to track just the total number of unsatisfied sites in each 325 organism. In the sexual model this is not possible. The identity of each site must be tracked in 326 each organism so that recombination can be correctly computed. Multiplying the number of sites, 327 by the number of organisms, by the number of time-steps, quickly adds up to an excessive amount 328 of computer resources for the simulation. In an attempt to minimize these costs the simulation 329 maintains multiple lists: a global list of sites present in nearly all organisms; a per organism list of 330 exclusions from the global list (for newly satisfied positive sites); and a per organism list of direct 331 sites present in just a few organisms (for new negative sites and sites close to fixation  these lists were rebalanced: frequently occurring exclusions were replaced by direct list entries and 333 removed from the global list. The simulation was implemented in a mixture of Python, for the 334 non-performance critical regions, and multi-threaded C, for the performance critical regions.

335
For positive selection two approaches were used for determining the steady state. For expected 336 mean log fitness result values, x, smaller in magnitude than 50,000Γ p log(1 + s) a direct roll-out 337 was performed starting from the x predicted by the recurrence relationship with σ x = 0. In this 338 case a simulation was judged to have reached a steady state once σ 2 x was considered to be stable 339 followed by x being considered to be stable. σ 2 x was considered to be stable once its value averaged 340 over the 20,000 preceding time-steps was seen to be both larger and smaller than its value for 341 20,000 time-steps immediately prior to that. Once σ 2 x was stable, x was considered to be stable if 342 both positive and negative changes to its value averaged over the 20,000 preceding time-steps were 343 observed. In making this latter determination x was corrected for stochastic fluctuations in the 344 rate of site creation. An issue with determining the accuracy of the then measured x value is that 345 consecutive time-steps will be highly correlated. Results were thus gathered until 100 samples each 346 separated by a sufficiently large stride had an auto-correlation of 0.6 or less. This is equivalent 347 to having about 25 fully independent samples. The accuracy of the resulting x values was then 348 determined by computing their mean and standard error. Finally the simulation was run for an 349 additional 20,000 time-steps and additional results gathered.

350
For positive selection, running the simulation until a steady state was reached for larger expected 351 values of x was judged as too time consuming. Instead a regression approach was used. At least 20 352 starting values for x with σ x = 0 were chosen. For each the system was run until σ 2 x was considered 353 to be stable when averaged, this time, over 200 time-steps. The simulation was then run for 4 times 354 as many time-steps as had occurred up until this point, and the average rate of drift in x per time-355 step computed. Over a broad range of starting values, drift was observed to be a linear function 356 of x. A linear regression was performed of drift as a function of x, and the value of x for which 357 drift was expected to be zero computed. Bootstrapping of the linear regression was performed to 358 determine the accuracy of the results. A final simulation at this point of expected zero drift was 359 performed, waiting until σ 2 x was stable, and then results gathered over 20,000 time-steps.

360
For negative selection a direct roll-out was always performed. The procedure was the same as for 361 positive selection except that the system was considered stable when both positive and negative 362 changes in value were seen when averaged out over 2,000 time-steps instead of 20,000 time-steps, 363 and the final simulation was only run for 2,000 time-steps instead of 20,000 time-steps to gather 364 additional results.

365
Reported steady state x values were based on the mean value of the partially auto-correlated 366 samples for the direct roll-outs, or the zero drift point of the regressions. These values are intended 367 to be accurate to within 5% with 95% confidence. Other attributes of the sexual simulation, such 368 as variance and skewness, were measured based on the final 20,000 time-steps, but are not reported 369 in this paper.