Profiling the initial burst of beneficial genetic diversity to anticipate evolution of a cell population

Clonal populations of cells continuously evolve new genetic diversity, but it takes a significant amount of time for the progeny of a single cell with a new beneficial mutation to outstrip both its ancestor and competitors to fully dominate a population. If genotypes with these driver mutations can be discovered earlier—while they are still extremely rare—it may be possible to anticipate the future evolution of these populations. For example, one could diagnose the likely course of incipient diseases, such as cancer and bacterial infections, and better judge which treatments will be effective, by tracking rare drug-resistant variants. To test this approach, we replayed the first 500 generations of a >70,000-generation Escherichia coli experiment and examined the trajectories of new mutations in eight genes known to be under positive selection in this environment in six populations. By employing a deep sequencing procedure using molecular indexes and target enrichment we were able to track 236 beneficial mutations at frequencies as low as 0.01% and infer selection coefficients for 180 of these. Distinct molecular signatures of selection on protein structure and function were evident for the three genes in which beneficial mutations were most common (nadR, pykF, and topA). We detected mutations hundreds of generations before they became dominant and tracked beneficial alleles in genes that were not mutated in the long-term experiment until thousands of generations had passed. Therefore, this targeted adaptome sequencing approach can function as an early warning system to inform interventions that aim to prevent undesirable evolution.


44
New genetic variation naturally arises in lineages of cells and organisms during genome 45 replication and repair. These de novo mutations are the main drivers of adaptive evolution in 46 many populations, particularly those with little or no recombination or standing genetic variation. 47 coli populations were initiated from equal mixtures of two variants of the ancestral strain that 105 differ in a neutral genetic marker for arabinose utilization (Ara). We observed the evolutionary 106 dynamics of these populations over ~500 generations of regrowth from 75 daily 1:100 serial 107 transfers by periodically plating dilutions of each population on indicator agar. The ratio of Ara + 108 cells (pink colonies) to Aracells (red colonies) diverges from 1:1 when descendants of one 109 ancestor type accumulate enough of a fitness advantage due to de novo beneficial mutations that 110 they take over. We focused further analysis on six of the nine populations (thick lines). 111 112

Reconstructing the trajectories of new beneficial mutations 113
We next performed deep sequencing of eight genes at ~25 generation increments over all 500 114 generations of the evolution experiment for four of the nine populations. These eight genes 115 (nadR, pykF, topA, spoT, fabR, ybaL, hslU, and iclR) are known to be targets of selection in the 116 LTEE [3,31]. Illumina libraries containing molecular indexes [32] were prepared for sequencing 117 and enriched for the regions of interest using solution based hybridization [33]. Consensus 118 sequence reads were generated based on groups of reads with identical molecular indexes and 119 aligned to the E. coli genome to predict mutations, including using split-read mapping to identify 120 transposon insertions and large deletions ( Fig. 2A). The enrichment procedure was effective. In 121 the sample with the median number of total consensus reads, the average coverage depth across 122 each of the eight genes of interest exceeded 5,000 (Fig. 2B). After analyzing patterns in mutation 123 frequencies over time to eliminate other systematic biases (see Methods), we were able to track 124 the evolution and competition of 180 de novo mutations, including when many were present in 125 less than 0.1% of the cells in a population (Fig. 2C, Fig. 3). 126 index, which were amplified from the same original genomic DNA fragment, are used to 134 construct a consensus read to eliminate sequencing errors. Consensus reads are mapped to the 135 reference genome to call sequence variants. (B) Enrichment of reads mapping to eight genes 136 known to be early targets of selection in this environment from the long-term evolution 137 experiment. The final coverage depth of consensus reads in and around these genes is shown for 138 a typical sample (population A7 at 500 generations  described in Figure 2C-E for population A1 are shown for populations A2, A3, and A7 (top three 153 sets of panels). For populations A6 and A9, sequencing was only performed at time points during 154 the selective sweep window so only the plots corresponding to Figure 2D-E are shown (bottom 155 two sets of panels). 156 targeted genes followed by loss of this diversity. The initial dynamics are expected to be largely 158 driven by new genotypes that each evolve a single beneficial mutation very early in the 159 experiment. If their descendants escape stochastic loss, they will gradually increase in frequency 160 over the first few hundred generations as they outcompete the ancestral genotype. Once the 161 population becomes dominated by these first-step mutants, their frequency trajectories plateau 162 because of clonal interference: they are now mainly competing against one another and are 163 relatively evenly matched. In populations A1, A2, and A7, the total frequencies of the mutations 164 we identified sums to 50-62% at generation 270, indicating that each population is mostly 165 composed of genotypes with a single mutation in one of the focal genes. We recovered less of 166 the initial beneficial mutation diversity in population A3 where this sum was only 13%. 167 After around 300 generations, there is a steady decline in the frequencies of most mutations 168 in the eight targeted genes. At this point, new more-fit genotypes that have evolved from the 169 single-step mutants begin to exert their influence and displace them. Many of the most successful 170 second-step genotypes are descended from cells that already have a mutation in one of the 171 targeted genes. The original mutations serve as markers for the further expansion of these 172 subpopulations after a period during which their frequencies stagnate or decrease, but the new 173 beneficial mutations responsible for this further increase in fitness are outside of the genomic 174 regions we surveilled. The converse situation, in which a beneficial mutation in one of the eight 175 focal genes appears in a cell with an untracked beneficial mutation elsewhere in the genome, also 176 occurs in a few cases. Most strikingly, a new mutation in pykF that only appears after 300 177 generations in population A3 rapidly increases in frequency and becomes dominant, strongly 178 suggesting that it appeared in a genetic background with a prior, unknown beneficial mutation. 179

Selection coefficients can be inferred from initial mutation trajectories 181
We next sought to calculate the fitness benefits of individual mutations by tracking how rapidly 182 their frequencies rose early in the experiment when they were largely competing versus the 183 ancestral genotype because all new mutations in the population were still rare. To that end, we 184 performed additional sequencing on six populations (including the four already sequenced at 25 185 generation increments) at ~13-generation increments in a time window from 133 to 213 186 generations (Fig. 2D, Fig. 3). We were able to track a total of 161 mutations during this time, 187 including 56 that were not identified in the complete time courses. More than 95% of these 188 mutations occurred in just three of the targeted genes: nadR, pykF, and topA (Fig. 4A). 189 coincides with the onset of clonal interference (Fig. 2E, Fig. 3), At this point, genotypes with 202 beneficial mutations begin to make up an appreciable fraction of the population and compete 203 against one another rather than effectively only versus their ancestor. After correcting for this 204 increase in overall population fitness (see Methods), the mean selection coefficient that we 205 inferred for the tracked mutations in all six populations was 6.32% with a standard deviation of 206 0.74%. Although the distributions of selection coefficients estimated for mutations in nadR, 207 pykF, and topA overlap (Fig. 4B), there was a significant stratification among these genes. 208 Mutations in nadR were 0.27% more beneficial than mutations in topA, on average, and this 209 difference was significant (p = 0.024, Kolmogorov-Smirnov test). In turn, mutations in topA 210 were 0.39% more beneficial than those in pykF (p = 0.00035, Kolmogorov-Smirnov test). The 211 six mutations in other genes (spoT, yijC, and ybaL) for which we were able to estimate selection 212 coefficients were roughly as beneficial as mutations in nadR, pykF, and topA. 213 214

Beneficial mutations reveal different signatures of selection on gene function 215
Of the 236 mutations that we were able to track in complete or window time courses, 218 were 216 in the nadR, pykF, or topA genes. The large sets of beneficial mutations in these genes gave us 217 the statistical power to test for several signatures of molecular evolution to predict what types of 218 changes in the function of each gene improved E. coli fitness in this environment. Each of the 219 three genes exhibited a distinct spectrum of beneficial mutations (Fig. 4C). In some cases, different types of mutations were also unevenly distributed throughout the sequences of these 221 three commonly hit genes and had noticeably different effects on bacterial fitness (Fig. 5A). Pyruvate kinase 1 (pykF) catalyzes the final step of glycolysis, transferring a phosphate group 256 from phophoenolpyruvate (PEP) to ADP to generate pyruvate and ATP. It is a key enzyme in regulating glycolytic flux [37,38]. We observed an intermediate representation of disruptive 258 mutations in pykF, fewer than in nadR but more than in topA (Fig. 4C). Interestingly, 259 nonsynonymous base substitutions in pykF tend to have a larger selection coefficient than 260 disruptive mutations (p = 0.00390, Kolmogorov-Smirnov test) (Fig. 5A). This finding is in 261 agreement with a recent study of various pykF alleles that arose in the LTEE which found that 262 nearly all pykF point mutations were more beneficial than deletion of the pykF gene, both in the 0.0050, one-tailed binomial test) (Fig. 5B). Overall, these results suggest that complete 269 inactivation of pykF is highly beneficial in the environment of our evolution experiment, but 270 mutations that alter its activity-likely in ways that reduce glycolytic flux-are even more so. It 271 has been suggested that reducing pykF activity is beneficial in the similar glucose-limited 272 conditions of the LTEE because this allows more PEP to be used for import of glucose into cells 273 by the phosphotransfer system [42]. and begin to displace the ancestral genotype. We focused on eight genes known to accumulate 296 adaptive mutations in the >70,000 generation Lenski long-term evolution experiment (LTEE) 297 with E. coli that used nearly the same environment as our experiments. The only difference was 298 that we added four times as much of the limiting nutrient (glucose). By combining Illumina 299 sequencing of reads that incorporate molecular indexes for error correction, hybridization-based 300 capture of DNA encoding these genes, and dense temporal sampling, we were able to identify 301 more beneficial mutations and track them at much lower frequencies than is possible with 302 standard metagenomic sequencing. We detected a total of 236 mutations in the focal genes: 180 303 in the complete time courses of four populations and 161 in the window time courses of these 304 populations and two others, with 105 mutations overlapping between the two sets. 305 By densely sampling and deeply sequencing E. coli populations, we were able to characterize 306 many beneficial mutations that never reach the detection limits of standard Illumina sequencing 307 before they become casualties of clonal interference. Only 13 of the 180 mutations we detected 308 in the complete time courses ever achieved a frequency of 5% or more, which can be reliably 309 distinguished from noise by standard metagenomic sequencing, and only seven were this 310 common for 100 or more generations, such that they were likely to be detected by a typical time- four (hslU, nadR, ybaL, and iclR) are also common in the LTEE, but they typically occur later 322 (often within the first 2,000 to 10,000 generations) [3,31]. Nearly all mutations in these genes in 323 our evolution experiment were in topA, pykF, and nadR, but we also found multiple mutations 324 that were similarly beneficial in spoT, fabR, and ybaL. Mutations in nadR were more widespread 325 than expected in our experiment and may be more likely to completely disrupt its function than beneficial alleles that evolve in the LTEE [51]. Mutations in spoT and fabR were rarer than 327 expected from the LTEE. One possible explanation for these slight differences is the increased 328 concentration of glucose in our experiment compared to the LTEE. These minor deviations are 329 also reminiscent of how changing a different aspect of the environment (temperature) re-focuses 330 the mutations of largest benefit that succeed early onto different subsets of genes, nearly all of 331 which eventually accumulate beneficial mutations later in the LTEE environment, in related 332 evolution experiments [53,54]. Despite these subtle differences, we were still able to account for 333 majority of the genetic variation present in three of four of the four populations that we profiled 334 over the entire 500 generations by analyzing evolution in the eight candidate genes. 335 We also wanted to understand to what extent we gained early warning of driver mutations by 336 deeply profiling evolution in genes we expected to be under strong selection. In general, we were 337 able to begin tracking most mutations when they were above a frequency of 0.01%. This level of 338 profiling enabled us to first detect mutations an average of 75, 152, and 290 generations before 339 they surpassed frequencies of 0.1%, 1%, and 5%, respectively. Under the conditions of our 340 experiment these intervals take roughly 11, 23, and 44 days, respectively; so, even though we 341 made these predictions retrospectively, there would have been sufficient time to complete the 342 DNA isolation, library preparation, sequencing, and analysis steps quickly enough for this 343 approach to give early warning of specific genetic variants driving evolution of these 344 populations. The amount of lead time becomes disproportionately longer at higher frequencies 345 due to clonal interference between beneficial mutations. The chances and timescales of earlier The nature of epistasis and the limits that it imposes on predicting the future evolution of a 373 cell population could be further probed using our approach in several ways. One could repeat the 374 evolution experiment beginning with genotypes containing different first-step beneficial 375 mutations as starting points to more finely map the fitness landscape. One could also interrogate 376 the diverse collections of cells containing different beneficial alleles that we have evolved, by 377 taking the 150-generation populations and further evolving them under different conditions to 378 map genotype by environment effects, for example. Such experiments might also reveal latent 379 beneficial mutation in other genes (e.g., iclR and hslU) that were able to outcompete the ancestor 380 in our experiment but remained below the detection limit because they were not as beneficial as 381 mutations in topA, pykF, and nadR in this environment. There is precedent for changes in the 382 environment deflecting selection to different subsets of the same genes. In an offshoot of the 383 LTEE that began with a clone that had spoT, topA, and pykF mutations, selection was focused on 384 hslU, iclR, or nadR depending on changes in temperature [54]. Exhaustively mapping paths that clonal evolution is likely to follow is of particular interest 405 and utility in systems that evolve repeatedly from a defined starting point. These range from 406 bioreactors that are seeded with the same strain in different production runs to lung infections in 407 cystic fibrosis patients that start from similar, but not identical, opportunistic pathogens. The 408 ability to identify mutations in key genes while they are still very rare may also be used to 409 improve the early detection and predicting drug resistance in other human infections and cancer. 410 The evolutionary dynamics will be more complex in many of these systems, but they may also 411 unfold more slowly. For example, biofilm formation and the necessity of invading already 412 colonized niches will slow the dynamics of competition. This potentially makes the therapeutic 413 window for detecting incipient evolution by profiling the adaptome even greater.  Table S1.  custom Python scripts that carried out the following steps. First, the beginning of each read was 473 evaluated for the presence of the expected 5′-end tag, consisting of the random twelve-base 474 molecular index (MI) followed by four fixed bases (5′-N12CAGT). Read pairs lacking the correct 475 5′-end tag on either read were discarded. For remaining read pairs, the MIs from each read were 476 concatenated to create a 24-base dual-MI that uniquely identifies the original DNA fragment that 477 was amplified and sequenced. To group all reads corresponding to the same initial DNA 478 molecule, a FASTA file of all dual-MIs was used as input into the ustacks program from the 479 Stacks software pipeline (Version 1.48) [71] with the following options: a single read was 480 sufficient to seed a stack, a single mismatch within the 24 base MI was allowed in assigning a 481 read to a stack, secondary reads and haplotypes were disabled, and stacks with high coverage 482 were preserved. Then, CSRs were generated for all MI groups sequenced at least twice by taking 483 the straight consensus of all reads that were merged into that stack. If no base exceeded 50% 484 frequency at a given position in this set of reads, then that base was set as unknown (N). 485 also had frequency trajectories that tracked together. To identify these candidates for merging, 509 we analyzed each of the six window (generation 133 to 213) and four complete (generation 0 to 510 500) time courses separately. We only considered mutations that exceeded a threshold frequency 511 of 0.03% at some time during each time course as candidates for merging. Read alignment (RA) 512 evidence items were merged when they were located within 6 base pairs of one another and 513 within a normalized Canberra distance of 0.1 between the vectors of their frequency observations 514 across all of the time points in a dataset. All RA evidence pairs of this kind were found to co-515 occur in the same sequencing reads. For these cases, the read counts for the first linked mutation 516 were used to represent the entire event. For example, if a deletion of three base pairs was 517 predicted by missing bases at positions x, y, and z; then the frequency of missing the first base 518 (x) was assigned to the entire three-base deletion mutation. For new junction (JC) evidence we 519 performed the same merging procedure but allowed linked mutations to be within a larger 520 window of 20 base pairs and within a normalized Canberra distance of 0.5. JC pairs passing 521 these criteria were only merged if they were also consistent with an IS-element insertion in terms 522 of their relative orientation and spacing. In this case the variant and total read counts were added 523 together for the two different junctions, as the junctions on each side of the inserted IS element 524 provide independent information for estimating the frequency of this type of mutation. 525 526

Time course filtering and selection coefficient estimation 527
After merging evidence of genetic variants into lists of putative mutations, we further eliminated 528 putative evolved alleles from consideration using several filtering steps. For the complete time 529 courses, we first required that non-zero frequencies be observed for a mutation in samples from 530 at least two different time points. We next applied a filter to eliminate spurious variants that can 531 be recognized as arising from systematic sequencing or alignment errors because they do not 532 exhibit the correlated changes in frequency over time expected for the frequency trajectories of 533 real mutations [1]. Specifically, we required that the time-series of estimated frequencies for a 534 mutation over all analyzed time points have an autocorrelation value ≥ 0.55. 535 For the window time courses, we further required that the estimated frequency of a putative 536 mutation be ≥ 10 -4 at each of the last three time points that were sequenced (generations 193, 537 206, and 213). Then, we fit a binomial logistic model with slope and x-intercept terms to the time 538 courses of counts of variant and reference (total minus variant) observations for each mutation. 539 We filtered out any mutations for which this fit had an AIC < 200, Bonferroni-corrected p-value 540 for the slope differing from zero of > 0.05, or an x-intercept < -15. The slope fit from the 541 frequency trajectory of each mutation is an estimate of the selection coefficient of each mutation, 542 assuming the trajectories reflect competition purely against the ancestral strain. However, in the 543 latter half of the window time courses we detected a significant deviation from linearity for all 544 mutation trajectories, indicating that the overall population fitness had increased to a degree that 545 it decreased the rate at which all newly evolved genotypes with beneficial mutations increased in 546 frequency. The figures show the best models for a stepwise increase in population fitness 547 between the sequenced time points that improved the fits for all mutations in each population 548 considered separately. Because there was significant uncertainty in these estimates and the 549 fitness trajectories are expected to be highly similar between different populations, we used a 550 consensus model with one step-wise increase in fitness over time that best improved the fits for 551 all mutations from all populations to correct the estimated selection coefficients for this effect. 552 553

Mutation statistics and plots 554
One nadR mutation from population A2 was a noticeable outlier in terms of its large apparent 555 fitness benefit of 9.2%. Given that the next-highest observed selection coefficient for a mutation 556 was 8.0%, it is likely that the lineage with this nadR mutation also sustained a secondary 557 beneficial mutation early enough that they rose to detectable frequencies together. Therefore, we 558 removed this mutation before analyzing or graphing the characteristics of the set of likely single-559 step mutations. Graphs were generated in R using the ggplot2 package [