Abstract
The structure of fitness landscapes is critical for understanding adaptive protein evolution (e.g. antimicrobial resistance, affinity maturation, etc.). Due to limited throughput in fitness measurements, previous empirical studies on fitness landscapes were confined to either the neighborhood around the wild type sequence, involving mostly single and double mutants, or a combinatorially complete subgraph involving only two amino acids at each site. In reality, however, the dimensionality of protein sequence space is higher (20L, L being the length of the relevant sequence) and there may be higher-order interactions among more than two sites. To study how these features impact the course of protein evolution, we experimentally characterized the fitness landscape of four sites in the IgG-binding domain of protein G, containing 204 = 160,000 variants. We found that the fitness landscape was rugged and direct paths of adaptation were often constrained by pairwise epistasis. However, while direct paths were blocked by reciprocal sign epistasis, we found systematic evidence that such evolutionary traps could be circumvented by “extra-dimensional bypass”. Extra dimensions in sequence space – with a different amino acid at the site of interest or an additional interacting site – open up indirect paths of adaptation via gain and subsequent loss of mutations. These indirect paths alleviate the constraint on reaching high fitness genotypes via selectively accessible trajectories, suggesting that the heretofore neglected dimensions of sequence space may completely change our views on how proteins evolve.
The fitness landscape is a fundamental concept in evolutionary biology [1–6]. Large-scale datasets combined with quantitative analysis have successfully unraveled important features of empirical fitness landscapes [7–9]. Nevertheless, there is a huge gap between the limited throughput of fitness measurements (usually on the order of 102 variants) and the vast size of sequence space. Recently, the bottleneck in experimen-tal throughput has been improved substantially by coupling saturation mutagenesis with deep sequencing [10–16], which opens up unprecedented opportunities to understand the structure of high-dimensional fit-ness landscapes [17–19].
Previous empirical studies on combinatorially complete fitness landscapes have been limited to subgraphs of the sequence space consisting of only two amino acids at each site (2L genotypes) [20–25]. Adaptive walks in these subgraphs can only follow “direct paths”, where each mutational step reduces the Hamming distance from the starting point to the destination. In sequence space with higher dimensionality (20L, for a protein sequence with L amino acid residues), however, the extra dimensions may provide additional routes for adaptation. For example, some evolutionary dead ends (i.e. local maxima) may become sad-dle points and allow for further increase in fitness [26]. In this case, adaptation may proceed via “indirect paths” in sequence space, which involve extra mutations and reversions. The existence of indirect paths has been implied in different contexts [27, 28], but has not been studied systematically so its influence on protein adaptation remains unclear. Another underappreciated property of fitness landscapes is the influ-ence of higher-order interactions. Empirical evidence suggests that pairwise epistasis is prevalent in fitness landscapes [7, 22, 23, 29]. Specifically, sign epistasis between two loci is known to constrain adaptation by limiting the number of selectively accessible paths [20]. Higher-order epistasis (i.e. interactions among more than two loci) has received much less attention and its role in adaptation is yet to be elucidated [28,30].
In this study, we investigated the fitness landscape of all variants (204 = 160,000) at four amino acid sites (V39, D40, G41 and V54) in an epistatic region of protein G domain B1 (GB1, 56 amino acids in total) (Supplementary Fig. 1), an immunoglobulin-binding protein expressed in Streptococcal bacteria [31, 32]. The four chosen sites contain 12 of the top 20 positively epistatic interactions among all pairwise interactions in protein GB1, as we previously characterized [33] (Supplementary Fig. 2). Thus the sequence space is expected to cover highly beneficial variants, which presents an ideal scenario for studying adaptive evolution. Briefly, a mutant library containing all amino acid combinations at these four sites was generated by codon randomization. The “fitness” of protein GB1 variants, as determined by both stability (i.e. the fraction of folded proteins) and function (i.e. binding affinity to IgG-Fc), was measured in a high-throughput manner by coupling mRNA display with Illumina sequencing (Methods, Supplementary Fig. 3A) [34,35]. The relative frequency of mutant sequences before and after selection allowed us to compute the fitness of each variant relative to the wild type protein (WT).
To understand the impact of epistasis on protein adaptation, we first analyzed subgraphs of sequence space including only two amino acids at each site (Fig. 1A). Each subgraph represented a classical adaptive landscape connecting WT to a beneficial quadruple mutant, analogous to previously studied protein fitness landscapes [9, 20]. Each variant is denoted by the single letter code of amino acids across sites 39, 40, 41 and 54 (for example, WT sequence is VDGV). Each subgraph is combinatorially complete with 24 = 16 variants, including WT, the quadruple mutant, and all intermediate variants. We identified a total of 29 subgraphs in which the quadruple mutant was the only fitness peak. By focusing on these subgraphs, we essentially limited the analysis to direct paths of adaptation, where each step would reduce the Hamming distance from the starting point (WT) to the destination (quadruple mutant). Out of 24 possible direct paths, the number of selectively accessible paths (i.e. with monotonically increasing fitness) varied from 12 to 1 among the 29 subgraphs (Fig. 1B). In the most extreme case, only one path was accessible from WT to the quadruple mutant WLFA (Fig. 1A). We also observed a substantial skew in the computed probability of realization among accessible direct paths (Supplementary Fig. 4), suggesting that most of the realizations in adaptation were captured by a small fraction of possible trajectories [20]. These results indicated the ex-istence of sign epistasis and reciprocal sign epistasis, both of which may constrain the accessibility of direct paths [20, 36]. Indeed, we found that these two types of epistasis were prevalent in our fitness landscape (Fig. 1C). Furthermore, we classified the types of all 24 pairwise epistasis in each subgraph and computed the level of ruggedness as fsign + 2freciprocal, where ftype was the fraction of each type of pairwise epista-sis. As expected, the number of selectively inaccessible direct paths, i.e. paths that involve fitness declines, was found to be positively correlated with the ruggedness induced by pairwise epistasis (Fig. 1D, Pearson correlation = 0.66, p=1.0×10−4) [2].
Our findings support the view that direct paths of protein adaptation are often constrained by pairwise epistasis on a rugged fitness landscape [5,37]. In particular, adaptation can be trapped when direct paths are blocked by reciprocal sign epistasis. However, crucially, this analysis was limited to mutational trajectories within a subgraph of the sequence space. In reality, the dimensionality of protein sequence space is higher. Intuitively, when an extra dimension is introduced, a local maximum may become a saddle point and allow for further adaptation – a phenomenon recently proposed under the name “extra-dimensional bypass” [38]. We discovered two distinct mechanisms of bypass, either using an extra amino acid at the same site or using an additional site, that allow proteins to continue adaptation when no direct paths were accessible due to reciprocal sign epistasis (Fig. 2). The first mechanism of bypass, which we termed “conversion bypass”, works by converting to an extra amino acid at one of the interacting sites [28]. Consider a simple scenario with only two interacting sites. If the sequence space is limited to 2 amino acids at each site, as in past analyses of adaptive trajectories, the number of neighbors is 2; however, ifall 20 possible amino acids were considered, the total number of neighbors would be 38. Some of these 36 extra neighbors may lead to potential routes that circumvent the reciprocal sign epistasis (Fig. 2A). In this case, a successful bypass would require a conversion step that substitutes one of the two interacting sites with an extra amino acid (00 → 20), followed by the loss of this mutation (21 → 11). This bypass is feasible only if the original reciprocal sign epistasis is changed to sign epistasis after the conversion. To test whether such bypasses were present in our system, we randomly sampled 105 pairwise interactions from the sequence space and analyzed the ~20,000 reciprocal sign epistasis among them (Methods). More than 40% of the time there was at least one successful conversion bypass and in many cases multiple bypasses were available (Fig. 2B).
The second mechanism of bypass, which we termed “detour bypass”, involves an additional site (Fig. 2C). In this case, adaptation can proceed by taking a detour step to gain a mutation at the third site (000 → 100), followed by the later loss of this mutation (111 → 011) [27,28]. Detour bypass was observed in our system (Fig. 2D), but was not as prevalent and had a lower probability of success than conversion bypass. Out of 38 possible detour bypasses for a chosen reciprocal sign epistasis, we found that there were on average 1.2 conversion bypasses and 0.27 detour bypasses available. We note, however, that the lower prevalence of detour bypass in our fitness landscape (L=4) does not necessarily mean that it should be expected to be less frequent than conversion bypass in other systems. While the maximum number of possible conversion bypasses is always fixed (19 × 2 – 2 = 36), the maximum number of possible detour bypasses (19 × (L – 2)) is proportional to the sequence length L of the entire protein (whereas our study uses a subset L = 4). The pervasiveness of extra-dimensional bypasses in our system contrasts with the prevailing view that adaptive evolution is often blocked by reciprocal sign epistasis, when only direct paths of adaptation are considered. The two distinct mechanisms of bypass both require the use of indirect paths, where the Hamming distance to the destination is either unchanged (conversion) or increased (detour).
In order to circumvent the inaccessible direct paths via extra dimensions, reciprocal sign epistasis must be changed into other types of pairwise epistasis. For detour bypass, this means that the original reciprocal sign epistasis is changed to either magnitude epistasis or sign epistasis in the presence of a third mutation (Supplementary Fig. 5A). There are three possible scenarios where detour bypass can occur (Supplementary Fig. 5B-D). We proved that higher-order epistasis is necessary for the scenario that reciprocal sign epistasis is changed to magnitude epistasis, as well as for one of the two scenarios that reciprocal sign epistasis is changed to sign epistasis (Supplementary Text). This suggests a critical role of higher-order epistasis in mediating detour bypass.
To confirm the presence of higher-order epistasis, we decomposed the fitness landscape by Fourier analysis (Fig. 3A, Methods) [9, 30]. The Fourier coefficients can be interpreted as epistatic interactions of different orders [6, 30], including the main effects of single mutations (the 1st order), pairwise epistasis (the 2nd order), and higher-order epistasis (the 3rd and the 4th order). The fitness of variants can be reconstructed by expansion of Fourier coefficients up to a certain order (Supplementary Fig. 6). In our system with four sites, the 4th order Fourier expansion will always reproduce the measured fitness (i.e. Pearson correlation equals 1). When the 2nd order Fourier expansion does not reproduce the measured fitness (i.e. Pearson cor-relation less than 1), it indicates the presence of higher-order epistasis. In this way, we identified the 0.1% of subgraphs with greatest fitness contribution from higher-order epistasis (Fig. 3A, red lines) and visual-ized the corresponding quadruple mutants by the sequence logo plot (Fig. 3B). The skewed composition of amino acids in these subgraphs indicates that higher-order interactions are enriched among specific amino acid combinations of site 39, 41 and 54. This interaction among 3 sites is consistent with our knowledge of the protein structure, where the side chains of sites 39, 41, and 54 can physically interact with each other at the core (Supplementary Fig. 1A) and destabilize the protein due to steric effects (Supplementary Fig. 7).
In the presence of higher-order epistasis, epistasis between any two sites would vary across different ge-netic backgrounds. We computed the magnitude of pairwise epistasis (ॉ) between each pair of amino acid substitutions (Methods)[39], and observed numerous instances where the sign of pairwise epistasis depended on genetic background. For example, G41L and V54H were positively epistatic when site 39 was isoleucine [I], but the interaction changed to negative epistasis when site 39 carried a tyrosine [Y] or a tryptophan [W] (Fig. 3C-D). Similar patterns were observed in other pairwise interactions among site 39, 41 and 54, such as G41F/V54A and V39W/V54H (Supplementary Fig. 8). The observed pattern of higher-order epistasis was consistent with the results of the Fourier analysis (Fig. 3B). For example, site 40 was mostly excluded from higher-order epistasis; tyrosine [Y] or tryptophan [W] at site 39 were involved in the most significant higher-order interactions, as they often changed the sign of pairwise epistasis. Higher-order epistasis can also switch the type of pairwise epistasis, such as shifting from reciprocal sign epistasis to magnitude or sign epistasis (Supplementary Fig. 9), which in turn is important for the existence of detour bypass.
Our analysis on circumventing reciprocal sign epistasis revealed how indirect paths could open up new avenues of adaptation. To study the impact of indirect paths at a global scale, we performed simulated adaptation in the entire sequence space of 160,000 variants. The fitness landscape was completed by im-puting fitness values of the 10,639 missing variants (i.e. 6.6% of the sequence space) that had fewer than 10 sequencing read counts in the input library. Our model of protein fitness incorporated main effects of single mutations, pairwise interactions, and three-way interactions among site 39, 41 and 54 (Methods, Supplementary Fig. 10). We used predictor selection based on biological knowledge, followed by regularized regression, which has been demonstrated to ameliorate possible bias in the inferred fitness landscape [40]. In the complete sequence space, we identified a total of 30 fitness peaks (i.e. local maxima); among them 15 peaks had fitness larger than WT and their combined basins of attraction covered 99% of the sequence space (Fig. 4A).
We then simulated adaptation on the fitness landscape using three different models of adaptive walks (Meth-ods), namely the Greedy Model [6], Correlated Fixation Model [41], and Equal Fixation Model [20]. In the Greedy Model, adaptation proceeds by sequential fixation of mutations that render the largest fitness gain at each step. The other two models assign a nonzero fixation probability to all beneficial mutations, either weighted by (Correlated Fixation Model) or independent of (Equal Fixation Model) the relative fitness gain. Among all the possible adaptive paths to fitness peaks, many of them involved indirect paths, i.e. they em-ployed mechanisms of extra-dimensional bypass (Fig. 4B, Supplementary Fig. 11). We classified each step on the adaptive paths into three categories based on the change of Hamming distance to the destination (a fitness peak, in this case): “towards (-1)”, “conversion (0)”, and “detour (+1)” (Fig. 4C). Conversion was found to be pervasive during adaptation in our fitness landscape (17% of mutational steps for Greedy Model, 41% for Correlated Fixation Model, 59% for Equal Fixation Model). The use of detour was less frequent (0.1% of mutational steps for Greedy Model, 1.3% for Correlated Fixation Model, 3.7% for Equal Fixation Model), in accordance with the previous observation that detour bypass was less available than conversion bypass in our fitness landscape with L = 4. A conversion step would increase the length of an adaptive path by 1, while a detour step would increase the length by 2. As a result, an indirect path can be sub-stantially longer than a direct path consisting of only “towards” steps. We found that many of the adaptive paths required more than 4 steps, which was the maximal length of a direct path between any variants in this landscape (Fig. 4D). Interestingly, because indirect adaptive paths involved more variants of intermedi-ate fitness, the use of conversion and detour steps depended on the strength of selection. When mutations conferring larger fitness gains were more likely to fix (e.g. Greedy Model and Correlated Fixation Model), adaptation favored direct moves toward the destination, thus leading to a shorter adaptive paths (Fig. 4C-D). This suggests that the strength of selection interacts with the topological structure of fitness landscapes to determine the length and directness of evolutionary trajectories.
Given that extra-dimensional bypasses can help proteins avoid evolutionary traps, we expect that their exis-tence would facilitate adaptation in rugged fitness landscapes. Indeed, we found that indirect paths increased the number of genotypes with access to each fitness peak (Fig. 4E). In addition, the fraction of genotypes with accessible paths to all 15 fitness peaks increased from from 34% to 93% when indirect adaptive paths were allowed (Supplementary Fig. 11C). We also found that a substantial fraction of beneficial variants (fitness > 1) in the sequence space were accessible from WT only if indirect paths were used (Fig. 4F). Taken together, these results suggest that indirect paths promote evolutionary accessibility in rugged fitness landscapes. This enhanced accessibility would allow proteins to explore more sequence space and lead to delayed commitment to evolutionary fates (i.e. fitness peaks) [28]. Consistent with this expectation, our sim-ulations showed that many mutational trajectories involving extra-dimensional bypass did not fully commit to a fitness peak until the last two steps (Supplementary Fig. 12).
In our analysis, we have limited adaptation to the regime where fitness is monotonically increasing via sequential fixation of one-step beneficial mutants. When this assumption is relaxed, adaptation can some-times proceed by crossing fitness valleys [2, 6, 42, 43]. Another simplification in our analysis is to treat all sequences in a “protein space” [44], where two sequences are considered as neighbors if they differ by a single amino-acid substitution. In practice, amino acid substitutions occurring via a single nucelotide mutation are limited by the genetic code, so the total number of one-step neighbors would be reduced from 19L to approximately 6L. We also expect fitness landscapes of different systems to have different topo-logical structure. Even in our system (with >93% coverage of the genotype space), the global structure of the fitness landscape is influenced by the imputed fitness values of missing variants, which can vary when different fitness models or fitting methods are used. Our analysis also ignored measurement errors, but the measurement errors are expected to be very small due to the high reproducibility in the data (Supplementary Fig. 3B). Both imputation of missing variants and measurement errors can lead to slight mis-specification of the topological structure of the fitness landscape. Nevertheless, specific details of a certain fitness landscape do not undermine the generality of our findings on extra-dimensional bypass, higher-order epistasis, and their roles in protein evolution.
Higher-order epistasis has been reported in a few biological systems [28,45,46], and is likely to be common in nature [30]. In this study, we uncovered the presence of higher-order epistasis and systematically quanti-fied its contribution to protein fitness. We also revealed the importance of higher-order epistasis in mediating detour bypass, which could promote evolutionary accessibility in rugged fitness landscapes. As we pointed out, the possible number of detour bypasses scales up with sequence length, so it will be interesting to study how extra-dimensional bypass influences adaptation in sequence space of even higher dimensionality. For example, it is plausible that the sequence of a large protein may never be trapped in adaptation [47], so that adaptive accessibility becomes a quantitative rather than qualitative problem. Given the continuing develop-ment of sequencing technology, we anticipate that the scale of experimentally determined fitness landscapes will further increase, yet the full protein sequence space is too huge to be mapped exhaustively. Does this mean that we will never be able to understand the full complexity of fitness landscapes? Or perhaps big data from high-throughput measurements will guide us to find general rules? By coupling state-of-the-art experimental techniques with novel quantitative analysis of fitness landscapes, this work takes the optimistic view that we can push the boundary further and discover new mechanisms underlying evolution [9,48,49].
Methods
Mutant library construction
Two oligonucleotides (Integrated DNA Technologies, Coralville, IA), 5’-AGT CTA GTA TCC AAC GGC NNS NNS NNK GAA TGG ACC TAC GAC GAC GCT ACC AAA ACC TT-3’ and 5’-TTG TAA TCG GAT CCT CCG GAT TCG GTM NNC GTG AAG GTT TTG GTA GCG TCG TCG T-3’ were annealed by heating to 95°C for 5 minutes and cooling to room temperature over 1 hour. The annealed nucleotide was extended in a reaction containing 0.5 uM of each oligonucleotide, 50 mM NaCl, 10 mM Tris-HCl pH 7.9, 10 mM MgCl2, 1 mM DTT, 250 uM each dNTP, and 50 units Klenow exo- (New England Biolabs, Ipswich, MA) for 30 mins at 37°C. The product (cassette I) was purified by PureLink PCR Purification Kit (Life Technologies, Carlsbad, CA) according to manufacturer’s instructions.
A constant region was generated by PCR amplification using KOD DNA polymerase (EMD Millipore, Billerica, MA) with 1.5 mM MgSO4, 0.2 mM of each dNTP (dATP, dCTP, dGTP, and dTTP), 0.05 ng protein GB1 wild type (WT) template, and 0.5 uM each of 5’-TTC TAA TAC GAC TCA CTA TAG GGA CAA TTA CTA TTT ACA TAT CCA CCA TG-3’ and 5’-AGT CTA GTA TCC TCG ACG CCG TTG TCG TTA GCG TAC TGC-3’. The sequence of the WT template consisted of a T7 promoter, 5’ UTR, the coding sequence of Protein GB1, 3’ poly-GS linkers, and a FLAG-tag (Supplementary Fig. 1B) [33]. The thermocycler was set as follows: 2 minutes at 95°C, then 18 three-step cycles of 20 seconds at 95°C, 15 seconds at 58°C, and 20 seconds at 68°C, and 1 minute final extension at 68°C. The product (constant region) was purified by PureLink PCR Purification Kit (Life Technologies) according to manufacturer’s in-structions. Both the purified constant region and cassette I were digested with BciVI (New England Biolabs) and purified by PureLink PCR Purification Kit (Life Technologies) according to manufacturer’s instructions.
Ligation between the constant region and cassette I (molar ratio of 1:1) was performed using T4 DNA ligase (New England Biolabs). Agarose gel electrophoresis was performed to separate the ligated product from the reactants. The ligated product was purified from the agarose gel using Zymoclean Gel DNA Re-covery Kit (Zymo Research, Irvine, CA) according to manufacturer’s instructions. PCR amplification was then performed using KOD DNA polymerase (EMD Millipore) with 1.5 mM MgSO4, 0.2 mM of each dNTP (dATP, dCTP, dGTP, and dTTP), 4 ng of the ligated product, and 0.5 uM each of 5’-TTC TAA TAC GAC TCA CTA TAG GGA CAA TTA CTA TTT ACA TAT CCA CCA TG-3’ and 5’-GGA GCC GCT ACC CTT ATC GTC GTC ATC CTT GTA ATC GGA TCC TCC GGA TTC-3’. The thermocycler was set as follows: 2 minutes at 95°C, then 10 three-step cycles of 20 seconds at 95°C, 15 seconds at 56°C, and 20 seconds at 68°C, and 1 minute final extension at 68°C. The product, which is referred as “DNA library”, was purified by PureLink PCR Purification Kit (Life Technologies) according to manufacturer’s instructions.
Affinity selection by mRNA display
Affinity selection by mRNA display [34, 35] was performed as described (Supplementary Fig. 3A) [33]. Briefly, The DNA library was transcribed by T7 RNA polymerase (Life Technologies) according to man-ufacturer’s instructions. Ligation was performed using 1 nmol of mRNA, 1.1 nmol of 5’-TTT TTT TTT TTT GGA GCC GCT ACC-3’, and 1.2 nmol of 5-/5Phos/-d(A)21-(9)3-ACC-Puromycin by T4 DNA ligase (New England Biolabs) in a 100 uL reaction. The ligated product was purified by urea PAGE and translated in a 100 uL reaction volume using Retic Lysate IVT Kit (Life Technologies) according to manufacturer’s instructions followed by incubation with 500 mM final concentration of KCl and 60 mM final concentration of MgCl2 for at least 30 minutes at room temperature to increase the efficiency for fusion formation [52]. The mRNA-protein fusion was then purified using ANTI-FLAG M2 Affinity Gel (Sigma-Aldrich, St. Louis, MO). Elution was performed using 3X FLAG peptide (Sigma-Aldrich). The purified mRNA-protein fusion was reverse transcribed using SuperScript III Reverse Transcriptase (Life Technologies). This reverse tran-scribed product, which was referred as “input library”, was incubated with Pierce streptavidin agarose (SA) beads (Life Technologies) that were conjugated with biotinylated human IgG-FC (Rockland Immunochemicals, Limerick, PA). After washing, the immobilized mRNA-protein fusion was eluted by heating to 95°C. The eluted sample was referred as “selected library”.
Sequencing library preparation
PCR amplification was performed using KOD DNA polymerase (EMD Millipore) with 1.5 mM MgSO4, 0.2 mM of each dNTP (dATP, dCTP, dGTP, and dTTP), the selected library, and 0.5 uM each of 5’-CTA CAC GAC GCT CTT CCG ATC TNN NAG CAG TAC GCT AAC GAC AAC G-3’ and 5’-TGC TGA ACC GCT CTT CCG ATC TNN NTA ATC GGA TCC TCC GGA TTC G-3’. The underlined “NNN” indicated the position of the multiplex identifier, GTG for input library and TGT for post-selection library. The thermocy-cler was set as follows: 2 minutes at 95°C, then 10 to 12 three-step cycles of 20 seconds at 95°C, 15 seconds at 56°C, and 20 seconds at 68°C, and 1 minute final extension at 68°C. The product was then PCR amplified again using KOD DNA polymerase (EMD Millipore) with 1.5 mM MgSO4, 0.2 mM of each dNTP (dATP, dCTP, dGTP, and dTTP), the eluted product from mRNA display, and 0.5 uM each of 5’-AAT GAT ACG GCG ACC ACC GAG ATC TA CAC TCT TTC CCT ACA CGA CGC TCT TCC G-3’ and 5’-CAA GCA GAA GAC GGC ATA CGA GAT CGG TCT CGG CAT TCC TGC TGA ACC GCT CTT CCG-3’. The thermocycler was set as follows: 2 minutes at 95°C, then 10 to 12 three-step cycles of 20 seconds at 95°C, 15 seconds at 56°C, and 20 seconds at 68°C, and 1 minute final extension at 68°C. The PCR product was then subjected to 2 × 100 bp paired-end sequencing using Illumina HiSeq 2500 platform. Raw sequencing data have been submitted to the NIH Short Read Archive under accession number: BioProject PRJNA278685.
We were able to compute the fitness for 93.4% of all variants from the sequencing data. The fitness measure-ments in this study were highly consistent with our previous study on fitness of single and double mutants in protein GB1 (Pearson correlation = 0.97, Supplementary Fig. 3B) [33].
Sequencing data analysis
The first three nucleotides of both forward read and reverse read were used for demultiplexing. If the first three nucleotides of the forward read were different from that of the reverse read, the given paired-end read would be discarded. For both forward read and reverse read, the nucleotides that were corresponding to the codons of protein GB1 sites 39, 40, 41, and 54 were extracted. If coding sequence of sites 39, 40, 41, and 54 in the forward read and that in the reverse read did not reverse-complement each other, the paired-end read would be discarded. Subsequently, the occurrence of individual variants at the amino acid level for site 39, 40, 41, and 54 in both input library and selected library were counted, with each paired-end read represented 1 count. Custom python scripts and bash scripts were used for sequencing data processing. All scripts are available upon request.
Calculation of fitness
The fitness (w) for a given variant i was computed as: where counti,selected represented the count of variant i in the selected library, counti,input represented the count of variant i in the input library, countWT, selected represented the count of WT (VDGV) in the selected library, and countWT, input represented the count of WT (VDGV) in the input library.
Therefore, the fitness of each variant, wi, could be viewed as the fitness relative to WT (VDGV), such that wWT = 1. Variants with countinput < 10 were filtered to reduce noise. The fraction of all possible variants that passed this filter was 93.4% (149,361 out of 160,000 all possible variants).
The fitness of each single substitution variant was referenced to our previous study [33], because the se-quencing coverage of single substitution variants in our previous study was much higher than in this study (∼100 fold higher). Hence, our confidence in computing fitness for a single substitution variant should also be much higher in our previous study than this study. Subsequently, the fitness of each single substitution in this study was calculated by multiplying a factor of 1.159 by the fitness of that single substitution computed from our previous study [33]. This is based on the linear regression analysis between the single substitution fitness as measured in our previous study and in this study, which had a slope of 1.159 and a y-intercept of ∼0.
Magnitude and type of pairwise epistasis
The three types of pairwise epistasis (magnitude, sign and reciprocal sign) were classified by ranking the fitness of the four variants involved [53].
To quantify the magnitude of epistasis (ε) between substitutions a and b on a given background variant BG, the relative epistasis model [39] was employed as follows: where wab represents the fitness of the double substitution, ln(wa) and ln(wb) represents the fitness of each of the single substitution respectively, and wBG represents the fitness of the background variant.
As described previously [33], there exists a limitation in determining the exact fitness for very low-fitness variants in this system. To account for this limitation, several rules were adapted from our previous study to minimize potential artifacts in determining ε [33]. We previously determined that the detection limit of fitness (w) in this system is ∼0.01 [33]. Rule 1 prevents epistasis being artifically estimated from low-fitness variants. Rule 2 prevents overesti-mation of epistasis due to low fitness of one of the two single substitutions. Rule 3 prevents underestimation of epistasis due to low fitness of the double substitution. To compute the epistasis between two substitutions, a and b, on a given background variant BG, εab, BG, adjusted would be used if one of the above three rules was satisfied. Otherwise, εab, BG would be used.
Fourier analysis
Fitness decomposition was performed on all subgraphs without missing variants (109,235 subgraphs in total). We decomposed the fitness landscape into epistatic interactions of different orders by Fourier analysis [9, 54]. The Fourier coefficients given by the transform can be interpreted as epistasis of different orders [6,30].
For a binary sequence with dimension L (zi equals 1 if mutation is present at position i, or 0 otherwise), the Fourier decomposition theorem states that the fitness function can be expressed as [51]:
The formula for the Fourier coefficients is then: For example, we can expand the fitness landscape up to the second order, i.e. with linear and quadratic terms where , and is a unit vector along the ith direction. In our analysis of subgraphs, there are a total of 24 = 16 terms in the Fourier decomposition, with terms for the ith order (i = 0, 1, 2, 3, 4). We can expand the fitness landscape up to a given order by ignoring all higher-order terms in Equation 18. In this paper, we refer to higher-order epistasis as non-zero contribution to fitness from the 3rd order terms and beyond.
Imputing the fitness of missing variants
The fitness values for 10,639 variants (6.6% of the entire sequence space) were not directly measured (read count in the input pool = 0) or were filtered out because of low read counts in the input pool (see section “Calculation of fitness”). To impute the fitness of these missing variants, we performed regularized regression on fitness values of observed variants using the following model [40,55]: Here, f is the protein fitness. α0 is the intercept that represents the log fitness of WT; βi represents the main effect of a single mutation, i; Mi is a dummy variable that equals 1 if the single mutation i is present in the sequence, or 0 if the single mutation is absent; and is the total number of single mutations. Similarly, γj represents the effect of interaction between a pair of mutations; Pj is the dummy variable that equals either 1 or 0 depending on the presence of that those two mutations; and is the total number of possible pairwise interactions. In addition to the main effects of single mutations and pairwise interactions, the three-way interactions among sites 39, 41 and 54 are included in the model, based on our knowledge of higher-order epistasis (Fig. 3). δk represents the effect of three-way interactions among sites 39, 41 and 54; Tk is the dummy variable that equals either 1 or 0 depending on the presence of that three-way interaction; and NT = 193 = 6859 is the total number of three-way interactions. Thus, the total number of coefficients in this model is 9,102, including main effects of each site (i.e. additive effects), interactions between pairs of sites (i.e. pairwise epistasis), and a subset of three-way interactions (i.e. higher-order epistasis).
Out of the 149,361 variants with experimentally measured fitness values, 119,884 variants were non-lethal (f > 0) and were used to fit the model coefficients using lasso regression (Matlab R2014b). Lasso regression adds a penalty term (θ stands for any coefficient in the model) when minimizing the least squares, thus it favors sparse solutions of coefficients (Supplementary Fig. 10B). We calculated the 10-fold cross-validation MSE (mean squared errors) of the lasso regression for a wide range of penalty parameter λ (Supplementary Fig. 10A). λ = 10−4 is chosen. For measured variants, the model-predicted fitness values were highly correlated with the actual fitness values (Pearson correlation=0.93, Supplementary Fig. 10C). We then used the fitted model to impute the fitness of the 10,639 missing variants and complete the entire fitness landscape.
Simulating adaptation using three models for fixation
Python package “networkx” was employed to construct a directed graph that represented the entire fitness landscape for sites 39, 40, 41, and 54. A total of 420 = 160,000 nodes were present in the directed graph, where each node represented a 4-site variant. For all pairs of variants separated by a Hamming distance of 1, a directed edge was generated from the variant with a lower fitness to the variant with a higher fitness. Therefore, all successors of a given node had a higher fitness than the given node. A fitness peak was defined as a node that had 0 out-degree. Three models, namely the Greedy Model [6], the Correlated Fixation Model [41], and the Equal Fixation Model [20], were employed in this study to simulate the mutational steps in adaptive trajectories. The Greedy Model represents adaptive evolution of a large population with pervasive clonal interference [6]. The Correlated Fixation Model represents adaptive evolution of a population under the scheme of strong-selection/weak-mutation (SSWM), which assumes that the time to fixation is much shorter than the time between mutations, and the fixation probability of a given mutation is proportional to the improvement in fitness. The Equal Fixation Model represents a simplified scenario of adaptation where all beneficial mutations fix with equal probability [20]. Under all three models, the probability of fixation of a deleterious or neutral mutation is 0. Considering a mutational trajectory initiated at a node, ni with a fitness value of wi, where ni has M successors, (n1, n2, … nM) with fitness values of (w1, w2, … wM). Then the probability that the next mutational step is from ni to nk, where k ∈ (1, 2, … M), is denoted Pi→k and called the probability of fixation, and can be computed for each model as follows.
For the Greedy Model (deterministic model), For the Correlated Fixation Model (non-deterministic model), For the Equal Fixation Model (non-deterministic model), To compute the shortest path from a given variant to all reachable variants, the function “single_source_shortest_path” in “networkx” was used. If the shortest path between a low-fitness variant and a high-fitness variant does not exist, it means that the high-fitness variant is inaccessible. If the shortest path is longer than the Hamming Distance between two variants, it means that adaptation requires indirect paths.
Analysis of direct paths within a subgraph
In the subgraph analysis shown in Supplementary Fig. 4, the fitness landscape was restricted to 2 amino acids at each of the 4 sites (the WT and adapted alleles). There was a total of 24 variants, hence nodes, in a given subgraph. Only those subgraphs where the fitness of all variants was measured directly were used (i.e. any subgraph with missing variants was excluded from this analysis). Mutational trajectories were generated in the same manner as in the analysis of the entire fitness landscape (see subsection “Simulating adaptation using three models for fixation”). In a subgraph with only one fitness peak, the probability of a mutational trajectory from node i to node j via intermediate a, b, and c was as follows: To compute the Gini index for a given set of mutational trajectories from node i to node j, the probabilities of all possible mutational trajectories were sorted from large to small. Inaccessible trajectories were also included in this sorted list with a probability of 0. This sorted list with t trajectories was denoted as , where was the largest and was the smallest. This sorted list was converted into a list of cumulative probabilities, which is denoted as , where
The Gini index for the given subgraph was then computed as follows:
Visualization
Sequence logo was generated by WebLogo (http://weblogo.berkeley.edu/logo.cgi) [56].
The visualization of basins of attraction (Fig. 4A) was generated using Graphviz with “fdp” as the option for layout.
Competing Interests
The authors declare that they have no competing interests.
Contributions
N.C.W., C.A.O., and R.S. designed the experiment, N.C.W. and C.A.O. performed the experiments, N.C.W. processed the sequencing data, L.D. and N.C.W. analyzed the fitness landscape, J.O.L.S. provided important intellectual inputs, L.D. and N.C.W. wrote the manuscript, J.O.L.S. and R.S. revised the manuscript.
Acknowledgments
We would like to thank Jesse Bloom and Joshua Plotkin for helpful comments on early versions of the manuscript. N.C.W. was supported by Philip Whitcome Pre-Doctoral Fellowship, Audree Fowler Fellow-ship in Protein Science, and UCLA Dissertation Year Fellowship. L.D. was supported by HHMI Postdoc-toral Fellowship from Jane Coffin Childs Memorial Fund for Medical Research. R.S. was supported by NIH R01 DE023591. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- [1].↵
- [2].↵
- [3].
- [4].
- [5].↵
- [6].↵
- [7].↵
- [8].
- [9].↵
- [10].↵
- [11].
- [12].
- [13].
- [14].
- [15].
- [16].↵
- [17].↵
- [18].
- [19].↵
- [20].↵
- [21].
- [22].↵
- [23].↵
- [24].
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵