## Abstract

The enormous size and complexity of genotypic sequence space frequently requires consideration of coarse-grained sequences in empirical models. We develop scaling relations to quantify the effect of this coarse-graining on properties of fitness landscapes and evolutionary paths. We first consider evolution on a simple Mount Fuji fitness landscape, focusing on how the length and predictability of evolutionary paths scale with the coarse-grained sequence length and alphabet. We obtain simple scaling relations for both the weak- and strong-selection limits, with a non-trivial crossover regime at intermediate selection strengths. We apply these results to evolution on a biophysical fitness landscape that describes how proteins evolve new binding interactions while maintaining their folding stability. We combine the scaling relations with numerical calculations for coarse-grained protein sequences to obtain quantitative properties of the model for realistic binding interfaces and a full amino acid alphabet.

## 1. Introduction

The enormous size and dimensionality of genotypic sequence space are among the most salient features of molecular evolution. These features not only present technical challenges for experiments and computation, but raise major conceptual questions as well: how can populations efficiently find high-fitness states in such a large space? John Maynard Smith famously tackled this issue [1], arguing that positive selection acting on individual mutations is key to efficiently evolving functional protein sequences. However, this argument depends crucially on the structure of the fitness landscape and the underlying evolutionary dynamics. One expects a large population to ascend a steep and perfectly-smooth landscape quickly, while substantial landscape ruggedness or genetic drift will slow down adaptation.

The effect of ruggedness (due to epistatic interactions among genetic loci) on evolutionary paths has been a major focus of previous work. These studies have investigated both simple models of fitness landscapes — especially the uncorrelated random landscape [2–5] (also known as the “House of Cards” [6]) and the rough Mount Fuji model [5, 7, 8] — as well as landscapes empirically measured in specific organisms [9, 10]. Populations in these studies are generally assumed to be under strong selection, so that evolutionary paths proceed strictly upward in fitness; a major goal is to determine the number and length of the accessible paths for different landscape topographies. More recent work has begun to consider the effect of population dynamics (e.g., clonal interference) on evolutionary predictability [11], a topic of central importance in evolutionary biology [12, 13].

In most cases the computational and experimental cost of analyzing empirical models has required simplified sequence spaces, especially binary sequences (indicating only the presence or absence of a mutation at each site) [3,5,8,9], genomes or proteins with reduced lengths [14–16], and reduced alphabets of amino acids [16, 17] or protein structural components [18]. However, it is not clear how properties of landscapes and evolutionary paths change under these implicit coarse-graining schemes. Understanding their scaling behavior is essential for extending these models to more realistic biological systems. Specifically, we must determine how properties of a model scale with both the coarse-grained sequence length *L* and the coarse-grained alphabet size *k* (number of possible alleles at each site), the latter being important when multiple mutations at a single site are likely.

We first carry out this approach in a simple model of monomorphic populations undergoing substitutions on a smooth Mount Fuji landscape, showing how the scaling properties of the model depend crucially on the strength of selection relative to genetic drift. We then consider evolution on a fitness landscape based on the biophysics of protein folding and binding, describing how proteins evolve new binding interactions while maintaining folding stability [17]. Using scaling relations, we are able to extend numerical calculations carried out for coarse-grained representations of proteins, obtaining quantitative evolutionary properties for realistic binding interface sizes and a full amino acid alphabet.

## 2. Evolutionary paths on a smooth Mount Fuji landscape

We first consider a simple fitness landscape model, the smooth “Mount Fuji” (i.e., single-peaked) land-scape [19]. Consider genotypic sequences of length *L* with *k* possible alleles {A_{1}, A_{2}, …, A_{k}} at each site, resulting in *n*_{seq} = *k*^{L} possible genotypes. We assume the alleles {A_{1}, A_{2}, …, A_{k}} are in increasing order of fitness rank. The sites could be residues in a protein, nucleotides in a DNA sequence, or larger genomic loci such as whole genes. In general we will interpret the sequences in the model as coarse-grained versions of actual biological sequences. For example, a 12-residue binding interface on a protein with 20 possible amino acids at each site could be coarse-grained into *L* = 6 pairs of sites with *k* = 5 alleles at each site, where each allele represents a class of amino acids grouped by physico-chemical properties (e.g., negative, positive, polar, hydrophobic, and other). This is analogous to block spin renormalization in Ising models [20].

Let the occupation number *n*_{j}(*σ*) of a sequence *σ* be the number of A_{j} alleles in the sequence, so that . We define the fitness of a sequence *σ* to be
where *f* ≥ 1 is the minimum multiplicative fitness change from a single mutation: a mutation A_{i} → A_{j} at a single site changes fitness by a factor of *f ^{j−i}*. If

*f*= 1, the fitness landscape is flat and evolution is neutral, while if

*f >*1, the landscape has a minimum at

*σ*= A

_{1}A

_{1}… A

_{1}(

*𝓕*= 1) and a maximum at

*σ*= A

_{k}A

_{k}… A

_{k}(

*𝓕*=

*f*

^{L(k−1)}). The model is non-epistatic since the fitness function factorizes over sites; thus all mutations have the same fitness effect regardless of the genetic background on which they occur. A more general Mount Fuji model could allow mutations at different sites and between different alleles to have different fitness effects, although this will not affect the scaling properties of the model that are of primary interest here.

We assume that the population is monomorphic: all organisms have the same genotype at any given time. This approximation holds when *u* ≪ (*LN* log*N*)^{−1}, where *u* is the per-site probability of mutation per generation and *N* is the population size [21]. In this regime the population evolves through a series of substitutions, in which single mutants arise and fix one at a time. A substitution from genotype *σ* to *σ*′ occurs at the rate [22]
where *ϕ*(*s*) is the fixation probability of a single mutant with selection coefficient *s* = (*𝓕*(*σ*′)/*𝓕*(*σ*)−1. We use the diffusion approximation to the Wright-Fisher model for the fixation probability [23]:
Note that when *N |s| >* 1 this can be approximated by
That is, when selection is much stronger than genetic drift, deleterious mutations never fix, while beneficial mutations fix with a probability commensurate with their selective advantage. This is often referred to as the “strong-selection weak-mutation” (SSWM) limit [24].

### 2.1 The ensemble of evolutionary paths

For concreteness we consider the following evolutionary process: the population begins at the least fit genotype, A_{1}A_{1} … A_{1}, and evolves according to (2) until it reaches the most fit genotype, A_{k}A_{k} … A_{k}, for the first time. Define an evolutionary path *φ* as the ordered sequence of genotypes *φ* = (*σ*_{0}*, σ*_{1}*, …, σ _{ℓ}*) traversed by the population during this process, where

*σ*

_{0}= A

_{1}A

_{1}… A

_{1}and

*σ*= A

_{ℓ}_{k}A

_{k}… A

_{k}. The probability of making a single substitution

*σ → σ*′, given a substitution out of

*σ*occurs, is Where is the mean waiting time in

*σ*before a substitution occurs. Thus the probability of taking a path

*φ*is Since the population is guaranteed to reach the final state eventually, , where the sum is over all first-passage paths

*φ*between the initial and final states.

We are interested in statistical properties of this evolutionary path ensemble. We can calculate many such properties using an exact numerical algorithm described in Appendix A [25, 26]. Here we are especially interested in the distribution of path lengths **ℓ**, i.e., the number of substitutions experienced by the population before it first reaches the fitness maximum. The path length distribution *ρ*(**ℓ**) is defined as
where *𝓛*[*φ*] is the length of path *φ*, and *δ* is the Kronecker delta. We can similarly express the mean and variance **ℓ**_{var} of path length. We also consider the path entropy *S*_{path}, defined as
This quantity measures the predictability of evolution in sequence space: if only a single path is accessible, then *S*_{path} = 0, and evolution is perfectly predictable. Larger values of *S*_{path}, on the other hand, indicate a more diverse ensemble of accessible pathways, and thus less predictable evolution.

### 2.2 Neutral limit

We first consider properties of the evolutionary path ensemble in the case of neutral evolution (*f* = 1 in (1)). For simple random walks on finite discrete spaces, previous work has shown that the mean first passage path length scales with the total number of states [27, 28], while the distribution of path lengths will be approximately exponential [27]. Thus for neutral evolution,
Conceptually, this means the population on average must explore the entire sequence space before reaching a particular point for the first time, and thus the average number of substitutions grows exponentially with the length of the sequence. Moreover, since the standard deviation is of the same order as the mean, paths much longer than the mean are likely.

Let *γ* be the average connectivity, defined as the average number of single-mutant substitutions accessible from each sequence; in neutral evolution all single-mutant substitutions are accessible, so *γ* = *L*(*k* − 1). Since all substitutions are equally likely, *Q*(*σ′ ∣σ*) = **γ**^{−1} for *σ* and *σ′* separated by a single mutation. The entropy of the neutral path ensemble is therefore [26]
The path entropy consists of two distinct components: the average path length and the average connectivity. The factor of log *γ* is the average entropy contribution from each jump in the path. It is worth noting that mean path length (and the distribution of path lengths in general) does *not* have explicit dependence on connectivity: it only depends on the size of the space. So it is the enormous size, not the connectivity, of sequence space that causes neutral evolution to require so many steps to reach a particular point. In contrast, path entropy, and thus evolutionary predictability, depends on *both* the size and connectivity of sequence space.

### 2.3 Strong-selection limit

We now consider evolutionary paths in the strong-selection limit. Here all beneficial mutations are selected so strongly (*f* ≫ 1 in (1)) that their fixation probabilities are all approximately 1, while deleterious mutations never occur. Thus evolutionary paths proceed strictly upward on the fitness landscape. This is sometimes called the “adaptive walk” [2] or “random adaptation” [29] scenario; it is identical to zero-temperature Metropolis Monte Carlo dynamics with energy replaced by negative fitness [3]. Since the fitness landscape is non-epistatic and reverse mutations are impossible, each site can be considered to evolve independently. In particular, we can decompose the total path length into a sum of path lengths for individual sites, so that the path length cumulants for the entire sequence are simply sums of the cumulants for individual sites. (Note that the restriction to first-passage paths effectively couples all the sites because they must all reach their final states simultaneously, and so site independence is only valid when reverse mutations are prohibited.)

In Appendix B we show that the mean number of substitutions for a single site in this limit is (the (*k* - 1)th harmonic number), consistent with previous results[29][30]. Hence the mean length for *L* sites is *LH*_{k-1}, and since *H*_{k-1} = log *k* + *b* + *𝓞*(*k*^{-1}), the mean length scales as
We explicitly include the *𝓞*(1) constant *b* here since it may be comparable to log *k* if *k* is not too large. For the harmonic numbers, *b* is equal to the Euler-Mascheroni constant *γ*EM ≈ 0.5772, but we use generic notation here as this same scaling form will be fit to an empirical model in the next section. Equation (11) implies that scales approximately logarithmically with the size *n*_{seq} of sequence space, compared with the linear scaling seen in the neutral case[9]. Moreover, Appendix C shows that *ρ*(**ℓ**) is approximately Poisson, and thus the variance **ℓ**_{var} should obey the same scaling as .

The average connectivity of sequence space is reduced compared to the neutral case, since only beneficial substitutions are allowed. The connectivity averaged over all sequences is *L*(*k –* 1)*/*2 (Appendix D); the reduction by a factor of 2 is intuitively explained by the fact that every allowed beneficial substitution has a prohibited deleterious substitution. For the path entropy under strong selection, we take as an ansatz the same dependence on and *γ* as in (10), albeit with different *L, k* scaling:
We numerically verify this ansatz in the next section (figure 1).

### 2.4 Coarse-graining and landscape-dependence of scaling relations

The path scaling relations depend qualitatively on whether the fitness landscape is flat (neutral evolution) or very steep (strong selection). How does the transition between these two limits occur at intermediate selection strengths, where selection and stochastic fluctuations (genetic drift) compete more equally? We now implement a concrete scheme for coarse-graining sequence space on the fitness landscape of (1). Let *s* be the total selection coefficient between the minimum and maximum fitness points on the landscape of (figure 1). Let s be the total selection coefficient between the minimum and maximum fitness points on the landscape; this corresponds to the actual selection strength between two distinct biological genotypes. For example, the minimum and maximum fitnesses might correspond to wild-type and antibiotic-resistant genotypes in bacteria [9, 31], or to one protein sequence does [17]. As we coarse-grain the sequence space into smaller *L* and *k*, we must therefore hold fixed this true overall selection strength. Since *s* = *f*^{L(k − 1)}− 1 in the Mount Fuji model (figure 1), we renormalize the minimum fitness benefit *f* accordingly:

Thus the fitness benefit of each individual mutation increases as we coarse-grain the sequence space (decrease *L* and *k*), since each mutation in the model corresponds to several mutations on the true biological sequences.

We consider a range of total *s* values and numerically calculate path statistics for each *L* and *k* using the method of Appendix A. In figure 1 we show the scaling of , *ℓ*_{var}, and *S*_{path} calculated in this manner for several values of relative selection strength *N s*. For *N s* = 0, we not only confirm the neutral scaling relations (9) but also observe that any proportionality factors and additive constants are so negligible that the scaling relations are actually approximate equalities (figure 1a). The predicted relation for the path entropy (10) also holds exactly. Moreover, weak selection appears to preserve these scaling relations: they still hold even at *N s* = 0.1 (figure 1b). When selection becomes comparable to genetic drift (*N s* = 1, figure 1c), the neutral scaling relations still hold qualitatively, although the slopes of and *l*_{var} ∼ *k*^{2L} are no longer close to 1, indicating different proportionality factors.

At the other extreme (*N s* = ∞, figure 1f), strong-selection scaling relations (11) for path length hold as expected. We also verify that * even for strong selection, albeit with a proportionality factor less than 1. This scaling maintains at finite but large selection strengths of **N s* = 100 (figure 1e). At intermediate selection strengths (*N s* = 10, figure 1d), however, neither set of scaling relations for and *ℓ*_{var} holds, indicating that they are no longer a simple function of sequence space size *k*^{L}.

## 3 Evolutionary paths in a biophysical model of protein adaptation

Simple model landscapes defined in genotype space, such as (1), have produced many theoretical results and guided analysis of some data [2–5, 8, 10]. However, their purely phenomenological nature allows for little interpretation of their parameters and includes no basis in the underlying molecular processes — interactions among proteins, DNA, RNA, and other biomolecules — that govern cells. Thus a promising alternative is to develop models of fitness that explicitly account for these molecular properties [14, 16, 31–33]. We now consider the scaling properties of evolutionary paths in such a model based on the biophysics of protein folding and binding [17, 25, 26].

### 3.1 Protein energetics and coarse-graining

Consider a protein with two-state folding kinetics [34]. In the folded state, the protein has an interface that binds a target molecule. Because the protein can bind *only* when it is folded, the protein has three possible structural states: folded and bound, folded and unbound, and unfolded and unbound. Let the free energy of folding be *E*_{f} (sometimes known as Δ*G*), so that an intrinsically-stable protein has *E*_{f} < 0. Let the free energy of binding, relative to the chemical potential of the target molecule, be *E*_{b}, so that *E*_{b} < 0 indicates a favorable binding interaction. Note that *E*_{b} becomes more favorable as the chemical potential of the target molecule is increased.

The folding and binding energies depend on the protein’s genotype (amino acid sequence) *σ*. We assume adaptation only affects the *L* residues at the binding interface, which, to a first approximation, make additive contributions to the total folding and binding free energies [35]:
where *ϵ*_{f} (*i, σ ^{i}*) and

*ϵ*

_{b}(

*i, σ*) are entries of energy matrices that capture the energetic contributions of amino acid

^{i}*σ*at position

^{i}*i*. Folding and binding energetics are probed experimentally and computationally by measuring the changes (often denoted by ΔΔ

*G*) in

*E*

_{f}or

*E*

_{b}resulting from single-point mutations [36–38]. These studies generally indicate that each position makes an energetic contribution of order 1 kcal/mol to the total energy. As a simple approximation, we sample each energy contribution

*∊*

_{f,b}(

*i,σ*) independently from a Gaussian distribution with zero mean and standard deviation 1 kcal/mol. The offsets and therefore correspond to the average folding and binding energies of the protein with a random sequence at the binding interface; includes the folding stability contribution from all residues in the protein away from the binding interface. As long as it produces a physically realistic range of total energies, the exact shape of the distributions for

^{i}*∊*

_{f,b}(

*i,σ*) is unimportant for large enough

^{i}*L*due to the central limit theorem.

Numerical calculations over all *k*^{L} sequences are not possible for large *L* and a full amino acid alphabet (*k* = 20). However, we can consider coarse-grained versions of the model by grouping positions and amino acids into classes, resulting in some effective sequence parameters *L*_{eff} and *k*_{eff} that are smaller than their physical counterparts *L*_{phys} and *k*_{phys} = 20. If we then determine how properties of the model scale with *L*_{eff} and *k*_{eff} under such a coarse-graining procedure, we can extrapolate these properties to the physical values *L*_{phys} and *k*_{phys}.

As we vary *L*_{eff} and *k*_{eff}, we must renormalize the distributions of energetic contributions *∊*_{f,b}(*i,σ ^{i}*) for the effective sequences such that the distribution of total sequence energies remains constant, similar to our coarse-graining scheme in the previous section. Since the total sequence energies are sums of Gaussian contributions from each site (14), coarse-graining the sites amounts to sampling the

*∊*

_{f,b}(

*i,σ*) effective values from a Gaussian distribution with standard deviation rescaled by a factor of . or example, if

^{i}*L*

_{phys}= 12 and

*L*

_{eff}= 6 (grouping positions into pairs), then each effective

*∊*

_{f,b}(

*i,σ*) is the sum of two physical

^{i}*∊*

_{f,b}(

*i,σ*) values, and hence the effective

^{i}*∊*

_{f,b}(

*i,σ*) should have zero mean and standard deviation kcal/mol. Note that we analytically continue this rescaling to consider values of

^{i}*L*

_{eff}and

*k*

_{eff}that do not evenly divide

*L*

_{phys}and

*k*

_{phys}. For simplicity we will drop the “eff” abels and hereafter interpret

*L, k, ∊*

_{f}(

*i,σ*) and

^{i}*∊*

_{b}(

*i,σ*) as these effective, coarse-grained parameters unless indicated otherwise.

^{i}### 3.2 Evolutionary model

Without loss of generality, we assume the protein contributes fitness 1 to the organism when it is both folded and bound. Let *f*_{ub}*, f*_{uf} ∈ [0,1] be the multiplicative fitness penalties for being unbound and unfolded, respectively: the fitness is *f*_{ub} if the protein is unbound but folded, and *f*_{ub} *f*_{uf} if the protein is both unbound and unfolded. Then the fitness of the protein averaged over all three possible structural states is given by [17]
where *β* = 1.7 (kcal*/*mol)^{−1} is inverse room temperature and the structural states are assumed to be in thermodynamic equilibrium.

We assume that the population begins as perfectly adapted to binding a target molecule characterized by energy matrix *ϵ*_{b1} with offset (defining a fitness landscape *𝓕*_{1}). The population is then subjected to a selection pressure which favors binding a new target, with energy matrix *∊*_{b2} and offset (fitness landscape *𝓕*_{2}). We assume that the binding energy matrices for the new and old targets are uncorrelated, although this assumption is not essential. The population evolves in the monomorphic limit with the SSWM dynamics in (2) and (4). Thus the evolutionary paths are first-passage paths leading from the genotype corresponding to the global maximum on *𝓕*_{1} to a local or global maximum on *𝓕*_{2}, with fitness increasing monotonically along each path.

### 3.3 Case 1: selection for binding strength

There are three qualitatively distinct cases of the fitness landscape in (15), depending on the values of the parameters *f*_{ub} and *f*_{uf} [17]. These cases correspond to different biological scenarios for the selection pressures on binding and folding. In the simplest scenario (“case 1”), proteins are selected for their binding function (*f*_{ub}<1), but misfolding carries no additional fitness penalty (e.g. due to toxicity of misfolded proteins) beyond loss of function (*f*_{uf} = 1). Thus we say there is direct selection for binding only. Three examples of adaptation in this regime are shown in figure 2; the main determinant of the qualitative nature of adaptation is the overall folding stability *E*_{f}.

Although the model is non-epistatic at the level of the energy traits (since (14) is additive), there is epistasis at the level of fitness (15) due to its nonlinear dependence on energy. Indeed, there is widespread magnitude epistasis, which occurs when the fitness effect of a mutation has different magnitude on different genetic backgrounds, although it is always beneficial or always deleterious. Sign epistasis, which occurs when a mutation can be beneficial on one background but deleterious on another, manifests itself as curvature in the fitness contours in energy space, as shown in figure 2. However, we see that the landscape is largely free of sign epistasis except near *E*_{f} = 0, where there is a higher probability of multiple local fitness maxima (figure 2b). Overall, this suggests that the scaling relations from the non-epistatic Mount Fuji model may provide a reasonable approximation for this model of protein adaptation; the approximately additive nature of protein traits as in (14) has led to applications of the Mount Fuji model to proteins previously [7, 31, 39, 40].

In figure 3 we show scaling properties of the genotypic fitness landscape for the three *E*_{f} regimes of the model for case 1 (corresponding to the examples in figure 2). The minimum path length ℓ_{min} is the Hamming distance between the initial and final states for adaptation; for a randomly-chosen initial sequence, ℓ_{min} = *L*(1 *-* 1*/k*) on average. Indeed, this relation accurately describes the stable proteins (figure 3a). For proteins,that are already sufficiently stable, there is no selection pressure to improve stability further, so the global fitness maximum is almost always the best-binding sequence. Since the binding energetics for the old and new targets are uncorrelated, the initial and final states are uncorrelated as well, which explains the *ℓ*_{min} scaling. For marginally-stable and unstable proteins, *ℓ*_{min} still scales with *L*(1 − 1*/k*), but with a reduced slope. This is due to the fact the initial and final states become correlated in these two cases. We can think of this effect as a reduction in the effective length *L*, since more beneficial mutations are already present in the initial state. We see similar behavior in the average connectivity *γ* and accessible size *n*_{seq} of sequence space (figure 3b,c). Note that a random initial state reduces the average connectivity of the accessible sequence space by an additional factor of 2, yielding *γ* = *L*(*k* − 1)*/*4 (see Appendix D).

Whereas stable and unstable proteins almost always have a single fitness maximum, marginally-stable proteins have a sizable probability of multiple maxima owing to greater sign epistasis (figure 2b). In a purely random, uncorrelated fitness landscape, the average number of local maxima is *m*=*k*^{L}/(*L*(*k −* 1) + 1) [2]. This has the form *n*_{seq}*/*(*γ* + 1): the number of maxima increases with the total size of the space and decreases with the connectivity. We empirically test this scaling for the average number of maxima for a marginally-stable protein, and we find good agreement (figure 3d). By fitting numerically-calculated values of *m* as a power law of *n*_{seq}/(*γ* + 1), we obtain an anomalous scaling exponent of ≈ 0.09; the fact this is less than 1 reflects the correlated nature of our fitness landscape. The fitted scaling relation allows us to accurately determine the average number of local maxima for binding interfaces and amino acid alphabets much larger than we can directly calculate. By also fitting *γ* as a linear function of *L*(*k* − 1)/4 (figure 3b) and *n*_{seq} as a power law of ((*k* + 1)/2)^{L} (Appendix D, figure 3c), we estimate the number of local maxima to be ≈ 11 for a marginally-stable protein with *L*_{phys} = 12 binding interface residues and an amino acid alphabet of size *k*_{phys} = 20. This number of maxima is much smaller than the total number of sequences (*k*^{L} ≈ 4 × 10^{15}) and the expected number of maxima on an uncorrelated random landscape of the same size (*k*^{L}/(*L*(*k* − 1) + 1) ≈ 1.8 × 10^{13}).

In figure 4 we show the scaling of path statistics , ℓ_{var}, and *S*_{path}. We find that the strong-selection scaling relations describe these cases of the protein model very well, despite the complexities of the energy and fitness model relative to the simple Mount Fuji case. The main discrepancy is in the path length variance, indicating that the distributions *ρ*(*ℓ*) are not as close to Poisson as in the Mount Fuji model. We expect this is mainly due to the small amount of epistasis present in the protein model. Nevertheless, the scaling is accurate enough to extend the model to larger binding interfaces and a full amino acid alphabet. For example, using the fitted coefficients *a* and *b* (figure 4a,b), we estimate and *ℓ*_{var} *≈* 9.6 for a marginally-stable protein with *L*_{phys} = 12 and *k*_{phys} = 20. Comparing these against the estimated ℓ_{min} *≈* 10 (fitted as a linear function of *L*(1 – 1/*k*); figure 3a), we see that many more substitutions than the minimum are likely.

### 3.4. Cases 2 and 3: selection for folding stability

The fitness landscape changes qualitatively when there are additional selection pressures against misfolding beyond loss of function [17, e.g., for proteins that form toxic aggregates when misfolded [41–43]. The first possibility is that the protein has a non-functional binding interaction (*f*_{ub} = 1) but is deleterious when misfolded (*f*_{uf} < 1; “case 2”). Here the relative binding strengths of the old and new targets lead to different patterns of adaptation. In figure 5a, we show an example of adaptation when both the old and new targets have potentially strong (but non-functional) binding affinity, while [figure 5b shows an example when the old target has weak affinity while the new one has strong affinity. [Figure 5c shows the case when the old target has strong affinity and the new target has little to no affinity.

Finally, the most general case is to have distinct selection pressures on both binding and folding (0 < *f*_{ub} < 1 and *f*_{uf} < 1; “case 3”). Adaptation in this scenario often resembles binding-only selection (figure 2), except when both binding and folding are of marginal strength (i.e., *E*_{f} ≃ 0 and *E*_{b} ≃ 0). In this case, the distribution of genotypes in energy space straddles a straight diagonal fitness contour, leading to a distinct pattern of evolutionary paths that gain extra folding stability first, only to lose it later as binding improves (figure 5d).

We show the scaling properties of the evolutionary paths for cases 2 and 3 in figure 6. In general, the predicted scaling relations are less accurate compared to binding-only selection (case 1, figure 4). This is due to the increased sign epistasis in these regimes (note significant curvature in the fitness contours in figure 5). Selection for both binding and folding (case 3) is particularly epistatic in the *E*_{f} ≃ 0, *E*_{b} ≃ 0 regime, leading to the largest deviations from the Mount Fuji scaling (figure 6). On the other hand, the degree of epistasis here is still far from the maximally-epistatic, uncorrelated random landscape [2,6]; in that model we should have ∼ log *L* [3, which is clearly not the case in our biophysical model.

### 4. Discussion

Developing models of fitness landscapes based on the physics of proteins and other biomolecules has emerged as a powerful approach for understanding molecular evolution [14, 16, 17, 30, 31]. However, the empirical nature of these models often makes explicit analytic treatments impossible, while the enormous size of sequence space often restricts numerical calculations or simulations to short sequences *L* or reduced alphabet sizes *k*. While analyses with small *L* and *k* may preserve qualitative properties of the models, quantitatively extending these results to more realistic parameter values is essential for comparison with experimental data. Here we have developed a scaling approach in which we empirically fit small *L* and *k* calculations to scaling relations in order to obtain precise quantitative properties of the model for arbitrarily large *L* and *k*. The scaling analysis moreover confirms that small *L* and *k* calculations largely preserve qualitative properties of the model expected for realistic sequence spaces. Although the scaling relations are derived for a much simpler, purely non-epistatic Mount Fuji model, they are surprisingly robust to the widespread magnitude epistasis and limited sign epistasis observed in the more realistic biophysical model of protein evolution.

We also gain important conceptual insights from the scaling analysis. In particular, we find that the neutral evolution scaling holds even when selection is present, provided that it is not too strong (*N*_{s} *≤* 1, figure 1a,b,c). This means that the average number of substitutions to a global fitness maximum, even in the presence of weak selection, grows exponentially with *L*. On the other hand, strong selection enables populations to find the global maximum much faster: the mean path length scales with the logarithm of sequence space size, and the distribution of path lengths is approximately Poisson rather than exponential. However, extremely strong selection (*Ns* ≈ 100, figure 1e) is required for this more efficient behavior to take over. Selection of this magnitude may be produced by sudden environmental changes, as in our model of protein adaptation [17]. When selection is of more moderate strength (*Ns* ≈ 10), path length statistics are not simple function of sequence space size (figure 1d). We expect the more complex relation in this case to depend on the specific details of the landscape and evolutionary dynamics.

Moreover, these insights are valuable for other types of random walks on complex landscapes, e.g., spin models where *L* is the number of spins and *k* is the number of individual spin states. The scaling properties of first-passage paths have been well-studied for random walks in the absence of an energy or fitness landscape [28,44], but the effects of a landscape on scaling are less well-known. Although the substitution dynamics of (2) considered here are different from the typical dynamics used in spin models and other random walks (e.g., Metropolis Monte Carlo) [20], we expect our qualitative findings to remain valid. Thus we expect the pure random walk scaling (*T* = ∞) to hold even for temperatures down to the size of the largest energy differences on the landscape. There should be a non-trivial crossover regime at temperatures around the size of these landscape features, and then at small *T* the *T* = 0 scaling takes over. Investigating the nature of this crossover in both evolutionary and physical models is an important topic for future work.

## Appendix A. Numerical algorithm for statistics of the path ensemble

We calculate statistical properties of the evolutionary paths using an exact algorithm based on transfer matrices [25, 26]. Let *Q*(*σ* ′| *σ*) be the jump probability defined by a rate matrix as in (5). For each substitution ℓ and intermediate genotype *σ*, we calculate *P*_{ℓ} (*σ*), the total probability of all paths that end at *σ* in ℓ substitutions, as well as Γ_{ℓ} (*σ*), the total entropy of such paths. These quantities obey the following recursion relations:
where *P*_{0}(*σ*) = 1 if *σ* is the initial state and *P*_{0}(*σ*) = 0 otherwise, and Γ_{0}(*σ*) = 0 for all *σ*. Final states are treated as absorbing to ensure that only first-passage paths are counted. We use these transfer-matrix objects to calculate the path ensemble quantities described in the text:
where *S*_{final} is the set of final states. The sums are calculated up to a path length cutoff Λ, which we choose such that . The time complexity of the algorithm scales as *𝓞*(*γn*Λ) [25], where *γ* is the average connectivity and *n* is the total size of the state space.

## Appendix B. Mean path length in the strong-selection limit

Since sites can be considered independent in the strong-selection limit, we need only calculate the mean path length for a single site with *k* possible alleles. A path begins at A_{1}, and initially all *k* alleles are of equal or higher fitness and are therefore accessible. The first substitution can go to any A_{j} ∈ {A_{2}, A_{3} …, A_{k}} with equal probability (k − 1)^{−1}, after which there are k− *j* + 1 remaining alleles. Thus the mean path length for *k* alleles must satisfy the recursion relation
where . This is satisfied by
where *H*_{n} is the *n*th harmonic number defined by

To prove this, we first note that
where we have used the property *H*_{n+1} = *H*_{n} + (*n* + 1)^{-1}. Now we substitute on the right side of (B.1) and invoke (B.4) to obtain

This proves (B.2) is the solution to the recursion relation.

## Appendix C. Distribution of path lengths in the strong-selection limit

Here we address the whole path length distribution *ρ*(ℓ) for a single site in the strong-selection limit. With alleles ordered by fitness rank, a path of *ℓ* substitutions is of the form A_{1} *→* A_{j1} *→ … →* A_{jℓ-1} *→* A_{k},where 1 < *j*_{1} < … *j*_{ℓ-1} < *k*. Since all beneficial substitutions are equally likely in this limit, each jump probability out of allele A_{j} is (*k−j*)^{-1}. Therefore the probability of taking a path of length *ℓ* is

The mean of this distribution as shown in Appendix B. Here we obtain an approximate form for the whole distribution. Define *∊* = *k*^{-1} and *x*_{i} = *j*_{i}/*k*. For *k* ≫ 1 (*∊* ≪ 1) we can take the continuum limit of the exact expression to obtain

By changing variables to *y*_{i} = *x*_{i} - (*i* + 1)*∊*, we rewrite this as

Each integral is dominated by its integrand’s value near the upper limit. However, because the domain of integration requires ordering of the *y*_{i} variables (0 < *y*_{1} < *y*_{2} < … < *y*_{ℓ-1} < 1*-*(ℓ+1) ϵ), the integrand for *y* _{ℓ-1} has the greatest support near its upper limit. Since the integrands are all similar near their lower limits, we thus approximate each integrand by the one for *y*_{ℓ-1}:

This approximation allows us to use the identity

Therefore,

In the limit of *k* ≫ 1 and *ℓ/k* ≪ 1,

Thus *ρ*(ℓ) is approximately a Poisson distribution with mean and variance log *k*. This is consistent with the exact solution since for large *k*.

## Appendix D. Size and connectivity of sequence space in the strong-selection limit

Each sequence *σ* has possible beneficial mutations in the Mount Fuji model (1). Thus the connectivity averaged over all sequences is

We can also determine the average connectivity of the accessible sequences starting from a random initial sequence. We first consider a single site. The initial allele A_{j} is chosen with probability 1*/k*, leaving *k–j*+1 accessible alleles. Thus the average connectivity of this accessible space is
Since multiple sites contribute additively to the connectivity, the total average connectivity of the accessible space is *L*(*k* − 1)*/*4.

Starting from the sequence with minimum fitness, all *k*^{L} sequences are accessible in the strong-selection limit. More generally, if the population begins at sequence *σ* there are accessible sequences, including *σ* itself. If the initial sequence is chosen at random, then the average number of accessible sequences is

## Acknowledgments

AVM was supported by an Alfred P. Sloan Research Fellowship.

## Footnotes

E-mail: morozov{at}physics.rutgers.edu