Abstract
How DNA is mapped to functional proteins is a basic question of living matter. We introduce and study a physical model of protein evolution which suggests a mechanical basis for this map. Many proteins rely on large-scale motion to function. We therefore treat protein as learning amorphous matter that evolves towards such a mechanical function: Genes are binary sequences that encode the connectivity of the amino acid network that makes a protein. The gene is evolved until the network forms a shear band across the protein, which allows for long-range, soft modes required for protein function. The evolution reduces the high-dimensional sequence space to a low-dimensional space of mechanical modes, in accord with the observed dimensional reduction between genotype and phenotype of proteins. Spectral analysis of the space of 106 solutions shows a strong correspondence between localization around the shear band of both mechanical modes and the sequence structure. Specifically, our model shows how mutations are correlated among amino acids whose interactions determine the functional mode.
PACS numbers: 87.14.E-, 87.15.-v, 87.10.-e
I. INTRODUCTION: PROTEINS AND THE QUESTION OF THE GENOTYPE-TO-PHENOTYPE MAP
DNA genes code for the three-dimensional configurations of amino acids that make functional proteins. This sequence-to-function map is hard to decrypt since it links the collective physical interactions inside the protein to the corresponding evolutionary forces acting on the gene [1–5]. Further-more, evolution has to select the tiny fraction of functional sequences in an enormous, high-dimensional space [6–8], which implies that protein is a non-generic, information-rich matter, outside the scope of standard statistical methods. Therefore, although the structure and physical forces within a protein have been extensively studied, the fundamental question as to how a functional protein originates from a linear DNA sequence is still open, in particular, how the functionality constrains the accessible DNA sequences.
To examine the geometry of the sequence-to-function map, we devise a mechanical model of proteins as amorphous learning matter. Rather than simulating concrete proteins, we construct a model which captures the hallmarks of the genotype-to-phenotype map. The model is simple enough to be efficiently simulated to gain statistics and insight into the geometry of the map. We base our model on the growing evidence that large-scale conformational changes – where big chunks of the protein move with respect to each other – are central to function [9–15]. In particular, allosteric proteins can be viewed as ‘mechanical transducers’ that transmit regulatory signals between distant sites [16–19].
Dynamics is essential to protein function, but it is hard to measure and simulate due to the challenging spatial and temporal scales. Nevertheless, recent studies suggest a physical picture of the functionally-relevant conformational changes within the protein: Nanorheological measurements showed low-frequency viscoelastic flow within enzymes [20], with mechanical stress affecting catalysis [21]. Computation of amino acid displacement, by analysis of structural data, demonstrated that the strain is localized in 2D bands across allosteric enzymes [22]. We therefore take as a target function to be evolved in our protein such a large-scale dynamical mode. Other important functional constraints, such as specific chemical interactions at binding sites, are disregarded here because they are confined to a small fraction of the protein. We focus on this mechanical function whose large scale, collective nature leads to long-range correlation patterns in the gene.
Our model includes essential elements of the genotype-to-phenotype map: the target mechanical mode is evolved by mutating the ‘gene’ that determines the connectivity in the amino acid network. During the simulated ‘evolution’, mutations eventually divide the protein into rigid and ‘floppy’ domains, and this division enables large-scale motion in the protein [23]. This provides a concrete map between sequence, configuration, and function of the protein. The computational simplicity allows for a massive survey of the sequence universe, which reveals a strong signature of the protein’s structure and function within correlation ‘ripples’ that appear in the space of DNA sequences.
II. MODEL AND RESULTS
We give here a summary and interpretation of our results, The appendix contains further details and explains choices we made in designing the model as close as possible to real proteins.
A. Mechanical model of protein evolution
Our model is based on two structures: a gene, and a protein, which are coupled by the genotype-to-phenotype map. The coarse-grained protein is an aggregate of amino acids (AAs), modeled as beads, with short-range interactions given as bonds (Fig. 1). A typical protein is made of several hundred AAs, and we take N = 540. We layer the AAs on a cylinder, 18 high 30 wide, similar to dimensions of globular proteins. The cylindrical configuration allows for fast calculation of the low energy modes, and thereby fast evolution of the protein. Each AA may connect to the nearest five AAs in the layer below, so that we get 25 = 32 effective AA species, which are encoded as 5-letter binary codons 1. These codons specify the bonds in the protein in a 2550-long sequence of the gene(5× 30×(18 – 1), because the lowest layer is connected only upwards).
To become functional, we want the protein to evolve to a configuration of AAs and bonds that can transduce a mechanical signal from a prescribed input at the bottom of the cylinder to a prescribed output at its top 2. The solution we search turns out to be a large-scale, low-energy deformation where one domain moves rigidly with respect to another in a shear or hinge motion, which is facilitated by the presence of a fluidized, ‘floppy’ channel separating the rigid domains [25–27].
These large-scale deformations are governed by the rigidity pattern of the configuration, which is determined by the connectivity of the AA network via a simple majority rule (Fig. 1) which we detail in Sect. A3. The basic idea is that each AA can be either rigid or fluidized and that this rigidity state propagates upwards: Depending on the number of bonds and the state of other AAs in its immediate neighborhood, an AA will be rigidly connected, ‘shearable’, i.e., loosely connected, or in the sought-after mechanical shear motion in Sections II E and B3. (In [24] we take the mechanical modes themselves as the target function.)a pocket of less connected AAs within a rigid neighborhood 3. As the sequence and hence the connections mutate, the model protein adapts to the desired input-output relation specified by the extremities of the separating fluid channel (Fig. 1(right)).
The model is easy to simulate: We start from a random gene of 2550 bits, and at each time step we flip a randomly drawn bit, thus adding or deleting a bond. In a zero-temperature Metropolis fashion, we keep only mutations which do not increase the distance from the target function, i.e., the number of errors between the state in the top row and the prescribed outcome. Note that, following the logics of biological evolution, the ‘fitness’ of the protein is only measured at its functional surface (e.g¨, where a substrate binds to an enzyme) but not in its interior.
Typically, after 103-105 mutations this input-output problem is solved (Fig. 2). Although the functional sequences are extremely sparse among the 22550 possible sequences, the small bias for getting closer to the target in configuration space directs the search rather quickly. Therefore, we could calculate as much as 106 runs of the simulation which gave 106 independent solutions of the evolutionary task.
B. Dimensional reduction in the phenotype-to-genotype map
Thanks to the large number of simulations, we can explore vast regions of the genetic universe. That the sampling is well-distributed can be seen from the typical inter-sequences distance, which is comparable with the universe diameter (Fig. 4). This also indicates that the dimension of the solution set is high. Indeed, the observed dimension of sequence space, as estimated following [28, 29], is practically infinite (∼ 150) 4. This shows that the bonds are chosen basically at random, although we only consider functional sequences.
On the other hand, very few among the 2540 configurations are solutions, owing to the physical constraints of contiguous rigid and shearable domains. As a result, when mapped to the configuration space, the solutions exhibit a dramatic reduction to a dimension of about 8-10 [30]. This reduction between ‘genotype’ (sequence) and ‘phenotype’ (configuration, function) [31, 32] is the outcome of physical constraints on the mechanical transduction problem. In the nearly random background of sequence space, these constraints are also manifested in long-range correlations among AAs on the boundary of the shearable region (Fig. 5 and Sect. B4).
C. Spectral analysis reveals correspondence of genotype and phenotype spaces
Spectral analysis of the solution set in both sequence and configuration spaces provides further information on the sequence-to-function map (FIG. 6). The sequence spectrum is obtained by singular value decomposition (SVD) of a 106 × 2550 matrix, whose rows are the binary genes of the solution set. The first few eigenvectors (EVs) with the larger eigenvalues capture most of the genetic variation among the solutions, and are therefore the collective degrees-of-freedom of protein evolution (Fig. 6B). The 1st EV is the average sequence, and the next EVs highlight positions in the gene that tend to mutate together to create the fluid channel.
The spectrum of the configuration space is calculated in a similar fashion by the SVD of a 106 × 540 matrix, whose rows are the configurations of the solutions set (Fig. 6A). In the configuration spectrum, there are 8-10 EVs which stand out from the continuous spectrum, corresponding to the dimension 8 shown in Fig. 3. Although the dimension of the sequence space is high (∼ 150), there are again only 8-9 eigenvalues outside the continuous random spectrum.
These isolated EVs distill beautifully the non-random components within the mostly-random functional sequences. The EVs of both sequence and configuration are localized around the interface between the shearable and rigid domains. The similarity in number and in spatial localization of the EVs reveals the tight correspondence between the configuration and sequence spaces.
This duality is the outcome of the sequence-to-function map defined by our simple model: The geometric constraints of forming a shearable band within a rigid shell, required for inducing long-range modes, are mirrored in long-range correlations among the codons (bits) in sequence space. The corresponding sequence EVs may be viewed as weak ’ripples’ of information over a sea of random sequences, as only about 8 out of 2550 modes are non-random (0.3%). These information ripples also reflect the self-reference of proteins and DNA via the feedback loops of the cell circuitry [34].
It is instructive to note similarities and differences between the spectra. While the spectra of the configuration space and of the sequence space have a similar form — with a continuous, more or less random, part and a few isolated eigenvalues above it — the location of the random part is different: In the configuration case it is close to zero while in the sequence case it is concentrated at large values around 500.
The geometric interpretation is that the cloud of solution points looks like an 8-9 dimensional flat disk in the configuration case, while in the sequence space, it looks like a high-dimensional almost-spherical ellipsoid. The few directions slightly more pronounced of this ellipsoid correspond to the non-random components of the sequence. The slight eccentricity of the ellipsoid corresponds to the weak non-random signal above the random background. This also illustrates that the dimension of the sequence space is practically infinite, while in the configuration space it is comparable to the number of isolated eigenvalues.
We verified that the dimensional reduction and the spectral correspondence depend very little on the details of the models. For example, we examined a model with 16 AA species
D. Stability of the mechanical phenotype under mutations
First, we determine how many mutations lead to a destruction of the solution (Fig. 8A). About 10% of all solutions are destroyed by just one random mutation. The exponentially decaying probability of surviving m mutations signals that these mutations act quite independently. Fig. 8B which shows the location of these destructive mutations around the shearable channel 6.
We have also studied the loci where two interacting mutations will destroy a solution (i.e., none of the two is by itself destructive). In most cases, the two mutations are close to each other, acting on the same site. The channel is less vulnerable to such mutations, but the twin mutations are evenly distributed over the whole rigid network (Fig. 8C).
E. Fluid channel supports low-energy shear modes
The evolved rigidity pattern supports low-energy modes with strain localized in the floppy, fluid channel. We tested whether the evolved AA network indeed induces such modes (Fig. 9), by calculating the mechanical spectrum of a spring network in which bonds are substituted by harmonic springs. The shear motion of the network is characterized by the modes of , its elastic tensor. is the 2Nx2N curvature matrix in the harmonic expansion of the elastic energy , where Sr is the 2N-vector of the 2D displacements of the N AAs. has the structure of the network Laplacian multiplied by the 2x2 tensors of directional derivatives (see Sect. B3, which is derived from [35, pp. 618–9]).
We traced the mechanical spectrum of the protein during the evolution of the fluidized channel (a shear band). We found that the formation of a continuous channel of less connected amino acids indeed facilitates the emergence of low-energy modes of shear or hinge deformations (Fig. 9). The energy of such low modes nearly vanishes as the channel is close to completion. Similar deformations, where the strain is localized in a rather narrow channel, occur in real proteins, as shown in recent analysis of structural data [22].
F. Proteins can adapt simultaneously to multiple tasks
Our models were designed to trace the evolution of a mechanical function and show how it constrains the genotype-to-phenotype map, as shown above. Real proteins also evolve towards other essential functions, such as binding affinity and biochemical catalysis at specific binding sites. Here, we examine another important molecular trait, stability.
Many studies examine the energetic stability of the protein, as measured by its overall free energy (ΔG) [4, 5, 8]. In the present model, this free energy is given by the number of bonds, which represent chemical and physical interactions among the amino acids. The higher the number of bonds the more stable and less flexible is the protein. By tuning stability, organisms adapt to their environment. Thermophiles that live in hotter places, such as hydrothermal vents, evolve more stable proteins to withstand the heat. Cryophiles that reside in colder niches have more flexible proteins [36].
We simulated the evolution of the two phenotypes, our specific dynamical mode together with an energetic state (i.e., a given bond density). We find that the large solution set of the mechanical problem allows the protein to select a subset with a specific energetic state. Thus, the evolutionary dynamics could find solutions to the same mechanical function when we imposed extreme values of bond density (Fig. 10). This demonstrates the capacity of the protein to search in parallel for the solutions of several biological tasks. Evolving a specific binding site is expected to be an easier task, since such sites are confined to a small fraction of the protein.
G. Amino acid interactions
In the model described so far, the bonds were determined by the AA species alone, while in real proteins, it is the interaction between pairs of AA which determines the formation of bonds 7. This raises the question as to how much our results are sensitive to the fine details of the interaction model. As we show, a more realistic interaction model does not change the main results, which demonstrates the robustness of our approach.
To model two-body AA interactions we consider a set of three AA species, which we call A0, A1 and A2. Whether a bond is formed or not is determined by a symmetric binary relation b(Ai, Aj), which we write as a 3×3 interaction matrix,
This variant of the model is reminiscent of the HP model with its two species of AAs [37]. The interaction range is kept identical to that our standard model, namely an AA can form a bond the 5 nearest neighbors in the adjacent rows.
The ‘gene’ in this variant of the model is a sequence of 18 × 30 = 540 two-letter binary codons, gi, each representing an AA, such that the overall length of the gene is 1080 bits. The genetic code is a map C from codons to AAs, C: gi → Ai. Since there are four codons and only three AAs, there is a 25% redundancy in the ‘genetic code’. This is reminiscent of the (higher) redundancy of the natural genetic code in which 20 AAs are encoded by 61 codons [38–40] (out of the 43 = 64 codons 3 are ‘stop’ codons). In our 4-codon genetic code, the redundant AA is chosen to be A0, C(00) = C(01) = A0, and the two other AA are encoded as C(10) = A1, C(11) = A2. For a given gene, the bond pattern is determined by looking at all AA pairs within the interaction range and calculating their coupling according to the interaction matrix (Table I), b(C(gi), C(gj)) = b(Ai, Aj). Once the bond network is determined from the gene, the rigidity pattern, rigid, fluid or ‘trapped’, is calculated as in the standard model (Sect. II A).
In the simulations, at each step we flip one letter in a randomly selected codon. A quarter of the mutations are synonymous, since they exchange ‘00’ and ‘01’. The other three quarters add or cut bonds, and we check, as before, whether the connectivity change moves the rigidity pattern closer to a pattern that allows for a low-energy floppy mode. A small number of beneficial mutations eventually resolve the mechanical transduction problem, typically after 103 - 104 mutations.
In Fig. 11 we present some data (obtained from 4 · 105 solutions) to illustrate the robustness of the results relative to model changes. We find that, despite having changed the connectivity model, our main conclusions regarding the geometry of the phenotype-to-genotype map remain intact: A huge reduction from a high-dimensional genotype space (dim > 100) to a low-dimensional phenotype space (dim ∼ 10), similar to the dimensions in Fig. 3. It is noteworthy that the configuration eigenvectors are very similar to those of simpler model (as in Fig. 6), although they are determined by very different bonding interactions. This is evident in the (non-random) bond eigenvectors which are similar in number to those of the pervious model but differ in pattern owing to the different bonding rules of Table I. The robustness of the results manifests the universality of the dimensional reduction which originates from the continuity of the mechanical transduction.
III. CONCLUSIONS
Our models of the genotype-to-phenotype map put forward a new physical picture of protein evolution. Our thesis is that rather than structure itself, it is the dynamics that governs protein fitness. Our method considers proteins as evolving amorphous matter with a mechanical function, a specific low-energy conformational change. The rigidity/shearability pattern of the protein, and hence its dynamical modes, are determined by the connectivity of the amino acid interaction network. The model explains how the spatially-extended modes appear as the gene mutates and changes the amino acid network. These modes are shear and hinge motions where the strain is localized in the shearable channel and where the surrounding domains translate or rotate as rigid bodies (Fig. 9).
A main insight from our model is that requiring the protein to have ‘floppy’ modes puts strong constraints on the space of mechanical phenotypes. As a consequence there is a huge dimensional reduction when mapping genotypes to phenotypes. We find that the collective mechanical interactions among the amino acids are mirrored in corresponding modes of sequence correlation in the genes. These main results do not depend on details of the model and have been reproduced in versions with (i) a different number of AA species (16 instead of 32), (ii).bonds that depend on pairwise interactions, and (iii) harmonic spring network [24]. All these suggest that the results are generic and apply to a wide range of realizations.
Our models are distilled to their simplest physical-mathematical schemes, but have concrete, experimentally testable predictions. In the functional protein, the least random, strongly correlated sites are concentrated in a rigid shell that envelops the shearable channel [22]. Our model therefore predicts that these sites are also the most vulnerable to mutations (Fig. 8B), which distort the low-frequency modes and thus hamper the biological function. These effects can be examined by combining mutation surveys, biochemical assays of the function, and physical measurements of the low-frequency spectrum, especially in allosteric proteins.
To that end, one may take an enzyme with a known shear band (via analysis similar to [22]) and mutate amino acids within and around the band. We expect the mutation of these amino acids to have a significant impact on the dynamics and biochemical function of the protein, as compared to other mutations in the rigid subdomains. By sequence alignment methods [33, 41–43], it is possible to test whether these sensitive positions in the protein exhibit strong correlations in the gene, as predicted by the model. One may also search for the dimensional reduction predicted by the model in high resolution maps of molecular fitness landscapes [44–47].
Past studies have shown that the motion of proteins [48–51] and their hydrophobicity patterns [52] may often be approximated by a few normal modes, while others have demonstrated that the variation in aligned sequences may be characterized by a few correlation modes [33, 41–43]. The present study links the genotype and phenotype spaces, and explains the dimensional reduction as the outcome of a non-linear mapping between genes and patterns of mechanical forces: We characterize the emergent functional mode to be a soft, ‘floppy’ mode, localized around a fluidized channel (a shear band), a region of lower connectivity which is therefore easier to deform. The contiguity of this rigidity pattern implies that it can be described by a few collective degrees of freedom, implying a vast dimensional reduction of configuration space.
The concrete genotype-to-phenotype map in our simple models demonstrates that most of the gene records random evolution, while only a small non-random fraction is constrained by the biophysical function. This drastic dimensional reduction is the origin of the flexibility and evolvability in the functional solution set.
ACKNOWLEDGMENTS
We thank Stanislas Leibler, Michael R. Mitchell, Elisha Moses and Giovanni Zocchi for essential discussions and encouragement. JPE is supported by an ERC advanced grant ‘Bridges’, and TT by the Institute for Basic Science IBS-R020 and the Simons Center for Systems Biology of the Institute for Advanced Study, Princeton.
Appendix A The protein evolution model
1. The cylindrical amino acid network
We model the protein as an aggregate of amino acids (AAs) with short range interactions. In our coarse grained model, beads represent the AAs and bonds their interactions with neighboring AAs (Fig. 1). We consider a simplified cylindrical geometry, where the AAs are layered on the surface of a cylinder at randomized positions, to represent the non-crystalline packing of this amorphous matter. Throughout this study, we examine a geometry with height h(= 18), i.e., the number of layers in the z direction, and width w(= 30), i.e., the circumference of the cylinder. When the cylinder is shown as a flat 2D surface (such as in Fig. 2), there are still periodic boundary conditions in the horizontal w direction. The row and column coordinates of an AA are (r, c), with r for the row (1,…, h) and c for the column (1,…, w). The cylindrical periodicity is accounted for by taking the horizontal coordinate c modulo w = 30, c → mod w (c – 1) +1.
Each AA in row r can connect to any of its five nearest neighbors in the next row below, r – 1. This defines 25 = 32 effective species of amino acids that differ by their ‘chemistry’, i.e., by the pattern of their bonds. Therefore, in the gene, each AA at (r, c) is encoded as a 5-letter binary codon, ℓrck, where the k-th letter denotes the existence (= 1) or absence (= 0) of the k-th bond. The gene is the sequence of NAA = w · h = 540 codons which represent the AAs of the protein. This means that each codon just specifies which of the 5 bonds are present or absent. Therefore, the codons are a genetic sequence of 2700 = w · h · 5 digits 0 or 1. Each of these numbers determines whether or not a bond connects two positions of the grid. Since the bonds from the bottom row do not affect the configuration of the protein and the resulting dynamical modes, the relevant length of the gene is somewhat smaller, NS = 2550 = w · (h – 1) · 5.
2. Evolution searches for a mechanical function
We now define the target of evolution as finding a functional protein, in the following specific sense: To become functional, the protein has to evolve a configuration of AAs and bonds that can transduce a mechanical signal from a prescribed input at the bottom of the cylinder to a prescribed output at its top. This signal is a large-scale, low-energy deformation where one domain moves rigidly with respect to another in a shear or hinge motion, which is facilitated by the presence of a fluidized, ‘floppy’ channel separating the rigid domains [25–27].
3. Rigidity propagation algorithm
The large-scale deformations are governed by the rigidity pattern of the configuration, which is determined by the connectivity of the AA network via a simple majority rule (Fig. 1). The details of this majority rule are as follows (Fig. 12): Each AA position will have two binary properties, which define its state:
The rigidity σ: This property can be rigid (σ = 1) or fluid (σ = 0).
The shearability s: This property can be shearable (s = 1) or non-shearable (s = 0). As shown below, a non-shearable AA can be either rigid or fluid within a rigid domain of the protein. Non-shearable domains tend to move as a rigid body (i.e., via translation or rotation), whereas shearable regions are easy to deform.
Only 3 of the 4 possible combinations are allowed:
Non-shearable and solid AA (yellow): (σ = 1; s = 0).
Non-shearable and fluid AA (red): (σ = 0; s = 0).
Shearable and fluid AA (blue): (σ = 0; s = 1).
Shearable solid is forbidden.
Given a fixed sequence, and an input state in the bottom row of the cylinder, {σ1,c, s1,c} the state of the cylinder is completely determined as follows: The three states percolate through the network, from row r to row r + 1 (see Fig. 12). This propagation is directed by the presence of bonds, with a maximum of 5 bonds ending in each AA (of rows r = 2 to h; the state of the first row is given as input). These bonds can be present(=1) or absent(=0). according to the codon irck, k = – 2,…, 2 when they point to the AA with coordinate (r, c) coming from the AA (r – 1, c + k).
In a first sweep through the rows, we deal with the rigidity property σ. In row r =1 each of the w AAs is in a rigidity state rigid (σ = 1) or fluid (σ = 0). In all other rows, r = 2 to h, the 5 bonds determine the value of the rigidity of (r, c) through a majority rule: where θ is the step function (θ(x ≥ 0) = 1, θ(x < 0) = 0)). The parameter σ0 = 2 is the minimum number of rigid AAs from the r – 1 row that are required to rigidly support AA: In 2D each AA has two coordinates which are constrained if it is connected to two or more static AAs. In this way, the rigidity property of being pinned in place propagates through the lattice, as a function of the initial row and the choice of the bonds which are present as encoded in the gene.
We next address the shearability property. It is determined by the rigidity of AAs as follows: We assume that all fluid AAs in row r = 1 are also shearable (blue: (σ = 0; s = 1)). A fluid node (r, c) in row r will become shearable exactly if at least one of its neighbors (r − 1, c) or (r − 1, c ± 1) is shearable: where s0 = 1. The first term on the lhs ensures that a solid AA can never become shearable. This completes the definition of the map from the sequence to the state.
4. Fitness and mutations
As we explained before, the aim is to find a functional protein which can transfer forces. To find such a protein, we start from a random sequences (of 2550 codons), and from an initial state (input) in the bottom row of the cylinder. This initial state is just made from rigid and fluid beads, as shown e.g., in Fig. 2. For most simulations, we just took 5 consecutive fluid beads among the remaining solid beads.
We next define the target. It is a chain of w values, fluid and shearable (σ = 0; s = 1) or solid (σ = 1; s = 0), in the top row, which the protein should yield as an output, Given (i) agene sequence, which determines the connectivity ℓrck and (ii) the input state, {σ1,c s1,c}c=1,…,w, the algorithm described above uniquely defines the output state in the top row{σ1,c s1,c}c=1,…,w. At each step of evolution, the output state is compared to the fixed, given target, by measuring the Hamming distance, the number of positions where the output differs from the target:
In the biological convention − F is the fitness that should increase towards a maximum value of − F = 0, when the input-output problem is solved.
Solutions are found by mutations. At each iteration, a randomly drawn digit in the gene is flipped, that is, the values of 0 and 1 are exchanged. This corresponds to erasing or creating a randomly chosen link of a randomly chosen AA. After each flip, a sweep is performed, and the new output at the top row is again compared to the target. A mutation is kept only if the Hamming distance is not increased as compared to the value before the mutation (equivalently the fitness is not allowed to decrease). This procedure is repeated until a solution ( F = 0) is found. This will happen with probability 1, perhaps after very many flips, if the problem has a solution at all. This is really the Metropolis algorithm [53] algorithm (at 0 temperature).
Remark
It is an important feature of our model that the quality of a network is only measured at the target line. This corresponds to the biological fact that the protein can only interact with the outside world through is surface (in our case, the ends of the cylinder). One of the surprising outcomes of our study is that this requirement has a strong influence on what happens in the interior of the protein. Also, the propagation of fluidity should not be confused with learning in neural networks, but is rather of the percolation type.
5. Simulation of evolutionary dynamics
All simulations are done on the 30×18 = 540 playground, as described above. We have done simulations for many variants of the model, and many targets, but we present only two specific problems, for which the most extensive study was done: In the first, the fluid regions of the input and the target are opposite and of length 6 at the bottom and length 5 at the top. In the second run, top and bottom are the same, but the top is shifted sideways by 5 units. We will call these two examples straight and tilted, denoted as S and T. We have also studied examples in which the position of the target (relative to the input) is left free, but here we only discuss the results for the ’S’ and ’T’ case. This serves to illustrate that the results are largely independent of the details of the model. We have studied many other variants, and in all cases, the main results are qualitatively unchanged.
Remark
We view this as an important outcome of our theory, namely that it illustrates a close connection between gene and protein which goes way beyond the simple model we considerhere.
For both, S and T, we study 200 independent branches, starting from a random sequence with about 90% of the bonds present at the start. Given any fixed sequence, we sweep according to the rules of Eq(A1)-(A2) through the net, and measure the Hamming distance F (Eq(A3)) between the last row and the desired target. When this Hamming distance is 0, we consider the problem as solved. If not, we flip randomly a bond (exchanging 0 with 1) and recalculate the Hamming distance. We view this flip as a mutation of the sequence, equivalent to mutating one nucleic base in a gene. If the Hamming distance decreases or remains unchanged, we keep the flip, otherwise we backtrack and flip another randomly chosen bond. This is repeated until a solution is found. (This is really a Metropolis algorithm [53] at zero temperature.) Typically, after 103-105 mutations this input-output problem is solved. Although the functional sequences are extremely sparse among the 22550 possible sequences, the small bias for getting closer to the target in configuration space directs the search rather quickly.
Once a solution is found, we destroy it by further mutations and then look for a new solution, as before, starting from the destroyed state. This we call a generation. For each of the 200 branches, we followed 5000 generations, leading to a total of 106 solutions. The time to recover from a destroyed state is about 1500 flips per error in that state, which is similar to time it takes to find a solution starting from a random gene. A destruction takes around 11.2 mutations on average.
We also did another 106 simulations starting each time from another random configuration. The statistics in both cases are very similar, but the destruction-reconstruction simulations obviously show some correlations between a generation and the next. This effect disappears after about 4 generations.
Appendix B Results, analysis and interpretation
1. Dimension of solution set
Dimension of a space measures the number of directions in which one can move from a point. In the case of our model, since from any sequence in sequence space one can move along NS = 2550 axes by flipping just one bit, we see that the sequence space has dimension 2550, and the number of different elements in this space is a hypercube with 22550 ∼ 10768 elements.
The set of solutions which we find, has however much smaller dimension, as we show in Fig. 3 for the straight and tilted example. In the case of experimental data, as ours, the dimension is most conveniently determined by the box-counting (Grassberger-Procaccia [30]) algorithm. This is ob tained by just counting the number N(ϱ) of pairs at distances ≤ϱ, and then finding the slope in a log-log plot. This is indicated by the black lines in Fig. 3 we see that, clearly, the dimension in the space of configurations is about 8-9, while, in the space of sequences, the dimension is basically ‘infinite’, namely just limited by the maximal slope one can obtain [28].
2. Spectrum in phenotype and genotype spaces
We compute spectra for both the sequences and the configurations, for the 106 solutions. Let us detail this for the case of sequences: We have 106 binary vectors with NS = 2550 components each, and we want to know the ‘typical’ spectrum of such vectors. This is conveniently found with the Singular Value Decomposition (SVD), in which one forms a matrix W of size m×n = 106×2550. This matrix can be written as U · D · V*, where U is m×m, V is n×n and D is an m×n matrix which is diagonal in the sense that only the elements Dii with i = 1,…, n are nonzero. (We assume here that we are in the case m > n.) The Dii are in general > 0 and in this case the singular value decomposition is unique. We call the set of the = the spectrum of the sequences, and the vectors in V the eigenvectors of the SVD. It is the first few of those which are shown in Fig. 6.
Note that the SVD eigenvalues are the square roots of the spectrum of the covariance matrix WTW which has the same eigenvectors as W. Therefore the high SVD eigenvalues correspond to the principal components, the directions with maximal variation in the solution set.
Mutatis mutandis, we perform the same SVD for the case of the configurations, using the s-values (that is, of the shearability) of vectors of the configurations. (This is reasonable, because, in general, there are very few non-shearable and fluid AAs.)
Apart from the numerical findings, which are shown in Fig. 6 for the straight (S) example and in Fig. 7 for the tilted (T) one, some comments are in order:
Configuration space
(The eight figures on the bottom left): The first mode is proportional to the average configuration. The next modes reflect the basic deviations of the solution around this average. For example, the second modes is left-to-right shift, the third mode is expansion-contraction etc. Since, the shearable/non-shearable interface can move at most one AA sideways between consecutive rows, the modes are constrained to diamond-shaped areas in the center of the protein. This is the joint effect of the ‘influence zones’ of the input and output rows.
Sequence space
(The eight figures on the bottom right): The first eigenvector is the average bond occupancy in the 106 solutions. The higher eigenvalues reflect the structure in the many-body correlations among the bonds. The typical pattern is that of ‘diffraction’ or ‘oscillations’ around the fluid channel. This pattern mirrors the biophysical constraint of constructing a rigid shell around the shearable region. Higher modes exhibit more stripes, until they become noisy, after about the tenth eigenvalue.
The bond-spectrum, top right in Figs. 6 and 7, has some outliers, which correspond to the localized modes shown in the eight panels below. Apart from that, the majority of the eigenvalues seem to obey the Marčenko-Pastur formula, see [54]. If the matrix is m × n, m > n, then the support of the spectrum is. In our case, since we have a 106 × 2550 matrix, one expects (if they were really random) to find the spectrum at which is close to the experiment, and confirms that most of the bonds are just randomly present or absent. We attribute the slight enlargement of the spectrum to memory effects between generation in the same branch. This corresponds to the well-known phylogenetic correlations among descendants in the same tree.
It is tempting to also study the continuous part of this spectrum, which is not quite of the standard form. While in principle, this could be done by taking into account the known correlations, even the techniques of [55] seem difficult to implement. We thank T. Guhr for helpful discussions on his issue.
3. Shear modes in the amino acid network
Consider now either of the two examples, straight or tilted (S and T). A solution of such an example is given by a set of bonds, and this set of bonds defines a graph on the NAA = h · w = 540 AAs. This graph is embedded in 2D where are the positions of the AAs, which are connected by straight bonds. We now extend the scope of our study somewhat, by assuming that the bonds are not totally rigid, but given by harmonic springs (see also [24]). This allows us to study mechanical properties which would be too stiff if we only worked with bonds which are rigid sticks.
In this case, the calculations are straightforward, if some what complex, and they are, e.g., well explained in [35, pp. 618–619]. We thus consider the elastic tensor, , which is the tensor product of the network Laplacian with the 2 by 2 tensor of directional derivatives.
For the reader who is unfamiliar with [35], we describe what this means component-wise. The playground Ω ⊂ Z2 has size h in the z-direction and size w in the x direction, with periodic boundary condition in the x direction. All bonds go from some (r, c) to (r+1, c), (r +1, c± 1), (r+1, c±2), again with periodic boundary conditions in the c-direction. Each such bond defines a direction vector (dz, dx) in R2 which we normalize to . Note that this vector depends on both the origin and the target of the bond.
If we imagine harmonic springs between the nodes connected by bonds (all with the same spring constant), then we can define the (symmetric) tensor matrix of deformation energies in the x and y direction by and where each element of is—when k and m are connected by a bond—the 2 by 2 matrix (indexed by i, j ϵ {1, 2})
If k and m are not connected, then M(k, m) is the 0 matrix. The elements of M (k, m) are denoted M (k, m)ij.
Finally we complete the 2N × 2N matrix to a ‘Laplacian’ by adding diagonal elements to it, so that the row (and column) sums are 0. In components, this means that we require, for each k ∈ Ω and each i,j ∈ {1, 2}, the sums to vanish. Other properties of A are described in [35].
Since we take periodic boundary conditions in the x direction, there will always be a (simple) 0 eigenvalue of in this direction. Other 0 eigenvalues correspond to translation in the x — z direction or rotation in the x − z plane. Another type of (double) 0 eigenvalues are associated with any patch of nodes which is totally disconnected from the rest of the lattice. Since the density ϱ of bonds is about 1/2 and otherwise quite random, and there are twice 5 bonds at each interior node we expect (assuming random distribution of bonds) there to be about N · 2-10 ∼ 0.001N isolated nodes, i.e., isolated singletons, and even fewer patches of greater size.
Further zero modes come from nodes which can oscillate sideways without first order effects. This will happen if a node is only connected by one bond. Since ϱ ∼ 1 /2, the probability of finding such a node is about (10)
Thus, we show in Figures 6 and 7 the eigenfunctions only for the first eigenvalues after the trivial ones. Due to the tensorial nature of the problem, the eigenvectors have two components, which we show as 2D shear-flow.
4. Genetic correlation matrix
In Fig. 5, we study the correlations among the 106 solutions in sequence space. Given the matrix Wij, of all sequences, with i = 1,…, N = 106, j = 1,…, 2550 (of binary digits), we compute the means and the standard deviations stdj=(∑i|Wij – 〈Wj〉|2) ½Then, in the usual way, we form Mij = Wij – 〈Wj〉 and Fig. 5 then shows log(|Cj,j′|), with the autocorrelation Cjj omitted.
Note that both, the means and the variances depend very weakly on j. Fig. 5 reveals and reinforces several observations also made in other calculations of this paper. First, looking onto the axis j = j′ in the figure one sees a periodicity of the patterns corresponding to the 17 gaps between the 18 rows of the configuration space. This reflects the necessity to maintain a connected liquid channel. Also, as seen in Fig. 5, the correlations grow somewhat towards the ends, especially toward the upper (j = 2550) end. This is because of the mechanical constraint which forces the channel to become more precise towards the ends, in analogy with Fig. 8B.
The periodic patterns all over the square reflect not only the natural periodicity of 150 (= 5 · w) elements in the sequence, but also show that the boundaries of the channel form a special shell (with two peaks per row).
5. Survival under mutations
Here, we ask how robust the solutions are as further mutations take place. First, we determine how many mutations lead to a destruction of the solution. The statistics of this is shown in Fig. 8. We note that about 10% of all solutions are destroyed by just one mutation, while there is an exponential decay of survival of m mutations. This signals that the mutations act independently.
One can also ask where the critical mutations take place. This is illustrated in Fig. 8B, and was discussed in the main text. We have also studied the places where exactly two mutations will kill a solution (and none of the 2 is a single site ‘killer’) (Fig. 8C) and in these cases, one finds that the two mutations are generally close to each other, acting on the same site. Again, the channel is less vulnerable to mutations but now the mutations are evenly distributed over the rest of the network.
6. Expansion of the protein universe
Let us explain in further detail how Fig. 4 was obtained. Here, we test our model against the ideas of [6]. Our results will give some insight about the nature of the graph of solutions. First, we describe the question as it is found in [6]. Take any two solutions and consider their gene sequences s1 and s2. They will have a Hamming distance d(s1, s2), which we normalize by dividing by 2550 (the number of elements in si, i = 1, 2), which we call the protein universe diameter. The question is how much the solution following one generation after s2 differs from s1. If we call that solution s3, then the observed quantity is defined as follows: Let wi = 1 if s1,i = 1 and –1 if s1,i = 0, for i = 1,… 2550. Then for each i let xi = wi · (s3,i – s,2i). Note that xi > 0 if the change between s3,i and s3,i is towards s1 and < 0 if it is away from s1. Finally, Naway = ∑i:wi<0 1 and Ntowards = ∑i:wi>0 1, and we plot in Fig. 4 Ntowards/Naway as a function of D.
In Fig. 4 we show the results for data set S, (the plot for set T looks similar). The black curve is nothing but D/(1 – D), where D is the normalized Hamming distance, i.e., the proportion of sites which are different between s1 and s2. The fit to this curve tells us an important aspect about the set of possible solutions. Note that the set of all possible s forms a hypercube of dimension 2550 with 22550 corners. The set of solutions is a very small subset of this hypercube, where all corners which are not solutions have been taken away, including the bonds leading to these corners. This leads to a very complicated sub-graph of the hypercube. While we do not have a good mathematical description of how it looks, the good fit shows that the comparisons between s1, s2, and s3 are as if one performed a random walk on the full cube. (Note that such a result must be intimately connected to the high dimension of the problem, since for low dimensional hyper-cubes it does not hold.) Almost all solutions are at the edge of the universe, where the typical Hamming distances among the sequences are close to the typical distance between random sequences,
7. Flexibility of solutions: thermal stability
The histogram of the density of links for the 106 solutions is shown in Fig. 13. These distributions are obtained for simulations in which links are flipped randomly in a symmetric fashion. One can easily push these densities somewhat up or down, by favoring/restricting the flips of links towards 1. However, much more extreme solutions can be found by deterministic procedures which turn as many links to 1 resp. 0. In these cases, we have obtained densities of as high as 0.96 and as low as 0.14, that is, 2452/2550 links, resp. 372/2550 links. Two such extreme cases are illustrated in Fig. 10. This shows that the model, if needed, can be adapted to questions of temperature dependence of the protein, for example, by giving more or less weight to the number of bonds, something like a chemical potential in statistical mechanics.
Footnotes
↵1 In our model, the AA species is determined by the bonds, while in real proteins the bonds are determined by the chemical nature and position of the AA (see also Sect. II G).
↵2 Note that in this simulation, we do not take as evolutionary criterion the mechanical signal itself, but require that the protein forms a fluid channel with a prescribed configuration. We show that this configuration facilitates the sought-after mechanical shear motion in Sections II E and B3. (In [24] we take the mechanical modes themselves as the target function.)
↵3 The propagation of rigidity is effectively a “double” percolation problem in which both fluid (blue) and rigid (gray) regions are continuous (see Sect. A3).
↵4 We lack sufficient data to determine such high dimensions precisely, and 150 is a lower bound distance/diameter
↵5 The natural genetic code with its 20 AAs is therefore an intermediate case.
↵6 The natural genetic code is redundant, i.e. several codons encode the same AA and are therefore synonymous. Such redundancy reduces the fraction of destructive mutations, since mutations that exchange synonymous codons do not change the encoded AA and are theretofore bound to be neutral. A case of redundant code is examined in Sect. II G.
↵7 At least two AAs. There may be also higher order terms of three-body interactions etc.
References
- [1].↵
- [2].
- [3].
- [4].↵
- [5].↵
- [6].↵
- [7].
- [8].↵
- [9].↵
- [10].
- [11].
- [12].
- [13].
- [14].
- [15].↵
- [16].↵
- [17].
- [18].
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].
- [40].↵
- [41].↵
- [42].
- [43].↵
- [44].↵
- [45].
- [46].
- [47].↵
- [48].↵
- [49].
- [50].
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵