Could network structures generated with simple rules imposed on a cubic lattice reproduce the structural descriptors of globular proteins?

A direct way to spot structural features that are universally shared among proteins is to find proper analogues from simpler condensed matter systems. In most cases, sphere-packing arguments provide a straightforward route for structural comparison, as they successfully characterize a wide array of materials such as close packed crystals, dense liquids, and structural glasses. In the current study, the feasibility of creating ensembles of artificial structures that can automatically reproduce a large number of geometrical and topological descriptors of globular proteins is investigated. Towards this aim, a simple cubic (SC) arrangement is shown to provide the best background lattice after a careful analysis of the residue packing trends from 210 proteins. It is shown that a minimalistic set of ground rules imposed on this lattice is sufficient to generate structures that can mimic real proteins. In the proposed method, 210 such structures are generated by randomly removing residues (beads) from clusters that have a SC lattice arrangement until a predetermined residue concentration is achieved. All generated structures are checked for residue connectivity such that a path exists between any two residues. Two additional sets are prepared from the initial structures via random relaxation and a reverse Monte Carlo simulated annealing (RMC-SA) algorithm, which targets the average radial distribution function (RDF) of 210 globular proteins. The initial and relaxed structures are compared to real proteins via RDF, bond orientational order parameters, and several descriptors of network topology. Based on these features, results indicate that the structures generated with 40% occupancy via the proposed method closely resemble real residue networks. The broad correspondence established this way indicates a non-superficial link between the residue networks and the defect laden cubic crystalline order. The presented approach of identifying a minimalistic set of operations performed on a target lattice such that each resulting cluster possess structural characteristics largely indistinguishable from that of a coarse-grained globular protein opens up new venues in structural characterization, native state recognition, and rational design of proteins.


81
The presented hard sphere model for structure generation is novel in the sense that it  137 The latter is terminated once the average displacement of the beads reaches 0.5 Å. This particular 138 choice ensures that the average displacement is much smaller than the nearest neighbor spacing 139 and is adopted in light of average displacements attained by the RMC-SA method described below.

140
The reverse Monte Carlo (RMC) procedure is a reconstruction method that aims to evolve 141 a structure until the selected property (or properties) of the evolving structure mimics or best 142 represents the selected experimental property (43,44). Simulated annealing (SA) is a heuristic 143 global optimization method where the probability of accepting solutions is continuously decreased 144 as the solution space is explored. In the current study, the target property of the RMC-SA 145 procedure is chosen to be the average RDF of the 210 basis proteins up to 15 Å distance.

146
The RMC optimization proceeds with Boltzmann sampling using a simple Metropolis 147 scheme (45). Each attempted move is accepted with a probability (p), see Eq.
(1), where is the 148 energy difference between the attempted and current configurations of the relaxing structure. The 149 energy function, E, is defined as the integrated squared difference between the RDF of the 150 generated structure, ( , ), and the average RDF of the 210 basis proteins, (r), up to 15 Å, see 151 Eq. (2). (2) 154 In Eq.
(2), the index represents current or attempted structures at step , and the Boltzmann factor 155 is defined as follows: 157 where = 10 -6 is used as a scaling constant. The simulated annealing is implemented by updating 158 β -1 by 0.9 at every 10 3 steps. The combined RMC-SA procedure is performed for 5x10 5 steps or 159 until β∆ ≤ 10 -7 for at least 2x10 3 consecutive steps.

177
Originally BOO (50) is defined as a global order parameter, but we will use the local 178 definition that accounts for the bond vectors in the coordination shell for each constituent residue: Here denotes average over all bonds emanating from a residue i and is defined as follows: 182 where , (θ ,φ ) are spherical harmonics † computed at the polar, θ , and azimuthal, φ , angular 183 coordinates of the unit vector directed from residue i to its neighbor j. Note that the final 184 expression of the order parameter , in Eq. (4) does not depend on m for rotational invariance.

185
It is possible to systematically define higher order invariants with spherical harmonics (50).
186 The simplest and physically the most interesting higher order invariants are the properly 187 normalized third order invariants given as follows:  217 Network view provides a convenient framework for comparing two systems in terms of inter-unit 218 interactions. In the current study, networks are constructed from the residues of the generated 219 structures by connecting residues that were within a 7 Å cut-off distance of each other. The choice 220 of cut-off distance is motivated in part because 7 Å allows non-bonded interactions within the two 221 nearest coordination shells to be included (53-55). This cut-off value is also consistent with 222 previous Delaunay tessellation and cut-off scanning studies (56,57).

223
The contact topology of a residue network is stored in the N×N adjacency matrix , where 224 the , term is either equal to one or zero according to the presence or absence of interactions 225 between the i th and j th residues, respectively (54). Analysis of the adjacency matrix and its variants 226 (such as the Laplacian or the normalized Laplacian) leads to a wealth of information. For example, 227 a statistical mechanical treatment yields information on residue auto-and cross-correlations (58), 228 and the neighborhood structure of a residue and how information propagates to further neighbors 229 (55). Network models were also used to identify adaptive mechanisms in response to perturbations 230 (59), predict collective domain motions, hot spots, and conserved sites (58,60-62).

231
In terms of network characteristics, proteins share common structural similarities with 232 other self-organizing condensed matter systems (53). A quantity termed "relative contact order", 233 which may be derived from the adjacency matrix, is shown to highly correlate with the folding 234 rates measured for many two-state folders (63).
l In the current study, distributions and averages of local network parameters such as degree 236 (k i , number of neighbors of residue i, Eq. (7)), the nearest neighbor degree (k nn,i , the number of 237 neighbors of the neighbors of residue i, Eq. (8)), and the clustering coefficient ( ) of residue i, Eq.
238 (9), were monitored. In addition, a global parameter, the shortest path length ( ) of residue i is 239 also monitored (Eq. (10)). These network parameters are defined as follows:

253
The Laplacian ( ) is used extensively in the graph theory literature (64,65) and is defined 254 as follows: 256 where is a diagonal matrix with , = . The Laplacian is a positive-semidefinite matrix and 257 its eigenvalue spectrum is routinely used for quantitative characterization of networks. In the 258 current study, the normalized Laplacian ( * ) is used.
The main utility of the normalized Laplacian comes from the fact that all of its eigenvalues exist

3 Results and Discussion
273 Throughout this study, we will show that at least three simple requirements are necessary to 274 generate an artificial structure (from a crystalline template) that mimics the behavior of average

282
As mentioned earlier, instead of targeting a specific protein to mimic (or identifying) its 283 structure, we target the average behavior of many proteins based on the experimentally determined 284 coordinates. Therefore, as a first step, it is necessary to define a set of proteins ("the basis set") 285 from which the average structural features of real proteins can be defined and the generated 286 structures can be compared against.

3.1 The Average Protein Structure and the Selection of Lattice Template
288 To define an average protein structure, a set of 210 globular proteins was selected from 595 289 proteins, which are representative of four major protein folds with less than 25% sequence  308 Figure 1 The average BOO parameters, (dark grey) , and , (light grey), for l = 2, 4, and 6 for 309 the 210 proteins used as the basis set.

310
The RDF is an important metric of any packing consideration; therefore, any attempt to

315
The average RDF is marked by a sharp first coordination peak at 3.8 Å and is the due to 316 the bonds between nearest neighbors on the main chain. In addition, a less prominent secondary 317 peak is observed at ~5.6 Å. The ratio of the first and second peak locations puts fundamental 318 constraints on the placement of neighboring residues on a lattice. Fortunately, for the FCC and SC 319 lattices, the ratio of the first and second nearest neighbor distances is equal to (= 1.4142), 320 which is quite close to the ratio obtained from the average protein RDF (5.6/3.8 = 1.4737).
321 Therefore, it is plausible to accept FCC and SC lattices as candidate template lattices. However, 322 when the average number of nearest and next-to-nearest neighbors of nodes in proteins is 323 calculated from the area underneath each peak, it becomes evident that the first coordination shell 324 needs to be less crowded than the second coordination shell. However, in FCC, which is a close 325 packed structure, the opposite is true and therefore, the FCC lattice cannot be considered as a 326 template lattice. This leaves the SC lattice as the only candidate. Although this selection might be 327 viewed as counterintuitive, it has been shown that the bending and torsional angles of proteins  348 This finding suggests that in order to create structures similar to proteins, the SC clusters should 349 be emptied down to ~40% occupancy. We also note that 40% occupancy as well as the 60% void 350 concentration are both above the percolation threshold of 25% of the SC lattice (72). This suggests 351 that once the SC cluster is emptied down to 40% occupancy, the remaining residues could form a 352 connected network of points, which is an absolute requirement for any protein-like structure to be 353 considered viable. However, random removal of lattice sites does not guarantee that the remaining 354 lattice sites form a connected network; therefore, an additional constraint is imposed during the 355 random removal procedure to maintain the connectedness of the remaining lattice points. In the 356 current study, the connectedness is defined such that there must be at least one path between any 357 two remaining beads on the cluster through nearest neighbor contacts. Finally, three-dimensional  ). The sub-range from 7.6 Å to 388 9.3 Å corresponds to three on-lattice peaks that are representative of the 4 th , 5 th and 6 th closest 389 neighbors on the SC lattice ( Figure 5(a)), and it is clear that the RMC-SA relaxation procedure is 390 not able to reduce the high number of initial contacts that were present in the on-lattice structure 391 within this distance range. The reasons for this outcome are not known. However, one possible 392 explanation might be that the connectedness constraint imposed during the introduction of voids 393 is not strict enough to produce a linear chain, and therefore, side branches materialize after the 394 introduction of voids. Because the relaxation procedure does not remove branches, their existence 395 leads to a more compact structure and a slightly higher density within the core of the cluster. From 396 9.3 Å to 10.7 Å, the source of discrepancy is more straightforward. This range is the empty region 397 between the 6 th and 7 th nearest neighbors on SC lattice (see Figure 5(a)) and it is larger than the 398 average displacements introduced with the two relaxation methods. Therefore, RDF functions 399 pertaining to the generated structures attain lower values in comparison to the proteins.

400
The radial displacements of all residues and the average radial displacement as a function 401 of distance from the center of mass are shown in Figure 6 for the two relaxation procedures 440 because proteins have a slightly higher average contact number, the protein distributions might 441 translate to slightly lower average BOO parameters although this is not observed for , for l = 6.
442 In general the correspondence between the average protein and the relaxed structures is better for 443 , than it is for , for l = 4 and 6. The (unrelaxed) on-lattice clusters show a strong peak at 2, 444 = 0 (Figure 7(d)) because 2, vanishes for l = 2 in a cubic cluster (50) and the starting structure 445 will have points that sit at an ideal cubic environment. Although this peak quickly disappears once 446 the structure is relaxed through random moves and RMC-SA algorithm, the relaxed probability 447 distribution functions show the opposite variation in 2, than that is observed for the average 448 protein structure. This discrepancy seems to be the most prominent disagreement between the 449 proteins and relaxed structures.   471 However, because the current study aims to identify the simplest set of rules to generate structures 472 similar to folded proteins, a more rigorous approach is not pursued at this time.

473
To quantify the extent of similarity between the generated structures and the average 482 In the two-sample Kolmogorov-Smirnov test, the null hypothesis is rejected with a confidence 483 level () if the following inequality holds: 485 where and are the number of data points making up the respective CDFs. The 486 value of d  (tabulated in (73)) depends on  (a value of 0.05 is used generally) and sample size. In 487 the current study, the null hypothesis is that both groups were sampled from populations with 488 identical distributions; therefore, we would like to see that this null hypothesis is not rejected. To 489 provide a confidence level to (accepting or rejecting) the null hypothesis, we calculated p-values, 490 which indicate the probability of obtaining a result equal to or more extreme than what is actually 491 observed assuming that the null hypothesis is true. In general, the null hypothesis is rejected if the 492 p-value is less than a predetermined confidence level (generally 0.05 or 5%). The results of the 493 two-sample Kolmogorov-Smirnov tests are shown in Table 2. It is clearly seen from the p-values 494 that the null hypothesis (that the two samples, proteins vs. various generated structures, are drawn 495 from identical distributions) is not rejected for any of the generated structures at 40% occupancy.
496 As a result, it can be stated that the correspondence obtained between proteins and various 497 generated structures is a statistically significant observation.
498 The probability distributions of various network parameters of the average protein and 504 various generated structures are presented in Figure 8. 522 remarkably well. In addition, the region where λ > 1 is marked by a wealth of local motifs that are 523 realized by secondary structural elements, and therefore, it is expected to cause the biggest 524 discrepancies. However, the generated clusters successfully mimic the behavior of the average 525 protein structure in the region λ > 1. Finally, the overall distribution of the generated clusters is 526 observed to be typical of the molecular structure networks (53,74).

527
The results of the Kolmogorov-Smirnov tests on various network parameters are also listed 528 in Table 2. Based on the test results, the null hypothesis is once again not rejected, and therefore, 529 the possibility that the network parameter distributions of the real proteins and various generated 530 structures come from the same distribution cannot be overruled. 536 template used to create these generated clusters. One of the rules imposed on the generation 537 procedure is to empty the clusters such that the occupancy would be 40%. This leads to an excellent 538 agreement between the average degree (contact number) of the on-lattice clusters and real proteins 539 (see Figure 3). However, it is important to understand the effect of occupancy on the structure of 540 the generated clusters and their similarity to real protein structure.

541
To investigate the effect of occupancy, we use the random relaxation procedure because 542 there is very little difference between the random move algorithm and the more computationally 543 demanding RMC-SA algorithm. Three occupancies are studied: 20, 40 and 60%. In the case of 544 BOO parameters, increasing the occupancy led to the narrowing of the , distributions and shifted 545 them towards lower mean values (Figure 9 and Table 3). For l = 2 and 6, , distributions of the 546 generated structures at 40% occupancy matched the protein distribution more closely than those 547 at 20 and 60% occupancies. However, for l = 4, 60% occupancy distribution is the closest to that 548 of the protein. These observations are also reflected in the Kolmogorov-Smirnov statistics (Table   549 4). The BOO parameter , is not sensitive to occupancy especially for l = 4 and 6. For example, 557 occupancies are statistically identical to each other and to the average protein (Table 4).

Figure 9
559 Figure 9. The influence of occupancy on various BOO parameters for l =2, 4, and 6.
560 The degree and average neighbor degree show similar trends with respect to occupancy; 572 the distributions of the generated structures become wider and shift towards greater mean values 573 Figure 10(a) and (b)). In both cases, 40% occupancy is the closest to the average protein. In fact, 574 at 20 and 60% occupancy, the degree and nearest neighbor degree distributions of the generated 575 structures show no statistical similarity to that of the average protein as suggested by the 576 Kolmogorov-Smirnov analysis ( Table 4). The relationship between the degree and occupancy is 577 quite straightforward; the number of neighbors for any lattice site increases as a result of increasing 578 occupancy. The same effect is also reflected in the nearest neighbor degree. It is obvious that a 579 larger occupancy than 40%, but less than 60% is required to match the degree and average neighbor 580 degree distributions of real proteins.

Figure 10
582 Figure 10. Comparison of the distributions of (a) degree (contact number), (b) nearest neighbor 583 degree, (c) clustering coefficient, (d) the shortest path length, and (e) the eigenvalue of the 584 normalized Laplacian for the average protein and generated clusters with respect to cluster 585 occupancy.

586
The clustering coefficient distributions show slight variations with respect to different 587 percentage of occupancy; the distributions become slightly narrower with lower mean values upon 588 increasing occupancy (Figure 10(c)). The clustering coefficient distribution at 60% occupancy is 589 closest to that of the average protein, although the Kolmogorov-Smirnov analysis did not reveal 590 any statistical difference with respect to occupancy. This observation shows that a protein residue