Scaling laws of graphs of 3D protein structures

The application of graph theory in structural biology offers an alternative means of studying 3D models of large macromolecules, such as proteins. However, basic structural parameters still play an important role in the description of macromolecules. For example, the radius of gyration, which scales with exponent ~0.4, provides quantitative information about the compactness of the protein structure. In this study, we combine two proven methods, the graph-theoretical and the fundamental scaling laws, to study 3D protein models. This study shows that the mean node degree of the protein graphs, which scales with exponent 0.038, is scale-invariant. In addition, proteins that differ in size have a highly similar node degree distribution, which peaks at node degree 7, and additionally conforms to the same statistical properties at any scale. Linear regression analysis showed that the graph parameters (radius, diameter and mean eccentricity) can explain up to 90% of the total radius of gyration variance. Thus, the graph parameters of radius, diameter and mean eccentricity scale with the same exponent as the radius of gyration. The main advantage of graph eccentricity compared to the radius of gyration is that it can be used to analyse the distribution of the central and peripheral amino acids/nodes of the macromolecular structure. The central nodes are hydrophobic amino acids (Val, Leu, Ile, Phe), which tend to be buried, while the peripheral nodes are more hydrophilic residues (Asp, Glu, Lys). Furthermore, it has been shown that the number of central and peripheral nodes is more related to the fold of the protein than to the protein length.

123 between a pair of residues was less than (or equal to) 7 Å. It follows that the number of nodes was 124 equal to the number of residues (Cα atoms) in the protein. Ligands, water molecules, and other 125 hetero-compounds were discarded during graph construction. Thus, if a protein has n residues, then 126 a protein graph G = G(V, E) consists of a set of vertices (nodes) V = v 1 v 2 , … v n and a set of edges 127 E = e 1 , e 2 , … e m . 138 Meanwhile, the diameter is defined as maximum eccentricity among all vertices in the graph. The 139 center of a graph or central node has eccentricity equal to the radius. A vertex is said to be a 140 peripheral node if its eccentricity is equal to the diameter. 141 Next, R script (igraph package) was used to calculate mean node degree, eccentricity, mean 142 eccentricity, radius and diameter: 143 MND <-mean(degree(G)) # mean node degree of graph G  Arginine  265  Asparagine  187  Aspartate  187  Cysteine  148  Glutamate  214  Glutamine  214  Glycine  97  Histidine  216  Isoleucine  195  Leucine  191  Lysine  230  Methionine  203  Phenylalanine  228  Proline  154  Serine  143  Threonine  163  Tryptophan  264  Average  192  162   163 164 Results and Discussion 165 166 Node degree-nearly scale-invariant 167 One of the fundamental global graph parameters in graph theory is the mean node degree (MND).
168 The mean node degree shows how many edges each node has on average. In previous work, 169 Pražnikar and co-workers (33) have shown that protein models that deviate from the expected MND 170 by approximately two standard deviations or more are likely to be incorrect. Furthermore, the 171 scaling exponent calculated in the mentioned study is close to zero and indicates that the mean node 172 degree is nearly scale-invariant.
173 In this study, a large non-redundant database of biological units (31,571) was used, rather than 174 crystal asymmetric units. We can see in Figure 1A that MND is not strongly dependent on protein 175 size and that the distribution is rather narrow. Upon closer examination, however, the value 176 determined in our study (0.038) differs slightly from the value (0.024) determined in previous 177 study. The reason for the different scaling exponents is probably that the datasets are not the same.
178 An analysis performed on two large but different datasets shows that MND is nearly scale-invariant, 179 i.e., the scaling exponent is close to zero (0.024 and 0.038). Thus, MND is not strongly dependent 180 on protein length, and it can be concluded that the number of edges in the protein graph is linearly 181 related to the number of nodes (amino acids). 184 (B) Probability of node degree of protein graphs for three size bins. The first size bin (black line) 185 encompasses protein structures with length between 100 and 200 residues, the second size bin (blue 186 line) encompasses protein structures with length between 500 and 600 residues, and the third size 187 bin encompasses structures with length between 900 and 1000 residues.
188 189 We could expect that larger proteins would have a higher average node degree because of a higher 190 number of core residues and a relatively lower number of surface residues, which are supposed to 191 have lower numbers of edges. Thus, to further analyse the node degree of protein nodes-residues, 192 we calculated the node degree distribution for three different size bins. The first size bin contains all 193 protein structures from the database, which have lengths between 100 and 200 residues, in the 194 second size bin are proteins with lengths between 500 and 600 residues, and in the third size bin are 195 structures with lengths between 900 and 1,000 residues. Figure 1B shows that all three distributions 196 are very similar and that there is a peak at 7 node degrees. The comparison of all three peak values 197 shows that the first size bin, which contains the smallest proteins, has the highest probability density 198 value. The lowest probability density value at 7 node degree is seen for the third size bin, which 199 contains the largest proteins among all three selected size bins. A closer look at the left (degree 2) 200 and right (degree 14) tail of the distributions shows high similarity for all three distributions. The 201 visual comparison shows the most significant differences on the left and right sides of the peak. The 202 differences can be observed at values of approximately 5 to 9 node degrees. It can be seen that the 203 first size bin has a higher probability at 3 to 6 node degrees as compared to size bins two and three.
204 The order is somehow reversed on the right side of the displayed distribution. For node degrees 8, 9 205 and 10, size bin three exhibits higher values compared with size bins two and three.
206 This analysis shows that despite the different sizes of the proteins, they have a very similar node 207 degree distribution, which peaks at node degree 7. A simple way to explain the presented results is 208 that buried residues in small or large proteins form approximately the same number of links. This is 209 a direct consequence of the fact that the amino acids are physical objects and cannot be arbitrarily 210 close to each other. The marginal difference in node degree distributions, a slight shift to higher 211 node degrees, explains the low positive scaling exponent (0.038), which is nearly scale-invariant. 212 213 Protein graph eccentricity: an alternative method for analysis of 214 radius of gyration 215 It is easy to ask a question: radius of gyration and radius as a graph parameter have a common 216 name, but do they follow the same power law? To answer this question, linear regression analysis 217 and scaling exponent were calculated for three graph parameters: radius, diameter and mean 218 eccentricity. The radius-graph parameter is defined as minimum eccentricity, whereas the 219 eccentricity of the graph is defined as the maximum distance between one node and all other nodes.
220 Notice that the diameter is defined as maximum eccentricity.
221 Figure 2A shows the scaling exponent of a radius of gyration for 31,571 selected protein structures.
222 The non-linear fitting function can be written as 223 R gyr =R 0 N  , 224 where R gyr is the radius of gyration, R 0 is the pre-factor and  is a scaling exponent. The pre-factor 225 R 0 can be obtained experimentally and used as a restrained value during non-linear fitting (34-36).
226 Thus, when restrained fitting was performed, the pre-factor (R 0 =2 Å) was fixed. We can see that in 227 the case of restrained fitting, the scaling exponent is 0.405, which is consistent with other studies.  235 237 Similar to the analysis of the radius of gyration, the power exponent was fitted with and without 238 restraint. The pre-factor for restrained fitting was derived from linear regression analysis, as shown 239 in Figure 3. The linear regression analysis between the radius of gyration and graph parameters 240 reveals that R 2 is close to 0.90 for all three cases (Fig. 3). The highest R 2 (0.91) is observed between 241 mean eccentricity and radius of gyration (Fig. 3C). The reason for this is probably that the values of 242 radius and diameter are discrete, while mean eccentricity values are not discrete. For example, the 243 radius can be 7 or 8, but cannot be a real number between 7 and 8. Mean eccentricity is just a mean 244 value of all shortest paths to any nodes. It is seen that the distribution of mean eccentricity is 245 smoother in comparison to the discrete values of radius and diameter on the y-axis. 250 If we use pre-factor R 0 of a radius of gyration, which was obtained from experimental data, then we 251 can calculate the pre-factors for radius, diameter, and mean eccentricity using the slope k from a 252 regression analysis. The steepness of the linear regression model between R gyr and radius was 0.42 253 Å -1 , 0.77 Å -1 between R gyr and diameter, and 0.59 Å -1 between R gyr and mean eccentricity (see Fig.   254 3A, B and C). Using pre-factor R 0 and the steepness of linear fit k, the pre-factors for radius, 255 diameter and mean eccentricity can be calculated using the next expression: 257 where R x is a new calculated pre-factor, R 0 is the pre-factor of radius of gyration and k is the 258 steepness of the linear fit. In Figure 2B, C and D are shown calculated pre-factors for radius (2.0 Å 259 0.42 Å -1 = 0.84), diameter (2.0 Å 0.77 Å -1 = 1.54) and mean eccentricity (2.0 Å 0.59 Å -1 = 1.18). 260 We can see that the restrained scaling exponent is higher than the non-restrained scaling exponent 261 for all three cases. Furthermore, it is observed that restrained graph parameters all have very similar 263 (0.405).
264 Thus, this study shows that the radius of gyration, which is calculated from the atomic coordinates 265 and radius of graph follow the same scaling exponent. From this we can conclude that when 266 analysing 3D models of macromolecules using a graph-theoretical theory approach, the eccentricity 267 of the graph can be used to estimate the radius of gyration. Thus, graph parameter eccentricity 278 kind of analysis has some common points with analysis of solvent-exposed residues, which is 279 directly related to the arrangement of residues in 3D space. Buried residues constitute the core of 280 the protein; meanwhile, residues exposed to the solvent represent the outer part of the protein in 3D 281 space (40). The molecular mass of a protein is related with the total solvent exposed surface using 282 the next expression: 284 where M is the molecular mass of the protein (41). Similarly, we can introduce the relation between 285 protein length and total solvent-exposed surface. Given that the average amino acid has a total 286 solvent exposed area of 192Å 2 (32), we can use this value as a restraint during data fitting. Figure 4 287 shows the protein length against the total exposed area. The scaling exponent of the fitted curve is  350 Furthermore, an additional comparison between the expected node degree and node degree of a 351 candidate could be used to explore and interpret large deviations. For example, intrinsically 352 disordered proteins are expected to have a considerably lower mean node degree than globular 353 proteins of the same size.
354 The comparison between the mean eccentricity of the graph and radius of gyration revealed a high 355 R 2 . In other words, the mean eccentricity and radius of gyration follow the same scaling exponent 356 (~0.4). The eccentricity of the graph, in addition to the estimation of the radius of gyration, also 357 allows us to study the distribution of central (buried) and peripheral amino acids (non-buried). We 358 should be aware that the mean eccentricity alone (or radius of gyration), which is used as a 359 constraint when running molecular dynamics simulations or manually building a model, does not 360 provide the correctness of the protein model. It is also crucial to determine how the amino acids are 361 distributed in real space, and this can be elucidated by studying peripheral and central nodes. Thus, 362 a single graph parameter (eccentricity) can be used to control the compactness of the 363 macromolecule and the distribution of amino acids in 3D space, which makes it a valuable tool for 364 analysing protein models.