Folding kinetics of an entangled protein

The possibility of the protein backbone adopting lasso-like entangled motifs has attracted increasing attention. After discovering the surprising abundance of natively entangled protein domain structures, it was shown that misfolded entangled subpopulations might become thermosensitive or escape the homeostasis network just after translation. To investigate the role of entanglement in shaping folding kinetics, we introduce a novel indicator and analyze simulations of a coarse-grained, structure-based model for two small single-domain proteins. The model recapitulates the well-known two-state folding mechanism of a non-entangled SH3 domain. However, despite its small size, a natively entangled antifreeze RD1 protein displays a rich refolding behavior, populating two distinct kinetic intermediates: a short-lived, entangled, near-unfolded state and a longer-lived, non-entangled, near-native state. The former directs refolding along a fast pathway, whereas the latter is a kinetic trap, consistently with known experimental evidence of two different characteristic times. Upon trapping, the natively entangled loop folds without being threaded by the N-terminal residues. After trapping, the native entangled structure emerges by either backtracking to the unfolded state or threading through the already formed but not yet entangled loop. Along the fast pathway, trapping does not occur because the native contacts at the closure of the lasso-like loop fold after those involved in the N-terminal thread, confirming previous predictions. Despite this, entanglement may appear already in unfolded configurations. Remarkably, a longer-lived, near-native intermediate, with non-native entanglement properties, recalls what was observed in cotranslational folding.

The biological functionalities of proteins are determined by the properties of their native 2 states. The well defined compact native structure of a globular protein is achieved, in 3 the aqueous cytoplasm medium, through the physical folding of the protein chain after 4 assembly and release at the ribosome, where several chaperons assist the folding of the 5 nascent chain. According to Anfinsen's thermodynamic hypothesis [1], small globular 6 proteins are able to fold "in vitro" into the correct native structure in a reproducible 7 manner, without the help of any cellular machinery. Despite the daunting complexity of 8 the protein folding process, e.g. the important role played by the solvent degrees of 9 freedom, a large body of research activity showed in recent decades that its kinetic and 10 thermodynamic properties are encoded in simple descriptors of the native state 11 structure, even within a coarse-grained description [2][3][4]. 12 For instance, the list of residue pairs that are in close spatial contact with each other 13 in the native structure, the contact map [5][6][7], defines the corresponding native 14 interaction network. Together with features characterizing the local native geometry of 15 the protein chain, the contact map can be used to define an implicit solvent 16 structure-based energy function, whose global minimum is attained for the native 17 structure. This simple Go-like approach can be set up in several different flavours [8], 18 and typically allows to predict the folding nucleus, i.e. the subset of residues whose 19 interaction network needs to be formed for folding to proceed correctly, in good 20 agreement with experimental data [9]. Notably, the same approach allows to predict 21 successfully folding mechanisms more involved with respect to the typical two-state 22 scenario, detecting the presence of thermodynamic and/or kinetic intermediates [9]. 23 Coarse-grained structure-based models proved very insightful also in the study of the 24 cotranslational folding process by means of numerical simulations [10]. 25 Similarly, the loops formed between residues in contact with each other in the native 26 structure have an average chemical length, the contact order [11], which is strongly 27 correlated with the folding time for two-state folders [12][13][14][15]. The contact order is just 28 one basic feature of the interaction network of native contacts. In general, the simpler 29 the network, the faster the predicted folding [16]. The organization of the network of 30 contacts, however, does not necessarily capture the topology of the protein backbone 31 described as a curve in the three-dimensional space, as well as the possible formation of 32 knots and other entangled motifs. The discovery of knots in a few proteins [17] was a 33 surprise because they seem to make the folding process unnecessarily hard. Their integrals for discretized and possibly open curves [35,[39][40][41]. A crucial question is 47 whether and how these topologically entangled motifs affect the protein energy 48 landscape and the folding process. 49 Very recently, a striking connection was established between the presence of 50 misfolded protein sub-populations during and just after protein translation and their 51 entanglement properties. For example, an abundance of entangled motifs characterizes a 52 subset of proteins prone to misfolding and aggregation under heat stress when newly 53 synthesized, but not once matured [42]. This may suggest that those proteins rely more 54 on the protein homeostasis machinery to reach their native states. Along the same lines, 55 using coarse-grained structure-based models of protein translation, it was predicted that 56 one-third of proteins can misfold into soluble less-functional states, that bypass the 57 protein homeostasis network, avoiding aggregation and rapid degradation [43]. Such 58 misfolded species were characterized as long-lived kinetic traps, native-like in several 59 respects, with their misfolding due to a change in the entanglement properties of the 60 native structure. Moreover, the shift in the competition between differently entangled 61 misfolded subpopulations, along cotranslational folding pathways, was shown to 62 determine the functional impact of synonymous mutations [44]. 63 The Gaussian entanglement was found to be significantly correlated with the "in 64 vitro" folding rate [15], so that a higher degree of entanglement slows down the folding 65 process. Interestingly, the Gaussian entanglement and the contact order can be 66 combined to improve the predictions of folding rates [15], showing that the 3d topology 67 adds some extra information over the underlying interaction network. Entangled loops 68 are looped lasso-like segments of a protein chain displaying large Gaussian entanglement 69 when threaded by another segment, possibly not looped, of the same protein chain. 70 Recently, it was discovered that they are present in roughly one third of known single 71 domain proteins [37], much more than knots [18]. Moreover, the amino acids at the end 72 of a loop interact with each other with an energy significantly weaker, on average, if the 73 loop is entangled [37]. 74 According to the well established paradigm of minimal frustration [45,46], energetic 75 interactions in proteins are optimized in order to avoid as much as possible the presence 76 of unfavorable interactions in the native state. Although non optimized interactions 77 may result in kinetic traps along the folding pathway, some amount of residual 78 frustration has been detected and related to functionality and allosteric transitions [47]. 79 The above observation shows that non optimized interactions are present also in relation 80 to topologically entangled motifs, an example of residual topological frustration. As a 81 possible mechanistic explanation for this, it was hypothesized that the premature 82 formation of entangled loops could cause the subsequent threading by another segment 83 of the protein to become a kinetic bottleneck [37]. Therefore, it is likely better for 84 entangled loops to be established in the later stages of the folding process. This 85 hypothesis was further corroborated by another observation: entangled loops are found 86 in an asymmetric location with respect to the chain portion they are entangled with, 87 the thread, such that the latter is found more frequently on the N-terminal side of the 88 loop [37]. In the context of co-translational folding [48], this implies that entangled 89 loops are synthesized at the ribosome, and hence folded, on average later than the 90 thread. This "late entanglement avoids kinetic traps" hypothesis was verified within a 91 toy lattice model for protein evolution targeting the folding speed [49]. Protein 92 sequences optimized to fold as quickly as possible into an entangled native structure 93 were indeed characterized by weak interactions at the ends of entangled loops.

94
In this contribution, we study the folding behaviour of the small antifreeze RD1 95 protein, which is natively entangled, by means of molecular dynamics simulation within 96 a coarse-grained structure-based implicit solvent approach. Our aim is to verify, in exactly the same way as in [9]. We model the long range pairwise interactions through a 114 Lennard-Jones 12/6 potential, whereas a 12/10 Lennard-Jones potential was used 115 instead in [9]. Attractive long range interactions are considered only for residue pairs 116 found in contact with each other in the native structure. We define native contacts 117 based on heavy atom positions (see the Structure-based energy function subsection for 118 details), although in a different way from [9]. The weights of the different terms in the 119 energy function are the same as in [9]. We implement Langevin thermostatted 120 molecular dynamics through the LAMMPS software [53] (see the Langevin dynamics 121 subsection for details). In this contribution we will present results from two different 122 sets of Langevin molecular dynamics simulations: equilibrium-sampling ones at/around 123 the folding transition temperature and refolding simulations below the folding transition 124 temperature.

125
Topological properties of the SH3 domain and the RD1 protein 126 In this work we study and compare the folding processes of two short protein chains defined for a pairwise contact within a protein chain). In the rest of this work we will 133 evaluate the overall entanglement of a protein configuration by using a weighted average 134 ⟨G ′ ⟩ of the Gaussian entanglements for all the native contacts which are formed in that 135 configuration. The weight function used in the average is shown in both lower panels of 136 Fig 1. It is defined through a soft activation threshold g 0 = 0.5 in such a way that the 137 more their values of |G ′ | are above (respectively below) threshold, the more 138 (respectively the less) the native contacts contribute to the average (see the Gaussian 139 entanglement subsection for details). Note that the entanglement can appear with 140 either positive or negative chirality, depending on the sign of G ′ or ⟨G ′ ⟩. 141 src SH3 is a 56-residue fragment from the tyrosine protein kinase domain, well 142 characterized in the literature as a prototypical two-state folder both experimentally [54] 143 and in computer simulations [9,[50][51][52]. As seen in the lower right panel of Fig 1,  The submicrosecond protein folding kinetics of RD1 was studied using a photolabile 149 caging strategy with time-resolved photoacoustic calorimetry [55]. As seen in the lower 150 left panel of Fig 1, 30  complexity. The largest Gaussian entanglement G ′ = −1.03 is associated to the native 154 contact between residues N14 and K61 (in green in Fig 1a). The loop joining the two 155 residues is entangled with the thread spanning the residues from N1 to I11. The set of 156 the other 29 entangled native contacts, with −1 < G ′ < −0.75, is formed between 157 residue pairs in the sequence portions P12-A24 and A34-L55. All these loops can be 158 grouped together according to the clustering procedure performed in [37]. For 159 illustration purposes, one such loop is highlighted in red in Fig 1a,    increased cooperativity of the 12/10 flavour of the Lennard-Jones potential is likely to 196 be a general property due to its faster decay at long distances. The shape of the free 197 energy profile is now more similar to the one reported in [9] for the SH3 domain, 198 although the height of the barrier is still lower and the positions that we find on the Q 199 axis for the unfolded and the folded states are less separated. This differences are then 200 due to the dissimilar definition of contact map that we use. similar to what found in [9]. The interactions between the 3 β-strands at the centre of 205 the sequence are already partially formed, while the contacts involving the 20 N-terminal 206 residues contribute less to the transition state structure, a description in agreement with 207 experimental results [54]. The properties of the transition state ensemble are much more 208 robust than the shape of the free energy profile, when changes in the definition of native 209 pairwise attractive interactions are made within a structure-based energy function.     The entangled RD1 protein exhibits a two-state behaviour at the 251 folding transition 252 The results of 8 long simulation runs at equilibrium (sample time series are shown in S5 253 Fig), carried out at different temperatures for the RD1 protein, were analyzed using the 254 WHAM method [56][57][58] in order to estimate the configurational entropy S (Q) as a 255 function of the fraction of native contacts Q, allowing to obtain the free energy profile 256 F (Q) as a function of Q and the specific heat (see the Weighted histogram method 257 subsection for details).

258
As shown in S2 Figa, the specific heat exhibits a sharp peak as a function of 259 temperature, whose position is used to locate the folding temperature T F . The  The height of the free energy barrier between the unfolded state and the folded one 264 is ∆F ≃ 3.2κ B T , a much higher value than the one found for the SH3 domain when

312
The different types of observed refolding trajectories are summarized in Table 1 Table 1. Different refolding pathways for the entangled RD1 protein.
Folding pathway Number of instances Mean folding time The number of times a given folding pathway type is observed in the refolding of the RD1 protein is reported, together with the average folding time observed for each subset of trajectories. Horizontal dots represent possible transitions among meta-stable states, whereas tilted dots stress that trajectories classified in the fast folding pathway do not explore the IT ensemble. The folding time of a given trajectory is defined as the first time for which Q ≥ 0.8 (verified using the block average of Q, see the Logistic fit of contact formation curves subsection). Time is measured in MD steps (see the Langevin dynamics subsection for details).    The early formation of contacts involving the natively entangled 377 thread is crucial to select the fast folding channel for the RD1 378 protein 379 In order to gain a mechanistic insight into the difference between the fast (U → IE − → 380 F) and the slow (U → IT → · · · → F) refolding channels, we focus on the different 381 properties of the intermediate states IE − and IT.

382
In the previous subsection we already discussed how the threading refolding channel 383 involves the N -terminus to thread an already formed loop in order to establish the 384 native topological complexity (see S2 Video and the configuration snapshots IT and F 385 in Fig 6).    Overall, this analysis provides evidence that, for efficient and fast folding of the RD1 410 protein to occur, the thread needs to fold correctly prior to the closure of the natively 411 entangled loop. When this does not happen, the contacts that would be entangled in 412 the correctly folded structure, form without originating the native entanglement, 413 trapping the RD1 protein in the IT intermediate.  Taken together, these results thoroughly confirm the hypothesis that, for a fast and 449 efficient folding process of the RD1 protein, entangled contacts need to be formed at the 450 later stages, after the contacts involving the thread have been already established. Non covalent lasso-like entanglement motifs were recently found in a large fraction of 473 protein native structures [37]. These motifs are present even in small proteins, such as 474 the 64-residues Type III antifreeze RD1 protein studied in this work, as shown in Fig 1a.  To investigate how entangled motifs are formed and how the topological frustration 482 due to entanglement is dealt with during the folding process, we adopted one of the 483 earliest and most popular structure-based models [9], where the protein chain is 484 coarse-grained at the level of one CA atom per residue, with two minor adjustments. 485 We expect the obtained results not to change qualitatively with different definitions of 486 the energy function or different coarse-graining schemes, as long as the topological 487 complexity is properly described.

488
Our model validation on the non-entangled SH3 domain reproduced its well known 489 two-state folding behaviour (see Fig 2a inset), although with a much reduced 490 cooperativity of the folding transition with respect to the original model [9]. Crucially, 491 however, the probability of native contact formation in the transition state ensemble is 492 not affected by changing the flavour of energy function (see Fig 2). We thus expect the 493 results presented for the natively entangled RD1 protein, based on the presence and  Table 1).

502
Moreover, we find that entangled unfolded configurations IE, with a low fraction of 503 native contacts, are visited dynamically with relative ease, even with an opposite 504 chirality (IE + ) with respect to the native RD1 structure (see Fig 5 and Fig 6). Note 505 that entangled motifs with flipped chiralities are observed for misfolded species of large 506 proteins in structure-based coarse-grained simulations of cotranslational folding [43].

507
Such misfolded structures were predicted to be stable for at least hours by means of 508 physics based all atom simulations [60], suggesting that they are not an artifact due to 509 using a structure-based energy function with an entangled native structure. More 510 generally, our results highlight the possible role played by entangled yet essentially 511 unfolded configurations in determining which folding pathway is chosen. Different 512 entangled motifs could indeed originate already in the early stages of the folding process, 513 with a low fraction of native contacts, to be then possibly "frozen" when the fraction of 514 native contacts is increased. It will be interesting to study whether and how the kinetic 515 partitioning between the many differently entangled misfolded subpopulations observed 516 in cotranslational folding simulations [43,44] can arise through differently entangled 517 unfolded configurations/subpopulations. The entanglement parameter ⟨G ′ ⟩ introduced 518 here, should be an essential indicator for characterizing the partitioning of states. proteins [43,44]. In the latter case, entangled motifs are typically deep (far away from 522 sequence termini), whereas for the small RD1 protein the entanglement due to the 523 threading of the N-terminal portion (see Fig 1a) is shallow. As a result, both the direct 524 folding from IT to the native structure upon threading (see S2 Video and Fig 4b) and 525 the backtracking from IT to the unfolded state (see Fig 4c)  The correct early folding of the N-terminal thread, forming a set of "trap-avoding" 543 contacts, is crucial for fast and efficient folding (see Fig 7). On average, if the contacts 544 that are entangled in the native state (blue shaded region in Fig 7) form before the 545 latter ones, trapping in IT without formation of the native-like entanglement follows.

546
Notably, only few specific "first-entangling" contacts are enough to originate the shows that the contacts, whose formation fluctuates more, form on average earlier (see 554 Fig 9). To the best of our knowledge, this is a novel observation.

555
Remarkably, the rich refolding behaviour observed for the natively entangled RD1 556 protein in Fig 5 is a kinetic effect, occurring below the folding transition temperature, 557 at T = 0.9T f . On the other hand, a more typical (for small proteins) two-state folding 558 behaviour is observed at equilibrium at T = T f (see Fig 3). The presence of kinetic 559 transient intermediates, that are not populated at equilibrium, is well known in general 560 in protein folding [59,62], generating a rich and complex folding behaviour. Interestingly, 561 both possible roles postulated in general for folding intermediates are observed in our 562 simulations of the RD1 protein refolding kinetics: a productive intermediate, IE − , that 563 speeds up folding, and a trap intermediate, IT, that slows it down. This observation for 564 a protein as small as RD1 is striking, pointing again to the crucial role of entanglement. 565 All the results discussed for the RD1 protein confirm the "late entanglement" 566 hypothesis made on general grounds [37]; entangled contacts need to form on average in 567 the latter stages of a fast and efficient folding process and their premature formation 568 (before the thread they entangle with in the native structure is already in place) leads to 569 trapping into kinetic bottlenecks without formation of the native entanglement. Yet, simulation of structure-based models, highlighting the presence of the knotted topology 574 in the early stages of the folding process [63]. On the other hand, the role of non-native 575 interactions, not considered in the model used here, should also be taken into 576 account [64].

577
An interesting question is whether and how the complex behaviour exhibited by the 578 RD1 protein is modified for cotranslational folding. It is plausible that trapping in IT is 579 |G ′ | ≃ 3) [37]; coarse-grained models such as the one we used here are clearly needed to 587 study their folding mechanisms in details.

588
The few experimental results available for the "in vitro" folding of the RD1 protein 589 are consistent with our findings. RD1 is a Type III antifreeze protein from the 590 Antarctic eel pout, Lycodichthys dearborni [65]. The folding equilibrium upon chemical 591 denaturation was shown to be a reversible two-state process with no populated 592 intermediates, for the HPLC-12 type III antifreeze protein from the North-Atlantic 593 ocean pout Macrozoarces americanus [66], a close isoform of the RD1 protein. The   594 folding kinetics cannot be monitored by general detection methods, since RD1 lacks an 595 intrinsic fluorescence probe and any clear spectroscopic differences between the folded 596 and unfolded states [55]. However, a photolabile caging strategy followed by 597 time-resolved photoacoustic calorimetry allowed to gather evidence of two distinct 598 refolding events with different characteristic times [55]. Significantly, this result is folding from U to F in our interpretation), whereas the slow event to a much smaller 602 volume change (the threading from IT to F in our interpretation). IT and F share in 603 fact a similar compactness (see Fig 6).

604
The RD1 "antifreeze" protein domain can have other functions; it is found included 605 in the multi-domain E. coli sialic acid synthase [67] and its deletion causes the loss of 606 enzyme activity [68]. The antifreeze function is due to the presence of a flat ice-binding 607 surface, which include most of the residues in the V9-M21 chain portion and Q44 [65] 608 and is highly conserved across different isoforms [65,69]. The N-terminal S4-A7 residues 609 and several residues in the V26-E35 stretch are also highly conserved [65], although they 610 do not participate into the ice-binding surface, consistently with the crucial role their 611 interactions play in the folding mechanisms highlighted by our simulations (see Fig 7).  [44].

619
The predictions obtained with our coarse-grained simulations need further 620 confirmations by more refined models, as well as by experiments. Future studies would 621 greatly benefit from the ability to predict and engineer RD1 mutants that either 622 stabilize or abrogate the IT intermediate, as done with folding intermediates for other 623 proteins [70]. More generally, as a cold-adapted protein with a specific antifreeze 624 function, the RD1 protein provides a much interesting subject for both molecular 625 evolution and biotechnology processes [71]. We believe the presence of a non covalent 626 lasso entangled motif highlighted here for the RD1 protein, together with the prediction 627 of its major impact on the folding mechanisms, could be of great interest in the above 628 fields.

630
Structure-based energy function 631 For our Molecular Dynamics (MD) simulations, we implemented an alpha carbon, coarse-grained model with a structure-based energy function. The functional form of the potential energy is based on the one used by Clementi and colleagues [9]. A protein composed of N residues is modelled as a virtual polymer chain where each monomer is characterized by its alpha carbon atom. The vectors are the pseudo-bond angles of three subsequent residues, and are the pseudo-dihedral angles corresponding to four residues. The potential energy takes the form: where "0" subscripts specify the values that are extracted from the protein PDB map analysis. A cut-off is imposed for r ij > 2.5σ ij and no tail correction is applied. For 640 residue pairs not in contact with each other in the native structure, the repulsive-only 641 term V NN r ij models excluded volume effects through a Lennard-Jones 12/6 potential, 642 with parameters ε NN ij and σ NN = 4Å cut at 2 1/6 σ NN and shifted such that the 643 potential is zero at the cut-off. The energy parameters are chosen to be uniform for all 644 residues, ε i r = 100ε, ε i θ = 20ε, ε i ϕ = ε ij C = ε NN ij = ε, so that ε sets the overall energy 645 scale.

646
Consistently with previous work on entangled contacts [37], two residues are said to 647 form a native contact in the coarse-grained description if, in the all-atom representation, 648 any two non-Hydrogen atoms in different residues are closer than 4.5Å. The native 649 contact length r ij 0 is then set to the distance between the two alpha carbon atoms of the 650 corresponding residues. Native contacts, and hence non-bonded attractive interactions, 651 are considered only for residue pairs not involved in the same pseudo-bond or 652 pseudo-dihedral angle. In MD simulations, a contact between residues i and j is formed 653 if r ij < g r ij 0 , where the choice of g does not strongly affect thermodynamic and kinetic 654 observables [72]. In the present work, g = 1.2.

655
Langevin dynamics 656 We perform MD by simulating the Langevin equation of motion: where the mass m i = m ∀i. On the right-hand side the force terms are: a drag force 657 with drag coefficient γ, the conservative force −∇V (see the Structure-based energy 658 function subsection for details), and the Gaussian stochastic force R i , which satisfies where T is the temperature of the heat bath. The 660 drag and the random force mimic the interaction between the solvent and the system, 661 consequently thermostatting the latter to the average temperature T . κ B = 1 sets 662 temperature units to be the same as energy. We use reduced units, rescaling mass by m 663 and energy by ε, whereas a = 1Å is used to rescale lengths. The time rescaling factor 664 should then be a m/ε; however, mapping to real time units is not trivial and other The WHAM Method [56][57][58] uses histograms collecting system configurations, with 678 respect to reaction coordinates, to compute thermodynamic observables. Here, we use 679 an approximation that employs the fraction of native contacts Q in place of the energy 680 as a reaction coordinate. Although Q is not a deterministic function of the energy, the 681 minimally frustrated structure-based model reduces the energy fluctuations for 682 configurations with the same value of Q, resulting in a good approximation [9]. Q has 683 the advantage to naturally bin histograms.

684
Let R be the number of simulations at equilibrium with inverse temperatures β i , i = 1, . . . , R (κ B = 1). Let N i (Q) be the number of configurations with fraction of native contact equal to Q in the i-th simulation, with n i = Q N i (Q) total entries. The configuration partition function in the microcanonical ensemble W (Q) can be estimated as a linear combination of these histograms: where f i = β i F i are dimensionless free energies and p i are normalized weights to be determined. The average energy given a Q value is: April 16, 2023 16/34 The weights p i can be derived by imposing the W (Q) variance to be minimum, obtaining: for the entropy S(Q) = exp [W (Q)]. This is solved self-consistently with: The self-consistent equations can be solved iteratively with initialization In the latter, 692 averages are evaluated using WHAM results and N is the number of residues. Hence,

693
T f = arg max T C v (T ). For each protein, the final folding temperature is the mean of at 694 least two heat capacity profiles, each of which comes from a WHAM calculation using 8 695 independent simulations at equilibrium. These run for 2 · 10 9 integration steps, or 696 8.3 · 10 4 MD steps. The first 10% of the simulations is discarded from histogram 697 counting, to allow for thermalization to take place.

698
Gaussian entanglement 699 The linking number G between two closed oriented curves γ i = {r i } and γ j = {r j } in R 3 can be defined through Gauss' double integrals [38]: This number is an integer and a topological invariant. A generalization for discrete and open curves is the Gaussian entanglement G ′ [15,35,37]. For a chain with N monomers, i=i1 and γ j = {r j } j2 j=j1 be two non-overlapping portions of γ. We require j 2 − j 1 ≥ m j and i 2 − i 1 ≥ m i . In the present work, we choose m i = 4 and m j = 10 [37]. In coarse-grained protein chains, r k represents the alpha carbon position vector. Let R i = (r i+1 + r i )/2 be the mean position between two subsequent alpha carbons and ∆R i = r i+1 − r i be the virtual bond vector. The Gaussian entanglement, actually describing the self-entanglement between two different portions of the same chain, is defined as: Crucially, G ′ (γ i , γ j ) is a real number.

700
In the present contribution we consider γ i as a "loop", a chain portion "closed" by a 701 non-covalent interaction, with residues i 1 and i 2 in contact with each other in the native 702 configuration (see the Structure-based energy function subsection for details). No 703 similar constraints are imposed on γ j , which we call a "thread", due to its possible 704 intertwining with a loop, if involved in an entangled motif.

705
For any loop γ i , or equivalently for any native contact, we select the thread most likely entangling with γ i by maximizing |G ′ (γ i , γ j )| over all threads γ j which do not overlap with γ i . This yields also the Gaussian entanglement G ′ (where the argument γ i has been dropped for clarity) associated to that loop, in combination with the selected thread, written formally as: where Ω(γ i ) is the set of G ′ (γ i , γ j ) computed for all threads γ j that do not overlap with 706 γ i and satisfy j 2 − j 1 ≥ m j .

707
An entanglement indicator for the whole protein configuration, given the distribution of Gaussian entanglement values selected for each loop with the latter strategy, can be defined in many different ways. The Gaussian entanglement that is maximum in modulus, and the corresponding loop-thread pair, were previously considered [37]. In the present contribution, we introduce as entanglement indicator, for the whole chain configuration, the weighted average of G ′ over all loops: The unnormalized weights are defined through an activation Hill function: where g 0 is the activation threshold, such that h g 0 g 0 , m = 1/2, and m is the  ⟨σ ij ⟩(t) are affected by strong fluctuations, similarly to what observed in the Q time 773 series. Therefore, to perform further analysis, we smooth these curves through a block 774 average procedure. This is applied to the Q(t) time series to estimate the folding time 775 as well as to the contact formation probabilities ⟨σ ij ⟩(t), with time windows of 250 and 776 100 MD steps, respectively.

777
Using the block average, each ⟨σ ij ⟩(t) curve is fitted to the logistic function 778 L/ (1 + exp [−k (t − t * )]) using nonlinear least squares implemented in the Python 779 library scipy [75]. From the fit, we extract two observables: the average contact 780 formation time t * , defined by ⟨σ ij ⟩(t * ) = L/2, where L is the logistic asymptotic limit 781 for infinite time, and the maximal rate k of contact formation along the trajectory, 782 which is proportional to the slope Lk/4 of the logistic function evaluated at t = t * .

783
Unfolded configurations used in folding events have, on average, a few native 784 contacts already formed (Q > 0), typically those related to residues close to each other 785 along the chain. Therefore, some ⟨σ ij ⟩(t) curves might start already above L/2, causing 786 t * to be negative. Most of these contacts are shown for completeness, apart from a 787 restricted subset that has t * < t thr < 0 which is removed for visualization clarity. For 788 both proteins, this subset represents less than 1.2% of the total number of contacts and 789 their removal does not change the t * − k correlation parameters appreciably. Regarding 790 the final refolded configurations, a few native contacts do not form on average (Q < 1). 791 In these cases, a logistic fit is not possible and, therefore, such contacts are discarded 792 from the analysis (representing less than 3.7% of the total number of native contacts for 793 both proteins).   Cartoon native structures and corresponding Gaussian entanglement histograms of the proteins studied in this work. a: Cartoon native structure of the Type III antifreeze protein RD1 (Protein Data Bank (PDB) code 1ucs), which exhibits a mixed α/β structure. To illustrate the native topological complexity of RD1, we show in red one entangled loop (G ′ = −0.97) between the contacting residues L17 and M41, in yellow, with the associated threading fragment in blue, the first 14 N-terminal residues (see the Gaussian entanglement subsection for the definition of entangled loops and threads). Whilst, the most entangled contact (G ′ = −1.03) is shown in green, between residues N14 and K61. b: Cartoon native structure of the SH3 domain (PDB code 1srl), which has a mainly-β structure, with five β strands. c: Histogram of the Gaussian entanglement values for the native loops in the RD1 protein.
The unnormalized weight function used to evaluate the entanglement indicator ⟨G ′ ⟩ = −0.68 for the whole RD1 protein native structure is also shown. d: Histogram of the Gaussian entanglement values for the native loops in the SH3 domain. The unnormalized weight function used to evaluate the entanglement indicator ⟨G ′ ⟩ = 0.08 for the whole SH3 domain native structure is also shown. The equation defining the unnormalized weight function is reported in the legends of both panels c and d. It is a Hill activation function with threshold g 0 = 0.5 and cooperativity index m = 3. The weight function needs to be properly normalized to compute the weighted average ⟨G ′ ⟩. While we use the same unnormalized weight function in all cases, the normalization is specific to each distinct protein configuration (see the Gaussian entanglement subsection for details).     The red arrows mark the location of the native contacts involving the first 6 N-terminal residues, which are likely to be formed in IE − whereas they are less likely to be formed in IT. Among the former ones, the "trap-avoiding" native contacts framed in red are those whose formation probabilities differ most in the two ensembles (see the Definition of ensembles subsection for details). The shaded area framed in blue contains 28 out of the 30 natively entangled contacts, with G ′ < −0.75. The native contacts framed in magenta are the "first-entangler" ones more likely to be formed in IE − (see the Definition of ensembles subsection for details). Two examples of native contact formation probabilities as a function of time, averaged over the 52 refolding trajectories that achieve correct folding to the RD1 protein native structure at T = 0.9T f , and of the corresponding logistic fits. Time is measured in MD steps. Both the average time series and the corresponding block averages (see the Logistic fit of contact formation curves subsection for details) are plotted (see legend). The block averages are fitted to the logistic function reported in the legend and the resulting fits are shown in the plots. The fit parameters are L, the asymptotic saturation value at t → ∞, t * , the time at which the contact formation probability is L/2 (as highlighted in the plots; it may thus be negative for contacts which are already likely formed in the unfolded state), and k, determining the rate at which contact formation proceeds at t = t * (the slope of the logistic function is Lk/4 at t = t * , as highlighted in the plots). Top row: native contact between V6 and E25; one of the "trap-avoiding" N-terminal thread contacts framed in red in Figures 7 and 9a. Bottom row: native contact between K23 and I37; one of the "first-entangler" contacts framed in magenta in Figures 7 and 9a.