Characteristics of Interactions at Protein Segments External to Globular Domains in the Protein Data Bank

The principle of three-dimensional protein structure formation is a long-standing conundrum in structural biology. A globular domain of a soluble protein is formed by a network of atomic contacts among amino acid residues, but regions external to globular domains, like loop and linker, often do not have intramolecular contacts with globular domains. Although these regions can play key roles for protein function as interfaces for intermolecular interactions, their nature remains unclear. Here, we termed protein segments external to globular domains as floating segments and sought for them in tens of thousands of entries in the Protein Data Bank. As a result, we found that 0.72 % of residues are in floating segments. Regarding secondary structural elements, coil structures are enriched in floating segments, especially for long segments. Interactions with polypeptides and polynucleotides, but not small compounds, are enriched in floating segments. The amino acid preferences of floating segments are similar to those of surface residues, with exceptions; the small side chain amino acids, Gly and Ala, are preferred, and some charged side chains, Arg and His, are disfavored for floating segments compared to surface residues. Our comprehensive characterization of floating segments may provide insights into understanding protein sequence-structure-function relationships.


INTRODUCTION
Elucidating the principles of the three-dimensional (3D) structure formation of proteins is a longstanding conundrum in the field of structural biology [1].How a sequence of 20 types of amino acid residues in a polypeptide determines its 3D structure remains largely unclear.Toward illumination of this issue, extensive efforts in structural biology have accumulated a massive amount of structural data for proteins in the Protein Data Bank (PDB) [2].By taking advantage of the wealth of structural data, the field of so-called "structural bioinformatics", which tackles extracting biological knowledge from structural databases by using techniques of information science, has arisen [3][4][5][6].Statistical analyses of structural elements in the PDB have provided a bird's eye view on the characterization of the 3D structures of proteins.For example, statistical analyses revealed amino acid propensities associated with several features, e.g., formation of secondary structural elements [7][8][9][10] and loop regions [11].In addition to secondary structure formation, another key feature to establish protein .CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made The copyright holder for this preprint (which this version posted September 20, 2018.; https://doi.org/10.1101/423087doi: bioRxiv preprint 3 folding is intramolecular contact between amino acid residues distant in the primary structure; in this paper, we refer to this type of intramolecular contact as non-local: contacts between neighboring residues are termed local contacts.The statistical analyses of the vicinity of amino acid residues in 3D space have been extensively performed to predict and recognize protein folds [12,13].For example, the residue-wise contact order, defined as the summation over the distance along the sequence between contacting residues, contains significant information regarding 3D structures (Kinjo and Nishikawa, 2005;Kurt et al., 2008).Propensities of non-local contacts play pivotal roles in establishing folds of globular domains.
On the other hand, proteins also have regions without non-local contacts.A typical example is the linker region, which is a flexible segment linking two globular domains.The structural element termed the loop region also tends to have no or only a few non-local contacts.They are also key elements in the 3D structures of proteins.However, in spite of their importance, the nature of the regions without non-local contacts is not well understood.
Here, we performed statistical analyses on PDB entries to investigate the nature of regions without non-local contacts.Tens of thousands of PDB entries were processed, and two types of regions consisting of consecutive amino acid residues were defined: (i) regions with non-local contacts, and (ii) those without non-local contacts.We refer to these regions as (i) supported segments and (ii) floating segments, respectively (Fig 1).We aim to characterize the floating segments in proteins.On the basis of non-redundant PDB entries, the frequency of these floating segments was analyzed with many features of the segments, e.g., length, secondary structures, accessible surface area (ASA), and intermolecular interactions.We present that an considerable number of residues are floating in protein structures deposited in the PDB.While amino acid preferences of floating segments are similar to those of exposed residues, some amino acids exhibited a unique propensity for floating segments.

Dataset Construction
In this study, we constructed two kinds of datasets, named primary dataset and non-redundant datasets.The latter are subsets of the first.The primary dataset was constructed by extracting entries from a snapshot of the PDB on June 14, 2017, with the following criteria: (i) the entry contains at least one polypeptide, (ii) the number of atoms is less than one million, and (iii) the structure was solved with X-ray crystallography with resolution better than, or equal to, 3.0 Å. Extracting information from the PDB was performed by parsing PDBML [14] with in-house scripts (S1 Data).
Non-redundant datasets were constructed by picking non-redundant entries from the primary dataset.Single-linkage clustering with 40 % sequence identity was performed using the CD-HIT program [15].In cases where an entry had more than one chain, the sequence identity of the most similar pair of chains was considered.A non-redundant dataset was constructed by random picking of one entry from each cluster.We constructed 100 non-redundant datasets with different random seeds, and statistics involving them were analyzed.

Detecting Atomic Contacts and Definition of Structural Units
In this study, we assessed interatomic contacts between heavy atoms with the threshold that the interatomic distance is less than 5 Å.When a pair of amino acid residues has at least one interatomic contact, we considered that this pair of residues has an inter-residue contact, An inter-residue contact formed between residues of the same molecule is termed an intramolecular contact, which can be grouped into two classes: (i) non-local contact for the cases in which the two contacting residues are distant more than five residues in the sequence order, and (ii) local contact for other cases.On the other hand, an intermolecular contact is defined as a contact between two different molecules.On the basis of inter-residue contacts, we defined a structural unit named segment.A unit consisting of more than three successive amino acid residues with non-local contacts is defined as a supported segment, and that without non-local contacts is defined as a floating segment (Fig 1).

Analyses
We characterized segments in polypeptide chains by focusing on the following points: (i) segment type, defined as floating or supported, (ii) segment length, (iii) secondary structural elements (SSEs), (iv) accessible surface area (ASA), and (v) types of inter-molecular contact partners.(i) The segment type is signified as ; flo and sup mean floating and supported segments, respectively.  ⊂ {,} (ii) Segment length L seg is the number of consecutive amino acid residues composing the segment.
For simplicity, segment lengths fell into three classes: , that were defined as   ⊂ {ℎ,,} 2 < L seg ≤ 4, 4 < L seg ≤ 9, and 9 < L seg , respectively.(iii) The type of SSE (T SSE ) was assessed by using the DSSP program [16].In this manuscript, we applied three categories of SSE;   ⊂ ; helix was G, H or I in the DSSP classification, beta was B, E, T, or S, and coil was {ℎ,, } the others.The representative T SSE of each segment was the most frequent SSE in the residues composing the segment.(iv) The solvent accessibility of each segment was defined as the two levels: surface and buried, .For each amino acid residue, when the ratio of ASA of the   ⊂ {, } residue to that of the same amino acid in Gly-X-Gly motifs is greater than 0.2, the residue is assumed to be surface exposed.Otherwise, it is assumed to be buried.ASA was calculated with the DSSP program [16].(v) Partners of intermolecular contacts fell into three classes: polypeptides, polynucleotides, and small compounds.These were defined by the entity type as described in the PDB annotation.We considered non-polymer entities ≥ 300 Da as small compounds.The type of interaction partner of a segment is denoted as for polypeptide, polynucleotide, and   ⊂ {,,} small compound, respectively.
Relative frequencies of various types of segments were analyzed.F(x) indicates the relative frequency of segments with the condition x, , where N total denotes the total number of segments in a dataset, and N(x) denotes the number of segments with the condition x.For example, the relative frequency of short segments and that of helix segments are represented as F(T len =short) and F(T SSE =helix), respectively.For simplicity, they can also be denoted as F(short) and F(helix).The conditional relative frequency, the ratio of the number of segments with types x and y to that with type y, is denoted as F(x|y).
In addition, we also analyzed characteristics of residues.The relative frequency of residues with the condition x is presented as F res (x).The amino acid type of each residue is denoted as    ⊂ . The propensity score of each amino acid was assessed with {,,,,,,,,,,,,,,,,,,,} the log-odds score., The number of segments decreases exponentially, along with an increase in the segment length.In particular, there is a steep decrease in the frequency for floating segments in L seg < 9.A majority of floating segments consist of only three or four residues (F(short|flo) = 0.850), and the ratio of floating segments that are longer than nine residues is only F(long|flo) = 0.014.On the other hand, many supported segments have more than nine segments (F(long|sup) = 0.103; F(short|sup) = 0.568).Floating segments tend to be shorter than supported ones.This may reflect the fact that longer protein regions without molecular

Segments as interfaces of intermolecular interactions
The relative frequencies of segments for intermolecular interactions are summarized in Fig 4 .The interaction partners were categorized into one of the three types; that is, polypeptides (T int = pep), polynucleotides (T int = nuc), and small compounds (T int = sc).For interactions with polypeptides and polynucleotides, floating segments exhibit a higher frequency of appearance in intermolecular interfaces compared with supported segments; the ratio of interacting floating segments are F(pep|flo) = 0.424, F(nuc|flo) = 0.0121, and those values for supported segments are F(pep|sup) = 0.282 and F(nuc|sup) = 0.00659, respectively.This is because intermolecular interactions require high solvent accessibility, and almost all floating segments are on the surface (F(sur|flo) = 0.953) and this ratio in supported segments is F(sur|sup) = 0.397.However, although floating segments are enriched at the surface, binding sites for small compounds prefer supported segments rather than floating ones; the ratios of interacting-floating and interacting-supported segments are F(sc|flo) = 0.0190 and F(sc|sup) = 0.0710, respectively.Since the binding sites are usually formed as a concave surface (called a cavity or pocket) with a certain size and depth [17], they should be formed by supported segments rather than by floating ones.As an example of binding sites with floating segments, 3-chlorocatechol 1,2-dioxygenase binds its ligand with a floating helix-turn-helix conformation (S4A Figure ; PDB ID: 2BOY).In addition, many entries in this category do not have biologically relevant ligand-binding sites but have contacts with other non-specific small molecules such as lipids.For instance, a light harvesting complex is surrounded by chlorophyll molecules (S4B Figure ; PDB ID: 3PL9).
Regarding the segment length, longer segments show higher frequencies for interactions since longer segments have larger surface areas, which simply elevates the probability of interactions.

Propensities of Amino Acids
Propensity of each amino acid for floating and supported segments was assessed based on the logodds score (Fig 5B ; Eq. 2).In general, bulky non-polar amino acids, e.g., Cys, Phe, Ile, Leu, Met, Val, Trp, and Tyr, are disfavored for floating segments.This tendency is similar for surface residues (Fig 6; the Pearson correlation coefficient (PCC) of propensity scores between floating segments and surface residues is 0.929).This is due to the fact that a majority of floating segments are at the surface; F(sur|flo) = 0.953.However, there are some unique features in the amino acid propensity for the floating segments compared to surface residues.(i) Arg and His are disfavored in floating segments, although they are highly enriched as surface residues due to their high polarity.In many cases, they are involved in interfaces of intermolecular contacts.Arg is enriched for contacts with polynucleotides.
For example, a 20-residue segment in the Fos-Jun complex has seven Arg residues (S6A Figure ; PDB ID: 1A02).At an interface with polypeptides, Arg can stabilize the interactions through formation of salt bridges (S6B Figure ; PDB ID: 2E7S).Many His residues in floating segments are observed in His-tag sequences.(ii)Gly is preferred for floating segments, although it is not so favored in surface residues.This is due to its high flexibility, which makes it possible to form unstructured regions, including loops and linkers.For example, a 12-residue segment in a loop region of MHC molecules has five Gly residues (S6C Figure ; PDB ID: 1LNU).(iii) Ala is not so disfavored for the floating segments, in spite of its negative score in surface residues due to its hydrophobic side chain.One possible explanation is that α-helices favor Ala residues [7,18].Floating segments show higher ratios of helix conformation than supported segments (Fig 3).An example of a floating helix with many Ala residues is shown in S6D Figure (PDB ID: 4KE2).
We also assessed amino acid propensities for intermolecular interactions with the three categories of molecules; polypeptides, polynucleotides, and small molecules (Figs 5C, D, and E .CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made The copyright holder for this preprint (which this version posted September 20, 2018.; https://doi.org/10.1101/423087doi: bioRxiv preprint The major differences are as follows: Gly and Pro are favored, and Gln is disfavored for floating smallmolecule binding sites.Since Gly and Pro are enriched in flexible regions, they are observed in loop regions composing a binding site (examples are shown in S6G Figure and H).For the propensity for floating segments in interfaces to polynucleotide, there is no clear correlation with other propensities.
In a comparison with the trend in supported segments, floating segments disfavor some hydrophobic amino acids, e.g., Cys, Leu, Pro, and Trp.In addition, while Asp and Glu are disfavored, Asn and Gln are not.They sometimes have direct contacts with a base of polynucleotides (S6I Figure ).

CONCLUSIONS
In this study, we defined floating and supported segments involved in the 3D structure of proteins (Fig Regarding the amino acid composition, while floating segments are basically similar to surface exposed residues, they have some unique features; higher preferences for small side chains (Gly and Ala) and disfavoring some charged side chains (Arg and His) compared to surface residues (Fig 5A ).
Interestingly, the propensity scores for polypeptide interactions of floating segments are in an opposite trend from that for all floating segments (Figs 5A and B).Residues disfavored for floating residues tend to be interfaces for protein-protein interactions at floating segments, except for Gln residues.

ACKNOWLEDGEMENT
Supercomputer resources were provided by the National Institute of Genetics, Research Organization of Information and Systems, Japan.
Figure; PDB ID: 2B58).In contrast to the floating segments, the β-structures are not disfavored in long-supported segments (Fig 3).Since β-stands are usually formed as a part of a β-sheet, broad lengths of supported β-stands can be found in protein structures.Since the usual size of β-sheets accords to the category of T len = middle, a higher ratio of beta structures in the middle-supported

For
floating segments, typical interactions of long helix and long beta segments are formation of coiled-coil structures and inter-molecular β-sheets, respectively, as shown above (S2C Figure and S2A Figure, respectively).As an exception to the tendency of the segment length, longer supported segments are not enriched in polynucleotide binding sites.Many typical double-stranded DNA binding sites include floating segments with positively charged amino acid residues.For example, bZIP heterodimeric complexes recognize DNA with two floating helices (S5A Figure; PDB ID: 2WT7).A floating-helix segment consisting of 18 residues in the NC2-TBP-DNA ternary complex structure recognizes the DNA with their six acidic residues (S5B Figure; PDB ID: 1JFI).A linker loop with two acidic residues in a replication terminator protein is buried into the major groove of DNA (S5C Figure; PDB ID: 1ECR).In contrast, it is sterically difficult to attach grooves of a double-stranded DNA to supported segments.A majority of supported segments at the polynucleotide binding sites touch the DNA backbone rather than burying into the grooves (for example PDB ID: 3E3Y; S5D Figure).In addition, many single-stranded DNA binding sites are composed of supported segments (for example S5E and F Figures; PDB ID: 2KFN and 3CMW, respectively).
, respectively).The propensity score of floating segments for polypeptide interactions (black in Fig 5C)shows the opposite trend from that of propensity to form floating segments (Fig 6B;their PCC is -0.872).Although hydrophobic amino acids are disfavored for floating segments, they are favored for intermolecular interaction interfaces with other polypeptides.This implies that when disfavored amino acids exist in a floating segment, it is expected that they conduct some functions to recognize another proteins.The exception is Gln, which has a positive propensity score for both conditions; S res (Q ;flo) and S res (Q ; pep|flo) shown in Figs5B and C, respectively.Gln is often observed at terminal or kinking regions of a helix (S6E and F Figures for examples).The propensity for floating segments in smallmolecule binding sites showed a weak correlation to that for polypeptides (Fig 6C; the PCC is 0.614).
1), and characterized them on the basis of statistical analyses of the PDB.We found considerable numbers of floating segments in known protein structures (0.72 % of residues are in floating segments).The frequency distribution of segment length shows exponential decay along with an increase in the segment length, in both floating and supported segments.The length distribution of floating segments is more biased toward shorter regions than that of supported segments, and most of the floating segments are composed of three or four residues (Fig 2).Three is the minimum length of a segment in the definition; the segment length largely impacts its characteristics.Shorter floating segments tend to form secondary structures (Fig 3).Longer floating segments are enriched in intermolecular interaction interfaces.In particular, beta structures are favored for long-floating segments at the interfaces (Fig 4).Although floating segments are enriched at interfaces for polypeptides and polynucleotides, they are disfavored at interfaces for small compounds (Fig 4).

Figure 1 .
Figure 1.Schematic illustration of floating and supported segments.Open and shaded arrows indicate intermolecular and intramolecular contacts, respectively.The shaded regions in polypeptides are floating segments.

Figure 2 .
Figure 2. Histogram of floating (black) and supported (grey) segments regarding segment lengths.Error bars are the standard deviations in 100 non-redundant datasets.

Figure 3 .
Figure 3. Ratio of the SSE for each combination of classes: T len and T seg .Black, dark grey, and light grey denote the ratios of beta, helix, and coil structures, respectively.Error bars are the standard deviations in 100 non-redundant datasets.

Figure 4 .
Figure 4.The ratios of interacting segments for each combination of classes: T SSE , T len and T seg .(A) Ratios for interactions with polypeptides.(B) Ratios for interactions with polynucleotides.(C) Ratios for interactions with small compounds.Black, dark grey, and light grey denote the ratios of beta, helix, and coil structures, respectively.Error bars are the standard deviations in 100 non-redundant datasets.

Figure 5 .
Figure 5. Amino acid propensity scores.(A) The propensities for surface or buried residues;   .Black and grey bars denote surface and buried residues for the panel, respectively.(B) (   |   ) The propensities to form floating or supported segments; .(C) The propensities to   (   |   ) interact with polypeptides; .(D) The propensities to interact with   (   |   = ;   ) polynucleotides; .(E) The propensities to interact with small compounds;   (   |   = ;   )

Figure 6 .
Figure 6.Comparisons of the amino acid propensity scores shown in Fig 5. (A) Comparison between floating and surface segments.(B) Comparison between floating segments and those with peptide interactions.(C) Comparison between floating segments with small compound interactions and those with peptide interactions.
On average [with standard deviation (SD)], a non-redundant dataset consisted of 17,259.01[37.44]entities, 38,601.25 [140.30]chains, and 8,967,882 [34,917.46]residues.For the PDB entries, we defined floating segments as having at least three consecutive residues without non-local contacts, and defined supported segments as those with non-local (   | ; ) = log (   (   |,)/(1 -  (   |,))   (   |)/(1 -  (   |)) ) where x and y indicate conditions.We evaluated the amino acid propensities for segment types   , those for interaction partners in each segment type , and those for (   |   )   (   |   ;  ) surface or buried residues .  (   |   ) (44.60 %) of these were singleton clusters (S1 Figure).A non-redundant dataset was constructed by randomly picking one entry from each cluster, and statistical analyses were performed on 100 non-redundant datasets.Segment Length and Secondary Structure Elements Distributions of the segment length L seg for each segment type are shown in Fig 2.