Cellular labelling favours unfolded proteins

Folded enzymes are essential for life, but there is limited in vivo information about how locally unfolded protein regions contribute to biological functions. Intrinsically Disordered Regions (IDRs) are enriched in disease-linked and multiply post-translationally modified proteins. The extent of foldability of predicted IDRs is difficult to measure due to significant technical challenges to survey in vivo protein conformations on a proteome-wide scale. We reasoned that IDRs should be more accessible to targeted in vivo biotinylation than more ordered protein regions, if they retain their flexibility in vivo. Indeed, we observed a positive correlation of predicted IDRs and biotinylation density across four independent large-scale proximity proteomics studies that together report >20 000 biotinylation sites. We show that biotin ‘painting’ is a promising approach to fill gaps in knowledge between static in vitro protein structures, in silico disorder predictions and in vivo condition-dependent subcellular plasticity using the 80S ribosome as an example.

Despite increasing community interest, it has remained challenging to define the phenomenon of "intrinsic disorder" as clearly as the ordered complement of the structural proteome. Rigidly folded proteins can be solved in high-resolution crystal, cryo-EM or NMR structures that can be described by a simplified hierarchy of elements of increasing length from primary structure (sequence of single amino acid) over secondary structure elements (α-helices and β strands of ~10 residues) to tertiary structure (folded domains of ~100 residues) and quaternary structures (i.e. assemblies of several folded proteins). IDPs cannot be as straightforwardly classified in a simple hierarchy of modules of increasing length because the "minimal unit", a single IDR, can vary in length from a few residues to thousands. Accordingly, IDRs can vary significantly in their properties and functions and the need for further differentiation of sub-classes of disorder was recognized early in the development of the field 14 .
While the structure-function paradigm is fully established and has been highly successful, a complementary "disorder-function" paradigm is still emerging 15 .
Co-evolutionary inference suggests that many predicted disordered regions have the capacity to fold and are selected in evolution by contact constraints imposed by their folded conformation in presence of cellular binding partners 16 . In other words, such binding-coupled folding IDPs look similar to folded proteins as determined by (co)evolution statistical analysis. Interfaces of foldable IDRs tend to be larger than contacts between two ordered proteins and the exposed hydrophobic surface area is often larger, which in some cases limits solubility of IDPs and requires tighter subcellular regulation of IDPs compared to ordered proteins [17][18][19] .
One of the least characterized aspects in IDR research is in vivo malleability leading to multiple structural forms that disordered regions can adopt in a given compartment in a given cellular state. According to in vitro experiments, it can be expected that subtle variations in pH, salt concentrations, and PTMs can have very significant effects on the conformational ensembles of IDPs. For instance, nuclear pore proteins can form extremely tight complexes (dissociation constant (K d ) in low pM range) near physiological salt concentration (~100 mM) which becomes very weak (K d in mM range) at 200 mM salt concentration 20 . Indeed, a recent large-scale multidimensional proteomics study that investigated temperature-dependent solubility and abundance changes across cell cycle phases, demonstrated that large subsets of the human proteome dramatically change their solubility, stability, subcellular organization and protein partners in patterns resembling differential phosphorylation during the cell cycle 21 .
Early reports suggested that phosphorylation predictions can become significantly more accurate if local intrinsic disorder tendency is taken into consideration 22 . Many single-protein examples illustrate that IDRs can be phosphorylated or hyper-phosphorylated within disordered residues, often at highly soluble and intrinsically disorder-promoting serine and threonine residues 10,11,[23][24][25] . The correlations of IDRs with acetylation, ubiquitination and sumoylations at lysines, and phosphorylations at residues such as tyrosine and histidine are more challenging to detect, however, and hence frequently underreported in scientific studies [26][27][28][29][30] . Finally, there are very few studies reporting possible interactions between IDRs and multiple types of PTMs.
Biotinylation-based proximity proteomics methods are traditionally used to map transient interactions and subcellular neighbours [31][32][33][34][35] . The common principle of various proximity proteomics approaches is that biotinylation is highest in proximity to the biotin-activating enzyme that is fused to protein of interest. This localised biotinylation enhances the biotin incorporation in protein interactors and/or subcellular neighbours of biotin activating enzyme fused targets, which can be quantified in mass spectrometric experiments combined with stringent statistical filtering of background proteins to remove endogenously and non-specifically biotinylated proteins. Several recent technological improvements enable the direct detection of thousands of biotin sites in hundreds of proteins in a single study [36][37][38][39] . We therefore reasoned that these novel large-scale in vivo biotin site data could be repurposed to gain insights into possible cellular conformations of proteins.
The most frequently used enzymes in proximity proteomics are variants of BirA biotin-protein ligase and Ascorbate Peroxidase (APX) 33,34,40,41 . A promiscuous mutant of BirA ("BioID") as well as a thermophilic homologue (BioID2) biotinylate nearby lysines through the formation of activated biotinoyl-5'-AMP which forms a covalent attachment to the nucleophilic ε-amino side chain group of lysine (K). APX or accelerated versions like APEX2 can convert biotin-phenol to activated radicals that can readily react for a short period of time with nearby tyrosines (Y). Interestingly, these two amino acid types are on opposite ends of the disorder-promoting amino acid scale -Lysine promotes disorder while tyrosine is on average depleted in IDRs 42 .
We hypothesize that sites of cellular biotinylations in proximity labelling studies could favour biotinylation within predicted IDRs if these can retain greater accessibility in their cellular context compared to predicted ordered regions. We perform a comprehensive analysis using representative data from multiple, independent and orthogonal large-scale proximity tagging datasets as well as diverse IDR predictions to test our hypothesis. We demonstrate the enrichment of cellular biotinylation events in predicted IDRs and show that these regions show higher biochemical reactivity compared to ordered regions in all targeted cellular niches and especially in the nucleus of HEK293 cells.

Results:
Concept of the study Predicted IDRs can be reshaped by interactions in cells (Fig. 1A). Often, specific functions of IDRs are linked to their potential to fold upon interacting with specific native partner proteins or ligands 43 . Alternatively, short "linear motifs" within IDRs could mediate a multitude of local interactions of other folded proteins that could constrain and compact IDRs in specific partially folded or ordered conformations 44 . In the most extreme scenario, IDRs could remain entirely unfolded and fully accessible 20,45 . We expected to observe more in vivo biotinylations within predicted IDRs if they remain at least transiently and locally unfolded and accessible in their cellular context. If present, such a correlation can be used for in vivo structural proteomics studies.

Brief introduction to selected proximity proteomics studies
To test our hypothesis of possible links between structural features of proteins and biotinylation, we selected four recent, independent and orthogonal, large-scale studies by the following criteria (1) large number of directly identified biotinylation sites (2) orthogonality in targeted subcellular niches and (3) independence of biotin-peptide enrichment strategies (Fig. 1B). "DiDBiT" targeted the whole cell and is therefore agnostic of subcellular localisation. It identified ~20 000 biotinylation sites on lysine sites upon extensive biotinylation by applying 1mM NHS-biotin, a chemically activated form of biotin, to cultured HEK293 cells, complete digestion by trypsin and streptavidin-affinity purification of biotinylated peptides 38 . "SpotBioID" targeted rapamycin-dependent interactions of the human mTOR kinase using its FK506-rapamycin binding (FRB) domain fused to BioID 39 . Immunofluorescence data within SpotBioID and previous literature conflict concerning the main subcellular localisation of FRB-BioID that appears to be cytoplasmic in fluorescence experiments and nuclear in previous literature and biotin-protein enrichments 39 , with most evidence suggesting mainly nuclear localisation of the FRB-BioID fusion. The remaining two data sets come from recent, tyrosine-targeting APEX2 studies. Both successfully explored an alternative enrichment strategy based on polyclonal biotin-antibody from goat and rabbit that facilitated gentle elution while retaining explicit biotin site information unlike other strategies involving gentle elution of cleavable biotin derivatives 36,37,46 . They comprise an antibody-based APEX2 study (within this paper termed "Ab-APEX") targeted the mitochondrial matrix using mito-APEX2 37 , and a study called BioSITe 36 which uses a cytoplasmic APEX2 fusion construct to Nestin (NES) protein.

Orthogonality of tyrosine and lysine as molecular targets of proximity proteomics
How different are tyrosine and lysine residues, the most frequent molecular targets in proximity proteomics? Tyrosine is a partly hydrophobic and bulky amino acid and predominantly partitions to the hydrophobic core of proteins and near the interface of intrinsic membrane proteins. Its solvent accessible surface area (SASA) shrinks by some 90% during folding reactions (Fig. 1Ci) 47 . Lysine residues, by contrast, tend to orient to the surface of folded proteins and stay in contact with surrounding water molecules, i.e. retain a large fraction of their SASA (Fig. 1Cii). Nevertheless, through their intramolecular and intermolecular contacts, for instance, in protein-protein interactions, lysine residues have a large spectrum of accessibilities with an average near 50% of remaining SASA in folded proteins 47 .

Proximity proteomics studies can specifically target subcellular locations
As expected, as the four studies targeted different subcellular niches there was a very small overlap in proteins across the 4 studies with only 29 proteins being in common (Fig. S1A). Of these 29, many of them had multiple cellular locations predominating in the nucleus (Fig. S1B, blue), cytosol (Fig. S1B, red) and the extracellular region (Fig. S1B, yellow). Given the small size of this subset of the whole dataset, these locations are not statistically enriched for despite being frequently seen. However, we could confirm the location for each of the studies above (n > 500) using a functional enrichment analysis against a set of Gene Ontology (GO) terms aimed at describing cellular location (GO:CC; Fig.  S1C). Our data shows that as expected, Ab-APEX proteins strongly target the mitochondrion with high fold enrichment for the mitochondrial matrix and the mitochondrial inner membrane (Fig. S1C, first column). Then, we checked the BioSITe data which also as expected based on the NES-APEX fusion, enriched GO terms of the cytoplasm and the cytosol (Fig. S1C, second column). The DiDBiT study, which lacks specific targets seems enriched for nuclear, mitochondrial and cytosolic proteins (Fig. S1C, column 3). Finally, SpotBioID, where the authors state that FRB-BioID is cytoplasmic, are enriched for mostly nuclear and some cytoskeletal proteins 39 . Briefly, all four studies showed expected enrichments consistent with the targeted cellular compartment and previous literature.

Illustrative examples of proteins that are biotinylated across all four independent studies
We started by exploring our datasets combining all biotinylation sites and PTMs by initially focusing on structural features of the limited subset of 29 proteins that were common in all studies. While not statistically significant, we noticed that the list contained many RNA binding proteins. Elevated IDR content among these proteins is consistent with previous reports of high IDR content among nucleotide-binding proteins 48 but a larger set will have to be explored for firmly establish a statistical correlation. The first example, Emerin, is an integral membrane protein that is often found at the inner nuclear membrane or at adherens junctions. Emerin mutations cause X-linked recessive Emery-Dreifuss muscular dystrophy. Biotinylation sites from all four studies cluster in a large predicted IDR in the first half of the protein sequence, avoid the transmembrane-spanning domain ( Fig. 2Ai-iv) consistent with our hypothesis that predicted IDRs might are more biotinylated in vivo if they remain highly accessible. A very large number of other PTMs in this IDR further illustrates that this membrane protein is indeed often subjected to intracellular modifications. Surprisingly, Emerin is found in all four studies despite the fact that some targeted different subcellular locations. Emerin is one of ~400 identified integral membrane proteins, suggesting that detailed intracellular structural insights can be gleaned from re-purposed proximity proteomics studies.
Next, we analysed the predicted fully disordered RNA-interacting plasminogen activator inhibitor protein SERBP1 (Fig. 2Bi). Four sites of biotinylation, across the four studies, cluster around the central region of this protein (residues 200-260) where previously reported unique PTM sites also cluster ( Fig.  2Bii). DiDBiT identifies many additional sites scattered over the entire protein sequence, five of which are common with the nuclear targeted SpotBioID study. SERBP1 was previously found in multiple subcellular locations consistent with its identification in four studies. Ribosomal proteins are typically predicted to be disordered or non-globular 49 . We see SERBP1 attaching at the periphery of the 80S ribosome RNA-protein complex and mostly lack (in ~80% of its sequence) unique electron density (Fig.  2Biii); remaining small visible fractions form elongated structures that are detected in random coil or α-helical conformations.
Finally, we selected FKBP3 as a protein of average (predicted) disorder content for the human proteome around 40% according to VSL2b 2 . FKBP3 is a cis-trans prolyl isomerase that is involved in cellular protein folding and tightly binds to the immunosuppressant rapamycin. Biotinylations are enriched in its predicted IDRs (72%) or localise to local coil structure and short, highly accessible αhelical segments in the NMR structure. FKBP3 was previously annotated as nuclear protein. We conclude that detailed inspection of common examples across four studies suggests an enrichment of biotinylations in IDRs and regions lacking defined secondary structure in otherwise folded proteins.

Predicted IDRs are more frequently and densely biotinylated in vivo
Encouraged by observing enhanced biotinylation in predicted IDRs in the small pool of proteins common to all four studies, we next wondered whether this trend might still hold globally for the biotinylated proteome (referred to as "biotinome" hereafter) comprising nearly 4000 proteins. We first checked if proteins with higher predicted fraction of IDRs contain higher numbers of unique sites of biotinylation by comparing the predicted IDR fraction for proteins in each biotinome to the number of biotinylated sites they contained (Fig. 3A). Within each dataset, there were only a small number of proteins with 5 or more biotinylation sites and hence these have been collectively binned into the "5+" category ( Fig. 3A, last violin). For both the SpotBioID and the BioSITe studies, we observed an increase in the frequency of sites of in vivo biotinylations per protein from 1 to 4 with increasing IDR fractions, while DiDBiT and Ab-APEX did not show this trend (Fig. 3A, left panels). Both the cytosol and the nucleus, which are target compartments in BioSITe and SpotBioID have been previously suggested to contain many IDRs 48 . Mitochondria, by contrast, are predicted low in IDRs especially their subset of proteins with bacterial homologues 50 . DiDBiT, lacking compartmental preference, contains both highly disordered and fully folded proteins which might mask any possible weak correlation. We conclude that in vivo observed biotinylation frequency per protein and predicted IDR fractions can be correlated in IDR-rich compartments such as the nucleus and cytosol in HEK293 cells (Suppl. Table "Biotins").
To overcome limitations of averaging over IDRs and ordered regions that might have masked structural trends in the DiDBiT and Ab-APEX studies, we next refined our analysis by distinguishing between biotinylations inside and outside of IDRs while accounting for the density of potentially modifiable residues. To establish an "expected" rate of biotins, we calculated the number of lysine residues (K; for SpotBioID and DiDBiT) or tyrosine residues (Y; Ab-APEX and BioSITe) -both within the predicted regions of IDR (as determined by VSL2b) and across the entire protein body. The ratio of all K/Y residues within IDR regions to all K/Y residues across the protein body gave us an expected rate of biotinylation in IDRs. We then performed a similar calculation using the numbers of biotins we actually observed within IDRs and across the whole protein for each of our 4 studies (Fig. 3B). Consistent across all 4 studies, irrespective of the prediction algorithms used, we observed a significantly greater number of biotins within IDR regions (orange bars; Fig. 3B) than expected (blue bars; Fig. 3B). Once again, this observation was more significant in the nuclear proteins (SpotBioID) than in the mitochondrial proteins (Ab-APEX) (Fig. S2A).
Convinced that we are seeing a true positive correlation between local predicted IDRs and biotinylation density, we sought to see if similar trends can also be observed on protein level after sorting all proteins in classes ranging from most to least folded. To this end, we labelled a protein as Folded (F) if it had predictions of <10% IDR, Partially Folded (P) if it had 10-30% IDR and Unfolded (U) if it had >30% IDR in its protein body similar to a strategy in Gsponer et. al. 18 . We then looked at the overall distribution of proteins in these IDR classes for each of our 4 studies (Fig. 3C). We display the results for just VSL2b and IUPred-L algorithms as "VSL2b_IUPred-L" mimics the trend of VSL2b alone while the "D2P2 consensus" mimics IUPred-L. We observed that all studies contain proteins that can be classified as F, P and U thus enabling pairwise comparisons. The predictors that are better at predicting long IDRs or the absence of folded domains, IUPred-L and D2P2 consensus predictors 51,52 , classified more proteins as F than VSL2b that has a wider definition of IDRs that also includes short IDRs. Consistent with our previous observations and claims in literature 20,48 , the nuclear protein enriched SpotBioID dataset shows the highest proportion of U proteins while the mitochondria targeting Ab-APEX study shows the highest proportion of F proteins (Fig. 3C, S2B).
Given these three categories of proteins, we wondered whether there would be an association between IDR-associated-biotins and the various categories of IDPs. To assess this, we performed both pairwise t-tests between the groups (F-P, U-P, U-F; S2Ci) and an ANOVA across all groups followed by a Tukey's Honestly Significant Differences post-hoc test (Fig. S2Cii). In all studies except SpotBioID, there were significant differences between biotin numbers in the F and U group with more biotinylation events occurring in the U group. Additionally, the differences were significant for all studies between U and P groups, once again showing higher number of biotins in the U group ( Fig. 3D; S2C).

Post Translational Modifications (PTMs) enriched in biotinome-IDRs
Having discovered a strong correlation between IDRs and increased biotinylation, we wondered whether an up-to-date comparison with ~305,000 PTMs in PhosphoSitePlus comprising the small phosphorylation, acetylation as well as the larger protein-sized sumoylations and ubiquitinations, would parallel these trends or show enrichment in other proteins that do not overlap with the biotinome. To this end we downloaded all experimentally reported phosphorylation, acetylation, ubiquitination and sumoylation data from PhosphoSitePlus and mapped them to two datasets (1) the "biotinome" for HEK293 which is the collection of all proteins across our 4 studies and (2) the HEK293 proteome which was published by Geiger et. al. in 2012 and contains 7650 proteins 53 . Additionally, we also mapped IDRs to the Geiger et. al. proteome using the VSL2b algorithm.
As a simple starting point, we looked at the direct correlation between the number of IDRs and the number of PTMs in both datasets (Fig. 4A, Suppl. Table "PTM list"). In both cases, there is a modest positive correlation of 0.37 which was supported by a highly significant p-value (p = 2.2e-16) indicating that the probability of seeing this correlation by chance is extremely low. We thus conclude that there is a small but significant correlation between IDRs and PTMs in the overall proteome as well as the "biotinome".
Despite the similarity in correlation, we wanted to know if there was an overall enrichment of PTMs in the HEK293 biotinome relative to the HEK293 proteome. We looked for a difference in the mean number of PTMs in the two groups of proteins, across each of the 4 post-translational marks. Median frequencies followed the expected higher rates for frequently reported phosphorylations and less frequently studied and likely under-published ubiquitinations, acetylations, and sumolyations. (Fig.  4B). Furthermore, on average, there are significantly more phosophorylation, acetylation and ubiquitination marks in the HEK293 biotinome relative to the HEK293 proteome (p << 0.05; Fig. 4B, iiii) suggesting a trend in co-occurrence of PTMs with biotins. This trend was not observed for sumoylation possibly due to very low reported numbers and technical under-detection of this specific modification (Fig. 4B, iv).
Having observed that we have a positive correlation between IDRs and PTMs, and that PTMs are more frequent in the biotinome, we wanted to see if there was a preponderance of biotinome PTMs within regions of disorder (IDRs). We calculated the expected rate of seeing serine (S), threonine (T) and tyrosine(Y) in IDRs vs the rest of the protein to work out the baseline probability of phosphorylation marks. Similarly, we calculated the expected rate for lysine in IDRs vs the rest of the protein to establish a baseline for ubiquitination, acetylation and sumoylation marks. We then calculated the observed frequency for all 4 PTM types within IDRs in our biotinome datasets. Our observations were even more remarkable than those seen in the biotin context with all 4 marks being significantly enhanced within regions of intrinsic disorder more than expected (IDRs predicted by VSL2b; Fig. 4C,  Fig. S3A).
Knowing that there was a significant enhancement of PTMs in regions of IDR, we sought to determine if this would be even stronger if we looked at the proteins in the 3 previously discussed categories of Folded (F), Partially Folded (P) and Unfolded (U). We confirmed that in all 4 studies, PTMs occurred at a significantly higher rate in U proteins than in F or P proteins (Fig. 4D, S3B, S3C). This analysis also showed that all proteins in the nuclear SpotBioID study (Fig. 4D (ii)) and most of the proteins in the cytoplasmic BioSITe study (Fig. 4D (iv)) contain one or more PTMs (y-axis > -3) while this was not true for the Ab-APEX and DiDBIT studies. Given our previous observations that IDRs are more frequent in cytosol and nucleus, this provides another line of evidence that PTMs, like biotins, prefer IDR rich proteins.

Application of biotin 'painting' to investigate the in vivo plasticity of the 80S ribosome
To investigate whether large structured complexes can also be analysed with this method, we filtered the DiDBiT dataset for ribosomal proteins and visualised all biotinylated subunits in an "exploded" version of the 80S ribosome (Fig. 5). Virtually all biotinylated subunits are non-spherical and multiply biotinylated as evident from large fractions of yellow marked biotinylation sites, many of which are inaccessible to water or larger molecules such as biotin in the fully assembled 80S ribosomal complex as they are contacting ribosomal RNA (supplementary video). We observed a high density of biotinylation sites in this ~3 megadalton large complex which suggests that biotin 'painting' has no fundamental size limitation. High biotinylation density in the 80S ribosome is consistent with an earlier suggestion that eukaryotic ribosomes are rich in predicted IDRs that can be functionally essential 49 .

Conclusions
We have shown using several orthogonal analyses that in vivo biotinylation occurs at a greater rate within predicted IDRs and following on from this observation, highly disordered proteins are more likely to be biotinylated than those that are mostly folded. Furthermore, this trend of increased biotinylation in IDRs is not dependent on the algorithm we use to predict IDRs. However, the greater sensitivity of VSL2b enables the establishment of the trend also in short regions of local disorder and leads to a greater IDR fraction and more biotinylations assigned to IDRs. Finally, we have consistently observed that the SpotBioID study has more proteins that are highly disordered than the other 3 studies thereby validating previous predictions of large fractions of IDRs in nuclear proteins in vivo 48 .
Moreover, we have interrogated the frequency of post translational modification within IDRs, and have provided an up-to-date analysis of the relationship between in vivo observed PTMs across ~2000 independent experimental studies and predicted IDRs confirming that PTMs are enriched in predicted IDRs. Furthermore, we have shown that the biotinome we have analysed in our study is enriched for PTMs relative to the whole HEK293 proteome despite both groups showing a similar positive correlation between the number of PTMs and number of IDRs. Finally, similar to biotinylation, we find that PTMs too are enriched in nuclear and cytosolic proteins relative to mitochondrial proteins.

Discussion
We describe here the first in vivo evidence for preferential biotinylation of predicted IDRs across four independent proximity proteomics studies. This adds a new type of (exogenous) PTM to a list of other PTMs (phosphorylation, ubiquitination and acetylation) that have previously been suggested to be enriched in IDRs 22,30 and is validated by our comprehensive analysis. Ubiquitination and acetylation that shares the same target amino acid (i.e. lysine) with most proximity proteomics studies show higher median numbers of modification sites per protein than the deep proteome reference (Fig 3C).
We envisage many possible benefits from re-purposing proximity proteomics data for in vivo structure-functional questions: (i) To complement very detailed kinetic in vitro studies that can resolve conformational dynamics at high spatial and temporal resolution using hydrogen deuterium exchange (HDX). Biotin 'painting' could enable complementary in vivo comparisons of the same target proteins and thereby increase the scope of HDX or related protein surface accessibility-based structural proteomics techniques 54,55 .
(ii) To acquire dynamic snapshots of biological pathways and determine by which mechanism these rewire biomolecular interaction networks and modulate subcellular conformations of proteins. Recent technological advances both in biotinylation enzymes and multiplexed mass analysis will accelerate sampling of more biological timepoints [56][57][58] .
(iii) To study dynamic in vivo drug effects. Many new drug candidates are failing in the later stages of development due to our incomplete understanding of cellular biology. If we can re-purpose BioID or other biotinylation methods for elucidating subcellular protein interactions, we might achieve earlier insights into drug (in)efficiency in relevant biological contexts.
Structural biotin analysis requires identification of the precise sites of biotinylation, which are typically not captured in more widely used protein-level enrichment in BioID experiments. We therefore briefly summarize here possible limitations and benefits for the biotin-peptide enrichment.
An obvious limitation of peptide-level enrichment is that non-biotinylated peptides cannot contribute to the mass spectrometric signal, which can mean that more biological input material may be required in some cases. While peptide-level enrichment increases the specificity and analytical efficiency for detecting biotinylated peptides [36][37][38][39] , it comes at the expense of not being able to detect very short proteins that lack lysines or detectable peptides with one missed cleavage (due to a modified lysine). Sequence coverage might be improved by including additional proteases in future biotin-based proximity experiments 59 .
How does biotin painting compare to other recently established proteome-wide structural assays? Similar to (in vitro/ex vivo) Limited Proteolysis (LiP)-MS, biotin painting can reveal local structural features of proteins and additionally enables in vivo and in vitro comparisons while being intrinsically limited by the need for biotin-peptide enrichment 60,61 . (LiP)-MS might, however, be less sensitive for conformational transitions that occur in IDRs that are depleted in hydrophobic amino acids and therefore lack the molecular targets of common LiP enzymes. Thermal proteome profiling (TPP) using quantitative comparisons of soluble fractions upon heating, is also compatible with in vivo structural comparisons but lacks local resolution while adding complementary information on protein-protein interactions based on co-precipitation of tightly interacting complex partner proteins at increasing temperatures 21 . Multi-span integral membrane proteins are under-represented in published TPP experiments, and biotin 'painting' might have useful complementary applications to biomedically relevant multi-span membrane proteins such as GPCRs. In summary, subcellular biotin painting can complement the already very powerful toolbox of structural proteomics.
Are short IDRs functionally relevant? Several lines of recent independent experimental evidence suggest so. Local flexibility has been identified as crucial factor for the evolution of novel enzymatic functions 62 , and for tuning the activity of enzymes to enable efficient catalysis at low temperatures in biological niches of psychrophilic organisms 63 . Local unfolding, incomplete folding or delayed folding can be helpful for cellular transport of proteins that must not fold prematurely before reaching their cellular destinations 64,65 . High-density biotin painting appears to be useful to characterize the in vivo reactivity of both predicted long IDRs (using the IDR predictor IUPred-L) and local or transiently unfolded or disordered IDRs (i.e. IDRs uniquely predicted by VSL2b).
How can we use the insights gleaned from this study to design novel, potentially better, proximity proteomics experiments? A key assumption in classical proximity proteomics studies is that biotinylation is enhanced near the biotin-activating enzyme. Our study shows that unfolded regions can be more readily biotinylated compared to folded regions. This could mean that proteins that in reality never change their cellular distribution can be perceived as farther away or closer to a birAfusion due to condition-dependent local folding or unfolding, respectively. We do not currently have definitive answers on how to unambiguously dissect condition-dependent local (un)folding and subcellular redistributions. It appears worthwhile to envisage the possibility that transient changes in protein folding can be important modulators of cellular dynamics that should be more broadly factored into experimental designs of proteomics studies (Fig. 6).
In conclusion, we believe that biotin 'painting' adds new layers of insight to proximity proteomics approaches by providing in vivo validation for computational IDR predictions, highlighting multimodification hotspots that are often disease-linked 30 and by enabling condition and compartmentspecific in vivo structural comparisons.

Source data description:
Four independent in vivo biotinylation studies have been used for our exploration of structural specificity of biotinylation sites. Their details are provided in Table 1 and they can be accessed as input files on the Github repository https://github.com/ComputationalProteomicsUnit/biotinIDR.

Study
Ref Target  Assigning disorder predictions Some 60 published disorder prediction algorithms feature balanced accuracies of around 70% to 80%; some being designed and validated to predict short IDRs (<30 residues) and others being better at determining long or both long and short IDRs 66 . The majority of these predictors are trained on a limited set of in vitro structural data, mainly X-ray crystallography data, Nuclear Magnetic Resonance (NMR) mobility data in the DisProt database (http://www.disprot.org/) 67 .
A subset of more frequently used prediction algorithms has pre-computed predictions in the web resource D2P2 (http://d2p2.pro/ 51 ). D2P2 also offers the option to select a consensus call for IDRs in a given protein that is predicted by most of the 9 different compound predictors. Of the 9 callers included in D2P2, we focused our interest initially on the 2 most orthogonal callers (1) VSL2b which has high sensitivity for calling IDRs in both short and long regions of IDR 68 , and (2) IUPRed-L which has been trained to predict long disorders with high confidence 52 . As additional comparisons, we also predicted IDRs using (3) a combination of VSL2b and IUPRed-L where an IDR was accepted if called by one or both predictors (4) Consensus of (at least 75% of) 9 predictors included in D2P2.
For all versions of IDR calling, we did not set any restrictions on the length of IDR. This means that an IDR can be called on a residue of length 1. While this might yield a lot of false positives, we wanted to ensure sensitivity rather than specificity of IDR calling. Having tested the 4 different versions of IDR calling with D2P2, we realised consistent trends between all predictions approaches while higher local sensitivity of VSL2b enabled more insights on local disorder. We therefore performed more detailed biotin site and PTM analyses using VSL2b. The IDR assignment uses an Application Programming Interface (API) to the D2P2 website and code to use this API was kindly provided by Dr. Tom Smith. The scripts 'd2p2.py' and 'protinfo.py' are necessary for the final analysis and can be accessed through the github repository https://github.com/TomSmithCGAT/CamProt/tree/master/camprot/proteomics . The python script for the final IDR analysis and output is called "Get_IDRs-DM-v2.py" and can be accessed via the repository https://github.com/ComputationalProteomicsUnit/biotinIDR.

Mapping post-Translational Modifications (PTMs)
A full repertoire of PTMs was downloaded on 10 th April, 2018 from the "Downloads" section of the PhosphositePlus website (https://www.phosphosite.org/homeAction.action), particularly sites for phosphorylation, acetylation, ubiquitination and sumoylation (Suppl. Tables "PhosphositePlus"). These were then mapped onto the proteins for each of our 4 studies and used for generating proteinwise images.

Protein sequence modification or proteoform images
Images summarising the location of IDRs and PTMs were produced using Protter (http://wlab.ethz.ch/protter/) and protein structure images were generated using Pymol. For Protter images, scripts were written to generate an appropriate URL and then batch download it from the server. These scripts 'printUrl.py' and 'runUrl.sh' are also available via the Github repository https://github.com/ComputationalProteomicsUnit/biotinIDR (Suppl . Table "Protter-List")

Code availability for Statistical analysis of PTMs and biotinylations in IDRs
Once IDR and PTM information were mapped, data were analysed for correlations and plots were generated using the R statistical framework (https://cran.r-project.org/ ) and several Bioconductor packages (https://bioconductor.org/). All code and input data can be accessed via the Github respository https://github.com/ComputationalProteomicsUnit/biotinIDR with the main files being 'biopep.pub.Rmd' and 'biopepFunctions.R'. An extension to GO mapping, 'GO.R' was also kindly provided by Dr Tom Smith and can be obtained here https://github.com/TomSmithCGAT/CamProt_R .

Statistical tests used
To compare expected rates of biotin/PTMs and observed counts, we used a standard binomial test in R (binom.test). For estimating the background rate of biotins, we counted all the lysines (K; BirA based studies) or tyrosines (Y; APX based studies) in the protein sequence and within predicted IDR regions. For estimating the background rate of PTMs, we counted all the lysines (K; ubiquitination, acetylation, sumoylation) or serines, threonines and tyrosines (S, T, Y; phosphorylation) in the protein sequence and within predicted IDR regions. We defined the "probability of success" as the number of residues in IDRs/Total number of residues, a "success" as a biotin or PTM within an IDR and "number of trials" is the number of Biotins or PTMs observed in that study.
To look for differences in PTMs and biotins in the three protein groups -Folded, Partially Folded and Unfolded, we used pairwise t-tests or ANOVA followed by a post-hoc correction of family-wise error rates using a Tukey's Honestly Significant Differences test. The former yields a p-value while the latter yields a confidence interval for the effect size as well as a p-value. To compare number of biotins and IDRs, we used a standard Pearson's correlation test. To compare mean PTMs between the HEK293 biotinome and HEK293 proteome we used a standard t-test for means.
To perform a GO enrichment analysis, we used the package 'goseq' 69 which is based on a Hypergeometric test with a Wallenius' correction which accounts for any biases in the data such as gene length, protein expression etc. In our study, we used protein expression from Geiger et al. 53 , as the bias factor prior to calculating GO enrichment.

Biomolecular structure visualisation
The Cryo-EM structure of the human 80S ribosome (PDB ID 4v6x 70 ) was visualised using ChimeraX 71 . SERBP1 and its biotinylated sites were highlighted using the sel function in its command line interface. RNA was coloured purple and protein subunits (except SERBP1) blue. All ribosomal macromolecules were visualised in surface representation. The FKBP3 NMR structure (PDB ID 2mph) was visualised in cartoon model of the first low energy model; surfaces were kept 90% transparent except around biotinylation sites that were highlighted in yellow.

Code availability
Code to process biotinylation datasets and reproduce the analysis has been deposited in the Github repository https://github.com/ComputationalProteomicsUnit/biotinIDR. Please request access to the code by emailing the corresponding authors as it will be made public following journal acceptance.   . Some 50% of the average lysine's SASA stays exposed in folded proteins 47 .  . Significance: ***p < 0.0005; ****p close to 0, using a bionomial test where the "probability of success" is the (number of lysine residues or tyrosine residues in IDRs/Total number of lysine residues or tyrosine residues), a "success" is a biotin within an IDR and "number of trials" is the number of biotins observed in that study.  A t-test for differences in means between the two groups was conducted and the p-value is embedded at the top of each panel. All except sumoylation are significantly different and greater in the HEK293 "Biotinome" relative to the whole proteome. (C) Barplots showing the Expected (pale blue) and Observed (pale orange) distribution of PTMs within regions of IDR across the 4 post-translational marks using the IDR caller VSL2b. All pairs except are significantly different between Observed and Expected values using a bionomial test where the "probability of success" is the (number of K/S/T/Y in IDRs/Total number of K/S/T/Y), a "success" is a PTM within an IDR and "number of trials" is the number of (each type of) PTMs observed in that study (Fig. S3A)   We can see that nearly all ribosomal proteins have some yellow "paint" on them. Fig. 6 Summarizing model. Biotinylation and other PTMs (including phosphorylation, acetylation, sumoylation, ubiquitination) are more likely in predicted IDRs suggesting that they are more (at least transiently) accessible for biochemical modifications compared to folded proteins. Fully folded proteins can also be modified but show lower fractions of modified residues compared to IDRs except for ubiquitin. This positive correlation of biotinylation density and IDRs, i.e. biotin painting IDRs can be used to re-purpose biotinylation-based proximity proteomics studies to monitor protein plasticity in vivo.

Fig. S1 (A) A Venn diagram showing the overlap of proteins across the four studies used in this analysis.
There is a very small number of proteins (29) common to all 4 studies (blue box). DiDBIT has the most number of exclusive proteins as it targets the entire cell. Ab-APEX targets the mitochondrial matrix and inner mitochondrial membrane. SpotBioID targets the nucleus and BioSITE targets the cytoplasm. (B) A connectivity plot showing any published and validated interactions between the 29 common proteins identified in S1A. This image was generated using the online program STRING (https://stringdb.org/cgi/input.pl?sessionId=QqkSTNv1EVki&input_page_show_search=on).An enrichment analysis was run on the 29 proteins using Gene Ontology (GO) categories and the colours represent some of the most significant terms from this analysis. Blue indicates that these proteins are known to localise to the nucleus, red indicates localisation to the cytosol and yellow indicates localisation to the extracellular region. Multiple colours in a single circle indicate that the given protein has been found in multiple locations in different studies. (C) GO Cellular Component enrichment analysis for the 4 studies. The proteins in each study were mapped to GO:CC categories and compared to a background of the published HEK293 proteome which was also mapped to GO:CC catagories. The size of the dot represents the fold enrichment over the background, i.e. the fraction of proteins in the input list that are annotated by a given GO term divided by fraction of proteins in the background list that could be mapped to the same GO term. The colour of the dot represents the significance of the enrichment with grey being most and orange being least significant. Note that all terms displayed in this figure are significant and above the adjusted p-value cut-off of 0.05.

Fig. S2 Statistics for Biotin and IDR comparisons. (A)
A table of values and percentages used to look at the Expected (Blue) and Observed (Orange) rate of biotin occurrence within IDRs across all studies and all callers. The last column "Binom.pval" denotes the p-value using a bionomial test where the "probability of success" is the (Lysine residues in IDRs/Total Lysine residues), a "success" is a biotin within an IDR (Biotins.in.IDR) and "number of trials" is the total number of biotins (Total.biotins) observed in that study. All tests are significant at the p = 0.05 level (B) A table displaying the frequencies of proteins in each of the IDR categories in each of the 4 studies using the IDR predictor VSL2b. (C) To test whether or not there were significant differences in the number of biotins found in regions of IDR across the 3 IDR categories (i) a pair-wise t-test with multiple testing correction between the three groups -F, P and U. The table shows the p-value of these pairwise t-tests across the four studies. Light blue denotes comparisons that are not significant. Light orange denotes significant (p<0.05) and dark orange denotes comparisons that are highly significant (p << 0.05) (ii) an analysis of variance (ANOVA) was performed followed by a Tukey Honestly Significant Differences (THSD) test to correct for family wise error. The table shows the p-value of the THSD test across the four studies. Light blue denotes comparisons that are not significant. Light orange denotes significant (p<0.05) and dark orange denotes comparisons that are highly significant (p << 0.05)

Fig. S3 Statistics for PT and IDR comparisons (A)
A table of values and percentages used to look at the Expected (Blue) and Observed (Orange) rate of PTM occurrence within IDRs across all studies, all callers and all PTM types. The "Binom.pval" column denotes p-values obtained using a bionomial test where the "probability of success" is the (STKY.in.IDR/Total.STKY), a "success" is a PTM within an IDR (PTMS.in.IDRs) and "number of trials" is the number of (each type of) PTMs (Total.PTMs) observed in that study (Fig. S3A). (B) a pair-wise t-test with multiple testing correction between the three groups -F, P and U. The table shows the p-value of these pairwise t-tests across the four studies. Light blue denotes comparisons that are not significant. Light orange denotes significant (p<0.05) and dark orange denotes comparisons that are highly significant (p << 0.05) (C) an analysis of variance (ANOVA) followed by a Tukey Honestly Significant Differences (THSD) test to correct for family wise error. The table shows the p-value of the THSD test across the four studies. Light blue denotes comparisons that are not significant. Light orange denotes significant (p<0.05) and dark orange denotes comparisons that are highly significant (p << 0.05).