## Abstract

Intrinsically disordered proteins (IDPs) populate a range of conformations that are best described by a heterogeneous ensemble. Grouping an IDP ensemble into “structurally similar” clusters for visualization, interpretation, and analysis purposes is a much-desired but formidable task as the conformational space of IDPs is inherently high-dimensional and reduction techniques often result in ambiguous classifications. Here, we employ the t-distributed stochastic neighbor embedding (t-SNE) technique to generate homogeneous clusters of IDP conformations from the full heterogeneous ensemble. We illustrate the utility of t-SNE by clustering conformations of two disordered proteins, A*β*42, and a c-terminal fragment of *α*-synuclein, in their APO states and when bound to small molecule ligands. Our results shed light on ordered sub-states within disordered ensembles and provide structural and mechanistic insights into binding modes that confer specificity and affinity in IDP ligand binding. t-SNE projections preserve the local neighborhood information and provide interpretable visualizations of the conformational heterogeneity within each ensemble and enable the quantification of cluster populations and their relative shifts upon ligand binding. Our approach provides a new framework for detailed investigations of the thermodynamics and kinetics of IDP ligand binding and will aid rational drug design for IDPs.

**Significance** Grouping heterogeneous conformations of IDPs into “structurally similar” clusters facilitates a clearer understanding of the properties of IDP conformational ensembles and provides insights into “structural ensemble: function” relationships. In this work, we provide a unique approach for clustering IDP ensembles efficiently using a non-linear dimensionality reduction method, t-distributed stochastic neighbor embedding (t-SNE), to create clusters with structurally similar IDP conformations. We show how this can be used for meaningful biophysical analyses such as understanding the binding mechanisms of IDPs such as *α*-synuclein and Amyloid *β*-42 with small drug molecules.

## I. Introduction

In general, knowledge of the 3-dimensional structure of a protein is the first step toward a molecular-level mechanistic understanding of its biological function. This knowledge is also central to activities such as the rational design of drugs, inhibitors, and vaccines and in the broad area of protein engineering and biomolecular recognition^{1–7}. With the advances made in structure determination techniques^{8–16} and recent transformative leaps made in computationally predicting the structure from sequence^{17–19}, the science of structural biology is going through paradigmatic changes where the knowledge of structure is not the biggest bottleneck anymore^{20}. However, outside the realm of these structured proteins exist a “dark” proteome of intrinsically disordered proteins (IDPs) that constitute more than 40% of all known proteins and play important roles in cellular physiology and diseases^{21–27}. An IDP can populate a heterogeneous ensemble of conformations and is functional without taking a unique structure. In essence, IDPs are expanding the classical hypothesis of sequence-structure-function to the sequence-disordered ensemble-function(s) paradigm. Though solution-based experiments like NMR, FRET, and SAXS do provide structural information for IDPs, they generally report time and ensemble-averaged properties of IDP conformations^{28–30}. In the absence of computational models, solution experiments are challenging to interpret in terms of individual atomic resolution structures that constitute IDP ensembles. In other words, IDPs are not directly amenable to conventional high-resolution structure determination, structure-based functional correlation, protein engineering, and drug-designing strategies that hinge upon the knowledge of a reference 3-dimensional structure.

Computational tools, particularly those that incorporate the available experimental information, can be effectively used to generate high-resolution ensemble structures of IDPs. Of late, several broad classes of different approaches have been developed for this purpose. Methods based on pre-existing random coil library and simple volume exclusions (examples: Flexible Meccano^{31}, TraDES^{32}, BEGR^{33}) are often used to create an initial exhaustive pool of conformations, which are further processed to produce refined ensembles upon combining with experimental constraints^{30, 34–39}. These methods, though purely statistical in nature, provide a computationally efficient approach to calculating IDP conformational ensembles that are consistent with experimental data. The second set of approaches utilizes Physics-based molecular simulations either in a coarse-grained representation (examples: SIRAH^{40}, ABSINTH^{41}, AWSEM-IDP^{42}, SOP-IDP^{43}, HPS^{44} and others) or with an all-atom resolution,^{45–48} to generate initial Boltzmann-weighted conformational ensembles that can be further refined with experimental restraints using various reweighing approaches^{49–51}. Recently developed molecular mechanics force fields for IDPs^{45–48} used in combination with parallel tempering based enhanced sampling approaches such as Replica exchange solute tempering (REST)^{52–55} and hybrid tempering (REHT)^{56} has also shown promise in producing atomic-resolution accurate IDP ensembles consistent with experimental solution data without any added bias in the simulations.

While significant advances have been made in generating high-resolution IDP conformational ensembles that are consistent with experimental data, the subsequent interpretation of these ensembles to address key biological questions related to the interactions of IDPs remains extremely challenging. IDP conformational ensembles are inherently extremely high-dimensional. That is, the phase space of IDPs consists of several thousands of features, which may vary relatively independently, making it extremely challenging to uncover correlations in conformational features among conformations contained in IDP ensembles. This often makes sequence-ensemble-function relationships of IDPs very difficult to understand, even when aided by relatively accurate IDP conformational ensembles. If one could efficiently identify representative conformational sub-states in IDP ensembles, and quantify their relative populations in different molecular and cellular contexts, it would become significantly easier to identify conformational features of IDPs that may be associated with specific functional roles or disease states^{57–59}. Therefore, parsing the heterogeneous ensemble data into representative conformational states can be as critical as the generation of the ensemble itself as it allows one to leverage conventional structural-biology analysis tools for IDPs.

The process of dividing large abstract data set into a number of subsets (or groups) based on certain common relations such that the data points within a group are more similar to each other and the points belonging to different groups are dissimilar is called clustering. Due to its ability to provide better visualization and statistical insights, clustering is ubiquitous in the analyses of big-data biological systems with wide-ranging applications such as profiling gene expression pattern^{60, 61}, de novo structure prediction of proteins^{62, 63}, the quantitative structure-activity relationship of chemical entities^{64}, docking and binding geometry scoring^{65}, and also in analyses of protein ensemble from molecular dynamics (MD) trajectory^{66}. In the latter case, the similarity or dissimilarity between a pair of conformations is often defined by classifiers or collective variables (CVs) such as root mean squared displacements (RMSD), differences in native contacts, dihedral angle distribution, the radius of gyration, and the end-to-end distance. The choice of CV is generally empirical in nature but in general, CVs chosen based on domain knowledge do yield reliable clustering in structured protein ensembles. However, they do not sufficiently describe the highly heterogeneous IDP ensemble. This is primarily due to the mismatch one sees in conformations for the same CV values since low-dimension CV projections are unable to capture the finer details in the coordinate space that separates one conformation from another one. This has a severe impact on clustering since often different conformations of IDP have similar projected collective variables and are kept together in one group. To illustrate this, we present a set of conformations from a simulated IDP ensemble with the same value of R* _{g}* (Fig. S1 in Supplementary Material (SM)). It is evident from this illustration how this could lead to ambiguous classification. The problem can be circumvented to some extent by considering multi-dimensional projections. But this often leads to having too many sparsely-populated clusters, which attenuates the whole exercise of effective clustering.

Theoretically well-grounded dimensionality reduction (DR) techniques are now commonly being used in protein conformation analysis to extract the latent low dimensional features and the quantum of information lost during the projection depends heavily on the kind of data set under consideration^{67–72}. For example, a highly heterogeneous data set that lies on a high-dimensional manifold as in the case of IDPs is best handled with the non-linear dimension reduction (NLDR) techniques, which generally attempt to keep the nearest neighbors close together. While methods such as ISOMAP and Local Linear Embedding are best suited to unroll or unfold a single continuous manifold, a more recent method such as t-Distributed Stochastic Neighbor Embedding (t-SNE) is more appealing as it helps to disentangle multiple manifolds in the high-dimensional data concurrently by focusing on the local structure of the data and tend to extract clustered local groups of samples. Consequently, t-SNE tends to perform better in separating clusters and avoiding crowding. We will show that t-SNE is particularly well-suited for clustering seemingly disparate IDPs conformations into homogeneous subgroups since it is designed to conserve the local neighborhood when reducing the dimension, which ensures similar datasets remain similar and dissimilar points remain dissimilar in the low dimensional space as that of the high dimensional space^{73}. Due to its ability to provide a very informative visualization of heterogeneity in the data, t-SNE is being increasingly employed in several applications such as clustering data from single cell transcriptomics^{74–77}, mass spectrometry imaging^{78}, and mass cytometry^{79, 80}. Lately, t-SNE has also been used for depicting the MD trajectories of folded proteins^{81–87} and for interpretation of mass-spectrometry based experimental data on IDPs by juxtaposing with classical GROMOS-based conformation clusters from the corresponding molecular simulation trajectories of the IDP under consideration^{88}.

In this paper, we demonstrate the effectiveness of t-SNE (in combination with K-means clustering) for identifying and visualizing representative conformational substates in IDP ensembles. We investigate the small molecule binding properties of Amyloid *β*-42 (A*β*42) and *α*-synuclein (*α*S), proteins involved in the neurodegenerative proteinopathies like Alzheimer’s and Parkinson’s diseases, respectively. Therapeutic interventions by sequestering the monomeric state of these IDPs have recently been explored using state-of-the-art biophysical experiments and long timescale molecular simulations^{89, 90}. A set of repurposed small molecules such as the c-Myc inhibitor-G5 (benzofurazan N-([1,1-biphenyl]-2-yl)-7-nitrobenzo[c][1,2,5]oxadiazol-4-amine (10074-G5)) and a Rho kinase inhibitor - Fasudil (along with the high-affinity Fasudil variant Ligand-47) have been identified as promising agents against the monomers of A*β*42 and *α*S, respectively. Since the monomeric states of these IDPs are extremely heterogeneous, it is not fully understood how the different conformations form viable complexes with these small molecules and what is the molecular origin of their affinity and specificity. This is mainly owing to the inefficient clustering of the IDP structures using the classical clustering tools. Here we revisit the molecular trajectories of A*β*42 (a total of 56 *µ*secs) and *α*S (total of 573 *µ*secs) using t-SNE (in combination with K-Means clustering). This exercise has drastically improved our knowledge of the binding mechanism of small molecules to such large IDPs and also provides us with strategies for designing specific inhibitors with high-affinity binding. Besides, it also helps in understanding the conformational landscape of APO and ligand-bound IDPs, which are otherwise hard to obtain. We believe that the method presented here is general in nature and can be used to cluster and visualize IDP ensembles across systems and assist in detailed structural, thermodynamics, and kinetics analyses of IDP conformations in APO and bound states.

## II. Results and Discussion

We aim to cluster the heterogeneous mixture of disordered protein conformations into a subset of unique and homogeneous conformations. To do this, as a first step, we employ t-SNE that projects the large dimensional data in lower dimensions. We then apply K-means clustering on the projections to identify the clusters in the reduced space. Before we illustrate the power of this algorithm as a faithful clustering tool for realistic IDP ensembles, we use a simple alanine-dipeptide (ADP) toy model in SM to provide physical intuition into how t-SNE works. Please see Fig. S2 and the subsection titled *”Physical intuition into t-SNE-based clustering algorithm using alanine dipeptide”* in SM. We use this toy model system to also make the reader familiar with a critical hyper-parameter called perplexity used in the t-SNE algorithm, and prescribe a strategy to choose its right value for effective clustering. Going forward, we apply the method for the analyses of IDP ensembles of complex systems such as A*β*42 and *α*S, each in the presence and absence of small-molecule inhibitors. We have listed all the systems under consideration in Table I below. We represent the conformations within A*β*42 and *α*S ensembles by the inter-residue Lennard-Jones contact energies and the Cartesian coordinates of heavy atoms, respectively. These measures are kept consistent with the original work from where trajectory data is taken for A*β*42^{89} and *α*S^{91} to enable faithful comparisons. t-SNE was performed based on the distance/dissimilarity calculated between all pairs of conformations as RMSD of the above data.

### A. Prescription for choosing optimal parameters for t-SNE clustering of IDPs

The results of t-SNE depend largely on the choice of perplexity. Since the objective criterion here is to maximize clustering, we adopt the well-known Silhouette score,^{92} commonly used for optimizing the number of clusters (K) in K-means clustering, for tuning the perplexity values as well. As shown through the formulation in the method section below, the Silhouette score computes the average of every point’s distance to its own cluster (cohesiveness) than to the other clusters (separateness) and is defined such that its value lies in the range of −1 to 1. A score of 1 is most desirable indicating perfectly separated clusters with clearly distinguishable features. A positive value generally indicates acceptable clustering while negative values are unacceptable for distinguishable clustering. The cohesiveness and separateness of clusters are generally measured based on Euclidean distance. Since the clusters here are identified on a reduced t-SNE space, computing the score on this space (*S _{ld}*) alone may be misleading. This is particularly true when using sub-optimal parameters that often clump the points randomly during the dimensional reduction step by t-SNE. Therefore, it is important to measure the goodness of clustering with respect to the original distance in the high dimensional space (

*S*), in addition to that in the low dimensional space. The integrated score (

_{hd}*S*), therefore, adds value to the estimated clustering efficiency in terms of reliability.

_{ld}∗ S_{hd}### B. t-SNE for clustering A*β*42 conformational ensembles

*β*

#### 1. t-SNE identifies the clustering pattern intrinsic to the A **β**42 ensemble

We apply our algorithm on APO and G5-bound A*β*42 all-atom MD simulations trajectories obtained from Vendruscolo group^{89}. We have used an identical set of representative frames for clustering as in the original work (35000 frames from each ensemble) where each system was run for 27.8 (*µ*s). Furthermore, to be consistent, we represent the conformations similarly by inter-residue Lennard-Jones contact energies. We used the distance between all pairs of conformations from the RMSD of the contact energies and feed that into our t-SNE pipeline. In the case of A*β*42 (APO and G5-bound), the calculated Silhouette score for a range of K and perplexities indicates a positive value with respect to both the distances at the low dimensional space (S* _{ld}*) as well as at the high dimensional space (S

*) (Table S2 S3) suggesting reliable clustering. This can be compared against the large negative score (−0.6) with respect to the high dimensional distance, obtained for the classical GROMOS-based clustering, which indicates that the conformations are grouped into wrong clusters. In Fig. 1(a,b), we report the integrated score (S*

_{hd}**S*

_{hd}*) as measured for the clusters in APO and G5-bound ensembles of A*

_{ld}*β*42. The figure shows the highest Silhouette score for a range of small perplexities but only with a single optimal cluster size (30 in the case of APO trajectory and 40 in G5 bound trajectory). This consistent behavior confirms that at the optimal parameters, the identified clustering pattern is truly intrinsic to the underlying ensemble structure and corresponds roughly to the possible number of metastable structures, rather than just random noise. At these optimal values, we find that the low-dimensional t-SNE map shows discrete clusters in both APO and G5-bound ensemble. Whereas at sub-optimal values, the identified clusters either encompass different pieces together in a single cluster (for example at P=50; K=20 in APO) or break into multiple clusters of similar conformations (at P=350, K=100 in APO system).

#### 2. Clustering reveals ordered sub-states within disordered A **β**42 ensemble

Once the optimal number of clusters for a given data set is decided using the prescriptions described above, we inspect the uniqueness and homogeneity of individual clusters by back-mapping to the conformations in the bound and unbound ensembles. Fig. 2 shows the conformations within each cluster of A*β*42 ensemble indicating unique topology and secondary structural architecture. To understand this quantitatively, we reordered the conformations based on the cluster indices and plotted their pairwise similarity/distances in Fig. S3 and Fig. S4 for APO and G5 bound A*β* ensemble, respectively, and compare them with the distance map before clustering. The results show that the clusters obtained with optimal parameters indeed yield better homogeneity than that obtained with sub-optimal parameters. The conformational distance is measured based on the RMSD of inter-residue LJ energies. To further accentuate the homogeneity illustrations, we have also plotted the respective RMSD of cartesian coordinates. Please note that for the clustering with t-SNE, only the RMSD of LJ energies was used. For the input maps (Fig S3a and S4a), we ordered the conformations sequentially in the X and Y axes. For the maps generated after clustering, the frames are sorted based on the cluster indices and placed from 0* ^{th}* cluster to N

*cluster (S3-S4, b-d for both the upper and lower panel in Fig. S3 and Fig. S4, respectively). As indicated by the figures, the input distance maps of these ensembles show a certain level of conformational memory across the contiguous frames (the Red blocks/grids at the diagonal band) as the trajectories are generated from a history-dependent metadynamics approach. Nevertheless, the clustering obtained with sub-optimal parameters adversely affects even this intrinsically clustered data and several off-diagonal Red patches appear in the plot indicating either wrong groupings or broken clusters. On the other hand, with the optimal parameters, the algorithm yields better clustering as indicated by the thickening of the diagonal Red blocks and reduction of off-diagonal Red patches.*

^{th}More interestingly, though the G5 bound conformational ensemble was clustered only based on the similarities of protein conformations, the ligand is shown to have a specific binding orientation with the protein within each cluster (Fig. 2(b)). This is an exciting result that sheds light on the hidden ordered features in a disordered IDP ensemble, which can confer specificity for ligand binding. The ability of t-SNE to cluster a seemingly disordered ensemble into substates with distinct structural features can vastly reduce the library of conformational screening from a large ensemble into a handful of structures. This will aid in a high throughput structural and statistical analysis of IDP ensemble data and can have a huge impact on our fundamental understanding of disorder-function relationships and in the design of therapeutic drugs for IDP molecules.

#### 3. Insights into the binding properties of A**β**42 with G5

The cluster-based population statistics of different metastable conformations have been analyzed and shown in Fig. 2(c,d). The results indicate that the distributions are more equally probable in the case of the ligand-bound ensemble than in the APO state. From this population distribution of different metastable conformations, we have estimated that the Gibbs conformational entropy (-Σ(*p ln p*)) of the G5-bound ensemble is larger than the apo ensemble (Fig. S5 in SM). The number of optimal unique conformations (30 in APO versus 40 in G5-bound) and their respective Silhouette score (in high dimension space, 0.21 versus 0.15) (Table S1 and S2) also suggest consistent observation. Taken together, these results further corroborate the entropic expansion on ligand binding as deduced from the earlier studies^{89}. Though the ligand has very specific binding geometry within each cluster, they vary significantly across the different clusters. We show the contact probabilities of G5 with individual protein residues in Fig. 3(a) for individual clusters. We also plot the residue-wise contact probabilities using the total trajectory, which provides averages without clusters (Fig. S6 in SM). As indicated by the figures, the G5 preferentially binds to aromatic residues such as Tyr/Phe (residue numbers 10, 19, 20) and hydrophobic residues such as Ile/Val/Met (residue numbers 31, 32, 35, 36). The screening of such large aromatic and hydrophobic residues by the ligand would possibly limit the transient order stabilization and likely increase the heterogeneity/entropy. To further quantify how the contacts of G5 at diverse locations affect the interaction strength, we applied a high throughput numerical technique called molecular mechanics with generalized Born and surface area solvation (MM/GBSA) to loosely estimate the free energy of the binding of ligands to proteins^{93, 94}. Our MMGBSA-derived binding scores are shown in Fig. 3(b). We see that the G5 binds at relatively equal strength in multiple clusters. But interestingly, we also noted a few of the clusters (cluster numbers 14, 29, and 30) that show statistically stronger binding than the others. More interestingly, these same clusters consist of a relatively larger population in the ensemble than the other conformers. The protein residues involved in binding in these selected clusters along with their energy contributions to the total energy as plotted in Fig. 3(c) and the conformational binding-geometry for the cluster that exhibits the most favorable MM/GBSA binding is shown in Fig. 3(d,e). In Fig. S7 in SM, we also show the same data (binding geometries and residue-wise interactions) for the two other clusters, which show the second and the third best MM/GBSA scores. Our analyses reveal that ligands interact with multiple favorable sites simultaneously, which indicates that even a partially collapsed or ordered state of an IDP can provide a specific binding pocket for small molecule interactions. These unique insights gained as a result of high-fidelity clustering can be lever-aged for future IDP-drug designing with time-tested strategies that are similar to the ones applied for the folded proteins.

### C. t-SNE for clustering *α*S conformational ensembles

*α*

#### 1. t-SNE reveals distinct conformation sub-states despite extreme plasticity

Next, we examine our clustering framework for understanding the metastable conformations of *α*S. This is a longer IDP consisting of 140 amino acids and has a more fuzzy conformational landscape than the A*β*42 peptides with no persistent residual secondary structure propensity. The shorter constructs consisting of only the C-terminal residues that directly bind to Fasudil and Ligand-47 have also been shown to have extremely heterogeneous conformations. The t-SNE projection produces a fully crowded map for the full-length *α*S, except few well-segregated dense clusters (Fig. S8a). For the c-ter APO system, there is an almost complete overlap of the conformation space (Fig. S8b) suggesting extreme heterogeneity in their conformational landscape. This is very unlike what is seen in Fig. 1 for A*β*-42.

In order to get a better sense of the conformational diversity, we plotted the distance map as done before for A*β*42 peptide. Here again, due to the extreme heterogeneity, the conformational states rapidly exchange among themselves, which in turn creates a very cluttered distance map of the original trajectory. This is shown in the first subplot for full *α*S in Fig. S9(a) and for the c-terminal (c-ter) peptide in Fig. S9(b). This suggests there are very less intrinsic groupings of these conformations in the high dimensional space, which is consistent with the tSNE projections seen in Fig. S8. As a result, the clusterability or Silhouette score is very less though with a positive value closer to zero for S* _{hd}* indicating poor clusterability of these ensembles. However, we find that in highly fuzzy IDP systems with complex heterogeneous conformations, obtaining a positive clustering coefficient itself is a positive sign. A small positive score can indicate that the sample is on or very close to the decision boundary between two neighboring clusters. In such cases, the ideal number of clusters would be either too many smaller clusters (for a scattered map as in the case of full-length

*α*S) or a single large cluster (for a continuous map as in the case of c-ter) and hence must be chosen based on the knowledge of how finer or coarser details need to be captured. Therefore, we have chosen the maps generated with K=50 and perplexity=400 for the full-length system and K=20 and perplexity=1800 for the C-terminus system for further analyses (Fig. S10). We divide the continuous blob of t-SNE projection space into 20 contiguous regions for the APO c-ter system and 50 clusters for full-length protein with the idea of obtaining a cluster-wise visualization that is more informative than the raw ensemble data. This choice of hyperparameters is further be justified by measuring the extent of homogeneity in the intra-cluster conformations in comparison to the inter-clusters. As shown by the distance maps generated after the clustering exercise shown in Fig. S9, we obtained clusters with balanced intra-cluster homogeneity and inter-cluster diversity. This is characterized by deeper Red in the diagonal and Blue in the off-diagonal region of the distance map and is particularly distinct in the case of the clustering with K=50 and K=20 for the full-length and c-terminal ensembles, respectively. The other choices of Ks do not give such clear distinctions in clusters as the number of clusters ends up being either too high or too low. This further accentuates the ability of the clustering algorithm toward local neighborhood preservation.

Visual representation of the conformations in the 50 subgroups of APO full-length *α*S system and 20 subgroups of APO c-ter *α*S system are shown in Fig. 4(a,b). In spite of the extreme heterogeneity of the original space in the *α*S ensemble, the subspace represented by the t-SNE captures distinct clusters with unique and uniform conformations in both systems. The conformations neither have any secondary structure nor collapsed to form rigid pockets as in the case of the A*β* system. Yet, there seems an order within each of the clusters. Interestingly, we find that the conformations of C-ter *α*S peptide have a conformational range that ranges from a fully-extended rod-like shape to acutely bent hairpin conformations (4(b)) and with different degrees of bending angles between these two extremes. To bring out this feature very clearly, we have presented the clusters in a sequence, arranged based on the average bend angle measured between the C-alpha atoms of residues 121, 131, and 140 (Fig. 5(a)), which make up the c-terminal, middle and N-terminal end of the peptide. We have plotted the distribution of the bend angles in Fig. 5(b). As evident from the figure, the single-dimensional bend angle could nicely separate out the fuzzy disordered ensemble. An interesting and desired bye-product of this high-fidelity clustering is that it seems to inform a single collective variable (bend angle in this case) that uniquely defines the various conformations across clusters. And, this could be useful for further thermodynamics and kinetics studies that are often explored with a limited number of well-defined collective variables due to algorithmic and computational limitations.

#### 2. Insights into conformation kinetics and binding of c-terminal **α**S

Next, we compare the Fasudil and Lig47 bound geometries of c-ter *α*S along with the APO state in order to elucidate their differential ligand binding behavior. The t-SNE maps for the Fasudil and Lig47 bound c-ter *α*S individually are shown in Fig. S11 for different values of perplexities. Like the APO c-ter projections, the t-SNE maps for Fasudil bound (Fig. S11 (a)) and Lig47-bound (Fig. S11 (b)) produce an overspread and continuous plot with no distinct groups and therefore we used a similar value of K equals to 20 to cluster them. Also, much like the APO c-ter *α*S, the bound peptides exhibit a conformational space that ranges from acutely bent hairpin-like conformation to rod-like conformations as shown in Fig. 5(d,g).

Unlike the localized binding of G5 with individual metastable states of A*β*42, the binding of ligand seems to be highly non-localized with *α*S and occurs at scattered sites of the protein (Fig. 5(d,g). This is mainly because, as reported in one of our recent papers^{91}, only weak specificity is exerted between the small molecules and the protein. However, when comparing the conformations of Fasudil versus Lig-47 bound protein, we noticed that the latter exert better localization/specificity over the former, particularly with the bent states. Yet the binding does not change the bending profile or the population of states greatly, indicating that the ligands could not sufficiently lock the protein in a specific conformation(s) or induce any change in the conformational space. To show this clearly, we have shown the accessible tSNE conformational space of the collated APO and ligand-bound trajectories. As shown in Fig. S12(a), the conformational space overlaps almost fully suggesting that the binding does not seem to favor one structure over the other and is ubiquitous. This is in severe contrast to the behavior exhibited by A*β*42 APO and ligand-bound t-SNE projections as shown in Fig. S12 (b). The map in Fig. S12 (b) clearly shows that the APO and bound A*β*42 ensembles have clusters that are distinct with only a few regions showing overlapping projections.

Though the different extended and bent conformations are almost equiprobable in both APO and ligand-bound *α*S systems (Fig 5c,f and i), the bent states exhibits better kinetic stability than the extended states as shown by their transition probability in Fig 6(a-c). As noted before, the same bent states were shown to localize ligand-47 in specific sites than the Fasudil. Hence to understand the differential binding of Fasudil and Ligand-47 across the different states, we looked at the probability of contact of each residue on the C-ter *α*S (Fig. 6(d,e). We find that the probability of the contacts made by Lig-47 is much stronger than Fasudil. In particular, the bent states favor the strong and simultaneous binding of three Tyr residues (residue 5, 13, and 16) with Lig-47 (Fig 6 d-h).

Taken together, the results of t-SNE based clustering exercise on *α*-S reveal one of the most interesting and non-obvious learning about the emergence of the possible role of peptide local curvatures as sites of ligand binding. We see that besides the weak chemical specificity through the aromatic-pi interactions, the small molecules recognize certain local curvature in the protein. Such a physical complementarity is more prominent in the case of Lig-47 which clutches to the bent conformations and as a result, possibly exerts a stronger affinity with the protein as compared to the Fasudil molecule. This result possibly explains the experimental observation as to why the Lig47 binds more strongly than the Fasudil to *α*S protein. In addition, the study hints at a novel route of targeting the sites on the IDP that may take up high local curvatures/bent as a potential therapeutic intervention strategy against the surging IDP-driven pathologies.

## III. Conclusion

In spite of the well-established knowledge of the inherent conformational heterogeneity in an IDP ensemble and despite advances made in accurately determining the ensemble conformations using integrative approaches, successful application of IDPs to drug targeting is limited. The main reason behind this is the lack of accurate classifications of the conformational ensemble. Our algorithm provides that tool where several thousands of structures can be grouped into representative sets of the distinct and tractable conformational library, which invariably will aid while carrying out *in silico* functional and drug screening studies in a rational manner. This tool also makes it very convenient to generate sub-groups of similar conformations for long IDPs, whose full conformational ensemble is highly intractable for structural biophysical analyses. For example, we applied our algorithm to study how the long disordered regions of FUS protein interact with RNA molecules^{95} and this t-SNE tool allowed us to illustrate the complex binding behavior of FUS with RNA in an interpretable manner.

While highlighting the advantages of t-SNE, we do acknowledge that unlike projection techniques such as PCA and MDS, the interpretation of t-SNE projection is non-trivial. One would need to train and process the data set extensively to arrive at the hyperparameters that provide a clustering solution that can be meaningfully interpreted and visualized. This limitation arises since t-SNE optimization is non-convex in nature with random initialization that produces different sub-optimal visual representations at different runs. However, this affects mainly the global geometry and hierarchical positioning of the clusters and not the local clustering pattern. We illustrate the consistency in local clustering upon different runs with different random initialization by quantifying the silhouette score and mutual information of clusters in Table S4. Moreover, finding a single optimal global geometry of the IDP dataset is not often possible owing to their extreme heterogeneity with almost equal transition probability between different clusters. However, if one necessitates the global preservation, tuning the perplexity^{96}, and other parameters like Early exaggeration and Learning rate, initializing with PCA and Multi-scale similarities will be helpful^{75, 97}. In addition, some of the variations of t-SNE methods such as h-SNE can also be helpful^{98}.

Another factor that should be considered while using t-SNE on ultra-large datasets is the associated computational cost. Analyzing large data sets (beyond n *≫* 10^{6}) with t-SNE is not only computationally expensive (scales with *O*(*n*^{2})), but also suffers from slow convergence and fragmented clusters. If the computational cost becomes formidable, one could use methods such as Barnes-Hut approximation^{99} and the FIT-SNE method to accelerate the computation. In short, Barnes-Hut approximation considers a subset of nearest neighbors for modeling the attractive forces and the FIT-SNE method relies on a fast Fourier transformation, which reduces the computational complexity to *O* (*nlogn*) and *O*(*n*), respectively. To mitigate the slow convergence and fragmentation of clusters, it is often desirable to run t-SNE on a sub-sample of the trajectory that includes all unique populations and then projects the rest of the points onto the existing map.

In line with discussing the possible pitfalls of the t-SNE method, it is also understood that adding a new data point onto the existing t-SNE map can lead to erroneous interpretation as the method is essentially non-parametric and does not directly construct any mapping function between the high dimensional and low-dimensional space. Recent extensions of the method in combination with deep neural networks allow for parametric mapping^{75,100} and could be tried if such a situation can not be avoided. Moreover, the possibility of out-of-sample mapping with parametric t-SNE can be explored further for driving simulations from one state to another and to match experimentally known values. For instance, in such cases, the similarities can be obtained from NMR chemical shifts or from SAXS intensities. From that perspective, t-SNE as an integrative modeling tool looks very promising.

## IV. Materials and Method

### A. Input for t-SNE analysis

The systems details about the trajectories of alanine-dipeptide, A*β*42 and *α*S ensembles are reported in Table 1. The conformations of the trajectories were represented by backbone dihedral angle, inter-residue LJ-interaction potential, and atomic coordinates of heavy atoms for alanine-dipeptide, A*β*, and *α*S ensembles, respectively.

#### 1. t-SNE based dimensional reduction

Given a number of observations (conformations) *n* and with *d* dimensional input features in the original space defined as *X* = {*x*_{1}*, x*_{2}*, …, x _{n}*}

*∈*R

*, t-SNE maps a smaller*

^{d}*s*dimensional embedding of the data that we denote here by

*Y*= {

*y*

_{1}

*, y*

_{2}

*, …, y*}

_{n}*∈*R

*. Here*

^{s}*s ≪ d*and typically s = 2 or 3. This projection is based on the similarity and dissimilarity between conformations. The similarity or dissimilarity between the conformations in the high dimensional space is computed based on Euclidean or RMS distances. t-SNE aims to preserve the local neighborhood such that the points that are close together in the original space remain closer in the embedded space. In the original space, the likelihood of a point

*x*to be the neighborhood of

_{j}*x*instead of every other point

_{i}*x*is modeled as a conditional probability

_{k}*p*

_{(}

_{j|i}_{)}assuming the Gaussian distribution centered at point

*x*with a standard deviation of

_{i}*σ*. Similarly, the conditional probability in the embedded space (

_{i}*q*

_{(j}

_{|i}_{)}), with the same n points initialized randomly, is computed but now based on a t-distribution. Having a longer tail than Gaussian, the t-distribution moves dissimilar points farther away to ensure less crowding in the reduced space. To ensure symmetry in the pairwise similarities, the joint probability is calculated from the conditional probability as follows: Finally, the difference between the two probability distributions, calculated as Kullback-Leibler (KL) divergence is then minimized by iteratively rearranging the points in the low dimensional space using gradient descent optimization. where P and Q are the joint probability distributions in the high and low dimensional space over all the data points.

The major tunable hyperparameters in t-SNE are the perplexity, learning rate and the number of iterations. The perplexity value, *P* defines the Gaussian width, *σ _{i}*, in Equation 1 above such that,

*log*

_{2}

*P*=

*H*(

*P*) =

_{i}*−*Σ

*log*

_{j}p_{j|i}_{2}

*p*for all

_{j|i}*i*. Loosely, this parameter controls the number of nearest neighbors each point is attracted to and therefore balances the preservation of similarities at a local versus global scale. Typically, low perplexity values tend to preserve the finer local scale and high perplexity values project a global view. To optimize perplexity, we ran the algorithm with varying values of perplexities and chose the one that yields a high silhouette score. The other two parameters such as the learning rate and the number of iterations control the gradient descent optimization. While we chose the default value of 200 for the learning rate, the number of iterations was chosen to be 3500, which is large enough for avoiding random fragmentation of clusters as suggested in the literature.

^{97}

### B. Kmeans clustering of data on the reduced space obtained from t-SNE

Kmeans clustering is the simplest unsupervised clustering algorithm that partitions the data into non-overlapping clusters. The algorithm starts by grouping data points randomly into K clusters, as specified by the user. Then it iterates through computing the cluster centroids and reassigning data points to the nearest cluster centroid until no improvements are possible. The parameter, K, is optimized by running at various values and chosen based on the maximized clustering efficiency.

### C. Optimizating the hyperparameters (*Perp* in t-SNE and *K* in k-means) using Silhouette score

*Perp*

*K*

Silhouette score for a datapoint i is measured by,
where *a _{i}* is the intra-cluster distance defined as the average distance to all other points in the cluster to which it belongs.

*b*represents the inter-cluster distance measured as the average distance to the closest cluster of datapoint i except for that it’s a part of. Typically the silhouette score ranges between 1 and −1, where a high value indicates good clustering, and values closer to 0 indicate poor clustering. A negative value indicates the clustering configuration is wrong/inappropriate.

_{i}The distance between points is usually measured in terms of the Euclidean distance metric. Since the clusters, in our case are identified in a reduced representation with TSNE, computing the score based only on the distances in the reduced space (*S _{l}d*) may be misleading, if the points are wrongly put together during the dimensional reduction step by t-SNE. Therefore, it is important to measure the goodness of clustering with respect to the original distance in the high dimensional space (

*S*), in addition to that in the low dimensional space. The integrated score (

_{h}d*S*), therefore, adds value to the estimated clustering efficiency in terms of reliability.

_{l}d ∗ S_{h}d### D. Cluster-wise conformational analysis and visualization

The conformations corresponding to each cluster are extracted using Gromacs based on the cluster indices. All the conformations were used for estimating the contact probability, binding energy, and homogeneity within individual clusters. Whereas, for visualization purposes, we extracted ten representative conformations from each cluster that is closest to the corresponding cluster centroid (as identified using KD-tree based nearest neighbor search algorithm). The conformations are rendered using VMD.

Our current implementation of the model is available on the GitHub repository: https://github.com/codesrivastavalab/tSNE-IDPClustering.

## V. Author Contributions

R.A. and A.S. conceived and designed the research; A.R. performed the calculations with help from J.K; R.A., J.K., M.B., P.R., and A.S. analyzed data; R.A.and A.S. wrote the paper together with inputs from J.K., M.B., P.R.

## Supplemental Material

### A. Physical intuition into tSNE-based clustering algorithm using alanine dipeptide

We first employ the t-SNE method on the alanine dipeptide (ADP) trajectory where we compare the results with the well-known 2D Ramachandran plot using dihedral distance as dissimilarity score. Ramachandran plot is also called *ϕ − ψmap* due to the backbone dihedral angles along the peptide bond^{1,2}. Here, we can fix the number of clusters to four (K = 4) based on the four known sub-regions of the Ramachandran plot namely beta-sheet, PPII, right-handed *α* helix, and left-handed *α* helix (Fig. S2(a)). The left-handed *α* helix region lies separately in the second half of the *ϕ* dihedral axis whereas the beta-sheet and the right-handed *α* helical regions occupy the first half of the *ϕ* dihedral axis. To quantify the goodness of clustering, we calculate the Silhouette score^{3} on the raw data and arrive at a score of 0.55 for the 2D map. Of note, when the numbers of clusters are not known a priory unlike the ADP system, we have a prescription that makes use of silhouette score with the t-SNE perplexity values to find the optimum number of clusters.

The second feature of t-SNE is a tuneable parameter called “perplexity,” which dictates (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has. The perplexity value has a complex effect on the resulting pictures. The original paper says, “The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.” But the story is more nuanced than that. Getting the most from t-SNE may mean analyzing multiple plots with different perplexities.

In the Ramachandran plot, low dimensional projection of data along any one of the projections (*ϕ* or *ψ*), yields overlap of different conformations onto each other. We show this at the bottom and left of Fig. S2 (a) for projection along *ϕ* and *ψ*, respectively. PCA, the most common dimensional reduction method, fails to achieve clear separation and has a very low Silhouette score of 0.154 (see Fig. S2(b)). This is because PCA tries to linearly transform the data along an axis of maximal variation, which is the *ψ* axis in the Ramachandran plot, and hence cannot capture the distinction between L-helix and other conformers. On the other hand, the t-SNE projections provide more faithful representations of the clusters. In Fig. S2 (c), we plot the t-SNE projections for a range of perplexity values. For a certain perplexity value (*Perp* =400), t-SNE clearly separates out the 4 sub-regions as in the original space with a much improved Silhouette score of 0.50. At low perplexities, t-SNE focuses on the local variations and tries to preserve the closest neighbors as much as possible in the original space. However, very low perplexity yields too many clusters with single or very few conformations per cluster, which nullifies the advantages of clustering in the first place. On the other hand, t-SNE essentially degrades to PCA at very high perplexities and leads to overcrowding as greater variations are tolerated at high perplexity scores. With perplexity as a tuneable parameter to balance the degree of local preservation on one hand and minimize the overcrowding on the other, t-SNE offers an exciting possibility to meaningfully cluster and visualize complex and heterogeneous high-dimensional IDPs datasets.

## VI. Acknowledgments

A.R. thanks the Wellcome Trust DBT India Alliance for Early Career Fellowship (Grant number: IA/E/18/1/504308). A.S thanks the Department of Science and Technology (DST) of India for the early career grant (SERB-ECR/2016/001702). A.S. also thanks the DST for the National Supercomputing Mission grant (DST/NSM/R&D HPC Applications/2021/03.10). Computational support from the high-performance computing facility “Beagle” setup from grants by a partnership between the Department of Biotechnology of India and the Indian Institute of Science (IISc-DBT partnership program) is greatly acknowledged. AR and AS are grateful to the SciNet HPC Consortium, ComputeCanada for their generous computational support.

## References

- 1.↵
- 2.
- 3.
- 4.
- 5.
- 6.↵
- 7.↵
- 8.↵
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.↵
- 17.↵
- 18.
- 19.↵
- 20.↵
- 21.↵
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.↵
- 28.↵
- 29.
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.
- 36.
- 37.
- 38.
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.
- 47.
- 48.↵
- 49.↵
- 50.
- 51.↵
- 52.↵
- 53.
- 54.
- 55.↵
- 56.↵
- 57.↵
- 58.
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.
- 69.
- 70.
- 71.
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.
- 83.
- 84.
- 85.
- 86.
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵