Classification of protein binding ligands using structural dispersion of binding site atoms from principal axes

Many researchers have studied the relationship between the biological functions of proteins and the structures of both their overall backbones of amino acids and their binding sites. A large amount of the work has focused on summarizing structural features of binding sites as scalar quantities, which can result in a great deal of information loss since the structures are three-dimensional. Additionally, a common way of comparing binding sites is via aligning their atoms, which is a computationally intensive procedure that substantially limits the types of analysis and modeling that can be done. In this work, we develop a novel encoding of binding sites as covariance matrices of the distances of atoms to the principal axes of the structures. This representation is invariant to the chosen coordinate system for the atoms in the binding sites, which removes the need to align the sites to a common coordinate system, is computationally efficient, and permits the development of probability models. These can then be used to both better understand groups of binding sites that bind to the same ligand and perform classification for these ligand groups. We demonstrate the effectiveness of our method through classification studies with two benchmark datasets using nearest mean and polytomous logistic regression classifiers.

Proteins are molecules consisting of chains of amino acids that fold into a 3-dimensional 2 structure that perform biological functions by binding to various chemicals. In 3 protein-ligand binding, the ligand is usually a signal-triggering molecule that binds to a 4 site near the surface on a target protein. A common hypothesis is that proteins which 5 perform similar functions should bind to the same ligands and, as such, have binding 6 sites that have similar shapes. Therefore, analyzing structures of protein-ligand binding 7 sites can be considered as an initial step towards protein function prediction. While 8 studies typically focus initially on binding sites with known binding activity, the 9 common goal is to then utilize these categories of binding sites to predict the binding 10 activity of proteins with unknown function. This problem has many useful applications, 11 such as effective drug discovery with fewer side effects, development of structure-based 12 drug designs, disease diagnosis. criteria. [12] talked about two web servers and software packages named SiteEngine and 48 Interface-to-Interface (I2I)-SiteEngine for the recognition of the similarity of binding 49 sites and interfaces. [13] talks about a database named SiteBase, which holds 50 information about structural similarity between known ligand-binding sites. For the 51 comparison of these binding sites, geometric hashing was used and the equivalent atom 52 constellations between pairs of binding sites were identified. [14] talks about assessing 53 similarity between pockets in protein binding sites by aligning them in 3D space and 54 comparing the results with a convolution kernel. Then [15] discusses the TIPSA 55 algorithm based on the iterative closest point (ICP) algorithm ( [17]). While many more 56 alignment-based methods exist, these are the key studies that led directly to this 57 current work. While these methods show promising results, they are often 58 computationally expensive to perform and the results can be difficult to analyze since 59 the methods are restricted to utilizing just pairs of binding sites. Using pairwise 60 comparisons of similarity scores rather than structural characteristics of individual 61 binding sites greatly impairs the ability of researchers to develop probability models and 62 machine learning methods for understanding and modeling binding site activity. 63 On the other hand, when researchers characterize features of solitary binding sites 64 directly, they typically seek to summarize various structural features using scalar characterizing the sites, there can be problems with using such an approach. First, for 68 binding sites that are relatively flat, slight changes in the coordinates of even one or two 69 atoms can substantially change a site's volume, which can, in turn, make the 70 characteristic unstable for use in characterization and modeling. Furthermore, the 71 volume only directly describes information about the size of the region enclosed by the 72 surface of the binding site. Information about the shape and interior structure of the 73 binding site is lost. To combat this type of information loss, researchers can use 74 additional descriptors to quantify these characteristics. For instance, one measure that 75 can describe the shape of a binding site is its sphericity, which is a measure of 76 similarity to a sphere that is proportional to a ratio of a function of the volume of a 77 surface to its surface area. However, this again, fails to directly account for the interior 78 structure of the binding site. A standard measure with which to quantify information 79 about both the size of a binding site and its interior structure is, as used in [15], the 80 radius of gyration, which is the standard deviation of the distances of the atoms to the 81 center of mass of the binding site. Unfortunately, all of these quantities result in a loss 82 of a large portion of information about the structure of the binding sites. While the 83 radius of gyration does at least describe information about the structure of the entire 84 binding site and not just its surface, much information is still lost about the variability 85 within the structure of the binding site because it reduces all of the variability, which 86 occurs in three dimensions, to a single dimension.

87
These types of univariate measures can certainly be combined together and analyzed 88 either using traditional multivariate analysis methods or various regression analyses to 89 improve our understanding of the structural information of binding sites. However, the 90 relationships between these univariate characteristics are often quite complicated, which 91 can make it more difficult to gain a more complete view of a binding site's structure.

92
As such, motivated by the principles of object data analysis (ODA) (See [16]), a 93 more ideal approach is to consider a higher-level representation of a binding site's 94 structural information that directly incorporates the types of information found in 95 univariate descriptors while also preserving information that can be lost due to 96 condensing such complex information to scalar quantities. In this paper, we will encode 97 the structural information in binding sites as a covariance matrix in a novel way that 98 eliminates the need to align binding sites to place them in a common coordinate system. 99 The remainder of the paper is organized as follows. In Section 2, we describe the 100 data sets that we will use throughout our study. In Section 3, we present our 101 methodology, including a description of our novel representation of protein binding sites, 102 how we quantify differences between these representations, and a description of the 103 classification procedures we use to evaluate the effectiveness of our representation. In 104 Section 4, we present a detailed analysis of our results, including a discussion about 105 computational costs. Finally, Section 5 contains our conclusions and a discussion of 106 potential areas for future work. Motivated by the classification studies of [14] and [15], we decided to focus our attention 109 on two datasets from the literature that consists of a variety of binding sites with 110 varying size, chemical and structural characteristics that all are known to bind to just a 111 handful of ligands so that our eventual classification study could be well-formed and our 112 results could be compared to other methods whose creators had similar goals to ours.

113
Our first data set, from [18], is known in the literature as the Kahraman dataset. It 114 consists of 100 protein binding sites which bind to one of 10 ligands (AMP, ATP, FAD, 115 FMN, GLC, HEM, NAD, PO4, EST, AND). These ligands vary in size and flexibility. 116 PO4 is the smallest ligand in size and the most rigid molecule. FAD is the largest in size and is the highest in flexibility. Despite this set's small size, it provides a carefully 118 crafted benchmark set that can be used effectively to demonstrate whether a method 119 can link structure and function via a classification study. The second dataset is called 120 the extended Kahraman dataset in [14]. It consists of 972 protein binding sites, of which 121 the Kahraman dataset is a subset. These sites also bind to one of the above same 10 122 ligands. Summaries of the data sets are shown in Tables 1 and 2. The relative   123 proportions of each ligand group to the whole data set for the extended data set differ 124 considerably from the original Kahraman set. Most notably, there is a substantially 125 higher proportion of proteins that bind to the ligand PO4 in this dataset. by [18]. 134 We obtained information about the 3D structures of these proteins from the PDB

136
consistently decide what atoms in each protein should be included in binding sites for 137 our study, we adopted the convention of [14], which experimentally determined that all 138 atoms within 5.3Åof the binding ligand in the crystal structure should be included in a 139 binding site. This definition also facilitates comparisons to [15] and [14]. In this research, we approach the ligand-binding protein prediction problem by taking a 148 higher level approach that encodes the structural information found in protein binding 149 sites as a data object in a manner that reduces the amount of information lost 150 compared to using univariate descriptors or structural characteristics. Our approach 151 consists of three main parts: (1) developing a novel representation of binding site 152 structural information as a 3 × 3 covariance matrix that eliminates the need to perform 153 computationally expensive alignment procedures, (2) using properties of covariance nonparametrically defined, empirical probability distributions that both provide insight 157 into the relationship between binding site structure and biological function and allow us 158 to perform classification studies.

159
Representation via covariance of distances to principal axes 160 A natural way to encode information about the structure and size of a point cloud is through the covariance matrix of the coordinates of the points in the cloud. The covariance matrix of the coordinates is where x i is a vector containing the 3-dimensional coordinates of the ith atom, n is the 161 number of atoms in the binding site,x is the 3-dimensional vector of the coordinates for 162 center of mass of the atoms in the binding site, and denotes the matrix transpose. For 163 a 3-dimensional point cloud, S x is a 3 × 3 symmetric positive definite (SPD) matrix.

164
Since the set of all of these matrices is a metric space, we could utilize a distance 165 defined on this space to quantify the degree of dissimilarity between two point clouds, 166 regardless of the number of points present in each cloud. In this application, the points 167 are each atom in a binding site. Unfortunately, we immediately run into a critical 168 problem with using this as a representation of binding site structure. The coordinates 169 for the locations of the atoms are provided with respect to their locations in the crystals 170 of the full proteins using arbitrary vector bases, so no two binding sites share the same 171 x, y, and z axes. As a result, the covariance matrices characterize completely different 172 directions of variability, which prevents us from using the distance between two 173 covariance matrices as a useful measure of dissimilarity. Because of this, it would 174 initially seem that we would need to appeal to alignment-based methods to obtain 175 common coordinate systems. However, motivated by the PrincAxis similarity measure used by [14], which 177 quantifies the differences in the lengths of the principal axes of two binding sites, we Since all binding sites have three principal axes, they provide a common coordinate 184 system for encoding the structural variability that we can subsequently use to compare 185 binding sites. To do this, we focus on the distances of each atom to the three principal 186 axes, where the distance of an atom to a principal axis is defined to be the Euclidean 187 distance between the atom's coordinates and its projection onto the principal axis.

188
Once we have the distances for all atoms, we can then construct the covariance matrix 189 of these distances, which we use as our data object for encoding the structural 190 information from binding sites. For convenience, we will, from here on out, refer to this 191 representation as the Covariance of Distances to Principal Axes (CDPA). 192 We will now describe the details of this process mathematically. Let d kj denote the 193 distance of the k th atom to the j th principal axis, where k = 1, 2, ..., n i , j = 1, 2, 3, and 194 n i is the number of atoms for the i th binding site, for i = 1, 2, . . . , 100 or i = 1, 2, . . . , 972, depending on the data set we use. Then, the distance matrix d i for the 196 ith binding site can be represented as where d .j is the vector of the distances of all n i atoms to the j th principal axis. Our final data objects are then defined as for i =1, 2, . . . , 100 or i =1, 2, . . . , 972.

198
As an illustrative example, we consider the PO4 binding site of the protein 1cbq, which is found in the extended Kahraman dataset and consists of only 8 atoms. The distances of each atom to all three principal axes for this binding site are shown in matrix form as Then, the covariance matrix for ligand-binding site, 1cbq-PO4 can be shown as Quantifying differences between binding sites 200 In order to classify binding sites according to their binding ligand using CDPA, we need to be able to quantify differences between binding sites. To do so, we need to utilize properties of the space of 3 by 3 SPD matrices. While this space, itself, is not a vector space, it is a submanifold of the space of 3 by 3 symmetric matrices, which is a Euclidean vector space. As such, it inherits the Euclidean distance Euclidean distance alone fails to take the covariance structure of the data into account, 202 we must instead use the Mahalanobis distance between two matrices A and B to get a 203 more meaningful measure of dissimilarity between binding sites. In order to calculate 204 the Mahalanobis distances, means and covariances of S i are required, but since each S i 205 is already a matrix, we must first vectorize the observations. To do so, we utilize the 206 vectorized form vecd(A) described in [20] and [21], which is calculated as December 16, 2020 6/18 The last three entries of Eq (1) are multiplied by √ 2 so that the Euclidean distance 208 between any two observations remains the same whether in matrix or vector form. In 209 other words, the Frobenius norm of the matrix will be equal to the norm of the 210 vectorized form of the matrix. That is, trace(S i ) = (Svec i ) (Svec i ).

211
With this form, we can obtain the mean vector and covariance matrix for each group 212 of binding ligands in the standard way to obtain Svec j andΣ j , respectively, for 213 j = 1, . . . , 9. This allows us to calculate the Mahalanobis distance from each binding 214 site i to the mean of ligand group j as where, i =1, 2, . . . , 100 or i =1, 2, . . . , 972 and j =1, 2, . . . , 9. 216 Using these calculated Mahalanobis distances, we can directly examine the 217 relationships between the binding sites and the means of each group by comparing the 218 distributions of the Mahalanobis distances for binding sites in group j to the mean of 219 group j with the Mahalanobis distances for binding sites in other groups to the mean of 220 group j. We show an example of histograms for these distributions in Fig 2 and   Classification and validation methods 227 We use two procedures to classify the binding ligand of each binding site. The first is the nearest mean classifier. It is a form of nearest neighbor classification, which was utilized by [15] and [14]. However, while the latter method requires all observations to be compared to each other in a pairwise fashion, the nearest mean classifier requires only for each observation be compared to the mean of each group. For a fixed observation i, we compute D i,j , as calculated in Eq (2), for j = 1, . . . , 9 and assign it to group j if Unlike nearest neighbor classification, this method allows us to explicitly utilize the 228 mean and covariance structure of each binding site and is based upon the models like 229 those shown in Fig 2 and 3. An advantage of this classifier is that it is simple, so we can 230 nearly directly determine the utility of CDPA and the Mahalanobis distance models 231 with minimal impact from the classification scheme.

232
The second method we use is polytomous/multinomial logistic regression, which allows us to model the probability that binding site i binds to ligand group j . As predictor variables, we use the D i,j so that the probability depends not just on what group is closest to observation i, but how close observation is to every group. This allows us to use both the variability within groups and between groups to classify each binding site. If we denote the predicted probability that observation i belongs to group j as P i,j , then we assign observation i to group j if An additional advantage of logistic regression is that these predicted probabilities also 233 give us a measure of certainty regarding the classifications. If P i,j is high, then we 234 should feel more certain that the classification is correct, whereas P i,j taking a low 235 value means that the logistic regression model has trouble distinguishing group j from 236 the others for the observation.

237
To validate the classification scheme, we initially considered the classical 238 leave-one-out cross validation scheme, but, unfortunately, many of the ligand groups in 239 both data sets contain very few binding sites, so leaving out even one observation  If even a single observation is removed, the mean and covariance structure for the entire group will be radically altered, which makes leave-one-out cross validation fail.
Instead, we can simulate a validation dataset, allowing us to use the entire original dataset as a training set, leaving the means and covariances unaltered. To do this, we simulate testing data by adding noise to each atom coordinate. First, the initial data are read-in, and the atom coordinates for given binding sites are measured. For instance, suppose the coordinates of atom j are given as (x 0 j , y 0 j , z 0 j ). Then, a small amount of 3 dimensional Gaussian noise (τ ) is added to the coordinates of each atom to perturb the data and can be represented as where w 1 , w 2 , and w 3 are independent N (0, τ ) random variables. CDPA and 244 classification is then repeated using the noise-altered data with respect to the models

252
We now present the results of performing classification studies for both datasets using 253 CDPA and discuss the particular challenges involved in working with each dataset.

254
Ligand classification for Kahraman dataset 255 We first consider the smaller Kahraman dataset so that we can compare our results to 256 those of both TIPSA and sup-CK. First, we visualize the entire data set in 3 dimensions 257 using MDS in Fig 5. The plot shows a clear separation between most of the groups and 258 December 16, 2020 8/18 we can see that the covariance structure for the groups differ from each other 259 considerably. This further supports why it is imperative to use Mahalanobis distances 260 rather than just Euclidean distances between the binding sites when analyzing the data. 261

Fig 5. MDS plot in 3 dimensions for the Kahraman dataset.
A summary of classification results is shown in Table 3. The classification errors are 262 the proportions of binding sites that were classified to the incorrect ligand group for 263 each method used. Note that we only report the results for nearest mean classification 264 for CDPA for this dataset in the table. This is because, since we must use one ligand 265 group as a baseline category, we have 8 logistic models, each of which contains 8 266 parameters, which gives us a total of 80 parameters that must be estimated. This leaves 267 us with just 20 degrees of freedom out of the 100 observations. As a result, even though 268 the fitted model produced a classification error of 0%, this is speaks only to drastic 269 overfitting rather than to any benefit of the methodology.  indicates that CDPA outperforms the leading methods for this dataset, the most 273 important takeaway from these results is that it is clear that the CDPA representation 274 is able to effectively encode useful information about the structures of the binding sites. 275 We also include results for a few variations from the TIPSA study to further highlight 276 the importance of taking the covariance structure of the binding sites into account.

277
TIPSA:TI uses the Tanimoto Index (also known as the Jaccard Score), which references 278 only the proportion of atoms common to both binding sites in a pair, as a similarity 279 measure. Gyr is the difference between the radii of gyration for two binding sites, so it 280 utilizes a one dimensional measure of the variability of atoms within binding sites as a 281 dissimilarity measure. While TIPSA:TI performs better than Gyr by themselves, 282 Ellingson and Zhang (2013) showed that using Gyr in combination with the Tanimoto 283 Index improved the results considerably. This provides further evidence that it is 284 important to utilize information about the covariance structure of the binding sites.

285
A confusion matrix showing a detailed breakdown of the results for CDPA is shown 286 in Table 4. From this, we can see that 9 of the 15 misclassified binding sites are assigned 287 to the HEM group. This makes sense based on Fig 5, since we see that, while most of 288 the variation in the HEM group is along the vertical axis, there are two binding sites in 289 the group that are outliers, though still within the scope of the variation exhibited by 290 the entire dataset, that significantly affect the covariance matrix for HEM. This thusly 291 results in some binding sites from other groups being misclassified as HEM sites.

292
To perform model validation, we added noise to the data, as described in the replications while the shaded region surrounding it provides 95% confidence bands. As 296 expected, as τ increases, the performance of the method degrades. However, we can see 297  We looked more closely at the data to determine why the method produced such a 314 high error rate and calculated the covariance matrixΣ j for each group. We found that 315 the FMN and PO4 groups have extremely high levels of variation in at least some 316 variables. Their covariances are shown below. There are higher variations present in these ligand groups. Therefore, to check which Since these results suggest problems with data quality, we did not include these outlying 326 observations in the construction of our models and instead built the models using the 327 remaining 906 binding sites. We will, however, present classification results both with 328 and without the removed observations.  To address this clustering problem, we broke the HEM group into two subsets, which 335 we refer to as HEM-I and HEM-II. The MDS plots in 3 dimensions for these groups, 336 shown in Fig 11 and 12, respectively, show that their subgrouping will have means and 337 covariance matrices that more accurately capture the distributions of these binding sites. 338 Please note that we could not use the same approach with the PO4 group since the 339 space spanned by the cleaned observations is so large that Mahalanobis distances to the 340 mean of those observations would be extremely small compared to the distances to the 341 means of the other groups. Here, though, we see clear and distinct patterns that allow 342 us to partition the HEM group for constructing the classification models. For evaluating 343 the classification results, though, we still treat HEM as one entire group. flipped, as commonly happens with MDS. While it was not our intention to obtain a 347 similar distribution of observations for this dataset, this does provide support that the 348 cleaning of the data was indeed necessary. From this, it is also clear that we did not 349 over-clean the PO4 group since we still see far more variation in the group for this 350 dataset than we did for the smaller set and that we simply removed observations with 351 clear quality issues. Additionally, the clear separation between the two HEM subgroups 352 is even clearer when viewed in combination with the rest of the data.  performed light data cleaning, we cannot perfectly compare the classification results of 358 CDPA with those of [14]. Despite this, though, we still present their results to provide a 359 rough baseline for performance and to highlight differences between the methodologies 360 between their study and ours. correctly. This would also be true of many of the PO4 binding sites shown in this plot 375 as well as the PO4 observations that were cleaned from the data. The use of the 376 Mahalanobis distances, though, allows for correct classification of many other binding 377 sites with CDPA, though, which k-nearest neighbor methods may fail for. 378 Table 5. Results for nearest neighbor classification for the extended Kahraman (5.3Å) dataset.
Method Classification Error Sup − CK L 0.19 CDPA with nearest mean classification (When trained and tested with 906 obs.) 0.32 CDPA with nearest mean classification (When trained with 906 obs. and tested with 965 obs.) 0.3451 CDPA with polytomous logistic regression 0.25 We now present a detailed analysis of the results for CDPA for both classification 379 methods, beginning with the nearest mean classifier. The confusion matrix for this 380 method is shown in Table 6. A number of the misclassifications are sensible due to the 381 similarity of the binding ligands. For instance, AMP and ATP are quite similar 382 structurally, show it is not surprising to see that many binding sites from the AMP and 383 ATP groups are incorrectly classified as belonging to the other group. The performance 384 and similarity of the ligands for FAD and NAD is likewise. While the structures of GLC 385 and PO4 are not so similar, they are both by far the smallest ligands in this data set, so 386 it is not surprising that sites from these two groups may be mistaken for each other.
Even though we didn't see this behavior from the Kahraman data set, it is easy to see 388 why these groups were misclassified for each other here from Fig 13; the two groups are 389 close neighbors to each other, often overlapping, especially in the long tail of the PO4 390 group that extends to the left side of the plot. We now consider the logistic regression classification scheme, for which we used the 392 ligand group AMP as the reference category for calculating log-odds ratios with respect 393 to. We then calculated predicted probabilities that each observations belong to the 394 ligands groups based on these. As shown in Table 5, the classification error for this 395 method is closer to that of Sup-CK than to CDPA with the nearest mean classifier, 396 which provides support for the idea that it is important to utilize Mahalanobis distances 397 to the means of all groups rather than just consider which mean is closest. Table 7 398 shows the estimated coefficients for every parameter in the poloytomous logistic  The confusion matrix for the logistic regression classifier is shown in      Table 8 shows that most of the GLC sites were misclassified as PO4 Kahraman set, these procedure are both far less sensitive to even small amounts of noise 438 since the original dataset is much larger here. The average classification error for the 439 nearest mean classifier increase roughly linearly with the noise level, but for the logistic 440 regression classifier, the average classification error first increases quite slowly for noise 441 levels up to roughly 0.3 and then increases roughly linearly for noise levels above that. 442 These results further confirm the superiority of the logistic regression classifier to the 443 nearest man classifier since the simulated classification error for the former method with 444 noise levels near 1 is comparable to the classification error of the nearest mean method 445 even for extremely low levels of noise. Furthermore, the confidence bands for the logistic 446 regression classifier are considerably narrow at all levels of noise.

Comparison of computational costs 448
The model-based approach used in this research is very fast compared to the 449 alignment-based approaches used by [15] and [14]. For instance, in [15], to perform 450 pairwise alignment for the Kahraman dataset, it took between 2 to 6 seconds per 451 alignment. Since there are 100 binding sites in this dataset, there will be 4950 pairs, 452 requiring a minimum of 9900 seconds to compare all of the pairs. For [14], the algorithm 453 running time per pockets pair varied between 0.2 and 1.3 seconds. This means that, at 454 a minimum, it will take 990 seconds to compare all the pairs using that methodology.

455
With CDPA, to calculate the similarity measure for all binding sites and classify them 456 to their ligand group, it took only 2.4 seconds for the Kahraman dataset. The extreme 457 difference in computational costs is due to both CDPA being alignment-free and our In this study, we developed a novel representation called CDPA for encoding the 464 structural information from a protein binding site as a 3 × 3 covariance matrix. This 465 representation allowed us to develop nonparametric probability models for groups of 466 sites that all bind to the same ligand using the Mahalanobis distance for the covariance 467 matrix for each binding site to the mean of the binding group. We then showed that 468 these distributions of the distances are useful for classifying the sites by binding ligand. 469 CDPA with the nearest mean classifier outperformed others methods for the Kahraman 470 dataset. While it is improper to compare directly to the results of [14] for the extended 471 Kahraman dataset since some of the set's proteins are no longer listed in the PDB and 472 we had to perform some light data cleaning, it is clear that CDPA with the logistic 473 regression classifier still performed comparably to [14] for the dataset. At the very least, 474 this suggests the CDPA is able to discriminate between the ligand groups. At the same 475 time, CDPA is orders of magnitude faster than the alignment-based methods since it is 476 invariant to coordinate changes for the atoms and does not rely on pairwise comparisons, 477 but instead on comparisons to the mean covariance matrix for each ligand group.

478
The differences between pairwise and model-based comparisons also extend to 479 classification performance. For instance, sup-CK with the nearest neighbor classifier 480 performs well for the extended Kahraman dataset, despite some data quality issues, 481 because the outlying observations only influence classification performance for each 482 other since no models for the groups are constructed. This is in sharp contrast to CDPA 483 models, where the outlying observations also impact the construction of the models 484 when not cleaning the data first. On the other hand, though, nearest neighbor methods 485 are overly sensitive to noise and cannot take variation within and across binding site 486 groups into considerations, whereas our model-based approaches do so explicitly.

487
To explore how our CDPA classifiers performed with different data, we simulated 488 testing sets by adding increasing amounts of Gaussian noise to the coordinates of the 489 atoms in the binding sites. Fig 6 and 17 showed the average classification error per 490 replication with confidence bands as a function of the amount of noise for the two 491 datasets we used to form the models. While the classification error rate was lower for 492 the original Kahraman dataset than for the extended Kahraman dataset, the 493 classification performance degraded more quickly as noise was added to the data. 494 Furthermore, in Fig 18, we can see that the lengths of the confidence intervals for the 495 extended Kahraman data are much narrower than those of the Kahraman set. As a 496 result, we can be more confident that the CDPA procedures are, in fact, more stable 497 with more observations, even though the performance for the larger dataset is not quite 498 as good as it is for the smaller set. This provide further support for the ability of CDPA 499 to capture differences between groups of binding sites. However, there is plenty of work that remains to be done to build upon this initial 501 study on CDPA. As we mentioned above, the CDPA models appear to work better for 502 those binding sites having similar covariance structures to the average of their group, 503 but pairwise comparisons appear to be more advantageous for outlying sites. This 504 suggests that future approaches could seek to leverage both types of classification 505 methods: model-based for binding sites with relatively low distances to at least one 506 group and pairwise comparisons for those observations that differ considerably to the 507 means of the groups. Furthermore, we wish to consider additional classification 508 procedures, such as classification trees, parametric probability models that would 509 permit Bayesian procedures to be used, and ensemble methods that could combine 510 many or all of the approaches.

511
Ideally, these methods would permit us to also utilize alternative and/or additional 512 dissimilarity measures to what we considered here. For instance, in this paper, we 513 utilized the fact that the space of 3 × 3 covariance matrices is a submanifold of the 514 space of 3 × 3 symmetric matrices to utilize the Euclidean distance between 515 observations. However, there are many additional metrics that can be placed on the 516 space of 3 × 3 covariance matrices that would define other distances between 517 observations. The other distances may be able to further improve the performance of 518 CDPA by more explicitly taking the geometry of the sample space into consideration.

519
Additionally, we seek to enhance CDPA by bringing other characteristics of the binding 520 sites, especially chemical attributes, into the analysis, as well, since that information is 521 not explicitly taken into account by CDPA.