Protein structure featurization via standard image classification neural networks

Many applications in the biomedical domain involve the detailed molecular and functional characterization of macro-molecules such as proteins. Where possible, this involves the knowledge of detailed 3D coordinates of every atom within a protein. At the same time, machine learning has become the basis of much innovation within this domain in recent years. There are, however, a few challenges in applying machine learning to 3D protein structures, such as variability in size and high dimensionality of the data. It would therefore be beneficial to be able to map every protein structure to a smaller fixed-dimensional representation that is directly learned from the structure without manual curation. In addition, it would be valuable for biomedical researchers if such approaches would require little method development and instead draw from cutting-edge research such as image classification via deep neural networks. Here, such an approach is outlined that first re-formats protein structures as 2D color images and then applies off-the-shelf neural networks for image classification. It is shown that such neural networks can be trained to effectively encode the CATH protein classification database and that feature vectors extracted from such networks, once trained, can be transferred to a completely new task that is likely to benefit from molecular protein information, namely that of small molecule binding.


INTRODUCTION
Understanding protein structure and function is of interest in a wide range of Biological applications. One application that is of most direct relevance to human health is the implication of various proteins in diseases and the role that these proteins may play in finding a therapy.
While many experimental and computational techniques exist to study proteins at detail there is typically a trade-off with the throughput of that same method. High-throughput methods typically come at the cost of molecular detail. This is certainly true for all methods that are used to understand the molecular structures of proteins, which are complex polymer chains of typically 20 amino acid building blocks that fold up on themselves to produce often highly specific three-dimensional arrangements or folds. These folds are then used to endow the protein with a specific biochemical function as part of healthy Biology, but also in the context of disease. Computationally, proteins have been studied at the amino acid sequence level through comparative Bioinformatics (high throughput) or at the biophysical level through Molecular Dynamics simulations of their 3D structures (low throughput). A middle ground between these two approaches is reached in Structural Bioinformatics where the known protein structures are compared based on a combination of sequence and structure-based descriptors so that highly similar proteins receive a high score. These methods calculate a similarity or difference score of pairs of proteins. This score requires that the proteins are assigned to a set of descriptors that allow for a comparison in some proxy feature space that is used to capture actual protein differences. All such descriptors are typically manually selected by the method developer and thus are limited to preconceived notions of what information is important for representing and comparing proteins.
Machine learning builds models by processing large amounts of training samples and optimizing a set of internal parameters. These parameters are fine-tuned so that the model can make predictions on a provided ground truth data label, based on the input features or descriptors of the modeled entities.
Proteins are challenging for machine learning because of their complex 3D structures of variable sizes and the complex relationship between features and prediction targets. However, promising new approaches can now be developed due to the advances in Deep Learning where multi-layer artificial neural networks can be used to extract such complex relationships.
The amino acid sequence of the protein has been used for machine learning by exploiting the similarities to Natural Language Processing. One approach, called ProtVec, has applied word embeddings to amino acid triplets as the "words" and the protein sequence as the "sentence" [1]. However, the sequence of a protein does not specify the biologically active 3D structure -a well-known problem in biology [2].
The most common approach of applying Deep Learning to protein structures is through the discretization of 3D space into bins fed into Deep Convolutional Neural Networks (CNNs) as 3D arrays [3]- [5]. Typically, this requires the definition of a 3D box with a given size and granularity (size of bins) centered on a region of interest such as a known ligand binding site. While this representation of protein structures is very intuitive and interpretable, it comes with some down-sides such as a high degree of sparseness and dimensionality as well as computational costs resulting from 3D input data. Additionally, the development of methods for 2D images generally proceeds more rapidly than that for 3D approaches, so that the adoption of existing deep learning techniques is easier in 2D.
A different route has been taken here, where the protein structure is not interpreted as a set of absolute 3D atomic positions, but rather as a set of pairwise distances. This effectively allows to treat the protein structures as flat, image-like 2D arrays. Such distance matrices have two identical axes corresponding to an ordered list of atoms starting from amino acids at the N-terminal end of the protein backbone and ending at the C-terminal end. Atoms of the same amino acid are typically listed in a standard order, for example in the PDB format of the Protein Data Bank, allowing the formation of predictable patterns.
Recently, a few studies have been published that also follow this general approach, but only consider a single atom per amino acid and use their methods for protein search [6] and protein design [7] applications. For the application of drug discovery, all-atom molecular detail will be needed in the future, although the prototype presented here does not yet achieve that level of detail.
Here, we focus on the case where the category that we want to predict is the ability of a protein to bind a small molecule such as a drug, or not. Such drug molecules are designed to bind to certain proteins, typically to alter the behavior of that protein in a disease context and with the aim of a therapeutic effect.
In the first part of this work, the focus lies on training a neural network on protein structure classification in general, and without a specific application in mind. This is to maximize generalizability of the protein features and avoid hidden biases of small molecule activity datasets. The results clearly show that protein structure classification can be solved by Deep Learning without much difficulty.
In the second part, protein feature vectors (fingerprints) generated based on the trained model from the first part are applied to an activity data set where the strength of an interaction between a small molecule and a protein was measured experimentally.

Dataset preparation CATH
The CATH [8] non-redundant data set was used as a training set for protein fingerprint generation. These include 21090 3D domain structures from across the entire CATH classification system and constitute a small but less biased subset of the full 308999 domain structures. Data labels for supervised learning were taken from the 'Class', 'Architecture', 'Topology', and 'Homologous superfamily' levels with 4, 40, 1364, and 2714 sub-categories, respectively. Note that these numbers are partially slightly lower than the ones published on CATH, due to loss of some protein structures during processing as outlined further below.

ChEMBL
ChEMBL [9] version 23 was used as a starting point to select activity values of small-molecule ligands and their single protein targets for which a pChEMBL value was available. pChEMBL is defined as: -Log(molar IC50, XC50, EC50, AC50, Ki, Kd or Potency), depending on the assay. Based on the pChEMBL value, three activity categories were defined that approximately balanced the number of low and high activity observations, while keeping the intermediate activity bin relatively small. This was obtained by choosing pChEMBL values of 5.2 and 6.0 as the lower and upper bounds of that intermediate bin, respectively. Furthermore, the data set was restricted to cases where each compound and each target was associated with at least one high-activity and one low-activity observation, thus arriving at a final dataset of 35076 compound and 871 targets, with a total of 218635 activity measurements. ChEMBL single protein targets are associated with a Uniprot identifier corresponding to the underlying gene/protein of that target. For each Uniprot identifier, multiple 3D structures (mostly from X-ray crystallography and obtained from the Protein Data Bank, PDB) can be found that may correspond to different structural subunits of that protein.
To automatically match each ChEMBL target with a protein structure, the target protein sequence reported by ChEMBL is compared to the sequence of all PDB structures (chains) linked to the same Uniprot ID. The PDB structure with the highest pairwise alignment score, calculated with BioPython (https://biopython.org/), was used as that target's structure -as long as a minimum score of 50 was obtained (parameter-free Bio.pairwise2.globalxx function). Otherwise, the target was discarded. Among alternative structures for the same target with identical alignment score, the one with the lower resolution value was picked. Any remaining ties were resolved by picking the structure with smaller ligand size, to reduce the conformational bias towards the crystallized ligand.

Protein structure to image conversion
All protein structures used were converted to a format that could be used directly as input for state-ofthe-art Deep Convolutional Neural Networks. The three channels typically reserved for the Red, Green, and Blue (RGB) color contributions of images, were populated with three different 2D representations of a single protein structure: 1) the pairwise atomic distances of heavy atoms in the protein structure (D); 2) a pairwise non-bonded energy term like the one used in the Amber molecular mechanics force field [10] (NB); 3) the atomic cross-correlation matrix from an anisotropic network model (ANM) [11], [12]. ANMs are used to calculate the normal modes of protein structures, i.e. their intrinsic motions and dynamics. The equation for the NB channel was of the standard Amber [13] form where Eij is the potential energy contribution of interactions between atoms i and j, with partial charges qi and qj, as well as actual distance rij and equilibrium distance r 0 ij : Distances D were cut off at an upper limit of 51.2 Å whereas no lower limit was applied. ANM correlation matrices range between -1 and 1. The ProDy [14] python library (http://prody.csb.pitt.edu/) was used to calculate D and ANM. NB was calculated based on the partial charges assigned by the PDB2PQR tool (http://www.poissonboltzmann.org/pdb2pqr/) using the 'Amber' force field option and cut off outside a range of (-0.1, 0.1).
After these processing steps, 20798 domain structures remained in the final CATH data set, where some PDB structures caused errors and were thus discarded.

Deep learning on protein structures
We used the VGG16 [15], Inception V3 [16], ResNet50 [17], and DenseNet121 [18] architectures provided by Keras (https://keras.io) (GPU optimized version 2.1.5, using Tensorflow 1.7 as backend) and with pretrained weights from ImageNet [19], but without their original softmax output layers. Instead, each network was equipped with one new dense (fully connected) layer using batch normalization, ReLU activation, and Dropout (with a fraction of 0.25 neurons randomly turned off at each training iteration). This dense layer either had 128 or 512 neurons as described in Results, depending on which length feature vector was desired. This layer was then connected to four separate softmax classification outputs for each of the four CATH levels (see above). Networks were trained with a total batch size of 128 split across four NVIDIA GeForce GTX 1080 Ti GPUs, using the Adam optimizer with a learning rate of 0.001 and categorical cross-entropy as loss function for all four outputs (each contributing equally to the final loss). Learning rates were reduced on plateaus of 5 epochs of stagnating validation loss by a factor of 0.2 towards a minimum learning rate of 0.0001 and early stopping was applied whenever the change in validation loss was lower than 0.001 for 20 epochs. 40% of the data were used for validation during training and training data is shuffled before each epoch. In addition to these supervised classification models, the U-Net [20] architecture was modified to be trained on reconstruction loss, thus effectively serving as an autoencoder with skip connections -as opposed to its original application of providing binary segmentation masks for biomedical images. The outputs of the bottleneck layer of 1024 convolutional filters was passed through global average 2D pooling, thus resulting in a protein feature vector of length 1024. Note, that hyperparameters were not systematically optimized and thus only represent one working solution.

ChEMBL activity prediction
Small molecule activities were predicted as a binary classification task where each pair of compound and target was assigned to a low activity ('low') or high activity ('high') class. The classifier was a Random Forest with an ensemble size of 100 and without restricting the depth of trees.
Various strategies for splitting the data set into training and validation data and cross-validation were investigated. Random splits randomly assigned 40% of the data to the validation set, whereas stratified approaches balanced data across the two prediction classes (high vs low). Five-fold cross-validation was used throughout. Group-based cross validation schemes trained on n-1 sub-groups and validated on the remaining group that was not used during training. Groups were defined by k=5 clusters obtained through K-means clustering of the protein fingerprints.
Protein fingerprints were obtained as described above for the 871 protein targets in the final ChEMBL data set. Fingerprints were always obtained as the last ReLU activations of the second fully connected layer, just before the softmax output layers. These activations were obtained by feeding the RGB images of ChEMBL-target protein structures into the convolutional neural network trained on CATH data.

Protein structures across folds can be encoded as fixed-dimensional fingerprints
Protein structures can be represented as sets of 3D atomic coordinates in Euclidean space that are connected by covalent bonds to form the universal backbone and side chain organization which is often described as a sequence of amino acid letters. Furthermore, atoms can form close physical connections via non-covalent bonds (H-bond, van der Waals, and electrostatic interactions) even when non-adjacent along the sequence and it is these spatial structural arrangements that give proteins their specific functional properties.
To train a neural network to consistently encode any protein structure into a fixed-length latent space representation, it will need to perform either unsupervised or supervised learning. The former will likely take the form of an autoencoder or related algorithms and condenses the information contained in its inputs automatically without any further input and relying on the reconstruction error for learning. Supervised learning on the other hand requires expert labeled data and performs either classification or regression where the error between predictions and ground truth labels is minimized. The availability of labeled data can be a limitation in many machine learning applications, however, in the case of protein structures, abundant labels are already available in databases such as CATH.
The CATH database curates the structural relationships among all known protein structures and does so by breaking proteins down into subunits called domain which roughly correspond to evolutionarily reused building blocks of larger proteins that may have a high likelihood of folding independently. While the entire dataset consists of around 300,000 domain structures, there is also a much smaller non-redundant dataset of around 21,000 structures that have been selected to cover all varieties of structure without being biased towards those that are highly abundant in PDB. It is therefore a suitable dataset for teaching a deep neural network about what proteins look like, including those that are in principle interesting as drug targets due to their readiness to form stable structures during crystallization and subsequent X-ray diffraction analysis -the source of most structures in PDB. These domains are then categorized at various levels of structural detail, starting at the four 'Classes' of proteins with mostly alpha-helical secondary structure content, mostly beta-sheet content, or proteins with both alpha-helix and beta-sheet. The fourth Class is reserved for proteins without clear assignment such as intrinsically disordered proteins. Below this top level follows 'Architecture', where the four Classes are further subdivided into a total of 40 categories, followed by the 'Topology' level (1373 categories), and 'Homologous Superfamily' level (2737 categories).
In order to make use of the state-of-the-art power of deep learning models, each CATH domain was converted to a 2D color image by plotting the inter-atomic distances between heavy atoms that typically make up proteins (i.e. Carbon, Oxygen, Nitrogen, Sulphur -but omitting Hydrogen) and augmenting those distances with additional information stemming from pairwise interactions of partial electric charges between atoms as is commonly used in Molecular Mechanics force fields such as Amber, as well as correlated atomic motions predicted by a normal mode calculation (see Methods and Figure 1).
In its current version, the presented method, will scale down the 2D protein images from their individual native resolutions (the number of heavy atoms squared) to a common size such as 224x224 -which was used as the default size in many of the original publications of the deep learning models used here. The implications of this practice will be discussed in the Discussion section.
With slight modifications, successful image classification networks such as VGG16 can be used almost outof-the-box, by simply changing the number of outputs to match the training data set at hand. In addition, the presented method works by connecting four parallel output layers to a newly added densely connected layer at the end of the network, one for each level of classification detail provided by CATH. This makes it effectively a multi-task learning network which has been shown to offer improvements in prediction [21]. Protein fingerprints (PFP) are simply the outputs of the last ReLU activation layer that are used as features by the connected softmax outputs to make a prediction. The length of a PFP is thus determined by the number of neurons in the last pre-output layer. Since changing this PFP length effectively also reduced network capacity, two different sizes of PFPs were tested to see any potential effect on training performance. There is considerable overfitting, the implications and potential remedies are discussed further below.
As an alternative measure of performance, PFPs were clustered in their original dimension (128 or 512) using K-means where k was set to the four different numbers of classes per task in the outputs. The resulting cluster labels were then compared to the original CATH labels and the Homogeneity score as implemented in Scikit-Learn [22] was calculated where a 1.0 corresponds to a perfect overlap. This score allows to observe further performance differences across network architectures, but also between multiple instances of the same architecture (three models were trained for each architecture). Inception V3 showed considerable differences between replicates, showing some of the best and worst Homogeneity scores. DenseNet121 was the best-performing architecture across all metrics.
Larger PFPs of length 512 only provided slight improvements in training metrics potentially owed to the added network capacity but did not perform better in the homogeneity score than the architectures with PFPs of length 128. Figure 2 provides a visualization of a 512-dimensional PFP space with PFPs predicted for all protein domains in the CATH data set. This distribution was reduced to two dimensions using t-SNE (t-distributed stochastic neighbor embedding) [23] . Colors indicate the CATH labels at first two classification levels (Class and Architecture). The model used to produce these fingerprints are indicated by an asterisk (*) in Table 1 and corresponds to the model with lowest overall validation loss. Overall, visual inspection of Figure 2 reveals a clustering of same-color data points which is consistent with the high Homogeneity scores. Some visual imperfections can be identified in the left panel of Figure 2 where four large clusters are expected to be separated, but visually some discontinuities can be seen. Given the high Homogeneity scores across CATH levels, this imperfection could be due to the stochasticity of t-SNE. Another DenseNet121 model (with PFP length 128) had very slightly higher score at the Class-level and also has a more homogeneous appearance in the t-SNE plot (not shown).
In summary, a 128-dimensional fingerprint can be sufficient to achieve a near perfect encoding of the CATH data set. This is a stark contrast to the amount of information used as input, i.e. the 224x224x3 = 150528 color intensity values of the 2D color images. This number of features is an impediment to any useful analysis and t-SNE plots on the raw image values convey the inability to identify meaningful clusters in the same data set of CATH domains (not shown).
While Table 1 only lists supervised deep learning approaches, there are also successful other approaches used on images, such as the U-net architecture for image segmentation, i.e. the task of outlining objects belonging to a known category within an image. The U-net architecture essentially follows the structure of a deep, multi-layer auto-encoder with skip connections from the encoder to the decoder arm of the network [20]. The U-net architecture was used here with the modification of reconstructing the input image instead of a binary / multi-label segmentation mask. This version of the U-net is therefore truly unsupervised in the sense that no additional training labels were needed. Figure 3 shows the same t-SNE map as Figure 2, but based on fingerprints extracted from the activations of the 1024-dimensional bottleneck layer of U-net (after mapping each of the 1024 2D feature maps to single averages). In the absence of CATH training labels, the fingerprints do not cluster visually when coloring data points according to CATH labels. Nevertheless, the U-net fingerprints performed at the same level as other fingerprints in the activity prediction task (see below).

Protein fingerprints applied to ChEMBL activity prediction
After encoding the CATH dataset into 128 and 512-dimensional fingerprints (1024 in the case of U-net), the general applicability of such fingerprints was demonstrated on the ChEMBL (version 23) dataset to predict chemical activity of small molecule compounds against specific proteins measured experimentally and recorded as the half-maximal concentration of the compound at which the anticipated effect was observed, e.g. inhibition of protein activity. ChEMBL provides a standardized value for various types of activity and calls it the p-ChEMBL value. The Methods section describes how the ChEMBL database was used to generate an approximately balanced binary classification dataset without trivial class assignment (e.g. one target always inactive) and how the single-domain target protein was assigned to the PDB code of the best-matching structure which in turn was then converted to a distance-based representation ("image") and encoded to a fixed-length fingerprint using the CATH-trained classifier described above.
Each activity entry in this ChEMBL dataset thus provides structural information of both the compound and the protein, but independently of each other, i.e. without knowing the exact binding position or mechanism. Compounds were encoded using two alternative fingerprint representations commonly used and implemented in the chemo-informatics software RDkit [24], MACCS (167 bits) and Morgan (1024 bits, radius=3), see [25] for a comparison of different fingerprints. With this compound fingerprint (CFP) and the corresponding protein fingerprint (PFP) concatenated, the final feature vector was used as input to another machine learning model. Here, a random forest ensemble classifier was chosen due to its good performance, generalization metrics, and interpretability. However, any other classifier may be used in its stead. A binary classification setup was chosen to predict whether a CFP and PFP combination falls into the 'high' or 'low' activity bin -as this task is more attainable than an exact regression predictor of actual activity (p-ChEMBL) values, especially since these measurements come from very different types of assays.
Since activity datasets like ChEMBL can be biased towards certain compound structures or targetsproviding a lot of data for some, but little for others -two alternative feature vectors were created for comparison. First, the CFP alone was used as input for activity prediction. In this case, only the compound structure itself can provide any information that would lead to a higher-than-chance accuracy of predicting activity and would indicate that certain compound structural features are over-represented in the datasets. Second, the CFP was concatenated with a one-hot encoded array indicating which of the targets the compound was measured against. In this case, predictions will be improved if certain compounds or compound classes have been tested on particular targets more than others and if certain targets can be associated with either high or low activity. Third, CFP and PFP are concatenated, providing structure-level details of both. Improvements to the prediction scores would mean that the structural information of the protein contains additional clues on activity when combined with the CFP. PFPs were generated using the DenseNet121 architectures with lowest validation loss and with PFP lengths 128 and 512. Table 2 shows the prediction results of the ChEMBL activity classifier. Looking at the mean performance in a 5-fold stratified cross-validation (CV) setup, all metrics indicate only a slight improvement over a random baseline of 0.5 when just using the CFP with no protein features. Adding the one-hot encoded index of the target on the other hand, provides an enormous performance boost. Such an approach can serve as a baseline model that mostly utilizes hidden biases in the way certain types of molecules are associated with certain types of targets and/or activity patterns -rather than making use of molecular data of each individual target protein. Adding structure-based PFPs provides improved prediction metrics over the CFP+index approach and at a much lower number of features. PFPs also perform at the same level as sequence-derived word embeddings via ProtVec.
Group-based 5-fold CV requires the model to learn from 4 out of 5 clusters of features while predicting the 5 th cluster. Two different group splits were explored here, one where clusters were formed based on the compound features, and one based on the PFPs. Feature importances were summed up for all compound features and all protein features, separately. It was found that compound-based group splits resulted in lower F1 accuracy than stratified splits, but compound features contributed around 60% to overall feature importance in both types of splits. Splitting according to PFP groups, on the other hand, dropped compound feature importance to around 10%, i.e. 90% of feature importance came from the proteins. At the same time, the features making use of molecular-level information (either sequence or structure) performed at a much higher level than the one-hot encoded target index, whereas one-hot encoding was sufficient to achieve relatively high cross-validation scores in the other two CV strategies. A proper test set of data not involved in either training or validation, however, was not used here.
Scikit-Learn's GridSearchCV function was used for CV-based hyper-parameter search, finding better performing settings for some of the parameters specific to random forest models, namely the number of estimators in the ensemble, the maximum number of input features and the maximum tree depth. Results are only shown from models that achieved the highest F1 score for a given set of input features. The F1 score is a balanced accuracy that combines both precision and recall. The OOB (out of bag) score provided an additional metric of generalization for random forest models, where randomly excluded samples from the training set are used to validate predictors.

DISCUSSION
An efficient reduction of protein structures to a small, consistent set of features (fingerprint) is of high importance in Structural Bioinformatics and allows for an accurate estimation of protein relationships and may be useful in activity prediction of small molecules.
Here, a new machine learning approach showed that existing image classification neural networks can be adopted to encode a protein fingerprint and that such a fingerprint seems to provide molecular detail that is comparable to the protein sequence level.
Models trained with CV on held-out protein feature clusters seem to make better use of the molecular detail of the proteins in the training set and use those features for enhanced generalization to unseen targets in the validation set. This is encouraging, especially in the light of reported cases where biases in the compound structures were sufficient for high model performance and detailed 3D structure information was not utilized by the model at all [26]. However, it seems that at the level of image resolution used here, sequence-derived features (ProtVec) performed just as well.
The approach outlined here still has several limitations to overcome. Most importantly, the practical requirement of current deep learning frameworks to provide fixed-dimensional data batches discourages the use of full atomic resolution of proteins with very different sizes. Also, the limitation of GPU RAM currently necessitates to work with scaled-down images. However, time is likely to alleviate such limitations in the future in combination with more sophisticated neural network architectures and more powerful computer hardware. There is also no established method for data augmentation of proteins -a technique that is frequently used on images where small perturbations such as rotation, translation, and rescaling are used to artificially provide more training data and that allows the model to learn invariances over such perturbations.
The degree of over-fitting observed in Table 1 still suggests that this work would benefit from more data in the future, such as the full CATH domain set. The capacity of the neural networks used may also have been higher than required for the classification task, and indeed smaller PFPs (i.e. the last dense layer) performed as well or better than larger ones.
CATH protein classification may also not be the most challenging task that makes use of atomic detail. Future models should directly train neural networks on the input images while predicting molecular binding. A more thorough comparison between 1D, 2D, 3D and graph representations of proteins should be compared on the same benchmark. Most recently, graph convolutional networks have been applied on both small molecules and proteins and may be a more efficient way of representing both small and large molecules [27]. Such future benchmarks should also investigate the merits of including the nonbonded and dynamics-based inputs channels explored here, and for various tasks.
Activity prediction remains a challenging task, not least because of the likely issues present in the commonly used data sets such as ChEMBL, which for the most part are collections of experiments conducted independently of each other and inconsistently. The mapping from assay to protein sequence/structure is also not always possible with certainty so that it is very likely that false labels were present in the data set used here. 3D structures are also not available nearly as abundantly as protein sequences.
Another limitation of all current machine learning models is that they can either use the protein sequence or the 3D structure, but not both. Ideally, models that featurize proteins should work with just the