PickPocket: Pocket binding prediction for specific ligand families using neural networks

Most of the protein biological functions occur through contacts with other proteins or ligands. The residues that constitute the contact surface of a ligand-binding pocket are usually located far away within its sequence. Therefore, the identification of such motifs is more challenging than the linear protein domains. To discover new binding sites, we developed a tool called PickPocket that focuses on a small set of user-defined ligands and uses neural networks to train a ligand-binding prediction model. We tested PickPocket on fatty acid-like ligands due to their structural similarities and their under-representation in the ligand-pocket binding literature. Our results show that for fatty acid-like molecules, pocket descriptors and secondary structures are enough to obtain predictions with accuracy >90% using a dataset of 1740 manually curated ligand-binding pockets. The trained model could also successfully predict the ligand-binding pockets using unseen structural data of two recently reported fatty acid-binding proteins. We think that the PickPocket tool can help to discover new protein functions by investigating the binding sites of specific ligand families. The source code and all datasets contained in this work are freely available at https://github.com/benjaminviart/PickPocket. Author Summary Most of the protein biological functions are defined by its interactions with other proteins or ligands. The cavity of the protein structure that receives a ligand, also called a pocket, is made of residues that are usually located far away within its sequence. Therefore understanding the omplementarity of pocket and ligand is a real challenge. To discover new binding sites, we developed a tool called PickPocket that focuses on a small set of user-defined ligands to train a prediction model. Our results show that for fatty acid-like molecules, pocket descriptors (such as volume, shape, hydrophobicity…) and secondary structures are enough to obtain predictions with accuracy >90% using a dataset of 1740 manually curated ligand-binding pockets. The trained model could also successfully predict the ligand-binding pockets using unseen structural data of two recently reported fatty acid-binding proteins. We think that the PickPocket tool can help to discover new protein functions by investigating the binding sites of specific ligand families.


Introduction :
One of the main tasks of bioinformatics is to associate biological roles to proteins using the always increasing biological data (1,2). To predict the function of a protein based on its sequence, computational methods look for sequence patterns in biological databases of known and already annotated proteins. Homology search (3,4), motif search (5) and functional PickPocket: Pocket binding prediction for specific ligand families using neural networks. FRSite (22), DeepSite (23) and DeepDrug3D (24) or of tools based on the random forest algorithm, such as P2Rank (25).
With the progressive increase in structure availability, the need to store and compare ligand-binding pocket data has led to the creation of  We obtained the best results using a neural network multilayer perceptron classifier with an architecture of (15, 10, 5). To avoid overfitting, we trained the model using a 5-fold cross-validation. Furthermore, in order to reduce problems associated with unbalanced classes, we downsampled the largest groups according to the smallest one. The model displayed an Area Under the Curve (AUC) of 97.2% (ROC curve in Figure 1). The model accuracy was 93.4%.
PickPocket: Pocket binding prediction for specific ligand families using neural networks. structure showed that the cavity fatty acid occupied two pockets (red and blue in Figure 2A). The red pocket, which is the deep part inside the protein and contained the carboxyl part, had a score of 0.78. The blue pocket, PickPocket: Pocket binding prediction for specific ligand families using neural networks.
which is at the opening of the cavity and contained the fatty acid tail, had a score of 0.03. The fatty acid-binding cavities were large and, as illustrated in this case, fpocket tended to consider them as more than one pocket. As both pockets corresponded to the same cavity and the red pocket had a score well-above the threshold, we considered that PickPocket discovered the fatty acid-binding cavity of the structure. For 6U1U ( Figure 2B), the red pocket, corresponding to the palmitic acid cavity, received a score of 0.69.   Complete prediction results can be found in supplementary information ( SI narrow. This can be corrected by changing the alpha-sphere radius or the maximum distance between pockets. One of the challenges we faced was to tune the fpocket parameters in order to have a big enough pocket size without merging different cavities. PickPocket easy tuning of these parameters allows users to adapt the input to the ligand specificity.
In order to cluster ligand protein complexes, Deepdrug3D and FRsite use an atom-based voxelization. This step also allows generating a compatible input for convolutional neural networks. On the other hand, our methodology uses matrix properties that are faster to generate, but contain less information. We also chose to use the fpocket software, although DeepSite is more accurate against the sc-PDB database of binding sites PickPocket: Pocket binding prediction for specific ligand families using neural networks. (40). However, fpocket is fast, and pocket descriptors data is easily retrieved from output files.
PickPocket can help to discover new protein functions by investigating the binding sites of a specific ligand family. The results we obtained prove that for fatty acid-like molecules, pocket descriptors and secondary structure are enough to obtain predictions with >90% accuracy.
Thanks to its high prediction accuracy, PickPocket can be used as a tool for in silico screens, and should boost novel research.

Material and Methods
PickPocket methodology can be divided into five steps (Figure 3).