3D deep convolutional neural networks for amino acid environment similarity analysis

Torng, Wen; Altman, Russ B.

doi:10.1186/s12859-017-1702-0

Methodology article
Open access
Published: 14 June 2017

3D deep convolutional neural networks for amino acid environment similarity analysis

Wen Torng¹ &
Russ B. Altman^1,2

BMC Bioinformatics volume 18, Article number: 302 (2017) Cite this article

20k Accesses
87 Citations
6 Altmetric
Metrics details

Abstract

Background

Central to protein biology is the understanding of how structural elements give rise to observed function. The surfeit of protein structural data enables development of computational methods to systematically derive rules governing structural-functional relationships. However, performance of these methods depends critically on the choice of protein structural representation. Most current methods rely on features that are manually selected based on knowledge about protein structures. These are often general-purpose but not optimized for the specific application of interest.

In this paper, we present a general framework that applies 3D convolutional neural network (3DCNN) technology to structure-based protein analysis. The framework automatically extracts task-specific features from the raw atom distribution, driven by supervised labels. As a pilot study, we use our network to analyze local protein microenvironments surrounding the 20 amino acids, and predict the amino acids most compatible with environments within a protein structure. To further validate the power of our method, we construct two amino acid substitution matrices from the prediction statistics and use them to predict effects of mutations in T4 lysozyme structures.

Results

Our deep 3DCNN achieves a two-fold increase in prediction accuracy compared to models that employ conventional hand-engineered features and successfully recapitulates known information about similar and different microenvironments. Models built from our predictions and substitution matrices achieve an 85% accuracy predicting outcomes of the T4 lysozyme mutation variants. Our substitution matrices contain rich information relevant to mutation analysis compared to well-established substitution matrices. Finally, we present a visualization method to inspect the individual contributions of each atom to the classification decisions.

Conclusions

End-to-end trained deep learning networks consistently outperform methods using hand-engineered features, suggesting that the 3DCNN framework is well suited for analysis of protein microenvironments and may be useful for other protein structural analyses.

Background

Protein sites are microenvironments within a protein structure, distinguished by their structural or functional role. A site can be defined by a three-dimensional location and a local neighborhood around this location in which the structure or function exists. Central to rational protein engineering is the understanding of how the structural arrangement of amino acids creates functional characteristics within protein sites.

Determination of the structural and functional roles of individual amino acids within a protein provides information to help engineer and alter protein functions. Identifying functionally or structurally important amino acids allows focused engineering efforts such as site-directed mutagenesis for altering targeted protein functional properties [1]. Alternatively, this knowledge can help avoid engineering designs that would abolish a desired function. Traditionally, experimental mutation analysis is used to determine the effect of changing individual amino acids. For example, in Alanine scanning, each amino acid in a protein is mutated into Alanine, and the corresponding function or structural effects recorded to identify the amino acids that are critical [2]. This technique is often used in protein-protein interaction hot spot detection for identifying potential interacting residues [3]. However, these experimental approaches are time-consuming and labor-intensive. Furthermore, there is no information about which amino acids would be tolerated at these positions.

The increase in protein structural data provides an opportunity to systematically study the underlying pattern governing such relationships using data-driven approaches. A fundamental aspect of any computational protein analysis is how protein structural information is represented [4, 5]. The performance of machine learning methods often depends more on the choice of data representation than the machine learning algorithm employed. Good representations efficiently capture the most critical information while poor representations create a noisy distribution with no underlying patterns.

Most methods rely on features that have been manually selected based on understanding sources of protein stability and chemical composition. For example, property-based representations describe physicochemical properties associated with local protein environments in protein structures using biochemical features of different level of details [6,7,8,9]. Zvelebil et al. have shown that properties including residue type, mobility, polarity, and sequence conservation are useful to characterize the neighborhood of catalytic residues [9]. The FEATURE program [6], developed by our group, represents protein microenvironments using 80 physicochemical properties. FEATURE divides the local environment around a point of interest into six concentric shells, each of 1.25 Å in thickness, and evaluates the 80 physicochemical properties within each shell. The properties range from low-level features such as atom type or the presence of residues to higher-level features such as secondary structure, hydrophobicity and solvent accessibility. We have applied the FEATURE program to different important biological problems, including the identification of functional sites [10], characterization of protein pockets [11], and prediction of interactions between protein pockets and small molecules [12], with success.

However, designing hand-engineered features is labor-intensive, time-consuming, and not optimal for some tasks. For example, although robust and useful, the FEATURE program has several limitations [6, 11, 13]. To begin with, each biological question depends on different sets of protein properties and no single set encodes all the critical information for each application. Second, FEATURE employs 80 physiochemical features with different level of details; some attributes have discrete values, while others are real valued. The high dimensionality together with the inhomogeneity among the attributes can be challenging for machine learning algorithms [14]. Finally, FEATURE use concentric shells to describe local microenvironments. The statistics of biochemical features within each shell are collected but information about the relative position within each shell is lost. The system is therefore rotational invariant but can fail in cases where orientation specific interactions are crucial.

The surfeit of protein structures [15] and the recent success of deep learning algorithms provide an opportunity to develop tools for automatically extracting task specific representations of protein structures. Deep learning networks have achieved great success in computer vision and natural language processing community [16,17,18,19], and have been used in small molecule representation [20, 21], transcription factor binding prediction [22], prediction of chromatin effects of sequence alterations [23], and prediction of patient outcome from electronic health records [24]. The power of deep learning lies in its ability to extract useful features from raw data form [16]. Deep convolutional neural networks (CNN) [17, 25] comprise a subclass of deep learning networks. Local filters in CNNs scan through the input space and search for recurring local patterns that are useful for classification performance. By stacking multiple CNN layers, deep CNNs hierarchically compose simple local spatial features into complex features. Biochemical interactions occur locally, and can be aggregated over space to form complicated and abstract interactions. The success of CNNs at extracting features from 2D images suggests that the convolution concept can be extended to 3D and applied to proteins represented as 3D “images”. In fact, Wallach et al. [26] applied 3D convolutional neural networks to protein-small molecule bioactivity predictions and showed that performances of deep learning framework surpass conventional docking algorithms.

In this paper, we develop a general framework that applies 3D convolutional neural networks for protein structural analysis. The strength of our method lies in its ability to automatically extract task-specific features, driven by supervised labels that define the classification goal. Importantly, unlike conventional engineered biochemical descriptors, our 3DCNN requires neither prior knowledge nor assumptions about the features critical to the problem. Protein microenvironments are represented as four atom “channels” (analogous to red, green, blue channels in images) in a 20 Å box around a central location within a protein microenvironment. The algorithm is not dependent on pre-specified features and can discover arbitrary features that are most useful for solving the problem of interest. To demonstrate the utility of our framework, we applied the system to characterize microenvironments of the 20 amino acids. Specifically, we present the following:

(1)
To study how the 20 amino acids interact with their neighboring microenvironment, we train our network to predict the amino acids most compatible with a specific location within a protein structure. We perform head-to-head comparisons of prediction performance between our 3DCNN and models using the FEATURE descriptors and show that out 3DCNN achieved superior performances over models using conventional features.
(2)
We demonstrate that the features captured by our network are useful for protein engineering applications. We apply results of our network to predicting effects of mutations in T4 lysozyme structures. We evaluate the extent to which an amino acid “fits” its surrounding protein environment and show that mutations that disrupt strong amino acid preferences are more likely to be deleterious. The prediction statistics over millions of training and test examples provide information about the propensity of each amino acid to be substituted for another. We therefore construct two substitution matrices from the prediction statistics and combine information from the class predictions and the substitution matrices to predict effects of mutation in T4 lysozyme structures.
(3)
We present a new visualization technique, “atom importance map”, to inspect individual contribution of each atom within the input example to the final decision. The importance map helps us intuitively visualize the features our network has captured.

Our 3DCNN achieves a two-fold increase in microenvironments prediction accuracies compared to models that employ conventional structure-based hand-engineered biochemical features. Hierarchical clustering of our amino acid prediction statistics confirms that our network successfully recapitulates hierarchical similarities and differences among the 20 amino acid microenvironments. When used to predict effects of mutations in T4 lysozyme structures, our models demonstrate strong ability to predict outcomes of the mutation variants, with 85% accuracy to separate the destabilizing mutations from the neutral ones. We show that substitution matrices built from our prediction statistics encode rich information relevant to mutation analysis. When no structural information is provided, models built from our matrices on average outperform the ones built from BLOSUM62 [27], PAM250 [28] and WAC [29] by 25.4%. Furthermore, given the wild type structure, our network predictions enable the BLOSUM62, PAM250 and WAC models to achieve an average 35.8% increase in prediction accuracies. Finally, the atom input importance visualization confirms that our network recognizes meaningful biochemical interactions between amino acids.

Methods

Datasets

T4-lysozyme free, protein-family-based training and test protein structure sets

For the 20 amino acid microenvironment classification problem, we construct our dataset based on the SCOP [30] and ASTRAL [31] classification framework (version 1.75.) To avoid prediction biases derived from similar proteins within the same protein families, we ensure that no structure in the training set belongs to the same protein family as any structure in the test set. Specifically, we first retrieved representative SCOP domains from the ASTRAL database. We excluded multi-chain domains, and identified protein families of the representative domains using the SCOP classification framework, resulting in 3890 protein families. We randomly selected 5 % of the identified protein families (194 protein families) from the 3890 protein families to form the test family set—with the remaining 3696 protein families forming the training family set. Member domains of a given protein family were either entirely assigned to training set or entirely assigned to test set. In addition, we removed PDB-IDs present in both the training and test sets to ensure there was no test chain in a family that was used in training. To enforce strict sequence level similarity criteria between our training and test set, we used CD-HIT-2D [32] to identify any test chain that has a sequence similarity above 40% to any chain in the training structure set, and removed the identified structures from the test set.

Furthermore, to obtain fair evaluation of our downstream application that characterizes T4 lysozyme mutant structures, we removed T4 lysozyme structures from both datasets. Specifically, PDB-IDs of the wild-type and mutant T4 lysozyme structures were first obtained from the Uniprot [33] database. We then excluded structures containing domains in the same family as any wild type or mutant T4 lysozyme structure from both the training and test datasets. We obtained the final selected protein structures from the PDB as of date Oct 19 2016.

Input Featurization and processing

To facilitate comparison between deep learning and conventional algorithms built with hand-engineered biochemical features, we created two datasets from the same train and test protein structure sets described in T4-lysozyme Free, Protein-Family-Based Training and Test Protein Structure Sets section.

(A) Atom-Channel Dataset

ᅟ

Local box extraction and labeling

For each structure in the training and test structure sets, we placed a 3D grid with 10 Å spacing to sample positions in the protein for local box extraction. Specifically, we first identify the minimum Cartesian x, y and z coordinates of the structure, and define the (x_min, y_min, z_min) position as the origin of our 3D grid. We then construct a 3D grid with 10 Å spacing that covers the whole structure (Fig. 1a.) For each sampled position, a local box is extracted using the following procedure: The nearest atom to the sampled position is first identified (Fig. 1b) and the amino acid which this atom belongs to is assigned as the central amino acid (Fig. 1c). To achieve consistent orientation, each box is aligned within the box in a standard manner using the backbone geometry of the center amino acid (Fig. 1d). Specifically, each box is oriented such that the plane formed by the N-CA and the C-CA bonds forms the x-y plane and the orthogonal orientation with which the CA- Cβ bond has a positive dot product serves as the positive z-axis (Fig. 1e). A 20 Å box is then extracted around the Cβ atom of the central amino acid using the defined orientation (Fig. 1f). We chose the Cβ atom of each amino acid as center to maximize the observable effects of the side chain while still maintaining a comparable site across all 20 amino acids. The Cβ atom position of Glycine was estimated based on the average position of the superimposed Cβ atoms from all other amino acids. Side-chain atoms of the center amino acid are removed. The extracted box is then labeled with the removed amino acid side-chain type (Fig. 1g).

Local box Featurization

Each local 20 Å box is further divided into 1-Å 3D voxels, within which the presence of carbon, oxygen, sulfur, and nitrogen atoms are recorded in a corresponding atom type channel (Fig. 2.) Although including the hydrogen atoms would provide more information, we did not include them because their positions are almost always deterministically set by the position of the other heavy atoms, and so they are implicitly represented in our networks (and many other computational representations). We believe that our model is able to infer the impact of these implicit hydrogens. The 1-Å voxel size ensures that each voxel can only accommodate a single atom, which could allow our network to achieve better spatial resolution. Given an atom within a voxel, one of the four atom channel types will have a value of 1 in the corresponding voxel position, and the other three channels will have the value 0.

We then apply Gaussian filters to the discrete counts to approximate atom connectivity and electron delocalization. Standard deviation of the Gaussian filters is calibrated to the average Van der Waals radii of the four atom types. The local box extraction and featurization steps are performed on both the training and test protein structure sets to form the training and test dataset.

Dataset balancing

Different amino acids have strikingly different frequencies of occurrence within natural proteins. To ensure useful features can be extracted from all the 20 amino acid microenvironment types, we construct balanced training and test datasets by applying the following procedure to the training and test dataset: (1) The least abundant amino acid microenvironment in the original dataset is first identified. (2) All examples of the identified amino acid microenvironment type are included in the balanced dataset. (3) The number of examples for the least abundant amino acid microenvironment is used to randomly sample an equal amount of examples from all the other 19 amino acid microenvironment types. Validation examples are randomly drawn from the balanced training set using a 1:19 ratio. This ensures an approximately equal number of examples from all the 20 amino acid microenvironment types for the balanced training, validation and test datasets.

Data normalization

Prior to being fed into the deep learning network, input examples are zero-mean normalized. Specifically, mean values of each channel at each position across the training dataset are calculated and subtracted from the training, validation, and test examples.

(B) FEATURE Dataset

ᅟ

FEATURE microenvironments

FEATURE, a software program previously developed in our lab, is used as a baseline method to demonstrate the performance of conventional hand-engineered structure-based features [6]. The FEATURE program captures the physicochemical information around a point of interest in protein structure by segmenting the local environment into six concentric shells, each of 1.25 Å in thickness (Fig. 3). Within each shell, FEATURE evaluates 80 physicochemical properties including atom type, residue class, hydrophobicity, and secondary structure (See Table 1 for a full list of the properties). This enables conversion of a local structural environment into a numeric vector of length 480.

Table 1 Full list of the 80 biochemical properties used in the FEATURE program

Full size table

Dataset construction

Following a similar sampling procedure described in (A) Atom-Channel Dataset section, we placed a 3D grid with 10 Å spacing to sample positions for featurization in each structure in the training and test structure sets (Fig. 1a), where the 3D grid is constructed using the same procedure as in (A) Atom-Channel Dataset section. For each sampled position within a structure, the center residue is determined by identifying the nearest residue (Fig. 1b and c). A modified structure with the center residue removed from the original structure is subsequently generated. The FEATURE software is then applied to the modified structure, using the Cβ atom position of the central residue, and generates a feature vector of length 480 to characterize the microenvironment. The generated training and test datasets are similarly balanced and zero-mean normalized, as described in (A) Atom-Channel Dataset section. Validation examples were randomly drawn from the balanced training set using a 1:19 ratio.

Network architecture

To perform head-to-head comparisons between end-to-end trained deep learning framework that takes in raw input representations and machine learning models that are built on top of conventional hand-engineered features, we design the following two models: (A) Deep 3D Convolutional Neural Network (B) FEATURE Softmax Classifier. Both models comprise three component modules: (1) Feature Extraction Stage (2) Information Integration Stage (3) Classification Stage, as shown in Fig. 4. To evaluate the advantages of using a Deep Convolutional Architecture versus a simple flat neural network, we also built a third model (C) Multi-Layer Perceptron with 2 hidden layers.

(A) Deep 3D Convolutional neural network

Our deep 3D convolutional neural network is composed of the following modules: (1) 3D Convolutional Layer (2) 3D Max Pooling Layer [34] (3) Fully Connected Layer (4) Softmax Classifier [35]. In brief, our network begins with three sequential alternating 3D convolutional layers and 3D max pooling layers, which extract 3D biochemical features at different spatial scales, followed by two fully-connected layers which integrate information from the pooled response across the whole input box, and ends with a Softmax classifier layer, which calculates class scores and class probability of each of the 20 amino acid classes. Schematic diagram of the network architecture is shown in Fig. 4. The operation and function of each module are briefly described below. All modules in the network were implemented in Theano [36].

3D Convolutional Layer

The 3D Convolution layer consists of a set of learnable 3D filters, each of which has small local receptive field that extends across all input channels. During the forward pass, each filter moves across the width, height and depth of the input space with a fixed stride, convolves with its local receptive field at each position and generate filter responses. The rectified linear (ReLU) [37] activation function consecutively performs a non-linear transformation on the filter responses to generate the activation values. More formally, the activation value $ {a}_{i, j, k}^L $ at output position (i,j,k) of the L^th filter when convolving with the input X can be calculated by Eqs. (1) and (2).

$$ {a}_{i, j, k}^L=\mathrm{ReLU}\left[{\sum}_{m= i}^{i+\left( F-1\right)}{\sum}_{n= j}^{j+\left( F-1\right)}{\sum}_{d= k}^{k+\left( F-1\right)}{\sum}_{c=0}^{C-1}{W}_{c, m, n, d}^L{X}_{c, m, n, d}+{b}^L\right] $$

(1)

$$ \mathrm{ReLU}=\left\{\begin{array}{c}\hfill x,\kern0.5em if\ x\ge 0\hfill \\ {}\hfill 0,\kern0.5em if\ x<0\hfill \end{array}\right. $$

(2)

Where F is the filter size, assuming the filter has equal width, height and depth, C is the number of input channels, W is a weight matrix with size (C,F,F,F), X is the input, i, j, k are the indices of the output position, and m, n, d are the indices of the input position.

Our 3D Convolution module takes in a 5D–tensor of shape [batch size, number of input channels, input width, input height, input depth], convolves the 5D–tensor with 3D filters of shape [number of input channels, filter width, filter height, filter depth] with stride 1, and outputs a 5D- tensor of shape [batch size, number of 3D filters, (input width- filter width) +1, (input height- filter height) +1, (input depth - filter depth) +1]. During the training process, the weights of each of the 3D convolutional filters are optimized to detect local spatial patterns that best capture the local biochemical features to separate the 20 amino acid microenvironments. After the training process, filters in the 3D convolution layer will be activated when the desired features are present at some spatial position in the input.

3D Max Pooling Layer

The 3D max pooling module takes in an input 5D–tensor of shape [batch size, number of input channels, input width, input height, input depth], performs down-sampling of the input tensor with stride of 2, and output a 5D- tensor of shape [batch size, number of input channels, input width/2, input height/2, input depth/2]. For each channel, the max pooling operation identifies the maximum response value for each 2*2*2 subregion and reduce the 2*2*2 cube region into a single 1*1*1 cube with the representative maximum value. The operation can be described by Eq. (3).

$$ {\mathrm{MP}}_{\mathrm{c},\mathrm{l},\mathrm{m},\mathrm{n}}= \max \left(\left\{{\mathrm{X}}_{\mathrm{c},\mathrm{i},\mathrm{j},\mathrm{k}},{\mathrm{X}}_{\mathrm{c},\mathrm{i}+1,\mathrm{j},\mathrm{k}},{\mathrm{X}}_{\mathrm{c},\mathrm{i},\mathrm{j}+1,\mathrm{k}},{\mathrm{X}}_{\mathrm{c},\mathrm{i},\mathrm{j},\mathrm{k}+1},{\mathrm{X}}_{\mathrm{c},\mathrm{i}+1,\mathrm{j}+1,\mathrm{k}},{\mathrm{X}}_{\mathrm{c},\mathrm{i},\mathrm{j}+1,\mathrm{k}+1},{\mathrm{X}}_{\mathrm{c},\mathrm{i}+1,\mathrm{j},\mathrm{k}+1},{\mathrm{X}}_{\mathrm{c},\mathrm{i}+1,\mathrm{j}+1,\mathrm{k}+1}\right\}\right) $$

(3)

Where $ \left\{\begin{array}{c} i={l}^{\ast }2\\ {} j={m}^{\ast }2\\ {} k={n}^{\ast }2.\end{array}\right. $

*MP denotes the output of the Max-Pooling operation of X

*l, m, n are the indices of the output position, c denotes the input channel, and i, j, k are the indices of the input position

Fully Connected Layer and the Softmax Classifier

The fully-connected layer integrates information of neurons across all positions within a layer using a weight matrix that connect all neurons in the layer to all neurons in the subsequent layer. A ReLU function follows to perform a non-linear transformation. The operation is described by Eq. (4). By following the 3DCNN and 3D Max-Pooling layers with fully connected layers, the pooled filter responses of all filters across all positions in the protein box can be integrated. The integrated information is then fed to the Softmax classifier layer to calculate class probabilities and to make the final predictions.

$$ {h}_n=\mathrm{ReLU}\left(\sum_{m=0}^{M-1}{W}_{m, n}{X}_m+{b}_n\ \right) $$

(4)

Where h _n denotes the activation value of the n^th neuron in the output layer, M denotes the number of neurons in the input layer, N denotes the number of neurons in the output layer, and W is a weight matrix with size [M, N].

(B) FEATURE Softmax classifier

The FEATURE Softmax Classifier model comprises the same three feature extraction, information integration and classification stages. The model begins with an input layer, which takes in FEATURE vectors generated in (B) FEATURE Dataset section. In this case, the input layer is equivalent to the feature extraction stage since the biochemical features are extracted from the protein structures by the FEATURE program prior to being fed into the model. The input layer is then followed by two fully-connected layers, which integrate information from the input features. Finally, the model ends with a Softmax classifier layer, which performs the classification.

(C) Multi-Layer Perceptron

Our Multi-Layer Perceptron model takes in the same local boxes input as the 3DCNN model, flattens the 5D–tensor of shape (batch size, number of input channels, input width, input height, input depth) into a 2D matrix of shape (batch size, number of input channels* input width*input height*input depth), and has just two fully-connected layers which integrate information across the whole input box, ending with a Softmax classifier layer.

We trained our 3DCNN, MLP, and the FEATURE Softmax Classifier using stochastic gradient descent [38] with the back-propagation algorithm [39]. Gradients were computed by the automatic differentiation function implemented in Theano. A batch size of 20 examples was used. To avoid over-fitting, we used L2 regularization for all the models, and employed dropout [40] (p = 0.3) when training the 3DCNN, FEATURE Softmax Classifier and MLP. We tested different L2 regularization constants and dropout rates. We selected the appropriate L2 regularization constant and dropout rate based on validation set performance; we did not attempt to optimize the other meta-parameters. We trained the 3DCNN network for 6 days for 9 epochs using GPUs on the Stanford Xstream cluster. The MLP model was trained for 20 epochs using GPUs on the Stanford Xstream cluster until convergence. The FEATURE Softmax classifier took 3 days on the Stanford Sherlock cluster to reach convergence. The Stanford XStream GPU cluster is made of 65 compute nodes for a total of 520 Nvidia K80 GPU cards (or 1040 logical graphical processing units). The Stanford Sherlock cluster includes 6 GPU nodes with dual socket Intel(R) Xeon(R) CPU E5–2640 v2 @ 2.00GHz; 256 GB RAM; 200 GB local storage.

Classification accuracies and confusion matrix

Individual and knowledge-based amino acid group accuracy

Prediction accuracies of the models are evaluated using two different metrics: individual class accuracy and knowledge-based group accuracy. Individual class accuracy measures the probability of the network to predict the exact amino acid as the correct class. Since it is known that chemically similar amino acids tend to substitute each other in naturally occurring proteins, to further evaluate the ability of the network to capture known amino acid biochemical similarity, we also calculate a knowledge-based group accuracy metric based on predefined amino acid groupings [41]. For group accuracy, a prediction is considered correct if it is within the knowledge-based amino acid group as the true class.

Confusion matrix

Upon the completion of model training, the model weights can then be used to perform prediction for any input local protein box. For a given set of input examples, the number of examples that have true labels i and are predicted as label j is recorded in the position [i, j] of the raw count confusion matrix M. To obtain the probability of examples of true label i being predicted as label j, each row i of the raw count confusion matrix M is then normalized by the total number of examples having the true label i to generate the row-normalized confusion matrix N _row, where each number in N _row has a value between 0 ~ 1 and the sum of each row equals 1.

$$ {N}_{row}\left[ i, j\right]= M\left[ i, j\right]/{\sum}_j M\left[ i, j\right] $$

(5)

The above described process is applied to the training and test dataset to generate 2 separate row-normalized confusion matrices. The matrices are then plot as heat maps using the Matplotlib package.

Clustering

To identify amino acid environment groups discovered by the network, we performed hierarchical clustering [42] on the row-normalized confusion matrices of both the train and test dataset. Hierarchical clustering with the Ward linkage method was performed using the scipy.cluster.hierarchy package [43].

Structure-based substitution matrix

Conventional sequence-based substitution matrices such as BLOSUM62 and PAM250 are calculated from the log odd ratio of substitution frequencies among multiple sequence alignments within defined sequence databases. Using an analogous concept, we construct a frequency-based, structure-based substitution matrix from our raw count confusion matrix M. We generated a second matrix considering the score matrix as a measure of similarity between any two amino acid types. This matrix is derived based on dot product similarities between entries of amino acid microenvironment pairs in the raw count confusion matrix. The two score matrices are denoted as S _freq and S _dot respectively, and are calculated using the following equations.

Score matrix I: Frequency-based score

The frequency-based substitution scores were calculated using the following equations:

$$ \begin{array}{l} p\left( i, j\right)= M\left[ i, j\right]/{\sum}_i{\sum}_j M\left[ i, j\right]\ \hfill \\ {}{q}_{row}(i)={\sum}_j M\left[ i, j\right]/{\sum}_i{\sum}_j M\left[ i, j\right]\ \hfill \\ {}{q}_{col}(j)={\sum}_i M\left[ i, j\right]/{\sum}_i{\sum}_j M\left[ i, j\right]\hfill \\ {}{S}_{freq\hbox{'}}=\mathit{\log}\left\{ p\left( i, j\right)/{q}_{row}(i)\ast {q}_{col}(j)\right\}\hfill \end{array} $$

To enable straight-forward comparison to other substitution matrices, we create a symmetric substitution matrix by averaging over the original and transposed S _freq as below.

$$ {S}_{freq}=\left({S}_{freq\prime }+{S_{freq\prime}}^T\right)/2 $$

Score matrix II: Dot-product-based score

The dot-product based scores were calculated using the following equations

$$ \begin{array}{l}{N}_{row}\left[ i, j\right]= M\left[ i, j\right]/{\sum}_j M\left[ i, j\right]\hfill \\ {}{N}_{col}\left[ i, j\right]= M\left[ i, j\right]/{\sum}_i M\left[ i, j\right]\hfill \\ {}{Row}_i={N}_{row}\left[ i,:\right]/\sqrt{\sum_k{\left({N}_{row}\left[ i, k\right]\right)}^2}\hfill \\ {}{Row}_j={N}_{row}\left[ j,:\right]/\sqrt{\sum_k{\left({N}_{row}\left[ j, k\right]\right)}^2}\hfill \\ {}{Col}_i={N}_{col}\left[:, i\right]/\sqrt{\sum_k{\left({N}_{col}\left[ k, i\right]\right)}^2}\hfill \\ {}{Col}_j={N}_{col}\left[:, j\right]/\sqrt{\sum_k{\left({N}_{col}\left[ k, j\right]\right)}^2}\hfill \\ {}{S}_{dot}\ \left[ i, j\right]=\mathit{\log}\left\{ dot\left({Row}_i,{Row}_j\right)+ dot\left({Col}_i,{Col}_j\right)\right\}\hfill \end{array} $$

The two score matrices are calculated for both the training and test predictions and are denoted as S _{freq − train}, S _{freq − test}, S _{dot − train}, S _{dot − test}, respectively. Because similar scores were obtained between the training and the test predictions, S _{freq − train} and S _{dot − train}are used are representative matrices and are denoted as S _freq and S _dot. Comparison between the matrices to BLOSUM62, and PAM250, and WAC were performed using linear least-square regressions using the scipy.stats.linregress module.

T4 mutant classification

T4 lysozyme mutant and wild type structures

The PDB IDs of 40 T4 lysozyme mutant structures were obtained from the SCOPe2.6 database [44] and the corresponding 3D structures are downloaded from the PDB. We categorize the effects of the mutants based on their associated literature, where a stabilizing mutation is categorized as “neutral” and a destabilizing mutation is categorized as “destabilizing”. Table 2 summarizes the 40 mutant structures employed in this study. To compare between the microenvironments surrounding the wild type and mutated amino acids, the wild type T4 lysozyme structure (PDB ID: 2lzm [45]) is also employed.

Table 2 Summary of the 40 T4 mutant structure

Full size table

T4 wild type and mutant structure microenvironment prediction

For each of the selected 40 T4 lysozyme mutant structures, we extract a local box centered on the Cβ atom of the mutated residue, removing side chain atoms of the mutated residue. The same labeling and featurization procedures described in (A) Atom-Channel Dataset section is applied to the extracted box. Wild type counterparts of these 40 mutated residues can be found by mapping the mutated residue number to the wild type structure. Local boxes surrounding the wild type amino acids can then be similarly extracted and featurized. Each pair of wild type and mutant boxes are then fed into the trained 3DCNN for prediction. The predicted labels for wild type and mutant boxes are denoted as WP (wild type predicted) and MP (mutant predicted), respectively.

T4 mutation classifier

We built Lasso [46] and SVM [47] classifiers with 4-fold cross validation using the following three sets of features for five different scoring matrices (BLOSUM62, PAM250, WAC, S _freq and S _dot), resulting in fifteen different models.

Input Features for the T4 mutation classifiers

$$ \begin{array}{l}6\hbox{-} \mathrm{Feature}=\left[\mathrm{S}\left(\mathrm{WT},\mathrm{WP}\right),\mathrm{S}\left(\mathrm{WT},\mathrm{MT}\right),\mathrm{S}\left(\mathrm{WT},\mathrm{MP}\right),\mathrm{S}\left(\mathrm{WP},\mathrm{MT}\right),\mathrm{S}\left(\mathrm{WP},\mathrm{MP}\right),\mathrm{S}\left(\mathrm{MT},\mathrm{MP}\right)\right]\hfill \\ {}3\hbox{-} \mathrm{Feature}=\left[\mathrm{S}\left(\mathrm{WT},\mathrm{WP}\right),\mathrm{S}\left(\mathrm{WT},\mathrm{MT}\right),\mathrm{S}\left(\mathrm{WP},\mathrm{MT}\right)\right]\hfill \\ {}1\hbox{-} \mathrm{Feature}=\left[\mathrm{S}\left(\mathrm{WT},\mathrm{MT}\right)\right]\hfill \end{array} $$

*S(i,j) is the similarity score taken from the (i,j) element of a score matrix

*WT, WP, MT and MP denote the wild type true label, wild type predicted label, mutant true label, and mutant predicted label, respectively.

The SVM models were constructed using the sklearn.svm package using the Radial Basis Function (RBF) kernel, and the Lasso models were built using the sklearn.linear_model.Lasso package.

Network visualization: Atom importance map

Our input importance map shows the contribution of each atom to the final classification decision by displaying the importance score of each atom in heat map colors. Importance scores are calculated by first deriving the saliency map described in [48]. Briefly, the saliency map calculates the derivative of the true class score of the example with respect to the input variable I at the point I₀, where I₀ denotes the input value. The saliency map is then multiplied by I₀ to obtain the importance scores for each input voxel for each atom channel. By first order Taylor approximation, the importance score of each atom approximates the effect on the true class score when removing the corresponding atom from the input. Absolute values of the importance scores are recorded, normalized to range (0,100) for each input example across all positions and all channels, and assigned to the corresponding atoms in the local protein box. We visualized results using Pymol [49] by setting the b-factor field of the atoms to the normalized-absolute-valued importance scores. Gradients of the score function with respect to the input variables are calculated by the Theano auto differentiation function.