Deep learning regression model for antimicrobial peptide design

Antimicrobial peptides (AMPs) are naturally occurring or synthetic peptides that show promise for treating antibiotic-resistant pathogens. Machine learning techniques are increasingly used to identify naturally occurring AMPs, but there is a dearth of purely computational methods to design novel effective AMPs, which would speed AMP development. We collected a large database, Giant Repository of AMP Activities (GRAMPA), containing AMP sequences and associated MICs. We designed a convolutional neural network to perform combined classification and regression on peptide sequences to quantitatively predict AMP activity against Escherichia coli. Our predictions outperformed the state of the art at AMP classification and were also effective at regression, for which there were no publicly available comparisons. We then used our model to design novel AMPs and experimentally demonstrated activity of these AMPs against the pathogens E. coli, Pseudomonas aeruginosa, and Staphylococcus aureus. Data, code, and neural network architecture and parameters are available at https://github.com/zswitten/Antimicrobial-Peptides.

sequences will be antimicrobial or not, which allows for scanning of sequenced genomes for antimicrobial peptide 29 discovery. Such algorithms include AMP Scanner v2 (Veltri et al., 2018), iAMPpred (Meher et al., 2017), and a 30 variety of algorithms available from the CAMP (Cationic AMP) database (Thomas et al., 2009). Other groups have 31 used regression approaches, based on peptide structure and biophysical properties, to quantitatively predict 32 antimicrobial activity. These approaches are often used for local sequence optimization around a specific known 33 AMP scaffold (Yoshida et al., 2018;Hilpert et al., 2006). 34 Beyond identifying and optimizing existing AMPs, several groups have used variational autoencoders (Das et al.,  Our goal was to improve on these approaches by combining a large dataset with a regression model to design AMPs 40 with a low predicted minimum inhibitory concentration (MIC). MIC is a standard measure of antibiotic activity: 41 lower MIC means a lower drug concentration required to inhibit bacterial growth. We first assembled a large dataset 42 of MIC measurements by combining data from multiple databases into GRAMPA (Giant Repository of AMP 43 Activity). 1 Examination of this dataset yielded experimental corroboration that MIC is more correlated among 44 bacteria in the same gram class. 45 Next, we used GRAMPA to train convolutional neural network (CNN) models for AMP activity prediction. As a 46 benchmark, we showed that converting our model to a classifier yields classification performance that improves on 47 the class of a query peptide, we had the k nearest neighbors (evaluated using Levenshtein distance as it was found to 113 be superior to the other approaches in the regression analysis) "vote" based on whether they were AMPs or 114 negatives. We had the best results with k=7. 115

Neural network model 116
For our architecture, after zero-padding, we begin with 2 1-dimensional convolutional layers of 64 neurons each 117 with a kernel size 5 letters and a stride of 1 letter, paired with a Max Pooling layer with stride 2 and a pooling size of 118 2. We then use a flattening layer. Next, we add a Dropout(0.5) layer to regularize. Finally, we add two dense layers 119 of 100 and 20 neurons each (with ReLU activation), and then a single neuron to transform the output into a single 120 scalar value: the predicted log MIC value for the peptide. A diagram of our architecture is given in Figure 1. The 121 model was trained to minimize mean squared error using the Adam optimizer. Different recurrent depths, kernel 122 sizes, dropout rates, and learning rates were explored as described in Table S1; the CNN proved largely insensitive 123 to these hyperparameters. We also explored replacing the convolutional layers with vanilla RNN layers, LSTM 124 layers, and bidirectional recurrent LSTM layers. 125 126 127 Figure 1. Architecture of neural network. Peptides are encoded as one-hot vectors and then fed to either 128 convolutional or recurrent layers, followed by two dense layers and an output layer that outputs a predicted MIC for 129 the peptide. 130 peptides. A model trained only on experimental data from existing peptide databases would have no inkling of this 134 fact. We added negative training data to our model to reflect this prior, taking random sequences of amino acids and 135 "labeling" them to have very low activity (log MIC = 4). We found doing so to increase the classification accuracy 136 of the model, while slightly decreasing regression accuracy. 137

Ensemble model 138
To make our final model, we trained an ensemble of models with identical architecture on slightly different datasets. 139 For positive datasts, we used the training set AMPs, and the training set AMPs with the cysteine-containing AMPs 140 filtered out. We varied the amount of negative training data between 1, 3, and 10 times the size of the positive data, 141 yielding a total of 2x3=6 different datasets. The random peptides in the negative data were allowed to include 142 cysteine if and only if the positive dataset included cysteine. The networks were trained 5 different times for each 143 negative dataset to average over the stochasticity inherent in training neural networks, meaning that the full 144 ensemble model contained 6x5=30 neural networks. 145 We noticed while analyzing our model output that individual models gave extremely bimodal predictions, which 146 was expected: the prediction was either very close to 4 (meaning, a predicted inactive peptide) or somewhere 147 between -1 and 3.5 (meaning, a predicted active peptide). Therefore, for the purposes of classification (Section 3.3), 148 instead of averaging over each of the ensemble model predictions, we had each model in the ensemble "vote." If 149 more than half of the models predicted log MIC > 3.9, we classified the peptide as inactive and predicted log MIC = 150 4. Otherwise, we classified the peptide as active and the predicted log MIC (used for generation of the ROC curves 151 in place of a probabilistic prediction) was the average over all predictions that were <3.9. Finally, in our 152 classification test sets we set all C-terminal amidation to "False" because the other algorithms did not have access to 153 this information. 154 For regression and peptide design by simulated annealing, we simply averaged over the ensemble. Particularly for 155 simulated annealing, it was important to have a smoother prediction landscape. 156 where m old and m new are the predicted log MIC values of the current peptide and new proposed peptide respectively, 165 and T is temperature. A transition was also rejected if it did not satisfy the constraints we imposed, such as length 166 (between 10 and 25) or charge density (see Section 3.5). The initial temperature T 0 was set to 4/ln(2), meaning a 167 transition probability of ½ for moving from an excellent peptide (log MIC = 0, or MIC = 1µM) peptide to an 168 inactive peptide (log MIC = 4), and the final temperature T f was 0.00001/ln(2). N steps = 100,000 steps (transition 169 proposals) were used per simulated annealing run, and temperature at step n varied according to the power law: 170 All generated peptides were a) cysteine-less (the random initializations and transitions only used the other 19 amino 171 acids), and b) set to be c-terminal amidated, because they were later chemically synthesized with c-terminal 172 amidation. 173 174

Hydrophobic moment analysis 175
Hydrophobic moments were calculated using the "Normalized consensus" residue hydrophobicities (Eisenberg et 176 al., 1984). Each peptide was compared to 1,000 randomly shuffled versions of itself to generate an "HM percentile," 177 defined as the frequency with which the peptide sequence has a greater HM than a randomly shuffled version of 178 itself. 179

Dataset characterization 196
Our dataset contained at least 700 MIC measurements for 10 different microbes, with the most measurements (4559)  197 for E. coli (Table S2). To maximize our training set size, we selected E. coli as the organism against which we 198 would train our model. Many AMPs were measured for their activity against multiple bacteria, which allowed us to 199 consider the question of how tightly correlated the log MICs were between different microbial species. Since the 200 primary mechanism of action of CAMPs is generally membrane disruption, and gram-negative bacteria have highly 201 different membrane structures from gram-positive bacteria, we predicted that gram type would be the primary factor 202 determining correlations. This trend was indeed observed in the data ( Figure 2). We also included Candida albicans, 203 an opportunistic pathogenic yeast, in our analysis. Antibiotic activity against C. albicans, a eukaryote, correlated 204 poorly with antibiotic activity against all bacterial species (Figure 2). To our knowledge this is the first report on a 205 large scale of AMP activity across multiple species. These results also confirm that an AMP designed for 206 effectiveness against E. coli would also likely be effective against multiple gram-negative pathogens. 207  Table S2. 214 215

Design of machine learning architecture 216
After splitting the data into test and training sets, we considered multiple machine learning techniques and 217 architectures. For the purposes of this section, we excluded data with cysteine and included no negative data for a 218 simplified and streamlined comparison. 219 Before training the full ensemble, we optimized our model architecture by training a variety of networks with 220 convolutional or recurrent layers. A performance comparison of the NN-based models on the validation set is given 221 in Table S1. While many recent and state-of-the-art networks for AMP analysis are recurrent (Veltri et al., 2018;222 Müller et al., 2018), our convolutional neural network (CNN) model performed better than the recurrent models we 223 tried. Additionally, model performance was not significantly altered by changing parameters such as dropout or 224 convolutional kernel size (Table S1). 225 The superior performance of the CNN in comparison to recurrent architectures could be a reflection of the fact that 226 many CAMPs are alpha helical in their active conformation (Lee et al., 2015). Because alpha helices do not have 227 long-range interactions, recurrent models may not be necessary for this problem. That said, it is possible that the 228 recurrent models were simply slower to train and that with a larger dataset, network architectures that capture longer 229 dependencies might start to show advantages. different peptide featurizations (see Methods), and a k-nearest neighbors regression approach (Table 1). For the k-232 NN approach, we used the same training-validation split used to select a neural network architecture to select the 233 best k and similarity measure (edit distance or alignment type and matrix; see Methods and Figure S1). 234 235 Our model was substantially better than both RR models, which suggested that it was successfully taking nonlinear 240 and amino acid order effects into account and not simply basing predictions on amino acid composition. k-NN 241 outperformed linear regression, but the deep learning model was the clear winner (Table 1). 242 We then trained the large ensemble model described in Methods for the results described below. This was because 243 early investigation showed that while the ensemble was not much better than individual models for performance 244 against held-out test sets, it was substantially better for sequence design. This was because our SA algorithm found 245 spurious minima in the predicted log MIC landscape of single models: some sequences were predicted to be highly 246 active by some models, but much less active by other models trained on the same data. The large ensemble model 247 eliminated this issue and was thus used in all subsequent analysis. 248 249

Classification performance 250
While classification was not the primary purpose of this work, it allows for some benchmarking of our model 251 compared to other machine learning-based AMP classifiers. We emphasize that the comparison is not specifically 252 between the different machine learning algorithms so much as the classification capability of the data-algorithm found that our classifier substantially outperformed other classifiers at this task, including AMP Scanner v2 which 258 was previously shown to improve on the state of the art. This was also true for when we restricted the test set to 259 peptides with no greater than 90% and 70% identity to the training set (Table 2 and Figure S2). We also tuned a k-260 NN classifier using the same training-validation split applied in Section 3.2 ( Figure S3). As with regression, this 261 classifier performed only slightly worse than our CNN (Table 2, Figure S2). It also performed better than all of the 262 other classifiers except for CAMP-RF in the 70% identify filter case ( One important caveat is that other classifiers were trained using non-antimicrobial protein sequences from UniProt, 272 as their negative data, not random peptides. This puts the other classifiers at a disadvantage for this comparison. 273 UniProt sequences are more appropriate negative data if the goal is to scan genomes for antimicrobial sequences as 274 protein sequences likely have different statistical properties from purely random peptides. While genome scanning was not our primary goal, we compared classification performance for our model and others against UniProt-derived as negative data. However, despite this handicap, our ensemble model had the second best performance of all of the 278 tested models (after AMP Scanner v2) by Matthews Correlation Coefficient (MCC; Table S3 d-f) and area under the 279 receiver operating characteristic curve (AUC; Figure S4). Notably, when we used a 10:1 negative:positive data ratio, 280 the CNNs that emerged had the overall best classification performance, exceeding that of AMP Scanner v2 (Table  281 S3 d-f). Nevertheless, we mixed these models with more balanced models to make the ensemble model we used for 282 analyzing AMP candidates, as the ensemble demonstrated better regression performance. 283 The k-NN approach performed in the middle of the pack by AUC ( Figure S4) and MCC except for particularly poor 284 performance in the 70% identity case (Table S3d-f). While it is difficult to make concrete conclusions given the 285 different goals and negative data of our model versus others, these comparisons along with k-NN's good 286 performance suggest that the improved net performance of our model derives mostly from our relatively large 287 dataset. However, since k-NN performed particularly poorly on more unique AMPs, more complex predictive 288 models will be better at designing novel, interesting sequences. 289 290

Regression performance 291
We next turned back to the problem of regression. Figure 3 depicts the predictions of our ensemble model on the 292 AMP-only test set and Table S4 contains fit statistics. Our model's predictions span over three orders of magnitude 293 in MIC, which lent confidence that peptide design using this model will be likely to give particularly active AMPs. 294 We do note that some active peptides were predicted to be inactive (log MIC ~4) using our model, which suggests 295 that at least some of AMP sequence space will be inaccessible to our design algorithm. This is due to the inclusion 296 of negative data in our training set and accounts for the lower correlation coefficients observed here compared to the 297 results in Table 1. Regardless, these false negatives are only a minor problem as we are not attempting an exhaustive 298 search of all possible AMPs, just trying to find some that work well. More important is that there are very few 299 peptides in the top left corner of the plots, meaning that the peptides predicted to be highly effective were, in fact, 300 highly effective. AMPs, we imposed one of two different constraints on our sequence search to reduce the charge: a positive charge, 310 and a positive charge density, constraint. In the first, we limited the total number of R's and K's to 6, and in the 311 second, we permitted no more than 40% of the residues to be R or K. The generated peptides were predicted to have 312 low MIC values, frequently below 1 μM ( Figure S5). 313 To determine the extent to which our sequence design was capturing typical AMP structure, we analyzed the 314 hydrophobic moment of the peptides assuming an alpha-helical conformation. The distribution of hydrophobic 315 moments of our designed peptides did indeed show an elevated hydrophobic moment compared to shuffled versions 316 of themselves (Figure 4). This elevated hydrophobic moment was not present when as a negative control we used a 317 140° turn per residue ( Figure S6; as opposed to the 100° turn in alpha helices). Thus, our model learned to 318 incorporate alpha helical structure into its sequence design. 319  Figure S7 shows some representative sequence alignments of our peptides with peptides in the database, showing 326 strong similarity in many cases. This similarity, and specifically the frequent repetitions of "LAK," is likely due to 327 the presence in the training data of several low-MIC peptides with LAK repeats. 328 Nevertheless, several peptides did represent new designs. By manually sorting through all of the peptides with 329 predicted log MIC < 0, we identified two peptides with fairly low sequence similarity to any one peptide in 330 particular, which we termed CNN-SA1 and CNN-SA2. CNN-SA1 was in some ways a modified fusion of two 331 previously identified AMPs, while the second peptide was a mostly new design ( Figure 5). Furthermore, each of 332 these peptides had a high hydrophobic moment, as can be observed on helical wheel plots generated using HeliQuest 333 (Gautier et al., 2008) ( Figure S8; HM percentiles 99.0 and 97.8 respectively), and predicted log MICs of -0.2 and -334 0.11 respectively. We selected these two peptides for solid-phase synthesis and experimental testing. Representative alignment to a LAK-containing peptide. b) CNN-SA2 sequence with best alignment. 340 341 3.6 Experimental validation 342 Table 3 gives the antimicrobial activity of the two peptides. They have good activity against E. coli, as predicted: the 343 MICs were lower than 79% of the active AMPs in our dataset. They also showed activity against S. aureus and P. 344 aeruginosa, which is in line with what we would predict given that activities are usually correlated between different 345 species of bacteria ( Figure 2). Furthermore, we note that these peptides have not undergone experimental local 346 sequence optimization, which frequently reduces MICs by an order of magnitude or more (López-Pérez et al., 2017). 347 Thus, these sequences may be promising as lead compounds rather than as treatments in and of themselves. 348 349