PhageAI - Bacteriophage Life Cycle Recognition with Machine Learning and Natural Language Processing

Background As antibiotic resistance is becoming a major problem nowadays in a treatment of infections, bacteriophages (also known as phages) seem to be an alternative. However, to be used in a therapy, their life cycle should be strictly lytic. With the growing popularity of Next Generation Sequencing (NGS) technology, it is possible to gain such information from the genome sequence. A number of tools are available which help to define phage life cycle. However, there is still no unanimous way to deal with this problem, especially in the absence of well-defined open reading frames. To overcome this limitation, a new tool is definitely needed. Results We developed a novel tool, called PhageAI, that allows to access more than 10 000 publicly available bacteriophages and differentiate between their major types of life cycles: lytic and lysogenic. The tool included life cycle classifier which achieved 98.90% accuracy on a validation set and 97.18% average accuracy on a test set. We adopted nucleotide sequences embedding based on the Word2Vec with Ship-gram model and linear Support Vector Machine with 10-fold cross-validation for supervised classification. PhageAI is free of charge and it is available at https://phage.ai/. PhageAI is a REST web service and available as Python package. Conclusions Machine learning and Natural Language Processing allows to extract information from bacteriophages nucleotide sequences for lifecycle prediction tasks. The PhageAI tool classifies phages into either virulent or temperate with a higher accuracy than any existing methods and shares interactive 3D visualization to help interpreting model classification results.


Background
We are might soon be living in a post-antibiotic era and there is a need to find an alternative to treat microbial diseases, especially because of growing resistance of pathogens. One of the solutions that brings much scientific attention during recent years is phage therapy [1]. Bacteriophages are defined as viruses that target, infect, and replicate within bacteria, having high specificity restricted to one bacterial genus or even certain strains. They are among the most abundant entities on Earth -it is estimated that there are 10 31 phages worldwide [2,3].
After their discovery at the beginning of the 20th century, phages suddenly lost popularity because antibiotics were discovered in parallel. Therefore, there are still many gaps in a knowledge of their biology. One of the problems that still needs to be addressed and investigated is a differentiation between the phage life cycles: lytic or lysogenic [3].) A virulent phage exhibits a strictly lytic life cycle in which after a phage attachment to a host cell, a nucleic acid is injected in order to use bacterial metabolism, replicate its genome and synthesize new virions. As a result, bacterium is lysed and bacteriophages are released to the environment. In contrast, a temperate phage carries a lysogenic cycle in which its genome might be inserted into a bacterial chromosome and form a prophage, a state in which it can last for many generations. However, when such a phage is induced with a certain stress factor, it can also enter a lytic cycle [3,4].
There is a need to define a life cycle of a phage especially when choosing phages for therapeutic purposes, as temperate phages are known to take part in a horizontal gene transfer (HTG). Since they can integrate into bacterial genomes, they can transfer undesirable features such as virulence 49 50 51 52 53 factors or antibiotic resistance genes into subsequent bacterial generations. On the other hand, virulent phages are considered safe and are approved for use in phage therapy [2,5].
So far, there is no unambiguous and indisputable way to define a bacteriophage life cycle. There is a traditional experimental method based on clearance or turbidity of plaques, however, it is not of much use nowadays [5]. As NGS sequencing is less and less expensive, it becomes available for research units to gain information from bacteriophages' sequences. In Andrew Millard lab webpage there is a plot presenting a cumulative number of phage genomes over the years (Figure 1).
However, the analysis of phage sequences is still a struggle for the scientific community because of a low availability of reference genomes, mistakes in Open Reading Frames (ORFs) sizes done with automatic annotation programs and little knowledge about protein function of an analysed phage.
Therefore, there are various approaches to define a phage life cycle [6,7]. It often starts with a search of reference sequences in Basic Local Alignment Search Tool (BLAST) and both automatic and manual annotation of genomes (e.g.in DNAMaster, University of Pittsburgh). Then, careful analysis of ORFs is done looking not only for sequence homology, but also for a structural one e.g.in HHpred and search for domains is performed in InterPro [8,9]. As a complement, phylogenetic analysis e.g., in MEGAX [10] and analysis of termini of phages in PhageTerm [11] are prepared.
Currently, there is only one automatic tool called PHACTS in which a prediction of a phage life cycle is generated based on amino acid sequences of the analyzed phages. However, it requires an amino acid sequence based on annotation which can be imperfect and, moreover, it gives quite often averaged probability of results around 0.5 -0.6 which is not satisfactory [12].
Consequently, there is a need to develop a fast and reliable tool which will be based on a phage genome analysis itself and which will not be dependent on hypothetical functions of potential ORFs which is very often the case for bacteriophage genomes with no reference. This is why we decided to apply solutions from the Artificial Intelligence (AI) domain that focuses on statistical models and algorithms allowing computer systems to solve a particular problem or perform a specific task with or without explicit expert rules and programmed instructions. While historically somewhat niche, increasingly better hardware and recent developments in the Deep Learning algorithms enable Machine Learning (ML) models and natural Language Processing (NLP) to achieve human-like performance on various tasks from multiple fields such as computer vision or knowledge extraction from the data. ML is applied to an increasingly wide range of domains. Every scientist has an opportunity to integrate it into his operations to become more competitive by gaining predictive insights and the potential to automate numerous tasks. Today's AI frameworks are already mature and effective enough to be powerful tools not only for researchers but also for practical application developers.
In this paper, we present a novel approach based on Machine Learning and Natural Language Processing to classify phages into virulent or temperate based on their nucleotide sequences. Our tool is available online at https://phage.ai/. Published within author permission [13].

This work was focused on constructing a novel Machine Learning and Natural Language
Processing pipeline for bacteriophages' life cycle classification. For this research we used 278 virulent and 174 temperate phage genomes in FASTA format.
We applied common NLP techniques for efficient DNA word embedding by k-mer structure (contiguous subsequence of k letters) with sliding window approach using constant k = 6 and the  This number of features allowed us to train and tune 11 supervised ML classifiers (see Table 1). For tuning models hyperparameters, Bayesian optimization was applied with 10 fold cross-validation and F1-weighted scoring. The best result was achieved with a Support Vector Machine classifier with a linear kernel which resulted in an average accuracy of 98.90% on the validation sets (      Interpretation of the SVM model is difficult given the multidimensional context of the data as well as the embedding used to get numeric vector space. Therefore, in the PhageAI tool we have prepared an interactive 3D visualization to help interpreting model classification results (Figure 7). The application of Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) [15] allowed us to separate between virulent and temperate phages and group a significant part of them into clusters and subgroups represented by the same life cycle. Therefore, as the next step of our research we intend to investigate this and rather look for a correlation between them.
The PhageAI tool is in active development. We also shared dedicated Python programming package

Review of existing solutions
Before developing the PhageAI tool, we have reviewed the existing solutions for bacteriophage life cycle classification. Namely, we focused on the tool that is currently the most popular and widely used for phage research: PHACTS [12]. It uses an ensemble of Random Forest classifiers trained on samples from the PHANTOME database [17]. The models use protein-based features representing calculated similarities to the analyzed genomes. However, since the proteins used by the classifiers are chosen mostly at random, the results vary greatly during practical tests, with the same phages classified as both virulent and temperate on multiple runs. By analyzing the entire nucleotide sequences and using techniques such as the reverse complement described in the methodology section, our approach yields much more stable results.
The tool utilizes a hybrid Machine Learning and protein similarity approach that is not reliant on sequence features for automated recovery and annotation of viruses, determination of genome quality and completeness, and characterization of virome function from metagenomic assemblies.
VIBRANT uses supervised neural network Multi-layer Perceptron classifier (MLP) with protein signatures and a custom v-score metric that circumvents traditional boundaries to maximize identification of lytic viral genomes and integrated proviruses, including highly diverse viruses.
Surprisingly, during testing of Enterobacteria phage Mu (NC_000929.1), which is a model example of phage integrating into a host genome using the transposition process [19], the program indicated that it has a virulent character. Therefore, one has to be cautious when relaying on the life cycle assessment presented by the program. At the same time, this program is an ideal tool for rapid annotation of the viral genome, which allows manual review of the program's indications.

Conclusions
In this paper, we have shown that it is possible to capture and extract knowledge from hundreds of bacteriophages sequences to classify their life cycles with a high accuracy and immediate result.
The PhageAI tool needs only DNA nucleotide sequence in FASTA format to make a prediction, which is to our best knowledge a novel approach.
The application of Machine Learning and Natural Language Processing for bacteriophage research other issues such as predicting proteins features, distinguishing bacteriophages taxonomy or phage host range identification. Deep Learning approach is becoming more justified in the next step because the PhageAI repository has already collected more than 10,000 phages' sequences.
PhageAI was released as a free web platform, REST API service and open sourced Python package which should allow other researchers to include our tool in their pipelines.

Methods
In our study we have used more than 600 genomic sequences of bacteriophages from ACLAME [20] and PhagesDB [21] with information about their life cycle. We manually verified predictions for the purpose of this study. In order to standardize the annotation, all phage genomes were annotated by using DNAMaster (a tool developed by Dr. Jeffrey Lawrence, the University of Pittsburgh, v5.23.3) with its auto annotations option which combines Glimmer [22] and GenMarkS [23] algorithms. Then, all detected ORF were analyzed to find proteins that may be involved in bacteriophage integration into the host genome. For this purpose, HMMscan [24] and InterPro ( [9] access: 07.2019) software were used, which allow for detection of characteristic domains in the protein sequence. Additionally, Hhpred ( [8] access: 07.2019) was run to find remote homologs based on the modeled 3D structure. In the case when none of lysogenic factors was found or a phage was unable to maintain the lysogeny, the phage sequence was marked as virulent, otherwise it was temperate. The phages for which the life cycle could not be predicted or was unclear were discarded from further processing.
Moreover, amino acid sequences of phages were analyzed in the PHACTS tool to compare the predictions obtained manually.

Datasets
Final training dataset after manual editing consisted of 278 virulent and 174 temperate phages.
Additionally, we selected a testing dataset of 54 virulent and 30 temperate phages from different species and families ( Figure 8, Figure 9, Figure 10).

Train-validation-test split
To control how much a ML model is learning from the data, a well-established practice is to split samples to evaluate the future classifier with different bacteriophages. To train and evaluate the results we have chosen the following approaches:  Cross-validation: stratified shuffle with 10-folds and 80% -20% train-validation proportions was used to find the optimal hyperparameters and evaluate the results during training. For data stratification we used life cycles as well as bacteriophages families values to preserve the percentage of samples for each class.
 Holdout validation: a dataset of 84 unseen samples was designated as the testing set. It was not used in the training process directly, but it was employed to compare the model's metrics after training.
 Additional holdout validation: the second dataset delivered by Proteon Pharmaceuticals S.A.
company, containing 61 samples unavailable to the models during training was used to estimate the final metrics.

Reverse complement augmentation
After train-validation-test split the reverse complement bacteriophage sequences ( Figure 11) were treated as another samples. It enabled the ML model to automatically learn the complex relationships between the double strand DNA sequences. Previous studies [34,35] confirm the importance of utilizing the reverse complement DNA sequences, which is connected with data augmentation. This step also allowed us to double the datasets which became ready to be vectorized.

Efficient DNA word embedding
Bacteriophages genomic sequences in FASTA format are represented as relatively long strings could be problematic when applying the classification models to solve problems in DNA sequence analysis, especially since most of the ML algorithms prefer lower-dimensional continuous vectors as input. Therefore, we tested and compared three methods (sliding window, non-overlapping and variable-length) for k-mers extraction of length 3 <= k <= 12 and their impact on our experiment.
Additionally, all the k-mers which contain characters outside of the nucleotide alphabet {A, C, G, T} were removed before vectorization was launched. This includes characters used to signify uncertainties within the sequenced genome.
One of the key ideas in NLP is how to efficiently convert sequences of character or words into numeric vectors, which then can be fed into various ML models. To obtain feature vectors of fixed size representing the genomes, we adopted word embedding based on the Word2Vec with Shipgram model, which leverages a shallow neural network with a projection layer. The Skip-gram model is an efficient method trained to predict the probabilities of a word being a context word for the given target. The "context" is a set of adjacent subsequences surrounding the targeted k-mer.
Using fixed length vectors to represent the sequence, the similarity between bacteriophages can be measured, even though each sequence can be of a different length (bp).
Finally, bacteriophages DNA were represented by the average of the k-mer embedding vectors of words that compose the sequences, which means that each genome was described by averaged numeric values in vector space. The idea of averaged word embeddings was adopted from X et al., 20xx where averaged word embeddings were used for document paragraphs [36].

Efficient feature selection
Heterogeneous features extracted from average of the k-mer embedding vectors might reflect better pattern information for characterizing bacteriophages lifecycle. For this purpose, we applied an RFECV which is an efficient feature selection method to remove irrelevant attributes and ceiling the generalization ability of the next step model.

Supervised learning
For this study we trained and compared results from 11 implementations of supervised ML algorithms:  For tuning models hyperparameters we discarded techniques such as Grid Search and Randomized Search which search through the entire space of available parameter combinations in an isolated way without improving based on the past results. Instead, we applied Bayesian Optimisation [37], which minimizes the time spent to obtain an optimized set of model parameters. We measured the accuracy, precision, recall, F1-score and Area Under the Receiver Operating Characteristic (AUC).
To increase the performance of gradient-based classifiers we trained them with multiple NVIDIA GPUs usage. Not applicable.

Consent for publication
Not applicable.

Availability of Data and Materials
The dataset used and analysed during the current study are available in the NCBI repository. https:// pmlegacy.ncbi.nlm.nih.gov/sites/myncbi/1Xu-9lbsnfN1Wi/collections/59768502/public/

Competing interests
In accordance with PhageAI -Bacteriophage Life Cycle Recognition with Machine Learning and Natural Language Processing policy, the authors are reporting that the PhageAI platform was developed by the authors and Proteon Pharmaceuticals is the owner of the platform.