The First Implication of Image Processing Techniques on Influenza A Virus Sub-Typing Based on HA/NA Protein Sequences, using Convolutional Deep Neural Network

Increase in influenza A virus host range throughout its evolution has given rise to major concerns worldwide. Although the increasing host range mechanism of the virus is largely unknown; persistent genetic mutations have been blamed as a key factor in the re-organization of the host response and the host range. To uncover the underlying core bases of the two important antigenic proteins of influenza virus (HA and NA), functional data mining and image processing analysis of over 8000 protein sequences of different HA and NA subtypes were performed. Each amino acid sequence in HA or NA proteins sat as a feature or variable and two polynomial datasets were created and subjected into conventional prediction models. The average accuracies of these predictive models for HA subtype classifications ranged from 38.9% for SVM to 87.2% for Decision Tree models. NA subtype classification with conventional prediction models varied from 81.3% to 99.87% for SVM and KNN models, respectively. Then amino acid sequence datasets were converted to binary images; subtypes feature sat as target variable, and target label determined by image processing convolution neural network. The performances of Image processing models (convolutional neural network) on binary images for both HA and NA datasets reached to 100%; and the application of Gabor2 filter decreased the time for the predicting model to reach the best performance for HA subtype; while it increased the epochs time for NA subtype classification. For the first time ever, converting influenza virus’ HA and NA amino acid sequences into the binary image datasets and their classifications by convolution neural network increased the prediction accuracies and performances to the highest possible point. The finding of this paper paves new avenues for virus classification based on antigenic HA and NA amino acid profiles, and easily classifying and predicting the possible future emerging strains of pandemic influenza.

The average accuracies of these predictive models for HA subtype classifications ranged from 38.9% for SVM to 87.2% for Decision Tree models. NA subtype classification with conventional prediction models varied from 81.3% to 99.87% for SVM and KNN models, respectively. Then amino acid sequence datasets were converted to binary images; subtypes feature sat as target variable, and target label determined by image processing convolution neural network. The performances of Image processing models (convolutional neural network) on binary images for both HA and NA datasets reached to 100%; and the application of Gabor2 filter decreased the time for the predicting model to reach the best performance for HA subtype; while it increased the epochs time for NA subtype classification.
For the first time ever, converting influenza virus' HA and NA amino acid sequences into the binary image datasets and their classifications by convolution neural network increased the prediction accuracies and performances to the highest possible point. The finding of this paper paves new avenues for virus classification based on antigenic HA and NA amino acid profiles, and easily classifying and predicting the possible future emerging strains of pandemic influenza.

Introduction:
The ability of influenza virus to increase its host range is a major concern worldwide; resulting in human infection with high mortality rate and widespread pandemic fear with higher morbidity and mortality rates [1]. The last influenza outbreak with a novel avian origin influenza A (H7N9), caused up to 400 000 deaths globally in the first year, and has increased concerns over its pandemic potentials in near future [2]. The emergence of new broad host range of influenza strain with lack of previous host immunity and human to human transmission ability may result in the real pandemic outbreak with millions of fatalities. High frequency of genetic reassortment and antigenic drift, availability of hosting environments and circulating different subtypes of influenza virus for genetic alteration provide the virus suitable setting to generate new highly infectious strains [3].
Based on two surface glycoproteins, haemagglutinin (HA) and neuraminidase (NA), 16 HA subtypes and 9 NA subtypes of influenza virus have been identified. Less than thirty percent of HA and NA amino acids are conserved among all virus subtypes and HA and NA segments are extremely variable in genetic sequences [4]. Its surface HA proteins are the key part in the specificity of influenza virus infection, while during viral releases from host cells, the cleavage of linkage between terminal sialic acid and adjacent galactose is done by NA. Influenza virus A strategy to increase its host range goes through alteration of viral surface proteins. it has been shown that a few amino acid substitutions enabled the virus to transmit via respiratory droplet between ferret. Or single amino acid substitution converted nonlethal strain of influenza to a lethal virus in human; showing the importance of amino acid profiling of surface influenza proteins to monitor the host specificity [5].
Development of algorithms that allow computers to extract the patterns among the data variables is a subfield of artificial intelligence. Machine learning goes through a process of inference and fitting the best model or learning algorithm from examples. The approaches have been widely used in many applications (i.e. pattern recognition, stock market prediction, text and language processing and development of search engines) [6-10]. Multiple sequence alignment, protein structure prediction, gene expression analysis, gene ontology prediction, and molecular classification are other areas of prediction models. Machine learning techniques have achieved great success in biological classifications and evolutionary pattern recognition, including influenza virus host and subtype identification [11,12].
Deep learning is a branch of machine learning discussion and a set of algorithms that try to model high-level abstract concepts using learning at different levels and layers [13,14]. The depth, the number of node layers process data to recognize the pattern, is the most distinguished feature of deep learning from conventional neural networks. Training nodes on a distinct set of features based on the output of the previous are the most important characteristic of deep neural networks [15,16].
Historically, image processing and face recognition tasks have been done successfully by deep neural networks algorithms [17]. These models can be trained to detect objects in pictures far more better than human do [18]. The most common used deep learning network architecture for image analysis is the convolutional neural network (CNN). Pattern matching (convolution) and aggregation (pooling) operations are the basic cores of CNN. Scanning the image by a given pattern and calculating of a match for every position is done at the pixel level. The presence of the pattern in a region determines by pooling (max-pooling), and the region information aggregates into a single number [19,20]. Most network architectures used in image processing and image analysis have been done by convolution and pooling operations [21].
In this research, for the first time, two important influenza virus A subtypes' protein sequences (HA and NA) converted into binary images and their subtypes were predicted by developing, training and validating image processing convolution neural network algorithms; and finally their performances compared with the conventional predictive models.

Material and Methods
The following steps were undertaken as shown in Figure 1.: constructing polynomial datasets of HA & NA amino acids' sequences; (ii) constructing a binary image datasets of HA & NA amino acids' sequences; (iii) training and testing conventional predictive models on polynomial datasets;

HA Binary Image Dataset (HABID):
Regarding the total numbers of amino acids of 20, to each amino acid letter a digit from 1 to 20 assigned (See Table 1.). The assignment based on SeqInt = aa2int(SeqChar) function which converts sequence character of single-letter codes of an amino acid to an integer; based on Table 1 values.
Then numeric data of HA sequences converted to the binary image; composed of nineteen 0 and 1. For example, we assigned number 2 to amino acid Arginine (R); therefore its binary numbers would be 01000000000000000000. The final created image made of 20 * 576 binary matrix.
4. NA Binary Image Dataset (NABID): Again, to each amino acid sequence of NA protein a digit between 1 and 20 assigned (as explained above). Each digit converted to a binary image; 1 assigned at the position with a number and for the rest of 19 more spaces, nineteen 0 assigned. The final image dataset made of 20 * 475 binary matrix.

Conventional Machine Learning Predictive Models
The polynomial datasets (HAPD and NAPD) imported into MATLAB (R2018b, 9.4 To make the results comparable, no cross-validation approach selected; data divided into 90% and 10% parts and each model trained on nine parts and validated (tested) on the tenth part.
The performance or the accuracy of each model in predicting the right class of HA or NA subtypes computed.

Image Processing Convolution Neural Network (CNN)
A CNN made of three major layers: convolution, pooling, and fully connected layers; each layer does a special task. There are two stages for training in CNN; feed-forward and backpropagation. In the first stage, the image fed into the network; This action is nothing but a multiplication of the point between the input image matrix and the filter matrix in each convolution layer. The layer searches for high-level features extracted from raw data; looking for meaningful objects and; no decision is being made at this stage. Flatting these features at the end of the network and connecting them to two fully-connected layers is usually a cheap computational load method for learning the nonlinear components of these features. The subtypes, respectively. For probability distribution and network output class presentation, Softmax and classification layers were used. Figure 1 shows the convolution network architecture and layers' architecture.
To enhance the classification accuracy for each class; GABOR2 filter is applied with 90degree angles on horizontal images at the preprocessing stage on the binary image.

Prediction Model
Conventional machine based predictive models were trained and validated on polynomial Detection of lung nodules, classification of radiograms and bone fractures has attracted many research attractions [32,33]. This is the first successful attempt in modeling and prediction of influenza virus A subtypes based on image processing algorithms. Highly accurate models developed, trained and validated by analyzing images made of amino acid sequences of antigenic proteins of the influenza virus. The models were able to convert the raw protein sequences into images in a few short seconds and then apply the image processing techniques to identify the prominent key differences between virus subtypes' images.
The performances or accuracies of conventional predictive models (as we named in this paper) have been compared with newly developed image processing CNN algorithms of protein Although the number of amino acid sequences in HA or NA subtype classes in the prepared datasets was not equal (the number of some HA or NA classes were twice the others), the model did a great job in predicting the right virus subtype class. Therefore, the developed algorithm can be used to predict influenza virus A subtypes with any number of protein sequences; making the model suitable for asymmetric datasets.
This research approach will serve as a fundamental base for future studies on clustering influenza

Conclusion
Computer-based predictive models have opened new vistas in medical analyses and diagnostic tests and their implications in these fields are growing rapidly. In this research, we developed and applied image recognition convolution deep learning neural network algorithms to distinguish between five different HA and four NA subtypes of influenza virus A.
For the first time, we converted and transformed protein sequences into images and by optimizing the image processing filters, and we were able to classify the virus subtypes with 100% accuracies.
Comparing the results of the developed method with the conventional predictive models showed this approach was more efficient, accurate and less time-consuming.
As this method can be used to quickly convert and compare whole genome and proteome sequences of healthy and unhealthy people into images, it really opens new analytical approaches.
The method can be used easily to compare any genome, transcriptome or proteome of organisms at the different situations,  Figure 1. Overview of CNN network topology (A) and various steps taken to convert protein sequences into a binary image and applying Gabor filter to find the right subtype classes (B); the arrangement of the layers in the CNN has been presented in part (C).