Abstract
Increase in influenza A virus host range throughout its evolution has given rise to major concerns worldwide. Although the increasing host range mechanism of the virus is largely unknown; persistent genetic mutations have been blamed as a key factor in the re-organization of the host response and the host range. To uncover the underlying core bases of the two important antigenic proteins of influenza virus (HA and NA), functional data mining and image processing analysis of over 8000 protein sequences of different HA and NA subtypes were performed. Each amino acid sequence in HA or NA proteins sat as a feature or variable and two polynomial datasets were created and subjected into conventional prediction models. The average accuracies of these predictive models for HA subtype classifications ranged from 38.9% for SVM to 87.2% for Decision Tree models. NA subtype classification with conventional prediction models varied from 81.3% to 99.87% for SVM and KNN models, respectively. Then amino acid sequence datasets were converted to binary images; subtypes feature sat as target variable, and target label determined by image processing convolution neural network. The performances of Image processing models (convolutional neural network) on binary images for both HA and NA datasets reached to 100%; and the application of Gabor2 filter decreased the time for the predicting model to reach the best performance for HA subtype; while it increased the epochs time for NA subtype classification.
For the first time ever, converting influenza virus’ HA and NA amino acid sequences into the binary image datasets and their classifications by convolution neural network increased the prediction accuracies and performances to the highest possible point. The finding of this paper paves new avenues for virus classification based on antigenic HA and NA amino acid profiles, and easily classifying and predicting the possible future emerging strains of pandemic influenza.