## Abstract

Influenza viruses remain a formidable threat to global public health due to their high mutability and infectivity. Accurate prediction of influenza virus subtypes is crucial for clinical treatment and disease prevention. In recent years, machine learning methods have played an important role in studying influenza viruses. This study proposes a new alignment-free method based on the correlation of k-grams called Subsequence Correlation Coefficient Vector (SCCFV) to subtype hemagglutinin (HA) and neuraminidase (NA) of influenza virus. In the method, each influenza virus sequence is converted to four time series and the correlation coefficients of time series are utilized to extract the features of sequences. Then the supervised learning methods are used for the subtype classification of influenza viruses. We compare the effectiveness of the random forest, decision tree and support vector machine classifiers. Experimental results show that the random forest method achieves the best performance with an accuracy of 0.99979, an precision of 0.99996 and a recall of 0.99997. All prediction indicators of our method are significantly higher than traditional methods.

## 1. Introduction

One of the key respiratory infectious diseases affecting humans is influenza[1]. Influenza viruses have acute and high infectivity and belong to the Orthomyxoviridae family[2]. The virus was first identified in 1932 and has been a major focus of virus research in recent years [3].According to the differences in nucleocapside protein (NP) and matrix protein (MP), it can be divided into three types : type A, type B and type C [4]. Influenza C virus contains 7 RNA segments [5, 6], whereas influenza A and B [7] viruses both have 8 RNA segments. Influenza A virus has attracted widespread attention due to its capacity for cross-species transmission and high variability. Additionally, it exhibits a broad range of host adaptability, whose hosts include humans, wild birds, domestic poultry, and a variety of mammals [8, 9].The virus poses a serious threat to human and animal health mainly due to the antigenic drift and antigenic transfer of its surface glycoprotein, which makes it easy to mutate across different hosts [10, 11].

Hemolysin (HA) and neuraminidase (NA) are glycoproteins on the surface of influenza virus. The subtype of influenza virus is determined by the types of these two type of proteins. For influenza A viruses, 16 HA subtypes [H1-H16] and 9 NA subtypes [N1-N9] have been identified based on antigenic differences [12]. For determining the subtype classification of influenza virus HA and NA fragments, the most commonly used methods by WHO are the hemolysis inhibition (HI) test and the neuraminidase inhibition (NI) test. Although this subtype identification method is relatively economical compared to molecular detection, it still requires a lot of experiments when dealing with unknown virus subtype, which will consume a lot of time and materials[13]. Machine learning methods can provide a feasible and resource efficient option for the subtype classification problem, which may have higher efficiency and significantly lower cost compared to traditional methods.

In the past decades of research on influenza virus, machine learning methods have been used to infer certain pathogenic markers in the hemagglutinin (HA) gene [14]. To classify the pathogenicity classification of H5Nx avian influenza strains, Akshay Chadha et al. [15]compared the performance of different machine learning classifiers such as logistic regression (LR) with lasso and ridge regularization, random forest (RF), K-nearest neighbor (KNN), naive Bayes (NB), support vector machine (SVM) and convolutional neural networks (CNN). Edyta Swiezton et al. [16] used Bayesian methods for phylogenetic and molecular analysis of H5N8 and H5N5 viruses. In the research of Fahad Humayun et al.[17], a statistical method for predicting avian influenza virus subtypes was proposed, which made a great contribution to predicting avian influenza subtypes through feature extraction and machine learning methods. However, there are few methods for classifying the subtype of Avian Influenza A virus. Even if there is a methodology, the prediction effect needs to be improved due to the way it is encoded.

To improve the accuracy of influenza virus subtype prediction, we adopted a novel feature extraction method inspired by the correlation coefficients of time series of DNA sequences. Subsequently, based on the features extracted, we performed influenza virus subtype classification by machine learning methods. After comparing four different methods: random forest, decision tree, SVM and KNN, we finally found that random forest showed the highest accuracy in predicting influenza virus subtypes.

## 2. Materials and methods

### 2.1. Hemagglutinin and Neuraminidase Subtype sequences

All Hemagglutinin(HA) subtype sequences from 2002 to 2022 years stored in NCBI were downloaded by Influenza Virus Tool (http:/www.ncbi.nlm.nih.gov/genomes/FLU). We retained nucleotide sequences with a sequence length of at least 1600 bp. The virus subtypes containing only a few sequences (less than 10 sequences) were also eliminated. The resulted data set included 95,585 influenza virus sequences with subtypes from H1 to H14, and H16. The distribution of sequence length was listed in Table 6. Among these sequences, the 70% of them were utilized for model training, while the remaining 30% were utilized for model test.

All Neuraminidase(NA) subtype sequences from 2002 to 2022 years in NCBI were also downloaded. The DNA sequences with at least 1300 bp were kept. The virus subtypes containing only a few sequences were also deleted. Then the dataset we obtained included N1 to N9 subtypes of avian influenza virus, with a total of 88,638 sequences. The distribution of sequence length was listed in Table 7. Among these sequences, the 70% of them were utilized for model training, while the remaining 30% were utilized for model test.

### 2.2. Subsequence Correlation Coefficient Feature Vector (SCCFV)

To intricately examine the distribution of nucleotides within the DNA sequence, we initially transform the DNA sequence into a corresponding feature vector in multi-dimensional space. For a virus DNA sequence *X* = (*x*_{1}, *x*_{2}, …, *x*_{n}), with each *x*_{i} belonging to the set *Φ* = {*G, T, A, C*},*i* = 1, 2 …, *n*, and n denoting the length of the sequence, we transform the sequence into four time series as follows: Let *V*_{t} = *v*_{t} (1)*v*_{t} (2) … *v*_{t} (*N*), *t ∈* {*G, T, A, C*}
We define the M-step delay *V*_{t+M} of *V*_{t} as: *V*_{t+M} = *v*_{t} (*M* + 1)*v*_{t} (*M* + 2) … *v*_{t} (*M* + *n*),where *v*_{t} (*n* + 1) = … = *v*_{t} (*M* + *n*) = 0, especially, *V*_{t} = *V*_{t+0}. *M* is a preset integer (*M ≪ N*) which can be adjusted from datasets.

Thus the average value, auto-correlation of one time series and cross-correlation among time series can be computed for the four time series. First, we define the average occurrence number of *t ∈* {*G, T, A, C*} as:
We illustrate the above formula with a sequence “AGCTGTACCTG”:
Please refer to the detailed computation in Figure 6. In order to assess the correlation of nucleotides with a sequence and between sequences, this study conducted computation for the autocorrelation coefficient and cross-correlation coefficient. The M lag autocorrelation *ρ*_{cc} (*M*) in a sequence is defined as:
The M-step delay correlation *ρ*_{xc} (*M*) of nucleotides x and c in a sequence can be defined using the cross-correlation between the two time series *v*_{x} and *v*_{c} :
The standardized cross-correlation also called cross-correlation coefficient *τ*_{xc} (*M*) between nucleotides x and c is defined as:
Specifically, the autocorrelation coefficient *τ*_{cc} (*M*) of nucleotide c itself can be defined as:
Subsequently, we obtain a feature vector of nucleotides named the correlation coefficient feature vector [18] with dimensions 16 × *M*:
Drawing upon the methodology introduced by He et al.[19], we systematically apply a procedure to divide sequences into segments with nearly equal length. Consider the virus DNA sequence *X* = (*x*_{1}, *x*_{2}, …, *x*_{n}), the sequence is segmented into k non-overlapping subsequences, where the initial *r* segments (*Substr*_{1}, *Substr*_{2}, …, *Substr*_{k}) are comprised of *Z* + 1 nucleotides, and the remaining *k − r* segments (*Substr*_{r} + 1, *Substr*_{r} + 2, …, *Substr*_{L}) containing *z* nucleotides. The define of *z* is the quotient and *r* is the remainder when dividing *n* by *k*.
For each segment of the sequence, some correlation coefficient feature vectors can be evaluated. Thus the feature vectors can be integrated to a new vector denoted as Subsequence Correlation Coefficient Feature Vector (SCCFV) characterized by a length of 16 × *M* × *k* :

### 2.3. Model Definition and Evaluation

We used three machine learning methods decision tree, random forest and SVM to train our model based on the above feature vectors. The decision tree is a machine learning model based on a tree structure. It constructs a tree by recursively partitioning a dataset, where each node represents a feature, each branch represents a possible value for that feature, and leaf nodes represent the final classification result. The classification of a decision tree is based on a set of rules learned from the features of the input data [20]. Random forest uses multiple decision trees to perform classification or regression, and combines the results of these trees through voting or averaging [21]. The goal of SVM is to find a hyperplane that can divide different classes in the feature space, while maximizing the separation between the sample points (support vectors) closest to the hyperplane for classification [22]. Fig 2 illustrates the pipeline of our method used in this study. We first used the SCCFV method to convert sequence information into feature vectors, then used three classifiers i.e. random forest, decision tree, and support vector machine to train our model. Once our models were well trained, the prediction of influenza virus subtypes were conducted by the models.

In order to evaluate the effectiveness of three machine learning methods in classifying influenza A viruses, this study used four evaluation indicators: Accuracy, Precision, Recall and F1-score. Accuracy represents the proportion of actual positive predicted and negative predicted actual samples to the total sample, Precision represents the proportion of positive predicted actual positive to the predicted positive, Recall represents the proportion of positive predicted actual positive samples to the total number of positive samples, and F1-Score can be regarded as a comprehensive measure of Precision and Recall, which is their harmonic mean. These indicators are selected to comprehensively evaluate the performance of the classification model. Accuracy provides the overall classification accuracy, Precision focuses on the accuracy of positive examples, Recall focuses on the model’s coverage ability of positive examples, and F1-score balances the accuracy of Precision and Recall.The calculation formula for these four variables is as follows : The Python source code in this paper is freely available to the public upon request. All calculations in this article were performed on a Dell Inspiron 15 3511 with an Intel Core i5-1135G7 processor, running Microsoft Windows 11 Chinese version, with 16GB RAM.

## 3. Results

In our work, we classified the HA and NA fragments of influenza A viruses into subtypes respectively. We first extracted features of virus DNA sequences based on the SCCFV method, and then applied three classifiers to obtain our training results. After many classified data tests, we selected *k* as 5 and *M* as 2. The specific test results can be referred to Fig 7.

### 3.1. HA subtype classification

The classifiers were applied to classify the subtype of HA and the results were listed in Table 8. Among the results, Random Forest (referred to as RF) achieves the highest performance with an accuracy of 0.99979, a precision of 0.99996, a recall of 0.99997, and an F1 score of 0.99996. The decision tree and SVM methods obtain almost same performance. All three methods achieve almost 100% accuracy, precision, recall and F1 score, which ultimately leads to the conclusion that our SCCFV algorithm comprehensively captures the sequence information of HA subtypes.

To demonstrate the accuracy and reliability of our SCCFV approch, we also constructed a phylogenetic tree of 16 subtypes of HA viruses with 160 HA DNA sequence data. The resulted tree was shown in the following Fig 8 and viruses from the same subtype were labeled with the same color. Notably, in this tree, viruses from the same subtype cluster together, indicating that our method can accurately reconstruct the phylogeny of HA subtype sequences.

### 3.2. NA subtype classification

For the classification of NA subtypes, the Random Forest method still yields the most favorable result, attaining an impressive accuracy of 0.99979, a precision of 0.99996, a recall of 0.99997, and an F1 score of 0.99996. The accuracy, precision, recall and F1 score of the decision tree and SVM both achieve almost 100%. The detailed results are listed in Table 9. From the results of HA and NA, we conclude that the three method can almost 100% accurately classify the HA and NA subtypes and the RF method has the optimal performance.

We also used 90 NA data of 9 subtypes to construct a phylogenetic tree and the resulted tree was shown in the following Fig 6.

By observing the phylogenetic tree, we note that taxa of the same subtype tend to cluster together, suggesting that our method may reconstruct the phylogeny of NA subtypes.

In the past, Fahad Humayun used three methods: k-gram, discrete wavelet transform (DWT) and multivariate mutual information (MMI) to extract features of sequences, and then utilized four classifiers to predict the subtype of Type A influenza virus. In his research, the decision tree has the best prediction effect. Compared with Fahad Humayun’s experiment, we utilized the same dataset for the classification of Type A avian influenza. Based on the RF classifier, we classified the Type A avian influenza virus. The classification results were detailed in Table 5. The results indicated that our prediction accuracy was significantly higher than Fahad Humayun’s best classification results.

## 4. Discussion and conclusions

Since its initial identification, the Type A influenza virus has undergone continuous evolution and widespread transmission, maintaining a significant impact on both human and animal health. Based on the different biological, physical and chemical properties of different Type A influenza virus subtypes, more targeted prevention, control strategies and immune vaccines can be developed. In this work, we studied the subtype classification of Type A influenza virus using 95,585 HA and 88,638 NA DNA sequences. A novel alignment-free method SSCFV was proposed to extract the information contained in the sequences. The approach converted each DNA sequence to four numerical time series and utilized the average, correlation coefficient of them to extract features. To capture the local correlation, the DNA sequence was divided into k segments and the features of each segment were computed. Thus a 16 × *M* × *k* dimensional numerical vector was formed to represent each Type A influenza virus.

Based on the feature vectors of viruses, three machine learning methods random forest, decision tree and SVM were used for the subtype classification of HA and NA. All these methods achieve extremely high prediction effect with almost 100% accuracy, precision, recall, and F1 score. Among them, the random forest perform best. To validate the effect of our SCCFV approach, the phylogenetic trees of HA and NA subtypes were also constructed. In the trees, the viruses from same subtype cluster together, indicating the method has the potential to reconstruct the evolutionary history of HA and NA subtypes.

In addition, our method was also compared with Fahad Humayun’s work for the subtype classification of Type A influenza virus. The prediction results using the random forest classifier based on the SCCFV approach were all superior to Fahad Humayun’s experiment. Our method is able to achieve higher classification accuracy in a shorter time.

Besides, our method may have certain limitations. During the preliminary sequence processing, we exclude subtypes containing less than 10 sequences, which may subsequently lead to a certain extent of prediction inaccuracies when it comes to classifying less common subtypes of the Type A influenza virus. However, subtypes with less than 10 sequences can be clustered by phylogenetic trees. In the future, few shot learning methods can be studied to deal with the classification of these less frequently occurring subtypes.

## 5. Conflict of interest statement

The authors declare no competing financial interests.

## 6. Acknowledgements

This study is supported by the National Natural Science Foundation of China (Z23146), and Beijing Municipal Education Commission(Z22053 and Z22055) for providing excellent research environment while part of this research was done.