Accurate and fast clade assignment via deep learning and frequency chaos game representation

Abstract Background Since the beginning of the coronavirus disease 2019 pandemic, there has been an explosion of sequencing of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, making it the most widely sequenced virus in the history. Several databases and tools have been created to keep track of genome sequences and variants of the virus; most notably, the GISAID platform hosts millions of complete genome sequences, and it is continuously expanding every day. A challenging task is the development of fast and accurate tools that are able to distinguish between the different SARS-CoV-2 variants and assign them to a clade. Results In this article, we leverage the frequency chaos game representation (FCGR) and convolutional neural networks (CNNs) to develop an original method that learns how to classify genome sequences that we implement into CouGaR-g, a tool for the clade assignment problem on SARS-CoV-2 sequences. On a testing subset of the GISAID, CouGaR-g achieved an \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$96.29\%$\end{document} overall accuracy, while a similar tool, Covidex, obtained a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$77,12\%$\end{document} overall accuracy. As far as we know, our method is the first using deep learning and FCGR for intraspecies classification. Furthermore, by using some feature importance methods, CouGaR-g allows to identify k-mers that match SARS-CoV-2 marker variants. Conclusions By combining FCGR and CNNs, we develop a method that achieves a better accuracy than Covidex (which is based on random forest) for clade assignment of SARS-CoV-2 genome sequences, also thanks to our training on a much larger dataset, with comparable running times. Our method implemented in CouGaR-g is able to detect k-mers that capture relevant biological information that distinguishes the clades, known as marker variants. Availability The trained models can be tested online providing a FASTA file (with 1 or multiple sequences) at https://huggingface.co/spaces/BIASLab/sars-cov-2-classification-fcgr. CouGaR-g is also available at https://github.com/AlgoLab/CouGaR-g under the GPL.


Introduction
The global coordination in combating the COVID-19 pandemic has led to the sequencing of one of the largest amount of viral genomic data ever produced. All this data is stored in publicly available archives, such as the European Nucleotide Archive (ENA) and GISAID [15], currently having more than 9.6 million sequenced genomes, classified in variants, clades, and lineages.
The SARS-CoV-2 virus has evolved since its discovery, and the currently available phylogenies describing its evolutionary history [11] show more than 2000 different genomes, divided into lineages. Since the phylogeny is fairly stable and the main (existing) lineages, i.e. the lines of descent, have been identified, a natural and interesting problem is to quickly find, given a sequence, the clade to which it belongs, i.e. a group of descendants sharing a common ancestor [1]. Fast and efficient solutions to the clade assignment problem would help in tracking current and evolving strains and it is crucial for the surveillance of the pathogen. This classification problem has been attacked with machine learning approaches [3,4,5] using the Spike protein amino acid sequence to drive the classification step.
In this paper we propose a method for classifying SARS-CoV-2 genome sequences based on Chaos Game Representation (CGR) [14]: a deterministic bidimensional representation of a DNA sequence, also called CGR encoding, that can be easily obtained from the genome sequences. The CGR encoding of a sequence has two fundamental properties: it is deterministic, that is there is a unique CGR encoding of each sequence, and reversible, hence the original sequence can be recovered from its representation [6].
A strongly related approach, known as Frequency matrix of Chaos Game Representation (FCGR) [9,6], starts from the k-mers (the substring of length k) of the string we want to represent resulting in the the notion of k-th order FCGR [36]. The k-th order FGCR of a sequence s is a 2 k × 2 k matrix whose elements are the number of occurrences, i.e. the frequencies, of each k-mer in s, where each frequency is stored in the specific and distinct position for each k-mer. Note that the matrix shape depends on the fact that the sequence s is on a 4-symbol alphabet. In essence, the FCGR is an alternative ordering of the histogram for all the k-mers (for a fixed integer k). Deep Learning and FCGR have been used to evaluate the drug resistance for protein sequences of HIV [23] and for multi-class classification task to identify the source organism for a given protein [10] -in this case the FCGR has been extended to encode sequences in the protein alphabet. The FCGR has also been used for unsupervised clustering of DNA sequences of several species [24] by using dense neural networks, where the input of these networks must be a 1-dimensional vector. In this case, the 2-dimensional FCGR representation of the sequences must be flattened and cannot be fully exploited. For an extensive review on CGR and its applications in bioinformatics, we refer the reader to [21].
Subtyping Sars-Cov-2 sequences has been addressed in the literature with bioinformatics pipelines that require the alignment to a reference genome [11] [35], and also with machine learning approaches aiming to skip the alignment step [26]. Convolutional Neural Networks (CNNs) [19,20], showed outstanding results in the well-known Imagenet classification problem [18]. To the best of our knowledge, only two works have used CNNs and FCGR for the classification of DNA sequences. In [28] a simplification of the network reported in [20]) was used to classify different taxonomic categories with a dataset of 3,000 sequences (1200-1400 long). A comparison with Support Vector Machines (SVM), showed that CNNs improve over SVM when using a fragment (500bp) of the sequences. In [30], a CNN was proposed for the classification of a dataset of ≈ 660 sequences from eleven phylogenetic families reporting a test accuracy of 87%.
In this paper we leverage the FCGR representation of genomic sequences and CNN power to perform intra-species classification of viral DNA genome sequences, using SARS-CoV2 as our case of study and GISAID clades as our labels. Observe that in this problem the CNN classifies a dataset that is at least two order of magnitude larger than the one considered in the above mentioned papers. Another work that has tackled the clade assignment problem is Covidex [7], a web app tool based on Random Forest and k-mer frequencies: to the best of our knowledge this is the most recent work facing our problem. Notice that almost the entire phylogenetics literature deals with inter-species classification, where the distance between possible cluster centroids is larger, hence the classification problem is easier. We propose to use a residual neural network [13] (ResNet50) for the classification of DNA sequences into 11 GISAID clades, using a dataset of two orders of magnitude larger (153K sequences for training) than those analyzed in the above cited works (about 3000 sequences in [28]).
Classification metrics (precision, recall and f1-score) and analysis of the separability of the embeddings generated by the classification layer (Silhouette Coefficient [29], Calinski-Harabasz Score [8], and Generalized Discrimination Value (GDV) [31]) are analyzed for each model. Using the fact that each feature in the FCGR is uniquely related to a k-mer, we aim to analyze if the most relevant k-mers identified by feature importance methods (Saliency Maps [34] and Shap Values [22]) are related to mutations defining each clade.
We trained four models, one for each value of k ∈ {6, 7, 8, 9}. All models performed very similarly, with k = 8 being the best one, achieving an overall accuracy of 97.15% in the test set, and the best classification metrics (0.971 for Silhouette Coefficient, 90687.025 for Calinski-Harabasz and −0.734 for GDV). All but two clades (GR and GRY) reported an f1-score above 97% for all the trained models. Since GR is a close ancestor of GRY and these two clades share many mutations, they are confused with each other.
Using the 20 most relevant k-mers identified by Saliency Maps, we were able to achieve a similar performance than our CNNs models using SVM for k ∈ {6, 7, 8}. Finally, to access the performance of our models w.r.t. other approaches, we compare our results with Covidex [7] the only recent tool that we found in the literature solving the clade assignment problem. Our results show that our models outperform Covidex in all clades and reported metrics (accuracy, precision, recall and f1-score).

Background
The Chaos Game Representation for encoding DNA/RNA sequences is formally defined as: Definition 1 (Chaos Game Representation (CGR)) Let s = s 1 . . . s n ∈ {A, C, G, T } * be a sequence. Then the CGR encoding of the sequence s is the bidimensional representation of the ordered pair (x n , y n ) which is defined iteratively as where (x 0 , y 0 ) = (0, 0) and, Note that each point (x i , y i ) obtained with the above encoding represents the i-long prefix of the sequence s. Also, all the CGR encodings are points inside the square with vertices given by the values of the function g. In particular, the encoding of all prefixes that shares the last character will be placed in the same quadrant, all prefixes that shares the two last characters, will be placed in the same sub-quadrant, and so on. This property results in a fractal structure of the representation.
Missing bases can be problematic to encode, since the g(·) function is not defined in that case, we used the notion of frequence matrix CGR [9,6], which has the added benefit of allowing us to manage k-mers instead of strings of arbitrary length.
Definition 2 (Frequency matrix of Chaos Game Representation) Let s = s 1 . . . s n ∈ {A, C, G, T, N } * be a sequence, and let k be an integer. Then the frequency matrix of CGR, in short FCGR, of the sequence s is a 2 k × 2 k bidimensional matrix F = (a i,j ), 1 ≤ i, j ≤ 2 k , i, j ∈ N. For each k-mer b ∈ {A, C, G, T } k , we have an element a i,j in the matrix F , that is equal to the number of occurrences of b as a substring of s. Moreover, the position (i, j) of such element is computed as follows: where (x, y) is the CGR encoding for the k-mer b.
Note that the FCGR is defined for a DNA sequence with unknown nucleotide, denoted by N -where k-mers with an N are simply excluded in the counting processwhile the CGR encoding is well-defined only when all nucleotides are known. To explicitly mention the dimension of the FCGR, we will refer to this as the k-th order FCGR.

Classification of viral sequences of DNA
We are given a phylogeny over the possible viral strains, partitioned into classes: each class c of such partition C is a clade of the tree. More precisely, a clade is a group of related organisms descended from a common ancestor [1], in other words a clade is a subtree of a phylogeny that consist of an ancestral lineage and all its descendants.
Given a genome sequence, that is a string s ∈ {A, C, G, T, N } * , we determine the original clade in C from which the genome sequence is originated; however the genome sequence s might not have been previously observed. In any case the sequence will be assigned to a putative clade. To solve this problem, we propose a supervised learning model based on Convolutional Neural Networks (CNN) [20], using FCGR as inputs.

Data Description
The dataset for this experiment was downloaded from GISAID. By the time of our access to GISAID 1 there were around 10 million sequences.
In order to undersample the available data, we first dropped all the rows in the metadata without information in the columns Virus name, Collection Date, Submission Date, clade, Host and Is complete?, then we built a fasta_id identifier from the metadata as a concatenation of the columns Virus name, Collection Date and Submission Date.
For each clade, we randomly selected 20, 000 sequences considering only those rows where the Host column has value "Human" -clades L, V, and S have less than 20, 000 sequences available, in these cases all sequences have been selected.
As a result of the above procedure, we obtained 191, 456 sequences among the 11 GISAID clades (S, L, G, V, GR, GH, GV, GK, GRY, O, and GRA) over the 12 availables, we excluded the clade GKA from our study since there were only 81 sequences reported in the metadata. The undersampled dataset was randomly split into train, validation and test sets in 80 : 10 : 10 proportion, preserving the same proportion of clades (labels) in each set. The distribution of the clades over the datasets is given in Table 1 4 Analyses In this section we present the experimental setup, the dataset used to train and test each model, and clustering and classification metrics. We train one model for each k ∈ {6, 7, 8, 9} and we complement the study of the accuracy of each model (compared against Covidex [7]) with an analysis of the most relevant k-mers for the classification of each clade using Saliency Maps and Shap.

Experimental setup
All experiments are conducted using a Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz, x86 64, 32 GB RAM and a graphic card NVIDIA GeForce RTX 3060. The implementation is done in Python 3.9. Tensorflow 2.7.0 [2] was used for training the CNN and scikit-learn 1.0 [25] to compute classification metrics and clustering evaluation (except for Generalized Discrimination Value that was implemented). All code is available online for reproducibility 2 .

Model training
Each model was set to be trained for 50 epochs with a batch size of 32 (for k = 9 we used a batch size of 16 for memory constraints) using Adam optimizer [16] with learning rate 0.001 (the default parameters in keras). The validation loss was monitored after each epoch to save the best trained weights, reducing the learning rate with a patience of 8 epochs and a factor of 0.1, and by an early stopping in case the metrics do not improve after 12 epochs.
Based on the comparison of loss and accuracy of the training and validation sets (see Figures 1 for accuracy and 2 for loss), we observe that training is more stable for k = 8 and k = 9 in the earlier epochs, i.e. training and validation metrics are similar, while for k = 6 and k = 7 it takes 33 and 17 epochs to obtain the same behaviour, respectively.
The architecture used in this experiment is the same for all k (ResNet50 [13]), we only changed the input size. Originally, this architecture was designed for inputs of size (224 × 224 × 3), which led us to the assumption that this architecture could be more suitable for k = 8. Notice that our sequences are ≈ 29, 000bp long, which means that our input FCGR for k = 8 is very sparse, since from an n-long sequence we can count n − k + 1 k-mers, it means that (in the case where all k-mers are different) we have at most 29, 000 k-mers, at least 55% of the elements of the FCGR are 0 for k = 8, and a 88% for k = 9. In Table 2

Classification results
After each model is trained the precision and recall for the test set are computed for each clade using the best trained weights (lowest loss in the validation set), achieved at epochs 34, 19, 21 and 27 for k = 6, k = 7, k = 8 and k = 9, respectively. In our case,  we assign each sequence to the clade with highest score. Precision, recall and f1-score are shown in Table 4. Precision and recall are very similar among all the trained models, with small improvements when k increases, 6 out of 11 clades have f1-score greater than 99% in our best model (k = 8). Most notable differences in the performance can be seen in clades GR and GRY, which present the lowest reported recall and precision in each model, respectively. Moreover, from the confusion matrices (see Fig. 3) we can see that misclassified sequences that belong to clades GR and GRY, are confused between

Comparison with the literature
We compare our results against Covidex [7], a tool that classify Sars-CoV-2 sequences into three nomenclatures: GISAID, Nextrain and Pango lineages. Using a different model for each task, all based on Random Forest and 6-mers as input. The reported accuracy are 97, 77%, 99, 52% and 96, 56% for GISAID, Nextstrain and Pango models, respectively. They also trained the models using 7-mers, but they claim that it only produced slightly better results in terms of accuracy but with more than doubling the computation time [7]. The input for Covidex is a vector with the normalized counting of the frequencies for all 4 k k-mers. Our input, the FCGR also considers all k-mers but in a bidimensional matrix. The main difference between both approaches is the model behind it, while Covidex uses Random Forest to perform the classification, we take advantage of the CNNs and use a 2-dimensional input, the FCGR. Notice that using the FCGR with any other classical Machine Learning method implies to convert the FCGR into a vector, and hence, the loss of the 2-dimensional structure.
Since our model is trained using GISAID clades, we only compare to those results. In Covidex, they used 10 clades: S, L, G, V, GR, GH, GV, GK, GRY and O. In our case, we included GRA since there were enough available sequences by the time of our experiments, but this is not considered in the comparison.
For Covidex, the model for the GISAID nomenclature was trained with 66, 126 sequences and tested on 13, 230. revisionexplain why we did not re-trained Covidex Since Covidex is made available as an user app for any SARS-Cov2 sequence, we used the app over our test dataset to compare the results. We tested Covidex on our test dataset of 17, 146 sequences (excluding the 2000 sequences from GRA clade). Achieving a 77, 12% of accuracy, more than a 18% lower than all our trained models and 20, 65%  Table 4: Precision, recall, and f1-score on the test set (19,146 sequences). Each of our models is represented by length of the the k-mers used to generate the FCGR. 6 clades out of 11 (S,L,V,GH,GV,GRA) exhibit f1-score above 97.5% in all models. For each clade, precision does not differ more than 2% among the different models (clade GRY between 6-mer and 8-mer). For recall, the largest difference is 3.5% (clade O between 6-mer and 8-mer). Metrics in bold are those where Covidex achieves better performance than our models in the same dataset.
lower than their reported accuracy. The reported precision, recall and f1-score, as well as the test results over our selected dataset can be seen in Table 5. We found that the reported f1-score of Covidex is quite distant for the one we obtained in our test dataset for clades L (−8. we can observe a decrement on the reported f1-score ranging from 0.8% to 2.9%. Our models (see Table 4) exhibit better performance than Covidex in all clades and metrics on our test set, with similar results only on clades S and GV. We did not perform an extensive comparison of the running times since both tools classify a genome sequence in less than a second (on k = 8, our tool took 0.15 seconds in average).

Clustering results
We evaluate the embeddings of the last layer of each trained model using the Silhouette Coefficient, Calinski-Harabasz score and Generalized Discrimination Value (GDV). These results are shown in Table 6. We can observe that the model for k = 9 is the best one among all metrics, however, all trained models exhibit a very similar separability based on Silhouette and GDV.  Table 5: Precision, recall, and f1-score for Covidex. The Report part is taken from the Supplementary material of [7]. The Test part has the precision, recall, and f1-score obtained by Covidex on our test set, restricted to the 10 clades (17,146 sequences) analyzed in [7]. We found significant differences between Covidex and our trained models in the Test metrics (see Table 4). In particular, the most notorious differences w.r.t f1-score, ranging from 8.4%-42.   Table 6: Clustering metrics for our trained models. Each metric is computed using the output of each model and the predicted clade (that is, the clade that achieves the highest score by our model) in the test set. Each model is represented by the length of the k-mers used to generate the FCGR. For the Silhouette score, the closest to 1 the better. For the Calinski-Harabasz score, larger values are better. For the GDV score, the closest to −1, the better. The model with best separability is the one for k = 9.

Relevant k-mers for the classification of each clade.
similarly than the trained CNNs (that uses FCGR as input, and hence all the 4 k possible k-mers).
The same training and test sets used for the CNNs were used for the SVM. The results of the accuracy in the test set for the different values of N are shown in Fig. 7 6. We can observe that k-mers identified by Saliency Maps are more informative than those identified by Shap Values, since for N = 20, we obtain similar accuracy in the test set for k = 6, 7, 8 compared to CNN (96-97%), while in the case of Shap Values, this accuracy is only achieved by k = 7 with N = 35. Notice that using N = 20, we are considering a small number of all possible k-mers (0.49% for k = 6, 0.12% for k = 7 and 0.03% for k = 8).

Matching relevant k-mers to mutations.
Using the reference genome employed by GISAID (EPI ISL 402124) 3 and the list of marker variants 4 for each GISAID clade with respect to this reference, we evaluated how many k-mers among the 50 chosen ones by Saliency Maps and Shap Values actually matched any of the reported marker variants. A summary is shown in Table 7.
The results shown that the most relevant k-mers selected using Saliency Maps match   Values   6  46  3  7  51  0  8  11  0  9 4 0   When k increases, the accuracy always decreases (for the same number of relevant k-mers), which can be explained since when k increases the total number of possible k-mers increases exponentially.

Discussion
In this work we have shown that FCGR can be used to classify DNA sequences. Most notably, we have used FCGR to assign SARS-CoV-2 genome sequences to its GISAID strain by running a CNN on 191, 456 genome sequences (80% training set, 10% validation set, and 10% test set). In particular, the 8-th order FCGR achieved a test accuracy of 96.29%. The majority of missclassified sequences are shared between two strongly related strains, GR and GRY (GR is a close ancestor of GRY).
We have assessed the influence of the length k of the substrings (k-mers) used to build the FCGR, showing that values between 6 and 9 lead to very similar results, with less than 1% of accuracy on the same test set among them. However, when increasing the value of k, the training time for the model and the memory required to save the FCGRs increases exponentially. For k = 6 each epoch required 3 minutes and 6.6GB of memory, while for k = 9 it required 1:16 hour and 374.7GB. However, FCGRs show a fractal structures; this suggests that we might couple increasing k with using only a portion of the FCGR.
We compare our results with Covidex, a Random Forest based tool that classify sequences on GISAID clades based on k-mers frequencies. Under the same test set, our results show that our models outperform Covidex in all clades and reported metrics (accuracy, precision, recall and f1-score). Moreover, we found that the reported precision, recall and f1-score of Covidex are quite different for all clades but S and GV in our test set, exhibiting a decreasing in the f1-score metric up to 42.4%. We have used Saliency Maps and Shap to identify relevant k-mers, looking for matches with the marker variants reported for each strain. Using the k-mers obtained by Saliency Map, we found 46, 51, 11 and 4 matches for k = 6, 7, 8 and 9, respectively. While, for the k-mers identified by Shap, only 3 matches were found for k = 6. A possible direction for future works is to explore other existing methods (e.g. Lime [27], GradCAM [32], DeepLIFT [33]) that might be suitable in explaining the decisions of the model.
Classifying genome sequences introducing the assembly bias, since any classification depends on the specific assembly pipeline that has been used. To lessen this possible problem, we should study a related problem, where we classify read samples instead of fully assembled genomes. This new problem is more complex, since different regions of the viral genome can have different coverage -hence impacting the frequenciesand reads needs to be cleaned from both errors and contamination artifacts (the latter might be attacked with specialized tools like KMC3 [17].
We did not perform an extensive comparison of the running times since both tools classify a genome sequence in less than a second.

Methods
We use the k-th order FCGR representation for each sequence. In order to obtain this representation, we need to count the k-mers in each sequence and to put those frequencies in the FCGR based on the CGR encoding.
Before feeding the FCGR to the model, we rescale its elements to values between 0 and 1 for stability of the learning process. To do so, we divide each FCGR element-wise by the maximum value in the FCGR. It is worth mentioning that other preprocessing steps were taken into consideration but were ultimately excluded because found empirically worse.

Model architecture
We choose a residual neural network, Resnet50 [12] as our CNN, adapted for k-th order FCGR, i.e. with input size equal to (2 k × 2 k × 1), and output size equal to the number of clades: |C|, with softmax activation function in the last layer and categorical crossentropy as loss function, since we want to assign only one clade to each DNA sequence.

Model evaluation
To assess the performance of our trained model, we perform a classification evaluation of the predictions and also a clustering evaluation for the embeddings in order to evaluate the class separability.

Classification metrics
We report two classification metrics for each clade in the test set: precision and recall. Given a clade c, the correct predictions of the model can be compared to all the sequences with ground truth c (recall), and to all the sequences predicted by the model into the clade c (precision).
Formally, given a clade c, the positive class P consists of the set of genome sequences that are assigned to c, while all other genome sequences are the negative class N . Consequently, the true positive consist of the sequences that originate from the clade c and have been assigned to c, the false positive consist of the sequences that do not originate from the clade c and have been assigned to c, the false negative consist of the sequences that originate from the clade c and have not been assigned to c. The precision and recall are computed as follows: We also report the f1-score, defined as,

Clustering measures
In order to assess the quality of the class separability given by the CNN, we evaluate the embeddings of the last layer (the one used to perform the classification) in the network with three clustering evaluation measures. These embeddings are the output from the final layer of the network for each FCGR.
1. Silhouette Coefficient [29] Given an embedding v belonging to a cluster A, the silhouette coefficient s(v) of v compares the mean intra-cluster distance in A (a) with the mean nearest-cluster distance for v (b), that is, the closest cluster to v different from A. The value of s(v) ranges between −1 (wrongly assigned) and 1 (perfect separability). For a cluster A, the mean silhouette coefficient of A is computed as the average of s(v) over all embeddings v ∈ A.

2.
Calinski-Harabasz Score [8] Given a set of embeddings E of size n E that has been clustered into k clusters, the Calinski-Harabasz Score s, also known as the Variance Ratio Criterion, is defined as the ratio of the between-clusters dispersion and the inter-cluster dispersion for all clusters (the dispersion of a group of n points is measured by the sum of the squared distances of the points from their centroid).
where tr(B k ) is the trace of the between-cluster dispersion matrix and tr(W k ) is the trace (the sum of all elements in the diagonal of W k ) of the within-cluster dispersion matrix, defined as follow: where C q is the set of embeddings in the cluster q, c q is the centroid of the cluster q, c E is the centroid of E and n q = |C q |.
The higher the score s means that the clusters are dense and well separated.
where the mean intra-class for each class C l is defined as and the mean inter-class for each pair of classes C l and C m is defined as follows, here N k correspond to the number of points in class k, and s (k) i is the ith point of class k. The quantity d(a, b) is the distance between a and b, for our case, we considered the Euclidean distance. The value ∆ range between −1 (perfect separability) and 0 (wrongly assigned),

Feature importance
After the model is trained, we can perform feature importance methods (also known as pixel attribution in case of images) to analyze the impact of each element of the FCGR in our prediction. We selected Saliency Maps [34] and Shap Values [22]. Saliency Maps calculate the gradient of the loss function for a specific desired class with respect to the input (FCGR) elements, the gradients are rescaled to [0, 1], where elements with values closer to 1 represent the more influential features for the input FCGR over the predicted class. Shap (Shapley Additive Explanations) Values is a game theoretic approach to explain the output of any machine learning model. It aims to explain the influence of each feature compared to the average model output over the dataset the model was trained on, it outputs positive and negative values, where positive values push the prediction higher, and negative values push the prediction lower. Using the most relevant features from both methods over the FCGR, we aim to identify the most relevant k-mers for the classification of each clade.
Using these methods we aim to analyze the most relevant k-mers for the classification of each clade in the trained models.

Funding
This project has received funding from the European Union's Horizon 2020 Innovative Training Networks programme under the Marie Sk lodowska-Curie grant agreement No.

956229.
This project has received funding from the European Union's Horizon 2020 Research and Innovation Staff Exchange programme under the Marie Sk lodowska-Curie grant agreement No. 872539.