Analysis of SARS-CoV-2 RNA-Sequences by Interpretable Machine Learning Models

We present an approach to investigate SARS-CoV-2 virus sequences based on alignment-free methods for RNA sequence comparison. In particular, we verify a given clustering result for the GISAID data set, which was obtained analyzing the molecular differences in coronavirus populations by phylogenetic trees. For this purpose, we use alignment-free dissimilarity measures for sequences and combine them with learning vector quantization classifiers for virus type discriminant analysis and classification. Those vector quantizers belong to the class of interpretable machine learning methods, which, on the one hand side provide additional knowledge about the classification decisions like discriminant feature correlations, and on the other hand can be equipped with a reject option. This option gives the model the property of self controlled evidence if applied to new data, i.e. the models refuses to make a classification decision, if the model evidence for the presented data is not given. After training such a classifier for the GISAID data set, we apply the obtained classifier model to another but unlabeled SARS-CoV-2 virus data set. On the one hand side, this allows us to assign new sequences to already known virus types and, on the other hand, the rejected sequences allow speculations about new virus types with respect to nucleotide base mutations in the viral sequences. Author summary The currently emerging global disease COVID-19 caused by novel SARS-CoV-2 viruses requires all scientific effort to investigate the development of the viral epidemy, the properties of the virus and its types. Investigations of the virus sequence are of special interest. Frequently, those are based on mathematical/statistical analysis. However, machine learning methods represent a promising alternative, if one focuses on interpretable models, i.e. those that do not act as black-boxes. Doing so, we apply variants of Learning Vector Quantizers to analyze the SARS-CoV-2 sequences. We encoded the sequences and compared them in their numerical representations to avoid the computationally costly comparison based on sequence alignments. Our resulting model is interpretable, robust, efficient, and has a self-controlling mechanism regarding the applicability to data. This framework was applied to two data sets concerning SARS-CoV-2. We were able to verify previously published virus type findings for one of the data sets by training our model to accurately identify the virus type of sequences. For sequences without virus type information (second data set), our trained model can predict them. Thereby, we observe a new scattered spreading of the sequences in the data space which probably is caused by mutations in the viral sequences.

The coronavirus disease 2019 (COVID-19) caused by SARS-CoV-2 viruses, whose origin 2 lies probably in Wuhan (China), is a severe respiratory disease [1]. Currently (May 3 2020), it is spreading rapidly all over the world [2]. Yet there are several indicators that 4 the molecular characteristic evolves during time [3,4]. This evolution is mainly driven 5 by mutations, which play an essential role and maybe accompanied by mechanisms of 6 stabilization [5,6]. 7 The analysis of the genomic structure by sequencing is currently topic of ongoing 8 research to better understand the molecular dynamics [7]. Obviously, changing the 9 genomic structure may cause new properties and, hence, could increase the difficulties 10 in finding drugs for treatment. For example, changes may lead to behavioral changes, 11 such as the increased binding of the SARS-CoV-2 surface glycoprotein to human ACE2 12 receptors [8]. 13 Viruses of the family Coronaviridae possess a single stranded, positive-sense RNA 14 genome ranging from 26 to 32 kilobases in length and frequently are extremely 15 similar [9]. Therefore, the analysis of those sequences to understand the genetic 16 evolution in time and space is very difficult. This problem is magnified by incorrect or 17 inaccurate sequencing [10]. Further, mutations are not equally distributed across the in [36,37]. 48 In the present publication, we investigate whether alignment-free dissimilarities are 49 suitable for the identification of SARS-CoV-2 clusters/classes in combination with 50 interpretable machine learning methods for clustering and classification [38,39]. This we 51 do for two data sets: GISAID-data and NCBI-data. For the first one, virus classes 52 (types) were identified by phylogenetic tree analysis in [12], whereas the second one is 53 without class information. 54 Although deep neural network approaches provide impressive results in sequence 55 classification [16,[40][41][42], deep architectures are at least difficult to interpret [43]. 56 Therefore, we focus on applying prototype based methods using alignment-free 57 dissimilarity measures for sequence comparison. In fact, prototype-based machine 58 learning models for data classification and representation are known to be interpretable 59 and robust [44][45][46]. Using such methods for the SARS-CoV-2 sequence data, first we 60 verify the classification results for the GISAID-data. In particular, we classify the 61 sequences by a learning vector quantizer, which is proven to be robust and 62 interpretable [45,47]. Thereafter, we use this model to classify the new data from the 63 NCBI. Moreover, this interpretable classifier provides correlation information regarding 64 data features contributing to a class discrimination. This additional knowledge allows a 65 further characterization of the virus classes. 66 Materials and methods 67 68 In order to investigate SARS-CoV-2 viruses in terms of sub-type spreading, two virus 69 sequence data sets were considered. 70 The GISAID Dataset D G 71 The first one, abbreviated by D G , is from the GISAID coronavirus repository (GISAID - 72 Global Initiative on Sharing Avian Influenza Data). It consists by March 4th, 2020 of 73 254 coronavirus genomes, isolated from 244 humans, nine Chinese pangolins, and one 74 bat Rhinolophus affinis. After preprocessing, 160 complete human sequences are 75 obtained as described in [12], where these genomes of SARS-CoV-2 have been used to 76 create a phylogenetic network. The resulting network analysis distinguished three types 77 of the virus (cluster) A, B, and C: A is most similar to the bat virus, whereas B are 78 3/24 sequences obtained from A by two mutations: the synonymous mutation T8782C and 79 the non-synonymous mutation C28144T changing a leucine to a serine. A further 80 non-synonymous mutation G26144T changing a glycine to a valine lead from B to type 81 C. In this sense, the classes (virus types) code implicitly the evolution in time of the 82 virus. 83 In our data analysis, we removed two sequences, whose accession numbers occur twice 84 in the data record, and another two, which we identified as not human resulting in 156 85 final sequences. Additionally, we take the type/class information as label for the virus 86 genome sequences and, hence, as reference. A detailed data description as well as 87 complete list of sequences can be found in [12]. The virus type assignments and 88 additional data (country, collection date) as well as accession numbers for all 156 89 sequences in use are additionally provided in the supplementary material.

90
The complete data information can be found in supplementary files S12 Data.

91
The NCBI dataset D N

92
The second data set including 892 complete genomes has been selected from the 93 National Center for Biotechnology Information (NCBI) Viral Genome database [48], 94 and GenBank [49] by April 19th, 2020, see Tab. 1. These data are human based 95 sequences and provide additionally the country information from which the sequences 96 originate, as well as their collection date. For each sequence we have also derived a more 97 general assignment to regions based on the country information, which includes the 98 following values: USA, China, Europe, and Others. The accession number and the 99 additional data used in the analysis have been included in the supplementary material. 100 We refer to this data set by D N .

101
Remark, although the SARS-CoV-2 virus is an RNA virus, the sequences provided by 102 databases are given using the DNA-coding. In the following, we take over this 103 convention and do not explicitly refer to that later.
Again, the complete data information can be found in supplementary files S12 Data.

105
Representation of RNA Sequences for Alignment-free Data 106 Analysis 107 Several approaches were published to represent sequences adequately for alignment-free 108 comparison. These method range from chaos game representation to standard unary 109 coding or matrix representations. An overview is given in [27] and [36,37]. Here we focus 110 only on two of the most promising approaches -Natural Vectors and Bag-of-Words. for k ≥ 2 is given as µ k L = E p L − µ 1 L k . Then, the natural vector of order K for a 120 sequence s is given as whereby we again drop the dependencies on K and s, if it is not misleading. Natural 122 vectors are usually compared in terms of the l p -metric giving the Euclidean distance for p = 2. Kendall-statistics, as a kind of correlation 124 measure, was applied in [50].

125
The NV-description of sequences can also be applied for nucleotide sequences containing 126 ambiguous characters (degenerate bases) [35,51]. This yields an extended alphabet 127 A = A ∪ E . In that case, weights 0 ≤ w L (s i ) ≤ 1 are introduced for each L ∈ A with 128 where p L,si is the probability that the detected ambiguous character s i ∈ E should be 129 the character L. These weights have to be taken into account during the expectation 130 value calculations [35].   Obviously, comparison of those histogram vectors can be done using the usual Euclidean 145 distance. However, motivated by the latter mentioned density property, an alternative 146 5/24 choice is to compare by means of divergence measures [52]. In the investigations 147 presented later, we applied the Kullback-Leibler-divergence [53] 148 for sequence histograms h and m. Note that the first term in (3)) is the negative h j · log (m j ) 150 is the Shannon cross-entropy. Yet, other divergences like Rényi-divergences could be 151 used [54]. We refer to [55] for a general overview regarding divergences in the context of 152 machine learning.

153
The assignment of the nucleotide triplets to the histogram dimension can be found in 154 the supplementary material S13 Histogram Coding of Nucleotide Triplets. The Median Neural Gas algorithm (MNG) is a neural data quantization algorithm for 158 data compression based on dissimilarities [56,57]. It is a stable variant of the k-median 159 centroid method improved by neighborhood cooperativeness enhanced learning, where k 160 is the predefined number of representatives [58,59]. In this context, median approaches 161 only assume a dissimilarity matrix for the data and restrict the data centroids to be 162 data points. Thus, after training, MNG provides k data points to serve as 163 representatives of the data. Thereby, the data space is implicitly sampled according to 164 the underlying data density in consequence of the so-called magnification property of 165 neural gas quantizers [60,61]. 166 It should be emphasized that despite the weak assumption of a given similarity matrix, 167 MNG always delivers exact data objects as representatives. Hence, any averaging for 168 prototype generation like in standard vector quantizers is avoided here. This is essential, 169 if averaged data objects are meaningless like for texts, music data, or RNA/DNA 170 sequences, for example.

171
Affinity Propagation for Clustering with Cluster Number Control 172 Affinity Propagation (AP) introduced by Frey&Dueck in [62] is an iterative cluster 173 algorithm based on message passing where the current cluster nodes, in the AP setting 174 denoted as prototypes or exemplars, interact by exchanging real-valued messages.

175
Contrary to methods like c-means or neural maps, where the number c of prototypes 176 has to be chosen beforehand, AP starts assuming that all N data points are potential 177 exemplars and reduces the number of valid prototypes (cluster centroids) iteratively.

178
More precisely, AP realizes an exemplar-dependent probability model where the given 179 similarities ς (i, k) between data points x i and x k (potential exemplars) are identified as 180 log-likelihoods of the probability that the data points assume each other as a prototype. 181 For example, the similarities ς (i, k) simply could be negative dissimilarities like the 182 negative Euclidean distance.

183
The cost function C AP (I) minimized by AP is given by where I : N → N is the mapping function determining the prototypes for each data 185 point given in (4) and is a penalty function. During the optimization, two kind of messages are iteratively 187 exchanged between the data until convergence: the responsibilities r (i, k) and the 188 availabilities a (i, k). The responsibilities reflect the accumulated evidence that point k serves as prototype for data point i. The 190 availabilities describe the accumulated evidence how appropriate data point k is seen as a potential 192 prototype for the points i. Finally, the prototypes are determined by Hence, a (i, k) and r (i, k) can be taken as log-probability ratios [62]. The iterative 194 alternating calculation of a (i, k) and r (i, k) is caused by the max-sum-algorithm 195 applied for factor graphs [63], which can further be related to spectral clustering [64].

196
The number of resulting clusters is implicitly determined by the self-similarities ς (k, k) 197 also denoted as preferences. The larger the self-similarities the finer is the granularity of 198 clustering [62]. Common choices are the median or the minimum of the similarities 199 between all inputs. Otherwise, the self-similarities can be seen as a control parameter introduced by T. Kohonen [65]. A cost-function-based variant is known as generalized 207 LVQ [66]. This cost function approximates the classification error [67]. In particular, an 208 LVQ classifier requires training data . . , C} is the set of available class labels. Further, the model 210 assumes a set of prototypes W = {w k ∈ R n , k = 1 . . . M } with class labels c (w k ) such 211 that at least one prototype is assigned to each class. Hence, we have a partitioning of is supposed, which has to be differentiable with respect to 214 the second argument. For a given LVQ-configuration a new data point x is assigned to 215 a class by the mapping 7/24 is known as the winner-takes-all rule (WTA) in prototype-based vector quantization.

218
The prototype w ω is denoted as winner of the competition.

219
During the learning, the cost-based LVQ minimizes the expected classification error is the local classification error depending on the choice of the monotonically increasing 222 function f and the classifier function is the so-called best matching 224 correct prototype and is the corresponding best matching 225 incorrect prototype. Frequently, the squashing function f is chosen as sigmoid: . Learning takes place as stochastic gradient descent learning 227 (SGDL) [68,69] of E X [E (x k , W )] with respect to the prototype set W to obtain an 228 optimum prototype configuration in the data space.

229
The dissimilarity d (x, w) can be chosen arbitrarily supposing differentiability with 230 respect to w to ensure SGDL. Frequently, the squared Euclidean distance both, x and w are assumed as discrete representations of density functions, divergences 233 like the Kullback-Leibler-divergence D KL (x, w) from (3) come into play instead [70].

235
Another popular choice is the squared Euclidean mapping distance proposed in [71] with the mapping matrix Ω ∈ R m×n and m being the projection 237 dimension usually chosen m ≤ n [72]. Here, the data are first mapped linearly by the 238 mapping matrix and then the Euclidean distance is calculated in the mapping space R m . 239 The mapping matrix can be optimized again by SGDL to achieve a good separation of 240 the classes in the mapping space. The respective algorithm is known as Generalized

243
After training, the adapted projection matrix Ω provides additional information. The 244 resulting matrix Λ = Ω T Ω ∈ R n×n allows an interpretation as classification correlation 245 matrix, i.e. the matrix entries Ω ij give the correlations between data features i and j, 246 which contribute to the class discrimination [39,75]. A non-linear mapping could be 247 realized applying kernel distances instead [76,77].

248
A trained LVQ model can be applied to newly incoming data of unknown distribution. 249 However, care must be taken to ensure that the model remains applicable and that 250 there is no inconsistency with the new data. Therefore, each LVQ can be equipped with 251 a reject option for the application phase [78,79]. If the dissimilarity of the best 252 matching prototype to a data point is greater than a given threshold τ , it is rejected for 253 classification, i.e. this optional tool equips the LVQ with a so-called self-controlled evidence (SCE) [45]. The threshold τ is determined during model training for each 255 prototype individually, e.g. 95%-percentile of the dissimilarity value for those data, 256 which are assigned to the considered prototype by the WTA-rule (6) together with the 257 class assignment (5).

258
Stochastic Neighbor Embedding for Visualization

259
The method of Stochastic Neighbor Embbeding (SNE) was developed to visualize 260 high-dimensional data in a typically two-dimensional visualization space [80]. For this 261 purpose, each data point x k in the data space is associated with a visualization vector 262 v k ∈ R 2 . The objective of the respective embedding algorithm is to distribute the 263 visualization data in a way that the density of original data distances in the 264 high-dimensional data space is preserved as good as possible for the respective density 265 of the distances in the visualization space (embedding space). The quality criterion is 266 the Kullback-Leibler-divergence between them, which is minimized by SGDL with 267 respect to the visualization vectors v k .

268
Yet, SNE suffers from the fact that the distance densities in the original data space are 269 frequently heavy-tailed [81], which leads to inaccurate visualizations. To overcome this 270 problem, the so-called t-distributed SNE (t-SNE) was developed [82].

271
Data Processing Workflow

272
In the following we describe and motivate the steps of data processing and analysis.  For all LVQ variants we take only one prototype per class.

288
For GMLVQ, the projection matrix is chosen as Ω ∈ R 2×n , i.e. the mapping 289 dimension is m = 2.

290
SGDL training as 10-fold cross validation to determine the best LVQ 291 architecture for the given problem.

292
-Training of W using the GLVQ for NV representation.

293
* GDLVQ is not applicable for this sequence representation due to 294 mathematical reasons. 295

9/24
-Training of W using the GLVQ for BoW representation.

296
-Training of W and Ω using the GMLVQ for BoW representation.

297
Final training of the best LVQ architecture with optimum training schedule 298 to achieve best prototype configuration W .

299
-If GMLVQ architecture is selected for final training: training of both W 300 and Ω, determination of the classification correlation matrix Λ = Ω T Ω. 301 -Determination of the reject thresholds for each prototype for 302 self-controlled of evidence use based on the 95%-percentile rule.  Evaluation of the data rejected by the SCE rule.

319
According to the processing workflow we trained several LVQ-classifier variants for the 320 D G -data. By 10-fold cross-validation, we achieved the averaged accuracies depicted in 321 Tab. 2 together with their respective standard deviations. According to these results, 322 GMLVQ performs best using the BoW-coding of the sequences together with the 323 Euclidean mapping distance d Ω (x, w) from (9). Thus, we finally trained a GMLVQ    Table 2. Classification results of trained LVQ-variants for the D G -dataset obtained by 10-fold cross-validation.
Applying the trained GMLVQ classifier to the D N B -data set leads to the classification 339 of 37 data points to class A, 95 data points to class B, 2 data points to class C. The classification analysis of the D G -data by means of the machine learning model 356 GMLVQ verifies the class determination suggested in [12]. Only 7 data samples are not 357 classified accordingly due to the model self-controlled evidence decision. Thereby, the 358 GMLVQ model shows a stable performance in learning (see Tab. 2), which underlies its 359 well-known robustness [47]. Thus, we observe an overall precise agreement supporting 360 the findings in [12].

361
This agreement, however, is obtained by alignment-free sequence comparisons. More smallest deviations in the histograms. Yet, we cannot expect greater deviations, because 377 the sequences differ only in few characters according to the special mutations [11,12].

378
The AP-centroids differ slightly more than the GMLVQ prototypes, see S6 Fig. This   379 can be dedicated to larger overall scattering of the D N B -data.

380
Further, the GMLVQ-prototypes serve as class 'detectors'. If the encoded sequences are 381 most similar to them with respect to the mapping distance, the sequences are assigned 382 to the respective classes according to the WTA (6). However, in general the prototypes 383 are not identical with the mean vectors of the class distribution, as emphasized in [83]. could be that both t-SNE as well as AP implicitly reflect data densities in the data 390 space. Class densities, however, do not have to coincide with the overall data density.

391
Thus, the Ω-mapping, which is optimized during GMLVQ training for best classification 392 performance, offers the better visualization option and, hence, disclosures the class 393 distribution more appropriately. samples from December/January are available. We observe that class C for the D G 399 data is mainly represented in January for European samples, which confirms the 400 findings in [12]. Thus, the small number of class C samples in D N B -classification may 401 be addressed to this peculiarity in Europe. Further, the GMLVQ, which was trained by 402 D G data, rejects a large amount of data from D N B , particularly in March. We suspect 403 an accumulation of mutations which could explain the scattering. Accordingly, the 404 GMLVQ is able to detect this behavior by means of the SCE decision rule. 405 We observe from the visualization S11 given virus type information, which was obtained by phylogenetic tree analysis [12].

415
GMLVQ supposes vectorial data representations and compares vectors in terms of a 416 well-defined dissimilarity measure. In this application, the GMLVQ training is based on 417 the Bag-of-Words coded sequences and yields class specific prototype vectors as well as 418 an optimum class/typus separating dissimilarity measure in the data space of encoded 419 sequences. Compared to phylogentic trees, which require high computational costs due 420 to the involved sequence alignment process, the GMLVQ approach has lower complexity 421 and allows an easy out-of-training generalization.  Visualization of GMLVQ classification for the full D N -data by Ω-mapping.

494
The data as well as the GMLVQ prototypes are depicted. The class coloring is as in S2 495 combinations, the coding by the histogram dimensions is given in the file 'S13 503 Histogram Coding.xlsx'. 504 S14 GMLVQ Mapping for D N Virus type assignments for the 505 D N -sequences obtained by GMLVQ For each sequence in D N , the class/type 506 assignment obtained by the GMLVQ model is given as well as if a rejection decision was 507 made according to the SCE of the GMLVQ model. Additionally, we provide the 508 sequences from D G , which were rejected by the SCE decision of the GMLVQ model.