Assessing SARS-CoV-2 evolution through the analysis of emerging mutations

Intro The number of studies on SARS-CoV-2 published on a daily basis is constantly increasing, in an attempt to understand and address the challenges posed by the pandemic in a better way. Most of these studies also include a phylogeny of SARS-CoV-2 as background context, always taking into consideration the latest data in order to construct an updated tree. However, some of these studies have also revealed the difficulties of inferring a reliable phylogeny. [13] have shown that reliable phylogeny is an inherently complex task due to the large number of highly similar sequences, given the relatively low number of mutations evident in each sequence. Motivation From this viewpoint, there is indeed a challenge and an opportunity in identifying the evolutionary history of the SARS-CoV-2 virus, in order to assist the phylogenetic analysis process as well as support researchers in keeping track of the virus and the course of its characteristic mutations, and in finding patterns of the emerging mutations themselves and the interactions between them. The research question is formulated as follows: Detecting new patterns of co-occurring mutations beyond the strain-specific / strain-defining ones, in SARS-CoV-2 data, through the application of ML methods. Aim Going beyond the traditional phylogenetic approaches, we will be designing and implementing a clustering method that will effectively create a dendrogram of the involved sequences, based on a feature space defined on the present mutations, rather than the entire sequence. Ultimately, this ML method is tested out in sequences retrieved from public databases and validated using the available metadata as labels. The main goal of the project is to design, implement and evaluate a software that will automatically detect and cluster relevant mutations, that could potentially be used to identify trends in emerging variants. Contact tasos1109@gmail.com


Introduction
The greatest combat of the 21st century against a debilitating disease agent SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) virus discovered in Wuhan, China, in December 2019, has piqued an unprecedented usage of bioinformatics tools in deciphering the molecular characterizations of infectious pathogens. With the viral genome data of SARS-CoV-2 having been made available barely weeks after the reported outbreak, bioinformatics platforms have become an all-time critical tool to earn time in the fight against the disease pandemic.
At the time of writing, there are more than 9.2 million publicly available complete or near-complete genome sequences, that have been submitted to GISAID from all over the world, of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) (as of 11 March 2022) and the number continues to grow. This remarkable attainment has been achieved by the inexpensive rapid genome sequencing and online sharing of SARS-CoV-2 genomes by public health experts and research teams around the globe. These genomes play a pivotal role and provide invaluable insights into the ongoing evolution of the virus during the underlying pandemic.
Understanding the evolution of not only the SARS-CoV-2 but also other coronaviruses is crucial for creating more robust health infrastructures which prevents major outbreaks under the setting of a pandemic. Although the existence of that "treasure trove" of data, the transition history of the virus and hence "who infected whom" [22] is very difficult to be determined. Previous studies have shown that pinpointing individuals who share the same or almost the same viral genome, could help solve this problem [22] but then there are other difficulties, such as, that we need to identify within-host genetic variants, with high precision, and understand the evolution of them. Needless to say that the phrase "almost the same viral genome" can be poorly defined as there are a lot of ways one can define similarity between two or more sequences and one with physical significance must be preferred.

TSNE
TSNE or t-distributed stochastic neighbor embedding is a stochastic statistical method for reducing the dimensionality of a set of points to low dimensional space [8,23]. TSNE expresses the similarity and dissimilarity of points in the reduced dimensions by nearby points and very distant points with high probability, respectively.

DBSCAN
DBSCAN or density-based spatial clustering of applications with noise is a density-based non-parametric clustering algorithm utilized in the model building that belongs to the broad group of unsupervised machine learning methods. Given a set of points in an n-dimensional space, DBSCAN is used for grouping together those points which belong to a high-density area, or, in other words, those that have many neighboring points. Points that do not have many neighbors or are pointed out in low-density regions are marked as outliers or noise [7,20].

MCL
MCL or Markov Cluster Algorithm is an unsupervised clustering method that applies to graphs in the broad concept of what graph means in mathematics. The algorithm makes use of simulation of stochastic flow in graphs. In the field of cluster analysis in graphs, graph partitioning is the process of reducing a graph to smaller graphs by separating its groups of nodes into mutually exclusive groups [3]. The MCL algorithm was invented/discovered by Stijn van Dongen at the Centre for Mathematics and Computer Science (also known as CWI) in the Netherlands [6] and its purpose is to find the optimal partition of a connected graph in an unsupervised way, without any a priori knowledge of the partition sizes.

Approach
We will approach the problem stated by the research question with the following steps: • Data gathering • Data pre-processing • Data modeling • Combining binary model with the corresponding metadata to a single file. • Dimensionality reducing of the binary model for each label at a time. Two different dimensionality reductions (TSNE), one for the non-characteristic (1s) and one for the characteristic (2s) mutations respectively. • In each one, out of 2, spaces, the one which is constructed by using only 1s and the one which is constructed by using only 2s, we are going to perform unsupervised clustering machine learning, a method called DBSCAN, in an attempt to find high density areas representing closely related samples. • Up to this point, two completely independent 2-dimensional spaces have been constructed, and on each one, distinct clusters have been identified based on high-density areas. Each cluster represents a group of strongly related samples relying on the characteristic mutations in one space and the non-characteristic mutations in the other space. In other words, we have assigned each sample to a cluster considering only characteristic and non-characteristic mutations each time. • Undirected graph construction based on co-occurring mutations over groups of samples as they were indicated based on the clusters of the previous step. • Finding highly interconnected parts of the network through MCL analysis. • Directed graph construction based on occurring mutations from one node to others. This graph should indicate evolutionary paths of the virus. • Validating some possible evolutionary paths by using the MSA of those samples that take part of those paths and finding correlations among their co-occurring mutations. • Using these correlations as a measure of distance and run hierarchical clustering.

Materials and methods
In this study, we present a novel method for detecting patterns of co-occuring mutations beyond strain-specific / strain-defining ones and making use of those patterns in an endeavor to group samples in such a way that they could indicate evolutionary paths of the virus. In other words, mutational occurrence patterns might suggest different ways of grouping sample revealing the evolutionary history of SARS-CoV-2.

Data collection
For this study, 5411 vcf files and their metadata have been deposited in the European Nucleotide Archive (ENA) at EMBL-EBI under accession number PRJEB44141 ENA. Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome with NCBI Reference Sequence: NC_045512.2 was obtained from NCBI database and used for the alignment of the raw data, the variant calling, the consensus sequence generation (alternative genomes), and the remaining algorithmic procedures described in the rest of this study. At this point, it is worth mentioning that the raw sequences alignment and the variant calling were not part of this study and had been performed previously. For the time being, we obtained the output (vcf files [5]) of the variant calling pipeline. A Genetic variant annotation and functional effect prediction toolbox, SnpEff [4], was then used in order to annotate all the vcf files, at the amino acid level.  [19]. The second one shows the WHO Lables while the third one gives a discription, mostly, regarding the area of dominance of those lineages. The fourth column is the number of each lineage appearing in the entire dataset [14].

Data summary
The dataset consists of 5411 vcf files and their metadata. Each file were removed. To achieve that, the pangolin tool was employed and only the sequences able to be classified were kept. A summary of the assigned lineages to the used sequences is given in table 1. The kept 4.500 sequences were then aligned to the SARS-CoV-2 genome using MAFT [10] and were used for the rest of the algorithmic procedures, as explained in detail in the next chapters. sites are represented with "1" or "2" according to whether they are non-characteristic or characteristic mutations, respectively. Unmutated sites are represented with "0".
A vast amount of invaluable information, such as the sample id, collection date, geographic location and/or geographic region, taxonomy, host sex, a small description of how the viral sequences were retrieved from human hosts, and other information is provided in the form of metadata in XML format by ENA database. A script in python was then used to clean and transform the XMF format to an easy-to-read in human-readable CSV format. In addition to those fields, the dynamic nomenclature proposal for SARS-CoV-2, given by the pangolin tool is provided for each sample, as a completely separate CSV file. Since we merged those CSVs for simplicity reasons, a detailed list with the exact names of columns in the CSV metadata file is provided in Table 2, which must be followed exactly by the reader in case one would like to reproduce the results.

Modeling
A data model is an abstract mathematical representation that assembles the elements of a dataset in terms of a standard or a template format. Data models are also used to describe the data themselves, how the elements are related to one another, and the properties of a real-world problem.  As we have already described in the above sections, our dataset consists of raw SARS-CoV-2 genomes. We can have access to these raw sequences either by looking at the FASTA file which contains the actual sequences or by accessing the VCF files where only the specific mutations and their characteristics are stated, in correspondence with the reference genome. The reason why we need a model to represent our data is becoming more and more clear. Given that the data is unstructured and quite complex to be analyzed by different algorithms we needed a very simple model or  a mathematical representation of them, very well structured and easy to manipulate. Preferably a model that consumes a small amount of computer memory, not only because we have a few thousand genomes but also because our initial goal was to provide a tool as generic as possible, capable of analyzing large datasets on a personal computer or even a laptop.
At this point, we will be explaining the model that we chose to work with, its advantages and disadvantages over other models that exist in the bibliography, and most importantly what are the necessary assumptions one has to take for granted to make use of such a model.
Given the fact that the mutations' locations are given relative to the reference genome, one could model the raw -mutated sequences using a binary model of the same length as that of the reference genome. Simply put, one shall use a string representation, per sequence, with the same length as that of the reference genome. This string must be capable of encoding all the different possible states that a site can physically exist in. In our case three different symbols are being used, zeros, ones, and twos, one for each of the three different states, unmutated sites, characteristic mutations as reported by the pangolin tool, and noncharacteristic mutations Fig. 4. Although the abovementioned model is not exactly binary, because of the use of three symbols instead of two, from now on we are going to call it binary as one can use a combination of only the first two symbols, zeros, and ones, to describe the third state. The decision of using three symbols was made by us just to reassure a crystal-clear distinction among the states. It is worth mentioning that by using such a model and by keeping the length of the string representations constant relative to the ref genome, only the deletions and the substitutions can be taken into consideration while the insertions must be discarded. Otherwise, the length of the model will not be the same for each sequence. As a result, our binary model lacks the ability of modeling insertions, and this is indeed its main drawback. Despite the loss of some information, this type of modeling method seems sufficient enough, at least at the time of writing, since a careful examination of the characteristic mutations reported by the pangolin tool, showed that not too many insertions had been reported. Hence, we can safely use a model that is only able to model deletions and substitutions.

Dimensionality reduction and clustering on the binary model
Due to the nature of the problem of modeling and analyzing raw sequences represented by a binary model, there are many ways to manipulate the data regarding either the columns of the binary table as features and the rows as records or vice versa. In an attempt to identify correlations among different coordinates or simply put sites, of the virus genome, at least completely intuitively, one could consider defining each site of the genome as a variable and the samples (rows of the table) as records. We end up with a dataset that consists of almost 30k variables, as many as the sites are, which without any further modifications is quite difficult for someone to manipulate because of its high dimensionality. On the other hand, one could consider each site as a variable that changes as we move along the samples. In other words, each variable could be considered as a time series that changes while we are moving from one sample to the next, under the assumption that the samples are in sequential temporal order Fig. 6A, which as we have already discussed is impossible for one to know in a pandemic setting. In order to visualize and make the dataset more understandable, a variety of dimensionality reduction methods were performed. A comparison of them is not going to be mentioned since that would be beyond the scope of this report. TSNE with two components was employed for the task of lowering the dimensions of the data, providing a 2-dimensional representation, and making it easier for us to visualize, understand, and form clusters of closely related samples. Besides TSNE performed best in comparison with other methods, in both the ability to separate distant samples and the execution time.
In an attempt to produce two completely different and independent 2D spaces, one able to separate samples based on the characteristic mutations and one based on the non-characteristic mutations, TSNE was used twice, one taking into account only the non-characteristic mutations and one taking into account only the characteristic mutations. As a result, two different 2D spaces have been calculated, the first based on the characters "1" Fig. 7 and the second based on the characters "2" Fig. 8. From now on, we will be using the notation "based on 1s" or "based on 2s" when we want to refer to the clustering made by considering only the non-characteristic or characteristic mutations respectively.
Regarding the 2D space of the data considering only the characteristic mutations Fig. 8, one can clearly distinguish how well separated the Table 3. Samples Samples in long format. Each record in this table corresponds to a sample, the name of which is shown in the first column. The second and the third columns are the maps M 1 , M 2 , which for a given sample name return its cluster based on "1"s and "2"s, respectively. The third column is the map M 3 , which given a sample name returns its assigned lineage.
clusters of different lineages are. The fact that the groups of lineages are almost perfectly separated from each other is profound but not completely unexpected. This is proof of not only how powerful TSNE could be in separating sequences in some cases but also how precisely the mutations are selected as the main features of the pangolin classifier since they separate the sequences in groups in such a great way.
Since TSNE keeps closely related samples nearby, within the 2D space, the next logical step is to use a necessarily unsupervised machine learning technique to identify clusters of interest in the reduced dimensional spaces. It is of great necessity to build up an automated method of identifying highdensity clusters, even if those clusters consist entirely of samples that have been classified as they belong to the same lineages. The reason why, as it has already been mentioned, is that it is very challenging to put the samples in sequential temporal order and, therefore, groups of closely related samples must be formed assuming that a "jump" between nearby groups can indicate a possible underlying evolutionary path of the virus.
For that purpose, Scikit Learn implementation of the DBSCAN algorithm was used with the default interface parameters except for the "eps" value. EPS is defined as "the maximum distance between two samples for one to be considered in the neighborhood of the other" [16] and set up equal to 5 using a simple trial and error approach. DBSCAN performed almost perfectly in the case of characteristic mutations and satisfactorily in the case of non-characteristic mutations as shown in Fig.  7, since most of the large distinct groups of the same lineage have been identified and high-density sub-areas within groups of the same lineage have been caught.

Graph construction
Positions of concern (PoCs) As a result, each cluster, regardless of whether they were formed based on "1"s or "2"s, consists of a bunch of samples and their assigned lineages (see section 4.4).
We define a PoC as the position of a nucleotide that appears to be mutated with a non-zero frequency, along the samples of the same lineage of a very particular cluster based on "1"s or "2"s. As a result, we construct a table, the column names of which follow the bellow mentioned notation: where: • lineage is the lienage; • x is the unique identifier of the particular cluster; • y is an index that shows based on which symbol the clustering has been done.
The columns named by equation 1 provide the mutated sites while the columns named by equation 2 provide the counts for those sites. Once again, only the sites with non-zero counts or f requency > 0 are being stored in this table.
Example 4.1. Let's assume lineage = B.1.1.7, x = 1, and y = 2, which means that we are looking for the PoCs of samples with lineage B.1.1.7 that belong to the cluster 1 based on "2"s. Given the equations 1, 2, the two column names will become: and thus, the PoCs table will be: Each data mentioned in this example is just dummy values and does not represent any real information. where n(X) is the number of elements in set X. Let also M 1 : S → A 1 , M 2 : S → A 2 and M 3 : S → L be three maps, that given a particular s i return the cluster in which the sample belongs to, based on "1"s and "2"s, and its assigned lineage, respectively. Table 3 shows the way the abovementioned sets are structured. We define, for each (a 1 , a 2 ) ∈ A 1 × A 2 , the following subset of S:

Nodes
and the following subset of L: The Sa 1 ,a 2 , as defined by equation 3, is a subset of S, whose elements are all the samples that have been assigned to these particular a 1 , a 2 clusters, while the La 1 ,a 2 , as defined by equation 4, is a subset of L, whose elements are all the assigned lineages of the corresponding samples Sa 1 ,a 2 . Given each combination of clusters (a 1 , a 2 ), and the lineages La 1 ,a 2 of the included samples, one can easily use the a 1 , a 2 , La 1 ,a 2 , the equations 1, 2, and the PoCs table in order to determine the P oCsa 1 , P oCsa 2 for each lineage ∈ La 1 ,a 2 . After that, we used for each lineage ∈ La 1 ,a 2 , the PoCs of a 1 and a 2 , separately, (clarification: saying a 1 and a 2 , means, cluster a 1 based on "1"s and cluster a 2 based on "2"s) in an attempt to find out which of these positions of concern would appear as SNPs in the samples Sa 1 ,a 2 , by searching for SNPs at the P oCsa 1 , P oCsa 2 along the samples Sa 1 ,a 2 . When a SNP, the position of which ∈ P oCsa 1 or P oCsa 2 , is found among the samples Sa 1 ,a 2 , a new record is added to the data structure (table 4) to the -based on "1"s snps-or -based on "2"s snps-column, regarding whether the position of that SNP belongs to the P oCsa 1 or P oCsa 2 .

Edges
We define: While the cartesian product of Sa 1 ,a 2 with itself, denoted S 2 a 1 ,a 2 = Sa 1 ,a 2 × Sa 1 ,a 2 is the set of all ordered pairs (x, y), the setĒ is defined above, equation 5, as the cartesian product of Sa 1 ,a 2 with itself using only unordered pairs under the constraint x ̸ = y. E is simply a subset ofĒ, the purpose of which will become clear later in this chapter.
Example 4.2. Supposing Sa 1 ,a 2 = {a, b}, the cartesian product of Sa 1 ,a 2 with itself will be: According to the equation 5,Ē, which once again is the cartesian product of Sa 1 ,a 2 with itself using only unordered pairs under the constraint x ̸ = y, will be:Ē = {{a, b}} and simply: E ⊆Ē a subset ofĒ.
In that way, we have constructed setĒ, whose elements are the combinations of elements of Sa 1 ,a 2 , as unordered pairs without including pairs of the same elements. To put it simply, given that Sa 1 ,a 2 is composed by the records of the table 4,Ē will give us all the unique combinations among them, in pairs of two, without taking into consideration the order and the pairs of an element with itself.
As we will see in detail in the following section, assuming that the elements of Sa 1 ,a 2 are nodes in an undirected simple graph, then E is a set of edges, each of which is associated with two distinct nodes. Now, let's come back to the definition of edges. Even though setĒ has already been expressed in a set-builder notation and we have a well-defined mathematical notation of this set to work with, we haven't mentioned yet, how the elements of the subset E are supposed to be selected from setĒ of equation 5. For that purpose algorithm 1 is used.
Given the combinations of nodes,Ē, as defined by equation 5, algorithm 1, for each pair of nodes {x, y} ∈Ē (for loop in line 1), retrieves lineages Lx, Ly (line 3,4) and for each lineage lx ∈ Lx, ly ∈ Ly calculates the common positions of concern (for loops in lines 5,6). Once these loops over lineages Lx and Ly are done the algorithm is responsible for checking whether or not there has been at least one position of concern in common (line 12). If indeed one or more positions of concern is found in common within those nodes {x, y} then an edge between those two is being considered and stored in subset E (line 13). The weight of that edge is defined as the number of the common positions of concern (line 14). It now becomes clear why E is a subset ofĒ. That is because the algorithm connects with edges only those pairs of nodes {x, y} in which there is at least one common position of concern.
Miscellaneous calculations are not shown in algorithm 1 for simplicity reasons. In the real implementation of that algorithm many data, beyond the weight, accompanying each edge {x, y} are calculated as well. For instance, a small table accompanies each edge, containing for each given lx, ly pair the common P oCsx,y of the connected nodes, and their minimum occurrence frequency.

15:
end if 16: end for Example 4.3. Given an element {x, y} of E, which represents an edge connecting nodes x and y, the following P oCsx,y have been found in common: Each data mentioned in this example is just dummy values and does not represent any real information.

Graph definition
In the previous chapters, Sa 1 ,a 2 was defined, which for a given tuple (a 1 , a 2 ) represents a node of a graph. Each of those nodes is composed by a number of samples of some given lineages. These samples belong to the corresponding tuple of clusters (a 1 , a 2 ) based on "1"s, and "2"s, respectively. In addition, we have defined E, a set whose elements are carefully selected edges, each of which connects two distinct nodes, as described in the previous chapter. The edges E have been selected in such a way, that they connect a pair of two nodes if and only if those nodes are composed by at least one position of concern in common.
be an undirected graph, or in other words a tuple comprising: • Sa 1 ,a 2 , a set whose elements are called vertices (or nodes); • E, a set of paired vertices, whose elements are called edges.
Despite the fact that the Sa 1 ,a 2 and the E are used in order to define the graph (see eq. 7) and, by definition, these two sets consist of a group of samples for each element of Sa 1 ,a 2 and an edge connecting two elements of Sa 1 ,a 2 for each element of E, in reality, as has already been mentioned, Fig. 9. Undirected graph. MCL clustering has not yet been applied to that graph. One can easily see that it is almost fully connected. Underneath each node is its unique id and the assigned lineages to its samples. The nodes are connected to each other according to algorithm 1.
both the nodes and the edges are accompanied by additional supplementary information.
More specifically, with regards to the nodes, for each one of them: • the samples Sa 1 ,a 2 of which the node is made up; • the lineages La 1 ,a 2 assigned to those samples; • the percentage content of the compound lineages (E.g.: If in a particular node 50 out of 100 samples are B.1.1.7 and the rest 50 are AY.7, the percentage content will be 50-50%.); • the positions of concern P oCsa 1 and P oCsa 2 for each of the assigned lineages; • the frequencies or counts of the positions of concern, as they appear from the point of view from the inside of cluster a 1 and a 2 (not from inside the samples of this particular node); • the SNPs that were found at either some or all the positions of concern, as well as how many in each position; are calculated and shown in a small table that accompanies a corresponding node. As for the edges, they are accompanied by a corresponding table such as that already mentioned in example 4.3. The connecting power of an edge is given, by definition, by its weight and it is shown as the thickness of the line which connects two distinct nodes {x, y}, for visualization purposes.
A very convenient graph UI might be used, so that, when a user hovers the mouse over a node/edge all the necessary information appears in a pop-up table. To represent the constructed graph in a better way, Pyvis, a python library for visualizing networks, is used. That library is capable of using small icons instead of regular circles as nodes and variable thickness for the edges connecting nodes between them, which makes it extremely useful for our purposes.

MCL
By visually examining the structure of the produced network, we end up with a few rather obvious conclusions. First, the graph seems to be almost fully connected. That should be explained if one considers that in our thoroughly explained dataset, the mutation rate is relatively low compared with the number of samples, and the majority of the noncharacteristic mutations is circulating among them. This brings us to the The entire range of values of the distinct variable wx,y is divided into a series of nonoverlapping intervals (bins) of arbitrary length for better visualization. One could see that the vast majority of the weights is gathered between 0 and 32-ish. Note that the reason why it seems like there are no weights (counts = 0) with values greater than 32 is due to the fact that the number of weights with values greater that 32 is negligible.
conclusion that between two groups of different samples (graph's nodes) there should be at least some mutations in common and thus we end up with an edge connecting those two nodes. As seems to be the case, the high interconnectivity of nodes leads to a highly interconnected graph. Second, and due to the fact that the graph appears to be fully connected, it is very intricate to draw conclusions and even harder for one to intuitively recognize high-density clusters of interconnected nodes.
For the above-mentioned reasons, there is a great necessity of using a computational method to formulate clusters of high-density groups of nodes in that graph. MCL is mobilized to do so. We ran the MCL algorithm using the default input parameters on our model graph, in an attempt to identify clusters and prune some edges among them. The ultimate goal was to end up with a rather not fully connected network. One could say that the aim is to "relax" strong bonds among these groups of nodes ending up with strongly interconnected groups of nodes, weakly connected to each other.
MCL takes into consideration graph G as defined by equation 7 and the weights wx,y of each pair of nodes {x, y}. According to the documentation, MCL is unequivocally able to process weighted undirected graphs and the weights must be non-negative numbers while it is explicitly stated that they have to represent similarities instead of dissimilarities. With that being said and since our weights, as defined in the previous chapter, are positive integers, MCL took place. In the first few attempts, the results were unexpectedly bad since the MCL was incapable of distinguishing any clusters at all. It was soon made clear that the underlying problem lay in how the weights were distributed rather than the MCL itself.
Since an edge's weight represents the number of common mutations for each pair of lineages, between the nodes {x, y}, one could consider the wx,y as a variable, the distribution of which is given by Fig. 10. Let be the standardize weights, where: x i is the arithmetic mean, and: Firstly, all weights are now more spread out than they were before the transformation.
Those that used to be closer to 0 have collapsed to 0 while those that used to be closer to the maximum value have collapsed to the new maximum value. Secondly, all weights are now bounded from a minimum allowed value of 0 to 400, which according to MCL documentation is an appropriate range.
is the population standard deviation of a discrete random variable by using summation notation. Given the Normal Distribution Probability Density Function (PDF) with mean µ and variance σ 2 the, Normal Distribution Cumulative Density Function (CDF) is calculated by the integral of equation 9. Those integrals cannot be expressed in terms of elementary functions, and are often claimed to be special functions. However, many numerical approximations are known which are beyond the scope of this thesis and that is why they are not shown. Let wx,y = Φ µ,σ 2 (wx,y) be the normalized weights in terms of transforming the standardized weights in such a way that they follow the Normal CDF. The distribution of those mathematically transformed weights wx,y is shown in Fig. 11. We ended up using wx,y accompanying each pair of nodes {x, y} as the new input in the MCL algorithm. The decision of inflating the gaps and stretching the distribution of weights in figure 10 through the method discussed in this chapter results in MCL's distinctive ability to form highly interconnected clusters of nodes having increased dramatically. Since that method is an unsupervised clustering technique, there is no way to validate the results. On the other hand, one could use the percentage content of the compound lineages of each node of the graph in order to somehow validate the MCL's clusters.
To keep things as simple as possible we took into consideration only those nodes that consist of a dominant lineage with a percentage content of at least 90% or more. What we concluded is that nodes consisting of the same dominant lineages are gathered into distinct MCL clusters, which strongly indicates that the MCL algorithm is performing well on our dataset after the discussed transformation of the weights.

Arrow edges and co-occurring mutations
Co-occurring mutations in terms of positions of concern in common have already been determined through the methodology of chapter 4.5. Although these positions of concern have been providing insightful information for grouping different nodes of samples, they have not been providing any useful information regarding the evolution of the virus at all. On the other hand, an algorithmic procedure that identifies, between two different nodes, an actual mutation appearing in only one node and not in the other, in the same position of concern could potentially be used in order to draw an arrow from the node the mutation is missing to the node the mutation is popping up. As explained later on this chapter in detail, these arrows could be used to construct a directed graph indicating evolutionary paths of the virus.

Nodes
Given A 1 and A 2 the collection of the unique clusters' identifiers based on 1s (non-characteristic mutations) and on 2s (characteristic mutations), we define, for each a 1 ∈ A 1 and for each a 2 ∈ A 2 , the following sets: S a 1 , S a 2 are sets of nodes given a fixed a 1 or a 2 respectively. We need to remind here that each node is the set of samples Sa 1 , a 2 , as defined by equation 3 for a given pair (a 1 , a 2 ) ∈ A 1 × A 2 .
Example 4.4. Given a collection of nodes {S 0,2 , S 1,2 , S 3,0 } each of which is a subset of S as equation 3 defines and assuming that each of those nodes is not an empty set, equation 11 will give: and the equation 12 will give: In other words, keeping the number of cluster fixed either based on "1s" or on "2s", equation 11 will give us the group of nodes that belongs to a fixed cluster a 1 based on "1s", regardless of which clusters those nodes belong to based on "2s" and vice versa. Equation 12 will give us the group of nodes that belongs to a fixed cluster a 2 based on "2s", regardless of which clusters those nodes belong to based on "1s". As we move forward in this chapter, we will be referring to those sets as S a 1 and S a 2 of equations 11, 12 implying that we have already fixed the cluster a 1 or a 2 respectively. The example 4.4 demonstrates exactly this behavior.

Directed edges
Algorithm 2 is used to calculate arrows between pairs of nodes. An arrow is drawn from node x to node y if they have at least one PoC in common and that PoC is unmutated along the samples of the former node and mutated along the samples of the latter node. In that case node x becomes the starting point of the arrow while node y becomes its ending point.
Not all possible pairs of nodes are candidates for that method. More specifically, only those nodes that belong to a group S a 1 ̸ = ∅ for a fixed a 1 ∈ A 1 are allowed to be connected to each other. Respectively, only those nodes that belong to a group S a 2 ̸ = ∅ for a fixed a 2 ∈ A 2 are allowed to be connected to each other. If an arrow is drawn between a pair of nodes which belongs to a group given by equation 11, then this arrow is marked as of blue color and corresponds to the popping up of a noncharacteristic mutation moving from one node to another. Alternately, if an arrow is drawn between a pair of nodes which belongs to a group given by equation 12, then this arrow is marked as of red color and corresponds to the popping up of a characteristic mutation moving from one node to another.
Having a closer look at example 4.4, it is made clear that only the nodes belonging to a specific group which is given by one of the equations 13-17 are allowed to be connected to each other. For instance, nodes S 0,2 , S 1,2 of equation 16 are allowed to be connected to each other with a red arrow yet they are not allowed to be connected to any other nodes of equations 13, 14, 15, 17. In this particular example only nodes of equation 16 can be connected to each other simply because they are two. All the rest of the groups cannot form arrows among their nodes due to the fact that they consist of only one node. It becomes quite obvious that an arrow can be drawn if and only if there are enough nodes to be connected to each other inside a group. In other words the following constraint must be imposed: We define: two sets of all plausible edges (also called directed edges, directed links, directed lines, arrows or arcs) which are ordered pairs of nodes (that is, an edge is associated with two distinct nodes). E a 1 consists of all possible blue edges taking into consideration only non-characteristic mutations ("1"s), while E a 2 consists of all possible red edges taking into consideration only characteristic mutations ("2"s). Each ordered pair (x, y) represents a directed edge or an arrow, from node x to node y. Nodes x and y are called the endpoints of the edge, x the tail of the edge and y the head of the edge. The edge (y, x) is called the inverted edge of (x, y) and it is permitted. Multiple edges, not allowed under the definition of equations 19, 20, are two or more edges with both the same tail and the same head. Loops, edges that join nodes to themselves are not permitted since a directed loop would be (x, x) which is not in either E a 1 or E a 1 under the constraint x ̸ = y. Lines 1-22 and 24-45 of algorithm 2 are responsible for selecting two subsets of blue and red directed edges E a 1 , E a 2 based on "1s" and "2s" respectively. In reality, these two blocks of code were implemented inside a for loop running twice, one regarding the based on "1s" implementation and one regarding the based on "2s" implementation. The selected blue directed edges are stored in the E a 1 ⊆ E a 1 subset while the selected red directed edges in the E a 2 ⊆ E a 2 subset, as records in the form of ordered pairs (x, y).
The key points of the algorithm 2 are those on lines 11, 13 for blue edges and 34, 36 for red edges respectively. By iterating through all combinations of nodes (x, y) in S a 1 or S a 2 and consequently, through each PoC of node x and y, we check whether: • the number of SNPs in those PoCs of x is zero (lines: 11, 34); • the number of SNPs in the PoCs of y, which are in common with those of node x, is greater than zero (lines: 13, 36).
If the above-mentioned criteria are fulfilled, a new ordered pair (x, y) is added in E a 1 or E a 2 (blue or red directed edge), respectively.
Miscellaneous calculations are not shown in algorithm 2. In reality, additional data accompanying each directed edge in E a 1 or E a 2 are calculated as well. For instance, the specific SNPs and the number of each SNP's copies in each PoC of the pointing node y, are composed in a table unique for each directed edge (x, y), as it is shown in example 4.5.

Directed graph definition
Having defined S a 1 , S a 2 (see eq. 11, 12) and constructed E a 1 , E a 2 through algorithm 2, as subsets of E a 1 , E a 2 (see eq. 19, 20), let: be a directed graph, making use of blue edges and: be a directed graph, making use of red edges. In other words, tuples comprising: • Sa 1 ,a 2 , a set whose elements are called vertices (or nodes); • E a 1 or E a 2 , sets of ordered pairs of nodes, called directed edges.
Despite the fact that the Sa 1 ,a 2 and the E a 1 , E a 2 are used in order to define these two graphs (see eq. 21,22) and, by definition, these sets consist of groups of samples in the case of Sa 1 ,a 2 and directed edges connecting two elements of Sa 1 ,a 2 at a time in the case of E a 1 , E a 2 , in reality, as it has already been mentioned, both the nodes and the edges are accompanied by additional supplementary information.
A small table accompanying each node is calculated and includes exactly the same supplementary information as of those described in section "Graph definition" of chapter 4.5.
As for the directed edges, they are accompanied by a corresponding table such as those that have already been mentioned in example 4.5. For visualization purposes, the directed edges are visualized with colored arrows, blue in the case of DG 1 graph and red in the case of DG 2 .
There are two different directed graphs, DG 1 and DG 2 . The reason why two different directed graphs were defined is the fact that, in the case of one directed graph making use of both blue and red edges, there could have been multiple edges from one node to the other. Unfortunately, multiple edges are not permitted by the definition of the directed graph. In a more general sense, a graph allowing multiple edges, could be defined, yet things have been kept as simple as possible. Instead, two different simple directed graphs, one making use of blue directed edges and one making use of red directed edges were used. Since the exact same nodes were used in order to define both graphs, we were able to visualize them together, the one on top of each other in a way that seemed to be a single directed multigraph, figure 12.

Multiple sequence alignment (MSA)
In section 4.7, we defined a graph, the edges of which connect nodes to each other in a useful way, as a tool of finding groups of samples closely related to each other, and identifying mutations popping up from one group of samples to another, in an attempt to gain an understanding of the evolution of SARS-CoV-2. As a matter of course, it is if not impossible, extremely difficult for one to determine the evolutionary history of the virus or "who infected whom" [22]. For that reason and just because, as far as we are aware, at the time of writing, there is no existing methodology capable of addressing that problem, we will try to validate the results obtained by the graph analysis and its directed edges by using a multiple sequence alignment as explained below.
By obtaining an accurate multiple sequence alignment of approximately 4.500 SARS-CoV-2 genomes of 30.000-ish bases soon made it clear that it is a tremendously computational expensive task. It turns out that by Example 4.5. Given an ordered pair (x, y) ∈ E a 1 or E a 2 , which represents a posible directed edge connecting nodes x and y, the following table is their content: Where the first group of samples corresponds to node x, while the second one corresponds to node y. It is worth mentioning that there might be the case where more than one SNP have been found in a PoC. If that happened, for example, the new record would be "G/C * 1, A/T * 3". The PoCs in common between nodes x, y based on "1"s and "2"s are: P oCs x,y a 1 = {209, 1002} and P oCs x,y a 2 = {1002, 2763} The number of SNPs found in the common PoCs (209 and 1002) between nodes x and y, is zero and zero for x and 2 and 4 for y, respectively. Since there is at least one SNP popping up from node x → y, in the same PoCs, under the -based on "1"s-analysis, a directed edge will be drawn from x to y and the following table accompanying that edge will be: The number of SNPs found in the common PoCs (1002 and 2763) between nodes x and y, is zero and zero for y and 0 and 8 for y, respectively. Since there is at least one SNP popping up from node y → x, in the same PoCs, under the -based on "2"s-analysis, a directed edge will be drawn from y to x and the following table accompanying that edge will be: A/T  8  2763 and

SNPs counts position
Each data mentioned in this example is just dummy values and does not represent any real information.
employing the rapid calculation of full-length MSA of closely-related viral genomes mode of MAFFT aligner (cite MAFFT), a full-length MSA of our SARS-CoV-2 genomes is a relatively computational easy task in terms of a few minutes on a regular PC. It is worth mentioning here that according to MAFFT documentation it is sometimes useful and vastly faster to align all sequences just to the reference genome to build a full MSA. However,

Algorithm 2 Calculate directed edges
Require: if rest_of _nodes ̸ = ∅ then 5: for y ∈ rest_of _nodes do 6: P oCsx ← P oCsx.a 1 ▷ Positions of concern of node x based on "1s" 7: P oCsy ← P oCsy.a 1 ▷ Positions of concern of node y based on "1s" 8: SN P sx ← SN P sx.a 1 ▷ SNPs of node x based on "1s". These SNPs corespond to the P oCsx 9: SN P sy ← SN P sy.a 1 ▷ SNPs of node y based on "1s". These SNPs corespond to the P oCsy 10: for each (snpx, P oCx) in (SN P sx, P oCsx) do 11: if n(snpx) = 0 then ▷ If zero snps found at a corresponding PoC along the samples of node x 12: for each (snpy, P oCy) in (SN P sy, P oCsy) do 13: if n(snpy) > 0 and P oCx = P oCy then ▷ If more than zero snps found at the same PoC along the samples of node y 14: if rest_of _nodes ̸ = ∅ then 28: for y ∈ rest_of _nodes do 29: P oCsx ← P oCsx.a 2 ▷ Positions of concern of node x based on "2s" 30: P oCsy ← P oCsy.a 2 ▷ Positions of concern of node y based on "2s" 31: SN P sx ← SN P sx.a 2 ▷ SNPs of node x based on "2s". These SNPs corespond to the P oCsx 32: SN P sy ← SN P sy.a 2 ▷ SNPs of node y based on "2s". These SNPs corespond to the P oCsy 33: for each (snpx, P oCx) in (SN P sx, P oCsx) do 34: if n(snpx) = 0 then ▷ If zero snps found at a corresponding PoC along the samples of node x 35: for each (snpy, P oCy) in (SN P sy, P oCsy) do 36: if n(snpy) > 0 and P oCx = P oCy then ▷ If more than zero snps found at the same PoC along the samples of node y 37: end for 45: end for by doing so, it is very likely for gaps to be inserted into the reference sequence if deemed necessary hence altering its initial length. To avoid this behavior keeplength setting was used since it reassures no gaps are inserted into the reference sequence. Furthermore, maxambiguous was set to 1 to prevent removing sequences that have ambiguous letters while ep or offset value was set to 0.1 as it's been recommended if no long gaps are expected.
The rapid calculation of full-length MSA of closely-related viral genomes mode of MAFFT aligner was originally for other purposes, yet the SARS-CoV-2 outbreak made clear that this method can be used efficiently and effectively on SARS-CoV-2 genomes, and it has become widely known in the community ever since. Placing trust in the community's wide usage of it on SARS-CoV-2 data we have gone through this method without validating the MSA by any means.

Entropies -Mutual information gain
Having a closer look at graph 12, one can identify strongly connected groups of nodes with either blue or red or even both types of edges. One can even move/rearrange the nodes through the drag and drop mode of the used GUI, making it even easier for the human eye to point out a few groups of interest, resulting in the graph 15. Once a particular group of nodes has been identified (see graph 15 selected areas), the nodes ids of that specific group are needed as input for the next step. The samples that each node of interest is made of are then obtained. Following the a single directed multigraph with two types of edges, blue for mutations found analyzing clusters formed by non-characteristic mutations and red for mutations found analyzing clusters formed by characteristic mutations. Each node's outer color (colorful circular ring) corresponds to the high interconnectivity group that each node has been assigned to by the MCL. The pie chart inside the circular ring is the percentage content of the compound lineages that node is made of while the compound lineages by themselves are mentioned in the first line underneath each node. The second line underneath the nodes is a unique identifier of the corresponding node. same notation, we will be calling those nodes, nodes of interest and their samples, samples of interest. Given the full MSA comprised of all samples and the samples of interest, we are slicing the former obtaining only a part of it that consists only of the samples of interest, while maintaining all its required properties. At the same time, we obtain the reported positions by all directed edges, blue and red, that connect the nodes of interest to each other, and we store them in a suitable data structure. From now on we will be referring to those positions as sites of interest.
Possessing an MSA of only the samples of interest and their sites of interest provided by the edges that connect nodes of interest, we calculated the mutual information (MI) of each possible pair of the abovementioned sites of interest. Let be the slice of the full MSA with each x i,j representing the j th base/site/position of the i th sample. Let also P oCs blue , P oCs red be the sites of interest that have been reported by the total number of blue and red edges, respectively, connecting nodes of interest (In example 4.5 there are only two edges, one blue and one red. P oCs blue and P oCs red would represent the position columns of the two accompanying tables. In the case of a group of nodes of interest, considering more blue and red edges, P oCs blue would be the union of all blue position columns of the accompanying tables reported by blue edges while P oCs red would be the union of all red position columns of the accompanying tables reported by red edges). The darker the color the more closely related the corresponding sites are. All on-main-diagonal elements are one due to the fact that it is the NMI calculated between a site and its itself. Only the upper triangular part was needed to be calculated because this heatmap is square and completely symmetrical, saving a little bit of computational time.
In order to obtain the total reported positions by the total blue and red edges in the group of nodes of interest, we take the union between the reported positions by blue edges and the reported positions by red edges. We then have P oCs blue,red = P oCs blue ∪ P oCs red (24) Equation 24 gives the total positions of interest or, in other words, some columns of the M SA that we are interested in. Then the P airs_of _columns = {(k, l) | (k, l) ∈ P oCs 2 blue,red } consists of all posible combinations in pairs of two columns, given by the cartesian product of P oCs blue,red with itself. Considering the columns P oCs blue,red as random variables, the probabilities of each character in a random variable (column), x j ∀j ∈ P oCs blue,red , along the m samples of the M SA are given by the count of each character over the total counts of all different characters given by the random variable. Since, in our case, the length columns is constant and equals to m, and the value of a random variable can be one of A, T, G, C, the list of probabilities for a particular random variable/column is given by ∀j ∈ P oCs blue,red (26) where Here the symbol # is used to denote the number of occurrences. Due to the fact that M SA is a multiple sequence alignment, there could be spaces among the sites of different samples and thus, the "-" symbol should be included in the counts. For simplicity reasons though, we have not included it in our analysis.
Having obtained a part of the full MSA that is related to the group of nodes of interest, the next logical step is to use a metric in an attempt to measure the dependency of the sites of interest in that sub-MSA. For that purpose the mutual information is acquired. Mutual information (MI) is non-negative value, which shows the dependency between two variables. In our case the random variables are discrete random variables. The higher the MI value is, the more dependent the variables are. The normalized MI is then calculated by using the "normalized_mutual_info_score" method of scikit learn library, between all pairs of the discrete random variables/columns, P airs_of _columns, given in pairs of two by equation 25. NMI is a normalization of the MI score. The result is a square d × d matrix, where d = n(P oCs blue,red ) is the cardinality of P oCs blue,red , and its values are scaled from 0 to 1. 1 shows perfect correlation while 0 means no correlation at all. That is because the NMI was calculated bewteen all pairs of the sites of interest given by P oCs blue,red and hence, the number of elements of that set is the number of unique variables. Figure 13 shows a heatmap of the calculated NMI between all pairs of sites of interest in a group of nodes of interest.
Hierarchical clustering on the sites of interest is then performed (figures: 17,16,14,19,18,20). Specifically, the "AgglomerativeClustering" method of scikit learn library, with the default parameters is used, except for linkage wich is set to "complete". Agglomerative clustering needs a distance metric. By definition, a lower distance is better than a higher one and vice versa. In the case of the NMI values, though this criterion is not fulfilled. On the contrary, NMI behaves in exactly the oposite way, a higher NMI value corresponds to a higher dependency while a lower NMI to a weaker dependency. For that reason, we reverse the NMI values, using the following formula dist = 1 − N M I transforming the NMI values to a distance metric, where, a lower value, close to 0 means that the in comparison variables are highly correlated and hence are closely related to each other, while, a high distance, close to 1, means that the variables are weakly correlated and, thus, they are very distant to each other. It is worth mentioning that due to rounding errors negative distances may occur. In that case, they are set to zero, since 0 is the lowest possible value.     2) A percentage colored in red is shown, indicating the proportion of samples in which this mutation has been identified as characteristic over the total number of the occurrences of that mutation along the samples of concern. 3) A percentage colored in blue is shown, indicating the proportion of samples in which this mutation has been identified as non-characteristic over the total number of the occurrences of that mutation along those samples of concern. Notice that, the red percentage is not shown when this mutation has not been identified as a characteristic mutation, while the blue percentage is not shown when this mutation has not been identified as a non-characteristic mutation. For instance, a mutated position that has been found as characteristic mutation in half of the samples of concern and non-characteristic in the other half of the samples, would have 50-50% red and blue percentages. 4) The following green symbols indicate the lineages of the samples in which this mutation has been found (see the upper-left corner for the available lineages). The black or faded positions/locations/sites are considered insignificant and correspond to mutations that have been identified as non-characteristic mutations at less that 5% and (not) as characteristic mutations at 0%.

Results
Conducted studies have shown that many mutations can randomly occur in the time of virus replication inside host's body [25,21]. An enormous number of factors, both environmental and from inside host's body, could potentially cause the necessary selective pressure, which can result in some random mutations being selected. Going through our methodology, we obtained graph 15 which could potentially indicate evolutionary paths of the virus, among well known lineages, revealing randomly co-occuring mutations beyond strain-specific / strain-defining ones. Graph 15 is a small part of the whole graph 12 which is the result of algorithm 2. After careful consideration, we identified 6 distinct groups of nodes of interest shown as either colored areas or inside colored curves in graph 15. For each of these groups of nodes, a dendrogram is provided, as a result of the hierarchical clustering of the reversed Normalized Mutual Information (see chapters 4.8 and 4.9). For instance, nodes 32, 33, 38, 24, 23, 19, and 18 from inside the orange curve constitute a homogeneous network consisting of samples that have been classified as of lineage B.1.1.7. Important and strongly correlated sites among these samples are shown in figure 14. The small sub-network inside the blue closed loop that is comprised of nodes 12, 13, and 14 is a small homogeneous network consisting of samples that have been classified as of lineage B.1.1.7. Its sites of interest are shown in Fig. 15. Part of the graph 12 that could potentially indicate interesting evolutionary paths of the virus. Human choice is shown using either colored closed loops or shadowed areas for simplicity reasons due to overlapping areas. figure 16. The big sub-network inside the yellow and black closed loops has been split into two different parts, maintaining an overlap of one node, since the sites/positions of concern were so many that it would be very difficult for one to interpret the final dendrogram. The sub-network inside the yellow area that consists of nodes 39, 91, 48, 47, 50, 66, 68, 10, 11, 7, 8, and 9, is highly inhomogeneous containing samples of the European lineage, Alpha, Delta, and Omicron variants. The hierarchical clustering of important and strongly correlated sites found among its samples are shown in figure 18. By disassembling the sub-network inside the yellow area even more into its constituent pieces, we conclude that the sub-network in the green shadowed area is also of high interest. It is assembled of 7, 8, 9, 10, 11 nodes and it consists mainly of B.1.1.7 (Alpha variant) samples and a small number of AY.43, alias of the European lineage and  figure 20). Furthermore, taking apart the sub-network inside the black closed loop into its constituent pieces, we figured out that the sub-network within the blue shadowed area needs to be further investigated. It is comprised of nodes 45, 37, 40, 69, and 67 with high rates of EU and UK lineages and Delta variants ( figure 19).

Discussion & Conclusions
In this study, we first attempted to cluster the available SARS-CoV-2 samples in groups, and to validate those groups based on the reported characteristic mutations by the pangolin tool [15]. Then co-occurring mutations among these groups were counted and further revealed the phylogenetic relationships among groups of samples with either coexisting or by counting the occurrences of important mutations moving from one group to the other. Our methodology was validated by calculating the Mutual Information between pairs of sites in a multiple sequence alignment, along samples indicated by the previously mentioned groups. We then used the mutual information in a way to represent the distance of those sites from each other. We ended up performing hierarchical clustering on those distances obtaining potentially useful insights regarding the evolutionary history of the virus and validating the phylogeny among our groups.
Specifically, figure 14 shows that identifying evolutionary patterns among B.1.1.7 samples with known labels could be possible. Based on the non-characteristic mutated sites at medium to high rates of group A and the fact that this group is close to the root of the tree, we conclude that evolutionary pressure could lead to new B.1.1.7 sub-lineages, forcing those mutations to prevail. We can draw the same conclusions based on group B yet by a weaker indication since group B's sites appear to be mutated at lower percentages. On the other hand, the high In figure 17 the majority of the reported sites appears to be insignificant non-characteristic mutated at very low rates and thus figure 17 is of less importance. However, the first 3 branches, starting from the root of the tree, are the most remote groups of mutations and hence they are worth investigating. The mutation at position 27751, is a non-characteristic mutation along the samples used to construct this tree, appearing at very low rates. Yet, it is a characteristic one when it is studied in lineages B.1.617.2, AY.12, AY.9, AY.4, AY.43, AY.122. It is worth mentioning that samples of the above-mentioned lineages appear at higher frequencies as we move outside the green shadowed area to the rest of the nodes in the yellow area of figure 15, from node 10 to 91 and beyond. As a result, based on the existing labels we draw the conclusion that this mutation is very likely to pop up in future lineages.
In figure 18, one can clearly identify multiple non-characteristic and characteristic mutations that belong to alpha, beta, and delta variants at high rates, making this tree highly inhomogeneous. Surprisingly, there are some sites that while they are characteristic mutations of the compound lineages of this system, appear as non-characteristics at low rates. More specifically 22029-22033 and 22226, 5183, 11513, 11417 are reported as non-characteristic mutations while they are characteristic mutations of lineages B.1.617.2 and AY.9 respectively, lineages that are indeed included in the yellow area of figure 15. Additionally, some reported sites, such as 16465, 22314, and 25915 do not belong to any known lineage, at least at the time of writing, yet they were found as non-characteristic mutations along that system's samples at significant rates. Those facts could again be indications of emerging mutations in future lineages or patterns between non-characteristic and characteristic mutations revealing the selective pressure of the former to the latter.
The tree in figure 16 is made of nodes 12, 13, 14 (graph 15) and is highly homogeneous since all of its used samples are B.1.1.7-assigned lineages. Interestingly, not even one site, reported by this tree appears as a characteristic mutation. This tree could potentially be used in order to identify emerging mutations of B.1.1.7 lineage or even to classify new samples to new sub-B.1.1.7-lineages. In groups, A B, and C one can observe non-characteristically mutated sites at middle to relatively high rates accompanying a few non-characteristically mutated sites at significantly low rates, such as 6025 in group A, 23624 in group B, and 2154, 798, 17133, 3337 in group C. That co-existence of non-characteristic mutations at middle to high and very low rates could be an indication of either new sub-variants that are being emerged or mutations that are faded out through the evolutionary history of the virus.
On the other hand, figures 19 and 20 are representative examples of how complex and indistinguishable the situation could become, if one included multiple nodes with a large number of samples in their analysis. There is a ton of characteristic and non-characteristic mutations in groups A and B of both trees since a tremendous number of samples have been included. It is worth mentioning that in group C of figure 19, non-characteristic mutations at insignificant rates pop up which, surprisingly, characterize lineages such as BA.1.1 and BA.1 that do not belong to the used part of the graph (blue shadowed area of graph 15). Unfortunately, not many conclusions can be drawn from these figures due to their complexity. Further analysis of these nodes must be done to safely draw conclusions regarding patterns between their characteristic and non-characteristic mutations.
Several studies attempting to either classify or even cluster SARS-CoV-2 populations have been suggested in the past. The majority of those methods depend on phylogenetic inference, bringing limitations regarding statistical errors, computation, and visualization in order for them to draw useful results. To avoid the excessively complex and computationally expensive phylogenetic construction, many novel sub-typing methods have been proposed, yet they are not able to determine the phylogenetic relationships among different sub-types [26]. Therefore, our proposed novel method replaces the classical phylogenetic approaches, reducing complexity while maintaining its ability to infer evolutionary paths of the virus. In addition to this, our method could update the phylogenetic model based on new-coming samples in a significantly more efficient and effective way, in a constantly evolving environment.
In our exploratory analysis, we assumed that the ground-truth labels, in order to validate the clustering analysis, are those assigned by the Pangolin system. Different labels of different nomenclature systems should be further used in future studies. It is also widely known that the dimensionally reduced space as a result of TSNE must not be used to perform clustering techniques on it, since the former distorts the distances among points to such an extent, that the results could be misleading. However, the fact that we validated the clusters based on the existing labels, lets us safely cluster the samples on the TSNE space despite those limitations. In other words, it was more of a supervised clustering than an unsupervised method.
Different modeling methods of the SARS-CoV-2 genomes could also be considered. For instance, instead of using the binary model we developed, one could use more sophisticated methods to encode those genomes into abstract mathematical representations. A particular use case of such embeddings is BioVec [1], which is a novel approach for representing biological sequences into vectors, through the application of unsupervised machine learning. This method has been evaluated in classifying proteins into protein families, achieving intriguing results. Therefore, one could use those mathematical representations (embedding vectors) in order to perform either direct clustering in that multidimensional space or dimensionality reduction followed by clustering. It is worth mentioning that, in this study, we attempted to use BioVec, yet due to computational limitations we drew the conclusion that it is beyond the scope of this manuscript. Further research must be done to examine whether such a model could be used for that purpose or could lead to better results. Different dimensionality reduction methods or clustering techniques could also be used, making sure that the clustering methods will be density based since that is a prerequisite of our methodology.
Here, we present a computational method for detecting patterns of co-occurring mutations potentially revealing the evolution of SARS-CoV-2. Not only did we identify emerging mutations that could lead to the emergence of new sub-linegages, but we also found noncharacteristic mutations beyond strain-specific / strain-defining ones strongly correlated to strain characteristic mutations. Although these correlations might indicate an underlying "association" between different lineages, correlation does not necessarily mean causation. In order for one to identify causal relationships, more research is needed on drivers of evolution and the emergence of new mutations on perhaps worldwide data. As SARS-Cov-2 continues to spread globally, we hope our software will help detect new evolutionary events in the fight against the global pandemic.

Code & Data Availability Statement
All SARS-CoV-2 sample data were retrieved from ENA (Project ID: PRJEB44141). All code developed for the analysis is available at https://github.com/BiodataAnalysisGroup/BEEMUS .