Fast Whole-Genome Phylogeny of the COVID-19 Virus SARS-CoV-2 by Compression

Abstract We analyze the whole genome phylogeny and taxonomy of the SARS-CoV-2 virus using compression. This is a new fast alignment-free method called the “normalized compression distance” (NCD) method. It discovers all effective similarities based on Kolmogorov complexity. The latter being incomputable we approximate it by a good compressor such as the modern zpaq. The results comprise that the SARS-CoV-2 virus is closest to the RaTG13 virus and similar to two bat SARS-like coronaviruses bat-SL-CoVZXC21 and bat-SL-CoVZC4. The similarity is quantified and compared with the same quantified similarities among the mtDNA of certain species. We treat the question whether Pangolins are involved in the SARS-CoV-2 virus. The compression method is simpler and possibly faster than any other whole genome method, which makes it the ideal tool to explore phylogeny.


I. INTRODUCTION
In the 2019 and 2020 pandemic of the COVID-19 illness many studies use essentially two methods, alignment-based phylogenetic analyses e.g. [19], and an alignment-free machine learning approach [23]. These pointed to the origin of the SARS-CoV-2 virus which causes the COVID-19 pandemic as being from bats. It is thought to belong to lineage B (Sarbecovirus) of Betacoronavirus. From phylogenetic analysis and genome organization it was identified as a SARS-like coronavirus, and to have the highest similarity to the SARS bat coronavirus III. DATA AND DATA CLEANING We downloaded the data broadly in two parts. The original input data are stored by Rudi Cilibrasi at: -rw-rw-r-1 rudi rudi 2.0G Jul 17 15:46 incoming/gisaid hcov-19 2020 07 17 22.fasta -rw-rw-r-1 rudi rudi 29M Jul 17 15:45 incoming/gurjit-data.zip We downloaded from the GISAID [11] database on July 17th, 2020 in total 66,899 sequences. A part of at least 99.9% of the GISAID data are SARS-CoV-2 viruses. The remainder are not technically SARS-CoV-2 viruses but are related and of interest. One of the authors of the machine learning approach study [23] was so kind as to supply us with the 6,751 sequences used in that paper. There is one SARS-CoV-2 virus among them. Thus, at least 99.9% of the data is not SARS-CoV-2 viruses but other viruses. Therefore we can compare our results about the phylogeny of SARS-CoV-2 directly with those of [23].
We removed a single sequence with a special exception part in the code because it had a bad name: gisaid hcov-19 2020 07 17 22.fasta.fai hCoV-19 29868 1713295574 80 82.
That name is too short to be useful: "hCoV- 19." Altogether therefore we imported 73,649 sequences total into the raw sequence database. Each viral sequence is an RNA sequence and seems to be around 30,000 RNA base pairs (A,T,G,C) in size. The total size of all sequence data together is in the order of two Gigabyte. Of the sequences initially downloaded from GISAID, we applied a lowercase transformation to each to reduce pointless variability. After that, we computed a histogram of all the characters in the sequence and counted the size of each group. Many sequences contained the base pairs A, C, G, N, T or other letters. The letters A,C,G, and T signify the basic DNA nucleotides. the meaning of 'N' and the other letters involved can be found on a FASTA file format reference webpage. For example, "N" means "unknown nucleotide residue." We retained the deduplicated unique 15,578 sequences with the known nucleotides A,C,G, and T from the GISAID download. The 6,751 sequences obtained from the authors of [23] were over the letters A,C,G, and T already.
As a representative of the SARS-CoV-2 virus we selected the most common one in the dataset as the basis for the NCD against all others. It appeared in the sorted virus list with 105 multiples. In Section IV it appears at the top with an NCD of 0.003621 and has the official name of gisaid hcov-19 2020 07 17 22.fasta<hCoV-19/USA/WI-WSLH-200082/2020-EPI ISL 471246-2020-04-08 .
We then looked at whether there was much variation among the SARS-CoV-2 viruses themselves since this may invalidate the NCD distance between the inspected viruses and the selected SARS-CoV-2 virus. We retained the viruses in the list after deduplication and filtering for A,C,G,T. The worst NCD against the selected SARS-CoV-2 virus was 0.874027 namely gisaid hcov-19 2020 07 17 22.fasta <hCoV-19/pangolin/Guangxi/P1E/2017EPI ISL 4105392017 from a Pangolin. Removing that one sequence from the list we got a worst NCD of 0.873367 also from a Pangolin in 2017.
(To clarify why there is the discrepancy below in that we obtain 15,578 imported sequences but when counting lose 100-200 in the next phase: There are some exact name duplicates in the GISAID data. When the "imported sequence" count is reported, then we count identically named identical sequences separately because they did both get imported (but one would overwrite the other). When counting the different names for sequences without /2017, /2018, /2019 in the name, these exact-name duplicates would collapse into a single one causing just one count.) Initially there are 15,430 sequences from GISAID that contain "hCov" in the name. We then removed all sequences that contained /2017 in the name. After this we were left with 15,428 sequences and obtained a worst NCD of 0.738175 also from a Pangolin.
Removing all sequences that contained /2017, /2018, or /2019 in the name, left 15,409 viruses in the list. The worst-case NCD across all the remaining sequences of the SARS-CoV-2 virus to the selected SARS-CoV-2 virus is 0.044986 and the average is 0.009879. This shows that the 15,428 SARS-CoV-2 viruses retrieved from GI-SAID on July 17 in 2020 all have sequences that hardly differ from one another. The worst-case sequence [20] can be found at gisaid hcov-19 2020 07 17 22.fasta<hCoV-19/USA/OH-OSUP0019/2020-EPI ISL 427291-2020-03-31 and its registration code is EPI ISL 427291 [20].
With respect to the figures: Figures 1 and 2 were done with the PHYLIP (PHYLogeny Inference Package) [29] and Figure 3 with the Complearn Package [5]. The programming is easy. Essentially one only has to compute Equation A.4 for all pairs of viruses to be compared.
The comparisons between the NCD's of viruses and the NCD's of the mtDNA's of species are made on basis of Table II. Since the NCD is a metric (satisfies the triangle property) [6, Theorem 6.2] the above variation of NCD's of different SARS-CoV-2 viruses will not unduly upset the greater NCD distances (only by ±0.044986) between the other viruses (the 6,751 non-SARS-CoV-2 viruses) and the selected SARS-CoV-2 virus. The analytic software which August 24, 2020 DRAFT . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made   (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The following virus has an NCD of 0.444846 with the selected SARS-CoV-2 virus and is the closest (apart from the above SARS-CoV-2 virus). It is Sarbecovirus/EPI ISL 402131.fasta/ <BetaCoV/bat/Yunnan/RaTG13/2013-EPI ISL 402131. The first part is the classification of the subfamily of viruses, the code EPI ISL 402131 is that of the virus itself which one can use with Google to obtain further information.
In this case the virus is sampled from a bat in Yunnan Province in in the PRC (China) in 2013 and is 29855 bp RNA linear VRL 24-MAR-2020, and its final registration code is MN996532. It is known as Bat coronavirus RaTG13 of the family Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes; Nidovirales; Cornidovirineae; Coronaviridae; Orthocoronavirinae; Betacoronavirus; Sarbecovirus. The virus is found in the Rhinolophus affinis, a medium-size Asian bat of the Yunnan Province (China). The human coronavirus genome shares at least 96.2% of its identity with its bat relative, while its similarity rate with the human strain of the SARS virus (Severe Acute Respiratory Syndrome) is much lower, only 80.3% [8]. The NCD distance between the selected RSA-CoV-2 virus and this virus is about the same as that between the mtDNA's of the Chimpansee and the PigmyChimpansee according to the Table II. The next three viruses have respectively NCD=0.788416 for Coronaviridae/CoVZC45.fasta/ < MG772933.1, and NCD=0.791082 for Coronaviridae/CoVZXC21.fasta/ < MG772934.1 and at a larger distance NCD=0.917493 for Coronaviridae/Coronaviridae 783.fasta/ < KC881006 Here the part "fasta/ < KC881006" means that the registration code is KC881086.
The first of these two viruses is the Bat SARS-like coronavirus isolate bat-SL-CoVZC45, complete genome at 29802 bp RNA linear VRL 05-FEB-2020 of the above family of viruses.
Its NCD with the selected SARS-CoV-2 virus is slightly larger than the mtDNA distance between the Human and the Gorilla at 0.737 and slightly lower than the mtDNA distance between the Human and the Orangutan at 0.834.
The second of these two viruses is the Bat SARS-like coronavirus isolate bat-SL-CoVZXC21, complete genome at 29732 bp RNA linear VRL 05-FEB-2020 of the same family. The same comparison of the NCD distance between this virus and the selected SARS-August 24, 2020 DRAFT . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 26, 2020. ; https://doi.org/10.1101/2020.07.22.216242 doi: bioRxiv preprint  Table I. These viruses have the least NCD distance with the SARS-CoV-2 virus. S(T ) = 0.998948. We use human mtDNA as outgroup; it is about the same size as the viruses and it can be assumed to be completely different. Where it joins the tree there is the root. The labeling of the items is as follows. All sequences are labeled as they occur in the data of [23] together with their registration code. The most interesting are the 11th to 15th (inclusive) sequences of the tree from the top of the page. The 13th to 15th virus sequences (inclusive) are the most interesting for us. The 13th is EPI ISL 402131 which is the bat/Yunnan/RaTG13/2013 that is the RaTG13 Bat Coronavirus sampled in Yunnan in 2013. The 14th label is selected SARS-CoV-2 virus which occurs 105 times in the virus database of GISAID. It is the top one in the sorted list Table I  CoV-2 virus with the NCD of mtDNA's between (the same) species holds also for this second virus.
The third virus in the list is the Bat SARS-like coronavirus Rs3367, complete genome at 29792 bp again of the same family. Comparing the NCD between this virus and the selected SARS-CoV-2 virus yields that it is slightly smaller than the NCD between the Human mtDNA August 24, 2020 DRAFT . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

A. The Pangolin connection
As we saw in determining the worst-case NCD between the selected SARS-CoV-2 virus and other GISAID SARS-CoV-2 database viruses the database is littered with Pangolin viruses from 2017, 2018, and 2019. Several studies e.g. [16], [31] hold that while the SARS-CoV-2 virus probably originates from bats it may have been transmitted to another animal and/or August 24, 2020 DRAFT . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 26, 2020. distance with the human SARS-CoV-2 virus that is close to one half of the Pangolin distance.
As noted before, this is comparable with the NCD distance between the Chimpansee and the PigmyChimpansee according to the Table II. Also in the tree of Figure 2 the Pangolin viruses are generally far from the selected SARS-CoV-2 virus. Hence the hypothesis that the Pangolin species is an intermediary between the bat viruses above and the human SARS-CoV-2 virus is perhaps unlikely.

V. DISCUSSION
Previously most studies into the COVID-19 pandemic suggested that the virus involved originated from bats. Bats are a known reservoir of viruses that can zoonotic transmit to humans [18]. Virtually all those mentioned studies use alignment-based methods. Analyzing the involved SARS-CoV-2 virus with the NCD is in this setting a novel alignment-free method based on compression. It places the RaTG13 bat virus the closest to the SARS-CoV-2 virus followed by the two SARS-like coronaviruses bat-SL-CoVZXC21 and bat-SL-CoVZC4. The similarity is quantified and compared with the same quantified similarities among the mtDNA of certain species. A possible involved other animal is the Pangolin. The method used here, the normalized compression distance (NCD) method is based on Kolmogorov complexity analysis and compression [15], [6]. It is domain-independent and requires no parameters to be set, apart from the used compression algorithm. Earlier studies using alignment-based methods have suggested that the the SARS-CoV-2 virus originated from bats before being zoonotical transferred to humans. The machine-learning approach, an alignment-free method, in [23] came to the same conclusion. The current, completely different, alignment-free method with August 24, 2020 DRAFT . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 26, 2020.

A. Kolmogorov Complexity
Informally, the Kolmogorov complexity of a string is the length of a shortest string from which the original string can be losslessly reconstructed by an effective general-purpose computer such as a particular universal Turing machine U. Hence it constitutes a lower bound on how far a lossless compression program can compress. For details see the text [17]. In this paper we require that the set of programs of U is prefix free (no program is a August 24, 2020 DRAFT . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 26, 2020. ; https://doi.org/10.1101/2020.07.22.216242 doi: bioRxiv preprint proper prefix of another program), that is, we deal with the prefix Kolmogorov complexity. We call U the reference universal prefix machine. Formally, the conditional prefix Kolmogorov complexity K(x|y) is the length of a shortest input z such that the reference universal prefix machine U on input z with auxiliary information y outputs x. The unconditional prefix Kolmogorov complexity K(x) is defined by K(x|ǫ) where ǫ is the empty word of length 0.. The functions K(·) and K(·|·), though defined in terms of a particular machine model, are machine-independent up to an additive constant and acquire an asymptotically universal and absolute character through Church's thesis, and from the ability of universal machines to simulate one another and execute any effective process.
The Kolmogorov complexity of an individual finite object was introduced by Kolmogorov [12] as an absolute and objective quantification of the amount of information in it. It is sometimes confused with the information theory of Shannon [21], which deals with average information to communicate objects produced by a random source. They are quite different.

B. Information Distance
The information distance D(x, y) between strings x and y is defined as where U is the reference universal prefix machine above. Like the Kolmogorov complexity K, the distance function D is upper semicomputable. Define E(x, y) = max{K(x|y), K(y|x)}.
In [1] it is shown that the function E is upper semicomputable, (Here and elsewhere in this paper "logarithm" or "log" refers to the binary logarithm.) D(x, y) = E(x, y)+O(log E(x, y)), the function E is a metric (more precisely, that it satisfies the metric (in)equalities up to a constant), and that E is minimal (up to a constant) among all upper semicomputable distance functions D ′ satisfying the normalization conditions y:y =x 2 −D ′ (x,y) ≤ 1 and x:x =y 2 −D ′ (x,y) ≤ 1 (to exclude bogus distances which state, for example, that every y is in distance 1 2 of a given x). We call this metric E universal. Thus, for every pair of finite files x, y we have that E(x, y) is at least as small as the smallest D ′ (x, y). This means that E(x, y) is at least as small as the distance engendered by August 24, 2020 DRAFT . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 26, 2020. ; https://doi.org/10.1101/2020.07.22.216242 doi: bioRxiv preprint the dominant feature shared between x and y. The normalized information distance (NID) is defined by It is straightforward that 0 ≤ NID(x, y) ≤ 1 (up to an O(1/ max{K(x), K(y)}) additive term). It is is a metric [15] (and so is the NCD we will meet in (A.4) see [6]. As an aside, a nonoptimal precursor to the NID/NCD was given in [14].) Since by the symmetry of information law [10] or see [17], rewriting the NID using (A.2) yields up to some small aditive terms that we ignore. For more details on this derivation see [17] or [26].
In this way, in [1], [15] we and others developed theoretical approaches to the similarity of finite objects. We proved that these theories based on Kolmogorov complexity are perfect.
By approximating the Kolmogorov complexities involved by real-world compressors we transformed these theoretical notions into applications that work better than we could expect [6], [7]. It turns out that on natural data the above process gives adequate results. The resulting similarity measure is a parameter-free and alignment-free method. (In bioinformatics the computation of the similarity between genetic strings commonly involves the so-called "alignment method." This method incurs often a high or even forbidding computational cost. for certain problems biologists look for alignment-free methods.) It is a non feature-based similarity. That is, it captures every effective distance: effective versions of Hamming distance, Euclidean distance, edit distances, alignment distance, Lempel-Ziv distance, and so on. It can simultaneously detect all similarities between pieces that other effective distances can detect separately.
Let us give an intuitive explanation. Two objects are deemed close if we can significantly "compress" one given the information in the other, the intuition being that if two pieces are more similar, then we can more succinctly describe one given the other. The NID discovers August 24, 2020 DRAFT . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 26, 2020. ; https://doi.org/10.1101/2020.07.22.216242 doi: bioRxiv preprint all effective similarities in the sense that if two objects are close according to some effective similarity then they are also close according to the NID. One of the advantages of this similarity is that it works for noisy objects [4]. It is a quantity between 0 (identical) and 1 (completely dissimilar).

C. Normalized Compression Distance
Unfortunately, the universality of the NID comes at the price of incomputability. In fact, it is not even semicomputable and there is no semicomputable function at a computable distance of it [24]. One uses real data-compression programs to approximate the Kolmogorov complexity. The length of the compressed version of a finite object is obviously computable.
Usually the computation process is fast. For the natural data we are dealing with we assume that the length of the compressed version is not too far from its Kolmogorov complexity. We substitute the Kolmogorov complexity in the NID by its approximation. If Z is a compressor and we use Z(x) to denote the length of the compressed version of a string x, then we arrive at the Normalized Compression Distance: where we have replaced the pair (x, y) in the formula by the concatenation xy (file y appended to file x) and we ignore logarithmic terms in the numerator and denominator, see [6]. REMARK 1. In [6] there are axioms to capture the real-world setting, and show that (A.4) approximates optimality. Actually, the NCD is a family of compression functions parameterized by the given data compressor Z. Common compressors are gzip (a Lempel-Ziv compressor with small window size of 32kB), bzip2 (a block-sorting compressor based on the Burrows-Wheeler transform with a larger window size of 256kB), and PPMZ (prediction by partial matching (PPM) which is an adaptive statistical data compression technique based on context modeling and prediction [9].
The objects being compared for similarity must fit in, say, one-half the window size. Gzip is by far the poorest compressor while PPMZ the best (although slowest) in this lineup.
For example, the ideal compressor Z takes care that NCD Z (x, x) equals 0. With Z =gzip usually it is between 1 2 and 1 (very bad). With Z =bzip2 it is lower but nowhere near 0, and NCD P P M Z (x, x) in the genomic experiment of Table II below was between 0.002 and 0.006. For more experimental evidence see [3]. ✸ August 24, 2020 DRAFT . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 26, 2020. ; https://doi.org/10.1101/2020.07.22.216242 doi: bioRxiv preprint REMARK 2. Because of the normalization it does not matter for the NCD whether the length of data set x is different from the length of y. In practice, this difference should not be too great. ✸ To visualise the n 2 pairwise distances between n data we construct a dendrogram. To do so we take the n × n distance matrix as input, and construct a dendrogram with the n objects as leaves (so the dendrogram contains n external nodes or leaves and n − 2 internal nodes like in the figures below). We assume n For details see the cited reference. The method is available as an open-source software tool [5].

D. Phylogeny
A DNA sequence is a finite string over a 4-letter alphabet {A, C, G, T }. We used the mitochondrial genomes (mtDNA) of 21 mammals, each of at most 18,000 base pairs, obtained from the GenBank Database. Hence, the mtDNA of every species involved is a string of at most 36,000 bits. Since we use the entire mtDNA of every species involved we do "wholegenome" phylogeny. [2]. But using the whole genome gives a single tree. For the 21 original species used we do not give the Latin names; they can be found in [6].
For every pair of mitochondrial genome sequences x and y, we evaluated the formula in Equation A.4 using a good compressor like PPMZ The resulting distances are the entries in an 21 × 21 distance matrix Table II. Constructing a phylogeny tree from the distance matrix, for example using our quartet tree method [7] or PHILIP [29] as tree-reconstruction software, gives the corresponding tree.

E. SARS Virus
We clustered the SARS virus directly after its sequenced genome was made publicly available, in relation to potential similar viruses.  Figure 3 are very similar to the definitive tree based on medicalmacrobio-genomics analysis, appearing later in the New England Journal of Medicine [13].
August 24, 2020 DRAFT . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made We depicted the figure in the ternary tree style, rather than the genomics-dendrogram style, since the former is more precise for visual inspection of proximity relations.

ACKNOWLEDGEMENT
The authors thank Gurjit Randhawa for the database of viruses used in [23].
August 24, 2020 DRAFT . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made