PT - JOURNAL ARTICLE AU - Andrew Melnyk AU - Sergey Knyazev AU - Yury Khudyakov AU - Fredrik Vannberg AU - Leonid Bunimovich AU - Pavel Skums AU - Alex Zelikovsky TI - Using Earth Mover’s Distance for Viral Outbreak Investigations AID - 10.1101/628859 DP - 2019 Jan 01 TA - bioRxiv PG - 628859 4099 - http://biorxiv.org/content/early/2019/05/06/628859.short 4100 - http://biorxiv.org/content/early/2019/05/06/628859.full AB - RNA viruses mutate at extremely high rates forming an intra-host viral population of closely related variants (or quasi-species) [4]. High variability of Human Immunodeficiency Virus (HIV) and Hepatitis C virus (HCV) making them particularly dangerous by allowing them to evade the host’s immune system. HIV and HCV outbreaks pose a significant problem for public health for solving which it is critical to infer transmission clusters, i.e., to decide whether two viral samples belong to the same outbreak. Initial approach [10] was based on estimating relatedness between two samples as the distance between consensuses of the corresponding viral populations. The distance between closest pair of representatives from two populations, MinDist, has been shown to be significantly more accurate [2]. Unfortunately, MinDist computation requires a cumbersome RNA-seq data assembly and identification of all viral sequences from a given project. We present a novel approach that allows to bypass read assembly and estimate the distance between viral samples based on k-mer (i.e. a substring of length k) distribution in RNA-seq reads. The experimental validation using sequencing data from HCV outbreaks shows that the proposed algorithms can successfully identify genetic relatedness between viral populations, infer transmission clusters and outbreak sources, as well decide whether the primary spreader is present in the sequenced outbreak sample.