HamHeat: A fast and simple package for calculating Hamming distance from multiple sequence data for heatmap visualization

Alexey V. Rakov; Dieter M. Schifferli; Shu-Lin Liu; Emilio Mastriani

doi:10.1101/2020.03.26.009258

Abstract

The problem of fast calculation of Hamming distance inferred from many sequence datasets is still not a trivial task. Here, we present HamHeat, as a new software package to efficiently calculate Hamming distance for hundreds of aligned protein or DNA sequences of a large number of residues or nucleotides, respectively. HamHeat uses a unique algorithm with many advantages, including its ease of use and the execution of fast runs for large amounts of data. The package consists of three consecutive modules. In the first module, the software ranks the sequences from the most to the least frequent variant. The second module uses the most common variant as the reference sequence to calculate the Hamming distance of each additional sequence based on the number of residue or nucleotide changes. A final module formats all the results in a comprehensive table that displays the sequence ranks and Hamming distances.

Availability and implementation HamHeat is based on Python 3 and AWK, runs on Linux system and is available under the MIT License at: https://github.com/alexeyrakov/HamHeat.

Contact rakovalexey{at}gmail.com

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.