TY - JOUR
T1 - A New approximate matching compression algorithm for DNA sequences
JF - bioRxiv
DO - 10.1101/853358
SP - 853358
AU - Lázaro-Guevara, J.M.
AU - Garrido, K.M.
Y1 - 2019/01/01
UR - http://biorxiv.org/content/early/2019/11/25/853358.abstract
N2 - Undeveloped countries like Guatemala, where access to high-speed internet connections is limited, downloading and sharing Biological information of thousands of Mega Bits is a huge problem for the beginning and development of Bioinformatics. Based on that information is an urgent necessity to find a better way to share this biological data. There is when the compression algorithms become relevant. With all this information in mind, born the idea of creating a new algorithm using redundancy and approximate selection.Methods: Using the probability given by the transition matrix of the three-word tuple and relative frequencies. Calculating the relative and total frequencies given by the permutation formula (nr) and compressing 6 bits of information into 1 implementing the ASCII table code (0…255 characters, 28), using clusters of 102 DNA bases compacted into 17 string sequences. For decompressing, the inverse process must be done, except that the triplets must be selected randomly (or use a matrix dictionary, 4102).Conclusion: The compression algorithm has a better compression ratio than LZW and Huffman’s algorithm. However, the time needed for decompressing makes this algorithm incompatible for massive data. The functionality as MD5sum need more research but is a promising helpful tool for DNA checking.
ER -