Abstract
We propose a polynomial algorithm computing a minimum plain-text representation of kmer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 60% over unitigs and 27% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 91% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26× over unitigs and 2.10× over previous work.
Availability matchtigs: https://github.com/algbio/matchtigs
SSHash-Lite: https://github.com/jermp/sshash-lite
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Forgot to update abstract
[16] Downloaded in February 2022.