RT Journal Article SR Electronic T1 Skip-mers: increasing entropy and sensitivity to detect conserved genic regions with simple cyclic q-grams JF bioRxiv FD Cold Spring Harbor Laboratory SP 179960 DO 10.1101/179960 A1 Bernardo J. Clavijo A1 Gonzalo Garcia Accinelli A1 Luis Yanes A1 Katie Barr A1 Jonathan Wright YR 2017 UL http://biorxiv.org/content/early/2017/08/23/179960.abstract AB Bioinformatic analyses and tools make extensive use of k-mers (fixed contiguous strings of k nucleotides) as an informational unit. K-mer analyses are both useful and fast, but are strongly affected by single-nucleotide polymorphisms or sequencing errors, effectively hindering direct-analyses of whole regions and decreasing their usability between evolutionary distant samples.We introduce a concept of skip-mers, a cyclic pattern of used-and-skipped positions of k nucleotides spanning a region of size S ≥ k, and show how analyses are improved compared to using k-mers. The entropy of skip-mers increases with the larger span, capturing information from more distant positions and increasing the specificity, and uniqueness, of larger span skip-mers within a genome. In addition, skip-mers constructed in cycles of 1 or 2 nucleotides in every 3 (or a multiple of 3) lead to increased sensitivity in the coding regions of genes, by grouping together the more conserved nucleotides of the protein-coding regions.We implemented a set of tools to count and intersect skip-mers between different datasets. We used these tools to show how skip-mers have advantages over k-mers in terms of entropy and increased sensitivity to detect conserved coding sequence, allowing better identification of genic matches between evolutionarily distant species. We also highlight potential applications to problems such as whole-genome alignment and multi-genome evolutionary analyses.Software availability: the skm-tools implementing the methods described in this manuscript are available under MIT license at http://github.com/bioinfologics/skm-tools/