Skip-mers: increasing entropy and sensitivity to detect conserved genic regions with simple cyclic q-grams

Bioinformatic analyses and tools make extensive use of k-mers (fixed contiguous strings of k nucleotides) as an informational unit. K-mer analyses are both useful and fast, but are strongly affected by single nucleotide polymorphisms or sequencing errors, effectively hindering direct-analyses of whole regions and decreasing their usability between evolutionary distant samples. Q-grams or spaced seeds, subsequences generated with a pattern of used-and-skipped nucleotides, overcome many of these limitations but introduce larger complexity which hinders their wider adoption. We introduce a concept of skip-mers, a cyclic pattern of used-and-skipped positions of k nucleotides spanning a region of size S ≥ k, and show how analyses are improved by using this simple subset of q-grams as a replacement for k-mers. The entropy of skip-mers increases with the larger span, capturing information from more distant positions and increasing the specificity, and uniqueness, of larger span skip-mers within a genome. In addition, skip-mers constructed in cycles of 1 or 2 nucleotides in every 3 (or a multiple of 3) lead to increased sensitivity in the coding regions of genes, by grouping together the more conserved nucleotides of the protein-coding regions. We implemented a set of tools to count and intersect skip-mers between different datasets, a simple task given that the properties of skip-mers make them a direct substitute for k-mers. We used these tools to show how skip-mers have advantages over k-mers in terms of entropy and increased sensitivity to detect conserved coding sequence, allowing better identification of genic matches between evolutionarily distant species. We then show benefits for multi-genome analyses provided by increased and better correlated coverage of conserved skip-mers across multiple samples. Software availability the skm-tools implementing the methods described in this manuscript are available under MIT license at http://github.com/bioinfologics/skm-tools/


32
Genomes are not random strings, but are the product of millions of years of evolution and selection 33 pressure which imparts unique characteristics to the sequence of nucleotides. These characteristics need 34 to be considered in order to better analyse genomic datasets. Here we exploit the increase in entropy 1 and the next section). Skip-mers preserve many of the elegant properties of k-mers such as reverse 64 complementability and existence of a canonical representation which allows strand agnostic analyses 65 (Darling et al., 2006). Also, using cycles of three greatly increases the power of direct intersection 66 between the genomes of different organisms by grouping together the more conserved nucleotides of the 67 protein-coding regions, a property already used by the short 11011011 seeds of the WABA algorithm 68 (Kent and Zahler, 2000). Skip-mers can then be described as a sub-set of q-grams or spaced seeds or a 69 generalisation of the 11011011 seeds first described in WABA: a set of simple cyclic q-grams that increase 70 entropy and sensitivity when analysing divergent coding sequence.  Figure 1. Different SkipMer(m, n, k) cycles defined over the same sequence region, resulting in different combinations of bases. The shape of the underlying cyclic q-gram is defined by the variables m (used bases per cycle), n (cycle length), and k (total number of bases).
A skip-mer is a simple cyclic q-gram that includes m out of every n bases until a total of k bases is It is important to note that k-mers are a sub-class of skip-mers. A skip-mer with m = n will use all 79 contiguous k nucleotides, which makes it a k-mer. Throughout this manuscript we often use m = 1 ∧ n = 1, 80 or the shorter form notations 1-1 or 1-1-k to refer to k-mers.

105
In particular, the following tools have been used in the preparation of this manuscript: will be produced.

115
The current implementation of the skm-multiway-coverage tool includes a coverage cut-off that 116 defaults to 1 as this is appropriate for the current study. All skip-mers that are at a higher frequency than 117 the cut-off are eliminated before any analysis. To consider candidate matches for alignment of conserved 118 sequence it is appropriate to discard skip-mers with a higher copy number than your expected number of 119 matches as this will filter repetitive matches including background noise. While our current choice of 120 cut-off at 1 makes sense in a general analysis as the one presented in this manuscript, care needs to be 121 taken to make reasonable choices for future applications. The coverage score is used as a proxy for sequence conservation. To approximate a measure of conserved 124 nucleotides, the coverage is projected over individual nucleotides rather than directly counting shared 125 skip-mers which would introduce redundancy from phased matches. An equivalent coverage metric for 126 spaced seeds can be found in Noé and Martin (2014) where it is also used to estimate distances. The 127 score for each feature (i.e. gene) versus each genome in the multi-way analyses is calculated as the total 128 number of bases that are included in matching skip-mers from that genome divided by the total number of 129 bases that are covered by valid (i.e. copy number below threshold in the reference) skip-mers from the 130 reference:

131
Coverage score = Bases covered by matching skip-mers Bases covered by valid reference skip-mers The coverage cut-off is applied before any analyses are performed. When using the default cut-off 132 of 1, skip-mers that have a higher copy number in the reference will not be evaluated for scoring and 133 skip-mers that have a single copy in the reference but more than one copy in the scoring genome will not 134 be counted as covered. 137 We analysed a genome assembly of Triticum aestivum to investigate the effect of the cycle size n in the 138 multiplicity of the skip-mers in a genome. Figure 2 shows how increasing n, and thus the total span of a      To explore the advantages of the SkipMer(1, 3, k) analyses across divergent genomes we compared the  196 We computed correlations between the gene scores for D. melanogaster from every pair of the other