Abstract
Graphical representation of DNA sequences is one of the most popular techniques of alignment-free sequence comparison. In this article, we propose a new method for extracting features of DNA sequences represented by binary images, in which we estimate the similarity between DNA sequences by the frequency histograms of local bitmap patterns on the images. Our method has linear time complexity for the length of DNA sequences, which is practical even for comparison of long sequences. We tested five distance measures to estimate sequence similarities and found that histogram intersection and Manhattan distance are most appropriate for our method among them.
1 Introduction
Sequence alignment1,2 is generally used to estimate similarities between relatively short sequences such as nucleotide sequences of genes or amino acid sequences of proteins. However, the time complexity of the alignment is O(L2) for sequences of length L, which requires an enormous amount of computation time when L is large. Therefore, it is necessary to develop so called alignment-free methods, which are independent of alignment, to compare long sequences such as whole genome sequences in practical time.
One of the most popular methods for the alignment-free sequence comparison is graphical representation of biological sequences3. So far various methods based on the graphical representation have been introduced by many authors. The basic procedure is common to almost all the methods, which is outlined as follows: first, each type of bases in a DNA sequence is replaced by an individual vector on a certain dimensional expression space, two-dimension4–22, three-dimension23–36, or more37–40; next, the vectors are connected successively, drawing a trajectory on the expression space; lastly, the distances between the trajectories, or graphs, are calculated based on a pre-defined distance measure.
In this article, we propose a new method for sequence comparison categorized in the graphical representation. We express a DNA sequence as a binary image—each pixel of a binary image is plotted in either black or white—on a 2-dimensional space and count the occurrence frequencies of 3×3 bitmap patterns on the binary image. The distance between the binary images is measured based on the occurrence frequency histograms of the bitmap patterns. As for distance measures between histograms, we selected five frequently used measures: histogram intersection41, Manhattan distance, Bhattacharyya distance42, Jensen-Shannon divergence43, and Kendall’s rank correlation coefficient44. Based on phylogeny of 31 mitochondrial genome sequences, we seek for the most appropriate distance measure for our method among the five.
2 Methods
2.1 Generating a binary image from a DNA sequence
Here, we describe, step-by-step, the procedure to generate a binary image from a DNA sequence.
2.1.1 Graphical representation of a DNA sequence
At first, we assign two-dimensional vectors that are perpendicular to each other to individual types of bases, A, T, G, and C. The number of the independent variations of the assignment is 3!/2 = 3, when we identify the assignments that can be transformed from each other by the rotations of 90-degree or the inversion with respect to the vertical or horizontal axis (Fig. 1). We chose the left most assignment in Fig. 1, where nucleotides A and T are placed on the upper quadrants, and G and C on the lower ones, so that the GC content of a DNA sequence can be grasped easily from the resultant graphical representation. Next, connecting consecutively the vectors assigned to the bases of the DNA sequence from the origin to the end one by one, we can draw a 2-dimensional graph. Fig. 2(a) shows an example—a graphical representation of sequ
2.1.2 Multiplying weighting factors
In order to extract potential information conveyed by individual bases, we introduced weighting factors based on a Markov chain model into the process of generating binary images45. As the weighting factor, we used self-information I(E), which is the amount of information that we will receive when a certain event E occurs. Let P(E) be the probability that event E occurs, then I(E) is defined as I(E) = - log2 P(E) in bits. Here, we define P(E) according to the second order Markov chain, concerning about codons in the coding regions of DNA sequences, which is calculated by the occurrence frequencies of triplets of bases. The probability that nucleotide z occurs after a pair of nucleotides xy (x, y,z ∈ {A, T, G, C}) is calculated by where Nxys(s ∈ {A, T, G, C}) is the number of occurrence of triplet xys, which is measured in all the DNA sequences analyzed.
The individual vectors corresponding to the bases on a DNA sequence are multiplied by the weighting factors according to the preceding two bases. For example, when P(A|AC), P(T|CA), P(A|AT), P(T|TA), and P(G|AT) are .20, .66, .41, .31, and .44, respectively, the weighting factors are calculated to be 2.3, 0.60, 1.3, 1.7, and 1.2, respectively. The first two of a series of vectors in Fig. 2(a) are drawn without weightings because the corresponding weighting factors can not be available. The third vector A is multiplied by 2.3 because the preceding doublet of bases is AC and, therefore, the weighting factor corresponding to P(A|AC) is chosen. The remaining vectors are similarly multiplied by the weighting factors. As a result, the graphical representation of sequence “ACATATG” is modified as shown in Fig. 2(b).
2.1.3 Generating a binary image
A binary image is a digitized image that has pixels of only two possible values 0 and 1, which are typically plotted in white and black, respectively. From the graphical representation of a DNA sequence, we generate a binary image by the following manner; we set value 1 for the pixels that include at least a part of a vector, and 0 otherwise, in the graphical representation (Fig. 2(c)).
2.2 Local patterns
We define a local pattern as a bitmap of a set of adjacent pixels of a certain size of window. Because each pixel of a binary image has two pixel values, the number of local patterns is 2n, where n is the number of pixels belonging to a window. Windows of too large size are dominated by white pixels, and, on the other hand, those of too small size can not have enough variations to express a DNA sequence. In this study, therefore, we chose the window size of 3 × 3, where the number of the local patterns is 29 = 512. Note that we do not include the local pattern whose pixels are all white into the local pattern histograms because it represents the empty background of the images.
Fig. 3 shows the five examples of local patterns of window size 3×3 with their serial numbers below. We derive the serial numbers by lining up the pixels from the upper left corner to the lower right and interpreting them as a binary number with zeros and ones for white and black pixels, respectively, with the upper left corner being the highest bit.
2.3 Counting the occurrence frequencies of local patterns
Sliding a 3 × 3 window by one pixel per move over the binary image, we counted the occurrence frequencies of the local patterns. If the trajectory of the graphical representation of a DNA sequence is like a random walk, the average distance between the origin and the terminus of the trajectory would be O(L1/2), where L is the sequence length. In that case, the rectangle area covering the whole trajectory is proportional to L; hence the computational time to count the occurrence frequencies would become O(L). In reality, however, the trajectory is not a random walk but a curved line of length O(L) in most cases (see Fig. 4 in Section 3), in which the computational time to count the occurrence frequencies becomes O(L2)—this is equivalent to that for pair-wise sequence alignment.
Accordingly we devised the method of counting the occurrence frequencies as follows. We divided the binary image into square regions of 10 × 10 pixel, and marked the regions having at least one black pixel, simultaneously when generating the binary image. Then, we counted the occurrence frequencies of the local patterns only in the marked regions. Thus we can reduce the computational time of counting to O(L), although there remains room for improvement in reducing the numerical coefficients.
2.4 Distance measures between local pattern histograms
There are several measures to estimate similarity/dissimilarity between two histograms. Among them, we chose commonly used five measures as the distance measure between the local pattern histograms, and compared them, seeking for the appropriate measure for our method. We briefly describe them below. In the following formulas, pi and qi are the occurrence frequencies of the local pattern of serial number i in histograms P and Q, respectively, and N is the largest serial number of the local patterns (i.e., N = 511 for 3×3 pixel local patterns). Note that, in the calculation of the distances, the occurrence frequencies are normalized to be .
Histogram intersection (HI) Histogram intersection was proposed by Swain et al.41 for color indexing of images, which is defined as
It ranges from 0 to 1, with 1 for P and Q being identical. To calculate distances, we convert it to DHI (P, Q) = 1 – HI (P, Q).
Manhattan distance (MD) Manhattan distance, also known as City block distance or L1 -norm, is defined as which ranges from 0 to 2, with 0 for P and Q being identical.
Bhattacharyya distance (BD) Bhattacharyya distance 42 is defined between two probability distributions from a divergence which ranges from 0 to 1, with 1 for P and Q being identical. The Bhattacharyya distance is defined from the divergence as DBD (P, Q) = – ln BD (P, Q).
Jensen-Shannon divergence (JS) Jensen-Shannon divergence43 is a symmetrized and smoothed version of Kullback-Leibler divergence46, which is defined as where M = (P + Q)/2 and KL ( ·, M) is the Kullback-Leibler divergence calculated by
Here, mi = (pi + qi)/2. The Jensen-Shannon divergence ranges from 0 to 1, with 0 for P and Q being identical.
Kendall’s rank correlation coefficient (τ) Kendall’s rank correlation coefficient44, also known as Kendall’s τ, is defined as where X is the number of concordant i, j (i > j) pairs in which (pi - pj)(qi - qj) > 0 is satisfied; Y is the number of the discordant pairs in which (pi – pj)(qi – qj) < 0 is satisfied; r is the number of one kind of the tie pairs in which pi = pj and qi ≠ qj are satisfied; and s is the number of the other kind of the tie pairs in which pi ≠ pj and qi = qj are satisfied. If both pi = pj and qi = qj are satisfied, the corresponding i, j pairs are excluded from the computation. Kendall’s τ lies between −1 and 1, with 1 for the rank orders of pis and qis being completely in agreement with each other, and with −1 for them being completely reversal with each other. We re-scaled the Kendall’s τ as so that Dτ (P, Q) ranges from 0 to 1, with 0 for the rank orders of P and Q being identical.
3 Results and Discussion
3.1 Genome sequences analyzed
We downloaded mitochondrial genome sequences of 31 mammalian species from GenBank47 and analyzed them. Table 1 summarizes the genome sequences. Mitochondrial genomes are widely used to study genome evolution and phylogenetic inference due to, for example, a high mutation rate relative to nuclear genomes and a nearly uniform size for mammalian species.
3.2 Weighting factors
We counted the number of occurrences of every tri-nucleotides in all the genome sequences listed in Table 1, sliding a window of length three by one nucleotide per move, and calculated the weighting factors by the method described in Section 2.1. The calculated weighting factors are shown in Table 2. The large (small) weighting factors indicate that the corresponding triplets rarely (often) occur in the genome sequences.
3.3 Graphical representations
Fig. 4 shows the graphical representations of the 31 mammals. It can be recognized that the trajectories among closely related species, such as primates, cats, elephants, bears, and so on, look similar to each other; on the other hand, those between the species in different orders look different. This observation indicates that our method of graphical representation is effective in visual inspection of the sequence similarities. The usefulness of introducing the weighting factors in the graphical representation is argued by one of the authors45,48.
3.4 Local pattern histograms
We counted the numbers of occurrences of the local patterns for the 31 mammalian species and constructed the local pattern histograms. Fig. 5 shows the (un-normalized) local pattern histograms of the 31 mammals. Local patterns 1, 4, 10, 34, 64, 84, 136, 160, 256, and 273 are detected more than 1000 counts in all the genome sequences. Those frequent local patterns are depicted in Fig. 6.
3.5 Construction of phylogenetic trees
Fig. 7 shows the phylogenetic trees constructed from the calculated distance matrices using Unweighted Pair Group Method with Arithmetic mean (UPGMA). The trees are drawn by statistical analysis software R49. The trees of the top panel, the middle one, and the bottom one are constructed based on distance measures HI and MD, BD and JS, and τ, respectively. The upper two trees seem to be well reconstructed in that primates, elephants, cats, bears, and so on, are located in their respective clades, although sheep is separated from buffalo-cow pair in the middle tree. In the bottom tree, on the other hand, some species are located on inadequate places; for instance, pig and white rhinoceros are included in primates, and leopard is separated from cat-tiger pair.
Table 3 shows the Pearson’s correlation coefficients calculated from the distance matrices among the five distance measures. We can find that HI-MD pair and BD-JS pair are strongly correlated with each other, which confirms the topological aspects of the resultant phylogenetic trees in Fig. 7.
To evaluate our phylogenetic trees quantitatively, we measured the Robinson-Foulds (R-F) distances50 between our trees and a reference tree (Fig. 8), which was constructed by ClustalW51 based on the multiple sequence alignment of the mitochondrial genome sequences. The R-F distances were calculated by treedist program in Phylogeny Inference Package (PHYLIP)52. Table 4 summarizes the measured R-F distances. Among the five distance measures, HI and MD show the best performance.
At this point, it is worth discussing the position of hedgehog. We compared our tree constructed by HI and MD (the top panel of Fig. 7) with those given by Huang et al.21 and Yu et al.53 Overall configuration of our tree approximately agrees with their trees except for the position of hedgehog. Krettek et al.54 performed a phylogenetic analysis using concatenated sequences of 13 protein-coding genes of mitochondrial genomes of nine mammals including human, harbor seal, cow, and hedgehog, and identified the position of hedgehog as basal relative to the other species included. Our result is consistent with Krettek et al.54 rather than Huang et al.21 and Yu et al.53
4 Conclusion
In this study, we proposed a novel method for estimating similarities between DNA sequences. In this method, we express DNA sequences as binary images by replacing individual bases with 2-dimensional vectors and connecting them successively. We counted the occurrence frequencies of 3 × 3 bitmap patterns on the binary images and measured the distances between them based on the frequency histograms of the bitmap patterns. We tried five frequently used distance measures to estimate similarity/dissimilarity between two histograms: histogram intersection, Manhattan distance, Bhattacharyya distance, Jensen-Shannon divergence, and Kendall’s rank correlation coefficient.
We compared our phylogenetic trees with that constructed by ClustalW for mito-chondrial genomes of 31 mammalian species. Among the five distance measures, histogram intersection and Manhattan distance showed the best performance in terms of Robinson-Foulds distance between the phylogenetic trees, as well as having the more appropriate position of hedgehog on the phylogenetic tree than those given by Huang et al.21 and Yu et al.53.
The most time consuming step in our method is counting the occurrence frequencies of local patterns. Its time complexity is O(L) for sequence length L, which is practical even for very long sequences, although the proportionality constant is large; it is thought to be possible to make the constant smaller to reduce the actual computation time by improving the counting method.