The Homo-Edit Distance Problem

We consider the homo-edit distance problem, which is the minimum number of homo-deletions or homo-insertions to convert one string into another. A homo-insertion is the insertion of a string of equal characters into another string, while a homo-deletion is the inverse operation. We show how to compute the homo-edit distance of two strings in polynomial time: We first demonstrate that the problem is equivalent to computing a common subsequence of the two input strings with a minimum number of homo-deletions and then present a dynamic programming solution for the reformulated problem. 2012 ACM Subject Classification Applied computing → Bioinformatics; Applied computing → Molecular sequence analysis; Theory of computation → Dynamic programming


Introduction
A homo-insertion is an insertion of a string of equal characters, which we also call a block, into another string. A homo-deletion is the inverse operation, that is, the deletion of such a block. We consider the following problem: Given two strings, what is the minimum number of homo-insertions or homo-deletions needed to convert one into the other? We refer to this number as the homo-edit distance. This distance is a generalization of the edit distance between two strings, where only insertions and deletions are possible, which is also known as the longest common subsequence distance [4,1]. Unlike in the classic special case, where blocks consist only of single characters, two blocks may merge to one after a homo-deletion. For example, the homo-edit distance of ATA and the empty string is 2 and is achieved by first deleting T and then the block AA. This property makes the homo-edit distance more difficult to compute than the longest common subsequence distance. We became aware of this problem as an exercise (6.40) in the classic textbook on bioinformatics algorithms by Jones and Pevzner [3]. As this exercise caused our students a lot of trouble we decided to look at it more closely within a thesis project [2]. We show how to compute the homo-edit distance of two strings in polynomial time: We first demonstrate that the problem is equivalent to computing a common subsequence of the two input strings with a minimum number of homo-deletions and then present a dynamic programming solution for the reformulated problem.

Problem Formulation
Let Σ be a finite alphabet. A string of length n ∈ N 0 is defined as s = s 1 s 2 . . . s n ∈ Σ n . The empty string is denoted as ε. We also write the length of a string s as |s|. A block is a string consisting of identical characters, and we write a k for a block of length k and some a ∈ Σ.
We refer to substrings of s by Subsequences s k1 s k2 . . . s k l of s are characterized by their indices k 1 , k 2 , . . . , k l , where 1 ≤ k 1 < k 2 < . . . < k l ≤ n. We define two string operations which we subsume as homo-operations: The first operation inserts a block of length k into a string at a certain position. Let a ∈ Σ and let u = a k be the block that is to be inserted into string s at position i, where 1 ≤ i ≤ n + 1. We define this homo-insertion as the string The second operation deletes a block s(i, j) = a . . . a with a ∈ Σ and 1 ≤ i ≤ j ≤ n. We define this homo-deletion as the string Note that both operations are reversible, that is, for each homo-insertion there is a homo-deletion that can be applied to obtain the original string, and vice versa. For an operation O we denote the corresponding reverse operation by O. Reversibility also holds for chains of operations as the following lemma shows.

Proof by induction.
Base case: k = 1 Case 1: t = O 1 (s) = I i,u (s) is obtained by a homo-insertion of a string u = a j−i+1 into s at position i, where a ∈ Σ and j ≥ i. We reverse this homo-insertion by using a homo-deletion of substring s(i, j) = u from t, i.e., D i, is obtained by a homo-deletion of a substring u = s(i, j) from s. We reverse this homo-deletion by using a homo-insertion of u into t at position i, i.e., is obtained by a homo-insertion of a string u = a j−i into s at position i, where a ∈ Σ and j ≥ i. We reverse this homo-insertion by using a homodeletion of substring s (i, j) = u from t, i.e., D i, is obtained by a homo-deletion of a substring u = s (i, j) from s . We reverse this homo-deletion by using a homo-insertion of u into t at position i, i.e., We define the homo-edit distance H(s, t) between two strings s and t as the minimum number of homo-operations to convert s into t. From Lemma 1 it follows that the homo-edit distance is symmetric, that is, H(s, t) = H(t, s). We can now define the homo-edit distance problem formally as follows: Problem 1 (Homo-Edit Distance Problem). Given two strings s and t, compute their homo-edit distance H(s, t).

Problem Reformulation
In this section we point out that the homo-edit distance between two strings s and t can be computed by considering homo-deletions only. For this we show that there exists a common subsequence v of both strings such that converting both s and t into v needs a total of H(s, t) homo-deletions.
. . , O k must exist as well, because we can repeatedly replace each homoinsertion followed by a homo-deletion with a homo-deletion followed by a homo-insertion, resulting in the same string.
Let u be the string that we want to insert by applying O i and let w be the string that we want to delete by applying O i+1 . We consider two cases: Let a ∈ Σ, let u = a c1 , and let w = a c2 , such that after applying both homo-operations, we either inserted or deleted a c1−c2 , depending on whether c 1 > c 2 or c 1 ≤ c 2 . This means we could use one instead of two homo-operations for inserting or deleting a c1−c2 , or even zero if c 1 = c 2 . Thus, the series O 1 , O 2 , . . . , O k would not be optimal, which is a contradiction.
Let v be the string that we obtain by performing these homo-deletions on s, that Then v is a subsequence of s by definition. From Lemma 1 we know that we can reverse the homo-insertions of the series t Thus, v is also a subsequence of t, and we can obtain v by a total of k homo-deletions.
Lemma 3 implies that we can safely disregard homo-insertions for computing homo-edit distances. In the next section we present an algorithm that computes the homo-edit distance of two strings by finding the minimum number of homo-deletions to convert both into a common subsequence.

Dynamic Programming Algorithm
This section contains our algorithmic contributions to the problem, their correctness proofs, a note on backtracking and a running time analysis.

Algorithms
We compute the homo-edit distance between two strings s and t with a two-part dynamic programming (DP) algorithm: The first part is a precomputation step that computes and stores the homo-edit distance between every substring of both s and t and the empty string ε. The second part is the main algorithm that, similar to classic textbook approaches for sequence alignment, computes a DP matrix containing the homo-edit distances between all prefixes of s and t. For better understanding we explain the main algorithm first. Given two strings s and t, let v be an optimal common subsequence, that is, v satisfies the conditions of Lemma 3. Let m = |s| and n = |t|. We compute an (m + 1) × (n + 1) matrix d, where each entry d i,j corresponds to the homo-edit distance between the prefixes s(1, i) and t (1, j), with the following recurrence: We start by initializing d 0,0 with 0. For all other entries we proceed, e.g., from top to bottom (i = 0, 1, . . . , m) and from left to right (j = 0, 1, . . . , n), and consider three cases for the homo-edit distance between s (1, i) and t(1, j), among which we pick the minimum: 1. The first case is given if we have a match, i.e., s i = t j . In this case, the common character could be part of an optimal common subsequence v. As we would neither delete s i nor t j by a homo-deletion, we have d i,j = d i−1,j−1 . 2. The next case comprises all possibilities that involve deleting s i from s, meaning that this character would not be part of an optimal subsequence v. More precisely, for d i,j we consider each entry d k,j of the same column j in a row k from above plus the cost of deleting s(k + 1, i). We will show how to compute the homo-edit distances between all substrings of a string and the empty string ε later. 3. The last case consists of all possibilities where we delete t j from t. More precisely, for d i,j we consider each entry d i,l of the same row i in a column l from left plus the cost of deleting t(l + 1, j).
Eventually, d m,n contains the homo-edit distance between s and t. We can obtain an optimal subsequence v and thereby an optimal series of operations to obtain s from t or vice versa by backtracking the cases from d m,n to d 0,0 . Note that there can be multiple possibilities for v. See Algorithm 1 and the paragraph about backtracking below for more details.
Algorithm 1 Main dynamic programming algorithm to compute the homo-edit distance between two strings s and t.
1: function int homoEditDistance(s, t) 2: let H be a dictionary, which holds all entries for distancesToEmptyString of s and t, with substrings as keys and the corresponding homo-edit distances to ε as values   , j), ε) between all substrings of a string s of length n and the empty string. Again, we use dynamic programming, filling an n × n matrix h(s), with the following recurrence: We start by initializing all homo-edit distances between ε and every substring s(i, i) of length one to h i,i (s) = H(s(i, i), ε) = 1 for all 1 ≤ i ≤ n. Then we loop over all substrings of length two and compute their homo-edit distances to ε, and repeat the same procedure for all substrings of increasing length up to length n: To compute h i,j (s), we partition substring s(i, j) into all possible pairs of shorter substrings s(i, k) and s(k + 1, j), where i ≤ k < j. For each partition we compute the cost to delete it, and choose the minimum of these costs. If s i = s j the cost of deleting a partition is the sum of the costs to delete either substring. If, however, s i = s j the cost decreases by one, which we notate using the Iverson bracket. The reason is that all partitions delete s i in s(i, k) and s j in s(k + 1, j) separately by two homo-deletions, but it is always possible to delete the characters at the first and last index together with one homo-deletion. In the end, h i,j (s) = H(s(i, j), ε) for all 1 ≤ i < j ≤ n. See Algorithm 2 and the correctness proof below for more details.
Algorithm 2 Auxiliary dynamic programming algorithm to compute the homo-edit distance between every substring of a string s and the empty string.

return H
The example in Fig. 1 illustrates how the algorithms compute the homo-edit distance for the input strings s = CTGCA and t = AGAAC.

Correctness
Lemma 4. Given a string s = s 1 s 2 . . . s n , Recurrence (2) computes the homo-edit distance between every substring of s and the empty string ε.

Proof by induction.
Base case: n = 1. We need exactly one homo-deletion for one character, thus we have a homo-edit distance of 1, which is consistent with Recurrence (2). Induction step: n → n + 1. We consider two cases: Case 1: There exists an index k where 1 ≤ k < n + 1 such that s(1, k) and s(k + 1, n + 1) can be deleted independently from one another, i.e., we do not perform a homo-deletion that involves both substrings at once. This means the induction holds since we can reduce this problem to two subproblems.  Case 2: There exists no such index k. Then s 1 and s n+1 are the same character and must be deleted together because otherwise Case 1 would apply. That is, the cost for deleting s(1, n + 1) are the same as for s(1, n) because we can always delete s 1 and s n+1 (and perhaps other equal characters in between) together with the last homo-deletion before reaching ε. Proof. The edit distance problem is to convert a string into another such that the sum of individual costs of the editing operations insertion, deletion, and substitution is minimized, where the mentioned editing operations can operate on exactly one character. Ukkonen [6] describes a generalization of this problem: Given two strings s = s 1 s 2 . . . s n and t = t 1 t 2 . . . t m , we want to convert s into t such that the sum of individual costs of editing operations is minimized. We can show that a problem is also a generalized edit distance problem by giving an editing operation set E ⊂ Σ * × Σ * , where an element (x, y), x = y, represents an editing operation that replaces x with y, and a recurrence that defines a matrix d with cost function δ : E → N as follows: (Note that we rewrote Ukkonen's recurrence to fit our notations.) Hence, if the homo-edit distance problem is a generalized edit distance problem, Recurrence (1) works correctly.
For the homo-edit distance problem we can represent the editing operation set as The cost function can be defined as As a result, Algorithm 1 works correctly as we can rewrite Recurrence (1) as Ukkonen's recurrence.

Backtracking
From Lemma 1 and Lemma 3 we can deduce that an optimal series of operations needed for transforming s into t can be inferred from an optimal series of homo-deletions needed to transform both s and t into a common subsequence v with the property described in Lemma 3. Therefore, we disregard homo-insertions. Besides, we focus on backtracking one optimal series of homo-deletions that transform each input string into v. Note, however, that there might be multiple possible optimal series and subsequences. In order to backtrack and thus generate an optimal series of homo-deletions as well as v, we augment our matrices d and h as follows: For each entry d i,j , we additionally store the indices of any entry d i ,j from which we came from. For each entry h i,j (s), we additionally store the smallest index k that led to h i,j (s). We proceed analogously for h(t).
Next, we backtrack a path from d m,n to d 0,0 . Let d i1,i2 be the entry from where we obtained our current entry d j1,j2 . Let v be an empty string ε that will eventually hold our desired v, and let L s and L t be initially empty lists in which we will store our indices denoting an optimal series of homo-deletions from s or t, respectively. Note that homo-deletions cause indices to shift such that the indices stored in L s and/or L t might need to be adjusted accordingly. We consider three cases: 1. If we obtained d j1,j2 from a match, we prepend s j1 to v. 2. If we obtained d j1,j2 from an above entry that deletes s(i 1 , j 1 ) from s, we recursively split the deletion of s(i 1 , j 1 ) into the deletion of the two substrings s(i 1 , k) and s(k + 1, j 1 ), where k is the respective index obtained from backtracking h(s). We abuse notation by using the same notation for any lower level of the recursion. The recursion adds a tuple (k, k) (or (k + 1, k + 1)) to L s if a substring s(k, k) (or s(k + 1, k + 1)) consists of one character only. Every time we move up one recursion level, we check whether the outer characters s i1 and s j1 are equal. If so, from L s we remove the tuple that is returned first by s(i 1 , k), which contains i 1 , as well as the tuple that is returned last by s(k + 1, j 1 ), which contains j 1 . We then append (i 1 , j 1 ) to L s . 3. If we obtained d j1,j2 from a left entry that deletes t(i 2 , j 2 ) from t, we proceed analogously to the second case.  Proof. Follows from Lemmas 4 and 5 and the above running time analysis.

Conclusions
The focus of this paper is to introduce the homo-edit distance problem and to present a solution to compute this distance in polynomial time. We have not yet considered applications of this distance to specific problems in bioinformatics and leave this as future work. We can, for example, imagine applications to sequence analysis problems that involve tandem repeats, in a similar way as done by Sammeth and Stoye [5] who analyzed coding regions of the Staphylococcus aureus protein A gene (spa). S. aureus is a major human pathogen, and the analysis of relations between antibiotics-resistant strains can have important implications for clinical practice. Here, the homo-edit distance could be a good starting point for an all-against-all comparison of the spa-regions of different strains with an alphabet given by the repeats or higher order repeat structures.
Another possible application is the analysis of homopolymer-rich DNA-regions. Basecalling in these regions is particularly difficult for pyro-and ion torrent-based sequencing technologies, where over-and undercalling are common errors in these regions. The challenge is to distinguish these sequencing artifacts from true genetic content where a homo-edit distancebased analysis of the reads falling in such regions may provide some help.
In general, we envision also more theoretical work on extensions of the homo-edit distance. For which combinations of additional biologically meaningful operations like, e.g., duplications or mutations, can the distance still be computed in polynomial time and which versions become intractable? These and related open questions provide challenging opportunities for the theoretical bioinformatics community.