## Abstract

In computational biology, mapping a sequence *s* onto a sequence graph *G* is a significant challenge. One possible approach to addressing this problem is to identify a walk *p* in *G* that spells a sequence which is most similar to *s*. This problem is known as the Graph Sequence Mapping Problem (`GSMP`). In this paper, we study an alternative problem formulation, namely the De Bruijn Graph Sequence Mapping Problem (`BSMP`). We focused on addressing the problem involving changes in the graph. We reformulated the problem, taking into account the characteristics of the arcs induced in the De Bruijn graph. This reformulation led to a modification in the problem definition, allowing the application of a polynomial-time algorithm for its resolution.

## I. Introduction

A very relevant task in computational biology is to map a sequence onto another for comparison purposes. Typically, one sequence is compared to a reference sequence, which is considered a high-quality sequence representing a set of sequences [9], [10]. On the other hand, the reference sequence is biased as it represents only a limited set of sequences and it is unable to account for all possible variations. One way to overcome this bias is to represent multiple sequences as another robust structure [6], such as the sequence graph or De Bruijn graph [7], [13], [14].

The *sequence graph* is a graph such that each node is labeled with one or more characters and the *simple sequence graph* is one where each node is labeled with exactly one character [7]. In the *De Bruijn graph* [13], [14] of order *k*, each node is labeled with a distinct sequence of length *k* and there is an arc from one node to another if and only if there is an overlap of length *k −* 1 from the suffix of the first to the prefix of second.

Informally, a walk *p* in a graph *G* is a sequence of connected nodes by arcs. Given a sequence graph *G*, a walk *p* in *G* can spell a sequence *s*^{′} by concatenating the characters associated with each node of *p*. The *Graph Sequence Mapping Problem* – `GSMP` consists of finding a walk *p* in a sequence graph *G* that spells a sequence as similar as possible to a given sequence *s*. One of the first articles that addresses `GSMP` in more details was written by Amir *et. al*. in the article entitled *Pattern*

Supported by CAPES and UFMS.

*Matching in Hypertext* [1]. Navarro improved the results of this article and detailed these improvements in the article entitled *Improved Approximate Pattern Matching on Hypertext* [8]. For the approximate mapping, Amir *et. al*. were the first authors in the literature who identified an asymmetry in the location of the changes, showing the importance of understanding whether changes happen only in the pattern, only in the hypertext or in both. Considering the asymmetry identified by Amir *et. al*., the `GSMP` allows three variants:

allows changes only in the pattern when mapping the pattern in hypertext;

allows changes only in the hypertext when mapping the pattern in hypertext;

allows changes in the pattern and hypertext when mapping the pattern in hypertext.

For variant 1, Amir *et. al*. proposed an algorithm that runs in *O*(|*V*| + *m* |*A*|) time which was improved by Navarro to run in *O*(*m*(|*V* |+ |*A*|)) time. Here, |*V*| is the number of nodes in the graph, |*A*| is the number of arcs, and *m* is the length of the mapped pattern. For variants 2 and 3, Amir *et. al*. proved that the respective problems are NP-complete considering the Hamming and edit distance when the alphabet Σ has |Σ| *≥* |*V*|. In the work entitled *On the Complexity of Sequence-to-Graph Alignment* [3], Jain *et. al* prove that the variants 2 and 3 are NP-complete when the alphabet Σ has |Σ| *≥* 2.

The first time the `GSMP` was addressed using a De Bruijn graph as input was in the article entitled *Read Mapping on De Bruijn Graphs* [2]. In this work, Limasset *et. al*. propose the following problem, called here *De Bruijn Graph Sequence Mapping Problem* – `BSMP`: given a De Bruijn graph *G*_{k} and a sequence *s*, the goal is to find a path *p* in *G*_{k} such that the sequence *s*^{′} spelled by *p* have at most *d* differences between *s* and *s*^{′} with *d ∈* ℕ. The `BSMP` was proved to be NP-complete considering the Hamming distance, leading to the development of a seed-and-extended heuristic by the mentioned authors. Note that for the `BSMP` it does not make sense to talk about the three variants mentioned above since there are no node repetitions.

Recently, the `BSMP` was addressed for walks in the article entitled *On the Hardness of Sequence Alignment on De Bruijn Graphs* [4]. Gibney *et. al*. proved that the problem is NP-complete when the changes occur in the graph and they proved that there is no algorithm faster than *O*(|*A*| *m*) for De Bruijn graphs when we have changes only occur in *s* in which |*A* |is a number of arcs and *m* is the length of *s*.

In a recently published article entitled *Heuristic for the De Bruijn Mapping Problem* [15], we dedicated ourselves to developing heuristics for Variant 1 (sequence changes) of `BSMP`. However, in this new article, we focus on exploring Variant 2 (graph changes) for `BSMP`. This work is motivated by the study by Gibney *et. al*, who demonstrated that the problem is NP-complete for Variant 2. Nevertheless, we present a reformulation of `BSMP` that enables the application of a polynomial algorithm to solve the problem.

## II. Preliminaries

In this section, we describe some necessary concepts such as computational definitions and problem definition that are used in this paper.

### A. Sequence, distance, graphs and matching

Let Σ be an **alphabet** with a finite number of characters. We denote a sequence (or string) *s* over Σ by *s*[1]*s*[2] … *s*[*n*] in which each character *s*[*i*] *∈* Σ. We say that the **length** of *s*, denoted by |*s*|, is *n* and that *s* is a *n***-length** sequence.

We say that the sequence *s*[*i*]*s*[*i*+1] … *s*[*j*] is a **substring** of *s* and we denote it by *s*[*i, j*]. A substring of *s* with length *k* is a *k*-length sequence and also called *k***-mer** of *s*. For 1*≤ j ≤ n* in which *n* = |*s*|, a substring *s*[1, *j*] is called a **prefix** of *s* and a substring *s*[*j, n*] is called a **suffix** of *s*.

Given five sequences *s, t, x, w, z*, we define *st* the **concatenation** of *s* and *t* and this concatenation contains all the characters of *s* followed by the characters of *t*. If *s* and *t* are *n*- and *m*-length sequences respectively, *st* is a (*n*+*m*)-length sequence. If *s* = *xw* (*x* is a prefix and *w* is a suffix of *s*) and *t* = *wv* (*w* is a prefix and *z* is a suffix of *t*), we say the substring *w* is an **overlap** of *s* and *t*.

The **Hamming distance** *d*_{h} of two *n*-length sequences *s* and *t* is defined as
where is equal to 1 if *s*[*i*] ≠ *t*[*i*], and 0 otherwise. In this context, we also say that *s* and *t* have *d*_{h}(*s, t*) differences.

A **graph** is an ordered pair (*V, A*) of two sets in which *V* is a set of elements called **nodes** (of the graph) and *A* is a set of ordered pairs of nodes, called **arcs** (of the graph). Given a graph *G*, a **walk** in *G* is a sequence of nodes *p* = *v*_{1}, …, *v*_{k}, such that for each pair *v*_{i},*v*_{i+1} of nodes in *p* there is a arc (*v*_{i}, *v*_{i+1}) *∈ A*. A **path** in *G* is a walk with no repeated nodes. Given a walk *p* = *v*_{1}, …, *v*_{k}, |*p*| = *k −* 1 is the **length** of *p*. For graphs with costs associated with their arcs, the **cost** of a walk *p* is the sum of the cost of all arcs of all consecutive pairs of nodes (*v*_{i}, *v*_{i+1}) in *p*. A **shortest path** from *v*_{1} to *v*_{k} is one whose cost is minimum (a path of **minimum cost**).

An undirected graph *G* = (*V, A*) is called a **bipartite graph** *H* = (*V*_{1} *∪ V*_{2}, *A*) if there is a partition of its vertices *V* into two disjoint sets *V*_{1} and *V*_{2}, such that for each edge (*u, v*) *∈ A, u ∈ V*_{1} and *v ∈ V*_{2}. A bipartite graph is called a **complete bipartite graph** if there is an edge from every vertex *u ∈ V*_{1} to every vertex *v ∈ V*_{2}.

A **matching** *E* in *H* consists of a set of *m* edges of *H* that do not share common vertices. The **cost of the matching** *E* is *C*(*E*) = _{e ∈ E} *z*(*e*), where *z*(*e*) is the cost of an edge *e* = *{x, y} ∈ A*. A matching is said to be of **minimum cost** if there is no matching with a lower cost. A matching *E* is **maximum** if there is no other matching larger than *E*, i.e., if there is no matching *E*^{′} such that |*E*^{′}| *>* |*E*|. Given a set E with all maximum matchings, a matching *E ∈* E is said to be **maximum and of minimum cost** if there is no matching *E*^{′} such that *C*(*E*^{′}) *< C*(*E*).

Given a bipartite graph *H* = (*V*_{1} *∪ V*_{2}, *A*), find a minimum-cost matching *E*. This problem is extensively studied in the literature and can be solved using, for example, Edmonds’ algorithm (with a time complexity of *O*((|*V*_{1}| *×* |*V*_{2}|)^{3})), Hopcroft-Karp algorithm (with a time complexity Of , and the Hungarian algorithm (with a time complexity of *O*((|*V*_{1}| *×* |*V*_{2}|)^{3})) [11], [12], [17].

A **sequence graph** is a graph (*V, A*) with a sequence of characters, built on an alphabet Σ, associated with each of its nodes. A **simple sequence graph** is a graph in which each node is labeled with only one character. Given a set *S* = {*r*_{1}, …, *r*_{m} } of sequences and an integer *k ≥* 2, a **De Bruijn graph** is a graph *G*_{k} = (*V, A*) such that:

*V*= {*d ∈*Σ^{k}| such that*d*is a substring of length*k*(*k*-mer) of*r ∈ S*and*d*labels only one node};*A*= {(*d, d*^{′})| the suffix of length*k −*1 of*d*is a prefix of*d*^{′}}.

Note that we can define the De Bruijn graph just by its set of vertices and the arcs are for each pair of *k*-mer *d* and *d*^{′} whose *k −* 1 length prefix of d is equal to the *k −* 1 length suffix of *d*^{′}.

In this paper, informally for readability, we consider the node label as node. Given a walk *p* = *v*_{1}, *v*_{2}, …, *v*_{n} in a De Bruijn graph *G*_{k}, the **sequence spelled** by *p* is the sequence *v*_{1}*v*_{2}[*k*] … *v*_{n}[*k*], given by the concatenation of the *k*-mer *v*_{1} with the last character of each *k*-mer *v*_{2}, …, *v*_{n}. For a walk *p* = *v*_{1}, *v*_{2}, …, *v*_{n} in a simple sequence graph *G*, the sequence spelled by *p* is *v*_{1}*v*_{2} … *v*_{n}. A **mapping** of a sequence *s* onto a simple sequence graph or a De Bruijn graph *G* is a walk *p* in *G* whose editing cost between *s* and the sequence spelled by *p* is minimum.

Given the definitions above, we state the following problem for simple ssequence graphs (`GSMP`) and for De Bruijn graphs (`BSMP`) when we have changes only in the sequence, respectively:

(*Graph Sequence Mapping Problem –**GSMP**):* Given a *m*-sequence *s* and a simple sequence graph *G*, find a mapping of *s* onto *G*.

(*De Bruijn Graph Sequence Mapping Problem –**BSMP**):* Given a *m*-sequence *s* and a De Bruijn graph of order *k, G*_{k}, find a mapping of *s* onto *G*_{k}.

## III. Reformulating the De Bruijn Mapping Problem

Given a De Bruijn graph with an alphabet |Σ|*≥* 4, a pattern *P*, and an integer *δ ≥* 0, the `BSMP`, studied by Gibney et al. [4], determines whether there exists a walk *p* in the De Bruijn graph. This walk must have a Hamming distance of zero concerning the sequence induced by *p* and the pattern *P*, with a maximum of *δ* substitutions in the graph. In this context, the authors use a simple sequence graph to represent the De Bruijn graph. It is essential to note that in the visualization of the De Bruijn graph as a simple sequence graph, the *k*-mers are represented by walks of length *k*. In other words, each walk of length *k* induces a distinct *k*-mer.

It is important to highlight that the NP-completeness of this problem for a simple sequence graph was already known, as explored earlier by Amir *et. al* and Jain *et. al*. The reduction studied by the authors starts from a directed graph *D* that contains a Hamiltonian cycle. From *D*, they construct the De Bruijn graph as a simple sequence graph *G* and demonstrate that this graph is indeed a De Bruijn graph. Finally, the authors’ reduction proves that if there is a Hamiltonian cycle in *D*, then there exists a walk *p* in *G* with at most *δ* modifications in *G*, such that the sequence induced by *p* and the pattern *P* have a Hamming distance of zero. It is also demonstrated that if there is a walk *p* in *G* with at most *δ* modifications to *G*, then there exists a Hamiltonian cycle in *D*.

When exclusively considering modifications in the graph, they approach the problem in a way that does not compromise the fundamental structure of the graph, i.e., vertices and arcs are not modified. This characteristic is crucial when analyzing the NP-completeness of the problem, as, as mentioned earlier, the `BSMP` has been proven to be NP-complete in simple sequence graphs in the past, where modifications naturally preserved the structure of vertices and arcs.

On the other hand, when specifically addressing De Bruijn graphs, it is possible to explore the induction of new arcs after a modification in the graph. Given this characteristic, in this work, we explore modifications in the graph allowing for a change that can induce new arcs, implying a change in its underlying structure. This flexibility allows us, when the induction of new arcs is considered, to find solutions in polynomial time.

The main distinction lies in the approach to graph modifications. While for Gibney et al., the impossibility of altering the structure contributes to the NP-completeness of the problem, here, we acknowledge the possibility of changes that impact not only the labels of the vertices but also the topology of the graph. This permissiveness offers the advantage of admitting strategies to find solutions in polynomial time when the introduction of new arcs is allowed. The `BSMP` is approached differently in this work compared to the work of Gibney et al. In summary, we emphasize that the difference lies in the manipulation of the De Bruijn graph.

Given that the De Bruijn graph allows the induction of arcs through its *k*-mers, we consider the definitions in the next two sections to redefine the problem taking into account this characteristic of induced arcs.

### A. The s-transformation of a De Bruijn graph

Let *s* be a sequence and *G*_{k} a De Bruijn graph, where *V* = {*v*_{1}, …, *v*_{n}} is the set of vertices of *G*_{k}. We denote by ** K(s)** = {

*w*

_{i}, …,

*w*

_{m}} the set of

*m*unique subchains of length

*k*extracted from

*s*, with

*n ≥ m*. An

**-**

*s***transformation**of

*S**G*

_{k}is the substitution of each vertex (

*k*-mer)

*v*

_{i}by

*w*

_{i}(denoted as

*v*

_{i}

*→ w*

_{i}), resulting in a new graph where there exists a walk that induces

*s*. A cost function

**is defined as the sum of Hamming distances between the corresponding pairs**

*c*(*S*)*v*

_{i}and

*w*

_{i}, i.e.,

*c*(

*S*) =

_{i}

*d*

_{H}(

*v*

_{i},

*w*

_{i}). The set

**is defined as the set of all possible**

*𝒮**s*-transformations of

*G*

_{k}, and

*D*

_{H}(

*s, G*

_{k}) = min

_{S∈S}

*c*(

*S*).

Based on the previously presented definitions, consider, for example, the sequence *s* = `ACTGCG` and the De Bruijn graph *G*_{3} represented only by its vertices *V* = {`ACT`, `CTG`, `ACG`, `TTT`}. In this case, we have *K*(*s*) ={`ACT`, `CTG`, `TGC`, and `GCG`}. Figure 1 shows a graphical representation with the induced arcs of *G*_{3}. In the same figure, the graph is represented, obtained after applying an *s*-transformation *S* to *G*_{3} considering the *k*-mers of *K*(*s*), where *c*(*S*) = 6 is the minimum cost. This reflects the cost of replacing, in *G*_{3}, the *k*-mer `ACG` with `TGC` and the *k*-mer `TTT` with `GCG`. In , the distance between *s* and the sequence induced by *p* = *v*_{1}, *v*_{2}, *v*_{3}, *v*_{4} is zero.

### B. Bipartition and pairing between two sets of k-mers

Given a sequence *s* and a De Bruijn graph *G*_{k}, to determine an *s*-transformation of *G*_{k}, we construct a complete bipartite graph *H* = (*V ∪ W, A*) where *V* corresponds to the set of vertices (*k*-mers from *G*_{k}), and *W* is the set *K*(*s*) associated with *s*. For each vertex (*k*-mer) *v ∈ V* and *w ∈ W*, there exists an edge *e* = {*v, w}∈ A* with a cost *z*(*e*) = *d*_{H} (*v, w*) (the Hamming distance between the *k*-mer of *v* and the *k*-mer of *w*). To determine an *s*-transformation of *G*_{k}, we use a maximum cardinality matching of minimum cost *E* in *H*. Each edge {*v, w} ∈ E* represents the cost of replacing in *G*_{k} the vertex *v* (from *V*) with *w* (from *W*).

An example of this complete bipartite graph is shown in Figure 2, corresponding to the De Bruijn graph *G*_{3} with *V* = {`ACT`, `CTG`, `ACG`, `TTT`} and *W* = *K*(*s*) = {`ACT`, `CTG`, `TGC`, `GCG`} obtained from the sequence *s* =`ACTGCG`. In this example, the cost of each edge *e* = {*v, w*, where *v ∈ V* and *w ∈ W*, is determined by *z*(*e*) = *d*_{H} (*v, w*). For the same bipartite graph presented in Figure 2, an example of a maximum cardinality matching of minimum cost is *E* = {*{v*_{1}, *w*_{1}}, {*v*_{2}, *w*_{2}}, {*v*_{3}, *w*_{4}}, {*v*_{4}, *w*_{3}*}}* (dashed edges), with a total cost equal to 3.

### C. Reformulating the problem

Based on the previous information about the *s*-transformation, we redefine the problem De Bruijn Graph Sequence Mapping Problem with Graph Changes as follows:

(*De Bruijn Graph Sequence Mapping Problem –**BSMP**)*

Given as input a sequence *s* and a De Bruijn graph *G*_{k} represented only by its set of vertices, determine an *s*-transformation *S* of *G*_{k} of minimum cost.

## IV. Algorithm for the problem of mapping sequences into De Bruijn graphs with changes in the graph

Given a De Bruijn graph *G*_{k}, represented exclusively by its set of vertices *V*, and a sequence *s*, the proposed algorithm for the version with changes in the labels of the De Bruijn graph vertices, called De Bruijn Graph Mapping Tool with Graph Changes – `BMTC`, follows the steps below:

Initially, a set

*W*is created containing all unique*k*-mers from the sequence*s*;Next, a bipartite graph

*H*is created. In this context, connections are established from each*k*-mer*v ∈ V*to all*k*-mers*w ∈ W*, with the edge cost {*u, v}*given by*d*_{H}(*v, w*);The Hungarian algorithm is applied to the graph

*H*to find a maximum matching with minimum cost;Based on the obtained matching, the

*k*-mers in*V*that have undergone changes are identified, resulting in the generation of a new set of vertices*V*^{′}, representing the modified De Bruijn graph.

### A. Proof

To prove that the previous steps work, consider the following lemmas.

*K*(*s*) *⊆ V* (*G*_{k}) if and only if there exists a walk in *G*_{k} that spells out *s*.

Let *S* be an *s*-transformation of *G*^{′}_{k} resulting in . There exists a matching *E* in called a matching induced by *S* such that *C*(*E*) *≤ c*(*S*).

**Proof**. Let *S* be an *s*-transformation of in *G*_{k}, meaning for the vertex set *V* = {*v*_{1}, …, *v*_{n}}, *S* is a substitution *v*_{i} *→ w*_{i} for each *i* transforming *G*_{k} into where there exists a walk that spells out *s*. Given that there exists a walk that spells out *s* by Lemma 1, we have for *G* a *K*(*s*) = {*u*_{i}, …, *u*_{m}} such that *u*_{i} = *w*_{i}, …, *u* _{m} = *w*_{m} and *c*(*S*) = Σ _{i}*d* (*v*_{i}, *w*_{i}). Let *E* = {*u*_{i}*v*_{i} : 1 *≤ i ≤ m}*, then

Given the two previous lemmas, consider the following theorem:

Let *s* be a sequence, *G*_{k} a Bruijn graph, and *E*^{∗} a minimum-cost matching in the bipartite graph *H* = (*V* (*G*_{k}) *∪ K*(*s*), *A*). Then, *d*_{H} (*s, G*_{k}) = *C*(*E*^{∗}).

**Proof**. For each edge {*x, y} ∈ E*^{∗}, *x ∈ V* (*G*_{k}), *y ∈ K*(*s*) replace *x* with *y* to obtain and, as a consequence of Lemma 1, an *s*-transformation *S*. Note that *c*(*S*) = *C*(*E*^{∗}). Therefore,

Let *E* be a matching induced by an *s*-transformation of minimum cost such that *C*(*E*) *≤ c*(*S*), which Lemma 2 guarantees to exist. Therefore,

### B. Algorithm

The steps for the algorithm for the problem of sequence mapping onto De Bruijn graphs with graph changes are described in Algorithm 1.

### C. Complexity

For Algorithm 1 consider the follow values:

*O*(|*s*|) to identify all unique*k*-mers;*O*((|*V*| + |*W*|)*k*) to create the bipartite graph with edge costs;*O*((|*V*| + |*W*|)^{3}) to run the Hungarian algorithm;*O*(|*L*|) to iterate through all edges in the matching and replace*k*-mers in*V*^{′}.

Therefore we have the following time complexity: and the time complexity of Algorithm 1 is O((|V |+|W|)3).

## V. Discussion, perspectives and conclusions

In this paper, we delve into the De Bruijn Problem Mapping with changes in the graph. We explore the work of Gibney *et. al* [4], which demonstrated the NP-completeness of the problem for De Bruijn graphs. In this work, we allow a change in the graph not only to modify the labels of the vertices but also the topology of the graph. The fundamental distinction lies in the flexibility allowed, exploring the induction of new arcs after a modification in the De Bruijn graph. While Gibney *et. al* focus on modifications that preserve the fundamental structure, this work considers changes that can induce new arcs, enabling polynomial-time solutions.

Considering this feature of inducing new arcs, we redefine the problem, introducing concepts such as the *s-transformation* of a De Bruijn graph and the *Bipartition and matching between two sets of k-mers*. We develop an algorithm called `BMTC`, which utilizes the Hungarian algorithm to find a maximum-cost minimum matching in a bipartite graph, resulting in a modified set of vertices for the De Bruijn graph.

The proof that the algorithm works is based on lemmas establishing the relationship between the graph transformation and matchings in the bipartite graph. The theorem demonstrates that the cost of the maximum matching found in the bipartite graph is equal to the Hamming distance between the given sequence and the original graph. `BMTC` offers an innovative approach to the problem, allowing changes in the De Bruijn graph that may result in new arcs, proving advantageous for finding polynomial-time solutions.

It is noteworthy that the literature still lacks depth in this aspect. The most recent work by Gibney *et. al* [[4]] addresses the problem but in a different manner. By introducing the possibility of inducing new arcs in case of changes in the De Bruijn graph, this work proposes a polynomial-time solution to the problem. However, further exploration of the potential practical applications of this solution is still needed.

## Acknowledgment

## Footnotes

↵* lucas.lb.rocha{at}gmail.com {said.sadique{at}ufms.br, francisco.araujo{at}ufms.br}

Correcting the title of the paper and changing the model used in relation to the previous one.