Abstract
Recent assemblies by the T2T and VGP consortia have achieved significant accuracy but required a tremendous amount of effort and resources. More typical assembly efforts, on the other hand, still suffer both from mis-assemblies (joining sequences that should not be adjacent) and from under-assemblies (not joining sequences that should be adjacent). To better understand the common algorithm-driven causes of these limitations, we investigated the unitig algorithm, which is a core algorithm at the heart of most assemblers. We prove that, contrary to popular belief, even when there are no sequencing errors, unitigs are not always safe (i.e. they are not guaranteed to be substrings of the sequenced genome). We also prove that the unitigs of a bidirected de Bruijn graph are different from those of a doubled de Bruijn graph and, contrary to our expectations, result in under-assembly. Using experimental simulations, we then confirm that these two artifacts exist not only in theory but also in the output of widely used assemblers. In particular, when coverage is low then even error-free data results in unsafe unitigs; also, unitigs may unnecessarily split palindromes in half if special care is not taken. To the best of our knowledge, this paper is the first to theoretically predict the existence of these assembler artifacts and confirm and measure the extent of their occurrence in practice.
1 Introduction
Reconstructing the full sequence of a genome from its sequencing data remains one of the most challenging problems in bioinformatics. Assemblers have suffered both from mis-assemblies (putting together sequences that should not be adjacent) and under-assemblies (not putting together sequences whose adjacency should be apparent from the data) [4, 35]. Recent efforts by the Telomere-to-Telomere consortium [25, 24] and the Vertebrate Genome Project [30] demonstrated how long read technologies, long-range contact mapping, and manual curation can alleviate these errors. However, the time and cost of those efforts remain prohibitive for most biology labs. In such cases, mis- and under-assemblies continue to be a major limitation (e.g. [38]).
Understanding the common algorithm-driven causes of these limitations is made complicated by the diversity and complexity of assembly algorithms. We can start by focusing on assemblers that use de Bruijn graphs (dBGs) [16], which continue to be popular even for long-read data [5]. But even dBG-based assemblers differ on how they handle complexities arising from sequencing errors, heterogeneity, or DNA double strandedness. Nevertheless, most assemblers are built on top of the unitig algorithm, which returns all the maximal unitigs in an assembly graph [35]; a unitig is a path whose vertices have exactly one incoming and one outgoing edge, with the exception that the first and last vertex can have any number of incoming and outgoing edges, respectively. Being a common denominator of most assemblers, the unitig algorithm is a good target for investigating shared sources of mis- and under-assemblies.
It is already known that the unitig algorithm contributes to under-assembly (e.g. see the safe and complete framework of [37, 9]) and can trivially create mis-assemblies when there are sequencing errors. The effect of sequencing errors on assembly errors has even been theoretically studied more broadly in [32, 33, 34]. However, it is widely assumed that if it were not for sequencing errors, unitigs would always be safe (i.e. substrings of the sequenced genome). In an earlier work [19], we attempted to formally prove this but could only do so by assuming perfect coverage. This assumption was also necessary in another earlier work [37], where it was suggested that without it, unitigs may not be safe. Unitigs were also implied to be unsafe in certain models of the assembly problem [9]. We therefore hypothesize that, contrary to popular belief, there are non-contrived conditions which lead to unsafe unitigs on error-free data.
The unitig algorithm also needs to account for the fact that the strand from which a read is sequenced is unknown. Most assemblers do so via two common approaches to constructing the dBG. In one, every k-mer is “doubled” prior to constructing the dBG, i.e. for every k-mer in the input, both it and its reverse complement is added to the dBG (e.g. SPAdes [6]). In the other approach, edges are given two instead of one orientation, thereby capturing the way that double-stranded strings can overlap. This results in a bidirected dBG [22], used in assemblers such as ABySS [36, 17]). Since this is a more elegant construction for capturing the double-stranded nature of the data, one would intuitively expect that it should not hurt assembly accuracy.
In this paper, we perform a theoretical and empirical study to validate our two hypothesis about common algorithm-driven sources of mis- and under-assemblies. First, despite widespread belief to the contrary, we show that even on error-free data, unitigs do not always appear in the sequenced genome (i.e. they are unsafe). Our experimental results confirm that at least two different assemblers exhibit this behavior in practice. Second, we establish that there is a bijection between maximal unitigs in the doubled and bidirected dBGs, except that palindromic unitigs in the doubled dBG are split in half in the bidirected dBG. This shows that, contrary to intuition, naively using the bidirected graph actually contributes to under-assembly compared to the doubled graph. Our experimental results confirm that this artifact appears in some assemblers but not in others. Nevertheless, we also find that the extent of these two artifacts is limited. To the best of our knowledge, this paper is the first to theoretically predict the existence of these assembler artifacts and confirm and measure the extent of their occurrence in practice.
2 Preliminaries
In this section we give the formal definitions for our paper. The reader may wish to delay reading the last three paragraphs (relating to bidirected graphs) until they are used in Section 4.
Strings
In this paper, we assume all strings are over the four-letter DNA alphabet. A string of length k is called a k-mer. We define sufk(x) (respectively, prek(x)) to be the last (respectively, first) k characters of x. When the subscript is omitted from pre, and suf, we assume it is k − 1. For x and y with suf (x) = pre(y), we define gluing x and y, denoted by x ⊙ y, as x concatenated with the last |y| − k + 1 characters of y. Given two strings x and y, we define occy(x) as the number of times that x occurs in y. The reverse complement of x is denoted as . For a set of strings 𝒮, denotes the set of the reverse complements of all strings of 𝒮. A string x is a palindrome iff . A string x is canonical if it is the lexicographically smaller of x and . For s ∈ {0, 1}, we define orient(x, s) to be x if s = 0 and to be if s = 1. To canonize x is to replace it by its canonical version, canon(x) = mini(orient(x, i)). We say that x0 and x1 have a (s0, s1)-oriented-overlap if suf (orient(x0, 1 − s0)) = pre(orient(x1, s1)). Intuitively, such an overlap exists between two strings if we can orient them in such a way that they are glueable. For example, GTT and TTG has a (1, 0)-oriented overlap, and AAC and TTG have a (0, 0)-oriented overlap. We define the non-canonical k-spectrum spk(x) as the set of all k-mer substrings of x.
Directed de Bruijn graphs
Given a set of k-mers K, the basic node-centric directed de Bruijn Graph, Gbasic(K), is directed graph where nodes are the k-mers of K, and an edge exists from k-mer x to k-mer y iff suf (x) = pre(y). A double directed de Bruijn graph on K, Gdbl(K) is a basic de Bruijn graph on the set of k-mers , i.e. . Observe that for any k-mer x such that , the existence of the edge from x to y in Gdbl(K) implies the existence of a different edge from to . We refer to such a pair of edges as mirrors. For a k-mer x such that , the Gdbl(K) will contain an edge from x to x; we call this edge a self-mirror.
Walks and unitigs in directed graphs
For a vertex x in a directed graph, the in-degree d−(x) (respectively, out-degree d+(x)) is the number of edges incoming to (respectively, outgoing from) it. The sequence of vertices w = (x0, …, xn), for n ≥ 0, is a walk iff for all 1 ≤ i ≤ n there exists an edge from xi −1 to xi. Vertices x0 and xn are called endpoints, and a walk sometimes has one endpoint. The spelling of a walk is defined as spell(w) = x0 ⊙· · · ⊙ xn. A walk is said to be circular iff n ≥ 1 and x0 = xn, and a simple cycle if for all i and j such that 0 ≤ i < j ≤ n, xi = xj implies i = 0 and j = n. A simple periodic cycle is a walk that starts with a simple cycle and then keeps on looping around it without ever exiting; formally, a walk is a simple periodic cycle if there exists 0 ≤ i ≤ n − 1 such that (x0, …, xi) is a simple cycle and xi+1, …, xn is a repetition of x0, …, xi, except the last repetition may be partial. A walk is a unitig if it is not a periodic cycle and for all 1 ≤ i ≤ n, d−(xi) = 1 and for all 0 ≤ i ≤ n − 1, d+(xi) = 1. A unitig is maximal if it is not a proper subwalk of another unitig.
Bidirected de Bruijn graph
A bidirected graph G is a pair (V, E) where the set V are called vertices and E is a set of edges. Intuitively, every vertex has two sides; formally, a vertex-side is a pair (u, s), where u ∈ V and s ∈ {0, 1}. An edge e is a set of two vertex-sides {(u0, s0), (u1, s1)}, where ui ∈ V and si ∈ {0, 1}, for i ∈ {0, 1}. Intuitively, an edge is an undirected connection between two (not-necessarily distinct) vertex-sides. We say that an edge e is incident to each of the two vertex-sides. Note that there can be multiple edges between two vertices, but only one edge once the sides are fixed. A labeled bidirected graph is a bidirected graph G where every vertex u has a string label lab(u), and for every edge e = {(u0, s0), (u1, s1)}, there is a (s0, s1)-oriented-overlap between lab(u0) and lab(u1). G is said to be overlap-closed if there is an edge for every such overlap. Let K be a set of canonical k-mers. The node-centric bidirected de Bruijn graph, denoted by Gbid(K), is the overlap-closed labeled bidirected graph where the vertices and their labels correspond to K. Figure S1A shows an example of a bidirected graph.
Walks and unitigs in bidirected graphs
An edge in a bidirected graph is an inverted loop if its two vertex-sides are equal. An inverted loop {(u, s), (u, s)} is lonely if it is the only edge incident to (u, s). We define the degree of a vertex-side d(u, s) to be the number of edges incident to it, but with an inverted loop contributing two to the degree. A sequence t = (u0, s0, u1, s1, …, un, sn) with n ≥ 0 is a walk if for all 1 ≤ i ≤ n, there exists an edge ei = {(ui −1, 1 − si −1), (ui, si)}. The vertex-sides (u0, s0) and (un, 1 − sn) are called the first and last endpoint sides, respectively. Note that even when n = 0, there are two endpoint sides. The spelling of a walk is defined as spell(w) = orient(lab(u0), s0) ⊙· · · ⊙ orient(lab(un), sn). The reverse of t is rev(t) = (un, 1 − sn, …, u0, 1 − s0). Note that, as expected, . Note that if t′ is a subwalk of t, then rev(t′) is a subwalk of rev(t) and spell(t′) is a substring of spell(t) (the converse is not necessarily true when k is even). Figure S1BC gives an example illustrating a walk in a bidirected graph and Figure S1D shows a corresponding walk in a doubled directed dBG.
A walk w = (u0, s0, …, un, sn) is said to be circular iff n ≥ 1 and (u0, s0) = (un, sn), and a simple cycle if for all i and j such that 0 ≤ i < j ≤ n, (ui, si) = (uj, sj) implies i = 0 and j = n. A simple periodic cycle is a walk that starts with a simple cycle and then keeps on looping around it without ever exiting; formally, w is a simple periodic cycle if there exists 0 ≤ i ≤ n − 1 such that (u0, s0, …, ui, si) is a simple cycle and (ui+1, si+1, …, un, sn) is a repetition of (u0, s0, …, ui, si), except the last repetition may be partial. A walk is a unitig if it is not a periodic cycle and for all 1 ≤ i ≤ n, d−(xi) = 1 and for all 0 ≤ i ≤ n − 1, d+(xi) = 1. A walk (u0, s0, …, un, sn) is a unitig if it is not a periodic cycle and for all 0 ≤ i < n, d(ui, 1 − si) = 1 and, for all 0 < i ≤ n, d(ui, si) = 1. A unitig is said to be maximal if it is not a proper subwalk of another unitig. Note that all the subwalks of a unitig must themselves be unitigs.
3 Safety of unitigs
In this section, we will give necessary and sufficient conditions for a unitig to be unsafe in the basic dBG constructed from error-free reads. To properly formulate this question, we define a sequenced read interval as a genomic interval that generated a read, i.e. from which a read was sequenced. A sequencing experiment then corresponds to a set of sequenced read intervals. A sequenced interval is then defined as a maximal interval which is covered by sequenced read intervals, with the additional constraint that any two consecutive sequenced intervals overlap by at least k − 1. We define a sequenced segment as the string corresponding to a sequenced interval. Observe that the sequenced intervals do not overlap by more than k − 2 bases (otherwise they would not be maximal), but the sequenced segment may have longer overlaps due to repeats. A set of reads then induces a set 𝒮 = {S1, …, S|𝒮|} of sequenced segments. Figure 1 illustrates this formulation. In this section, we do not explicitly account for reverse complements, since they will be considered in Section 4.
Given a set of sequenced segments 𝒮, we say that a unitig w in Gbasic(spk(𝒮)) is unsafe iff spell(w) is not a substring of a string in 𝒮. Equivalently, w is unsafe iff it is not a subwalk of a walk that corresponds to a string in 𝒮. Our definition of unsafe captures the notion of a potential mis-assembly, as the unitig is not present in the sequenced part of the genome.1 Observe that in formulating the problem, we start with the set of sequenced segments themselves; the read set that induced them is irrelevant. We can now state the main result of this section, which gives the necessary and sufficient conditions for a unitig to be unsafe. The proof of this theorem, along with the necessary Lemmas, is left for Appendix A due to space constraints.
Let 𝒮 be a set of sequenced segments and let w = (x0, …, xm) be a unitig in Gbasic(spk(𝒮)). Then w is unsafe if and only if for all S ∈ 𝒮, one of the following holds:
S does not contain any k-mer of w,
occS(prek(S)) = 1 and prek(S) = xi for some 1 ≤ i ≤ m,
occS(sufk(S)) = 1 and sufk(S) = xj for some 0 ≤ j ≤ m − 1, or
occS(prek(S)) = occS(sufk(S)) = 2 and there exists 1 ≤ i ≤ j ≤ m − 1 such that prek(S) = xi and sufk(S) = xj.
The cases of Theorem 1 are illustrated in Figure 2 and can be understood intuitively as follows. Since every k-mer of Gbasic(spk(𝒮)) is in 𝒮, every k-mer of w must be touched by some S ∈ 𝒮. Then, consider a walk g corresponding to such a string S. If g starts in the middle of w and does not visit its own starting vertex again, then g does not fully contain w (case (ii)). Similarly, if g ends in the middle of w and did not visit its own ending vertex previously, then g does not fully contain w (case (iii)). If g starts and ends in the middle of w, with the ending vertex to the right of the starting vertex, and contains each of those vertices exactly twice, then g does not fully contain w (case (iv)). This is the “if” direction of Theorem 1, with the “only if” direction further stating that under all other conditions, g fully contains w.
When the genome is a single chromosome and the coverage is high enough so that every k-mer is sequenced, the whole genome becomes one sequenced segment. In this case, Theorem 1 simplifies because the genome has only one starting and ending vertex and, for a unitig w to be unsafe, the genome must somehow contain every vertex of w without containing w as a subwalk.
Let X be a string and let w = (x0, …, xm) be a unitig in Gbasic(spk(X)). Then spell(w) is not a substring of X iff one of the following holds:
occX(prek(X)) = occX(sufk(X)) = 1, prek(X) = xi, sufk(X) = xi −1 for some 1 ≤ i ≤ m.
occX(prek(X)) = occX(sufk(X)) = 2, prek(X) = xi, sufk(X) = xj for some 0 < i ≤ j < m.
Moreover, this can hold for at most one unitig in Gbasic(spk(X)).
This corollary tells us that with perfect coverage, all unitigs, except possibly one, are safe. Note that this is a stronger version of the perfect coverage case than the one given in [19], which made an assumption that the starting vertex of X is a source and the ending vertex of X is a sink.
A natural question is how a scenario which gives an unsafe unitig looks like in terms of the original genome. Figure 3 visualized the following natural possibility. Suppose that the sequenced genome X has a repeat that appears as a maximal unitig ψ in Gbasic(spk(X)). Then, suppose that the region encompassing the start of one copy and the region encompassing the end of the other copy is not sequenced. Then ψ loses its maximality in Gbasic(spk(𝒮)) and becomes a subwalk of a bigger unitig w. Though w is a unitig in the graph from the sequencing data, it would not be a unitig if all the k-mers of X were included in the graph. In Section 5, we will show that this situation accounts for the majority of our experimental observations.
4 The relationship between the doubled dBG (Gdbl(K)) and the bidirected dBG (Gbid(K))
In this section, we will characterize the relationship between the maximal unitigs of Gdbl(K) and the maximal unitigs of Gbid(K) (Theorem 2). Due to space constraints, the lemmas and proofs needed to prove Theorem 2 are in Appendix B. Here, we will instead give an intuitive walk-through to elucidate the relationship between the two graphs. We will incrementally show the relationship between objects in the doubled graph and the bidirected graph — first between vertices and vertex-sides, then between edges, then between walks, and finally between maximal unitigs.
Let K be a set of canonical k-mers, with k odd. We only consider the case of odd k; when k is even, there may be palindrome k-mers, which create special cases to handle both in the practical assembler implementation and in the theoretical analysis. Since most assemblers anyway restrict k to be odd, we limit ourselves to this case as well.
There is a natural mapping between vertices of Gdbl(K) and vertex-sides of Gbid(K). For a vertex x in Gdbl(K), define FV (x) = (u, s), where u is a vertex in Gbid(K) and s ∈ {0, 1} such that lab(u) = orient(x, s). By the definition of Gbid(K), there exists a unique u and unique s that satisfy this condition. The uniqueness of s is guaranteed by the fact that x cannot be a palindrome. Formally, FV is a bijection between vertices of Gdbl(K) and vertex-sides of Gbid(K) (Lemma B.10). There is also a natural mapping between edges in Gdbl(K) and Gbid(K). Let x1 and x2 be two k-mers in Gdbl(K) and let (u1, s1) = FV (x1) and (u2, s2) = FV (x2). We define the mapping
FE(x1, x2) = {(u1, 1 − s1), (u2, s2)} such that (x1, x2) is an edge in Gdbl(K) if and only if FE(x1, x2) is an edge in Gbid(K) (Lemma B.11). Note, however, that FE is not a bijection, since a pair of mirror edges (x, y) and map to the same bidirected edge, i.e. .
The FV and FE mappings allow us to naturally define a mapping from walks in Gdbl(K) to walks in Gbid(K). Let w = (x0, …, xn) be a walk in Gdbl(K). For each 0 ≤ i ≤ n, let (ui, si) = FV (xi) and define FW (w) ≜ (u0, s0, …, un, sn). FW is a spell-preserving bijection between the set of walks in Gdbl(K) and the set of walks in Gbid(K) (Lemma B.12).
One might hypothesize that FW is also a bijection between the maximal unitigs of Gdbl(K) and the maximal unitigs of Gbid(K). Surprisingly, it turns out to not be the case, though the following more careful analysis reveals a close relationship. For Gdbl(K), let us partition the set of maximal unitigs into non-palindromic strings Dnon-pal and palindromic strings Dpal. For Gbid(K), let Bno-loop be the set of maximal unitigs where neither endpoint side has an incident lonely inverted loop, let Bfirst-loop be the set of maximal unitigs where the only endpoint side with a lonely inverted loop is the first one, and let Blast-loop be the set of maximal unitigs where the only endpoint side with a lonely inverted loop is the last one. To avoid corner cases, let us further assume that there are no circular unitigs in Gdbl(K), which eliminates the possibility of a maximal unitig having lonely inverted loops at both endpoint sides and implies that Bno-loop, Bfirst-loop, and Blast-loop are a partition of the maximal unitigs of Gbid(K) (Lemma B.16). Figure 4 shows an example.
We also need to define a function head which, informally, takes a maximal palindromic unitig in Gdbl(K), extracts the first half of it, and maps it to Gbid(K). Formally, head(w) maps a walk w = (x0, …, xn) in Dpal to the walk in Gbid(K). Note that is necessarily an integer since w is a palindrome and hence n must be odd (Lemma B.1). We can now state the main theorem of this section.
Let K be a set of canonical k-mers where k is odd and Gdbl(K) does not contain a circular unitig.
The function FW is a bijection from Dnon-pal to Bno-loop.
The function rev is a bijection between Blast-loop and Bfirst-loop.
head is a bijection from Dpal and Blast-loop
Figure 5 schematically illustrates the relationship captured by Theorem 2. The theorem says that for maximal unitigs that are non-palindromic in Gdbl(K) and do not have inverted self loops incident at the endpoint sides in Gbid(K), FW is in fact a bijection. However, every maximal unitig w that is palindromic in Gdbl(K) is split into two maximal unitigs in Gbid(K): one that spells the first half of w and has a self loop incident at the last endpoint side, and one that spells the second half of w and has a self loop at the first endpoint side. These are necessarily reverses of each other.
Inverted loops are caused by k-mers x where (e.g. GTA). When these type of k-mers are not present in K, there are no inverted loops in Gbid(K) or palindromic unitigs in Gdbl(K). Hence, Dpal = Bfirst-loop = Blast-loop = ∅, and Theorem 2 immediately simplifies.
Let K be a set of k-mers, with odd k, which does not contain any x such that . Then FW is a bijection from the maximal unitigs in Gdbl(K) to the maximal unitigs in Gbid(K).
5 Empirical results
Occurrence of unsafe unitigs in real genomes
Theorem 1 predicts the possibility of unsafe unitigs. To verify the extent to which this happens with real genomes, we use T2T human reference chromosome 1 [25]. We simulated error-free reads of length 100 with varying target coverages and varying k. Note that for this experiment, we want to test if mis-assemblies occur even when the data is perfect, so making the reads error-free is necessary. The sequenced read intervals correspond to the source location of each simulated read, and the sequenced segments are defined as in Section 3. From these reads, we constructed the basic de Bruijn graph and output its maximal unitigs, using a version of BCALM [10, 11] modified to ignore reverse-complementary. We confirmed that the unitigs that were unsafe(i.e. not a substring of the sequenced segments) were exactly the unitigs that satisfied the conditions of Theorem 1.
Table 1 shows the number of unsafe unitigs, as a function of the coverage and of k. There are as many as 17,635 unsafe unitigs (at coverage 2x and k = 71). The best indicator for the number of unsafe unitigs is the percent of k-mers sampled (or the number of sequenced segments), i.e. the number of unsafe unitigs goes down as the percent of sampled k-mers goes up. This trend is in line with the prediction of Corollary 1, which states that once the coverage is perfect, we expect to see at most one unsafe unitig. Our results indicate that the artifacts identified by Theorem 1 do occur in real genomes, though they become less common as more of the genomic k-mers are sampled.
An unsafe unitig is not necessarily a mis-assembly, as it may be a substring of the unsequenced genome by luck. We define an unitig to be mis-assembled if its spelling is not a substring of the reference. Table 1 shows that the number of mis-assembled unitigs is substantially lower than the unsafe unitigs, e.g. with 708 mis-assembled unitigs at 2x coverage and k = 71. Thus the potential for mis-assembly does not usually translate into a real mis-assembly, though many mis-assemblies remain.
We further check how many of these mis-assembled unitigs fit the example in Fig. 3. A formal definition to capture this example is included in Appendix A for reference. Table 1 shows that the vast majority of mis-assembled cases are in fact caused by this situation, where a repeat has an occurrence in which its start is unsequenced and another occurrence in which its end is unsequenced.
The simulations in Table 1 suggest that the mis-assembly artifact can be removed by simply increasing coverage. In a metagenome expirement, however, this is not always possible. Even when one increases the number of reads, there will continue to be genomes in the sample whose abundance is low enough that their coverage is low. To verify this intuition, we used a standard benchmark dataset generated by the CAMI competition [31], containing 70 million synthetic reads from 30 genomes. Table 2 shows there are 33-37 mis-assembled unitigs, indicating that this artifact remains under realistic coverage of a metagenomic dataset. The section “CAMI dataset” in Appendix C contains more details about the experiment, including Table S1 which shows the details of the dataset.
Presence of unsafe unitigs in the contig output of real assemblers
We investigated the extent to which the artifact predicted by Theorem 1 appears in output of real assemblers. Assemblers do not simply output the unitigs of a graph but perform many other steps, hence it was not clear if this artifact would appear in the output contigs. Unfortunately, it is not clear how to verify this artifact with real data, as sequencing errors make it difficult to know which of the misassembled contigs are caused by the conditions of Theorem 1. We therefore again used a simulated error-free dataset from the T2T chromosome 1, using the ART simulator [15], with read length of 250 and varying coverages. This time, we simulated reads from either strand, since assemblers are not typically run in single-stranded mode. We also used the CAMI dataset, but simulating reads in double-stranded mode. We then constructed the doubled de Bruijn graph using k = 74 and output its maximal unitigs (note that Theorem 1 holds for even k). We also ran SPAdes [6] and MEGAHIT [18] to assemble the reads (see Appendix C for parameter details). We then identified unitigs and the assembler contigs that were mis-assembled, but allowing for reverse complements. We will say that a string x matches a string y with a threshold of t if a fraction t of the k-mers of x occur in y.
Tables 3 and 4 show that nearly all of the mis-assembled unitigs matched at least one misassembled SPAdes contig with a threshold of 1. For MEGAHIT, the threshold of 1 turned out to be stringent; this is not surprising, since assemblers have many steps that may add or remove k-mers from the graph; additionally, MEGAHIT varies the value of k internally and may therefore join k-mers that do not have an overlap of length k −1. Using a threshold of 0.5, however, we found that, similarly to SPAdes, most mis-assembled unitigs matched a mis-assembled contig of MEGAHIT. These results indicate that the artifact predicted by Theorem 1 not only appears in unitigs of the raw graph but also in the output of widely used assemblers like SPAdes and MEGAHIT.
Presence of palindrome splitting in a real genome
To measure the extent of the “palindrome splitting” artifact predicted by Theorem 2, we let K be the set of all constituent k-mers in human chromosome 21 (grch38.p13), after excising the Ns. We confirmed the correctness of Theorem 2 by verifying that the spellings of Dnon-pal are equal to the spellings of Bno-loop and that the spellings of Blast-loop are equal to the spellings of Dpal and are the reverse complements of the spellings of Bfirst-loop. Table 5 shows that the splitting artifact is present but rare, e.g. for k = 15, there were 186 palindromic maximal unitigs in Gdbl(K) which were split in Gbid(K). The artifact becomes rarer with increasing k (e.g. for k = 43, there were only 3 split palindromes), which is expected since palindrome frequency in real genomes decreases with length.
Presence of palindrome splitting in real assemblers
Most assembler papers do not contain enough detail to ascertain what kind of de Bruijn graph they use to handle reverse complements nor what modifications, if any, they make to the unitig algorithm used for the final output. Looking at MEGAHIT [18], SPAdes [6], ABySS [36, 17], and minia [12], only the SPAdes paper is unambiguously clear in saying how it handled reverse complements (it used the doubled dBG). Furthermore, since these assemblers implement many heuristics, the splitting artifact may be absent (respectively, present) even if they did (respectively, did not) use bidirected graphs. We therefore tested the behavior of these assemblers by looking for evidence of palindrome splitting in their output, rather than in their technical descriptions.
Since large exact palindromes are uncommon in typical genomes, we created a synthetic genome by modifying a ∼ 7 mil bp long contig from human chromosome 4 (grch38.p13) as follows. We randomly sampled a 1,000bp-long region and replaced the last 500bp by the reverse complement of first 500 bp; we then repeated the sampling process 700,000 times. We then simulated error-free Illumina reads with ART. We used a read length of 100bp so that assemblers will not be able to supplement the dBG with read information in a way that hides the palindrome splitting artifact. We used 10x coverage so that most k-mers would be sampled.
First, we find the reference location of each unitig w in Dpal. Then, we find all exact alignments of the assembler contigs to the reference. We say that w is fully-covered if there exists a contig whose alignment spans w’s. Otherwise, we say w is split if one half of w’s region does not overlap with any contig alignments while the other half has a contig aligned that ends precisely in the middle of w at one end and extends past w at the other end. A unitig is ambiguous if it does fall into either category. Appendix C contains a more precise definition of these cases.
Table 6 shows that ABySS clearly exhibits the palindrome splitting artifact, with all non-ambiguous unitigs being split and none fully-covered. In fact, this is due to a heuristic that breaks unitigs at any palindromic edges or vertices [1, 2]. The opposite was true for SPAdes and MEGAHIT, with all non-ambiguous unitigs being fully-covered and none split. minia on the other hand exhibited mixed behavior. Of the 417 non-ambiguous cases, 34 were split and 383 were fully-covered. These results indicate that the palindrome splitting artifact of Theorem 2 does persist all the way to the contig output stage in some assemblers. However, this artifact requires the presence of long exact palindromes in the reference, which is uncommon in most genomes.
6 Discussion
Our theoretical study uncovered two artifacts of the unitig algorithm for genome assembly. The first is that even without sequencing errors, it can create mis-assemblies in places of imperfect coverage. The second is that when the bidirected graph is used to model double-strandedness, the unitig algorithm under-assembles by failing to merge the two halves of palindromes. Our experiments confirmed the presence of these theoretically-predicted artifacts in real genomes and popular assemblers. Fortunately, the impact of these artifacts is not large and can be addressed. Mis-assembly issues due to the first artifact can be resolved by increasing coverage or, potentially, breaking unitigs at places where the coverage along them is uneven. Under-assembly issues due to the palindrome artifact are rare in real genomes and, moreover, can sometimes be fixed by forcing the unitigs to “push their way through” lonely inverted loops (however, it is not always possible, e.g. [23, 8].
One of the tangential outcomes of this paper is that we have given proper definitions for things like walks and unitigs in the context of bidirected graphs. Previous papers used these concepts somewhat informally; when definitions were given, they worked in the context of that paper but failed to have more general desired properties. For example, our previous work had an inconsistency in the way that a walk was defined on a single vertex versus on many vertices [28]. One key takeaway is that as a rule thumb, when working with bidirected graphs one should avoid thinking in terms of vertices but think instead of vertex-sides. The definitions we have provided in this paper generalize further than previous ones and are able to form the basis for the type of analysis we have done in this paper. For example, we are the first to prove the bijection between walks in the doubled and bidirected dBGs. We hope that these definitions will facilitate future attempts to formally study questions in bidirected graphs.
Bidirected graphs give an elegant way to capture the double-stranded nature of DNA in a dBG, but our results here indicate that, for the unitig algorithm, they do not give any theoretical advantage. One of the claimed advantages of using the bidirected graph framework in assembly is that it allows one to take advantage of results from graph theory that may otherwise be hidden. The primary example of this is a result (involving one of the authors) in [22] where a variant of the assembly problem was theoretically solved in polynomial time by relying on a reduction to the flow problem in bidirected graphs [14]. When viewed in retrospect, however, it is not clear that this connection was necessary. The algorithm being reduced to [14] was too cumbersome to implement and, when the assembly problem later necessitated a software solution, an approximation algorithm was used instead [20, 21]. But the approximation algorithm worked on the doubled graph, erasing the advantage of having initially formulated the problem on bidirected graphs. Therefore, it remains to be seen if there are situations where the connection to graph theoretical results on bidirected graphs can prove useful for genome assembly. Alternatively, using a different setting may better help identify the advantages of bidirected graphs, e.g. pangenomics [27], rearrangement analysis [7], or compression [29]. Quantifying these advantages would be an exciting future direction.
Reproducibility
Scripts for the experimental evaluations are available on GitHub [3].
A Safety of unitigs: full exposition
Proof of Theorem 1 and Corollary 1
In this subsection, we will prove Theorem 1 and Corollary 1. In the following, we will always have 𝒮 be a set of sequenced segments and w = (x0, …, xm) be a unitig in Gbasic(spk(𝒮)). We start with a lemma that, roughly speaking, says that if a walk corresponding to some S ∈ 𝒮 touches w, it must contain all of w except that it may begin or end somewhere along the way.
Let 𝒮 be a set of sequenced segments and let w = (x0, …, xm) be a unitig in Gbasic(spk(𝒮)). Let S ∈ 𝒮 and let g = (g0, …, g|S|) be the walk corresponding to S. Suppose there exists i and j such that xi = gj. Then,
If sufk(S) ∉ {xi, …, xm −1}, then gj+δ = xi+δ for all δ ∈ [0, m − i].
If prek(S) ∉ {x1, …, xi}, then gj −δ = xi −δ for all δ ∈ [0, i].
Proof. We will only prove (i), since the argument for (ii) is symmetric. We use induction on δ. For δ = 0, we have that the implication of (i) reduces to gj = xi, which is vacuously true because it is also a condition of the theorem. Now we assume that (i) holds for δ − 1, i.e. gj+δ −1 = xi+δ −1. Since xi+δ −1 ≠ sufk(G), gj+δ −1 is not the last vertex of g. Because xi+δ −1 is a non-last vertex of a unitig, it has only one out-neighbor, which is xi+δ. Therefore, gj+δ = xi+δ, which shows that that (i) holds for δ. □
Using this lemma, we can now prove some general properties of unsafe unitigs.
Let 𝒮 be a set of sequenced segments and let w = (x0, …, xm) be a unitig in Gbasic(spk(𝒮)). If w is unsafe then
m ≥ 1,
there exists S ∈ 𝒮 such that prek(S) ∈ {x1, …, xm},
there exists S ∈ 𝒮 such that sufk(S) ∈ {x0, …, xm −1},
for all S ∈ 𝒮 and their corresponding walks g, either g and w do not share a vertex, or prek(S) ∈ {x1, …, xm}, or sufk(S) ∈ {x0, …, xm −1}, and
for all S ∈ 𝒮 and all i, occS(xi) ≤ 2.
Proof. For (i), consider a unitig that has just one vertex x. Since each k-mer in Gbasic(spk(𝒮)), there must be at least one S ∈ 𝒮 whose walk contains x. Hence, the unitig that is composed of only x is safe. For (ii), assume for sake of contradiction that for all S ∈ 𝒮, prek(S) ∉ {x1, …, xm}. Since every vertex of the graph must be contained in at least one string, let S′ ∈ 𝒮 be a string that contains xm. Applying Lemma A.1(ii) with i = m, we get that the walk corresponding to S′ must contain w, contradicting that w is unsafe. The case of (iii) is symmetric to (ii), using x0 instead of xm and applying Lemma A.1(i) with i = 0. For (iv), let g = (g0, …, g|S|) and assume for sake of contradiction that there exists a S ∈ 𝒮 such that g shares a vertex with w and prek(S) ∉ {x1, …, xm} and sufk(S) ∉ {x0, …, xm −1}. Let xi and gj be the vertices of w and g, respectively, that are equivalent. We can apply Lemma A.1 to get that (gj, …, gj+m −i) = (xi, …, xm) and (gj −i, …, gj) = (x0, …, xi). This means that w is a subwalk of g, which is a contradiction. For (v), let S ∈ 𝒮 and let g = (g0, …, g|S|) be its corresponding walk. If g and w do not share any vertices, then occS(xi) = 0 ≤ 2 for all i and we are done. Otherwise, we can apply (iv) to get that either (1) prek(S) ∈ {x1, …, xm} or (2) sufk(S) ∈ {x0, …, xm −1}. Let us consider (1) — we will omit the argument for (2) since it is symmetrical. Then g0 = xi for some 1 ≤ i ≤ m. Note that g0 is the first occurrence of xi in g. Assume for the sake of contradiction that occS(xi) > 2. To get the second occurrence of xi, g must first visit x0. After this second visit to xi, g must continue all the way until xm if it is to visit xi for a third time. Therefore, at the second visit to xi, g must in fact visit (x0, …, xm), which contradicts that w is unsafe.
The case when a sequenced segment contains its first and/or last k-mer more than once puts additional constraints on how it can contain a unitig.
Let 𝒮 be a set of sequenced segments and let w = (x0, …, xm) be a unitig in Gbasic(spk(𝒮)). Let S ∈ 𝒮 such that at least one of the following holds:
occS(prek(S)) = 2 and there exists an integer i ∈ [1, m] such that xi = prek(S), or
occS(sufk(S)) = 2 and there exists an integer j ∈ [i, m − 1] such that xj = sufk(S). Then, spell(w) is not a substring of S iff both (i) and (ii) hold.
Proof. We only prove case (i) since case (ii) is symmetrical. Let g be the walk corresponding to S. In the first phase, g starts from xi and, since it must visit xi a second time, continues until xm. Then at some point it enters w through x0 and proceeds to visit xi for the second and last time. We will refer to the time from the end of the first phase to the point it enters x0 as the second phase, and the rest of the walk as the third phase. Observe that g does not contain w as a subwalk in either the first or second phase.
Now we prove the if direction. During phase 1, g visits xj exactly once. During phase 2, g does not visit xj. During phase 3, g proceeds from x0 forward along the unitig until it hits xj for the second time. Since xj is occurs exactly twice and is the last vertex of g, this is the end of g. Since j < m, g does not contain w as a subwalk during the third phase.
Now we prove the only if direction. Assume w is not a subwalk of g. Therefore, during the third phase g cannot go until xm and must stop earlier at some xj = sufk(S), for some integer j ∈ [i, m − 1]. This xj was visited once during phase 1 and not visited during phase 2 and now visited a second and final time during phase 3. □
These lemmas are all the pieces we need to prove Theorem 1.
Let 𝒮 be a set of sequenced segments and let w = (x0, …, xm) be a unitig in Gbasic(spk(𝒮)). Then w is unsafe if and only if for all S ∈ 𝒮, one of the following holds:
S does not contain any k-mer of w,
occS(prek(S)) = 1 and prek(S) = xi for some 1 ≤ i ≤ m,
occS(sufk(S)) = 1 and sufk(S) = xj for some 0 ≤ j ≤ m − 1, or
occS(prek(S)) = occS(sufk(S)) = 2 and there exists 1 ≤ i ≤ j ≤ m − 1 such that prek(S) = xi and sufk(S) = xj.
Proof. First we prove the if direction. We will show that for all S ∈ 𝒮 and its corresponding walk g, if one of the four conditions hold, then w is not a subwalk of g. If (i) holds, then w is trivially not a subwalk of g. Now, if prek(S) = xi for some 1 ≤ i ≤ m and xi is visited only once by g, If (ii) holds, then g starts with xi but never visits xi again, therefore (x0, …, xi) is not a subwalk of g. Hence, w is not a subwalk of g. Similarly, if (iii) holds, then (xj, …, xm) is not a subwalk of g and hence w is not a subwalk of g. If (iv) holds, then Lemma A.3 implies that w is not a subwalk of g.
Now we prove the only if direction. We will show that for all S ∈ 𝒮 and their corresponding walk g, if w is not a subwalk of g, then one of the four conclusions hold. By Lemma A.2.(iv), either (1) g does not contain any k-mer from w, (2) prek(S) ∈ {x1, …, xm}, or (3) sufk(S) ∈ {x0, …, xm −1}. In case of (1), g trivially does not contain w, and condition (i) is satisfied. In case of (2), let i ∈ [1, m] be an integer such that xi = prek(S). By Lemma A.2.(v), occS(xi) is either 1 or 2. If occS(xi) = 1, then condition (ii) immediately holds. If occS(xi) = 2, then Lemma A.3 implies that there exists an integer j ∈ [i, m − 1] that satisfies condition (iv). In case of (3), let j ∈ [0, m − 1] be an integer such that xj = sufk(S). Again, by Lemma A.2.(v), occS(xj) is either 1 or 2. If occS(xj) = 1, then condition (iii) immediately holds. If occS(xj) = 2, then Lemma A.3 implies that there exists an integer i ∈ [i, m − 1] that satisfies condition (iv). □
Let X be a string and let w = (x0, …, xm) be a unitig in Gbasic(spk(X)). Then spell(w) is not a substring of X iff one of the following holds:
occX(prek(X)) = occX(sufk(X)) = 1, prek(X) = xi, sufk(X) = xi −1 for some 1 ≤ i ≤ m.
occX(prek(X)) = occX(sufk(X)) = 2, prek(X) = xi, sufk(X) = xj for some 0 < i ≤ j < m.
Moreover, this can hold for at most one unitig in Gbasic(spk(X)).
Proof. We can apply Theorem 1 to set 𝒮 = {X}. Since X must contain all k-mers of w, w is unsafe if and only if condition (ii), (iii) or (iv) from Theorem 1 holds for S = X. First, assume Condition (ii) is true for X. Then by Theorem 1, w is unsafe. Consider the walk g corresponding to X. Because g begins at xi and all vertices in Gbasic(K) must be in g at least once, (xi, …, xm) is a subwalk of g. This is the one and only occurrence of xi in g. Since xi is the first vertex in g occuring only once, xi −1 cannot precede xi. Hence, xi −1 must be the end of g, i.e., sufk(S) = xi −1. Note that, this is the one and only occurrence of xi −1 in g. Thus, Condition (iii) is also true for X. With a symmetric argument, we can show that if Condition (iii) is satisfied, then Condition (ii) is satisfied. Combining both gives us the first condition of the corollary. Finally, observe that Condition (iv) is identical to Condition 2 in the corollary. The fact that these conditions can hold for at most one unitig follows directly from the fact that there is only one vertex for prek(S) in the graph.
Formal definition of the case of Figure 3
In Section 5, we quantify the number of unsafe unitigs that fall into the case of Figure 3. To make this precise, we give a formal classification for this case. Let X be a genome and let 𝒮 be a set of its sequenced segments. We say that an unsafe walk w = (x0, …, xm) satisfies the case of Figure 3 if
there exists 0 < i ≤ j < m such that ψ = (xi, …, xj) is a unitig in Gbasic(spk(X)),
spell(ψ) occurs at least twice in X,
in one of the occurrences, the k-mer preceding spell(ψ) is not in 𝒮,
in another of the occurrences, the k-mer following spell(ψ) is not in 𝒮,
there exists S ∈ 𝒮 and an integer i′ ∈ [i, j] such that spell((xi′, …, xj)) is a suffix of S, and
there exists S ∈ 𝒮 and an integer j′ ∈ [i′ − 1, j] such that spell((xi, …, xj′)) is a suffix of S.
B The relationship of Gdbl(K) and Gbid(K): full exposition
In this section, we will prove Theorem 2. We start by providing additional definitions that are necessary to understand the proofs in this section.
Let K be a set of k-mers. A unitig in a directed graph that is not a proper subwalk of another unitig that ends at the same vertex is said to be prefix-maximal; a unitig that is not a proper subwalk of another unitig that starts with the same vertex is said to be suffix-maximal. Notice that a unitig is maximal iff it is both prefix- and suffix-maximal.
Let (u, s) be a vertex-side in Gbid(K). We define dil(u, s) to indicate the presence of an inverted loop, i.e. dil(u, s) = 1 if there is an inverted loop incident to side (u, s) and dil(u, s) = 0 otherwise. A unitig t in Gbid(K) is prefix-maximal if it is not a proper subwalk of another unitig that ends at the same vertex-side as t. A unitig is suffix-maximal if it is not a proper subwalk of another unitig that starts with the same vertex-side as t. Note that a unitig is maximal iff it is both prefix- and suffix-maximal.
We will prove Theorem 2 by first building a collection of Lemmas. First, we make a simple observation. A palindrome must have an even number of characters, otherwise there is a middle character that would need to be equal to its own reverse complement. Hence, a palindromic walk, in either the doubled or the bidirected graph must have an even number of nucleotides.
Let K be a set of k-mers.
For all palindromic walks w = (x0, …, xn) in Gdbl(K), k and n have the same parity.
For all palindromic walks t = (u0, s0, …, un, sn) in Gbid(K), k and n have the same property.
Proof. A palindromic string must have an even number of nucleotides. The number of nucleotides in spell(w) and in spell(t) is k + n, Hence the parity of k and n must be the same.
From here on out, we proceed by first proving Lemmas for the directed de Bruijn graphs (both the regular one and the doubled one) (Appendix B.1), then proving Lemmas for the bidirected graph (Appendix B.2), then proving Lemmas which connect the two types of graphs (Appendix B.3), and, finally, proving Theorem 2 (Appendix B.4).
B.1 Directed graph
First, we make the observation that unitigs cannot repeat vertices unless they are a simple cycle. This is generally stated without proof, but the statement is actually not true when unitigs are allowed to be periodic cycles. In our definition of unitig, we forbid this case, allowing us to prove the observation.
For all unitigs w in a directed graph, either w is a simple cycle or it does not repeat any vertices.
Proof. Let w = (x0, …, xn) be a unitig. Suppose that w repeats a vertex. Let 0 ≤ j ≤ n be the smallest value for which there exists 0 ≤ i < j such that xi = xj. If i > 0, then xi has xi −1 and xj −1 as an in-neighbor. By the minimality of our choice of i, xi −1≠ xj −1, and hence d−(i) ≥ 2. This contradicts that w is a unitig. If i = 0, then let j + 1 ≤ 𝓁 ≤ n − 1 be the largest index greater than j such that x𝓁 = x𝓁 mod (j+1). In other words, 𝓁 is the first place after xj where the unitig is about to “fall off the cycle”. If such an 𝓁 does not exist, then either j = n and w is a simple cycle, or w is a simple periodic cycle, contradicting the definition of a unitig. Otherwise, the vertex x𝓁 has as out-neighbors both x𝓁+1 and x𝓁+1 mod (j+1). By the choice of 𝓁, these out-neighbors are distinct and hence d+(x𝓁) ≥ 2. This contradicts that w is a unitig.
A very simple property in the doubled graph is is that the in-degree (respectively, out-degree) of a vertex is equal to the out-degree (respectively, in-degree) of its reverse complement.
Let K be a set of k-mers and let x be a vertex in Gdbl(K). Then d+(x) = d−(x) and d−(x) = d+(x).
Proof. Observe that for all vertices y in the Gdbl(K), there is an edge from x to y in Gdbl(K) iff there is an edge from to . This is true even if and these two edges are identical. Hence and . □
We defined maximal unitigs as those that are not proper sub-walks of other unitigs. We can give an equivalent definition for directed graphs, in terms of vertex degrees. Since it is widely known, we state it without proof.
Let G be a directed graph and let w = (x0, …, xn) be a unitig in G. Then
w is prefix-maximal if and only if d−(x0) ≠ 1 or there exists a vertex x′ that has an edge to x0 and d+(x′) > 1.
w is suffix-maximal if and only if d+(xn) ≠ 1 or there exists a vertex x′ that has an edge from xn and d−(x′) > 1.
Palindromic unitigs play a special role in Theorem 2. We observe that in a palindromic unitig of the doubled graph, the only edge from a k-mer to its reverse complement is the middle one.
Let K be a set of k-mers with odd k. Let w = (x0, …, xn) be a palindromic unitig in Gdbl(K) that is not a simple cycle. Then for all 0 ≤ i ≤ n − 1, we have that iff i = (n − 1)/2.
Proof. First note that by Lemma B.1, n is odd and n ≥ 1. Let m = (n − 1)/2. Because spell(w) is a palindrome, for all 0 ≤ i ≤ n. The only if direction of the Lemma statement follows immediately by plugging in i = m and getting . For the if direction, assume that for all 0 ≤ i ≤ n − 1. Then . By the fact that w is not a simple cycle and Lemma B.2, it cannot have any repeated vertices. Hence, i = n − i − 1 which only happens when i = m. □
We also observe that a maximal unitig that is not a palindrome cannot contain within it a palindrome of length ≥ k.
A non-palindromic maximal unitig w in Gdbl(K) cannot contain a proper sub-unitig that is palindromic.
Proof. For the sake of contradiction, let z be a proper sub-unitig of w that is a palindrome. First suppose that there exists a k-mer y such that y precedes z in w and follows z in w. In that case, observe that the walk is also a sub-unitig of w and also a palindrome. We can then extend z in this way until no longer possible, i.e. there do not exist a k-mer y such that y precedes z in w and follows z in w. Let w′ be this maximally extended walk. Note that by construction, w′ is a sub-unitig of w and it is proper because w′ is palindromic and w is not. Let the first vertex of w′ be x, and, hence, the last one is .
Consider the case when w starts with x. Because w≠ w′, there must exist an out-neighbor u of in w. Its mirror must also exist, i.e. an edge from ū to x. Lemma B.4 states that x is the first vertex of a maximal unitig, it must either (a) have one other in-neighbor besides ū or (b) ū must have at least one other out-neighbor besides x. For case (a), Lemma B.3 implies that . For case (b), Lemma B.3 implies that . In either case, the degrees of x or of u contradict the definition of being part of a unitig. The case when w ends with is symmetric and omitted.
Now consider the case when w does not start with x and does not end with . Let a be the vertex preceding x in w, and let b be the vertex following in w. There exist a mirror edge from to x. Since w′ was chosen so that it cannot be extended, ā ≠ b. Hence x has two distinct in-neighbors, a and . Since w contains x as a non-first vertex, this contradicts that w is a unitig. □
B.2 Bidirected graph
As is the case with directed graphs (Lemma B.4), there is a definition of maximality for bidirected unitigs that has to do with degrees rather than sub-unitigs. We are not aware of this equivalence being explicitly proven, so we do so here:
Let K be a set of canonical k-mers. Let t = (u0, s0, …, un, sn) be a unitig in Gbid(K). Then
t is prefix-maximal if and only if d(u0, s0) ≠ 1 or there is an edge {(u0, s0), (u′, s′)} such that d(u′, s′) > 1, and
t is suffix-maximal if and only if d(un, 1 − sn)≠ 1 or there is an edge {(un, 1 − sn), (u′, s′)} such that d(u′, s′) > 1.
Proof. We will only prove (i) since the proof of (ii) is symmetric. First, we prove the only if direction. We need to consider three cases. The first case is when d(u0, s0)≠ 1. If d(u0, s0) = 0, then t is prefix-maximal because there is no other walk of which it is a subwalk with the same last vertex-side (un, sn). The second case is when d(u0, s0) > 1. Consider any walk t′ that ends in (un, sn) and of which t is a proper subwalk. Observe that (u0, s0) would not be the first vertex-side of t′. Therefore, since d(u0, s0) > 1, t′ cannot be a unitig and t must be prefix-maximal. The third case is when (u0, s0) has degree one and (u′, s′) is its only neighbor. Again, consider any walk t′ that ends in (un, sn) and of which t is a proper subwalk. Observe that (u′, 1 − s′) belongs to t′ but is not the last vertex-side of t′. Therefore, since we assumed that d(u′, s′) > 1, t′ cannot be a unitig and t must be prefix-maximal.
To prove the if direction we prove the contrapositive. In other words, we will show that if the degree of (u0, s0) is one and its sole neighbor (u′, s′) also has degree at most 1, then t is not prefix-maximal. First, observe that t′ = (u′, 1 − s′, u0, s0, …, un, sn) is a valid walk, since the edge {(u′, s′), (u0, s0)} exists. Then, observe that the degree of (u′, s′) is exactly one because it has degree at most one (by our assumption) and also has a neighbor (i.e. (u0, s0)). Therefore, the degree requirements for t′ being a unitig are fulfilled. Finally, observe that t is a proper subwalk of t′ ending in the same vertex-side, (un, sn). Therefore, t is not prefix-maximal. □
In a bidirected graph, a walk and its reverse are either both unitigs or not and, if they are, are either both are maximal or not.
Let K be a set of canonical k-mers and let w be a unitig in Gbid(K).
rev(w) is a unitig in Gbid(K).
w is prefix-maximal iff rev(w) is suffix-maximal.
w is suffix-maximal iff rev(w) is prefix-maximal.
Proof. Let (u0, s0, …, un, sn) = w and . For (i), by definition of rev, we have that and . Applying the definition of unitig to w, we get that
These can be equivalently stated as
If we change the index variables, these can be equivalently restated as
This is precisely the definition of rev(w) being a unitig.
For (ii) and (iii), first observe that Lemma B.7 gives an alternate, equivalent, definition for prefix- and suffix-maximal. For (ii), observe that if apply the alternate definition of suffix-maximal to rev(w) and plug in that and , we get precisely the alternate definition of w being prefix-maximal. For (iii), observe that if apply the alternate definition of prefix-maximal to rev(w) and plug in that n and , we get precisely the alternate definition of w being suffix-maximal.
While we showed that it is natural for the doubled graph to have a palindromic unitig, this is impossible in a bidirected graph.
Let K be a set of canonical k-mers, with k odd. Then a unitig of Gbid(K) cannot be a palindrome.
Proof. Let t = (u0, s0, …, un, sn) be a palindromic walk. By Lemma B.1, n is odd, and so n ≥ 1. For convenience, let m = (n − 1)/2. By definition, . In particular, the two “central” k-mers of spell(t) must be reverse complements of each other. Formally, orient(lab(um), sm) = orient(lab(um+1), sm+1). Since the labels of vertices in a bidirected graph are distinct, lab(um) ≠ lab(um+1) and hence sm = 1 − sm+1. Applying the definition of a bidirected walk to t, we get that {(um, 1 − sm), (um+1, sm+1)} is an edge. The fact that sm = 1 − sm+1 implies that this edge is an inverted loop incident to (um, 1 − sm). Thus d(um, 1 − sm) ≥ 2, implying that t does not satisfy the definition of being a unitig. □
B.3 Connecting the directed and bidirected graphs
So far, we have proven properties of the doubled graph and of the bidirected graph separately; in this section, we prove lemmas about the relationship between the two graphs, when k is odd. Recall that for a k-mer x ∈ K, we defined FV (x) = (u, s), where (u, s) is the unique vertex-side in Gbid(K) such that lab(u) = orient(x, s).
Let K be a set of canonical k-mers where k is odd. FV is a bijection between vertices of Gdbl(K) and vertex-sides of Gbid(K).
Proof. To show that FV is a bijection, we will show that for all vertex-sides (u, s) in Gbid(K), there exists a unique k-mer x in Gdbl(K) such that FV (x) = (u, s). Consider a value of x such that FV (x) = (u, s). By definition, lab(u) = orient(x, s)). Since k is odd and x is not a palindrome, the value of x satisfying this must be unique. By construction of Gdbl(K) and Gbid(K), k must be a vertex in Gdbl(K). Further, if x = orient(lab(u), s), then orient(x, s) = orient(orient(lab(u), s), s) = lab(u) and so x satisfies the condition that FV (x) = (u, s).
We will use to denote the inverse of FV, which was shown in Lemma B.10 is . We will use to denote that a vertex x of Gdbl (K) and a vertex-side (u, s) in Gbid(K) are associated with each other by FV.
Recall that for two Gdbl(K) k-mers x1 and x2, we define the mapping FE(x1, x2) = {(u1, 1 − s1), (u2, s2), where (u1, s1) = FV (x1) and (u2, s2) = FV (x2). Though the mapping is not a bijection, it preserves the property of being an edge in the respective graph2:
Let K be a set of canonical k-mers where k is odd. Let x1 and x2 be vertices in Gdbl(K). We have that (x1, x2) is an edge in Gdbl(K) if and only if FE(x1, x2) is an edge in Gbid(K).
Proof. By the definition of bidirected edges, FE(x1, x2) = {(u1, 1 − s1), (u2, s2)} is an edge iff
Recall that by the definition of FE, lab(u1) = orient(x1, s1) and lab(u2) = orient(x2, s2). We can therefore rewrite Equation (1) equivalently as
Now, using the fact that orient(orient(y, s), s) = y, for all y and s, we can rewrite Equation (2) as
Since we obtained Equation (3) from Equation (1) using equivalent transformations, it shows that the two statements are equivalent and completes the proof. □
One particular case of Lemma B.11 that we will often invoke is that there is an edge from x to in Gdbl(K) if and only if there is an inverted loop incident to (u, 1 − s) in Gbid(K).
Now recall that FW is defined as a function that maps a walk w = (x0, …, xn) in Gdbl(K) to a sequence FW (w) = (u0, s0, …, un, sn), with (ui, si) = FV (xi) for all 0 ≤ i ≤ n. We show that FW (w) is in fact a walk in Gbid(K) and, moreover, FW is a bijection from the set of walks in Gdbl(K) to the set of walks in Gbid(K).
Let K be a set of canonical k-mers where k is odd. FW is a spell-preserving bijection from the set of walks in Gdbl(K) to the set of walks in Gbid(K).
Proof. Let w = (x0, …, xn) be a walk in Gdbl(K) and let (ui, si) = FV (xi) for all 0 ≤ i ≤ n. We will first show that FW (w) = (u0, s0, …, un, sn) is a walk in Gbid(K). By definition of FV, FW (w) is a sequence of vertex-sides. Consider the edge from xi to xi −1, for all 1 ≤ i ≤ n. By Lemma B.11, there is an edge {(ui −1, 1 −si −1), (ui, si)} in Gbid(K). This shows that every two consecutive vertex-sides in FW (w) are connected by an edge, thus completing the proof that FW (w) is a walk. The fact that it is spell preserving follows from its definition.
To show that FW is a bijection, we need to show that for all walks t = (u0, s0, …, un, sn) in Gbid(K), there exists a unique walk w in Gdbl(K) such that t = FW (w). Let w = (x0, …, xn) be an arbitrary walk in Gdbl(K). In order for FW (w) = t, we need that FV (xi) = (ui, si) for all 0 ≤ i ≤ n. Because FV is bijection (Lemma B.10), there is exactly one value of xi to satisfy this, and that is xi = F −1(ui, si) = orient(lab(ui), si). Therefore, w = (orient(lab(u0), s0), …, orient(lab(un), sn)) is the unique walk in Gdbl(K) to satisfy FW (w) = t. □
Given the above proof, we can write the inverse of FW as . We will use to denote that a walk w in Gdbl(K) and a walk t in Gbid(K) are associated with each other by FW.
Notice that if k were to be even, then Lemma B.12 would not hold. In particular, Let x ∈ K be a palindrome k-mer and let u be the vertex in Gbid(K) such that lab(u) = x. Then both of the walks (u, 0) and (u, 1) would spell x, while in the Gdbl(K) there would only be one walk that spells x.
Since unitigs are defined in terms of degrees, it is useful to first understand how the degrees of vertices in Gdbl(K) relate to the degrees of vertex sides in Gbid(K).
Let K be a set of canonical k-mers where k is odd. Let x be a vertex in Gdbl(K) and let (u, s) be a vertex-side is Gbid (K) such that . Then,
d+(x) = d(u, 1 − s) − dil(u, 1 − s)
d−(x) = d(u, s) − dil(u, s).
Proof. For proving part (i), we will first prove an upper bound and then a matching lower bound. We start with the upper bound. Let Y be the set of all out-neighbors of x which are not equal to . Note that Y may contain x. Let Y′ = {FV (y) | y ∈ Y} and observe that since FV is injective (Lemma B.10), |Y′| = |Y |. By Lemma B.11, for each vertex-side (u′, s′) ∈ Y′, there is an edge {(u, 1 − s), (u′, s′)} and so d(u, 1 − s) ≥ |Y′|.
We show that d+(x) = d(u, 1 − s) − dil(u, 1 − s) by considering two cases. In the first case, assume that there does not exist an edge . Then d+(x) = |Y |. Moreover, by Lemma B.11, the edge {(u, 1 − s), (u, 1 − s)} does not exist, so dil(u, 1 − s) = 0. Putting these facts together, d+(x) = |Y | = |Y′| ≤ d(u, 1 − s) = d(u, 1 − s) − dil(u, 1 − s).
In the second case, assume that there exists an edge . Lemma B.11 says that there is an inverted loop incident to side (u, 1 − s), so dil(u, 1 − s) = 1. An inverted loop adds 2 to the degree of (u, 1 − s), i.e. d(u, 1 − s) ≥ |Y′| + 2; it also contributes 1 to out-degree of x, i.e. d+(x) = |Y | + 1. Putting these together, we get d(x) = |Y | + 1 = |Y′| + 1 ≤ d(u, 1 − s) − 1 = d(u, 1 − s) − dil(u, 1 − s).
For the lower bound, let Z′ be the set of all vertex-sides (u′, s′) such that (u′, s′)≠ (u, 1 − s) and there is an edge {(u, 1 − s), (u′, s′)}. Let Z = {z | FV (z) ∈ Z′}. By Lemma B.10, |Z| = |Z′|. By Lemma B.11, for every z ∈ Z, there is an edge from x to z in Gdbl(K) and therefore d+(x) ≥ |Z| = |Z′|.
Now we show that d(u, 1 − s) ≤ d+(x) + dil(u, 1 − s) by considering two cases. In the first case, assume that there is no inverted loop touching (u, 1 −s). Then, d(u, 1 −s) = |Z′| and dil(u, 1 −s) = 0. We can therefore write d(u, 1 − s) = |Z′| + dil(u, 1 − s) ≤ d+(x) + dil(u, 1 − s). In the second case, assume there exists an inverted loop touching (u, 1 − s). In this case, d(u, 1 − s) = |Z′| + 2. By Lemma B.11, there is an edge from x to and . Thus, d + (x) ≥ |Z| + 1. Putting this together, d(u, 1 − s) = |Z′| + 2 = |Z| + 2 ≤ d+(x) + 1 = d+(x) + dil(u, 1 − s).
For part (ii), observe that . We can then apply part (i) of this theorem to , u, and 1 − s, and get that . By Lemma B.3, , and hence . □
An immediate consequence of the degree-preserving lemma is that if F (w) is a unitig, then so is w. The converse is not always true however.
Let K be a set of canonical k-mers where k is odd. Let w = (x0, …, xn) and t = (u0, s0, …, un, sn) be two walks related by .
If t is a unitig, then w is a unitig.
If w is a unitig and for all , then t is a unitig.
Proof. For (i), when n = 0, w is trivially a unitig because it has only one vertex. For n > 0, since t is a unitig, d(ui, si) = 1 for 0 < i ≤ n. Moreover, since an inverted loop would make a degree ≥ 2, we have dil(ui, si) = 0. Using Lemma B.13, d−(xi) = 1. Similarly, for all 0 ≤ i < n, d(ui, 1 − si) = 1, dil(ui, 1 − si) = 0, and Lemma B.13 gives that d+(xi) = 1. Hence w is a unitig.
For (ii), first observe that there is no inverted loop incident to (ui, si), for 1 ≤ i ≤ n. If that were the case, then Lemma B.11 implies that there is an edge from to xi. Since w is a unitig, the only in-neighbor of xi is xi −1. Hence, , which contradicts the conditions of the Lemma. Now, since dil(ui, si) = 0, Lemma B.13 implies that d(ui, si) = d−(xi) + dil(ui, si) = d−(xi) = 1. Using a symmetrical argument (omitted), d(uj, 1 − sj) = 1 for all 0 ≤ j < n. Therefore, t is a unitig. □
Similarly, we can relate the maximality of unitigs in Gdbl(K) and Gbid(K). A maximal unitig in Gdbl(K) is maximal in Gbid(K), on the condition that is a unitig in Gbid(K); however, the other direction only holds with a restrictive condition.
Let K be a set of canonical k-mers where k is odd. Let w = (x0, …, xn) and t = (u0, s0, …, un, sn) be two walks related by w . Suppose that both w and t are unitigs.
If t is prefix-maximal and has no lonely inverted loop at the first endpoint side, then w is prefix-maximal.
If w is prefix-maximal, then t is prefix-maximal.
If t is suffix-maximal and has no lonely inverted loop at the last endpoint side, then w is suffix-maximal.
If w is suffix-maximal, then t is suffix-maximal.
Proof. We will prove (i) and (ii) only, since the proofs of (iii) and (iv) are symmetric. For (i), if there is more than one edge incident to (u0, s0), then d(u0, s0) ≥ 2. If there are no edges incident to (u0, s0), then d(u0, s0) = 0. In both cases, Lemma B.13 implies that d−(x0) = d(u0, s0)≠ 1 and Lemma B.4 implies that w is prefix-maximal.
Now consider the case that d(u0, s0) = 1. By the conditions of the Lemma, there is no in-verted loop incident at (u0, s0), and Lemma B.13 implies d−(x0) = 1. Since t is prefix-maximal, by Lemma B.7, there is a vertex side (u′, s′) and an edge e = (u′, s′), (u0, s0) such that d(u′, s′) > 1. Let and Lemma B.11 implies that there is an edge from x′ to x0 in Gdbl(K). Observe that because d(u0, s0) < 2, e is not an inverted loop. Therefore, (u′, s′) has at least one incident edge that is not an inverted loop. Because an inverted loop adds at least two to the degree, d(u′, s′) − dil(u′, s′) > 1. Thus, Lemma B.13 implies that d+(x′) > 1. By Lemma B.4, w is a prefix-maximal unitig.
For (ii), suppose for the sake of contradiction that t is not prefix-maximal. Then Lemma B.7 implies that d(u0, s0) = 1 and there exists a vertex-side (u′, s′) with d(u′, s′) = 1 and an edge e = {(u′, s′), (u0, s0)}. Let Note that dil(u0, s0) = dil(u′, s′) = 0 because vertex-sides with degree 1 cannot have an inverted loop incident to them. Lemma B.13 then implies that d−(x0) = d(u0, s0) = 1 and d+(x′) = d(u′, s′) = 1. In addition, Lemma B.11 applied to e says that there is an edge from x′ to x. By Lemma B.4, these facts imply that w is not prefix-maximal, which is a contradiction. □
Theorem 2 has a condition that there are no circular unitigs. We now show that this implies that a unitig in Gbid(K) cannot have lonely inverted loops incident to both of the endpoint sides.
Let K be a set of canonical k-mers where k is odd. Let w = (x0, …, xn) be a walk in Gdbl(K) such that FW (w) is a unitig. If the two endpoint sides of FW (w) have lonely inverted loops incident on them, then is a circular unitig in Gdbl(K).
Proof. First, to show that w′ is a walk in Gdbl(K), we need to show that there exist edges and . This follows by applying Lemma B.11 to the inverted loop edges at the endpoints of F (w), i.e. to {(u0, s0), (u0, s0)} and {(un, 1 − sn), (un, 1 − sn)}.
Second, to show that w is a unitig, we will show that all the necessary vertex degrees are 1. By Lemma B.14, w is a unitig, and hence d+(xi) = 1 for all 0 ≤ i < n and d−(xi) = 1 for all 0 < i ≤ n. Let (ui, si) = FV (xi) for all 0 ≤ i ≤ n. Because the endpoint sides of F (w) each have a lonely inverted loop, d(u0, s0) = 2 and d(un, 1 − sn) = 2. Applying Lemma B.13, d−(x0) = d(u0, s0) − dil(u0, s0) = 2 − 1 = 1 and d+(xn) = d(un, 1 − sn) − dil(un, 1 − sn) = 1. Applying Lemma B.3 to all these, we get that for all 0 ≤ i ≤ n and for all 0 ≤ i ≤ n. □
B.4 Proof of Theorem 2
Let K be a set of canonical k-mers where k is odd and Gdbl(K) does not contain a circular unitig.
The function FW is a bijection from Dnon-pal to Bno-loop.
The function rev is a bijection between Blast-loop and Bfirst-loop.
head is a bijection from Dpal and Blast-loop
Proof.
We already know from Lemma B.12 that FW is a bijection between walks in Gdbl(K) and Gbid(K). It remains to show that
For a unitig w that is maximal and non-palindromic in Gdbl(K), FW (w) ∈ Bno-loop.
For a unitig t ∈ Bno-loop, F −1(t) is a maximal and non-palindromic unitig in Gdbl(K).
First, we prove (1). Because w is a non-palindromic maximal unitig, by Lemma B.6, there is no edge 0 ≤ i < n such that , because then (xi, xi+1) would be a palindromic sub-unitig of w. Hence we can apply Lemma B.14 to say that FW (w) is a unitig and we can apply Lemma B.15 to say that FW (w) is maximal. Hence FW (w) ∈ B. To show that FW (w) ∉ B2 ∩ B3, first assume for the sake of contradiction that there is a lonely inverted loop at the last endpoint side of FW (w). Then by Lemma B.11 there is an edge from xn to . By Lemma B.13, d+(xn) = 2 1 = 1. By Lemma B.3, . Because w is maximal, if d+(xn) = 1, then . This is a contradiction. The argument that there is no lonely inverted loop at the first endpoint side of FW (w) is symmetric and omitted.
Now, we prove (2). Let w = F −1(t). Since t is a unitig, Lemma B.14 implies that w is a unitig also. Moreover, Lemma B.9 implies that t is non-palindromic; since FW is spelling preserving (Lemma B.12), w is also non-palindromic. Since the Theorem assumes that Gdbl(K) does not have circular unitigs, Lemma B.16 implies that t cannot have a lonely inverted loop at both endpoints. Since t ∉ B2 ∪ B3, it also cannot have an inverted loop at exactly one endpoint. We can therefore apply Lemma B.15 to get that w is maximal.
Observe that rev is by definition a function that is its own inverse and is a bijection on the set of walks in Gbid(K). Furthermore, Lemma B.8 implies that rev remains a bijection when restricted to maximal unitigs in Gbid(K). Finally, observe that for a walk t, the first (respectively, last) endpoint side of t is the last (respectively, first) endpoint side of rev(t). These facts together imply that rev is a bijection between Bfirst-loop and Blast-loop.
To show that head is a bijection we show
for all w ∈ Dpal, head(w) ∈ Blast-loop,
for all t ∈ Blast-loop, there exists a w ∈ Dpal such that head(w) ∈ Blast-loop.
the above w is unique.
First, we prove (1). Let w = (x0, …, xn). By Lemma B.1, n is odd and at least 1. Let m = (n − 1)/2 and let h ≜ (x0, …, xm). Since w is a palindromic unitig and, by the conditions of the Theorem, non-circular, Lemma B.5 implies that for all . Then by Lemma B.14, head(w) = FW (h) is a unitig. Simultaneously, because w is a maximal unitig, h is a prefix-maximal unitig. Lemma B.15 then implies that FW (h) is prefix-maximal.
Now we show that FW (h) is suffix-maximal and has a lonely inverted loop at the last endpoint. Let (u0, s0, …, um, sm) ≜ FW (h). Since w is palindromic, Lemma B.5 implies that , and, hence, um = um+1. By Lemma B.11, there is an inverted loop incident to (um, 1 − sm), i.e. the last endpoint of FW (h). Because w is a unitig, d+(xm) = d−(xm+1) = 1, Lemma B.13 then implies that d(um, 1 − sm) = d+(xm) + dil(um, 1 − sm) = 2. By Lemma B.7, FW (h) is suffix-maximal and therefore we have shown that FW (h) ∈ Blast-loop.
Next we prove (2). Let (u0, s0, …, un, sn) = t and let . Let be a sequence of vertices in Gdbl(K). We will first show that w is a walk, then that it is palindromic, then that it is a unitig, and finally that it is maximal. Note that w is equivalently defined to be the concatenation of with . Applying Lemma B.12, the sequences (x0, …, xn) and are walks. Since t is in Blast-loop, there is an inverted loop incident to (un, 1 − sn). By Lemma B.11, this implies there is an edge from xn to in Gdbl(K). Therefore, w is a walk. It is palindromic by its definition. Since t is a unitig, by Lemma B.8, rev(t) is a unitig. Now applying Lemma B.14, w and rev(w) are both unitigs. Because the inverted loop is lonely, d(un, 1 − sn) = 2, and by Lemma B.13, d+(xn) = 1. Applying Lemma B.3, . Hence w is a unitig.
As t is in Blast-loop, this implies that no lonely inverted loop is incident to (u0, s0). We can apply Lemma B.15 to get that F −1(t) is prefix-maximal. Because w starts with F −1(t), w is also prefix-maximal. By Lemma B.8, F −1(rev(w)) is suffix-maximal. Because w ends with F −1(rev(w)), w is also suffix-maximal. Hence, w is maximal.
For (3), let (u0, s0, …, un, sn) = t and let ). Let w′ be a walk in Dpal such that head(w′) ∈ Blast-loop. We will show that . Since head(w′) has n + 1 vertices, w must have 2n + 2 vertices. Hence we can write . Since w is a palindrome, we have that for all 0 ≤ i ≤ 2n + 1. We can therefore rewrite w as . Next, observe that . Since this must be equal to t and FW is a bijection (Lemma B.12), we get that . We can therefore rewrite w as , which is the same as w. □
C Experimental details
Choice of k parameter for the assemblers
To ensure that the results across the assemblers are comparable, we set the k parameter in a way so that the set of unitigs constructed are as close as possible. The ideal way is to set k such that the underlying k-mer sets K used for all assemblers are same. However, there was a practical limitation for that. We note that both SPAdes and MEGAHIT are a multi-k assemblers, so the k parameter is just the maximum allowed k-mer size. When we pass the value k to the assemblers, both SPAdes and MEGAHIT use k-mer set and (k + 1)-mer set to construct unitigs, whereas bcalm, ABySS, and minia uses a node-centric de Bruijn graph with only k-mer sets as vertices. As such, we found that the output unitigs of SPAdes and MEGAHIT with a value of k are more similar to unitigs of bcalm and ABySS created with k + 1. We also note that SPAdes and MEGAHIT only allow odd k, which is why we needed to use an even k for Gdbl.
In Table 3, we therefore passed k = 74 to bcalm and k = 73 to SPAdes and MEGAHIT. Since Theorem 1 is valid for all k, this was not an issue for Table 3. We used the default parameter for minimum k-mer coverage for both assemblers.
For Table 6, we passed k = 31 to all assemblers, since Theorem 2 only applies when the vertex lengths are of odd k. Since SPAdes and MEGAHIT by default use both k-mer and (k + 1)-mer set to construct unitigs, the number of palindromic unitigs (433) differs from the number in minia and ABySS (440). However, this is not a problem because we are not comparing the numbers between assemblers but only within assemblers.
Detection of palindrome splitting artifact
In this section, we use the notation S[i : j] to denote substring of string S starting at index i and ending at index j. Let w = (x0, …, xn) be a palindromic unitig in Dpal and let p be its spelling. We say a unitig in Dpal is fully-covered if there exists some contig that aligns to an interval which contains p’s interval in the reference. Let k′ ≜ (k − 1)/2. We say w is split if there exists at least one contig c such that either
c aligns to an interval that starts before p’s interval and ends exactly at position |p|/2 + k′ of p’s interval and there are no other contigs with alignments intersecting p[|p|/2 + k′ + 1 : |p|], or
c aligns to an interval that ends after p’s interval and starts exactly at location |p|/2 − k′ + 1 of p’s interval and there are no other contigs with alignments intersection p[1 : |p|/2 + k′].
We say w is ambiguous if it does not fall into either category.
To motivate these cases, observe that the length of p is n + k and, because p is a palindrome and k is odd, n must be odd. Let be the first half of the walk w and let p′ be its spelling. By Theorem 2, head(w) ∈ Blast-loop and rev(head(w)) ∈ Bfirst-loop. Then, . Then,
spell(head(w)) = spell(FW (w′)) = spell(w′) = p[1 : |p|/2 + k′], and
.
The cases we describe therefore correspond to observing the alignments of head(w) and rev(head(w)) to the corresponding places of p and not observing any other bidirected unitigs aligning across the middle boundaries.
CAMI dataset
We used the benchmark called “low complexity dataset” in [31]. Since our analysis requires error-free reads, we re-simulated the reads using identical genomes and abundances (as detailed in supplementary materials of [26]). Table S1 shows the properties and relative abundances of the genomes. We used CAMISIM [13] for the simulations, with read length of 150nt and insert size 150.
Acknowledgements
PM thanks Rayan Chikhi, Alexandru Tomescu, and Mihai Pop for useful discussions. This material is based upon work supported by the National Science Foundation under Grant No. 1453527 and 1931531. AR was supported by NIH Computation, Bioinformatics, and Statistics training program.
Footnotes
↵1 The safety of unitigs has been previously studied for other notions of “safety” by [9]. While the authors did not make the explicit conclusion and did not verify it in practice, their Theorem 6.1(d) implies that unitigs are not guaranteed to be safe in the model of assembly they consider. Concretely, while a suffix or prefix of the unitig may be present at the starts and ends of parts of the genome, the whole unitig might never be contained as a contiguous sequence.
↵2 As an aside, we mention how one would obtain a bijection. This is not necessary for the proofs of this paper, but may be a useful observation in its own right. Let E be the set of edges in Gdbl(K), let α ⊆ E be all the self-mirror edges, and let β be the partition of E \ α into mirror edge-pairs. For example, if E = {(AGG, GGA), (T CC, CCT), (T TA, TAA)}, then α = {(T TA, TAA)} and β = {{(AGG, GGA), (T CC, CCT)}}. For an element , we define . For a self-mirror edge (x, y) ∈ α, we define FEG({(x, y)}) = FE(x, y). One can then show that FEG is a bijection between α ∪ β and edges in Gbid(K).