Describing the Local Structure of Sequence Graphs

Analysis of genetic variation using graph structures is an emerging paradigm of genomics. However, defining genetic sites on sequence graphs remains an open problem. Paten’s invention of the ultra-bubble and snarl, special subgraphs of sequence graphs which can identified with efficient algorithms, represents important first step to segregating graphs into genetic sites. We extend the theory of ultrabubbles to a special subclass where every detail of the ultrabubble can be described in a series and parallel arrangement of genetic sites. We furthermore introduce the concept of bundle structures, which allows us to recognize the graph motifs created by additional combinations of variation in the graph, including but not limited to runs of abutting single nucleotide variants. We demonstrate linear-time identification of bundles in a bidirected graph. These two advances build on initial work on ultrabubbles in bidirected graphs, and define a more granular concept of genetic site.


Background
The concept of the genetic site underpins both classical genetics and modern genomics. From a biological perspective, a site is a position at which mutations have occurred in different samples' histories, leading to genetic variation. From an engineering perspective, a site is a subgraph with left and right endpoints where traversals by paths correspond to alleles. This is useful for indexing and querying variants in paths and for describing variants in a consistent and granular manner.
Against a linear reference, it is trivial to define sites, provided that we disallow variants spanning overlapping positions. This is clearly demonstrated by VCF structure [1]. VCF sites, consisting of any number of possible alleles, are identified by their endpoints with respect to the linear reference.
If we wish to analyze a set of variants containing structural variation, highly divergent sequences or nonlinear references structures, then a linear reference with only non-overlapping variants is no longer a sufficient model. Datasets with one or more of these properties are becoming more common [2,3], and sequence graphs [4] have been developed as a method of representing them. However, defining sites on graphs is considerably more difficult than on linear reference structures and the creation of methods to fully decompose sequence graphs into sites remains an unsolved problem.

The Challenges of Defining Sites on Graphs
On a graph-based reference, the linear reference definition of a site as a position along the reference and a set of alleles fails to work for several reasons: 1. Sequences which are at the same location in linear position may not have comparable contexts. This is a consequence of having variants which cannot be represented as edits to the linear reference but rather as edits to another variant. We illustrate this with an example from 1000 Genomes polymorphism data, visualized using Sequence Tube Maps [5].
A CTCA T C GTTCGAAGG CTCACTAG 2. Elements of sequence may not be linearly ordered. Parallel structure of the graph (3.) is one sort of non-linearity. Graphs also allow repetitive, inverted or transposed elements of sequence. These all prevent linear ordering.

Fig. 2. A cycle and an inversion in a graph
3. The positions spanned by different elements of variation may partially overlap. Therefore, multiple mutually exclusive segments of sequence in a region of the graph cannot be considered to be alternates to each other at a welldefined position without having to include extraneous sequence that is shared between some but not all of the "alleles." We can expect that the density of these graph structures will increase with increasing population sizes included in datasets.
Our aim will be to recognize and fully decompose subgraphs resembling Example 1 into a notion of site, and isolate these from elements of the graph resembling Examples 2 and 3.

Directed and Bidirected Sequence Graphs
The graphs used to represent genetic information consist of labelled nodes and edges. Nodes are labelled with sequence fragments. Edges form paths whose labels spell out allowed sequences. Two types of graph are used. The more simple type is the directed graph. A directed graph (or "digraph") G consists of a set V of nodes and a set E of directed edges. A directed edge is an ordered tuple (x, y), consisting of a head x ∈ V and tail y ∈ V . A directed path is a sequence of nodes joined by edges, followed head to tail. G is a directed acyclic graph (DAG) if it admits no directed path which revisits any node.
A bidirected graph G [6] consists of a set V of vertices and a set E of edges. Each vertex v ∈ V consists of a pair of node-sides {v lef t , v right } and each edge is an unordered tuple of node-sides. Bidirected graphs have the advantage of being able to represent inversion events.
We write N for the set of node-sides in the bidirected graph G. The oppositê n of a node-side n is the other node-side at the same vertex as n.
A sequence p = x 1 , x 2 , . . . , x k of node-sides is a path if ∀x i , . any contiguous subsequence of p consisting of a node-side x alternating with its oppositex must either be even-numbered in length or must be a prefix or suffix of p Informally, this means that in a path, consecutive pairs forming edges alternate with pairs of opposite node-sides or, equivalently, that paths visit both nodesides of the vertices they pass through. They can however begin or on an isolated node-side. A bidirected graph G is cyclic if it admits a path visiting a node-side twice. Therefore the self-incident hairpin motif (below, right) is considered a cycle. A bidirected graph G is properly cyclic if it admits a path which visits a pair {n,n} twice in the same order. Some publications refer to biedged graphs. These are {black, grey}-edgecolored undirected graphs, where every node is paired with precisely one other by sharing a grey edge and paths in the graph must alternate between traversing black and grey edges. Paten elaborates on this construction in [7] and shows that it is equivalent to a bidirected graph. We will restrict our language to that of bidirected graphs, recognizing that these are equivalent to biedged graphs.
Acyclic bidirected graphs are structurally equivalent to directed graphs in that Lemma 1. If G is a bidirected acyclic graph, there exists an isomorphic directed acyclic graph D(G).
Proof. See [7] 3.2 Bubbles, Superbubbles, Ultrabubbles and Snarls The first use of local graph structure to identify variation was the detection of bubbles [8] in order to detect and remove sequencing errors from assembly graphs. Their bubble is the graph motif consisting of two paths which share a source and a sink but are disjoint between.
The general concept of bubbles was extended by Onodera et al, who defined superbubbles in directed graphs [9]. Brankovic demonstrates an O(|V | + |E|) algorithm to identify them [10], building off work of Sung [11].
We restate the Onodera definition, modified slightly as to be subgraph-centric rather than boundary-centric: A subgraph S ⊆ G of a directed graph is a superbubble with boundaries (s, t) if 1. (reachability) t is reachable from s by a directed path in S 2. (matching) the set of vertices reachable from s without passing through t is equal to the set of vertices from which t is reachable without passing through s, and both are equal to S 3. (acyclicity) S is acyclic 4. (minimality) there exists no t ∈ S such that boundaries (s, t ) fulfil 1,2 and 3. There exists no s ∈ S such that (s , t) fulfil 1,2 and 3.
To motivate our definition of a superbubble equivalent on bidirected graphs, we prove some consequences of the matching property.
Proposition 2. Let S ⊆ G be a subgraph of a directed graph. If S possesses the matching property relative to a pair (s, t), then it possesses the following three properties: 1. (2-node separability) Deletion of all incoming edges of s and all outgoing edges of t disconnects S from the remainder of the graph. 2. (tiplessness) There exist no node n ∈ S\{s, t} such that n has either only incoming or outgoing edges. 3. S is weakly connected Proof. (matching ⇒ separability) Suppose ∃x / ∈ S, y ∈ S\{s, t} such that there exists either an edge x → y or an edge y → x. Suppose wlog that ∃ an edge x → y. By matching, there exists a path y → · · · → t without passing through s. We can then construct the path x → y → · · · → t which does not pass through s. But by matching this implies that x ∈ S, which leads to a contradiction.
The converse need not be true on directed graphs 3 . We define two structures on bidirected graphs. The first is the ultrabubble, which given Proposition 2, can be thought of as an analogue to a superbubble. The second, the snarl, is a more general object which preserves the property of 2-node separability from the larger graph without having strong guarantees on its internal structure. The following definitions are due to Paten [7]: A connected subgraph S ⊆ G of a bidirected graph G is a snarl (S, s, t) with boundaries (s, t), if 1. s =t 2. (2-node separability) every path between a pair of node-sides in x ∈ S, y ∈ G\S contains either s →ŝ or t →t as a subpath. 3. (minimality) there exists no t ∈ S such that boundaries (s, t ) fulfil 1 and 2. There exists no s ∈ S such that (s , t) fulfil 1 and 2 The class of ultrabubbles is the subclass of snarls (S, s, t) furthermore fulfilling 4. S is acyclic 5. S contains no tips -vertices having one node-side involved in no edges Three examples of ultrabubbles are shown below. The following is important property of snarls.
Proposition 3 (Non-overlapping property). If two distinct snarls share a vertex (node-side pair) then either they share a boundary node or one snarl is included in the other's interior.
Proof. Let S be a snarl with boundaries s, t. Let T be another snarl, with boundaries u, v. Suppose that u ∈ S\{s, t} but v / ∈ S, and s / ∈ T . Consider the set S ∩ T . It is nonempty since it contains u. Let x ∈ S ∩ T . Let y / ∈ S ∩ T . Suppose that there exists a path p = x ↔ · · · ↔ y which neither passes through u nor t.
Since y / ∈ S ∩ T , either y / ∈ S or y / ∈ T . Wlog, assume y / ∈ T . Then due to the separability of T , since the path p does not pass through u, it must pass through v before leaving T to visit y. But v / ∈ S so p must also pass through s before leaving S to visit v since it does not pass through t. But it must pass through v before leaving T to visit s, which leads to an impossible sequence of events. Therefore any path x ↔ · · · ↔ y for x ∈ S ∩ T, y / ∈ S ∩ T must pass through either u or t. This contradicts the minimality of both S and T .
This non-overlapping property is also a nesting property. Observe that, due to Proposition 3, the relation U ≤ V on snarls U, V defined such that U ≤ V if U is entirely contained in V has the property that if U ≤ V and U ≤ W , then either V ≤ W or W ≤ V . Therefore the partial order on the snarls of G defined by the relation ≤ will always be equivalent to a tree diagram. A bottom level snarl is one which forms a leaf node of this tree. The equivalent of Proposition 3 for superbubbles was stated without proof by Onodera in [9]. Our proof also constitutes a proof of the statement for superbubbles, due to the following proposition, proven by Paten in [7]: Proposition 4. Every superbubble in a directed graph corresponds to an ultrabubble in the equivalent (see Lemma 1) bidirected graph.
Identifying all superbubbles in a directed graph or all snarls in a bidirected graph introduces a method of compartmentalizing a graph into partitions whose contents are all in some sense at the same position in the graph, and for which the possible internal paths are independent of what path they continue on beyond their boundaries. We will use this concept to define sites for certain specialized classes of graphs.

Graphs which are Decomposable into Nested Simple Sites
We will extend the theory of ultrabubbles to a theory of nested sites where the structure of certain graphs can be fully described in terms of combinations of linear orderign and ultrabubble nesting relationships. This is important for 1. Identifying nested variation 2. Indexing traversals

Traversals and Subpaths
An (s, t)-traversal of S is a path in S beginning with s and ending with t. An (s, s)-traversal and a (t, t)-traversal are analogously defined. Presence of an (s, s)-or (t, t)-traversal implies cyclicity. Two traversals of a snarl are disjoint if they are disjoint on S\{s, t}. Paten's [7] snarls and ultrabubbles are 2-node separable subgraphs whose paired boundary nodes isolate their traversals from the larger graph. We can state this with more mathematical rigor: Claim. Consider a snarl (S, s, t) in a bidirected graph G. The set of all paths in G which contain a single (s, t)-traversal as contiguous a subpath is isomorphic to the set-theoretic product P (s) × T rav(s, t) × P (t) consisting of the three sets The isomorphism is the function mapping p 1 ∈ P (s), p 2 ∈ T rav(s, t), p 3 ∈ P (t) to their concatenation p 1 p 2 p 3 .
This property is important because it allows us to express the set of all haplotypes traversing a given linear sequence of snarls in terms of combinations of alleles for which we do not need to check if certain combinations are valid.

Simple Bubbles and Nested Simple Bubbles
Simple bubbles are structurally equivalent to (multiallelic) sites consisting of disjoint substitutions, insertions or deletions, with all alleles spanning the same boundaries. Proposition 7 below demonstrates that we can identify simple bubbles in O(|V |) time given that we have found all snarl boundaries. Paten has shown [7] that identification of snarl boundaries is achieved in O(|E| + |V |) time. To find the ultrabubbles among these, note that checking for acyclicity is O(|E| + |V |) on account of the unbranching nature of these snarls' interiors.
Given a node-side n, write N b(n) for the set of all neighbors of n. Note that Lemma 6 (Nodes in an ultrabubble are orientable with respect to the ultrabubble boundaries). Given an ultrabubble (S, s, t) and given n ∈ S\{s, t}, consider the set T of all (s, t)-traversals of S passing through n. Then either In the former case we call n s-sided, otherwise we call it t-sided.
Proof. This is a corollary to Lemma 1.
Proposition 7 (Simple bubbles have unbranching interiors). Let (S, s, t) be an ultrabubble. Then all traversals are disjoint iff every interior node-side has precisely one neighbor.
Since n is orientable with respect to (s, t), suppose, without loss of generality, that it is s-sided. Then there exist distinct paths from s to n passing through each of its neighbors. Continuing these with a path from n opp to t produces two nondisjoint traversals of S.
(⇐) Suppose that every interior node-side has precisely one neighbor. Suppose that there exist two distinct nondisjoint traversals of S. For no node-side to have multiple neighbors, they must coincide at every node-side, contradicting the assumption that they are not the same traversal.
We seek to extend this simple property to more complex graph structures. We will take advantage of the nesting of nondisjoint ultrabubbles proven in Proposition 3 to define another structure in which nondisjoint traversals are easily indexed.
Definition 8. An ultrabubble (S, s, t) ⊆ G is decomposable into nested simple sites if either: 1. S is a simple bubble 2. if, for every ultrabubble S contained in the interior of S, you replace the ultrabubble with a single edge s − t whenever S is decomposable into simple sites, then S becomes a simple bubble The following figure demonstrates decomposability into nested simple sites.
Proposition 9. If an ultrabubble (U, s, t) is decomposable into nested simple sites, then the complete node sequence of any (s, t)-traversal can be determined only by specifying the path it takes inside those nested ultrabubbles within which the traversal does not visit any further nested ultrabubble.
Proof. Let p be a (s, t)-traversal of an ultrabubble U which is decomposable into nested simple sites. Let V be a nested ultrabubble inside U . If p traverses, V , write p| V for the traversal p restricted to V .
Left: A nesting of four ultrabubbles. Right: The tree structure to index traversals of U implied by Proposition 9 Suppose that t| V intersects no nested ultrabubbles within V . Then t| V is disjoint of all other traversals within V due to U begin decomposable into nested simple sites. Therefore specifying any node of t| V uniquely identifies it.
Suppose that t| V intersects some set of ultrabubbles nested within V . Since U is decomposable into nested simple sites, the nodes of t| V must be linear and disjoint of all other paths if we replace all ultrabubbles nested in V with edges joining their boundaries. Therefore specifying which ultrabubbles are crossed uniquely determines the nodes included in t| V which lie outside of the nested ultrabubbles in V .
The statement of the proposition follows from the two arguments above by induction.
Proposition 10. An ultrabubble is decomposable into nested simple sites iff every node side is either the interior ultrabubble boundary or has precisely one neighbor.
Proof. This can be established using Proposition 7.
This property allows O(|V | + |E|) evaluation of whether a graph is decomposable into nested simple sites, by arguments analogous to those for simple bubbles.

A Partial Taxonomy of Graph Notifs which do not Admit Decomposition into Sites
In section 4.3, we will show that we can decompose a graph into nested simple sites as defined in the previous section if it lacks a certain forbidden motif. We will begin with examples of three graph motifs, and the biological events which might produce them. We describe some graph features which prevent decomposition into nested sites below, and the sets of mutations which might have produced them.
1. Two (or more) substitutions or deletions against a linear sequence which overlap, but not completely. 2. A substitution (or deletion) which spans elements of sequence on the interior of two disjoint ultrabubbles. Addition of such an edge joining two ultrabubbles which were decomposable into nested simple sites will consolidate the two into a single ultrabubble which is not decomposable into nested simple sites. 3. Two SNVs or other simple elements of variation at adjacent positions. This will be the focus of our Section 5.

The Relationship Between Nested Simple Sites and Series Parallel Graphs
The structure of ultrabubbles decomposable into nested simple sites, and their tree representation (see Fig 7) might be familiar to the graph theorist familiar with series-parallel digraphs. The fact that the digraphs equivalent to ultrabubbles form a subclass of the two-terminal series-parallel digraphs is interesting due to the computational properties of the latter class of graphs.
Definition 11. A directed graph G is two-terminal series parallel (TTSP) with source s and sink t if either 1. G is the two-element graph with a single directed edge s → t 2. There exist TTSP graphs G 1 , G 2 with sources s 1 , s 2 and sinks t 1 , t 2 such that G is formed from G 1 , G 2 by identification of s 1 with s 2 as s and identification of t 1 with t 2 as t (Parallel addition) 3. There exist TTSP graphs G 1 , G 2 with sources s 1 , s 2 and sinks t 1 , t 2 such that G is formed from G 1 , G 2 by identification of t 1 with s 2 (Series addition) + + Fig. 12. Top: parallel addition. Bottom: series addition Two terminal series parallel digraphs have a useful forbidden subgraph characterization.
Proposition 12 (From [12]). A directed graph G is two terminal series parallel if and only if it contains no subgraph homeomorphic to the graph W shown below Proof: Refer to Valdes [12] and Duffin [13]  Proof. Suppose that the directed graph D(U ) equivalent to U (which exists by Lemma 1) contains a subgraph homeomorphic to W . Then there must be a nodeside u in U with two neighbours a 1 , a 2 which are the beginnings of disjoint paths p 1 , p 2 ending on node-sides b 1 , b 2 which are neighbours of a node-side v. By Proposition 10, u and v must be ultrabubble boundaries. Since p 1 , p 2 are disjoint, u and v must be opposing boundaries of the same ultrabubble. But the presence of a subgraph homeomorphic to W also implies that there exists a pair q 1 , q 2 of disjoint paths, one from a node x toû and the other from x to v, both not passing through u orv. But this is not possible since it would contradict 2-node separability of (u, v).
We highlight the middle "Z-arm" of the W -motif in our first two examples of ultrabubbles which are not decomposable into nested simple sites.

Abutting Variants
We wish to decompose the graph structure of sets of variants lying at adjacent positions such that there is no conserved sequence between them able to form an ultrabubble boundary. We will define a graph motif called the balanced recombination bundle which corresponds this graph structure, and can be rapidly detected.
We observe examples abutting single nucleotide variants (SNVs) in the 1000 Genomes polymorphism data. It is a reasonable hypothesis that these should become more common as the population sizes of sequencing datasets increases, since, statistically, the distribution of variation across the genome should grow less sparse as the population increases.

Bundles
Definition 14. An internal chain n 1 → n 2 → · · · → n k is a sequence of nodesides such that ∀i, 2 ≤ i ≤ k, n i ∈ N b(n i−1 ).
Definition 15. We say that a tuple (L, R) of sets of node-sides is a bundle if 1. (Matching) ∀ ∈ L, N b( ) ⊆ R and N b( ) = ∅; ∀r ∈ R, N b(r) ⊆ L and N b(r) = ∅ 2. (Connectedness) ∀ ∈ L, r ∈ R, there exists an internal chain → r 1 → 1 → · · · → r k → k → r such that ∀i, 1 ≤ i ≤ k, r i ∈ R and i ∈ L Definition 16. We say that a tuple (L, R) of sets of node-sides is a balanced recombination bundle (R-bundle for short) if Proof. Complete matching ⇒ matching. Complete matching ⇒ connectedness by the chain → r for all ∈ L, r ∈ R Definition 18. An unbalanced bundle is a bundle which is not a balanced recombination bundle. An unbalanced bundle is acyclic if L ∩ R = ∅.
We will describe a O(|V | + |E|) algorithm to detect and categorize bundles exhaustively for all node-sides in a bidirected graph. To establish the validity of this algorithm, we need several preliminary results: Lemma 20. Every q ∈ N is either a tip or an element of a bundle.
Proof. Suppose that q is not a tip. Define a function W that maps a tuple (L, R) of nonempty sets of node-sides to a tuple W (L), W (R) where  R)) for k such that W k+i ((L, R)) = W k ((L, R)) ∀i ∈ N W ∞ exists since W n is nondecreasing with respect to set inclusion and our graphs are finite. Now define W (q) := W ∞ (({q}, N b(q))), noting that N b(q) = ∅ since {q} is not a tip. Let us write L W ∞ and R W ∞ for the respective elements of W (q). We claim that W (q) is a bundle.
Proof of matching: let ∈ L W ∞ , r ∈ R W ∞ . By construction of W , Proof of connectedness: let ∈ L W ∞ , r ∈ R W ∞ . We will show that for any r ∈ R W ∞ , ∃ an internal chain q → r 1 → 1 → · · · → r k → k → r such that ∀i, 1 ≤ i ≤ k, r i ∈ R W ∞ and i ∈ L W ∞ . Suppose that r ∈ N b(q), then we are done. Otherwise, since r ∈ R W ∞ , there exists some minimal n ∈ N such that r ∈ the R-set R W n of some W n (({q}, N b(q))). It is straightforward to see that we can then construct an internal chain q → r 0 → Proposition 21. If q ∈ L for a bundle (L, R), then (L, R) = W (q) Proof. Suppose that W (q) = (L, R). Then either L = L W ∞ or R = R W ∞ . First, suppose the latter. Suppose that ∃r ∈ R such that r / ∈ R W ∞ . Since (L, R) is a bundle, we know that there is an internal chain q → r 0 → 1 → r 1 → · · · → r k → k → r with all r i ∈ R, i ∈ L. But, using the same shorthand as before, it is also evident that r i ∈ R W i , i ∈ L W i ∀i, 1 ≤ i ≤ k. But since k ∈ N b(r), we can deduce that r ∈ R W k+1 , which leads to a contradiction since r / ∈ R W ∞ . Suppose otherwise that ∃r ∈ R W ∞ such that r / ∈ R. Consider an internal chain c = q → r 0 → 1 → r 1 → · · · → r k → k → r fulfilling the conditions needed to prove connectedness of W (q). Note that q ∈ L and by matching r 0 ∈ N b(q). But r / ∈ R, which leads to a contradiction since it means that there must exist two consecutive members somewhere in the chain c which cannot be neighbors.
We say that a node-side n is involved in a bundle (L, R) if n ∈ L or n ∈ R.
Corollary 22 (To Proposition 21). Every non-tip node-side is involved in precisely one bundle.

An Algorithm for Bundle-Finding
The diagram in Fig 16 demonstrates our algorithm for finding the balanced recombination bundle containing a query node-side q if it is contained in one, and discovering that it is not if it is not. The is written in pseudocode below, with an illustration following.
In order to prove that this is a valid algorithm for detection of balanced recombination bundles, we need the following lemma.
Lemma 23. Let (L, R) be a tuple of sets of node-sides. If ∃q ∈ L such that ∀a ∈ N b(q), ∀b ∈ N b(a), N b(b) ⊆ N b(q) but N b(q) ⊂ R, then (L, R) cannot be connected (in the sense of Definition 15).
sequence of node-sides c is both a valid internal chain and ends with r. Therefore (L, R) cannot be connected.
Proposition 24 (Validity of Algorithm 1). This algorithm detects all balanced recombination bundles, and rejects all unbalanced recombination bundles.
Proof. Suppose q is involved in a balanced recombination bundle (L, R). W.l.o.g. suppose that q ∈ L. Due to the complete matching property, the set N b(q) in the algorithm is guaranteed to be equal to R. Due to the completeness property, the set N b(R[0]) in the algorithm is guaranteed to be equal to L. It is evident that the algorithm directly verifies complete matching and acyclicity. Suppose otherwise. Assuming we have eliminated all tips, which can be done in O(|V |) time, Lemma 20 proves that q is involved in an unbalanced bundle B. If B fails acyclicity but not complete matching, then checking that A ∩ B = ∅ will correctly detect that L ∩ R = ∅.
Otherwise, suppose that B fails complete matching. Suppose first that N b(q) ⊂ R. We assert that ∃a ∈ N b(q) such that ∃b ∈ N b(a) such that ∃c ∈ N b(b) such that c / ∈ N b(q). This event will be detected by the second loop of the algorithm. This follows from the connectedness of B and Lemma 23.
Suppose otherwise that N b(q) = R but ∃r ∈ R such that N b(r) ⊂ L. Let c ∈ L\N b(r). By matching, ∃r ∈ R such that r ∈ N b(c). Therefore N b(r) and N b(r ) will be found to be unequal in the first loop of the algorithm.
Suppose otherwise that N b(q) = R, N b(r) = L ∀r ∈ R, but ∃ ∈ L such that N b( ) ⊂ R. Then we will find in the second loop that N b( ) = N b(q).
Proposition 25 (Speed of Algorithm 1). We can identify all balanced recombination bundles, all unbalanced bundles and all tips in O(|E| + |V |) time. We begin by looping over all node-sides and identifying all tips, which is achieved in O(|V |) time. We then loop again over all remaining node-sides. At each node-side q, we run the function describe above, which, if q is involved in a balanced recombination bundle, will return the bundle B = W (q). It is evident that this function runs in O(|E B |) time, seeing as it loops over each edge of B twice-once from each side-each time making an O(1) set inclusion query. After B is built, all nodes are marked such that they are skipped when they are encountered in the global loop. This gives overall O(|E B | + |V B |) exploration of B.
If q is involved in an unbalanced bundle B = W (q), this fact is detected by the same function in O(|E B |) time. In this case, we can find all nodes of B by performing a breadth-first search. Examination of the W -function will convince the reader that a breadth-first search will find all node-sides of B in O(|E B | + |V B |) time. We follow the same procedure of marking all these node-sides to be skipped in the global loop.
This proves that, after eliminating tips in O(|V |) time, we can build the set B of all non-isomorphic bundles B, and decide whether they are balanced recombination bundles, in time proportional to B∈B |E B | + |V B |. But Lemma 20 and Corollary 22 tell us that V = {v : v is a tip} ∪ B∈B V B , and that all elements of this union of node-sides are disjoint. Furthermore, due to the matching property of bundles, E = B∈B E B , and all elements of this union of edges are disjoint. Therefore, our method is O(|V | + |E|).

Bundles and Snarl Boundaries
Definition 26. Given a "boundary" node-side b = s or t of a snarl (S, s, t), we call the tuple (b, N b(b)) a snarl comb. A snarl comb is called proper if ∀n ∈ It is easy to verify that a proper snarl comb is a balanced recombination bundle. It is also easy to see that an improper snarl comb is, according to set inclusion of tuples, a proper subset of a unique bundle.
Proposition 27 (Bundles do not cross snarl boundaries). Let (S, s, t) be a snarl. Suppose that B = (L, R) is a bundle. Then either all node-sides involved in B are members of S, or no node-side involved in B is a member of S.
Proof. Suppose that there exists a bundle B = (L, R) with node-sides both within S and not within S. Let x, y be involved in B, with x ∈ S, y / ∈ S. W.l.o.g., suppose x ∈ L, y ∈ R. This implies that there exists an internal chain p = x → · · · → y. But then this implies that there exists a ∈ S, b / ∈ S such that a ∈ N b(b), which would allow us to use the edge a → b to create a path violating the 2-node separability of S
Definition 29. If two recombination bundles are compatible, we define the set p(x) to be a bundled simple site P .
Claim. Consider a bundled simple site P in a graph G, lying between compatible balanced recombination bundles B 1 , B 2 . The set of all paths in G which contain paths p ∈ P as contiguous subpaths is isomorphic to the set-theoretic product P (L 1 ) × P × P (R 2 ) consisting of the three sets 1. P (L 1 ) := {paths in G\S terminating in x, for some x ∈ L 1 } 2. P 3. P (R 2 ) := {paths in G\S beginning with y, for some y ∈ R 2 } under the function mapping p 1 ∈ P (L 1 ), p ∈ P, p 2 ∈ P (R 2 ) to their concatenation.
We will call a balanced recombination bundle B = (L, R) trivial if both L and R are singleton sets. 1. It is a generalized simple bubble 2. When each ultrabubble (V, u, v) nested in U which is a decomposable into nested generalized sites is replaced with a single edge spanning u and v, then U is a generalized simple bubble We sketch a linear-time method of building sites from a tree diagram of nested ultrabubbles. We run Algorithms 2 and 3 starting at bottom-level nested ultrabubbles. If ultrabubble has all nontrivial balanced recombination bundles paired, then, when we evaluate the ultrabubble containing it, we represent it as a single edge from its source to sink.
In Algorithm 3, which follows below, we refer to the individual sets of nodesides forming the tuples (L, R) of a bundle as bundle-sides.

Bundles Containing Deletions
Our bundles-and therefore our sites-fail to detect the graph motifs formed by deletions spanning otherwise well-behaved variants. We define a special, wellbehaved subclass of unbalanced bundle to address this.

Fig. 18. Two examples of deletion bundle-pairs
These structures occur when two balanced recombination bundles on either side of some span of graph are bridged by deletions. It remains necessary to check that there is graph structure joining the nodes of R A to L B for this to be the case.
Algorithm 4 below will detect deletion bundle pairs from among the set of unbalanced bundles in linear time.
Proposition 33. Given a set of acyclic unbalanced bundles, this algorithm finds those among then which are deletion bundle pairs.
Proof. Suppose that q is involved in a deletion bundle pair (L A , R A , L B , R B ). W.l.o.g, either q ∈ L A or q ∈ L B .
Suppose first that q ∈ L B : In this case, N b(q) = R B . We then know that ∀a ∈ R B , N b(a) = L A ∪L B . This will trigger the condition L 2 = ∅. The elements of a ∈ L 1 will segregate into precisely two groups: one such that N b(a) = R Bthe elements a ∈ L B , and another group such that N b(a) = R A ∪ R B -the elements a ∈ L A . If these conditions are fulfilled, we then build R A and R B . It remains to verify that ∀b ∈ R A , N b(b) = L A , and ∀b ∈ R B , N b(b) = L A ∪ L B .
Suppose otherwise that q ∈ L A : In this case, N b(q) = R A ∪ R B . This will trigger the condition L 2 = ∅ since the elements b ∈ N b(q) will segregate into two groups: Suppose otherwise that q is not involved in a deletion bundle pair. Suppose that Algorithm 4 does not fail, returning ∅. There are two possibilities then for the nature of the unbalanced bundle (L, R) for which q ∈ L.
First, suppose the condition L 2 = ∅ was triggered. The ∃q ∈ L such that, where L := {l ∈ L | l ∈ N b(a) for some a ∈ N b(q)}, N b( ) ⊆ N b(q) ∀ ∈ L . Then by Lemma 23, N b( ) ≥ N b(q)∀ ∈ L. Therefore it must be that N b(q) = R. Furthermore, to pass the search for R A , there must ∃R A such that if ∈ L Algorithm 4: Deletion bundle pair finding Data: Node-side q known to be in an acyclic unbalanced bundle Result: Deletion bundle-pair containing q if it is in a deletion bundle-pair, ∅ otherwise begin and N b( ) = R, then N b( ) = R A . Furthermore, to pass the conditions of the subsequent two loops, it must be that ∀r ∈ R\R A , all N b(r) are the same, and ∀r ∈ R A , all N b(r ) are the same. Furthermore, to pass the last condition checked, must be that N b(r ) from the latter group ⊂ N b(r). And since L Proof. Note that a deletion bundle-pair is a special type of unbalanced bundle. Therefore, if, given an unbalanced bundle B, we can check whether it is a deletion bundle-pair in O(|E B | + |V B |) time, by the arguments of Proposition 21, we can find all deletion bundle-pairs in O(|E| + |V |) time.
Inspection of the algorithm shows that, like the algorithm for identifying balanced recombination bundles, it performs two O(1) set-inclusion queries per edge, making it O(|E B |) overall.

Discussion
Graph formalism has the potential to revolutionize the discourse on genetic variations by creating a model and lexicon that more fully embraces the complexity of sequence change. This is vital: the current linear genome model of a reference sequence interval and alternates is insufficient. It fails to express nested variation and can not properly describe information about the breakpoints that comprise structural variations.
The introduction, in order, of bubbles, superbubbles, ultrabubbles and snarls progressively generalizes the concept of a genetic site to accommodate more general types of variation using progressively more general graph types. In this paper we both review and build on these developments, showing how the recently introduced ultrabubbles can be furthered sub-classified using concepts from circuit theory. This expands the simple notion of proper nesting described in the original ultrabubble paper. Furthermore, we describe how we can extend the theory of ultrabubbles by generalizing ultrabubble boundaries to another sort of boundary structure-the bundle-which allows us to describe regions where variants are packed too closely to be segregated into separate ultrabubbles.
Our methods are powerful in decomposing dense collections of nested or closely packed variation into meaningful genetic sites. We anticipate that these structures will become increasingly common in the analysis of variation using graph methods, as sequencing datasets containing variation from increasing numbers of individuals become available.