## Abstract

**Motivation** Pangenome graphs provide a complete representation of the mutual alignment of collections of genomes. These models offer the opportunity to study the entire genomic diversity of a population, including structurally complex regions. Nevertheless, analyzing hundreds of gigabase-scale genomes using pangenome graphs is difficult as it is not well-supported by existing tools. Hence, fast and versatile software is required to ask advanced questions to such data in an efficient way.

**Results** We wrote ODGI, a novel suite of tools that implements scalable algorithms and has an efficient in-memory representation of DNA variation graphs. ODGI includes tools for detecting complex regions, extracting *loci*, removing artifacts, exploratory analysis, manipulation, validation, and visualization. Its fast parallel execution facilitates routine pangenomic tasks, as well as pipelines that can quickly answer complex biological questions of gigabase-scale pangenome graphs.

**Availability** ODGI is published as free software under the MIT open source license. Source code can be downloaded from https://github.com/pangenome/odgi and documentation is available at https://odgi.readthedocs.io. ODGI can be installed via Bioconda https://bioconda.github.io/recipes/odgi/README.html or GNU Guix https://github.com/ekg/guix-genomics/blob/master/odgi.scm.

**Contact** egarris5{at}uthsc.edu

## 1 Introduction

A pangenome models the full set of genomic elements in a given species or clade (Consortium, 2018; Eizenga *et al*., 2020b). In contrast to reference-based approaches which relate sequences to a single genome, these data structures encode the mutual relationships between all the genomes represented. In pangenome graphs (Paten *et al*., 2017), homologous regions between genomes are compressed into a single representative of all alleles present in the pangenome. These flexible models let us encode any kind of variation, allowing the generation of comprehensive data systems that builds the basis for the analyses of genome evolution. Although these data structures are of utility to researchers (Consortium, 2018; Garrison *et al*., 2018; Baaijens *et al*., 2019; Hickey *et al*., 2020; Sibbesen *et al*., 2021), the scientific community still lacks a toolset specifically focused on graph manipulation and interrogation.

The Human Pangenome Reference Consortium (HPRC) and Telomere-to-Telomere (T2T) consortium (Miga *et al*., 2020; Logsdon *et al*., 2021; Nurk *et al*., 2021) have recently demonstrated that high-quality *de novo* assemblies can be routinely generated from third-generation long read sequencing data. We anticipate that *de novo* assemblies of similar quality will become common, leading to demand for methods that allow us to create and explore pangenomes.

Here, we present the Optimized Dynamic Genome/Graph Implementation (ODGI) toolkit, a pangenome graph interrogation and transformation system specifically implemented to handle the data scales encountered when working with pangenomes built from hundreds of haplotype-resolved genomes. ODGI provides a set of standard operations on the variation graph data model, generalizing “genome arithmetic” concepts like those found in BEDTools (Quinlan and Hall, 2010) to work on pangenome graphs, and providing a variety of operations, such as visualization, sorting, and liftover projections, all critical to understand and exploit pangenome graphs.

## 2 Model

A pangenome graph is a sequence model that encodes the mutual alignment of many genomes (Garrison, 2019; Eizenga *et al*., 2020b). In the variation graph, *V* = (*N, E, P*), nodes *N* = *n*_{1} … *n*_{|N|} contain sequences of DNA. Each node *n*_{i} has an identifier *i* and an implicit reverse complement , and a node strand *s* corresponds to one such orientations. Edges *E* = *e*_{1} … *e*_{|E|} represent ordered pairs of node strands: *e*_{i} = (*s*_{a}, *s*_{b}). Paths *P* = *p*_{1} … *p*_{|P|} describe walks over node strands: . When used as a pangenome graph, *V* expresses sequences, haplotypes, contigs, and annotations as paths. By containing both the sequences and information about their relative variations, the variation graph provides a complete and powerful foundation for many bioinformatic applications.

Pangenome graphs can be constructed by multiple sequence alignment (Lee *et al*., 2002; Grasso and Lee, 2004) or by transitively reducing an alignment between sequences to an equivalent, labeled sequence graph (Kehr *et al*., 2014; Garrison, 2019). Current methods to build these graphs are still under active development (Li *et al*., 2020; Armstrong *et al*., 2020; Garrison *et al*., 2021), but they have largely settled on a common data model, represented in the Graphical Fragment Assembly (GFA) format (GFA Working Group, 2016). This standardization supports the development of a reference set of tools that operate on the pangenome graph model. Such an effort began with the VG toolkit (Garrison *et al*., 2018). Here we refocus it with ODGI, a compatible, but independent set of algorithms focused on visualization, interrogation, and transformation of pangenome graphs.

## 3 Implementation

The ODGI toolkit builds on existing approaches to efficiently store and manipulate variation graphs (Garrison *et al*., 2018). Similar to other efficient libraries presenting the HandleGraph model (Eizenga *et al*., 2020a), the implementation of ODGI’s tools rests on three key properties which hold for most pangenome variation graphs:

They are relatively sparse, with low average node degree.

They can be sorted so that most edges go between nodes that are close together in the sort order.

Their embedded paths are locally similar to each other.

These properties are used to build efficient dynamic variation graph data structures (Siren *et al*., 2020; Eizenga *et al*., 2020a). Sparsity (1) allows us to encode edges *E* using adjacency lists rather than matrices or hash tables. The local linear structure of the graph (2) lets us assign node identifiers that increase along the linear components of the graph, which supports a compact storage of edges and path steps as relativistic (usually small) differences rather than absolute (always large) integer identifiers. Path similarity (3) allows us to write local compressors that reduce the storage cost of collections of path steps.

ODGI improves on prior efforts, based on issues that arose during our work with high-quality *de novo* assemblies that cover almost all parts of the human genome (Logsdon *et al*., 2021; Nurk *et al*., 2021). In particular, we find that it is necessary to support graphs with regions of very high numbers of path traversals (high path depth). Such motifs can occur in collapsed structures generated by ambiguous sequence homology relationships in repeats found in the centromeres and other segmental duplications. If we cannot process such regions, there are only two options: 1) remove such regions, or 2) leave them unaligned. However, neither of these solutions allows us to investigate their biological features. To seamlessly represent such difficult regions, we followed an approach implemented in the dynamic version of the Graph BWT (GBWT) (Siren *et al*., 2020) and built a node-centric, dynamic, compressed model of the paths. This design supports node-local modification and update of the graph, which lets us operate on paths in parallel.

We store the graph in a vector of node structures, each of which presents a node-local view of the graph sequence, topology, and path layout. Expressed in terms of the variation graph *V*, ODGI’s core *Node* structure includes a decoder that maps the neighbors of each node to a dense range of integers. For a given *Node*_{i} and neighbor *Node*_{j}, the decoder itself does not store the *id* of *Node*_{j}, but rather a compact representation of the relative difference between the node ids: *δ*= *Node*_{i}.*id* − *Node*_{j} .*id*. This keeps the size of the encoding small, per common variation graph property (2). We define the edges and path steps traversing the node in terms of this alphabet of *δ*’s. The structures in Algorithm 1 describes our encoding.

ODGI’s relativistically-packed Node structure and the Step structure used to represent the paths as doubly-linked lists.

Each structure contains the sequence of the node (*Node*_{i}.*sequence*), its edges in both directions (*Node*_{i}.*edges*), and a vector of path steps that describes the previous and next steps in paths that walk across the node (*Node*_{i}.*path*_*steps*). For efficiency, *Node*_{i}.*sequence* is stored as a plain string, while the *edges* and *path*_*steps* are stored using a dynamic succinct integer vector that requires *O*(2*nw*) bits for the edges and *O*(5*nw*) bits for the path steps, where *n* is the number of steps on the node and *w* is ≈ *log*_{2}(*n*) (Prezza, 2017).

To allow edit operations in parallel, each node structure includes a byte-width mutex *lock*. All changes on the graph can involve at most two *Node* structs at a time (both edge and path step representations are doubly-linked). To avoid deadlocks, we acquire the node locks in ascending *Node*.*id* order and release them in descending order. In addition to node-local features of the graph, we must maintain some global information. Specifically, we record the start and end of paths, as well as a name to path id mapping in lock-free hash tables. The use of lock-free hash tables lets us avoid a global lock when looking up path or graph metadata, which would quickly become a bottleneck during parallel operations on the graph. By avoiding global locks, we implement many of the operations in ODGI using maximum parallelism available. This approach is key to enable our methods to scale to the largest pangenome graphs that we can currently build (with hundreds of vertebrate genomes).

## 4 Results

ODGI provides a set of interrogative and manipulative operations on pangenome graphs. We have established these tools to support our exploration of graphs built from hundreds of large eukaryotic genomes. ODGI’s tools are practical and able to work with high levels of graph complexity, even with regions where paths present very high depth nodes (10^{5} to 10^{6}-fold depth).

ODGI covers common operations that we have found to be essential when working with complex pangenome graphs:

–

*odgi build*constructs the ODGI data model from GFA file (§4.1).–

*odgi view*converts the ODGI data model into GFA file (§4.1).–

*odgi viz*provides a linear visualization of the graph (§4.2).–

*odgi draw*renders a 2D image of the graph (§4.2).–

*odgi extract*excerpts subsets of the graph based on path ranges (§4.3).–

*odgi explode*breaks the graph into connected components (§4.3).–

*odgi squeeze*unifies disjoint graphs (§4.3).–

*odgi chop*breaks long nodes into shorter ones (§4.4).–

*odgi unchop*combines unitig nodes (§4.4).–

*odgi break*removes cycles in the graph (§4.4).–

*odgi prune*removes complex regions (§4.4).–

*odgi groom*resolves spurious inverting links (§4.4).–

*odgi position*lifts coordinates between path and graph positions (§4.5).–

*odgi untangle*deconvolutes paths relative to a reference (§4.5).–

*odgi tips*finds path end points relative to a reference (§4.5).–

*odgi sort*orders the graph nodes (§4.6).–

*odgi layout*establishes a 2D layout (§4.6).–

*odgi matrix*derives the pangenome matrix (§4.7).–

*odgi paths*lists and extracts paths in FASTA (§4.7).–

*odgi flatten*converts the graph to FASTA and BED (§4.7).–

*odgi stats*provides numerical properties of the graph (§4.7).–

*odgi bin*generates a summarized view of the graph (§4.7).–

*odgi depth*describes node depth over graph and path positions (§4.7).–

*odgi degree*describes node degree over graph and path positions (§4.7).

Each tool focuses on a small set of related operations. Most read or write the native ODGI format (‘og’ extension) (Figure 1) and work with standard text based data formats common to bioinformatics. This supports the implementation of flexible and composable graph processing pipelines based on graphs (GFA/ODGI) and standard bioinformatic data types representing positions, genomic ranges (BED), and pairwise mappings (PAF). We use variation graph paths to provide a universal coordinate system, representing annotations and pairwise sequence relationships using the paths as reference and query sequences. Thus, ODGI provides a set of interfaces that let us approach these graphs from the perspective of standard reference- and sequence-based data models. Indeed, by considering all paths in the graph as potential reference or query sequence, we make graphs invisible to downstream tools that operate on collections of sequences or rely on a reference sequence (*e*.*g*. SAMtools (Li *et al*., 2009)), enabling interoperability. This approach benefits from the information in the graph without strongly embedding our methods in this difficult new research context.

### 4.1 Building the ODGI model

ODGI maintains its own efficient binary format for storing graphs on disk. We begin by transforming the storage model of the standard GFAv1 (GFA Working Group, 2016) format (in which nodes, edges, and paths are described independently) into the ODGI node-centric encoding with *odgi build*. This construction step can be a significant bottleneck, in particular as the size of the path set of the graph increases.

The ODGI data structure (Algorithm 1) allows algorithms that build and modify the graph to operate in parallel, without any global locks. In *odgi build*, we initially construct the node vector in a serial operation that scans across the input GFA file. Then, we serially add edges in the *Node*.*edges* vectors of pairs of nodes. Finally, we create paths in serial, and extend them in parallel by obtaining the mutex *Node*.*lock* for pairs of nodes and by adding the path step in their *Node*.*path*_*steps* vectors. This parallelism speeds ODGI model construction by many-fold when testing against graphs made from assemblies produced by the HPRC (Figure 2).

To support interchange with other pangenome tools or text-based processing, *odgi view* converts a graph in ODGI binary format to GFAv1.

### 4.2 Visualizing pangenome graphs

Pangenome graph visualization is one of the first steps to gain insight into the mutual relationship between the sequences in the graph and their variation. We pursue a novel approach to visualization with *odgi viz* and *odgi draw*, two tools which provide scalable ways of generating pictures of the high-level structure of large pangenome graphs.

*odgi viz* supports a binned, linearized rendering in 1 dimension (1D) (that is, all graph nodes lie on the same axis). This visualization is computed in linear-time and offers a human-interpretable format suitable for understanding the topology and genome relationships in the pangenome graph (Fig. 4). Graph nodes are arranged on a single axis, from left to right, with the colored bar indicating the paths and the nodes they cross. White spaces indicate where paths do not traverse the nodes. The meaning of the colors depends on how *odgi viz* is executed. By default, path colors are derived from a hash of the path name (Fig. 4**b**). Path names are displayed on the left of the paths. The black lines on the bottom indicate the edges connecting the nodes and, therefore, represent the graph topology.

Nevertheless, complex, nonlinear graph structures are difficult to display and interpret in a low number of dimensions. To overcome such a limitation, *odgi viz* supports multiple visualization modalities (Fig. 4**c-e**), making it easy to grasp the properties and shape of the graph. Graph node order can affect downstream analyses on pangenome graphs. With *odgi viz* we can color the paths by path position (Fig. 4**c**), with light grey indicating where paths begin and dark grey where they end. This visualization is suitable for understanding graph node order, as smooth color gradients indicate that the graph order respects the linear paths’ coordinate systems. Pangenome graphs represent both strands of DNA sequences. *odgi viz* supports also coloring the paths by orientation, with paths colored where their sequence is reverse-complemented (red) or in direct orientation (black) with respect to the sequences of the graph nodes (Fig. 4**d**). Eukaryotic genomes experience gains and losses of genetic material, resulting in copy number variation (CNV) across the population. With *odgi viz*, we can use multiple color palettes to color the paths by path depth, highlighting the different copy number statuses in the genomes represented in the pangenome graph (Fig. 4**e**).

*odgi draw* extends the visualization in 2 dimensions (2D) (Fig. 4**a**) by rendering the layout built by *odgi layout* (§4.6). A 2D rendering is more costly to compute, but we similarly provide an implementation that scales linearly with pangenome sequence size, allowing us to apply it to large pangenome graphs.

### 4.3 Extracting or joining regions of interest

Pangenome graphs built from hundreds of haplotype-resolved *de novo* genome assemblies are very large, but it is often necessary to work with only a small portion of the genomes represented, such as a specific *locus* (Fig. 4**a**) or a smaller region (Fig. 4**b-g**), or even a single gene (Fig. 3). This simplifies the downstream analyses and reduces the resources to work only with the extracted graphs. Graph portions can be extracted by using the paths in the graph as coordinate systems to guide the process. For such operation, *odgi extract* allows users to extract specific regions of the graph as defined by query criteria. Regions of interest can be specified by graph nodes or path range(s), also in BED format. Furthermore, it is possible to indicate a list of paths to be preserved completely in the extracted graph.

In *odgi extract*, we begin by collecting all graph nodes that fall within the ranges to extract (and the paths to preserve, if requested). Users can specify the number of steps or nucleotides to expand the selection and include neighboring nodes. Then, edges connecting all selected nodes are added in the subgraph under construction. Finally, the portions of the paths (*i*.*e*., the subpaths) walking through the selected nodes are extracted and added to the new subgraph. Subpaths are searched in parallel, created serially, and extended in parallel again thanks to the parallelism enabled by the ODGI data structure (see §4.1), making *odgi extract* a scalable solution to extract also complex subregions presenting nodes with very high path depth.

Pangenome graphs can embed multiple chromosomes as separated connected components (inter-chromosomal structural variants would join the components into bigger ones). *odgi explode* separates the connected components in different ODGI format files, while *odgi squeeze* allows merging multiple graphs into the same ODGI format file, preventing node ID collisions.

### 4.4 Editing the graph structure

Pangenome graphs can be used in a variety of applications, ranging from read mapping to variant identification and genotyping (Eizenga *et al*., 2020b). However, graphs presenting complex topology can increase the computational overhead of many downstream analyses. ODGI offers multiple commonly-needed basic operations on the topology of the graph and its nodes.

For simplifying the graph structure, users can use *odgi prune* to take away complex parts as defined by query criteria, while with *odgi break* they can remove cycles in the graph, reducing the complexity of the graph topology. Furthermore, *odgi groom* allows removing spurious inverting links by exploring the graph from the orientation supported by most paths.

To enable efficient sequence alignment against the graph, long nodes can be divided into shorter nodes at a maximum requested size using *odgi chop*. Partial order alignment, which consists of aligning sequences against a directed acyclic graph (DAG), is frequently used in pangenome building pipelines (Garrison *et al*., 2021), but the current implementations return DAGs with 1-bp long nodes. *odgi unchop* allows joining nodes that can be merged without changing the graph topology, nor the embedded sequences, obtaining an equivalent, but more compact, representation of the graph.

### 4.5 Untangling and navigating the pangenome

The key data in a pangenome graph is a representation of the alignment (or homology) relationships between the sequences represented. Navigating and understanding the graph requires coordinate systems that we can use to link other data into the graph model, and thus to all genomes in the pangenome. ODGI’s tools use the embedded sequences to provide a universal coordinate space that is graph-independent, thereby remaining stable across different graphs built with the same genomes.

A universal coordinate system allows us to support several kinds of “lift-over” of coordinates between different genomes in the same or different graphs. *odgi position* translates graph and path positions between or within graphs, emitting the liftovers in BED format. Coordinates can be specified in BED format, but users can even specify a GFF/GTF file to project the annotations into the pangenome graph (Fig. 4**f-g**).

For a precise translation process when conversing a query position to a reference position in a repeat region, we apply the *path jaccard* context mapping concept. It could be that the found reference node is visited several times by the reference. To ensure a precise translation, we select the reference position whose context (the multiset of *Node*.*id*s reached within a distance of e.g. 10kbp) has the best jaccard metric when compared to the query context.

Pangenome graphs model alignments of many genomes. With *odgi untangle*, users can extract pairwise alignment information between a given set of “query” sequences and a given set of “target” sequences (used as references). While pangenome graphs may contain looping structures that imply many-to-many alignments between the pangenome sequences, these untangled alignments map each segment of the queries to a single segment in the set of targets. Being able to work with any sets of reference sequences lets us convert the graph to lift-over maps compatible with standard software for projecting annotations and alignments from one genome to another. As an example, by untangling the graph we can study the variation that lies in regions collapsed due to ambiguous alignments over sequence repeats (as shown in Fig. 4**f**). Indeed, to obtain a more precise overview of the *locus* in Fig. 4**b-e**, we can apply *odgi untangle* to segment paths into linear segments by breaking these segments where the paths loop back on themselves. We first discover segment boundaries using standard approaches for detecting repeats in sequence graphs (Pevzner, 2004). We finally “untangle” by finding the target segment that best match each query segment using the *path jaccard* context mapping model. In this way, we obtain information on the position and copy number status of the sequences in the collapsed *locus* (Fig. 4**h**). Moreover, to obtain base-level precise information on the relationships between the repeated sequences, we can align them by using the pairs of regions that came from the untangling to guide the alignment (Guarracino *et al*., 2021).

*odgi tips* can identify the break point positions of the contigs relative to the reference(s) in the graph by walking from the ends of each contig until a reference node is found. It could be that the reference visits the node several times. Therefore, for each contig range (a tip) *odgi tips* takes a look at each possible reference window and finds the most similar one using the *path jaccard* concept. The output is a BED file with the best reference hit and position for each of the contigs’ ends.

### 4.6 Sorting the pangenome graph

Pangenome graphs can hide their underlying latent structures, introducing difficulties in the analysis and interpretation. Among the causes of this is the correct ordering of the graph nodes in a convenient number of dimensions. ODGI provides a variety of sorting algorithms to find the best graph node order in 1 or 2 dimensions, allowing us to understand the sparse structures typically found in pangenome graphs and the genetic variation they represent.

To find the best order of graph nodes in 1D, *odgi sort* provides multiple sorting algorithms which can be combined into a sorting pipeline to take advantage of the strength of each. Most notably, nodes can be sorted topologically, randomly, by breaking cycles in the graph (§4.4), by grooming (§4.4), and/or by using a novel path-guided (PG) stochastic gradient descent (SGD) algorithm: PG-SGD. This exploits path information to order the graph nodes. PG-SGD learns a 1D or 2D organization of the graph nodes that matches distances in graph paths. To scale to large graphs, we learn this projection in parallel via a HOGWILD! approach (Niu *et al*., 2011). Our approach can be seen as an adaptation of SGD-based drawing to pangenome graphs (Zheng *et al*., 2018). In parallel, each HOGWILD! thread updates node relative positions to best-match their nucleotide distance in the paths running through the graph. Following standard SGD approaches, a learning rate is reduced as the algorithm progresses, and execution continues until the adjustments to the model fall below a target threshold *E*. ODGI can project vector (in 1D) and matrix (2D) representations of the graph relative to these learned coordinate spaces. Based on this projection, we can trivially sort graph nodes in 1D. Moreover, we support the same concept in 2D in *odgi layout* by providing a 2D implementation of the PG-SGD algorithm.

### 4.7 Obtaining metrics of the pangenome graph

Graphs statistics provide alternative ways to gain insight into pangenomes complexity revealing the overall structure, size, and features of a graph and its sequences.

Pangenome graph topology can be derived by applying *odgi matrix*, obtaining information on graph nodes connections in textual sparse matrix format. To investigate on the genomes encoded in the graph, *odgi paths* allows users to calculate pairwise overlap statistics of groupings of paths and emit all path sequences in FASTA format, and it also allows the generation of a “pangenome matrix” that reports the copy number (presence/absence) of each path over each node. *odgi flatten* generates a linearization of the graph by emitting the pangenome sequence (the concatenation of all node sequences) in FASTA format, and the projection of all paths on the linearized sequence in BED format.

Applying *odgi stats*, users can retrieve metrics describing the graph properties, such as the number of nodes, edges, paths, and graph length. It outputs pangenome statistics in tab-separated values (TSV) or YAML textual file formats. MultiQC’s (Ewels *et al*., 2016) ODGI module provides an interactive way to comparatively explore such statistics of an arbitrary number of graphs.

ODGI also offers more advanced tools for the interrogation of the graphs. To study very large pangenomes, users can use *odgi bin* to summarize the path information into bins of a specified size, generating a summarized view of gigabase scale graphs in TSV or JSON file formats.

Genomes presenting sequences with highly identical repeats result in pangenome graphs with complex motifs that can be detected by *odgi depth* and *odgi degree*, which return the node depth and node degree, respectively, as defined by query criteria. Both tools emit the output in BED format, allowing users to assess the complexity of the graph and detect intricate regions. Indeed, high depth/degree nodes can be the mirror of genetic variation (Fig. 3), but also misassemblies or problems in the pangenome building, making the tools further useful for graph validation.

## 5 Discussion

Pangenome graphs stand to become a ubiquitous model in genomics thanks to their capability to represent any genetic variant without being affected by reference bias (Eizenga *et al*., 2020b). However, despite this great potential, their spread is impeded by the lack of tools capable of managing and analyzing pangenome graphs easily and efficiently.

ODGI is a state-of-the-art tool suite that enables users to explore and discover the underlying biology in pangenomes graphs, filling the gap that made pangenomic analyses difficult. It provides tools to easily transform, analyze, simplify, validate, and visualize pangenome graphs at large scale. In particular, lifting over annotations and linearizing nested graph structures place the suite as the bridge between traditional linear reference genome analysis and pangenome graphs. With the increased adoption of long read sequencing we expect pangenomic tools to become increasingly common in biomedical research. Particularly for targets that involve complex variation, such as cancer, plant genomics and metagenomics, ODGI will facilitate disentangling, describing and analyzing a much larger set of variation than previously was possible with tools that depend on short reads and reference genomes. Furthermore, users can even consider ODGI as a framework, taking advantage of its algorithms to develop new and more advanced tools that work on pangenome graphs, thus expanding the type of possible pangenomic analyses available to the scientific community.

ODGI is already the backbone of the Pangenome Graph Builder pipeline (Garrison *et al*., 2021). Its static, large-scale 1D and 2D visualizations of the pangenome graphs allow an unprecedented high-level perspective on variation in pangenomes, and have also been critical in the development of pangenome graph building methods. However, an interactive solution that combines the 1D and 2D layout of a graph with annotation and read mapping information across different zoom levels is still missing. Recent interactive browsers are reference-centric (Beyer *et al*., 2019; Yokoyama *et al*., 2019; Durant *et al*., 2021; Liang and Lonardi, 2021) or focus primarily on 2D (Wick *et al*., 2015; Gonnella *et al*., 2018). Our graph sorting and layout algorithms can provide the foundation for future tools of this type. We plan to focus on using these learned models to detect structural variation and assembly errors.

ODGI has allowed us to explore *context mapping* deconvolution of pangenome graph structures via the path jaccard metric. This resolves a major conceptual issue that has strongly guided existing algorithms to construct pangenome graphs. Previously, great efforts have been made to prevent the “collapse” of non-orthologous sequences in the graph topology itself (Li *et al*., 2020). This has been seen as essential to making these new bioinformatic models interpretable. While our presentation is primarily qualitative, our work demonstrates that we can mitigate this issue by exploiting the pangenome graph not as a static reference, but as a dynamic model of the mutual alignment of many genomes. Because pangenome graphs can contain complete genomes, we are able to query them to polarize the information they contain in easily-interpretable and reusable pairwise formats that are widely supported in bioinformatics. ODGI also projects variation graphs into vector and matrix representations that allow the direct application of machine learning and statistical models to the pangenome. We expect that ODGI will provide a reference interface between pangenomic and genomic approaches to understanding genome variation.

## Funding

We gratefully acknowledge support from NIH/NIDA U01DA047638 (EG) and NIH/NIGMS R01GM123489 (EG and PP). SH acknowledges funding from the Central Innovation Programme (ZIM) for SMEs of the Federal Ministry for Economic Affairs and Energy of Germany. SN acknowleges Germany’s Excellence Strategy (CMFI), EXC-2124 and (iFIT) - EXC 2180 – 390900677. This work was supported by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A537B, 031A533A, 031A538A, 031A533B, 031A535A, 031A537C, 031A534A, 031A532B).

## Data availability

Code and links to data resources used to build this manuscript and its figures, can be found in the paper’s public repository: https://github.com/pangenome/odgi-paper.

## Acknowledgements

We thank members of the HPRC Pangenome Working Group for their insightful discussion and feedback, and members of the HPRC production teams for their development of resources used in our exposition.