GBZ File Format for Pangenome Graphs

Motivation Pangenome graphs representing aligned genome assemblies are being shared in the text-based Graphical Fragment Assembly format. As the number of assemblies grows, there is a need for a file format that can store the highly repetitive data space-efficiently. Results We propose the GBZ file format based on data structures used in the Giraffe short read aligner. The format provides good compression, and the files can be efficiently loaded into in-memory data structures. We provide compression and decompression tools and libraries for using GBZ graphs, and we show that they can be efficiently used on a variety of systems. Availability C++ and Rust implementations are available at https://github.com/jltsiren/gbwtgraph and https://github.com/jltsiren/gbwt-rs, respectively. Contact jouni.siren@iki.fi Supplementary information Supplementary data are available online.


Introduction
Pangenome graphs (or genome graphs) (Eizenga et al., 2020b) are graphs representing sequence variation. Each node (or sometimes each edge) in the graph is labeled with a sequence, and each path represents the concatenation of the labels on it. Pangenome graphs are often used as reference genomes. Initial efforts have focused on using graphs as technical artifacts to improve accuracy in applications such as read alignment (Garrison et al., 2018;Kim et al., 2019;Rautiainen and Marschall, 2020;Li et al., 2020;Sirén et al., 2021) and genotyping (Eggertsson et al., 2017;Kim et al., 2019;Chen et al., 2019;Hickey et al., 2020;Sirén et al., 2021;Ebler et al., 2022).
There is increasing interest in using graphs as biologically meaningful references that represent sequence variation in the relevant population. The Human Pangenome Reference Consortium (HPRC) (Wang et al., 2022) aims to create a human reference genome based on over 350 high quality haplotype-resolved assemblies. The reference will consist of the assembled genomes and pangenome graphs representing their alignments.
Software tools for pangenome graphs use the text-based Graphical Fragment Assembly (GFA) format as their data interchange format. GFA was originally intended for assembly graphs, but it is also viable for pangenome graphs with some restrictions and extensions. A GFA file representing only the graph itself is not too large, and it is often compressed further with standard data compressors such as gzip. The format becomes inadequate if we also want to represent the original sequences as paths. While the paths are often highly similar and should compress well, common general-purpose data compressors cannot handle long-distance repetitions. Decompressing a large text file and reading the graph and the paths into in-memory data structures can also be expensive.
We propose the GBZ file format for pangenome graphs representing aligned sequences. The format is based on the data structures used in the Giraffe aligner (Sirén et al., 2021), and it uses a GBWT index (Sirén et al., 2020) for storing a large collection of similar paths space-efficiently. GBZ is a path-based format. Paths representing the original sequences are the primary objects, while the existence of nodes and edges is inferred from usage. We provide a standalone C++ library and compressor for creating and using GBZ graphs, as well as a Rust library for using existing graphs. Both libraries can load GBZ graphs quickly into versatile data structures that are space-efficient enough to enable working with human pangenome graphs on a desktop or a laptop.
Our libraries use Elias-Fano encoded bitvectors for storing increasing integer sequences. We describe various improvements to their semantics and the query interface that make them more practical and may be of independent interest.
Pangenome graph construction pipelines may sometimes create highly collapsed regions that break the assumptions the GBWT makes to ensure fast and space-efficient queries. We show that decompressing and caching parts of the GBWT index is enough to restore performance in specific tasks when such regions are present. Based on this, we suggest changes to in-memory GBWT data structures to make them more robust with highly collapsed graph regions.

Strings
In the following, we refer to semiopen and closed integer intervals as A string S = s 0 · · · s n−1 of length |S| = n is a sequence of characters from an ordered alphabet Σ. For convenience, we often assume that the alphabet is the set of integers [0 . . . |Σ|). We write S[i] = s i and S[i . . . j) = s i · · · s j−1 to access individual characters and substrings of any sequence S. Prefixes and suffixes are substrings of the form S[0 . . . j) and S[i . . . n), respectively, and we write them as S[. . . j) and S[i . . . ). A text string T ends with an endmarker T [|T | − 1] = $ that does not occur in any other text position. The endmarker is the smallest character in the alphabet, and we often assume that $ = 0.
Space-efficient data structures rely on rank and select queries on various sequences. For any string S, position i ≤ |S|, and character c ∈ Σ, we define S.rank(i, c) as the number of occurrences of character c in the prefix S[. . . i). If S[i] = c, we refer to it as the occurrence of rank S.rank(i, c) of character c. We also define the inverse operation S.select(i, c) = j, where S[j] is the occurrence of rank i of character c. (Note that we use 0-based select, where the first occurrence is of rank 0.)

Bitvectors
A bitvector is a data structure that encodes a binary sequence (a string over alphabet {0, 1}) and supports efficient rank/select queries. We call characters 0 and 1 unset and set bits, respectively. If A is a strictly increasing sequence of m Plain bitvectors store the binary sequence without any further encoding. There are many efficient structures that support rank and select on plain bitvectors. We use the ones from SDSL (Gog and Petri, 2014).
A binary sequence is sparse if the number of set bits is much smaller than the number of unset bits. If a binary sequence is sparse, |A| ≪ |B| in the corresponding bitvector. In this paper, a sparse bitvector is a data structure that uses Elias-Fano encoding to store a sparse binary sequence space-efficiently (Okanohara and Sadakane, 2007).
Elias-Fano encoding stores the low and high parts of each value A[i] separately. The lowest w bits are stored in an integer array low by setting low[i] = A[i] mod 2 w . For the high part, we assign each value x to a bucket using bucket(x) = ⌊x/2 w ⌋ and encode the buckets in unary. If a bucket contains k values, we encode it as a binary sequence 1 k 0. We concatenate the binary sequences for each bucket to form bitvector high. The encoding works best with w ≈ log 2 |B| − log 2 |A|, which makes the number of buckets similar to the number of values. Bitvector high then contains similar numbers of set and unset bits.
If high[j] is the set bit of rank i, we can compute Given i, we can compute j = high.select(i, 1). Hence For B.rank(i, 1), we first find the end of the bucket that would contain value i using h = high.select(bucket(i), 0). The number of values in buckets up to bucket(i) is l = h − bucket(i). If high[h − 1] = 1, it is the set bit of rank l − 1. If also low[l − 1] ≥ i mod 2 w , we know that A[l − 1] ≥ i and iterate with h ← h − 1 and l ← l − 1. If either of the conditions fails, we return B.rank(i, 1) = l.

Burrows-Wheeler transform
Let T be a text string of length n. The suffix array of text T is an array SA of pointers to the suffixes of T in lexicographic order. For any i < j, Because the text ends with an endmarker, we always have SA[0] = n − 1.
We can generalize the suffix array for an ordered collection T 0 , . . . , T m−1 of m texts. Each element of the suffix array is now a pair SA[i] = (j, j ′ ) that refers to suffix T j [j ′ . . . ) of text T j . To make the lexicographic order unique, we assume that the endmarker of T i is smaller than that of T j for any i < j. Then The (multi-string) Burrows-Wheeler transform (BWT) (Burrows and Wheeler, 1994) of a collection of texts is a permutation BWT of character occurrences in the texts. For each suffix in lexicographic order, the permutation lists the character preceding the suffix. We define the BWT using the suffix array: Let C be an array of length |Σ| + 1 such that C[c] is the total number of occurrences of all characters c ′ < c in the texts. For any character c, if Given a text collection T 0 , . . . , T m−1 , the lexicographic rank of string X is the number of suffixes of the collection that are smaller than X. The key operation on the BWT is the LF-mapping that maps the lexicographic rank of a suffix to the lexicographic rank of the preceding suffix. In other words, it is a function such that if SA[i] = (j, j ′ ), then SA[LF(i)] = (j, j ′ − 1). We leave this form of LF-mapping undefined when j ′ = 0. A more general form of LF-mapping takes a character c ∈ Σ as a second argument. If the lexicographic rank of string X is i, the lexicographic rank of string cX is LF(i, c). In particular, LF(i) = LF(i, BWT[i]). We leave the function undefined with c = $.
The FM-index (Ferragina and Manzini, 2005) is a space-efficient text index based on the BWT. It computes LF-mapping as Let SA[i . . . j) be the range of suffixes starting with pattern (string) P . Then the range of suffixes starting with pattern cP , for any character c ∈ Σ, is SA[LF(i, c) . . . LF(j, c)). We can also use LF-mapping for extracting individual texts from the BWT. The last character of text T i is BWT [i]. As long as BWT[j] ̸ = $ at the current position j, there may be other characters in the text before the ones we have already extracted. We therefore move to j ← LF(j) and find the possible previous character at BWT [j].

Graphs
Because we often use value 0 for technical purposes and store some information for all integers between node identifiers, we assume that Genome graphs are often represented as bidirected sequence graphs G = (V, E, ℓ). Each node v ∈ V has a non-empty string label ℓ(v), and it can be seen in two orientations o ∈ {+, −}. If o is an orientation, we use o to refer to the other orientation. When we visit node v in forward orientation +, we read string ℓ(v, If P is a path in graph G, the reverse path P visits the same nodes in reverse order and in the other orientation. That is, if P Graph representations often store the topology of a bidirected sequence The libhandlegraph library (Eizenga et al., 2020a) provides a common C++ interface for various bidirected sequence graph implementations. A HandleGraph is an immutable graph. PathHandleGraph extends it with a set of named paths, where each path has a unique non-empty string name.

GFA file format
We use a subset of the GFA file format version 1.1. 1 The subset is mostly compatible with the bidirected sequence graph data model: • S-lines or segments are the nodes of the graph. They consist of a unique non-empty string name and a non-empty string label. • L-lines or links are the edges of the graph. They connect two segment visits. We do not allow any overlaps between the connected segments. • P-lines or paths are named paths in the graph. They consist of a unique non-empty string name and a non-empty sequence of segment visits. • W-lines or walks are paths with metadata suitable for representing assembled genomes. They consist of a sample name (non-empty string), haplotype identifier (integer), sequence name (non-empty string), an optional interval of positions in the named sequence, and a non-empty sequence of segment visits.
In particular, we do not support overlapping segments and annotations with optional tags, as they are more useful with assembly graphs.

GBWT
The GBWT (Sirén et al., 2020) is a run-length encoded FM-index that stores a collection of paths in directed graph G ′ = (V ′ , E ′ ) as strings over alphabet V ′ . Run-length encoding allows it to compress collections of similar paths well. Because we build the FM-index for reverse strings, LF-mapping with character w after character v follows edge (v, w) instead of going backwards on edge (w, v). We break the BWT into substrings BWTv for each v ∈ V ′ and store the substrings in the nodes. Each node v ∈ V ′ also stores a list of outgoing edges (v, w) and some rank information that allows computing LF-mapping using locally stored information. We compress the nodes as byte sequences. Queries on the GBWT involve decompressing entire nodes. Hence we assume that nodes do not have too many neighbors and paths do not visit them too many times.
We use the GBWT for storing paths in a bidirected sequence graph G = (V, E, ℓ), using the approach described in Section 2.1.4. The GBWT index is usually bidirectional (Lam et al., 2009). For each path P , we store both P and P in the index, which enables extending the pattern in both directions.
Each path in the graph (which may be stored as two paths in the GBWT index) may be associated with a structured name. The name consists of four integer fields: sample identifier, contig identifier, haplotype/phase identifier, and fragment identifier / running count. Each name is assumed to be unique. GBWT metadata may also contain lists of string names corresponding to sample and contig identifiers.
The GBWTGraph (Sirén et al., 2021) is the graph representation used in the Giraffe aligner. It uses a bidirectional GBWT index for graph topology and stores the labels of both orientations of all nodes as a single concatenated string. The graph also caches some information that can be computed from the GBWT for faster access. A GBWTGraph is always a subgraph induced by the paths. Paths representing DNA sequences are the primary objects, while the existence of nodes and edges is inferred from usage. This is not a fundamental limitation of the data structures but a deliberate choice made due to the path-centric algorithms used in Giraffe.

Improved sparse bitvectors
Elias-Fano encoded sparse bitvectors are often used for partitioning integer intervals [a . . . b) into subintervals [select(i, 1) . . . select(i + 1, 1)). For example, we may want to store an ordered collection of strings S 0 , . . . , S m−1 as a single concatenated string X for faster disk I/O. A sparse bitvector B of length |X| can then serve as an index. If we set B[j] = 1 at the starting position of each string, we can access the original strings as S i = X[B.select(i, 1) . . . B.select(i + 1, 1)) (with B.select(m, 1) = |X| for convenience). However, bitvector semantics and the rank/select query interface limit the practicality of this approach.
One common issue is that some of the strings S[i] may be empty.
However, Elias-Fano encoding is not limited to storing strictly increasing sequences. The encoding and select( · , 1) queries work correctly as long as all values in bucket i occur before the values in bucket j in the integer sequence A, for all buckets i < j. For rank( · , 1) queries, the algorithm works correctly as long as the sequence is non-decreasing. The semantics of rank( · , 0) and select( · , 0) queries become unclear when duplicate values are allowed.
When we want to retrieve string S[i], we have to compute both B.select(i, 1) and B.select(i + 1, 1). Computing them directly is inefficient, as they involve relatively expensive high.select( · , 1) queries. To avoid this, select queries should return an iterator with internal state (i, high.select(i, 1)). We can compute B.select(i, 1) efficiently from those values. When we want to advance the iterator, we set i ← i + 1 and scan for the position of the next set bit in high. This is efficient on the average, as we have chosen a width w such that high contains similar numbers of set and unset bits. The iterator can also be used for decompressing the integer sequence A efficiently.
Given a value j ∈ [a . . . b), we may want to retrieve the subinterval [select(i, 1) . . . select(i + 1, 1)) containing it, as well as the rank i of the subinterval. This can be achieved using a predecessor query B.pred(j) = (i, B.select(i, 1)), where i is the last position with A[i] ≤ j. We can compute it naively by first finding the rank i = B.rank(j + 1, 1) − 1 and then computing B.select(i, 1). This is inefficient, as rank queries on B involve computing high.select( · , 0), while select queries on B require high.select( · , 1). A more efficient implementation is similar to a rank query. We first find the end of the bucket that would contain value j, and then we iterate backward until we find a value A[i] ≤ j or run out of values.

General
The GBZ file format is based on the serialization format used in the Simple-SDS library. A file is an array of elements: unsigned little-endian 64-bit integers. If the file is memory-mapped, all objects with alignment of 8 bytes or less are properly aligned in memory. A vector is serialized by writing the number of items as an element, followed by the concatenated items. If the size of an item is not a multiple of 64 bits, the vector must be padded with 0-bits. Optional or implementation-dependent structures may be serialized by writing the size of the serialized object in elements as an element, followed by the object itself. A reader can then easily skip the object or pass it through as a vector of elements.
Two general principles have guided the development of the GBZ file format: determinism and backward compatibility with the in-memory representation of existing GBWT indexes. The former means that the same data should always be compressed in the same way, without any variation that depends on the implementation or the execution environment. When these principles are in conflict, compatibility has prevailed in the first version of the file format. Compatibility also required making some complex requirements to support old GBWT indexes. Future versions of the file format may choose simplicity and determinism over compatibility.
See Supplement 1 for further details.

Building blocks
Simple-SDS provides a number of basic succinct data structures. They are serialized by encoding some header information as elements and storing the data in lower-level structures. The fundamental structure is the raw bitvector that stores a binary sequence of length n in a vector of elements.
An integer vector stores a sequence of n integers of width w each as a raw bitvector. Plain bitvectors store the data as a raw bitvector, followed by optional rank/select support structures. The support structures are optional and unspecified, as they are implementation-dependent and can be built quickly from the bitvector itself. Sparse bitvectors use a plain bitvector for high and an integer vector for low. String array stores an ordered collection of strings as a single concatenated string, as described in Section 2.2. The serialized structure uses a sparse bitvector as an index and compresses the strings using alphabet compaction. For example, if the strings only contain characters in {A, C, G, N, T }, they are stored as an integer vector containing values in [0 . . . 5). In-memory structures may decompress the index as an integer vector and the strings as a byte vector for faster access.
Dictionary encodes a bidirectional mapping between an ordered collection of distinct strings S 0 , . . . , S m−1 and their identifiers [0 . . . m). In the serialized structure, the strings are stored as a string array. An integer vector stores the identifiers in lexicographic order. In-memory structures may decompress the dictionary into a more appropriate representation.
Tags are key-value pairs used for annotation. Keys are case-insensitive strings, and they are assumed to be distinct. Values can be arbitrary strings. Serialized tags are stored as a string array, while in-memory structures may use more appropriate representations. Key source identifies the library used for serializing the data. The reader may use it for determining if it can understand any optional structures that are present.

GBWT
The serialization format for the GBWT resembles the compressed inmemory structure. It starts with a header and tags. Node records are compressed as byte sequences using the original encoding, and the sequences are concatenated in a byte vector. A sparse bitvector is used as an index over the nodes.
The GBWT uses document array samples for mapping BWT positions to the identifiers of the corresponding paths. They are serialized as an optional structure in an unspecified format. The structure is optional, because many applications do not need that functionality. The format is left unspecified, as the original structure is complex and not particularly efficient. Future versions of the file format may specify a better structure based on the r-index (Gagie et al., 2020).
GBWT metadata is stored in an optional structure. The structure is optional, because the GBWT may also be useful in applications outside genomics. Metadata starts with a header, followed by a vector of path names. Sample and contig names are serialized as dictionaries.

GBWTGraph
As the GBWTGraph uses a GBWT index for graph topology, it only needs to store a header and node labels. While the in-memory data structure used in Giraffe stores the labels in both orientations for faster access, serializing the reverse orientation is clearly unnecessary. The forward labels are concatenated and stored as a string array.
While GFA graphs have segments with string names, bidirected sequence graphs have nodes with integer identifiers. And while the original graph may have segments with long labels, it often makes sense to limit the length of the labels. For example, the minimizer index used in Giraffe supports nodes of length 1024 bp or less. Some algorithms may create temporary copies of node labels, and long labels may also be inconvenient to visualize.
For this purpose, the GBWTGraph includes a node-to-segment translation. The translation consists of a string array storing segment names S 0 , . . . , S m−1 and a sparse bitvector B mapping node ranges to segments. If the translation is in use, the set of nodes is assumed to be V ′ = [1 . . . |V ′ |]. Segment S i is then the concatenation of nodes v ∈ [B.select(i, 1) . . . B.select(i + 1, 1)).

GBZ
GBZ is a container that stores a bidirectional GBWT index and a GBWTGraph. It also includes a header and tags. The format interprets GBWT metadata in a specific way.
Paths corresponding to sample name _gbwt_ref are named paths. They correspond to paths in a PathHandleGraph and P-lines in GFA. The name of the path is stored as a contig name.
Other paths correspond to GFA W-lines. GBWT path names map to GFA walk metadata in the following way. Sample names and haplotype identifiers can be used directly. GBWT contig names become GFA sequence names. GBWT fragment identifier is used as the start of GFA sequence interval.

Compression algorithm
We propose an algorithm that compresses a GFA file in the GBZ format in several passes over a memory-mapped file. The algorithm works best when, for each weakly connected component in the graph and each line type, all lines from that component with that type are consecutive in the file.
In the first pass, the algorithm determines the starting position and type of each line in the file. It determines whether a node-to-segment translation is necessary. A translation is created if a segment contains a sequence longer than the user-defined threshold (1024 bp by default) or if segment names cannot be interpreted as integer identifiers. The algorithm also chooses an appropriate size for GBWT construction buffers based on the length of the longest path.
In the second pass, the algorithm processes all segments. It builds the node-to-segment translation if necessary and stores node labels in a temporary structure. The third pass processes links and creates a temporary graph. The algorithm finds the weakly ordered components of the graph and sorts them by the minimum node identifier. It then determines GBWT construction jobs by using large components directly as jobs and combining consecutive small components into larger jobs.
The fourth pass builds GBWT metadata from P-lines and W-lines. If there are W-lines present, the algorithm interprets P-lines as named paths, as described in Section 2.3.5. Otherwise the algorithm breaks Pline names info fields using a user-provided regex and maps the fields to the components of GBWT path names.
GBWT construction happens in the fifth and final pass that builds a separate GBWT index for each job. The jobs are started from the largest to the smallest (by the number of nodes in the components), and multiple jobs can be run in parallel. Each job first processes P-lines and then W-lines in the corresponding components and appends the paths into a buffer. When the buffer is full, the paths are inserted into the GBWT index using the direct construction algorithm (Sirén et al., 2020).
After all construction jobs have finished, the algorithm merges the partial GBWT indexes into a final index. Because the jobs are based on weakly connected components, the partial indexes do not overlap, and node records from them can be used directly in the merged index (Sirén et al., 2020). The algorithm builds a GBWTGraph from the final GBWT index, node labels, and node-to-segment translation, and then it serializes everything in the GBZ format.

Decompression algorithm
Decompressing GFA from a GBZ file is straightforward. We first iterate over the node-to-segment translation (or nodes if there is no translation) and output the S-lines in that order. Then we iterate over all edges in an arbitrary order, select the ones that connect segments, and output them as L-lines. Because each edge ((v, o), (w, o ′ )) of a bidirected sequence graph implies the existence of the reverse edge ((w, o ′ ), (v, o)), we only write the smaller of the two (in the usual tuple ordering) as an L-line. Writing L-lines is slower than writing S-lines, because it requires partial decompression of the node records. Hence it may be useful to write them using multiple threads.
Then we write named paths as P-lines and finally other paths as Wlines, both in an arbitrary order. We interpret GBWT metadata as described in Section 2.3.5. As extracting the paths from the GBWT index is the most expensive part of decompression, this should be done using multiple threads.

Experimental setup
The main GBZ library is written in C++. It builds on the original GBWT implementation. The graph supports the the PathHandleGraph interface from libhandlegraph, exposing named paths through the interface. Other paths can be accessed using a custom interface. There is also a standalone Rust library that can use existing GBWT and GBZ structures but not build new ones.
There are some differences between the implementations. Because the C++ implementation uses Giraffe data structures, it decompresses and stores node labels in both orientations. The Rust implementation only stores the forward orientation. The GFA decompression algorithm in the C++ implementation uses additional memory to speed up decompression. It decompresses the node-to-segment translation and large (>1024-byte) GBWT node records for faster access. The Rust implementation uses the query interface directly without caching. Finally, the C++ implementation writes both edges and paths using multiple threads, while the Rust implementation only uses multiple threads for paths.
Both implementations use an external library for basic succinct data structures. The C++ implementation uses the vgteam fork of SDSL , which includes sparse bitvector improvements and supports the Simple-SDS serialization format. The Rust implementation uses Simple-SDS.
As our test data, we used two 90-haplotype human graphs based on year 1 data from HPRC (Liao et al., 2022). Cactus was built using the Minigraph-Cactus pipeline (Hickey et al., 2022) with GRCh38 as the reference. PGGB was built using the Pangenome Graph Builder pipeline (Garrison et al., 2022). We also created a 1000GP dataset with 5008 human haplotypes by combining the full 1000 Genomes Project (The 1000 Genomes Project Consortium, 2015) indexes from the Giraffe paper (Sirén et al., 2021). See Table 1 for details.
Chromosome 16 of the PGGB graph contains a highly collapsed repetitive region, where the average haplotype visits a few nodes tens of thousands or even hundreds of thousands of times. These nodes also have a large number of neighbors. This breaks the assumptions made by the GBWT, making compression much slower than expected. The C++ implementation successfully mitigates the issue during decompression by caching large node records. Hence a better in-memory representation for large records should be enough to fix this and other similar issues.
We used four systems for the experiments: Desktop (iMac 2020), Laptop (MacBook Air 2020), Intel Server (AWS i3.8xlarge), and ARM Server (AWS r6gd.8xlarge). See Table 2 for details. All data was stored on a local SSD. While we used different C++ compilers on different systems, the Rust compiler was always rustc version 1.58.1.
See Supplement 2-4 for further details.

Compression
We compressed the Cactus and PGGB datasets in the GBZ format using the compressor included in the C++ implementation. File sizes were 3.6 times and 2.5 times smaller, respectively, than those achieved by the generalpurpose gzip compressor (see Table 1). The compression ratio improves further as the number of haplotypes increases, as seen with the 1000GP dataset. We measured the time and memory usage of GBZ compression and the time usage of gzip compression. (Memory usage of gzip compression is negligible.) The results can be seen in Table 3 and Table 4 for Cactus and PGGB, respectively. Because the input GFA is memory-mapped, any parts of the input stored in disk cache are included in the memory usage of the compressor. We did not use Laptop for this benchmark due to the limited amount of memory.
With the Cactus dataset, multi-threaded GBZ compression required similar time to single-threaded gzip compression. Intel Server and ARM Server were faster than Desktop, both with individual GBWT construction jobs and by the total running time multiplied by the number of parallel jobs. This was likely caused by the higher memory bandwidth of the server processors.
Compression speed of the PGGB dataset was dominated by the highly collapsed region of chromosome 16. Without chromosome 16, the compression would have finished in 60-90 minutes (depending on the system), which is slightly higher than gzip compression time. The GBWT construction job for chromosome 16 took an order of magnitude longer. Dealing with such collapsed regions will require changes to the in-memory data structures used in the GBWT (see Section 4). Desktop was faster than the servers due to the higher single-core speed of its processor. Its memory usage was also lower, because there was not enough memory for caching the entire input.

In-memory structures
The GBZ file format has been designed for fast loading into in-memory data structures. In the C++ implementation, these are the same GBWT and GBWTGraph structures that are used in the Giraffe aligner. The structures in the Rust implementation are similar. While some structures must be rebuilt when loading the file and other parts are decompressed for faster access, most of the GBWT can be copied from the file to the data structures.
We measured the time and memory usage of loading GBZ files. The results can be seen in Table 3 and Table 4. Loading the structures took tens of seconds. Desktop and Laptop were faster than Intel Server and ARM Server due to their higher single-core performance. The C++ implementation used more memory than the Rust implementation, because it stores node labels in both orientations.
Laptop used less memory than the other systems, especially with the PGGB dataset. This is because we defined memory usage as resident set Table 1. Datasets and their properties. We list the size of the file in uncompressed and gzip-compressed GFA format as well as in the GBZ format. We also list the total length (in nodes) of the forward and reverse paths stored in the GBWT index and the number of lines of each type in the file.

Decompression
We decompressed the Cactus and PGGB datasets from GBZ format to GFA format on all four system using the C++ and Rust decompressors. Time and memory usage of can be seen in Table 3 and Table 4. We also measured the time used by gzip decompression for a comparison.
With the Cactus dataset, the multi-threaded C++ decompressor was about as fast as the single-threaded gzip decompressor on macOS. The Linux version of gzip was several times slower. The Rust decompressor was also slower, because it uses the query interface directly without caching. ARM Server was faster than Intel Server due to having more CPU cores.
Chromosome 16 in the PGGB dataset caused issues again. The C++ decompressor managed to decompress it in a reasonable time, as it caches large GBWT node records. Decompression time was reasonable even on Laptop, which only had half the memory required for the inmemory data structures. This is because the memory access patterns during decompression are mostly sequential, and swapping does not slow it down too much. The Rust implementation was more than an order of magnitude slower than the C++ implementation. Gzip decompression was again slower in Linux than in macOS.

Scalability
The 1000GP dataset contains 55.6 times more haplotypes than the Cactus and PGGB datasets, and the total length of the paths is 240.7 times and 130.7 times higher, respectively. However, the GBZ file is only 5.41 times and 2.94 times larger, respectively. The GFA file would be too large to store explicitly on any of our systems, and even the gzip-compressed file would likely be too large for all systems except Intel Server.
We decompressed the GBZ file using the C++ implementation and piped the output to wc to determine the size of the GFA file. This took 12.8 hours / 45.6 GiB on Intel Server and 18.2 hours / 51.2 GiB on ARM Server. The wc tool was the bottleneck in decompression. On the average, only about 8 parallel decompression threads were active on Intel Server. On ARM Server, the average number of threads was approximately 5.5.

Discussion
We have proposed the GBZ file format for pangenome graphs representing aligned genomes. The file format is based on data structures used in the Giraffe aligner, and it is the preferred graph format for the aligner. GBZ graphs are widely supported in vg (Garrison et al., 2018), and we also provide standalone libraries for using them in other software tools.
GBZ compresses GFA files with many similar paths well. The compression speed is competitive as long as the graph does not contain certain degenerate structures. A GBZ file can be decompressed quickly into a GFA file for tools that do not support GBZ. The graph and the paths in the file can also be loaded quickly into in-memory data structures.
There are two main data models for representing paths metadata in pangenome graphs. The W-line / GBWT model is based on structured names appropriate for assembled genomes. The P-line / libhandlegraph model uses string names, which are harder to interpret but suitable for more diverse applications. GBZ supports both models. Some applications of pangenome graphs require specifying a set of paths that acts as a linear reference sequence. Neither model has a way of specifying whether the graph contains a preferred reference sequence or if any sample is an equally valid choice. If such a convention emerges, GBZ can support it using tags.
The GBWT assumes that the nodes of the graph do not have too many neighbors and the paths do not visit the same nodes too many times. Pangenome graph construction pipelines sometimes produce graphs that violate these assumptions. Because the GBWT has to decompress the node record every time it computes LF-mapping from that node, this can cause significant performance loss. However, as we saw in the decompression benchmarks (Section 3.4), caching large records in a more suitable format helps to avoid the issue. While the format we used was only appropriate for extracting paths from the GBWT, a proper solution such as a run-length encoded wavelet tree should work in most situations. The ones from the DYNAMIC library (Prezza, 2017) look promising. However, using them in the GBWT may require significant engineering effort and changes to the query interface.