## Abstract

Much of the DNA and RNA sequencing data available is in the form of high-throughput sequencing (HTS) reads and is currently unindexed by established sequence search databases. Recent succinct data structures for indexing both reference sequences and HTS data, along with associated metadata, have been based on either hashing or graph models, but many of these structures are static in nature, and thus, not well-suited as backends for dynamic databases.

We propose a parallel construction method for and novel application of the *wavelet trie* as a dynamic data structure for compressing and indexing graph metadata. By developing an algorithm for merging wavelet tries, we are able to construct large tries in parallel by merging smaller tries constructed concurrently from batches of data.

When compared against general compression algorithms and those developed specifically for graph colors (VARI and Rainbowfish), our method achieves compression ratios superior to gzip and VARI, converging to compression ratios of 6.5% to 2% on data sets constructed from over 600 virus genomes.

While marginally worse than compression by bzip2 or Rainbowfish, this structure allows for both fast extension and query. We also found that additionally encoding graph topology metadata improved compression ratios, particularly on data sets consisting of several mutually-exclusive reference genomes.

It was also observed that the compression ratio of wavelet tries grew sublinearly with the density of the annotation matrices.

This work is a significant step towards implementing a dynamic data structure for indexing large annotated sequence data sets that supports fast query and update operations. At the time of writing, no established standard tool has filled this niche.

## 1 Introduction

### 1.1 Background

The ever-decreasing cost of high-throughput sequencing (HTS) has led to massive growth in the availability of DNA and RNA sequencing data to researchers and the greater scientific community [21]. Several large-scale projects, such as the 1000 Genomes Project [4], UK10K [33], and many others [32, 35], have enabled us to much more extensively sample the genetic variation among humans and other organisms of interest. In addition to providing raw sequencing reads, follow-up projects such as ExAC and gnomAD have consolidated some of this data into large variant call sets to facilitate subsequent analysis [16]. However, the rate of sequencing data generation continues to exceed the rate at which these data can be indexed, processed, and analyzed [21].

Traditional sequence indexing and search methods, such as hash table-based seed-and-extend [2] or Burrows-Wheeler transform (BWT)-based [6] read-to-reference alignment [18], are optimized for relatively small^{1} databases of long reference sequences, and thus, HTS data have remained largely unindexed and not efficiently searchable. The development of metagenomics has complicated the issue even further, since sequence information from millions of as-of-yet uncharacterized organisms is available only in HTS data sets [32, 9]. Without sufficient sequencing data or reference genomes to properly assemble the genomes of individual species from these samples, much of this valuable data is currently difficult to process for specialists and inaccessible to non-specialists from the greater research community.

### 1.2 Recent models for metagenome indexing

Recent models for sequence indexing can be divided into two main groups: *hashing-based* and *graph-based*.

Hashing-based methods use probabilistic data structures for lossy or lossless compression of sequences, graph elements, or metadata, and allow for fast approximation of various queries, such as similarity between pairs of sequences [23], membership in a set of sequences [24], or subsequence counting [36]. Recently, sequence Bloom trees [29] and split-sequence Bloom trees [30] have been introduced for indexing HTS data. However, due to their use of Bloom filters for sequence matching, they require recomputation with different parameters as they saturate to maintain a given false positive rate. Each hashing method is typically optimized for performing a narrow range of queries, and thus, a separate copy must be stored for every query type supported.

Graph-based methods were first used for assembling short read sequencing data into long contiguous sequences (contigs) [27]. Most of these can be described as variants of de Bruijn graphs [34], overlap graphs [22], cactus graphs [25], and others [26]. The succinct representation of the uncompacted de Bruijn graph by Bowe, Onodera, Sadakane, and Shibuya [5] (henceforth referred to as BOSS) has acted as the basis for sequencing projects where the sheer sizes of the input data, such as metagenomics data sets, have necessitated trading increased running times for dramatically decreased storage [15, 17, 24, 20].

In order to use a de Bruijn graph as the backend for a sequence search method, however, an additional method must be developed to encode and compress associated metadata (which we refer to as *graph colorings*). When query sequences are mapped to paths on the graph, these paths induce sequences of colors annotating sections along the paths. Colors on edges can be used as indicators for various metadata categories (given some ordering of categories), such as their presence in certain samples, genetic structures, or their implications in diseases[20]. These are encoded as a large bit matrix (which we refer to as the *annotation matrix*), with one row for each edge and one column for each metadata category. One of the early methods for color encoding is the positional BWT [8], where sample haplotypes are encoded as bit vectors on a reference sequence and a BWT is applied before they are compressed. This method has also been extended to work with positions on string graphs [22].

Recent methods using the BOSS representation for colored de Bruijn graphs, such as Bloom filter tries [14], VARI [20], Rainbowfish [1], and deBGR [24], have proposed methods for succinct compression of graph colors. The VARI pipeline concatenates the rows of its bit matrix and compresses by Elias-Fano [11, 10] or Raman-Raman-Rao (RRR) [28] coding depending on the proportion of set bits [20]. The method in Rainbowfish builds on this by computing Huffman codes for the edge colors and compressing the concatenations of the codes by RRR coding [1]. Bloom filter tries are a probabilistic data structure for storing the edge labels and colors of a colored de Bruijn graph [14], while deBGR encodes these in a quotient filter and uses the colors of neighboring edges for error correction [24]. Although Rainbowfish achieves the best compression ratios among the lossless methods, its use of Huffman codes does not take full advantage of correlations between columns in the annotation matrix. In addition, it requires the distribution of the edge color frequencies to be known beforehand.

These methods rely on static data structures for optimal compression, requiring full decompression, extension, and recompression steps to perform edits. In the case of Rainbowfish, extensions can potentially reduce the efficiency of the coding if novel colors follow a different distribution from those which were used to compute the codes. In the worst case, a full recomputation of all Huffman codes would be required at regular intervals to maintain a desired compression efficiency. The static natures of these methods renders them inadequate for use as backends in dynamic sequence databases.

We have recently been developing methods for the fast construction and storage of the BOSS representation of de Bruijn graphs in both static (for fast querying) and dynamic (for fast updates) data structures, with the ability to convert between internal representations depending on the desired types of user interaction. In this work, we further extend these methods by proposing a dynamic data structure for graph colorings compression that takes advantage of correlations between columns of the annotation matrix and can be combined with other models for sequence indexing as well.

### 1.3 Wavelet tries: a dynamic data structure for annotation compression

For the compression of graph colorings, we propose a novel application of the *wavelet trie* data structure [13]. Briefly, a wavelet trie is an extension of the concept of a wavelet tree and takes the shape of a compact prefix tree (a binary radix trie). Instead of compressing strings over a fixed alphabet, wavelet tries compress tuples of bit vectors, where each vector is the binary encoding of a string over an alphabet of arbitrary size. This allows the structure to compress dynamic strings over arbitrary alphabets by finding common contiguous subsequences (or *segments*) among the bit vector encodings of its characters. In the context of genome graph coloring, the tuple of bit vectors can be defined as the rows of the annotation matrix in which the number of columns grows to encode new categories of metadata. In the worst case, the height of a wavelet trie is equal to the length of the longest bit vector being compressed (in the case where no common prefices are present at every internal node).

To the authors’ knowledge, no implementation of this data structure has been reported. In this manuscript, we present an implementation employing a parallel construction strategy via wavelet trie merging. The merging algorithm presented is a generalization of the algorithm provided by the original manuscript for appending novel bit vectors to an existing wavelet trie [13].

## 2 Methods

### 2.1 Succinct de Bruijn graph construction

The Bowe, Onodera, Sadakane, and Shibuya (BOSS) representation of the de Bruijn graph was chosen as the underlying genome model for this study [5]. Let *k* be some fixed positive integer and *G* be a de Bruijn graph of order *k*. When the edges of *G* are sorted by the reverse lexicographical ordering of the *k*-mer labels of their respective source nodes (using their own edge labels as tie breakers), only the last character of each *k*-mer (vector *F*), the edge labels (vector *W*), and two auxiliary bit vectors (vectors *ℓ* and *W*^{−}) need to be stored to represent the graph. *ℓ* is an indicator for the last outgoing edge of a node, while *W*^{−} is an indicator for all but the first edge leading to a node with an in-degree greater than one. In this representation, there is a one-to-one correspondence between *F* and *W*, where the ith occurrence of a character *c* in *W* with *W*^{−} value 0 corresponds to the *i* occurrence of *c* in *F* with *ℓ* value 1 [5]. Construction of the BOSS representation of de Bruijn graphs is done using a binned parallel approach [5, 17].

### 2.2 Graph coloring during de Bruijn graph construction

Colors are computed for each edge of the de Bruijn graph during construction based on the metadata of the input sequences from which they are derived. During *k*-mer enumeration, assign each unique metadata category a positive integer ID and use these IDs to assign each *k*-mer a list of category IDs corresponding to its associated metadata categories. Then, convert the list of IDs to a bit vector (called an *edge color*) such that the IDs determine which bits in the vector are set to 1. When duplicate edges are removed during graph construction, combine their respective bit vectors via bitwise OR operations to define the new color of the remaining edge. Alongside the succinct de Bruijn graph, this process results in an auxiliary *annotation matrix* with n rows corresponding to the edges of the graph and *m* columns corresponding to the total number of unique metadata strings observed during construction. The resulting graph-matrix pair is a *colored de Bruijn graph*. When this graph is queried, sequences are mapped to a path (a sequence of edges) and a corresponding sequence of annotation matrix rows.

### 2.3 Graph color compression with wavelet tries

To greatly reduce the required storage space of the annotation matrix, while allowing for dynamic extension and random access to matrix rows, we chose to employ the *wavelet trie* data structure.

**Definitions and notation** Given a bit vector *b* ∈ {0,1}* (a finite string over the binary alphabet {0,1}), we use the notation |*b*| to refer to its length, *b*[*i*] to refer to its *i*th character, 1 ≤ *i* ≤ |*b*|, *b*[*j : k*] to refer to the bit vector *b*[*j*] ⋯ *b*[*k*], *b*[: *k*] to refer to its prefix *b*[1 : *k*], and *b*[*j* :] to refer to its suffix *b*[*j*] ⋯ *b*[|*b*|]. The empty vector is denoted *ε*.

The function **rank _{0}**(

*b, j*) (

**rank**(

_{1}*b, j*)) counts the number of 0(1) characters in

*b*[:

*j*], while

**select**

_{0}(

*b, j*) (

**select**

_{1}(

*b, j*)) returns the index of the

*j*

^{th}0(1) in

*b*. Also, we will use the notation 2

^{A}to denote the power set of a set

*A*and abuse the notation |·| to refer to both set cardinalities and bit vector lengths.

**Construction** The wavelet trie encoding the annotation matrix *A* ∈ {0,1}^{n×m} is constructed recursively and is a binary tree of the form *T* = (*V,E*) (see Figure 1), where its nodes *n _{j}* ∈

*V, j*∈ {1,…, |

*V*|} are of the form

The *α _{j}* are referred to as the

*longest common prefices*(LCPs) and the

*β*are referred to as the

_{j}*assignment vectors*.

The algorithm starts with the *root* node *n*_{1}. We define the initial set of input bit vectors to be the rows of *A*, , where for all *i* ∈ {1,…, *n*}.

On the *j*th iteration, for a list of input bit vectors , , ∀*i* ∈ {1,…,*ℓ*}, compute *n _{j}* as follows: Compute the longest common prefix for the bit vectors in

*B*. Formally, this function is defined as follows,

_{j}**LCP**: 2

^{{0,1}*}→ {0,1}*,

If the computed *α _{j}* matches all the input bit vectors,

*n*is referred to as a

_{j}*leaf*and let the assignment vector be

*β*←

_{j}*ε*. Then terminate this branch. Otherwise, the set the assignment vector to be the concatenation of next significant bits in each of the ,

*i*∈ {1,…,

*ℓ*} after removing the common prefix

*α*,

_{j}Continue the recursion on the child nodes *n*_{2j} and *n*_{2j+1} with the new sets of bit vectors *B*_{2j} and *B*_{2j+1}, respectively, which are defined by partitioning *B _{j}* based on

*β*and removing the first |

_{j}*α*|+2 bits,

_{j}### 2.4 Parallel construction via wavelet trie merging

To allow for parallel construction of wavelet tries, we developed an algorithm to merge wavelet tries as a generalization of the wavelet trie extension method [13]. Merging proceeds by performing an *align* and a *merge* step on each node, starting from the root (see Figure 2 for an illustration of the process). Given two wavelet tries *T*′ and *T*″ with node sets and that we want to merge into a new trie *T*, the merging process can be summarized as:

**Align**: for the nodes and , compute the longest common prefix and make new nodes with this value and appropriate*β*vectors, set this to be the parent of the current nodes,**Merge**: once and are equal, concatenate and ,**Repeat**: move down to*j*’s children and apply the same function until all leaves are reached.

In the context of compressing the edge colors of a de Bruijn graph, this method assumes that the columns of two wavelet tries being merged are indicators for matching metadata categories.

For this method, we define the *descendants* function **D** : {1,…, |*V*|} → 2^{{1,…,|V|}} for the wavelet trie *T* = (*V, E*) with nodes by the recurrence

The three steps in the merging operations are as follows:

**Align** Given nodes and , we compute their longest common prefix

If , we let
and update the indices in *T*′ by applying the transformation and updating all nodes *k* ∈ **D**(*j*) accordingly. We then let and and truncate the prefix in the newly created child nodes,

If , the second trie is processed accordingly.

**Merge** If and , then terminate. Otherwise, if , set (of length ). (similarly for ). Then, merge the two assignment vectors

**Repeat** The merging algorithm is then performed on nodes *n*_{2j} and *n*_{2j+1} depth-first to continue the recursion.

If two wavelet tries constructed from bit vectors of different lengths are merged, this merging algorithm leads to the decoding of bit vectors with trailing zeros. Since we indend to use these vectors as indicators for various metadata, the presence of extra trailing zeros in the decoded bit vector does not represent false information.

### 2.5 Computational complexity of wavelet trie operations

Let *A* ∈ {0,1}^{n×m} be an annotation matrix. The height of a constructed wavelet trie *T* = (*V, E*) depends on the degree to which the bit vectors share common segments, with the worst-case value being *h* ≤ min(*n, m*) when no segments are shared. Since there can be at most *n* leaves, and the maximum height of the tree is at most *m*, the number of nodes can be at most |*V*| ≤ min(*n*, 2^{m}).

Given two wavelet tries *T*_{1} and *T*_{2}, merging is performed in time. Once a tree is constructed, queries can be performed in time. To achieve this value, the *β _{j}* are compressed with RRR coding [28] to support rank operations in time.

### 2.6 Improving compression ratios using graph topology

One of the advantages of maintaining a graph-based model for genome storage is its ability to efficiently represent an ordering on the *k*-mers. On the other hand, the ordering provided by the BOSS representation is rarely optimal for compressing adjacent edge colors by run-length encoding, since adjacent edges frequently come from different samples. There is, however, an additional degree of freedom in that the ordering of annotation matrix columns is not fixed. Since the compression ratio of a wavelet trie depends on both the ordering of the rows (defining the compressability of the *β*s) and the similarity of the bit vectors after each assignment during construction (which defines the height and balance of the tree), we explored whether graph structure could be used to help provide additional prefix bits to help optimize row segregation.

If we assume that a certain set of paths in the de Bruijn graph (i.e., those corresponding to reference genomes) act as *backbones* (whose indicator columns in the annotation matrix we refer to as *backbone bits*), while other paths represent sequence variation, then it is expected that the edge colors of backbone paths are highly correlated with those of the variation paths. The edge colors of variation paths can be supplemented by setting the columns of their corresponding backbones to 1.

We now describe this process more precisely. Let *b*^{1},…, *b ^{n}* be the rows of the input annotation matrix and let

*C*= {1,…,

*m*}, where

*m*is the number of columns of the annotation matrix. Let

*R*⊂

*C*be the set of indices/IDs generated from backbone paths/genomes and let

*P*map elements of

*R*to their corresponding paths in

*G*. The user provides a map

**B**:

*C*→

*C*such that

*R*= {

*j*|

*B*(

*j*) =

*j j*∈

*C*} is the set of fixed points of

**B**. Then, for each

*j*∈

*C*and each

*b*s.t.

^{i}*b*[

^{i}*j*] = 1, we set

*b*[

^{i}**B**(

*j*)] ← 1.

When this process is not followed, we say that the backbone bits are *unset*, whereas applying this process results in the backbone bits being *set*. For example, given an index *i* corresponding to a backbone and *j* corresponding to a variant, we say that the backbone bit is set if *i* = *j* = 1 and unset if *i* = 1 and *j* = 0.

## 3 Results

The following section covers our evaluation of the wavelet trie data structure on a variety of data sets. This includes a comparison of its compression ratio against general compression algorithms and to those developed specifically for graph colors. In addition, we evaluate the hypothesis that setting backbone bits using prior knowledge improves compression ratios. Finally, we study how the compression ratio of wavelet tries behaves as a function of the number of metadata categories m and the density of the annotation matrix (the ratio of the number of bits set to 1 and nm) a linear hierarchy (called a *chain*) of models ranging from 50 to 1000 virus genomes.

### 3.1 Data sets

Data sets originating from viruses (Virus100 and Virus1000), bacteria (simply Bacteria), and humans (chr22+gnomAD and hg19+gnomAD) are used in this study to construct graphs with varying topologies to study their effects on the wavelet trie’s compression ratios. See Appendix Section A.1 for a precise description of the data sets used.

### 3.2 Wavelet trie compression ratios similar to gzip and bzip2, and better than previous methods

As baseline comparisons, the compression ratio of wavelet tries was compared to those of the standard UNIX compression utilities gzip and bzip2 (see Table 1). gzip is an implementation of the LZ77 algorithm and encodes blocks of text, while bzip2 performs a sequence of transformations, including run-length encoding, BWT, move-to-front transforms, and Huffman coding. In addition, the compression performance of wavelet tries was compared to other methods developed specifically for annotation matrices on succinct de Bruijn graphs, such as the methods presented in VARI [20] and Rainbowfish [1]. We measured *compression performance* as numbers of bits stored by the structures (which we denote *s*) divided by the total number of bits in the matrices (*nm*).

The results indicate that wavelet trie compression outperforms gzip and the VARI method. bzip2 and the Rainbowfish method achieve similar compression ratios and slightly outperform our method. The Virus1000 data set is notable in that wavelet tries exhibit the worst compression performance among the methods tested, though much better results were achieved when backbone bits were set. At the time of writing, the VARI method was unable to compress the annotations for the Virus100, Virus1000, and hg19 data sets. Setting backbone bits led to a three-fold improvement in the compression performance on the Virus1000 data set (from 67.174 bits per edge to 22.756), marginal improvements in the compression performance on the bacterial and chr22 data sets, and a marginal decrease in performance on the Virus100 and hg19 data sets.

### 3.3 Setting backbone bits improves compression ratios

To test the hypothesis that the setting backbone bits (which by definition tend to occur in columns with lower indices) reduces compression ratios, 100 random shufflings of the column ordering in the Bacteria and Virus100 data sets were generated and the resulting data compressed to approximate the null distribution of compression ratios (see Figure 3). The results indicate that the original ordering of columns was optimal with respect to the defined null distribution when backbone bits were set. As a negative control, when the backbone bits were unset, the resulting compressed file sizes did not significantly differ from the means of the null distributions.

### 3.4 Wavelet trie size grows linearly with increased unique compression size

To test the scalability of wavelet trie compression, we generated a *chain* (a linear hierarchy) of virus graphs ranging from 50 to 1000 random genomes in steps of 50 (i.e., *G*_{1} ⊂ ⋯ ⊂ *G*_{20}) and measured the compression ratios of the annotations for each graph. The compression ratios for the Virus50 to Virus1000 graphs exhibit an exponential drop from 12.5% to 6.5% when backbone bits are not set and a steeper drop from 12% to 2% when backbone bits are set (see Figure 4). The compression ratios grew sublinearly as the density of the annotation matrices grew, with little difference in the growth characteristics with and without the backbone bits set. By definition, the matrix densities tended to be marginally greater when backbone bits were set.

## 4 Discussion and Conclusions

In this study, we have tacked the problem of encoding sequence metadata as edge colors to construct succinct colored de Bruijn graphs. Given a binary matrix encoding of metadata (with one row per edge and one column per metadata category), we have presented a parallel construction method and novel application of the wavelet trie data structure for matrix compression. The construction method builds smaller wavelet tries on batches of data and merges them to form the full trie, performing every step in a multithreaded fashion. The resulting structure is dynamic in that novel edge colors of arbitrary size can be appended. In addition, we have demonstrated that when using indicators for the backbone regions of the de Bruijn graph positioned in low-index columns of the annotation matrix, we are able to improve compression ratios by assisting edge color segregation during wavelet trie construction. Thus, we are able to take advantage of graph topology to improve compression performance.

The results on the Virus1000 set of graphs indicate that our implementation of wavelet tries, for sufficiently large graphs, stabilize at compression ratios of 6.5% and 2% when backbone bits are unset and set, respectively. The data structure is less efficient on smaller graphs due to the greater significance of the employed data structures’ overhead. The high variability among viral sequences led to each added batch of sequences, on average, adding a constant amount of information to the graph and its compressed annotation. With regard to their effect on trie structure, each new batch creates a split in a node close to the root, forming a large separate subtrie, and thus, reducing the chances of such splits occurring with each subsequent extension. While these graphs represent cases of relatively few sequences with modest metadata category counts, larger graphs, such as those constructed on larger collections of eukaryotic genomes will need to be constructed to further study the wavelet trie’s growth characteristics.

One significant limitation of wavelet tries is their reliance on shared segments (contiguous subsequences), especially in the first few columns of the annotation matrix, to effectively partition the rows for optimal compression. While this is partially addressed by setting backbone bits in the annotation matrix, a more principled approach with less user input will become necessary in future releases. This would involve an analysis of the de Bruijn graph topology to algorithmically determine paths to use for backbone bits.

An alternate approach for compressing graph colors which does not rely on this analysis would be to use a directed acyclic graph as a model, as is typically applied for compressing dictionaries [3]. With this structure, matching nodes in different branches are merged into single nodes to further reduce redundancy. However, the computational complexity of dynamically maintaining such a structure, and the added complexity of preserving the ordering of the annotation rows (to support row index queries) are challenges that must be addressed for such an approach to be appropriate for compressing graph colors.

With regard to the original stated purpose of developing an annotated sequence graph for indexing reference sequence and sequencing reads, an ideal implementation of such a database would (1) employ a dynamic data structure which (2) supports fast queries and updates, and is (3) optimal with regards to storage. At the time of writing, no database solution fulfilling all three criteria on this type of data has been published. One observation, however, is that in real world applications, database updates are much less frequent than queries. Thus, a dynamic data structure with slower query times may, on average, underperform compared to one involving a static data structure with fast query times that is updated periodically. We propose a solution in which the backend is able to switch between static and dynamic states efficiently. This way, the average time complexity of query and update operations is kept low through amortization.

## 5 Acknowledgements

We would like to thank Torsten Hoefler, and the Biomedical Informatics Group at ETH Zurich, in particular Amir Joudaki, Viktor Gal, and Gideon Dresdner for helpful discussions and criticism. This project was funded by the Swiss National Science Foundation (SNF) grant #407540-167331 “Scalable Genome Graph Data Structures for Metagenomics and Genome Annotation” as part of Swiss National Research Programme (NRP) 75 “Big Data”. The authors declare no conflicts of interest.

## A Appendix

## A.1 Data sets

### Bacteria

This data set is composed of 45 strains of bacteria in GenBank [7] from the *Lactobacillus* species *acidophilus, amylovorus, brevis, buchneri*, and *casei*. The columns in this graph’s annotation matrix indicate presence of an edge in each of the strains. Because of the low variability in the input sequences, they are represented as a graph with a predominantly linear topology and short variant paths (called *bubbles*). One genome from each of the species was chosen as a backbone path. The resulting graph had 18,669,398 unique *k*-mers, 18,713,013 edges, and 536 unique edge colors (i.e., bit combinations). See Appendix Section A.3 for a list of the bacterial strains used.

### Virus1000

This data set is composed of 1000 virus genomes randomly selected from GenBank, meant to study a graph whose topology is a series of almost mutually-exclusive loops with slight variation. The columns in this graph’s annotation matrix indicate presence of edges in each of the virus genomes. Similar to the Bacteria data set, the viruses were grouped by the first word of their names and the first species in each group was assigned as a backbone path. The resulting annotation bit matrix is very sparse and adjacent rows are either almost identical or almost mutually exclusive. This graph contains 15,342,369 unique *k*-mers, 15,360,442 edges, and 10,585 unique edge colors.

### Virus100

This is a subset of the Virus1000 set containing only 100 virus strains used to facilitate the permutation tests in Section 3.3. This graph contains 2,051,777 unique *k*-mers, 2,052,501 edges, and 284 unique edge colors.

### chr22+gnomAD

This graph consists of chromosome 22 from the `hg19` assembly of the human reference genome as the main reference backbone. To provide genetic variability, the set of exome variants from the gnomAD data set were incorporated into the graph [16]. This larger data set is meant to analyze the properties of the trie when the underlying graph is large, but with little variability. The columns in this graph’s annotation matrix are defined as indicators for its edges’ presence in 9 ethnic groups defined in the data set. The first column in the matrix is used to indicate edges which are present in the reference genome and serves as the backbone bit. The graph contains 54,386,415 unique *k*-mers, 54,723,569 edges, and 595 unique edge colors.

### hg19+gnomAD

This graph was constructed from the same data sets as the one described above, using data from the full human autosome. The same definition is used for the annotation matrix columns, with 9 columns being used to indicate edges observed in the defined ethnic groups and 22 prefix columns being used to indicate presence in the first 22 reference chromosomes as the backbone bits. This graph’s topology was designed to be analogous to the Virus1000 data set, but with 1000 × the number of rows and one-tenth of the number of annotation columns. It contains 2,880,005,212 unique *k*-mers, 2,886,801,846 edges, and 320,856 unique edge colors.

## A.2 Implementation and source code availability

All algorithms were implemented in C++ 14 using the `boost` (arbitrary precision integers), `htslib` (VCF parsing) [19], `sdsl-lite` (static succinct data structures) [12], and **libmaus2** (dynamic succinct data structures) [31] libraries. Wavelet tries are stored in memory in a fashion similar to linked lists, with `Node` objects containing pointers the objects that define their children. For serialization, this structure is packed into a `std::unordered_map` data structure mapping node indices to `Node` objects.

Our implementation is provided as a header-only library and a standalone executable at `http://www.github.com/ratschlab/metannot`.

## A.3 List of bacterial strains used

–

*Lactobacillus acidophilus*30SC (uid 63605)

La 14 (uid 201479)

NCFM (uid 57685)

–

*Lactobacillus amylovorus*GRL 1112 (uid 61179)

GRL1118 (uid 160233)

–

*Lactobacillus brevis*ATCC 367 (uid 57989)

KB290 (uid 195560)

–

*Lactobacillus buchneri*NRRL B 30929 (uid 66205)

uid 73657

–

*Lactobacillus casei*ATCC 334 (uid 57985)

BD II (uid 162119)

BL23 (uid 59237)

LC2W (uid 162121)

LOCK919 (uid 210959)

W56 (uid 178736)

Zhang (uid50673)

## A.4 List of virus strains used

Due to their large number, the lists of virus strains used are made available in the previously-linked GitHub repository.

## Footnotes

harun.mustafa{at}inf.ethz.ch,andre.kahles{at}inf.ethz.ch,mikhaika{at}student.ethz.ch,raetsch{at}inf.ethz.ch

↵1 in terms of the number of sequences