Abstract
High-throughput DNA sequencing data is accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and allow for efficient query of sequences. In particular, the concept of colored de Bruijn graphs has been explored by several groups. While there has been good progress towards representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the labels. In this work, we present a systematic analysis of five different state-of-the-art annotation compression schemes that evaluates key metrics on both artificial and real-world data and discusses how different data characteristics influence the compression performance. In addition, we present a new approach, Multi-BRWT, that shows an up to 50% improvement in compression performance over the current state-of-the-art and is adaptive to different kinds of input data. Using our comprehensive test datasets, we show that this improvement can be robustly reproduced for different representative real-world datasets.
1 Introduction
Over the past decade, there has been an astronomical growth in the global capacity for generating DNA sequencing data [29]. Various sequencing efforts have started to amass data from populations of humans [30] and other organisms [31, 32]. For these well studied organisms, already assembled reference sequences are the common starting point for comparative and functional analyses. Unfortunately, a large proportion of DNA sequencing data, in particular data originating from non-model organisms or collected in metagenomics studies, are lacking a genome reference. Whereas general guidelines exist for the first [10], genome assembly for metagenomics is much less well defined. In addition to genome sequencing, a large amount of data is created from cDNA sequencing-based transcriptome analyses (and comparable techniques). Such data is often generated for a specific purpose but not analyzed further for questions outside of the scope of the studies that generated them. A consequence of this is that a large amount of raw sequencing data is publicly available but remains untouched. Its vastness and the currently lacking standards for indexing such data make an integrated analysis daunting even for field experts.
To make this host of data efficiently searchable, it is necessary to employ a search index. However, indexing techniques traditionally used for linear reference sequences, such as methods based on the FM-index [17], can not be readily applied to solve this task. To address this problem, different graph data structures for the representation of multiple genomes have been proposed. One class of structures are generalizations of the FM-index, such as generalized compressed suffix arrays on a genome graph [27, 11] or succinct de Bruijn graphs [7, 18]. Other possibilities include hashing-based implicit representations of de Bruijn graphs using Bloom filters [9, 26] or related representations [14, 28].
When building an index on the sequences alone, this approach merely allows for querying the presence of sequences in the constructed index. To support relating queries to information such as source genomes, haplotypes, or functional annotations, additional labels must be associated with the index. To facilitate this, approaches for storing additional data on an indexed graph have been suggested, such as the gPBWT [21] for genome graphs or succinct representations of colored (i.e. labeled) de Bruijn graphs [18, 15, 3] for the representation of sets of sequences. In this context, there has also been recent development towards a dynamic representation of such data has to be mentioned [19, 24].
The problem of efficiently representing these types of relations is also addressed in other fields. Commonly referred to as compressed binary relations, a growing body of theoretical work addresses such approaches [6]. Successful applications of similar techniques include the efficient representation of large web-graphs [8] and RDF data sets [4]. We will provide a more detailed description of some of these approaches in Section 2.
In this work, we present a new method for compressing binary relations. Providing as background a comprehensive benchmark of existing compression schemes, we show that our approach has superior performance on both artificial and real world data sets. Our paper has the following structure. After introducing our notation, we begin by defining the abstract graph and annotation structures that we wish to compress (Section 2.1). We then provide descriptions of our proposed compression technique and competing methods (Section 2.2). Finally, we compare the compression performance of these methods on different types of graph annotations (Section 3) and close with a brief discussion of our results and an outlook on future work (Section 4).
2 Methods
After introducing notation, we will give an overview of all methods implemented for this work and provide a description of our methodological contributions.
2.1 Preliminaries
We will operate in the following setting: We are given a k-dimensional de Bruijn graph G = (V, E) over a set of given input sequences S. The node set V shall be defined as the set of all consecutive sub-sequences of length k (k-mers) of sequences in S where si:j denotes the sub-sequence of s from position i up to and including position j, and |s| is the length of s. Directed edges exist between any two nodes that overlap by k – 1: where u2:k and v1:k–1 describe the k-mers u = u1:k and v = v1:k without their first and last characters, respectively.
In order to represent relations between sources of the input sequences S and the nodes V, we now define the concept of a labeled de Bruijn graph and proceed by discussing the more general problem of representing a graph labeling.
Each node v ∈ V that we refer to as object is assigned a finite set of labels ℓ(v) ⊂ L. We represent this graph labeling as a binary relation 𝓡 ⊂ V × L. A trivial representation of 𝓡 taking |V| · |L| bits of space is a binary matrix A ∈ {0, 1}|V|×|L|. We will use Ai and Aj to denote its rows and columns, respectively.
In the following sections, we will discuss various methods described in the recent literature and present our improvements in efficiently representing 𝓡. In addition to minimal space, we also require that the following set of operations can be carried out efficiently on the compressed representation of 𝓡:
query labels(v) = {l ∈ L| (v, l) ∈ 𝓡} Given an object v ∈ V (a k-mer in the underlying de Bruijn graph), return the set of labels ℓ(v) assigned to it.
query objects(l) = {v ∈ V| (v, l) ∈ 𝓡} Given a label l ∈ L (e.g., a genome or sample ID), return the set of objects assigned to that label.
query relation(v, l) Given an object v ∈ V and a label l ∈ L, check whether (v, l) is in the relation 𝓡, query relation(v, l) = 1{(v,l)∈𝓡}.
2.2 Binary Relation Representation Schemes
For compressing the binary relation 𝓡, we consider the following representations suggested in recent literature. As an abstraction, we will use the representation of 𝓡 as a binary matrix A ∈ {0, 1}|V|×|L| (referred to as the binary relation matrix) to illustrate the individual methods.
Column-major Sparse Matrix Representation
As a simple baseline technique, we compress the positions of the non-zero indices in each column independently using Elias-Fano encoding [22]. While this method does not take into account correlations between columns for compression, it allows for parallel construction of the binary relation matrix by mapping the k-mers in each input sequence to the constructed graph in a separate process. This acts as an initial representation from which all other compression techniques are constructed.
Flat Row-major Representation
As a second baseline method, this representation concatenates all rows of A into a joint vector that is subsequently compressed using Elias-Fano encoding. This approach, for instance, is used by VARI [18] and its extensions [2].
Rainbowfish
The current state-of-the-art for genome graph labeling is a row-major representation of the binary relation matrix A in which an optimal coding is constructed for the set of rows in A [3]. More precisely, let Ai1, …, Air ∈ {0, 1}|L| be all the unique rows in A, where r ∈ |V|. Let i : V → {1, …, r} be the mapping assigning each row in A to the corresponding index of the equal unique row. We then define the count vector C ∈ ℕr such that cj = #{v ∈ V| i(v) = j}. We sort the values in C in non-increasing order to construct the permutation map σ : {1, …, r} → {1, …, r} (i.e. ci > cj ⟹ σ(i) ≤ σ(j)).
To encode A, we start by permuting the unique rows according to σ, and form a matrix A′ ∈ {0, 1}r×|L|, Then we compress the resulting matrix with the flat row major representation using an RRR vector [25] as the underlying storage technique and construct a coding vector (σ(i(v)) – 1)v∈V mapping the rows of the initial matrix A to the rows of encoded matrix A′. The coding vector is represented in a variable-length packed binary coding with a delimiter vector [3]. The delimiter vector is compressed with Elias-Fano encoding [22].
Binary Relation Wavelet Trees (BinRel-WT)
This method involves a translation of the |𝓡| non-zero elements of A into a string, which is then represented using a conventional wavelet tree [6]. Given the binary relation matrix, its set bits are iterated in row-major order and their respective column indices are stored contiguously in a string s over the alphabet {1, …, |L|}. The lengths of each row of A are then stored in a delimiter vector d with unary coding. Finally, we represent s with a wavelet tree whose underlying binary vector is compressed into an RRR vector as well as for the delimiter vector.
Hierarchical Compressed Column-major Representation (BRWT)
In contrast to BinRel-WT, BRWT acts directly on binary matrices without the translation to sequence [6]. First, an index vector I is computed by merging all matrix columns through bitwise-OR operations on its rows, and stored to represent the root of the tree. Then, the rows composed entirely of 0s are discarded from A and two equal-sized submatrices A′ and A″ (which may contain rows composed entirely of 0s) of the binary relation matrix A are passed to the left and right children of the root. The compression proceeds recursively. Construction terminates when a node is assigned a single column, which is stored as its index column (see Figure 1 a)). For reconstruction of matrix elements, it is sufficient to only store the index vectors associated with each node of the BRWT tree and the leaves.
In the next section we consider the problem of topology optimization in the BRWT tree construction. Afterwards, we propose two modifications to the BRWT construction scheme for improving its compression performance.
2.3 Multiary, topology-optimized BRWTs
Our first extension to the BRWT scheme is the introduction of an n-ary tree topology (BRWTn), allowing for matrices to be split into more than two submatrices (see Figure 1 b)). The construction and querying for BRWTn is analogous to the case of binary BRWT. In computational experiments on artificial and real data we show that in most cases, BRWTn with arity greater than two but smaller than |L| provides a higher compression ratio than the simple binary BRWT scheme (see Section 3). Note, that BRWTn with the maximal allowed arity n = |L| is equivalent to the baseline column-major sparse matrix representation.
To proceed with our second extension, let us consider binary relations with the number of labels much less than the number of objects, |L| ≪ |V|. This condition is actually met in real annotated genome graphs where the number of k-mers, is usually in the billions and the number of labels is in the order of thousands (see Section 3.2).
Our second extension consists in introducing arbitrary assignments of columns from the matrices encoded in the nodes of the BRWT to their children. These assignments are represented by dictionaries stored in the BRWT nodes, but the |L| ≪ |V| constraint makes the space overhead from storing these negligible compared to the space needed to encode the index vectors. Thus, we exclude the problem of representing these assignments from further consideration and leave that as a small technical detail. Here and further, we refer to this proposed method as Multi-BRWT.
It is easy to see that all Multi-BRWT trees form a space where the root is defined for the initial matrix and the leaves correspond to the single columns of the initial binary matrix. Here, we leave out the question of enumerating these trees and counting the equivalence classes, but focus on the problem of choosing a tree structure that optimizes the compression ratio of Multi-BRWT.
Let us set this problem formally. Given a binary matrix A, let 𝓣 be the space of all Multi-BRWT trees representing A. Let Size(I) denote the size in bits of a compressed binary vector I (for instance, if I is of length n with m set bits, Size(I) = n for an uncompressed bit vector, and for RRR vectors [25]). We then neglect the space required for dictionaries defining the column assignments and we define the size of the Multi-BRWT tree T ∈ 𝓣 as the space required to store all its index vectors including the vectors in leaves: where N is the set of all nodes of the Multi-BRWT tree T. Thus, we wish to find an optimal Multi-BRWT tree by minimizing the storage space,
We will refer to this as the Multi-BRWT problem.
By analogy to the NoSQL table compaction problem [12], it can be easily shown that Multi-BRWT constrained on the space of binary trees with the uncompressed bit vector representation as the underlying structure for storing the index vectors is NP-hard. Thus, we consider a two-step heuristic approach for finding an optimal Multi-BRWT structure. First, we build a binary Multi-BRWT tree by greedily matching the pairs of index columns according to their similarity: the number of shared set bits. Then, we optimize the arity of the chosen Multi-BRWT by selecting a subset N′ from the set of internal nodes of the Multi-BRWT tree with its root r and all the leaves v1, …, v|L|, {r} ∪ {v1, …, v|L|} ⊂ N′ ⊂ N. To keep the resulting Multi-BRWT tree valid (allowing for reconstruction of the initial matrix A), we reassign all nodes left in N′ without their parents to their grandparents, and repeat recursively if their grandparents are not present in N′ as well. This naturally leads us to a greedy algorithm for optimizing the arity of the BRWT tree. Namely, starting the procedure in the leaves’ parents and applying it to each node except for the root recursively, we estimate the cost of removing each current node by the following formula where I(v) denotes the index vector stored in the node v and I′(c) denotes the index vector that will be stored in the node c after removing its parent v and reassigning the node c to its grandparent. Now we simplify the formula for estimating the cost of removing a node in BRWT by introducing an assumption that the size of a bit vector I of length n with m set bits is fully defined by these two parameters, i.e. Size(I) = Size(n, m). Now, it is easy to see that after reassigning the node c with the index vector I(c) of length nc with mc set bits to the parent of its parent v with index vector I(v) of length nv with mv set bits, the node c updates and replaces its index vector I(c) with a vector I′(c) of length nv with mc set bits. This provides us with the following simplified formula for estimating the cost of removing a node from the BRWT tree
Formula 4 can be efficiently computed without the need to actually rebuild the current structure of the BRWT. As a result, a decision about removing node v from the BRWT is made if the cost is negative CostRem(v) < 0, leading thereby to a decrease of the BRWT in size. In our practical implementation we use the following formula for approximating the size required for storing an RRR bit vector [20] with the block size t: Size(n, m) =
Greedy algorithm for finding an initial tree approximation
As an approximate solution to the first step, we propose a greedy algorithm in which an initial greedy pairwise matching step is performed on the columns to optimize their initial order prior to construction. Given the input columns A1, …, A|L| and their corresponding object queries oi = query objects(i), we first compute cardinalities of their pairwise intersections sij, sij = |oi∩oj|. Then, we sort all computed similarities by decreasing order and perform match pairs of columns greedily.
Efficient pair-wise distance estimation
The proposed greedy approximation method takes as input the matrix of the pairwise column similarities. For m columns of length n, computing each pairwise similarity costs 𝓞(n), and thus, computing the full similarity matrix takes 𝓞(nm2), which is a considerable overhead for datasets with m > 1000 and n ~ 109. To make the procedure of estimating the pairwise similarities cheaper, we estimate the pairwise similarities from a sample rows of the matrix A. Moreover, we prove the following lemma to show that using just random rows is sufficient for approximating the pairwise similarities with a small error ε with high probability.
Subsampling lemma: Suppose we are given sets o1, …, om ⊆ {1, …, n}, with the minimal cardinality d defined as We sample the elements of {1, …, n} independently with the same probability p, to form a sampled set of objects S ⊂ {1, …, n} and subsampled sets designated by Now if we define uij = |oi ∪ oj| and its approximator then for we claim:
For a full proof, see Supplementary Section 2.
According to Lemma 1, with the subsampling technique we can approximate the cardinality of unions up to an ε-fraction with high probability, and therefore, approximate the column similarities required in the greedy matching algorithm as well. Note that similarly to the Johnson–Lindenstrauss lemma [5], the required number of sampled objects does not depend on the total number of objects n.
2.4 Implementation Details
We implement the underlying de Bruijn graph as a hash table storing k-mers packed into 64bit integers with 64bit indexes assigned to the k-mers, or as a complete de Bruijn graph represented by a mapping of k-mers to 4k row indexes of the binary relation matrix.
In the column-major representation, the columns of the binary relation matrix are stored using SD vectors implemented in sdsl-lite [13]. The single long vector of the row flat representation is also compressed using SD vectors from sdsl-lite.
BinRel-WT (sdsl) compressor uses the implementation of wavelet tree from the sdsl-lite library, using an RRR vector to store its underlying bit vector. The delimiter vector uses the RRR vector implementation from sdsl-lite.
The BinRel-WT compressor uses the binary relation implementation from https://github.com/dieram3/binrel_wt.
Our BRWT is implemented as a tree in memory, compressing the index and leaf vectors as RRR vectors. To avoid multiple passes through the matrix rows, we construct the BRWT using a bottom-up approach. Given a fixed clustering of the matrix columns, the leaves of the BRWT are constructed first, followed by their parents constructed for the index vectors propagated from the children nodes. To speed up the greedy matching algorithm, we sample randomly 106 rows in each experiment and use those to approximate the number of shared bits in the input columns and the index vectors during the Multi-BRWT construction. For optimizing the tree arity, we use formula when estimating sizes of bit vectors of length n with m set bits in the RRR representation with a block size of t = 127.
All SD vectors are constructed with default template parameters, while all RRR vectors are constructed with a block size of 127.
Code Availability
All methods implemented and evaluated in this paper are available at https://github.com/ratschlab/genome_graph_annotation.
2.5 Data
Simulated data
To profile our compressors, we generated several different series of synthetic binary matrices of varying densities (see Supplementary Section 1 for a more detailed description). In total we generated three different kinds of series: i) random matrices with uniformly distributed set bits, ii) initially generated random matrix rows duplicated and permuted randomly, iii) initially generated random matrix columns duplicated and permuted randomly. The motivation behind these series is as follows: The best performing state-of-the art compressors exploit correlation between rows of the binary relation matrix [24]. However, the usual structure of annotated de Bruijn graphs often implies a correlation structure on the columns rather than on rows. While for a small number of columns and sufficient label redundancy this effect translates to row redundancy, for larger label sets this is not necessarily the case and approaches exploiting correlation structure on the columns might fare better. To test this hypothesis, the generated the different kinds of synthetic data, reflecting uncorrelated rows/columns, correlated rows, and correlated columns for series i), ii), and iii), respectively.
Real-world data
For evaluating all approaches in a real-world setting, we have chosen two data sets well-known in the community and representative of typical applications.
3 Results and Discussion
3.1 Experiments on artificial data
Based on the artificial dataset described in Section 2.5, we evaluated the relation between characteristics of the binary relation matrix A and the compression performance.
Dependency of compression ratio on matrix structure
We measure density of A as the number of set bits divided by the total size of A. For reference, the labels for a sequencing-based de Bruijn graphs typically exhibit very low densities, commonly < 0.5%.
Especially in this low-density region, we find that the properties of the binary relation matrix have a strong effect on the compression ratio of individual methods. A second determinant of performance is whether any assumptions are made on the properties of the data.
On sparse, fully random data, the baseline compressors fare very well (Figure 2a), as no assumptions can be made about relationships. Notably, Rainbowfish, which exploits redundancy among rows, generates considerable overhead for very low densities. In the field of BRWT methods, the Multi-BRWT is closest to the best performing choices.
In the setting of correlated rows (data set ii, Figure 2b), as expected, Rainbowfish shows the strongest performance, clearly exploiting row redundancy. Again, among the BRWT methods the Multi-BRWT performs best.
Finally, in the setting that comes closest to a typical labeling task occurring on de Bruijn graphs derived from sequencing data (Figure 2c), the Multi-BRWT approach shows superior performance. Exploiting the correlated columns of the matrix, Multi-BRWT achieves a 5-fold improvement in compression ratio compared to Rainbowfish and more than 2-fold compared to the closest competitor. Notably, the baseline binary BRWT has no advantage over other baseline methods. Further, we observe that this performance gain increases with the total number of columns in the matrix (Supplemental Figures 1 and 2).
3.2 Experiments on real data
To compare the compression performance of our chosen methods under a variety of conditions, we have constructed two test datasets that exhibit different matrix sparsity characteristics.
Kingsford Human RNASeq (2,652 read sets)
We filtered the 2,652 raw sequencing read sets with the KMC [16] tool to extract frequent unique canonical k-mers from each (k = 20). We used the same thresholds for the k-mer frequency level as in [24]. Using the k-mers extracted, we constructed a de Bruijn graph containing 3,693,178,415 nodes and annotated these with their source read sets, which resulted in 2,586 labels (66 filtered read sets were empty). As a baseline for comparison, we used the straightforward column compressed annotation, which required a total of 36.56 gigabytes (G) of space. We used this as a starting point to convert the annotation into other formats.
The results are summarized in Table 1. As expected, the simple row- and column-based compressors but also BinRel-WT require more than 30G in total. The current state-of-the art, Rainbowfish, reduces this by one third to 19.22G, exploiting correlated rows in the input matrix. The recently suggested BRWT benefits from column correlation through iterative grouping of columns and drastically improves on Rainbowfish, showing a 30% lower size. We further reduce this size through our generalized approach using multiary tree representations for BRWT. While some increase in arity reduces size compared to the binary case, a higher arity does not necessarily translate into lower space, as certain sub-matrices do not benefit from grouping. The lowest representation of a fixed arity is BRWT7 with 12.13G.
We can improve compression performance of a binary BRWT through greedy pairing of node index vectors (see Section 2.3). This strategy further decreases size by another 25% to 9.68G. Finally, optimizing the tree topology using the Multi-BRWT method and selectively removing inner nodes (reassigning children to their grandparents) while maintaining a constraint on each node’s maximum number of children, leads to the smallest space achieved in our experiments. Applying this technique, we decrease the required space to 9.19G (Multi-BRWT5, with at most 5 children for each node), an almost 50% improvement over Rainbowfish.
RefSeq reference genomes
Compression of the complete RefSeq genome annotation (release 88) resulted in a de Bruijn graph of dimension k = 15 containing n = 1, 073, 685, 700 nodes, leading to a binary relation matrix of n rows and m = 3, 173 columns with the density ~ 3.8%, which is relatively high for a genome graph annotation and can be explained by the small k-mer size used.
This is a substantially larger dataset with less dependency between labels (columns). Due to the size of the graph annotation, we computed only the column compressed baseline, flat row major representation, our implementation of the BinRel-WT method (BinRel-WT (sdsl)), binary BRWT baseline, the greedily optimized binary BRWT representation, and the proposed Multi-BRWTz for a subset of z values (cf. Table 1). We were able to a achieve a compressed storage size of only 42.28G. Also here, the BRWT approach improves drastically over the column compressed baseline, and the Multi-BRWT method considerably surpasses the baseline BRWT approach.
For our implementations, we would like to note that the other competing methods either exceeded our available memory (512G) or computed longer than 24h.
To reduce our memory footprint while constructing the Rainbowfish compression of the RefSeq annotation data, we attempted to use the annotation of the Mantis pipeline as a surrogate [24]. We would like to note that this pipeline does not natively support construction from FASTA files, and construction of a Mantis index from conversions of the RefSeq FASTA files to FASTQ failed due to inherent size limits of Mantis’ data structures.
4 Conclusion
We have presented a series of compressed representation methods for binary relations, building upon and improving on the existing literature. By generalizing BRWTs to multiary trees with improved partitioning schemes and adaptive arity to reduce data representation overhead, we have improved on state-of-the-art compression techniques for both simulated and real-world biological datasets.
We have shown that the structure of the input data has strong influence on the compression performance. Methods such as Rainbowfish or the Flat compressor benefit from correlated rows (objects with a similar set of labels). It is to note that in a real-world setting, where more and more labels are added to the set, the number of highly correlated rows decreases (ultimately leading to a set of mostly independent rows) and these methods work less well. Interestingly, it is especially this setting that regularly occurs in the labeling of genome graphs, were an underlying set of (related) sequences is assigned a growing set of different labels.
We have presented a method that copes very well with an increasing number of related columns and we could show that this results in considerable performance gains not only on synthetic, but also on typical real-world data. Our method, Multi-BRWT, led to an 80% decrease in compressed size compared to the baseline method and a 50% decrease compared to the closest competitor, Rainbowfish.
A natural extension of this work will involve the utilization of dynamic binary vectors in the underlying storage of BRWTs to allow for their use in dynamic database contexts. Of particular interest are the use of dynamic compressed structures to avoid expensive decompression and recompression steps when performing updates, and the prospects for post-construction column rearrangements to accommodate dynamic data.
By the nature of BRWT, frequent rows (i.e. those with high row density) would be present in more subtrees, and thus, exhibit greater row query times. One way to resolve this would be through hybrid row and column partitioning, in which rows exhibiting certain properties (such as high frequency) are excluded and compressed through techniques better suited to those submatrices.
Overall, we conclude that, despite the advancements in compression over the recent years, there is still much room and many degrees of freedom in compressor design for further improvement.
5 Acknowledgements
We would like to thank the members of the Biomedical Informatics group for fruitful discussions and critical questions, and Torsten Hoefler and Mario Stanke for constructive feedback on the graph setup. Harun Mustafa and Mikhail Karasikov are funded by the Swiss National Science Foundation grant #407540 167331 “Scalable Genome Graph Data Structures for Metagenomics and Genome Annotation” as part of Swiss National Research Programme (NRP) 75 “Big Data”.