Abstract
Motivation De Bruijn graphs play an essential role in computational biology, facilitating rapid alignment-free comparison of genomic datasets as well as reconstruction of underlying genomic sequences. Subsequently, an important question is how to efficiently represent, compress, and transmit de Bruijn graphs of the most common types of genomic data sets, such as sequencing reads, genomes, and pan-genomes.
Results We introduce simplitigs, an effective representation of de Bruijn graphs for alignment-free applications. Simplitigs are a generalization of unitigs and correspond to spellings of vertex-disjoint paths in a de Bruijn graph. We present an easy-to-plug-in greedy heuristic for their computation and provide a reference implementation in a program called ProphAsm. We use ProphAsm to compare the scaling of simplitigs and unitigs on a range of genomic datasets. We demonstrate that simplitigs are superior to unitigs in terms of the cumulative sequence length as well as of the number of sequences, and that they are sufficiently close to the theoretical bounds for practical applications. Finally, we demonstrate that, when combined with standard full-text indexes, simplitigs provide a scalable solution for k-mer search in pan-genomes.
Availability ProphAsm is written in C++ and is available under the MIT license from http://github.com/prophyle/prophasm.
Introduction
The advance of DNA sequencing started the golden age of biology in which phenomena previously unobservable can be studied on an unprecedented scale. However, sequencing capacity has been growing faster than computer performance and memory, and also faster than available human resources. Nowadays large amounts of sequencing data are available, but with a decreasing completeness and quality. In consequence, the traditional sequence-based representations and sequence alignment-based techniques [1,2] become less suitable for real-life scenarios due the space- and time-complexities they impose as well as due to their sequence-oriented nature in the world of datasets exhibiting graph structure.
An example is given by bacterial genomics. Modern large-scale studies of bacterial species comprise tens of thousands of sequenced isolates (see, e.g., [3], [4], or the Global Pneumococcal Project (pneumogen.net/gps/)). However, the information about isolates’ genomes is almost always incomplete, as sequencing provides only partial observations of the genomes. While it is relatively straightforward to compute draft assemblies of bacterial genomes, the completion of the genomes is difficult. Due to repetitive regions, a full reconstruction from short reads is mathematically impossible even if sequencing errors were perfectly random. Long reads are often unavailable and reference sequences are of limited applicability due to the high variability and even unclear borders between species. While draft assemblies may be sufficient for many analyses, they are often not an ideal universal representation for a multitude of reasons. Most importantly, draft assemblies created using different assemblers are not directly comparable and this can introduce false differential signals into studies. In many scenarios it is therefore desirable to work with a graph representation obtained directly from a sequencing experiment without trying to assemble the genome.
De Bruijn graphs belong to the most popular graph representations of genomic datasets. They are defined as directed graphs G = (V, E) where V is the set of all k-mers (i.e., subwords of a fixed length k) occurring in the dataset with edges connecting a vertex v to a vertex w if there is a k − 1 long prefix-suffix overlap between these v and w. As follows from the definition, we can associate a de Bruijn graph with the underlying k-mer set and edges can be defined implicitly (unlike the edge-centric definition where k-mer sets are associated with edges [5]). In this paper, we consider only vertex-centric graphs.
De Bruijn graphs feature remarkable properties. First, their computation from data is easy and deterministic. Algorithms for enumerating and counting k-mers have been extensively studied and many programs are available [6–9]. If the datasets contain sequencing errors, the computation may also involve graph cleaning. This aims at removing those k-mers that are the result of sequencing errors and are due to their supposed randomness expected to be rare. Second, if k is chosen appropriately, de Bruijn graphs can capture substantial information about the entire molecules under sequencing as these correspond to (some of the) walks in the graphs, provided that sequencing was sufficiently deep. Third, de Bruijn graphs can be handled easily, which simplifies software development as well as dataset analysis and interpretation. These properties have led to a large variety of applications of de Bruijn graphs.
De Bruijn graphs have been widely studied in the context of sequence assembly [10–12]. Here, their construction is typically the first step to the reconstruction of the genomes and transcriptomes under sequencing from retrieved sequencing reads. Many modern assemblers (e.g., Spades [13], ABySS [14], Velvet [15], Minia [16], and MEGAHIT [17]) follow the de-Bruijn graph paradigm.
Alignment-free sequence comparison [18] is another major application of de Bruijn graphs, following the idea that similar sequences share similar k-mers, and comparing de Bruijn graphs thus provides a good measure of sequence or dataset similarity. This involves the use of de Bruijn graphs for variant calling and genotyping [19–23], transcript abundance estimation [24], metagenomic classification [25–28]. The latter also demonstrates another particularity of de Bruijn graphs – their high ability to approximate the graph structure of pan-genomes. Indeed, reference databases of bacterial strains are often highly incomplete and noisy; nevertheless, k-mer-based classifiers perform best in inferring the abundance profiles among all classifiers [29], which also suggests that de Bruijn graphs can be used as a proxy to variation graphs. Conversely, de Bruijn graphs with a large k can be used for indexing variation graphs [30,31].
The importance of de Bruijn graph leads us to an important problem: their space-efficient representation. While general de Bruijn graphs may impose large space requirements, it has been shown that those of real datasets can be highly compressible. Indeed, given the linearity of DNA and RNA molecules and the nature of sequencing, genomic datasets exhibit the spectrum-like property; i.e., there exist long strings of which most of the dataset’s k-mers are substrings [5].
In this paper, we study the problem of representation of de Bruijn graphs for alignment-free data analysis. Building on previous works [32,33], we propose simplitigs as an effective representation of de Bruijn graphs. Simplitigs provide a textual representation of the graph, representing each k-mer exactly once and facilitating easy indexing with standard full-text indexes. Simplitigs use the observation that in practical applications such graphs typically contain long paths. In contrast to unitigs, which are paths that do not contain any branching nodes, simplitigs can contain branching nodes.
Finally, we present ProphAsm, a tool for computing simplitigs for a given dataset, such as reads, genomes, pan-genomes or metagenomes. ProphAsm proceeds by building the associated de Bruijn graph in memory, followed by a greedy enumeration of maximum vertex-disjoint paths. We use ProphAsm to demonstrate that simplitigs are superior to unitigs in terms of the cumulative sequence length and of the number of sequences, and that they are sufficiently close to the theoretical bounds for practical applications. The employed heuristic can be easily integrated in any software producing de Bruijn graphs.
Results
Simplitigs as an efficient representation of de Bruijn graphs
We developed the concept of simplitigs to efficiently represent de Bruijn graphs for alignment-free applications (Figure 1). Simplitigs are a generalization of unitigs and corresponds to spellings of vertex-disjoint paths covering the de Bruijn graph; consequently, maximum simplitigs are such simplitigs that cannot be further compacted by merging (Methods). Note that unitigs and k-mers are also simplitigs, but not maximal, in general. The main conceptual difference between maximal simplitigs and maximal unitigs is that unitigs are limited by branching nodes (which are crucial for genome assembly), whereas simplitigs are not limited by this constraint. This allows for further compactification, with a benefit increasing proportionally to the amount of branching nodes in the graph.
We designed a greedy heuristic for the computation of simplitigs (Algorithm 1, Methods). At every step, it selects a k-mer from the current k-mer set and keeps extending it forward and then backward as long as possible while removing the already used k-mers from the set. This process is repeated until all k-mers are covered. We provide a reference implementation in a program called ProphAsm (github.com/prophyle/prophasm). The heuristic can be easily applied by any other software that outputs de Bruijn graphs or k-mer sets.
In the following sections, we use ProphAsm to compare maximal simplitigs with maximal unitigs on different types of data sets.
Greedy computation of maximal simplitigs for a k-mer set
In an iterative fashion, the algorithm draws an arbitrary k-mer from the set of k-mers K as a new simplitig, and then keeps extending it forwards and backwards as long as possible, while removing the already used k-mers from K.
Simplitigs of selected model organisms
We evaluated the simplitig representation on individual genomes of five model organisms for a range of k-mer lengths (Figure 2, Methods). The range was selected based on values that are most commonly used for alignment-free sequence comparison (see, e.g., [24,25,34]). For each organism and a k-mer length, we computed maximal simplitigs and unitigs, and compared them in terms of two basic characteristics: the number of sequences produced and their cumulative length. Whereas the former defines the number of records to be kept, the latter determines the total memory need. Note that both numbers are tightly connected (Methods, (eq 1)).
First, we analyzed the number of sequences produced (Figure 2, upper plots). We observe that for all datasets, as the k-mer size increases, the number of simplitigs grows and later decreases slowly while the number of unitigs grows rapidly at the beginning, and subsequently drops substantially, approaching the number of simplitigs. The maxima of functions corresponding to simplitigs and unitigs may (e.g., C. elegans) or may not (e.g., B. mori) be reached at the same k-mer size.
We then analyzed the cumulative sequence length (Figure 2, lower plots). The cumulative length is bounded from below by the number of k-mers in the genome plus k − 1, corresponding to the theoretically maximum degree of compactification. In such a case, all k-mers would occur on the same simplitig; however, this is not attainable for most datasets. As we can observe and (eq 1) explains, the shapes of the curves in the lower plots copy the upper plots, while being only shifted up by a multiple of the theoretical lower bound function. When comparing the simplitig and unitig curves, we can observe the same patterns as for the number of sequences.
Overall, we observed that simplitigs always provide better performance than unitigs. In particular, they quickly approach the theoretical lower bounds for both characteristics tested. Every data set has a range of k-mer lengths where the difference between simplitigs and unitigs is striking, and after a certain threshold, the difference almost vanishes. While for short genomes this threshold is located for smaller k’s (e.g., k ≈ 17 for E. coli) than those typically used in alignment-free applications, in long genomes this threshold has not been attained on the tested range and seems to be substantially shifted towards large k-mers (e.g., B. mori). All this suggests that in practical applications, simplitigs are preferable for indexing individual genomes and the benefit is likely to increase with the genome size.
Simplitigs of bacterial pan-genomes
Computational pan-genomics has recently emerged as an important sub-branch of bioinformatics [35]. One of the motivations is the analysis of sequencing data in the context of whole species. Species are then represented using so-called pan-genome representations, i.e., reference structures including all within-species variation. De Bruijn graphs are particularly useful as pan-genomic references as they can be easily constructed from a variety of different data types, ranging from assembled reference sequences to the original sequencing reads reads. We sought to evaluate the usefulness of simplitigs for bacterial pan-genomes, which are particularly challenging due to their high diversity and variability.
We compared simplitig and unitig representations of the Neisseria gonorrhoeae pan-genome, as a function of the number of genomes included (Figure 3, Methods). We used 1,102 clinical isolates collected from 2000 to 2013 by the Centers for Disease Control and Prevention’s Gonococcal Isolate Surveillance Project [36]; the data set comprises of draft assemblies from Illumina HiSeq reads. As expected, as the number of isolates and the associated variance grow, the number of sequences and their cumulative length grow as well, both for maximal unitigs and simplitigs. While simplitigs and unitigs perform comparably well when one bacterial genome is included (consistently with Figure 2), the improvement of unitigs over simplitigs grows and eventually stabilizes at approximately 1.5 as more genomes are included (Figure 3, bottom plot). On the other hand, the improvement in the number of sequences steadily decreases along the whole range and stabilizes at approximately 3.0.
To verify the generalizability of our findings, we repeated the experiment with the same dataset but for k=18 and also with 616 pneumococcal genomes from a carriage study of children in Massachusetts [37,38] with k=18 and k=31 (Methods). In all the cases, the results were qualitatively the same, although with small changes in the resulting relative improvements.
Application of simplitigs for k-mer search in bacterial pan-genomes
Importantly, simplitig representation can be easily applied for k-mer lookup. As simplitigs and unitigs reduce k-mer lookup in a de Bruiijn graph to k-mer search in a set of sequences, any full-text search engine can be used. This includes several available highly optimized BWT indexes of read mappers [39–41]. BWT indexes [42] (i.e., indexes based on the Burrows-Wheeler Transform [43], sometimes also referred to as FM-indexes) are powerful compact representations of sequence data supporting fast string search. Although technical aspects of their implementation may vary (as well as their resulting performance), they are interchangeable in our applications. We used the index provided by BWA [40] to evaluate the performance of k-mer lookup in bacterial pan-genomes.
Single pan-genome
We first evaluated the performance of k-mer presence/absence queries on a single pan-genome (Table 1, Methods). We used the same N. gonorrhoeae draft genomes as previously to build a gonococcal k-mer pan-genome for five different k-mer sizes using three strategies: by merging the draft assemblies, by computing comprehensible unitigs, and by computing comprehensive simplitigs (Table 1a). For all of them, we constructed BWT indexes using BWA [40], queried ten million k-mers using BWA fastmap [44], and evaluated the resulting memory footprint and query performance (Table 1b).
Consistent with the previous experiments, simplitigs provided a clear improvement over unitigs (Table 1a). Maximal simplitigs improved 3.0x–4.9x the number of sequences and a 1.5x–2.1x the cumulative sequence lengths. Intuitively, we might assume the resulting memory footprint of BWA should be proportional to the cumulative sequence length, and therefore, the improvement in memory footprint was expected to be similar to the one of the cumulative sequence length. Surprisingly, the memory footprint improved substantially more with unitigs (2.7x – 5.6x) (Table 1b). To explain this phenomenon, it is important to understand that the underlying full-text engine has to keep information about individual sequences in memory as separate records and standard read mappers are optimized for low numbers of references. As the number of reference sequences grows, it has a negative impact on both the memory footprint and query speed. However, since simplitigs provided 3.0x–4.9x improvement in the number of sequences over unitigs, it helped to alleviate this overhead. Overall, the comparatively high number of maximal unitigs observed throughout our experiments (Figures 1 and 2) provides a further argument for using unitigs as the preferable representation of k-mer sets.
Multiple pan-genomes
Finally, we evaluated the simplitigs performance for simultaneous indexing of multiple bacterial pan-genomes (Table 2, Methods). We downloaded all complete bacterial genomes from Genbank (as of December 2019; 10,502 genomes out of which we managed to download 9,570; Methods). We restricted ourselves to the complete genomes as the draft genomes in Genbank are known to be impacted by contamination [45]. We grouped individual genomes per species which resulted in 719 bacterial pan-genomes. We then computed simplitigs and unitigs for k=18 and k=31 for every species, merged the obtained representations, and calculated the same statistics as previously (Table 2a). Finally, we constructed BWT indexes using BWA, and measured the resulting k-mer lookup performance using the same ten million k-mers as previously (Table 2b).
In this case, the improvement of simplitigs over unitigs was 4.2x and 3.1x in the sequence count for k=18 and k=31, respectively, and 1.6x and 1.3x in the cumulative sequence length (Table 2a). For k=31 simplitigs provided 1.2x speedup and 1.8x improvement in memory consumption (Table 2b); for k=18, the speedup could not be evaluated (Methods). These results are consistent with the previous sections and provide further evidence that simplitigs are useful not only for storage, but also for fast k-mer lookup.
Discussion
We introduced the concept of simplitigs, a generalization of unitigs, and demonstrated that simplitigs constitute a compact, efficient and scalable representation of de Bruijn graphs for commonly used genomic datasets. The two representations share many similarities. Both represent de Bruijn graphs in a lossless fashion, correspond to spelling of vertex-disjoint paths, and preserve k-mer sets. Being text-based and stored as FASTA files, both can be easily manipulated using standard Unix tools and indexed using full-text indexes. On the other hand, unlike unitigs, general simplitigs are not expected to have direct biological significance as neighboring segments of the same simplitig may correspond to distant parts of the same DNA molecule or even to different ones. Therefore, simplitigs are not applicable in many situations where unitigs are used, but on the other hand they show much better compression properties.
We provided ProphAsm, a tool implementing a greedy heuristic to compute maximal simplitigs. This heuristic is easy to implement in any software, which suggests its further use as a generic method for serialization of k-mer sets. The simplicity is in contrast to BCALM 2, the reference software for unitigs, where the complexity of the bi-directed de Bruijn graph model may complicate debugging; for instance, BCALM 2 does not support k-mer lengths that are divisible by four (as for December 2019, unsupported since 2017). As a downside, the naive implementation of the heuristic using a standard hashtable may run into memory issues. In our work, we have not encountered this, but memory consumption can be readily improved using more advanced data structures, similarly to what has been done for tools for unitig computation [33,46,47]. We note that ProphAsm is a spin-off of the ProPhyle software (https://prophyle.github.io/, [27]) for phylogeny-based metagenomic classification. Simplitig computation was already implemented in ProPhyle [48] for the purpose of efficient indexing of k-mers assigned to nodes of the phylogenetic tree.
The data presented in this paper are informative for understanding the scaling of computational resources as more sequencing data become available [49]. The studied gonococcal dataset constitutes a relatively complete snapshot of a bacterial population in a geographical region and at a given time scale. As such, it can be used to model the “state of completion” of k-mer pan-genomes. On the other hand, the multiple-pangenomes experiment provided insights about the resulting performance when a large number of pan-genomes is queried using a BWT index simultaneously. This allows to make predictions about the scaling for species where at present only a limited number of assemblies are available, but more data are likely to be generated in the future. Overall, with more data available, the comparative benefits of simplitigs over unitigs grow.
Besides the presented advantages, simplitigs also introduce several technical challenges related to the solution ambiguity (as illustrated in Figure 1). Whereas maximal unitigs are uniquely defined (up to the order and reverse complementing), this is not the case for maximal simplitigs. In the presented heuristic, the resulting maximal simplitigs and their characteristics depend on the order in which the initial k-mers are drawn from the underlying set. At every iteration, once a maximal simplitig is built, a new k-mer is drawn from the graph as the new initial k-mer. In the case of ProphAsm, this is an unordered set from the C++ standard library, which makes it difficult to implement reproducibly across platforms.
Modern bioinformatics applications of de Bruijn graphs often require multiple graphs considered simultaneously. The resulting structure is usually referred to as a colored de Bruijn graph [19] and its representations have been widely studied ([50–61]). Even though we touched this setting in the section Multiple pan-genomes, exploiting the similarity between individual de Bruijn graphs for further compression in simplitig-based approaches is to be addressed in future work.
With the growing interest in k-mer indexing of all genomic datasets [60], we anticipate the simplitig representation to be useful as a generic representation of de Bruijn graphs.
Methods
De Bruijn graphs
All strings are assumed to be over the alphabet {A, C, G, T}. A k -mer is a string of length k. For a string s = s1…sn, denote s[i..j] = si…sj. We also define prefk (s) = s1…sk and sufk (s) = sn−k+1…sn. For two strings s and t of length at least k, we define the binary connectivity relation s→kt iff prefk (s) = sufk (t).
Given a set K of k -mers, the de Bruijn graph of K is the directed graph G = (V, E) with V = K and E = {(u, v) | u→k−1 v}. This definition of de Bruijn graphs is node-centric, as nodes are identified with k-mers and edges are implicit. Therefore, we can use the terms “k-mer set” and “de Bruijn graph” interchangeably.
Simplitigs
Consider a set K of k -mers and the corresponding de Bruijn graph G = (V, E). A simplitig graph G′ = (V′, E′) is a subgraph of G that is acyclic and the in-degree and out-degree of any node is at most one. It follows from this definition that a simplitig graph is a vertex-disjoint union of paths called simplitigs. A simplitig is called maximal if it cannot be extended forward or backward without breaking the definition of simplitig graph. In more detail, a simplitig u1 →k−1u2→k−1…→k−1un is maximal if the following conditions hold
either u1 has no incoming edges in G, or for any edge (v, u1) ∈ E, v belongs to another simplitig,
either un has no outgoing edges in G, or for any edge (un, v) ∈ E, v belongs to another simplitig.
A unitig is a simplitig u1 →k−1u2→k−1…→k−1un such that each of the nodes u2, …, un has in-degree 1 in graph G. A maximal unitig is defined similarly.
Greedy computation of simplitigs
The problem of computation of maximal simplitigs that would optimal in the cumulative sequence length corresponds to the maximum disjoint path problem which is known to be NP-hard [62]. Throughout this paper, a greedy approach was used for the simplitig computation (Algorithm 1). Simplitigs were constructed iteratively, starting from a random k-mer and being extended greedily forwards and backwards as long as possible. Note that Algorithm 1 works in the bi-directed setting, in which canonical k-mers are used instead of standard k-mers. A formal definition of bi-directed de Bruijn graphs requires complex formalism (see, e.g., https://github.com/GATB/bcalm/tree/master/bidirected-graphs-in-bcalm2); however, the greedy heuristic works similarly in both setups and does not require this formalism. Therefore we resorted to the uni-directed model for the explanation of the concepts and to bi-directed model for the greedy heuristic.
Comparing simplitigs with unitigs
Simplitigs and unitigs were always compared in terms of the number of sequences produced and their cumulative length. Note that these numbers are related: assuming that the frequency of every k-mer is 1, then
As the maximum disjoint path problem is known to be NP-hard [62], finding the optimal solutions can be highly intense computationally. However, we can provide the lower bound #kmers + k − 1 corresponding to the maximum possible degree of compactification (i.e., a single simplitig covering all k-mers). In the situations where cumulative sequence length of unitigs approaches this bound, the greedy heuristic presented above is sufficient.
Correctness evaluation
The correctness of simplitigs can be verified using an arbitrary k-mer counter. Simplitigs are correct if and only if every k-mer is present exactly once and the number of distinct k-mers is the same as in the original datasets. To verify the correctness of ProphAsm outputs, we used JellyFish 2 [6].
Experimental evaluation – model organisms
Reference sequences for five selected model organisms were downloaded from RefSeq: Escherichia coli str. K-12 (accession: NC_000913.3, length: 4.64 Mbp), Saccharomyces cerevisiae (accession: NC_001133.9, length: 12.2 Mbp), Caenorhabditis elegans (accession: GCF_000002985.6, length: 100 Mbp), Bombyx mori (accession: GCF_000151625.1, length: 482 Mbp), and Homo sapiens (HG38, length: 3.21 Gbp). For each of them, simplitigs and unitigs were computed using ProphAsm and BCALM 2, respectively.
Experimental evaluation – pan-genomic scaling
First, 1102 draft assemblies of N. gonorrhoeae clinical isolates (collected from 2000 to 2013 by the Centers for Disease Control and Prevention’s Gonococcal Isolate Surveillance Project [36], and sequenced using Illumina HiSeq) were downloaded from Zenodo (Grad 2019). Second, 616 draft assemblies of S. pneumoniae isolates (collected from 2001 to 2007 for a carriage study of children in Massachusetts, USA [37,38], and sequenced using Illumina HiSeq) were downloaded from the SRA FTP server using the accession codes provided in Table 1 in [38]. For each of these datasets, an increasing number of genomes was being taken, merged and simplitigs and unitigs computed using ProphAsm and BCALM 2, respectively, for k=18 and k=31. To avoid excessive resource usage the functions were evaluated at points in an increasing distance (for intervals [10, 100] and [100,+∞] only multiples of 5 and 20 were evaluated, respectively).
Experimental evaluation – fulltext k-mer queries
In the single pan-genome experiment, the same 1,102 assemblies of N. gonorrhoeae were merged into a single file. ProphAsm and BCALM 2 were then used to compute simplitigs and unitigs from this file for k=15, 19, 23, 27, 31. All three obtained FASTA files were used to construct a BWA index, which was then queried for k-mers using ‘bwa fastmap -l {kmer-size}’. The k-mers were previously generated from the same pan-genome using DWGsim [63] (version 0.1.11, with the parameters ‘-z 0 -1 {kmer-size} -2 0 -N 10000000’).
For the multiple pan-genome experiment, a list of available bacterial assemblies was downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt. For all assemblies marked as complete, accessions were extracted and used for their download using RSync (files matching ‘*v?_genomic.fna.gz’). The obtained assemblies were then merged to a master file, which was then for computing simplitigs and unitigs using ProphAsm and BCALM 2. The obtained simplitig and unitig files were used to construct a BWA index and queried for the same k-mers as in the previous section using ‘bwa fastmap –l {kmer-size}’. The times of loading the indexes into memory were measured separately and subtracted from the query times. With unitigs for k=18, bwa repeatedly crashed in the middle of k-mer matching from an unspecified reason.
Computational setup
All computation was performed on an iMac 4.2 GHz Quad-Core Intel Core i7 with 40 GB RAM. The reproducibility of computation was ensured using BioConda [64]. All benchmarking was performed using ProphAsm v0.1.0 and BCALM 2 v2.2.1. Times and memory footprint were measured using GNU time.
Implementation
ProphAsm is written in C++ and available under the MIT license from http://github.com/prophyle/prophasm.
Acknowledgements
This work was supported by the David and Lucile Packard Foundation.