Abstract
The de Bruijn graph is a key data structure in modern computational genomics, and construction of its compacted variant resides upstream of many genomic analyses. As the quantity of genomic data grows rapidly, this often forms a computational bottleneck.
We present Cuttlefish 2, significantly advancing the state-of-the-art for this problem. On a commodity server, it reduces the graph construction time for 661K bacterial genomes, of size 2.58Tbp, from 4.5 days to 17–23 hours; and it constructs the graph for 1.52Tbp white spruce reads in ∼10 hours, while the closest competitor requires 54–58 hours, using considerably more memory.
Competing Interest Statement
RP is a co-founder of Ocean Genomics, inc.
Footnotes
Improved exposition and clarified details of filtering and implications of vertex vs. edge-centric de Bruijn graphs. Added experiments on new data type (RNA-seq). Added experiments using unitig and maximal path construction upstream of associative k-mer indexing. Revised results using updated versions of several tools. Fixed various typos.