RT Journal Article
SR Electronic
T1 ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers
JF bioRxiv
FD Cold Spring Harbor Laboratory
SP 306902
DO 10.1101/306902
A1 Lauren Coombe
A1 Jessica Zhang
A1 Benjamin P Vandervalk
A1 Justin Chu
A1 Shaun D Jackman
A1 Inanc Birol
A1 René L Warren
YR 2018
UL http://biorxiv.org/content/early/2018/04/25/306902.abstract
AB Background The long-range sequencing information captured by linked reads, such as those available from 10x Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time.Results Here, we show how linked reads, when used in conjunction with Hi-C data for scaffolding, improve a draft human genome assembly of PacBio long-read data five-fold (baseline vs. ARKS NG50=4.6 vs. 23.1 Mbp, respectively). We also demonstrate how the method provides further improvements of a megabase-scale Supernova human genome assembly, which itself exclusively uses linked read data for assembly, with an execution speed six to nine times faster than competitive linked read scaffolders. Following ARKS scaffolding of a human genome 10xG Supernova assembly (of cell line NA12878), fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n=13).Conclusions ARKS uses a kmer mapping strategy instead of linked read alignments to record and associate the barcode information needed to order and orient draft assembly sequences. The simplified workflow, when compared to that of our initial implementation, ARCS, markedly improves run time performances on experimental human genome datasets. Furthermore, ARKS utilizes barcoding information from linked reads to estimate gap size. It accomplishes this by modeling the relationship between known distances of a region within contigs and calculating associated Jaccard indices. ARKS has the potential to provide correct, chromosome-scale, genome assemblies, promptly. We expect ARKS to have broad utility in helping refine draft genomes.