Abstract
Emerging Linked-Read technologies (aka Read-Cloud or barcoded short-reads) have revived interest in standard short-read technology as a viable way to understand large-scale structure in genomes and metagenomes. Linked-Read technologies, such as the 10X Chromium system, use a microfluidic system and a set of specially designed 3’ barcodes (aka UIDs) to tag short DNA reads which were originally sourced from the same long fragment of DNA; subsequently, these specially barcoded reads are sequenced on standard short read platforms. This approach results in interesting compromises. Each long fragment of DNA is covered only sparsely by short reads, no information about the relative ordering of reads from the same fragment is preserved, and typically each 3’ barcode matches reads from 2-20 long fragments of DNA. However, compared to long read platforms like those produced by Pacific Biosciences and Oxford Nanopore the cost per base to sequence is far lower, far less input DNA is required, and the per base error rate is that of Illumina short-reads.
The use of Linked-Reads presents a new set of algorithmic challenges. In this paper, we formally describe one particular issue common to all applications of Linked-Read technology: the deconvolution of reads with a single 3’ barcode into clusters that correspond to a single long fragment of DNA. We introduce Minerva, A graph-based algorithm that approximately solves the barcode deconvolution problem for metagenomic data (where reference genomes may be incomplete or unavailable). Additionally, we develop two demonstrations where the deconvolution of barcoded reads improves downstream results: improving the specificity of taxonomic assignments, and by improving clustering of related sequences. To the best of our knowledge, we are the first to address the problem of barcode deconvolution in metagenomics.