Abstract
Quantitative analysis of next-generation sequencing data requires discriminating duplicate reads generated by PCR from identical molecules that are of unique origin. Typically, PCR duplicates are defined as sequence reads that align to the same genomic coordinates using reference-based alignment. However, identical molecules can be independently generated during library preparation. The false positive rate of coordinate-based deduplication has not been well characterized and may introduce unforeseen biases during analyses. We developed a cost-effective sequencing adapter design by modifying Illumina TruSeq adapters to incorporate a unique molecular identifier (UMI) while maintaining the capacity to undertake multiplexed sequencing. Incorporation of UMIs enables identification of bona fide PCR duplicates as identically mapped reads with identical UMIs. Using TruSeq adapters containing UMIs (TrUMIseq adapters), we find that accurate removal of PCR duplicates results in enhanced data quality for quantitative analysis of allele frequencies in heterogeneous populations and gene expression.
Method Summary TrUMIseq adapters incorporate unique molecular identifiers in TruSeq adapters while maintaining the capacity to multiplex sequencing libraries using existing workflows. The use of UMIs increases the accuracy of quantitative sequencing assays, including RNAseq and allele frequency estimation, by enabling accurate detection of PCR duplicates.
Footnotes
↵2 Present address: Memorial Sloan Kettering Cancer Center, 1275 York Avenue, Box 20, New York, NY 10065 USA.