RT Journal Article
SR Electronic
T1 Extracting allelic read counts from 250,000 human sequencing runs in Sequence Read Archive
JF bioRxiv
FD Cold Spring Harbor Laboratory
SP 386441
DO 10.1101/386441
A1 Brian Tsui
A1 Michelle Dow
A1 Dylan Skola
A1 Hannah Carter
YR 2018
UL http://biorxiv.org/content/early/2018/08/07/386441.abstract
AB The Sequence Read Archive (SRA) contains over one million publicly available sequencing runs from various studies using a variety of sequencing library strategies. These data inherently contain information about underlying genomic sequence variants which we exploit to extract allelic read counts on an unprecedented scale. We reprocessed over 250,000 human sequencing runs (&gt;1000 TB data worth of raw sequence data) into a single unified dataset of allelic read counts for nearly 300,000 variants of biomedical relevance curated by NCBI dbSNP, where germline variants were detected in a median of 912 sequencing runs, and somatic variants were detected in a median of 4,876 sequencing runs, suggesting that this dataset facilitates identification of sequencing runs that harbor variants of interest. Allelic read counts obtained using a targeted alignment were very similar to read counts obtained from whole genome alignment. Analyzing allelic read count data for matched DNA and RNA samples from tumors, we find that RNA-seq can also recover variants identified by WXS, suggesting that reprocessed allelic read counts can support variant detection across different library strategies in SRA. This study provides a rich database of known human variants across SRA samples that can support future meta-analyses of human sequence variation.