Abstract
Identification of sequence variation from short-read sequence data is subject to common-yet-intermittent miscalling that occurs in a sequence intrinsic manner. We identify that recurrent false positive single nucleotide variants are strongly present in databases of human sequence variation and demonstrate how each individual sample generates a unique set of recurrent false positive variants. These recurrent miscalls result from known difficulties aligning short-read sequence data between redundant genomic regions. We could replicate, catalogue and remove three quarters of these recurrent miscalls for any given exome with as little as ten rounds of read resampling, realignment and recalling. The removal of such misleading variants reduces the search space for identification of disease causing variants.
- SNV
- single nucleotide variant
- RFP
- recurrent false positive
- ENU
- N-ethyl-N-nitrosourea
List of Abbreviations
- SNV –
- single nucleotide variant
- RFP –
- recurrent false positive
- ENU –
- N-ethyl-N-nitrosourea