Abstract
Accurate detection and genotyping of structural variations (SVs) from short-read data is a long-standing area of development in genomics research and clinical sequencing pipelines. We introduce Paragraph, an accurate genotyper that models SVs using sequence graphs and SV annotations. We demonstrate the accuracy of Paragraph on whole-genome sequence data from three samples using long read SV calls as the truth set, and then apply Paragraph at scale to a cohort of 100 short-read sequenced samples of diverse ancestry. Our analysis shows that Paragraph has better accuracy than other existing genotypers and can be applied to population-scale studies.
Footnotes
schen6{at}illumina.com
pkrusche{at}gmail.com
edolzhenko{at}illumina.com
rsherman{at}jhu.edu
RPetrovski{at}illumina.com
fschlesinger{at}illumina.com
mkirsche{at}jhu.edu
DBentley{at}illumina.com
mschatz{at}cs.jhu.edu
fritz.sedlazeck{at}bcm.edu
meberle{at}illumina.com
1. We now use an expanded truth dataset from three samples sequenced using highly accurate CCS reads to provide a better balance between TPs and TNs for our recall and precision calculations. 2. There is no longer separate recall and precision sections and now both metrics are analyzed simultaneously. 3. All of the different software methods are assessed for, recall, precision and ability to handle deviations in breakpoint accuracy. Related figures and tables (including supplementary) were updated based on the new truth data.
List of abbreviations
- SV
- structural variation
- bp
- base pair
- TR
- tandem repeat