ABSTRACT
Motivation Our aim was to simplify and speedup joint-genotyping, from sequence based variation data of individual samples, while maintaining as high sensitivity and specificity as possible.
Results We have leveraged versatile GOR data structures to store biallelic representations of variants and sequence read coverage in a very efficient way, allowing for very fast joint-genotyping that is an order of magnitude faster than any joint-genotyping method published to date. Furthermore, it can be easily extended and executed much faster in an incremental fashion. Concordance analysis based on the Genome In A Bottle (GIAB) samples shows favorable results when compared with the de-facto standard approach, using gVCF files and GATK joint-calling. Additionally, we have developed variant quality classification using XGBoost and variant training sets derived from the GIAB samples. The entire business logic is implemented efficiently and concisely in SparkGOR.
Availability SparkGOR is open-source and freely available at https://github.com/gorpipe.
Contact hakon{at}genuitysci.com
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Minor edits; eliminating a deprecated URL and an incorrect citation.