PT  - JOURNAL ARTICLE
AU  - Jana Ebler
AU  - Marina Haukness
AU  - Trevor Pesout
AU  - Tobias Marschall
AU  - Benedict Paten
TI  - Haplotype-aware genotyping from noisy long reads
AID  - 10.1101/293944
DP  - 2018 Jan 01
TA  - bioRxiv
PG  - 293944
4099  - http://biorxiv.org/content/early/2018/04/03/293944.short
4100  - http://biorxiv.org/content/early/2018/04/03/293944.full
AB  - Motivation Current genotyping approaches for single nucleotide variations (SNVs) rely on short, relatively accurate reads from second generation sequencing devices. Presently, third generation sequencing platforms able to generate much longer reads are becoming more widespread. These platforms come with the significant drawback of higher sequencing error rates, which make them ill-suited to current genotyping algorithms. However, the longer reads make more of the genome unambiguously mappable and typically provide linkage information between neighboring variants.Results In this paper we introduce a novel approach for haplotype-aware genotyping from noisy long reads. We do this by considering bipartitions of the sequencing reads, corresponding to the two haplotypes. We formalize the computational problem in terms of a Hidden Markov Model and compute posterior genotype probabilities using the forward-backward algorithm. Genotype predictions can then be made by picking the most likely genotype at each site. Our experiments indicate that longer reads allow significantly more of the genome to potentially be accurately genotyped. Further, we are able to independently validate with both Oxford Nanopore and Pacific Biosciences sequencing data millions of variants previously identified by short-read technologies in the reference NA12878 sample, including hundreds of thousands of variants that were not previously included in the high-confidence reference set.