Improve homology search sensitivity of PacBio data by correcting frameshifts

Bioinformatics. 2016 Sep 1;32(17):i529-i537. doi: 10.1093/bioinformatics/btw458.

Abstract

Motivation: Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than secondary generation sequencing technologies such as Illumina. The long read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and identify gene isoforms with higher accuracy in transcriptomic sequencing. However, PacBio data has high sequencing error rate and most of the errors are insertion or deletion errors. During alignment-based homology search, insertion or deletion errors in genes will cause frameshifts and may only lead to marginal alignment scores and short alignments. As a result, it is hard to distinguish true alignments from random alignments and the ambiguity will incur errors in structural and functional annotation. Existing frameshift correction tools are designed for data with much lower error rate and are not optimized for PacBio data. As an increasing number of groups are using SMRT, there is an urgent need for dedicated homology search tools for PacBio data.

Results: In this work, we introduce Frame-Pro, a profile homology search tool for PacBio reads. Our tool corrects sequencing errors and also outputs the profile alignments of the corrected sequences against characterized protein families. We applied our tool to both simulated and real PacBio data. The results showed that our method enables more sensitive homology search, especially for PacBio data sets of low sequencing coverage. In addition, we can correct more errors when comparing with a popular error correction tool that does not rely on hybrid sequencing.

Availability and implementation: The source code is freely available at https://sourceforge.net/projects/frame-pro/

Contact: yannisun@msu.edu.

MeSH terms

  • Databases, Genetic
  • Frameshift Mutation*
  • Molecular Sequence Annotation
  • Protein Isoforms
  • Sequence Alignment*
  • Sequence Analysis, DNA*
  • Sequence Deletion
  • Sequence Homology
  • Software

Substances

  • Protein Isoforms