The accuracy of DNA sequences: estimating sequence quality

Genomics. 1992 Sep;14(1):89-98. doi: 10.1016/s0888-7543(05)80288-5.

Abstract

In this paper we describe a method for the statistical reconstruction of a large DNA sequence from a set of sequenced fragments. We assume that the fragments have been assembled and address the problem of determining the degree to which the reconstructed sequence is free from errors, i.e., its accuracy. A consensus distribution is derived from the assembled fragment configuration based upon the rates of sequencing errors in the individual fragments. The consensus distribution can be used to find a minimally redundant consensus sequence that meets a prespecified confidence level, either base by base or across any region of the sequence. A likelihood-based procedure for the estimation of the sequencing error rates, which utilizes an iterative EM algorithm, is described. Prior knowledge of the error rates is easily incorporated into the estimation procedure. The methods are applied to a set of assembled sequence fragments from the human G6PD locus. We close the paper with a brief discussion of the relevance and practical implications of this work.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.
  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Algorithms
  • Base Sequence
  • DNA / analysis*
  • Glucosephosphate Dehydrogenase / genetics
  • Humans
  • Models, Genetic
  • Models, Theoretical
  • Molecular Sequence Data
  • Polymorphism, Restriction Fragment Length
  • Reproducibility of Results

Substances

  • DNA
  • Glucosephosphate Dehydrogenase