The accuracy of DNA sequences: estimating sequence quality

G A Churchill; M S Waterman

doi:10.1016/s0888-7543(05)80288-5

The accuracy of DNA sequences: estimating sequence quality

Genomics. 1992 Sep;14(1):89-98. doi: 10.1016/s0888-7543(05)80288-5.

Authors

G A Churchill¹, M S Waterman

Affiliation

¹ Biometrics Unit, Cornell University, Ithaca, New York 14853.

PMID: 1358801
DOI: 10.1016/s0888-7543(05)80288-5

Abstract

In this paper we describe a method for the statistical reconstruction of a large DNA sequence from a set of sequenced fragments. We assume that the fragments have been assembled and address the problem of determining the degree to which the reconstructed sequence is free from errors, i.e., its accuracy. A consensus distribution is derived from the assembled fragment configuration based upon the rates of sequencing errors in the individual fragments. The consensus distribution can be used to find a minimally redundant consensus sequence that meets a prespecified confidence level, either base by base or across any region of the sequence. A likelihood-based procedure for the estimation of the sequencing error rates, which utilizes an iterative EM algorithm, is described. Prior knowledge of the error rates is easily incorporated into the estimation procedure. The methods are applied to a set of assembled sequence fragments from the human G6PD locus. We close the paper with a brief discussion of the relevance and practical implications of this work.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.
Research Support, U.S. Gov't, P.H.S.

MeSH terms

Algorithms
Base Sequence
DNA / analysis*
Glucosephosphate Dehydrogenase / genetics
Humans
Models, Genetic
Models, Theoretical
Molecular Sequence Data
Polymorphism, Restriction Fragment Length
Reproducibility of Results

Substances

DNA
Glucosephosphate Dehydrogenase

Grants and funding

2R01GM36230-06/GM/NIGMS NIH HHS/United States