Skip to main content
Log in

Combinatorial algorithms for DNA sequence assembly

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

The trend toward very large DNA sequencing projects, such as those being undertaken as part of the Human Genome Program, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four-phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates, and list a series of alternate solutions in the event that several appear equally good. Moreover, it uses a limited form of multiple sequence alignment to detect, and often correct, errors in the data. Our combined algorithm has successfully reconstructed nonrepetitive sequences of length 50,000 sampled at error rates of as high as 10%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Blum, A., T. Jiang, M. Li, J. Tromp, and M. Yannakakis. Linear approximation of shortest superstrings.Proceedings of the 23rd ACM Symposium on Theory of Computation, pp. 328–336, 1991.

  2. Camerini, P., L. Fratta, and F. Maffioli. A note on finding optimum branchings.Networks 9, 309–312, 1979.

    Article  MATH  MathSciNet  Google Scholar 

  3. Camerini, P., L. Fratta, and F. Maffioli. Thek best spanning arborescences of a network.Networks 10, 91–110, 1980.

    Article  MATH  MathSciNet  Google Scholar 

  4. Chang, W. and E. Lawler. Approximate string matching in sublinear expected time.Proceedings of the 31st IEEE Symposium on Foundations of Computer Science, pp. 118–124, 1990. To appear inAlgorithmica.

  5. Chvátal, V., and D. Sankoff. Longest common subsequences of two random sequences.Journal of Applied Probability 12, 306–315, 1975.

    Article  MATH  MathSciNet  Google Scholar 

  6. Cull, P. and J. Holloway. Reconstructing sequences from shotgun data. InSequences II: Methods in Communication, Security, and Computer Science, R. Capocelli, A. De Santis, and U. Vaccaro, eds., Springer-Verlag, New York, pp. 166–188, 1993.

    Google Scholar 

  7. Foulser, D. A linear time algorithm for DNA sequencing. Technical Report 812, Department of Computer Science, Yale University, New Haven, CT 06520, 1990.

    Google Scholar 

  8. Fredman, M., R. Sedgewick, D. Sleator, and R. Tarjan. The pairing heap: a new form of self-adjusting heap.Algorithmica 1, 111–129, 1986.

    Article  MATH  MathSciNet  Google Scholar 

  9. Fredman, M., and R. Tarjan. Fibonacci heaps and their uses in improved network optimization algorithms.Journal of the Association for Computing Machinery 34(3), 596–615, 1987.

    MathSciNet  Google Scholar 

  10. Gabow, H. Two algorithms for generating weighted spanning trees in order.SIAM Journal on Computing 6(2), 139–150, 1977.

    Article  MATH  MathSciNet  Google Scholar 

  11. Gabow, H., Z. Galil, T. Spencer, and R. Tarjan. Efficient algorithms for finding minimum spanning trees in undirected and directed graphs.Combinatorica 6, 109–122, 1986.

    Article  MATH  MathSciNet  Google Scholar 

  12. Gallant, J. The complexity of the overlap method for sequencing biopolymers.Journal of Theoretical Biology 101, 1–17, 1983.

    Article  Google Scholar 

  13. Gallant, J., D. Maier, and J. Storer. On finding minimal length superstrings.Journal of Computer and System Sciences 20(1), 50–58, 1980.

    Article  MATH  MathSciNet  Google Scholar 

  14. Gingeras, T., J. Milazzo, D. Sciaky, and R. Roberts. Computer programs for the assembly of DNA sequences.Nucleic Acids Research 7(2), 529–545, 1979.

    Article  Google Scholar 

  15. Gusfield, D., G. Landau, and B. Schieber. An efficient algorithm for the all pairs suffix-prefix problem.Information Processing Letters 41, 181–185, 1992.

    Article  MATH  MathSciNet  Google Scholar 

  16. Huang, X. A contig assembly program based on sensitive detection of fragment overlaps.Genomics 14, 18–25, 1992.

    Article  Google Scholar 

  17. Hutchinson, G. Evaluation of polymer sequence fragments data using graph theory.Bulletin of Mathematical Biophysics 31, 541–562, 1969.

    Article  Google Scholar 

  18. Kececioglu, J. Exact and approximation algorithms for DNA sequence reconstruction. Ph.D. dissertation, Technical Report 91-26, Department of Computer Science, The University of Arizona, Tucson, AZ 85721, 1991.

    Google Scholar 

  19. Kececioglu, J., and E. Myers. A procedural interface for a fragment assembly tool. Technical Report 89-5, Department of Computer Science, The University of Arizona, Tucson, AZ 85721, 1989.

    Google Scholar 

  20. Lawler, E. A procedure for computing thek best solutions to discrete optimization problems and its application to the shortest path problem.Management Science 18, 401–405, 1972.

    Article  MATH  MathSciNet  Google Scholar 

  21. Li, M. Towards a DNA sequencing theory.Proceedings of the 31st IEEE Symposium on Foundations of Computer Science, pp. 125–134, 1990.

  22. Manber, U. and G. Myers. Suffix arrays: A new method for on-line string searches.Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 319–327, 1990. To appear inSIAM Journal on Computing.

  23. Margot, J., G. W. Demers, and R. Hardison. Complete nucleotide sequence of the rabbitβ-like globin gene cluster: analysis of intergenic sequences and comparison with the humanβ-like globin gene cluster.Journal of Molecular Biology 205, 15–40, 1989.

    Article  Google Scholar 

  24. Mehlhorn, K.Data Structures and Algorithms, Vol. 1. Springer-Verlag, Berlin, 1984.

    Google Scholar 

  25. Myers, E. Incremental alignment algorithms and their applications. Technical Report 86-2, Department of Computer Science, The University of Arizona, Tucson, AZ 85721, 1986.

    Google Scholar 

  26. Peltola, H., H. Söderlund, J. Tarhio, and E. Ukkonen. Algorithms for some string matching problems arising in molecular genetics.Proceedings of the 9th IFIP World Computer Congress, pp. 59–64, 1983.

  27. Peltola, H., H. Söderlund, and E. Ukkonen. SEQAID: a DNA sequence assembly program based on a mathematical model.Nucleic Acids Research 12(1), 307–321, 1984.

    Article  Google Scholar 

  28. Press, W., B. Flannery, S. Teukolsky, and W. Vetterling.Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, New York, 1988.

    MATH  Google Scholar 

  29. Sankoff, D. Minimal mutation trees of sequences.SIAM Journal on Applied Mathematics 28(1), 35–42, 1975.

    Article  MATH  MathSciNet  Google Scholar 

  30. Sankoff, D. and V. Chvátal. An upper bound technique for lengths of common subsequences. InTime Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence comparison, D. Sankoff and J. Kruskal, eds., Addison-Wesley, Reading, MA, pp. 353–357, 1983.

    Google Scholar 

  31. Sankoff, D. and J. Kruskal, eds.Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA, 1983.

    Google Scholar 

  32. Shapiro, M. An algorithm for reconstructing protein and RNA sequences.Journal of the Association for Computing Machinery 14, 720–731, 1967.

    MATH  Google Scholar 

  33. Smetanič, Y., and R. Polozov. On the algorithms for determining the primary structure of biopolymers.Bulletin of Mathematical Biology 41, 1–20, 1979.

    MathSciNet  Google Scholar 

  34. Smith, T. F., and M. S. Waterman. Identification of common molecular subsequences.Journal of Molecular Biology 147, 195–197, 1981.

    Article  Google Scholar 

  35. Staden, R. A strategy of DNA sequencing employing computer programs.Nucleic Acids Research 6(7), 2601–2610, 1979.

    Article  Google Scholar 

  36. Tarhio, J. and E. Ukkonen. A greedy approximation algorithm for constructing shortest common superstrings.Theoretical Computer Science 57, 131–145, 1988.

    Article  MATH  MathSciNet  Google Scholar 

  37. Tarjan, R. Finding optimum branchings.Networks 7, 25–35, 1977.

    Article  MATH  MathSciNet  Google Scholar 

  38. Turner, J. Approximation algorithms for the shortest common superstring problem.Information and Computation 83, 1–20, 1989.

    Article  MATH  MathSciNet  Google Scholar 

  39. Ukkonen, E. A linear algorithm for finding approximate shortest common superstrings.Algorithmica 5, 313–323, 1990.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

Communicated by E. W. Myers.

This research was supported by the National Library of Medicine under Grant R01-LM4960, by a postdoctoral fellowship from the Program in Mathematics and Molecular Biology of the University of California at Berkeley under National Science Foundation Grant DMS-8720208, and by a fellowship from the Centre de recherches mathématiques of the Université de Montréal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kececioglu, J.D., Myers, E.W. Combinatorial algorithms for DNA sequence assembly. Algorithmica 13, 7–51 (1995). https://doi.org/10.1007/BF01188580

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01188580

Key words

Navigation