Skip to main content

Large-Scale Multiple Sequence Alignment and Tree Estimation Using SATé

  • Protocol
  • First Online:

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1079))

Abstract

SATé is a method for estimating multiple sequence alignments and trees that has been shown to produce highly accurate results for datasets with large numbers of sequences. Running SATé using its default settings is very simple, but improved accuracy can be obtained by modifying its algorithmic parameters. We provide a detailed introduction to the algorithmic approach used by SATé, and instructions for running a SATé analysis using the GUI under default settings. We also provide a discussion of how to modify these settings to obtain improved results, and how to use SATé in a phylogenetic analysis pipeline.

This is a preview of subscription content, log in via an institution.

Buying options

Protocol
USD   49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

References

  1. Kemena C, Notredame C (2009) Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics 25:2455–2465

    Article  PubMed  CAS  Google Scholar 

  2. Nelesen S, Liu K, Zhao D et al (2008) The effect of the guide tree on multiple sequence alignments and subsequent phylogenetic analyses. Pac Symp Biocomput 2008:25–36

    Google Scholar 

  3. Liu K, Linder CR, Warnow T (2010) Multiple sequence alignment: a major challenge to large-scale phylogenetics. PLoS Curr 2, RRN1198

    Article  PubMed  Google Scholar 

  4. Wang L-S, Leebens-Mack J, Wall PK, Beckman K, de Pamphilis CW, Warnow T (2011) The impact of multiple protein sequence alignment on phylogenetic estimation. IEEE Trans Comput Biol Bioinform 8:1108–1119

    Article  Google Scholar 

  5. Cantarel BL, Morrison HG, Pearson W (2006) Exploring the relationship between sequence similarity and accurate phylogenetic trees. Mol Biol Evol 11:2090–100

    Article  Google Scholar 

  6. Löytynoja A, Goldman N (2008) Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320(5883):1632–5

    Article  PubMed  Google Scholar 

  7. Hall BG (2005) Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. Mol Biol Evol 22(3):792–802

    Article  PubMed  CAS  Google Scholar 

  8. Morrison DA, Ellis JT (1997) Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of Apicomplexa. Mol Biol Evol 14(4):428–41

    Article  PubMed  CAS  Google Scholar 

  9. Ogden TH, Rosenberg MS (2006) Multiple sequence alignment accuracy and phylogenetic inference. Syst Biol 55(2):314–28

    Article  Google Scholar 

  10. Larkin MA, Blackshields G, Brown NP et al (2007) ClustalW and ClustalX version 2.0. Bioinformatics 23:2947–2948

    Article  PubMed  CAS  Google Scholar 

  11. Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113

    Article  PubMed  Google Scholar 

  12. Edgar RC (2004) MUSCLE: a multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797

    Article  PubMed  CAS  Google Scholar 

  13. Katoh K, Toh H (2008) Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinformatics 9:286–298

    Article  PubMed  CAS  Google Scholar 

  14. Nelesen S, Liu K, Wang L-S et al (2012) DACTAL: fast and accurate estimations of trees without computing full sequence alignments. Bioinformatics 28:i274–i282

    Article  PubMed  CAS  Google Scholar 

  15. Varón A, Vinh LS, Wheeler WC (2010) POY version 4: phylogenetic analysis using dynamic homologies. Cladistics 26:72–85

    Article  Google Scholar 

  16. Liu K, Nelesen S, Raghavan S, Linder CR, Warnow T (2009) Barking up the wrong treelength: the impact of gap penalty on alignment and tree accuracy. IEEE/ACM Trans Comput Biol Bioinform 6(1):7–21

    Article  PubMed  Google Scholar 

  17. Liu K, Warnow T (2012) Treelength optimization for phylogeny estimation. PLoS One 7(3):e33104

    Article  PubMed  CAS  Google Scholar 

  18. Suchard MA, Redelings BD (2006) BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics 22:2047–2048

    Article  PubMed  CAS  Google Scholar 

  19. Fleissner R, Metzler D, von Haeseler A (2005) Simultaneous statistical multiple alignment and phylogeny reconstruction. Syst Biol 54:548–561

    Article  PubMed  Google Scholar 

  20. Novák A, Miklós I, Lyngsoe R et al (2008) StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees. Bioinformatics 24:2403–2404

    Article  PubMed  Google Scholar 

  21. Lunter G, Miklós I, Drummond A et al (2005) Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics 6:83

    Article  PubMed  Google Scholar 

  22. Liu K, Raghavan S, Nelesen S et al (2009) Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science 324:1561–1564

    Article  PubMed  CAS  Google Scholar 

  23. Liu K, Warnow T, Holder MT et al (2012) SATé-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst Biol 61(1):90–106

    Article  PubMed  Google Scholar 

  24. Stamatakis A (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22:2688–2690

    Article  PubMed  CAS  Google Scholar 

  25. Price M, Dehal P, Arkin A (2010) FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS One 5:e9490

    Article  PubMed  Google Scholar 

  26. Löytynoja A, Goldman N (2005) An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci U S A 102:10557–10562

    Article  PubMed  Google Scholar 

  27. Wheeler T, Kececioglu J (2007) Multiple alignment by aligning alignments. Bioinformatics 23:i559–i568

    Article  PubMed  CAS  Google Scholar 

  28. Felsenstein J (2004) Inferring phylogenies. Sinauer, Sunderland, MA

    Google Scholar 

  29. Dewey CN (2012) Whole-genome alignment. Methods Mol Biol 855:237–257

    Article  PubMed  CAS  Google Scholar 

  30. Mirarab S, Nguyen N-P, Warnow T (2012) SEPP: SATé-enabled phylogenetic placement. Pac Symp Biocomput 2012:247–58

    Google Scholar 

  31. Matsen F, Kodner R, Armbrust EV (2010) pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11:538

    Article  PubMed  Google Scholar 

  32. Berger SA, Krompass D, Stamatakis A (2011) Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood. Syst Biol 60:291–302

    Article  PubMed  Google Scholar 

  33. Liu K, Randal Linder C, Warnow T (2011) RAxML and FastTree: comparing two methods for large-scale maximum likelihood phylogeny estimation. PLoS One 6(11):e27731. doi:10.1371/journal.pone.0027731

    Article  PubMed  CAS  Google Scholar 

  34. Stamatakis A (2006) Phylogenetic models of rate heterogeneity: a high performance computing perspective. Proc IPDPS, Rhodes, Greece, 2006

    Google Scholar 

  35. Jukes TH, Cantor CR (1969) Evolution of protein molecules. Mammalian protein metabolism. Academic, New York, pp 21–132

    Google Scholar 

  36. Posada D, Buckley T (2004) Model selection and model averaging in phylogenetics: advantages of Akaike Information criterion and Bayesian approaches over likelihood ratio tests. Syst Biol 53(5):793–808

    Article  PubMed  Google Scholar 

  37. Abascal F, Zardoya R, Posada D (2005) ProtTest: selection of best-fit models of protein evolution. Bioinformatics 21(9):2104–2105

    Article  PubMed  CAS  Google Scholar 

  38. Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8:275–282

    PubMed  CAS  Google Scholar 

  39. Whelan S, Goldman N (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18:691–699

    Article  PubMed  CAS  Google Scholar 

  40. Dayhoff M, Schwartz R, Orcutt B (1978) A model of evolutionary change in proteins. Atlas Protein Sequence Struct 5:345–352

    Google Scholar 

  41. Kosiol C, Goldman N (2005) Different versions of the Dayhoff rate matrix. Mol Biol Evol 22:193–199

    Article  PubMed  CAS  Google Scholar 

  42. Adachi J (1996) Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol 42:459–468

    Article  PubMed  CAS  Google Scholar 

  43. Dimmic M, Rest J, Mindell D, Goldstein R (2002) rtREV: an amino acid substitution matrix for inference of retrovirus and reverse transcriptase phylogeny. J Mol Evol 55:65–73

    Article  PubMed  CAS  Google Scholar 

  44. Adachi J, Waddell P, Martin W, Hasegawa M (2000) Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J Mol Evol 50:348–358

    PubMed  CAS  Google Scholar 

  45. Mueller T, Vingron M (2000) Modeling amino acid replacement. J Comput Biol 7:761–776

    Article  Google Scholar 

  46. Henikoff S, Henikoff J (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89:10915–10919

    Article  PubMed  CAS  Google Scholar 

  47. Yang Z (1998) Synonymous and nonsynonymous rate variation in nuclear genes of mammals. J Mol Evol 46:409–418

    Article  PubMed  CAS  Google Scholar 

  48. Le S, Gascuel O (2008) An improved general amino acid replacement matrix. Mol Biol Evol 25(7):1307–1320

    Article  PubMed  CAS  Google Scholar 

  49. Bodaker I, Suzuki MT, Oren A, Béjà O (2012) Dead Sea rhodopsins revisited. Environ Microbiol Rep 4(6):617–621

    PubMed  CAS  Google Scholar 

  50. Andam C, Harlow T, Papke RT, Gogarten JP (2012) Ancient origin of the divergent forms of leucyl-tRNA synthetases in the Halobacteriales. BMC Evol Biol 12(1):85

    Article  PubMed  CAS  Google Scholar 

  51. Hagopian R, Davidson JR, Datta RS et al (2010) SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction. Nucleic Acids Res 38(suppl 2):W29–W34

    Article  PubMed  CAS  Google Scholar 

  52. Katoh K, Toh H (2007) PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences. Bioinformatics 23:372–374

    Article  PubMed  CAS  Google Scholar 

  53. Sievers F, Wilm A, Dineen D et al (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7:539

    Article  PubMed  Google Scholar 

  54. Wang N, Braun EL, Kimball RT (2012) Testing hypotheses about the sister group of the Passeriformes using an independent 30-locus data set. Mol Biol Evol 29(2):737–750

    Article  PubMed  CAS  Google Scholar 

  55. Xiang C-L, Gitzendanner MA, Soltis DE et al (2012) Phylogenetic placement of the enigmatic and critically endangered genus Saniculiphyllum (Saxifragaceae) inferred from combined analysis of plastid and nuclear DNA sequences. Mol Phylogenet Evol 64:357–367

    Article  PubMed  Google Scholar 

  56. Andam C, Harlow T, Thane R et al (2012) Ancient origin of the divergent forms of leucyl-tRNA synthetases in the Halobacteriales. Evol Biol 12:85

    Article  CAS  Google Scholar 

  57. Huelsenbeck JP, Ronquist R (2001) MrBayes: Bayesian inference of phylogeny. Bioinformatics 17:754–755

    Article  PubMed  CAS  Google Scholar 

  58. Stockham C, Wang L-S, Warnow T (2002) Postprocessing of phylogenetic analysis using clustering. Bioinformatics 18(Suppl 1):i285–i293

    Article  Google Scholar 

  59. Amenta N, Klinger J (2002). Case study: visualizing sets of evolutionary trees. In: Proceedings IEEE symposium on information visualization, pp 71–74

    Google Scholar 

  60. Bryant D (2003) A classification of consensus methods for phylogenetics. DIMACS series in discrete mathematics and theoretical computer science 51:163–184

    Google Scholar 

  61. Kannan S, Warnow T, Yooseph S (1998) Computing the local consensus of trees. SIAM J Comput 27(6):1695–1724

    Article  Google Scholar 

  62. Phillips C, Warnow T (1996) The asymmetric median tree – a new model for building consensus trees. Discrete Appl Math 71(1–3):311–335

    Article  Google Scholar 

  63. Mirarab S, Warnow T (2011) FAST-SP: linear time calculation of alignment accuracy. Bioinformatics 27(23):3250–3258

    Article  PubMed  CAS  Google Scholar 

  64. Maddison W (1997) Gene trees in species trees. Syst Biol 46(3):523–536

    Article  Google Scholar 

  65. Boussau B, Szöllősi G, Duret L et al (2013) Genome-scale coestimation of species and gene trees. Genome Res 23(2):323–30

    Article  PubMed  CAS  Google Scholar 

  66. Yu Y, Degnan JH, Nakhleh L (2012) The probability of a gene tree topology within a phylogenetic network with applications to hybridization detection. PLoS Genet 8(4):e1002660

    Article  PubMed  CAS  Google Scholar 

  67. Degnan JH, Rosenberg NA (2009) Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol Evol 24(6):332–340

    Article  PubMed  Google Scholar 

  68. Chaudhary R, Bansal MS, Wehe A et al (2010) iGTP: a software package for large-scale gene tree parsimony analysis. BMC Bioinformatics 11:547

    Article  Google Scholar 

  69. Bansal MS, Alm EJ, Kellis M (2012) Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer, and loss. Bioinformatics 28(12):i283–i291

    Article  PubMed  CAS  Google Scholar 

  70. Yang J, Warnow T (2011) Fast and accurate methods for phylogenomic analyses. RECOMB comparative genomics, 2011. BMC Bioinformatics 12(Suppl 9):S4

    Article  PubMed  Google Scholar 

  71. Bayzid MS, Warnow T (2012) Finding optimal species trees from incomplete gene trees under incomplete lineage sorting. J Comput Biol 19(6):591–605

    Article  PubMed  CAS  Google Scholar 

  72. Rice P, Longden I, Bleasby A (2000) EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet 16:276–277

    Article  PubMed  CAS  Google Scholar 

  73. Swofford DL (2003) PAUP*: phylogenetic analysis using parsimony (*and other methods), Version 4

    Google Scholar 

  74. Warnow T (2012) Standard maximum likelihood analyses of alignments with gaps can be statistically inconsistent. PLoS Curr 4:RRN1308. doi:10.1371/currents.RRN1308

    Article  PubMed  Google Scholar 

  75. Swenson MS, Suri R, Linder CR et al (2012) SuperFine: fast and accurate supertree estimation. Syst Biol 61(2):214–227

    Article  PubMed  Google Scholar 

  76. Neves DT, Warnow TJ, Sobral L et al (2012) Parallelizing SuperFine. 27th Symp Appl Comp

    Google Scholar 

  77. Nguyen N, Mirarab S, Warnow T (2012) MRL and SuperFine + MRL: new supertree methods. Algorithms Mol Biol 7:3

    Article  PubMed  Google Scholar 

  78. Daskalakis C, Roch S (2010) Alignment-free phylogenetic reconstruction. Proc Res Comp Molec Biol (RECOMB), Lecture Notes Computer Science 6044: 123–137

    Google Scholar 

  79. Chan CX, Ragan RA (2013) Next-generation phylogenomics. Biol Direct 8:30. doi:10.1186/1745-6150-8-3

    Article  Google Scholar 

  80. Vinga S, Almeida J (2003) Alignment-free sequence comparison – a review. Bioinformatics 19(4):513–523

    Article  PubMed  CAS  Google Scholar 

  81. Holder M, Warnow T, Mirarab S et al (2012) Online tutorial for SATe. http://phylo.bio.ku.edu/software/sate/sate_tutorial.pdf

  82. Linder CR, Suri R, Liu K et al (2010) Benchmark datasets and software for developing and testing methods for large-scale multiple sequence alignment and phylogenetic inference. PLoS Curr 2:RRN1195. doi:10.1371/currents.RRN1195

    Article  PubMed  Google Scholar 

  83. Linder CR, Warnow T (2005) Overview of phylogeny reconstruction. In: Aluru S (ed) Handbook of computational biology. CRC computer and information science series. Chapman & Hall, Boca Raton, FL

    Google Scholar 

Download references

Acknowledgments

This work was supported a training fellowship to KL from the Keck Center of the Gulf Coast Consortia, on the NLM Training Program in Biomedical Informatics, National Library of Medicine (NLM) T15LM007093. This work was also partially supported by NSF grant DEB 0733029 to TW.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Liu, K., Warnow, T. (2014). Large-Scale Multiple Sequence Alignment and Tree Estimation Using SATé. In: Russell, D. (eds) Multiple Sequence Alignment Methods. Methods in Molecular Biology, vol 1079. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703-646-7_15

Download citation

  • DOI: https://doi.org/10.1007/978-1-62703-646-7_15

  • Published:

  • Publisher Name: Humana Press, Totowa, NJ

  • Print ISBN: 978-1-62703-645-0

  • Online ISBN: 978-1-62703-646-7

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics