GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality

Wu, Thomas D.; Reeder, Jens; Lawrence, Michael; Becker, Gabe; Brauer, Matthew J.

doi:10.1007/978-1-4939-3578-9_15

GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality

Thomas D. Wu⁴,
Jens Reeder⁴,
Michael Lawrence⁴,
Gabe Becker⁴ &
…
Matthew J. Brauer⁴

Protocol
First Online: 24 March 2016

10k Accesses
149 Citations
17 Altmetric

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1418))

Abstract

The programs GMAP and GSNAP, for aligning RNA-Seq and DNA-Seq datasets to genomes, have evolved along with advances in biological methodology to handle longer reads, larger volumes of data, and new types of biological assays. The genomic representation has been improved to include linear genomes that can compare sequences using single-instruction multiple-data (SIMD) instructions, compressed genomic hash tables with fast access using SIMD instructions, handling of large genomes with more than four billion bp, and enhanced suffix arrays (ESAs) with novel data structures for fast access. Improvements to the algorithms have included a greedy match-and-extend algorithm using suffix arrays, segment chaining using genomic hash tables, diagonalization using segmental hash tables, and nucleotide-level dynamic programming procedures that use SIMD instructions and eliminate the need for F-loop calculations. Enhancements to the functionality of the programs include standardization of indel positions, handling of ambiguous splicing, clipping and merging of overlapping paired-end reads, and alignments to circular chromosomes and alternate scaffolds. The programs have been adapted for use in pipelines by integrating their usage into R/Bioconductor packages such as gmapR and HTSeqGenie, and these pipelines have facilitated the discovery of numerous biological phenomena.

This is a preview of subscription content, log in via an institution.

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

References

Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Discrete Algorithms 2:53–86
Article Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Article CAS PubMed Google Scholar
Brennicke A, Marchfelder A, Binder S (1999) RNA editing. FEMS Microbiol Rev 23:297–316
Article CAS PubMed Google Scholar
Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Technical Report Technical Report 124, Digital Equipment Corporation, Palo Alto, California
Google Scholar
Burset M, Seledtsov IA, Solovyev VV (2000) Analysis of canonical and non-canonical splice sites in mammalian genomes. Nucleic Acids Res 28:4364–4375
Article CAS PubMed PubMed Central Google Scholar
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, and 1000 Genomes Project Analysis Group (2011) The variant call format and vcftools. Bioinformatics 27(15):2156–2158
Article Google Scholar
Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, Pritchard JK (2009) Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25:3207–3212
Article CAS PubMed PubMed Central Google Scholar
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21
Article CAS PubMed PubMed Central Google Scholar
Eddy SR (2011) Accelerated profile HMM searches. PLoS Comput Biol 7:e1002195
Article CAS PubMed PubMed Central Google Scholar
Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, Bibillo A, Bjornson K, Chaudhuri B, Christians F, Cicero R, Clark S, Dalal R, deWinter A, Dixon J, Foquet M, Gaertner A, Hardenbol P, Heiner C, Hester K, Holden D, Kearns G, Kong X, Kuse R, Lacroix Y, Lin S, Lundquist P, Ma C, Marks P, Maxham M, Murphy D, Park I, Pham T, Phillips M, Roy J, Sebra R, Shen G, Sorenson J, Tomaney A, Travers K, Trulson M, Vieceli J, Wegener J, Wu D, Yang A, Zaccarin D, Zhao P, Zhong F, Korlach J, Turner S (2009) Real-time DNA sequencing from single polymerase molecules. Science 323:133–138
Article CAS PubMed Google Scholar
Elias P (1975) Universal codeword sets and representations of the integers. IEEE Trans Inf Theory 21:194–203
Article Google Scholar
Engström PG, Steijger T, Sipos B, Grant GR, Kahles A, The RGASP Consortium, Rätsch G, Goldman N, Hubbard TJ, Harrow J, Guigó R, Bertone P (2013) Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods 10:1185–1191
Article PubMed Google Scholar
Farrar M (2007) Striped Smith–Waterman speeds database searches six times over other SIMD implementations. Bioinformatics 23:156–161
Article CAS PubMed Google Scholar
Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 8:967–974
CAS PubMed PubMed Central Google Scholar
Frommer M, McDonald LE, Millar DS, Collis CM, Watt F, Grigg GW, Molloy PL, Paul CL (1992) A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc Natl Acad Sci 89:1827–1831
Article CAS PubMed PubMed Central Google Scholar
Grant GR, Farkas MR, Pizarro A, Lahens N, Schug J, Brunk B, Stoeckert CJ Jr, Hogenesch JB, Pierce EA (2011) Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq Unified Mapper (RUM). Bioinformatics 27:2518–2528
CAS PubMed PubMed Central Google Scholar
Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, Borthballer A, Ascano M Jr, Jungkamp A-C, Munschauer M, Ulrich A, Wardle GS, Dewell S, Zavolan M, Tuschl T (2010) Transcriptome-wide identification of RNA-binding protein and MicroRNA target sites by PAR-CLIP. Cell 141:129–141
Article CAS PubMed PubMed Central Google Scholar
Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, Gottardo R, Hahne F, Hansen KD, Irizarry RA, Lawrence M, Love MI, MacDonald J, Obenchain V, Ole’s AK, Pag‘es H, Reyes A, Shannon P, Smyth GK, Tenenbaum D, Waldron L, Morgan M (2015) Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods 12(2):115–121
Google Scholar
Jiang Z, Jhunjhunwala S, Liu J, Haverty PM, Kennemer MI, Guan Y, Lee W, Carnevali P, Stinson J, Johnson S, Diao J, Yeung S, Jubb A, Ye W, Wu TD, Kapadia SB, de Sauvage FJ, Gentleman RC, Stern HM, Seshagiri S, Pant KP, Modrusan Z, Ballinger DG, Zhang Z (2012) The effects of hepatitis B virus integration into the genomes of hepatocellular carcinoma patients. Genome Res 22:593–601
Article CAS PubMed PubMed Central Google Scholar
Kalafus KJ, Jackson AR, Milosavljevic A (2004) Pash: efficient genome-scale sequence anchoring by positional hashing. Genome Res 14:672–678
Article CAS PubMed PubMed Central Google Scholar
Kent WJ (2002) BLAT–the BLAST-like alignment tool. Genome Res 12:656–664
Article CAS PubMed PubMed Central Google Scholar
Klijn C, Durinck S, Stawiski EW, Haverty PM, Jiang Z, Liu H, Degenhardt J, Mayba O, Gnad F, Liu J, Pau G, Reeder J, Cao Y, Mukhyala K, Selvaraj SK, Yu M, Zynda GJ, Brauer MJ, Wu TD, Gentleman RC, Manning G, Yauch RL, Bourgon R, Stokoe D, Modrusan Z, Neve RM, de Sauvage FJ, Settleman J, Seshagiri S, Zhang Z (2015) A comprehensive transcriptional portrait of human cancer cell lines. Nat Biotechnol 33:306–312
Article CAS PubMed Google Scholar
Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25
Article PubMed PubMed Central Google Scholar
Lawrence M, Degenhardt J, Gentleman R (2015) VariantTools: tools for working with genetic variants. R package version 1.10.0
Google Scholar
Lemire D, Boytsov L (2015) Decoding billions of integers per second through vectorization. Softw Pract Experience 45:1–29
Google Scholar
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 25:1754–1760
Article CAS PubMed PubMed Central Google Scholar
Li R, Yu C, Li Y, Lam T-W, Yiu S-M, Kristiansen K, Wang J (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25:1966–1967
Article CAS PubMed Google Scholar
Lister R, Ecker JR (2009) Finding the fifth base: genome-wide sequencing of cytosine methylation. Genome Res 19:959–966
Article CAS PubMed PubMed Central Google Scholar
Liu J, Lee W, Jiang Z, Chen Z, Jhunjhunwala S, Haverty PM, Gnad F, Guan Y, Gilbert HN, Stinson J, Klijn C, Guillory J, Bhatt D, Vartanian S, Walter K, Chan J, Holcomb T, Dijkgraaf P, Johnson S, Koeman J, Minna JD, Gazdar AF, Stern HM, Hoeflich KP, Wu TD, Settleman J, de Sauvage FJ, Gentleman RC, Neve RM, Stokoe D, Modrusan Z, Seshagiri S, Shames DS, Zhang Z (2012) Genome and transcriptome sequencing of lung cancers reveal diverse mutational and splicing events. Genome Res 22:2315–2327
Article CAS PubMed PubMed Central Google Scholar
Manber U, Myers G (1990) Suffix arrays: a new method for on-line string searches. In: Symposium on discrete algorithms, pp 319–327
Google Scholar
Morin PA, Luikart G, Wayne RK, The SNP Workshop Group (2004) SNPs in ecology, evolution and conservation. Trends Ecol Evol 19:208–216
Article Google Scholar
Nacu S, Yuan W, Kan Z, Bhatt D, Rivers CS, Stinson J, Peters BA, Modrusan Z, Jung K, Seshagiri S, Wu TD (2011) Deep RNA sequencing analysis of readthrough gene fusions in human prostate adenocarcinoma and reference samples. BMC Med Genomics 4:11
Article CAS PubMed PubMed Central Google Scholar
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453
Article CAS PubMed Google Scholar
Ning Z, Cox AJ, Mullikin JC (2001) SSAHA: a fast search method for large DNA databases. Genome Res 11:1725–1729
Article CAS PubMed PubMed Central Google Scholar
Obenchain V, Lawrence M, Carey V, Gogarten S, Shannon P, Morgan M (2014) VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants. Bioinformatics 30(14):2076–2078
Google Scholar
R Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria
Google Scholar
Rognes T, Seeberg E (2000) Six-fold speed-up of Smith–Waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics 16:699–706
Article CAS PubMed Google Scholar
Rudin CM, Durinck S, Stawiski EW, Poirier JT, Modrusan Z, Shames DS, Bergbower EA, Guan Y, Shin J, Guillory J, Rivers CS, Foo CK, Bhatt D, Stinson J, Gnad F, Haverty PM, Gentleman R, Chaudhuri S, Janakiraman V, Jaiswal BS, Parikh C, Yuan W, Zhang Z, Koeppen H, Wu TD, Stern HM, Yauch RL, Huffman KE, Paskulin DD, Illei PB, Varella-Garcia M, Gazdar AF, de Sauvage FJ, Bourgon R, Minna JD, Brock MV, Seshagiri S (2012) Comprehensive genomic analysis identifies SOX2 as a frequently amplified gene in small-cell lung cancer. Nat Genet 44:1111–1116
Article CAS PubMed PubMed Central Google Scholar
Seshagiri S, Stawiski EW, Durinck S, Modrusan Z, Storm EE, Conboy CB, Chaudhuri S, Guan Y, Janakiraman V, Jaiswal BS, Guillory J, Ha C, Dijkgraaf GJP, Stinson J, Gnad F, Huntley MA, Degenhardt JD, Haverty PM, Bourgon R, Wang W, Koeppen H, Gentleman R, Starr TK, Zhang Z, Largaespada DA, Wu TD, de Sauvage FJ (2012) Recurrent R-spondin fusions in colon cancer. Nature 488:660–664
Article CAS PubMed PubMed Central Google Scholar
Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26:1135–1145, 2008.
Article CAS PubMed Google Scholar
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197
Article CAS PubMed Google Scholar
Steijger T, Apbril JF, Engström P, Kokocinski F, The RGASP Consortium, Hubbard TJ, Guigó R, Harrow J, Bertone P (2013) Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 10:1177–1184
Article Google Scholar
Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25:1105–1111
Article CAS PubMed PubMed Central Google Scholar
Wozniak A (1997) Using video-oriented instructions to speed up sequence comparison. Comput Appl Biosci 13:145–150
CAS PubMed Google Scholar
Wu TD, Nacu S (2010) Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26:873–881
Article CAS PubMed PubMed Central Google Scholar
Wu TD, Watanabe CK (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21:1859–1975
Article CAS PubMed Google Scholar
Yeo G, Burge CB (2004) Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J Comput Biol 11:377–394
Article CAS PubMed Google Scholar
Zhang Y, Luoh S-M, Hon LS, Baertsch R, Wood WI, Zhang Z (2007) GeneHug-GEPIS: digital expression profiling for normal and cancer tissues based on an integrated gene database. Nucleic Acids Res 35:W152–W158
Article PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

Genentech, South San Francisco, CA, USA
Thomas D. Wu, Jens Reeder, Michael Lawrence, Gabe Becker & Matthew J. Brauer

Authors

Thomas D. Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jens Reeder
View author publications
You can also search for this author in PubMed Google Scholar
Michael Lawrence
View author publications
You can also search for this author in PubMed Google Scholar
Gabe Becker
View author publications
You can also search for this author in PubMed Google Scholar
Matthew J. Brauer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Ohio State University, Biomed Informatics, College of Medicine, Columbus, Ohio, USA
Ewy Mathé
National Cancer Institute, National Institutes of Health, Columbia, Maryland, USA
Sean Davis

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Wu, T.D., Reeder, J., Lawrence, M., Becker, G., Brauer, M.J. (2016). GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality. In: Mathé, E., Davis, S. (eds) Statistical Genomics. Methods in Molecular Biology, vol 1418. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3578-9_15

Download citation

DOI: https://doi.org/10.1007/978-1-4939-3578-9_15
Published: 24 March 2016
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-3576-5
Online ISBN: 978-1-4939-3578-9
eBook Packages: Springer Protocols

Publish with us

Policies and ethics