PyRAD: assembly of de novo RADseq loci for phylogenetic analyses

Deren A R Eaton

doi:10.1093/bioinformatics/btu121

PyRAD: assembly of de novo RADseq loci for phylogenetic analyses

Bioinformatics. 2014 Jul 1;30(13):1844-9. doi: 10.1093/bioinformatics/btu121. Epub 2014 Mar 5.

Author

Deren A R Eaton¹

Affiliation

¹ Committee on Evolutionary Biology, University of Chicago, 1025 E. 57th St. Chicago, IL 60637, USA and Botany Department, Field Museum of Natural History, 1400 S. Lake Shore Dr. Chicago, IL 60605, USACommittee on Evolutionary Biology, University of Chicago, 1025 E. 57th St. Chicago, IL 60637, USA and Botany Department, Field Museum of Natural History, 1400 S. Lake Shore Dr. Chicago, IL 60605, USA.

PMID: 24603985
DOI: 10.1093/bioinformatics/btu121

Abstract

Motivation: Restriction-site-associated genomic markers are a powerful tool for investigating evolutionary questions at the population level, but are limited in their utility at deeper phylogenetic scales where fewer orthologous loci are typically recovered across disparate taxa. While this limitation stems in part from mutations to restriction recognition sites that disrupt data generation, an additional source of data loss comes from the failure to identify homology during bioinformatic analyses. Clustering methods that allow for lower similarity thresholds and the inclusion of indel variation will perform better at assembling RADseq loci at the phylogenetic scale.

Results: PyRAD is a pipeline to assemble de novo RADseq loci with the aim of optimizing coverage across phylogenetic datasets. It uses a wrapper around an alignment-clustering algorithm, which allows for indel variation within and between samples, as well as for incomplete overlap among reads (e.g. paired-end). Here I compare PyRAD with the program Stacks in their performance analyzing a simulated RADseq dataset that includes indel variation. Indels disrupt clustering of homologous loci in Stacks but not in PyRAD, such that the latter recovers more shared loci across disparate taxa. I show through reanalysis of an empirical RADseq dataset that indels are a common feature of such data, even at shallow phylogenetic scales. PyRAD uses parallel processing as well as an optional hierarchical clustering method, which allows it to rapidly assemble phylogenetic datasets with hundreds of sampled individuals.

Availability: Software is written in Python and freely available at http://www.dereneaton.com/software/.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Cluster Analysis
Genetic Loci
High-Throughput Nucleotide Sequencing / methods*
Phylogeny*
Sequence Analysis, DNA / methods*
Software