PT - JOURNAL ARTICLE AU - Timothy J. Hackmann TI - Correcting values of DNA sequence similarity for errors in sequencing AID - 10.1101/237990 DP - 2017 Jan 01 TA - bioRxiv PG - 237990 4099 - http://biorxiv.org/content/early/2017/12/22/237990.short 4100 - http://biorxiv.org/content/early/2017/12/22/237990.full AB - The similarity between two DNA sequences is one of the most important measures in bioinformatics, but errors introduced during sequencing make values of similarity lower than they should be. Here we develop a method to correct raw sequence similarity for sequencing errors and estimate the original sequence similarity. Our method is simple and consists of a single equation with terms for 1) raw sequence similarity and 2) error rates (e.g., from Phred quality scores). We show the importance of this correction for 16S ribosomal DNA sequences from bacterial communities, where 97% similarity is a frequent threshold for clustering sequences for analysis. At that threshold and typical error rate of 0.2%, correcting for error increases similarity by 0.36 percentage points. This result shows that, if uncorrected, sequencing error would increase similarity thresholds and generate false clusters for analysis. Our method could be used to adjust thresholds for cluster-based analyses. Alternatively, because it requires no clustering to correct sequence similarity, it could usher in a new age of analyzing ribosomal DNA sequences without clustering.