Abstract
Background As Next Generation Sequencing takes a dominant role in terms of output capacity and sequence length, adapters attached to the reads and low-quality bases hinder the performance of downstream analysis directly and implicitly, such as producing false-positive single nucleotide polymorphisms (SNP), and generating fragmented assemblies. A fast trimming algorithm is in demand to remove adapters precisely, especially in read tails with relatively low quality.
Findings We present a trimming program named Atria. Atria matches the adapters in paired reads and finds possible overlapped regions with a super-fast and carefully designed byte-based matching algorithm (O(n) time with O(1) space). Atria also implements multi-threading in both sequence processing and file compression and supports single-end reads.
Conclusions Atria performs favorably in various trimming and runtime benchmarks of both simulated and real data with other cutting-edge trimmers. We also provide an ultra-fast and lightweight byte-based matching algorithm. The algorithm can be used in a broad range of short-sequence matching applications, such as primer search and seed scanning before alignment.
Availability & Implementation The Atria executables, source code, and benchmark scripts are available at https://github.com/cihga39871/Atria under the MIT license.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
E-mail: jiacheng.chuan{at}inspection.gc.ca (Chuan J)
aiguozhou{at}scau.edu.cn (Zhou A)
lhale{at}upei.ca (Hale L)
lsshem{at}mail.sysu.edu.cn (He M)
sean.li{at}inspection.gc.ca (Li X)
Research Area: Software and Workflows
Abbreviations
- CPU
- Central processing unit
- DNA
- Deoxyribonucleic acid
- GB
- Gigabyte
- MCC
- Matthew’s correlation coefficient
- NGS
- Next-generation sequencing
- PPV
- Positive predictive value
- RAM
- Random-access memory
- RNA
- Ribonucleic acid
- SNP
- Single nucleotide polymorphism
- SSD
- Solid-state drive
- TB
- Terabyte
- UInt
- Unsigned integer
- UInt64
- Unsigned 64-bit integer
- WGS
- Whole-genome sequencing.