PT  - JOURNAL ARTICLE
AU  - Tanveer Ahmad
AU  - Johan Peltenburg
AU  - Nauman Ahmed
AU  - Zaid Al-Ars
TI  - ArrowSAM: In-Memory Genomics Data Processing through Apache Arrow Framework
AID  - 10.1101/741843
DP  - 2019 Jan 01
TA  - bioRxiv
PG  - 741843
4099  - http://biorxiv.org/content/early/2019/08/22/741843.short
4100  - http://biorxiv.org/content/early/2019/08/22/741843.full
AB  - The rapidly growing human genomics data driven by advances in sequencing technologies demands fast and cost-effective processing. However, processing this data brings some challenges particularly in selecting appropriate algorithms and computing platforms. Computing systems need data closer to the processor for fast processing. Previously, due to the cost, volatility and other physical constraints of DRAM, it was not feasible to place large amounts of working data sets in memory. However, new emerging storage class memories allow storing and processing big data closer to the processor.In this work, we show how commonly used genomics data format, Sequence Alignment/Map (SAM) can be presented in the Apache Arrow in-memory data representation to take benefits of in-memory processing to ensure the better scalability through shared memory Plasma Object Store by avoiding huge (de)-serialization overheads in cross-language interoperability. To demonstrate the benefits of such a system, we presented an in-memory SAM representation, we called it ArrowSAM, Apache Arrow framework is integrated into genome pre-processing applications including BWA-MEM, Sorting and Picard as use cases to show the advantages of ArrowSAM. Our implementation comprises three components, First, We integrated Apache Arrow into BWA-MEM to write output SAM data in ArrowSAM. Secondly, we sorted all the ArrowSAM data by their coordinates in parallel through pandas dataframes. Finally, Apache Arrow is integrated into HTSJDK library (used in Picard for disk I/O handling), where all ArrowSAM data is processed in parallel for duplicates removal. This implementation gives promising performance improvements for genome data pre-processing in term of both, speedup and system resource utilization. Due to columnar data format, better cache locality is exploited in both applications and shared memory objects enable parallel processing.