PT - JOURNAL ARTICLE AU - Tanveer Ahmad AU - Johan Peltenburg AU - Nauman Ahmed AU - Zaid Al-Ars TI - ArrowSAM: In-Memory Genomics Data Processing through Apache Arrow Framework AID - 10.1101/741843 DP - 2019 Jan 01 TA - bioRxiv PG - 741843 4099 - http://biorxiv.org/content/early/2019/08/22/741843.short 4100 - http://biorxiv.org/content/early/2019/08/22/741843.full AB - The rapidly growing human genomics data driven by advances in sequencing technologies demands fast and cost-effective processing. However, processing this data brings some challenges particularly in selecting appropriate algorithms and computing platforms. Computing systems need data closer to the processor for fast processing. Previously, due to the cost, volatility and other physical constraints of DRAM, it was not feasible to place large amounts of working data sets in memory. However, new emerging storage class memories allow storing and processing big data closer to the processor.In this work, we show how commonly used genomics data format, Sequence Alignment/Map (SAM) can be presented in the Apache Arrow in-memory data representation to take benefits of in-memory processing to ensure the better scalability through shared memory Plasma Object Store by avoiding huge (de)-serialization overheads in cross-language interoperability. To demonstrate the benefits of such a system, we presented an in-memory SAM representation, we called it ArrowSAM, Apache Arrow framework is integrated into genome pre-processing applications including BWA-MEM, Sorting and Picard as use cases to show the advantages of ArrowSAM. Our implementation comprises three components, First, We integrated Apache Arrow into BWA-MEM to write output SAM data in ArrowSAM. Secondly, we sorted all the ArrowSAM data by their coordinates in parallel through pandas dataframes. Finally, Apache Arrow is integrated into HTSJDK library (used in Picard for disk I/O handling), where all ArrowSAM data is processed in parallel for duplicates removal. This implementation gives promising performance improvements for genome data pre-processing in term of both, speedup and system resource utilization. Due to columnar data format, better cache locality is exploited in both applications and shared memory objects enable parallel processing.