Abstract
Background Depth of coverage calculation is an important and computationally intensive preprocessing step in a variety of next generation sequencing pipelines, including the analyses of RNA-seq data, detection of copy number variants, or quality control procedures.
Results Building upon big data technologies, we have developed SeQuiLa-cov, an extension to the recently released SeQuiLa platform, which provides efficient depth of coverage calculations, reaching more than 100x speedup over the state-of-the-art tools. Performance and scalability of our solution allows for exome and genome-wide calculations running locally or on a cluster while hiding the complexity of the distributed computing with Structured Query Language Application Programming Interface.
Conclusions SeQuiLa-cov provides significant performance gain in depth of coverage calculations streamlining the widely used bioinformatic processing pipelines.
Introduction
Given a set of sequencing reads and a genomic contig, depth of coverage for a given position is defined as a total number of reads overlapping the locus.
The coverage calculation is a frequently performed but timeconsuming step in the analysis of Next Generation Sequencing (NGS) data. In particular, Copy-Number Variant detection pipelines require obtaining sufficient read depth of the analyzed samples [1, 2, 3]. In other applications, the coverage is computed to assess the quality of the sequencing data (e.g. to calculate the percentage of genome with at least 30X read depth) or to identify genomic regions overlapped by insufficient number of reads for reliable variant calling [4]. Finally, depth of coverage is one of the most computationally intensive parts of differential expression analysis using RNA-seq data at single-base resolution [5, 6, 7].
A number of tools supporting this operation have been developed, with 22 of them specified in Omictools catalog [8]. Well known, state-of-the-art solutions include: samtools depth [9], bedtools genomecov [10], GATK DepthOfCoverage [11], sambamba [12], and mosdepth [13] (see comparison presented in Table 1).
Traditionally, these methods calculate the depth of coverage using a pileup-based approach (introduced in samtools [9] and used in GATK [11]), which is inefficient since it iterates through each nucleotide position at every read in a Binary Alignment Map (BAM) file. An optimized, event-based approach has been proposed in bedtools [10] and mosdepth [13]. These algorithms use only specific ‘events’, i.e. start and end of the alignment blocks within each read (Figure 1A) instead of analyzing every base of each read, which significantly reduces the overall computational complexity.
Samtools and bedtools depth of coverage modules do not provide any support for multi-core environment. Mosdepth implements parallel BAM decompression, but its main algorithm remains sequential. Sambamba, on the other hand, promotes itself as a highly parallel tool, implementing depth of coverage calculations in a map-reduce fashion utilizing multiple threads on a single node. Regardless of parallelization degree, all of the above mentioned tools share a common bottleneck caused by using a single thread for returning results. Finally, GATK was the first genomic framework providing a support for distributed computations, however, the DepthOfCoverage method has not been ported yet to the current software release of the toolkit.
Key Points
SeQuiLa-cov allows for high-coverage (~60x) genome-wide depth of coverage calculations in less than one minute.
SeQuiLa-cov provides ANSI SQL compliant API for accessing and analyzing of aligned sequencing reads data.
We present the first fully scalable, distributed, SQL-oriented solution designated for the depth of coverage calculations. SeQuiLa-cov, an extension to the recently released Se-QuiLa [14] platform, runs a redesigned event-based algorithm for the distributed environment and provides convenient, SQL-compliant interface. The tool can be easily integrated with other applications implemented in Scala, R, or Python.
Algorithm and implementation
Algorithm
Consider input data set, read_set, of aligned sequencing reads sorted by genomic position from a BAM file partitioned into the n data slices (read_set1, read_set2,…, read_setn) (Figure 1B).
In the most general case, the algorithm can be used in a distributed environment where each cluster node computes the coverage for the subset of data slices using the event-based method. Specifically, for the í-th partition containing the set of reads (read_seti), the set of eventsi,chr vectors (where chr is an index of genomic contig represented in read_set) is allocated and updated, based on the items from read_set. For all reads, the algorithm parses the CIGAR string and for each continuous alignment block characterized by start position and length len it increments by one the eventsi,chr(start) and decrements by one the value of eventsi,chr(start + len). To compute the partial coverage vector for partition i and contig chr, a vector value at the index j is calculated as follows:
The result of this stage is a set of partial_coveragei,chr vectors distributed among the computation nodes. To calculate the final coverage for the whole read_set, an additional step of correction for overlaps between the partitions is required. An overlap overlapi,chr of length l between vectors partial_coveragei,chr and partial_coveragei+1,chr may occur on the partition boundaries where l tailing genomic positions of partial_coveragei,chr are the same as l heading genomic positions of partial_coveragei+1, chr (see Figure 1C).
If an overlap is identified then the coverage values from the partial_coveragei,chr’s l-length tail are added into the partial_coveragei+1,chr’s head and subsequently the last l elements of partial_coveragei,chr are removed. Once this correction step is completed, non-overlapping coveragei,chr vectors are collected and yield the final coverage values for the whole input read_set.
The main characteristic of the described algorithm is its ability to distribute data and calculations (such as BAM decompression and main coverage procedure) among the available computation nodes. Moreover, instead of simply performing full data reduction stage of the partial coverage vectors our solution minimizes required data shuffling among cluster nodes by limiting it to the overlapping part of coverage vectors. Importantly, SeQuiLa-cov computation model supports fine-grained parallelism at user-defined partition size in contrary to the traditional, coarse-grained parallelization strategies that involve splitting input data at a contig level.
Implementation
We have implemented SeQuiLa-cov in Scala programming language using the Apache Spark framework. To efficiently access the data from a BAM file we have prepared a custom data source using Data Source Application Programming Interface (API) exposed by SparkSQL. Performance of the read operation benefits from the Intel Genomics Kernel Library (GKL) [15] used for decompressing the BAM files chunks and from predicate pushdown mechanism that filters out data at the earliest stage.
The implementation of the core coverage calculation algorithm aimed at minimizing, whenever possible memory footprint by using parsimonious data types, e.g. Short type instead of Integer, and efficient memory allocation strategy for large data structures, e.g. favoring static Arrays over dynamic size ArrayBuffers. Additionally, to reduce the overhead of data shuffling between the worker nodes in the correction for the overlaps stage we used Spark’s shared variables [16] accumulators and broadcast variables (Figure 1C). Accumulator is used to gather information about the worker nodes’ coverage vector ranges and coverage vector tail values, that are subsequently read and processed by the driver. This information is then used to construct a broadcast variable distributed to the worker nodes in order to perform adequate trimming and summing operations on partial coverage vectors.
Functionality
Supported coverage result types
SeQuiLa-cov features three distinct result types: per-base, blocks, and fixed-length windows coverage (Figure 1A). For per-base, the depth of coverage is calculated and returned for each genomic position making it the most verbose output option. The method producing block level coverage (blocks) involves merging adjacent genomic positions with equal coverage values into genomic intervals. As a consequence, fewer records than in case of per-base output type are generated with no information loss. The fixed-length windows the algorithm generates set of fixed length, tiling, non-overlapping genomic intervals and returns arithmetic mean of coverage values over positions within each window.
ANSI SQL compliance
SeQuiLa-cov solution promotes SQL as a data query and manipulation language in genomic analysis. Data flows are performed in SQL-like manner through the custom data source supporting convenient Create Table as Select and Insert as Select methods. SeQuiLa-cov provides a table abstraction over existing BAM/CRAM files, with no need of data conversion, which can be further conveniently queried and manipulated in a declarative way. The coverage calculation function bdg_coverage, as described in Algorithm sub-section, has been implemented as table-valued function(Figure 1D).
Benchmarking
We have benchmarked SeQuiLa-cov solution with leading software for depth of coverage calculations, specifically samtools depth, bedtools genomeCov, sambamba depth and mosdepth (results of DepthOfCOverage from outdated GATK version are available in supplementary data). The tests were performed on the aligned WES and WGS reads from the NA12878 sample (see Methods for details) and aimed at calculating blocks and window coverage. To compare the performance and scalability of each solution, we have executed calculations for 1, 5, and 10 cores on a single computation node (see Table 2).
Samtools depth and bedtools genomeCov are both natively non-scalable and were run on a single thread only. Exome-wide calculations exceeded 10 minutes and genome-wide analyses took over two hours in case of samtools, while bedtools’ performance was significantly worse, i.e ~1.9x for WES and ~4.75x for WGS. Sambamba depth declares to take advantage of fully parallelized data processing with the use of multithreading. However, our results revealed that even when additional threads were used, the total execution time of coverage calculations remained nearly constant and greater than samtools’s result. Mosdepth shows significant speedup (~1.3x) against samtools when using single thread. This performance gain increases to ~3.7x when using 5 decompression threads, however, it does not benefit from adding additional CPU power. In case of fixed-length window coverage mosdepth achieves over ~1.3 speedup against sambamba.
SeQuiLa-cov achieves performance similar to mosdepth when run using a single core. However, SeQuiLa-cov is ~1.3x and ~2.5x as fast as mosdepth when using 5 and 10 CPU cores, respectively, demonstrates its better scalability. The similar performance characteristic is observed for both blocks and fixed-length windows methods.
To fully assess the scalability profile of our solution, we have performed additional tests in a cluster environment (see Methods for details). Our results show that when utilizing additional resources (i.e. more than 10 CPU cores), SeQuiLa-cov is able to reduce the total computation time to 15 seconds for WES and less than one minute for WGS data (Figure 2). Scalability limit is achieved for 200 and ~500 CPU cores in case of WES and WGS data, respectively.
To evaluate the impact of Intel GKL library on deflate operation (BAM bzgf block decompression), we have performed blocks coverage calculations on WES data on 50 CPU cores. The results showed on average ~1.18x speedup when running with Intel GKL deflate implementation.
Finally, our comprehensive functional unit testing showed that results calculated by SeQuiLa-cov and samtools depth are identical.
Conclusions
The application of the recent advancements in big data technologies and distributed computing can contribute to both speeding up genomic data processing and management. Analysis of large genomic data sets require efficient, accurate, and scalable algorithms to perform calculations utilizing the computing power of multiple cluster nodes. In this work, we show that with sufficiently large cluster genome-wide coverage calculations may last less than a minute and at the same time being over 100x faster than the best single-threaded solution.
SeQuiLa-cov is one of the building blocks of SeQuiLa [14] ecosystem, which initiated the move towards efficient, distributed processing of genomic data and providing SQL-oriented API for convenient and elastic querying. We foresee that following this direction will enable the evolution of genomic data analysis from the file-oriented to table-oriented processing.
Methods
Test data
We have tested our solution using reads from NA12878 sample which were aligned to hg18 genome. WES data containing over 161 million of reads weights 17 GB and WGS data include over 2,6 billion of reads taking 272 GB of disk space. Both BAM files were compressed at the default BAM’s compression level (5).
Testing environment
To perform comprehensive performance evaluation, we have setup a test cluster consisting of 28 Hadoop nodes (1 edge node, 3 master nodes and 24 data nodes) with Hortonworks Data Platform 3.0.1 installed. Each data node has 28 cores (56 with hyper-threading) and 512 GB of RAM, Yet Another Resource Negotiator (YARN) resource pool has been configured with 2640 virtual cores and 9671 GB RAM.
Investigated solutions
In our benchmark we have used the most recent versions of the investigated tools i.e. samtools version 1.9, bedtools 2.27.0, sambamba 0.6.8, mosdepth version 0.2.3 and SeQuiLa-cov version 0.5.1.
Availability of source code and requirements
Project name: SeQuiLa-cov
Project home page: http://biodatageeks.org/sequila/
Source code repository: https://github.com/zsi-Bio/bdg-sequila
Operating system: Platform independent
Programming language: Scala
Other requirements: Docker
License: Apache License 2.0
Availability of supporting data and materials
The Docker image is available at https://hub.docker.com/r/biodatageeks/. Supplementary information on benchmarking procedure as well as test data is publicly accessible at http://biodatageeks.org/sequila/benchmarking/benchmarking.html#depth-of-coverage.
Declarations
List of Abbreviations
- API –
- Application Programming Interface
- BAM –
- Binary Alignment Map
- GKL –
- Genomics Kernel Library
- NGS –
- Next Generation Sequencing
- SQL –
- Structured Query Language
- YARN –
- Yet Another Resource Negotiator
- WES –
- Whole Exome Sequencing
- WGS –
- Whole Genome Sequencing
Consent for publication
Not applicable
Funding
This work has been supported by the Polish budget funds for science in years 2016-2019 (Iuventus Plus grant IP2015 019874), as well as Polish National Science Center grant Preludium 2014/13/N/ST6/01843.
Author’s Contributions
MW – conceptualization, formal analysis, investigation, software and writing. AS – data curation, formal analysis, investigation, software, visualization and writing. WK – formal analysis, investigation, writing. TG – formal analysis, supervision, investigation, visualization and writing. All authors approved the final manuscript.