Abstract
Motivation Modern genomic research relies heavily on next-generation sequencing experiments such as ChIP-seq and ChIA-PET that generate coverage files for transcription factor binding, as well as DHS and ATAC-seq that yield coverage files for chromatin accessibility. Such files are in a bedGraph text format or a bigWig binary format. Obtaining summary statistics in a given region is a fundamental task in analyzing protein binding intensity or chromatin accessibility. However, the existing Python package for operating on coverage files is not optimized for speed.
Results We developed pyBedGraph, a Python package to quickly obtain summary statistics for a given interval in a bedGraph file. When tested on 8 ChIP-seq and ATAC-seq datasets, pyBedGraph is on average 245 times faster than the existing program. Notably, pyBedGraph can look up the exact mean signal of 1 million regions in ~0.26 second on a conventional laptop. An approximate mean for 10,000 regions can be computed in ~0.0012 second with an error rate of less than 5 percent.
Availability pyBedGraph is publicly available at https://github.com/TheJacksonLaboratory/pyBedGraph under the MIT license.
1 Introduction
The advancement of next-generation sequencing technologies allowed researchers to measure various biological signals in the genome. For example, one can probe gene expression (RNA-seq) (Mortazavi et al., 2008), protein binding intensity (ChIP-seq) (Robertson et al., 2007), chromatin accessibility (DHS and ATAC-seq) (Buenrostro et al., 2015), and protein-mediated long-range chromatin interactions (ChIA-PET) (Fullwood et al., 2009). Members of the ENCODE consortium (ENCODE Project Consortium, 2012) have collectively generated these datasets in diverse organisms, tissues, and cell types. The 1-dimensional (1-D) signal tracks of the datasets are generally stored in a bigWig compressed binary format or in a bedGraph text format. Although bigWig is a space-efficient standard format for visualizing data on genome browsers, the bedGraph format is often used for text processing and downstream analyses.
A common task in analyzing 1-D signals is extracting summary statistics of a given genomic region. For instance, it is useful to compare an average binding intensity in a peak region of the ChIP-seq signal track to that in a non-peak region. When analyzing new assays with unknown background null distributions, one may need to randomly sample as many as 10 billion regions to obtain sufficient statistical power to assess the significance of observed data for de-noising (Zheng et al., 2019). Thus, a fast algorithm is highly desirable. To accommodate this feature in the widely used Python language, we developed a package pyBedGraph and demonstrate its ability to quickly obtain summary statistics directly from a bedGraph file without the need to convert it to bigWig. The features of pyBedGraph include finding: 1) exact mean, minimum, maximum, coverage, and standard deviations; 2) approximate solutions to the mean.
2 Methods
Searching for a given interval in a large bedGraph file is a computationally expensive job. To overcome this problem, pyBedGraph creates an array that contains an index to an entry of data corresponding to a bedGraph line for every base pair in a chromosome. Therefore, when searching for a statistic, pyBedGraph can then simply use the array indices to rapidly access the bedGraph values, thereby avoiding the need to search.
In addition to finding the exact mean, pyBedGraph offers the option to approximate it with a reduced calculation time. The program can pre-calculate and store bins containing values over non-overlapping windows to substantially decrease the number of values indexed and hence the runtime. In this method, pyBedGraph looks up the two bins containing the start and end of the interval and inclusively extracts all bins between the two. When the first and last bin do not exactly match the start and end of the interval, respectively, an estimate is made for each bin by taking the (value of the bin) × (proportion of the bin overlapping the interval). This method trades off the speed with accuracy.
pyBedGraph is implemented in Python3 using Cython to further optimize speed. Detailed methods are provided in Supplementary data.
3 Results
We benchmarked the performance of pyBedGraph and its bigWig counterpart pyBigWig (Ramírez et al., 2016) on 6 ChIP-seq and 2 ATAC-seq mammalian datasets (Supplementary data) downloaded from the ENCODE portal (Sloan et al., 2016) (https://www.encodeproject.org). All runs were on a Intel(R) Core(TM) i5-7300HQ CPU @ 2.50GHz with 16 GB of RAM using a single thread.
3.1 Speed
Using an interval size of 500 bp and bin sizes of 100, 50, or 25 bp, we measured the runtime of looking up 0.1 to 1 million intervals from chr1. The results are illustrated for POLR2A ChIP-seq data (‘ENCFF376VCU’), where pyBedGraph takes 0.26 second (cf. 56 seconds for pyBigWig) to obtain an exact mean in 1 million intervals (Figure 1a). Our approximate computation takes 0.09, 0.11, and 0.14 seconds for bin sizes 100, 500, and 25 bp, respectively, while pyBigWig takes 56 seconds. As the size of the query intervals get larger, the run time gradually decreases for pyBedGraph’s approximate mean while it increases for the calculation of the exact mean (Supplementary data).
3.2 Accuracy
We next measured the amount of error resulting from the approximation. For each interval size from 100 bp to 5,000 bp, the percentage error was defined as , where n = 10,000 is the number of regions to look up (test case intervals) in chr1. A test case interval was excluded from the error calculation when its actual value was ‘None’ or 0 while the predicted was not, occurring in less than 2.5 percent of test cases. Mean squared errors and absolute errors were also computed (Supplementary data). On the ‘ENCFF376VCU’ dataset, the error was around 6%, 3%, and 1% for pyBedGraph with bin sizes equal to the interval size divided by 5, 10, and 20, respectively (Figure 1b). By contrast, pyBigWig utilizes ‘zoom levels’ in the bigWig file and its approximation error peaked at 11% and 9% for interval sizes of 1,000 bps and 4,000 bps, respectively.
3.2 Memory
The memory usage of pyBedGraph depends on the number of lines in the bedGraph file and sizes of each chromosome in the reference genome. Reading in the bedGraph file stores two 32-bit integers and a 64-bit float for each line; “loading” each chromosome creates an array of 32-bit integers to store the location of the corresponding line from the bedGraph file. Furthermore, the pre-calculated bins are stored in an array of 64-bit floats of length equal to the genome size divided by the bin size. For example, a bedGraph file for POLR2A ChIP-seq data in mouse spleen (‘ENCFF376VCU’) uses ~1.6GB (=100,620,515 lines × (4 + 4 + 8) bytes) of memory to load the whole genome and an additional 0.8GB (= 2.0 × 108 base pair × 4 bytes) to load chromosome 1 (chr1). Storing bins of size 100 base pairs (bps) requires only 16MB . The memory usage of pyBedGraph is nontrivial, yet still reasonable for most laptops. By contrast, pyBigWig does not load the bigwig file and consequently uses no memory.
4 Discussion
We developed pyBedGraph and demonstrated its ability to quickly obtain summary statistics from 1-dimensional genomic signals in bedGraph format. Specifically, obtaining the exact mean for 10 billion intervals is estimated to take 43 minutes with pyBedGraph and 7.4 days with pyBigWig. However, one drawback of pyBedGraph is that it can take up to a minute to load files whereas pyBigWig allows instant computation. Therefore, we recommend users to choose pyBedGraph if they prefer working with bedGraph file instead of bigWig, or if they have to search within more than 1 million intervals. For more than 1 billion intervals with limited compute time, our approximate solution with a small bin size may be a viable option. As genomics researchers continue to develop novel technologies ranging from bulk cells to single-cell and single-molecule experiments, it will be imperative to distinguish true signal from technical noise. Particularly, some ChIP-seq, ChIA-PET, and ChIA-Drop experiments yield only 10-20% enrichment rates due to weak antibody, resulting in noisy tracks. We envision pyBedGraph to play a vital role in quickly sampling null distributions to help researchers to de-noise the data.
Funding
This work has been supported by a Jackson Laboratory Director’s Innovation Fund (DIF19000-18-02), 4DN (U54 DK107967) and ENCODE (UM1 HG009409) consortia; Human Frontier Science Program (RGP0039/2017), and Florine Roux Endowment to Y.R.
Conflict of Interest
none declared.
Supplementary Method, Tables and Figures
Supplementary Method S1. Example mean statistics and error calculation
Supplementary Method S2. pyBedGraph usage and commands
Supplementary Method S1. Example mean statistics and error calculation
An example bedGraph file is given by the following:
After reading the file, the bedGraph object stores the following table as three NumPy arrays for ‘Start’, ‘End’, and ‘Value’.
Loading “chr1” creates the following array used to skip the searching process.
Exact statistics:
If the user searches the interval [‘chr1’, 8, 12], pyBedGraph begins by looking up the 8th index and obtains the value ‘2’. It then goes to the 2nd index in the arrays stored in the bedGraph object. The value ‘0.1’ and length of the interval (9 – 8 = 1) at the 2nd index is used to calculate the relevant statistic. Moving on to the next interval at the 3rd index, pyBedGraph sees that the “Start” (‘13’) is outside our query interval and stops looking further.
Approximate mean:
If bins of size 5 are used for storing the above values, the following bin array is created:
If the user searches the interval [‘chr1’, 0, 6], pyBedGraph notes that the starting index in the bin array is 0 / 5 = 0 and the end index is floor(6/5) = floor(1.2) = 1. Since the 0th bin is completely inside the interval, the value is stored with the weight of 5 (=bin size). The next bin with index 1 is only partly in the interval so the value is stored with the weight of 1 (=6 mod 5). The calculation found is then (4.1 + 0.1 / 5) / (5 + 5 * 1/5) = 0.69 The correct exact mean is (0.7×2 + 0.9×3) / (2 + 3) = 0.82. As a result, the percentage error in this example is 100*|0.69 – 0.82| / 0.82 = 15.9%.
Supplementary Method S2. pyBedGraph usage and commands