Augmented Interval List: a novel data structure for efficient genomic interval search

Motivation Genomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary. Results We present a new data structure, the augmented interval list (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N + n + m), where n is the number of overlaps between R and q, N is the number of intervals in the set R, and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5 - 18 times faster than standard high-performance code based on augmented interval-trees (AITree), nested containment lists (NCList), or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4% - 60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis. Availability An implementation of the AIList data structure with both construction and search algorithms is available at code.databio.org/AIList.


Introduction
A genomic interval r is defined by the two coordinates that represent the start and end locations of a feature on a chromosome. The general interval search problem is defined as follows: Given a set of N intervals R = {r 1 , r 2 , ..., r N } for N 1, and a query interval q, find the subset S of R that intersect q. If we define all intervals to be half-open, S can be represented as: S(q) = {r ∈ R|(r.start < q.end ∧ r.end > q.start)} If the order of intervals in R remains the same when the elements are sorted based on interval start or based on interval end, then S can be computed using a single binary search followed by another n + 1 comparisons, where n is the number of overlaps between R and q. We refer to R in this case as being f lat. The strategy of a binary search followed by sequential comparisons becomes sub-optimal when intervals in R possess a coverage or containment relationship, i.e., one interval covers or contains another interval in R. Such non-f lat interval lists require extra comparisons beyond n + 1 after the binary search.
The interval search problem is fundamental to genomic data analysis (Giardine, 2005;Li and Durbin, 2009;Layer et al., 2018;Jalili et al., 2018) and several approaches have been developed to do this efficiently (Cormen et al., 2001;Kent et al., 2002;Richardson, 2006;Alekseyenko and Lee, 2007;Quinlan and Hall, 2010;Neph et al., 2012). Currently, the most popular data structures are the nested containment list (NCList) by Alekseyenko and Lee (2007), the augmented interval tree (AITree) by Cormen et al. (2001), and the R-tree by Kent et al. (2002). The design of these data structures can be conceptualized as minimizing the additional comparisons in the search strategy we described earlier. NCList and AITree require O(N log 2 N ) time to build the data-structures, and their query time complexities are O(log 2 N + n + m), where m is the average number of extra comparisons (comparisons that do not yield overlapping results) required to find the n overlaps. The R-tree has a complexity of O(N ) in construction and O(m + n) in query. For genomic interval datasets, N is usually several orders of magnitudes larger than n or m.  NCLists, AITrees and R-trees differ both in time to build the data structure and to search it. As shown in Alekseyenko and Lee (2007), both average construction time and search time can vary significantly among the methods in practice. The search time differences are determined by the extra comparison value m, which differs based on the different approaches of each algorithm: For NCList, many sublists may be involved in a query, which requires many extra binary searches; for AITree, one must compare all interval nodes marked by the augmenting value, and not all of them intersect the query; and for R-tree, all intervals in an indicated bin are scanned, although many of these may not overlap the query.
In this paper, we present a new data structure, the Augmented Interval List (AIList). The AIList search algorithm reduces the number of extra comparisons (m) and thereby achieves better performance.

A simple augmented interval list and query algorithm
To begin, we sort R based on the interval start to create an interval list. We then augment the list with the running maximum end value, M axE, which thus stores the maximum end value among all preceding intervals ( Figure 1a) to create the AIList. M axE reflects the containment relationship among the intervals: because the 2nd interval [3,8) contains the 3rd interval [5, 7), they have the same M axE value (8); similarly the 4th interval [7, 20) contains the next three intervals and they all have the the same M axE value 20. M axE therefore indicates a list of containment groups. The coverage length of a containment group can be defined as the number of sequential identical M axE values minus one. A variation of this approach instead augments the list with the sorted end value, SortedE, which we detail in the Supplementary Information.
Once an AIList is constructed, we seek to search it with a query interval. To find the overlaps of a query [q.start, q.end) with the AIList, we first do a binary search using the query interval end against the sorted list of database interval starts to find the right-most interval in the AIList where r.start < q.end. We then step backwards to test for each interval whether M axE > q.start. Since M axE is in ascending order, and M axE ≥ r.end, when we find the first interval with M axE ≤ q.start, we are guaranteed that all remaining intervals do not overlap the query and can be skipped. For example, if our query were [9, 12), we first binary search to find the index of the last interval I E that has r.start < 12, which is the 5th interval [9, 10), M axE = 20. We know all intervals on the right side will not overlap the query since their r.start ≥ 12. We then step backwards to the 5th, then the 4th interval, but we can stop at the 3rd interval, [5, 7) with M axE = 8, since M axE < 9, which ensures that no further intervals on the left side will overlap the query. This is very efficient since we only checked one extra interval.
However, this strategy becomes less efficient in the presence of long coverage groups. For example, if the query were [17, 21), then we need to check 6 intervals: from the 8th, [19,30) with M axE = 30, back to the 3rd, [5, 7) with M axE = 8, which requires 4 extra comparisons. This example query is the worst case for this list. To mitigate the issue of containment, we next extend the AIList by decomposing the list.

Decomposition of an interval list
Algorithm 1 AIList Construction Algorithm 8: Because long coverage groups lead to extra comparisons, we seek to reduce the length of these groups. We achieve this by extracting containing intervals into a second list. In the simple case shown in Figure 1a, we can define the coverage length len of a list interval i as the number of the immediately following intervals (i + 1, i + 2, . . . ) that are covered by interval i; so in Figure 1a len [ Selection of M inL determines the extent of decomposition, with greater values leading to less decomposition. The optimal M inL differs mildly among datasets (Fig. 2). It is clear from Figure 2 that M inL = 20 is near optimal for datasets we tested, and more importantly, that the runtime is relatively robust to substantial variation in MinL choice, so it is not necessary to find the optimal M inL for each dataset. We have set the default M inL to 20, which results in the number of components nSub being less than 10 for all the datasets we have tested (see Table 1).
Algorithm 1 lists a simplified O(N logN ) algorithm for constructing an AIList, including both decomposition and augmentation. The function AddRunningM ax is a simple linear scan to determine the maxE value.

13: end procedure
Queries against the decomposed list structure are similar to the original case, but now done independently on each sublist. The decomposition process has divided the original list into 2 or more flattened or nearly flattened sublists, so queries in each sublist are close to optimal. The cost of this improvement is that we now require additional binary searches to get the indices I E for each sublist. This approach thus implements a tradeoff between number of binary search comparisons and number of extra interval comparisons due to interval containment. Similar to other algorithms mentioned above, the query time complexity for AIList is O(log 2 N + n + m), but the average number of extra comparisons m is minimized. The search algorithm is listed in Algorithm 2.

Results
AIList is implemented in C. To evaluate the efficiency of AIList, we compared its performance with interval trees (AITree), nested containment lists (NCList), and R-trees. For AITree we used the rbtree-based interval tree from Linux kernel (see Supplementary Information); for NCList we used the C-code intervaldb.c implemented by Alekseyenko and Lee (2007); and for R-tree we used the popular C++ implementation BEDTools v2.25.0 by Quinlan and Hall (2010). AIList outperforms other approaches on all datasets that we have tested so far. Table 1 lists seven real and representative genomic datasets ranging in size from 199,000 to 128 million intervals. In each of the experiments we describe below, we used Dataset 0 as the query set, with the other six used as the database interval set. All experiments were run on a computer with 2.8GHz CPU and 16GB memory, and the reported runtime for each method includes the time required for data loading, data structure construction, searching, and result output.
For all datasets, AIList outperformed all other methods ( Table 2). The improvement was more dramatic for the datasets with more complex containment structure: For f lat dataset 1 and near f lat dataset 2,  Table 1. Genomic interval datasets used as database or query for performance evaluation. N is the number of intervals, dcRatio is the ratio of the number of intervals that cover their immediate next over N , nSub is the number of total sublists for AIList, nHits is the number of overlaps with query Dataset 0. Dataset 1 and 2 are from BEDTools, and others are from UCSC.
AIList is 120-150% faster than AITree and NCList and 2 times faster than BEDTools. For datasets with greater containment, AIList is up to 5 times faster than AITree and NCList, and up to 18 times faster than BEDTools. Furthermore, AIList consumed substantially less memory than the other algorithms (  Table 3. Memory usage (%, memory used by a program divided by the total machine memory) of AIList, AITree, NCList and BEDTools for large datasets on a computer with a total memory of 16 GB. BEDTools was terminated for Dataset 6 because it took all machine memory.
To evaluate how AIList, AITree, NCList and R-tree scale in practice for differing sizes of input dataset, we selected a single comparison (Dataset 0 queried against Dataset 5 as input dataset) and then downsampled the input dataset in increments of 1,000,000 intervals. Fig. 3a shows four curves of runtime vs input dataset size. Fig. 3b shows the ratio of the runtime compared to AIList. As input dataset size increases from 1 million to 10 million, the runtime ratio increases from 2.8 to 8.3 for AITree, from 2.8 to 4.5 for NCList, and from 6.9 to 20 for BEDTools, demonstrating the superior scaling of AIList.
We also performed a similar simulation experiment by subsampling the query, which demonstrates the practically lower scaling of the AIList algorithm ( Figure 4). As expected, all algorithms scale linearly but AIList has the lowest coefficient.

Conclusion
We defined a novel data structure that dramatically improves on the enumeration of interval overlaps. Our method pairs the techniques of list augmentation and list decomposition to provide tighter terminal conditions for overlap checking, which reduces the number of extra comparisons that must be made. We demonstrated that this method outperforms existing methods across a series of datasets that range from flat structure to highly nested interval containment.
Similar to the NCList, the AIList data structure contains only one extra data element M axE, so it is more efficient than the AITree (3 extra elements) and the R-tree (contains duplicate elements); but the AIList header size is negligible, while the NCList header size can be comparable to database size (see Supplementary Information for details). Another advantage of the AIList is that inserting a new interval element requires simply checking the few sublists to find roughly where it belongs, which is much more flexible than NCList, which can require reconstructing the whole data structure. Because of its simple data structure, AIList is also the simplest to implement (see Algorithm 1 and 2, and source code). Finally, as shown in Table 3, AIList takes the least memory.
Taken all these together, AIList provides a significantly improved fundamental operation for highly scalable genomic data analysis.