Abstract
Motivation Algorithmic solutions to index and search biological databases are a fundamental part of bioinformatics, providing underlying components to many end-user tools. Inexpensive next generation sequencing has filled publicly available databases such as the Sequence Read Archive beyond the capacity of traditional indexing methods. Recently, the Sequence Bloom Tree (SBT) and its derivatives were proposed as a way to efficiently index such data for queries about transcript presence.
Results We build on the SBT framework to construct the HowDe-SBT data structure, which uses a novel partitioning of information to reduce the construction and query time as well as the size of the index. Compared to previous SBT methods, on real RNA-seq data, HowDe-SBT can construct the index in less than 36% of the time, and with 39% less space, and can answer small-batch queries at least five times faster. We also develop a theoretical framework in which we can analyze and bound the space and query performance of HowDe-SBT compared to other SBT methods.
Availability and implementation HowDe-SBT is available as a free open source program on https://github.com/medvedevgroup/HowDeSBT.
Contact Paul Medvedev, pzm11{at}psu.edu
Supplementary information Supplementary text and figures available as single Supplementary file.
1 Introduction
Public read databases such as the Sequence Read Archive (SRA) contain a treasure trove of biological information and have the potential to become a ubiquitous community resource by enabling broad exploratory analyses. For example, given a long nucleotide sequence, which experiments in the database contain reads matching it? More concretely, which human RNA-seq experiments from the SRA contain a transcript of interest? Unfortunately, there does not exist a way for today’s biologist to answer such a question in a reasonable amount of time. Tapping into the potential of these databases is hampered by scalability challenges and will require novel approaches from the algorithm community.
The computational problem falls into the widely-studied category of string alignment problems (Gusfield, 1997; Mäkinen et al., 2015). However, it differs in several regards. The strings to be matched are rarely present in the database in their entirety. Instead, sequencers produce many highly fragmented copies of the desired string, each subjected to potential sequencing error. Furthermore, the scale of the databases makes traditional sequence alignment methods, such as SRA-BLAST (Camacho et al., 2009), inadequate (Solomon and Kingsford, 2016).
The seminal paper of Solomon and Kingsford (2016) demonstrated that the above transcript question can be simplified to a question of approximate k-mer membership. Each experiment can be viewed as a collection of its constituent k-mers, and the biological question can be answered by finding all experiments which contain a high percentage of the k-mers in the query transcript. They demonstrated that this approach, even if only done approximately, is a good proxy for the answer to the transcript question. It also lends itself to answering a more broad range of questions, such as SNP presence, viral contamination, or gene fusion (Bradley et al., 2017). Their work has opened the door to a slew of data structures implementing various k-mer indices, roughly falling into two categories.
The first category of approaches are based on the Bloofi (Crainiceanu and Lemire, 2015) data structure. Each experiment’s k-mers are first stored in a Bloom filter (Bloom, 1970), an efficient but lossy data structure for storing sets. The experiments are then grouped into a hierarchical structure (i.e. tree) based on their similarity, where each leaf corresponds to the k-mers in an experiment. Each internal node represents the union of the k-mers in its descendants. The tree allows an efficient search for experiments matching a given k-mer profile by pruning non-promising branches. Bloofi was first adapted to the sequencing context by the Sequence Bloom Tree (SBT) data structure (Solomon and Kingsford, 2016), and further work improved the representation of the internal nodes (Sun et al., 2018; Solomon and Kingsford, 2017) and the clustering of the tree topology (Sun et al., 2018).
The SBT approaches aggregate k-mer information at the level of an experiment. The second category of approaches aggregate experiment information at the level of the k-mers. In such an approach, each query k-mer is independently looked up in an index to retrieve information about which experiments contain the k-mer (Holley et al., 2015; Muggli et al., 2017; Almodaresi et al., 2017; Mustafa et al., 2018; Pandey et al., 2018; Yu et al., 2018; Bradley et al., 2017; Almodaresi et al., 2018; Holley and Melsted, 2019). In this context, experiments are referred to as colors, and such a data structure is sometimes called a colored de Bruijn graph. These approaches are complementary to the SBT and the best choice depends on the particular properties of the queries and the dataset, such as the sharedness of k-mers between experiments.
In this paper, we make two main contributions. First, we develop an alternative way to partition and organize the data in an SBT such that it becomes more compressible and faster to query. To demonstrate the performance advantages of our method, called HowDe-SBT, we compare it on real data to the previous SBT methods. We also propose and explore a culling procedure to remove non-informative nodes from the tree and create a non-binary forest (see Supplementary).
Second, we introduce a theoretical framework which allows us to prove bounds on the performance of HowDe-SBT in comparison with Split-SBT (abbreviated as SSBT)(Solomon and Kingsford, 2017). Previous papers in the field have focused on experimental metrics for comparison, but, while these are very valuable and necessary, they can vary greatly depending on the dataset or the system used. Theoretical bounds can deepen our understanding of why algorithms perform well and can drive the development of better methods. In this paper, we derive an information theoretic bound on the space used by an SBT (Theorem 1) and quantify the number of bit lookups necessary for a query (Theorem 2).
2 Preliminaries
Let x and y be two bitvectors of the same length. A bitvector can be viewed as a set of the positions that are set to 1, and in this view, the set union (intersection, respectively) of x and y is equivalent to bitwise OR (AND, respectively). We write the bitwise AND between x and y as x ∩ y, and the bitwise OR as x ∪ y. The bitwise NOT operation on x is written as . The set difference of x and y is written as x \ y and can be defined as . The empty set is represented as a bitvector of zeros. The universal set, denoted by ξ, is represented as a bitvector of ones. Given that the fraction of 1s in x is p, the empirical entropy of x is defined as H(p) = -(p log2 p + (1 – p) log2(1 – p)).
A Bloom filter (BF) is a bitvector of length b, together with q hash functions, h1,…, hq, where b and q are parameters. Each hash function maps a k-mer to an integer between 0 and b — 1. To add a k-mer x to the set, we set the position hi (x) to 1, for all i. To check if a k-mer x is in the set, we check that the position hi(x) is 1, for all i. Note that a false positive may occur, i.e. x may have never been added but all its corresponding positions were still set to 1 and it is considered to be contained in the BF. In this paper, we restrict the number of hash functions to be q = 1 (as is done in other SBT approaches).
Next, consider a rooted binary tree T. The parent of a non-root node u is denoted as parerat(u), and the set of all the leaves of the subtree rooted at a node u is denoted by leaves(u). Let children(u) refer to the child nodes of a non-leaf node u. Define aracestors(u) as if node u is the root, and parerat(u) ∪ aracestors(parerat(u)) otherwise. That is, aracestors(u) are all the nodes on the path from u to the root, except u itself.
Suppose that there is a Bloom filter associated with each leaf of T. Then, define B∪ (u) for a leaf node u as its associated BF and B∪ (u) for an internal node as . Note that B∪ (u) of an internal node u can be equivalently defined as . Define the intersection of leaf BFs in the subtree rooted at a node u as . Equivalently, at internal nodes, .
In this paper, we solve the following problem:
Database input
A database D = {D1,…, Dn}, where each Di is a BF of size b that represents experiment i. Typically, Di contains all the non-erroneous k-mers that appear in the reads of experiment i.
Query input
A multi-set of k-mers Q (called the query), and a threshold 0 < θ ≤ 1.
Query output
The set of experiments whose Di contains at least a fraction θ of the query k-mers, i.e. {i: |{x ∈ Q: x exists in Di}| > θ · |Q|}
Note that in this formulation, we assume that D (along with parameter b) is already given to us. How to choose b and construct D from raw reads was already described in Solomon and Kingsford (2016).
3 Representation and querying
Initially, we determine a tree topology T using the clustering algorithm of Sun et al. (2018) as a black-box. A topology is a binary tree with a bijection between its leaves and the experiment BFs in our database; a topology does not yet have bitvectors assigned to the internal nodes. We now show how to assign these bitvectors — Bdet and Bhow— and how they can be used to answer a query.
Conceptually, each node u represents the set of experiment BFs corresponding to leaves(u). We observe that some positions are determined in u: they have the same value in each of leaves(u). Moreover, for the positions that are determined, we know exactly how they are determined. Formally, we define
Note that Bhow is intended to only be informative for those positions that are in Bdet. Figure 1 shows an example of a tree and its Bdet and Bhow representation.
Having Bdet and Bhow at each node can enable the following efficient query search; it is essentially the same strategy as in Sun et al. (2018) and Solomon and Kingsford (2017), but adapted to the HowDe-SBT tree. When we receive the set of query k-mers Q, we hash each one to determine the list of BF positions corresponding to Q (recall that our BF uses only one hash function, hence each k-mer corresponds to just one position). We call this list the unresolved positions. We also maintain two counters: the number of positions that have been determined to be 1 (present), and the number of positions determined to be 0 (absent). These counters are both initially 0. We then proceed in a recursive manner, starting at the root of the tree. When comparing Q against a node u, each unresolved position that is 1 in Bdet(u) is removed from the unresolved list, and the corresponding bit in Bhow(u) determines which counter, present or absent, is incremented. If the present counter is at least θ|Q|, we add leaves(u) to the list of matches and terminate the search of u’s subtree. If the absent counter exceeds (1 – θ) |Q|, Q cannot match any of the descendant leaves so we terminate the search of u’s subtree. If neither of these holds, we recursively pass the two counters and the list of unresolved positions down to u’s children. When we reach a leaf, the query unresolved list will become empty because Bdet is all ones at a leaf, and the algorithm will necessarily terminate. Figure 1 shows an example of a query execution.
We observe that some bit positions will never be looked at during a search, as follows. First, if a position is determined at a node v, it will be removed from the query unresolved list (if it was even there) after node v is processed. We say that this position is inactive in v’s descendants, since a search will never query that position. Formally, a position is active in Bdet(u) if it is not determined at its parent (equivalently, at any of its ancestors). Second, the only positions that are queried in Bhow(u) are those that are active and set to one in Bdet (u). We say these positions are active in Bhow(u). Formally, we can define
Bits that are inactive are wasteful, since they take space to store but are never queried. We remove these bits, forming (usually shorter) bitvectors comprised of only the active bits. Formally, let be Bdet(u) with all the inactive bits removed, and we let be Bhow(u) with all the inactive bits removed. and are further compressed with the general purpose RRR compression algorithm (Raman et al., 2007), and these compressed bitvectors are what, in the end, constitute our index. We note that since the removal of inactive bits changes the indices into the bitvectors, the query algorithm has to be modified accordingly by using rank and select. Since this is done in essentially the same way as in Solomon and Kingsford (2017), we omit the details here.
4 Analysis of savings compared to previous work
In this section, we show the connection of our representation to previous approaches and analyze the theoretical improvements. The structuring of our bitvectors can be viewed as an extension of the approach in Solomon and Kingsford (2017), which is called SSBT. The SSBT representation approach subsumes the representation of Sun et al. (2018), so we focus our comparison on SSBT. Briefly, SSBT uses the same approach of having a tree where the bitvectors at a node u represent, for each bit position x, whether x is 1 in all, none, or some of the leaves(u). It also marks a bit position as inactive if it can never be reached during a query. The SSBT bitvectors are called usim and urem. The topology of the SSBT is computed differently from HowDe-SBT, and, as was shown in Sun et al. (2018), it results in a poorly clustered tree. However, since we want to focus on the improvements solely due to bitvector representation, we will assume that the SSBT is constructed using the same topology as HowDe-SBT. In this case, the relationship of SSBT to HowDe-SBT can be summarized as a one-to-one relationship between all possible bitvector states, shown in Figure 2.
The intuition which guided our design of HowDe-SBT, relative to SSBT, was 1) minimizing the number of active positions and 1 bits (to improve space), and 2) minimizing the number of bit lookups performed during a query (to improve speed). To try to theoretically quantify this improvement, we derive a savings rate per bit position x, in terms of 1) nx, the number of nodes where x is active in Bdet, and 2) sx, the percentage of nodes where x is active in Bhow that have a 1.
Let T be the tree topology. We are not able to directly derive the savings rate for T after RRR compression, but we instead rely on Shannon’s information compression bound. For a bitvector a that is generated by a 0th order Markov model and that has a fraction p of 1s, the best that a lossless compression algorithm can achieve is |a|H(p) bits. While in practice this bound might be beaten because our bitvectors are not generated by a 0th order Markov chain, it is still a useful proxy for the compressibility (and in any case the RRR compression that we use does not compress beyond the 0th order bound (Raman et al., 2007)). Let (respectively, ) denote Shannon’s information bound for storing all the bit values at x in T using HowDe-SBT’s representation (respectively, SSBT’s representation).
Let 0 < x < b be a bit position. Then,
Proof. Let Tx be the subtree of T containing all nodes where x is active in Bdet. Observe that Tx is indeed a subtree and it contains exactly the nodes where x is determined and their ancestors. The total number of nodes in Tx is nx. Let ℓx be the number of leaves of Tx. Note that since every internal node of T has two children, ℓx = (nx + 1)/2, for all x. Note that the nodes at which the Bhow bits are active are exactly the leaves of Tx, and sx is the percentage of these that are 1.
Now, the nodes for which x is active in Bdet are exactly the nodes of Tx, and thus x contributes nx active Bdet bits. The only time Bhow is active at x is when Bdet is set to 1, which is exactly at the leaves of Tx. Hence, the total number of active bits is nx + ℓx = (3nx + 1)/2. Next, we count the number of active bits that are set to 1. The Bdet bits set to one are exactly at the leaves of Tx. There are ℓx active Bhow bits, of which a fraction sx are set to one. Hence, the number of active bits set to one is ℓx(1 + sx) = (nx + 1)(1 + sx)/2.
To prove the statement about , we use the equivalences in Figure 2 as a guide. The number of active positions in the usim vectors is nx and in the urem vectors is nx – sxℓx. In sum, the number of active positions is . The number of active bits in usim that are set to 1 is ℓxsx and the number of active bits in urem that are set to 1 is nx — ℓx. Hence the number of 1 bits in S is .
Suppose both the SSBT and HowDe-SBT are built from the same tree topology T. Then the number of bit lookups necessary to resolve a bit position x is (3nx + 1)/2 in HowDe-SBT, and (nx(4 — sx) — sx)/2 in SSBT.
Proof. Consider HowDe-SBT. For every internal node of Tx, we only make one lookup to Bdet, resulting in nx — ℓx lookups. At the leaves of Tx, we must also look at Bhow, resulting in 2ℓx lookups. The total number of lookups is then nx + ℓx = (3nx + 1)/2. For SSBT, we must always check both usim and urem at every node, with one exception: at a leaf of Tx, we do not need to check urem if usim is 1. Hence, the number of lookups is 2nx — ℓx sx = (nx(4 — sx) — sx)/2.
We can measure the percent improvement in the space bound as , shown in Figure 3(a). In the limit, ΔIx approaches between 9% and 14% for sx ≥ 0.75. Similarly, we can measure the percentage improvement in the number of lookups as which in the limit goes to . The improvement in lookups is thus between 0 (for sx = 1) and 25% (for sx = 0); for sx = 0.5, it is 14%.
In large scale applications, our theoretical analysis can be simplified by assuming that nx goes to the limit. However, HowDe-SBT can be applied in different settings, and some of these settings (like a private patient cohort) may in fact not be very large. Our detailed analysis can be used to determine at which point a dataset is large enough for the asymptotic effects to kick in; e.g. Figure 3(a) indicates that the space savings reaches a stable point roughly at = 50.
We caution, however, that our analysis does not automatically translate to total improvements when all the positions are considered jointly. The total data structure size depends a lot on the structure of the input. That is reflected by the distribution of nx and sx in the real data, whether or not they are correlated, the entropy of all the bits together, rather than separated by position, and the higher-order entropy of the bitvectors. These effects can be analyzed using real data, which we do in Section 6.
5 Construction algorithm
In this section, we describe an algorithm to compute and for a given topology and prove its correctness. Our algorithm makes a pass through the tree using a postorder traversal. At each node u, it 1) computes Bdet and Bhow of u, and 2) computes and of each of its children using their Bdet and Bhow. As the final step, and are computed for the root node.
The base case of the algorithm is, for each leaf u, to set Bdet (u) ← ξ and Bhow(u) ← B∪(u). For an internal node u, we first construct B∩ (u) and B∪ (u) using the operations:
Then, Bdet (u) and Bhow (u) are constructed by directly applying their definitions (Equation (1)). Once Bdet(u) and Bhow (u) have been computed, we compute the active bits in Bdet (c) and Bhow (c), for each child c of u:
Next, we make a linear scan through Bdet (c) and Bhow (c) and copy over only the bits that are set in and , respectively, thus gyrating and .
These are finally RRR compressed and written to disk.
After the postorder traversal completes, we make the final step to compute the and , for the root u. It is essentially the same process as for the internal nodes, but the active bits are set as:
and are then computed as before, removing the inactive bits during a linear scan followed by RRR compression.
We note that the , and bitvectors are not stored after construction is complete. To save memory, these can be discarded right after their use; however, it is necessary to save some of the Bdet and Bhow vectors. Specifically, we must save Bdet(u) and Bhow(u) between the visit to u and the visit to u’s parent. During this interval they can be saved in memory or, in order to save memory, they can be written to disk and reloaded to memory as needed (the default behavior). When u’s parentis visited, and are computed and Bdet(u) and Bhow(u) are no longer needed.
The following theorem shows that our algorithm is correct and that its runtime is linear in the total size of the input bitvectors.
Given a database of n Bloom filters, each of size b, and a tree topology, our method constructs and for all nodes in time.
Proof. We first prove the runtime. Our bitvector construction algorithm operates on each node twice — once when the node is visited, and a second time to finish the node when its parent is visited. (The root is a special case — its finishing stage occurs at the end of the algorithm.) At each stage, it performs a constant number of bitwise operations which each takes time. Similarly, the linear scan to remove inactive bits can be done in time by simply copying the active bits to a new vector.
Next, we prove correctness. We will need the following technical lemma:
The following properties are true:
Proof. Equation (4) trivially follows from the definition of B∩. For Equation (5), first observe that B∩ (u) and are disjoint (since ). Combine this with the definition of Bdet(u), and we get . Negating both sides, we get
We first show the correctness of computing Bdet and Bhow when we visit u. For a leaf u,
If u is an internal node, then it is enough to show that B∩ and B∪ are computed correctly.
To show the correctness of computing and , recall their definitions from Section 3. It is straightforward to see that our algorithm computes and exactly according to these definitions, both for the case of children and for the case of the root. Finally, the algorithm correctly computes and because it simply applies their definitions.
We note that the complexity of constructing the tree topology was not studied in Sun et al. (2018), but the obvious implementation would take Ω(n2) time. Though it takes negligible time on currently tested datasets, it may become the bottleneck in the future if n increases by several orders of magnitude.
6 Results
6.1 Experimental setup
To evaluate HowDe-SBT, we compared it against its two predecessors, AllSome-SBT (Sun et al., 2018) and SSBT (Solomon and Kingsford, 2017). All experiments were run on an Intel Xeon CPU with 512 GB of RAM and 64 cores (at 2.10 GHz). Our tool is open source and available for free through https://github.com/medvedevgroup/HowDeSBT. All details about how the tools were run, including parameters, together with datasets needed to reproduce our results, are available on https://github.com/medvedevgroup/HowDeSBT/tree/master/reproduce.
We used the same data for evaluation as Solomon and Kingsford (2016), except that we removed experiments that did not have reads longer than k. There were 66 experiments removed in this way. A 67th experiment was removed because the corresponding BF in Solomon and Kingsford (2016) was empty. Together, the removed experiments had a total of 675 million reads. The resulting dataset contained 2,585 human RNA-seq runs from blood, brain, and breast tissues, compromising all relevant human RNA-seq datasets in the SRA at the time of Solomon and Kingsford (2016). For each file, we filtered out any k-mers that occurred less than a file-dependent threshold. We used the thresholds from Pandey et al. (2018), for consistency purposes. We used k = 20 and a Bloom filter size of b = 2 · 109 (as in previous work), and k-mer counting was done using Jellyfish (Marçais and Kingsford, 2011).
To study query performance, we created four types of queries: a single transcript, a batch of ten transcripts, a batch of 100, and a batch of 1000. Transcripts were picked arbitrarily from Gencode (ver. 25) transcripts that are at least k nt long. We created 100, 10, 3, and 3 replicates for each type of query, respectively. We include batches of multiple transcripts in our tests because SBT performance is known to depend on batch size (Solomon and Kingsford, 2016). The idea of combining multiple queries in a batch and then processing them at each node simultaneously was first described by Solomon and Kingsford (2016) and is implemented in all SBT methods, including HowDe-SBT. We use a value of θ = 0.9 for all experiments. Note that because the output of HowDe-SBT is identical to SSBT and AllSome-SBT, we do not need to compare their accuracy; moreover, a comparison of SBT accuracy relative to exact methods like Mantis was also already explored in Solomon and Kingsford (2016).
6.2 Performance comparison
Table 1 shows the time and space taken to construct the index. The index could be created in less than 36% of the time and with 39% less space for HowDe-SBT than for all other approaches. The faster construction time of HowDe-SBT over AllSome-SBT was due to a combination of having to handle much smaller bitvectors during construction (as reflected by the smaller index size) and software engineering improvements. Otherwise, the construction algorithms of the two methods differ only by the specific bitvector operations applied during the tree traversal.
It is important to note that the construction times in Table 1 do not include the pre-processing time of converting the SRA read files to the initial experiment Bloom filters. This is a time-consuming process, which took us several days using multiple threads. We did not obtain reliable timing results since we did the conversion on the fly while streaming the data from the SRA over a network connection.
To measure the query time, we first note that there are different use cases that effect how to best measure time performance. SBT approaches are designed to scale to a very large number of experiments or to machines with limited memory (e.g. a desktop computer) because the memory required for a query is not dependent on the number of experiments; only one node of the tree needs to be loaded into memory at any given time. Therefore, we focused our analysis on a setting where the index is loaded into memory with each new query.
Table 2 shows the query speed for all tools. HowDe-SBT was faster than other SBT approaches, with over a 5x speedup on single-transcript batches. Peak RAM usage was < 1.3 GiB for all batches for all tools Table 3.
We also tested the effect of warming the cache prior to querying, where we ran each query two consecutive times and then reported the run-time for the second run (Supplementary Table S1).
Warming the cache prior to querying led to improved query times for all tools, but their relative performance remained mostly similar.
6.3 Bitvector properties
First, we investigate how many positions are active and how saturated the active bits are. The fraction of active Bdet positions decreases going down the tree (by definition), with a median of 0.006 at internal nodes and 0.002 at leaves. The saturation of (i.e. the percentage of the active bits that are determined) is 100% at the leaves (by definition) and has a median of 41% at the internal nodes. The saturation of (i.e. the percentage of the active Bhow bits that are 1) has a median of 51% at the leaves and 12% at the internal nodes. It decreases with the height of a node (i.e. maximum distance to a leaf), meaning that at higher levels of the tree, the vast majority of positions that are determined are found to be absent rather than present. Figure 4 shows the saturation distributions. In terms of final space on disk, after RRR compression, the leaves account for only 18% of the total index size.
In Section 4, we derived the reduction in space and query time for a bit position x in terms of the number of nodes where it is active in Bdet (denoted by nx) and the fraction of nodes where it is set to one in Bhow (denoted by sx). Figure 3(b) shows the distribution of these values on our tree. The median value for sx is 0.14 and for nx is 31, which corresponds to in Figure 3(a). However, we also see a correlation between nx and sx (Spearman coefficient of r = 0.67). Any future extension of our theoretical analysis per bit position to one of all bit positions jointly should take this complexity into account. We note that the total reduction in index size of HowDe-SBT over SSBT on our data (38%) is due not only to the improved bit representation but also due to the better tree topology that HowDe-SBT constructs; thus, a direct comparison to our theoretical predictions is challenging.
7 Conclusion
In this paper, we presented a novel approach for the representation of Sequence Bloom Trees and studied its performance from both a theoretical and an experimental perspective. The main intuition behind our representation is that it reduces the number and entropy of the active bits. Compared to previous SBT approaches, HowDe-SBT is an improvement on all fronts: it constructs the index in less than 36% of the time and with 39% less the space, and can answer small-batch queries five times faster. Compared specifically against AllSome-SBT, the biggest advantage is that the size of the index is an order of magnitude smaller. In comparison against SSBT, the biggest advantage is in the construction time and query times, across all batch sizes.
With the improvements in this paper, the SBT can already be deployed for small and mid-size databases, such as a private patient cohort or all the sequencing data in flybase.org. Such a deployment will need to provide an automated way to update the database; while our method naturally supports insertions in time (we have omitted the details, but Sun et al. (2018) give the basic overview of how insertions can be handled), a quality assurance step will be necessary prior to automating database updates. It will also be necessary to provide a front-end interface (e.g. Galaxy or similar to Bradley et al. (2017)) for easy access. Such front-ends should also provide wrappers for more biologically-oriented queries (i.e. to convert “which sample has a SNP” into a k-mer query).
Supplementary Section 1 Culling of nodes
In this section, we propose and explore a culling technique for allowing the tree to be non-binary, something that was explored in the original Bloofi paper but was not supported by subsequent SBT approaches. The technique is applied during construction, right after the baseline binary tree topology is constructed but before the bitvector representations are computed.
For a node u, let saturation(u) be the number of bits set to one and active in Bdet(u) divided by the number of bits that are active in Bdet(u). For example, in Figure 1, the left child of the root has saturation of 9/15. Since at this stage Bdet and Bhow are not yet computed, we estimate the saturation by a sub-sampling technique, similar to that used to estimate node similarity First, we identify the culling threshold as two standard deviations below the mean of the saturation value in the internal nodes. This was 20% on our dataset. Second, we scan the baseline topology to identify internal nodes u for which saturation(u) is below the threshold. These nodes are then removed from the topology, with their children reassigned as children of the removed node’s parent; if no parent exists, these nodes become roots of a new tree in the forest. We call this process culling. The result of culling is that the binary tree potentially becomes a non-binary forest. This does not change the query algorithm in any substantial way.
We investigate the effect of culling on the tree, as a function of the culling threshold (Table S2). At the threshold of 20%, we remove about 4% of the nodes, with a negligible decrease in total index size. The query times fluctuate (Table S3), showing a slight increase or decrease depending on the batch size.
Overall, we conclude that culling did not have a substantial effect on the SBT. It is interesting to observe that when the threshold is high (40%), 45% of the nodes are removed but the index size actually increases by 62%. This is due to the fact that a single active bit in Bdet(u) is replaced by two active bits in u’s children, if u is removed.
Acknowledgements
We are thankful to Ayaan Hossain and Natasha Stopa for helping prototype some aspects of this project. This work has been supported in part by NSF awards DBI-1356529, CCF-551439057, and IIS-1453527 to PM and partially supported by PSU’s College of Engineering Multidisciplinary Seed Grant Program. Research reported in this publication was supported by the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number R01GM130691. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.