BionetBF: A Novel Bloom Filter for Faster Membership Identification of Large Biological Graph

Big Graph is a graph having thousands of vertices and hundreds of thousands of edges. The study of graphs is crucial because the interlinkage among the vertices provides various insights and uncovers the hidden truth developed due to their relationship. The graph processing has non-linear time complexity. The overwhelming number of vertices and edges of Big Graph further enhances the processing complexity by many folds. One of the significant challenges is searching for an edge in Big Graph. This article proposes a novel Bloom Filter to determine the existence of a relationship in Big Graph, specifically biological networks. In this article, we propose a novel Bloom Filter called Biological network Bloom Filter (BionetBF) for fast membership identification of the biological network edges or paired biological data. BionetBF is capable of executing millions of operations within a second while occupying a tiny main memory footprint. We have conducted rigorous experiments to prove the performance of BionetBF with large datasets. The experiment is performed using 12 synthetic datasets and three biological network datasets. It takes less than 8 sec for insertion and query of 40 million biological edges. It demonstrates higher performance while maintaining a 0.001 false positive probability. BionetBF is compared with other filters: Cuckoo Filter and Libbloom, where small-sized BionetBF proves its supremacy by exhibiting higher performance compared with large-sized Cuckoo Filter and Libbloom. The source code is available at https://github.com/patgiri/BionetBF. The code is written in the C programming language. All data are available at the given link. Highlights Proposed a novel Bloom Filter, BionetBF, for faster boolean query on Big Graph. BionetBF has a low memory footprint and the lowest false positive probability. It has high performance with constant searching time complexity. BionetBF has the potential to application in Big Graph, de-Bruijn Graph, and Drug Discovery.


Introduction
The graph has always been the most crucial research area for years due to its diverse application fields. The graph helps in the representation of relationships among the entities. For example, social networks represent the connectivity of users based on friendship. The second attractive feature of graphs is non-linear time complexity. Every operation in a graph, for instance, insertion, query and deletion, has non-linear time complexity. It excites the researcher to develop an algorithm that performs the operations in linear time. Today's digital era has converted the graph into Big Graph. The exponential generation of data has led to Big Graphs. For example, the number of social network users is millions, and each user is connected to thousands of other users. The social network graph has millions of vertices with a hundred million edges. The number of monthly active users on Facebook is 2.98 billion in the year 2022 [1]. Big Graph has vast applications. Social network analysis helps in sociology. Sociology aims to understand society, patterns of social relationships, human social behaviour, social interaction, etc. Knowledge graph [2,3] presents the data and their relationship, for example, google knowledge graph [4]. Big Graph is big, complex, uncertain, and dynamic [5]. These features influence the design of a Big Graph searching algorithm. The colossal number of vertices and edges increases the complexity and uncertainty in the Big Graph. It increases the processing complexity by many folds. Due to the huge size, the searching algorithm has to balance the time and space complexity. The algorithm design should be a reasonable model to capture the uncertainties of the Big Graph. Many Big Graphs are dynamic. New edges or vertices are added or deleted from the graph. For example, a user frequently adds friends on social networks and changes the graph structure. The searching algorithm needs to be fast to provide a result before the input graph structure becomes invalid. Another challenge associated with the Big Graph is the inability to store the entire graph in RAM. One solution is distributed graph processing, for example, Pregel [6]. Pregel has scalability issues such as superstep barriers [7] and network traffic issues in case of dense graphs. Bloom Filter [8] is a probabilistic data structure which is not a complete system but can reduce the burden of the state-of-the-art techniques in graph searching to a large extent.
A biological network is a complex network that represents the relationship between biological entities such as molecules, proteins, ions, metabolites, etc. [9]. Some examples of biological network are protein-protein interaction networks [10], genetic regulatory networks, cell-cell communication [11], and genetic interaction networks [12]. Protein-protein interaction networks present transient and stable interactions between the proteins where the interactions are undirected by nature [13]. The gene regulatory network represents the complete set of gene products and their interactions. Genetic interaction networks represent the interactions between the genes within an organism [14,15]. Moreover, Bloom Filter is used to construct de Bruijn Graph where de Bruijn Graph of human genome size is more than 30GB of RAM [16,17]. Therefore, Bloom Filter is an efficient option to use to solve the de-Bruijn graph problem. Representing the biological system as a network instance increases the comprehension of the complex biological process, particularly at the molecular level [18]. Biological networks unfold prominent relationships, patterns, and properties of the whole system, which is difficult in a univariate analysis [19]. The study on the biological network of proteinprotein interaction helps determine the protein related to the diseases [20,21]. The interactions among the proteins result in cellular and molecular mechanisms determining an organism's health. A biological network is a great help in drug discovery. Typically, the commercial availability of a drug takes more than 7 years, the development of a drug takes approximately 5 years, and a few years more for clinical trials. Thus, the process of drug discovery is expensive and time-consuming. A biological network helps in discovering the valuable information related to a drug and its target, such as (a) determining any secondary effects interfering with the target, (b) whether the target protein has many connections, (c) whether the proteins of the targeted drug is dangerous, and (d) determines the molecules affected by the side effects of the drug. Furthermore, a biological network is used for predicting the drug combination because the network is fitting for computational and combinatorial analysis. The study of genetic interaction networks helps comprehend the relationship between phenotype and genotype. Phenotype is an organism's trait resulting from the interaction of the genotype and its environment. The genotype is the collection of genes. The biological network also helps in determining how viruses and bacteria infect, persevere, and cause diseases [22,23]. In addition, it helps to apprehend the pharmacokinetic and pharmacogenomic actions of antibacterial drugs. Furthermore, a biological network helps establish the relationships among organisms and species [24].

Motivation
Analysis of large networks is predominantly computationally intractable [25]. In other words, computationally intractable problems are either NP-hard or NP-complete problems that are solved approximately or heuristically. The various factors contributing to the complexity of biological networks are chemical kinetics, a vast number of interacting variables, feedback loops, dynamic in nature caused due to various linear or nonlinear relationships, and stochasticity [26]. The biological network is dynamic in nature [27]; however, the information provided by the biological network is static at an instance. The biological network is highly influenced by external factors making it dynamic in nature. Comparatively, a biological network furnishes a small part of the whole biological network; hence, it is missing much information. The biological network has thousands of millions of vertices with billions of interactions/edges. The huge amount of information the biological network generates often yields dubious interaction. The construction of a biological network is challenging due to the large biological data silo, and the data are noisy [28]. Thus, it demands highly efficient data structures with low memory footprints.

Contribution
This article proposes a novel Bloom Filter, called BionetBF, for rapid membership identification of edges in Big Graph. To the best of our knowledge, no Bloom Filter or filters are proposed for storing the Big Graph in a single data structure. We have considered biological networks for our experimentation. BionetBF can perform insertion and query operations on millions of biological edges within a few seconds. We have conducted extensive experiments on BionetBF using 12 synthetic biological datasets and 3 real datasets. In the case of synthetic biological datasets, BionetBF takes 7.6 sec for the insertion of 40 million biological edges. Similarly, it takes 7.6 sec for the query of 40 million biological edges. BionetBF maintains a low false positive probability of 0.001. One important point to notice is that BionetBF has zero false positive probability in many datasets. The performance of BionetBF is measured using multiple parameters: Million operations per second (MOPS), Seconds per operation (SPO), and Megabyte per operation (MBPS). In insertion operation, the highest MOPS, SPO, and MBPS are 7.06, 1.87, and 162.78, respectively. Similarly, in query operation (Disjoint set), the highest MOPS, SPO, and MBPS are 6.95, 1.9, and 160.6, respectively. BionetBF demonstrates an accuracy of more than 99.5%. BionetBF is compared with another two filters: Cuckoo Filter [29] and Libbloom (standard Bloom Filter) [30]. A small-sized BionetBF demonstrates faster operations and high performance than Cuckoo Filter and Libbloom. Cuckoo Filter requires 6.6×, and Libbloom requires 4.7× larger memory than BionetBF. Furthermore, the performance of BionetBF is measured using three biological network datasets: Drug-Gene, Gene-Disease, and Gene-Gene. Drug-Gene is the dataset of interaction between molecules and genes. Gene-Disease is the dataset about the association of disease with genes. Gene-Gene is the dataset about the interaction among genes. BionetBF also presents a higher performance in the biological network datasets. BionetBF can be used for quick identification of a specific biological edge in many biological networks. BionetBF is also an excellent choice for filtering repetitive insertion or query of biological edge.

Bloom Filter
Definition 1. Bloom Filter is an approximation data structure to test whether a queried item is a member of a set or not. It returns either definitely not in the set or possibly in the set. An empty Bloom filter is a bit array of bits, all set to 0. It requires independent hash functions to map an input item into Bloom Filter. In insertion, all corresponding bit positions of the bit array are set to 1 using hash functions. In a query operation, all bit positions of the bit array must be 1 to return true; otherwise, Bloom Filter returns false, which refers that the queried item is not in the set [31].
Bloom Filter [8] is an approximate set membership filtering data structure defined in Definition 1. It is extremely popular in diverse domains, particularly, Big Data [32], IoT, Cloud Computing, Networking [33], Security [34,35], Database, Bioinformatics [36], and Biometrics. Bloom Filter is applied to reduce the main memory footprint. Bloom Filter uses a tiny amount of main memory to filter mammothsized data. It can significantly boost the performance of any system, but it has a false positive probability issue. Nevertheless, designing a new Bloom Filter can mitigate the false positive probability.
Standard Bloom Filter/Bloom Filter is a bit array of bits. It uses number of hash functions. Initially, each bit/cell is set to 0. Bloom Filter performs two operations: insertion and query. In insertion operation, hash functions hash a single item. The hash value gives a location of a cell. The cells are set to 1. Similarly, in query operation, the hash functions hash the queried item. If all the cells are 1, then Bloom Filter returns true. If at least one cell is 0, then Bloom Filter returns false. Bloom Filter [8] is a bitmap array consisting of bits. Let be a Bloom Filter of size , = { 1 , 2 , 3 , … , } be the set of words inserted into the Bloom Filter , U be the universe where ⊂ U, is the number of hash functions. Bloom Filter size (i.e., ) is a prime number to reduce collision [31]. Bloom Filter calculates the optimal number of hash functions using Equation 1.
A large value increases the time complexity of the operations, whereas a small value increases collision probability. Bloom Filter returns either true or false in query operation, but these responses are further classified based on the presence/absence of the words in Bloom Filter. The is classified into and , and is classified into and . Let be the random query item and queried to the Bloom Filter, then the definitions are as follows-Definition 2. If ∈ , and ∈ , then the result of is true positive.   ; if at least one bit location is 0, then Bloom Filter returns . The query of X gives response, whereas the query of Z gives a False response. In the case of the query of Z, out of three bit locations, one bit location is 0; hence, Bloom Filter returns .
Bloom Filter has mainly two operations: insertion and query. Algorithm 1 and 2 represent the insertion, and query operations, respectively. For example, the algorithms are written using the murmur hash function [37]   There is a prominent issue in Bloom Filter, particularly false positives. Assuming only words and are inserted into . A word is queried to (as illustrated in Figure 1). hashes by three hash functions. The three hash functions obtained three bit locations. All three locations have bit value 1. Hence, returns ; which is an incorrect response. This is caused because the locations obtained are already set to 1 by the words and .
Cuckoo Filter [29] is a data structure used for membership inspection, and it is not a Bloom Filter; instead, it is a filter performing the same task as Bloom Filter, i.e., membership identification. It uses a modified cuckoo hashing [43], called partial-key Cuckoo hashing. This hashing improves the dynamic insertion of words into the Cuckoo Filter. Instead of hash values, the fingerprint of the word is stored in the Cuckoo Filter to reduce memory requirements. The fingerprint size is less compared to the hashed value of the word. Cuckoo Filter generates two locations by hashing the fingerprint. Randomly any one location is selected for the storage of the fingerprint. In case the location is occupied, then select the other location. If both locations have fingerprints, then among the two locations, randomly select any one location. Then kick the fingerprint in the selected location and store the new fingerprint. Again, check the kicked fingerprint's alternate location; if empty, store the kicked fingerprint; otherwise, repeat the kicking process.
The kicking process continues until a threshold value. Crossing the kicking threshold confirms that the Cuckoo Filter is saturated. In query operation, hash the queried word to obtain the fingerprint. Again hash the fingerprint to generate two locations; if it is present in any one location, then the Cuckoo Filter returns True; otherwise, False.

BionetBF: The Proposed System
BionetBF is a novel Bloom Filter for membership identification of edges in Big Graph. It is a two-dimensional Bloom Filter (2DBF) with fewer arithmetic operations. BionetBF is a two-dimensional array where the dimensions are different prime numbers. Prime numbers reduce collision probability. Usually, Bloom Filters take a single word as input, but the edge has two data/words (two vertices). Hence, the vertices of an edge are concatenated and given as input to BionetBF. This conversion of two words into a single word helps maintain a single data structure for the network dataset.
BionetBF implements two operations: insertion and query. Let a directed edge in a Big Graph be ( 1 , 2 ) where the edge direction is from 1 to 2 . Our desired FPP is 0.001. BionetBF with one hash function has more than 0.001 FPP. Hence, BionetBF uses two hash functions, it is experimentally illustrated in the supplementary document. Let, , is a BionetBF, ( 1 , 2 ) is a directed edge, and () is a hash function. BionetBF uses the murmur hash function [37]. Different seed values in the murmur hash function create a different hash value for the same word. Initially, set all cells of BionetBF to 0.
Algorithm 3 demonstrates the insertion operation of BionetBF. First 1 and 2 are concatenated (say the concatenated word as 1 2 and 1 2 ≠ 2 1 ). The 1 2 is given as input to the murmur hash function. Then, it performs modular operations on the hashed value and the dimensions of the BionetBF. The result of the modular operations generates a location in BionetBF which stores the information indicating the insertion of the 1 2 into BionetBF. It performs operation to set one bit to 1 without losing previous information of that cell. This procedure is repeated twice as = 2; in other words, each 1 2 sets two bits to 1 in BionetBF. In the case of an undirected edge, instead of concatenation operation is performed between 1 and 2 and inserted into BionetBF. However, there is very less application of undirected edge. Hence, it is not explored in this article.
Algorithm 4 demonstrates the query operation of BionetBF for the queried item ( 1 , 2 ). The steps till obtaining a location in BionetBF are the same for both insertion and query operation. The XOR operation retrieves the related information from the cell, i.e., it gives the bit value of the queried edge. The operation determines whether the location bit is 1. It obtains two locations because = 2. If both location bits are 1, then BionetBF returns True; otherwise, False.  The presence of a path can also be verified in a Big graph by using performing multiple query operations in BionetBF. The operation is performed among individual query of all edges in the path. Lets demonstrates the path query operation with an example. Let ( 1 , 2 , 3 , 4 ) are four vertices that has a path from 1 to 4 such that 1 → 2 , 2 → 3 , and 3 → 4 . Let us assume a query that whether their is a path from 1 to 4 . The path can be queried into the Bloom Filter as  = QBIONETBF( , , 1 , then there exists a path, otherwise, not.

Data Description
We have used two types of datasets for the BionetBF experiments: synthetic and real-world datasets 1 . The synthetic dataset is , and the real-world dataset is the biological network dataset. We have used the synthetic dataset to verify the correctness of the experimental results. The real-world dataset is used to present the efficiency of BionetBF in real-world data. This section provides details of the datasets. To determine the FPP, we have generated three different datasets for query operation: Same Set , Mixed Set  and Disjoint Set . Let the inserted data set in the Bloom Filter be  (Inserted Set). The Same Set and Disjoint Set can be defined as  =  and  ∩  = , respectively. Let,  = { 1 ,  2 }, then one of the following condition is true for Mixed Set:  The Same Set is not generated separately; instead, the  is again queried to BionetBF. To avoid confusion between insertion and query set, the Same Set reference is used to indicate the queried set. Disjoint Set is a set having no common edge with the Inserted Set as depicted in Figure  2. The Mixed Set comprises half the number of edges of Inserted Set and Disjoint Set. The Same Set is used to determine the correct execution of the operation. Whereas Mixed Set and Disjoint Set are used to determine FPP in BionetBF.

IDKmer Dataset
The K-mer is defined as a DNA sequence or string of length . Our synthetic dataset, , consists of biological edge: and − . The − is a K-mer of a fixed length, and is the ID of the − . For − , read a genomic sequence of a fixed length from a DNA sequence dataset (downloaded from [dataset] consists of human DNA sequences [44]). The − lengths considered for the experiment are 8, 15 and 20. The − are read from the single sequence continuously, i.e., the reading of the first sequence starts from the first nucleotide of the DNA sequence dataset. The sequence is read for the required length. Then, for the second sequence, the reading of the nucleotide starts from the nucleotide, where it was stopped in the first sequence. To assign unique , an 8 digit long number is incremented and assigned to each sequence. Disjoint set is generated by taking a different 8 digit long number from the original set. There are four datasets with 10, 20, 30 and 40 million numbers of biological edge having − length 8. Similarly, the IDKmer dataset with − length 15 and 20 is generated; each has four datasets with 10, 20, 30 and 40 million biological edges. Hence, there are 12 datasets. Each 12 dataset has its Mixed and Disjoint sets. Table 1 lists the notation used to refer to the datasets and the file size. In our synthetic dataset, the first biological edge with − length 8 is "10000000 CTGGGCTA", with Kmer length 15 is "10000000 CTGGGCTAAAAGGTC" and with K-mer length 20 is "10000000 CTGGGCTAAAAGGTCCCTTA".

Biological Network Dataset
Three biological network datasets are considered for real-world data for experimentation: Drug-Gene [dataset] [45], Gene-Disease (Downloaded from [dataset] [46]), and Gene-Gene (Downloaded from [dataset] [46]). In the Drug-Gene dataset, one biological vertex is chemical ID, and another vertex is gene ID. The information provided by the dataset is the Drug-Gene interaction network, i.e., the interaction between genes (proteins encoded by genes) and small molecules. In the Gene-Disease dataset, one biological vertex is gene ID, and the other vertex is disease ID. The dataset provides information regarding the association of disease with genes. This is known as the disorder-gene association [47]. In the Gene-Gene dataset, both the biological vertices are gene IDs. The interaction between gene and gene is called Epistasis [48].
A Mixed Set and Disjoint Set is generated for each biological network dataset. In Disjoint Set, the first vertex is a unique string, and the other vertex is the Gene vertex read from the Drug-Gene dataset. The string of the first edge Kmer length is 8 with 10 million IDKmers 180 8 20 Kmer length is 8 with 20 million IDKmers 360 8 30 Kmer length is 8 with 30 million IDKmers 540 8 40 Kmer length is 8 with 40 million IDKmers 720 15 10 Kmer length is 15 with 10 million IDKmers 250 15 20 Kmer length is 15 with 20 million IDKmers 500 15 30 Kmer length is 15 with 30 million IDKmers 750 15 40 Kmer length is 15 with 40 million IDKmers 1000 20 10 Kmer length is 20 with 10 million IDKmers 300 20 20 Kmer length is 20 with 20 million IDKmers 600 20 30 Kmer length is 20 with 30 million IDKmers 900 20 40 Kmer length is 20 with 40 million IDKmers 1200 is "aaaaaaa", which is incremented alphabetically to have a unique string in each line. Similarly, a unique string is the first vertex of the Disjoint Sets of Gene-Disease and Gene-Gene dataset. The other vertex is the Disease vertex, and Gene vertex values are read from Gene-Disease and Gene-Gene datasets, respectively.

Performance Parameters
Let  ,  ,  , , , and be the false positive probability, time taken (in seconds), number of operations (in million), file size in megabyte (MB), Bloom Filter's size (in bits), and number of input sequences, respectively. Equation

RESULTS
We have conducted extensive series of tests to validate the performance of BionetBF using diverse datasets: 12 synthetic datasets (Section 4.2 provides the detailed procedure for generation of the dataset) ( Table 1 provides the notation and file size details) and biological network dataset (Downloaded from [46]) ( Table 2 provides the details of file size and the number of lines). We conducted a rigorous experiment to prove two concerning points of biological networks: 1. BionetBF performs fast processing of the overwhelming sized dataset using tiny amount of memory 2. It has lowest error BionetBF drastically reduces memory requirements, and we present its analysis and results in this section. We have conducted the experiments in low-cost Ubuntu-Desktop computer with 4GB RAM and Core-i7 processor.

BionetBF
This section presents the analysis of BionetBF with = 2 using the IDKmer dataset. Figure 3a represents   gives FPP of 0.00086 and 0.0017 for Mixed Set and Disjoint Set, respectively. Figure 5 illuminates the accuracy of BionetBF, which is 100% in the case of 8 and 15 for both Mixed Set and Disjoint Set. Moreover, the accuracy of the Same Set is 100%. Figure 5a and Figure 5b illustrate the accuracy of the Mixed Set and Disjoint Set, respectively. The accuracy of the Mixed Set and Disjoint Set for 20 is 99.91% and 99.83%, respectively. Thus, BionetBF has more than 99% accuracy for a large data file. Figure 6 represents the performance of the insertion operation. Figure 6a exhibits MOPS where the highest MOPS is 7.06 (           Table 3. The bits are calculated by the BionetBF data structure size. The BPS is based on two fixed parameters: BionetBF Bloom Filter size and the number of input sequences. Hence, BPS is the same for all IDKmer dataset with the same number of IDKmers. The highest BPS is 10.85 exhibited by the 10 million IDKmers, while the lowest is 2.71 exhibited by the 40 million IDKmers.

Comparison with Other filters
In this section, BionetBF is compared with Cuckoo Filter [29] (Code available at [49]) and Libbloom (Code available at [50]). The filter size of the Cuckoo Filter is based on the total number of input sequences. In the experiments, the highest number of sequences inserted is 40 million; based    on this, the filter size of the Cuckoo Filter is 100MB. If the number of total sequences is reduced, Cuckoo Filter does not insert all sequences in case of 20 . Cuckoo Filter takes numbers as input; hence, a hash function is added to convert the sequences to numbers. Hence, the total number of hash functions used by the Cuckoo Filter is 3 if no kicking occurs. Another important point is if Cuckoo Filter is executed with the same dataset multiple times, it gives different false positives. We have considered the least number of false positives after executing the Cuckoo Filter multiple times for a single dataset. The Libbloom is the standard Bloom Filter [8]. Similar to Cuckoo Filter, in the case of Libbloom, the total number of sequences considered is 40 million. The memory size is 71 MB, and to achieve an FPP of 0.001, the Libbloom takes 10 hash functions. In the experiment, the filter size of BionetBF is 15MB. Similar to BionetBF, the Kmer ID and sequence of the IDKmer dataset are concatenated and inserted as a single sequence into Cuckoo Filter and Libbloom. Table 4 highlights the filter  size and the number of hash functions used by Cuckoo Filter, Libbloom, and BionetBF in the experiment. Figure 8 and Figure 9 show the comparison among Cuckoo Filter, Libbloom and BionetBF based on the time taken for insertion and query operation, respectively. Figure  8 and Figure   , and 20 , respectively, with every increment of 10 million IDKmer queries. Figure 10 and Figure 11 delineate the comparison among Cuckoo Filter, Libbloom and BionetBF based on the FPP for Mixed Set and Disjoint Set. Clearly, the Cuckoo Filter has the highest FPP, which is more than the desired FPP, i.e., 0.001 in all IDKmer datasets. Libbloom exhibits less than desired FPP in all IDKmer datasets. In the Mixed Set ( Figure  10), Cuckoo Filter has 0.003 (approx.) on average more FPP than BionetBF, and Cuckoo Filter has 0.0029 (approx.) on average more FPP than Libbloom. In the case of Libbloom, BionetBF has 0.0000144 (approx.) on average more FPP. It is an extremely low difference, but considering the size of the filters, BionetBF has better performance. BionetBF has zero FPP in the majority IDKmer datasets, whereas, except for the 10 million IDKmer datasets, Libbloom has some FPP. In the Disjoint Set (Figure11), the Cuckoo Filter has 0.0059 (approx.) on average more FPP than BionetBF, and the Cuckoo Filter has 0.006 (approx.) on average more FPP than Libbloom. In the case of Libbloom, it has 0.0000128 (approx.) on average less FPP than BionetBF. In the case of the Disjoint Set, BionetBF, with its small filter size, outperforms Libbloom. BionetBF has zero FPP in 8 and 15 . Cuckoo Filter has zero FPP in all IDKmer datasets, and Libbloom has zero FPP only in datasets having 10 million IDKmers. Therefore, BionetBF exhibits superior performance in regard to insertion time, query time and FPP with a small filter size for a high number of IDKmers. Figure 12 highlights the analogy among Cuckoo Filter, Libbloom and BionetBF based on the accuracy. In all ID-Kmer datasets, Libbloom and BionetBF have 100% accuracy, whereas Cuckoo Filter does not have 100% accuracy, but it is still more than 99%. This result is obvious because the data structure of Cuckoo Filter and Libbloom are constructed to achieve the desired FPP. Figure 13 delineate comparison among Cuckoo Filter, Libbloom and BionetBF based on the performance of insertion operation. Figure 13a Figure  15 shows the comparison of performance based on SPO for query operation. Figure 15a, Figure 15b and Figure  15c illustrate the query SPO for Same Set, Mixed Set and Disjoint Set, respectively. In Same Set (Figure 15a     respectively. Libbloom occupies approximately 5.3× more BPS than BionetBF. From these experiments, we have proved the high efficiency and performance of BionetBF compared to other filters, namely, Cuckoo Filter and Libbloom. Another important point is that a small-sized BionetBF performs better than a big-sized Cuckoo Filter and Libbloom. Another filter is also considered for comparison with BionetBF, namely, Xor filter [42] (Code available at [51]). For faster processing, the Xor filter saves all the sequences in an array which is given as input for the construction of the filter. This reduces the time duration of the whole operation, i.e., insertion and query. Usually, genomic data is huge; inserting the whole genomic data in a single array is not possible. Moreover, the Xor filter has a fixed upper limit on the number of repetitive data possible in the dataset. The genomic data are highly repetitive in nature. This condition of the Xor filter is a huge constraint for the genomic data. Therefore, the XOR filter is not appropriate for genomic data.

BionetBF with Biological network
This section provides an analysis and result of experimentation performed on BionetBF using three biological network datasets: Drug-Gene, (b) Gene-Disease, and (c) Gene-Gene. Some details regarding the dataset are mentioned in Section 4 4.2.2. While experimenting with the Drug-Gene dataset, the biological edge (i.e., chemical and gene) are concatenated and inserted or queried to BionetBF. Similarly, in the Gene-Disease dataset, the biological edge (i.e., gene and disease) are concatenated. In the case of Gene-Gene, the interacting genes are concatenated to insert or query to BionetBF. An important point to highlight is the size of the Drug-Gene dataset is 13.4 GB (approx.), Gene-Disease is 1.5GB (approx.), and Gene-Gene is 154MB. Figure 17 represents the insertion and query time of BionetBF with the biological network dataset. Drug-Gene, Gene-Disease, and Gene-Gene take 76.6 sec, 11.64 sec and 0.88 sec, respectively, to insert biological edges. Similarly, for Disjoint Set, the query time taken by Drug-Gene, Gene-Disease, and Gene-Gene is 76.75 sec, 11.93 sec, and 0.74 sec, respectively. Figure 18 illustrates the FPP of BionetBF with biological network datasets: Drug-Gene and Gene-Gene. The Gene-Gene dataset has zero FPP. Both Drug-Gene and Gene-Disease datasets are big-sized datasets; however, the FPP is very low. Figure 19 illuminates the accuracy of BionetBF with biological network datasets: Drug-Gene (Figure 19a), Gene-Disease ( Figure 19b) and Gene-Gene ( Figure 19c). The Gene-Gene dataset has 100% accuracy; Drug-Gene and Gene-Disease have an accuracy of more than 99.9%. Figure 20 elucidates the performance of BionetBF with the biological network datasets for both insertion and query operations. Considering both insertion and query operations, BionetBF with the biological network datasets exhibit more than 5 MOPS, below 2.10 −7 SPO, and more than 130 MBPS. Table 6 illustrates the bits per sequence of the three biological network datasets.

Drug-Gene Interaction: A Case Study
Boolean query/membership identification is an important operation in Big Data. Boolean query operation determines whether the queried data is a member of the huge dataset. In the case of Big Graph, boolean query operation determines whether the queried edge is connected to the Big Graph. The huge size of the Big Graph is increasing the complexity of this simple operation which either returns true or false as a response. This section presents a case  study to present an application of BionetBF for a faster boolean query. We have considered two real dataset: Drug-Gene dataset and Chemical-Gene dataset (Downloaded from [dataset] [46]). The Drug-Gene and Chemical-Gene datasets have biological data on the interaction between chemicals and genes. To avoid confusion, one dataset is called Drug-Gene and another Chemical-Gene datasets. The size of BionetBF is kept the same as in the above experimentation. The details of the Drug-Gene dataset are mentioned in Section 4.2. The Drug-Gene dataset is 13.4 GB with 400 million drug-gene interactions. The Drug-Gene dataset has a size of 1.14 GB with 62816502 drug-gene interactions. The Chemical-Gene dataset is inserted into the BionetBF, and the Drug-Gene dataset is queried to BionetBF. The insertion time is 77.44 sec, and the query time is 9.54 sec. For every query, BionetBF returned false. In other words, BionetBF was able to respond to around 62.8 million Drug-Gene interactions that are absent in the biological network having 400 million drug-gene interactions within 10 secs.

Discussion
BionetBF can help in drug discovery. Another implementation is the deduplication of the biological edge. This section discusses some implementation of BionetBF in the biological network field.
One implementation of BionetBF is faster identification of biological edges in a biological network consisting of a particular target protein. The biological networks are stored in the database. Multiple BionetBFs are constructed for each biological network. The size of the BionetBF is based on the size of the biological network. A large biological network with thousands or millions of nodes can be stored in a small-sized BionetBF. It is illustrated in the Result section. Hence, constructing a BionetBF is not much of an overhead. When a target protein and its other interacting protein are determined, the biological edge is checked in BionetBFs in parallel. If a BionetBF returns true, check its related biological network in the database for further processing. Searching the biological network in the database is difficult because searching a Big Graph is an NP-hard problem. However, BionetBF can confirm the membership of a Big Graph within seconds. Another application of BionetBF in drug discovery is security. The biological network can be stored in BionetBF and provided to others without disclosing the main structure of the biological network. Suppose a pharmaceutical company wants to study the compatibility of their drug with another drug belonging to another pharmaceutical company. The latter company does not want to completely provide the information about its drug (say, ). However, they also want to encourage the development of the former company's drug (say ). aims to determine whether consumption of along with is safe. company can store the protein-protein interactions in the BionetBF. Also, other information such as chemicalprotein interaction can be stored in the same BionetBF. This BionetBF is sent to the company. BionetBF is itself an encrypted data; extracting complete information from BionetBF is difficult. Even using a Brute force attack on the BionetBF, i.e., continuously querying the possible proteinprotein interactions, is very difficult with such a huge number of protein possibilities.
can determine its proteinprotein interactions and check in the BionetBF whether it is present in Drug A.

Same
Mixed Disjoint  Bloom Filters are mostly known for deduplication. BionetBF is also an excellent choice for deduplication of repetitive insertion or query of biological edge. While determining the biological network of a biological process, the same biological edge is generated and forwarded to the database for storage. However, the repetition of such biological edge can also be in thousands. Storage of huge repetitive information in databases is a huge issue in Bioinformatics. But, searching and removal of such repetitive biological data is another difficult task. Hence, in such a situation, BionetBF can be implemented. After the generation of the biological edge, check it in the BionetBF. In case it is present, ignore the storage in the database; otherwise, it is stored in the BionetBF first and then forwarded to the database for storage. Another advantage of using BionetBF is slow storage in the database without halting the processing. Databases are in secondary storage; hence, the storage time complexity is more. The application does not have to wait for the database to complete the insertion of the biological data; rather BionetBF can continue working for the membership checking, and the newly biological edge is stored in a buffer and slowly inserted into the database.

Conclusion
Our proposed system, BionetBF, identifies the membership of biological edge effectively and efficiently. It can store a huge volume of the biological edge using a tiny amount of memory. As we know, biological data are significantly large, requiring huge-sized memory (RAM) to process. Therefore, BionetBF offers an effective and efficient solution for the same. In addition, our proposed solution requires 2.71 bits per biological edge to store in the RAM. Therefore, it proves that it can store mammoth-sized biological data in a few MB of main memory. BionetBF can process the biological data at the pace of 162.78 MBPS in a low-cost computer. Therefore, BionetBF will be able to process much higher MBPS in a higher-end computing system. We conducted a series of experiments using 12 synthetic datasets and three real biological network datasets to validate our proposed method. Experimentally, it is illustrated that the execution time of all the operations is low while mitigating the false positive probability to the lowest. Furthermore, BionetBF has a zero error rate in a few datasets. We compared BionetBF with another two filters: Cuckoo Filter and Libbloom (standard Bloom Filter), where a smaller-sized BionetBF exhibits higher performance than the other two filters. Therefore, BionetBF can enhance the entire system performance of Drug Discovery, Genome coding, etc., due to its low memory footprint, nearly 100% accuracy, and high performance. Notably, BionetBF can dramatically boost any system's performance. Our experimental results show the processing capabilities of biological data in low-cost hardware. Therefore, it can drastically reduce the computing cost of the entire application system.
BionetBF can be used to store an extensive biological network with thousands or millions of nodes. It can be implemented for quick membership checking of neighbouring nodes of a biological network for a specific protein.
Moreover, it features a small memory footprint that can hold information about the huge biological network. Hence, an application can maintain multiple BionetBFs that store nodes of biological networks for different proteins. In drug discovery, BionetBF can be used for passing information securely. BionetBF saves the biological network without disclosing the main structure. BionetBF is an outstanding choice for the deduplication of repetitive biological edges. BionetBF offers constant time complexity, which is ideal for the repetitive insertion or query of biological edge. Thus, it can be a game-changer for genomics, proteomics, and Bioinformatics.