PT - JOURNAL ARTICLE AU - Schilken, Ingo AU - Mustafa, Harun AU - Rätsch, Gunnar AU - Eickhoff, Carsten AU - Kahles, Andre TI - Efficient graph-color compression with neighborhood-informed Bloom filters AID - 10.1101/239806 DP - 2017 Jan 01 TA - bioRxiv PG - 239806 4099 - http://biorxiv.org/content/early/2017/12/26/239806.short 4100 - http://biorxiv.org/content/early/2017/12/26/239806.full AB - Motivation Technological advancements in high throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains inaccessible to the research community through a lack efficient data representation and indexing solutions. One of the available techniques to represent read data on a more abstract level is its transformation into an assembly graph. Although the sequence information is now accessible, any contextual annotation and metadata is lost.Results We present a new approach for a compressed representation of a graph coloring based on a set of Bloom filters. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph to decide on false positives, we can reduce the memory requirements for a given set of colors per edge by three orders of magnitude. As insertion and query on a Bloom filter are constant time operations, the complexity to compress and decompress an edge color is linear in the number of color bits. Representing individual colors as independent filters, our approach is fully dynamic and can be easily parallelized. These properties allow for an easy upscaling to the problem sizes common in the biomedical domain.Availability A prototype implementation of our method is available in Java.Contact andre.kahles{at}inf.ethz.ch, carsten.eickhoff{at}inf.ethz.ch, Gunnar.Ratsch{at}ratschlab.org