TY - JOUR T1 - Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit JF - bioRxiv DO - 10.1101/2020.09.13.274779 SP - 2020.09.13.274779 AU - Ben Blamey AU - Salman Toor AU - Martin Dahlö AU - Håkan Wieslander AU - Philip J Harrison AU - Ida-Maria Sintorn AU - Alan Sabirsh AU - Carolina Wählby AU - Ola Spjuth AU - Andreas Hellander Y1 - 2020/01/01 UR - http://biorxiv.org/content/early/2020/09/14/2020.09.13.274779.abstract N2 - This paper introduces the HASTE Toolkit, a cloud-native software toolkit capable of partitioning data streams in order to prioritize usage of limited resources. This in turn enables more efficient data-intensive experiments. We propose a model that introduces automated, autonomous decision making in data pipelines, such that a stream of data can be partitioned into a tiered or ordered data hierarchy. Importantly, the partitioning is online and based on data content rather than a priori metadata. At the core of the model are interestingness functions and policies. Interestingness functions assign a quantitative measure of interestingness to a single data object in the stream, an interestingness score. Based on this score, a policy guides decisions on how to prioritize computational resource usage for a given object. The HASTE Toolkit is a collection of tools to adapt data stream processing to this pipeline model. The result is smart data pipelines capable of effective or even optimal use of e.g. storage, compute and network bandwidth, to support experiments involving rapid processing of scientific data characterized by large individual data object sizes. We demonstrate the proposed model and our toolkit through two microscopy imaging case studies, each with their own interestingness functions, policies, and data hierarchies. The first deals with a high content screening experiment, where images are analyzed in an on-premise container cloud with the goal of prioritizing the images for storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for a real-time control loop for a transmission electron microscope.Key PointsWe propose a pipeline model for building intelligent pipelines for streams, accounting for actual information content in data rather than a priori metadata, and present the HASTE Toolkit, a cloud-native software toolkit for supporting rapid development according to the proposed model.We demonstrate how the HASTE Toolkit enables intelligent resource optimization in two image analysis case studies based on a) high-content imaging and b) transmission electron microscopy.We highlight the challenges of storage, processing and transfer in streamed high volume, high velocity scientific data for both cloud and cloud-edge use cases.Competing Interest StatementThe authors have declared no competing interest.DHData Hierarchy. Conceptual structures in datasets, realized as, e.g. tiered storage systems.HASTEHierarchical Analysis of Spatial (TE)mporal data.HSCHaste Storage Client. A core HASTE component for managing data hierarchies.IFInterestingness function. Applied to a document in HASTE to compute an interestingness score.PLLSPower Log Log Slope. ER -