Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets

View ORCID ProfileMiroslav Kratochvíl, Oliver Hunewald, View ORCID ProfileLaurent Heirendt, Vasco Verissimo, View ORCID ProfileJiří Vondrášek, View ORCID ProfileVenkata P. Satagopam, View ORCID ProfileReinhard Schneider, View ORCID ProfileChristophe Trefois, Markus Ollert
doi: https://doi.org/10.1101/2020.08.03.234187
Miroslav Kratochvíl
1Institute of Organic Chemistry and Biochemistry, Prague, Czech Republic
2Department of Software Engineering, Faculty of Mathematics and Physics, Charles university, Prague, Czech Republic
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Miroslav Kratochvíl
  • For correspondence: exa.exa@gmail.com
Oliver Hunewald
3Department of Infection and Immunity, Luxembourg Institute of Health, Esch-sur-Alzette, Luxembourg
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Laurent Heirendt
4Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Campus Belval, Belvaux, Luxembourg
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Laurent Heirendt
Vasco Verissimo
4Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Campus Belval, Belvaux, Luxembourg
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jiří Vondrášek
1Institute of Organic Chemistry and Biochemistry, Prague, Czech Republic
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jiří Vondrášek
Venkata P. Satagopam
4Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Campus Belval, Belvaux, Luxembourg
5ELIXIR Luxembourg, University of Luxembourg, Campus Belval, Belvaux, Luxembourg
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Venkata P. Satagopam
Reinhard Schneider
4Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Campus Belval, Belvaux, Luxembourg
5ELIXIR Luxembourg, University of Luxembourg, Campus Belval, Belvaux, Luxembourg
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Reinhard Schneider
Christophe Trefois
4Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Campus Belval, Belvaux, Luxembourg
5ELIXIR Luxembourg, University of Luxembourg, Campus Belval, Belvaux, Luxembourg
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Christophe Trefois
Markus Ollert
3Department of Infection and Immunity, Luxembourg Institute of Health, Esch-sur-Alzette, Luxembourg
6Department of Dermatology and Allergy Center, Odense Research Center for Anaphylaxis, Odense University Hospital, University of Southern Denmark, Odense, Denmark
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Background The amount of data generated in large clinical and phenotyping studies that use single-cell cytometry is constantly growing. Recent technological advances allow to easily generate data with hundreds of millions of single-cell data points with more than 40 parameters, originating from thousands of individual samples. The analysis of that amount of high-dimensional data becomes demanding in both hardware and software of high-performance computational resources. Current software tools often do not scale to the datasets of such size; users are thus forced to down-sample the data to bearable sizes, in turn losing accuracy and ability to detect many underlying complex phenomena.

Results We present GigaSOM.jl, a fast and scalable implementation of clustering and dimensionality-reduction for flow and mass cytometry data. The implementation of GigaSOM.jl in the high-level and high-performance programming language Julia makes it accessible to the scientific community, and allows for efficient handling and processing of datasets with billions of data points using distributed computing infrastructures. We describe the design of GigaSOM.jl, measure its performance and horizontal scaling capability, and showcase the functionality on a large dataset from a recent study.

Conclusions GigaSOM.jl facilitates utilization of the commonly available high-performance computing resources to process the largest available datasets within minutes, while producing results of the same quality as the current state-of-art software. Measurements indicate that the performance scales to much larger datasets. The example use on the data from an massive mouse phenotyping effort confirms the applicability of GigaSOM.jl to huge-scale studies.

Key points

  • GigaSOM.jl improves the applicability of FlowSOM-style single-cell cytometry data analysis by increasing the acceptable dataset size to billions of single cells.

  • Significant speedup over current methods is achieved by distributed processing and utilization of efficient algorithms.

  • GigaSOM.jl package includes support for fast visualization of multidimensional data.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • https://github.com/LCSB-BioCore/GigaSOM.jl

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted August 04, 2020.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets
Miroslav Kratochvíl, Oliver Hunewald, Laurent Heirendt, Vasco Verissimo, Jiří Vondrášek, Venkata P. Satagopam, Reinhard Schneider, Christophe Trefois, Markus Ollert
bioRxiv 2020.08.03.234187; doi: https://doi.org/10.1101/2020.08.03.234187
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets
Miroslav Kratochvíl, Oliver Hunewald, Laurent Heirendt, Vasco Verissimo, Jiří Vondrášek, Venkata P. Satagopam, Reinhard Schneider, Christophe Trefois, Markus Ollert
bioRxiv 2020.08.03.234187; doi: https://doi.org/10.1101/2020.08.03.234187

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (2638)
  • Biochemistry (5231)
  • Bioengineering (3653)
  • Bioinformatics (15747)
  • Biophysics (7225)
  • Cancer Biology (5604)
  • Cell Biology (8060)
  • Clinical Trials (138)
  • Developmental Biology (4747)
  • Ecology (7476)
  • Epidemiology (2059)
  • Evolutionary Biology (10535)
  • Genetics (7707)
  • Genomics (10092)
  • Immunology (5168)
  • Microbiology (13844)
  • Molecular Biology (5361)
  • Neuroscience (30626)
  • Paleontology (213)
  • Pathology (873)
  • Pharmacology and Toxicology (1520)
  • Physiology (2236)
  • Plant Biology (4991)
  • Scientific Communication and Education (1039)
  • Synthetic Biology (1382)
  • Systems Biology (4135)
  • Zoology (808)