Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

scMontage: Fast and Robust Gene Expression Similarity Search for Massive Single-cell Data

Tomoya Mori, Naila Shinwari, View ORCID ProfileWataru Fujibuchi
doi: https://doi.org/10.1101/2020.08.30.271395
Tomoya Mori
1Center for iPS Cell Research and Application (CiRA), Kyoto University, 53 Kawahara-cho, Sho-goin, Sakyo-ku, Kyoto 606-8507, Japan
#Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Naila Shinwari
1Center for iPS Cell Research and Application (CiRA), Kyoto University, 53 Kawahara-cho, Sho-goin, Sakyo-ku, Kyoto 606-8507, Japan
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Wataru Fujibuchi
1Center for iPS Cell Research and Application (CiRA), Kyoto University, 53 Kawahara-cho, Sho-goin, Sakyo-ku, Kyoto 606-8507, Japan
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Wataru Fujibuchi
  • For correspondence: fujibuchi-g@cira.kyoto-u.ac.jp
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Single-cell RNA-seq (scRNA-seq) analysis is widely used to characterize cell types or detect heterogeneity of cell states at much higher resolutions than ever before. Here we introduce scMontage (https://scmontage.stemcellinformatics.org), a gene expression similarity search server dedicated to scRNA-seq data, which can rapidly compare a query with thousands of samples within a few seconds. The scMontage search is based on Spearman’s rank correlation coefficient and its robustness is ensured by introducing Fisher’s Z-transformation and Z-test. Furthermore, search results are linked to a human cell database SHOGoiN (http://shogoin.stemcellinformatics.org), which enable users to fast access to additional cell-type specific information. The scMontage is available not only as a web server but also as a stand-alone application for user’s own data, and thus it enhances the reliability and throughput of cell analysis and helps users gain new insights into their research.

Introduction

Technology for single-cell analysis has evolved to reveal cell profiles at much higher resolutions than ever before. As an example, several studies have demonstrated that the computational analysis of single-cell RNA-seq (scRNA-seq) data can discover novel cells or cell subtypes. The recently launched Human Cell Atlas (HCA) project [1] is expected to further accelerate the production of single-cell data on an extraordinary scale. These unprecedented massive-scale data will be available to the public through International Nucleotide Sequence Database Collaboration (INSDC) sites such as the Gene Expression Omnibus (GEO) [2] and the Sequence Read Archive (SRA) [3]. Thus, data mining by very fast gene expression profile similarity searches has become increasingly important in terms of screening, clustering, and finding cells.

The concept of similarity searches for gene expression profiles was proposed nearly 20 years ago [4]. CellMontage [6] is the first practical and large-scale implementation that provides users quick searches against a large-scale microarray database for similar gene expression profiles based on Spearman’s rank correlation.

Here, we propose scMontage, a renovated gene expression similarity search server, which is developed for analyzing massive-scale scRNA-seq data, based on the SHOGoiN human cell type database (http://shogoin.stemcellinformatics.org) with statistically robust Fisher’s Z-transformed correlation coefficient. Currently, the scMontage server provides human and mouse scRNA-seq data and allows users to quickly access cell-type-specific biological information, such as cell taxonomy, lineage map, cell marker, and so on. The scMontage enhances the throughput and reliability of single-cell analysis and helps users gain new insights into massive scRNA-seq data.

Results

A profile search in scMontage can be implemented by selecting a database and inputting a query profile. After selecting the database by specifying the organism and the platform, the user can limit the genes for calculation to particular types according to Gene Ontology [7]. As a query, it is possible to either upload a gene expression profile or directly paste gene expression data in CM format.

Figure 1 shows an example screen shot when human pancreatic alpha cell (GEO id: GSM1901473) is queried to the database, where ‘H. sapiens’, ‘HiS-eq2000/2500’, and ‘MF:transcription factor activity, protein binding’ are selected. The results show that the first hit is the query itself, as expected, and the top hits come from the pancreas alpha cells (Table 1, Table S1). The description column contains SHOGoiN Cell IDs (in parentheses) from which a user can access integrated cell type information by the SHOGoiN database. Similarly, when human pancreatic islet cell (GEO id: GSM1901455) is queried to the database under the same database setting as the previous search, the pancreatic islet cell sample is found in the top hits with high statistical significance though less number of pancreatic islet cell samples are contained in the database than the other pancreatic cells (Table S2). The reliability of the scMontage search is not limited to human cell samples. Table S3 shows the search result when mouse Reg4-positive intestinal cell is queried to the database of “M. musculus, SINGLECELL: all” with “MF:transcription factor activity protein binding” genes. The top hit is the Reg4-positive intestinal cell and most of the top hits are small intestinal cells. Therefore, the scMontage search is robust not only for cell types but also for species.

Figure 1
  • Download figure
  • Open in new tab
Figure 1 Example of CM format and input screen of scMontage

This example screen shot shows the case that human pancreatic alpha cell is queried to the database. ‘H. sapiens’, ‘HiSeq2000/2500’, and ‘MF:transcription factor activity, protein binding’ are selected as database settings. The search results are shown in Table 1.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1 Example search result

In addition, Figure 2 shows a comparison of statistical evaluation between CellMontage and scMontage for a mouse lung cell sample (GEO id: GSM1271921) under the database setting of “SINGLECELL: all” and “MF:transcription factor activity, protein binding”. The histograms indicate the distributions of the Spearman’s rank correlation coefficient r, the t-statistic Embedded Image Embedded Image, the Z-transformed sample correlation coefficient zr, and the standardized Z-transformed sample correlation coefficient z, respectively. In CellMontage, the distribution of tr does not follow t-distribution when the population correlation coefficient between query and database profiles is non-zero. In scMontage, however, the distribution of zr approximately follows the normal distribution whose mean is zρ = 0.42. Consequently, the standardized Z-transformed sample correlation coefficient z follows the standard normal distribution.

Figure 2
  • Download figure
  • Open in new tab
Figure 2 Statistical evaluations of search results from CellMontage and scMontage approaches

The histograms indicate the distributions of the sample correlation coefficient zr, the t-statistic tr the Z-transformed sample correlation coefficient zr, and the standardized Z-transformed sample correlation coefficient z when a mouse lung cell sample (GSM1271921) is queried to the database under the database setting of “SINGLECELL: all” and “MF:transcription factor activity, protein binding”.

Discussion

We developed scMontage that can be used for the validation and functional prediction of unknown cell types obtained from tissues or derived from stem cells at the single-cell level. The scMontage also provides quick access to additional information of various cell types in the SHOGoiN database from the search results. It is highly expected that a vast amount of single-cell gene expression profiles will be produced from the HCA projects or other research groups in the future. Therefore, scMontage will become an important tool for providing a very fast and powerful environment that can accelerate massive single-cell data analysis by extracting information on gene expression similarity relationships between known and unknown as well as within known/unknown cell types.

Materials and methods

The scMontage basically runs on Spearman's rank correlation coefficient as a similarity metric of gene expression profiles using a very fast algorithm, RaPiDS [8], for vast calculation, which enables a linear time search with a small constant for the size of the database. As a result, scMontage can compare a query with tens of thousands of samples in the database within a minute. The Spearman’s rank correlation coefficient r between two rank numbers is defined as Embedded Image where Di and n indicate the rank difference between gene i and the number of genes to be used for calculation, respectively. As the output of scMontage, profiles with the highest similarity to the query are ranked by their statistical significance on the basis of the Fisher's Z-transformation of the rank correlation coefficient, which is drastically improved from the CellMontage approach. The distribution of Fisher’s Z-transformed sample correlation coefficient zr approximately follows the normal distribution with a mean zρ and a standard deviation Embedded Image regardless of the size of n, where zρ is approximated as the mean of zr when it appears in standardization as the following equations: Embedded Image Embedded Image

Thus, scMontage can correct the significance that the population correlation coefficient between query and database profiles is non-zero, which often occurs due to common cell properties such as cell cycle states observed at single-cell level regardless of cell types.

The scMontage server currently provides 5,035 single-cell transcriptome data (1,424 human and 3,611 mouse cell samples on 23 August 2018) whose cell types are available by original submitters. Raw sequence data are acquired from SRA, and their read counts are computed by mapping them to human/mouse reference genome sequences downloaded from Ensembl [9] using Bowitie2 [10] and counting the mapped reads by HTSeq [11].

Furthermore, scMontage results are linked to the SHOGoiN database, a repository for accumulating, integrating, and providing cell information of human and other model organisms. This allows users fast access to additional cell-type-specific information, such as cell taxonomy, lineage map, cell marker, DNA methylation, and morphological image.

Authors’ contributions

WF conceptualized and designed the study. TM, NS, and WF developed the server and drafted the paper. All authors have read and approved the final manuscript.

Competing Interests

The authors declare that they have no competing interests.

Supplementary material

Supplementary Table S1 Search result when human pancreatic alpha cell (GEO id: GSM1901473) is queried

Supplementary Table S2 Search result when human pancreatic islet cell (GEO id: GSM1901455) is queried

Supplementary Table S3 Search result when mouse Reg4-positive intestinal cell (GEO id: GSM1524296) is queried

Acknowledgements

This work was partially supported by the Core Center for iPS Cell Research, Research Center Network for Realization of Regenerative Medicine, Japan Agency for Medical Research and Development, Grant-in-Aid for Scientific Research on Innovative Areas, The Ministry of Education, Culture, Sports, Science and Technology, and the iPS Cell Research Fund. The authors deeply appreciate Dr. Peter Karagiannis for kindly reviewing the manuscript.

References

  1. [1].↵
    Rozenblatt-Rosen O, Stubbington MJT, Regev A, Teichmann SA. The Human Cell Atlas: from vision to reality. Nature 2017;550:451–3.
    OpenUrlCrossRefPubMed
  2. [2].↵
    Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, et al. NCBI GEO: mining tens of millions of expression profiles - database and tools update. Nucleic Acids Res 2007;35: D760–5.
    OpenUrlCrossRefPubMedWeb of Science
  3. [3].↵
    Leinonen R, Sugawara H, Shumway M on behalf of the International Nucleotide Sequence Database Collaboration. The Sequence Read Archive. Nucleic Acids Res 2011;39:D19–21.
    OpenUrlCrossRefPubMedWeb of Science
  4. [4].↵
    Bassett DE Jr., Eisen MB, Boguski MS. Gene expression informatics - it’s all in your time. Nat Genet 1999;21:51–5.
    OpenUrlCrossRefPubMedWeb of Science
  5. [5].
    Hunter L, Taylor RC, Leach SM, Simon R. GEST: a gene expression search tool based on a novel Bayesian similarity metric. Bioinformatics 2001;17:S115–22.
    OpenUrlCrossRefPubMed
  6. [6].↵
    Fujibuchi W, Kiseleva L, Taniguchi T, Harada H, Horton P. CellMontage: similar expression profile search server. Bioinformatics 2007;23:3103–4.
    OpenUrlCrossRefPubMedWeb of Science
  7. [7].↵
    The Gene Ontology Consortium. Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res 2017;45:D331–8.
    OpenUrlCrossRefPubMed
  8. [8].↵
    Horton PB, Kiseleva L, Fujibuchi W. RaPiDS: an algorithm for rapid expression profile database search. Genome Inform 2006;17:67–76.
    OpenUrlPubMed
  9. [9].↵
    Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, et al. Ensembl 2018. Nucleic Acids Res 2017;46:D754–61.
    OpenUrlCrossRefPubMed
  10. [10].↵
    Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012;9:357–9.
    OpenUrlCrossRefPubMedWeb of Science
  11. [11].↵
    Anders S, Pyi PT, Huber W. HTSeq - a Python framework to work with high-throughput sequencing data. Bioinformatics 2015;31:166–9.
    OpenUrlCrossRefPubMedWeb of Science
Back to top
PreviousNext
Posted August 31, 2020.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
scMontage: Fast and Robust Gene Expression Similarity Search for Massive Single-cell Data
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
scMontage: Fast and Robust Gene Expression Similarity Search for Massive Single-cell Data
Tomoya Mori, Naila Shinwari, Wataru Fujibuchi
bioRxiv 2020.08.30.271395; doi: https://doi.org/10.1101/2020.08.30.271395
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
scMontage: Fast and Robust Gene Expression Similarity Search for Massive Single-cell Data
Tomoya Mori, Naila Shinwari, Wataru Fujibuchi
bioRxiv 2020.08.30.271395; doi: https://doi.org/10.1101/2020.08.30.271395

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4838)
  • Biochemistry (10739)
  • Bioengineering (8019)
  • Bioinformatics (27198)
  • Biophysics (13941)
  • Cancer Biology (11087)
  • Cell Biology (15999)
  • Clinical Trials (138)
  • Developmental Biology (8759)
  • Ecology (13247)
  • Epidemiology (2067)
  • Evolutionary Biology (17322)
  • Genetics (11667)
  • Genomics (15887)
  • Immunology (10996)
  • Microbiology (26004)
  • Molecular Biology (10609)
  • Neuroscience (56370)
  • Paleontology (417)
  • Pathology (1729)
  • Pharmacology and Toxicology (2999)
  • Physiology (4530)
  • Plant Biology (9593)
  • Scientific Communication and Education (1610)
  • Synthetic Biology (2673)
  • Systems Biology (6960)
  • Zoology (1508)