Abstract
Background Selecting highly variable features is a crucial step in most analysis pipelines of single-cell RNA-sequencing (scRNA-seq) data. Despite numerous methods proposed in recent years, a systematic understanding of the best solution is still lacking.
Results Here, we systematically evaluate 47 highly variable gene (HVG) selection methods, consisting of 21 baseline methods developed based on different data transformations and mean-variance adjustment techniques and 26 hybrid methods developed based on mixtures of baseline methods. Across 19 diverse benchmark datasets, 18 objective evaluation criteria per method, and 5,358 analysis settings, we observe that no single baseline method consistently outperforms the others across all datasets and criteria. However, hybrid methods as a group robustly outperform individual baseline methods. Based on these findings, a new HVG selection approach, mixture HVG selection (mixHVG), that incorporates top-ranked features from multiple baseline methods is proposed as a better solution to HVG selection. An open source R package mixhvg is developed to enable convenient use of mixHVG and its integration into users’ data analysis pipelines.
Conclusion Our benchmark study not only provides a systematic comparison of existing methods, leading to a better HVG selection solution, but also creates a pipeline and resource consisting of diverse benchmark data and criteria for evaluating new methods in the future.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Abbreviations
- ADT
- Antibody-Derived Tags
- ARI
- Adjusted Rand Index
- ASW
- Average Silhouette Width
- CITE-seq
- Cellular Indexing of Transcriptomes and Epitopes by Sequencing
- HVG
- Highly Variable Gene
- LISI
- Local Inverse Simpson Index
- LOESS
- Locally weighted (or estimated) scatterplot smoother
- LSI
- Latent Semantic Indexing
- NMI
- Normalized Mutual Information
- PBMC
- Peripheral Blood Mononuclear Cell
- PC
- Principal Component
- PCA
- Principal Component Analysis
- scATAC-seq
- Single-cell Assay for Transposase-Accessible Chromatin using sequencing
- scRNA-seq
- Single-cell RNA sequencing
- SCT
- sctransform
- SVD
- Singular Value Decomposition
- TFIDF
- Term Frequency–Inverse Document Frequency