Using coexpression to explore cell-type diversity with the fcoex package

Here, we present the fcoex package, which infers coexpression from scRNA-seq data and yields multiple, overlapping classes of cells based on coexpression modules. The tool extends the current scRNA-seq toolbox, providing a multi-hierarchy view on cell functionality and enabling the development of more complete cell atlases. Single-cell RNA sequencing (scRNA-seq) captures details of the cellular landscape, basing a fine-grained view on biological processes. Current pipelines, however, are restricted to single-label perspectives, missing details of the classification landscape. In the pbmc3k blood cell dataset, fcoex detects known classes, like antigen-presenting cells and a new theoretical group of cells, marked by the expression of FCGR3A (CD16). Fcoex extends the current scRNA-seq toolbox, providing a multi-hierarchy view on cell functions as a tool to develop complete cell type atlases. Availability and Implementation Fcoex is written in R and openly available in Bioconductor (https://bioconductor.org/packages/fcoex/). Supplementary information Supplementary data is available at the end of the manuscript. Source code for analysis is available at https://github.com/csbl-inovausp/fcoex_analysis;


Introduction
Since the 17th century, science has increasingly appreciated how cells act as building blocks of life. (Mazzarello, 1999) The advent of molecular profiling has offered new opportunities to tap into the diversity of cell behaviours in the past decades. Among the molecular methods, single-cell RNA sequencing (scRNA-seq), in particular, has been rapidly growing as a method for deep characterisation of cell types and states, and its use is central for the Human Cell Atlas Project. (Regev et al., 2017) Current scRNA-seq data analyses often rely on unsupervised clustering of cells. For that, bioinformaticians tailor parameter sets to a target resolution, i.e., the level of detail used to detect cell identities. (Luecken and Theis, 2019;Regev et al., 2017) When the clustering is finished, the groups of cells are annotated with class labels, representing the underlying biology in a language we can understand. (Clarke et al., 2021) Single clusters, however, are not limited to a single biological function. Even single cells display simultaneously genetic programs that dictate both identity and activity. Here, we focus on the challenge of identifying the multiple functional classes of cells present in scRNA-seq data. The fcoex framework uses gene co-expression to infer parallel cell clusters, emphasising the multifunctional nature of cells. By applying it to a well-known dataset of blood cells, pbmc3k, we demonstrate that fcoex recovers biologically relevant superclasses and seamlessly adds value to standard scRNA-seq data analyses.

The fcoex method
The fcoex method is designed for application right after a standard scRNA-seq clustering step (Fig. 1A). The cluster assignments convey information about the relations between cells to the algorithm and help to guide feature selection. Then, the package selects global marker genes specific to 1, 2, or more previously defined clusters. It ranks markers by symmetrical uncertainty, a non-linear correlation metric based on classical Shannon entropy.  To find co-expression modules, fcoex inverts the FCBF feature selection algorithm, and instead of removing redundancy, it selects redundant (co-expressed) gene expression patterns. (see Supplementary Data for details). The default gene coexpression modules yielded by the pipeline are small by design (10s of genes per module) to facilitate manual exploration of the coexpression landscape. Each module has one "header" gene, which expression pattern better represents all the genes in the module.
Fcoex treats each module as a gene set to find cell populations, using only their expression to re-cluster cells. The new classifications are based, thus, on the genes (and functions) captured by each co-expression module. The multiple module-based clusters serve as a platform for exploring the diversity of the dataset and identifying upper cell classes, grouping cells by common functions.

Multi-hierarchies of blood types
To validate the fcoex pipeline, we selected the well-known pbmc3k dataset from the SeuratData R package, which contains around 2700 peripheral blood mononuclear cells (PBMC) with previously-defined cluster labels.
The standard fcoex pipeline detected nine modules that capture different parts of the cellular diversity in the dataset. For example, module M8 contains cytotoxicity genes, as PRF1 and GZMA, and splits the dataset into cytotoxic (NK and CD8) and non-cytotoxic cells. M2 (CD3D) splits the dataset into T-cells and non-T-cells. M5 (HLA-DRB1) groups together monocytes, B cells, and dendritic cells, all known antigen-presenting cells (APC). (Fig. 1B-E) The classifications provided by fcoex are easily reintegrated to Seurat to power visualisations and get differentially expressed markers, providing more genes for the analysis, if desired.
In general, fcoex clusters combined biologically similar cell types of the original dataset. The clusterings help to explore and classify cells by function (Fig. 1E). Even in a well-studied dataset, fcoex provided a new light on the shared functionality of some NK cells and macrophages: they both markedly express the CD16-coding gene FCGR3A, whose product is a key player in antibody-dependent cellular cytotoxicity (ADCC). (Yeap et al., 2016) Thus, a complete functional classification of blood cells might want to include a "professional ADCC cells" class.

Conclusion
Here we presented fcoex, a R/Bioconductor package for co-expression-based reclustering of single-cell RNA-seq data. We note that other methods are increasingly available for co-expression analysis of single cells. The monocle R package (Cao et al., 2019), widely used for pseudotime analysis, has implemented algorithms for detecting co-expression modules and WGCNA, widely used in bulk transcriptomics, has also been applied to scRNA-seq. Cardozo et al., 2019) In principle, any of those algorithms could be used as input for our framework. Of note, fcoex modules are generally smaller and simpler to explore, making it a sensible first-pass approach to explore the multi-layered diversity in single-cell transcriptomics datasets.
Regardless of the algorithm used to identify coexpression, the main goal of the fcoex pipeline is to use the modules to find biologically relevant populations. That framework allows a multi-hierarchy view of cell types, in which each cell type is related to multiple, overlapping cell classes, as seen in the Cell Ontology