Multi-batch cytometry data integration for optimal immunophenotyping

We describe the integration of multi-batch cytometry datasets (iMUBAC), a flexible, robust, and scalable computational framework for unsupervised cell-type identification across multiple batches of high-dimensional cytometry datasets. After overlaying cells from healthy controls across multiple batches, iMUBAC learns batch-specific cell-type classification boundaries and identifies aberrant immunophenotypes in patient samples. We illustrate unbiased and streamlined immunophenotyping, using both in-house and public mass and flow cytometry datasets.

correlated well between the two batches tested (Fig. S4). Moreover, the estimated and manually determined frequencies of both abundant (e.g., α β T cells) and rare (e.g., innate lymphoid cells) cell subsets correlated reasonably well across a wide range of percentages (from ~0.01% to ~70%), not only among local controls but also among travel/family controls and patients (Fig. S5). Collectively, iMUBAC rationally integrates multibatch cytometry datasets, to identify cell types consistent with state-of-the-art gating in both controls and patients.
We then applied iMUBAC to in-house multi-batch spectral flow cytometry datasets for PBMCs (14 individuals, 2 batches, 14 cell-surface and 4 intracellular markers). The datasets included data from patients with two monogenic forms of autoimmunity: FAS deficiency 15,16 and STAT3 gain-of-function [17][18][19] , both of which are known to be accompanied by lymphoproliferation and high counts of circulating CD4 -CD8doublenegative α β T (DN T) cells. After manually gating out dead cells and doublets, we defined 50 metaclusters with FlowSOM/ConsensusClusterPlus, which we merged and identified manually (Fig. 2a, 2b, and S6). As expected, we observed an expansion of a cluster representing DN T cells in patients with FAS deficiency and, to a lesser extent, in those with STAT3 gain-of-function (Fig. 2c). Conventional flow cytometry with a different antibody panel validated this expansion of DN T cells (Fig. S7). iMUBAC also revealed a decrease in the levels of Vδ2 γ δ T, CD8 + mucosal-associated invariant T (MAIT), and CD16 + natural killer (NK) cells in both groups, possibly reflecting a previously unappreciated level of immunopathological homogeneity between these two inborn errors of immunity (Fig. 2c). iMUBAC can be applied to spectral flow cytometry datasets, as well as CyTOF datasets, and readily identifies both known and unappreciated immunophenotypes in patients with rare immunological disorders.
Finally, we applied iMUBAC to previously published CyTOF datasets for PBMCs from 10 healthy controls and 20 patients with stage IV melanoma before and after PD-1 blockade immunotherapy (11 responders and nine non-responders, respectively) that had been stained with three antibody panels 20 . We analyzed two batches simultaneously (i.e., discovery and validation cohorts). Healthy controls were used for batch correction, clustering with FlowSOM/ConsensusClusterPlus, and classifier training ( Fig. 2d and S8-10).
We tested the differential abundance (DA) between responders and non-responders using the quasi-likelihood F-test (QLF) in edgeR 21 , as previously applied to CyTOF analysis 8,22 . Eight DA subsets were identified, with statistical significance after adjustment for treatment status and batch effects. In particular, the "CD14 Mono 2" cluster, corresponding to CD14 + CD16 -HLA-DR hi CD86 hi PD-L1 hi monocytes, was expanded in responders consistent with the findings of the original report 20 (Fig. 2e). Moreover, four DA subsets, all T-cell subsets, were reproducibly identified even if only pretreatment datasets were used (Fig. 2f). These subsets could potentially be used as biomarkers of a better prognosis, before the initiation of PD-1 blockade immunotherapy.
iMUBAC can be used to streamline an exploratory immunophenotyping analysis in clinical pilot studies of common disease conditions.
High-dimensional cytometry has considerably improved human immunological studies, despite limited sample availability. Moreover, platforms such as FlowRepository or Cytobank facilitate the sharing of cytometry data, making it possible to foster discoveries through meta-analysis. We anticipate that this flexible, robust, and scalable workflow, available on GitHub (https://github.com/casanova-lab/iMUBAC), will expedite rational comparative analyses of multi-batch cytometry datasets and facilitate novel discoveries through unbiased and streamlined immunophenotyping.

Cells
Peripheral blood mononuclear cells (PBMCs) were isolated from heparinized venous blood samples by Ficoll-Hypaque density gradient centrifugation (GE Healthcare). Cells were cryopreserved in bovine fetal calf serum supplemented with 10% dimethyl sulfoxide (DMSO) and stored at -150°C until use. Patients with various severe infectious diseases were included in this study. When blood samples from patients were collected at distant sites, the blood samples were transported to either the Paris or the New York branch of our laboratory overnight and processed. We accounted for the effect of blood sample transportation, by collecting samples from healthy volunteers or healthy family members and transporting and processing them simultaneously with the transported samples (travel/family controls). Cells from healthy local volunteers collected and processed locally were also used (local controls). We included multiple local controls in each cytometry experiment.

Spectral flow cytometry
Two experiments were performed on separate dates. We studied frozen PBMCs from 11 locally recruited healthy adult controls and three patients (two patients with homozygous loss-of-function mutations of

Public CyTOF datasets
CyTOF datasets for PBMCs from patients on PD-1 blockade 20 were downloaded from FlowRepository (https://flowrepository.org/experiments/1124). The study from which these data were taken analyzed PBMCs from patients with melanoma before and about 12 weeks after treatment with either nivolumab or pembrolizumab. Patients were classified as responders or non-responders based on treatment outcomes for the first 15 weeks of treatment. The datasets also contained data for PBMCs from healthy donors at two corresponding time points. The datasets consisted of two batches (i.e., experiments processed at two separate dates) used as discovery and validation cohorts in the original study. The first batch contained data for five healthy donors, five responders, and five non-responders, whereas the second batch contained data for five healthy donors, six responders, and four non-responders. In this study, these two batches were analyzed simultaneously in an exploratory analysis.
Full scripts are available upon request.

Integration of multi-batch cytometry datasets (iMUBAC)
The iMUBAC workflow consists of four steps: i) preprocessing, ii) batch correction, iii) unsupervised clustering and cell-type annotation, and iv) batch-specific cell-type prediction.
Preprocessing. Batch-specific preprocessing was performed as follows. First, CyTOF data files in the FCS format were imported into R with the ncdfFlow package. The ncdfFlow package can be used for the memory-efficient HDF5-based storage of cytometry data. The truncate_max_range option was disabled. For CyTOF data, the transformation option was also disabled, as the transformation implemented in the underlying read.FCS package is optimized for flow cytometry. Second, channel names were organized. This step resolves batch-to-batch differences in the panel design such that identical markers measured using different channels (i.e., fluorochrome-or metal-conjugated antibodies) are aligned. Third, for CyTOF data, doublets and dead cells were excluded in a data-adaptive manner. In this step, all cells from all samples in a single batch of an experiment were pooled, such that identical gates were applied to all samples in a given batch. For DNA-based gating, the dnaGate function in the cydar package was used, and outliers on both the higher and lower sides (considered to be doublets and debris, respectively) were excluded. For event length and dead cell exclusion dye-based gating, the outlierGate function in the cydar package was used, and outliers on the higher side were excluded. In our in-house CyTOF datasets, the intercalator Rh103 was used to exclude dead cells, whereas the intercalators Ir191 and Ir193 were used to exclude doublets and debris. In the PD-1 blockade CyTOF datasets, the intercalator Pt198 was used to exclude dead cells, whereas the intercalators Ir191 and Ir193 were used to exclude doublets and debris. For our in-house pre-gated spectral flow cytometry datasets, the channel for the Zombie NIR Fixable Viability dye was used to exclude dead cells, and automated gating for doublets and DNA content was disabled. The expression values were then transformed. For CyTOF data, a hyperbolic arcsintransformation was applied, with a cofactor of five. For spectral flow cytometry data, Logicle transformation was applied, with parameters estimated in a data-adaptive manner with the estimateLogicle function implemented in the flowCore package. Finally, any event with zeros for all markers was discarded.
After the batch-specific preprocessing, the outputs were concatenated, with only markers common to all batches retained, to form a single SingleCellExperiment object in R.
Batch correction. The goal of this step is to enable the system to learn batch-to-batch deviations due to technical effects but not biological variabilities. We, therefore, used only data from healthy controls for batch correction, excluding data for patients and travel/family controls. Data were first down-sampled to 200,000 cells Unsupervised clustering and cell-type annotation. The goal of this step was to identify cell types in an unsupervised manner. We implemented two methods: i) FlowSOM-guided clustering followed by ConsensusClusterPlus-guided metaclustering and ii) UMAP-based dimension reduction followed by shared nearest neighbor (SNN) graph-based clustering 14 , as described below.
The FlowSOM/ConsensusClusterPlus method. This approach was inspired by the workflow described by Nowicka et al. 5 Briefly, batch-corrected expression values were subjected to unsupervised clustering with FlowSOM 4 , using the FlowSOM package, followed by metaclustering with the ConsensusClusterPlus package.
Euclidean distance was used for metaclustering. For the in-house spectral flow cytometry and CyTOF datasets, we generated 50 and 60 metaclusters, respectively, to improve the resolution of cell-type identification. For the PD-1 blockade CyTOF datasets, we generated 40 metaclusters. In addition to this clustering, we generated a heatmap for the median expression levels of all markers for each of the clusters, to facilitate the manual determination of cell identity.
SNN graph method. This approach was inspired by the workflow for single-cell RNA sequencing datasets implemented in the scran package. Batch-corrected expression values were first dimension-reduced via UMAP into ten dimensions. These dimensions were then used to construct an SNN graph with the buildSNNGraph function in scran, using the default settings. Finally, the graph was divided into communities, or clusters, with the Louvain algorithm, implemented as the cluster_louvain function in the igraph package 24 .
Batch-specific cell-type prediction. The goal of this step is to allow the system to learn, automatically, the cell-type classification rules, or cell-type boundaries, in a batch-specific manner, and to propagate the boundaries to all the cells in a given batch, including cells from travel/family controls and patients. Importantly, we used non-batch-corrected expression values tied to cluster labels defined in the unsupervised clustering section. First, cells from the healthy controls used for unsupervised clustering were further downsampled, retaining a maximum of 100 cells per cluster from a given batch. This step reduces the computational burden and also alleviates the class imbalance problem during machine learning, as there are both highly abundant cell subsets (e.g., CD14 + monocytes) and rare cell subsets (e.g., plasmacytoid dendritic cells) among human PBMCs.
We tested several conditions and found that cell-type classification rules can be learned successfully from 100 events per cell-type. A classifier was then trained with the caret package 25 . We selected the extremely randomized trees 26 algorithm implemented in the extraTrees package. After centering and scaling the nonbatch-corrected expression values, we performed five-times repeated 10-fold cross-validations with internal upsampling to maximize the Kappa statistic. Hyperparameters were tuned for each batch; mtry was tuned from five to 15, whereas numRandomCuts was tuned from one to two, the ranges being empirically determined. The trained batch-specific classifier was then applied to predict clusters for all cells in a given batch, including the cells of patients and travel/family controls. This approach assumes that all cells of patients and travel/family controls fall into one of the clusters defined from the cells of healthy controls.

Differential abundance analysis
For the PD-1 blockade CyTOF datasets, the raw counts of cell subsets identified from the three panels were tested simultaneously for differential abundance (DA) with the quasi-likelihood F-test (QLF) framework of the package edgeR 21 . When both pre-and post-treatment datasets were used, the DA between groups (i.e., responders and non-responders) was assessed with adjustment for both treatment and batch effects. When only pretreatment datasets were used, the DA between groups was assessed with adjustment for batch effects only.
The DA values for subsets with an absolute log2 fold-change of at least 0.5 and an adjusted P-value below 0.05 were considered statistically significant.