ImmCellTyper: an integrated computational pipeline for systematic mining of 1 Mass Cytometry data to assist deep immune profiling

33


Introduction
Mass cytometry or cytometry by time-of-flight (CyTOF) is a powerful high-throughput singlecell technology which employs stable elemental isotopes, as the same manner of fluorophores, to detect cellular proteins of interest.This approach successfully tackles the panel multiplex challenges faced by traditional flow cytometry due to spectral overlap and permits simultaneous measurement of over 40 parameters on millions of cells.To date, CyTOF has been widely applied in basic and translational medical research, such as deep immunophenotyping and characterisation of novel refined cell subsets, immune monitoring of cell-adoptive therapy, and dissecting cell subpopulations from heterogeneous tumour samples 1 .Nonetheless, the advantages of CyTOF come with the problem of handling highdimensional dataset.Traditional gating strategy for flow cytometry, while still serving as the 'gold standard' for cell population identification in cytometry data, may not be an optimal option for CyTOF data due to its high-dimensional settings.The high-parametric resolution in CyTOF, aimed at revealing previously undiscovered cell subpopulations, leads to a significant increase in the complexity of gating schemas and hierarchical depth, which makes manual gating extremely labour intensive and time-consuming 2 .Therefore, effective computational toolkits and pipelines are entailed for CyTOF data mining and analysis.
Efforts have been directed toward developing means of clustering algorithms to deconvolute a pool of live cell mixture into distinct cell populations, which facilitates CyTOF data analysis.
For instance, unsupervised methods, which include flowSOM 3 , Phenograph 4 , X-shift 5 , spade 6 , and DensVM 7 etc., often combine with dimension reduction techniques like t-SNE 8 , UMAP 9 , PCA etc., and require manual annotation of each cluster based on the marker expression patterns indicated by heatmaps.This approach works well to analyse populations in a datadriven manner, with all files concatenated and analysed all at once.Compared to manual gating, it has advantages in terms of convenience, efficiency, and relative unbiasedness from biological preconception, facilitating the detection of novel cell phenotypes for deep phenotyping.However, the unsupervised approaches are not always ideal and suitable for cytometry data.Mathematical clustering does not necessarily have biological meaning, leading to occasional inaccuracies.And several benchmarking studies suggested that the accuracy for these unsupervised tools may not be optimal 10 .Additionally, the technical uncertainty of results produced by different clustering approaches remains a conundrum.
Even for the same unsupervised method, the discrepancy among different runs without setting a seed, reduces the reproducibility and may cause confusion, particularly for biologists with limited computational knowledge.Moreover, manual validation is also essential, as biological annotation is the inevitable step to provide biological relevant labels for the clusters.
However, this process can be time-consuming and subjective, hindering the automation of pipelines.This problem is particularly pronounced for a large marker panel and samples with high heterogeneity, resulting in a higher number of clusters that need to be annotated.
Advances in artificial intelligence have accelerated the development of alternative clustering methods in a supervised manner for cell type inference.These methods consider the 'ground truth' or prior knowledge about the marker expression of each given cell types to automatically label each cell.Currently, a couple of semi-automatic methods have been developed, such as Linear Discriminant Analysis(LDA) 11 , DGCyTOF 12 , CyAnno 13 , DeepCyTOF 14 , Automated Cell-type Discovery and Classification (ACDC) 15 and Semi-supervised Category Identification and Assignment(SCINA) 16 etc.ACDC 15 and SCINA 16 use a matrix of pre-defined markers for each cell type to annotate the clusters showing the same signature.These methods assume that markers are either expressed or not expressed (binary), which limits their ability to distinguish cell subtypes with similar phenotypes, particularly non-canonical cell types that cannot be easily separated linearly.Alternatively, DeepCyTOF and LDA use a marker expression matrix extracted from manually gated cell types as a training dataset to build a machine learning model for cell type prediction.This approach has higher precision and accuracy compared with aforementioned methods 10 .Nonetheless, it can be labour intensive for preparation of the training set manually.Also, these methods are limited in their ability to predict novel cell subsets beyond the pre-gated set of cell types and lack a systematic and comprehensive way to assign the cells which were not identifiable under any of the gated cell types.New solutions have emerged with algorithms such as DGCyTOF 12 and CyAnno 13 , the former adopts a deep learning classification combined with hierarchical stableclustering methods and an iteration calibration system to identify known cell types and assign novel subsets; while CyAnno is based on a machine-learning framework which allows the integrative modelling of both 'gated' and 'ungated' cells.Both methods have demonstrated high accuracy in their test datasets, but unfortunately are not widely used by the research community, possibly due to the issue for the hassle of training data preparation, the lack of user-friendliness and implementation challenges for bench researchers.
To address the common drawbacks of current semi-supervised and unsupervised clustering algorithms and preserve their strengths in discovering both canonical and non-canonical cell subsets respectively, we propose a strategy implemented in ImmCellTyper for cell classification named BinaryClust.By considering biologists' prior biological knowledge and interpretation for canonical cell clusters in a semi-supervised manner, BinaryClust first automatically characterises the main cell lineages in a fast and accurate way.Subsequently, it extracts specific cell types of interest for further clustering using unsupervised algorithms to identify cell subsets including previously unreported non-conventional population.In addition, this R-implemented pipeline takes advantage of SingleCellExperiment class for data management, providing an easy-to-use and organized systematic workflow of CyTOF data handling.The whole pipeline includes quality control and batch effect correction, which helps to effectively pool datasets from different batches, and ensures the robustness for downstream analysis.Meanwhile, modules like dimension reduction, semi-supervised and unsupervised clustering (flowSOM and Phenograph), interactive data visualisation, and statistical testing for complex study design were also incorporated in this pipeline.Compared with existing integrated computational workflow, such as CATALYST 17 , CapX 18 , Cytofkit 19 and ImmunoCluster 20 which was developed and maintained by our team, this recently developed toolkit advanced further on coherence, functionality and user-friendliness.Overall, this approach has the potential to facilitate and smooth the investigation of CyTOF-based research.

Results
BinaryClust is comparable with manual gating in quantifying the abundances of main cell lineages.
The core concept for BinaryClust is depicted in Figure1, where a simple user-designed cell type marker expression matrix is required and serves as a reference to characterise the main cell lineages, with positive markers indicated in '+', negative markers in '-', and irrelevant markers in 'A'.K-means(k=2) will be applied to divide the positive and negative cell population of each marker, then align it to the reference table to infer main cell types.This is followed by the extraction of population-of-interest for downstream clustering using unsupervised methods for subpopulation discovery.
To assess the performance of the automated cell type classification and prediction, we generated a test CyTOF dataset using peripheral blood mononuclear cell (PBMC) samples from seven patients with myeloproliferative neoplasm (MPN) and two healthy donors, employing a 37-marker deep immunophenotyping panel (Supplementary Table 1).Manual gating was performed by two independent experts which identified seven main cell lineages.
These results act as the reference for evaluating the computer-aided methods.The hierarchical sequential gating strategy was explicitly illustrated in Supplementary Figure 1.We evaluated the agreement between manual gating results and BinaryClust results regarding cell frequencies of each population.The mean value was calculated from manual gating results of two experts to compare with BinaryClust-generated results using Pearson correlation analysis.As shown in Figure2A, the two methods exhibit a strong correlation with coefficient(R 2 ) equals to 1,1,0.99,0.96,0.96,0.99,0.99 for CD4 T cells, CD8 T cells, dendritic cells, NK cells, monocytes, and gamma delta T cells, respectively (all P <0.0001).And most of the data points remain close to the line of equality (red line, R 2 =1), indicating a high degree of agreement.Meanwhile, the Bland-Altman plots in Figure 2B also suggest no consistent bias of manual gating versus BinaryClust across all the identified cell types.The good performance of BinaryClust was further validated in the influenza dataset published by our group 21 , which contains FCS files from 11 individuals with 6 main immune cell types detected, the average Fmeasure and ARI reached 0.98 and 0.91 respectively (Supplementary Figure2A-B).

To further 200 evaluate 201
BinaryClust's performance, we compared it with the well-performing unsupervised algorithm flowSOM and supervised classifier Linear discriminant analysis (LDA).FlowSOM was run on the same MPN dataset with k value set as 20, followed by manual annotation and cluster merging to identify the same cell populations as in manual gating and BinaryClust.The cell frequencies derived from BinaryClust, flowSOM, and manual gating (expert1 and expert2) were compared using interaction plot (Figure 3).We observed that BinaryClust remains consistent with manual gating (all P>0.05), whereas flowSOM identified significantly less Gamma Delta T cells and dendritic cells (P<0.001 and P=0.006, respectively) compared with the other three measurements.We also increased the initial k value to 40 to improve accuracy by over clustering then merging clusters 3 .We found that the average frequency for Gamma Delta T cells and dendritic cells increased to 1.5% and 4.65% respectively but still remained significantly different (P<0.05) from the other two methods.It is also interesting to see that although the same gating strategy was applied, different experts obtained results with slight variation despite no statistical significance.The same findings were confirmed using boxplot as shown in Supplementary Figure 3. Subsequently, LDA was also tested on the same dataset.Since this method requires a training dataset from manual gating to build a model, we exported the cell events and cluster assignment from Cytobank, then equally partitioned all cells into training and test dataset.
Due to the difference in method implementation, we did not include LDA for cell abundance quantification comparison with other methods, since only half of the cells were used for prediction.To ensure an equitable comparison and further evaluate the accuracy, F-measure, and Adjusted Rand Index (ARI) were calculated using manual gating cluster IDs as reference.
As indicated in Table 1, for the MPN dataset, BinaryClust achieved high F-measure and ARI in all 7 cell types with an average of 0.94 and 0.91 respectively.The performance remains excellent in influenza dataset as well (average F-measure = 0.98, Supplementary Figure 2C).
LDA has equivalent prediction accuracy as BinaryClust with an average F-measure of 0.93 and ARI of 0.90.Both supervised methods outperformed flowSOM (average F-measure=0.75,average ARI=0.66),which is in line with previous benchmarking study on evaluating supervised and unsupervised clustering algorithms 10 .Notably, for cell types that constitute a substantial proportion in the pool like CD4 T cells, CD8 T cells, NK cells, monocytes, and B cells, flowSOM can identify them with high precision and sensitivity, while this is not the case for dendritic cells and gamma delta T cells which account for less than 3% of the whole population (F-measure = 0.51 and 0.15, respectively).Even so, the overall F-measure and ARI for flowSOM considering cell proportion reached 0.81 and 0.80.

ImmCellTyper pipeline supports interactive data visualisation and comparison among study groups.
BinaryClust is an important component of ImmCellTyper, a comprehensive integrated pipeline designed for systematic CyTOF data mining.Hence, we utilised the visualisation functions to further prove the reliability and robustness of BinaryClust.
BinaryClust inherently considers CyTOF makers as binary distributed which is in most of the case, but not always.Therefore, it is crucial to check the marker behaviour before running the pipeline.As shown in Figure 5A, all markers selected for the classification matrix (Figure 5B) displayed a binary distribution implying the suitability for this pipeline.After clustering (Figure 5C), median marker expression heatmap (Figure 5D) can check the reliability of the results before proceeding for downstream analysis.In MPN dataset, we projected the cluster assignment resulted from BinaryClust and manual gating to UMAP (Figure 5E), along with the expression of the phenotypic markers (Figure 5F and supplementary Figure 4A).High similarity was observed between BinaryClust and manual gating, whereas slight difference was found on flowSOM results coloured UMAP on islands of CD8 T cells and CD4 T cells (Supplementary Figure 4A).FlowSOM appeared to classify cells that were close on spatial distance into the same cluster, in contrast to prior knowledge-based methods: BinaryClust and manual gating.

Application of ImmCellTyper pipeline to the COVID-19 dataset demonstrates its versatile functionalities for comprehensive data analysis
To showcase the analytical and visualisation functions of ImmCellTyper pipeline, we used the dataset published by Chevrier et al. 22 , which described the immune signature of mild and severe COVID-19 patients in comparison with healthy individuals.There are a total of 82 FCS files with a 40-plex marker panel focusing on innate immunity in this dataset.An initial marker expression check was carried out based on the user-defined matrix (Figure 6B) and displayed in Figure 6A.BinaryClust was then performed and identified 12 cell populations as expected.We used t-distributed stochastic neighbour embedding(t-SNE) in this dataset for dimension reduction, coloured by cell types and faceted by disease conditions, which exhibited substantial immune alteration among healthy control, mild and severe COVID-19 patients (Figure 6C).From the heatmap in Figure 6D, we can have an overview of the marker expression of each population, which remains consistent with our initial definition indicated in the expression matrix.The abundances of the identified cell types were quantified for each individual study participant in the format of stacked histogram (Figure 7A) and summarized in boxplot (Figure 7B).Compared with healthy volunteers, there are significant immune alterations in COVID-19 patients in B cells, Basophils, cDCs, Monocytes, NK cells, CD8 T cells, Neutrophils and pDCs (all P<0.05), which depicts a similar trend to the original paper, despite slight discrepancies caused by different statistical methods used.Kruskal Wallis test was conducted for this dataset followed by multiple testing correction (Benjamini-Hochberg procedure) and Dunn's test for post hoc analysis, while Chevirer et al. used Mann-Whitney Wilcoxon test corrected by Holm method.We then selected 9 markers (IL-6, PD-L1, VISTA, IDO, TIM-3, TMEM173, GranzymeB, PPARg, Ki-67) as the state markers which can reflect the functional or proliferative status of the immune cell types.Notably, Granzyme B was observed to be significantly highly expressed in COVID-19 patients versus healthy control across all identified immune cell types(all P<0.05), TMEM173 was substantially up-regulated in COVID-19 patients particularly in monocytes, neutrophils, and CD4 T cells; and PD-L1 expression remained low for all cells, indicating the immune system of COVID-19 patients were highly activated without exhaustion regardless of disease severity, which was not explored previously.Since monocytes and neutrophils were of particular interest in the original paper, and the panel also included specific markers for in-depth interrogation, we then extracted the two population and carried out Phenograph(k=60) to investigate the subclusters.As shown in Figure 7D, Phenograph returned 16 and 14 subclusters respectively for monocytes and neutrophils, with heatmap provided in Supplementary Figure 5.

Reanalysis of the COVID-19 dataset demonstrated the concordance of ImmCellTyper pipeline
with original reports with additional findings, as well as its versatile interactive data visualisation functionalities.

Discussion
In this study, we present an analytical pipeline named ImmCellTyper for systematic exploration of CyTOF data.This pipeline addresses a comprehensive range of analytical needs encompassing data quality check, batch effects examination/correction, cell type identification, and downstream differential analysis accompanied by high-quality, publishable data visualisations.Furthermore, ImmCellTyper includes an in-house developed, knowledgebased, semi-supervised classifier BinaryClust with high accuracy and speed, and also integrates the well-performing and state-of-the-art unsupervised algorithms (Phenograph and flowSOM).By adopting the strategy of first obtaining the main cell types and then select specific cell types for in-depth interrogation with higher clustering resolution, ImmCellTyper combines the advantages of both supervised and unsupervised clustering algorithms for discovery of both major populations and refined subpopulations.Designed for ease of use, this pipeline features a clear and user-friendly workflow, and is compatible to the widely used pipeline CATALYST, aiming to provide a one-stop solution for CyTOF users.
For validating the robustness of the automated cell characterization function of ImmCellTyper: BinaryClust, we involved two independent datasets (MPN dataset and Influenza dataset) and compared with existing clustering tools flowSOM and LDA.Previous Benchmarking studies indicated that flowSOM was one of the best-performing unsupervised tools in precision, speed and stability, with F-measure ranges from 0.58~0.90 in various datasets 10 , and was widely used in high-impact research publications 10,23 .However, the results of flowSOM vary upon the k values set by the users which causes some uncertainty and impairs reproducibility of the results.In contrast, semi-supervised clustering methods like ACDC, DeepCyTOF 14 and LDA outperformed the unsupervised methods in the same benchmarking study.The Fmeasure obtained by ACDC and LDA ranged from 0.78~0.99,which was significantly higher than unsupervised approaches including Accense, PhenoGraph, Xshift, k-means, flowMeans 24 , FlowSOM and DEPECHE 25 (F-measure: 0.28~0.93) 23.In this study, we chose LDA because it has proven to be superior to ACDC and DeepCyTOF in terms of precision and speed 10,11 .The excellent performance of LDA was demonstrated in MPN dataset of this study with F-measure reached 0.93 for around 4 million cells.BinaryClust was comparable in accuracy(F-measure=0.94) but much faster in speed.Both semi-supervised approaches outperformed flowSOM as expected.Although semi-supervised methods seem to perform better, the mainstream for CyTOF data analysis still relies on unsupervised methods.The reason for this might be that most of the semi-supervised methods rely on manual gating as reference and requires either the manually gated expression matrix or user-defined binary matrix as training data.This process takes extra time and efforts especially for defining subpopulations, and may introduce end-user bias.Additionally, the common restriction for all the supervised methods is that they have limited capability to reveal novel subsets as those are not predefined in the training pool, which poses a critical challenge for the pursuit of novel discoveries.Considering all these strength and limitations, as a semi-supervised method, BinaryClust possesses the advantages of incorporating prior biological knowledge, being reproducible across runs and accurate on cell type identification.Meanwhile, the input requirement is a simple matrix of cell population definition based on marker positivity, avoiding the hassle to manually gate example FCS files.With a deep understanding that certain CyTOF markers are expressed on a continuum rather than binary manner, this automated supervised approach will only be used for classification of main cell lineages.The test datasets in this study contained data from human PBMC samples which represented a heterogenous pool of immune cells, covering the major types of various immune populations, and demonstrated excellent performance of BinaryClust on classifying CD4 T cells, CD8 T cells, gamma delta T cells, NK cells, monocytes, dendritic cells, and B cells.We also recommend checking marker expression distribution before running the pipeline to ensure accurate results.After the above step, this pipeline supports cell population extraction for further dissection into subclusters using unsupervised approaches, which excel in detecting rare and refined subpopulations.Phenograph is a particularly effective tool for this purpose.By applying this strategy, our previous work with BinaryClust facilitated the evaluation of the impact of systematic anti-cancer agents on lymphocyte population in non-small cell lung cancer patients, as well as the characteristics of autologous T cell products after manufacturing 26 .
Another purpose behind the development of the ImmCellTyper framework is to cater to general analysis needs, spanning from pre-processing to downstream differential analysis.
There are existing well-established pipelines like CATALYST 17 , diffcyt 27 , Cytofkit 19 , and ImmunoCluster 20 etc., each of them has its own strength and limitations.One of the challenges existed in CyTOF data pre-processing is effectively integrating multiple batches, given the technical differences arising from experiments and instrumental acquisition, which could result in the separation of clustering across batches and potentially confounding the signal of interest.In ImmCellTyper pipeline, we provide the functionality for batch effects examination and incorporated two well-performing batch correction algorithms CytoNorm 28 and CytofRUV 29 into our framework prior to downstream analysis, which ensures the quality of the data and, to the best of our knowledge, makes it the first integrated pipeline with this module.For the downstream analysis, tools like diffcyt and CATALYST group markers into phenotypic and state makers, enabling the detection of differentially abundant cell clusters and differential expression of functional markers within each cell population 27 .This wellrecognised strategy has been extensively applied in various studies.But one limitation is that it directly classifies cells into high-resolution clusters, and cannot automatically merge subclusters into one major population with similar phenotypes 27 .Our pipeline addresses this problem via a precise and convenient solution: using the semi-supervised classifier (BinaryClust) of ImmCellTyper.The two pipelines (CATALYST and ImmCellTyper) are well compatible, allowing users to leverage their functions together and providing a broader range of analytical options.On the other hand, Cytofkit has the advantages of the graphical user interface(GUI) for non-specialists, integration of a variety of clustering(DensVM 7 , ClusterX, FlowSOM, Phenograph), dimension reduction(PCA, ISOMAP, t-SNE) methods, and inference of the relatedness among cell populations, but it does not support complex study design, group comparison and statistical testing 19 .Several methods have been developed to fit with the high-dimensional features of CyTOF data for differential analysis including diffcyt, CellCnn 30 , cydar 31 , citrus 32 , cyEMD 33 etc.Nonetheless, none of them is applicable to compare more than two study groups at once which involves multiple testing.Therefore, to accommodate study designs involving multiple groups, ImmCellTyper does statistics via first applying Kruskal Wallis test, followed by Benjamin-Hochberg (BH) procedure for multiple testing correction and post hoc analysis via Dunn's test or pairwise-Wilcoxon test.In cases where the comparison involves only two groups, Mann-Whitney test will be applied in terms of cell frequency, as the distribution of CyTOF data generally does not fit for normal distribution.
One limitation of this computational pipeline is that its semi-supervised cluster identification may not be well-suited for the direct identification of specific subpopulations which are defined by continuum markers.This is due to the method's reliance on the presumption that makers are binary distributed.In addition, for users with general interest in charactering subclusters of each main cell lineage, it can be laborious to first perform BinaryClust then extract every population for unsupervised clustering.In such scenarios, BinaryClust can be used in parallel with other clustering methods or pipelines like CATALYST.This allows users to quickly obtain additional information about main cell types with the accuracy comparable to manual gating.
In summary, we introduce a novel open-source R-implemented strategy and a versatile toolbox for CyTOF data analysis.The future direction is to automate the data clean-up, compensation, and bead normalisation steps.Additionally, we also aim to implement the whole pipeline into Python, a more accessible programming language for bioinformatics novices and biologists who wish to perform high-dimensional data analysis independently.

BinaryClust
Most of the CyTOF markers exhibit log-normal or bi-modal distribution with zero inflation after arcsinh transformation.Thus, we employ binary classification using k-means to group cells into negative and positive populations(k=2) for each marker indicated in the userdefined classification matrix.Here, k-means is an unsupervised clustering algorithm to cluster the data based on the Euclidean distance among points, which is calculated by the formula below: Then assign each data point into a cluster centroid which is denoted by c i , and dist()is the Euclidean distance: By aligning the k-means results with the user-designed classification matrix, cell populations can be subsequently classified and annotated.

R package ImmCellTyper
This computational pipeline is implemented in the R package ImmCellTyper and publicly available on Github (https://github.com/JingAnyaSun/ImmCellTyper).Instructions for package installation and function usage can be found in the README file on the Github page.It is recommended to first examine the unwanted non-biological variation across batches and perform additional batch normalisation if necessary.Afterwards, import the data into the second part of the pipeline for downstream analysis.Users are required to prepare all the FCS files, sample metadata containing the details and grouping information of each sample, panel metadata with the information of the antibody panel with metal tags used in the experiments, and cell type classification matrix with phenotypic marker expression in a binary manner of each cell lineage.All files need to be in the right format to use the pipeline.

ImmCellTyper workflow overview
ImmCellTyper pipeline is composed of seven steps corresponding to two separate subpipelines, as described in Figure 8 and Github vignettes (https://github.com/JingAnyaSun/ImmCellTyper/tree/main/vignettes): (1) Sub-pipeline1 (corresponds to Figure 8, workflow step 2): Batch effects evaluation and correction.Batch effects occur when samples were collected and measured at different sites or time points, especially for large-scale studies.It is crucial to remove the unwanted variation which might interfere the true biological differences.After bead normalisation and data clean-up (Figure 8, workflow step 1), which can be performed using third-party platforms, such as CyTOF v7.0 system control software and Cytobank, users can systematically examine batch effects on two levels including marker behaviours and clustering results based on the method introduced by Trussart et al. 29 If needed, CytoNorm and CytofRUV, which are wellperforming correction algorithms, can be used to align the existing batch effects.
(2) Sub-pipeline2 (corresponds to Figure 8, workflow step 3-7): semi-supervised classification, differential analysis, and in-depth investigation.When data are cleaned and normalised, they can be imported into the second part of the pipeline, constructed into a SingleCellExperiment (SCE) object, and undergo semi-supervised classification to identify the major cell types and test the differential frequencies or state marker expression among study groups.After that, if the users have certain interests of specific cell types and pre-design the panel for that, or in another circumstance, the initial statistics draw the user's attention into a certain cell type, institutively, further clustering using unsupervised tools like flowSOM or Phenograph should be conducted with an increased cluster resolution and deeper investigation for cell subsets, after extracting the cell population of interest.ImmCellTyper pipeline has the same data storage and infrastructure as CATALYST, therefore all the functions in CATALYST can be seamlessly used in ImmCellTyper, tailored to user's analytical needs.We do not elaborate on the basic functions of CATALYST, which can be found in the tutorial vignettes of the package(https://github.com/HelenaLC/CATALYST).

Batch correction algorithms
The batch effect correction algorithms embedded in the function of 'batchNorm' comprises CytoNorm, as described by Van Gassen et al. 28 and CytofRUV by Trussart et al. 29 .Both algorithms rely on anchors (reference samples/technical replicates) across batches to perform normalisation.CytoNorm uses flowSOM clustering to first identify clusters prior to a population-specific transformation on the reference samples by computing the quantile values and aligning them with splines, whereas CytofRUV applies remove unwanted variation III (RUV-III) to CyTOF data by estimating and eliminating the non-biological variation of the pseudo-replicates.

Agreement evaluation
To assess the agreement among manual gating, BinaryClust classification and flowSOM clustering, we used correlation, interaction plot and Bland-Altman analysis.Correlation evaluates the relationship between two variables which doesn't mean concordance, but if two methods agree, surely, they should be highly correlated.We compared the results of cell population frequency generated by experts and BinaryClust using Pearson correlation, calculated the correlation coefficient and P value, with the line of equality indicating perfect agreement (red solid line, R 2 =1) in the plots to help gauge the degree of agreement between the two methods; Bland-Altman plot refers to a dot plot of the difference between two variables(y-axis) against the mean of them(x-axis), as described by J.Martian Bland and Douglas G. Altman in 1986, which represents a graphical magnitude of bias(average of difference) with 95% confidence interval.The math formula for the limit of agreement is as below: Limits of agreement = mean difference observed ± 1.96 × standard deviation We also used Interaction plot to display the interaction effects of the three methods including manual gating, BinaryClust classification and flowSOM clustering on the measurement of cell frequencies to evaluate the agreement.

F-measure
F-measure is a method to assess the accuracy of the clustering method compared to gold standard, which is manual gating results in this study.It stands for the harmonic mean of the precision and recall values, which can be calculated using below formula: Here, precision represents the positive predictive value: the proportion of true positive instances divided by the instances classified as positive by the clustering algorithm; recall evaluates the sensitivity, which is the number of true positive events correctly identified by the algorithm among all events that belong to the cluster.F-measure ranges from 0 to 1, where 1 suggests perfect performance and 0 indicates poor precision and recall.

Adjusted Rand Index
Adjusted rand index is a widely used method to measure the similarity between two clustering results.We used it to assess the agreement between test clustering algorithm and the gold standard labels which is derived from manual gating.ARI is defined as the following formula based on a contingency matrix (where n ij , a i , b j are values from the contingency table): ARI ranges from -1 to 1, with 1 indicates a perfect match, 0 means not relevant, and -1 indicates complete mismatch between two clustering results.

Differential analysis
In order to perform the differential analysis of cell abundances and state marker expression amongst study groups, we consider the number of study groups into two conditions in function 'StatTest': 1) n=2, Mann-Whitney Wilcoxon analysis will be applied as cell frequency/marker expression doesn't fit normal distribution; 2) n=3, Kruskal-Wallis test will be first performed followed by multiple testing correction using BH procedure and post-hoc analysis(Dunn's test or pairwise Wilcoxon test).

Sample collection and preparation for CyTOF
PBMC samples from MPN patients were requested and obtained from biobank at Guy's Hospital, under a protocol approved by the KCL Biobank Access Committee (REC18/EE/0025).
Healthy volunteers were recruited at Guy's hospital with informed consent and ethical approval by King's College Research Ethics Committee (HR-17/18-5960 MOD-20/21-5960) in accordance with the Declaration of Helsinki.All identifiable information of study participants were securely stored in a trusted research environment managed by members of the team.
Venous blood samples from healthy volunteers were collected in BD vacutainer EDTA tubes, and PBMCs were isolated and purified using Ficoll-Hypaque density gradient centrifugation.In brief, blood was carefully layered onto Ficoll and spined at 460g for 20 minutes at room temperature (RT) without brake; Upon observing a clear separation of blood components, PBMCs were then carefully isolated and washed three times with RPMI 1640 medium to remove other contaminants.Subsequently, the purified PBMCs were cryopreserved in liquid nitrogen at a density of 1X10 7 cells per vial.

CyTOF Antibody staining
Cryopreserved cells were quickly thawed at 37°C in water bath, suspended in pre-warmed RPMI 1640 medium, washed three times, and underwent Fc-blocking using human TruStain FcX (Biolegend) for 10 mins at RT.Given that the antibody panel employs indirect detection of CD95 using anti-CD95-APC and anti-APC-106Cd, cells were initially stained with anti-CD95-APC for 30 minutes in a dark place, followed by two washes with 2ml cell staining buffer (CSB).
After the removal of supernatant, cells were resuspended in 300µl CSB and transferred into the dip tube containing lyophilized antibody mix of Maxpar direct immune profiling assay (MDIPA, Standard BioTools).Additional antibodies used to study T cell activation, migration, and exhaustion status (anti-TIM3, anti-PD-1, anti-ICOS, anti-TIGIT, anti-OX40) were also added.
Cells were incubated with the antibody mix for 30 mins at RT in compliance with manufacturer's instructions, washed twice and fixed using a freshly prepared 1.6% paraformaldehyde (PFA, Thermo Fisher Scientific) solution for 10 mins at RT.In the end, cells were washed with CSB twice to remove residual PFA and stained with 125nM Cell-ID intercalator-Iridium in 1 ml Maxpar Fix and Perm Buffer (Standard BioTools) overnight at 4°C.
The stained cells were frozen down using freezing media (fetal bovine serum containing 10% DMSO) and stored at -80°C freezer before data acquisition.
On the day for CyTOF acquisition, cryopreserved samples were thawed, washed twice with 1ml CSB, followed by additional two washes with 1ml cell acquisition solution (CAS, Standard BioTools), and centrifuged at 800g for 5 minutes.Cells were then filtered through a 40 μm cell strainer to avoid blockage and cell count was determined by Countess automated cell counter (Invitrogen).EQ four-element calibration beads (Standard BioTools) were added to a final concentration of 0.5x10 6 cells/ml to adjust signal fluctuation of the instrument.CyTOF acquisition was performed on Helios mass cytometer system.For each batch of cell staining and run, technical replicates from the same healthy donor were included to evaluate and correct batch effects.
The full antibody panel including the metal tag, clone, and supplier is listed in Supplementary Table 1.

CyTOF data pre-processing
FCS files were first processed for bead normalisation using CyTOF v7.0 system control software (Standard BioTools) to correct signal drift during acquisition.Subsequently, the files were imported into Cytobank (Cytobank Inc.) for data cleaning, with a detailed procedure illustrated in Supplementary Figure 6.The aim is to remove non-events including debris, doublets, normalisation (EQ) beads and other undesired events like dead cells.The CD45+ cell population was pre-gated using Cytobank for down-stream analysis.

Manual gating
Manual gating for major cell populations was performed using Cytobank platform, and gatingML files were exported and converted into gatingSet object to extract labels of each cell via the R packages flowWorkspace, openCyto and CytoML.These data were used as reference for benchmarking computer-aided algorithms.

Figure 1
Figure 1 Schematic diagram of BinaryClust framework.Semi-supervised classification is first performed on selected markers in the user-defined marker expression matrix to classify and annotate major cell types.Population-of-interest can be further extracted and explored using unsupervised clustering methods followed by differential analysis.Figure was generated by BioRender (https://www.biorender.com/).

Figure 2
Figure 2 Agreement evaluation comparing manual gating and BinaryClust in MPN cohort(n=9).Manual gating of B cells, CD4 T cells, CD8 T cells, dendritic cells, NK cells, monocytes and gamma delta T cells were performed by two independent experts using Cytobank, and mean values of the population percentages were calculated to compare with BinaryClust results.Each dot represents one patient sample.(A) Scatter plot showing the correlation between the two methods, with the red line indicating perfect agreement (correlation coefficient = 1); (B) Bland-Altman plots of the two measurement methods among all the cell populations, with the black line suggesting the mean observed difference and red dotted lines indicating limits of agreement (1.96 x standard deviations).

Figure 3
Figure 3 Comparison of manual gating (manual1 and manual2), BinaryClust and flowSOM clustering results in MPN cohort(n=9).Interaction plots showing the individual measurement (percentage) of each study participant with indicated colors by different methods across main cell lineages (B cells, CD8 T cells, Gamma Delta T cells, NK cells, Dendritic cells, monocytes and CD4 T cells); ANOVA was used for statistical testing, and significance was marked by asterisk.* P<0.05, **P<0.01,***P<0.001,****P<0.0001.
ARI, Adjusted Rand Index; LDA, Linear discriminant analysis It has been demonstrated that flowSOM and LDA are among the fastest clustering algorithms without compromising their performance.Here, we compared the speed amongst the three approaches in MPN, Influenza and COVID-19 datasets 22 which contain 2,231,053, 210,933 and 3,862,628 cells respectively.Since LDA requires training data to be executed, we did not run it on COVID-19 dataset due to the absence of manual gating results.As shown in Figure 4, BinaryClust exhibited the highest speed in both MPN and Influenza dataset but fell behind flowSOM slightly in COVID-19 dataset.

Figure 4
Figure 4 Comparison of BinaryClust, flowSOM and LDA on speed.Bar chart showing runtime (in seconds) of the three methods in three different datasets.

277 278Figure 5
Figure 5 Cell type characterization and visualisation using ImmCellTyper pipeline in MPN dataset(n=9).(A)Intensity distribution of selected phenotypic markers used for BinaryClust classification, coloured by sample_id;(B) Pre-defined expression classification matrix for the MPN dataset, '+' indicates positive, '-' indicates negative and 'A 'suggests 'any';(C) Proportion of the main cell lineages of all cells in the concatenated FCS files after classification; (D) Median marker expression heatmap of BinaryClust classification results; (E)UMAP plot of random downsample of 2000 cells per patient coloured by main cell types based on BinaryClust classification (left) and manual gating results (right).(F) UMAP plots coloured by normalized expression of indicated markers (CD3, CD4, CD8a, CD20, CD19, CD14, and CD56) across 2000 cells per sample. 304

Figure 6
Figure 6 Applying ImmCellTyper pipeline on COVID-19 patient dataset(n=82) published by Chevrier et.al.(A) Marker intensity distribution of selected phenotypic markers used for BinaryClust classification, coloured by disease severity(n=22 healthy individuals, 28 mild COVID-19 patients and 38 severe COVID-19 patients);(B)Pre-defined marker expression classification matrix used for BinaryClust; (C) T-SNE plots, with 1000 cells per sample, were coloured by the main cell types generated by

Figure 7
Figure 7 Quantification and statistical analysis comparing the study conditions in COVID-19 dataset(n=82).(A)Stacked histogram of main cell type composition per individual generated by BinaryClust, and grouped by study conditions (healthy, mild and severe); (B)Boxplots representing cell abundance frequencies among the study conditions, faceted by different main cell types;(C) State marker expression intensities with comparison of the study groups across the main cell types; (D)Clusters of monocytes and neutrophils were extracted from the whole cells for downstream interrogation.T-SNE plots with random downsample of 1000 monocyte cells and (E) neutrophils per sample were coloured by study conditions and Phenograph clustering results(k=60), respectively.Statistical significance was marked by asterisk.* P<0.05, **P<0.01,***P<0.001,****P<0.0001

Figure 8
Figure 8 Overall schematic outline of the ImmCellTyper workflow with description for each step.

Table 1
Precision, recall, F-measure, and ARI of indicated clustering methods.