Abstract
Cytometry analysis has grown in recent years with the expansion in the maximum number of parameters that can be acquired in a single experiment. In response to this there has been an increased effort to develop computational methodologies for handling high-dimensional single cell data acquired by flow or mass cytometry. Despite the success of numerous algorithms and published packages to replicate and outperform traditional manual analysis, widespread adoption of these techniques has yet to be realised in the field of cytometry. Here we present CytoPy, a Python framework for automated analysis of high dimensional cytometry data that integrates a document-based database for a data-centric and iterative analytical environment. The capability of supervised classification algorithms in CytoPy to identify cell subsets was successfully confirmed by using the FlowCAP-I competition data. The applicability of the complete analytical pipeline to real world datasets was validated by immunophenotyping the local inflammatory infiltrate in individuals with and without acute bacterial infection. CytoPy is open-source and licensed under the MIT license. Source code is available online at the https://github.com/burtonrj/CytoPy, and software documentation can be found at https://cytopy.readthedocs.io/.
1. Introduction
Cytometry data analysis has undergone a paradigm shift in response to the growing number of parameters that can be observed in any one experiment. As the field evolves, the traditional method of manual gating by sub-setting single cell data into populations and encircling data points in hand-drawn polygons in two-dimensional space has proven laborious, subjective, and difficult to standardise. In response to these shortcomings, a cross-disciplinary effort has given birth to a new approach often termed „cytometry bioinformatics‟, to leverage complex computer algorithms and machine learning to automate analysis and improve the investigator‟s ability to extract meaning from high dimensional data.
Where cytometry is used for data acquisition, the typical objective is to discern differences between groups of subjects or experimental conditions, or to identify a phenotype that correlates with an experimental or clinical endpoint. To this end, a computational approach to analysis of cytometry data can take one of two strategies: to separate single cell data into groups or classifications, which then form the variables (often descriptive statistics of the obtained groups) the investigator uses to test their hypothesis, or directly model the acquired distribution of single cell data with respect to a chosen endpoint. Classification strategies can be further subdivided: autonomous gating replicates traditional gating through the use of algorithms such as clustering analysis (flowDensity (1), OpenCyto (2)); high-dimensional clustering groups cells according to their individual phenotypes (FlowSOM (3), PhenoGraph (4), Xshift (5), SPADE (6)); and supervised classification where training on an example of manually gated data produces a classifier capable of distinguishing cell populations (FlowLearn (7) and DeepCyTof (8)). Modelling strategies have been successfully adopted in applications such as ACCENSE (9), CellCNN (10), and CytoDX (11) despite the fact that this approach requires pooling of sample data and is therefore sensitive to batch effects.
In addition, various pieces of software have been developed for data handling, transformation, normalisation and cleaning (e.g. flowCore, flowIO, flowUtils, flowTrans, reFlow, flowAI), visualisation (e.g. ggCyto, t-SNE, UMAP, PHATE), and pipelines for specific applications (e.g. Citrus, MetaCyto, flowType/RchyOptimyx). To date, there are over 30 different contributions to automated analysis (12; 13; 14; 15). However, there is no widespread adoption of these methods as yet, nor is there a consensus on how to adopt such techniques, with much of the analysis pipeline left to the individual investigator to establish. This inconsistency results in projects amassing collections of custom scripts and data management that are not standardised or centralised, which not only makes reproducing results difficult but also makes for a daunting landscape for newcomers to the field.
We here introduce „CytoPy‟, a novel analysis framework that aims to mend these issues whilst granting access to state-of-the-art machine learning algorithms and techniques widely adopted in cytometry bioinformatics. CytoPy is developed and maintained in the Python programming language, which prides itself on readability and is becoming the language of choice amongst the open source data science community (16). CytoPy introduces a central data source for all single cell data, clinical/experimental metadata, and analysis results, and provides a „low code‟ interface that is both powerful and beginner friendly.
We demonstrate the performance of the supervised classification techniques housed within CytoPy on the Flow Cytometry: Critical Assessment of Population Identification Methods (FlowCAP) data set (17), which has been created for comparing the performance of automated analytical techniques for flow cytometry data. As the FlowCAP data underwent extensive pre-processing prior to their publication and hence do not reflect the challenges encountered with primary data generated by individual users, an in-house dataset of local immune cells in samples collected from patients undergoing peritoneal dialysis and who presented with and without acute bacterial infection was generated to demonstrate the applicability of CytoPy as a complete analytical pipeline for complex and unprocessed data. We believe that CytoPy provides a powerful and user-friendly framework to interrogate high dimensional data originating from investigations using flow cytometry or mass cytometry as readout, and has the potential to facilitate automated data analysis in a multitude of experimental and clinical contexts.
2. Design and Implementation
2.1. Building a framework that is data-centric
Reliable data management is a cornerstone of successful analysis, by improving reproducibility and collaboration. A typical cytometry project consists of many Flow Cytometry Standard (FCS) files, clinical or experimental metadata, and additional information generated throughout the analysis (e.g. gating, clustering results, cell classification, sample specific metadata). A further complication is that any analysis is not static but an iterative process. We therefore deemed it necessary to anchor a robust database at the centre of our software. In CytoPy, projects are instantiated and housed within this database, which serves as a single dynamic data repository that is then accessed continuously throughout the subsequent analysis. For the architecture of this database we chose a document-orientated database, MongoDB (18), where data are stored in JSON-like documents in a tree structure. Document-based databases carry many advantages, including simplified design, dynamic structure (i.e. database fields are not ‟fixed‟ and therefore resistant to unforeseen future requirements) and easy to scale horizontally, thereby improving integration into web applications and collaboration. In this respect, CytoPy depends upon MongoDB being deployed either locally or via a cloud service, and MongoEngine, a Document-Object Mapper based on the PyMongo driver (19).
2.2. Framework overview
An overview of the CytoPy framework is given in Figure 1 including a recommended pathway for analysis, although individual elements of CytoPy can be used independently. CytoPy follows an object-orientated design with a document-object mapper for both commitment to, and collection from, the underlying database. The user interacts with the database using an interface of several CytoPy classes, each designed for one or more tasks. CytoPy is algorithm agnostic, meaning new autonomous gating, supervised classification, clustering or dimensionality reduction algorithms can be introduced to this infrastructure and applied to cytometric data using one of the appropriate classes. CytoPy makes extensive use of the Scikit-learn (20) and SciPy (21) ecosystems. Throughout an analysis, whenever single cell data are retrieved from the database, they are stored in memory as Pandas DataFrames that are accessible for custom scripting at any stage.
Overview of the CytoPy framework and list of primary dependencies. Single cell data and experiments/clinical metadata (1) are used to populate a project within the CytoPy database (2). The CytoPy database models analytical data in MongoDB documents (cylinder) and an interface of CytoPy classes retrieves and commits data to this database (dotted rounded rectangle). The components of this interface are used to complete the following tasks. (3) Semi-autonomous gating identifies a clean ‟root‟ population for analysis. (4) Inter-sample variation is visualised to assess the degree of batch effect and samples are grouped according to their similarity in high-dimensional space. (5) Cells are classified by supervised and unsupervised methodologies and visualised for exploratory data analysis. (6) Finally, single cell data can be summarised and feature selection techniques employed to find variables of interest.
Following the steps in Figure 1, a typical analysis in CytoPy would be performed as follows (functions are shown in italics and class names are shown in italics and title-case).
Single cell data are generated and exported from the flow cytometer in FCS 2.0 or 3.0 format. Experimental and clinical metadata are collected in tabular format either as Microsoft Excel document or csv file, with the only requirement being that metadata be in ‟tidy‟ format (22).
A Project is defined and populated with the single cell data and accompanying metadata. Each subject (e.g. a patient, a cell line, or an animal) has a Subject document containing metadata that are dynamic and has no restriction on the data stored within, and are associated to one or several FCSGroup documents. Each FCSGroup document contains one or more FCS files associated to a single biological sample collected from the subject. This document contains all single cell data, „gated‟ populations, clusters and meta-information that attains to a single „sample‟. This also includes any isotype or Fluorescence-Minus-One (FMO) controls. Compensation is applied to single cell data at the point of entry using either an embedded spillover matrix or a provided csv file. The FCSGroup is associated to an FCSExperiment, containing all samples collected under one particular set of staining conditions. There must always be a Panel document associated to a FCSExperiment. For this, the investigator must provide a „panel design‟ in the form of a simple Excel document (see CytoPy documentation https://cytopy.readthedocs.io/). CytoPy then uses regular expression to match FCS metadata such as channel names to the expected panel and offers error handling for when discrepancies arise.
Any cytometry analysis will require that single cell data be cleaned of debris and artefacts. Semi-autonomous gating is employed to select what is known as a „root‟ population. This is a ground truth for every subject and where analysis will start from; for instance in a mixture of immune cells this could be the T cell population (CD3+ live single lymphocytes).
Batch effects are common and must always be evaluated prior to analysis, as they can influence subsequent steps. If the batch effect is minimal the investigator can consider pooling data and modelling the distribution of single cell data directly. If batch effects are considerable, the investigator should choose their training data accordingly and take caution when interpreting clustering results. CytoPy offers a class called EvaluateBatchEffects with methods for generating univariate and multivariate comparisons of the single cell feature space.
Multiple strategies can be employed to classify cells based on a common phenotype. Strategies such as autonomous gating and supervised classification are biased by the training data provided (and the gating strategy used to label those data) whereas high-dimensional clustering is an unsupervised method that groups cell populations according to their phenotype. CytoPy offers both supervised classification through the CellClassifier class and high dimensional clustering through the Clustering class, so that variables can be generated from either or both strategies. Importantly, the results of either strategy can be committed to the database and then visually interpreted using a class called Explorer. The Explorer class also facilitates exploratory data analysis with interactive plots of embedded space using multiple dimensionality reduction techniques.
Once cells have been classified, we can test our hypothesis. The single cell data are summarised into a „feature space‟, summary statistics that describe the cell populations. This generates a large number of variables, many of which will be either uninformative or redundant. Filter and wrapper methods are applied to perform feature selection, finding only those variables important for predicting a clinical/experimental endpoint. In addition, there are multiple methods available for visualising extracted features, allowing the investigator to quickly determine whether certain patterns exist in the dataset.
3. Results
3.1. CytoPy provides accurate cell classification using supervised machine learning algorithms
The nature of cytometry data lends itself well to supervised classification, given that a typical biological sample yields hundreds of thousands of events but we are limited in measuring up to 40 variables for each cell, resulting in an abundance of observations. CytoPy offers the CellClassifier class as a blueprint for supervised classification in a cytometry framework. One of the most popular libraries for implementing machine learning techniques in Python is Scikit-Learn (20). Scikit-Learn has been used in over 95,000 applications and provides a robust infrastructure of objects that handle pre-processing, training methodology, and interpretation of machine learning algorithms. The CellClassifier class follows the conventions of Scikit-Learn by providing a familiar application programming interface (API) and the apparatus for any classification algorithm to be integrated into the CytoPy framework. In CytoPy version 0.0.1, the following algorithms have been implemented: XGBoost, Feed-Forward Neural Network, Linear Discriminant Analysis, Support Vector Machines and K-Nearest Neighbours. The choice of algorithms to include at this stage were based on prior experience with classification tasks (23), examples in the literature of supervised classifiers in this domain (8; 17; 23), and the relevance of including classifiers from multiple families (24). In order to test the performance of each algorithm, we utilised the FlowCAP-I classification challenge (see Supplementary Methods). As shown in Table 1, XGBoost gave the best performance as judged from the weighted F1 scores for each algorithm, and was therefore deemed the method of choice for the remainder of this study.
Performance of 5 different supervised classifiers on FlowCAP-I data.
3.2. Semi-autonomous gating can standardise the cleaning of single cell data for rapid analysis
The pre-processed FlowCAP data are helpful for critically assessing the performance of supervised classification algorithms but do not reflect the challenges associated with a real world cytometry project generating complex, primary data. As validation of its performance we here applied CytoPy to the characterisation of immune cells in peritoneal drain fluid and whole blood of peritoneal dialysis (PD) patients with and without acute bacterial infection. We chose this dataset based on a wealth of previous experience in the field (25; 26; 27), the clinical relevance of acute peritonitis in those patients (28), and because of the technical challenges presented by the sample type. Samples were stained with a comprehensive panel of monoclonal antibodies to identify T lymphocytes, monocytes, dendritic cells, eosinophils and neutrophils as the major constituents of peritoneal immune cells, together with activation and differentiation markers on those populations (Supplementary Tables S2 and S3).
Cytometry data are highly variable and surface marker expression must often be identified amongst a backdrop of cellular debris and staining artefacts. This is particularly relevant when studying complex samples such as local specimen taken from the site of acute infection. In the case of individuals receiving PD, bacterial infection leads to the influx of billions of inflammatory cells, predominantly neutrophils, into the peritoneal cavity within a few hours (27). Considerable pre-processing is required to uncover biological material, and traditionally this task would be performed by laborious and time consuming manual gating. CytoPy replicates and expands upon autonomous gating algorithms to provide a semi-autonomous approach that standardises and improves the efficiency of pre-processing. Gates (polygons in two-dimensional space that encapsulate a population of interest) are associated to a sample using the Gating class. The Gating class is central to CytoPy as it is the means by which gates and populations are created, edited, and visualised throughout the analysis.
In CytoPy, the investigator decides upon a „root‟ population; a ground truth present in every sample and the point at which fully automated analysis will begin. Semi-autonomous gates are then applied in sequence to extract the root population for each biological sample. This is exemplified in Figure 2, showing the identification of T lymphocytes from local immune cells in the peritoneal effluent of PD patients. This example utilises density-driven threshold gates, where a threshold is determined based on properties of the Probability Density Function as estimated using Gaussian Kernel Density Estimation, and mixture models shown as elliptical gates in Figure 2. Multiple methodologies for autonomous gating are available to choose from (see CytoPy documentation for a detailed description https://cytopy.readthedocs.io/en/latest/gating.html). The ‟low code‟ interface and object orientated design of CytoPy makes the generation of gates simple. Establishing an effective gating strategy is achieved using the Template class, which inherits from the Gating class. Once a gating strategy has been determined and the autonomous gates chosen, the Template class allows to commit this strategy to the database so that it can be applied to subsequent data, replicating analysis (see Supplementary Data S1 Appendix).
Examples of using semi-autonomous gates for identification of immune cells in a biological sample, as exemplified by the identification of T lymphocytes in peritoneal drain fluid from a patient with acute peritonitis. Algorithm-driven gates are applied on each two-dimensional plot in accession according to a user defined gating template and population hierarchy. The first gate (A) in the sequence filters out the majority of debris using a static rectangular boundary applied to forward scattered light area (FSC-A) and sideward scattered light area (SSC-A). Those events positive for the pan-T cell marker CD3 are identified with a density-dependent autonomous gate (B) that finds a threshold at the point of minimal density using properties of a probability density function. (C) Density-dependent autonomous gating then identifies live cells within the CD3+ cell population; those below the threshold found for live/dead stain. Live single CD3+ cells are further discriminated from other events by applying Gaussian mixture models to create an elliptical gate (D) using FSC-A and forward scattered height (FSC-H), and a density-dependent gate (E) using sideward scattered light width (SSC-W). Finally, the T cell population is identified using FSC-A and SSC-A and encapsulated by an elliptical gate generated by a Gaussian mixture model.
3.3. CytoPy provides visual and quantitative tools for evaluating batch effect
Batch effect is an unavoidable obstacle in any cytometry experiment. CytoPy is thus designed to provide methods for the evaluation of batch effect as an important step in the analysis. For comparison, a reference sample can be identified using the calculate_ref_sample function. Following the method presented by Li et al. (8), CytoPy performs a pairwise computation of the Euclidean norm of each sample‟s covariance matrix, and selects the sample with the smallest average distance as reference. This reference sample can then be used for univariate comparison of each channel or multivariate comparison using a dimensionality reduction technique such as Principle Component Analysis (PCA). This is achieved using the EvaluateBatchEffects class that offers a low-code interface to produce the aforementioned plots (see Supplementary Data S2 Appendix).
In Figure 3, the reference sample is shown in blue and compared to randomly selected samples shown in red; ten such samples are depicted to ease visual interpretation but there is no limit to the number of comparisons that can be made in a single plot. While Figure 3A shows the degree of inter-sample variance for individual fluorochromes and highlights abnormalities in a single channel, Figure 3B shows the same ten randomly selected samples, individually plotted to overlay the reference sample, thus illustrating the multivariate drift of a sample compared to the chosen reference. This allows for identification of samples that are explicit outliers and gives a general sense of the inter-sample variance in the complete immunological landscape measured.
Variance in cell marker abundance as measured by flow cytometry for T cells in peritoneal drain fluid (CD3+ lymphocytes). A reference subject (325-01) is shown in blue and 9 other randomly selected subjects are overlaid for comparison in red. (A) Variation in individual parameters can be shown by kernel density estimation as shown here for 6 common parameters of interest in T cell biology, identifying all T cells (CD3) or the helper T cells (CD4) and cytotoxic T cells (CD8) populations, as well as surface markers associated with specific effector and memory subsets within these populations (CD45RA, CD27, CCR7). (B) Multi-variant drift can be visualised using dimensionality reduction techniques such as PCA. The same reference subject 325-01 as shown in (A) is given in blue and in each plot a different subject in red is overlaid.
The approach illustrated in Figure 3 defines methods that are helpful for visually critiquing the quality of the dataset and that can identify anomalies that should be addressed by changing technical procedures in data acquisition. To proceed with classifying cells into known phenotypical subsets we must take into account this technical variation. This is achieved in traditional manual gating by laboriously adjusting gates on a per-sample basis, with considerable variation depending on the investigator. For automated classification by supervised methods, we instead choose our training data in such a way that inter-sample variation is accounted for. CytoPy provides the similarity_matrix function and the output is shown for each sample type in Figure 4. Unlike the visualisation techniques depicted in Figure 3, the similarity_matrix function quantifies the inter-sample variation by computing a pairwise statistical distance for each possible combination of samples. The statistical distance shown in Figure 4 is the square root of the Jenson-Shannon divergence (the default choice for this function), given by:
Heatmap display of pairwise Jenson-Shannon Distances for all leukocyte subsets present in peritoneal drain fluid and subsets within the T cell compartment present in peritoneal drain fluid and whole blood. Jenson-Shannon distance is given as √JSD(p, q) where p and q are the PDFs of each given pair as estimated using a Gaussian kernel and JSD is a function for Jenson-Shannon divergence. Single linkage clustering is applied to each matrix to reveal groups of broad similarity.
Where m is the pointwise mean of the left probability vector p (PDF of the first sample) and the right probability vector q (PDF of the second sample). KL is the Kullback-Leibler divergence. The Jenson-Shannon distance returns a value between 0 and 1, where 1 indicates that the distributions p and q are equivalent, and 0 that they are highly dissimilar (29; 30). Any statistical distance (a function taking two probability vectors and outputting a metric distance) can be used, but by default the Jenson-Shannon distance is applied, chosen for its properties of symmetry and finite output (30; 31). The similarity_matrix function outputs a heatmap where the colour of each cell corresponds to the Jenson-Shannon distance of the x, y axis pair that overlaps on the given cell. The axes of the heatmap are clustered using single linkage clustering. Clustering on the pairwise Jenson-Shannon distance reveals groups of samples that are similar in the distribution of their single cell subsets in high dimensional space. Classification of cell populations in these groups can be performed independently per group but with the same objective of identifying phenotypically distinct cell populations. For each group the investigator chooses a reference sample (e.g. a uniform sample of cells from each member of the group) and manually labels this reference for the cell phenotypes of interest (e.g. for T lymphocytes this might be CD4+ and CD8+ T cell subsets), then trains a classifier using the labelled reference and subsequently predicts the cell populations for the remaining members of the group. This approach accounts for the inter-sample variation, and therefore improves the classifiers‟ ability to generalise.
3.4. Supervised classification algorithms can reliably identify cell subsets in complex sample types whilst providing tools to inspect and diagnose anomalies
In Figure 4, biological samples were clustered on pairwise Jenson-Shannon distances to reveal groups of samples of relatively high similarity; clustering results are shown as a dendrogram on the axis of each two-dimensional heatmap matrix. Groups are derived by cutting the dendrogram at a level that was heuristically chosen through visual inspection of the dendrogram. This process was repeated for each sample type and set of staining conditions to generate the groups shown in Figure 5A where each group was treated independently during supervised classification.
Performance of XGBoost for cell classification of CD45+ leukocytes from peritoneal drain fluid and T cells from peritoneal drain fluid and whole blood. Groups are generated from all patients (infected and non-infected) as described in Figure 4. XGBoost performance was assessed by weighted F1 score on 5 randomly chosen validation samples within each independent group (A); where groups represent samples clustered on pairwise JSD and an independent classifier is trained for each group. (B) Classification performance of individual classes for an obvious outlier in T lymphocytes from drain fluid, group 1 (weighted F1 score equal to 0.6) is shown visually as a confusion matrix. The values in each row are normalised according to class support (the number of events in a given class). The diagonal of the confusion matrix is equivalent to the accuracy of classification for a particular class. (C) Back-gating functionality allows for close inspection of supervised classification results and comparison to manual gates, semi-autonomous gates, or clustering results. The classification of γδ T cells in this example is compared to a manual gate.
Figure 5A shows the performance of XGBoost classification of all leukocyte subsets in peritoneal drain fluid and more detailed subsets of the T cell compartment in peritoneal drain fluid and in PBMCs from whole blood. Performance is given as the weighted F1 score, a metric that captures the harmonic mean between precision and sensitivity, and is weighted by class support (the number of true instances for each label), which provides a value between 0 and 1, where 1 is the best possible score. This metric was captured by monitoring the performance of XGBoost on five randomly chosen validation samples from each classification group of each experimental condition and/or sample type. The validation samples were labelled by manual gating. Performance was best for PBMCs from whole blood where the weighted F1 score on average was above 0.95. Performance was worst for identifying leukocyte subsets in peritoneal effluent, which reflects the complex nature of the sample type and the diversity of cell subsets we intend to describe. The situation for T cell subsets classified from drain fluid was more complicated. For groups 2 and 3 performance was optimal (average weighted F1 score ≥ 0.95) yet for group 1 there was one significant outlier; one validation sample gave a weighted F1 score of 0.6, outside the interquartile range for this group. Of note, CytoPy provides functionality to easily visualise and explore the results of CellClassifier objects. For the particular outlier mentioned, Figure 5B and 5C show detailed results of the classification of T cell subsets. Figure 5B is a heatmap representation of a confusion matrix, provided if the user provides a value of True to the argument print_report_card, in the manual_validation method of CellClassifier. The confusion matrix in Figure 5B shows „predicted labels‟ versus the „true label‟; the ground truth being the results of manual gates. The values shown in the confusion matrix were normalised across each row (true label) meaning the values on the diagonal were equivalent to the accuracy for each class. The confusion matrix revealed that although this sample scored poorly in terms of Weighted F1 score, the classification accuracy was greater than 95% for all but two classes: γδ T cells and unclassified cells, i.e. those that would not fall into any „gate‟. 52% of cells that had been classed as γδ T cells by the manual gate in this particular sample were instead left unclassified and a large majority of unclassified cells from manual gating were classified into other categories by the XGBoost algorithm. The inclusion of unclassified cells into one or more other subsets was least concerning as it likely reflected the subjective nature of manual gating; the close fit of a gate to its chosen population being one common subjective property of manual gates. The classification of γδ T cells was of greater concern, as this is a T cell subset that is relatively rare in many individuals and hence challenging to assess yet of significant importance especially in Gram negative infections (32).
The CellClassifier of CytoPy converts its classification results to population data that can be handled and visualised using the Gating class. This makes comparison of supervised classification to the results of manual gating, semi-autonomous gating or clustering analysis straight-forward. In addition, the back_gating method allows the investigator to plot the results of multiple methods on familiar bi-axial plots for comparison. As illustration, Figure 5c shows the interrogation of data likely to represent an outlier in the analysis. Overlaid is the result of the XGBoost classification for Vδ2+ γδ T lymphocytes (red points) and the manual gate for the same subset (yellow line). Vδ2+ γδ T cells were unusually sparse in this particular patient sample, which explains the poor classification performance in this instance. Of note, upon visual inspection the XGBoost algorithm was equally suited at identifying rare cell types compared to manual gating; and classification of γδ T cells was performed correctly by the XGBoost algorithm in all other samples (data not shown).
3.5. Unbiased cell classification by high dimensional clustering
Although supervised classification provides us with one methodology for identifying cell subsets, it is biased by the gating strategy used in labelling training data. In recent years, numerous clustering algorithms have been proposed for high-dimensional clustering of single cell data. Two popular solutions are PhenoGraph (4) and FlowSOM (3; 33), both of which are available in CytoPy through the Clustering class. As with the CellClassifier class, Clustering is agnostic to the clustering algorithm of choice. Semi-automated gating, XGBoost classification, and PhenoGraph clustering are comparable in their identification of major cells subsets (Supplementary Figure S1) but using unison of methods (i.e. XGBoost classification and PhenoGraph clusters) provides many benefits and is encouraged in the CytoPy framework; high dimensional clustering offers the opportunity for exploratory data analysis, and obtained clusters can be contrasted with populations identified from supervised classification to improve the confidence of reported results.
Exploratory data analysis in CytoPy is facilitated by the Explore class, which encapsulates the single cell data of one or multiple patients after clustering and supervised classification has been performed, and houses the data within a Pandas DataFrame. Operations can be performed on the DataFrame independently allowing custom scripting, but the Explore class carries many utility functions that are designed for exploratory data analysis. Examples include methods for associating metadata to clusters (e.g. the patient phenotype), dimensionality reduction techniques, and interactive plotting tools.
Clustering is performed on a per-sample basis but to explore the immune landscape of the entire cohort, a consensus must be found such that similar clusters between patients can be grouped. This consensus gives rise to comparisons in cell abundance and phenotype between clinical phenotypes. To achieve this, CytoPy uses meta-clustering. In brief, each subject is independently min-max normalised, and the centroid of each cluster calculated. The centroids of clusters for each subject are then merged to form a dataframe that describes the clustering results of all subjects. Finally, a clustering algorithm of choice is applied to this dataframe (see Supplementary Methods). As example for the successful utilisation of PhenoGraph, Figure 6A shows the results of meta-clustering for total leukocytes in the peritoneal drain fluid of individuals receiving PD. The Uniform Manifold Approximation and Projection (UMAP) (34) plot shows all clusters (solid filled circles) from all patients displayed in two-dimensional space. The colour of a cluster corresponds to the associated meta-cluster while the size cluster represents the proportion of cells within the cluster (relative to the total CD45+ single immune cells in each individual patient). The nature of the UMAP plot is such that clusters of similar phenotype are arranged closer to one another. However, CytoPy allows to utilise any dimensionality reduction technique (e.g. PCA, Isomap, PHATE (35) etc), depending on the preference of the investigator and the specific question to be addressed. Meta-clusters are manually labelled according to their phenotype, as displayed in the heatmap of Figure 6A. Clusters can be colour-coded using any desired metadata. For instance, given an instance of Explore named explorer, one could associate the clinical phenotype of a patient to their clusters using the following single line of code:
PhenoGraph meta-clustering results for CD45+ leukocytes present in peritoneal drain fluid from all available patient samples. (A) The heatmap shows the phenotype of meta-clusters. Individual clusters from all patients are shown in a UMAP plot where each colour filled circle is a unique cluster from an individual subject. Its colour corresponds to its meta-cluster enrolment and its size the proportion of cells relative to the number of CD45+ leukocytes. (B) Patient phenotype (stable control or acute peritonitis) is categorised by colour in a UMAP plot, showing individual clusters from all patients, and box plots show the difference in the proportion of cells as a percentage of CD45+ leukocytes; the difference in distribution of population proportions was tested by Mann-Witney U test; **** p ≤ 0.001
For each patient in this example, the database is queried for the variable named „peritonitis‟ (as in “does this patient have acute peritonitis?”) and populates the Pandas DataFrame stored in the explorer object. The UMAP plot is then repeated by colour-coding according to the metadata, as shown in Figure 6B. The distribution of clusters of different clinical phenotypes in the UMAP plot reveals changes in the immunological response. Subsets of cell compartments (e.g. „Monocytes_0‟, „Monocytes_1‟ etc.) can be consolidated and the proportion of cells within these consolidated groups (as percentage of all CD45+ immune cells) is shown in the boxplots of Figure 6B. Applying this cluster analysis to a cohort of PD patients, CytoPy found that acute bacterial peritonitis resulted in a dramatic shift in the composition of local immune cells, with a significant increase in the proportion of neutrophils and a parallel drop in the relative proportion of monocytes/macrophages, dendritic cells (DCs), B cells and T cells. These findings corresponded well with previous studies showing a significant influx of inflammatory cells into the peritoneal cavity on the first day of presenting with acute symptoms, compared to stable individuals in the absence of peritoneal inflammation (25; 26; 27; 36)
Figure 7 shows the same set of analytical techniques applied to the local T cell populations in individuals receiving PD. Figure 7A shows a UMAP plot of clusters, coloured according to their associated meta-cluster and revealing clean separation not only of CD4+ and CD8+ T cells as the major T cell populations but also of unconventional T cell populations such as Vα7.2+ CD161+ mucosal-associated invariant T (MAIT) cells and Vδ2+ γδ T cells. Figure 7B shows the same clusters as in Figure 7A but now colour-coded by the metadata regarding the presence or absence of bacterial infection. The differences in T cell subsets between stable controls and those with acute peritonitis were subtle and, due to the small size of this cohort, not statistically significant. Of note, CytoPy allows to explore the composition of the T cell compartment in even more detail, as illustrated for the CD8+ T cell subset (Figure 7C). Here, PhenoGraph was capable of discerning distinct memory and effector subsets based on the expression of the surface markers CD45RA, CD27 and CCR7 (Figure 7C) further validating CytoPy as a reliable method for exploring changes in immune response in large flow cytometry data.
PhenoGraph meta-clustering results for T cells in peritoneal drain fluid from all available patient samples. (A) The heatmap shows the phenotype of meta-clusters. Individual clusters from all subjects are shown in a UMAP plot where each colour filled circle is a unique cluster from an individual subject. Its colour corresponds to its meta-cluster enrolment, and its size the proportion of cells relative to the number of T cells. (B) Meta-cluster results can be coloured by patient phenotype to reveal regions that distinguish clinical endpoints. Patient phenotype is contrasted by colour in clusters on a UMAP plot and in box plots of major T cell subsets as a percentage of total T cells. (C) Close inspection of CD8+ T cells shows that functionally distinct effector/memory subsets can be identified by PhenoGraph clustering.
3.6. Feature extraction and feature selection reveal variables that differentiate the immune response during acute peritonitis compared to stable controls
Following cell classification by both biased and unbiased methodology, the immunological landscape of the observed subjects can be summarised in CytoPy into a „feature matrix‟. This includes the relative abundance of populations as identified by supervised classification and clusters produced by techniques such as PhenoGraph. There will be significant overlap here, and therefore the user may choose to specify to generate a consensus between the results of supervised classification and clustering by way of an average of the two methods. Supervised classification is more robust towards underlying batch effects but biased by the gating strategy imposed upon the training data, whereas clustering is unbiased but not stable to batch effects. By combining both methods the investigator can overcome the limitations that they present individually.
The methods described are implemented in the feature_extraction module of CytoPy. Once a feature matrix has been generated dimensionality reduction techniques can be employed to reveal immediately if subjects separate in accordance to the experimental or clinical endpoint of interest.
Figure 8A shows a PCA plot where peritonitis patients and stable controls clearly separated across two components, as expected from earlier studies by us (25; 27) and from the analysis shown in Figure 6.
(A) Principle component analysis of all identified cell populations shows separation of patients with acute peritonitis from stable controls. (B) Radial plot of major cell subsets given as the proportion of their derived parent population (MAIT cells, γδ T cell, CD8 T cells, CD4 T cells: proportion of total T cells; all others: proportion of CD45+ immune cells). Values shown are the consensus of XGBoost classification and PhenoGraph clustering.
Filtering techniques can be employed within CytoPy to remove variables of low variance or identify high multi-colinearity (Supplementary Figure 2). This is often necessary to remove redundant variables. The immunological pattern that differentiates a clinical state or experimental end-point can then be visualised in a radial plot as shown in Figure 8B. In this example, cell populations are marked on the axis and the internal value is the proportion of cells relative to their respected parent, after consolidating the results of both PhenoGraph clustering and XGBoost classification. Figure 8B confirms the observations made in the exploratory data analysis of clustering results (Figures 6 and 7): although subtle differences exist in the T cell compartments, it is the stark differences in the proportion of myeloid cells that differentiates those with peritonitis compared to stable controls. Where further feature selection is necessary, CytoPy offers embedded methods in the form of L1-regularised linear models, where variables can be selected according to whether their coefficient remains non-zero as the regularisation parameter decreases. (Supplementary Figure S2).
4. Availability and Future Directions
CytoPy represents a framework for the analysis of cytometry data that facilitates automated analysis whilst introducing robust data management and an iterative analytical environment. The present study shows the ability of CytoPy to characterise the FlowCAP-I dataset with high precision and identified XGBoost as optimal classification algorithm for gating with supervised methods. To demonstrate the capabilities of CytoPy on real-world data, we chose to analyse samples from patients with and without acute peritonitis, taking advantage of our extensive experience with this type of samples over more than a decade. Initially acquiring such samples on a four colour BD FACSCalibur flow cytometer with two lasers and simple FSC/SSC settings (37), we later utilised an eight colour BD FACSCanto with three lasers and FSC/SSC area/height channels (24; 34; 35), and now in the present study took advantage of a 16 colour BD LSR Fortessa with four lasers and FSC/SSC area, height, width, and time, thus illustrating the technological advance in the field but also the increasing complexity of the data acquired. The exquisite and elegant performance of CytoPy confirmed a striking increase in total neutrophils at the site of infection and a parallel decrease in the proportion of monocytes/macrophages, dendritic cells and T cells, in agreement with previous findings (26; 27), thereby validating the utility of CytoPy.
We have chosen to develop and maintain CytoPy in Python, a programming language with growing popularity in the bioscience domain. The application of the popular Python deep learning frameworks such as Tensorflow (38) and Keras (39) offer potential for the autonomous analysis of cytometry data (8; 10; 38). Despite their successful application the cited methods do not provide the robust data management and exploratory analytical tools that CytoPy offers. It is our intention to incorporate these methodologies in a future release. The agnostic object orientated design of CytoPy facilitates such additional implementations in a straight-forward manner. It is this agnostic design and the introduction of a document-based database as central repository for cytometry analysis that sets CytoPy apart from alternative solutions.
In addition to providing a new data-centric framework for applying existing methods of single cell classification and clustering, CytoPy offers novel tools to aid the analytical pipeline. In this study we highlight the difficulties presented in complex cytometry data and demonstrate autonomous methods that improve the efficiency of pre-processing. We show how CytoPy can visualise and quantify the inter-sample variation resulting from batch-effects. Prior attempts to mitigate or remove batch-effects have either been tied to the application of gates in two-dimensional space (40; 41), involve manipulation of the input space in such a way that biological signals could be lost or distorted (42; 43), or requires some technical intervention during data acquisition (44). Here we introduce an alternative strategy, instead of removing batch-effect by transforming or aligning the data, we propose a statistical measure be used to group data and supervised classification performed on each group individually. However, we appreciate the impact that a reliable method for mitigating or removing batch effect prior to analysis might have and are open to the integration of data normalisation or transformation methodologies that would achieve this and would see that it fits the data-centric design of CytoPy.
As high-dimensional cytometry analysis continues to grow in popularity there will be increasing demand for an analytical framework that is friendly for those who are new to programming, provides a database that directly relates metadata to single cell data, and scales in a fashion that encourages collaboration and expansion. CytoPy meets all these criteria whilst remaining open-source and freely available on GitHub (https://github.com/burtonrj/CytoPy). Those wishing to collaborate with us or extend our software capabilities should consult the documentation (https://cytopy.readthedocs.io/) and make a pull request on our GitHub repository.
5. Supplementary Methods
5.1. FlowCAP
To assess the ability of CytoPy to classify cells we used the datasets provided in the Flow Cytometry: Critical Assessment of Population Identification Methods (FlowCAP) challenge [21], where the challenge is to accurately separate cells into subsets based on single cell phenotype. The FlowCAP-I data consist of four human studies (graft-versus-host disease, diffuse large B-cell lymphoma, symptomatic West Nile virus infection, and normal donors) and one mouse study (hematopoietic stem cell transplant). Data were labelled and pre-processing performed (removal of debris, dead material, and with fluorescence compensation applied) at source by the laboratory responsible for acquiring the original data. Here, classifiers were trained on 25% of data and classification performance tested on the remaining 75%. Performance was reported as the average of weighted F1 scores across all five datasets, where the F1 score for data with |C| set of possible classes is given as:
Run time was determined as the number of seconds elapsed for training and classification, as an average across every sample classified. Five supervised machine learning algorithms, housed within CytoPy, were compared without hyperparameter tuning:
Feed-forward neural network with three layers of size 12, 6 and 3 nodes, L2 penalty of 1×10−4, ReLU activation function on the hidden layers, softmax activation function on the outer most layer, and categorical cross-entropy as the loss function; implemented in Keras v2.3.
XGBoost withdefault hyperparameters; implemented in XGBoost v0.9.
Linear Discriminant Analysis with singular value decomposition with no shrinkage and number of components equal to min(n classes – n features); implemented in Scikit-Learn v0.22.
K-Nearest Neighbours with number of neighbours used in constructing tree equalling 5 and „ball tree‟ algorithm to compute nearest neighbour for classification; implemented in Scikit-Learn v0.22.
Support Vector Machine with radial basis function kernel; implemented in Scikit-Learn v0.22.
In each instance, data were standardised by removing the mean and scaling to unit variance; standard scores for each sample is given as where u is the mean and s the standard deviation.
5.2. Patients
The study cohort comprised 37 adult individuals receiving peritoneal dialysis (PD) who were admitted between October 2016 and October 2018 to the University Hospital of Wales, Cardiff, on day 1 of acute peritonitis, before commencing antibiotic treatment (34.6% female; median age 68 years, range 22-91 years). 20 age and gender-matched individuals receiving PD and with no previous infections for at least 3 months served as stable, non-infected controls (35.0% female; median age 69.5 years, range 28-93 years). Subjects known to be positive for HIV or hepatitis C virus were excluded. Clinical diagnosis of acute peritonitis was based on the presence of abdominal pain and cloudy peritoneal effluent with >100 white blood cells/mm3. According to the microbiological analysis of the effluent by the routine Microbiology Laboratory, Public Health Wales, episodes of peritonitis were defined as infections caused by Gram-positive or Gram-negative organisms. Cases of fungal infection and negative or unclear culture results were excluded from this analysis. Basic patient demographics can be found in the Supplementary Methods and a summary of the bacterial culture results for patients with peritonitis are shown in Supplementary Table S1. All methods were carried out in accordance with relevant guidelines and regulations, and written informed consent was obtained from all subjects. Recruitment of PD patients was approved by the South East Wales Local Ethics Committee under reference number 04WSE04/27, and conducted according to the principles expressed in the Declaration of Helsinki. The study was registered on the UK Clinical Research Network Study Portfolio under reference numbers #11838 “Patient immune responses to infection in Peritoneal Dialysis” (PERIT-PD).
5.3 Flow cytometry
Peritoneal leukocytes were harvested from overnight dwell effluents and processed as described previously (27; 36); samples were treated with DNase (Sigma; 1:2,500 dilution) when excessive debris was visually apparent. Leukocyte populations in total effluent were stained using monoclonal antibodies against CD1c, CD3, CD14, CD15, CD16, CD19, CD45, CD116, HLA-DR and Siglec-8 (Supplementary Table S2) and identified as CD45+ immune cells, CD3+ T cells, CD19+ B cells, CD15−CD14+ monocytes/macrophages, CD15+ neutrophils, CD15−CD14+/−CD1c+ dendritic cells, and CD15−SIGLEC-8+ eosinophils. T cell subsets in peripheral blood mononuclear cells (PBMCs) and in peritoneal effluent were stained after Ficoll (Ficoll-Paque PLUS; Fisher Scientific) separation of blood and peritoneal leukocytes, respectively, using monoclonal antibodies against CD3, CD4, CD8, CD161, TCR-Vα7.2, TCR-Vδ2, TCR-pan-γδ, CD45RA, CCR7 and CD27 (Supplementary Table S3). Cell acquisition by flow cytometry was performed using a 16 colour BD LSR Fortessa cell analyser (BD Biosciences). Live single cells were gated based on side and forward scatter area/height and live/dead staining (fixable Aqua; Invitrogen).
5.4 Meta-clustering
Meta-clustering was performed to find a consensus amongst the individual clustering results of many individual samples. Each sample was independently normalized; that is, each feature was scaled:
Where x is the original value for a given feature and xnorm is its values scaled between zero and one. Once each sample was individually normalized, the clusters from each sample were extracted and their centroid calculated; by default this was given as the median of their feature vector but other definitions of center can be used (e.g. mean, geometric mean etc). Cluster centroids were annotated as to which sample they originated from and their original cluster ID and then concatenated into a single dataframe. This dataframe was then used as the input to a clustering algorithm of the user‟s choosing.
9. Supplementary data
PHATE plots showing the classification of T lymphocyte subsets in whole blood by (A) semi-autonomous gating, (B) XGBoost classification, and (C) PhenoGraph clustering.
Visualisation of feature selection techniques. (A) Variance of population proportions for all classified populations and clusters for cells isolated from peritoneal drain fluid (local) and whole blood (PBMCs). (B) Cell populations and clusters are summarised into common compartments and variation in proportion relative to parent population shown. (C) Support Vector Machine with a linear kernel was used to classify patient phenotype. The coefficient (y-axis) associated with each variable included in the feature space of this classifier is shown as the L1 regularisation parameter (x-axis) decreases. Variables of increasing importance to accurate classification of patient phenotype will take longer to converge to 0 as the regularisation parameter decreases.
Summary of microbiological culture results for peritoneal dialysis patients with acute peritonitis
Staining panel for leukocytes
Staining panel for T lymphocytes
ACKNOWLEDGMENTS
We are grateful to all peritoneal dialysis patients for participating in this study, and to the clinicians and nurses for their cooperation. We also thank Sarah Baker, Chantal Colmont, Donald Fraser, Alexander Greenshields-Watson, Ann Kift-Morgan, Kristin Ladell, Oliwia Michalak and John Pulford for their help and advice. This research received support from the Wales Kidney Research Unit (WKRU), UK Clinical Research Network (UKCRN) Study Portfolio, Medical Research Council (MRC) grant MR/N023145/1, and a School of Medicine PhD Studentship (to R.J.B.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
S1 Appendix
Example Python code for generating semi-autonomous gating templates and applying to multiple samples.
S2 Appendix
Example code for generating plots that visualise univariate and multi-variate inter-sample variation, and generating a „similarity matrix‟ using Jenson-Shannon distance. The similarity_matrix function outputs a „linkage matrix‟, sample IDs in an order that corresponds to the linkage matrix, and the similarity matrix plot. The linkage matrix and sample IDs can be given to the function generate_groups along with a desired number of groups (heuristically chosen using the plotted dendogram; see Figure 4), producing a Pandas DataFrame of sample IDs and corresponding group ID.