Single-cell Bayesian deconvolution

Individual cells exhibit substantial heterogeneity in protein abundance and activity, which is frequently reflected in broad distributions of fluorescently labeled reporters. Since all cellular components are intrinsically fluorescent to some extent, the observed distributions contain background noise that masks the natural heterogeneity of cellular populations. This limits our ability to characterize cell-fate decision processes that are key for development, immune response, tissue homeostasis, and many other biological functions. It is therefore important to separate the contributions from signal and noise in single-cell measurements. Addressing this issue rigorously requires deconvolving the noise distribution from the signal, but approaches in that direction are still limited. Here we present a non-parametric Bayesian formalism that performs such a deconvolution efficiently on multidimensional measurements, in a way that allows estimating confidence intervals precisely. We use the approach to study the expression of the mesodermal transcription factor Brachyury in mouse embryonic stem cells undergoing differentiation.


INTRODUCTION
The inherent stochasticity of biological processes leads to substantial heterogeneity even among genetically identical cells in the same environment [1][2][3] . The degree to which this heterogeneity affects, or even dictates, cellular decision-making in most situations is still an open question. This issue is of paramount importance in processes such as mammalian development, where hundreds (if not thousands 4 ) of distinct cell types (cell states) emerge from a small number of identical undifferentiated cells [5][6][7] . Identifying the molecular mechanisms underlying these cell-fate decision programs and their interplay with cellular heterogeneity 8,9 requires a rigorous quantification of cellular states across large numbers of cells.
Flow cytometry enables monitoring the distributions of abundances and activities of selected proteins for thousands of cells at a time, using fluorescently labeled markers. Cells, however, have a non-negligible amount of autofluorescence in the emission spectrum of most fluorescent probes (500 to 700 nm). This leads to a background noise that must be subtracted from the total signal emitted by fluorescently labeled cells 10 , in order to adequately relate the signal distribution provided by the cytometer to the mechanisms regulating the expression and/or activity of the protein of interest. Several standard methods exist for addressing this issue on a cellby-cell basis. One can use, for instance, additional fluorescence channels outside the emission spectrum of the fluorophores serving as regressor variables 11,12 . Alternatively, a second laser system can provide an independent measurement of both signal and autofluorescence 13 . Beyond issues of cost or accessibility, the effectivity of such solutions is limited, because there is no guarantee that autofluorescence from another channel, or from another excitation source, is a good proxy for autofluorescence in our channel of interest.
Commonly, rather than dedicating measurement resources to assess autofluorescence, control measurements of unlabeled cells are used to set a baseline of the signal coming from the naturally present autofluorescent components in the cell. This procedure, however, does not lead to a quantitative determination of the distribution of the signal coming exclusively from the fluorescent probe. Such quantitative assessment would require deconvolving the fluorescence distribution obtained in labeled cells from the one produced by unlabeled cells. Here, we propose a non-parametric Bayesian approach to this deconvolution problem, applicable to multichannel measurements. The method is robust and efficient, requiring cell numbers not larger than those typically considered in standard flow cytometry runs, and gives natural confidence intervals of the target distributions, which makes it attractive for a variety of applications.
There is an extensive statistics literature addressing the additive deconvolution problem 14 . A common set of deconvolution methods are kernel-based approaches, such as those relying on Fourier transforms 15-21 , which use the fact that in Fourier space, a deconvolution is simply the product of two functions. Two problems arise from such methods that limit their applicability in practical cases. First, Fourier transforms (and other methods that use orthogonal local bases, such as wavelets 22 ) are not positive defined. Consequently, kernel-based methods lead to deconvolved pseudo-distributions with artificial features, which are hardly interpretable for practical applications. Second, these methods usually lead to point estimates, and therefore do not provide native confidence intervals (i.e. without applying additional statistical approximations) that allow us to assess the quality of the inferred target distribution.
A second class of deconvolution approaches are likelihood-based methods 23 , which estimate the unknown target distribution using maximum likelihood approaches. As in the case of kernel-based methods, these approaches provide us with point estimates, and usually assume exact knowledge of the noise distribution. Finally, a third class of methods involve Bayesian inference 24-26 , which does not require complete knowledge of the noise distribution and naturally provides confidence intervals of the estimates obtained. So far, however, these Bayesian methods have been applied to repeated measurements of the same individual entities that are being monitored (in our case, cells), which is not a realistic possibility in standard flow cytometry. Our nonparametric Bayesian approach does not require repeated measurements and retains all the above-mentioned advantages of Bayesian methods. We have implemented the procedure in a Julia package available in GitHub (https://github.com/dsb-lab/scBayesDeconv.jl).
The work is structured as follows. First, we introduce the mathematical description of the additive convolution problem in the context of flow cytometry data, and explain our proposed deconvolution method based on non-parametric Bayesian models. Second, we validate our method using synthetic datasets with known target distributions, and compare its results to other existing methods. Third, we further test our method in real flowcytometry data of mouse embryonic stem cells undergoing differentiation. We test the dataset in two conditions. Initially, we treat our cells with a low concentration of a fluorescent dye (which masks the real flow-cytometry signal and acts as an external noise), and validate our method by deconvolving the noise coming from the dye and comparing it to the control case where the dye was not added. This offers evidence of the robustness of the method in real applications while having a ground truth. Additionally, we perform the deconvolution using several channels from the dataset, to show the effectiveness of the method in multichannel measurements, and comment on the complementarity of our approach to other processing steps common in analysis pipelines of flow cytometry data. We conclude by discussing the limitations of the method and possible ways to improve it in future work.

Theoretical definition of the problem
Consider a population of cells containing fluorescent markers that label the abundances or activities (e.g. phosphorylation states) of certain proteins of interest. Flow cytometry measurements provide us with the distribution p c (C) of total fluorescence signal C emitted by each individual cell in the population, as measured by the flow cytometer detectors. These signals have two components: the fluorescence T emitted exclusively by the target fluorophores that report on the proteins of interest, and the autofluorescence ξ emitted by cellular components other than our fluorescent labels: If these two components are independent of one another, the distribution p c (C) of the measured signal takes the form of a convolution of the distributions of T and ξ: Similarly to p c , the distribution p ξ of the autofluorescence ξ can be measured in the flow cytometer by using unlabeled cells that are otherwise identical to the labeled ones. On the other hand, the probability distribution of T , p T , cannot be measured directly. Our goal is to extract (deconvolve) the distribution p T from the measured distributions p c and p ξ , considering that we only have a finite set of samples (cells) of C and ξ.
In what follows, we first introduce the way in which we describe the distributions involved in the problem. Next we define the posterior distribution given by the model and the data, and finally we discuss the methods used to explore the parameter space for the deconvolution problem.
Mixture model. The vast majority of real datasets (including those generated in flow cytometry) result from complex combinations of variables that cannot be explained in general with simple distributions. In order to adapt flexibly to such conditions, we use mixtures of probability distributions as our basis set. Any function can be arbitrarily well approximated using an adequate choice of basis functions, provided enough components are included in the mixture. We can describe both the target and the noise distributions by independent mixtures of K components: where x is multivariate in the case of multichannel measurements. ω η denotes the weight of each base B(x|ψ η ) in the mixture, and ψ η represents its parameters (the means and covariance matrices in the case of multivariate normal distributions) 27 . The sets of ω η and ψ η are in turn represented by the vector ϕ in what follows. Using these basis functions, the distribution of the observed signal C can be described as a superposition of all the convolutions between them: where (B * B)(c|ψ T η , ψ ξ λ ) represents the convolution of two basis distributions with parameters ψ T η and ψ ξ λ , respectively, and K T and K ξ denote the number of bases used in each of the two mixtures.
The choice of basis functions to describe our data is crucial. We propose to use multivariate normal distributions to describe the data. Normal distributions with unknown mean and covariance have been shown to be flexible enough to represent datasets with high quality, requiring less components than more rigid methods. They have the additional advantage that its convolution turns out to be another normal distribution with modified parameters 28 . This last property is especially interesting for deriving exact analytic results for the Bayesian sampling process.
Posterior distribution and likelihood. Our goal is to extract, starting from samples of the distributions of total signal C and noise ξ, the parameters of the distribution of the target signal T . Defining the problem in terms of probability distributions leads very naturally to work with Bayesian methods. According to Bayes' rule, the posterior distribution that represents the probability of the parameters given the data is are the sets of observed samples of the total signal and the noise, respectively, with N c and N ξ representing the number of samples in each case.
The first bracket on the right-hand side of Eq. (5) is the likelihood function, which corresponds to the probability of the data given the parameters. Under the assumption of independent and identically distributed (iid) observations, this function is given by As can be seen, the information about the noise parameters is contained in both datasets, while the information about the target parameters only appears in the convolved data. As shown in Eq. (5) above, the posterior distribution can be estimated by multiplying the likelihood by the prior distribution of parameter values. Since the datasets required for a good deconvolution are large, the impact of the prior distribution should be negligible. Therefore, the posterior landscape is effectively described by (6), and the prior distribution is only present in our approach for formal and computational reasons. In any case, since no prior information exists on the parameters that describe the noise and target mixture components, we impose vague priors over the plausible set of parameters (see Supplementary Sec. S4).
Sampling. The posterior distribution is a complex multimodal (multi-peaked) object containing all the configurations of the target and noise distributions that are consistent with the observed data. A correct and efficient exploration of this distribution is essential to set bounds on the candidate models that explain the data. The Bayesian model, as stated in Eq. (5), has no direct closed form. Sampling from this model as stated would require the introduction of complex sampling algorithms such as Markov Chain Monte Carlo (MCMC) 28 and nested sampling 29 . These sampling processes are computationally very expensive and scale linearly with the sample size, which in the case of flow cytometry datasets is of the orders of tens to hundreds of thousands samples per dataset.
In order to overcome these shortcomings, we propose two approximations to the posterior distribution. The first one decouples the inference of the noise and target distributions parameters as two independent problems (see Supplementary Section S2.1). The parameters of the noise distribution can be fitted using well-known mixtures of normal distributions, with the target being described by a convolved mixture model. The second approximation simplifies the sampling of the convolved mixture model to enable sampling using standard Gibbs sampling algorithms (see Supplementary Section S4.2). The combination of both approximations allows the posterior to be sampled using Gibbs sampling, which is an efficient MCMC method of sampling from the posterior.
In addition to these approximations, we offer two alternative Bayesian mixtures, namely finite or infinite normal mixture models 30 . The latter approach has the added advantage over the former of not requiring fixing the number of basis functions to describe the data. However, for most practical purposes, finite normal mixture models are able to fit the data with high accuracy and in a more efficient manner regarding how the corresponding algorithms work.
Analysis pipeline. Given the concepts and tools described above, the analysis of the data is performed as follows (Fig. 1). First, the distributions of the target and autofluorescence signals are assumed to be described by mixtures (top left panel in Fig. 1), as defined by Eqs. (3) and (4). The data measured experimentally correspond to the autofluorescence signal (noise) and the total signal of the labeled cells (which includes the autofluorescence), as shown in the top right panel of Fig. 1. The mixture assumption and the observed data samples allow us to construct the posterior distribution (Eq. (5) and middle panel in Fig. 1) from the likelihood function defined by Eq. (6) and the prior distributions discussed in the Supplementary Sec. S4. The posterior distribution is a function of the model parameters (weights of the mixtures and parameters of the basis functions). We next explore (sample) the posterior distribution in parameter space using the Gibbs sampling approximation (bottom left panels in Fig. 1). This algorithm provides us with a representative sampling of the parameters of the target distribution, which allows us to compute an average of this distribution and its confidence interval (bottom right panel in the figure).

Application to synthetic data
First, we benchmark the ability of our method to recover the target distribution by using collections of synthetic datasets. In those datasets, the target and noise distributions are known, so we can compare the result of the deconvolution against a ground truth. To do this, we applied our method to the synthetic data described in Sec. . To prove the robustness of the implementation, we ran the test using four components, both for the noise and target distributions. We also avoided checking the full convergence of the algorithm manually. We ran the algorithm in this suboptimal conditions to avoid having to fine tune the specific parameters, which could lead to positive bias favoring our method in comparison with FFT-based approaches. We contrasted the results of our method with those of a specific FFT approach that does not require knowledge of the autofluorescence distribution 19 . Figure 2 shows a typical instance of the deconvolution performance of the two methods. Panel (a) shows in green the total (convolved) signal mimicking the output of a flow cytometry experiment. In this synthetic case, the signal is obtained by forward convolving a target distribution with the characteristics given above, shown in light blue in panels (b) and (c), with a noise (autofluorescence) distribution, shown in magenta in the inset of panel (a). The goal in this case is to recover (deconvolve) the ground-truth target distribution (panels b and c in Fig. 2) from the total and noise distributions (panel a). Figures 2(b,c) show that our Bayesian deconvolution method recovers the target distribution well. In comparison, the FFT method leads to oscillatory components in the deconvolution, and correspondingly to artefactual negative values in the probability distribution. Additionally, the Bayesian deconvolution method naturally provides a confidence interval, shown by the different samples (red lines) in panel (b) of Fig. 2.
For benchmarking purposes, we compared the deconvolved distribution with the real target distribution, which is known in this case, using the mean integrated overlap (MIO), as defined in Supplementary Sec. S6. We preferred this measure to the mean integrated squared error (MISE), which is commonly used in the deconvolution literature for theoretical reasons 16,19,20,22 , since the MIO measure is easier to interpret, as it corresponds directly to the absolute overlap of two probability distributions. We also avoid more common measures of distribution dissimilarity such as the Kolmogorov-Smirnov test, since such methods would underestimate the ability of FFT-based methods to converge to the ground-truth deconvolution, given that they lead to artifacts in the resulting distributions, as shown above. Figure 3 compares the deconvolution efficiency of the FFT-based method and our single-cell Bayesian deconvolution approach, in terms of the MIO measure that quantifies the similarity between the convolved distribution and the real one. According to its definition (see Supplementary Sec. S6), a MIO value of 1 corresponds to a perfect overlap, while the measure is 0 when two distributions do not overlap at all. As can be seen in Fig. 3, the single-cell Bayesian Deconvolution method outperforms the Fourier-based method in almost all the cases considered, the difference being more substantial for high levels of noise (small SNR, circles). In particular, the overlap is never below 0.7 for the Bayesian methods, while it can reach values near 0 in the FFT case depending on the type of distributions involved, particularly for low sampling numbers.
Additionally, it is worth noting that while the singlecell Bayesian deconvolution method is able to reproduce almost perfectly the target distribution (MIO close to 1) for large enough dataset sizes, the Fourier-based method always saturates to a suboptimal level even as the number of samples increases, even for low noise levels (large SNR, crosses in Fig. 3). Qualitatively, this is due to the oscillatory features in the distribution produced by the noise during the deconvolution with the FFT method, as a consequence of the oscillatory nature of the Fourier basis functions. In general, it has been proved that in FFT methods the convergence to the actual distribution grows sublinearly with the sample size, with scaling dependencies that make the method unfeasible for practical purposes 16,19,25 . Since Bayesian deconvolution makes use of local basis functions, more parsimonious solutions can be obtained, preventing the degradation of the target distribution due to the noise.
In addition to the direct comparison performed between FFT and Bayesian deconvolutions, we compared the results of the deconvolution against a model that simply fits the convolved data ignoring the convolution problem. A condition to impose on any practical deconvolution algorithm is that the result of the deconvolution is not worse than not doing the decomposition at all. The results of this comparison can be observed in Supplementary Figure S2. The results show that, while the Fourier-based method generally tends to make worse predictions than the null model (fitting without deconvolution), the Bayesian approach consistently generates results virtually as good as when the noise is negligible (SNS=10), and improves systematically the distribution in the high-noise case (SNS=1) over the null model. In conclusion, the Bayesian method that we propose has combined benefits, both qualitatively and quantitatively, over other basis methods as FFT and consistently improves the result (or at least not degrades the quality of the distribution).

An experimental dataset with external noise
Next we applied our method to an experimental dataset in which the noise distribution is unknown. To generate a ground truth, we use cells with the Brachyury reporter T/Bra::GFP in two media conditions (N2B27 supplemented with 3 µM CHIR99 and DMSO as control). At day 3, for each condition, one of the replicas was treated with 20 nM Green CMFDA dye and incubated for 3 min prior to flow cytometry, and the second one was incubated for 3 min with N2B27 (see Methods). Given that the dye incorporates in the cell cytoplasm and its emission spectrum is similar to the one of GFP, the dye acts as a noise source to the GFP signal coming from the Brachyury reporter. Consequently, we have the following four conditions, with their potential outcomes in terms of the signal measured in the FITC-A channel: (c1) Control: Brachyury expression is minimal, and the signal comes mainly from the intrinsic autofluorescence of the cells.
(c2) Control+CMFDA: Brachyury expression is minimal, and the signal comes mainly from the CMFDA dye, plus intrinsic autofluorescence of the cells.
(c3) CHIR99: Brachyury expression is upregulated, and thus the signal contains both the T/Bra::GFP reporter and the intrinsic autofluorescence of the cells.
(c4) CHIR99+CMFDA: Brachyury expression is upregulated, and thus the signal contains both the T/Bra::GFP reporter, the signal from the CMFDA dye and the intrinsic autofluorescence of the cells.
The four signal distributions corresponding to these four conditions are shown in Supplementary Fig. S3. We note that the distribution of the signal coming only from the dye cannot be measured independently, given the intrinsic autofluorescence of the cells. Moreover, the dye might be absorbed differently depending on the state of the cells (CHIR99 treatment), and the low concentration of the dye (∼ nM) might lead to significant cell-to-cell variabilty on the absorbance of the dye. These features make it a suitable dataset to test the robustness of our method. We first used our method to deconvolve the signal of the dye from the total signal observed in the experimental conditions (c2), using condition (c1) to define the 'noise' distribution. The results of the deconvolution are shown in Supplementary Fig. 4, which presents the dye distribution deconvolved from experimental conditions (c1) and (c2). Once the dye distribution was obtained from the deconvolution of conditions (c1) and (c2), we tested the consistency of our approach by considering the dye signal as the noise in condition (c4). Deconvolving the dye distribution from the one measured in (c4) should lead to the distribution obtained in condition (c3), which we can measure experimentally and thus serves as the ground truth in this case. The result can be seen in Fig. 4. Even in this case, where we use a deconvolved dataset for a second deconvolution, we observe that the inferred distribution resulting from our method is in good agreement with the experimentally measured distribution in condition (c3) (MIO = 0.81 ± 0.04) and improves over the null model that ignores the noise (MIO = 0.46 ± 0.01).
Overall, this experiment shows that our method is robust even when our conditions (target and noise independence) are not fully met, and over other fluctuations present in real datasets.

Multidimensional distributions
Finally, we applied our method in a real flow cytometry dataset with multichannel detection. To that end, we studied the expression of the mesodermal gene Brachyury through a GFP reporter (T/Bra::GFP) in mouse embryonic stem cells (see Methods). In this dataset, two populations are expected to appear: a Brachyury negative population of pluripotent cells, and a positive population capturing the cells exiting from pluripotency towards mesodermal fates. Cells were treated for 24h with 3 µM CHIR99 and 25 ng/mL Activin A to upregulate Brachyury prior to flow cytometry on day 2 (see Methods). The signal of the GFP reporter was our target signal, and it was acquired through the FITC-A channel.
In a normal flow cytometry dataset, the signal contains the fluorescence emitted by GFP together with the autofluorescence emitted by the cells at the GFP frequency. The effect of the autofluorescence over the T::GFP signal (target) gives the erroneous impression that Brachyury is expressed in the two populations, when looking directly at the convolved data (light green dots in Figure 5). In other words, it seems there is a non-zero basal expression of the gene in the pluripotent population. When deconvolving the data through the Bayes method, however, the result shows that the pluripotent population peaks actually around zero, as expected since these cells are expected not to show any emission of Brachuyry, which is confirmed by the absence of mRNA in the pluripotent states 31 . Similar problems of non-negative detection have been reported in other highly autofluorescent populations, as in the case of the false positive detection of MHCII in microglia 32 , and the false discovery of Foxp3 expressing cell types that are not T cell lineages 33 .
As can be observed from the other channels (See SSC-A in Figure 5 and all the channels in Figure S5), the deconvolution has an effect in removing the expected variance from all the other channels. The left variance may be an effect of the spillover energy that the GFP fluorophore has in the emission spectrum of other channels as the contribution to broad-band channels as FSC and SSC channels (see for example the effect in the FSC-A in Figure 5). The use of deconvolved distributions in single fluorophore samples could improve the estimation of spillover coefficients for sample correction 34 .
In general, the Bayes deconvolution is straightforwardly applicable to real multichannel datasets and its results are consistent with the expected biological observations, showing a clear correction of defects coming from the presence of autofluorescence in the detection channels.

DISCUSSION
In this paper, we propose a Bayesian approach to obtain flow-cytometry distributions of protein abundance or activity convolved with a known source of noise. The method, which relies on non-parametric Bayesian techniques, is freely available as a Julia package (https://github.com/dsb-lab/scBayesDeconv.jl) and can be used in a straightforward manner in a purely computational way, without the need of dedicated measurement channels or additional laser sources. It only requires measuring the fluorescently labeled and unlabeled cells and, unlike previously proposed deconvolution methods, it provides well-defined and robust probability distributions, described by mixtures of basis functions.
We measure the quality of the results obtained with our method by comparing the deconvolved distributions with known (ground-truth) target distributions using synthetic data and ad hoc experiments with mouse embryonic stem cells. We argue that the use of local basis distributions to describe all the distributions involved in the problem (both measured and unknown), and the corresponding use of a relatively small number of degrees of freedom, leads to an efficient inference. Finally, the Bayesian nature of the method gives rise to a set of candidate target probability distributions. This reflects the natural indeterminacy present in any process corrupted by noise. The ability to express indeterminacy in the solutions is crucial for any real application of a deconvolution algorithm. We further show that the method is robust and consistent even under strong noise in real datasets.
The approach that we propose here is applicable to flow cytometry datasets with multiple channels, which is the current outcome of modern flow cytometers. Furthermore, the method is consistent with already wellestablished methods for flow cytometry processing, such as the calculation of spillover matrices and the reduction of autofluorescence using additional channels for autofluorescence regression, as discussed in the Supplementary text (Section S1.2). In this sense, our deconvolution approach could be applied to calculate corrected onefluorophore samples for the calculation of the spillover coefficients, or it could be used after calculating the spillover uncoupling with another method, in order to remove the remaining sources of noise in the dataset.

LIMITATIONS OF THE STUDY
A potential limitation of the algorithm is the computing time requirements. For K n mixture components of the noise and K t components of the target, a single evaluation of the likelihood function scales as O(K n ×K t ×N ), where N is the number of measures in the Bayesian model. The current finite mixture implementation is capable of dealing with datasets of 100000 cells with K n = 4 and K t = 6 in about 10 minutes in a desktop with an i7 processor. This efficiency will be sufficient for most part of the analysis, but can be a drawback for the analysis of very large datasets when compared with non-Bayesian methods. The computational time required by the infinite mixture model depends strongly on the parameter α, which governs the probability of generating new basis functions, but in general this model is more than an order of magnitude slower than its finite counterpart for the same degree of complexity. Handling large datasets is a well-established problem in Bayesian statistics, and some lines of work have explored solutions to reduce this computational cost, which might be worth exploring in the context of the single-cell Bayesian deconvolution method proposed here [35][36][37] .
A second potential improvement of the method presented here would be to address the degeneracy under label exchange characteristic of mixture models. Such degeneracy may confound the convergence of the samples to a stationary distribution and the use of the samples for the discovery of clusters, as the same cluster may be assigned to different basis in different samples. A potential solution to this problem would be the implementation of label reassignment methods 38 .

ACKNOWLEDGEMENTS
We thank the CRG Tissue Engineering Unit and the UPF/CRG Flow Cytometry Unit for continuous support. This work was supported by the Spanish Ministry of Science and Innovation and FEDER, under projects FIS2017-92551-EXP and PID2021-127311NB-I00, by the "Maria de Maeztu" Programme for Units of Excellence in R&D (grant CEX2018-000792-M), and by the Generalitat de Catalunya (ICREA Academia programme). GT is supported by a FPU doctoral fellowship from the Spanish Ministry of Education and Universities (reference FPU18/05091). D.O. acknowledges funding from Juan de la Cierva Incorporación with Project no. IJC2018-035298-I, from the Spanish Ministry of Science, Innovation and Universities (MCIU/AEI).

Synthetic data
For the target distribution we generated samples from three distributions: symmetric bimodal, asymmetric bimodal and skew symmetric distributions ( Supplementary  Fig. S1). This choice of distributions intends to capture the features present in real datasets, such as the presence of multiple peaks, different cluster sizes, and the generally non-Gaussian character of the data 25 . As for the noise, we generated a set of three different noise datasets containing normal, skewed, and Student's t distributions (the last two of which exhibit fat tails) ( Supplementary  Fig. S1), in order to test the flexibility of the method against very dissimilar autofluorescence profiles. In order to check the impact of the noise strength, the convolutions between target and noise were generated at two signal-to-noise ratios (SNR). In our context, we define the SNR as the ratio between target and signal variances for the whole dataset. We chose a case with negligible noise (SNR = 10), and a difficult case where the noise is of the same magnitude as the signal (SNR = 1). We generated sample datasets of different sizes, with 100, 1000 and, 10000 samples. The high range of these values is representative of typical single-cell flow cytometry experiments. The combination of the different target distribution types, noise distribution types, SNRs, and sample sizes generates a collection of 54 datasets with known ground truth.

Flow cytometry
E14 mouse embryonic stem cells containing a knock-in fluorescence reporter for the mesodermal transcription factor Brachyury, T/Bra::GFP were used 39 . Cells containing the T/Bra::GFP reporter were cultured in ES-Lif (ESL) medium (KnockOut Dulbecco's Modified Eagle Medium (DMEM) supplemented with 10% fetal bovine serum (FBS), 1x Non-essential aminoacids (NEEA), 50 U/mL Pen/Strep, 1x GlutaMax, 1x Sodium Pyruvate, 50 µM 2-Mercaptoethanol and leukemia inhibitory factor (LIF)). Cells adhered to 0.1% gelatin-coated (Millipore, ES-006-B) tissue culture-treated 25 cm 2 (T25 Corning 353108) plates, and were passaged every second day, as previously described 40 . Cells were kept at 37 • C with 5% CO 2 , and were routinely tested and confirmed to be free from mycoplasma. Flow cytometry experiments were performed as follows: On day 1, cells were trypsinized and seeded into gelatin-coated 6-well plates (Corning, 353224) to a final density of ∼ 10 5 cells/well in 3 mL ESL media. On day 2, the media was replaced by first washing twice with 3 mL DPBS +/+ (Phosphate buffered saline containing Mg ++ and Ca ++ , Sigma, D8662) and then adding NDiff227 media (N2B27) (Takara Bio, #Y40002) 40 with the appropriate combination of Brachyury activators (3 µM CHIR99, Sigma, SML1046 and/or 25 ng/mL Activin A, Bio-Techne, 338-AC-010). The same procedure was followed on day 3. For the control conditions, the corresponding volume of dimethyl sulfoxide (DMSO) was added to the medium. Flow cytometry data was acquired on day 2 or day 3, depending on the type of experiment. For the extrinsic noise experiment, cells were incubated with 1 mL of 20 nM CellTracker Green (CMFDA, Thermofisher) for 3 min prior to trypsinization and flow cytometry. The data was acquired using an LSRIIb flow cytometer, with 2 × 10 4 cells being analyzed per condition. DAPI labelling was used to discard death cells and debris. Furthermore, cell doublets were discarded from the analysis. The readout of protein expression (peak of the signal) was obtained through the FITC-A channel. Measurements were extracted using the FACSDiva software and exported in a Python format for subsequent analysis.

Software
The software pipeline presented here, including MCMC for finite and infinite normal mixtures, has been implemented as a Julia package (scBayesDeconv.jl). The source code, with manual compilation and installation instructions, as well as full documentation and a notebook with examples of use, can be obtained publicly from GitHub (https://github.com/dsblab/scBayesDeconv.jl). The target signal is emitted by a fluorophore that reports on the abundance (or activity) of a protein of interest within each cell. This signal is affected by autofluorescence (which can be considered a source of noise) produced by elements of the cell other than the fluorophore. Due to the noise, the total signal C measured by the device is not directly T , but

S1.1 Deconvolving signal from noise
We are interested in the case in which we cannot measure both the signal C and the noise ξ independently in the same cell, and thus T in that cell cannot be calculated trivially via Eq. (S1). This limitation is typical of flow cytometry experiments, in which cells can only be measured once. In this case, the only information that can be extracted from the device consists of the distributions of the measured signal, p c , and of the background noise by itself, p ξ (by measuring cells without fluorophore), over large populations of cells (tens of thousands in a typical flow cytometry run). We can assume the samples to be independent and identically distributed (iid).
If T and ξ and independent of each other, the distributions defined above are related to one another by means of a convolution: In what follows, we describe a method to extract the distribution p T of the target variable T from the observed distributions of C and ξ via a deconvolution of Eq. (S2). The method is applicable to any measurement technique that provides distributions of a signal affected by noise.

S1.2 Generalization to multichannel measurements
where c k , t j and ξ k are realizations of the random variables C, T and ξ defined in Eq. (S1) above. The subindex j runs over the N T fluorophores (one per target), and the subindex k runs over the N ch channels (so that ξ k represents the contribution of the autofluorescence to channel k). The terms S jk define the spillover matrix S, that quantifies how the fluorophore signals are spread over all the measuring channels.
The spillover matrix can be estimated from single fluorophore controls by using regression methods [1].
To that end, one can perform control experiments in which only one fluorophore is present, and measure the resulting signal in all the channels: where the subindex j now corresponds to the only fluorophore present in the system, the superindex (j) indicates that the experiments were performed in the single-fluorophore condition, and c (j) ik denotes a total signal in channel k coming from cell i when only the fluorophore j is present in the system. This set of equations is underdetermined, as we do not know neither the real target signal t (j) ij nor the components S jk of the spillover matrix. However, if we focus for the moment on the channel that corresponds to our single fluorophore (k = j) and impose S jj = 1, we can write down that c ij (ignoring for now the autofluorescence of that channel). This allows us to establish a set of linear regression problems whose solution enables the estimation of the spillover matrix components for which we have single-fluorophore controls: The spillover coefficients S jk can then be estimated using robust regression techniques that remove the noise coming from outliers [1].
A similar method can be used to reduce the noise coming from the autofluorescence. To that end, one can define an effective "fluorophore signal" t in coming from the autofluorescence, with an associated additional channel n not linked to a real fluorophore used in the sample. If the signal in this channel is correlated with the emission of the autofluorescence ξ over the other channels, we can decompose the autofluorescence signal at every channel k into a regression term and a noise term: The coefficients S nk indicate how the autofluorescence signal is spread over the other channels. The fluorophore signals and the autofluorescence "signal" can be grouped in a single term: where now the sum over j also includes the autofluorescence signal. We can then calculate the spillover coefficients with the method above, by using the additional channel n as an additional "fluorophore control" for the autofluorescence signal. We also note that writing the autofluorescence as a regression problem in Eq. (S6) will lead in general to a reduction of the noise in all channels: Finally, to obtain the original signal, we can multiply Eq. (S7) by the inverse of the spillover matrix: We emphasize that this method requires that the autofluorescence signal at the "noise channel" is sufficiently correlated with its contribution at the other channels, which is not necessarily true a priori. Also, Eq. (S9) shows that the need to deconvolve the noise stands even after the application of the spillover and autofluorescence corrections proposed in the literature. In fact, the deconvolution method that we propose in this article is compatible with existing methods of autofluorescence correction such as the regression method reviewed above, which can be performed before applying our method to the corrected system (S9).

S2 Mathematical statement of the problem
Since we do not know the exact underlying distributions, we model them as potentially infinite mixtures of normal basis functions. The Gaussian mixture models for the target and the noise distributions can be represented as where N (x|µ i , Σ i ) denotes a normal distribution on the variable vector x (whose components are the different measurement channels), with mean µ i and variance Σ i . The mixture decomposition defined in Eqs. (S10) allows for a flexible and robust representation of unknown and generic distributions. Moreover, exploiting the fact that the convolution of two Gaussian distributions is Gaussian, we have an analytical expression for the distribution of the total variable C: where ϕ T = {ω T , µ T , Σ T } and ϕ ξ = {ω ξ , µ ξ , Σ ξ } represent all the parameters of the target and noise distributions, from which we can parametrize the distribution of the total measured signal. We note that the combination of normal basis function N in Eq. (S11) above is invariant under changes in the individual basis functions of T and ξ, provided the sums of means, µ T + µ ξ , and variances, Σ T + Σ ξ , are constant.
As discussed in the main text, according to Bayes' rule, the posterior distribution that represents the probability of the parameters given the data is the joint probability of observing the data given the parameters, which we can redefine as The form of the likelihood (S13) breaks the symmetry between the target and noise signals, and thus lifts the above-mentioned degeneracy between their parameters exhibited by Eq. (S11). The second bracket in the right-hand side of Eq. (S12), in turn, corresponds to the prior distributions of all the parameters of the problem, which we define in what follows.

S2.1 Approximated decomposition of the posterior into two separate problems
It is worth noting that the posterior distribution (S12) can be decomposed as follows: where we have removed the dependency of the noise signal in the first term of the right-hand side, since only the total signal c defines the parameters of the target distribution, as can be seen from the likelihood 6 (S13). The second term, on the other hand, is conditioned by both the noise and the total signal. Since we usually have as much data for the noise signal {ξ} as for the signal of interest {c}, we can approximately consider that the posterior distribution of the noise mixture parameter ϕ ξ is well represented by the noise data alone: With this approximation, the problem can be decomposed in two separate subproblems: first, finding the probability distribution of the parameters ϕ ξ of the noise mixture; and second, finding the distribution of the parameters ϕ T of the convolution mixture, conditioned on the noise mixture parameters.
In the following section, we go over the mathematical details of the probability distributions that will allow us to sample from the posterior distribution.

S3 Relevant probability distributions
In this section, we derive the main expressions that we need to sample our model. Our aim is to have a self-contained derivation of the sampling process of the posterior of a Gaussian normal mixture.

S3.1 Multivariate normal distribution with unknown mean and error
Here we derive the main results of interest on multivariate normal distributions, which we will use when sampling the posterior distribution using normal bases.

S3.1.1 Likelihood
The multivariate normal distribution for a set of independent identical samples (iid) {x} has the form, up to a scaling parameter, where µ is the mean and τ is the precision parameter and n is the number of cells being measured. The precision parameter relates to the covariance matrix as It is convenient to rearrange the distribution to make explicit the dependency of the summary statistics: where the summary statistics are the mean,x and the covariance,

S3.1.2 Conjugate prior
A convenient prior for the multivariate distribution is a conjugate prior that allows us to obtain the analytic form of the other distributions: where W −1 is the inverse Wishart distribution. The hyperparameter κ 0 represents our confidence in the estimation of the mean: the higher κ 0 is, the closer to the mean we will be.
The inverse Wishart distribution has the following shape, up to scaling terms: where p is the number of dimensions. The parameter Σ is the correlation matrix and ν represents our confidence in the estimation of the Σ correlation matrix. 8

S3.1.3 Posterior distribution
The posterior distribution will have the following shape: We can use the posterior distribution to obtain different distributions of relevance.

S3.1.4 Conditional distribution: mean
Retaining the terms involving the mean µ, the conditional distribution of the mean takes the form of a multivariate distribution: where in the last step we completed the squares. This multivariate normal distribution has effective whereμ is a weighted version between the mean statistic and the prior mean, andτ is an effective precision parameter. The κ 0 factor indicates how close we are from the prior mean.

S3.1.5 Conditional distribution: covariance
Retaining the terms involving the covariance matrix (and its inverse, the precision matrix), the conditional distribution can be shown to take the form of an inverse Wishart distribution: where the effective parameters areñ In this last expression, it is worth noting that each term represents a different kind of uncertainty: the first term is the uncertainty coming from the mean statistic, the second corresponds to the uncertainty with which the mean statistic represents the actual mean, the third term is the uncertainty coming from the prior mean, and the last term is the uncertainty coming from the prior itself.

S3.1.6 Marginal distribution: covariance
One additional distribution that we will need is the marginal distribution of the variance: Reorganizing the elements in ( * ), where we have defined,μ we can further regroup the part in ( * * ).
Inserting these terms now we can calculate the integral as a multivariate normal: where in the last step we use the fact that detτ ∝ det τ and the effective parameters of the inverse

S3.1.7 Posterior predictive distribution
One last distribution of interest is the probability of a new data point given the already observed set of observations: We can rearrange the elements in ( * ), completing squares as we did when calculating the mean conditional distribution (sectionS3.1.4): where we have retained all the terms involving τ and µ this time, as they are necessary for integrating out. The effective parameters are nowμ µ = nx + κ 0 µ 0 + y n + κ 0 + 1 (S42) We can further rearrange the terms in ( * * ) to obtain an expression with the y in quadrature: where in the second to the third step we group by quadrature and complete squares to group the terms with y, and then group by quadrature the terms on the left. The effective parameters arẽ We can now insert the obtained term in the original expression: where the effective covariance has the form, The mean parameter in (S46) only appears in the term in brackets, so we can integrate it in a straightforward manner as a multivariate normal integral: The last integral is an inverse Wishart that we can integrate directly, leading to We can now reorganize the effective covariance. If we define the matrix we can rewrite the effective covarianceΣ as 13 which is a multivariate T-distribution with

S3.2 Finite mixture distributions
We derive in this section the statistics relevant to mixture models.

S3.2.1 Likelihood
The likelihood of a finite mixture model with K components has the form It is straightforward to see that, if we take the marginal distribution over the hidden variables, we recover the original distribution: where only when the indicator variable is one of the corresponding term survives.
The use of indicator variables makes it possible to compute analytically the posterior distribution of the mixture model, as the base distributions are now in product form. The hidden indicator variables are not known, thus we will have to sample from them as well.

S3.2.2 Conjugate prior
A conjugate prior for the mixture model is where the Dirichlet distribution has the form The hyperprior parameters α are usually set to be symmetrical and to scale with the number of mixture components: In this way, the prior distribution only depends on one hyperparameter α that indicates the strength from the uniform weights.

S3.2.3 Posterior distribution
Putting together the likelihood and the prior distribution, the posterior of the mixture model is where the last term is the set of priors for each base distribution.

S3.2.4 Conditional distribution: weights
Taking the terms from the posterior that involve the weights: whereñ j = i z ij + α.

S3.2.5 Conditional distribution: indicator variables
As we already mentioned, the indicator variables are not known, so we have to sample from them too.
As we are considering identically independent samples, we can obtain the conditional distribution from each indicator variable independently as It is worth noting that the indicator will be sampled from a particular base distribution for the weights of that base, but also by how well that sample lies inside the base distribution.

S3.2.6 Conditional distribution: base distribution parameters
Finally, because the base distributions are in product form due to the introduction of the indicator variables, the parameters of each base can be computed independently as where {x} j represents the subsample of cells whose indicator variable belongs to the corresponding mixture component.

S3.3 Infinite mixture distributions
In this section we define basic results from infinite Dirichlet processes (for an insightful tutorial see [2]) that allow us to consider infinite mixtures.

S3.3.1 Taking the infinite limit
In order to take the limit to infinite clusters, we need to remove the dependence from the number of clusters in a mixture model. For that, we need to remove the dependence on the weights of our probability distribution. Consider for the moment a basis distribution that is uniform in space. The joint probability distribution conditioned on the priors would be where the first term in the right-hand side is the likelihood of the indicator variables as in (S58), also given in (S65), which is a multinomial distribution. The second term is the prior distribution of the mixture distribution (S61), which is a Dirichlet distribution. From this expression, we can calculate the marginal distribution of the indicator variables conditioned to the prior parameter: where the third equality makes use of the statistic n j = i z ij . The integral can be identified as a multinomial distribution without the scaling factor.
The expression above still contains explicitly the dependence on the number of components in the mixture K, and thus its limit K → ∞ cannot be computed in a straightforward manner. To have a more amenable expression, let us consider that we take out the sample l and reassign it to a new cluster. The probability of this sample to be assigned to any of the clusters, conditioned on all the other indicator variables, is where have made explicit the range of cells over which the samples {z} are taken in each case. To that end, we define {z} ¬l ≡ {z} i∈(1,...,l−1,l+1,...n) and n ¬l ≡ i∈(1,...,l−1,l+1,...n) z ij to refer to the subsets that contain all elements except l. With this expression, it is straightforward to take the limit to infinite components: We now have the probability that a sample is assigned to a base that has other indicator variables. Let us now consider for the moment that we have K bases with assigned samples. The probability that a sample is assigned to a new base will be: There is a non-zero probability that the cell will be assigned to a new base that was not populated before, and the probability of populating this new cluster will depend on the hyperparameter α.

S3.3.2 Adding a non-uniform basis
The results above have been derived considering a uniform basis distribution. In the most general case, the basis will be non-uniform. From Eq. (S58) it is very easy to see that if we had a non-uniform distribution, the corresponding term will drop out of the integral (S68). Proceeding in the same way as before, all the terms in (S69) will cancel out, except p(x l |ϕ k ).
The probabilities of a new assignation will be for a basis with a populated sample and for the creation of a new basis.

S3.3.3 Using the predictive posterior distribution
Similar way to the approach used in Sec. S3.1.7, we can directly predict the new outcomes in the case of an infinite mixture as a function of already observed data.
The probabilities of a new assignation will be for a basis with a populated sample and for a new basis.

S3.4 Modifications of the convolution distribution
Introducing indicator variables as indicated in S3.2.1, the convolution of two multivariate mixture distributions like the one described by (S11) takes the form where the indicator variable is now 1 if the sample i belongs to the noise base j and target base k.
Considering the approximation described in S2.1, the only sampling parameters that we have to go over will be {ϕ T }, the weights ω T and the indicator variables z. The main challenge is that there is no close form to group all the terms involving Σ T in single effective distributions as the ones derived in S3.1. In order to get a tractable expression, we would like to transform the convoluted covariances in such a way that the following expression follows: for given matrices A, M . Isolating, A we obtain that Now, in most practical cases we can consider that the convoluted covariance will be close to the expected covariance matrix, and we can approximate the expression above by where the effective expected covariance iŝ All the results derived in the preceding sections take into account this approximation: The weights of the target distribution (S3.2.4) will be computed using the sum over the samples for all the noise samples Sampling from the mean and covariance parameters in the distribution needs, however, a closer look.

S3.4.1 Prior distribution
In order for the prior to be conjugated, we have to scale the prior distribution in terms of the prior of the noise distribution.

S3.4.2 Conditional distribution: covariance
If we focus the analysis in a single set of target parameters ϕ k , the posterior probability of the variance and the mean has the form In contrast with the case without convolution, the normal distributions cannot be grouped together in a simple distribution that can be sampled by standard procedures. We can group the data coming from a specific dataset, where the variance is Within the proposed approximation, we can now group all the different distributions: and we can sample new covariance matrices for the target with the inverse Wishart distribution.

S3.4.3 Conditional distribution: mean
The mean distribution can be easily computed, regrouping all the terms in the following summary statis- and sampling from a multivariate normal,

S4 Hyperparameter selection
Our approach the has five hyperparameters: • α: Determines how close we are from a uniform distribution of weights (finite mixtures), or the potential of generating a new basis (infinite mixture).
• µ 0 : The center of the prior normal distribution µ 0 .
• κ 0 : The confidence we have of being close to µ 0 .
• Σ 0 : The covariance of the prior normal distribution.
• ν 0 : The confidence we have of being close to σ 0 .
Flow cytometry applications have in general have large datasets, making the approach quite insensitive to the choice of α, κ 0 and ν 0 , as the statistics will dominate the model. The choice of mean and covariance matrix, however, depend on the data distribution. Appropriate estimators of these parameters are the mean and the covariance matrix of the whole dataset. This imposes a soft-informative prior over the region where the density should lie. In cases where the distribution has fat tails, the density prior can be very flat because of the outliers, making it difficult for the algorithm to find the correct distribution as the prior spreads over the basis components. In these cases, narrower prior covariance matrices can help to fit the model correctly.

S5 Gibbs sampling algorithms
In this section, we describe the algorithms to sample the noise distribution according to the approximation described in S2.1.

S5.1 Efficient updating of summary statistics
Once we have computed the mean or the variance over a set of samples {x} and we add or remove a single sample, it is possible to update the statistic without having to recompute the metric fully over the new set, which in general will be more time-consuming as we will have to go over a sum over all the size of the set.
If we remove a sample: On the other hand, if we add a sample: where p true is the real target distribution and p est is the point estimate of the deconvolved distribution.
This is the traditional measure of convergence, but it is hard to interpret as it only has a lower bound. To address this issue, we introduce the mean integrated overlap (MIO):    Figure S5: Scatter plot sections of the multichannel dataset. The top right triangle shows the autofluorescence distribution (orange) and samples from the fitting process (red). In the low left triangle we show the convolution data (light green) and samples from fitted convolution (dark green) and the deconvolution (light blue). In the diagonal, we show the 1D distributions of the convolved and deconvolved results.