A semi-supervised Bayesian mixture modelling approach for joint batch correction and classification

Systematic differences between batches of samples present significant challenges when analysing biological data. Such batch effects are well-studied and are liable to occur in any setting where multiple batches are assayed. Many existing methods for accounting for these have focused on high-dimensional data such as RNA-seq and have assumptions that reflect this. Here we focus on batch-correction in low-dimensional classification problems. We propose a semi-supervised Bayesian generative classifier based on mixture models that jointly predicts class labels and models batch effects. Our model allows observations to be probabilistically assigned to classes in a way that incorporates uncertainty arising from batch effects. By simultaneously inferring the classification and the batch-correction our method is more robust to dependence between batch and class than pre-processing steps such as ComBat. We explore two choices for the within-class densities: the multivariate normal and the multivariate t. A simulation study demonstrates that our method performs well compared to popular off-the-shelf machine learning methods and is also quick; performing 15,000 iterations on a dataset of 750 samples with 2 measurements each in 11.7 seconds for the MVN mixture model and 14.7 seconds for the MVT mixture model. We further validate our model on gene expression data where cell type (class) is known and simulate batch effects. We apply our model to two datasets generated using the enzyme-linked immunosorbent assay (ELISA), a spectrophotometric assay often used to screen for antibodies. The examples we consider were collected in 2020 and measure seropositivity for SARS-CoV-2. We use our model to estimate seroprevalence in the populations studied. We implement the models in C++ using a Metropolis-within-Gibbs algorithm, available in the R package batchmix. Scripts to recreate our analysis are at https://github.com/stcolema/BatchClassifierPaper.

and we write b = [b 1 , . . . , b N ] for the collection of all N batch labels. Note that as each individual belongs 85 to a single batch, we assume that all P measurements for each individual are part of the same batch. We 86 wish to predict class labels for each individual, and write c = [c 1 , . . . , c N ] for the collection of all class 87 labels. We assume that the number of classes, K, is known, so that each c n ∈ {1, . . . , K}. We assume that 88 a subset of labels are observed and each class is represented in this subset. 89 2.2 Model specification 90 We use a K-component mixture model to describe the data X. The mixture model can be be written independently for each n = 1, . . . , N, where π = [π 1 , . . . , π K ] is the vector of component weights, f (·) is a parametric density function, and θ k are the parameters of the k th component. We assume each component describes a single and distinct class in the population and use the class labels to rewrite the model p(X n |c n = k) = f (X n |θ k ).
Then conditioning on the observed batch label we have p(X n |c n = k, b n = b) = f (X n |θ k , z b ).
A semi-supervised Bayesian mixture modelling approach for joint batch correction and classification We focus on continuous data where each measurement has support across the entire real line. We consider 91 the multivariate t density (MVT, density denoted f t (·)) and the multivariate normal (MVN, density denoted 92 f N (·)) as choices for f , but depending on the situation other choices could be more relevant and our model 93 is not inherently restricted to these. We use z b = (m b , S b ), choosing m b to be a P -vector representing the 94 shift in location due to the batch effects and S b to be a scaling matrix. We assume the observed location of 95 X n is composed of a class-specific effect, µ k , and a batch-specific effect, m b , so (X n |c n = k, b n = b) = 96 µ k + m b + n . Similarly we assume that the random noise, n , is subject to class and batch specific effects 97 Σ k and S b respectively.

99
In the likelihood function, only the combinations of the class and batch parameters, µ k + m b and Σ k ⊕ S b , 100 are identifiable, and the values of the class and batch specific effects are not. However, we assume that we 101 have some prior information about the relative orders of magnitude of the class and batch effects and encode 102 this in an informative prior, reducing the problem of identifiability with this additional constraint. If the 103 magnitude of the between-batch variability is similar to or greater than the true biological effect, then we 104 suspect that any analysis of such a dataset is untenable, or at least that the data are not appropriate for our 105 model.

5
A semi-supervised Bayesian mixture modelling approach for joint batch correction and classification The full hierarchical model can be found in section 1 of the supplementary material. Here we include the choice of prior distributions for the class and batch effects: m b,p |λ, δ 2 ∼ N 0, (λδ) 2 , (S b ) p,p |α, β, S loc ∼ IG(α, β, S loc ), η k | , ζ ∼ G( , ζ) (if the MVT density is being used).
IW denotes the inverse-Wishart distribution, IG denotes the inverse-Gamma distribution with a shape 107 α, rate β and location S loc , N signifies the Gaussian distribution parameterised by a mean vector and a 108 covariance matrix and G denotes the Gamma distribution parameterised by a shape and rate. An empirical 109 Bayes approach is used to set the hyperparameters for the class mean and covariance (details are included in Sampling the batch and class parameters allows us to derive a batch-corrected dataset, Y , in each iteration.
We define the p th measurement for the n th sample in Y as To achieve this, we generate 100 datasets in each of 9 different scenarios. In six of these, the data are 135 generated from a mixture of MVN distributions and in one scenario the data is generated from a mixture of 136 MVT distributions. In two scenarios, data are sampled from a a mixture of Poisson distributions. This count 137 data is log-transformed and then Gaussian noise is added to ensure our model is strongly misspecified in the 138 density choice. For each simulation we generate both a "batch-free" and an observed dataset. In seven of 139 the scenarios, the data contains P = 2 measurements for each of N = 750 samples; in two scenarios we 140 consider a higher feature space of P = 15. In all scenarios we consider B = 5 batches and K = 2 classes.

141
The classes are not evenly represented, in seven scenarios the first class expected to contribute 75% of the 142 samples with the remainder drawn from the second class, and class is independent of batch (and therefore 143 a pre-processing step is expected to match our model). In two scenarios the class labels and batch of origin 144 are dependent and pre-processing is expected to skew the inference. default for a classification SVM in this package uses a Gaussian Radial Basis kernel function. 176 We use the data with observed labels as the training set for each of these methods and those with unobserved 177 labels as a test set. We record the time taken to train the model and to predict the outcome for the test set.

179
We assessed within-chain convergence by calculating the Geweke statistic (16), and removed chains which 180 failed the diagnostic test. We then selected chains with the highest median complete log-likelihood as repre-181 sentative for the simulation. We compared the models using the Brier score between allocation probability 182 across classes and the true class (figure 1 A). We found that our mixture model performed better or at least as 183 well as the ML methods across all two dimensional scenarios. When the data were generated from Gaussian scenario. However, this shows that batchmix should not be casually applied in high-dimensional data.

195
We also wanted a sense of how well our models would estimate the proportion of the classes in our simu-196 lations as an analogue to seroprevalence in our motivating example. We recorded the predicted proportion  10 A semi-supervised Bayesian mixture modelling approach for joint batch correction and classification the same set of models as in the simulation study and aim to infer cell type for the data with hidden labels.

219
We hide a random 80% of labels in each of ten folds with the restriction that each class has at least one 220 known member. We compare model performance using the Brier score. The results are similar to the Base 221 case scenario in the simulation study with the semi-supervised Bayesian mixture models performing the 222 best. When there is no dependency between class and batch ComBat and batchmix are equivalent, but when 223 this dependency is present ComBat suffers significantly. In this case doing no batch correction is the best 224 strategy, followed closely by batchmix. In both scenarios the off-the-shelf supervised methods perform less 225 well than the semi-supervised mixture models.

11
A semi-supervised Bayesian mixture modelling approach for joint batch correction and classification Figure 3: The data and model performance for the cell gene expression data. A) Facet on the left shows the original data, annotated by class and batch. The facet on the right shows the data after batch effects are introduced for class independent of batch. B) As A) but class is dependent on batch. C) Model performance across 10 folds under the Brier score for both scenarios.

227
ELISA is an immunological assay used to measure antibodies, antigens, proteins and glycoproteins, and 228 normally involves a reaction that converts the substrate into a coloured product, the optical density (OD) 229 which can be measured and is then used to determine the antigen concentration. One application is to assess 230 seroprevalence of a disease within a population by measuring seropositivity of antibodies. It has a history of 231 application to a wide range of diseases (e.g., 38, 3, 17, 30) and was used extensively to study seropositivity

239
In the ELISA datasets we do not know the true seropositive status for the non-control data and cannot 240 evaluate the model accuracy. Rather, we present these to demonstrate application of our model and highlight 241 how diagnostic plots and results may be interpreted. In each case we run multiple chains and then use the 242 sampled log-likelihood to assess within and across chain convergence. expectation that seropositivity should increase with time as more of the population were exposed to SARS-

255
CoV-2, suggests that the batch and seropositivity frequency are dependent. Based on our simulation study, 256 we would expect that a pre-processing batch normalisation would therefore produce misleading results. the non-control data. This was highly consistent across hyperparameter choices and was contained within 265 the confidence interval of the estimate provided by Castro Dopico et al. (7). However, our seroprevalence 266 point estimates, particularly in later dates, were higher than the those from Castro Dopico et al. (7). Table 1 267 of the Supplementary material shows the point estimate from the ML methods used in the Simulation study, 268 our MVT mixture model and that from the original paper. This shows that while our method provides higher 269 point estimates than those from Castro Dopico et al. (7), the other ML methods (barring the SVM) provide 270 estimates much closer to or even exceeding that from the MVT.

271
The seroprevalence estimates and their credible intervals were almost identical across hyperparameter  (9). The Bayesian learner is designed to estimate seroprevalence during an epidemic and provides a smooth, nondecreasing estimate across time. Its assumptions ensure a more consistent increase across time, whereas the SVM-LDA and MVT mixture models are not incorporating any explicit temporal information. The estimates from the mixture model have been moved 3 days to the right on the x-axis to reduce overlap.

15
A semi-supervised Bayesian mixture modelling approach for joint batch correction and classification  We performed a similar analysis to our original simulation study on these datasets, comparing our models to 311 a range of off-the-shelf machine learning methods. Across all of the simulations, we found that our mixture 312 models outperformed other methods under both the F1 score and the squared distance (figures 7 A, 7 B).

313
The semi-supervised Bayesian mixture model also performed well here; this is due to the large proportion The results of our simulation study show that our mixture model consistently matches or outperforms several 331 alternatives when applied to data with batch effects, across a range of data generating models. We believe batch effects scenario), this is significantly worse than including a batch correction. As the relationship 337 between batch and class and the magnitude of the batch effects is rarely known, we believe our model offers 338 a better guarantee of meaningful inference. In the more specific scenario where data were generated from 339 a converged chain that had been applied to the ELISA data from Castro Dopico et al. (7), we obtained the 340 same findings, with our model again performing better than the off-the-shelf machine learning methods and 341 ComBat over-correcting for batch due to the imbalance of classes. We also see from our simulation study A semi-supervised Bayesian mixture modelling approach for joint batch correction and classification of the MVT mixture model is the approximate 50% increase in runtime, but as our implementation is quite 346 fast we believe that this is not a significant detractor. Based on these results we recommend the use of our 347 MVT mixture model when the analyst suspects the classes in the data may be non-Gaussian.

348
In terms of estimating seroprevalence, our mixture model performed very well in our simulation study.

349
Using the results shown in figure 1 B, we can try to gauge how well our method is performing in the ELISA 350 data. We would argue that the most pertinent scenarios are the MVT generated (the ELISA data are non-351 Gaussian), the Varying batch effects and the Varying class representation scenarios. Our method estimates 352 seroprevalence close to the truth, or slightly smaller, in these simulations. Based on this, we suspect that the 353 high estimates of seroprevalence provided by our model (relative to those from the original papers) in the 354 ELISA analyses are plausible.

355
In the Swedish dataset, we are reassured that the batch-correction is reasonable by our analysis of the patient 356 4 samples -these samples were used across several batches as positive controls; after applying the correction 357 learnt on the dataset excluding these extreme samples they are no longer separable by batch and have moved 358 towards the class mean. The data generated from our converged model also appears very similar to the 359 observed data, suggesting that the model assumptions are reasonable, and that meaningful estimates of the 360 parameters were obtained.

361
In the analysis using the data from Dingens et al. (12), the unrepresentative negative controls presented a 362 problem. We believe that the preceding analyses show the potential advantages of our model over existing 363 methods, but this dataset is a good example to show that our method is not a panacea that may overcome all 364 problems -it remains vital to have useful and relevant data in order to perform meaningful inference (13).

365
Any analysis that uses training data that appear to be drawn from a different population than the test data 366 is unlikely to produce meaningful results. Furthermore, the data are not well-described by a pair of MVT 367 distributions (even allowing for our additional flexibility with the batch parameters). This combination of 368 model misspecification and misleading training data makes us skeptical of the inferred parameters.

369
We note, however, that in the simulation of pseudo-ELISA data, our method still performed strongly despite 370 the positive controls not being representative of the general seropositive sample. In this case our model was 371 correctly specified (the data are generated from a MVT mixture model). In general, we suspect that our 372 method is useful if either the assumption that the labelled data represent their class well or that the model specific values is not too important, but we suspect that, if the sample size is smaller, having λ close to 389 one could exacerbate the identifiability problem for the batch shift effect and the class mean. Therefore, we 390 suggest setting λ ≤ 0.1 to encourage these parameters to converge in the small sample setting (although 391 note that their sum, µ k + m b , should converge regardless).

392
We have developed a Bayesian method to predict class membership and perform batch-correction simulta-

Model
Our data X = (X 1 , . . . , X N ) is generated across B batches where the origin batch of each point is known and represented by the vector b = [b 1 , . . . , b N ] . We are interested in classifying X into K disjoint classes. We model X using a K component mixture model: Here f (·) is the density function, π = [π 1 , . . . , π K ] are the component or class weights, θ = (θ 1 , . . . , θ K ) are the parameters describing the classes and z = (z 1 , . . . , z B ) are the parameters associated with the batches. We introduce an allocation variable, c = [c 1 , . . . , c N ] , to represent the class membership and assume that each class is represented by a single component of the mixture. Conditioning on c, our model is then p(X n |b n = b, c n = k, θ, z) = f (X n |θ k , z b ). (2) In our motivating example, c contains some observed values (alternatively, c contains missing values), this enables supervised or semi-supervised methods to infer the missing values. We introduce a binary vector, φ = [φ 1 , . . . , φ N ] , indicating if the label of the n th individual is observed or not. If we separate our dataset into subsets of labelled and unlabelled data and use X l to train some classifier which predicts the labels of X u (as we do with the off-the-shelf ML models in our simulation study), our method would be a supervised classifier. However, the Bayesian framework enables us to integrate these steps in a semi-supervised model, using the inferred allocations to include X u in modelling the class and batch parameters.

Multivariate Normal
Let f be the density function for the multivariate normal distribution, parametrised by a mean vector µ and a covariance matrix Σ.
We assume We also assume that the batch effects have no correlation across dimensions. We restrict the covariance matrix, S b , to being diagonal and assume independence between the entries of m b .
Our hierarchical model is π|γ ∼ Dir(γ/K, . . . , γ/K), IW denotes the inverse-Wishart distribution, IG denotes the inverse-Gamma distribution with a shape α, rate β and location S loc . N is the Gaussian distribution, Dir is the Dirichlet distribution and Cat is the categorical distribution. As we assume independence of batch effects across dimensions, we model each entry of the b th batch mean vector, m b,p , and the b th batch covariance matrix, (S b ) p,p , using one dimensional distributions. Figure 1: Directed acyclic graph for mixture of multivariate normal distributions with random effects. Squares indicate observed or known quantities. Note that a subset of c is observed in our application.
The total joint probability is

Multivariate t
If we let f be the density function for the multivariate t (MVT) distribution, parametrised by a mean vector µ, a covariance matrix Σ and degrees of freedom, η, then the model remains as described in section 1.1 and equations 5, except the model likelihood changes and we introduce a prior distribution over η: here G denotes the Gamma distribution parametrised by a shape and rate.

Parameter interpretation
Note that the "batch" parameters should not be inferred as direct estimates of the effect the batches have on the true measures. As we are essentially performing a classification on the inferred batch-free dataset, and the likelihood parameters of µ k +m b and Σ k ⊕S b are not constrained in the likelihood, we recommend that users focus on the relative change in the measurements for batches, the inferred dataset and the inferred classification rather than the direct meaning of individual parameters.

Empirical Bayes
We use the suggestions of Fraley and Raftery (2007) for our choices of prior hyperparameters on the class parameters.
The choice of ξ is self-explanatory. κ can be viewed as the number of observations contributing to the prior. Fraley and Raftery (2007) choose a value based on experiments to acquire a BIC curve that is a smooth extension of the counterpart without a prior. The marginal prior distribution of µ k is a Student's t distirbution centred at ξ with ν − P + 1 degrees of freedom. ν is the smallest integer value for the degrees of freedom that gives a finite variance. We set Ψ as a diagonal matrix. Let then Ψ p,p =σ 2 0 The logic is that the mixture components are expected, a priori, to each fill a common fraction of the total volume of space the data occupies. For the concentration on the class weights, we use a flat prior with γ = 1. In our motivating exmple of ELISA data, we cannot use more information (such as the ratio of class members in the known data), as the negative controls are historical samples the number of which is chosen before the experiment and is not related to the expected seroprevalence in the dataset.
For the degrees of freedom for the MVT, η k , we use an uniformative prior that offers a range of plausible values, = 2.0, ζ = 0.1 (Juárez and Steel, 2010).

Proposal distributions
For our batch and class parameters, we choose proposal densities that have an expectation of the current value and have the correct support. The class and batch means have a support (∞, ∞); this allows use of a Gaussian proposal distribution with a mean of the current value.
This density is symmetric and the relationship between the acceptance rate and the choice of the proposal window (σ 2 m and σ 2 µ ) is relatively intuitive, the acceptance rate will decrease as the window increases. The batch standard deviations have a support of (S loc , ∞). To ensure that proposed values remain in this range we use a Gamma proposal distribution with a shape of the current value divided by the rate, the rate set to some constant and a location of S loc .
This proposal has an expected value of (S b ) p,p . However, it is asymmetric and the acceptance rate increases as β S increases. We propose all P members of S b in each sampling step. The class covariance matrices are the most difficult to sample. There are P 2 values to propose and must be positive semi-definite. We use a Wishart proposal to satisfy this All of the proposal windows, (σ 2 µ , σ 2 m , β S , ν Σ ), are tuned aimming to achieve acceptance rates in the range [0.1, 0.5] (Roberts and Rosenthal, 2001); if this is not possible we prioritise keeping acceptance rates above 0.1. This can involve multiple tuning runs of the sampler on each dataset.

Simulation study
We use a simulation study to test the model behaviour in examples where the generating model and the true labelling are known. We aim to explore • the batch effects inferred by the model when none are present.
• the sampled distributions of the degree of freedom parameters in the mixture of multivariate t distributions.
• how the model behaves when there is some sort of inequality in the batches, e.g., different batch sizes, different class representation in each batch, and large difference in the magnitude of batch effects.
• how the model handles misspecification.

Design
Our study uses six different scenarios to test and benchmark behaviour. We use a Base case as the default scenario that each other scenario is a variation of. For example, the No batch effects scenario is the Base case with the batch means set to 0 and the batch standard deviations set to 1.0. We define each scenario by a set of parameters  We use the distance between cluster means in a single dimension, as this is the quantity of interest rather than specific values of µ k .
To generate the datasets, we first sample batch and class labels based on w b and π k respectively. The measurements for each point are then generated from a Gaussian distribution defined by these labels (except in the multivariate t generated scenario where the generating distribution is the eponymous distribution). We use a diagonal covariance matrix for simplicity. Each column generated randomly permutes the parameters associated with each class and batch; this means that the different columns can contain different information.
X n ∼ N (Y n + m bn , S bn ).
All the scenarios used these same parameters unless explicitly stated otherwise. Figure 3: Example of a generated dataset from the No batch effects scenario. Note that the dataset is identical before and after batch-correction.

No batch effects
This scenario is aimed at measuring the bias of the inferred batch effects. We remove the batch effects from the generating model by using values Note the inferred values of S are restricted to the open interval (1, ∞) in our sampler. Because of this we hope that the sampled batch scaling effect has a similar distribution across all batches rather than sampling a distribution centred on 1.0.

Varying batch size
This scenario investigates the behaviour of the model when the batch sizes are very different.

Varying batch effects
This scenario tests how successfully the model infers to differing batch effects in each batch, different magnitudes of batch effects (with some in the tails of the prior distribution) and the direction of the batch mean shift.   In each batch one column of this matrix is used to sample the class membership. This introduces a dependency for c n on b n , i.e., (c n |b n = b, π) ∼ Cat(π b ).

Multivariate t generated
This scenario generates the data for each class from a multivariate t (MVT) distribution. This type of data is believed to be common in biology and we wish to investigate how well the model learns the degrees of freedom parameter and to compare the performance of the mixture of Gaussians model to the mixture of MVTs model.

Log-poisson generated
This scenario generates the data for each class from a Poisson distribution distribution, which is then log-transformed and has some Gaussian noise added. This type of data is believed to be common in biology where count data is so prolific and we wish to investigate how well the model behaves when it is strongly misspecified.

High-dimensional
The data are generated from a mixture of MVN distributions as in the Base case, but more features (P = 15) are generated. To compensate for the additional information these contain, the distance between the class means in each feature is reduced.

Hardest
This scenario uses the class / batch dependency of Varying class representation across batches scenario, the data generating mechanism of the Log-poisson generated scenario and the dimensionality of the High-dimensional scenario.

Gene expression data
The batch labels, b = [b 1 , . . . , b N ] , N = 242, are generated according to one of two scenarios described below, and then used to add batch-effects to the gene expression data.

Scenario a) No dependency between class and batch
w ∼ Dirichlet(10), b n ∼ Categorical(w).

Scenario b) Class and batch are dependent
(b n |c n = k) ∼ Categorical(w k ).
Figure 10: Example of a generated dataset from the Hardest scenario. Samples being clustered in rows, measurements in columns.

Model convergence
For the simulated data we use the Geweke diagnostic for the complete log-likelihood after burn-in to assess within-chain convergence. We obtain a p-value by transforming the absolute value of the Z-scores with the Gaussian cumulative distribution function. We then discard all chains which have p-values below a threshold of 0.05. From the remaining chains we use that which has the highest median complete log-likelihood. For the real data we visually inspect the complete log-likelihood trace plots and manually select which chains have converged to the same mode in the posterior distribution (possibly the global mode). Performing the entire process manually is feasible for the real datasets as there are less chains. An example of this process is shown in figure 12. 7 Dopico et al.