Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

A semi-supervised Bayesian mixture modelling approach for joint batch correction and classification

View ORCID ProfileStephen Coleman, View ORCID ProfileKath Nicholls, View ORCID ProfileXaquin Castro Dopico, Gunilla B. Karlsson Hedestam, View ORCID ProfilePaul D.W. Kirk, View ORCID ProfileChris Wallace
doi: https://doi.org/10.1101/2022.01.14.476352
Stephen Coleman
1MRC Biostatistics Unit, University of Cambridge, U.K.
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Stephen Coleman
  • For correspondence: stephen.coleman@mrc-bsu.cam.ac.uk
Kath Nicholls
1MRC Biostatistics Unit, University of Cambridge, U.K.
2Cambridge Institute of Therapeutic Immunology & Infectious Disease, University of Cambridge, U.K.
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Kath Nicholls
Xaquin Castro Dopico
3Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Sweden
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Xaquin Castro Dopico
Gunilla B. Karlsson Hedestam
3Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Sweden
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Paul D.W. Kirk
1MRC Biostatistics Unit, University of Cambridge, U.K.
2Cambridge Institute of Therapeutic Immunology & Infectious Disease, University of Cambridge, U.K.
4Cancer Research U.K. Cambridge Centre, Ovarian Cancer Programme, University of Cambridge, U.K.
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Paul D.W. Kirk
Chris Wallace
1MRC Biostatistics Unit, University of Cambridge, U.K.
2Cambridge Institute of Therapeutic Immunology & Infectious Disease, University of Cambridge, U.K.
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Chris Wallace
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Systematic differences between batches of samples present significant challenges when analysing biological data. Such batch effects are well-studied and are liable to occur in any setting where multiple batches are assayed. Many existing methods for accounting for these have focused on high-dimensional data such as RNA-seq and have assumptions that reflect this. Here we focus on batch-correction in low-dimensional classification problems. We propose a semi-supervised Bayesian generative classifier based on mixture models that jointly predicts class labels and models batch effects. Our model allows observations to be probabilistically assigned to classes in a way that incorporates uncertainty arising from batch effects. By simultaneously inferring the classification and the batch-correction our method is more robust to dependence between batch and class than pre-processing steps such as ComBat. We explore two choices for the within-class densities: the multivariate normal and the multivariate t. A simulation study demonstrates that our method performs well compared to popular off-the-shelf machine learning methods and is also quick; performing 15,000 iterations on a dataset of 750 samples with 2 measurements each in 11.7 seconds for the MVN mixture model and 14.7 seconds for the MVT mixture model. We further validate our model on gene expression data where cell type (class) is known and simulate batch effects. We apply our model to two datasets generated using the enzyme-linked immunosorbent assay (ELISA), a spectrophotometric assay often used to screen for antibodies. The examples we consider were collected in 2020 and measure seropositivity for SARS-CoV-2. We use our model to estimate seroprevalence in the populations studied. We implement the models in C++ using a Metropolis-within-Gibbs algorithm, available in the R package batchmix. Scripts to recreate our analysis are at https://github.com/stcolema/BatchClassifierPaper.

Competing Interest Statement

CW receives research funding from GSK and MSD for an unrelated project and is a part-time employee of GSK. These companies had no input into this study.

Footnotes

  • stephen.coleman{at}mrc-bsu.cam.ac.uk, kcn25{at}cam.ac.uk, xaquin.castro.dopico{at}ki.se, gunilla.karlsson.hedestam{at}ki.se, paul.kirk{at}mrc-bsu.cam.ac.uk, cew54{at}cam.ac.uk

  • The simulation study has been expanded to cover more scenarios and include more simulated datasets. An additional analysis of some gene expression data has been done. Some figures have been updated in other sections to be easier to interpret.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted November 29, 2022.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
A semi-supervised Bayesian mixture modelling approach for joint batch correction and classification
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
A semi-supervised Bayesian mixture modelling approach for joint batch correction and classification
Stephen Coleman, Kath Nicholls, Xaquin Castro Dopico, Gunilla B. Karlsson Hedestam, Paul D.W. Kirk, Chris Wallace
bioRxiv 2022.01.14.476352; doi: https://doi.org/10.1101/2022.01.14.476352
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
A semi-supervised Bayesian mixture modelling approach for joint batch correction and classification
Stephen Coleman, Kath Nicholls, Xaquin Castro Dopico, Gunilla B. Karlsson Hedestam, Paul D.W. Kirk, Chris Wallace
bioRxiv 2022.01.14.476352; doi: https://doi.org/10.1101/2022.01.14.476352

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4237)
  • Biochemistry (9151)
  • Bioengineering (6788)
  • Bioinformatics (24034)
  • Biophysics (12142)
  • Cancer Biology (9550)
  • Cell Biology (13798)
  • Clinical Trials (138)
  • Developmental Biology (7643)
  • Ecology (11719)
  • Epidemiology (2066)
  • Evolutionary Biology (15521)
  • Genetics (10654)
  • Genomics (14336)
  • Immunology (9495)
  • Microbiology (22870)
  • Molecular Biology (9113)
  • Neuroscience (49070)
  • Paleontology (355)
  • Pathology (1485)
  • Pharmacology and Toxicology (2572)
  • Physiology (3851)
  • Plant Biology (8340)
  • Scientific Communication and Education (1472)
  • Synthetic Biology (2299)
  • Systems Biology (6198)
  • Zoology (1302)