Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification

View ORCID ProfileChakravarthi Kanduri, View ORCID ProfileMilena Pavlović, View ORCID ProfileLonneke Scheffer, View ORCID ProfileKeshav Motwani, View ORCID ProfileMaria Chernigovskaya, View ORCID ProfileVictor Greiff, View ORCID ProfileGeir K. Sandve
doi: https://doi.org/10.1101/2021.05.23.445346
Chakravarthi Kanduri
1Centre for Bioinformatics, Department of Informatics, University of Oslo
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Chakravarthi Kanduri
  • For correspondence: skanduri@uio.no geirksa@ifi.uio.no
Milena Pavlović
1Centre for Bioinformatics, Department of Informatics, University of Oslo
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Milena Pavlović
Lonneke Scheffer
1Centre for Bioinformatics, Department of Informatics, University of Oslo
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Lonneke Scheffer
Keshav Motwani
2Department of Pathology, Immunology and Laboratory Medicine, University of Florida
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Keshav Motwani
Maria Chernigovskaya
3Department of Immunology, University of Oslo
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Maria Chernigovskaya
Victor Greiff
3Department of Immunology, University of Oslo
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Victor Greiff
Geir K. Sandve
1Centre for Bioinformatics, Department of Informatics, University of Oslo
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Geir K. Sandve
  • For correspondence: skanduri@uio.no geirksa@ifi.uio.no
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Background Machine learning (ML) methodology development for classification of immune states in adaptive immune receptor repertoires (AIRR) has seen a recent surge of interest. However, so far, there does not exist a systematic evaluation of scenarios where classical ML methods (such as penalized logistic regression) already perform adequately for AIRR classification. This hinders investigative reorientation to those scenarios where further method development of more sophisticated ML approaches may be required.

Results To identify those scenarios where a baseline method is able to perform well for AIRR classification, we generated a collection of synthetic benchmark datasets encompassing a wide range of dataset architecture-associated and immune state-associated sequence pattern (signal) complexity. We trained ≈1300 ML models with varying assumptions regarding immune signal on≈850 datasets with a total of ≈210’000 repertoires containing ≈42 billion TCRβ CDR3 amino acid sequences, thereby surpassing the sample sizes of current state-of-the-art AIRR ML setups by two orders of magnitude. We found that L1-penalized logistic regression achieved high prediction accuracy even when the immune signal occurs only in 1 out of 50’000 AIR sequences.

Conclusions We provide a reference benchmark to guide new AIRR ML classification methodology by: (i) identifying those scenarios characterised by immune signal and dataset complexity, where baseline methods already achieve high prediction accuracy and (ii) facilitating realistic expectations of the performance of AIRR ML models given training dataset properties and assumptions. Our study serves as a template for defining specialized AIRR benchmark datasets for comprehensive benchmarking of AIRR ML methods.

Competing Interest Statement

VG declares advisory board positions in aiNET GmbH and Enpicom B.V.

Footnotes

  • In this revised version, we have now containerized our computational environment using a docker image that is publicly hosted and provided a detailed demo analysis of each category of experiment performed in the original manuscript using the containerized computational workflow (https://github.com/KanduriC/demo_reproducibility_kanduricetal2021.git). We added a new subsection under Methods titled "Docker container to improve reproducibility". We also publicly hosted ~ 2 TB of input data (doi.org/10.11582/2021.00064).

  • https://doi.org/10.11582/2021.00038

  • https://doi.org/10.11582/2021.00064

  • https://github.com/KanduriC/demo_reproducibility_kanduricetal2021.git

  • List of abbreviations

    ML
    Machine Learning
    AIR
    Adaptive Immune Receptors
    AIRR
    Adaptive Immune Receptor Repertoires
    TCR
    T Cell Receptors
    TCRβ
    T Cell Receptor beta chain
    CDR3
    Complementarity Determining Region 3
    SVC
    Support Vector Classifier
    RF
    Random Forests
    CV
    Cross Validation
    IMGT
    ImMunoGeneTics
  • Copyright 
    The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
    Back to top
    PreviousNext
    Posted September 03, 2021.
    Download PDF
    Data/Code
    Email

    Thank you for your interest in spreading the word about bioRxiv.

    NOTE: Your email address is requested solely to identify you as the sender of this article.

    Enter multiple addresses on separate lines or separate them with commas.
    Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
    (Your Name) has forwarded a page to you from bioRxiv
    (Your Name) thought you would like to see this page from the bioRxiv website.
    CAPTCHA
    This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
    Share
    Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
    Chakravarthi Kanduri, Milena Pavlović, Lonneke Scheffer, Keshav Motwani, Maria Chernigovskaya, Victor Greiff, Geir K. Sandve
    bioRxiv 2021.05.23.445346; doi: https://doi.org/10.1101/2021.05.23.445346
    Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
    Citation Tools
    Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
    Chakravarthi Kanduri, Milena Pavlović, Lonneke Scheffer, Keshav Motwani, Maria Chernigovskaya, Victor Greiff, Geir K. Sandve
    bioRxiv 2021.05.23.445346; doi: https://doi.org/10.1101/2021.05.23.445346

    Citation Manager Formats

    • BibTeX
    • Bookends
    • EasyBib
    • EndNote (tagged)
    • EndNote 8 (xml)
    • Medlars
    • Mendeley
    • Papers
    • RefWorks Tagged
    • Ref Manager
    • RIS
    • Zotero
    • Tweet Widget
    • Facebook Like
    • Google Plus One

    Subject Area

    • Bioinformatics
    Subject Areas
    All Articles
    • Animal Behavior and Cognition (3579)
    • Biochemistry (7534)
    • Bioengineering (5488)
    • Bioinformatics (20709)
    • Biophysics (10264)
    • Cancer Biology (7941)
    • Cell Biology (11595)
    • Clinical Trials (138)
    • Developmental Biology (6575)
    • Ecology (10150)
    • Epidemiology (2065)
    • Evolutionary Biology (13561)
    • Genetics (9504)
    • Genomics (12799)
    • Immunology (7891)
    • Microbiology (19471)
    • Molecular Biology (7621)
    • Neuroscience (41931)
    • Paleontology (307)
    • Pathology (1253)
    • Pharmacology and Toxicology (2182)
    • Physiology (3254)
    • Plant Biology (7017)
    • Scientific Communication and Education (1291)
    • Synthetic Biology (1944)
    • Systems Biology (5411)
    • Zoology (1109)