Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

A composite method to infer drug resistance with mixed genomic data

View ORCID ProfileGargi Datta, Nabeeh A Hasan, Michael Strong, Sonia M Leach
doi: https://doi.org/10.1101/2020.07.30.194266
Gargi Datta
1Computational Bioscience Program, University of Colorado School of Medicine, Aurora, CO, USA, 80045
2Center for Genes, Environment and Health, National Jewish Health, Denver, CO, USA, 80206
3Institute for Behavioral Genetics, University of Colorado Boulder, CO, USA, 80303
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gargi Datta
  • For correspondence: datta.gargi@gmail.com
Nabeeh A Hasan
2Center for Genes, Environment and Health, National Jewish Health, Denver, CO, USA, 80206
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Michael Strong
1Computational Bioscience Program, University of Colorado School of Medicine, Aurora, CO, USA, 80045
2Center for Genes, Environment and Health, National Jewish Health, Denver, CO, USA, 80206
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Sonia M Leach
1Computational Bioscience Program, University of Colorado School of Medicine, Aurora, CO, USA, 80045
2Center for Genes, Environment and Health, National Jewish Health, Denver, CO, USA, 80206
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Background The increasing incidence of drug resistance in tuberculosis and other infectious diseases poses an escalating cause for concern, emphasizing the urgent need to devise robust computational and molecular methods identify drug resistant strains. Although machine learning-based approaches using whole-genome sequence data can facilitate the inference of drug resistance, current implementations do not optimally take advantage of information in public databases and are not robust for small sample sizes and mixed attribute types.

Results In this paper we introduce the Composite MetaDistance method, an approach for feature selection and classification of high-dimensional, unbalanced datasets with mixed attribute features from various data sources. We introduce a mixed-attribute, multi-view distance function to calculate distances between samples, with optimal handling of nominal features and different feature views. We also introduce a novel feature set for drug resistance prediction in Mycobacterium tuberculosis, using data from diverse sources. We compare the performance of Composite MetaDistance to multiple machine learning algorithms for Mycobacterium tuberculosis drug resistance prediction for three drugs. Composite MetaDistance consistently outperforms existing algorithms for small sample training sets, and performs as well as other algorithms for training sets with larger sample sizes.

Conclusion The feature set formulation introduced in this paper is utilizes mutational and publicly available information for each gene, and is much richer than ever devised previously. The prediction algorithm, Composite MetaDistance, is sample size agnostic and robust especially given small sample sizes. Proper handling of nominal features improves performance even with a very small number of nominal features. We expect Composite MetaDistance to be even more robust for datasets with a higher percentage of nominal features. The algorithm is application independent and can be used for any mixed attribute dataset.

Competing Interest Statement

The authors have declared no competing interest.

  • List of abbreviations

    M. tb
    Mycobacterium tuberculosis
    TB
    Tuberculosis
    ML
    Machine Learning
    WHO
    World Health Organization
    WGS
    Whole Genome Sequencing
    RF
    Random Forest
    KNN
    K-Nearest Neighbor
    SVM
    Support Vector Machine
    AUC
    Area Under the receiver operating characteristic Curve
    SNV
    Single Nucleotide Variant
    rbf
    Radial Basis Function
    MD
    MetaDistance
    CM
    Composite MetaDistance
  • Copyright 
    The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.
    Back to top
    PreviousNext
    Posted July 31, 2020.
    Download PDF

    Supplementary Material

    Email

    Thank you for your interest in spreading the word about bioRxiv.

    NOTE: Your email address is requested solely to identify you as the sender of this article.

    Enter multiple addresses on separate lines or separate them with commas.
    A composite method to infer drug resistance with mixed genomic data
    (Your Name) has forwarded a page to you from bioRxiv
    (Your Name) thought you would like to see this page from the bioRxiv website.
    CAPTCHA
    This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
    Share
    A composite method to infer drug resistance with mixed genomic data
    Gargi Datta, Nabeeh A Hasan, Michael Strong, Sonia M Leach
    bioRxiv 2020.07.30.194266; doi: https://doi.org/10.1101/2020.07.30.194266
    Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
    Citation Tools
    A composite method to infer drug resistance with mixed genomic data
    Gargi Datta, Nabeeh A Hasan, Michael Strong, Sonia M Leach
    bioRxiv 2020.07.30.194266; doi: https://doi.org/10.1101/2020.07.30.194266

    Citation Manager Formats

    • BibTeX
    • Bookends
    • EasyBib
    • EndNote (tagged)
    • EndNote 8 (xml)
    • Medlars
    • Mendeley
    • Papers
    • RefWorks Tagged
    • Ref Manager
    • RIS
    • Zotero
    • Tweet Widget
    • Facebook Like
    • Google Plus One

    Subject Area

    • Bioinformatics
    Subject Areas
    All Articles
    • Animal Behavior and Cognition (4658)
    • Biochemistry (10313)
    • Bioengineering (7636)
    • Bioinformatics (26241)
    • Biophysics (13481)
    • Cancer Biology (10648)
    • Cell Biology (15361)
    • Clinical Trials (138)
    • Developmental Biology (8463)
    • Ecology (12776)
    • Epidemiology (2067)
    • Evolutionary Biology (16794)
    • Genetics (11372)
    • Genomics (15431)
    • Immunology (10580)
    • Microbiology (25087)
    • Molecular Biology (10172)
    • Neuroscience (54233)
    • Paleontology (398)
    • Pathology (1660)
    • Pharmacology and Toxicology (2883)
    • Physiology (4326)
    • Plant Biology (9213)
    • Scientific Communication and Education (1582)
    • Synthetic Biology (2545)
    • Systems Biology (6761)
    • Zoology (1458)