Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

cazy_webscraper: local compilation and interrogation of comprehensive CAZyme datasets

View ORCID ProfileEmma E. M. Hobbs, Tracey M. Gloster, View ORCID ProfileLeighton Pritchard
doi: https://doi.org/10.1101/2022.12.02.518825
Emma E. M. Hobbs
aSchool of Biology and Biomedical Sciences Research Complex, University of St Andrews, North Haugh, St Andrews, Fife, KY16 9ST, UK
bStrathclyde Institute of Pharmacy and Biomedical Sciences, University of Strathclyde, Glasgow, G4 0RE, UK
cCell and Molecular Sciences, James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Emma E. M. Hobbs
  • For correspondence: eemh1@standrews.ac.uk leighton.pritchard@strath.ac.uk
Tracey M. Gloster
aSchool of Biology and Biomedical Sciences Research Complex, University of St Andrews, North Haugh, St Andrews, Fife, KY16 9ST, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Leighton Pritchard
bStrathclyde Institute of Pharmacy and Biomedical Sciences, University of Strathclyde, Glasgow, G4 0RE, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Leighton Pritchard
  • For correspondence: eemh1@standrews.ac.uk leighton.pritchard@strath.ac.uk
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Carbohydrate Active enZymes (CAZymes) are pivotal in biological processes including energy metabolism, cell structure maintenance, signalling and pathogen recognition. Bioinformatic prediction and mining of CAZymes improves our understanding of these activities, and enables discovery of candidates of interest for industrial biotechnology, particularly the processing of organic waste for biofuel production. CAZy (www.cazy.org) is a high-quality, manually-curated and authoritative database of CAZymes that is often the starting point for these analyses. Automated querying, and integration of CAZy data with other public datasets would constitute a powerful resource for mining and exploring CAZyme diversity. However, CAZy does not itself provide methods to automate queries, or integrate annotation data from other sources (except by following hyperlinks) to support further analysis.

To overcome these limitations we developed cazy_webscraper, a command-line tool that retrieves data from CAZy and other online resources to build a local, shareable, and reproducible database that augments and extends the authoritative CAZy database. cazy_webscraper’s integration of curated CAZyme annotations with their corresponding protein sequences, up to date taxonomy assignments, and protein structure data facilitates automated large-scale and targeted bioinformatic CAZyme family analysis and candidate screening. This tool has found widespread uptake in the community, with over 20,000 downloads.

We demonstrate the use and application of cazy_webscraper to: (i) augment, update and correct CAZy database accessions; (ii) explore taxonomic distribution of CAZymes recorded in CAZy, identifying underrepresented taxa and unusual CAZy class distributions; and (iii) investigate three CAZymes having potential biotechnological application for degradation of biomass, but lacking a representative structure in the PDB database. We describe in general how cazy_webscraper facilitates functional, structural and evolutionary studies to aid identification of candidate enzymes for further characterisation, and specifically note that CAZy provides supporting evidence for recent expansion of the Auxiliary Activities (AA) CAZy family in eukaryotes, consistent with functions potentially specific to eukaryotic lifestyles.

Supplementary information cazy_webscraper source code is available at https://github.com/HobnobMancer/cazy_webscraper, and online documentation is provided at https://cazywebscraper.readthedocs.io.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • https://zenodo.org/record/7389830#.Y4nXypb7S70

  • Abbreviations

    AA
    Auxiliary activity
    BSR
    BLAST Score Ratio
    CBM
    Carbohydrate binding module
    CE
    Carbohydrate esterase
    EC
    Enzyme Commission
    GH
    Glycoside hydrolase
    GT
    Glycosyl transferase
    LPMO
    Lytic polysaccharide monooxygenase
    NCBI
    National Centre for Biotechnology Information
    nr
    NCBI non-redundant database
    ORM
    Object relationship model
    PDB
    Protein Data Bank
    PL
    Polysaccharide lyase
    RCSB
    Research Collaboratory for Structural Bioinformatics
    SQL
    Structured Query Language.
  • Copyright 
    The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
    Back to top
    PreviousNext
    Posted December 03, 2022.
    Download PDF

    Supplementary Material

    Data/Code
    Email

    Thank you for your interest in spreading the word about bioRxiv.

    NOTE: Your email address is requested solely to identify you as the sender of this article.

    Enter multiple addresses on separate lines or separate them with commas.
    cazy_webscraper: local compilation and interrogation of comprehensive CAZyme datasets
    (Your Name) has forwarded a page to you from bioRxiv
    (Your Name) thought you would like to see this page from the bioRxiv website.
    CAPTCHA
    This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
    Share
    cazy_webscraper: local compilation and interrogation of comprehensive CAZyme datasets
    Emma E. M. Hobbs, Tracey M. Gloster, Leighton Pritchard
    bioRxiv 2022.12.02.518825; doi: https://doi.org/10.1101/2022.12.02.518825
    Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
    Citation Tools
    cazy_webscraper: local compilation and interrogation of comprehensive CAZyme datasets
    Emma E. M. Hobbs, Tracey M. Gloster, Leighton Pritchard
    bioRxiv 2022.12.02.518825; doi: https://doi.org/10.1101/2022.12.02.518825

    Citation Manager Formats

    • BibTeX
    • Bookends
    • EasyBib
    • EndNote (tagged)
    • EndNote 8 (xml)
    • Medlars
    • Mendeley
    • Papers
    • RefWorks Tagged
    • Ref Manager
    • RIS
    • Zotero
    • Tweet Widget
    • Facebook Like
    • Google Plus One

    Subject Area

    • Bioinformatics
    Subject Areas
    All Articles
    • Animal Behavior and Cognition (4119)
    • Biochemistry (8828)
    • Bioengineering (6532)
    • Bioinformatics (23486)
    • Biophysics (11806)
    • Cancer Biology (9223)
    • Cell Biology (13336)
    • Clinical Trials (138)
    • Developmental Biology (7444)
    • Ecology (11425)
    • Epidemiology (2066)
    • Evolutionary Biology (15174)
    • Genetics (10453)
    • Genomics (14056)
    • Immunology (9188)
    • Microbiology (22200)
    • Molecular Biology (8823)
    • Neuroscience (47627)
    • Paleontology (351)
    • Pathology (1431)
    • Pharmacology and Toxicology (2493)
    • Physiology (3736)
    • Plant Biology (8090)
    • Scientific Communication and Education (1438)
    • Synthetic Biology (2225)
    • Systems Biology (6042)
    • Zoology (1254)