Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

EUGENe: A Python toolkit for predictive analyses of regulatory sequences

View ORCID ProfileAdam Klie, Hayden Stites, View ORCID ProfileTobias Jores, Joe J Solvason, View ORCID ProfileEmma K Farley, View ORCID ProfileHannah Carter
doi: https://doi.org/10.1101/2022.10.24.513593
Adam Klie
1Department of Medicine, University of California San Diego, La Jolla, CA 92093
2Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Adam Klie
Hayden Stites
3Daniel Hand High School, Madison, CT 06443
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Tobias Jores
4Department of Genome Sciences, University of Washington, Seattle, WA 98195
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Tobias Jores
Joe J Solvason
1Department of Medicine, University of California San Diego, La Jolla, CA 92093
2Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093
5Department of Biological Sciences, University of California San Diego, La Jolla, CA 92093
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Emma K Farley
1Department of Medicine, University of California San Diego, La Jolla, CA 92093
2Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093
5Department of Biological Sciences, University of California San Diego, La Jolla, CA 92093
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Emma K Farley
Hannah Carter
1Department of Medicine, University of California San Diego, La Jolla, CA 92093
2Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Hannah Carter
  • For correspondence: hkcarter@ucsd.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Deep learning (DL) has become a popular tool to study cis-regulatory element function. Yet efforts to design software for DL analyses in genomics that are Findable, Accessible, Interoperable and Reusable (FAIR) have fallen short of fully meeting these criteria. Here we present EUGENe (Elucidating the Utility of Genomic Elements with Neural Nets), a FAIR toolkit for the analysis of labeled sets of nucleotide sequences with DL. EUGENe consists of a set of modules that empower users to execute the key functionality of a DL workflow: 1) extracting, transforming and loading sequence data from many common file formats, 2) instantiating, initializing and training diverse model architectures, and 3) evaluating and interpreting model behavior. We designed EUGENe to be simple; users can develop workflows on new or existing datasets with two customizable Python objects, annotated sequence data (SeqData) and PyTorch models (BaseModel). The modularity and simplicity of EUGENe also make it highly extensible and we illustrate these principles through application of the toolkit to three predictive modeling tasks. First, we train and compare a set of built-in models along with a custom architecture for the accurate prediction of activities of plant promoters from STARR-seq data. Next, we apply EUGENe to an RNA binding prediction task and showcase how seminal model architectures can be retrained in EUGENe or imported from Kipoi. Finally, we train models to classify transcription factor binding by wrapping functionality from Janngu, which can efficiently extract sequences in BED file format from the human genome. We emphasize that the code used in each use case is simple, readable, and well documented (https://eugene-tools.readthedocs.io/en/latest/index.html). We believe that EUGENe represents a springboard toward a collaborative ecosystem for DL applications in genomics research. EUGENe is available for download on GitHub (https://github.com/cartercompbio/EUGENe) along with several introductory tutorials and for installation on PyPi (https://pypi.org/project/eugene-tools/).

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • Included further references to model interpretation methods in "The EUGENe workflow" section and in the discussion. Removed artifacts in figures and improved panel consistency. Authors and affiliations updated.

  • https://zenodo.org/record/7140083#.Y1b18-zML2U

  • https://github.com/cartercompbio/EUGENe

  • https://eugene-tools.readthedocs.io/

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted November 09, 2022.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
EUGENe: A Python toolkit for predictive analyses of regulatory sequences
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
EUGENe: A Python toolkit for predictive analyses of regulatory sequences
Adam Klie, Hayden Stites, Tobias Jores, Joe J Solvason, Emma K Farley, Hannah Carter
bioRxiv 2022.10.24.513593; doi: https://doi.org/10.1101/2022.10.24.513593
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
EUGENe: A Python toolkit for predictive analyses of regulatory sequences
Adam Klie, Hayden Stites, Tobias Jores, Joe J Solvason, Emma K Farley, Hannah Carter
bioRxiv 2022.10.24.513593; doi: https://doi.org/10.1101/2022.10.24.513593

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4688)
  • Biochemistry (10379)
  • Bioengineering (7695)
  • Bioinformatics (26372)
  • Biophysics (13547)
  • Cancer Biology (10720)
  • Cell Biology (15460)
  • Clinical Trials (138)
  • Developmental Biology (8509)
  • Ecology (12842)
  • Epidemiology (2067)
  • Evolutionary Biology (16885)
  • Genetics (11416)
  • Genomics (15493)
  • Immunology (10638)
  • Microbiology (25254)
  • Molecular Biology (10239)
  • Neuroscience (54587)
  • Paleontology (402)
  • Pathology (1671)
  • Pharmacology and Toxicology (2899)
  • Physiology (4355)
  • Plant Biology (9263)
  • Scientific Communication and Education (1588)
  • Synthetic Biology (2561)
  • Systems Biology (6789)
  • Zoology (1470)