Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Teaching data science fundamentals through realistic synthetic clinical cardiovascular data

View ORCID ProfileTed Laderas, View ORCID ProfileNicole Vasilevsky, View ORCID ProfileBjorn Pederson, View ORCID ProfileMelissa Haendel, View ORCID ProfileShannon McWeeney, View ORCID ProfileDavid Dorr
doi: https://doi.org/10.1101/232611
Ted Laderas
Oregon Health & Science University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ted Laderas
  • For correspondence: tedladeras@gmail.com
Nicole Vasilevsky
Oregon Health & Science University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Nicole Vasilevsky
Bjorn Pederson
Oregon Health & Science University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Bjorn Pederson
Melissa Haendel
Oregon Health & Science University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Melissa Haendel
Shannon McWeeney
Oregon Health & Science University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Shannon McWeeney
David Dorr
Oregon Health & Science University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for David Dorr
  • Abstract
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Objective: Our goal was to create a synthetic dataset and curricular materials to assist in teaching fundamentals of translational data science. Materials and Methods: A literature review was conducted to extract current cardiovascular risk score logic, data elements, and population characteristics. Then, clinical data elements in the models were pulled from clinical data and transformed to the Observational Medical Outcomes Partnership (OMOP) common data model; genetic data elements were added based on population rates. A hybrid Bayesian network was used to create synthetic data from the logical elements of the risk scores and the underlying population frequencies of the clinical data. Results: A synthetic dataset of 446,000 patients was created. A two-day curriculum was created based on this synthetic data with exploratory data analysis and machine learning components. The curriculum was offered on two separate occasions; the two groups of learners were given the curriculum and data, and results were tallied, summarized, and compared. Students' ability to complete the challenge was mixed; more experienced students achieved a range of 70%-85% in balanced accuracy, but many others did not perform better than the baseline model. Discussion: Overall, students enjoyed the course and dataset, but some struggled to consistently apply machine learning techniques. The curriculum, data set, techniques for generation, and results are available for others to use for their own training. Conclusion: A realistic synthetic data with clinical and genetic components helps students learn issues in cardiovascular risk scoring, practice data science skills, and compete in a challenge to improve identification of risk.

Footnotes

  • I mistakenly included a reference about synthetic data generation that did not use real clinical data. This reference has been removed.

Copyright 
The copyright holder for this preprint is the author/funder. It is made available under a CC-BY-NC 4.0 International license.
Back to top
PreviousNext
  • Posted April 21, 2018.

Download PDF

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Teaching data science fundamentals through realistic synthetic clinical cardiovascular data
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
Share
Teaching data science fundamentals through realistic synthetic clinical cardiovascular data
Ted Laderas, Nicole Vasilevsky, Bjorn Pederson, Melissa Haendel, Shannon McWeeney, David Dorr
bioRxiv 232611; doi: https://doi.org/10.1101/232611
del.icio.us logo Digg logo Reddit logo Technorati logo Twitter logo CiteULike logo Connotea logo Facebook logo Google logo Mendeley logo
Citation Tools
Teaching data science fundamentals through realistic synthetic clinical cardiovascular data
Ted Laderas, Nicole Vasilevsky, Bjorn Pederson, Melissa Haendel, Shannon McWeeney, David Dorr
bioRxiv 232611; doi: https://doi.org/10.1101/232611

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Scientific Communication and Education
Subject Areas
All Articles
  • Animal Behavior and Cognition (619)
  • Biochemistry (857)
  • Bioengineering (515)
  • Bioinformatics (4754)
  • Biophysics (1499)
  • Cancer Biology (1028)
  • Cell Biology (1445)
  • Clinical Trials (52)
  • Developmental Biology (973)
  • Ecology (1628)
  • Epidemiology (808)
  • Evolutionary Biology (3687)
  • Genetics (2509)
  • Genomics (3260)
  • Immunology (601)
  • Microbiology (2408)
  • Molecular Biology (888)
  • Neuroscience (6471)
  • Paleontology (42)
  • Pathology (124)
  • Pharmacology and Toxicology (220)
  • Physiology (286)
  • Plant Biology (890)
  • Scientific Communication and Education (247)
  • Synthetic Biology (383)
  • Systems Biology (1321)
  • Zoology (162)