Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

A Machine Learning-based Framework to Identify Type 2 Diabetes through Electronic Health Records

Tao Zheng, Wei Xie, Liling Xu, Xiaoying He, Ya Zhang, Mingrong You, Gong Yang, You Chen
doi: https://doi.org/10.1101/078634
Tao Zheng
1Institute of Image Communication and Networking, Shanghai Jiao Tong University, Shanghai, China
2Tongren Hospital Shanghai Jiao Tong University, Shanghai, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Wei Xie
3Department of Electrical Engineering & Computer Science, Vanderbilt University, Nashville, TN, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Liling Xu
2Tongren Hospital Shanghai Jiao Tong University, Shanghai, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Xiaoying He
4Department of Endocrinology, the First Affiliated Hospital of Sun Yat-Sen University, Guangzhou, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ya Zhang
1Institute of Image Communication and Networking, Shanghai Jiao Tong University, Shanghai, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Mingrong You
5Division of Epidemiology, Vanderbilt University, Nashville, TN, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Gong Yang
5Division of Epidemiology, Vanderbilt University, Nashville, TN, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
You Chen
6Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Objective To discover diverse genotype-phenotype associations affiliated with Type 2 Diabetes Mellitus (T2DM) via genome-wide association study (GWAS) and phenome-wide association study (PheWAS), more cases (T2DM subjects) and controls (subjects without T2DM) are required to be identified (e.g., via Electronic Health Records (EHR)). However, existing expert based identification algorithms often suffer in a low recall rate and could miss a large number of valuable samples under conservative filtering standards. The goal of this work is to develop a semi-automated framework based on machine learning as a pilot study to liberalize filtering criteria to improve recall rate with a keeping of low false positive rate.

Materials and Methods We propose a data informed framework for identifying subjects with and without T2DM from EHR via feature engineering and machine learning. We evaluate and contrast the identification performance of widely-used machine learning models within our framework, including k-Nearest-Neighbors, Naïve Bayes, Decision Tree, Random Forest, Support Vector Machine and Logistic Regression. Our framework was conducted on 300 patient samples (161 cases, 60 controls and 79 unconfirmed subjects), randomly selected from 23,281 diabetes related cohort retrieved from a regional distributed EHR repository ranging from 2012 to 2014.

Results We apply top-performing machine learning algorithms on the engineered features. We benchmark and contrast the accuracy, precision, AUC, sensitivity and specificity of classification models against the state-of-the-art expert algorithm for identification of T2DM subjects. Our results indicate that the framework achieved high identification performances (~0.98 in average AUC), which are much higher than the state-of-the-art algorithm (0.71 in AUC).

Discussion Expert algorithm-based identification of T2DM subjects from EHR is often hampered by the high missing rates due to their conservative selection criteria. Our framework leverages machine learning and feature engineering to loosen such selection criteria to achieve a high identification rate of cases and controls.

Conclusions Our proposed framework demonstrates a more accurate and efficient approach for identifying subjects with and without T2DM from EHR.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
Posted September 30, 2016.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
A Machine Learning-based Framework to Identify Type 2 Diabetes through Electronic Health Records
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
A Machine Learning-based Framework to Identify Type 2 Diabetes through Electronic Health Records
Tao Zheng, Wei Xie, Liling Xu, Xiaoying He, Ya Zhang, Mingrong You, Gong Yang, You Chen
bioRxiv 078634; doi: https://doi.org/10.1101/078634
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
A Machine Learning-based Framework to Identify Type 2 Diabetes through Electronic Health Records
Tao Zheng, Wei Xie, Liling Xu, Xiaoying He, Ya Zhang, Mingrong You, Gong Yang, You Chen
bioRxiv 078634; doi: https://doi.org/10.1101/078634

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4369)
  • Biochemistry (9546)
  • Bioengineering (7071)
  • Bioinformatics (24774)
  • Biophysics (12564)
  • Cancer Biology (9925)
  • Cell Biology (14299)
  • Clinical Trials (138)
  • Developmental Biology (7931)
  • Ecology (12077)
  • Epidemiology (2067)
  • Evolutionary Biology (15957)
  • Genetics (10904)
  • Genomics (14708)
  • Immunology (9847)
  • Microbiology (23582)
  • Molecular Biology (9454)
  • Neuroscience (50699)
  • Paleontology (369)
  • Pathology (1535)
  • Pharmacology and Toxicology (2674)
  • Physiology (4001)
  • Plant Biology (8642)
  • Scientific Communication and Education (1505)
  • Synthetic Biology (2388)
  • Systems Biology (6415)
  • Zoology (1345)