TY - JOUR T1 - Learning from Longitudinal Data in Electronic Health Record and Genetic Data to Improve Cardiovascular Event Prediction JF - bioRxiv DO - 10.1101/366682 SP - 366682 AU - Juan Zhao AU - QiPing Feng AU - Patrick Wu AU - Roxana Lupu AU - Russell A. Wilke AU - Quinn S. Wells AU - Joshua C. Denny AU - Wei-Qi Wei Y1 - 2018/01/01 UR - http://biorxiv.org/content/early/2018/07/11/366682.abstract N2 - Background Current approaches to predicting Cardiovascular disease rely on conventional risk factors and cross-sectional data. In this study, we asked whether: i) machine learning and deep learning models with longitudinal EHR information can improve the prediction of 10-year CVD risk, and ii) incorporating genetic data can add values to predictability.Methods We conducted two experiments. In the first experiment, we modeled longitudinal EHR data with aggregated features and temporal features. We applied logistic regression (LR), random forests (RF) and gradient boosting trees (GBT) and Convolutional Neural Networks (CNN) and Recurrent Neural Networks, using Long Short-Term Memory (LSTM) units. In the second experiment, we proposed a late-fusion framework to incorporate genetic features.Results Our study cohort included 109, 490 individuals (9,824 were cases and 99, 666 were controls) from Vanderbilt University Medical Center’s (VUMC) de-identified EHRs. American College of Cardiology and the American Heart Association (ACC/AHA) Pooled Cohort Risk Equations had areas under receiver operating characteristic curves (AUROC) of 0.732 and areas under receiver under precision and recall curves (AUPRC) of 0.187. LSTM, CNN and GBT with temporal features achieved best results, which had AUROC of 0.789, 0.790, and 0.791, and AUPRC of 0.282, 0.280 and 0.285, respectively. The late fusion approach achieved a significant improvement for the prediction performance.Conclusions Machine learning and deep learning with longitudinal features improved the 10-year CVD risk prediction. Incorporating genetic features further enhanced 10-year CVD prediction performance, underscoring the importance of integrating relevant genetic data whenever available in the context of routine care. ER -