Prediction performance of a cardiovascular risk assessment tool using Stanford EHR data repository

Mehrdad Rezaee; Arsia Takeh; Igor Putrenko; Andrea Ganna; Erik Ingelsson

doi:10.1101/648956

Abstract

Background Stratification of individuals for their risk to develop cardiovascular diseases can be used for effective prevention and intervention. A significant amount of information for risk assessment can be obtained through repurposing electronic health records (EHR). The objective of this study is to derive and assess the performance of prediction models for cardiovascular outcomes by using EHR-derived data.

Methods We used the Stanford Medicine Research Data Repository (STARR) data from 2000-2017, containing over 2.1 million patients. A subset of 762,372 individuals with complete International Classification of Diseases (ICD) data was used to fit Cox proportional hazard models for prediction of six cardiovascular-related diseases and type 2 diabetes.

Results The derived prediction models indicated consistent high discrimination performance (C-index) for all diseases examined: coronary artery disease (0.85), hypertension (0.82), type 2 diabetes (0.77), stroke (0.76), atrial fibrillation (0.82) and abdominal aortic aneurysm (0.77). Lower prediction abilities were observed for deep vein thrombosis (0.67). These results were consistent across age groups and maintained good prediction abilities among individuals with pre-existing diabetes or hypertension. Assessment of model calibration is ongoing.

Conclusions We proposed new prediction models for the seven diseases using ICD codes derived from EHR data. EHR data can be used for health risk assessment, but challenges related to data quality and model generalizability and calibration remain to be solved.

Introduction

Cardiovascular diseases (CVD), including diabetes, are the leading causes of global morbidity, mortality and health care costs [1–3]. Appropriate early risk assessment can identify high-risk individuals, spare those at lower risk from intensive interventions, educate patients, provide information on population outcomes, and help in resource allocation [4–6]. In order to accurately determine the risk of developing cardiovascular events, the multifactorial nature of these chronic diseases needs to be considered.

Tools for evaluating cardiovascular risk have been available since the Framingham investigators developed the original algorithms in 1960s [7–10]. Since then, multitudes of other risk assessment approaches have been developed that incorporate an increasing number of risk factors, biomarkers, and comorbidities [10–14]. Clinical application of these models, either for individual patient care or for the purpose of population disease management, requires validation across different patient populations and data sources [15, 16]. There are considerable factors that hinder the performance analysis of any cardiovascular risk model; this is primarily due to the difficulty in obtaining standardized, pertinent large data sets that can be used for calibration and validation studies [17–19].

EHR-based information represents a valuable source of data that can be used for evaluation of prediction modeling. EHRs have the distinct advantage of containing extensive biomedical information on large numbers of individuals across multiple data points. The purpose of the current study is to use EHR information from a large population of patients (STARR) to evaluate the performance of newly developed risk assessment models using International Classification of Diseases (ICD) outcomes derived from the EHR data.

Material and Methods

Data source

We obtained de-identified patient data from the STARR dataset. The STARR population (2.1 million) includes patients from all ages (0 to 91 years old) who have attended Stanford Hospital or any of its clinics from 2001 to 2017. Demographic information, encounters, lab results and pharmacy orders are recorded in the database. However, conventional cardiovascular risk factors (biomarkers, diabetes status and family history) had a high frequency of missing data (80-100 percent). For this reason (i.e. the complete data was limited) we looked into alternative types of data. Specifically, we focused on using only ICD-derived (International Classification of Diseases, Tenth Revision, Clinical Modification) risk factors for prediction of the cardiovascular-related events. Thus, we included all individuals who were 18 years-old or older, who had at least one ICD code for the relevant cardiovascular risk factors and had non-missing information on age, BMI and sex. This decreased the population size to 762,372.

Risk factors and outcome definitions

In addition to age, BMI and sex, the presence of cardiovascular-related risk factors was defined using the following ICD-10 codes: elevated triglycerides (E78.1), elevated cholesterol (E78.00), depressed HDL cholesterol (E78.6), elevated LDL Cholesterol (E78.5), elevated creatinine (R79.89) and vitamin D deficiency (E55.9).

Disease outcomes were defined based on the following ICD codes: coronary artery disease (CAD): I20–I25 and T82 codes; hypertension (HTN): I10, I15, and R03.0 codes; type 2 diabetes mellitus (DM); E11, E13, and E14 codes, stroke: G46.3, G46.4, I63, I66, I67, and I693 codes; deep vein Thrombosis (DVT): H34.8, H40.8, I23.6, I24.0, I63, I67.6, I74, I81, I82, I87.2, I87.3, K64.5, N48.8, N52.0, O03.3, O03.8, O04.8, O07.3, O08.7, 022, O87, Q26, T82.8, T83.8, T84.8, T85.8, and Z86.7 codes; abdominal aortic aneurysm (AAA): I71 and I79.0 codes; and for atrial fibrillation (AF): I48-49 codes. For each disease endpoint, we also obtained the date when the patient was diagnosed with the endpoint.

Statistical Analysis

We consider risk factors recorded from 2001 to 2012 and use these risk factors to predict five-year risk for each disease outcome between 2012 and 2017. For each disease outcome, we also excluded individuals who were diagnosed with that disease before the baseline year of 2012. That is, we considered only incident outcomes. End-of-follow up is defined as the diagnosis of the disease, death or end of the study (December 2017).

Linear Cox proportional hazard (PH) models were developed using lifelines 0.13.0 Python library.

Since the risk factors were not measured at baseline, but during an 11 year period, they were modeled as time-varying covariates. If multiple instances of the same risk factors were measured for the same individual we considered the last instance, as this is the closest to the baseline.

The discrimination was assessed on the basis of five-fold cross-validated Harrell’s concordance index (C-index) [20–22]. The cross-validated C-index was used as a main metric for assessing discriminative ability of the Cox PH-based models.

Principal component analysis (PCA) was used for validating the selection of variables and to avoid overfitting through comparison of the number of selected variables and optimal number of principal components [23]. The number of components to be retained was determined by using maximum-likelihood density estimation and full singular value decomposition as parameters of the PCA function, which applies Bayesian model selection to probabilistic PCA in this configuration.

Sensitivity Analysis

To evaluate the prediction performances of the derived models across different subsets of the population we performed sensitivity analyses in the following subpopulations: 1) healthy participants without any of the cardiovascular-related diseases or type 2 diabetes at the baseline, 2) the following age groups, with age measured at baseline: <45, 45-55, 56-65, 66-75, >75, 3) individuals with at least one non-target disease (DM, HTN, Stroke, DVT).

Results

The study population included 762,372 adult patients ages 18 years old and above, which visited Stanford Hospital and clinics during the period spanning 2000-2017, had at least one cardiovascular-related ICD code and information on age, sex and BMI (Table 1). We notice that this approach reduces the generalizability of our results. Nonetheless, this study represents a proof of concept that prediction models can be derived from ICD data. Table 2 reports the number of individuals included in each disease-specific analysis and the incidence of the main disease outcome for general population and age stratified sub-populations. Individuals reported in Table 2 were used to derive the final predictions. As presented in Table 3, the cross-validated discrimination metrics (C-index) across all disease was high (>0.75) except for DVT with a C-index of 0.67.

View this table:

Table 1:

Characteristics of the population included for the final analysis (762, 372). Number of individuals in each group, on brackets, percentage of individuals in each group.

View this table:

Table 2:

Population size and disease incidence. Incidence (number of incident cases/total population) for each target disease is presented for the total population as well as age groups for each disease. Numbers inside the parenthesis are the Incidence percentages.

View this table:

Table 3:

Five-fold cross validated C-index [95% CI] as a measure of prediction performance for each disease.

Modeling performance was further evaluated across clinically-distinct sub-populations (Table 4). Prediction models for each disease had consistent behavior across all the subpopulations examined.

View this table:

Table 4:

Discriminative ability of risk prediction models across different subgroups.

Discussion

The extensive availability of EHR data offers a unique and promising opportunity for the application of prediction models at the point of care or for population planning. EHR data have the advantage of including large numbers of individuals and recording many interactions that an individual has with the health system. This aspect allows us to model longitudinal changes in risk factors. The main disadvantage of the EHR-based data is low data quality, poor generalizability outside the specific EHR system and loss at follow-up (i.e. inability to track people that change health systems or emigrate).

The models derived in this paper focus on the same set of diseases that have been the focus of a companion paper developed on a large epidemiological cohort study (UK Biobank) [24]. Despite different data sources, we observed comparable prediction abilities across the studies. However, one advantage of large epidemiological studies as compared to EHR-based data is the limited number of missing data that allows for direct modeling of biomarkers as risk factors. In the current study this was not possible, and we had to rely on ICD-based diagnosis. This is suboptimal as the lack of ICD codes does not necessarily imply that the risk factor is not present in the individual. Moreover, this binarizes the underlying continuous variables (e.g. cholesterol), which limits the predictive value of such risk factors. Nonetheless, we still observe good prediction abilities indicating that the current, pragmatic approach has some value.

In this study we didn’t evaluate model calibration. This is a limitation that will be addressed in a future revision of the manuscript.

Conclusions

In this report, we present the development and validation of risk prediction models for cardiovascular-related diseases and type 2 diabetes using EHR data. Due to the large number of missing data for traditional risk factors, ICD codes were used to define the model predictors. The developed models had good prediction abilities in the entire study population as well in specific clinically-relevant subgroup populations. Future research will focus on including additional risk factors, such as biomarkers and genetic information, and on evaluating the generalizability outside this specific study population. Prediction models derived from EHR data have the potential to be used for primary prevention of cardiovascular-related disease.

Funding Source

The funder, Precision Wellness, Inc., provided support in the form of salaries for authors AT and IP, consultancy fees to AG, and as an unrestricted research grant to Stanford University (led by EI), but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘Author contributions’ section.

Author Contributions

Conceptualization: MR AT
Data curation: AT
Formal analysis: AT, IP
Funding acquisition: MR
Investigation: MR AT EI
Methodology: MR AT IP AG EI
Project administration: MR AT
Resources: MR
Software: AT
Supervision: MR AT
Validation: MR EI AG
Visualization: AT, IP
Writing – original draft: AT IP MR
Writing – review & editing: MR EI AG

Acknowledgement

Precision wellness, Inc. was solely responsible for the conception and development of the risk analysis algorithms described in this report.

EI served as the PI for this study, which was performed, in part, through a sponsored research (SPO 134382) agreement with Stanford University.

References

1.↵
Ritchey MD, Wall HK, Owens PL, Wright JS. Vital signs: state-level variation in nonfatal and fatal cardiovascular events targeted for prevention by Million Hearts 2022. Morbidity and Mortality Weekly Report. 2018 Sep 7;67(35):974.
OpenUrl
2.
Giedrimiene D, King R. Burden of Cardiovascular Disease (CVD) on Economic Cost. Comparison of Outcomes in US and Europe. Circulation 2017 Mar; Cardiovascular Quality and Outcomes: 10 (3): A207.
OpenUrl
3.↵
Benjamin EJ, Blaha MJ, Chiuve SE, Cushman M, Das SR, Deo R, Floyd J, Fornage M, Gillespie C, Isasi CR, Jiménez MC. Heart disease and stroke statistics-2017 update: a report from the American Heart Association. Circulation 2017 Mar;135(10):e146–603.
OpenUrl FREE Full Text
4.↵
Curry SJ, Krist AH, Owens DK, Barry MJ, Caughey AB, Davidson KW, Doubeni CA, Epling JW, Kemper AR, Kubik M, Landefeld CS. Risk assessment for cardiovascular disease with nontraditional risk factors: US preventive services task force recommendation statement. Jama. 2018 Jul 17;320(3):272–80.
OpenUrl
5.
Cooney MT, Dudina A, D’Agostino R, Graham IM. Cardiovascular risk-estimation systems in primary prevention: do they differ? Do they make a difference? Can we see the future?. Circulation. 2010 Jul 20;122(3):300–10.
OpenUrl FREE Full Text
6.↵
Lloyd-Jones DM, Braun LT, Ndumele CE, Smith SC, Sperling LS, Virani SS, Blumenthal RS. Use of risk assessment tools to guide decision-making in the primary prevention of atherosclerotic cardiovascular disease: a special report from the American Heart Association and American College of Cardiology. Journal of the American College of Cardiology. 2018 Nov 10:25711.
OpenUrl
7.↵
Damon A, Damon ST, Harpending HC, Kannel WB. Predicting coronary heart disease from body measurements of Framingham males. Journal of Chronic Diseases. 1969 Apr 1;21(11-12):781–802.
OpenUrl CrossRef PubMed Web of Science
8.
Brand RJ, Rosenman RH, Sholtz RI, Friedman ME. Multivariate prediction of coronary heart disease in the Western Collaborative Group Study compared to the findings of the Framingham study. Circulation. 1976 Feb;53(2):348–55.
OpenUrl Abstract/FREE Full Text
9.
Truett J, Cornfield J, Kannel W. A multivariate analysis of the risk of coronary heart disease in Framingham. Journal of Clinical Epidemiology. 1967 Jul 1;20(7):511–24.
OpenUrl
10.↵
D’Agostino Sr RB, Pencina MJ, Massaro JM, Coady S. Cardiovascular disease risk assessment: insights from Framingham. Global heart. 2013 Mar 1;8(1):11–23.
OpenUrl
11.
Conroy RM, Pyörälä K, Fitzgerald AE, Sans S, Menotti A, De Backer G, De Bacquer D, Ducimetiere P, Jousilahti P, Keil U, Njølstad I. Estimation of ten-year risk of fatal cardiovascular disease in Europe: the SCORE project. European heart journal. 2003 Jun 1;24(11):987–1003.
OpenUrl CrossRef PubMed Web of Science
12.
Goff DC, Lloyd-Jones DM, Bennett G, Coady S, D’Agostino RB, Gibbons R, Greenland P, Lackland DT, Levy D, O’Donnell CJ, Robinson JG. 2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. Journal of the American College of Cardiology. 2014 Jul 1;63(25 Part B):2935–59.
OpenUrl CrossRef PubMed
13.
Kavousi M, Leening MJ, Nanchen D, Greenland P, Graham IM, Steyerberg EW, Ikram MA, Stricker BH, Hofman A, Franco OH. Comparison of application of the ACC/AHA guidelines, Adult Treatment Panel III guidelines, and European Society of Cardiology guidelines for cardiovascular disease prevention in a European cohort. Jama. 2014 Apr 9;311(14):1416–23.
OpenUrl CrossRef PubMed Web of Science
14.↵
Wald NJ, Law MR. A strategy to reduce cardiovascular disease by more than 80%. Bmj. 2003 Jun 26;326(7404):1419.
OpenUrl Abstract/FREE Full Text
15.↵
DeFilippis AP, Young R, Carrubba CJ, McEvoy JW, Budoff MJ, Blumenthal RS, Kronmal RA, McClelland RL, Nasir K, Blaha MJ. An analysis of calibration and discrimination among multiple cardiovascular risk scores in a modern multiethnic cohort. Annals of internal medicine. 2015 Feb 17;162(4):266–75.
OpenUrl CrossRef PubMed
16.↵
Hippisley-Cox J, Coupland C, Vinogradova Y, Robson J, May M, Brindle P. Derivation and validation of QRISK, a new cardiovascular disease risk score for the United Kingdom: prospective open cohort study. Bmj. 2007 Jul 19;335(7611):136.
OpenUrl Abstract/FREE Full Text
17.↵
Read SH, van Diepen M, Colhoun HM, Halbesma N, Lindsay RS, McKnight JA, McAllister DA, Pearson ER, Petrie JR, Philip S, Sattar N. Performance of cardiovascular disease risk scores in people diagnosed with type 2 diabetes: external validation using data from the national Scottish diabetes register. Diabetes care. 2018 Sep 1;41(9):2010–8.
OpenUrl Abstract/FREE Full Text
18.
Damen JA, Hooft L, Schuit E, Debray TP, Collins GS, Tzoulaki I, Lassale CM, Siontis GC, Chiocchia V, Roberts C, Schlüssel MM. Prediction models for cardiovascular disease risk in the general population: systematic review. bmj. 2016 May 16;353:i2416.
OpenUrl Abstract/FREE Full Text
19.↵
Rumsfeld JS, Joynt KE, Maddox TM. Big data analytics to improve cardiovascular care: promise and challenges. Nature Reviews Cardiology. 2016 Jun;13(6):350.
OpenUrl
20.↵
Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the yield of medical tests. Jama. 1982 May 14;247(18):2543–6.
OpenUrl CrossRef PubMed Web of Science
21.
Harrell Jr FE, Lee KL, Califf RM, Pryor DB, Rosati RA. Regression modelling strategies for improved prognostic prediction. Statistics in medicine. 1984 Apr;3(2):143–52.
OpenUrl CrossRef PubMed Web of Science
22.↵
Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in medicine. 1996 Feb 29;15(4):361–87.
OpenUrl CrossRef PubMed Web of Science
23.↵
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J. Scikit-learn: Machine learning in Python. Journal of machine learning research. 2011;12(Oct):2825–30.
OpenUrl Web of Science
24.↵
Rezaee M, Takeh A, Putrenko I, Ganna A, Ingelsson E. Development and validation of a next-gen health stratification engine to determine risk for multiple cardiovascular diseases, bioRxiv, doi: 10.1101/562900
OpenUrl Abstract/FREE Full Text