Development and validation of a next-gen health stratification engine to determine risk for multiple cardiovascular diseases

Cardiometabolic diseases (CMD) impose greater impact on every aspect of health care than any other disease group. Accurate and in-time risk assessment of individuals for their propensity to develop CMD events is one of the most critical paths in preventing these conditions. The principal objective of the present study is to report the development, and validation of a next generation risk engine to predict CMD. UK Biobank population data was used to derive predictive models for six CMD. Missing data were imputed using imputation algorithms. Cox proportional hazard models were used to estimate annual absolute risk and relative risk of different risk factors for these conditions. In addition to conventional risk factors, the applied model included socioeconomic data, lifestyle factors and comorbidities as predictors of outcomes. In total, 416,936 individuals were included in the analysis. The derived prediction models achieved consistent and moderate-to-high discrimination performance (C-index) for all diseases: coronary artery disease (0.79), hypertension (0.82), type 2 diabetes mellitus (0.87), stroke (0.79), deep vein thrombosis (0.75), and abdominal aortic aneurysm (0.90). These results were consistent across age groups (37-73 years) and showed similar predictive abilities amongst those with pre-existing diabetes or hypertension. Calibration of risk scores showed that there was moderate overestimation of CMD-related conditions only in the highest decile of risk scores for all models. In summary, the newly developed algorithms, based on Cox proportional models, resulted in high disclination and good calibration for several CMD. The integrations of these algorithms on a single platform may have direct clinical impact.


23
Introduction 46 Cardiometabolic diseases (CMD) continue to be the leading causes of death in the United States since the 1920s, and 45% of the U.S. population is projected to suffer from any of these 48 diseases by 2035 [1]. The healthcare cost associated with these diseases represent one of the 49 greatest global economic burdens [2]. As with any chronic condition, appropriate prevention 50 and selective treatment for CMD are the most effective approaches to defer their clinical and 51 financial impact on individuals and across populations. 52 Primary prevention of chronic diseases is a resource intensive, costly, and non-effective if 53 applied through non-selective implementation [3]. Therefore, accurate population and 54 individual stratification is needed to provide individualized, as well as population-specific 55 care. In order to achieve clinically relevant risk stratification, established risk factors and 56 novel population-specific data should be considered to derive clinically applicable prediction 57 algorithms. 58 For over 20 years, the concept of cardiovascular risk assessment has been tested through 59 prediction models that are utilized in the clinical setting [4][5][6]. Current prediction models 60 have good discrimination abilities to identify individuals who will develop CMD. However, 61 there are opportunities to address the limitations of current models, such as inclusion of 62 contemporary risk factors, biomarkers and genetic information as part of the algorithms [7]. 63 Also, the currently systems are limited to only a few diseases, such as coronary artery disease 64 and stroke, without consideration of major comorbidities. Moreover, current models do not 65 allow for imputation for missing data; and finally, they are primarily directed to prevention of 66 disease over a 10-year span. In this study, the development and validation of a next-gen 67 stratification platform that integrates conventional clinical risk factors and biomarkers, 68 socioeconomic, lifestyle factors and other co-morbidities data for six cardiometabolic diseases 69 (CMD) is presented. To derive these new predictions models, we used data provided by the 70 UK Biobank (UKBB) project [8], including over 400,000 men and women aged 37-73 years, 71 with 6.1 years of median longitudinal follow-up. 72 73 Materials and methods 74 Baseline data preparation 75 Baseline data on 502,616 UKBB participants collected at assessment centers to derive the 76 prediction models. Overall, 95% of the UKBB participants were self-described as white, with 77 women comprising 54.4% of the total. CMD outcomes were determined based on 78 International Classification of Diseases (ICD) edition 10 (ICD-10) codes, as well as self-79 reports for coronary artery disease (CAD), hypertension (HPT), type 2 diabetes mellitus 80 The UKBB data were subsequently linked to hospital episode statistics (HES) data from 88 hospitals in England, Scotland and Wales. The age and date of a CMD event were determined 89 based on primary or secondary ICD-10 codes in the HES data corresponding to the event 90 using the earliest hospital record. The date of inclusion into the UKBB was defined as 91 baseline and was used as starting point for time-to-event calculations. The exit date was 92 determined as either date of death, end of follow-up (February 29, 2016), or a CMD event, 93 whichever happened first. Only those CMD-positive cases that were identified by ICD-10 94 codes, self-reports, or medication as described above and had the date of the event determined 95 based on the HES data were included into analyses, reducing the number of participants to 96 416,936. In addition, participants with prior CMD events (before baseline) were excluded 97 from analyses of that specific event, e.g. those with prior CAD event were excluded from the 98 CAD analyses and so on. 99 The datasets created for each CMD were spitted into training and testing sets based on 100 80%/20% ratio. Testing sets were used for model validation and calibration. Age-and CMD-101 specific testing sets were created by applying corresponding age and disease filters onto 102 general test datasets (without reusing any data from the training sets to avoid overfitting). 103

104
To develop highly predictive CMD risk prediction models, in addition to using already 105 available UKBB data fields, the new variables were derived that captured sociodemographic 106 and socioeconomic factors, laboratory test results, physiological measurements, physical 107 activity, nutrition, alcohol consumption, family history of CMD; as well as the presence of 108 diseases, disorders, or previous surgeries as shown in Table 1.    Table 1. 128 Alcohol score was calculated according to Alternative Healthy Eating Index (AHEI) 129 guidelines [10]. One alcohol serving corresponded to 11.4 grams of alcohol. Further, a 130 nutrition AHEI score was calculated as a sum of scores for the following nutrition categories: 131 vegetables, fruits, grains, sugar sweetened beverages and fruit juices, nuts, meat, fish, PUFA, 132 and alcohol. The nutrition scores were calculated according to AHEI guidelines [10]. 133 In addition to the predicted CMD (target CMD), participants could of course experience 134 other competing CMD outcomes. We used the age of experiencing these non-target diseases 135 as an additional risk factor. For participants that did not experience a CMD event before 136 baseline (CMD-negative cases), the age of CMD was set to 100. This approach allowed for

206
The resulted chi-square statistic was assessed using 8 degrees of freedom and was reported 207 with p-value. A calibration plot was created by plotting the predicted risk probabilities against 208 the observed risks for each group. 209

210
The study characteristics and the prevalence of six CMD at baseline for 416,936 UKB 211 participants that include CMD-positive cases that were identified by ICD-10 codes, self-212 reports, or medication and had the date of the event determined based on the HES data are 213 shown in Tables 1 and 2. Average age of men and women in this population was 56.3 ± 8.3 214 and 56 ± 8.1 years, correspondingly. During follow-up (median 6.1 years), 98,254 incident 215 CMD events occurred in 67,785 participants that were free from the disease at baseline (Table  216 2). 217 The prevalence of CMD at the baseline and incidence of CMD during the follow-up are shown in 219 parenthesis. 220 221 Imputation of missing data 222 Initial data quality evaluation showed that the number of missing values for examined 223 variables (Table 1) varied from 0 to ~52% with the mean of 6.3%, resulting in the no-null 224 values dataset sizes of ~78K -81K (vs. initial ~380K -416K). As discussed in the methods, 225 imputation of missing values for all continuous variables (Table 1) (Table 3). Cox PH models were 236 further applied to calculate the risk probabilities of occurrence of a CMD event at 5 years 237 following the initial observation. This time-to-event prediction was evaluated through 238 determination of the statistical 'distance' between CMD-positive and CMD-negative test 239 subgroups' risk scores (Table 3). F-statistic values for the CMD models were highest for the 240 models with high discriminative ability, except for the AAA model due to the low prevalence 241 of this disease. 242 243 244   (Table 3) and visualized by the calibration plot (Fig 2) showed 265 adequate overall calibration, but moderate overestimation of CMD risk in the highest decile of 266 risk scores.  (Table 4), indicating the degree of the association between the predictor and the 280 outcome. Predictors presented in Table 4 represented only those with absolute values of 281 coefficients larger than 0.8 and p-values less than 0.001 (see S1 Table for   Positive and negative signs indicate that corresponding factors increase or decrease the risk of 291 CMD, respectively. For the purpose of better presentation, only coefficients with absolute 292 values larger than 0.8 and p-values less than 0.001 are presented. 293 Across all disease models, age and low forced expiratory volume (FEV1) ranked as the 294 most important predictors. Higher body mass index (BMI) and hypercholesterolemia 295 medication were also among the strongest predictors for several models. Sex was ranked high 296 only for the CAD and AAA, which is in a good agreement with our observation that the 297 prevalence of these diseases was higher in men than in women. Family history ranked high 298 only in predicting CAD and DM2. Nutrition was among the most important predictors for 299 DM2, stroke, and AAA, which is likely explained by a healthier diet among individuals with 300 certain risk factors and predispositions. Similarly, coffee consumption was an important 301 predictor of HTN and DM2, possibly due to lower consumption in individuals with specific 302 risk factor profiles. Physical activity was an important predictor only for DM2, and younger 303 age of first occurrence of CAD, DVT and DM2 was among most important predictors for 304 HTN and stroke, respectively. 305 Validation 306 C-indexes for corresponding risk prediction benchmark models, with age and sex as the 307 only predictors, were lower (delta, 0.04 -0.2) when compared to those of our newly 308 developed models. Broad range applicability and consistency of the performance of the 309 developed risk prediction models for each disease were further determined by assessing the 310 discriminative ability across subpopulations (Table 5). These subpopulations included (1) 311 'healthy' participants without any of the six target CMD at the baseline; (2) participants with 312 at least one pre-existing non-target CMD at the baseline; and (3) various age categories. The 313 performance of the models was highest in younger age and the healthy subgroup; while it 314 significantly dropped in the subpopulation with pre-existing CMD. 315 The performance of CMD models was tested on four different age group subpopulations. 317 Healthy subpopulation included individuals without any CMD at the baseline. Unhealthy 318 subpopulation included cases with any non-target CMD at the baseline. 319 320

321
Principal findings 322 In this study, development and validation of a risk assessment platform applicable to six 323 CMD is presented. The population-specific modeling for this platform was done using a 324 dataset from the UK Biobanka very large, longitudinal cohort study. This allowed us to 325 derive prediction models and identify the most important contributing risk factors even for 326 diseases with low incidence. Inclusion of a broad spectrum of risk factors allowed for 327 modification of the array of input variables for the CMD risk prediction models included into 328 the platform without significant decrease in their predictive performance. The models 329 performed with high discriminative ability as demonstrated through extensive validation for 330 different disease and age group subpopulations. Accordingly, this platform can accommodate 331 different types of data sets and is applicable to population analysis, as well as individual 332 assessment. 333 There is an abundance of risk predictors for CMD, and multiple prior attempts of 334 combining them into risk calculators [17][18][19]. One of the major impediments for wide-spread 335 application of these risk predictors includes lack of uniform validation through large 336 population analyses. A comprehensive review found 363 models for cardiovascular risk 337 stratification that have been developed and reported [20]. Only a minor collection of these 338 models had sufficient evaluation according to contemporaneous analysis standards for either 339 development or validation. For example, 39% of the 363 models analyzed utilized C-statistics 340 for their development, and just over 60% for their validation. An even smaller number of the 341 models utilized calibration as any part the performance measures. Although, the more recent 342 models (since 2009) were more consistent in providing performance reports: 76% as part of 343 their development, and up to 90% as part of validation [20]. 344 In the current study, the discriminative ability of the developed models was similar or 345 exceeded established models when available. For example, the Framingham Risk Score for 346 coronary artery disease have been determined to be close to 0.76 and 0.79 for men and women, 347 respectively [21]; these reported results were obtained only in the presence of all of the 348 laboratory data and for a pre-selected small population. The modeling described for the platform 349 in this report allows for incorporation of contemporary risk information. This is becoming 350 increasingly important, since such more limited risk calculators may fail to express the accurate 351 and true risk for a significant population. As demonstrated previously, either 50% of patients 352 with CMD lack conventional risk factors or the conventional risk factors fail to explain more 353 than 15-50% of the incidence of CHD [22][23][24][25][26]. 354 The ability to incorporate socioeconomical data and nutritional information collectively 355 can complement the basic information that is equivalent to conventional biomarkers. This is 356 demonstrated in this study, as the performance of the current platform was achieved without 357 the utilization of the blood laboratory information, such as lipid levels or blood glucose levels 358 (as those were not available in UKBB at the time of this study). Utilization of a polygenic 359 scoring is underway and can reveal a population at risk or protected from development of 360 CMD [27][28][29]. It is expected that incorporation of the polygenic scoring will further increase 361 the predicative performance of the current platform. 362 363 Considering the fact that the UKBB population is not a complete representative of the 364 UK or US populations, the main limitation of this study is that the developed models may 365 need to be examined with inclusion of more diverse population. Predictive performance of 366 the models was higher when tested on healthier and younger subpopulations. At the same 367 time, training and calibration on CMD-specific datasets are required to improve 368 discriminative ability of the models across CMD subpopulations. Considering the fact that 369 the datasets used in predictive modeling were almost identical for different CMD, various 370 predictive performances of the CMD models imply that despite overlapping 371 pathophysiological pathways for various CMD, there are predictors specific for different 372 CMD. 373

381
In this report, we present development and validation of a new generation of disease risk 382 prediction models. The differentiation variables of this platform include: a) assessment of 383 multiple related diseases according to their associated outcomes (not just coronary artery 384 disease); b) inclusion of contemporary risk factors; c) variable engineering and processing 385 that allows for inclusion of data from different sources and addressing missing data points; d) 386 population-specific stratification to assess risk prediction in different subgroups; e) being 387 modular in nature to allow for inclusion of other risk determinants, such as genetic 388 information; and f) being applicable at individual, as well as population level. These 389 variables were designed into the platform in order to provide applicability of risk prediction to 390 managing and changing the course of cardiometabolic diseases. 391 392 393