Tensor Factorization-based Prediction with an Application to Estimating the Risk of Chronic Diseases

Tensor factorization has emerged as a powerful method to address the challenges of high dimensionality regarding disease development and comorbidity. Chronic diseases have a high likelihood to co-occur, making patients suffering from one chronic disease to have an elevated risk for the other diseases in the course of aging. Individualized prediction of chronic diseases can help patients prevent new diseases and reduce the healthcare costs. Despite rich results of risk assessment models for chronic diseases, individualized risk prediction considering the complex mechanisms of disease development and comorbidity remains to be under-researched. This research aims to develop tensor factorization-based machine learning models to predict the onset of new chronic diseases for individual patients through incorporating the comorbidity patterns with the clinical and sequential factors revealed in the electronic health records (EHR) data. We propose two tensor factorization-based methods to incorporate the clinical and sequential factors to reveal the latent patterns of co-occurring chronic diseases. The efficacy of the proposed methods was validated through predicting the onset of new chronic diseases for individual patients using the EHR data for 23 years from a major hospital in Hong Kong. The proposed methods consistently outperform benchmark predictive models. The top 10 predictions of new chronic diseases have approximately 60% recall. Tensor factorization is an appropriate method for predicting the onset of chronic diseases at the individual level. The proposed predictive models could inform proactive health management programs for at-risk patients with different chronic conditions at discharge. Author summary The existing risk assessment models mainly focused on the prediction of single diseases in the population base. Chronic disease risk prediction considering the complex mechanisms of disease development and comorbidity is under-researched. To support and inform clinical decision making for healthcare professionals in the aging society, this study provides an innovative approach to mapping an interconnected web of chronic illnesses and investigated the performance of chronic disease prediction using 2 years’ worth of patient assessment records and 23 years’ admission history data from a major hospital in Hong Kong. We proposed matrix and tensor-based methods to represent the high-order interrelations of patients, chronic diseases and additional features, which can reveal the latent patterns of co-occurring chronic diseases to enable more effective prediction. The proposed methods exhibit state-of-the-art performance in predicting the onset of new chronic diseases for individual patients.

Tensor factorization, the high order extension of the two-dimensional matrix 2 factorization, has emerged as a promising method to address the challenges regarding 3 the high dimensionality of the EHR data with good interpretability and scalability [1,2]. 4 Tensor factorization has been widely used in recommender systems, social network 5 analysis, process monitoring etc. [3][4][5]. Recently, tensor-based models have also been 6 applied to healthcare problems, including phenotype generation, medical information 7 retrieval, image-based diagnosis, and precision medicine [6][7][8][9][10][11]. 8 Compared with traditional machine learning methods, tensor-based models have the 9 unique advantages of having: (a) the capability to utilize multi-aspect features in 10 multiple dimensions; (b) the versatility in incorporating domain knowledge from 11 physicians or knowledge bases in medicine; and (c) the capability to solve the sparsity 12 problem, a major challenge for many data mining tasks, particularly for an EHR 13 dataset. These advantages make tensor factorizations a promising modeling approach to 14 disease prediction using the EHR data, which usually have high dimensionality and 15 sparsity, and needs domain expertise to ensure model validity [1,12,13]. 16 Chronic diseases are a major cause of morbidity and mortality worldwide [14]. As of 17 2012, approximately half of all adults in the US had one or more chronic health 18 conditions, and one in four adults had at least two chronic conditions [15]. The reasons 19 for the rapid rise in chronic illness include the population aging, longer life expectancies 20 due to improvements in medical care, and advances in diagnostic technology and 21 treatment options for many chronic diseases. Among older adults in the US, 77% of 22 them have at least two chronic illnesses [16] and 43% of Medicare beneficiaries have 23 three or more [17]. For instance, chronic conditions such as hypertension, heart diseases, 24 diabetes and chronic obstructive pulmonary disease (COPD) have a high likelihood to 25 co-occur, making patient suffering from any one of these five chronic conditions to have 26 an elevated risk for the other four conditions in the course of aging [18][19][20]. In addition, 27 even acute conditions that are frequently considered as causes for hospitalization among 28 the elderly, such as sepsis, peritonitis or fall, they could, nonetheless, be a manifestation 29 of the underlying chronic conditions. It has also been recognized that early risk 30 identification can facilitate early prevention/disease management in the community, 31 thereby reducing the number of people suffering from chronic diseases and their acute 32 presentations [21]. However, it is not until the widespread adoption of Electronic Health 33 Records (EHRs) could predictive analytics be applied to shed light on the evolution and 34 comorbidity of chronic diseases [22][23][24]. 35 Harnessing the EHRs data to predict diseases has emerged as an important topic for 36 medical informatics and precision medicine [25][26][27]. Predictive models based on EHR 37 data can help physicians assess the future risk of an individual for a certain chronic 38 condition or multiple conditions and identify the patients who are likely to acquire new 39 diseases. Moreover, predictive models enable the comparison of benefits and the 40 costs/risks of alternative treatments and prevention strategies, and allows for 41 personalized disease management for individuals [28,29].
proposed as the first diabetes prediction (regression-based) algorithm for type-2 47 diabetes [31]. In another study, logistic regression-based DiaRem score was proposed to 48 predict the readmission of type-2 diabetes after Roux-en-Y gastric bypass surgery [32]. 49 Similar research on developing regression models for diabetes prediction is rich and has 50 been validated with various datasets [33][34][35]. In addition to diabetes, successful 51 regression-based risk assessment models exist for other chronic diseases, like 52 cardiovascular and chronic kidney disease [36][37][38]. Refer to a recent review for 53 details [12]. 54 During the past decade, machine learning models have been recognized as an 55 effective method for chronic disease predictions. For instance, Himes et al. developed a 56 Bayesian network model to predict COPD in asthma patients, and demonstrated the 57 good accuracy of the model using 15 years of the EHR data [39]. Kurosaki et al. 58 constructed decision tree models that can identify patients at a high risk of 59 hepatocellular carcinoma development among different datasets [40,41]. In another 60 study, artificial neural networks and C5.0 classifiers were merged with decision trees to 61 form a hybrid model to predict type-1 diabetes mellitus [42]. A recent study evaluated 62 the performance of three classic models (naïve Bayesian classifier, Bayesian network, 63 and support vector machines) in leveraging daily self-monitoring reports to predict 64 asthma exacerbation and demonstrated the potential of machine learning models in 65 providing personalized monitoring decision support [28].

66
Despite the rich results of risk assessment models for chronic diseases, the existing 67 literature mainly focused on the prediction of single diseases in the population base, but 68 not the individualized risk prediction considering the complex mechanisms of disease 69 development and comorbidity, which usually have a high dimensionality [12,43].  (patient assessment dataset) contains the patient assessment information (e.g. heart 97 rate, blood pressure, smoking Matrix factorization decomposes a matrix into the product of matrices [45]. Tensor 107 factorization is the high order extension of matrix factorization and enables the 108 modeling of heterogeneous and multidimensional data. Matrix and tensor factorizations 109 can extract the latent components to enhance data mining tasks [1]. The most widely 110 used tensor factorization methods are CANDECOMP/PARAFAC (CP) factorization 111 and Tucker factorization [5]. The Tucker factorization decomposes a tensor into a core 112 tensor multiplied by a matrix along each mode. Meanwhile, CP factorization 113 decomposes the tensor into the sum of rank one tensors. CP factorization is also a 114 special case of Tucker factorization, in which the core is superdiagonal. CP factorization 115 is unique under mild assumptions, making it suitable to uncover and interpret the 116 actual latent factors because no equivalent rotated factorization yields the same fit [1]. 117 The rest of this paper adopts CP factorization to perform prediction.

118
To illustrate matrix and tensor factorizations, we present the factorizations of a 119 second-order matrix χ ∈ R I×J and a third-order tensor χ ∈ R I×J×K in Fig 1. The 120 matrix and the tensor can be expressed as the following Eq (1) and Eq (2), respectively: 121 where • denotes the outer product. Each data entry in the second-order matrix (x ij ) could be interpreted as the inner 123 product of two latent feature vectors, as shown by Eq (3). Similarly, each data entry in 124 the third-order tensor (x ijk ) could be interpreted as the inner product of three latent 125 feature vectors, as shown by Eq (4). Such factorization is highly interpretable for 126 multidimensional data mining applications, as we can interpret the decomposed 127 components as high-order grouping patterns [1].
Many matrix and tensor factorization methods have different assumptions regarding 129 factors and the underlying structures. Particularly, nonnegative matrix and tensor 130 factorizations, both of which incorporate nonnegative constraints, have proven to be 131 successful in many applications [46]. Such nonnegative constraints are suitable for this 132 study because the values of EHR data entries are mostly nonnegative. The most 133 popular cost functions are (a) the least squares error that corresponds to an assumption 134 of normal independently and identically distributed noise, and (b) the Kullback-Leibler 135 (KL) divergence that corresponds to maximum likelihood estimation under an 136 independent Poisson assumption [3,47]. As the count of admissions is used to construct 137 tensors in the predictive models, we adopt the tensor factorization methods using 138 generalized KL divergence and the multiplicative update rules [48,49]. Refer to a recent 139 review for the details of tensor factorizations [1]. In this study, we use matrices to represent the two-dimensional relationship between 143 patients and diseases. Third-order tensors are constructed to represent the 144 high-dimensional interrelations among patients, chronic diseases, and additional features. 145 The matrices and tensors constructed by the EHR data are usually sparse with 146 numerous "Nil" values. For example, there are many possible chronic diseases, and a 147 patient usually has a few of them upon discharge. There is no value for the rest of the 148 diseases, meaning that the patient has not acquired these diseases yet. Estimating 149 which disease will be likely acquired by this patient is difficult. Matrix and tensor 150 factorizations have been demonstrated to be effective estimating the values of such "Nil" 151 data through exploring the latent grouping patterns in the observed tensors [50]. In the 152 context of disease prediction, we can use matrix/tensor factorization to extract the We decompose the matrix as shown in Fig 1(a) and obtain the feature factors of 164 patients (p i ) and chronic diseases (d j ). The predicted risk score for patient i to acquire 165 chronic disease j is the inner product of extracted latent feature vectors as Nonnegative Tensor Factorization with Clinical information (NTF-C) 167 Matrix-based data mining methods lack the capability to capture the characteristics 168 and patterns in multi-aspect data. Medical research has recognized that clinical 169 attributes could indicate the patients' various health trajectories, which could lead to 170 acquiring different diseases in the future [51]. For instance, hypertension could lead to 171 several chronic diseases for the patient, given the specific clinical attributes (e.g. certain 172 symptoms) he/she has at present. Patients who smoke are more likely to acquire 173 respiratory disease like COPD, whereas patients with irregular heartbeat could be on 174 the path to cardiovascular disease. Tensor factorizations provide a powerful framework 175 to model such multi-aspect data by explicitly exploiting the multi-aspect structure to 176 identify the latent clusters of data [1]. We extend the second-order NMF model into a 177 third-order nonnegative tensor factorization (NTF) method to capture the clinical 178 attributes.

179
To model the multi-aspect interrelations among patients, chronic diseases, and 180 clinical attributes, we construct a third-order observed tensor χ (as shown in Fig 2(b)), 181 in which x ijk represents the frequency of admissions that comprise the corresponding decomposes observed tensor χ to the sum of rank-one tensors as illustrated in Fig 1(b). 185 x ijk can then be estimated through the inner product of three latent feature vectors.
where p i , d j , and c k are the feature vectors of patients, chronic diseases, and clinical 187 attributes, respectively.

188
Then, R is defined as the reconstructed tensor. Specifically, R is obtained by taking 189 the outer product of the factorized components. The values of data entries r ijk in the 190 reconstructed tensor are the estimated values for x ijk obtained by Eq (6). This 191 reconstructed tensor R is not equal to the observed tensor χ; instead, R is a low-rank 192 approximation of χ. This reconstruction process can capture the latent 193 high-dimensional grouping patterns of χ. Thus, R provides evidence for predicting the 194 values of the entries that are "Nil" in the observed tensor χ. The risk factors are determined by the values of entries in R. If the entry is "Nil" in the observed tensor χ, 196 but positive in the reconstructed tensor R, it indicates a risk of acquiring this disease 197 for the corresponding patient. In the tensor, a tube represents risk factors embedded in 198 different clinical attributes for patient i and disease j. This set of risk factors is 199 integrated across clinical attributes to generate a risk score for patient i in acquiring 200 disease j as follows: Nonnegative Tensor Factorization with Sequential Information (NTF-S) 202 The chronic diseases of patients evolve over time with a complicated comorbidity 203 relationship among different diseases [20]. Acquiring one chronic disease may lead to the 204 risk of subsequently acquiring another chronic disease. Another third-order tensor-based 205 model is proposed to model the sequence of chronic diseases. As shown in Fig 3,  First, an appropriate set of data is sampled as the cohort for this study (will be 231 introduced in the next section). Second, the tensor-based on the clinical attributes (for 232 NTF-C) and sequential information (NTF-S) are constructed. Then, we use the 233 sampled EHR data to train the model and evaluate the prediction performance.   For the demonstration of the performance of the proposed tensor based methods, 5 253 benchmark machine learning methods, including logistic regression (LR), multinomial 254 Bayesian classifier (MB), CART decision tree (DT), truncated singular value 255 decomposition (SVD), and the previously introduced NMF, were used to perform the 256 same tasks. In CP factorization, the number of rank-one components, named the rank 257 of tensor, captures the number of potential sub-groups in the tensor. However, 258 determining the value the rank is difficult [5]. Thus, we empirically evaluated different 259 values of the rank and found that the best experiment result could be obtained with 260 rank = 2. The low value of rank is expected due to the sparse nature of the EHR data. 261 The widely used top-k recall was adopted as the evaluation metric. For each patient, the risk score is calculated following Eq (7) or Eq (8) for each potential chronic disease, 263 and then those diseases are sorted in descending order of the score. 264 recall@k = # of recalled disease of top-k prediction N (9) where N is the number of future diagnosed diseases in the dataset.

265
When k increases, the top − k recall will also increase. If k is equal to the total 266 number of diseases, then the top − k recall is 100%, as all possible diseases are covered 267 by the "prediction." Table 3(a) presents the mean recall@k using the 2014-2015 dataset. 268    Table 3(b) presents the results for the 1993-2013 dataset (without patient assessment 274 information). The newly proposed NTF-S method consistently outperformed other 275 methods, except for top-9 prediction (only 0.6% lower than SVD). The recall of the 276 top-1 prediction is over 60% higher than the commonly used machine learning methods 277 for risk predictions, such as DT, MB, and LR. The recalls of the NTF-S method for top 278 5 predictions are higher than 50%. These experiment results demonstrated that 279 modeling the latent high-dimensional associations of patient, disease, sequence could 280 help us capture useful latent features for prediction.

281
The performance of NTF-S is better than NTF-C, largely because NTF-S takes 282 advantage of a 23-year patient assessment record dataset that contains rich sequential 283 comorbidity patterns of chronic diseases, such as disease A usually occurred earlier than 284 disease B for one patient. On the other hand, the NTF-C model contains additional 285 clinical attribute information and, thus, performed better than matrix-based models.

286
However, the smaller 2-year patient assessment record dataset limited the performance 287 of the NTF-C model. In our future research, we plan to request a highly comprehensive 288 23-year EHR dataset (with both sequence and clinical attribute information) to 289 construct an integrated fourth-order ¡patient, disease, clinical attributes, sequence¿ 290 model that takes advantages of both models.

292
To support and inform clinical decision making for healthcare professionals in this aging 293 society, the current study provides an innovative approach to the mapping of an 294 interconnected web of chronic illnesses over the course of aging. With 2 years' worth of 295 patient assessment records and 23 years' admission history data from a major hospital 296 in Hong Kong, we demonstrate that tensor factorization is an appropriate approach to 297 capture the latent high-dimensional associations among patients, diseases, clinical 298 attributes, and sequential patterns. The proposed predictive models can predict the 299 onset of new chronic diseases at the individual level with superior accuracy as compared 300 with benchmark machine learning methods.

301
Benefiting from the capability of tensor to model heterogeneous and multi-aspect proactive strategy to identify at-risk patients with different chronic conditions at 309 discharge into various health management programs, such as telemedicine support with 310 community call centers. In the long run, the implementation of accurate prediction 311 models can prevent readmissions and, thus, reduce the hospital occupancy rate in this 312 aging society.

313
The study has several limitations. First, the performance of the NTF-C model is 314 dependent heavily on the selection of clinical attributes. However, we are not able to 315 access the typical EHRs data including complete clinical information such as lab test 316 results due to privacy concern without sufficient authorization from the patients. In this 317 research, the six adopted clinical attributes helped in slightly improving the prediction 318 accuracy. These clinical attributes are common risk factors for all chronic diseases and, 319 thus, are not sensitive to specific diseases. How to better select clinical attributes for 320 the model is the focus of our future work. Second, the two proposed third-order 321 tensor-based models incorporate the information of clinical attributes and sequential 322 patterns, respectively. We did not propose a fourth-order tensor to incorporate both 323 information in a single model because of the sparsity problem when we go to the higher 324 order. In our future work, we plan to address this problem by augmenting the observed 325 tensor using the semantic information in knowledge bases. We will also explore other 326 factorization methods, such as coupled tensor factorizations, to solve the sparsity 327 problem. Third, the prediction was made by analyzing the existing EHR data, which 328 could be biased towards the selected cohort, or miss the critical medical information, 329 such as the associations between certain diseases.