Patients with different cancer types are stratified by CBC data

Searching for informative indices indicating cancer type and cause of the disease is of great importance. Here we tried to identify those indices from the data of complete blood analysis, only. We studied the inhomogeneity in the mutual distribution of a number of oncology patients with various types of tumors in the space of qualitative data provided by the complete blood count. The patients with oncology in hematology system were excluded. Ultimate goal is to reveal the relation between such inhomogeneity issues and the cause of a disease. We used the database on complete blood count comprising oncology patients with various causes of tumor development. The analysis has been carried out both by linear (K-means) and non-linear (elastic map technique) methods. No linear clustering has been found. On the contrary, elastic map technique yields stable clustering identifying not less than three clusters, in the set of patients. No relation of those clusters to sex or age of patients has been found. Four indices (namely, BAS, EOS, WBC and IG) exhibit no relation to the cluster structure, while all others do it. Thus, the patients are stratified according to their respond on the stress caused by cancer tumor. The data on complete blood count may be used for preliminary diagnostics of a tumor and its cause, for oncology patients. This type of analysis is cheap, standard and available at any medical organization.

Hence, a development and implementation of various techniques and approaches to 12 detect a malignant neoplasm, as well as to treat it effectively makes the core issue of an 13 up-to-date medicine and medical biology sciences. The most advantageous methods 14 contribute a lot both scientific oncology and cancer treatment; simultaneously, various 15 data banks on cancer and related problems are currently available to researchers, and 16 big data approach may bring a lot here [8]. Such data banks are full of hidden or 17 conspired relations between the numerous records, figures and characters; a researcher 18 may capitalize a lot from careful knowledge extraction and examination of such data 19 banks. 20 Key idea here is a search for some inhomogeneities in the points representing the 21 data, in multidimensional space, as well as newly identified interdependencies between 22 the variables or their combinations, rather than a verification ofá priori formulated 23 hypotheses. Of course, one should seek the most informative or significant variables or 24 their combinations yielding the inhomogeneities mentioned above; these former may be 25 of high informative value both for a researcher, and a doctors. The idea is a following: 26 firstly, a researcher identifies inhomogeneities in the data points distribution, and checks 27 whether these clusters are distinguishable. At the second step an interplay between the 28 composition of those clusters and the characteristics of a disease (or a patient) is 29 investigated. This approach is very sounding for an investigation of data banks 30 containing abundant clinical records collected in hospitals or other medical services. 31 The approach mentioned above might be applied to analysis of the data records 32 representing some routine, widely spread, customized and highly widespread medical 33 analysis data, thus bringing some new understanding on the problem. Here we analyzed 34 the data bank on complete blood count supported by A. Křižanovski Krasnoyarsk 35 Oncology Hospital, Krasnoyarsk, Russia. The dataset contains the records of the 36 complete blood count of the patients suffering from various oncological diseases, 37 excepting the blood system cancer, itself. Again, this analysis is mandatory in any 38 medical practice, and its cost an availability alongside the long story of its 39 implementation make the data absolutely reliable and consistent, nationwide.

40
The idea to collate the data records on complete blood count and oncological 41 pathology is not absolutely new. Paper [9] shows the efficiency of complete blood count 42 (CBC) to foresee and correct the treatment course of patients with oncology of two 43 types: mammal cancer, and colon cancer. CBC data were used to trace quite specific 44 treatment procedure of those pathologies that is intraoperative chemotherapy provided 45 be automedia. A relation between CBC records and ovarian cancer is discussed in [10]. 46 The paper shows the possibility to improve an early diagnosis of the disease through the 47 complex analysis of CBC records. The relations between CBC, biochemical and 48 immunological data obtained from the blood of patients with mammal cancer are 49 presented in [11]. The authors believe that the significant intragroup scattering absence 50 may reflect the stability of the metabolism processes and general sustainability of a 51 patient. Differential diagnostics of carcinoid tumors of gastrointestinal tract including 52 CBC data is discussed in [12] and CBC is shown to be efficient tool to differential 53 diagnostics of some types of cancer of gastrointestinal tract.

54
CBC could be used as a prognostic marker of the metastasis probability in lymph 55 nodes caused by endometrial cancer [13]. The patients came through complete cure 56 course against the disease mentioned above could be reliable classified into the risk 57 group vs. the group with minimal probability of metastasis. The classification has been 58 carried out due to linear logistic regression; that latter revealed the hidden relation 59 between the growth of neutrophil cells and lymph nodes damage. Similarly, paper [14] 60 provides an estimation of the endometrial cancer probability based on CBC records.

61
They used the characters of erythrocyte distribution width over the volume of these 62 latter as the biomarkers related to inflammatory process. Also, these distribution 63 parameters are prognostic very informative for patients with endometrial cancer. Again, 64 the authors stress the easy-to-do and very low cost of CBC recording.
Paper [15] reports on the feasibility of CBC in colon cancer examination, for the 66 purposes of early diagnostics. Some CBC figures indicate unambiguously the increased 67 risk of coloteral cancer, that in turn results in the assignment of extra special 68 examination for a patient. The approach is based on machine learning methods [16]; in 69 such capacity, this paper resembles, to some extent, our approach and results. The 70 correlation analysis between pre-surgery and post-surgery thrombocyte content 71 associated with some surgery factors for patients with cervical cancer is reported in [13]. 72 This paper claims that some figures of CBC could be used as a supplementary tool for 73 pre-surgery diagnostics in cervical cancer patients. Finally, the papers [17,18]   Thus, the aim of our research is to identify the inhomogeneities in the distribution of 84 the patients suffering from oncological tumors of different localization, origin and type, 85 and to reveal the relations between some features of the diseases and the composition of 86 clusters identified in the multidimensional data space. Each patient record is the string 87 of figures obtained from the complete blood count, thus comprising a data set in 88 multidimensional metric space. An interplay between the structure revealed in the space, 89 and the peculiarities of the disease characteristics is the ultimate goal of this work.

91
We studied the database on CBC records obtained from patients with different oncology 92 (excluding various types of leucemia), sex and age of a patient. All data have been • each variable has been tested against the normality of the distribution;

116
• finally, histogram has been developed for each variable, to evaluate the 117 distribution pattern.

118
Principle component analysis (PCA) [19] has been implemented at the next stage, as 119 well as the correlation matrix calculation. PCA has been applied to determine the 120 efficient (linear) dimension of the data, and correlation matrix has been calculated to 121 determine the couples of the variables with high linear constraint. That latter has been 122 used to select the variables from CBC records to be eliminated from further analysis; 123 the point is that numerous linear constraints may cause some bias in inner structure 124 revealing.  Hence, each patient is represented by the point in 21-dimensional metric space.

132
Everywhere below we used Euclidean metrics to measure the distance in the space.
133 Table 1 shows the abundances of neoplasms of various type located in patients that are 134 included into the database. So, we sought for the inhomogeneities in the distribution of 135 the patients enrolled into the database, in the 21-dimensional Euclidean space of CBC 136 figures, only. Next, if such clusters are found, then what is their composition in terms of 137 pathology, patient features or disease characteristics. Reciprocally, the dual question  well-known traditional approach that is PCA [19]. Geometrically, the first principal images of the data set points (these were orthogonal projections). The projected images 181 located on a convex part of the map will get closer, as the map straightens; vice versa, 182 the projected images located on a concave part of the map will get more distant. This is 183 intuitively clear, since the first case means that the original points were "strong enough" 184 to pull the elastic map to them; reciprocally, the second case means that the points were 185 "weak enough" to keep the elastic map close to them.  To visualize the newly obtained image points distribution over the map, one should 196 introduce some local density clustering procedure. There are a number of this kind 197 methods (see, e. g., [21] for some details), and we shall implement probably the simplest 198 one based on local density function. To do that, one must supply each image point on 199 where r j is the coordinate vector (in inner coordinates) of j-th point. r is the radius 201 originated at r j , A is the factor mainly equal to 1, and σ (looking similar to standard deviation in normal distribution) is the contrasting parameter: it determines the width 203 of the "hat" covering a point. It should be stressed that function (1) is rotationally 204 symmetric, for the sake of simplicity. Finally, the sum function where N is the total number of points in the data set. Function (2) shows the local 206 density function representing a cluster structure in the data set, if any.

208
To begin with, let's consider PCA results and correlation analysis of the original data   Also, we studied the distribution of various types of tumors over the map; the types 231 are shown in Table 1. To do it, the points on the map were labeled according to the 232 localization or the type of tumor. A specific prevalence in a cluster occupation has been 233 observed neither for localization, nor for the type (malignant or benign one) of tumor. 234 Such effect may result from the significant bias in the tumor type distribution, in the 235 original database: it comprises 720 records with malignant tumor diagnosis, and only 44 236 entries with benign ones. Similar absence of a prevalence in the distribution has been 237 observed for sex or age of the patients.

238
At the next stage, we studied the distribution of the patients over the elastic map in 239 dependence on the specific value of the used CBC parameters. Actually, the lowest and 240 the highest values for each 21 parameter of CBC over the database could be found. It 241 should be stressed, that some patients exhibit the figures falling beyond the normal 242 span of that former, some do not. So, the span was split into 10 equal intervals, and we 243 traced on the map the patients with different values of CBC characteristics.
Step by 244 step, we labeled the patients with the level of a character not exceeding λ j ; here λ 245 denotes a character, and j denotes the interval number, 1 ≤ j ≤ 10.
246 Surprisingly, it was found that all 21 character is divided into two groups, in terms 247 of the pattern distribution of the patients over the elastic map, for increasing values of 248 those characters. The pattern called starry sky was observed for BAS, EOS, WBC and 249 IG. This pattern manifests in considerably homogeneous occurrence of the points on the 250 elastic map, as the level of a character of CBC grows up. Surely, there are some minor 251 variations in the details of the starry sky appearance, for different characters of CBC. 252 Fig. 4 shows this starry sky type pattern observed for eosinophils content distribution 253 over the map; these are eosinophils.

254
The opposite type of the distribution of the points over the elastic map called wave 255 was observed for greater part of the characters of CBC. As soon, as a character (from 256 this list) grows up from the least figure to the maximal one, points corresponding to the 257 patients tend to occupy the map quite regularly, layer by layer. The pattern looks like a 258 moving wave, indeed. Evidently, various characters have different starting locations on 259 the map: it means that there are (almost) nobody at the database who has the minimal 260 figures of two characters, simultaneously (see detailed discussion below). Fig. 3 shows 261 this wave-type pattern observed for monocyte content distribution over the map.

263
The occurrence of malignant neoplasms is a complicated process involving a number of 264 factors. Currently, there is no universal and indisputable theory to classify these 265 factors [22][23][24][25]. A number of paper brings an evidence of the extreme variety and 266 diversity both of cancer tumors, and factors standing behind. Thus, the elabouration of 267 logically apparent and informative predictors for malignant neoplasm development is 268 still an acute problem, and sounding progress is reported in this direction [1,4,8,18,25]. 269 Sophisticated molecular biology techniques stand behind this progress, mainly.

270
An implementation of the up-to-date advanced methods of nonlinear statistical 271 analysis brings an opportunity to reveal knowledge from classical standard procedures 272 and data records. Some methods and approaches may look quite complicated and 273 unconventional thus requiring from a researcher some special efforts and qualification in 274 data processing; very low cost, availability and routine standard procedures for the data 275 records gathering introduced in every day practice of any hospital is the pay-off for 276 these calamities.

277
There are two fundamental issues in the modern approach in medical data treatment 278 based on big data methodology:

279
• the simultaneous and combined analysis of a number of different characteristics 280 which originally had not been installed into a common theory is the first issue, and 281 • "modelling" of data, that is the approximation of multidimensional bulky data 282 with a manifold of low dimension is the second issue.

283
Here we pursued both these ideas. Let's focus on Table 1 once again. Neither elastic 284 map technique (see Fig. 1), nor K-means identified a relation between a cluster, and a 285 type of malignant neoplasm; same is true for benign neoplasms. Such failure may result 286 from excessive specification in the localization and cancer type. It might be, gathering 287 the different tumors having the same ontogenetic origin, one gets more clear and 288 apparent correlation between clustering and the disease.

289
The features of the data base mentioned above make some constraints on the 290 analysis and experiment design (if any). The impossibility to gather a reference group is 291 the key point here, unless the specific factor is identified to arrange a randomization 292 against that former. In practice, it means that there is no way to collect a group of 293 healthy blood donors to be a reference group for further comparison with the patients 294 with malignant neoplasms. The main reason of that failure is that comes from a number 295 factors falling beyond any control, in a tentative reference group. For example, one is 296 never able to provide a similar geography distribution of the reference group members, 297 as well as occupation, etc. Meanwhile, all these factors may affect significantly the 298 distribution pattern. Moreover, one has to gather a separate reference group for each 299 specific malignant neoplasm included into the database. To overcome the problem, we 300 analyzed the patients cohorts. The identified features of tumors with different 301 localization may provide a good tool for alternative diagnostics and patient 302 investigation.

303
Let now focus on Fig. 3; we have examined all the characteristics of CBC, in terms 304 of a pattern of the map filling. It was found, various characteristics start to fill the map 305 from different places, and do it in a specific way. This fact seems to be a manifestation 306 of Liebig principle: any cancer tumor is a severe stress factor. Thus, the stress causes a 307 mobilization of the resources of a sick organism concentrating them into a kind of a 308 "bottle neck". The most amazing fact here is that the data base contains a number of 309 patients with different pathologies in different organs and tissues. And regardless the 310 specific type of localization of a tumor, we see through the clustering provided by elastic 311 map technique that a kind of specialization in the pathways and contours for the 312 resources mobilization takes place.

313
Indeed, there is only one pair of variables that yields a pretty close pattern of filling 314 out of the map, as the values grow up: these are haematocrit and haemoglobin. This 315 situation seems to be quite natural: the correlation coefficient between these two 316 variables is very high. These two CBC characteristics start to fill the map from the left 317 border and go smoothly to the right keeping quite a straight line. On the contrary, The patterns of the map filling described above unambiguously prove the 336 stratification of the patients over the space determined by CBC characteristics. There is 337 no stratification of the patients in terms of the correspondence of the clusters identified 338 by elastic map technique and age, sex or disease description (type and/or localization of 339 a tumor). On the contrary, the patients are stratified in their respond type to cancer 340 tumor attack. There is rather evident correlation between cluster composition, and the 341 pattern of the map filling. That latter reflects the type and way of mobilization of the 342 resources of an organism caused by malignant neoplasm, and the stratification strategy 343 follows Liebig principle. This is basically new stratification observed in cancer patients, 344 and further studies on the detailed relations between the respond type and related 345 essential clinical aspects still are required.