Structure and dynamics of human disease-complication network

A complication is an unanticipated disease arisen following, induced by a disease, a treatment or a procedure. We compile the Human Disease-Complication Network from the medical data and investigate the characteristics of the network. It is observed that the modules of the network are dominated by the classes of diseases. The relations between modules are unveiled in detail. Three nontrivial motifs are identified from the network. We further simulate the dynamics of motifs with the Boolean dynamic model. Each motif represents a specific dynamic behavior, which is potentially functional in the disease system, such as generating temporal progressions and governing the responses to fluctuating external stimuli. Author summary Advances in molecular biology lead to a new discipline of network medicine, investigating human diseases in a networked structure perspective. Recently, clinical records have been introduced to the research of complex networks of diseases. An important available medical dataset that has been overlooked so far is the complications of diseases, which are vital for human beings. We compile the Human Disease-Complication Network, representing the causality between the upstream diseases and their downstream complications. This work not only helps us to comprehend why certain groups of diseases appear collectively, but also provides a new paradigm to investigate the dynamics of disease progression. For clinical applications, the investigation of complications may yield new approaches to disease prevention, diagnosis and treatment.


Introduction
Advances in molecular biology lead to a new discipline of network medicine, which has between disease genes and disease phenotypic features has brought the concept of 23 diseasome [1,[29][30][31]. 24 To go beyond the global features, the motif is utilized to characterize the local 25 properties of those biological networks. The pioneering work by Milo et al. identifies 26 motifs in networks from biochemistry, neurobiology, ecology, and engineering [32]. 27 Then motifs have been observed in many biological networks [33][34][35], and their 28 dynamics have been investigated further [34,36]. 29 A critical medical dataset that has been overlooked so far is the complications of 30 diseases, which are vital for patients in clinical practice. For example, complictions 31 may affect clinical decisions for physicians in certain in circumstances [37]. A 32 complication is an unanticipated disease arisen following, induced by a disease, a 33 treatment or a procedure. Diseases form a networked structure with their 34 complications. For example, COVID−19 generates numerous complications, such as 35 venous thromboembolism and acute kidney injury, etc [38]. We compile the Human 36 Disease-Complication Network, representing the causality between the upstream 37 diseases and their downstream complications. We systematically investigate the 38 structure of the network, and pay attention to the disease modules. Taking into 39 account the complexity of network dynamics, we further study the motifs, with which 40 the dynamics can be depicted in the disease system. This work not only helps us 41 understand how different medical subdisciplines organize, but also provides a 42 comprehensive understanding of why certain groups of diseases appear collectively. 43 44 Materials 45 We collected data from the Clinical Medicine Knowledge Database, including 6715 46 diseases. The complications of diseases are extracted from the descriptions of diseases 47 in the database. A node denotes a disease, and a directed link from the i-th node to 48 the j-th node is drawn if the i-th disease generates the j-th complication. Then, we The k-core analysis is able to uncover a nucleus set of nodes, i.e., a set of nodes with a 57 high degree connected to each other. It has been widely used in networks to identify 58 the kernels more robustly than simply through the ranking of centrality 59 measures [39,40]. The k-core of a network consists of nodes i with the degree k i ≥ k, 60 and the k-core could be extracted by the iterative removal of all nodes i with degrees 61 k i < k when k > 0. The motifs are defined as local patterns occurring in the real network significantly 64 more frequently than in randomized networks with the same degree sequence [32,34].

65
For any given network, the occurrence number N g of the g-th connected subsets is 66 related to the network size and the degree distribution. For measuring the statistical 67 significance, the randomized ensemble of networks is generated as a null model [32].

68
The statistical significance Z score is defined as where N g is the appearance number of the g-th subset in the real HDCN, ⟨N g ⟩ and σ 70 are the average appearance number and standard deviation in the randomized 71 ensemble of networks. The motifs are those subsets with significantly higher frequency 72 in the real network than randomized ones, measured by the Z score. In this paper, we 73 mainly focus on the 3-node and 4-node connected subsets.

74
Boolean dynamics 75 The Boolean dynamics is a common model in gene regulatory networks and signal 76 transduction networks [41,42]. Each node in a Boolean network represents a 77 sub-cellular component such as protein, gene, transcription factor or metabolite. The 78 states of input nodes i are described by a binary value X i . X i = 1 represents that the 79 component i is active or expressed, X i = 0 means that it is inactive or not expressed.

80
The state of node j at time t + 1, X j (t + 1), is determined by a logic operation 81 together with the current state of its upstream regulators X i (t). The logic operation is 82 the Boolean update function, denoted by the logic operators or a weighted sum of the 83 inputs to an activation threshold. The output nodal dynamics is described by a 84 differential equation like where F (X 1 , T 1j ; · · · ; X i , T ij ) is the Boolean update function.

86
Here, we introduce the Boolean dynamic model to simulate the complication 87 progression in the HDCN. The disease progression in the HDCN is compared to other 88 biological processes, such as gene regulatory networks. A complication is generated by 89 the collective effect of upstream diseases, and it will presumably self-cure after the 90 upstream diseases are cured. It is reasonable to utilize the Boolean dynamic model to 91 investigate the complication dynamics. Therefore, X i represents the activation of the 92 i-th upstream disease, T ij is the activation threshold of the i-th disease to the j-th 93 complication, α is the lifetime of the cured disease, and F (X 1 , T 1j ; · · · ; X i , T ij ) 94 denotes the summarized effect of overall i upstream diseases on the j-th complication.

96
Qualitatively, the HDCN is formed by very few disconnected components and a large 97 giant connected component (see Fig 1) diseases. The in-degree distribution decays faster than the out-degree distribution, but 113 still significantly deviates from the Poisson distribution expected for a random graph.

114
In the HDCN, a disease presents a streaming structure: generating complications, 115 and as a complication caused by other diseases. Therefore, we introduce in-degree k in 116 and out-degree k out to quantitatively categorize the diseases into these three structural 117 levels, i.e., upstream, intermediate and downstream. The diseases with highest k out 118 and k in are listed in Table 1. A node with high k out and low k in is an upstream 119 disease such as Acute Lymphoblastic Leukemia (k out = 16, k in = 1), since it could lead 120 to other diseases yet be hardly produced by others. On the contrary, a node with low 121 k out and high k in is a downstream disease such as Pneumonia (k out = 0, k in = 181), To quantify the correlation between the in-degree and out-degree of nodes, we 130 compute the Spearman rank correlation coefficient SC between these two ranking [43]. 131 A negative value of SC = −0.26 indicates that the in-degree and out-degree in the 132 same node are significantly asymmetric, i.e., the disease connecting with more 133 upstream diseases usually results in fewer downstream complications, and vice versa.

134
With the k-core method [39,40], we identify a small well-connected nucleus, 135 consisting of 98 diseases with the maximal coreness= 7 (See S1 Table ). The

140
Although the HDCN layout is generated without any priori knowledge on disease 141 classes, it is naturally and visibly clustered according to major disease classes. In 142 order to quantitatively understand this clustering nature, we identify the community 143 structure of the HDCN. In a complex network, the community structure is the 144 grouping of nodes into clusters with a high density of internal links, while including a 145 relatively low density of links between clusters [44,45]. The out-degree and in-degree distributions P (k) of the HDCN. In-degree is the number of edges pointed to some vertex, and out-degree is the number of edges pointing away from it. the probability distribution of out-degree exhibits an approximative power-law, while the one of in-degree decays in an exponentially.
distributed in the related organs, rather than form a single module as in the 155 gene-disease network [30]. This difference is rooted in that the cancers in the former 156 one usually cause complications in the related organs, while the latter one may share 157 the same genes and generate dense connections between each other. Gynecology,

158
Nephrology and Urology are grouped into a single module, since all three belong to 159 the genitourinary system in which the related organs are physiologically close and 160 interlinked by the blood supply and some meatus. In a similar vein, the diseases in 161 Endocrine, Metabolic, Hematology and Rheumatology form a single module, which 162 has a global effect on the whole human disease system.

163
To quantify the influence of modules in the progression of complications, we define 164 the complicating ability of the p-th module as where κ out p and κ out p are the total out-links and in-links of the p-th module connecting 166 to other modules. The high value of C p suggests that the module can trigger 167 downstream complications with high possibility, while the high absolute value of 168 negative C p means that the module is prone to be caused by upstream diseases. As 169 shown in Table 2, Orthopedics, Dermatology, and Cancer are the modules with the 170 highest C p . Therefore, these three are the most influential disease modules, which can 171 generate the downstream diseases with great capability. Meanwhile, General Disease,    Table 3.  disease layer. k out and k in of four nodes in the bi-fan are shown in Table 4.  has not been observed in other networks. As shown in Table 3, the OFFL consists of  The frequency of motifs appeared significantly higher than in a random network, For the AND-gate, the differential equations are written as and where F (u, T ) = (u/T ) / (1 + (u/T )). In the remainder of this paper we will adopt gene regulation networks and neuronal connectivity networks [34]. Therefore, the 276 AND-gate of the FFL is probably a general mechanism to protect biological functions. 277 For the OR-gate, Eq 5 is substituted by If not stated otherwise, the 280 function will be adopted for all OR-gate. The results are exhibited in

Bi-fan 291
In the bi-fan motif, upstream diseases U 1,2 have two affecting pathways, U 1 → D 1,2 292 and U 2 → D 1,2 . Likewise, U 1 and U 2 may act in an AND-gate or OR-gate manner to 293 control D 1,2 .

294
For the OR-gate, both D 1,2 could be activated by U 1 and U 2 independently, i.e., 295 U 1 → D 1,2 and U 2 → D 1,2 . The differential equations of the bi-fan are written as and As shown in Fig 5,  when U 1 and U 2 co-occur simultaneously will the morbidity of D 1 and D 2 be high. To 300 some extent, the FFL motif provides a temporal mechanism to prevent the wild

307
For the AND-gate, both D 1,2 should be activated by U 1 and U 2 jointly. Eq 6 and 7 308 is substituted by dD1  simultaneously. The response rates of D 1,2 in the OR-gate are higher than the ones in 312 the AND-gate, while the relaxation rates of D 1,2 in the OR-gate are slower than the 313 ones in the AND-gate.

314
Overlapping feedforward loop 315 The OFFL has a more complicated structure than the other two. For simplicity, we 316 set that U 1 and U 2 always act in an OR-gate manner to control M and D. However, 317 M still acts with U 1 and U 2 in an AND-gate or OR-gate manner to control D. We will 318 discuss this AND-gate or OR-gate below.

319
For a mixed model of AND-gate and OR-gate, the differential equations of the 320 OFFL are written as and The results are shown in Fig 6. The progression level of D disease in the OFFL is

333
For the OR-gate, Eq 9 are written as dD  vaginal versus a cesarean delivery) are influenced by such a heuristics [37]. If the prior 344 patient had complications in one delivery mode, the physician will be more likely to 345 switch to the other, and likely inappropriate-delivery mode for the subsequent patient, 346 regardless of patient's indicators. More importantly, this strategy presents small but 347 significantly negative effects on patient health outcomes and increases resource use. which have widespread effects on the whole disease system.

372
In parallel, many methods have been introduced to the classification of human 373 diseases, such as machine learning [46], integration of phenotypic similarity with 374 genomics [47], pathway-based classification [48] and consensus-based technique [49].

375
However, contemporary approaches usually do not consider the interactions among 376 diseases [50]. This failure partly comes from the focused nature of medical training, 377 and the reductionist paradigm in modern medicine. To overcome this shortcoming, the 378 network framework is applied to define human disease [29,51]. In our work, the disease 379 modules clustered in the HDCN may further provide complementary information to 380 classify the human disease more accurately.

382
In this paper, the HDCN is constructed from the medical data. We investigate the 383 topological characteristics, including the degree distribution, clustering coefficient and 384 k-core of the HDCN. Further, we identify the disease modules which are dominated by 385 the classes of diseases. The relations between modules are unveiled in detail. The