Immune state networks of wild and laboratory mice

The mammalian immune system protects individuals from infection and disease. It is a complex system of interacting cells and molecules and extensive work, principally with laboratory mice, has investigated its function. Wild and laboratory animals lead very different lives, and this is reflected in there being substantial immunological differences between them. Here we use network analyses to study a unique data set of 120 immune measures of wild and laboratory mice, where immune measures define nodes and correlations of immune measures across individual mice define edges between immune measures. To date, there has only been very limited network analyses of the immune system, which is surprising because such analyses may be important to better understand its organisation and functionality. We found that the immunological networks of wild and laboratory mice were similar in some aspects of their mesoscale structure, particularly concerning cytokine response communities. However, we also identified notable differences in node membership of network communities between the wild and laboratory networks, pointing to how the same immune system acts and interacts differently in wild and in laboratory mice. These results show the utility of network analysis in understanding immune responses and also the importance of studying wild animals in additional to laboratory animals. Author summary The mammalian immune system is a complex system that protects individuals from infection and disease. Most of our understanding of the immune system comes from studies of laboratory animals, particularly mice. However, wild and laboratory animals lead very different lives, potentially leading to substantial immunological differences between them and so possibly limiting the utility of laboratory animals as informative model systems. As a complex interacting set of cells and molecules, the immune system is a biological network. Therefore, we used network analyses to study the immune system, specifically a unique data set of immune measures of wild and laboratory mice, where 120 different immune measures define nodes of the network. We found that the networks of wild and laboratory mice were similar in some aspects of their grouping structure, particularly concerning communities of nodes of cytokine responses. However, we also identified notable differences in node membership of communities between the wild and laboratory networks, pointing to how the same immune system behaves differently in wild and in laboratory mice. These results show the utility of network analysis in understanding immune responses and also the importance of studying wild animals in addition to laboratory animals.

The vertebrate immune system defends animals against infection and disease. 2 Understanding how the immune system functions has been a long-standing area of 3 study, not only to understand the basic biology of this important system, but also to be 4 able to manipulate it for therapeutic benefit. Almost all that we know about the 5 mammalian immune system has been discovered by studying laboratory animals, mainly 6 mice. Remarkably little is known about the immune responses of wild mammals, nor 7 how well the immune responses of laboratory animals reflect those of wild animals. The 8 operation of the immune system is affected by the environment -especially infections - 9 and wild animals are continuously exposed to a myriad of infections stimulating immune 10 responses, but in contrast, laboratory animals do not have this high-level exposure. 11 Indeed, the few studies of the immune state of wild animals have shown quantitative 12 differences in measures of immune state between wild and laboratory animals [1]. 13 The vertebrate immune system is a complex of interacting cells and soluble 14 molecules whose function, at its heart, is to recognise foreign antigen molecules, to then 15 remove, destroy or nullify them, while usually retaining a molecular memory of the 16 antigen in question. The multiple components of the vertebrate immune system are 17 typically categorised into cellular and humoral components, both of which contribute to 18 the innate and adaptive parts of the immune response. The immune system is dynamic, 19 so that different components of the immune system respond depending on the type of 20 antigen and its location in the individual. For example, immune responses to viruses 21 infecting lung cells are qualitatively very different to the immune responses directed 22 against macroscopic worms living in the gut [2]. Pathogens are not passive partners in 23 the face of the immune responses directed against them, and so combat the immune 24 response via a number of strategies. These strategies include molecularly hiding from 25 the immune system, actively immunomodulating the immune system for their own 26 benefit, or by changing their antigenic repertoire, so staying-ahead of an immunological 27 memory already acquired against them [3]. Effector cells and molecules of the immune 28 system act against invading pathogens, but these effects can also cause harm to the 29 individual, so-called immunopathology [4]. Therefore, a tightly-controlled, well 30 regulated immune system is needed to protect an individual from pathogen-induced 31 harm, without causing direct harm to the individual in the process. 32 The immune system is a regulated, homeostatic system, typically responding to 33 multiple antigenic stimulations at once. The principal actors of the immune system are 34 populations of cells, including T cells (including CD4 + T helper cells that facilitate 35 immune function and CD8 + T cytotoxic cells that act against infected cells), B cells  Because it is a complex interacting set of cells and molecules, the immune system is 48 a biological network. Applying network analysis to immunological data may therefore 49 provide new insights into the function and organisation of the immune system. Because 50 of the different immunological state of wild and laboratory animals [1,5], their immune 51 systems may also differ in the structure of interactions among different immune 52 components, which will be revealed by network analysis.

53
Here we apply network analysis to a data set that has comprehensively measured the 54 immune state of wild mice and laboratory mice, affording us the opportunity to 55 examine the commonalities and the differences between the behaviour of the immune 56 system in these animals. In particular, we analyse the community structure [6], which 57 aims to reveal groups (i.e., communities) of densely connected nodes (which here are 58 measures of the immune response) and the connectivity among the detected 59 communities. We then compare the wild and laboratory mice in terms of community 60 structure and also compare the biological groupings of immune measures and the 61 grouping derived from community detection. As such, this analysis provides a 62 comprehensive network analysis of the immune system, and gives new insights into the 63 functional immunological differences between wild and laboratory animals. We used a data set of the immune state of 460 wild mice (Mus musculus domesticus), 67 collected from 12 sites in the southern UK, and 102 laboratory mice. We assembled this 68 data set in previous work, which is available within the publication and associated 69 supplementary materials [5]. There were 126 immune measures, from which we removed 70 six as is explained in the next section. Therefore, there are 120 immune measures  The first category is antibodies, which has two measures: the serum concentration of 74 immunoglobulins G (IgG) and E (IgE). The second category is serum proteins, which 75 has three measures: concentration of the acute phase proteins serum amyloid P (SAP), 76 Haptoglobin, and alpha-1 antitrypsin (AAT). The third category has three

96
Data preprocessing 97 We preprocessed the data as follows. The concentrations of cytokines were measured 98 using Bioplex Pro kits (M60-009RDPD & MD0-00000EL, Bio-Rad, UK). Therefore, the 99 range of cytokine concentrations for which data were robust is defined by empirically 100 derived standard curves. When the observed concentration of a cytokine fell outside the 101 standard range of these assays [5], the readings were classified as out of range (OOR), 102 being either below ("OOR<") or above ("OOR>") the standard range. We set "OOR 103 <" measures to 0.001 [5], and treated "OOR >" as a missing value. 104 We removed the age and IgA measures [5] because they were absent for all the 105 laboratory mice. In the FACS category we removed the %D+G2Hi+G2Lo+H+, 106 %D+G2Hi+G2Lo+H-, %D-G2Hi+G2Lo+H+ and %D-G2Hi+G2Lo+H-measures 107 because they were absent for all wild and laboratory mice. As a result of removing these 108 six measures, we had a final total of 120 immune measures.

109
Many of the 120 immune measures had missing values, i.e., they were not observed 110 for some mice. In particular, in the CR category there were 223 wild mice (48%) and 75 111 laboratory mice (74%) without any of the 45 CR measures. Because of the central 112 importance of cytokines in immune responses, we decided to remove these mice instead 113 of removing the CR measures. Before removing these mice, the percentage of missing 114 values in the whole data set was 34% for the wild mice and 37% for the laboratory mice. 115 After removing the mice that had all of the CR measures missing, the data set consisted 116 of 237 wild mice with 12% of missing values and 27 laboratory mice with 12% of missing 117 values. In the CR category, the percentage of missing values dropped from 49% to 1.3% 118 for the wild mice and from 74% to 1.5% for the laboratory mice after this removal. 119 Finally, we imputed the remaining missing values. For each of the immune measures 120 that had missing values, we calculated the average value from the available observations 121 and replaced the missing values by this calculated average. We carried out this 122 imputation procedure separately for the wild and for the laboratory mice.

123
Correlation networks 124 We built correlation networks separately for the wild mice and for the laboratory mice 125 as follows. We used the N = 120 immune measures as nodes of the network. We then 126 calculated the Pearson correlation coefficient between each pair of immune measures by 127 regarding the mice as samples. We placed an edge if, and only if, the correlation value 128 exceeded a prescribed threshold. The threshold was the largest possible value with 129 which the remaining network was connected (i.e., when the network consisted of a single 130 connected component).

131
Community detection 132 We combined a stochastic block model (SBM) [7] and a consensus clustering approach 133 to uncover the mesoscopic block structure of the correlation networks. In the 134 microcanonical formulation of the SBM that we used, one minimizes the description 135 length of the network [8], such that one partitions the nodes in the network G into B by a partition b is given by the following Bayesian posterior probability: The numerator of Eq (1) can be written as is the description length of network G. Maximizing Eq (1) is equivalent to minimizing 143 the description length. We used the non-degree corrected version of the SBM algorithm 144 because it yielded smaller description lengths. Specifically, after 10 3 minimizing 145 attempts, the average and standard deviation of the description length for the corrected 146 and the non-corrected version was equal to 2575±3 and 2526±4, respectively, for the 147 wild mice and 2118±5 and 2047±5, respectively, for the laboratory mice.

148
To infer the best partition given by the SBM, we employed a Markov Chain Monte 149 Carlo (MCMC) algorithm [9] using the graph-tool library [10]. We start from an initial 150 partition b 0 , which is obtained from an agglomerative heuristic method [9]. Then, in 151 each step, we propose a move by selecting a node i and choosing a new tentative group 152 membership b i for node i with a specific probability that imposes no bias and preserves 153 the ergodicity [9]. The proposed move b → b , where b j = b j for all j = i, is accepted 154 or rejected according to the Metropolis-Hastings criterion [11,12]. This procedure 155 preserves the detailed balance and minimizes the final description length. Each attempt 156 to move a node is applied sequentially, such that nodes are visited one by one in a 157 random order.

158
When all the N nodes have attempted to move once, we say that a sweep has been 159 completed. Because the method is stochastic, a different block structure may appear in 160 each run of the algorithm. To cope with this stochasticity, we implemented the 161 consensus clustering procedure composed of the following three phases. In the first 162 phase, we identify the most probable number of communities of the network. To this 163 end, we ran the algorithm for 10 2 different initial partitions of the nodes. Then, for each 164 initial condition, we carried out MCMC sweeps and recorded the maximum and initial condition, we selected the number of communities that appeared the most times 172 in the 10 4 observations. Finally, we determined the number of communities of the 173 network as the most probable number of communities among the 10 2 initial conditions. 174 In the second phase, we fixed the number of communities to the value determined in the 175 first phase and inferred the probability that each node belongs to each community as 176 follows. For each of another 10 2 different initial partitions of nodes, we underwent the 177 same transient period of MCMC sweeps as in the first phase. Then, for each of the 10 2 178 initial conditions, we collected 10 5 configurations at an interval of 10 2 MCMC sweeps 179 and calculated the fraction of the 10 5 configurations in which each node belonged to 180 each community. Finally, for each initial condition, we defined the community of each 181 node as the community to which the node belonged the most times. Therefore, at the 182 end of the second phase, we had 10 2 node partitions. In the third phase, we performed 183 a consensus clustering procedure. To this end, we counted the number of times that 184 each pair of nodes belonged to the same community among the 10 2 node partitions 185 found in the second phase. We then concluded that any pair of nodes belonged to the 186 same community if they did so in at least 90% of the 10 2 node partitions. If a node did 187 May 9, 2019 6/19 not belong to the same community with any other node in at least 90% of the 10 2 node 188 partitions, then this node was judged to form a single-node community.

189
Comparing two communities 190 We estimated the similarity between a single community in the wild mouse network and 191 a single community in the laboratory mouse network in two ways. First, we counted the 192 number of nodes shared by the two communities. Second, we computed the Jaccard 193 index J between the two communities, defined by where A and B are the set of nodes in the two communities being compared, and |A|, 195 for example, is the number of nodes in A. information-theoretic methods [13][14][15]. We used five types of similarity measures, four 204 of which are pair counting methods, with the other one being an information-theoretic 205 method.

206
The pair counting methods first classify all the possible N (N − 1)/2 pairs of nodes 207 into four categories. We denote by w 11 the number of node pairs that belong to the 208 same community in both networks, by w 10 the number of node pairs that are in the 209 same community in the first network but not in the second network, by w 01 the number 210 of node pairs that are in the same community in the second network but not in the first 211 network and by w 00 the number of node pairs that are not in the same community in 212 either network. The total number of pairs of nodes is given by 213 w 11 + w 10 + w 01 + w 00 = N (N − 1)/2. The four similarity measures based on 214 pair-counting are as follows [13]: the Jaccard index J = w 11 /(w 11 + w 10 + w 01 ); the 215 Rand similarity coefficient R = 2(w 11 + w 00 )/N (N − 1); the Fowlkes-Mallows similarity 216 coefficient F M = w 11 / (w 11 + w 10 )(w 11 + w 01 ); the Minkowski coefficient 217 M = (w 10 + w 01 )/(w 01 + w 11 ).

218
The other similarity measure based on information theory is the variation of 219 information (VI) [14], which is defined as follows. Denote the two node partitions to be 220 compared by Similarity measures, including the ones described here, strongly depend on the 225 number and size of communities [13]. It is therefore difficult to assert whether the value 226 obtained for each similarity measure is large or small. To circumvent this problem, we 227 calculated the similarity between two partitions relative to that obtained from random 228 partitions. Given a similarity measure S calculated from the original data, the Z score 229 is defined as Z S = (S − µ)/σ, where µ and σ are the average and standard deviation, where H(t) is the description length after t sweeps; H and σ H are the average and the 265 standard deviation of the description length, respectively. For each network, we   To investigate quantitatively the similarity between the correlation matrices for the 292 wild and laboratory mice, we compared the strength for each node between the wild and 293 laboratory mice, which is shown in Fig 3A. The CR nodes tend to have the largest   (Fig 3B), where 300 each point represents a pair of nodes. Figure 3 suggests that the correlation matrices of 301 wild and laboratory mice show some similarity, but not identity.  D+G2Hi-G2Lo+H-) that did not robustly belong to any other community. Both networks have three communities that are almost exclusively composed of CR 321 nodes, which we refer to as the CR communities. Of the 45 CR nodes, 42 and 41 nodes 322 belong to CR communities in the wild and lab networks, respectively. Moreover, the 323 interconnections between the three CR communities is qualitatively the same for both 324 networks. Specifically, there is one CR community that is densely connected to the 325 other two CR communities, which are in turn sparsely connected to each other. In the 326 wild network, community W7 has edge densities of 0.72 and 0.82 with communities W5 327 and W6, respectively, whereas the edge density between W5 and W6 is equal to 0.06 328 (Fig 5A). Analogously, in the lab network, L7 has edge densities of 0.60 and 0.83 with 329 L5 and L6, respectively, whereas the edge density between L5 and L6 is equal to 0.03 330 (Fig 5B). Despite these structural similarities, the wild and lab CR communities consist of 332 different sets of nodes. We examined this further by computing the number of common 333 nodes and the Jaccard index between each pair of a community in the wild and lab 334 networks (Fig 6). This shows that nodes in the CR communities L5 and L6 of the lab 335 network are scattered among different CR communities in the wild network. Particularly, 336 in the wild network, all of the IL-1β, IL-12p70 and IL-13 nodes (with each cytokine 337 type having five nodes) are each within one community, whereas in the lab network only 338 the IL-12p70 nodes are all within one community (Table S1). Moreover, communities 339 W7 and L7, which are strongly connected to the other two CR communities in the wild 340 and lab networks, respectively, have only two CR nodes in common with a Jaccard 341 index of 0.07 (Fig 6). In fact, L7 better corresponds to W6 rather than W7, and W7 342 better corresponds to L6 than L7. Therefore, although the mesoscale structure of the 343 CR communities is similar between the wild and lab networks, the composition of each 344 CR community is substantially different between the two networks. A majority of NK cell nodes (i.e., FACS and MFI nodes) belong to two communities 346 in both the wild network (i.e., communities W1 and W2) and the lab network (i.e., 347 communities L1 and L2). Communities W1 and L1 are predominately composed of NK 348 cell nodes and all the eight nodes that they share are FACS NK cell nodes. Among the 349 seven nodes shared by communities W2 and L2, five are NK cell nodes (three FACS NK 350 cells and two MFI NK cells). In both networks, the edges within each of these two NK 351 cell communities are denser than between the two NK cell communities, which seems to 352 be the main reason why a majority of NK cell nodes are divided between two 353 communities. Consistent with this observation, the pairwise correlations between nodes 354 within L1 and L2 are mostly strongly positive whereas those between L1 and L2 are 355 mostly strongly negative (Fig 4B). This result may imply negative biological regulation 356 among NK cell nodes.

357
In the wild network, T cells are confined in a single community, i.e., community W2 358 (Fig 4C), whereas they are dispersed among four communities in the lab network. This 359 result is likely due to greater functional interaction among these cells in wild, 360 antigen-experienced mice, compared with laboratory mice [17].

361
Communities W2 and L3 are the largest communities in each network and share 15 362 nodes (Fig 6A). Moreover, these communities contain nodes of different types of 363 immune measures, with this mix being approximately consistent between the wild and 364 lab networks. 365 We have not considered negative pairwise correlations because the thresholds for 366 creating the networks from the correlation matrices were positive. Mindful that in 367 biological systems negative regulation is common, we also briefly analysed how the 368 immune measures are connected through negative correlations. In these so-called 369 negative networks, node pairs that are strongly negatively correlated were assumed to 370 form the edges. The negative networks have substantially more communities such that 371 their community structure is more difficult to interpret than in the case of the positive 372 networks (Fig S3). However, the CR nodes are predominately grouped together within 373 several communities in the negative networks for both wild and laboratory mice, similar 374 to the situation in the positive networks. In the wild negative network, which had nine 375 communities, 43 out of the 45 CR nodes belonged to one of the two communities that 376 were almost exclusively composed of CR nodes (Fig S3). The lab negative network had 377 17 communities and 39 CR nodes belonged to one of the four communities 378 predominately composed of CR nodes. Because CR nodes tend to be strongly positively 379 correlated with each other, there is no direct connection between pairs of CR nodes in 380 the negative networks. However, in the SBM, a pair of nodes tend to belong to the 381 same community if they have similar patterns of connectivity to the other nodes in the 382 network. This is the case even if the direct connectivity between the two nodes is not where the wild mice were collected [5,17].

395
In the wild vs. lab comparison, there were 237 wild mice and 21 laboratory mice. In 396 the male vs. female sex comparison, there were 133 males and 104 females. In the age 397 comparison, we compared 120 old mice, which we defined to be more than 8.5 weeks old, 398 and 117 young mice, which were less than 8.5 weeks old. We selected the age threshold 399 of 8.5 weeks to make the size of the two groups approximately equal in size. In the site 400 comparison, there were 71 mice from the HW site (a mixed arable and beef farm near 401 Bristol, UK) and 166 mice from the other sites [5]. We selected this particular site 402 because it was the single site where most wild mice were sampled. 403 We compared each pair of networks in terms of the five similarity measures for 404 community structure. For each similarity measure, as a reference, we also randomised 405 the mice to create pairs of networks whose sizes matched those of the original networks. 406 On the basis of the values of the similarity measure for the randomised pairs we 407 calculated the Z score. A negative Z score implies that the pair of original networks are 408 more different than are the pairs of random networks.

409
The Z score for each of the four comparison types and each of the five similarity 410 measures is shown in Table 1. The largest absolute value of the Z score was obtained 411 when the wild and lab networks were compared with the Rand coefficient. In this case, 412 wild and laboratory community structures are significantly different from each other as 413 compared to pairs of randomised networks (Z = −2.93). This result is consistent with 414 those in the previous section, where we have shown that the set of nodes belonging to a 415 community differs between the wild and lab networks, although there is a resemblance 416 in the inter-community connectivity between the wild and lab networks. The Rand 417 coefficient is highly sensitive to the size of communities in the partitions being 418 compared [13]. Both the wild and lab networks have large communities, whereas the 419 randomly generated networks in general do not have similarly large communities. This 420 factor seems to account for this large negative Z score.

421
However, Table 1  strength of association between each pair of nodes has been quantified, for example, by 484 the Pearson correlation coefficient [27,28], as in the present study, or mutual 485 information [29,30]. These previous studies analysed data obtained from healthy human 486 donors [28], patients in preterm labour [27], individuals with Chronic Fatigue 487 Syndrome [29] or Gulf War Illness [30].  Therefore, our networks can be regarded as co-expression networks, though the scale of 495 these networks is larger than previous work.

496
The present study has some limitations. First, we necessarily discarded a 497 considerable portion of data during our pre-processing steps due to missing values. The 498 number of lab mice with sufficient immune measures after pre-processing was small (i.e., 499 27). However, it is of note that these laboratory animals are genetically inbred and 500 maintained under standard conditions, all aiming to substantially reduce the 501 inter-mouse variability. Among the wild mice, many animals, particularly the young 502 ones, were small, which limited the amount of serum and the number of splenocytes 503 that could be obtained, thereby limiting the amount of immunological data that could 504 be obtained from them. Studies of wild animals will routinely confront such limitations. 505 Therefore, the preprocessing of data that we have undertaken may provide a model for 506 how this matter can be addressed in similar analyses of analogous data of other wild 507 populations. A larger dataset would significantly enhance the precision of the present 508 findings. Second, we did not find consistent patterns in the statistical comparison of the 509 four pairs of groups (wild vs. lab; male vs. female; young vs. old; HW site vs. other 510 sites). Rather, the results depended on the similarity measure we used and only a few 511 combinations of the similarity measure and the comparisons that we made yielded 512 significant Z scores. However, adjusting for these multiple comparisons means that none 513 of these results remains formally significant. As above, this may be due to the relatively 514 small size of the data set, compared to the number of nodes. Specifically, to compare 515 two groups of wild mice, such as male and females, we divided the 237 wild mice into 516 two groups. The number of nodes, 120, may be too large for this quantity of data. The 517 same limitation was acknowledged in a study of co-expression immune response networks in which the number of participants in each group was relatively small [30]. 519 However, this is the best that we could achieve with the available data. Sampling wild 520 mice and making these multiple immune measures is a substantial piece of work and we 521 note that this data set is the largest existing immunological data set of wild rodents.

522
Third, we applied a simple thresholding to the correlation matrices to create the 523 correlation networks. The choice of the threshold value was not determined a priori, but 524 rather determined functionally by using the highest value that resulted in a single, 525 connected component network [31,32]. A method to select the threshold value based on 526 an optimisation [32] and the use of network quantities directly applied to correlation 527 matrices [33][34][35] are alternative options that could be used in the future. We did not 528 opt to use these methods because the optimisation of the trade-off with the wiring 529 cost [32] is not relevant to the present data and stochastic block models directly 530 applicable to correlation matrices are not known.

531
The network analysis of wild mouse populations presented here is a novel analysis of 532 a novel dataset. It generates a holistic view of the mammalian immune system and the 533 dynamics of its function as the animals themselves interact with diverse and dynamic 534 environments. Humans are, arguably, immunologically more akin to wild mice than to 535 laboratory mice, and the diversity of immune state among people could also be analysed 536 using network analyses. Such an approach may usefully be used to ever better 537 understand the vertebrate immune system and its function in protecting individuals 538 from infection and disease.