Reconstructing heterogeneous pathogen interactions from co-occurrence data via statistical network inference

Infectious diseases often involve multiple pathogen species or multiple strains of the same pathogen. As such, knowledge of how different pathogen species or pathogen strains interact is key to understand and predict the outcome of interventions that target only a single pathogen or subset of strains involved in disease. While population-level data have been used to infer pathogen strain interactions, most previously used inference methods only consider uniform interactions between all strains, or focus on marginal interactions between pairs of strains (without correction for indirect interactions through other strains). Here, we evaluate whether statistical network inference could be useful for reconstructing heterogeneous interaction networks from cross-sectional surveys tracking co-occurrence of multi-strain pathogens. To this end, we applied a suite of network models to data simulating endemic infection states of pathogen strains. Satisfactory performance was demonstrated by unbiased estimation of interaction parameters for large sample size. Accurate reconstruction of networks may require regularization or penalizing for sample size. Of note, performance deteriorated in the presence of host heterogeneity, but this could be overcome by correcting for individual-level risk factors. Our work demonstrates how statistical network inference could prove useful for detecting pathogen interactions and may have implications beyond epidemiology.


60
In order to evaluate the performance of various network inference methods, we simulated 61 random interaction networks with up to 10 pathogen strains, which describe whether 62 and to what extent the strains interact in an epidemiological model. Within each 63 network, the presence of interaction between each pair of strains i and j was established 64 with connection probability σ. Given that interaction between a pair was established, correspond to competitive and mutualistic interactions, respectively, and the greater the 68 value diverges from zero, the stronger the interaction becomes. 69 Each network generated as such was then used to parametrize a 70 Susceptible-Infected-Susceptible (SIS) epidemiological model [8]. In the model, exp(x ij ) 71 was used to as the hazard ratio that determines how the presence of strain i in a host  The set of network inference methods that we considered were the Ising model, 80 graphical modelling approaches, and generalized estimating equations (GEE) [24][25][26]. Two random networks generated with connection probability σ = 0.25 and interaction strengths x ij drawn from [−θ, θ] = [− log 3, log 3] (A and D); Estimated networks from co-occurrence in 100,000 observations according to the regularized Ising model (B and E); and according to a graphical model using backward model selection by BIC (C and F). Strength and type of interactions (mutualistic versus competitive) are indicated by the thickness and colour (green versus red) of edges, respectively. host heterogeneity, and type-specific basic reproduction numbers R 0,i in the range 91 [1.5, 2]. 92 In the base-case setting, visual inspection showed satisfactory concordance between 93 the estimated networks and true networks. This is illustrated in Fig 1, which also shows 94 some differences between network inference methods. In order to compare the different 95 methods systematically, we assessed various performance measures, including sensitivity 96 and specificity for indicating the presence (or absence) of pairwise interactions, i.e. the 97 proportions of truly present (or truly absent) interactions that are correctly identified in 98 the reconstructed networks. In addition, we assessed the F1-score, which summarizes   Table 1).

104
Interactions were recovered with almost 100% sensitivity at 100,000 observations 105 (Fig 2A). Optimal specificity was attained already at small sample size in graphical 106 modelling approaches with BIC-based, whereas specificity remained at a more or less 107 constant suboptimal level in graphical modelling approaches with AIC-based selection 108 ( Fig 2D). GEE was good at small sample size, but was the only method considered with 109 slightly deteriorating performance with increasing sample size notwithstanding 110 correction for multiple testing (Fig 2B-D). Overall, methods with less optimal specificity 111 showed better sensitivity, showcasing trade-off between sensitivity and specificity.

112
Finally, the Ising model and the graphical modelling approaches with BIC-based 113 selection were able to achieve relatively high values of sensitivity and specificity at high 114 sample size, which resulted in near-optimal F1-scores. (Fig 2B,C). Table 1. Performance measures of statistical network inference in the base-case analysis. In the base-case analysis, interactions between strains were generated with connection probability σ = 0.25, interaction strengths x ij indicating interaction in acquisition drawn from [− log 3, log 3], and basic reproduction numbers randomly drawn from [1.5, 2]. All methods were evaluated at sample sizes of 100, 1,000, 10,000 and 100,000 observations. Abbreviations: GEE, generalized estimating equations; PPV, positive predictive value; RMSE, root-mean-square error.

Method
Sample size § Specificity (%) ± Sensitivity (%) ± PPV (%) ± F1-score †± Pearson's r * ± RMSE ± .793) .987 (.938; 1.04) § Number of sampled individuals ± 95% confidence intervals (bootstrap percentile estimates) in brackets † F1-score denotes the harmonic mean between sensitivity (also termed recall) and PPV (also termed precision) * Linear correlation coefficient between generated and estimated interaction strengths (non-zero estimates only) Fig 2. Performance measures of statistical network inference in the base-case analysis. Several measures have been calculated to assess the performance of the statistical network inference as a function of the sample size for the base-case analysis. A) Sensitivity; B) Positive predictive value (PPV); C) F1 score; D) Specificity; E) Pearson correlation; F) Root-mean-square deviation. In the base-case analysis, interactions between strains were generated with connection probability σ = 0.25, interaction strengths x ij indicating interaction in acquisition drawn from [− log 3, log 3], and basic reproduction numbers randomly drawn from [1.5, 2]. All methods were evaluated at sample sizes of 100, 1,000, 10,000 and 100,000 observations, but x-axis coordinates are slightly jittered to improve visualization. Abbreviations: bw.aic  (Table 1). Moreover, while correlations between true and 125 estimated interactions reached a plateau above 1,000 observations, accuracy continued 126 to improve with increasing sample size, as verified by steadily decreasing  As for the AIC-based graphical modelling approaches, the performance diverged 152 more from that of the base-case setting (Fig 3D,F). The performance improved 153 substantially under higher connection probabilities (compare red dots to black lines).

154
However, it suffered also more when networks were larger (compare green and yellow 155 dots to black lines). These patterns are linked to the high sensitivity and poor 156 specificity of the AIC-based approaches (as shown in the base-case analysis). Average contact rate is the same as in the base-case analysis, but 80% of hosts is assumed to have below-average contacts and 20% above-average contacts (coefficient of variation: 80%). Mixing between sub-populations occurred pseudo-assortatively (assortivity fraction: 50%). Performance is investigated in following ways: uncorrected: based on representatively sampled individuals from the total population without correction for contact rate; corrected: idem but with correction for contact rate; low-risk: (stratified) analysis on individuals sampled from sub-population with low contact rate only; high-risk: (stratified) analysis on individuals sampled from sub-population with high contact rate only. remained high at large sample size (S1 Table). These findings show a consistent bias   Fig 4). Graphically, the 175 risk variable can be represented as a central node equally connected to all other nodes 176 representing carriage of strains ( Fig 5E). This dependency illustrates that elevated 177 infection risk is associated with positivity for each strain. After filtering out the risk variable, the strain-specific interaction network is retained (Fig 5F)

Discussion
In this paper, we evaluated whether heterogeneous pathogen strain interactions can be 181 recovered from cross-sectional surveys tracking co-occurrence of multi-strain pathogens 182 via statistical network inference. Using simulated data, we demonstrated unbiased genotype combinations [20,21]. The importance of accounting for host heterogeneity is 202 reiterated in our analysis, but we also demonstrate that network inference is well suited 203 to detect small interactions at large enough sample size, provided that the mechanisms 204 of interaction can be inferred from conditional independence structures. In our simulations, sparsity of the networks used to generate co-occurrence data was 268 controlled by fixing the rate of true zeroes in the network. The extent to which true 269 independence holds in reality should determine which class of network inference 270 methods is to be preferred [31]. From a practical point of view, however, the important 271 interactions were almost always correctly identified, also at limited sample size.

272
Finally, we show that a regression framework familiar to most epidemiologists, i.e. framework has the additional benefit that it offers a flexible way of separating 275 individual risk factors common to all strains, from interactions between any two strains 276 [23]. This framework has been employed before to study clustering patterns of HPV 277 genotypes across risk populations, with correction for known risk factors [18]. While  X ∈ S is generally given by The four groups of terms on the right-hand side, moving from left to right, 309 correspond to flows of individuals into or out of state X due to 1) acquisition of a strain 310 i ∈ X, 2) clearance of a strain i ∈ X, 3) acquisition of a strain i ∈ X, and 4) clearance 311 of a strain i ∈ X, respectively. Each term is a product of the proportion of the 312 population in a particular state and the corresponding acquisition or clearance hazards 313 (per capita rates) denoted by q. In an example with three strains and X = {1}, the 314 corresponding terms are 1) The baseline hazard of clearance for an individual only infected with strain i is 317 denoted by q {i}→∅ and the baseline hazard for a completely susceptible individual to acquire strain i is given by Here, c is the per capita rate at which hosts make contacts that are relevant for 320 pathogen spread, β i is the probability of successfully acquiring strain i given contact interactions, but we only consider the pairwise-symmetric multiplicative structure (see 326 [8] for the alternative structures). To be exact, we assume each strain that is carried to 327 contribute multiplicatively to the hazard of acquiring (or clearing) the incoming (or where the involved interaction parameters are pairwise-symmetric, i.e. k ji = k ij and In the interaction networks we consider in this paper, nodes represent pathogen strains 349 and edges the presence of a pairwise interaction between two strains. By connecting 350 each pair of strains with fixed connection probability σ, independent from other edges, 351 the strain interaction networks resemble Erdös-Rènyi random graphs that become more 352 saturated with higher connection probability [34]. For each individual l, Y l = [Y 1,l , Y 2,l , · · · , Y n,l ] denotes the sequence of Bernoulli random 365 variables [35], with Y i,l denoting the presence of strain i in individual l. It follows that 366 corresponds to the product of marginal prevalence of both strains, provided that host death in age groups susceptible to infection is negligible [21], and there are no 373 unobserved common risk factors [28]. Moreover, the odds ratio of co-occurrence in the 374 two-strain model corresponds to the ratio of interaction parameters in acquisition and 375 clearance k 12 /h 12 [28]. When there are more than two strains, however, the 376 correspondence between pairwise association measures and corresponding pairwise 377 composite interaction parameters k ij /h ij might no longer hold, and deviation from 378 independence between two non-interacting strains could be induced by a third strain 379 that interacts with both. The objective here is to capture these (possibly heterogeneous) 380 pairwise interaction parameters from cross-sectional prevalence surveys for more than 381 two strains. For this purpose, we make use of various network inference methods (S1 382 Appendix).

383
The first statistical network inference method we applied are graphical modelling 384 approaches based on log-linear analysis [36]. This technique is generally used to 385 examine the relationship between more than two categorical variables, here denoting 386 presence-absence of each pathogen strain in a host population. To reconstruct networks 387 of strain-specific interactions from co-occurrence data with up to 10 circulating strains, 388 we searched through the subset of decomposable log-linear models, i.e. models whose 389 dependence graph is triangulated [25]. Selection was applied both in a forward 390 (focussed on adding edges) and backward (focussed on removing edges) fashion, using 391 AIC or BIC as selection criterion. Strengths of interaction were quantified a posteriori 392 by calculating odds ratios using conditional maximum likelihood estimation from the 393 contingency tables implied by the selected model.

394
Alternatively, we used the Ising model to reconstruct the simulated networks [24,37]. 395 Here, the probability of a certain pathogen strain being present is modelled as a iteratively, one variable is regressed onto all others, with a penalty imposed on the 401 regression coefficients to obtain a sparse network representation [38], and with model 402 selection based on the extended BIC [24]. ratios between all strains under consideration. In modelling the associations between multiple strains concomitantly, we used the alternating logistic regression algorithm of 406 the GENMOD procedure in SAS statistical software [26,32]. The regression framework 407 facilitates assessment of strain-specific interactions on the basis of Wald tests. To correct 408 for multiple hypothesis testing, we made use of the Benjamini-Hochberg procedure [39]. 409 Host heterogeneity

410
In the sensitivity analysis with host heterogeneity we considered a version of the types, that takes the value 0 when mixing between hosts is random and 1 when fully 418 assortative [33]. Here, we fixed φ = 50%. The contact rate c sr between a 'transmitting' 419 individual from sub-population s and a 'receiving' individual from sub-population r is 420 given by (see S2 Appendix for derivation) 421 c sr = φc s δ sr + (1 − φ) p r c r p 1 c 1 + p 2 c 2 , for s, r ∈ {1, 2}.
Here, δ sr is the Kronecker delta (with δ sr = 1 if s = r and zero otherwise). Defined are not shown in set notation with brackets, unlike the formulas in the main text. basic reproduction numbers randomly drawn from [1.5, 2]. In this sensitivity analysis, 453 the average contact rate relevant for pathogen spread is the same as in the base-case 454 simulations, but 80% of hosts is assumed to have below-average contacts and 20% with an assortment fraction of 50%. Abbreviations: GEE, generalized estimating 457 equations; PPV, positive predictive value; RMSE, root-mean-square error. 458 S1 Appendix. Statistical network inference.