## Abstract

If pathogen species, strains or clones do not interact, intuition suggests the proportion of co-infected hosts should be the product of the individual prevalences. Independence consequently underpins the wide range of methods for detecting pathogen interactions from cross-sectional survey data. Surprisingly, however, the results of the very simplest of epidemiological models challenge the underlying assumption of statistical independence. Even if pathogens do not interact, death and/or clearance of co-infected hosts causes net prevalences of individual pathogens to decrease simultaneously. The induced positive correlation between prevalences means the proportion of co-infected hosts is expected to be higher than multiplication would suggest. By modeling the dynamics of multiple non-interacting pathogens, we develop a pair of novel tests of interaction that properly account for this hitherto overlooked coupling. Which test is appropriate for any application depends on the form of the available data, as well as the extent to which the co-infecting pathogens are epidemiologically similar. Our tests allow us to reinterpret data from a number of previous studies, including pathogens of plants and animals, as well as papillomavirus and malaria in humans. We find certain reports of interactions have been overstated, and in some cases are not supported by the available data. Our work demonstrates how methods to identify interactions between pathogens can be updated to account for epidemiological dynamics.

## 1 Introduction

It is increasingly recognized that infections often involve multiple pathogen species or strains/clones of the same species (Balmer and Tanner, 2011; Vaumourin et al., 2015). Infection by one pathogen can affect susceptibility to subsequent infection by others (Griffiths et al., 2011; Petney and Andrews, 1998). Co-infection can also affect the severity and/or duration of infection, as well as the extent of symptoms and the level of infectiousness (Graham et al., 2005). Antagonistic, neutral and facilitative interactions are possible (Karvonen et al., 2018; Rigaud et al., 2010). Co-infection therefore potentially has significant epidemiological, clinical and evolutionary implications (Susi et al., 2015; Hilker et al., 2017; Alizon et al., 2013).

However, detecting and quantifying biological interactions between pathogens is notoriously challenging (Johnson and Buller, 2011; Hellard et al., 2015). In pathogens of some host taxa, most notably plant pathogens, biological interactions can be quantified by direct experimentation (Mascia and Gallitelli, 2016). However, often ethical considerations mean this is impossible, and so any signal of interaction must be extracted from population scale data. Analysis of longitudinal data remains the gold standard (Fenton et al., 2014), although the associated methods are not infallible (Telfer et al., 2010). However, collecting longitudinal data requires a dedicated and intensive sampling campaign, meaning in practice cross-sectional data are often all that are available. Methods for cross-sectional data typically concentrate on identifying deviation from statistical independence, using standard methods such as *χ*^{2} tests or log-linear modelling to test whether the observed probability of co-infection differs from the product of the prevalences of the individual pathogens (Booth and Bundy, 1995; Howard et al., 2001; Bogaert et al., 2004; Raso et al., 2004; Regev-Yochay et al., 2004; Nielsen et al., 2006; Chaturvedi et al., 2011; Rositch et al., 2012; Degarege et al., 2012; Malagón et al., 2016; Teweldemedhin et al., 2018). Detecting such a non-random statistical association between pathogens is then taken to signal a biological interaction. The underlying mechanism can range, for example, from individual-scale direct effects on within-host pathogen dynamics (Tollenaere et al., 2016; Mascia and Gallitelli, 2016), to indirect within-host immune-mediated interactions (de Roode et al., 2005), to indirect population-scale “ecological interference” caused by competition for susceptible hosts (Rohani et al., 1998, 2003).

A well-known difficulty is that factors other than biological interactions between pathogens can drive statistical associations. For instance, host heterogeneity – that some hosts are simply more likely than others to become infected – can generate positive statistical associations, since co-infection is more common in the most vulnerable hosts. Heterogeneity in host age can also generate statistical associations, as infections accumulate in older individuals (Lord et al., 1999). Methods aimed at disentangling such confounding factors have been developed, but show mixed results in detecting biological interactions (Hellard et al., 2012; Vaumourin et al., 2014). Methods using dynamic epidemiological models to track co-infections are also emerging, although more often than not require longitudinal data (Shrestha et al., 2011, 2013; Reich et al., 2013; Alizon et al., 2018; Man et al., 2018).

More fundamentally, however, the underpinning and long standing assumption that non-interaction implies statistical independence (Forbes, 1907; Cohen, 1973) has not been challenged. Here we confront the intuition that biological interactions can be detected via statistical associations, demonstrating how simple epidemiological models can change the way we think about biological interactions. In particular, we show that non-interacting pathogens should not be expected to have prevalences that are statistically independent. Co-infection by non-interacting pathogens is more probable than multiplication would suggest, invalidating any test invoking statistical independence.

The paper is organized as follows. First, we use a simple epidemiological model to show that the probability that a host is co-infected by both of a pair of noninteracting pathogens is greater than the product of the net prevalences of the individual pathogens. Second, we extend this result to an arbitrary number of noninteracting pathogens. This allows us to construct a novel test for biological interaction, based on testing the extent to which co-infection data can be explained by our epidemiological models in which pathogens do not interact. Different versions of this test, conditioned on the form of available data and whether or not co-infections are caused by different pathogen species, allow us to reinterpret a number of previous reports (Chaturvedi et al., 2011; López-Villavicencio et al., 2007; Andersson et al., 2013; Koepfli et al., 2011; Nickbakhsh et al., 2016; Seabloom et al., 2009; Moutailler et al., 2016; Howard et al., 2001; Molineaux et al., 1980). Our examples include plant, animal and human pathogens, and the methodology can potentially be applied to any cross-sectional survey data tracking co-infection.

## 2 Results

### 2.1 Two non-interacting pathogens

#### 2.1.1 Dynamics of the individual pathogens

We consider two distinct pathogen species, strains or clones (henceforth pathogens), which we assume do not interact, i.e. the interaction between the host and one of the pathogens is entirely unaffected by its infection status with respect to the other. Epidemiological properties that are therefore unaffected by the presence or absence of the other pathogen include initial susceptibility, within-host dynamics including rates of accumulation and/or movement within tissues, host responses to infection, as well as onward transmission. Assuming a fixed-size host population and Susceptible-Infected-Susceptible (S-I-S) dynamics (Keeling and Rohani, 2007), the proportion of the host population infected by pathogen *i* ∈ {1, 2} follows
in which the dot denotes differentiation with respect to time, *β _{i}* is a pathogen-specific infection rate, and

*μ*is the host’s renewal rate.

Specifically, the renewal rate *μ* is the sum of the host’s death rate and the pathogens’ clearance rate (Box 1). For simplicity of presentation in the main text we assume clearance is unspecific, meaning it simultaneously clears all pathogens from an infected host (Nowak et al., 1990; Hellriegel, 1992; Antia et al., 1996; Lion, 2013; Murall et al., 2014). Infected individuals are therefore assumed to die or to clear pathogens at net rate μ, independent of the pathogens hosted by each individual. However our model can be extended to handle pathogen-specific rates of clearance (Box 1; Supplementary Information: Sections S1.4, S1.5 and S2.3).

Unspecific clearance (i.e. simultaneous purging of all pathogens from a single host) is possible even when pathogens do not interact, so long as clearance occurs at the same rate in singly- and co-infected hosts. For clearance caused by host immune responses, this corresponds to assuming that no additional host defences are induced by co-infection. For clearance following treatment, this corresponds to application to all hosts regardless of infection status. Such preventive treatments are common in agriculture, in which context systematic and repeated treatments with antibiotics or fungicides are routine in both animal and plant populations. In these cases, unspecific clearance is entirely orthogonal to any pathogen interactions, but can occur at a significant rate.

The death rate of the host in our model may include disease-induced death and natural death. Box 1 shows how disease-induced mortality (virulence) is also compatible with non-interacting pathogen dynamics in certain cases. Yet, pathogens do not necessarily increase the death rate of their host. Similarly, unspecific clearance may not occur. Therefore, the parameter *μ* may reduce to the natural death rate of the host. While natural mortality may be negligible for acute infections, it cannot be neglected for chronic (i.e. long-lasting) infections, which are responsible for a large fraction of co-infections in humans and animals (Griffiths et al., 2011; Gorsich et al., 2018). Likewise, plants remain infected over their entire lifetime following infection by most pathogens, including almost all plant viruses, as well as the anther smut fungus, which drives one of our examples here (López-Villavicencio et al., 2007).

We assume that host population size is a constant. This makes death and unspecific pathogen clearance epidemiologically equivalent, since both replace an infected host with an uninfected one. The simplification facilitates later extension to an arbitrary number of co-infecting pathogens. Focusing on a generic model also allows us to reasonably use it for pathogens of any host species ranging from humans to animals to plants.

#### 2.1.2 Tracking co-infection

Making identical assumptions, but instead distinguishing hosts infected by different combinations of pathogens, leads to an alternate representation of the dynamics. We denote the proportion of hosts infected by only one of the two pathogens by *J _{i}*, with

*J*

_{1,2}representing the proportion co-infected. Pathogen-specific net forces of infection are and so in which

*J*

_{⌀}= 1 −

*J*

_{1}−

*J*

_{2}−

*J*

_{1,2}is the proportion of hosts uninfected by either pathogen (Fig. 1).

#### 2.1.3 Prevalence of co-infected hosts

We assume the basic reproduction number, *R*_{0,i} = *β _{i}/μ* > 1 for both pathogens. Solving Eq. (3) numerically for arbitrary but representative parameters (Fig. 2A) shows the proportion of co-infected hosts (

*J*

_{1,2}) to be larger than the product of the individual prevalences (

*P*=

*I*

_{1}

*I*

_{2}from Eq. (1)). That

*J*

_{1,2}(

*t*) ≥

*P*(

*t*) for large

*t*(for all parameters) can be proved analytically (Supplementary Information: Section S1.1).

Simulations of a stochastic analogue of the model (Fig. 2B) reveal the key driver of this behavior. The net prevalences of the pathogens considered in isolation, *I _{1}* and

*I*

_{2}, are positively correlated (Fig. 2C; Eq. (27) in Methods: Section 4.1.4, “Stochastic models”), due to simultaneous reductions whenever co-infected hosts die or clear pathogens.

### Is a non-interacting model with non-zero virulence an oxymoron?

Here we show that it is only possible to build a model in which two virulent pathogens do not interact at the population scale (i.e., the net prevalence of each pathogen is unaffected by the presence or absence of the other) if: i) both pathogens have the same disease-induced death rate (virulence) *α _{i}*; ii) co-infected hosts do not suffer any additional disease-induced mortality. The model is restricted to two pathogen species for simplicity (

*i*= 1,2). The specific clearance rate of each pathogen is

*γ*. The disease-induced death rate of co-infected hosts is

_{i}*γ*

_{1,2}. The natural death rate of the host is

*ν*. The host population size is assumed to be a constant, with no vertical transmission. Therefore, each death implies the birth of an uninfected individual. Unspecific clearance (purging all pathogens at once) occurs at rate ξ. As in the main text,

*I*+

_{i}= J_{i}*J*

_{1,2}denotes the prevalence of hosts singly infected or doubly infected by pathogen

*i*(

*i*= 1,2). The full model is: with

*F*(

_{i}= β_{i}I_{i}= β_{i}*J + J*

_{1,2}) and

*J*

_{⌀}= 1 −

*J*

_{1}−

*J*

_{2}−

*J*

_{1,2}. The full model implies:

Non-interacting pathogens necessarily have independent dynamics, i.e.,

This requires *α*_{1} = *α*_{1,2} = *α*_{2}. Therefore, non-interacting pathogens must have the same virulence (*α*_{1} = *α*_{2}) and beinginfected byone orboth pathogens must leave the disease-induced death rate unchanged *α*_{1,2} = *α*. These assumptions may fit similar pathogens such as different genotypes of the same species. However, the absence of additional disease-induced mortality may imply (indirect) interactions at the within-host scale. This does not compromise non-interaction at the host population scale though.

Otherwise (*α*_{1} ≠ *α*_{2} or *α*_{1} ≠ *α*_{2} ≠ *α*_{1,2}), it is simply not possible to build a non-interacting model with nonzero virulence, since pathogens interact at the host population scale. This is a form of “ecological interference” (Rohani et al., 2003). Consequently, there is no reason to expect such virulent pathogens to be statistically independent.

In conclusion, the renewal rate *μ* in the main text can be interpreted as *μ = ν + ξ* for pathogens which do not increase their host’s death rate (*α _{i}* = 0 for

*i*= 1,2) or (more cautiously) as

*μ = ν + ξ + α*for similar pathogens (restricting “non-interaction” to its population-scale definition: the prevalence of each pathogen is unaffected by the other). The answer to the question posed in the title of this box is therefore “No.”

#### 2.1.4 Deviation from statistical independence

For *R*_{0,i} > 1 the relative deviation of the equilibrium prevalence of co-infection from that required by statistical independence is
(Eq. (9) in Methods: Section 4.1.1, “Equilibria of the two-pathogen model”). The deviation is therefore zero if and only if the renewal rate *μ* = 0.

The observed outcome would conform with statistical independence only for non-interacting pathogens where there is no unspecific clearance or host death (at the timescale of an epidemic). This reiterates the role of host renewal in causing deviation from a statistical association pattern. Note that our result is not related to aging (infections accumulating in older individuals) as the deviation increases with the renewal rate *μ*.

### 2.2 Testing for interactions between pathogens

Eq. (3) can be straightforwardly extended to track *n* pathogens which do not interact in any way (including pairwise and three-way interactions). Equilibria of this model are prevalences of different classes of infected or co-infected hosts carrying different combinations of non-interacting pathogens. These can be used to derive a test for interaction between pathogens which properly accounts for the lack of statistical independence revealed by our analysis of the simple two-pathogen model.

#### 2.2.1 Modelling co-infection by *n* non-interacting pathogens

We denote the proportion of hosts simultaneously co-infected by the (non-empty) set of pathogens Γ to be *J*_{Γ}, and use Ω_{i} = Γ \ {*i*} (for *i* ∈ Γ) to represent combinations with one fewer pathogen.

The dynamics of the 2^{n} − 1 distinct values of *J*_{Γ} follow
in which the net force of infection of pathogen *i* is
and ∇_{i} is the set of all subsets of {1,…, *n*} containing *i* as an element. Eq. (5) can be interpreted by noting

the first term tracks inflow due to hosts carrying one fewer pathogen becoming infected;

the second term tracks the outflows due to hosts becoming infected by an additional pathogen, or death/clearance.

##### Predicted prevalences

If *R*_{0,i} = *β _{i}/μ* > 1 for all

*i*= 1

*n*, the equilibrium prevalence of hosts infected by any given combination of pathogens, , can be obtained by (recursively) solving a system of 2

^{n}linear equations (Eq. (16) in Methods: Section 4.1.2, “Equilibria of the

*n*-pathogen model”).

These equilibrium prevalences are the prediction of our “Non-interacting Distinct Pathogens” (NiDP) model, which in dimensionless form has *n* parameters (the *R*_{0,i}’s. *i* = 1,…, *n*; Methods: Section 4.2.2, “Fitting the models”).

##### Epidemiologically interchangeable pathogens

If we simplify the model by assuming that all pathogen infection rates are equal (i.e. *β _{i}* =

*β*for all

*i*), then if

*R*

_{0}=

*β/μ*> 1, the proportion of hosts infected by

*k*distinct pathogens can be obtained by (recursively) solving

*n*+ 1 linear equations (Eq. (22) in Methods: Section 4.1.3, “Deriving the NiSP model from the NiDP model”). This constitutes the prediction of our “Non-interacting Similar Pathogens” (NiSP) model, a simplified form of the NiDP model requiring only a single parameter (

*R*

_{0}).

#### 2.2.2 Using the models to test for interactions

If either the NiSP or NiDP model adequately explain co-infection data, those data are consistent with the underpinning assumption that pathogens do not interact. Which model is fitted depends on the form of the available data.

##### Numbers of distinct pathogens

Studies often quantify only the number of distinct pathogens carried by individual hosts, without necessarily specifying the combinations involved (López-Villavicencio et al., 2007; Seabloom et al., 2009; Chaturvedi et al., 2011; Koepfli et al., 2011; Andersson et al., 2013; Moutailler et al., 2016; Nick-bakhsh et al., 2016). There are insufficient degrees of freedom in such data to fit the NiDP model, and so we fall back upon the NiSP model. In using the NiSP model, we additionally assume all pathogens within a given study are epidemiologically interchangeable.

We identified four suitable studies reporting data concerning strains/clones of a single pathogen, and tested whether these data are consistent with no interaction. For all four studies (Fig. 3), the best-fitting NiSP model is a better fit to the data than the corresponding binomial model assuming statistical independence (Eq. (28) in Methods: Section 4.2.1, “Models corresponding to assuming statistical independence”). Three additional examples for data-sets considering distinct pathogens, which deviate more markedly from the epidemiological equivalence assumption, are in the Supplementary Information (Section S2.1).

In one case – co-infection by different strains of human papillomavirus (Chaturvedi et al., 2011) (Fig. 3A) – we find no evidence that the reported data cannot be explained by the NiSP model. These data therefore support the hypothesis of no interaction – and indeed no epidemiological differences – between the pathogen strains in question.

In the three other cases we considered – strains of anther smut (*Microbotryum violaceum*) on the white campion (*Silene latifolia*) (López-Villavicencio et al., 2007) (Fig. 3B); strains of the tick-transmitted bacterium *Borrelia afzelii* on bank voles (*Myodes glareolus*) (Andersson et al., 2013) (Fig. 3C); and clones of a single malaria parasite (*Plasmodium vivax*) infecting children (Koepfli et al., 2011) (Fig. 3D) – despite outperforming the model corresponding to statistical independence, the best-fitting NiSP model does not adequately explain the data. We therefore reject the hypotheses of no interaction in all three cases, noting that our use of the NiSP model means it might be epidemiological differences between pathogen strains/clones that have in fact been revealed.

##### Combinations of pathogens

Other studies report the proportion of hosts infected by particular combinations (rather than counts) of pathogens, although many of those concentrate on helminth macroparasites for which our underlying S-I-S model is well-known to be inappropriate (Anderson and May, 1991).

However, a methodological article by Howard et al. (2001) introduces the use of log-linear modeling to test for statistical associations. Conveniently, that article reports the results of that methodology as applied to a large number of studies focusing on *Plasmodium* spp. causing malaria.

By interrogating the original data sources (Methods: Section 4.3.2, “Combinations of pathogens (NiDP model)”), we found a total of 41 studies of malaria reporting the disease status of at least *N =* 100 individuals, and in which three of

*P. falciparum, P. malariae, P. ovale*and

*P. vivax*were considered. Data therefore consist of counts of the number of individuals infected with different combinations of three of these four pathogens, a total of eight classes. There were sufficient degrees of freedom to fit the NiDP model, which here has three parameters, each corresponding to the infection rate of a single

*Plasmodium*spp. Fig. 4A shows the example of fitting the NiDP model to data from a study of malaria in Nigeria (Molineaux et al., 1980).

Fitting the NiDP model allows us to test for interactions between *Plasmodium* spp., without assuming they are epidemiologically interchangeable. In 18 of the 41 cases we considered, our methods suggest the data are consistent with no interaction (Fig. 4B). We note that in 12 of these 18 cases the methodology based on statistical independence of Howard et al. (2001) instead suggests the *Plasmodium* spp. interact.

## 3 Discussion

We have shown that pathogens which do not interact and so have uncoupled prevalence dynamics (Eq. (1)) are not statistically independent. For two pathogens, the prevalence of co-infection is always greater than the product of the prevalences (Eq. (4)), unless host renewal (death or unspecific clearance) does not occur. Pathogens share a single host in co-infections, and so when a co-infected host is renewed, net prevalences of both pathogens decrease simultaneously. The prevalences of individual pathogens, regarded as random variables, therefore co-vary positively. As a side result, we note our analysis indicates a high-profile model of May and Nowak (1995) is based on a faulty assumption of probabilistic independence (Supplementary Information: Section S1.3). More importantly, our analysis shows that statistically independent pathogens may well be interacting (Supplementary Information: Section S1.5) which confirms that statistical independence is far from equivalent to the absence of biological interaction between pathogens.

We extended our model to an arbitrary number of pathogens to develop a novel test for interaction that properly accounts for statistical non-independence. Many data sets summarize co-infections in terms of multiplicity of infection, regardless of which pathogens are involved. Since there would be as many epidemiological parameters as pathogens in our default NiDP model, and so as many parameters as data-points, the full model would be over-parameterised. We therefore introduced the additional assumption that all pathogens are epidemiologically interchangeable. This formed the basis of the parsimonious NiSP model, which is most appropriate for testing for interactions between strains or clones of a single pathogen species.

Despite the strong and perhaps even unrealistic assumption that strains/clones are interchangeable, the NiSP model outperformed the binomial model assuming statistical independence for all four data sets we considered. In particular, the NiSP model successfully captured the fat tails characteristic of observed multiplicity of infection distributions. All four data sets therefore support the idea that co-infection is far more frequent than statistical independence would imply.

For the data set concerning co-infection by different strains of human papillomavirus (Chaturvedi et al., 2011), the NiSP model also passed the goodness-of-fit test, allowing us to conclude strains of this pathogen do not interact. Goodness-of-fit for such a simple model is a particularly conservative test, especially for the NiSP model, when we assume pathogens clones/strains are epidemiologically interchangeable.

We illustrated our methods via case studies for which suitable data are readily available, and our purpose was not to come to definitive conclusions concerning any particular system. That would require dedicated studies. However, by fitting even a highly-simplified version of our model to data, we have demonstrated how results of simple epidemiological models challenge previous methods based on statistical independence.

To explore further the implications of our findings, we analyzed available data sets tracking combinations of pathogens involved in each occurrence of co-infection. For methodological comparison purposes, we restricted ourselves to data referenced in Howard et al. (2001), concerning interactions between *Plasmodium* spp. causing malaria. Relaxing the assumption of epidemiological interchangeability (i.e. using the NiDP model), we found that 43.9% (i.e. 18/41) of data sets considered in Howard et al. (2001) are consistent with no interaction.

Again we do not intend to conclusively demonstrate interactions – or lack of interactions – for malaria. Instead what is important is that our results very often diverge from those originally reported in Howard et al. (2001) using a method based on statistical associations, namely log-linear regression. Log-linear regression suffers from well-acknowledged difficulties in cases in which there are zero counts (i.e. certain combinations of pathogens are not observed) (Fienberg and Rinaldo, 2012). Such cases often arise in epidemiology. Methods based on epidemiological models therefore offer a twofold advantage: biological interactions are not confounded with statistical associations, and parameter estimation is well-posed, irrespective of zero counts.

We focused here on the simple S-I-S model, since it is sufficiently generic to be applicable to a number of systems. In future work focusing on particular pathosystems, models including additional system-specific detail would of course be appropriate. We leave further analysis of more complex underpinning epidemiological models to such future research.

Lastly, we speculate our results may have implications beyond epidemiology. After all, pathogens are species which form meta-populations occupying discrete patches (hosts) (Seabloom et al., 2015). Meta-community ecology has long been concerned with whether interactions between species can be detected from co-occurrence data (Forbes, 1907; Caswell, 1976; Connor and Simberloff, 1979) and most existing methods are based on detecting statistical associations (Gotelli, 2000; Gotelli and Ulrich, 2012), but see Hastings (1987). Our dynamical modeling approach may also provide a new perspective in this area.

## 4 Methods

### 4.1 Mathematical analyses

#### 4.1.1 Equilibria of the two-pathogen model

The 2-pathogen model is given by Eq. (1–2–3). Since the population size is constant, *J*_{⌀} = 1 − *J*_{1} − *J*_{2} − *J*_{1,2}, and so it follows that

It is well-known (Keeling and Rohani, 2007) that if *R _{0,i}* =

*β*> 1 and

_{i}/μ*I*(0) > 0, the prevalence of pathogen

_{i}*i*will tend to an equilibrium .

Since *F _{i} = β_{i}* and

*J*=

_{i}*I*−

_{i}*J*

_{1,2}, the rate of change of co-infected hosts in Eq. (3) can be recast as

The equilibrium prevalence of co-infected hosts can therefore be written in terms of the individual net prevalences at equilibrium ( and ),

This immediately leads to the result concerning the deviation of from (i.e., the expected prevalence of co-infected hosts given statistical independence) quoted in Eq. (4).

#### 4.1.2 Equilibria of the *n*-pathogen model

The *n*-pathogen model is given by Eq. (1–5–6). Since the host population size is constant, *J*_{⌀} = 1 − ∑_{Γ∈∇}*J*_{Γ}, where V is the set of all 2^{n} − 1 sets with infected or co-infected hosts. It is also true that

At equilibrium, Eq. (5) becomes
in which and are equilibrium prevalences, and is the force of infection of pathogen *i* at equilibrium, i.e.

Since these forces of infection are constant and do not depend on the equilibrium prevalences, the set of 2^{n} − 1 equations partially characterizing the equilibrium is linear, with

Similarly, Eq. (10) is linear

The equilibrium prevalences can be written very conveniently in a recursive form (i.e. using the first equation to fix , using to independently calculate all values of for |Γ| = 1, then using the set of values of when |Γ| = 1 to independently calculate all values of for |Γ| = 2, and so on). A recurrence relation to find all equilibrium prevalences can therefore be initiated with the following expression for the density of uninfected hosts:

Then, one may recursively use the following equation, equivalent to Eq. (13):

Since the densities in Eq. (16) are entirely in terms of the equilibrium densities of hosts carrying one fewer pathogen , this allows us to recursively find the densities of all pathogens given pathogen-by-pathogen values of *R*_{0,i}.

#### 4.1.3 Deriving the NiSP model from the NiDP model

If all pathogens are interchangeable, and so have identical values of *R*_{0,i} = *R*_{0} ∀*i*, then for any pair of combinations of infecting pathogens, Γ_{1} and Γ_{2}, it must be the case that whenever |Γ_{1}| = |Γ_{2}|. This means the equilibrium prevalences of hosts infected by the same number of distinct pathogens must all be equal, irrespective of the particular combination of pathogens that is carried. In this case solving the system is much simpler. First, Eq. (11) can be rewritten as
in which . The net prevalence of hosts infected by *k* distinct pathogens is
in which ∇(*k*) is the set of combinations of {1,…, *n*} with *k* elements. Since the form of Eq. (17) depends only on |Γ|, all individual prevalences involved in *M̄ _{k}* are identical, and so
in which is a combinatorial coefficient, and is any of the individual prevalences for which |Γ| =

*k*. The ratio between successive values of is given by

From Eq. (15), it follows that
in which *R*_{0} = *β/μ*. For 1 ≤ *k* ≤ *n*, Eq. (17) and (20) together imply
a form which admits a simple recursive solution.

#### 4.1.4 Stochastic models

Figs. 2B and 2C were generated by simulating the stochastic differential equation corresponding to Eq. (3); simulating a continuous time Markov chain model using Gillespie’s algorithm gave consistent results. Confidence ellipses were obtained from an approximate expression for the covariance matrix at equilibrium (see below).

##### Continuous-time Markov chain

The continuous-time Markov chain model corresponding to the unscaled version of Eq. (3–7) tracks a vector of integer-valued random variables *X*(*t*) = (*J*_{⌀}(*t*),*J*_{1}(*t*),*J*_{2}(*t*),*J*_{12}(*t*)). Defining Δ*X* = X(*t* + Δ*t*) − *X*(*t*) = (Δ*J*_{⌀},Δ*J*_{1},Δ*J*_{2},Δ*J*_{1,2}), changes of ± 1 to each element of *X*(*t*) occur in small periods of time Δ*t* at the rates given in Table 1. Stochastic trajectories from this model can conveniently be simulated via the Gillespie algorithm (Gillespie, 1977). Note that the numeric values of the infection rates and the host birth rate must be altered to account for the scaling by population size.

##### Stochastic differential equations

The model can also be written as a system of stochastic differential equations (SDEs), an approximation to the continuous-time Markov chain that is valid for sufficiently large *N* (Kurtz, 1970) and which is particularly well-suited for simulation of the stochastic model when the population size is large. This form of the model again tracks the seven events in Table 1, although in the SDE formulation the random variables in *X*(*t*) are continuous-valued. A heuristic derivation is based on a normal approximation described below. Alternately, the forward Kolmogorov differential equations in the continuous-time Markov chain model are closely related to the Fokker Planck equation for the probability density function of the SDE model (Allen et al., 2008).

The expected change and covariance of the changes can be computed from Table 1 to order Δ*t* via
where is the unscaled version of the deterministic model as specified in Eq. (3–7) with *N = J*_{⌀} + *J*_{1} + *J*_{2} + *J*_{1,2} (a constant) and *F _{i} = β_{i}*(

*J*+

*J*

_{1,2})/

*N*. In addition, the matrix Σ is given by

The changes in a small time interval Δ*t* are approximated by a normal distribution via the Central Limit Theorem: , where 0 = zero vector. The covariance matrix Σ can be written as Σ = *GG ^{T}*. Letting Δ

*t*→ 0, the SDE model can therefore be expressed as

The matrix *G* is not unique but a simple form with dimension 4 × 7 accounts for each event in Table 1 (Allen et al., 2008). Each entry in matrix *G* involves a square root and *W* is a vector of seven independent standard Wiener processes, where d*W _{i}* ≈ Δ

*W*(

_{i}*t*) =

*W*(

_{i}*t*+ Δ

*t*) −

*W*(

_{i}*t*) ∈ Normal(0,Δ

*t*). An explicit form for the SDE model in Eq. (25) is

##### Covariance matrix at the endemic equilibrium

In Supplementary Information (Section S1.2) we show that the covariance between the prevalences of pathogen 1 and pathogen 2 as they fluctuate in the vicinity of their equilibrium values is approximately
with equality if and only if *μ* = 0 (assuming *β _{i}* >

*μ, i*= 1, 2). Only in the specific case

*μ*= 0, is the deviation from statistical independence equal to zero (Eq. (4)).

### 4.2 Statistical methods

#### 4.2.1 Models corresponding to assuming statistical independence

If data are observations of numbers of individuals infected with *k* distinct pathogens, *O _{k}*, for

*k*∈ [0,

*n*], statistical independence corresponds to assuming the infection load of a single individual follows the one-parameter, binomial model

*Bin*(

*n,p*), in which

*p*is the pathogen prevalence (assumed identical for each pathogen, and fitted appropriately to the data), and

*n*is the maximum number of infections that is possible (i.e. the total number of distinct pathogens under consideration). Model predictions are then simply

*N*samples from this binomial distribution, where

*N = ∑*is the total number of individuals observed in the data. One interpretation is as a multinomial model in which

_{k}O_{k}For the data for malaria corresponding to numbers of individuals, *O*_{Γ}, infected by different sets of pathogens, Γ, statistical independence corresponds to an *n*-parameter multinomial model, parameterised by the prevalences of the individual pathogens *p _{i}* (again fitted to the data), i.e.

#### 4.2.2 Fitting the models

The host renewal rate, *μ*, can be scaled out of the equilibrium prevalences by rescaling time. Fitting the models therefore corresponds to finding value(s) for scaled infection rate(s) *β _{i}*, i.e.

*R*

_{0,i}=

*β*(all are equal for the NiSP model).

_{i}/μThe method used to fit the model does not depend on whether the data are numbers of hosts infected by a particular combination of pathogens, or numbers of hosts carrying particular numbers of distinct pathogens, since both can be viewed as *N* samples drawn from a multinomial distribution, with *q _{j}* observations of the

*j*class. If the corresponding probabilities generated by the model being fitted are

^{th}*p*, then the log-likelihood is

_{j}The models were fitted by maximizing *L* via `optim()` in R (R Core Team, 2016). Convergence to a plausible global maximum was checked by repeatedly refitting the model from randomly chosen starting sets of parameters. All models were fitted in a transformed form to allow only biologically-meaningful values of parameters; that is, the basic reproduction numbers were estimated after transformation with log (*R*_{0,i} − 1) to ensure *R*_{0,i} > 1.

#### 4.2.3 Model comparison

To compare the best-fitting NiSP or NiDP model and an appropriate model assuming statistical independence (binomial or multinomial), we use the Akaike Information Criterion , in which is the log-likelihood of the best-fitting version of each model and *k* is the number of model parameters. This is necessary since these comparisons involve pairs of models that are not nested.

#### 4.2.4 Goodness-of-fit

We use a Monte-Carlo technique to estimate *p*-values for model goodness-of-fit, generating 1, 000, 000 independent sets of samples of total size *N* from the multinomial distribution corresponding to the best-fitting model, calculating the likelihood (Eq. (30)) of each of these synthetic data sets, and recording the proportion with a smaller value of *L* than the value calculated for the data (Sokal and Rohlf, 2012). This was done using the function `xmonte()` in the R package `XNomial` (Engels, 2015).

### 4.3 Sources of data and results of model fitting

#### 4.3.1 Numbers of distinct pathogens (NiSP model)

Results of fitting the NiSP model to data from four publications for strains of a single pathogen are presented in Figure 3. Error bars are 95% confidence intervals using exact methods for binomial proportions via `binconf()` in the R package `Hmisc` (Harrell Jr et al., 2016). Results for three further data sets concerning different pathogens of a single host (Andersson et al., 2013; Moutailler et al., 2016; Nickbakhsh et al., 2016) are provided as Supplementary Information (Section S2.1).

For convenience the raw data as extracted for use in model fitting are re-tabulated in Table 2. Results of model fitting are summarized in Table 3. We used the value *n* = 102 for the number of distinct strains in López-Villavicencio et al. (2007) following personal communication with the authors; there might be undetected genetic differences due to missing data – which would require a larger value of *n* in our model fitting procedure – but we confirmed that our inferences are unaffected by taking any value of *n* ∈ [100, 200].

#### 4.3.2 Combinations of pathogens (NiDP model)

Howard et al. (2001) report results of analyzing 73 data sets concerning multiple *Plasmodium* spp. causing malaria (rows 68–140 of Table 1 in that paper). We reanalyzed the subset of these studies satisfying certain additional constraints as detailed in the main text (see Supplementary Information: Section S2.2 for a full description of how the studies were filtered). This left a final total of 41 data sets taken from 35 distinct papers: 24 data sets considering the three-way interaction between *P. falciparum, P. malariae* and *P. vivax* and 17 data sets considering the three-way interaction between *P. falciparum, P. malariae* and *P. ovale*.

We used our method based on the NiDP model to test whether any of these data sets were consistent with no interaction between the *Plasmodium* spp. considered (Table 4). We found 15 data sets for which the NiDP model was: i) a better fit than the multinomial model as indicated by ΔAIC ≥ 2; ii) sufficient to explain the data as revealed by our goodness-of-fit test. In these 15 cases our methods therefore support the hypothesis of no interaction. For 11 of these 15 data sets (76, 109, 118, 130, 132, 68, 69, 70, 79, 95, 97, 98, 99, 100, 102) the results as reported in (Howard et al., 2001) instead suggest the strains interact.

## 5 Data availability

The cross-sectional survey data extracted from previous publications which we have used to test our methodology are tabulated in Table 2 in the main text, and Tables S2 and S4 in the Supplementary Information. These data are available in electronic format as .csv files from the corresponding author upon reasonable request.

## 6 Code availability

Code illustrating all statistical methods is freely available at: https://github.com/nikcunniffe/Coinfection.

## Author contributions

Designed research: All authors. Performed mathematical analyses: F.M.Ha., L.J.A., N.J.C. Explored and analyzed data: F.M.Ha., F.M.Hi., M.A.R., N.J.C. Wrote the paper: F.M.Ha., L.J.A., N.J.C.

## 7 Acknowledgments

This work was initiated during the Multiscale Vectored Plant Viruses Working Group at the National Institute for Mathematical and Biological Synthesis, supported by the National Science Foundation through NSF Award #DBI-1300426, with additional support from The University of Tennessee, Knoxville. This material is based upon research supported by the Thomas Jefferson Fund of the Embassy of France in the United States and the FACE Foundation. We thank S. Alizon, T. Berrett, E. Bussell, V. Calcagno, C. Donnelly, R. Donnelly, T. Giraud, M. López-Villavicencio, T. Obadia, M. Parry, M. Plantegenest, O. Restif, E. Seabloom, J. Shykoff, R. Thompson and C. Trotter for helpful discussions or provision of data.