Predictors of virus prevalence and diversity across a wild bumblebee community

Viruses are key regulators of natural populations. Despite this, we have limited knowledge of the diversity and ecology of viruses that lack obvious fitness effects on their host. This is even the case in wild host populations that provide ecosystem services, where small fitness effects may have major ecological and financial impacts in aggregate. One such group of hosts are the bumblebees, which have a major role in the pollination of food crops and have suffered population declines and range contractions in recent decades. In this study, we used a multivariate generalised linear mixed model to investigate the ecological factors that determine the prevalence of four recently discovered bumblebee viruses (Mayfield virus 1, Mayfield virus 2, River Liunaeg virus and Loch Morlich virus), and two previously known viruses that infect both wild bumblebees and managed honeybees (Acute bee paralysis virus and Slow bee paralysis virus). We show that the recently discovered bumblebee viruses were more genetically diverse than the viruses shared with honeybees, potentially due to spillover dynamics of shared viruses. We found evidence for ecological drivers of prevalence in our samples, with relatively weak evidence for a positive effect of precipitation on the prevalence of River Luinaeg virus. Coinfection is potentially important in shaping prevalence: we found a strong positive association between River Liunaeg virus and Loch Morlich virus presence after controlling for host species, location and other relevant ecological variables. This study represents a first step in the description of predictors of bumblebee infection in the wild not driven by spillover from honeybees.


Community Similarity
To estimate host community similarity between sampling sites, we estimated the underlying sampling probability of each bumblebee species at each site by treating the observed samples as being drawn from a multinomial distribution with 24 categories, corresponding to the 24 bumblebee species in the United Kingdom. We use a Dirichlet prior with these 24 categories and a concentration parameter of 1 for each category, implying complete uncertainty about the underlying probability. This has the advantage that the posterior has a known analytical form. Probabilities were estimated independently for each site. Ten thousand simulations were taken from the posterior distributions generated for each site to generate possible values of the underlying sampling probabilities of each bumblebee species at each site, which we assume to be roughly equivalent to the frequency of that bumblebee species at that site. For each of the 10,000 simulations from the posteriors at the sites, we generated estimates of the community dissimilarity using the Morisita-Horn index (Horn, 1966), implemented in the R package vegan (Oksanen et al., 2017). The posterior mode and 90% shortest probability intervals for the dissimilarity index were then reported.

Prevalence and Climatic Association
Climatic data for each of the nine sites at which bees were collected was taken from the WorldClim database at 1km resolution (Fick & Hijmans, 2017). Predictions for July and August derived from data from 1960-2010 were extracted for mean daily maximum temperature, mean precipitation, mean solar radiation and mean wind speed at the grid reference for the sites with a buffer area of 2km to account for the fact that bumblebees forage over approximately that distance (Wolf & Moritz, 2008). All values were averaged to generate a consensus value for that site and then mean-centred and scaled to unit standard deviation. At this point, we tested the correlation between the variables. Mean solar radiation and mean daily maximum temperature were highly correlated (Pearson correlation: 0.78), so only mean maximum temperature was carried forward. All remaining variables had low Pearson correlations between them in the range of -0.4 to 0.4.
We tested associations between individual prevalence and climate data using Stan version 2.18.2 (Carpenter et al., 2017) via the rstan interface (Stan Development Team, 2016) in R version 3.6 (R Core Development Team, 2016). A multivariate probit model was fitted, with random host, location and host-location effects, and mean maximum temperature, precipitation, and wind speed as fixed effects for each virus. The usage of the multivariate probit allows us to test for excess co-infection between viruses in the study, i.e. coinfection beyond random expectations after the effect of shared covariates has been removed. As the number of sampling locations was small, we expected our ability to accurately determine the size and direction of effects caused by ecological covariates would be limited. To reduce the effect of drawing spurious conclusions due to our small number of sites, we applied regularisation as recommended by Lemoine et al. (2016), using a regularising prior distribution. The global intercept for each virus was given a Gaussian (mu=0, sigma=10) prior, which does not substantially penalise low probabilities. Each fixed effect coefficient was given a Gaussian (mu=0, sigma=1) prior, which, given that the fixed effects act at the site level, should dominate the likelihood if the effect is small. Random effects were drawn from normal distributions centred at 0 with estimated standard deviations. In all cases, the standard deviations were given Exponential (lambda=2) hyperpriors, which are only weakly informative on the logit scale when the data is informative for the standard deviation. The correlation in residuals for the multivariate normal was given a near flat prior using a Lewandoski, Kurowicka and Joe (eta=1) prior. While the Stan code used can regularly give outputs with divergent transitions, the presented model had no divergent transitions over 24000 samples, tail and bulk effective sample sizes of over 400 for all parameters and no Bayesian fraction of missing information warning. We did not perform model selection given our regularising priors, and statements are made based on estimates from the full model.

Diversity Analysis
To analyse sequence diversity, we used the raw reads from the RNA sequencing described in Pascall et al. (2018). Briefly, these consist of 100bp-paired end RNA-Seq data from pools of B. terrestris, Bombus pascuorum and Bombus lucorum, each sequenced twice, once using duplex specific normalisation and once using poly-A selection, and a pool of mixed Overlapping mate reads were combined by FLASH version 1.2.11 using the default settings (Magoč & Salzberg, 2011). Reads were aligned to the RdRp sequences generated above using MOSAIK version 2.1.73 (Lee et al., 2014). Both merged reads and singletons from the sickle run were aligned together in the single end setting. Unmerged paired end reads were separately aligned using the paired end setting. In both cases, a quality threshold of 30 was used to remove ambiguously mapping reads. SAM files were recombined after the fact using SAMtools version 1.5 (Li et al., 2009). Given the high coverage of SBPV, MV1 and MV2 duplicate sequences were not marked, as per best practice (McKenna et al., 2010). Variants were called using the default settings in LoFreq* version 1.2.1 (Wilm et al., 2012). Base quality scores were recalibrated using the outputted vcf file in GATK (DePristo et al., 2011).
Variant calling and recalibration were repeatedly performed until the base quality scores converged to a stable distribution (a total of four recalibrations). Once the score distribution stabilised, variant calling was performed to generate a set of variants for the entire sample.
These variants were used to recalibrate the scores of each species-specific mapping and generate species level variant calls. If the median depth over called differences from the consensus was less than 20, species-virus combinations were removed from the variant analysis. B. lucorum was analysed for SBPV, ABPV and MV1. B. terrestris was analysed for SBPV, MV1 and MV2. B. pascuorum was analysed for ABPV, SBPV and MV2. The mixed Bombus pool was analysed for all six viruses.
The number of polymorphic sites was calculated for each virus. Variants with allele frequencies of 1 were removed as these represent fixed differences from the underlying reference sequence. To measure genetic diversity, we used these counts to approximate Watterson's theta (Watterson, 1975) for each host-virus combination. We had to account for the fact that the make-up of the pools was not precisely the same as the samples that were tested by PCR. As such, we predicted the status of each of the untested individuals in the pools from the model discussed above and took the median and a 90% central credible interval of the number of additional positives (over those confirmed by PCR). We then used these numbers to give bounds on the approximation to Watterson's theta estimator, by looking at the value of the estimator at those three estimates of the number of infected individuals. The equation used was: where is the approximation to Watterson's theta, n is the number of variants, l is the length of the sequence and i is the number of PCR positives. This method makes two strong assumptions: 1) there is no mixed infection of viral variants in individuals (i.e. that one extra individual represents a single extra count for the harmonic partial sum in the denominator of Watterson's theta) and 2) all variants present are detectable. The impact of deviation from the first assumption is likely to be small. The marginal change in the partial sum in the denominator decreases with every extra count, so a few missed counts will result in little change to the resultant estimate. The second assumption is more influential, given the larger impact that a missing variant has on the generated number. Given this, we acknowledge that our presented estimates may be conservative.

Bumblebee Community Similarity
There were obvious differences in bumblebee community structure between our Scottish sampling sites. The locations we sampled in the south had B. terrestris, B. pascuorum and B. lucorum dominated communities, whereas those further north had Bombus jonellus and Bombus hortorum dominated communities ( Figure 1, Table 2). Even with its small sample size of 13 bees, the Pentlands, a range of hills in southern Scotland (Figure 1), appeared to represent a third type of community: the presence of Bombus monticola, otherwise only found in the highland sites, and an equivalent frequency of B. pascuorum and B. lucorum makes the community look like a blend of the other community types. This is potentially due to its higher elevation and habitat being similar to the north with large numbers of heathland plants while being situated in the south.

Prevalence
There were large differences in prevalence of the viruses, all of which are +ssRNA picornalike viruses, between sites ( Figure 2). When broken down to the specific host-location level, sample sizes for many species become small, so the uncertainty around the modal prevalences is correspondingly large. lucorum. The prevalence of SBPV was universally high.

Factors Influencing Infection
Viral prevalence was related to environmental covariates in some cases (Table 3; Figure 3).
Higher levels of precipitation had a high posterior probability of being associated with higher prevalences of River Luinaeg virus (posterior probability: 95%), There was some evidence that more precipitation, higher maximum temperatures (which were highly correlated with solar radiation) and higher wind speeds were associated with higher prevalences of Mayfield virus 1 (posterior probability: 92%, 94% and 93% respectively). For most covariates, however, the bulk of the posterior distribution lay close to zero and did not shift considerably from the prior indicating a lack of between site resolution.
We also tested for excess co-infection beyond random between viruses using multivariate probit models that allowed us to calculate the correlation in the error terms of the multivariate normal latent variable. This measures the degree to which, after accounting for the predictors, there is still shared error, as caused by unobserved factors affecting infection risk.
In this case, these measure the extent to which there is excess coinfection after accounting for the location of sampling, the species of which the bees belong and the various locationlevel environmental variables. Some viruses exhibited excess coinfection (Table 4). RLV and LMV showed strong and positive correlation (mean correlation: 0.73), consistent with the high levels of coinfection noted above; the error correlation between these two viruses was the only one where the bulk of the posterior was not close to zero. There was also some indication of a negative association between MV2 and SBPV, with the bulk of the posterior supporting a correlation of below zero (posterior probability: 93%), but the variance of the posterior was such that this cannot be stated with great certainty.

Diversity
Over homologous genomic regions within the RdRp gene, there were large differences between viruses in our approximation to Watterson's theta (see methods; Table 4). RLV, LMV, MV1, and MV2 all exhibited more diversity than SBPV and ABPV, with SPBV itself being considerably more diverse than ABPV. The same genotypes of MV1 and MV2 are observed in both 2009 when Dalwhinnie, the Ochils and Iona were sampled and 2011 when Edinburgh, Gorebridge and the Pentlands were sampled, implying that the variants present in an area are stable over short periods. Additionally, as would be expected, most variation was in 3 rd codon positions leading to either no amino-acid replacements or replacements with similarly charged amino acids, and thus unlikely to affect protein function.

Discussion
In this study, we explored the ecological factors influencing the genetic diversity and distribution of the viruses of wild bumblebees. We found that all the viruses detected only in bumblebees have considerably higher genetic diversity than the viruses shared with honeybees. Additionally, we found evidence of a positive association between River Luinaeg virus and Loch Morlich virus. Finally, there is some evidence for an effect of environmental variables on viral prevalence with the strongest evidence being that higher levels of precipitation increase the prevalence of River Luinaeg virus. Bumblebee-limited viruses necessarily undergo multiple independent bottlenecking events when the viruses can only survive in overwintering queens, potentially maintaining diversity through reduced selection efficiency. This would initially seem to apply to SBPV, as McMahon et al. (2015) and Manley et al. (2020) report that SBPV is often found at higher prevalences in bumblebees than in sympatric honeybees during in summer. However, over winter, bumblebee populations are reduced to individual queens while honeybee colonies are maintained at population sizes in the thousands. This can maintain a level of virus in the honeybees that can then spill over into the bumblebees in the spring, which may lead to honeybee-derived SBPV variants dominating even in bumblebees. This could represent a more general effect where the fitness landscapes of viruses infecting managed species are systematically different from those infecting wild species. The hypothesised mechanism is that in a genetically homogenous, densely-packed managed population, one optimal viral genotype could easily achieve dominance as the fitness landscape will be relatively constant.

Diversity
On the other hand, in more genetically heterogeneous wild populations with fluctuating population sizes, the fitness landscape will be less constant, and thus selection of variants on this changing fitness landscape may lead to the maintenance of more genotypes.

Factors Influencing Infection
We found evidence that the prevalence of River Luinaeg virus was positively associated with increased precipitation, though our ability to estimate the precise size of this effect was limited. The direction of this effect is contrary to our hypothesis that higher rainfall would decrease prevalence by reducing the contact rate between the host and virus through mechanical cleaning of contaminated floral structures and decreased bumblebee flight frequency. However, an alternative explanation consistent with the results is that high precipitation reduces foraging time and therefore condition, mediated through starvation.

Starvation increases the severity of bumblebee virus infections (Manley et al., 2017) and
has been shown to increase infection risk in mammals (reviewed in (França et al., 2009;Pedersen et al., 2002;Schaible & Kaufmann, 2007)). River Luinaeg virus is predominantly associated with Bombus jonellus, a bee species found at significantly higher frequencies in our sampling locations with higher precipitation, while the inclusion of our host random effect should account for this somewhat, it is possible that this is also driving the effect. We also found weak evidence of a negative effect of precipitation on MV1. Given that only nine sites were sampled in this study, we are limited in the between-site conclusions we can draw.
Despite using a regularising prior, the risk of erroneously identifying effects can never be fully excluded in studies with small sample sizes. Thus, given the importance of wild pollinators, a study with more sites would be valuable future work to more accurately assess the effect of the environment on viral infection risk.

Coinfection
River Luinaeg virus and Loch Morlich virus were rarely found separately in this study. They are distinct, not different segments of the same virus, as, while whole genomes are available for neither, both partial genomes include an RdRp sequence (Pascall et al., 2018). The mechanism of their transmission is unknown, but we assume that, as with most other reported bee viruses, transmission occurs at flowers.
One potential explanation for this strong association is that one of the viruses is a satellite of the other, as occurs in Chronic bee paralysis virus with Chronic bee paralysis virus satellite virus (Bailey et al., 1980). However, this seems unlikely as both virus species are observed separately, though the possibility of false negatives in the PCR reactions cannot be ruled out. Another possibility is that both viruses circulate in the population, but infection with one causes damage to the host in such a way that susceptibility to the second is dramatically increased, perhaps in a manner analogous to HIV's synergism with TB though immune suppression (Kwan & Ernst, 2011) or influenza virus' changing of the environment of the nasopharynx, allowing secondary bacterial invasion (Joseph et al., 2013). Viral coinfections are ubiquitously reported in prevalence studies in bees (Anderson & Gibbs, 1988;Bacandritsos et al., 2010;Blažytė-Čereškienė et al., 2016;Chen et al., 2004;Choe et al., 2012;Evans, 2001;Gajger et al., 2014;Manley et al., 2020;McMahon et al., 2015;Mouret et al., 2013;Nielsen et al., 2008;Roberts et al., 2017;Thu et al., 2016),

Conclusion
Here we describe the ecological and genetic characteristics of a set of viruses of bumblebees. Of the six viral species studied, genetic diversity is higher in the viruses we detected only in bumblebees. This could be explained if viruses in species managed for food production, such as honeybees, are less diverse than those in wild species, and outbreaks of these viruses in wild species are predominantly due to spillover. Further studies in this and other systems would be valuable to answer the question of whether there is a significant difference in diversity in viruses in managed species, those shared between managed and wild species, and those limited to wild species. Viral genetic diversity could be a factor in determining the risk of both disease emergence and spillover.
Supporting Table 1 The species-site breakdown of the counts of individual bumblebees used in the study.      Table 3 The posterior correlations of the errors of each virus from the multivariate probit model, measuring the degree of co-occurance. Positive numbers represent excess co-occurrence beyond that predicted by covariates and negative numbers represent a dearth of double infections. 90% shortest posterior intervals for each correlation are shown in brackets.