Abstract
While many viruses have a single natural host, host restriction can be incomplete, hereby leading to spillovers to other host species. However, such spillover risks are difficult to quantify. As climate change is rapidly transforming environments, it is becoming critical to quantify the potential for spillovers. To address this issue, we resorted to an unbiased metagenomics approach, and focused on two environments, soil and lake sediments from Lake Hazen, the largest High Arctic freshwater lake in the world. We used DNA and RNA sequencing to reconstruct the lake’s virosphere and its range of eukaryotic hosts, and estimated the spillover risk by measuring the congruence between the viral and the eukaryotic host phylogenetic trees. We show that spillover risk is higher in lake sediments than in soil and increased with runoff from glacier melt, a proxy for climate change. Should climate change also shift species range of potential viral vectors and reservoirs northwards, the High Arctic could become fertile ground for emerging pandemics.
1 Introduction
Viruses are ubiquitous and are often described as the most abundant replicators on Earth [1–3]. In spite of having highly diverse genomes, viruses are not independent “organisms” or replicators [4], as they need to infect a host’s cell in order to replicate. These virus/host relationships seem relatively stable within superkingdoms, and can hence be classified as archaeal, bacterial (also known as bacteriophages), and eukaryotic viruses [5–7]. However, below this rank, viruses may infect a novel host from a reservoir host by being able to transmit sustainably in this new host, a process known as viral spillover [8, 9]. Indeed, in the past years, many viruses such as the Influenza A [10], Ebola [11], and SARS-CoV-2 [12] viruses spilled over to humans and caused significant diseases. While these three viruses have non-human wild animal reservoirs as natural hosts, others have a broader host range, or their reservoir is more challenging to identify. For instance, iridoviruses are known to infect both invertebrates and vertebrates [13], and Picornavirales are found in vertebrates, insects, plants, and protists [2]. Such host restrictions (or alternatively, spillover risks) are to date poorly defined and hence, difficult to assess without resorting to expert opinion [14].
Numerous factors can influence such a viral spillover risk. For instance, viral particles need to attach themselves to specific receptors on their host’s cell to invade it [15–17]. The conservation of those receptors across multiple species allows these hosts to be more predisposed to becoming infected by the same virus [17, 18]. Indeed, from an evolutionary standpoint, viruses are more prone to infecting hosts that are phylogenetically close to their natural host [15, 19], potentially because it is easier for them to infect and colonize species that are genetically similar [20]. Alternatively, but not exclusively, high mutation rates might explain why RNA viruses spill over more often than other viruses [15], as most lack proofreading mechanisms, making them more variable and likely to adapt to a new host [17].
While more studies are starting to characterize the communities and genomes of viruses in extreme environments [21–23], only few, if any, describe their spillover risk. The High Arctic is of special interest as it is particularly affected by climate change, warming faster than the rest of the world [24–27]. Warming climate and rapid transitions of the environment increase the risks of spillover events by varying the global distributions and dynamics of viruses, and their reservoirs and vectors [28, 29], as shown for arboviruses [30] and the Hendra virus [31]. Furthermore, as the climate changes, the metabolic activity of the Arctic’s microbiosphere also shifts, which in turns affects numerous ecosystem processes such as the emergence of new pathogens [32]. It has now become critical to quantify the risk of these spillovers. An intuitive approach to do this is to focus on the cophylogenetic relationships between viruses and their hosts [33–37]. Conceptually, if both viruses and their hosts cospeciate, the topologies of their respective phylogenetic trees should be identical or congruent. On the other hand, the occurrence of spillovers would result in incongruent virus/host phylogenies, so it can be postulated that measuring phylogenetic congruency can be used to assess spillover risk.
To test this hypothesis in the context of a changing High Arctic environment, we resorted to a combination of metagenomics and of cophylogenetic modelling by sampling, in an unbiased manner, both the virosphere and its range of hosts [3], focusing on eukaryotes, which are critically affected by viral spillovers [38]. We contrasted two local environments, lake sediments and soil samples of Lake Hazen, to test how viral spillover risk is affected by glacier runoff, and hence potentially by global warming, which is expected to increase runoff with increasing glacier melt at this specific lake [24, 25]. While microbial eukaryotes have been identified in Lake Hazen and other Arctic freshwater ecosystems [39–42], the Arctic multicellular macro-eukaryotes have yet to be sufficiently characterized. We show here that the risk of spillovers increases with warming climate, but is likely to remain low in the absence of “bridge vectors” and reservoirs.
2 Methods
(a) Data acquisition
An overview of data acquisition and analytical pipeline is shown in figure S1. Between the 10th of May and the 10th of June, 2017, sediments and soil cores were collected from Lake Hazen (82°N, 71°W; Quttinirpaaq National Park, northern Ellesmere Island, Nunavut, Canada), the largest High Arctic lake by volume in the world, and the largest freshwater ecosystem in the High Arctic [25]. Sampling took place as the lake was still completely covered in ice (table S1), as previously described [24]. The sediment accumulation at the bottom of the Lake is caused by both allochthonous and autochthonous processes. The former are characterised by meltwaters that flow between late June and the end of August, and run from the outlet glaciers along the northwestern shoreline through poorly consolidated river valleys, while the latter refer to the sedimentation process within the lake.
To contrast soil and sediment sites, core samples were paired, whenever possible, between these two environments. Soil samples were taken at three locations (figure S2; C-Soil, L-Soil, and H-Soil) in the dried streambeds of the tributaries, on the northern shore, upstream of the lake and its sediments. The corresponding paired lake sediment samples were also cored at three locations, separated into hydrological regimes by seasonal runoff volume: negligible, low, and high runoff (figure S2; C-Sed, L-Sed, and H-Sed). Specifically, the C (for Control) sites were both far from the direct influence of glacial inflows, while L sites were at a variable distance from Blister Creek, a small glacial inflow, and the H sites were located adjacent to several larger glacial inflows (Abbé River and Snow Goose). The water depth at L-Sed and H-Sed was respectively 50 m and 21 m, and the overlying water depth for site C-Sed was 50 m.
Before sample collection, all equipment was sterilised with 10% bleach and 90% ethanol, and non-powdered latex gloves were worn to minimise contamination. Three cores of ∼ 30 cm length were sampled at each location, and the top 5 and 10 cm of each sediment and soil core, respectively, were then collected and homogenized for genetic analysis. DNA was extracted on each core using the DNeasy PowerSoil Pro Kit, and RNA with the RNeasy PowerSoil Total RNA Kit (MO BIO Laboratories Inc, Carlsbad, CA, USA), following the kit guidelines, except that the elution volume was 30 μL. DNA and RNA were thereby extracted three times per sampling site, and elution volumes were combined for a total volume of 90 μL instead of 100 μL.
To sequence both DNA and RNA, a total of 12 metagenomic libraries were prepared (n = 6 for DNA, n = 6 for RNA), two for each sampling site, and run on an Illumina HiSeq 2500 platform (Illumina, San Diego, CA, USA) at Génome Québec, using Illumina’s TruSeq LT adapters (forward: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC, and backward: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT) in a paired-end 125 bp configuration. Each library was replicated (n = 2 for DNA, n = 3 for RNA) for each sample. Further details, such as DNA and RNA yields following extractions, can be found in Colby et al. [24].
(b) Data preprocessing and taxonomic assignments
A first quality assessment of the raw sequencing data was made using FastQC v0.11.8 [43]. Trimmomatic v0.36 [44] was then employed to trim adapters and low-quality reads and bases using the following parameters: phred33, ILLUMINACLIP:adapters/TruSeq3-PE-2. fa:3:26:10, LEADING:3, TRAILING:3, SLIDINGWINDOW:4:20, CROP:105, HEADCROP:15, AVGQUAL:20, MINLEN:36. A second round of quality check was performed with FastQC to ensure that Illumina’s adapter sequences and unpaired reads were properly removed. Reads assembly into contigs was done de novo with both SPAdes v3.13.1 [45] and metaSPAdes v3.13.1 [46] for DNA, and with Trinity v2.9.0 [47], rnaSPAdes v3.13.1 [48], and metaSPAdes for RNA. The choice of an assembly tool was based on (i) the number of contigs generated, (ii) the taxonomic annotations, (iii) the time of assembly, and (iv) the contig lengths (see electronic supplementary material). In all cases, the pipelines were used with their default settings.
Once assembled, a high-level (superkingdom) taxonomic assignment was determined based on BLASTn v2.10.0 [49] searches. Those were performed at a stringent 10−19 E-value threshold against the partially non-redundant nucleotide (nr/nt) database from NCBI v5 [50] (ftp.ncbi.nlm.nih.gov/blast/db/nt∗tar.gz; downloaded on June 17, 2020). We chose this threshold to increase the significance of our hits, as our preliminary results showed less ambiguity with smaller E-values, starting at a 10−19 cut-off. The proportions of taxonomic annotations (“Archaea,” “Bacteria,” “Eukaryota,” or “Viruses”) were calculated, and a 95% consensus was taken to assign a superkingdom rank for each contig. When no such 95% consensus could be determined, the contigs were classified as “Other.”
To refine the taxonomic assignment of “viruses,” GenBank’s viral nucleotide sequences v238.0 [51] were retrieved (ftp.ncbi.nlm.nih.gov/genbank/gbvrl∗seq.gz; downloaded on 23rd of July, 2020), concatenated, converted into FASTA with seqret v6.6.0 [52], and used to create a local database for BLASTn alignments. For each sampling location, after combining the DNA and RNA contigs classified as viral in the previous step, BLASTn searches were again conducted at the same stringent 10−19 E-value threshold, and the accession numbers of all the High-scoring Segment Pairs (HSPs) were used to retrieve their corresponding taxonomy identifiers (IDs) and their full taxonomic lineages with the R package taxonomizr v0.5.3 [53]. The viral contigs were also mapped with Bowtie2 v2.3.5.1 [54], using default settings to compare BLASTn and Bowtie2 efficiencies in refining these taxonomic annotations. As searches were found to be more sensitive with BLASTn than with Bowtie2 (see electronic supplementary material), only BLASTn results are shown hereafter, as our goal was to find as many similar sequences as possible in more than one species to eventually infer the virosphere from the virome. Eukaryotic contigs were processed as above, based off the nr/nt database. To increase specificity considering that > 100 hits were found per contig, results were filtered by keeping a maximum of 12 HSPs whose E-value < 10−100 per contig, for which lineages were obtained.
All samples were filtered to remove non-eukaryotic and uncultured hosts as well as viral and eukaryotic sequences with no taxonomy information. The ViralZone [55] and International Committee on Taxonomy of Viruses (ICTV) [56] databases were consulted to obtain host range information on each viral family. These taxonomic assignments were then used to retrieve their phylogenetic placements according to the Tree of Life (ToL) (tolweb.org), hence generating two trees: one for known viruses and one for known eukaryotes. For this, we used the classification and class2tree functions from the R package taxize v0.9.99 [57, 58]. In each environment, vertices of the viral and eukaryotic trees were then put in relation with each other according to the Virus-Host DB (downloaded on the 29th of September, 2020) [59]. These relations were saved in a binary association matrix (0: no infection; 1: infection), one for each environment. To simplify downstream computations without losing any information, only eukaryotic hosts associated to at least one virus were kept in the non-viral tree.
(c) Spillover quantification
To quantify viral spillovers based on the viral and eukaryotic hosts identified, we employed the Random Tanglegram Partitions algorithm (Random TaPas) [60]. This algorithm computes the cophylogenetic signal or congruence between two phylogenetic trees, the viral and the host trees, with the normalised Gini coefficient (G⋆). When congruence is large, or “perfect,” the two trees are identical and hence, there is strong cophylogenetic signal – and absence of spillover. On the other hand, weak congruence is evidence for the existence of spillovers. Random TaPas quantifies congruence in two ways: a geodesic distance (GD) [61], or a Procrustes distance (Procrustes Approach to Cophylogeny: PACo) [62], the latter measuring the distance between two trees geometrically transformed to make them as identical as possible. To partially account for phylogenetic non-independence when measuring congruence, Random TaPas further implements a resampling scheme where N = 104 subtrees of about 20% of the total number of virus/hosts links are randomly selected. This selection is used to generate a distribution of the empirical frequency of each association, measured by either GD or PACo.
Each empirical frequency is then regressed against a uniform distribution, and the residuals are used in two ways: (i) to quantify co-speciation, which is inversely proportional to spillover risk; and (ii) to identify those virus/host pairs that contributed the least to the cophylogenetic signal, i.e., the most to spillover risk. This risk is finally quantified by the shape of the distribution of residuals (for GD or PACo), with G⋆ that takes its values between 0 (perfect congruence, no spillover) to 1 (maximal spillover risk), with a defined threshold of 2/3 indicating a “large” value of G⋆ or large incongruence. To account for phylogenetic uncertainty, the process is repeated n = 1, 000 times, each replicate being a random resolution of the multifurcating virus/host trees of life into a fully bifurcating tree.
3 Results and Discussion
(a) Plant and fungal viruses are overrepresented
Based on our most sensitive annotation pipeline (see electronic supplementary material), viruses represented less than 1% of all contigs, and our samples were dominated by bacteria, with low proportions of eukaryotes (proportions of bacterial and eukaryotic contigs being respectively > 89.2% and < 6.4%, in 11 out of 12 samples) (see electronic supplementary material). These results could be due to our extraction process, which might have been biased towards microbial nucleic acids. For instance, an overrepresentation of bacteria was also found in a shotgun-metagenomics based study that also used soil extraction kits [63]. To assess the impact of this potential bias, the extraction process should be taken into consideration by future studies.
RNA viral contigs of all kinds (i.e., dsRNA, +ssRNA, and -ssRNA viruses) were found to be significantly more abundant than DNA viral contigs in all samples, as 70.5% to 87.9% of viral families had a RNA genome (binomial tests, P < 2.48 × 10−7; figure 1, table 1). This dominance of RNA viruses is not unexpected, as fungi biomass for instance surpasses that of bacteria in Arctic environments by 1-2 orders of magnitude [64], and eukaryotes are known to be the main targets of RNA viruses [2,5–7].
Abundance count of the viral families. (a) C-Soil; (b) L-Soil; (c) H-Soil; (d) C-Sed; (e) L-Sed; and (f) H-Sed sites. Abundances were log10-transformed. Viruses with a missing family were excluded from this analysis. The data used for this figure can be found in table 1.
Abundance of the viral families of the viral HSPs. The host range information was obtained from the ViralZone and International Committee on Taxonomy of Viruses (ICTV) databases. Viruses with no or unknown family were excluded from this table.
Our results are however difficult to compare with previous studies in the High Arctic, as most were solely based on DNA metagenomics sequencing [22, 65, 66], probably because RNA viruses are thought to be unstable [23], or due to inadequate sampling strategies to extract RNA viruses [67]. Two studies have been able to recover RNA viruses but one had not intended to characterise the RNA viral community, rather randomly finding sequences related to ssRNA viruses [68], and while the other also identified RNA and DNA viruses from RNA-seq, the environments were slightly different: although they included a freshwater lake, more abundant in ssDNA phages, the Baltic Sea contains varying levels of salinity [69] unlike Laze Hazen. Nonetheless, our results and those of this previous study [69] both show that it is possible to recover RNA viruses from RNA-seq metagenomics.
All viral genomes confounded, in all samples, known plants and/or fungi viral families were overrepresented compared to those infecting animals and protists, as proportions of the former ranged between 69.8% to 87.1% (binomial tests, P < 2.48 × 10−7; table 1). This overrepresentation might reflect a preservation bias, due to the constitutive defences found in plants and fungi offered by their waxy epidermal cuticles and cell walls [70], even if most plant viruses lack a protective lipoprotein envelope as found in animal viruses [71]. But irrespective of such a preservation bias, this imbalance could imply a high spillover potential among plants and fungi in the High Arctic for two reasons. First, RNA viruses are the most likely pathogens to switch hosts, due to their high rates of evolution [15, 72]. Second, plant biomass has been increasing over the past two decades in the High Arctic due to regional warming [73], and is likely to keep doing as warming continues.
(b) Spillover risk increases with glacier runoff
Given these viral and eukaryotic host representations, can spillover risk be assessed in these environments? To address this question, we resorted to the novel global-fit model Random TaPas, which computes the congruence between the virus and the eukaryotic host trees, with large and weak congruent topologies indicating low and high spillover risk, respectively. The stability of its results was assessed by running this algorithm three times, and by combining the results for the normalised Gini coefficients (G⋆ ∈ [0, 1]), a direct measure of spillover risk (see Methods).
When the runoff volume was negligible (the C sites; figure 2a), spillover risk’s median G⋆ ranged between 0.675 and 0.725, thus exceeding the 2/3 threshold, and was significantly higher in soil than in lake sediments for both GD and PACo (Dunn test, Benjamini-Hochberg [BH] correction, P < 0.001). However, in the presence of a low runoff volume (the L sites), spillover risk was higher in lake sediments than in soil for GD, but lower for PACo, with G⋆ ∈ [0.70, 0.75] (Dunn test, BH correction, P < 0.001; figure 2b). Finally, in the high runoff regime (the H sites), for both GD and PACo, spillover risk was higher in lake sediments than in soil, with values of G⋆ > 0.75 (Dunn test, BH correction, P < 0.001; figure 2c). Altogether, these results show that as runoff volume increases from almost non-existent to high, spillover risk increased with runoff, and shifted from higher in soil, to higher in lake sediments.
Normalised Gini coefficients (G⋆) obtained with Random TaPas (n = 3 runs). The values are separated by runoff volume: (a) control; (b) low runoff; and (c) high runoff. The two global-fit models used were GD (geodesic distances in tree space) and PACo (Procruses Approach to Cophylogeny). Significant results (Dunn test, BH correction) are marked with letters from a to j (α = 0.05). Blue represents the soil and yellow, the lake sediments.
This pattern is consistent with the predictions of the Coevolution Effect hypothesis [74], and provides us with a mechanism explaining the observed increase in spillover risk with runoff. Lake Hazen was recently found to have undergone a dramatic change in sedimentation rates since 2007 compared to the previous 300 years: an increase in glacial runoff drives sediment delivery to the lake, leading to increased turbidity that perturbs anoxic bottom water known from the historical record [25]. Not only this, but turbidity also varies within the water column throughout the season [75], hence fragmenting the lake habitat every year, and more so since 2007. This fragmentation of the aquatic habitat creates conditions that are, under the Coevolution Effect, favourable to spillover. Fragmentation creates barriers to gene flow, that increases genetic drift within finite populations, accelerating the coevolution of viruses and of their hosts. This acceleration leads to viral diversification which, should it be combined with “bridge vectors” (such as mosquitoes in terrestrial systems) and/or invasive reservoir species, increases spillover risk [74]. Lake sediments are environmental archives: over time, they can preserve genetic material from aquatic organisms but also, and probably to a lesser extent, genetic material from its drainage basin. The coevolutionary signal detected in lake sediments reflects interactions that may have happened in the fragmented aquatic habitat but also elsewhere in the drainage basin. Regardless of where the interaction occurred, our results show that spillover risk increases with runoff, a proxy of climate warming (figure 2).
To our knowledge, this is the first attempt to assess the complete virosphere of both DNA and RNA viruses, and their spillover capacity in any given environments, leading us to show that increased glacier runoff, a direct consequence of climate change, is expected to increase viral spillover risk of Lake Hazen. However, as this is the first study applying the Random TaPas algorithm, we do not have yet any comparators in order to gauge the efficacy of G⋆ in assessing spillover capacity, both qualitatively and quantitatively. Additional studies including more runs of the algorithm and multiple environmental settings of the High Arctic would be necessary to further reinforce our results, and to calibrate the “true” risk of viral spillovers.
(c) Spillovers might already be happening
To go one step further and identify the viruses most at risk of spillover, we focused on the model predictions made by Random TaPas. Under the null model, the occurrence of each virus/host association is evenly distributed on their cophylogeny (when sub-cophylogenies are drawn randomly, from a uniform law). Departures from an even distribution are measured by the residuals of the linear fit. Positive residuals indicate a more frequent association than expected, that is pairs of host/virus species that contribute the most to the cophylogenetic signal. On the other hand, negative residuals indicate a less frequent association than expected, and hence pairs of host/virus species that contribute little to the cophylogenetic signal, because they tend to create incongruent phylogenies, a signature of spillover risk.
For both soil and lake sediments, the magnitude of the largest residuals tended either to decrease (Soil; figure 3a) or to stay the same (Sediment; figure 3b). This means that with increasing runoff, the strength of the cophylogenetic signal may remain steady, or may even weaken. On the other hand, the magnitude of the most negative residuals either remained globally unchanged (Soil; figure 3a, 6a), or tended to become more negative (Sediment; figure 3b, 6b). This latter pattern indicates that as runoff increases, the strength of the cophylogenetic signal deteriorates, potentially implying a higher spillover risk in lake sediments.
Largest and smallest residuals per sampling site for (a) soil and (b lake sediments samples. Residuals were computed by Random TaPas (n = 3 runs) using GD (geodesic distances in tree space). Significant results (Dunn test, BH correction) are marked with an asterisk (*) (α = 0.05). Red represents the largest and blue, the smallest residuals. figure S6 further shows these results to be robust to the distance used to compare trees.
With this, Random TaPas can identify the viruses driving the spillover signal. For both GD (figure 4) and PACo (figure S7), the 5 most negative residuals of each sample (n = 60) suggest that viruses are most likely to spill over in fungi (n = 19), plants (n = 16), and protists (n = 15; including different species of microalgaes), the other 10 species being mostly insects (animals: n = 8; oomycetes: n = 2).
Distribution of the residuals computed by Random TaPas (n = 1 run) using GD (geodesic distances in tree space). (a) C-Soil; (b) L-Soil; (c) H-Soil; (d) C-Sed; (e) L-Sed; and (f) H-Sed sites. Blue residuals represent the soil, and yellow the lake sediments.
Altogether, we provided here a novel and unbiased approach to assessing spillover risk. This is not the same as predicting spillovers or even pandemics, because as long as “bridge vectors” and/or invasive reservoir species [74] are not present in the environment, the likelihood of dramatic events probably remains low. But as climate change leads to shifts in species ranges and distributions, new interactions can emerge [76], bringing in vectors that can mediate viral spillovers [77]. This twofold effect of climate change, both increasing spillover risk and leading to a northward shift in species ranges [78], could have dramatic effect in the High Arctic. Disentangling this risk from actual spillovers and pandemics will be a critical endeavour to pursue in parallel with surveillance activities, in order to mitigate the impact of spillovers on economy and health-related aspects of human life, or on other species [9].
Data accessibility
The raw data used in this study can be found at www.ncbi.nlm.nih.gov/bioproject/556841 (DNA-Seq) and at www.ncbi.nlm.nih.gov/bioproject/PRJNA746497/ (RNA-Seq). The code developed for this work is available from github.com/sarisbro/data.
Authors’ contributions
S.A.B. and A.J.P. designed research; G.A.C. collected and processed the samples; A.L. performed all analyses and wrote the original draft; A.L. and S.A.B. wrote the manuscript with contributions and suggestions from G.A.C. and A.J.P.; and S.A.B. and A.J.P. supervised this study and acquired funding.
Competing interests
We declare we have no competing interests.
Funding
This work was supported by the Natural Sciences and Engineering Research Council of Canada and by the University of Ottawa.
Acknowledgements
We thank Frances Pick for her helpful comments on an early version of this work.