Abstract
The discovery of highly divergent RNA viruses is compromised by their limited sequence similarity to known viruses. Evolutionary information obtained from protein structural modelling offers a powerful approach to detect distantly related viruses based on the conservation of tertiary structures in key proteins such as the viral RNA-dependent RNA polymerase (RdRp). We utilised a template-based approach for protein structure prediction from amino acid sequences to identify distant evolutionary relationships among viruses detected in meta-transcriptomic sequencing data from Australian wildlife. The best predicted protein structural model was compared with the results of similarity searches against protein databases based on amino acid sequence data. Using this combination of meta-transcriptomics and protein structure prediction we identified the RdRp (PB1) gene segment of a divergent negative-sense RNA virus in a native Australian gecko (Geyra lauta) that was confirmed by PCR and Sanger sequencing. Phylogenetic analysis identified the Gecko articulavirus (GECV) as a newly described genus within the family Amnoonviridae, order Articulavirales, that is most closely related to the fish virus Tilapia tilapinevirus (TiLV). These findings provide important insights into the evolution of negative-sense RNA viruses and structural conservation of the viral replicase among members of the order Articulavirales.
Introduction
The development of next-generation sequencing technologies (NGS), including total RNA sequencing (meta-transcriptomics), has revolutionized studies of virome diversity and evolution [1–3]. Despite this, the discovery of highly divergent viruses remains challenging because of the often limited (or no) primary sequence similarity between putative novel viruses and those for which genome sequences are already available [4–6]. For example, it is possible that the small number of families of RNA viruses found in bacteria, as well as their effective absence in archaeabacteria, in reality reflects the difficulties in detecting highly divergent sequences rather than their true absence from these taxa [3].
The conservation of protein structures in evolution and the limited number of proteins folds (fold space) in nature form the basis of template-based protein structure prediction [7], providing a powerful way to reveal the origins and evolutionary history of viruses [8,9]. Indeed, the utility of protein structural similarity in revealing key aspects of virus evolution is well known [9,10]. For instance, double-strand (ds) DNA viruses including the thermophilic archaeal virus STIV, enterobacteria phage PRD1, and human adenovirus exhibit conserved viral capsids, suggesting a deep common ancestry [11]. Thus, protein structure prediction utilising comparisons to solved protein structures can assist in the identification of potentially novel viruses [7,12]. Herein, we use this method as an alternative approach to virus discovery.
There is a growing availability of three-dimensional structural data in curated databases such as the Protein Data Bank (PDB), with approximately 11,000 viral protein solved structures that can be used in comparative studies. Importantly, these include structures of the RNA-dependent RNA polymerase (RdRp) that exhibits the highest level of sequence similarity among RNA viruses, including a number of key conserved motifs, and hence is expected to contain relatively well conserved protein structures. Exploiting such structural features in combination with metagenomic data will undoubtedly improve our ability to detect divergent viruses in nature, particularly in combination with wildlife surveillance [2,4,13].
The International Committee on Taxonomy of Viruses (ICTV) recently introduced the Amnoonviridae as a newly recognized family of negative-strand RNA viruses present in fish (ICTV Master Species List 2018b.v2). Together with the Orthomyxoviridae, the Amnoonviridae are classified in the order Articulavirales, describing a set of negative-sense RNA viruses with segmented genomes. While the Orthomyxoviridae includes seven genera, four of these comprise influenza viruses (FLUV), and to date the family Amnoonviridae comprises a single genus – Tilapinevirus – which in turn includes only a single species - Tilapia tilapinevirus or Tilapia Lake virus (TiLV).
TiLV was originally identified in farmed tilapine populations (Oreochromis niloticus) in Israel and Ecuador [14]. The virus has now been described in wild and hybrid tilapia across several countries in the Americas, Africa, Asia, and Southeast Asia [15–17]. TiLV has been associated with high morbidity and mortality in infected animals. Pathological manifestations include syncytial hepatitis, skin erosion and encephalitis [15,18]. TiLV was initially classified as a putative orthomyxo-like virus based on weak sequence resemblance (~17% amino acid identity) in the PB1 segment that contains the RdRp, as well as the presence of conserved 5′ and 3′ termini [14]. While both the Orthomyxoviridae and Amnoonviridae have negative-sense, segmented genomes, the genomic organization of the Amnoonviridae comprises 10 instead of 7-8 segments [14,18,19], and their genomes are shorter (~10 kb) than those of the Orthomyxoviridae (~12-15 kb). To date, however, only the RdRp (encoded by a 1641 bp PB1 sequence) has been reliably defined, and most segments carry proteins of unknown function. Importantly, comparisons of TiLV RdRp with sequences from members of the Orthomyxoviridae revealed the presence of four conserved amino acid motifs (I-IV) of size 4-9 amino acid residues each [14] that effectively comprise a “molecular fingerprint” for the order.
Unlike other members of the Articulavirales [20], TiLV appears to have a limited host range and has been only documented in tilapia (O. niloticus, O. sp.) and hybrid tilapia (O. niloticus x O. aureus). Herein, we report the discovery of a divergent virus from an Australian gecko (Geyra lauta) using a combination of meta-transcriptomic and structure-based approaches, and employ a phylogenetic approach to reveal its relationship to TiLV. Our work suggests that this Gecko virus likely represents a novel genus within the Amnoonviridae.
Materials and Methods
Sample collection
A total of seven individuals corresponding to the reptile species Carlia amax, Carlia gracilis, Carlia munda, Gehyra lauta, Gehyra nana, Heteronotia binoei, and Heteronotia planiceps were collected alive in 2013 from Queensland, Australia. Specimens were identified by mtDNA typing and/or morphological data. Livers were harvested and stored in RNAlater at −80°C before downstream processing. All sampling was conducted in accordance with animal ethics approval (#A2012/14) from the Australian National University and collection permits from the Parks and Wildlife Commission of the Northern Territory (#45090), the Australian Government (#AU-COM2013-192), and the Department of Environment and Conservation (#SF009270).
Sampling processing and sequencing
RNA extraction was performed using the RNeasy Plus minikit (Qiagen) following manufacturer’s instructions. Each of the seven livers were extracted individually and then pooled in equal amounts. For RNA sequencing, ribosomal RNA (rRNA) was depleted using the RiboZero (epidemiology) depletion kit and libraries were prepared with the TruSeq stranded RNA library prep kit before sequencing on an Illumina HiSeq 2500 platform (100 bp paired end reads). Library preparation and sequencing was performed by the Australian Genome Research Facility (AGRF), generating a total of 22,394,787 paired end reads for the pooled liver RNA library.
De novo assembly and sequence annotation
Raw Illumina reads were trimmed of sequencing adapters and low-quality bases with Trimmomatic v0.38 [21]. The trimmed reads were then de novo assembled into contigs (transcripts) using Trinity v2.8.6 [22]. Contig abundance was estimated with RSEM [23] and shown as the numbers of transcripts per million (TPM). For sequence annotation, contigs were compared against the NCBI nucleotide (nt) and non-redundant (nr) protein databases (nr) using BLASTn [24] and DIAMOND [25], respectively.
Protein structure prediction for virus detection
To further screen the meta-transcriptomic data, all the assembled sequences below the assigned threshold (e-value ≥ 10−5) were assigned as “orphan” contigs (n= 293,586). These were then analysed using a protein structure-informed approach. Specifically, orphan contigs were translated into all six open reading frames (ORFs) using the getorf program [26] to identify continuous ORFs of at least 1000nt in length between two stop codons (n=57). To detect distant sequence homologies and predict viral protein structures, this subset of translated ORFs were then analysed using a template-based modelling approach as implemented in Phyre2 (http://www.sbg.bio.ic.ac.uk/phyre2) [27]. In brief, target proteins were compared against proteins of known structure via homology modelling and fold recognition, followed by loop modelling and sidechain fitting [27]. Confident matches (confidence >90%) to known viral structures were selected for downstream analyses. Annotations from the predicted model were used as preliminary data for tentative taxonomic assignment and protein classification.
Annotation of the newly discovered virus
To further corroborate the viral origin of the predicted protein structure and gain insights into its taxonomic classification, we conducted parallel comparisons using DIAMOND [25] against the GenBank non-redundant (nr) database (https://www.ncbi.nlm.nih.gov/) and the HMMER web server (http://www.ebi.ac.uk/Tools/hmmer) against the following profile databases: (i) reference proteomes (https://proteininformationresource.org/rps/), (ii) Uniprot (https://www.uniprot.org/) and (iii) Pfam (https://pfam.xfam.org/). In addition, conserved domains were annotated using the Conserved Domain Database (CDD) and the CD-search tool (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml). To detect additional contigs and better characterize the entire genome of the novel virus, we aligned the DNA contigs against custom databases using DIAMOND [25], including (i) a reference RdRp sequences from the order Articulavirales, and (ii) reference sequences corresponding to all the segments of TiLV (Table S1). Given the divergent nature of the viruses, we considered all hits with E-value >10−4.
Phylogenetic analysis
The predicted contig encoding the RdRp of the newly discovered virus was aligned with reference protein sequences of the order Articulavirales (Table S2). A multiple amino acid sequence alignment was performed using the E-INS-i algorithm as implemented in the MAFFT v7.450 program [28]. Selection of the best-fit model of amino acid substitution was carried out using the Akaike Information criterion (AIC) and the Bayesian Information Criterion (BIC) with the standard model selection option (-m TEST) in IQ-TREE [29]. Phylogenetic analysis of these data was then performed using the Maximum Likelihood (ML) method available in IQ-TREE, with node support estimated with the ultra-fast bootstrap (UFBoot) approximation (1000 replicates) and the Shimodaira-Hasegawa approximate Likelihood ratio test (SH-aLRT). Sequencing reads are available at the NCBI Sequence Read Archive (SRA) under the Bioproject PRJNA626677 (BioSample: SAMN14647831; Sample name: VERT7; SRA: SRS6507258). The assembled sequence for GECV was deposited in GenBank under the accession number MT386081.
PCR validation
To validate the presence of the novel gecko amnoonvirus, and to identify the putative host species, we screened the individual liver RNA using RT-PCR. Briefly, cDNA was prepared using Superscript IV VILO master mix and RT-PCR was performed with the Platinum SuperFi Green PCR master mix and two primers sets targeting the gecko RdRp contig – F2V7 and F3V7 (Table S3). The resultant RT-PCR products were analysed by agarose gel electrophoresis and validated by Sanger sequencing.
Results
Virus discovery using meta-transcriptomics and protein structural features
We used a meta-transcriptomic approach to screen a single pooled library containing liver RNA of seven Australian native reptile species (Gehyra lauta, Carlia amax, Heteronotia binoei, Gehyra nana, Carlia gracilis, Carlia munda, and Heteronotia planiceps; see Methods). We focused on the de novo assembled contigs that had no significant hits using initial searches against the NCBI nucleotide and non-redundant databases. Accordingly, of 293,586 orphan contigs, 57 contained translatable ORFs of more than 1000 nt in length, and because we hypothesized that some may correspond to undetected virus sequences, we interrogated them using a protein structure prediction approach with template-based modelling (TBM) in Phyre2 [27]. From the 57 queried contigs, we obtained a 3D model of a 407 amino acid (1227 bp) contig with a high confidence hit (98.3%) to the RdRp catalytic subunit of a bat influenza A virus (family Orthomyxoviridae) (Table 1, Figure 1a-b). The confidence level obtained is indicative of high probability of modelling success between putative homologs. In addition, the alignment coverage between our query and the viral template corresponded to 52% (213 residues) of the query sequence, while the proportion of identical amino acids (i.e. sequence identity) was 19% (Table 1).
To corroborate these findings, the structural results were compared with those obtained from other analyses based on primary sequence similarity searches against public databases (see Methods) (Table 1). This revealed matches to the RdRp subunit (PB1 gene segment) of different members of the order Articulavirales, including the Influenza virus (FLUAV), TiLV, and Infectious salmon anaemia virus (ISAV). Comparisons of the assembled contigs against a custom database containing only members of the Articulavirales were then performed to improve sequence alignments. Accordingly, the best hit matches were obtained to TiLV (e-values <10−15) (Table 1). To identify additional viral segments, the assembled contigs were aligned to the ten segments of TiLV using DIAMOND. A total of 87 contigs were scored through the entire genome, although we did not recover any significant hit for segments 2-10 likely because they are so divergent in sequence (Table S1).
Sequence alignment and phylogenetic relationships
We tentatively name the new virus identified here as Gecko articulavirus (GECV). Multiple sequence alignment of the RdRp between GECV and other members the order Articulavirales identified a number of well conserved amino acid motifs (I-IV) ranging in length from 5-11 amino acids in length (Figure 2). Phylogenetic analysis of the aligned RdRp region revealed that GECV falls within the order Articulavirales and, along with TiLV (family Amnoonviridae), comprises a distinct monophyletic group. The close relationship between GECV and TiLV was supported by high UFBoot/SH-aLRT values (99%/99%) (Figure 1c). Likewise, estimates of the amino acid identity in the RdRp showed a closer (but still distant) sequence similarity (15.35%) with TiLV than other members of the order Articulavirales (Table 2).
Host association and in vitro validation
GECV was initially identified in the pooled sequencing library comprising a mix of several Australian reptile species. To identify the exact host species, we screened each individual species sample separately using RT-PCR and Sanger sequencing. As a result, we detected the presence of the novel GECV RdRp sequence in liver tissue of G. lauta (paratype QM J96622) (Figure S1), a gecko species native to north-western Queensland and the north-eastern Northern territory in Australia [30].
Discussion
Advances in protein modelling and sequence analysis based on structural comparisons with well-characterized protein templates constitute an attractive approach for the identification of highly divergent RNA viruses [27]. As viral proteins such as the RdRp play a central role on transcription and replication of RNA viruses, it is expected that structures and key motifs for catalytic functionality will be relatively well conserved throughout evolutionary history [31,32]. Based on this premise, it is expected that template-based protein structure modelling could be a powerful tool in the identification of highly divergent viruses [7,27,33]. Accordingly, we used protein structural similarity in combination with sequence and a profile similarity to identify a novel and divergent RNA virus in an Australian gecko (G. lauta).
We obtained a confident predicted 3D model for the RdRp of GECV based on its structural similarity with the RdRp subunit PB1 of influenza virus (family Orthomyxoviridae) (Figure 1a-b; Table 1). Although the structural data suggested that GECV belonged to the family Orthomyxoviridae (order Articulavirales) [27], additional sequence analysis revealed a closer relationship to members of the family Amnoonviridae (Figure 1c). In this context it is important to recall that biases in taxonomic assignment can occur because of the limited number of available proteins with known structures in the PDB. Although this is clearly a limitation, template-based approaches offer a tractable starting point for virus discovery and its taxonomic classification.
Although compromised by the large evolutionary distances involved, phylogenetic analysis among members of the order Articulavirales revealed that GECV was most closely related to TiLV, in turn suggesting that GECV is a novel and divergent genus within the Amnoonviridae. To date, the family Amnoonviridae has only been detected in fish [14], such that the discovery of GECV expands the host range of this family. Indeed, given the distance between the TiLV and GECV viruses, we can expect that further uncharacterised diversity exists in the family Amnoonviridae especially in fish and reptiles, and that more studies using the form of genomic surveillance performed here will reveal a far greater diversity of negative-sense RNA viruses [6,34].
Comparisons of the RdRp subunit PB1 from different articulaviruses revealed the presence of four well conserved motifs in GECV, broadly consistent with observations made for TiLV [14]. As suggested by several studies, motifs I-IV are critically implicated in the catalytic activity of PB1 [35,36]. Despite minor variations, we identified the SDD (serine-aspartic acid-aspartic acid) sequence in motif III that is presumed to be essential for protein functionality in FLUV [35,36]. Hence, the presence of well conserved motifs I-IV across the order Articulavirales may constitute effective molecular fingerprints for these viruses. Unfortunately, the marked lack of sequence similarity meant we did not recover any conclusive evidence regarding presence of other genome segments in GECV. Further studies that include sequencing, microscopy, and cell culture techniques, are therefore required to fully characterize the genome of this novel virus.
The identification of a novel virus in an Australian gecko (G. lauta) highlights the importance of virus surveillance in native species. Although GECV was detected in liver tissue, we currently cannot draw any conclusions regarding its pathogenic potential and impact on the health of G. lauta, particularly since a limited number of individuals were collected and all were apparently healthy. Additional research is therefore needed to establish the type of biological interaction between GECV and G. lauta. While a previous study reported the isolation of the arbovirus Charleville virus (family Rhabdoviridae) in G. australis (possibly G. dubia based on its distribution) collected in Queensland [36,37], this is the first report of a divergent articulavirus in reptiles. Taken together, these findings hint at a hidden diversity of RNA viruses in reptiles that remains to be characterized.
Supplementary Materials.
Figure S1. PCR detection and host association of GECV. (a-b) Agarose gels electrophoresis showing PCR products from two sets of primers that target a region in the PB1 gene segment (RdRp). Samples correspond to (c) liver tissue from seven different reptile species. A 355 bp PCR product was only amplified in G. lauta.
Table S1. Summary of the contig alignment to genomic segments of TiLV using DIAMOND. The relative abundance of each transcript was also calculated (see Methods).
Table S2. List of virus sequences used in the phylogenetic analysis. All sequences correspond to the PB1 protein.
Table S3. Set of primers used for PCR and Sanger sequencing reactions.
Author Contributions
Conceptualization, E.C.H.; methodology, A.S.O.-B., E.C.H., and J.-S.E.; formal analysis, A.S.O.-B.; investigation, A.S.O.-B., E.C.H., and J.-S.E.; resources, C.M., J.-S.E and E.C.H.; writing—original draft preparation A.S.O.-B.; writing—review and editing E.C.H., J.-S.E. and C.M.; visualization, A.S.O.-B.; supervision, E.C.H. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Australian Research Council, grant number FL170100022.
Conflicts of Interest
The authors declare no conflict of interest.
Acknowledgments
None.