Abstract
Saffron (Crocus sativus) is a spice with immense economic and medicinal relevance, due to its anticancer and chemopreventive properties. Although the genomic sequence of saffron is not publicly available, the RNA-seq based transcriptome of saffron from Jammu and Kashmir provides several, yet explored, insights into the metagenome of the plant from that region. In the current work, sequence databases were created in the YeATS suite from the NCBI and Ensembl databases to enable faster comparisons. These were used to determine the metagenome of saffron. Soybean mosaic virus, a potyvirus, was found to be abundantly expressed in all five tissues analyzed. Recent studies have highlighted that issues arising from latent potyvirus infections in saffron is severely underestimated. Bacterial and fungal identification is made complex due to symbiogenesis, especially in the absence of the endogenous genome. Symbiogenesis results in transcripts having significant homology to bacterial genomes and eu-karyotic genomes. A stringent criterion based on homology comparison was used to identify bacterial and fungal transcripts, and inferences were constrained to the genus level. Leifsonia, Elizabethkingia and Staphylococcus were some of the identified bacteria, while Mycosphaerella and Pyrenophora were among the fungi detected. Among the bacterial genera, L. xyli is the causal agent for ratoon stunting disease in sugarcane, while E. meningoseptica and S. haemolyticus, having acquired multiresistance against available antimicrobial agents, are important in clinical settings. Mycosphaerella and Pyrenophora incorporate several pathogenic species. It is shown that a transcript from heat shock protein of the fungi Cladosporium cladosporioides has been erroneously annotated as a saffron gene. The detection of these pathogens should enable proper strategies for ensuring better yields. The functional annotation of proteins in the absence of a genome is subject to errors due to the existence of significantly homologous proteins in organisms from different branches of life.
Introduction
Saffron, possibly one of the most expensive spices, is an anticancer and chemopreventive agent [1, 2], and is also effective for treating major depressive disorder symptoms [3, 4]. The genome of saffron is not publicly available, although http://www.crocusgenome.org/ provides a online interface to query individual sequences. The RNA-seq transcriptome [5, 6] of saffron was independently sequenced by two research groups in order to gain deeper insights into the genes involved in apocarotenoid biosynthesis [7, 8]. Previously, the YeATS suite identified several artifacts arising from RNA-seq assembly [9, 10], and has been used to analyze the walnut transcriptome revealing the biodiversity and plantmicrobe interactions in twenty different tissues from walnut in California [11].
A BLAST to the complete ’nt/nr’ database, as done previously, is not time efficient [9–11]. In the current work, the NCBI and Ensembl database was used to create smaller, yet comprehensive, databases for viruses (V-DB), bacterial (B-DB), fungal (F-DB) and plants (mitochondria, chloroplast and ribosomes – CMR-DB), and were added to the YeATS suite for a time efficient query of the metagenome. In the absence of a genome, identification of saffron derived transcripts is not straightforward. Using these databases, the metagenome of saffron from Jammu and Kashmir was derived from the transcriptome of saffron. http://nipgr.res.in/mjain.html?page=saffron provided the ABySS [12] assembled transcriptome with 105269 contigs. Mosaic virus derived transcripts were found to be abundantly expressed in all five tissues analyzed [13]. Mosaic viruses can be responsible for significant yield losses [14]. Key organelles of eukaryotes, like the mitochondrion and chloroplast, possibly originated as a symbiosis between distinct single-celled organisms [15]. Symbiogenesis results in transcripts that have significant homology to both bacterial and eukaryotic genomes. Consequently, a stringent constraint that compared homology scores was used to choose bacterial and fungal transcripts. Leifsonia, Elizabethkingia and Staphylococcus were some of the identified bacteria, while Mycosphaerella and Pyrenophora were among the fungi detected. These bacterial and fungal genera incorporate several pathogenic species. A proper disease-management strategy can be devised based on this knowledge. It is also shown that protein-based annotation, based on open reading frames of transcripts, can be error prone due to the existence of significantly homologous proteins in organisms from different branches of life, like fungi, plants and viruses.
Materials and methods
Obtaining transcriptome and expression matrix from different tissues of saffron
http://nipgr.res.in/mjain.html?page=saffron provided the ABySS [12] assembled transcriptome with 105269 contigs. The site provides the sequences (Saffron transcriptome assembly.fa), annotation (Saffron transcriptome assembly.fa) and expression values for different tissues (Saffron expression matrix).
Obtaining the viral sequences:
“http://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=10239” provides a retrieve option to get all virus genomes (viral.nt.fa:n=7331 in Dataset1). For viral protein sequences, the NCBI database was queried for “Viruses, RefSeq” (viral.nr.refseq.fa, n=50021 in Dataset1)
Obtaining the bacterial genomes:
“ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assemblysummary.txt” provides details of available bacteria genomes. This was parsed, and the first occuring species of a genus was chosen randomly (getBacterialGenomes.csh:n=1355 in Dataset1).
Obtaining plant mitochondrial, chloroplast and ribosomal sequences:
The NCBI database was queried for : Plants, RefSeq, Mitochondrion/Chloroplast/Ribosomal rRNA/Ribosomal mRNA. These were combined in a single file (DB.list CHLORO MITO MRIBO RRIBO, n=40049 in Dataset1).
Obtaining plant mitochondrial, chloroplast and ribosomal sequences:
The fungal sequences were obtained from the Ensembl site [16]. A random species was chosen for each genus (list.ensembl.fungi.txt in Dataset1, n=222).
The YeATS suite was used extensively to query these databases using the BLAST command-line in terface [17]. The BLAST bitscore was used as a comparison metric instead of the Evalue since it allows differentiation for high homologies where Evalue goes to zero.
Results and discussion
Viral transcripts:
Transcripts that map to virus genomes (V-DB) clearly demonstrates the presence of the several mosaic virus (MV) species (Table 1). MV are not a single taxon, although they are all ssRNA positive-strand viruses [13]. For example, tobacco MV is a Tobamovirus, while all other MVs in Table 1 are Potyviruses. Previously, saffron has been shown to be infected with bean yellow MV [18] and Narcissus MV [19]. Iris, which belong to the same Iridaceae family as saffron, was shown to be infected with Iris severe MV [20,21] and Narcissus MV [22]. A recent study highlights that the threat from these MVs is seriously underestimated [23].
The complete list of virus genome matches with a BLAST bitscore (BBS) cutoff=150 (Evalue=1E-30) (see viral.anno.nucleotide.txt in Dataset1, n=8). Transcripts with the highest significance (Table 1) were BLAST’ed through the online interface to the saffron genome. Only CSTC028935 had a significant match to ”gb EX144186.1 EX144186 cr28 F19 Saffron (Crocus sativus) mature stigma lambda Unizap”, with a 97% identity, although it is unlikely that this is part of the saffron genome. CSTC028935 encodes a 1587 long open reading frame (ORF) that is 79% identical to the soybean MV polyprotein (Accid:ADK60773.1). Expectedly, none of these viral transcripts were annotated previously in Saffron functional annotation [7].
Bacterial transcripts:
The endosymbiont theory posits that key organelles of eukaryotes, like the mitochondrion and chloroplast, originated as a symbiosis between distinct single-celled organisms [15]. While the origin of this symbiosis remains a subject of debate, the fact that prokaryotes and eukaryotes share significant homology in their genomes is established beyond doubt [24]. This presents a certain degree of uncertainty in identifying the metagenome through nucleotide sequencing, especially when the genome of the endogenous organism is not known. This uncertainty is demonstrated in the analysis below.
First, the saffron transcripts were BLAST’ed to the locally created bacterial genome database (B-DB) (1355 bacterial genomes, see Methods). There were 430 matching transcripts (BBS=200, Evalue cutoff=~1E-50, see BACT.200.anno.sort in Dataset1). A combined database comprising chloroplast, mitochondrial and ribosomal genes (CMR-DB) using a more relaxed Evalue constraint gave 573 matching transcripts (BBS=60, Evalue=1E-10, CHLOROMITORIBO.60.anno.sort in Dataset1). Since the organism under focus here is a plant, the Evalue is more relaxed for these matches. In case a transcript matched both B-DB and CMR-DB, the percentage difference in the BBS was used as a metric to select bacterial genes. There about 150 transcripts where the B-DB BBS score was 60% greater than than the CMR-DB score. About 80 transcripts had no match in CMR-DB. Combined with the high cutoff required for bacterial identification, and low threshold for CMR-DB assignment, these 230 transcripts can be assigned to bacterial species (BACT.MITOCHLORO.anno.sort in Dataset1).
Bacterial transcripts sorted based on BBS revealed the presence of several bacterial genera (Table 2). Enterococcus is a large genus of lactic acid bacteria, of which E. faecalis and E. faecium are common commensal organisms in the intestines of humans. E. silesiacus was found in two water isolates [25]. Similarly, Staphylococcus haemolyticus is the most frequent aetiological factors of staphylococcal infections, attributed to its ability to acquire multiresistance against available antimicrobial agents [26]. Leifsonia xyli is the causal agent of ratoon stunting disease in sugarcane [27], and is surprisingly genetically conserved in isolates from different countries [28]. The commmonly occuring gram-negative bacillus Elizabethkingiaa is found in soil, river water and reservoirs. Elizabethkingia meningoseptica is resistant to most antibiotics used in the intensive care setting [29], and has been responsible for an outbreak arising from tap contamination [30]. A recent case of bacteremia has been attributed to Elizabethkingia meningoseptica [31].
An example of a transcript with significant homology to bacterial and plant genomes is CSTC008113. This has the best match in the ’nt’ database to Vicia americana, a legume known commonly as American vetch and purple vetch, with BBS=957. Among the bacterial genomes, this matches to Tenacibaculum mesophilum with BBS=776. This has no match in the ’nt’ database when the search is constrained to bacteria, demonstrating that all bacterial genomes are not included in the ’nt’ database.
Fungal transcripts:
Fungal transcripts (BBS=150, annoEnsFungi.anno.sort in Dataset1, n=795) reveal several commonly occuring genera, detected in small traces in different tissues (Table 6). The exact species can not be identified using the available data, since these genera incorporate many species. http://www.speciesfungorum.org/Names/names.asp provides the enumeration of fungal species. For example, Mycosphaerella, a genus of ascomycotaes, is one of the largest genus of plant pathogen fungi with more than 1500 species [32]. Similarly, Pyrenophora has ~160 species, several of them pathogenic [33]
Caveats and advantages of ORF based annotation:
Annotation of transcripts can also be done by analyzing ORFs. Although the nucleotide sequence of CSTC042320 has no match in the complete ’nt’ database, it encodes a 479 long ORF homologous (57% identity) to a cytochrome P450 protein from pineapple (Ananas comosus). The saffron transcriptome has been annotated accordingly (Saffron functional annotation [7]), and has significant match in the online interface to the saffron genome (http://www.crocusgenome.org/?pageid=20). However, such functional annotation of proteins in the absence of a genome is subject to errors due to the existence of significantly homologous proteins in organisms from different branches of life, like fungi, plants and viruses. CSTC018609 is one such mis-annotated transcripts, that encodes an 213 long ORF, homologous to a heat shock protein from the fungi Cladosporium cladosporioides with 99% identity, but is annotated as a plant gene (Saffron functional annotation [7]). The nucleotide sequence of CSTC018609 also matches to the genome of the fungi Cladosporium cladosporioides (Accid:HQ693482.1) with BBS=1094, 97% identity. Although the nucleotide sequence also matches to a plant (Nicotiana tabacum, Accid:AY372070.1) with lesser significance (87% identity), it has no significant match through the online interface provided. Interestingly, the ORF of this gene is also homologous (76% identity) to a viral HSP70 protein (Micromonas pusilla virus, Accid:AET43623.1).
ORF based annotation provides coverage in some cases where nucleotide homology is not detected. The viral protein RefSeq database identified additional ∼20 putative viral transcripts (see viral.anno.protein.txt in Dataset1) not detected through the nucleotide comparison. One reason for such an exclusion is the smaller size of the local viral nucleotide database sequence, which does not include all strains. CSTC027157 has significant matches among the full ’nt’ virus database (bean common mosaic virus complete genome, isolate PV 0915, Accid:HG792064.1). However, some sequences (CSTC004726) have no match even in the complete ’nt’ database, although they encode ORFs with significant matches to the local viral protein database. CSTC004726 encodes a polyprotein (Accid:NP 660175.1, Evalue=9e-37) from bean common mosaic necrosis virus (Evalue=9e-37). Thus, an annotation methodology should include comparisons to both nucleotide and amino acid sequences.
Conclusions and future work
Previous application of the YeATS suite to genomes [9, 10] and metagenomes [11] required a BLAST to the complete ’nt’ and ’nr’ databases. In the current work, smaller, yet comprehensive databases have been created to accelerate the identification of the saffron metagenome from the transcriptome obtained from saffron in Jammu and Kashmir [7]. Transcripts from the soybean mosaic virus was found abundantly expressed in all five tissues analyzed. Although, these viruses have been isolated from these plants (and the related species Iris) previously [18–22], recent studies have shown the threat posed by these viruses to saffron cultivation is seriously underestimated. Several bacterial and fungal genera incorporating known pathogen species have also been identified.
A search through the complete set of sequences known (through a program like BLAST) is ideal, but not time efficient. The search can be accelerated by creating smaller, but comprehensive, sequence databases. Such a ’divide and conquer’ strategy needs to account for the existence of homologous nucleotide sequences (and proteins) across the different trees of life. There are two methods for annotating a transcript – (a) through the nucleotide sequence or (b) through the amino acid sequence (assuming it is not a non-coding transcript). Annotation based on ORFs identifies transcripts that have no nucleotide sequence homology in the entire ’nt’ database, which occurs due to the redundancy in the codon table. Finally, this work also highlights that the standard practice of excluding transcripts homologous to ribosomal RNA might exclude bona fide bacterial transcripts, in the absence of comparative analysis of the homology to different databases. While the current strategy of creating smaller, comprehensive subsets accelerates the annotation time as compared to YeATS, the existing flow still uses command-line BLAST to query the entire transcriptome with the sequence databases. A kmer based approach could accelerate this further, and will be added in a future release [34].
Competing interests
No competing interests were disclosed.
Grant information
The author(s) declared that no grants were involved in supporting this work.
Acknowledgements
I gratefully acknowledge Mridul Bhattacharjee and Nitin Salaye for logistic support.