Abstract
Background Although the reference genome of Solanum tuberosum group Phureja double-monoploid (DM) clone is available, knowledge on the genetic diversity of the highly heterozygous tetraploid group Tuberosum, representing most cultivated varieties, remains largely unexplored. This lack of knowledge hinders further progress in potato research and its subsequent applications in breeding.
Results For the DM genome assembly, two only partially-overlapping gene models exist differing in a unique set of genes and intron/exon structure predictions. First step was to merge and manually curate the merged gene model, creating a union of genes in Phureja scaffold. We next compiled available RNA-Seq datasets (cca. 1.5 billion reads) for three tetraploid potato genotypes (cultivar Désirée, cultivar Rywal, and breeding clone PW363) with diverse breeding pedigrees. Short-read transcriptomes were assembled using CLC, Trinity, Velvet, and rnaSPAdes de novo assemblers using different settings to test for optimal outcome. In addition, for cultivar Rywal, PacBio Iso-Seq full-length transcriptome sequencing was also performed. Revised EvidentialGene redundancy-reducing pipeline was employed to produce accurate and complete cultivar-specific transcriptomes from assemblers output, as well as to attain the pan-transcriptome. Due to being the most diverse dataset in terms of tissues (stem, seedlings and roots) and experimental conditions, cv. Désirée was the most complete transcriptome (95.8% BUSCO completeness). For cv. Rywal and breeding clone PW363 data were available for leaf samples only and the resulting transcriptomes were less complete than cv. Désirée (89.8% and 89.3% BUSCO completeness, respectively). Cross comparison of these cultivar-specific transcriptomes and merged DM gene model suggests that the core potato transcriptome is comprised of 16,339 genes. The pan-transcriptome contains a total of 95,779 transcripts, of which 54,614 transcripts are not present in the Phureja genome. These represent the variants of the novel genes found in the potato pan-genome.
Conclusions Our analysis shows that the available gene model of double-monoploid potato from group Phureja is, to some degree, not complete. The generated transcriptomes and pan-transcriptome represent a valuable resource for potato gene variability exploration, high-throughput –omics analyses, and future breeding programmes.
Background
At the species level, genomes of individuals can differ in single nucleotide polymorphisms (SNPs), short insertions and deletions (INDELs), gene copy numbers, and presence or absence of genes [1]. The latter leads to the concept of species specific pan-genomes, namely the core genome present in most individuals and the dispensable genome comprised of genes present only in a subset of individuals, which results in the emergence of particular subgroup-specific phenotypes. This concept has been extended to pan-transcriptomes, where the presence or absence of variations is not bound only to the gene content, but also to the genetic and epigenetic regulatory elements. Pan-genomes and pan-transcriptomes have been described in the model plant species Arabidopsis thaliana [2], and several crop species including maize [3, 4], rice [5], wheat [6] and soybean [7].
While the genome of a double-monoploid clone of Solanum tuberosum group Phureja (DM) is available [8], this diploid potato group differs from the tetraploid group Tuberosum, which includes most varieties of cultivated potato. Through domestication and modern breeding efforts, different potato cultivars also acquired genes from other closely related Solanum species or lost some ancestral genes [1]. Different breeding programmes have resulted in accumulation of different smaller genome modifications, e.g. SNPs and INDELs. Consequently, each distinct potato cultivar harbours a unique set of transcripts, resulting in physiological and developmental differences and different responses to biotic and abiotic stress. SNP and INDEL profile differences and novel gene variants in anthocyanin pathway were identified in a comparative transcriptome analysis of two Chinese potato cultivars [9]. Unfortunately, we could not include these transcriptomes in our pan-transcriptome because the transcriptome assemblies were not publicly accessible.
Based on the DM genome, the PGSC and ITAG annotation consortia [8, 10] have each independently produced potato gene models. For practical reasons, most potato researchers use only one genome annotation, either PGSC or ITAG, especially when conducting high-throughput analyses. Using an incomplete gene set can lead to false conclusions on gene presence or gene family diversity in potato. Using a computational pipeline followed by manual curation, we have consolidated the two publicly available group Phureja DM gene model sets to produce an unified one.
While a combined DM gene set is useful, it is still not as useful as a pan-transcriptome that included assemblies from cultivated potatoes. However, obtaining an optimal transcriptome from short read RNA-Seq data is not a trivial task. Each de novo assembler suffers from different intrinsic error generation and no single assembler performs best on all datasets [11]. To maximise diversity and completeness of potato cultivar transcriptomes, usage of multiple de novo transcriptome assemblers and various parameter combinations over the same input data was employed. Following this “over-assembly” step, we used tr2aacds pipeline from EvidentialGene [12] to reduce redundancy across assemblies and obtain cultivar-specific transcriptomes. Finally, we consolidated representative cultivar-specific sequences to generate potato pan-transcriptome (StPanTr). These transcriptomes will improve high throughput sequencing analyses, from RNA-Seq and sRNA-Seq to more specific ones like ATAC-Seq, by providing a more comprehensive and accurate mapping reference. The translated protein sequences can enhance the sensitivity of high throughput mass-spectroscopy based proteomics. The resource is valuable also for design of any PCR assays, e.g. quantitative PCR, where exact sequence information is required. Additionally, the knowledge generated regarding variations in transcript sequences between cultivars, such as SNPs, insertions and deletions, will be a key instrument to assist the breeding programmes.
Data description
Transcriptomic sequences of three potato genotypes, cv. Désirée, cv. Rywal and breeding clone PW363, were obtained from in-house RNA-Seq projects and supplemented by publicly available datasets of the same genotypes retrieved from SRA (Table 1).
The largest quantity of reads, cca. 739 mio reads of various lengths, was obtained for cv. Désirée, using Illumina and SOLiD short read sequencing platforms. For cv. Rywal and breeding clone PW363 only mature leaf samples were available. For cv. Désirée leaf samples were augmented with samples from stems, seedlings and roots. For cv. Rywal short read sequencing was complemented with full-length PacBio Iso-Seq sequencing of independent samples. Detailed sample information is provided in Supplementary Table 1.
Methods
Merging PGSC and ITAG gene models of reference genome group Phureja
GFF files corresponding to their respective gene models (PGSC v4.04, ITAG v1.0) were retrieved from the Spud DB potato genomics resource [13]. The two models (39,431 PGSC and 34,004 ITAG) were then compared on the basis of their exact chromosomal location and orientation. Genes were considered to be equivalent when the shorter sequence covered at least 70% of the longer sequence. In cases of overlapping gene prediction by both, ITAG IDs were kept as primary. All nontrivial examples of merge (e.g. multiple genes in one prediction model corresponding to one in the other, overlapping of genes in two models, nonmatching directionality of genes and similar) were manually resolved (example in Figure 1). This resulted in a merged DM genome GFF file with 49,322 chromosome position specific sequences, of which 31,442 were assigned with ITAG gene IDs and 17,880 with PGSC gene IDs (Supplementary File 1).
Data pre-processing
The complete bioinformatic pipeline is outlined in Figure 2. Sequence quality assessment of raw RNA-Seq data, quality trimming, and removal of adapter sequences and polyA tails was performed using CLC Genomics Workbench v6.5-v10.0.1 (Qiagen) with maximum error probability threshold set to 0.01 (Phred quality score 20) and no ambiguous nucleotides allowed. Minimal trimmed sequences length allowed was set to 15bp while maximum up to 1kb. Orphaned reads were re-assigned as single-end (SE) reads. Processed reads were pooled into cultivar data sets as properly paired-end (PE) reads or SE reads per cultivar per sequencing platform. For the Velvet assembler, SOLiD reads were converted into double encoding reads using perl script “denovo_preprocessor_solid_v2.2.1.pl” [14]. To reduce the size of cv. Désirée and cv. Rywal datasets, digital normalization was performed using khmer from bbmap suite v37.68 [15] prior to conducting de novo assembly using Velvet and rnaSPAdes.
PacBio long reads were processed for each sample independently using Iso-Seq 3 analysis software (Pacific Biosciences). Briefly, the pipeline included Circular Consensus Sequence (CCS) generation, full-length reads identification (“classify” step), clustering isoforms (“cluster” step) and “polishing” step using Arrow consensus algorithm. Only high-quality fulllength isoforms were used as input for further steps.
PacBio Cupcake ToFU pipeline
Cupcake ToFU scripts [16] were used to further refine the Iso-Seq transcript set. Redundant isoforms were collapsed with “collapse_isoforms_by_sam.py” and counts were obtained with “get_abundance_post_collapse.py”. Isoforms with less than two supporting counts were filtered using “filter_by_count.py” and 5’-degraded isoforms were filtered using “filter_away_subset.py”. Isoforms from the two samples were combined into one non-redundant Iso-Seq transcript set using “chain_samples.py”.
De Bruijn graph based de novo assembly of short reads
Short reads were de novo assembled using Trinity v.r2013-02-25 [17], Velvet/Oases v. 1.2.10 [18], rnaSPAdes v.3.11.1 [19] and CLC Genomics Workbench v8.5.4-v10.1.1 (Qiagen). Illumina and SOLiD reads were assembled separately. For CLC Genomics de novo assemblies, combinations of three bubble sizes and 14 k-mer sizes were tested on PW363 Illumina dataset. Varying bubble size length did not influence much the assembly statistics, therefore we decided to use length 85bp for Illumina datasets of the other two cultivars (Supplementary Figure 1). Parameters k-mer length and bubble size used for Velvet and CLC are given in Table 2. Scaffolding option in CLC and Velvet was disabled. More detailed information per assembly is provided in Supplementary Table 2.
Decreasing redundancy of assemblies and annotation
In order to obtain one clean and non-redundant transcriptome per cultivar, assemblies were first subjected to tr2aacds v2016.07.11 pipeline which grouped and classified transcripts, coding and polypeptide sequences into main (non-redundant), alternative, or discarded (drop) set. Each assembly contributed some proportion of transcripts to cultivar-specific transcriptome sets (Figure 3, Supplementary Figure 1 and Supplementary Figure 2). Both main and alternative sets were merged into initial cultivar reference transcriptomes and used in further external evidence for assembly validation, filtering and an-notation steps (Figure 2). de novo cultivar-specific transcripts were first mapped to the DM reference genome by STARlong 2.6.1d [20] using parameters optimized for de novo transcrip-tome datasets (all scripts are deposited at FAIRDOMHub project home page). Aligned transcripts were analysed with MatchAn-not to identify transcripts that match the PGSC or ITAG gene models. These transcripts were functionally annotated using the corresponding gene product information in GoMapMan [21]. Domains were assigned to the polypeptide data set using InterProScan software package v5.37-71.0 [22]. For all transcripts and coding sequences, annotations using DIAMOND v0.9.24.125 [23] were generated by querying UniProt retrieved databases (E-value cut-off 10−05 and query transcript/cds and target sequence alignment coverage higher or equal to 50%). Assembled initial transcriptomes were also screened for nucleic acid sequences that may be of vector origin (vector segment contamination) using VecScreen plus taxonomy program v.0.16 [24] against NCBI UniVec Database. Potential biological and artificial contamination was identified for cca. 3% of sequences per cultivar. To remove artefacts and contaminants, results from MatchAnnot, InterProScan and DIAMOND were used as biological evidence in a further filtering by in-house R scripts (Supplementary scripts 1). Transcripts that did not map to the genome or had no significant hit in either InterPro or UniProt were eliminated from further analysis (Supplementary Table 3, Supplementary Table 4 and Supplementary Table 5). Pajek v5.08 [22], in-house scripts, and cdhit-2d from the CD-HIT package v4.6 [25] were used to re-assign post-filtering main and alternative classes and to obtain finalised cultivar-specific transcriptomes (Supplementary scripts 2).
The whole redundancy removal procedure reduced the initial transcriptome assemblies by 18-fold for Désirée, 38-fold for Rywal, and 24-fold for PW363. Completeness of each initial de novo assembly to cultivar-specific transcriptome was estimated with BUSCO (Figure 3, Supplementary Figure 1 and Supplementary Figure 2) to identify optimal parameters for the short-read based assemblers. SOLiD assemblies (Figure 3: CLCdnDe1, CLCdnDe8, VdnDe8-10), produced by either CLC or Velvet/Oases pipelines, contributed least to transcriptomes, which can mostly be attributed to short length of the input sequences. Interestingly, for Illumina assemblies, increasing k-mer size in the CLC pipeline produced more complete assemblies according to BUSCO score and more transcripts were selected for the initial transcriptome (Figure 3: CLCdnDe1-7, CLCdnDe9-14). On the contrary, increasing k-mer length in Velvet/Oases pipeline lead to transcripts that were less favoured by the redundancy removal procedure (Figure 3: VdnDe1-7). Trinity assembly was comparable in transcriptome contribution and BUSCO score to high k-mer CLC assemblies (Figure 3).
Potato pan-transcriptome construction
Cultivar-specific representative sequences were combined with sequences from the merged DM gene models (non-redundant PGSC and ITAG genes) and subjected to the EvidentialGene traa2cds v2018.06.18 pipeline. Pajek v5.08 [27] and in-house scripts (Supplementary scripts 3) were used to retrieve appropriate main and alternative classification of transcripts while also taking discarded sequences in consideration.
138,162 cultivar-specific (57,943 Désirée, 43,883 PW363 and 36,336 Rywal) and 49,322 DM representative sequences were classified into 95,779 main and 91,705 alternative StPanTr transcript sequences. 16,339 main sequences are shared among all three cultivars and the DM clone, while 43,882 sequences are cultivar-specific (i.e. found only in a single cultivar). 17,601 representative sequences from DM clone did not have any match in cv. Désirée, breeding clone PW363 or cv. Rywal (Figure 4, Supplementary Figure 3).
Quality assessment and completeness analysis
As a measure of assembly accuracy, the percentage of correctly assembled bases was obtained by mapping Illumina reads back to cultivar-specific initial transcripts using STAR v.2.6.1d RNA-seq aligner with default parameters (Table 3). To assess the quality of the transcriptomes via size-based and reference-based metrics, we run TransRate v 1.0.1 [28] on cultivar-specific transcriptomes, prior to and after filtering (Table 4). Comparative metrics for cultivar-specific coding sequences (CDS) were obtained using Conditional Reciprocal Best BLAST (CRBB) [29] against merged DM gene model coding sequences.
To estimate the measure of completeness and define the duplicated fraction of assembled transcriptomes (prior and post filtering cultivar-specific, and pan-transcriptome), BUSCO v3 [30] scores were calculated using embryophyta_odb9 [26] lineage data (Table 5). At the cultivar-specific transcriptome level, the most diverse dataset in terms of tissues and experimental conditions resulted in the highest BUSCO score (cv. Désirée) as expected. Success in classification of main (representative) and alternative transcripts is evident from the pan-transcriptome BUSCO scores (i.e. differences in single-copy and duplicated BUSCOs for representative and alternative dataset). Highest number of fragmented BUSCOs is observed for the breeding clone PW363, what we can probably attribute to the highest number of short-contig assemblies. Furthermore, we can presume how the long-read assembly contributed to the shift in favour of single-copy BUSCOs for cv. Rywal (Table 5) as it has in favour of uniquely mapped reads (Table 3).
To inspect the quality of paralogue cluster assignments, multiple sequence alignments using MAFFT v7.271 [31] were conducted on representative and alternative sequences from paralogue clusters containing 8-16 sequences and containing sequences from each of the four genotypes (Désirée, PW363, Rywal and DM). Alignments were visualized using MView v1.65 [32] (Figure 5, Supplementary File 2). These alignments were our final quality check of success in de novo transcriptome assemblies and had helped us also in optimisation of the pipeline. The alignments within groups showed differences that can be attributed to biological diversity, e.g. SNPs and INDELS as well as alternative splicing (Figure 5). We however advise the users to check the available MSA for their gene of interest as some miss-assemblies might still occur (Supplementary File 2, Sup-plementary scripts 4).
Re-use potential
Insights into variability of potato transcriptomes
Based on the comparison of cultivar-specific transcriptomes we identified cca. 23,000, 13,000, and 7,500 paralogue groups of transcripts in cv. Désirée, breeding clone PW363 and cv. Rywal, respectively, that are not present in merged Phureja DM gene model. Addition of Iso-Seq dataset in the case of cv. Rywal confirms that long reads contribute to less fragmentation of de novo transcriptome. It is therefore recommended to generate at least a subset of data with one of the long-read technologies to complement the short read RNA-seq. As can be seen in reduction rate for PW363 (24-fold), producing additional short-read assemblies does not contribute as much to the quality of a transcriptome as having several tissues or a combination of 2nd and 3rd generation sequencing (38-fold Rywal).
From all four genotypes, cv. Désirée has the highest number of cultivar-specific representative transcripts, which can be attributed to having the most diverse input dataset used for the de novo assemblies in terms of tissues sequenced (stem, seedlings and roots) and experimental conditions covered. cv. Désirée also benefitted from the inclusion of a DSN Illumina library to capture low level expressed transcripts. However, even the leaf-specific reference transcriptomes of cv. Rywal and breeding clone PW363 include thousands of specific genes, indicating that cultivar specific gene content is common. Remarkably, we identified several interesting features when inspecting paralogue groups of transcripts, demonstrating the variability of sequences in potato haplotypes and the presence of the alternative splicing variants that contribute to the pan-transcriptome (Figure 5, Supplementary File 2).
It should be noted, that the reconstructed transcriptomes include also the meta-transcriptome stemming from microbial communities present in sampled potato tissues. We decided not to apply any filter on these transcripts. Inclusion of meta-transcripts makes it possible to also investigate the diversity of plant-associated endo- and epiphytes. The majority of these microbial transcripts will have microbial annotations, facilitating their future removal when necessary for other experiments.
Cultivar-specific transcriptomes can improve high-throughput sequencing analyses
Most gene expression studies have been based on either potato UniGenes assembled from a variety of potato expressed sequence tags (e.g. StGI, POCI) or the reference DM genome transcript models. Studies based on any of these resources have provided useful information on potato gene expression, but each have major drawbacks.
When using the DM genome as a reference for mapping RNA-Seq reads, the potato research community faces the existence of two overlapping, but not identical, gene model predictions. When using either of available GFFs, we were missing some of the genes known to be encoded in the assembled scaffold. The newly generated merged GFF helps to circumvent this problem. But even when using merged DM-based GFF, cultivar-specific genes and variations are not considered. Differences in expression and important marker transcripts can therefore be missed. In addition, the computational prediction of DM transcript isoforms is incomplete and, in some cases, gene models are incorrectly predicted. On the other hand, the inherent heterogeneity and redundancy of UniGenes or similar combined transcript sets causes short reads to map to multiple transcripts and thus makes the interpretation of results more difficult. The cultivar-specific transcriptomes presented here are an improvement as they include some expressed transcripts that are not present in the reference genome and are less redundant than UniGene sets. This is even more so true for different other applications of high-throughput sequencing, such as sRNA-Seq, Degradome-Seq or ATAC-Seq, as we now have more detailed information also on variability of transcripts within one loci which is a requirement for these.
Cultivar-specific transcriptomes may also help improve mass-spectroscopy based proteomics. A more comprehensive database of expressed proteins gives the peptide spectrum match algorithms more chance of obtaining a significant target, thus enhancing the detection and sensitivity of protein abundance measurements [33].
Using transcriptomes to inform qPCR amplicon design
Aligning transcript coding sequences from a StPanTr paralogue cluster can be used to inform qPCR primer design in order to study expression of specific isoforms or cultivars by selecting variable regions of the transcripts (Figure 5). On the other hand, when qPCR assays need to cover multiple cultivars, the nucleotide alignments can be inspected for conservative regions for design.
Conclusion
The transcriptomes present a valuable resource for different applications of high-throughput sequencing analyses as well as for proteomics studies. They will also be a crucial tool to support the breeding programmes of cultivated potato. In future, when new potato RNA-seq datasets become available, the cultivar-specific transcriptomes can be improved and expanded.
Availability of source code and requirements
Lists the following:
Project ID: _p_stRT
Project home page: https://fairdomhub.org/projects/161;
Local project data management using “pISA-tree: Standard project directory tree” (ISA-tab compliant) https://github.com/NIB-SI/pISA
pISA-tree – FAIRDOMHub API usage in R: https://github.com/NIB-SI/pisar
Operating systems: Fedora v23, Linux Mint v18.2, Windows 7/8/10
Programming languages: Bash, Perl, Python, R/Markdown
License: GPLv3
Availability of supporting data and materials
The GFF file with merged ITAG and PGSC gene models for S. tuberosum group Phureja DM genome v4.04 is available at project home page, as the cultivar-specific and pan-transcriptome assembly FASTA and annotation files, custom code described in the manuscript, intermediate and processed data, and all other supporting information that enable reproduction and re-use.
Supplementary tables
Supplementary Table S1 - Detailed sample information table used to generate the de novo transcriptome assemblies. Raw and processed reads summary. Layer _p_stRT/_I_STRT/_S_01_sequences, DOI: 10.15490/fair-domhub.1.datafile.3090.1
Supplementary Table S2 - Detailed de novo assemblies information table. Primary potato transcriptome assemblies summary listing parameters used for short reads de novo assembly generation. Layer _p_stRT/_I_STRT/_S_02_denovo, DOI: 10.15490/fairdomhub.1.datafile.3091.1
Supplementary Table S3 - Désirée biological evidence filtering results. Output of 1st filtering step by biological evidence for cv. Désirée. Layer _p_stRT/_I_STRT/_S_03_stCuSTr/ _A_03.1_filtering, DOI: 10.15490/fairdomhub.1.datafile.3110.1
Supplementary Table S4 - PW363 biological evidence filtering results. Output of 1st filtering step by biological evidence for breeding clone PW363. Layer _p_stRT/ _I_STRT/_S_03_stCuSTr/_A_03.1_filtering, DOI: 10.15490/fairdomhub.1.datafile.3111.1
Supplementary Table S5 - Rywal biological evidence filtering results. Output of 1st filtering step by biological evidence for cv. Rywal. Layer _p_stRT/_I_STRT/_S_03_stCuSTr/ _A_03.1_filtering, DOI: 10.15490/fairdomhub.1.datafile.3112.1
Supplementary figures
Supplementary Figure 1 - Number of transcripts from de novo assemblies contributing to breeding clone PW363, transcriptome and number of complete BUSCOs found in assemblies. Proportion of all contigs in de novo assembly (blue bars) and proportion of EvidentialGene okay set (green bars), and the number of complete BUSCOs (dots) using embryophyta_odb9 set are shown. Assembly software abbreviations: CLCdn - CLC Genomics Workbench, Vdn - Vel-vet, Sdn - SPAdes. Layer _p_stRT/_I_STRT/_S_03_stCuSTr/ _A_02.2_assembly-contribution-count, DOI: 10.15490/fair-domhub.1.datafile.3108.1
Supplementary Figure 2 - Number of transcripts from de novo assemblies contributing to cultivar Rywal, transcriptome and number of complete BUSCOs found in assemblies. Proportion of all contigs in de novo assembly (blue bars) and proportion of EvidentialGene okay set (green bars), and the number of complete BUSCOs (dots) using embryophyta_odb9 set are shown. Assembly software abbreviations: CLCdn - CLC Genomics Workbench, Vdn - Velvet, Sdn - SPAdes, PBdn - PacBio. Layer _p_stRT/_I_STRT/_S_03_stCuSTr/ _A_02.2_assembly-contribution-count, DOI: 10.15490/fair-domhub.1.datafile.3109.1
Supplementary Figure 3 - Venn diagram showing the overlap of paralogue clusters in cultivar-specific transcrip-tomes and merged DM gene model. Representatives and alternatives of the StPanTr paralogue cluster are counted. For Phureja,the merged ITAG and PGSC DM gene models were counted. Layer _p_stRT/_I_STRT/_S_04_stPanTr/_A_04-MSA, DOI: 10.15490/fairdomhub.1.datafile.3096.1
Supplementary files
GFF - merged GFF. The GFF file with merged ITAG and PGSC gene models for S. tuberosum group Phureja DM genome v4.04. Layer _p_stRT/_I_STRT/_S_04_stPanTr/ _A_01_evigene_1_3cvs-gffmerged, DOI: 10.15490/fair-domhub.1.datafile.3114.1
Supplementary HTML 1 - Multiple sequence alignments using MAFFT v7.271 and MView v1.65. Paralogue cluster on representative and alternative sequences, at least one from each of the four genotypes. Clusters containing 8-16 sequences. Layer _p_stRT/_I_STRT/_S_04_stPanTr/_A_04_MSA, DOI: 10.15490/fairdomhub.1.datafile.3116.1
Supplementary scripts
in-house scripts 1 - Evidence filtering. Corresponding scripts for biological evidence filtering step. Layer _p_stRT/_I_STRT/_S_03_stCuSTr/_A_03.1_filtering, DOI: 10.15490/fairdomhub.1.datafile.3117.1
in-house scripts 2 - stCuSTr paralogue clusters. Corresponding scripts for StCuSTr post-filtering main (non-redundant) and alternative classes reassignment step. Layer _p_stRT/_I_STRT/_S_03_stCuSTr/_A_03.2_components, DOI: 10.15490/fairdomhub.1.datafile.3118.1
in-house scripts 3 - stPanTr paralogue clusters. Corresponding scripts for StPanTr main and alternative transcripts classification step. Layer _p_stRT/_I_STRT/_S_04_stPanTr/ _A_02_components_1_3cvs-gffmerged, DOI: 10.15490/fair-domhub.1.datafile.3119.1
in-house scripts 4 - MSA. Corresponding scripts for MSA step. Layer _p_stRT/_I_STRT/_S_04_stPanTr/_A_04_MSA, DOI: 10.15490/fairdomhub.1.datafile.3120.1
Declarations
List of abbreviations
- BUSCO
- Benchmarking universal single-copy orthologs;
- CDS
- Coding sequence;
- CLC
- CLC Genomics Workbench;
- CRBB
- Conditional Reciprocal Best BLAST;
- cv.
- cultivar;
- DSN
- Duplex-specific nuclease;
- EST
- Expressed sequence tag;
- Iso-Seq
- Isoform sequencing;
- ITAG
- International Tomato Annotation Group;
- main
- representative, non-redundant;
- NR
- Non-redundant;
- ORF
- Open reading frame;
- PacBio
- Pacific Bio-sciences Iso-Seq sequencing;
- PE
- paired-end;
- PGSC
- Potato Genome Sequencing Consortium;
- qPCR
- Quantitative polymerase chain reaction;
- RNA-Seq
- RNA-Sequencing;
- SE
- single-end;
- SRA
- NCBI Sequence Read Archive;
- StGI
- Solanum tuberosum gene indices;
- StPanTr
- Solanum tuberosum pantranscriptome;
- Tr
- transcriptome;
- tr2aacds
- “transcript to amino acid coding sequence” Perl script from EvidentialGene pipeline;
Competing Interests
The authors declare no competing interests.
Funding
This project was supported by the Slovenian Research Agency (grants P4-0165, J4-4165, J4-7636, J4-8228 and J4-9302), COST actions CA15110 (CHARME) and CA15109 (COSTNET).
Author’s Contributions
M.P., K.G., Ž.R. and M. Zagorščak participated in study design and evaluation of transcriptomes. A.C. provided Rywal Illumina-sequenced samples. M.P. collected and pre-processed RNA-Seq datasets and produced CLC, Velvet/Oases and rnaSPAdes de novo assemblies. M.Zouine produced Trinity assemblies. E.T. processed Iso-Seq data. M.P. and M.Zagorščak run tr2aacds scripts and transcriptome annotation, filtering and evaluation software. Ž.R. merged PGSC and ITAG gene models of reference potato genome. Ž.R. and M.Zagorščak generated the pan-transcriptome. K.G. secured funding, and managed the project. M.P., M.Zagorščak, Ž.R. and K.G. wrote and edited the manuscript. S.S. helped with interpretation of EvidentialGene results, provided advice on filtering of transcriptomes and language editing. All authors have read and commented the manuscript and approved the final submission.
Authors’ information
A.C.: Anna Coll, Anna.Coll{at}nib.si
K.G.: Kristina Gruden, Kristina.Gruden{at}nib.si
M.P.: Marko Petek, Marko.Petek{at}nib.si
Ž.R.: Živa Ramšak, Ziva.Ramsak{at}nib.si
S.S.: Sheri Sanders, ss93{at}iu.edu
E.T.: Elizabeth Tseng, etseng{at}pacificbiosciences.com
M. Zagorščak: Maja Zagorščak, Maja.Zagorscak{at}nib.si
M.Zouine: Mohamed Zouine, mohamed.zouine{at}ensat.fr
Acknowledgements
We thank Robin Buell for potato PGSC-ITAG gene model reference table, Thomas Doak and Don Gilbert for advice on using EvidentialGene scripts, Henrik Krnec for BLAST output parser, Andrej Blejec for FAIRDOMhub API usage in R and Špela Baebler for provided DSN Illumina-sequenced samples.
Footnotes
↵* maja.zagorscak{at}nib.si
author affiliations and ORCIDs updated