Evaluation of rubber tree transcriptome and discovery of SNP and SSR from candidate genes involved in cellulose and lignin biosynthesis

Hevea brasiliensis (the rubber tree) is a well-known species with high economic value, and it is the primary source of natural rubber globally. Increasing demand for furniture and related industries has made rubberwood production as important as latex production. Molecular markers such as Single Nucleotide Polymorphisms (SNPs) and Simple Sequence Repeats (SSRs) are widely used for Marker Assisted Selection (MAS) which can be detected in large quantity by transcriptome sequencing. MAS is thought to be a useful method for the development of new rubberwood clones for its shorter breeding cycle compared to a conventional breeding procedure. In this study we performed RNA sequencing (RNA-seq) on four H. brasiliensis clones (RRIM 712, RRIM 2025, RRIM 3001 and PB 314) from three tissues including bark, latex and leaf samples to identify SSRs and SNPs associated with wood-formation related genes. The RNA sequencing using the Illumina NextSeq 500 v2 platform, generated 1,697,491,922 raw reads. A total of 101,269 transcripts over 400 bp in size were obtained and similarity search of the non-redundant (nr) protein database returned 83,748 (83%) positive BLASTx hits. The transcriptome analysis was annotated using the NCBI NR (National Center for Biotechnology Information Non-Redundant), UniProtKB/Swiss-Prot, gene ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. Differential expression analysis between later-timber rubber clone and non-later-timber rubber clone on wood-formation related genes, showed genes encoding phenylalanine ammonia-lyase (PAL), caffeic acid O-methyltransferase (COMT) and cinnamoyl-CoA reductase (CCR) were highly up-regulated in a latex-timber rubber clone. In total, about 3,210,629 SNPs and 14,956 SSRs were detected with 1,786 SNPs and 31 SSRs were found for wood-formation biosynthesis of H. brasilensis from 11 lignin and cellulose gene toolboxes. After filtering and primer selection, 103 SNPs and 18 SSR markers were successfully amplified and could be useful as molecular tool for marker assisted breeding to produce new timber rubber clones.


Introduction
Hevea brasiliensis (Wild.) Muell-Arg. is a tree native to the Brazilian Amazon region and from the Euphorbiaceae family [1]. This species is the most abundant species of the genus, with the largest production capacity, accounting for about 99% of all natural rubber produced globally and with the greatest genetic variability [2]. In the past, a greater emphasis was put on production of only high-latex-yielding clones, giving rise to a spectacular increase in yield. However, the increasing demand for rubberwood furniture has given rise to a new identity for the rubber industry [3]. Rubberwood furniture (advertised as 'environmentally friendly') [4], which can be produced from latextimber clones, requires considerable growth time using conventional breeding [5]. Due to its relatively low cost and light colour, rubberwood furniture has gained wide acceptance amongst domestic and foreign consumers and is now accepted as being a viable alternative to other timber species [6].

cDNA library construction and sequencing
The high quality of extracted RNAs with OD A260/A230 value of more than 2.0, OD A260/A280 value of between 1.8 to 2.0 and concentration of more than 100ng/uL were used to construct paired-end Illumina mRNA libraries using the Illumina NextSeq 500 v2 Kit. Each sample was sequenced in multiple HiSeq2000 lanes using the NextSeq 500 v2 Cycle Kit (Illumina, San Diego, CA) in order to obtain 2 x 75-bp reads.

Raw data filtering
Raw reads generated in FASTQ format (obtained from Illumina platforms) were analysed using FastQC software [9] resulting in clean reads which will be used for mapping and assembly. Clean reads were obtained by filtering the raw reads based on these steps that include: 1) removing reads containing adaptors; 2) removing reads containing ambiguous nucleotides; 3) trimmed low-quality reads to exclude poor-quality reads. FastQC produces end results as clean reads, and these were to be used for further analysis.

Transcriptome mapping and assembly
Once clean reads were obtained, they were mapped onto the draft H. brasiliensis genome [10] (accession: PRJDB4387) by Bowtie2, via the DDBJ/EMBL/GenBank BioProject database, using TopHat (version 2.1.0) software [11]. The reads were mapped onto the genome with default parameters. The mapped reads were then assembled using Cufflink v2.2.1, with default parameters. To obtain isoforms with a high degree of confidence, only those assembled transcripts with a value of FPKM > 0 were retained for further analyses. The assembled transcript sequence generated by Cufflink from the rubber-genome sequence was then extracted. Unmapped reads were extracted from the unmapped BAM file. De novo assembly of the reference reads was performed using Trinity v2.6.5 [12] with default parameters in order to generate the reference transcriptome. Unmapped reads were then mapped to the reference transcriptome using TopHat (version 2.1.0) software [11]. The assembled sequences generated by Cufflink from genome mapping and reference-transcriptome mapping using TopHat were combined. Finally, Cd-hit software was used [13] to combine assembled sequences from both sources and to reduce redundancy of the sequences. The output transcripts were considered as Hevea reference sequences.

Gene annotations
Basic annotations were executed that include protein functional annotation following searches of several databases, namely the NCBI NR (National Center for Biotechnology Information Non-Redundant) database, UniProtKB/Swiss-Prot (Universal Protein Resources) database and KEGG (Kyoto Encyclopedia of Genes and Genomes) database. Subsequently we performed GO (Gene Ontology) functional annotation of unigene using Blast2GO software [14], set up with an e-value cut-off at 1 e-10 . The functional classification of unigene was done by WEGO (version 2.0) software [15].

SSR markers detection and primer designing
We identified total SSRs and SSRs from lignin and cellulose genes toolboxes within each rubber genotypes using the microsatellite prediction (MISA) program (http://pgrc.ipk-gatersleben.de/misa/). Once SSRs were identified, the corresponding primers were designed by using the following parameters: (1) motif ranged from dinucleotides to pentanucleotides (2) minimum repeat unit was five for all nucleotide motifs (3) the maximum interruption lengths between two SSR markers was set up as 100bp. We used the program Primer3 (http:/primer3.ut.ee/) to design primers using the criteria of GC contents for primer sequences ranging from 40% to 60%.

SNP identification and primer designing
For SNP detection, we used SAMtools v1.3.1 [16] first to generate mpileup for one or multiple BAM files. Subsequently VarScan v2.3.9 [17] was then used to perform total SNPs and SNPs from lignin and cellulose genes toolboxes detection using default parameters. Reads were mapped against the reference transcriptome and filtered by ascertaining which DNA bases were different from the reference. Selected SNPs were required to have a read depth equal to or greater than 10, an SNP reads/total reads ratio equal to or greater than 0.25, a minimum Phred score of 20 and an SNP quality of 50. Similarly, we used Primer3 (http:/primer3.ut.ee/) to design primers as mentioned in section above.

Molecular-marker amplification and sequencing
The SSRs and SNPs loci associated with lignin and cellulose gene toolboxes were amplified. Amplified SSRs were sequenced to ascertain their workability. PCR amplifications for SSRs and SNPs loci were performed in 25-ul reactions that includes buffer solution, dNTP, MgCl and Taq polymerase using Mytaq Master Mix (Bioline). The PCR reaction condition is as follows: initial denaturation at 95ºC for 2 min, followed by 35 amplification cycles (30 s at 95ºC, 30 s at the specific annealing temperature and 1 min at 72ºC), and a final extension at 72ºC for 10 min. The PCR products were resolved via electrophoresis in 1.5% agarose gels prior to identifying the specific amplification SSRs and SNPs loci. The DNA sequencing reaction was performed by the service provider First Base company. The sequencing data were then visually inspected using Molecular Evolutionary Genetics Analysis (MEGA) X software [18].

Phylogenetic Analysis
The determination of genes relationships related to lignin biosynthesis between rubber clones was done by Polymerase Chain Reaction (PCR) process with two replicate from each clones. The DNA sequence were then sent for sequencing. The sequence data were aligned using MEGA X software [18]. The aligned sequences were then used as an input to construct phylogenetic tree using Maximum Parsimony (MP) approach. MP approach was selected due to the criterion ability to weighted equally and unordered the character state changes where the changes occur as early in relation to the root as possible. The evaluation of confidence values of the clades was carried out through ten thousand replicates of the bootstrap analysis. The phylogenetic tree analysis was done using MEGA X software [18]. The phylogenies were rooted using outgroup species from Mahihot esculenta available on NCBI database.

Transcriptome Analysis and Functional Annotation
In total, 1,697,491,922 raw reads were generated and trimmed to exclude low-quality reads. Of the 1,697,491,922 raw reads, 1,641,112,636 high-quality reads could be recovered after the trimming process. High-quality reads were then mapped onto the available draft Hevea genome (accession: PRJDB4387) and assembled to generate a total of 101,269 transcripts. The transcripts were considered as the reference transcriptome. The library of raw and clean reads is shown in S1  The 97,454 transcripts were also annotated using the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, which is a database of resources representing the functions and interactions of genes, and biological system at molecular level [19]. The KEGG database presents information as collections of manually drawn pathway maps [20]. pathways were successfully mapped, including metabolism, genetic information processing, environmental information processing, cellular processes and organismal systems. Of these five categories, metabolism was the most highly represented (with 82.2%), followed by organismal systems, genetic information processing, environmental information and cellular processes (S3 Fig).

Identification of genes related to wood biosynthesis
In this study, only transcripts representing genes related to wood-formation biosynthesis were selected for SSRs  Table 2 shows the number of transcripts related to genes involved in lignin and cellulose biosynthesis.  (Table 3).

Table 3. Overview of SNPs identified from three tissues and four rubber clones.
A more detail search was conducted to identify SNPs from genes related to cellulose and lignin biosynthesis.
The lignin-biosynthesis pathway has seven gene toolboxes, while the cellulose-biosynthesis pathway has four gene toolboxes. A total of 1,786 SNPs was found from 11 lignin and cellulose gene toolboxes.  to be the most abundant type of SSR in plant genomes. There were 12,767 di-, 2,172 tri-, 14 tetra-and 3 pentanucleotide potential SSRs. "AG/CT" motif was found to be the highest proportion, followed by "AT/AT" motif and "AAG/CTT" motif within rubber transcriptome dataset (S2 Table).
A detail search was conducted to discover SSRs from genes related to lignin and cellulose biosynthesis. A total of 31 SSRs have been found related to lignin and cellulose biosynthesis genes. However, after filtering the redundancy of SSRs within sequences covered by primers, the number of working primers was reduced to 18. The 18 SSR primers were selected from several genes including PAL, 4CL, COMT, CCR, CesA, COBRA and KOR (Table 5).

SNP primers design and screening
SNPs found earlier from transcriptome datasets of genes related to lignin and cellulose biosynthesis were used to develop 764 primer pairs. However, after filtering the redundancy of SNPs within sequences covered by primers, the number of working primers was reduced to 151 primer pairs. All 151 SNP markers were amplified using conventional PCR with a set of rubber clones before they could be used as a marker for selection of a high-quality rubber clones. The validation process showed some positives results with approximately 68% primer pairs successfully amplified and amplicons were found producing expected band size on agarose gel. Of the 151 of SNP primers, 103 were successfully amplified SNP in the DNA sequence of rubber. Table 6 shows the total number of SNP primers generated and number of primers successfully amplified from genes related to lignin-and cellulose-biosynthesis pathways. A details list of SNPs primers can be found in S3 Table.

SSR primer design and screening
The 18 SSR markers related to lignin and cellulose-biosynthesis gene toolbox were then selected for amplification using conventional PCR with a set of rubber clones before they could be used as a marker for selection of a high-quality rubber clone. All 18 SSR markers were successfully amplified and amplicons obtained were of expected size visualized on agarose gel. A details list of SSRs primers can be found in S4 Table.

Phylogenetic analysis of genes related to lignin biosynthesis
In this study, phylogenetic analysis using DNA sequence can provide the evolutionary relationship between genes related to lignin biosynthesis among different rubber clones. Initially, the sequence alignment analysis was done to identify similarity regions of same genes from different taxa. The DNA sequence was initially compared to the online

Discussion
Investigation of the genes related to rubberwood biosynthesis involves gathering information through transcriptome datasets at a molecular level. Understanding natural rubber production at a molecular level can bring new insights to programmes for rubber breeding. H. brasiliensis transcriptome data showed the mean length of transcripts to be 1,046 bp (N50 was 1,313 bp), and 101,269 transcripts were recovered in total. The transcriptome data can be accepted since similarities were found with previous studies in terms of H. brasiliensis transcriptome analysis. In our study, the mean length of transcripts was found to be far greater than that reported by [21], who used RRIM600 as their specimens and found the mean length of 19,708 transcripts to be 676 bp. In contrast, we found that N50 had 837 bp. Moreover, Mantello et al. [20] reported a higher number of transcripts (152,416), with lengths ranging from 97 bp to 13,266 bp within GT1 and PR255 clones. However, the mean length of transcripts (536 bp) and N50 were lower (720 bp) than our data. In addition, [22] reported that about 93.85% of their clean reads were successfully recovered, generating 98,697 transcripts with a mean length of 1,437 and 2,127 bp for N50. Recently, [23] reported that their study of CATAS93-114 generated 100,100 transcripts, with N50 having 2,048 bp. in the UniProtKB/Swiss-Prot database. Our annotation analysis showed higher percentages than those of various other studies, as follows: [20]: 63% (NR database) and 47% (UniProtKB/Swiss-Prot database; [22]: 65% (NR database) and 46.9% (UniProtKB/Swiss-Prot database); [24]: 72.6% (NR database) and 55.2% (UniProtKB/Swiss-Prot database; and [25]: 66.3% (NR database) and 42.1% (UniProtKB/Swiss-Prot database). Moreover, annotation analysis using the GO database found that amongst the three main subontologies, biological processes were the most highly represented, followed by molecular functions and cellular components. Our results are also similar to those described by [21] and [26]. However, [24] suggested that cellular components were the most highly represented, followed by biological processes and molecular functions, while [25] reported that biological processes were the most highly represented ontology followed by cellular components and molecular functions. Furthermore, annotation analysis using the KEGG database showed that about 18% of the transcriptome dataset was successfully mapped onto 153 pathways. Most of the transcripts were found to be involved in metabolic processes, followed by organismal systems, genetic information processing, environmental information, and cellular processes. The percentage mapped using the KEGG database was similar to that described by [20], with 17.1% (137 KEGG pathways), and [22], with 15.4%. However, [24] and [25] reported that the total transcriptome dataset mapped to the KEGG database was slightly higher than our data, with 53.1% (123 pathways) and 39.9% (128 pathways) being successfully mapped.
A total of 3,210,629 SNPs and 14,956 SSRs associated with potentially important genes were identified from the transcriptome dataset. The high number of SNPs from transcriptome data indicated the high allelic variability present in rubber trees that is able to be detected using this NGS technology. Moreover, about 73% from all SNP markers developed from lignin and cellulose biosynthesis pathway were successfully amplified on the random rubber clones.
Our results have a higher density than those described by [20] (1 SNP per 125 bp), [27] (1 SNP per 178 bp) and [28] (1 SNP per 1.5 kb) for the rubber trees studied. Apart from that, the high percentage of working primers showed that SNPs are present in various rubber clones and may be involved in rubber tree physiologies. Furthermore, the 18 SSR primers which were successfully generated and amplified could differentiate and showed a high frequency of polymorphisms between rubber clones. Although SNP markers are the most abundant type of DNA polymorphism in genomic sequences, SSR together with SNP could be the best candidate markers for genetic breeding and to investigate the variability of genes related to rubberwood biosynthesis in rubber trees.
Usage of biomarkers in breeding programmes could help increase the chances of producing new high-yield rubber clones without a high cost and in a shorter time. These primers can be used as a marker to get information about whether new rubber clones have potentially high wood volumes and can be grown and supplied to rubberwood industries. These markers can also be used by rubber breeders to obtain information about newly bred trees as young as one or two years old, and this could play a vital role in rubber breeding programmes.

Conclusions
The use of biomarkers in rubber breeding programmes (including using single-nucleotide polymorphisms (SNPs) or simple-sequence repeats (SSRs) could help rubber breeders determine the characteristics of new rubber clones in less time, without having to wait 30 years for a clone to be released. Using conventional polymerase chain reaction (PCR) in the validation process could help achieve targets faster than ever. Successfully developed 103 SNPs and 18 SSR primer sets could be used in genetic diversity studies and rubber breeding programmes focusing on rubberwood formation. Future studies could, therefore, provide an insight into how rubberwood quality and production processes could be improved.
Supporting information S1  Table. SNP primers for lignin and cellulose-biosynthesis related genes.
S4 Table. SSR primers for lignin and cellulose-biosynthesis related genes.