Introduction

The quantification of transcript abundance underpins much of current plant molecular biology1, which is usually quantified via real time PCR analysis2. The technique relies heavily on a normalization step based on the transcript abundance of a so-called “reference” gene (or genes), which is expected a priori to be stably transcribed2,3. The transcription level of an ideal reference gene is temporally and spatially constant and is unaffected by external factors3,4. These conditions are seldom fulfilled, even by the most frequently used reference sequences, such as those encoding 18S rRNA, GAPDH, EF-1α, ubiquitin, actin, α-tubulin or phosphatase 2A. Various software packages, prominently geNorm5 and NormFinder6, have been developed to identify reference genes appropriate to particular experimental systems.

The diploid angiosperm species Arabidopsis thaliana has long served as the leading genomic model for dicotyledonous plants7,8. The volume of transcriptomic data generated in this species is very large, especially given the development of microarray and next generation sequencing technologies. Such data sets have helped to identify potential replacements for the reference genes conventionally used for the analysis of transcription in A. thaliana3.

Like most plant genera, Arabidopsis includes a number of both diploid and polyploid species. As yet, the appropriateness of key reference genes, chosen on the basis of their transcription behavior in A. thaliana, especially in those with different ploidy levels, has yet to be tested. A recent comparison has been published between the transcriptomes of 19 diploid A. thaliana accessions9, chosen to maximize the range of intraspecific phenotypic variation, along with some tetraploid Arabidopsis spp. entries (A. thaliana 1001 Genomes Project)10. Here, we have exploited these data to identify an appropriate set of reference genes for normalization of transcriptomic data derived from both A. thaliana and some tetraploid Arabidopsis spp.

Results

Reference gene transcription among A. thaliana accessions

We download 38 RNA sequencing data from MAGIC (Multiparent Advanced Generation Inter-Cross) founder accessions of 19 A. thaliana diploids and all 11 available RNA sequencing data from 5 Arabidopsis tetraploids (3 allo- and 2 auto- tetraploid) from Gene Expression Omnibus (GEO) database. A set of five conventional (GAPDH, ACT2, UBQ10, UBC, EF-1α) and 16 potentially informative novel genes previously derived from microarray data in diploid A. thaliana Col-0 was assembled, the transcription of each recorded over at least 80% entries were selected (Table 1)3. The transcript abundance of these genes was measured by RPKM (reads per kilobase of exon model per million mapped reads). Among the 21 genes assessed in the 19 A. thaliana accessions, the four most highly abundant ones were EF-1α (AT5G60390, RPKM: 901.2 ± 18.4, mean ± SD, the same below), GAPDH (AT1G13440, 876.6 ± 24.2), UBC10 (AT4G05320, 242.7 ± 7.4) and ACT2 (AT3G18780, 210.3 ± 5.4). Transcript abundance of 11 out of the 21 genes was only moderate (RKPM ranging from 14.8 ± 0.4 to 97.9 ± 2.4) and that of the remaining six genes was relatively low (RKPM ranging from 0.84 ± 0.05 to 8.19 ± 0.19) (Fig. 1 and Supplementary Data S1, Online Resource). Inter-accession variation in RPKM was typically limited, with a small number of the accessions being largely responsible for the variation present.

Table 1 Reference gene primer sequences in Arabidopsis3
Figure 1
figure 1

Transcript abundance (given in RPKM) of a set of 21 potential reference genes in A. thaliana and some tetraploid Arabidopsis accessions.

Note, A. t × A. a = A. thaliana × A. arenosa.

Stability was assessed based on geNorm and NormFinder software. The former associates a stability value (M) with each potential reference gene, where a low M reflects stability and a high M instability5. An M value of < 0.5 is conventionally accepted for a reference, but with high M value (≥0.5) should be avoided11. NormFinder ranks genes according to the similarity of their transcript abundances, applying a model-based approach6. The geNorm analysis showed that the M value of 19 out of the 21 candidate genes was < 0.5, suggesting that any of one (or more) of them was appropriate as a reference sequence(s). The most stable sequences were AT5G46630 (encoding a clathrin adaptor complex subunit), AT1G13320 (PP2A subunit), AT4G26410 (uncharacterised conserved protein), AT5G60390 (EF-1α) and AT5G08290 (mitosis protein YLS8) (Fig. 2a). NormFinder analysis resulted in a similar set of stable genes: AT1G13320 and AT5G46630 emerged as the most stable (Table 2). Nevertheless, even these reference genes could be unstable in two special samples, for example, the RKPM of AT1G13320 was 1.3 times of minimum between diploid WU and TSU samples, while that of AT4G34270, which was identified by both geNorm and NormFinder as being a relatively unstable sequence had the closest transcript abundance (RKPM(WU/TSU) = 14.59/14.06). The pairwise variation (V) metric generated by geNorm was informative with respect to determining the optimal number of reference genes necessary for accurate normalization. A Vn/Vn + 1 value of < 0.15 indicates that an additional reference gene is not required5,12,13. Across the 19 diploid accessions, this metric was below the threshold for the inclusion of a third reference sequence for each initial pair chosen (Fig. 2b).

Table 2 Ranking of reference genes in Arabidopsis diploid and their expression stability values calculated using NormFinder
Figure 2
figure 2

Average expression stability values (M) and the pairwise variation (V) metric calculated by geNorm.

(a) M values of the 21 reference genes among 19 diploid A. thaliana accessions. A low value of M indicates greater stability. The seven columns shown in black refer to genes not transcribed in at least one tetraploid accession. (b) The pairwise variation (V) metric calculated to indicate the optimal number of reference genes required for normalization. A decrease in V indicates that the addition of the gene in question should improve normalization accuracy.

Differential stability in A. thaliana and tetraploid Arabidopsis

Although the reference genes behaved very consistently among the A. thaliana accessions, this was not the case in the tetraploids. Of the 21 reference genes, seven genes (AT1G62930, AT5G55840, AT4G38070, AT5G15710, AT3G53090, AT2G32170 and AT5G12240) of low RPKM in diploid were not detectable in all allotetraploids (Fig. 1, Supplementary Data S1), so that none of these would be suitable for normalization at the tetraploid level. A statistical analysis suggested among the remaining 14 reference genes, six genes (AT5G46630, AT4G33380, AT4G26410, AT4G05320, AT3G01150 and AT1G13320) showed a significant difference of transcript abundance between diploid and tetraploid Arabidopsis accessions according to Tukey's test and the Student's t test (P < 0.05) (Fig. 3a), meanwhile the coefficient of variation (Cv) analysis suggested that the Cv of the rest 14 genes in tetraploids was more variable than among the diploid A. thaliana accessions and the maximum Cv value of each autotetraploid (MaxA. t = 41.0%; MaxA. a = 45.3%) was less than those in the allotetraploid ones (MaxA. s = 77.2%; MaxA. t× A. a F1 = 48.3%; MaxA. t× A. a F8 = 50.9%) (Fig. 3b). Hence, it's necessary to provide a suitable list of reference gene(s) for tetraploid Arabidopsis accessions separately. According to geNorm, ten out of the 14 genes produced a M value of < 0.5 (Fig. 4), of which the most stable gene across all tetraploid accessions was AT5G46630, followed by AT4G26410, AT3G18780, AT3G01150 and AT1G13320 (Fig. 4). A similar stability ranking of 14 tested genes was observed based on analysis using NormFinder software (Table 3), i.e, the most stable gene was also AT5G46630, while AT1G13440 was the unstable one. The stability among the tetraploid accessions of the best-performing genes according to both software packages, was sufficient to allow them to be used as reference in different tetraploid Arabidopsis accessions.

Table 3 Ranking of reference genes in Arabidopsis tetraploid and their expression stability values calculated using NormFinder
Figure 3
figure 3

The comparison of transcript abundances and the coefficient of variation (Cv) of each reference gene between A. thaliana and tetraploid Arabidopsis accessions.

(a) The comparison of transcript abundances of 14 reference genes. Boxes indicate the 25th/75th percentiles, the line represents the median, squares represent the means and whiskers (plus/minus values) indicate the ranges of total samples. (b) The coefficient of variation (Cv). Cv values close to 0% indicate little deviation from the mean RPKM.

Figure 4
figure 4

Average expression stability values (M) of the 14 reference genes among the 5 tetraploid Arabidopsis accessions, as calculated by geNorm.

Discussion

The feasibility of reference genes identification from RNA-seq

Reliable reference gene(s) are an important resource for the analysis of transcriptomic data. Microarray-based transcriptomic data are derived from the hybridization between cDNA and the probe set attached to the chip, which are subjected to a degree of noise due to cross-hybridization and uneven hybridization efficiency14,15. In contrast, high throughput sequencing platforms such as RNA-seq generate a direct count of the number of transcript copies16 and this parameter is free of the hybridization artifacts which compromise the quality of microarray-based data17,18.

Reference gene stability in diploid A. thaliana

Based on microarray-acquired transcriptomic data, a set of 25 stably transcribed genes has been described for diploid A. thaliana Col-03,19. Of these, four were not well represented in the transcription profiles of the 19 diploid A. thaliana accessions used here due to their low abundance, this was consistent with the results identified by cDNA microarray3,19. Although the 19 diploid accessions represent a phenotypically diverse set, where nearly 50% of genes showed a measure of differential transcription, in the present study, the 21 genes assayed showed only a low degree of accession-to-accession variation in transcript abundance; thus each of them appears to be useful for normalization purposes9.

In spite of this, there was a little difference in gene function among the 21 genes. For example, AT5G46630, encoded proteins involved in adaptor complex subunits of claturin which mainly involved the protein transport and cell division process3,19, was most stable than others. PP2A subunit (AT1G13320) and AT5G60390 (EF-1α) also shown to be highly stable than others. They were all highly conserved and played ubiquitous roles in protein translation and degradation, binding cytoskeletal proteins and several different signal transduction pathways in the cell3,19. In addition, AT1G62930 and AT5G55840 both belonged to PPR gene were the most unstable genes, which were RNA-binding proteins that were particularly prevalent in terrestrial plants20. These proteins had a range of essential functions in posttranscriptional processes, including RNA editing, RNA splicing, RNA cleavage and translation within mitochondria and chloroplasts20. AT4G38070 was also unstable, as a kind of bHLH transcription factor, it was functionally characterized in Arabidopsis and its roles included regulation of fruit dehiscence, phytochrome signaling, flavonoid biosynthesis, hormone signaling and stress responses21. Interestingly, most of the stable genes were related to the structure and normal physiological function of cells while the unstable genes were not. Perhaps these stable reference genes played conservative roles in cellular structure, divided and physiological state in the normal growth of the diploid accessions.

We provide the ranking of reference genes which might be suitable for accurate normalization and quantification of gene expression studies in different diploid Arabidopsis accessions. However, there is no single reference gene that is the most stable in all samples tested (like AT1G13320). It has been suggested that at least four reference genes should be used for a broad range of tissues or conditions3,19, so although the ranking of the various reference genes could provide a guide for the selection of reference genes, care needs to be taken when analyzing transcription throughout a variety of tissues or conditions and the use of more than one gene is often advisable.

Variation in transcription of reference genes among tetraploid Arabidopsis accessions

It has been recognized that transcription profiles among tetraploid Arabidopsis accessions are more variable than within diploid A. thaliana10,22,23,24. This diversity was reflected in the set of 21 tested reference genes. For seven of the genes, there was no transcription in all allotetraploid accessions. However, it must be pointed that the newly formed Arabidopsis interspecific hybrid of autotetraploid A. arenosa x A. thaliana F1 are self-pollinated to generate A. arenosa x A. thaliana F8, which are A. suecica-like and could be comparable with the natural allotetraploids of A. suecica10,22 and these seven reference genes are all low expression abundance genes in chosen subset with a low RPKM in diploid (Fig. 1). Therefore these genes may enable better normalization and quantification of these genes of low transcript levels in diploid Arabidopsis, but should be neglected in the selection of reference genes for A. suecica or A. suecica-like allopolyploid combinations. We also supposed that high abundance stably expressed reference genes (eg. ACT225,26,27,28,29,30) are more suitable for normalization and quantification of these genes with high expression abundance in tetraploid than the low abundance stably expressed one(s), considering the impact of PCR efficiency, specificity and/or yield for RT-PCR and noise and/or hybridization efficiency of the probe for microarray2,3.

The statistical analysis also suggested that six out of 14 transcribed reference genes showed a significant difference in transcript abundance between diploid and tetraploid Arabidopsis accessions (Fig. 3a), meanwhile the Cv of the 14 genes in tetraploids was more variable than among the diploid A. thaliana accessions (Fig. 3b). Therefore, some mechanisms differentially regulated their expression from diploid to polyploidy maybe exists3,30,31 and determining the mode of regulation of these genes in tetraploids will provide further information regarding the effect of polyploidization on gene expression.

Followed polyploids formation in the plant kingdom, it was observed that the gene expression patterns varied with changes in a ploidy series, however, differences in autotetraploids were usually less pronounced than in allotetraploids32,33,34,35. In the present study, maximum Cv value all appeared in allotetraploids. Autopolyploidy occurred as a consequence of homologous genomes duplication from a single species, whereas allopolyploidy described the union of diverged genomes from different species31,32. Both allo- and auto- tetraploids could be affected by gene dosage generated by the interactions between homoeologous genes, however, the allotetraploid would be affected by not only the gene dosage but also the gene divergence under interspecific hybridization which played a major role in speciation and evolution32,33,36,37,38. It would be likely that these phenomena underlie the poorer stability of the reference genes in tetraploids in comparison to their highly reliable behavior in diploids32,38,39,40.

Interestingly, the most stable gene among the remaining 14 reference genes detectable in both allo- and auto- tetraploids was also AT5G46630, which is the same with the ranking in diploid accessions. AT4G26410 (the 2nd stable in geNorm analysis) encoded uncharacterized conserved protein was related to the cell structure41, while ACT2 (AT3G18780, the 3th stable in geNorm analysis) was structurally constituent of cytoskeleton. These stable reference genes mainly involved in cell structure, cytoskeleton, the protein transport and cell division process in the allo- and auto- tetraploids. Notably, ACT2 was the most commonly used reference gene in polyploid studies26,27,28,29,30 for not only its steady but also the similar expression level between diploids (Fig. 3a). Nevertheless, between different individuals of tetraploid, both phenotypic and genotypic variation and photosynthetic efficiency diversity (like heterosis) may lead to associated genes expression unstable (like GAPDH, the least stable). Hence, studies on the effect of (both allo- and auto-) polyploidy on gene expression are possible; however the genetic basis of ploidy changes in analyzed materials cannot be ignored.

Methods

Transcriptomic data

All data set was obtained from the Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/) database. Diploid transcriptomic sequences of the 19 diploid A. thaliana accessions were acquired from a MAGIC population (GEO accession: GSE30720; Dataset name: GSM762074-GSM762111; a total of 38 transcriptomic data; Plant materials include: seedling stage of Col, Bur, Can, Ct, Edi, Hi, Kn, Ler, Mt, No, Oy, Po, Rsch, Sf, Tsu, Wil, Ws, Wu and Zu; RNA was collected over a three-week period; All plants were grown under a 16/8-h light/dark cycle). Analyzed tetraploids dataset includes 5 tetraploids (GEO accession: GSE29687; Dataset name: GSM736442-GSM736446, a total of 11 transcriptomic data): autotetraploid A. thaliana (A. t; ABRC, CS3900) and A. arenosa (A. a; ABRC, CS3901), two hybrids (A. t × A. a F1 and A. t × A. a F8) between autotetraploid A. thaliana and A. arenosa and the natural allotetraploid A.suecica (A. s; ABRC, CS22508) (Total RNA from 3–4 week A. thaliana; 6–7-week A. arenosa or allotetraploids; All plants were grown under a 16/8-h light/dark cycle).

Selection of candidate reference genes

A set of five conventional (GAPDH, ACT2, UBQ10, UBC, EF-1α) and 16 potentially informative novel genes was assembled and the transcription of each was recorded over at least 80% entries, the selected genes covered a wide range of abundance in absolute expression levels (Table 1)3. Notably, these novel reference genes were chosen from hundreds of Arabidopsis genes which have been previously proven to outperform traditional reference genes in terms of expression stability throughout different developmental stages, organs, tissues, genotypes and under a range of environmental conditions in diploid Col-03. Meanwhile, these novel genes were all excellent for designing gene-specific PCR primers using a standard set of design criteria (e.g., from the 3′-untranslated region, primer Tm = 60 ± 1°C, length 18 to 25 bases, GC content between 40 and 60%, generate a unique, short PCR product between 60 to 150 bp of the expected length), which enabled better normalization and quantification of transcript levels in Arabidopsis2.

Stability was assessed based on geNorm and NormFinder software. The former associates a stability value (M) with each potential reference gene, where a low M reflects stability and a high M instability5. An M value of < 0.5 is conventionally accepted for a reference, but with high M value (≥0.5) should be avoided11. NormFinder ranks genes according to the similarity of their transcript abundances, applying a model-based approach6. The results represent the means ± SD. The statistical differences between diploid and tetraploid Arabidopsis accessions were analyzed according to Tukey's test and the Student's t test (P < 0.05). A coefficient of variation (Cv) was calculated according to the following formula:

A Cv close to 0 is produced when the RPKM (reads per kilobase of exon model per million mapped reads) value of a given gene deviates non-significantly from the mean RPKM19.