Shortlisting genes important in seed maturation by principal component analysis of gene expression data

Transcriptome data are widely used for functional analysis of genes. De-novo assembly of transcriptome gives a large number of unigenes. A large proportion of them remain unannotated. Efficient computational methods are required for identifying genes and modeling those for regulatory and functional roles. Principal component analysis (PCA) was used in a novel approach to shortlist genes, independently of annotation in genome expression data, taking seed development in Arabidopsis thaliana as a representative case. PCA was applied to published genome expression data from four lines of Arabidopsis, mutated in seed development. The PC separating all the developmental stages between a mutant and its respective wild type was selected for shortlisting genes as functionally more important. The shortlisted genes identified by PCA belong to a number of biological functions. The genes reported to give sensitivity to desiccation were identified in PCA analysis also in desiccation intolerant lines only. With respect to the network of 98 genes targeted by ABI3, a higher number of genes was identified as important in the mutants abi 3-5, fus 3-3 andlec 1-1 in comparison to abi 3-1. Ontological analysis and comparison with earlier studies suggest that PCA of genome expression data is useful for shortlisting functionally important genes.


Introduction
Transcriptome studies require shortlisting potentially important genes from a larger set identified in differential expression analysis. Machine learning methods can be useful in such analysis.
PCA is an unsupervised machine learning method used to increase the interpretability of gene expression data without loss of information [1]- [3]. It is especially suitable for time series data and allows filtering artifactual variations. Either genes or samples can be taken as the variables [4]. PCA translates the dataset by linear function to find new variables by maximizing the variance, while preserving 'variability' of dataset i.e. identifying new variables that can serve as effective predictors. Given a matrix (X) of m variables and n observations, it reduces m variables into r new variables, where r < m. These new r variables account for as much as possible variance explained by m variables while remaining mutually uncorrelated and orthogonal [5], [6].

Ontologies of shortlisted genes identified in different mutants of seed maturation
During seed maturation, embryo development is arrested and seed dehydrates.

Comparison with previous works
We compared the published results on differentially expressed genes associated with seed development in Arabidopsis, with the genes shortlisted in our study. In an earlier report, 2,712 genes were identified as differentially expressed between the desiccation tolerant and intolerant lines. These were classified as desiccation related genes [7]. To examine the reliability of PCA shortlisting method, we compared the distribution of these 2,712 genes in the PCA based shortlisted genes for abi 3-5, lec 1-1 &fus 3-3 in comparison with the shortlisted genes for abi 3-

1.
In the three desiccation intolerant mutants i.e. abi 3-5, lec 1-1 and fus 3-3, the distribution of these differentially expressed genes was 6 times higher than that in abi 3-1. We identified 908 genes as common between the earlier work and the desiccation related genes identified by PCA (not shortlisted in abi 3-1& shortlisted in other desiccation intolerant mutants). OBAP1A, oil bodyassociated protein 1A (AT1G05510), stachyose synthase (AT4G01970), avirulence induced gene 2 like protein (AT5G39720), and DREB2D (AT1G75490), identified as desiccation related genes in the earlier study were also confirmed by the PCA based method for shortlisting. The genes, LEC1, FUS3 & ABI3 are known to influence the expression of one another [14]. In our analysis also, ABI3 (AT3G24650) was shortlisted for influencing both FUS3 & LEC1, LEC1 for However, FUS3 was shortlisted by PCA only in LEC1. In an earlier work [15], a set of 98 genes was suggested as the targets regulated by ABI3.
These constitute the regulatory network expressed during seed maturation. We compared our results to examine which of the targets of ABI3 were shortlisted in the PCA based analysis also.
Combining the three developmental stages, 65, 38, 35 and 24ABI regulated genes were identified respectively in abi 3-5, fus 3-3, lec 1-1 and abi 3-1mutant lines ( Figure 3). The abi 3-5 has the highest percentage of ABI regulated genes across alldevelopmental stages.Among the targets of ABI 3 identified in different mutants (combining all three developmental stages), 17 genes were identified in abi 3-5, fus 3-3 and lec 1-1, while these were absent in abi 3-1&had been identified as desiccation related genes in the previous report.Our results also support that these 17 genes may be the candidate genes involved in desiccation tolerance (Additional data 4).

Long non-coding RNA
Long non-coding RNA are transcripts with length more than 200 nucleotides that do not code for proteins. Theseplay important role in regulation of seed development [16]. In abi 3-1, abi 3-5,  Table 2).

Discussion
Shortlisting of functionally more important genes is the first step in the interpretation of high throughput gene expression data. Normalization of gene expression data is a commonly used technique to make samples comparable. However, the normalization steps do not correct the artifacts such as batch effect in gene expression analysis. Errors in the selection of trait-related genes can be minimized by multiple approaches [4], [17], [18]. One of the promising methods to resolve these artifacts is to examine the PCs of the data.
PCA allows the visualization of multidimensional data by representing it along new dimensions (PC). It is also used as a clustering method ( Figure 1). The first few PCs may not necessarily capture most of the cluster structures [5]. PC5 was able to separate all the developmental stages between the mutants and wild type ( Figure 1) in abi 3-1. PCA is performed by considering either genes [3], [19] or the experimental conditions as the variable [20], [21].
Genes asthe variable, create PCs that indicate characteristics of genes which best explain the experimental response they produce. When experimental conditions are the variable, these create a PC that indicates characteristics of experimental conditions as explaining gene behavior they elicit. Various analyses have been reported, considering experimental conditions as the variables, as in case of sporulation time series gene expression data on yeast [4]. Suitability of sample embedded PCA for shortlisting genes is well described for time series gene expression data [21].
PC that separates all developmental stages of the mutant and wild type is interpreted to capture variance explaining phenotypic differences between the mutant and wild type ( Figure 1). These variables (developmental stages) in Figure 1 represent the summation of all the gene products, using scaled gene expression with its loading score. Multiplication of the "scaled CPM" with "loading score" provides co-ordinate of gene along the PC. Genes with high difference along the identified PC, between the developmental stages of wild type and mutants were classified as Earlier report [7] suggested that the genes down-regulated in desiccation intolerant mutants (lec 1-1, fus 3-3 and abi 3-5) as compared to desiccation tolerant lines (abi 3-1 and lec 2-1) are the desiccation related genes. The distribution of dessication related genes identified by differential expression analysis [7].establishes the reliability of our approach. Shortlisting of OBAP1A, oil bodyassociated protein 1A (AT1G05510), stachyose synthase (AT4G01970), avirulence induced gene 2 like protein (AT5G39720), and DREB2D (AT1G75490) as dessication related genes by both differential expression analysis and PCA based shortlisting suggest high probability of their involvement in acquiring dessication tolerance in seeds. The presence of exclusive sets of PCA shortlisted genes among studied transcription factors agrees with the involvement of different sets of genes regulated by those transcription factors [12], [13]. Genes shortlisted by PCA agree with cross regulation among the studied transcription factors [24], [25]. Genome-wide chromatin immunoprecipitation (ChIP-chip), transcriptome analysis, quantitative reverse transcriptasepolymerase chain reaction and transient promoter activation assay were combined to identify a set of 98 ABI3 target genes [15]. We compared the genes shortlisted by PCA with the ABI 3 targeted genes. A large number of ABI 3 targets was present among the genes shortlisted by 1 1

Consent for publication
Not applicable.

Competing interests
The authors declare that they have no competing interests.