Search for and function of different genes in phloem and xylem of poplar

We used the microarray data of poplar gene in the laboratory and the GPDNN model to preprocess the data and conduct the median adjustment, and then studied and analyzed the phloem and xylem genes with good clustering effect. Among them, a gene that was significantly different and related to plant hormones was selected for analysis. Firstly, the function of this gene was predicted to be auxin response factor. All genes of auxin response factor family were selected from poplar transcription factor database for evolutionary and homologous affinity analysis, and the relationship with homology was analyzed according to the expression semaphore correlation between genes. Finally, the number of up-regulated, down-regulated and fine-tuned genes in the tissues of budding, root, xylem and phloem was selected by using the median, and the relationship between the number of up-regulated genes and their semaphores and auxin content in the tissues was analyzed.


Gene chip
In the early 1990s, Affymetrix, an American company, produced the world's first oligonucleotide chip [6] by using in situ synthesis of oligonucleotides. Thousands of densely arranged molecular microarrays integrated on the chip can analyze a large number of biomolecules in a short period of time, enabling people to quickly and accurately obtain the biological information in samples, which is hundreds and thousands of times more efficient than traditional detection methods and convenient for modern researches on a large number of genes [7].
The design principle of oligonucleotide chip is introduced as follows: Oligonucleotide chip is based on the principle of reverse hybridization, the design and synthesis of good dozen beforehand to dozens of bases of oligonucleotides through sample points at a fixed onto the glass, or by in situ synthesis technology of fixed on the glass, and fluorescent tags for sequencing column under certain conditions, after washing the scan for monitoring information. Oligonucleotide chips usually use in situ synthesis, which combines solid-phase DNA synthesis with photolithography. Technical principles [1] 1: computer chip substrate;2: activate the computer chip surface; 3: light inactivation "A" mask;4: cross-linked with adenosine (A) reagent; 5. Repeated synthesis cycle, in which a photosensitive protective group is attached to the end of the 5 '-hydroxyl group of the synthetic base monomer.
Gene chip experiments generally include the following steps [1] : (1) Chip probe design. According to the needs of experimental purposes, the probe complementary to the target sequence was synthesized and fixed to the carrier by means of the hybrid complementary principle.
The sensitivity and specificity of gene chip hybridization is the core of chip technology, and Affymetrix has developed a unique pm-mm probe scheme. Each probe set for each gene on the chip consists of 10-20 probe pairs, each consisting of two probe units, one of which is a perfectmatch (PM) and the other a mismatch (MM) of the 13th base in the middle of the sequence.
(2) Sample preparation. Total RNA or purified poly-a mRNA was extracted from tissues or cells to synthesize double-stranded cDNA, and then in vitro transcription was used to synthesize cRNA, which was fragmented before hybridization.
(3) Hybridization of sample and probe, cleaning and coloring of probe. The obtained samples were incubated with the prepared probe for 16 hours to ensure the complete hybridization. After hybridization, the samples were combined with complementary probes, and the uncombined samples were rinsed out before scanning to ensure the accuracy of scanning data. (4) Scan to obtain data. The laser scanner was used to scan the hybrid microarray and obtain the fluorescence intensity of the labeled hybrid sequence. The fluorescence intensity obtained was the required data.
(5) Data analysis. The obtained data can be preprocessed in a certain way so that the obtained data can be trusted for future use.

Data processing
After the chip is scanned by laser, the output is calculated according to the fluorescence intensity and becomes the chip data. However, the chip data has many noise points, such as background and other factors, which affect the accuracy of data evaluation. For this reason, in the following period of time, many bioinformatists have done a lot of calibration work on biochip data. Cheng Li[9] 's statistical model, Irizarry's RMA model [10]and Li Zhang's PDNN model [11]are all mathematical models for the preprocessing of chip data. Ning Jiang[12] et al., in BMC Bioinformatics in 2008, made a lot of comments on the above correction pretreatment methods, and analyzed that PDNN was a better one of the above data pretreatment methods. Unfortunately, the Mismatch (MM) data was not taken into account in the PDNN data processing model, which was known to play an important role in gene expression level based on the chip principle. The GPDNN model established by wei wei of Peking University [2]makes up for this defect and makes data processing more reliable. We will use GPDNN model to preprocess poplar chip data and evaluate the data according to the clustering effect.
PDNN model divides hybridization into specific hybridization and non-specific hybridization. Specific hybridization refers to the binding of the probe to the target sequence, while non-specific hybridization refers to the binding of the probe to other sequences. Many studies have found that the size of the intensity of hybridization by combining ability (binding affinity), two relations can be absorbed by the Langmuir theory (Langmuir Adsorption Model). Among the 25 base-long probes, different positions of the probes have different effects on the binding ability. The edge part has less attraction and fixation ability to the target sequence than the middle part. The binding ability is affected by free energy. According to the study of Wang [13] [], for probe sequences with high T and C content, its fluorescence intensity is higher, which indicates that the probe sequences themselves also affect the ability of binding target sequences. Li Zhang[11] proposed in the model that the binding energy was determined by two factors: position and stacking free energy of base sequence, and assumed that the stacking energy was determined by the adjacent base pair. The combining capacity is the linear sum of the position weights and the stacking energy. However, because the non-specificity contains certain information, the PDNN model ignores this point.
Wei wei improved the original PDNN algorithm with the following specific methods: first, Wilcoxon [4] [] symbol test was conducted to determine two training data sets. Then the MM probe information was added to the non-specific parameter estimation model, and the non-specific binding and specific binding parameters were estimated respectively.[see appendix for specific GPDNN algorithm model]

Experimental materials
Our poplar chip is an Expression Microarray made by Affymetrix. Poplar expression profile chip is specially designed to control gene expression. Poplar gene expression profile chip contains more than 61,000 probe groups, representing more than 56,000 transcripts and gene products Our poplar microarray consists of three biological replicates at the time level (spring, summer, autumn, winter) and tissue level (root, stem, xylem cambium, phloem cambium, leaf), with a total of 60 microchips. But for some special reason, four chips were not made successfully. So now we're left with 56 chips. The following is described in mathematical language:

methods screening of differential genes
We evaluated the clustering effect of the preprocessed data, and the results showed that the gene clustering effect between phloem and xylem was better [figure]. For this reason, phloem and xylem were selected for future research and analysis.
First, we need to pick out the genes that differ between the phloem and xylem. At present, there have been many studies on the methods of selecting differential genes, such as Subramanian's GSEA (Gene sets enrich analysis) [14], etc., and the screening methods of these differential genes are based on a large number of data.Our poplar chip phloem and xylem data are less, so the main experimental method to find the different genes is wilcoxon non-parametric test [4]. The non-parametric test is a test that is independent of the population distribution. It does not depend on the form of the population distribution.
The non-parametric test is essentially a test to see if the position (median) of the population distribution is the same.Wilcoxon nonparametric test method is an improved symbol test. By arranging observed values in order from small to large, rank order is made, rank is obtained and hypothesis is tested. We first made the statistical hypothesis that the median expression in phloem was equal to the median expression in xylem, and the alternative hypothesis was in the opposite direction, that the median expression in phloem was not equal to the median expression in xylem. Then R software [15] was used to make relevant calculations of wilcox.test, and then the genes with significant differences (that is, P <0.05) were screened out according to p-value, and finally the corresponding annotation of GO[16] was found. Our statistical assumptions are: 3 results and discussion

data preprocessing and clustering analysis
After the poplar chip data were preprocessed by the GPDNN model and then the median leveling was carried out, we conducted cluster analysis and evaluation on the 45 chip data [3], which did not include the leaf data. From figure 3.1, it can be concluded that the phloem and xylem genes of poplar can cluster into different species according to the seasons of spring, summer, autumn and winter respectively, that is, the clustering effect of the phloem and xylem genes of poplar is good. Therefore, poplar phloem and xylem genes were selected for the following analysis.  In addition, we also used R for the differential expression of poplar phloem and xylem to make box diagram : (1) According to the seasonal standard, the xylem gene expression in spring, summer and autumn was higher than that in phloem, while the phloem gene expression in winter was not significantly different from that in xylem. In spring, summer and autumn, xylem gene expression was significantly higher than phloem in spring, followed by summer, and the difference between xylem gene expression and phloem gene expression in autumn was the smallest.
(2) Tissue as the standard: in xylem and phloem, with the changes of spring, summer, autumn, winter and four seasons, gene expression levels in each tissue showed a declining trend. In winter, the gene expression level was almost the same in all tissues, while in winter, the gene expression level in all tissues was very low when the trees were dormant and overwintering.
We list 10 genes with large p_value(

function prediction of target genes
A total of 1049bp of matching genetic information was found from the data provided by Affymatrix  We choose high grade poplar chain group to predict, will first match gene location and extension 1000 bp respectively, extended full genetic 4717 bp (appendix), and online prediction of the gene prediction software softberry, and finally the complete gene prediction, as shown in figure3.5. As shown then we estimated by softberry protein sequence in the Pfam database, to protein function prediction results show that both in the Pfam-A more accurate results (Table 3.3). From the results, we selected the gene function with high score and predicted the gene function as Auxin response factor (ARF).   We conducted homologous and evolutionary analysis on the 31 ARF gene CDS to be studied, and the results are shown in figure [see figure 3.8]. After analysis, the 31 ARF genes can still be divided into three categories, the homology of classIa and classIb is increased to 54%, and the homology among the genes is generally improved, the lowest is 62%, making the homology and evolutionary analysis results credible.

correlation analysis of ARFs gene expression
According to the corresponding probes of ARF genes in the database, correlation analysis was coefficients with other genes, but the homology between these three genes and other genes is not high, that is, they are distributed in different species.Estext_fgenesh4_pm-c_lg_x0888 has a high correlation coefficient with the expression level of other genes, but the homology of estext_fgenesh4_pm-c_lg_x0888 is not high.
In conclusion, when intergene homology is higher than 80%, the correlation coefficient of intergene expression level is higher than 0.8, and intergene expression regulation may occur simultaneously. When the correlation coefficient of intergene expression level is higher than 0.8, the homology of genes is low and they are in different classes, which cannot predict the regulation mechanism of intergene expression. In other words, in future studies, we can determine the expression regulation mechanism among different genes according to the homology between genes and the correlation of gene expression.

up-regulated and down-regulated genes of ARFs were selected
We used the median of 31 ARF genes corresponding to gene semaphores to determine up-regulated and down-regulated genes. When the gene expression increased by 80%, it was defined as up-regulated; when the gene expression decreased by 60%, it was defined as down-regulated; when it was in the middle, it was considered fine-tuning.
We Among the different tissues, the buds had the most upregulation genes, followed by the roots, and the xylem and phloem had the least upregulation genes. During plant growth and development, auxin is produced mainly by the terminal bud and transported to other parts of the plant through polarity. When auxin is produced, the auxin response factor can respond to auxin rapidly, and the number of up-regulated genes in buds is the largest. Meanwhile, the down-regulated genes are also less than other tissues. The number of upregulated genes was highest in autumn, followed by summer, with the same number in spring and winter.
We selected the distribution of up-regulated, down-regulated and fine-tuned genes in different tissues in the four seasons of spring, summer, autumn and winter.  It was found that estext_fgenesh4_pm.c_lg_xii0386 gene located on LG_XII chromosome showed both up-regulation and down-regulation in the same tissue in the same season.We were able to identify two different probe sequences for this gene using chip data provided by Affymetrix. After sequence homology comparison analysis, the homology of the two is less than 50%, we can infer that estext_fgenesh4_pm. C_LG_XII0386 contains different family members, whose homology is very low and expression level is different, either up-regulated or down-regulated.

analysis of the relationship between the number of up-regulated ARFs genes and their semaphores and auxin content in each tissue
Auxin plays an important role in plant growth and development, and can promote plant growth cycle morphological changes, including cell division, differentiation and elongation, flower and vascular tissue development, as well as directional growth [40].The potential role of auxin as a morphogenetic promoter has been predicted in the formation of lateral organs. Plant growth hormone can promote the formation of secondary axes and the formation of tissue primordia from leaves to flowers [41][42][43][44].Auxin accumulation at these sites is accomplished by input and output transporters, whose activities make auxin in a dynamic flow state [43,44] .
Auxin promotes cell division in the middle column sheath, resulting in cambium [45][46][47].   In the four seasons of spring, summer, autumn and winter, the correlations among auxin content, gene expression semaphore and up-regulated gene number were calculated respectively in different tissues. It can be seen from the results that the correlation coefficients between gene expression semaphore and auxin content and up-regulated gene number are low, and there is no correlation between them. However, auxin content was highly correlated with the number of up-regulated genes in bud, root and xylem tissues, except phloem tissues. According to the correlation between auxin content and the number of up-regulated genes in four seasons, it can be concluded that in budding and root tissues, auxin content and auxin content show a high positive correlation, and a high negative correlation in xylem. The results showed that there was a correlation between the up-regulated gene data and auxin content in the same tissue, which was consistent with our predicted results.  Table 3.10 correlation coefficient between gene expression semaphore and up-regulated gene number: (The correlation between The intensity of up-regulate gene and The number of up-regulate gene) It can be concluded that in any tissue (Table 3.7,Table 3.8, Table 3.9, Table 3.10), the correlation between auxin content, up-regulated gene number and gene expression semaphore pair is 1 or -1 in summer and fall, that is, highly correlated. During the process of poplar growth, the growth in summer and autumn is stable, which makes the stability of each research quantity and shows a high correlation.
However, once winter data were added, the correlation between gene expression signaling and auxin content and the correlation between gene expression signaling and the number of up-regulated genes decreased, but the correlation between auxin content and the number of up-regulated genes was not

look
Using the microarray data of poplar gene in the laboratory, the phloem and xylem genes with good clustering effect were selected for research and analysis after clustering evaluation. Different genes (p-value <0.05) were selected based on Wilcox non-parametric test, and a gene related to plant hormones and expressed was selected for analysis. After the prediction by NCBI, softberry, Pfam and other databases, the function of this gene was determined as auxin response factor.
Auxin response factor genes were selected from the transcription factor database of poplar, and evolutionary tree and relationship analysis were conducted on the ARFs family of poplar. For genes with less than 100% homology, the correlation coefficient of gene expression may be higher than 0.8 or lower than 0.8.