Abstract
We have developed a novel methylome analysis procedure, Methyl-IT, based on information thermodynamics and signal detection. Methylation analysis involves a signal detection problem, and the method was designed to discriminate methylation regulatory signal from background noise induced by thermal fluctuations. Comparison with three commonly used programs and various available datasets to furnish a comparative measure of resolution by each method is included. To confirm results, methylation analysis was integrated with RNAseq and network enrichment analyses. Methyl-IT enhances resolution of genome methylation behavior to reveal network-associated responses, offering resolution of gene pathway influences not attainable with previous methods.
Background
Most chromatin changes that are associated with epigenetic behavior are reprogrammed each generation, with the apparent exception of cytosine methylation, where parental patterns can be inherited through meiosis [1]. Genome-wide methylome analysis, therefore, provides one avenue for investigation of transgenerational and developmental epigenetic behavior. Complicating such investigations in plants is the dynamic nature of DNA methylation [2, 3] and a presently incomplete understanding of its association with gene expression. In plants, cytosine methylation is generally found in three contexts, CG, CHG and CHH (H=C, A or T), with CG most prominent within gene body regions [4]. Association of CG gene body methylation with changes in gene expression remains in question. There exist ample data associating chromatin behavior with plant response to environmental changes [5], yet, affiliation of genome-wide DNA methylation with these effects, or their inheritance, remains inconclusive [6, 7].
The epigenetic landscape is modulated by thermodynamic fluctuations that influence DNA stability. Most genome-wide methylome studies have relied predominantly on statistical approaches that ignore the subjacent biophysics of cytosine DNA methylation, offering limited resolution of those genomic regions with highest probability of having undergone epigenetic change. Jenkinson and colleagues [8] described the implementation of statistical physics and information theory to the analysis of whole genome methylome data to define sample-specific energy landscapes. Our group [9, 10] has proposed an information thermodynamics approach to investigate genome-wide methylation patterning based on the statistical mechanical effect of methylation on DNA molecules. The information thermodynamics-based approach is postulated to provide greater sensitivity for resolving true signal from thermodynamic background within the methylome [9]. Because the biological signal created within the dynamic methylome environment characteristic of plants is not free from background noise, the approach, designated Methyl-IT, includes application of signal detection theory [11-14].
A basic requirement for the application of signal detection is a probability distribution of the background noise. Probability distribution, as a Weibull distribution model, can be deduced on a statistical mechanical/thermodynamics basis for DNA methylation induced by thermal fluctuations [9]. Assuming that this background methylation variation is consistent with a Poisson process, it can be distinguished from variation associated with methylation regulatory machinery, which is non-independent for all genomic regions [9]. An information-theoretic divergence to express the variation in methylation induced by background thermal fluctuations will follow a Weibull distribution model, provided that it is proportional to minimum energy dissipated per bit of information from methylation change.
The information thermodynamics model was previously verified with more than 150 Arabidopsis and more than 90 human methylome datasets [9]. To test application of the Methyl-IT method to methylome analysis, and to compare resolution of the Methyl-IT approach to publicly available programs DSS [15], BiSeq [16] and Methylpy [17], we used three Arabidopsis methylome datasets. Genome-wide methylation data from a Col-0 single-seed decent population [3], maintained over 30 generations under controlled growth conditions, provides a measure of thermodynamic properties within an unperturbed system. To assess resolution of methylation signal during plant development, we included previously reported datasets from various stages of seed development and germination in Arabidopsis ecotypes Col-0 and Ws [18]. Both of these systems have been described for methylome behavior with Methylpy, and direct comparison of the two datasets allowed estimation of developmental epigenetic signal above background. For more detailed study of methylation and gene expression, and to provide empirical testing of Methyl-IT predictions, we focused on the trans-generational ‘memory’ line derived by suppression of the MSH1 (MUTS HOMOLOG 1) gene [19, 20], which has not been previously described for methylome features.
MSH1 is a plant-specific gene that encodes an organelle-localized protein [21, 22]. Plastid-depletion of MSH1 conditions ‘developmental reprogramming’ in the plant [23]. The msh1 mutant is altered in expression of a broad array of environmental and stress response pathways [24], and the mutant phenotype is also produced by MSH1 RNAi knockdown [20]. Differentially expressed gene (DEG) analysis of the msh1 TDNA mutant identifies major components from numerous abiotic and biotic stress, phytohormone, carbohydrate metabolism, protein translation and turnover, oxidative stress and photosynthetic pathways [24]. Subsequent null segregation of the RNAi transgene restores MSH1 expression but leaves a heritably altered phenotype, with delayed flowering, reduced growth rate, delayed maturity transition and pale leaves [20]. This condition is termed msh1 ‘memory’, and provides for direct investigation of transgenerational methylation variation and its association with altered gene expression.
Here, we report on Methyl-IT sensitivity relative to three commonly used methylome analysis programs. We demonstrate resolution of methylome repatterning by Methyl-IT analysis, and empirical validation of gene networks undergoing changes in methylation and gene expression as identified by the Methyl-IT procedure.
Results
The Methyl-IT method
For resolution of DNA methylation signal, we employed Hellinger divergence (H) as a means of quantifying dissimilarity between two probability distributions: that associated with a reference, defining background changes, and that associated with treatment.
Signal detection is a critical step to increase sensitivity and resolution of methylation signal by reducing the signal-to-noise ratio and objectively controlling the false positive rate and prediction accuracy/risk (Fig. 1). Optimal detection of signals requires knowledge of the noise probability distribution that, from a statistical mechanical basis, can be modeled for each individual sample by a Weibull distribution [9]. The methylation regulatory signal does not hold Weibull distribution and, consequently, for a given level of significance α (Type I error probability, eg. α = 0.05), cytosine positions with Hα=0.05 can be selected as sites carrying potential signals (shown as the blue region under the curve in Fig.1). Laws of statistical physics can account for background methylation, a response to thermal fluctuations that presumably function in DNA stability [9]. True signal is detected based on the optimal cutpoint [25], which can be estimated from the area under the curve (AUC) of a receiver operating characteristic (ROC) built from a logistic regression performed with the potential signals from controls and treatments. In this context, the AUC is the probability to distinguish biological regulatory signal naturally generated in the control from that induced by the treatment. In this context, the cytosine sites carrying a methylation signal are designated differentially informative methylated positions (DIMPs). The probability that a DIMP is not induced by the treatment is given by the probability of false alarm (PFA, false positive). That is, the biological signal is naturally present in the control as well as in the treatment.
Estimation of optimal cutoff from the AUC is an additional step to remove any remaining potential methylation background noise that still remains with probability α = 0.05 > 0. We define as methylation signal (DIMP) each cytosine site with Hellinger divergence values above the cutoff ( H33DT), as shown in Fig. 1. Each DIMP is a cytosine position carrying a significant methylation signal, which may or may not be represented within a differentially methylated position (DMP) according to Fisher’s exact test (or other current tests, Fig. 1). The difference in resolution by current methods versus Methyl-IT is illustrated by positioning H value sensitivity of the Fisher’s exact test (FET) at greater than Hmin for cytosine sites that are DMP and DIMPs simultaneously. For example, the ROC curve that corresponds to logistic regression for potential signals from the closest wild type control to msh1 memory line (control 3 and treatment 1 in Fig. 1) has an AUC cutpoint of H =1.028052.
The probability of false alarm (estimated for best fit found for the Weibull cumulative distribution of H in the mentioned control) for DIMP detection based on the mentioned cutpoint is PFA=1.466×10−6. Thus, in the msh1 memory line dataset under study, any cytosine position k with Hk ≥1.028052 is a DIMP. Although the probability PFA =1.466 x 10−6 is small, there is still an average of 44844 CG-DIMPs per wild type sample. The average of CG-DIMPs in the memory line samples is 225835. We found that the strength of biological regulatory signal (evaluated in terms of AUC) was different for each methylation context. The strongest signal by Hellinger divergence found in our analyses was in CG context. A parsimony decision to reduce the rate of false positives used the cutpoint estimated for the AUC from the strongest signal. A flow chart of Methyl-IT analysis, with integration of these major procedures described above, is shown in Fig. 2.
Relative sensitivity of the Methyl-IT method versus other procedures
Table 1 provides a critical but nonunique example for the 2×2 contingency table with read counts , and In this situation, and for any value there exists strong methylation signal in the treatment, significantly stronger than in the control, but a 2×2 contingency independence test cannot detect it. Even small genomes like Arabidopsis contain millions of methylated cytosine sites, and situations analogous to the one presented in Table 1 are not rare. If this hypothetical cytosine site were to occur in the memory line, with then, according to its p-value estimate from the corresponding Weibull distribution, it would be a potential signal included in the logistic regression and, since H = 1.12 in this example and AUC cutpoint Hcutpoint = 1.028, it would be a DIMP (Hcutpoint < H).
In the memory line, 100% of differentially methylated cytosines (TVD > 0.23) in all methylation contexts found by root-mean-square test (RMST, bootstrap test of goodness-of-fit [26] implemented in methylpy [17]), Fisher exact test (FET), and HDT (bootstrap test of goodness-of-fit based on Hellinger divergence, see methods) are also detected by Methyl-IT (Fig. 3). RMST does not detect 17.7% of CG-DIMPs, 47.8% CHG-DIMPs, and 59.7% CHH-DIMPs. HDT does not detect 19.7% of CG-DIMPs, 51.5% CHG-DIMPs, and 66.1% CHH-DIMPs, while FET does not detect 46.2% of CG-DIMPs, 73.9% CHG-DIMPs, and 84% CHH-DIMPs. Together, RMST, HDT and FET do not detect 13.5% of CG-DIMPs, 43.2% CHG-DIMPs, and 52.5% CHH-DIMPs. The DIMPs not detected by these alternative approaches come from situations analogous to that presented in Table 1. RMST is a robust test of goodness-of fit for 2x2 contingency tables. The statistic used in RMST is an information divergence. Results obtained with RMST were very close to those estimated based on Hellinger divergence [26, 27](see Table 1). Therefore, the differences in outcome between Methyl-IT and Methylpy do not reside in RMST but, rather, in the signal detection limitation, which requires knowledge of the null distribution for methylation background variation. The null distribution of the control sample testing statistic must be taken into account.
Relative sensitivity and resolution of the Methyl-IT method can also be assessed by parallel analyses of the three datasets, generational, seed development and msh1 memory. Fig. 4 shows a single-scale, direct comparison of differential methylation behavior in these datasets. Rather than total DIMP number, we present relative. The absolute DIMP counts and DIMP counts per genomic region are provided in the Additional File 2 Table.S1 for seed development and germination dataset. In Fig. 4, DIMP number is normalized to the corresponding local cytosine context number. The signal detection step of Methyl-IT discriminates signal unique to the sample from background patterning changes shared within the control without regard to DMP density. Consistent with expectations, the generational dataset displays lowest level variation across lineages, with greater inter-lineage variation than generational, and highest DIMP signal in CG context. Direct comparison between the generational and seed development studies estimated pattern and magnitude differences between the two datasets. Methylation signal in the seed development dataset taken from the original study by Kawakatsu et al. [18] was greater than that of the generational study, with DIMP signal in all three CG, CHG, CHH contexts. CHG and CHH changes were associated predominantly with non-genic and TE regions, and CG DIMPs showed higher density within gene regions (Fig. 4). Analysis of msh1 memory, when compared to the generational and seed development data, showed significantly greater magnitude change and prevalent methylation DIMP signal within genic CG context. Genome-wide analysis of methylation in the memory line, enhanced by signal detection, revealed considerable CG, CHG and CHH DIMPs across all chromosomes. Results are shown for data before (Fig. 3) and after (Fig. 4 and Additional file 1: Figure S1) normalization to demonstrate that while the vast majority of methylation resides in CHH context, normalized for density, changes in CG context predominated on chromosome arms (Additional file 1: Figure S1).
A hierarchical cluster based on AUC criteria, and built on the set of 7006 selected DIMPs associated genes, permitted the classification of seed developmental stages into two main groups: morphogenesis and maturation phases (Additional File 1 Figure. S2a). In this case, the methylation signal was expressed in terms of log2(DIMP-counts on gene). Within the 7006-dimensional metric space generated by 7006 AUC-selected genes, the linear cotyledon (COT) and mature green (MG) stages (morphogenesis-maturation phase) grouped into a cluster quite distant from the cluster of post mature green (PMG) and dry seed (DRY) stages (Dormancy phase). The latter cluster was closer to the leaf dataset derived from 4-week-old plants. Similar analysis was performed for the seed germination experiment from the mentioned study, and a hierarchical cluster built on the set of 3864 selected genes based on AUC criteria permitted the classification of seed developmental stages into two main groups: 1) dormancy and 2) germination-emerging phases (Additional File 1 Figure S2b).
Differentially methylated genes (DMG)
Here we propose the concept of differentially methylated genes (DMGs) based on the comparison of group DIMP counts by applying generalized linear regression model (GLM). In particular, the use of DMRs (clusters of DMPs within a specified region), can be tested in a group comparison by applying GLM.
Genes displaying a statistically significant difference in the number of DIMPs relative to control were defined as DMGs. Additional File 3 Table.S2 shows the number of DMGs observed in the seed development data, based on Methyl-IT analysis. In this case, the analysis included DIMPs, regardless of hypo or hyper methylation direction, and from all cytosine methylation contexts. Genes were defined as the region covered by gene body plus 2kb upstream of the gene start site.
The number of DMGs (1068 genes) is considerably lower than the number of genes associated with DMRs derived in the original study by Kawakatsu et al. (2017) [18]. Methylpy-derived DMR number reflects genomic intervals with a given density of cytosine methylation changes, defined relative to a control. Methyl-IT DMG number reflects gene regions with highest probability of differential methylation distinct from background activity in the control. For example, after combining the embryogenesis CG, CHG, and CHH DMRs reported in Kawakatsu et al. [18] (Table S5 from [18]) into a single set of DMRs, only 468 from 6433 DMR-associated genes (after removing duplicated genes and updating annotation) were Methyl-IT DMGs that met our GLM criteria in the group comparison of maturation phase versus morphogenesis phase (Additional File 1 Figure S3a). DMR-associated gene analysis was also performed with the set of DMRs detected in the germination experiment from the same study [18]. Similarly, 53 from 7638 DMR-associated genes were identified DMGs that met our GLM criteria in the group comparison of germination-emerging versus dormancy phases (Additional File 1 Figure S3b). In this case, 7638 DMR-associated genes comprise the resulting set from pooling germin-CHG and germin-CHH DMRs (as reported in Table S5 from reference [18]). Analysis for the set of all genes yielded 136 DMGs (Additional File 1 Figure S3c).
To more generally investigate the relative efficacy of commonly used methylation analysis programs, we applied DSS, BiSeq and Methylpy to the msh1 memory line and corresponding Col-0 control methylome datasets. The control line was acquired as a transgene-null within the same transformation experiment that produced MSH1-RNAi lines from which the memory line derives, and has been grown in parallel each subsequent generation. The overlaps of DMR-associated genes from DMRs found in the memory line by the methylome analysis pipelines DSS, BiSeq, and Methylpy is presented in Fig. 5a. What is striking is the degree of data non-conformity from the three methods. Because the subjacent algorithms of these programs are based not only on different statistical and computational approaches and do not define DMRs uniformly, the data output differs in sensitivity and methylation change criteria. The application of GLM to estimate the DMG set by Methyl-IT and its overlap with DMR-associated genes retrieved from DMRs identified by the mentioned programs is shown in Fig. 5b. For the group comparison counting only gene-body DIMPs, a total of 9271 loci (from the entire set of genes) were identified as DMGs in the msh1 memory line (Additional file 4: Table S3), while 8798 DMGs were identified for the group comparison counting DIMPs within gene body plus 2kb upstream and downstream (with TVD > 0.15). The application of GLM in estimating DMGs is not implemented to identify DMRs, but to evaluate whether or not a statistically significant difference exists between methylation signals observed in two individual groups for an already defined DMR.
Methyl-IT identifies gene networks in seed development and germination dataset
If heightened sensitivity in methylome signal detection imparts added biological information, this should be evident in tests for association of methylome signal with gene expression changes. Observed CG and CHG signal implies that changes in methylation during seed development relate to gene expression and/or developmental transitioning. To investigate this possibility further, we conducted a network enrichment analysis test (NEAT) of the Methyl-IT output from seed development and germination datasets.
Analysis of data from stages of seed development, including cotyledonary, mature green and post-mature green, contrasted to globular as reference, suggested a methylome repatterning following the mature green stage (Additional File 1 Figure. S2). Data indicate that methylome patterns are more similar between cotyledonary and mature green stages, transitioning to a distinguishable state for post-mature green and dry seed. This methylome transition may relate to the dessication and dormancy shift that also occurs with this timing [28, 29]. Further analysis of differentially methylated loci with NEAT detected statistically significant network enrichment of links between genes from the set of DMGs (Ws-0 seed) and the set of GO-biological process terms associated with seed functions (Table 2). The list of genes found in networks includes genes known to participate in seed development such as, For example, transcription factors DPBF2 (AT3G44460) from an abscisic acid-activated signaling pathway expressed during seed maturation in the cotyledons, ABSCISIC ACID BINDING FACTOR (ABF1, AT1G49720), and WRKY22 (AT4G01250) a member of WRKY transcription factors involved mainly in seed development. Other genes were found to be involved in seed dormancy, like SLY1 (SLEEPY1), and seedling development, like EIN4 (AT3G04580), CML16 (AT3G25600) (full gene list in Additional file 5: Table S4). GeneMANIA (http://www.cytoscape.org/), identified interaction networks within the data, indicating that many DMGs in the seed development dataset function together (Additional file 1: Figure S4).
Similar analysis of the seed germination and the Col-0 single-seed decent datasets did not detect DMGs within networks. Results in the single-seed decent generational study are consistent with expectations, since samples were grown under controlled conditions and sampled uniformly over generations. In the case of the seed germination dataset, this outcome may be consistent with the fact that only CHG and CHH DMRs were found in the original seed germination study by Kawakatsu et al. (2017) [18], while the seed developmental experiment showed 60% of CG DMRs overlapping with protein-coding genes. These data suggest that methylome signal may be more prominent under particular developmental transitions, like seed preparation for dormancy and dessication, than during processes like germination.
The memory line phenotype
Transgene-null plants following segregation of the MSH1-RNAi transgene, termed msh1 ‘memory’ lines, display full penetrance and transgenerational inheritance of the altered phenotype, and the msh1 memory effect recapitulates in tomato [30]. Arabidopsis lines that have undergone silencing of MSH1 segregate for the MSH1-RNAi transgene by self-crossing to produce heritable phenotype changes in ca. 7-25% of the resulting transgene-null progeny (Fig. 6a). The msh1 memory phenotype is milder and more uniform than that observed in msh1 mutants derived by point mutation, T-DNA mutation or RNAi suppression [19, 20, 23] (Fig. 6b). Memory lines show normal MSH1 transcript levels (Fig. 6c), but 100% penetrance and heritability of the altered phenotype in subsequent self-crossed generations. Over 3,000 RNAi-null memory line progeny under greenhouse conditions produced neither visible reversion to wild type nor more severe msh1 phenotypes (Additional file 1: Figure S5). In Arabidopsis, memory lines were stably carried forward four generations and, in tomato, ten generations to date.
Memory line methylome changes detected by Methyl-IT associate with gene expression
The derived transgene-null msh1 memory lines display gene expression changes in ca. 955 genes (Additional file 6: Table S5), approximately 67% of which are shared with the msh1 mutant (Additional file 7: Tables S6, Additional file 6: Tables S5).
The memory line DEG profile is distinctive. Unlike the mutant, which shows widespread gene ontology enrichment in nearly every stress response pathway (Additional file 7: Table S6), memory line gene ontology enrichment shows skewing toward integrated pathways for circadian clock, starch metabolism, and ethylene and abscisic acid response (Fig. 6d). These studies use the msh1 TDNA insertion mutant rather than transgenic MSH1-RNAi for comparisons to ensure that each plant is msh1-depleted. Transgenic RNAi knockdown lines are variable for MSH1 suppression across plants (Fig. 6c), potentially confounding interpretation, and MSH1-RNAi and msh1 TDNA mutant appear identical in phenotype (Fig. 6b).
Application of Network-Based Enrichment Analysis (NBEA) to the set of 955 DEGs in the memory line detected over-enrichment in five pathways: “circadian rhythm”, “response to red or far red light”, “regulation of circadian rhythm”, “long-day photoperiodism/flowering”, and “regulation oftranscription”. The permutation test applied to these data indicates that the observed simultaneous over-enrichment of these pathways by chance holds a probability of lower than 4×10−5, reflecting a non-random outcome (Additional file 8: Table S7).
The msh1 “memory” is a candidate system for non-genetic methylome reprogramming
Similar to investigation of methylation changes during seed development and germination, we followed Methyl-IT analysis of msh1 memory line data with NEAT and network-based enrichment analysis (NBEA) to assess biologically meaningful data based on DMGs alone. Additional file 9: Table S8 shows results classifying methylation signal into networks for circadian clock, abscisic acid-activated signaling, and defense response. Approximately 32% of identified DEGs overlap with DMGs in the memory line (Fig 7a). These differentially methylated and expressed loci are over-enriched for genes contributing to circadian rhythm, plant hormone signal transduction, and MAPK signaling pathway (Fig. 7b-7d). Network analysis of expression, shown in (Fig. 7b-7d)., suggests dysregulation of these pathways in msh1 memory.
Integration of independently derived DEG, DMG and NBEA data from the memory lines converged on 16 loci (Fig. 7a and Table 3), of which 10 directly participate in circadian rhythm regulation and the remainder, associated with light, ABA and ethylene response, are directly influenced by circadian clock regulators (Table 3). Principal component (PC) analyses based on the mean of CG- Hellinger divergence covering the gene regions delimited by DMGs (Fig. 8a), DMG/DEG intersection (Fig. 8b) and the mentioned 16 loci (Fig. 8c) suggest a distinctive role of gene-associated CG methylation in msh1-memory effect. For all analyses, more than 80% of variance among wild type, msh1 memory and msh1 TDNA mutant was explained on the plane PC1-PC2, where msh1 memory effect is clearly distinguishable from control. Quantitative discriminatory power of CG methylation in the 16 signature loci is reflected in hierarchical clustering based on their PC1-PC2 coordinates (Fig. 8d) and in their strong correlation with the first two components (Fig. 8e). In particular, eight circadian rhythm genes strongly correlate with PC1, which carries 65% of the whole sample variance. Thus, for these genes, CG methylation conveys enough discriminatory power to distinguish individual wild type phenotypes from the msh1 memory effect.
These observations are the first inference of association between CG methylation and gene expression changes in the msh1 memory line. DIMP distribution along the 16 signature loci showed most CG and non-CG DIMPs located within exonic regions in memory lines with little individual CG-DIMP variation (sometimes balanced with non-CG), suggesting that a programmed distribution pattern might exist (Additional file 10: Table S9).
Predicted changes in methylation pattern at core circadian clock genes were subsequently confirmed by sequence-specific bisulfite (BS) PCR analysis (Fig. 9a-9d). DIMPs were confirmed in the memory line at GI, TOC1, LHY and CCA1 genes. BS-PCR primer set BS-GI-P2, designed to bind to a predicted DIMP-rich region, confirmed DIMPs within the region (Fig. 9e), while primer set BS-GI-P7, designed to bind to a DIMP-free region, detected no changes (Fig. 9f). The DNA bisulfite conversion rate in this experiment was confirmed by using DDM1 as control, with a calculated bisulfite conversion rate of 99.47% for WT and 100% for memory line sample (Additional file 1: Figure S6).
Germination of the memory line and isogenic Col-0 wild type on media containing 100 uM 5-azacytidine alleviated the phenotype differences between the two lines, resulting in similar growth rates (Additional file 1: Figure S7). Transfer to potting media to assess later growth showed wild type and memory lines to be similar in phenotype following treatment (Additional file 1: Figure S7). Likewise, RNAseq analysis of the treated and untreated memory and control lines showed 5-azacytidine treatment had genome-wide effects on the gene expression pattern of both msh1 memory line and wild type, and brought overall gene expression patterns of treated msh1 memory line and wild-type closer than before treatment (Additional file 1: Figure S8). These observations reflect association between DNA methylation behavior and the altered phenotype.
Wild type and memory line plants treated with 5-azacytidine were also tested for changes in expression of the sixteen identified loci shown in Table 3. Quantitative RT-PCR assays confirmed previous RNAseq results, showing significant differences in steady state transcript levels for 14 of the 16 loci in wild type versus memory line plants growing under no treatment conditions (Additional file 1: Figure S9). Plants germinated in 5-azacytidine prior to transfer to growth media, however, produced no significant differences in gene expression for these loci in memory lines versus wild type (Additional file 1: Figure S9). These data show a relationship between methylation state and gene expression changes in msh1-induced memory, and provide evidence that altering methylation via chemical treatment can return gene expression to nearly wild type steady state levels for these loci within the time period assayed.
The msh1 memory effect is related to circadian rhythm changes
Both gene expression and methylome datasets, analyzed independently, indicated alteration in components of the circadian clock. To test for modified circadian oscillation behavior in msh1 memory, gene expression levels for 4 core circadian clock genes in Arabidopsis and 2 genes in tomato were evaluated over a 48-h time course under constant light (LL) and light-dark cycles (LD). Results confirmed a degree of circadian rhythm dysregulation for all tested loci in both Arabidopsis memory lines, with varying levels of altered expression (Fig 10). DEG analysis in Arabidopsis showed that the proportion of genes regulated by TOC1/CCA1 and altered in expression increased from 10.4% in the msh1 T-DNA mutant line to 33.1% in the msh1 memory line (Fig 11a). Memory-associated processes identified in Figure 6d, starch metabolism and cold, ethylene and abscisic acid response, are circadian clock output pathways [31] (Fig 11b-e), again signifying that methylome repatterning influences genes that function coordinately. The altered expression of three genes from these pathways was confirmed in Arabidopsis by qRT-PCR (Additional file 1: Figure S10). Data to date suggest that circadian clock dysregulation contributes to the memory line phenotype; it is not yet known whether clock dysregulation acts causally in memory programming.
Comparable memory effects are detected in tomato
The msh1 effect is recapitulated across plant species [23, 30]. We exploited this observation by comparing msh1 memory lines in Arabidopsis and tomato (cv ‘Rutgers’). Genome-wide methylome (BSseq) data were derived from Rutgers wild type and MSH1-RNAi transgene-null lines (fifth generation). Similar to Arabidopsis, tomato memory lines are attenuated and more uniform in phenotype relative to RNAi suppression lines, described by Yang et al. (2015) [30], and display reduced growth rate and delayed flowering.
To test Methyl-IT analysis value in a dataset derived from another plant species, and to learn whether signature pathways identified in Arabidopsis msh1 memory line are shared in tomato msh1 memory, we conducted parallel analysis with the derived tomato memory line methylome dataset. Available gene annotation in tomato is incomplete. Therefore, identified differentially methylated tomato loci were cross-referenced to Arabidopsis orthologs. We identified 7802 tomato DMGs (Additional file 11: Table S10). About 4277 of them were shared with Arabidopsis, accounting for ca. 55% of tomato DMGs and 46% of Arabidopsis DMGs (Fig. 12a). With NBEA analysis, we identified 147 tomato genes predominantly associated with phytohormone response, including auxin, salicylic acid, ethylene and ABA pathways, together with circadian regulators, abiotic and biotic stress genes, and light response (Additional file 12: Table S11). Arabidopsis homologs for 43% (63) of these 147 genes were found in Arabidopsis DMGs by NBEA (Additional file 13: Table S12). Homologs for 6 of the 16 loci identified in Arabidopsis and listed in Table 3 were present in the list of 147 tomato genes. Similar circadian clock dysregulation was observed in tomato msh1 memory as in its Arabidopsis counterparts. Gene expression levels for 2 core circadian clock genes, Sl_TOC1 (Solyc06g069690) and Sl_LHY (Solyc10g005080) in tomato were evaluated over a 48-h time course under light-dark cycles (LD) to confirm dysregulation (Fig. 12b), along with downstream circadian clock-regulated genes (Fig. 12c). Together, these data reflect cross-species conservation underlying msh1 memory.
Discussion
Methyl-IT draws from the perspective that DNA methylation functions to stabilize DNA [32-34] and, as such, may exist in “activated” versus “maintenance” states with regard to bioenergetics. We have begun to investigate DNA methylation patterning as a “language” of sorts, identifying pattern changes that comprise “signal” in response to treatment, without regard to density of methylation changes within a given interval. While the theoretical premise underlying our approach, and based on Landauer’s principle, is detailed elsewhere [9, 10], the present study compares resolution of this methodology to current methods for analysis of whole-genome methylation datasets.
Methyl-IT permits methylation analysis as a signal detection problem. Our model predicts that most methylation changes detected, at least in Arabidopsis and tomato, represent methylation “background noise” with respect to methylation regulatory signal, and are explainable within a statistical probability distribution. Implicit in our approach is that DIMPs can be detected in the control sample as well. These DIMPs are located within the region of false alarm in Fig. 1, and correspond to natural methylation signal not induced by treatment. Thus, using the Methyl-IT procedure, methylation signal is not only distinguished from background noise, but can be used to discern natural signal from that induced by the treatment.
Whereas Methylpy, DSS and BiSeq provide essential information about methylation density, context and positional changes on a genome-wide scale, Methyl-IT provides resolution of subtle methylation repatterning signals distinct from background fluctuation. Data derived from analysis with Methylpy, BiSeq or DSS alone could lead to an assumption that gene body methylation plays little or no role in gene expression, or that transposable elements are the primary target of methylation repatterning. Yet ample data suggest that this picture is incomplete [35]. Methyl-IT results show that these conclusions more likely reflect inadequate resolution of the methylome system. GLM analysis applied to the identification of DMR-associated genes by Methylpy, BiSeq and DSS indicates that DMRs (or DMR associated genes) do not provide sufficient resolution to link them with gene expression.
Signal detected by Methyl-IT may reflect gene-associated methylation changes that occur in response to local changes in gene transcriptional activity. Comparative analysis of the msh1 memory line data with msh1 T-DNA mutant, a more extreme phenotype, showed 42.3% of memory line DMGs (3921 out of 5354) to overlap with msh1 T-DNA DEGs. With the memory line DEGs estimated to number only 935, it is possible that methylation repatterning within the memory line serves to stabilize or re-establish gene expression following the extreme, stress-related changes that accompany MSH1 silencing [24]. Similarly, the pathway-associated methylome changes detected in seed development data may reflect participation of methylation in gene expression stage transitions, particularly prominent between green mature and post-green mature stages.
Methyl-IT analysis of various stages in seed development and germination showed evidence of methylation changes. Previous Methylpy output [18] defined predominant changes in non-CG methylation residing within TE-rich regions of the genome, whereas Methyl-IT data resolved statistically significant methylation signal within gene regions. With the complementary resolution provided by Methyl-IT, it becomes possible to investigate the nature of chromatin response within identified genes in greater detail during the various stages of a seed’s development. Several of the identified DMGs in this study involved genes that interact within known development pathways.
There is little detail available in plants of local intragenic methylation behavior during transitions in gene activation, but transcription factor-associated recruitment of methylation machinery has been postulated [35], and supported by data in other systems [36]. A large proportion of the intervals identified by this study are components of signal transduction, so expression effects may be below the detection limits of the assay. Among the 1717 transcription factors reported in PlantTFDB, 340 are identified as DMGs in our list for memory line. Effects of alternative splicing in memory changes, also known to respond to local methylation [37], would similarly have escaped detection in our gene expression analysis. However, for a better comprehension of which genes would be controlled by the regulatory methylation machinery in processes like seed developmental or the induced msh1 memory effect, the network enrichment analysis of DMGs and DEGs can reduce the number of potential regulators to a minimal number of genes testable under lab conditions, as presented in our study. Analysis produced evidence of a relationship between msh1 memory line gene expression and differential methylation data for at least 16 regulatory loci, 10 of which comprise components of the circadian clock.
Plants have the capacity to respond to a wide array of abiotic and biotic stresses and developmental cues through overlapping gene networks. It is increasingly evident that phytohormone, light response, abiotic and biotic stress response, photosynthesis and carbohydrate metabolism are integrated output pathways of the plant’s circadian clock [31]. A significant proportion of the plant’s gene expression profile is influenced by circadian regulation [38], introducing the concept of a master regulator of adaptation. Numerous reports underscore extensive pathway integration under circadian clock control, with starch metabolism, cold response and abscisic acid-mediated stress response, for example, as particularly prominent pathways altered by msh1 memory. The link between plant response to cold and epigenetic memory involves histone modifications of the FLC locus during vernalization [39]. Cold temperature also influences alterative splicing patterns of clock genes to alter their function [40]. ABA, a stress hormone, shows rhythmic diel levels in plants [41], and associates with TOC1 and an ABA-related gene, ABAR, in a highly regulated feedback loop [42]. Epigenetic modification of circadian clock genes effect changes in starch metabolism [43], and can educe enhanced growth vigor in hybrids and allopolyploids [44]. Studies of classical heterosis in Arabidopsis also show association with changes in circadian clock behavior [45]. Data from this study indicate that MSH1 suppression includes circadian clock, ABA and ethylene dysregulation as components of the associated msh1 global stress condition. Segregation of the MSH1-RNAi transgene only partially reverts the phenotype, revealing loci that have apparently sustained cytosine methylation repatterning, and producing a phenotypic memory effect, presumably methylation-based, that is reproducible and heritable. If correct, the msh1 memory phenomenon comprises a robust medium for addressing epiallelic stability.
Identification of gene networks in both seed development and msh1 memory was based on DNA methylation data analysis with the enhanced resolution of Methyl-IT. In the case of msh1 memory, gene expression, phenotype and cross-species comparison served to confirm the identified networks. While early in the process, these outcomes argue compellingly for the feasibility of genome-wide methylome decoding of the gene space.
Conclusions
Methyl-IT is an alternative and complementary approach to plant methylome analysis that discriminates DNA methylation signal from background and enhances resolution. Analysis of publicly available methylome datasets showed enhanced signal during seed development and germination within genes belonging to related pathways, providing new evidence that DNA methylation changes occur within gene networks. Similarly, msh1 transgenerational memory phenomena in Arabidopsis and tomato identified methylation-altered gene networks involving circadian clock components and linked stress response pathways altered in expression and connected to phenotype. Whereas, previous methylome analysis protocols identify changes in methylome density and landscape, predominantly non-CG, Methyl-IT reveals effects within gene space, mostly CG and CHG, for elucidation of methylome linkage to gene effects.
Methods
Methylome analysis
The alignment of BS-Seq sequence data from Arabidopsis thaliana was carried out with Bismark 0.15.0 [46]. BS-Seq sequence data from tomato experiment were aligned using ERNE 2.1.1 [47]. The basic and theoretical aspects of methylation analysis applied in the current work are based on previous published results [9]. Details on Methyl-IT steps are provided in the next sections.
Methylation level estimation
To estimate methylation levels at each cytosine position, we followed a Bayesian approach. In a Bayesian framework assuming uniform priors, the methylation level pi can be defined as: , Where and represent the numbers of methylated and non-methylated read counts observed at the genomic coordinate i, respectively. We estimate the shape parameters α and β from the beta distribution minimizing the difference between the empirical and theoretical cumulative distribution functions (ECDF and CDF, respectively), where B(α, β) is the beta function with shape parameters α and β. Since the beta distribution is a prior conjugate of binomial distribution, we consider the p parameter (methylation level pi) in the binomial distribution as randomly drawn from a beta distribution. The hyper-parameters α and β are interpreted as pseudo counts. Then, the mean of methylation levels pi, given the data D, is expressed by . The methylation levels at the cytosine with genomic coordinate i are estimated according to this equation.
Hellinger and Total Variation divergences of the methylation levels
The difference between methylation levels from reference and treatment experiments is expressed in terms of information divergences of their corresponding methylation levels, and respectively. The reference sample(s) can be additional experiment(s) fixed at specific conditions, or a virtual sample created by pooling methylation data from a set of control experiments, e.g. wild type individual or group.
Hellinger divergence between the methylation levels from reference and treatment experiments is defined as: Where . The total variation of the methylation levels indicates the direction of the methylation change in thetreatment, hypo-methylated TV < 0 or hyper-methylated TV > 0. TV is linked to a basic information divergence, the total variation distance, defined as: . Distance and Hellinger divergence hold the inequality: [48] Under the null hypothesis of non-difference between distributions and Eq. 4 asymptotically has a chi-square distribution with one degree of freedom. The term wi introduces a useful correction for the Hellinger divergence, since the estimation of and are based on counts (see Table 1).
Non-linear fit of Weibull distribution
The cumulative distribution functions (CDF) for can be approached by a Weibull distribution [9]. Parameter and were estimated by non-linear regression analysis of the ECDF versus [9]. The ECDF of the variable is defined as: , where is the indicator function. Function is easily computed (for example, by using function “ecdf” of the statistical computing program “R”[49]).
A statistical mechanics-based definition for a potential/putative methylation signal (PMS)
Most methylation changes occurring within cells are likely induced by thermal fluctuations to ensure thermal stability of the DNA molecule, conforming to laws of statistical mechanics [9]. These changes do not constitute biological signals, but methylation background noise induced by thermal fluctuations, and must be discriminated from changes induced by the treatment. Let be the probability that energy dissipated to create an observed divergence D between the methylation levels from two different samples at a given genomic position k, can be lesser than or equal to the amount of energy Then, a single genomic position k shall be called a PMS at a level of significance α if, and only if, the probability to observe a methylation change with energy dissipationhigher than is lesser than α The probability can be given by a member of thegeneralized gamma distribution family and, in most cases, experimental data can be fixed by the Weibull distribution [9]. Based on this dynamic nature of methylation, one cannot expect a genome-wide relationship between methylation and gene expression. A practical definition of PMS based on Hellinger divergence derives provided that Hk is proportional to and using the estimated Weibull CDF for Hk given by Eq. 8. That is, a single genomic position k shall be called a PMS at a level of significance α if, and only if, the probability to observe a methylation change with Hellinger divergence higher than Hk is lesser than α.
The PMSs reflect cytosine methylation positions that undergo changes without discerning whether they represent biological signal created by the methylation regulatory machinery. The application of signal detection theory is required for robust discrimination of biological signal from physical noise-induced thermal fluctuations, permitting a high signal-to-noise ratio.
Robust detection of differentially informative methylated positions (DIMPs)
Application of signal detection theory is required to reach a high signal-to-noise ratio [50, 51]. To enhance DIMP detection, the set of PMSs is reduced to the subset of cytosines with where TVD0 is a minimal total variation distance defined by the user, preferably TVD0 > 0.1. If we are interested not only in DIMPs but also in the full spectrum of biological signals, this constraint is not required. Once potential DIMPs are estimated in the treatment and in the control samples, a logistic regression analysis is performed with the prior binary classification of DIMPs, i.e., in terms of PMSs (from treatment versus control), and a receiver operating curve (ROC) is built to estimate the cutpoint of the Hellinger divergence at which an observed methylation level represents a true DIMP. There are several criteria to estimate the optimal cutpoint, many of which are implemented in the R package OptimalCutpoints [25]. The optimal cutpoint used in Methyl-IT corresponds to the H value that maximizes Sensitivity and Specificity simultaneously [52, 53]. These analyses were performed with the R package Epi [54].
Once all pairwise comparisons are done, a final decision of whether a DFMP is a DIMP is taken based on the highest cutpoint detected in the ROC analyses (Fig. 1). That is, the decision is taken based on the cutpoint estimated in the ROC analysis for the control sample with the closest distribution to treatment samples. The position of the cutpoint will determine a final posterior classification for which we would estimate the number of true positive, true negatives, false positives and false negatives. For each cutpoint we would estimate, the accuracy and the risk of our predictions. We may wish to use different cutpoints for different situations. For example, if our goal is the early detection of a terminal disease and high values of the target variable indicates that a patient carries the disease, then to save lives we would prefer the lowest meaningful cutpoint reducing the rate of false negative.
Estimation of differentially methylated genes (DMGs) using Methyl-IT
Our degree of confidence in whether DIMP counts in both control and treatment represent true biological signal was set out in the signal detection step. To estimate DMGs, we followed similar steps to those proposed in Bioconductor R package DESeq2 [55], but the test looks for statistical difference between the groups based on gene body DIMP counts rather than read counts. The regression analysis of the generalized linear model (GLMs) with logarithmic link was applied to test the difference between group counts. The fitting algorithmic approaches provided by glm and glm.nb functions from the R packages stat and MASS were used for Poisson (PR), Quasi-Poisson (QPR) and Negative Binomial (NBR) linear regression analyses, respectively.
Likewise for DESeq2 we used the linear regression model with design matrix elements x jk, coefficients βik, and mean μkj = sjqkj, where s j normalization constants are consideredconstant within a group. Only two groups were compared at a time. The design matrix elements indicate whether a sample j is treated or not, and the GLM fit returns coefficients indicating the overall methylation strength at the gene and the logarithm base 2 of the fold change (log2FC) between treatment and control [55]. In particular, in the case of NBR, the inverse of the variance was used as prior weight where disp is data dispersion computed by the estimateDispersions function from DESeq2 R package).
To test difference between group counts we applied the fitting algorithmic approaches: PR and PQR if NBR and NBR with ‘prior weights’. Next, best model based on Akaike information criteria (AIC). The Wald test for significance of the independent variable coefficient indicates whether or not the treatment effect is significant, while the coefficient sign (log2FC) will indicate the direction of such an effect.
Bootstrap goodness-of-fit test for 2×2 contingency tables
The goodness-of-fit RMST 2x2 contingency tables as implemented in methylpy [17] for the estimation of DMSs (based on the root-mean-square (RMS) statistics) is explained in Perkins et al. in reference [26](a complemental description is found at arXiv:1108.4126v2). The bootstrap heuristic to perform the test is given in reference [56]. An analogous bootstrap goodness-of-fit test based on Hellinger divergence was also applied to estimate DMPs (HDT). In this case, Hellinger divergence estimated according to the first statistic given in Theorem 1 from reference [27].
Identification of differentially methylated regions by using BiSeq, DSS and MethyPy
For BiSeq, raw sequence reads were trimmed to remove both poor-quality calls and adapters using Trim galore! (version 0.4.1) with options --paired --trim1 --gzip --phred33 --fastqc and Cutadapt (version 1.9.1) with cutoff 20. Remaining sequences were mapped to the Arabidopsis TAIR10 genome using Bismark (version v0.15.0) [46] and Bowtie2 (Version 2.2.9) [57].Duplicates were removed using the Bismark deduplicate function, and methylation calls were extracted with Bismark methylation extractor, reading methylation calls of overlapping parts of the paired reads from the first read (–no_overlap parameter). Differentially methylated regions were detected with BiSeq (version 1.18.0) [16, 58] with clusters at least 15 methylated sites with 100 bp between clusters.
For DSS, raw sequence reads were trimmed to remove both poor-quality calls and adapters using Trim galore! (version 0.4.1) with options --paired --trim1 --gzip --phred33 --fastqc and cutadapt (version 1.9.1) with cutoff 20. Remaining sequences were mapped to the Arabidopsis TAIR10 genome using Bismark (version v0.15.0) [46] and Bowtie2 (Version 2.2.9)[57]. Duplicates were removed using the Bismark deduplicate function and methylation calls were extracted with Bismark methylation extractor, reading methylation calls of overlapping parts of the paired reads from the first read (–no_overlap parameter). Differentially methylated regions were detected with DSS (Dispersion shrinkage for sequencing data, version 2.26.0) using the default parameters.
For MethylPy, differentially methylated regions (DMR) were identified using the MethylPy pipeline (version v0.1.0) [17] and Bowtie2 (Version 2.3.3)[57]. This pipeline used Cutadapt (version >=1.9) to trim the raw sequence reads to remove both poor-quality calls and adapters. Picard (>=2.10.8) was used for PCR duplicate removal. Chloroplast DNA sequence was used as the unmethylated control; the conversion rate observed was between 0.3% - 0.4%. Cytosine sites with less than four reads were discarded. Adjacent differential methylated sites closer to 100bp were collapsed into DMRs. CNN DMRs, CGN DMRs, CHG DMRs, and CHH DMRs with fewer than four, eight, four, and four DMSs, respectively, were discarded in following analyses, and CNN DMRs, CGN DMRs, CHG DMRs, and CHH DMR candidate regions with less than 0.1, 0.4, 0.2, and 0.1 differences between maximum and minimum methylation levels were also discarded.
For Methyl-IT, raw sequence reads were trimmed to remove both poor-quality calls and adapters using Trim galore! (version 0.4.1) with options --paired --trim1 --gzip --phred33 --fastqc and Cutadapt (version 1.9.1) with cutoff 20. Remaining sequences were mapped to the Arabidopsis TAIR10 genome using Bismark (version v0.15.0) [46]; and Bowtie2 (Version 2.2.9) [57]. Duplicates were removed using the Bismark deduplicate function and methylation calls were extracted with Bismark methylation extractor, reading methylation calls of overlapping parts of the paired reads from the first read (–no_overlap parameter). Differentially methylated regions were detected with Methyl-IT, using cytosine sites with at least 4 reads, and with default parameters.
Since methods DSS, BiSeq and Methylpy do not provide an equivalent concept to DMGs, we adopted the concept of DMR associated genes (DAGs) introduced in reference [18]. Basically, a gene and a DMR areassociated if the DMR is located within 2 kb of gene upstream regions, gene bodies and 2 kb of gene downstream regions [18].
Available methylome datasets used in this work
Methylome datasets from Arabidopsis (Ws-0) major seed developmental phases, globular stage (GLOB), linear cotyledon stage (COT), mature green stage (MG), post mature green stage (PMG) and dry seed, and Arabidopsis (Col-0) germination datasets of dry seed and 0-4 days after imbibition were analyzed. Ws-0 seed development and germination datasets were obtained from the Gene Expression Omnibus (GEO) under accession numbers GSE68132 and GSE94710. Both dataset were original studied by Kawakatsu et al. (2017) [18].
Network enrichment analysis
Network based enrichment analysis (NBEA) was applied using the EnrichmentBrowser R package [59, 60] and the Network Enrichment Analysis Test (NEAT) was performed by using the R package "neat" version 1.1.1[60].
These network enrichment approaches permitted identification of main network regulators involved in the msh1 memory transgenerational effect and in seed developmental and germination datasets.
Individual sample gene CG methylation principal component analysis (PCA) and classification
Individual samples were represented as vectors of variables carrying the mean of CG Hellinger divergence covering gene regions delimited by Arabidopsis msh1-memory DMGs. Principal component analysis (PCA) was performed on the individual vector-spaces determined by the gene regions: 1) DMGs, 2) intersection DEGs (msh1-memory)/DMGs, and 3) intersection NBEA-DMG/NBEA-DEG between the subsets derived from independent NBEA on the subsets DMGs and DEGs, respectively. PCA and hierarchical cluster analysis were applied by using prcomp and hclust functions, respectively, from the R package stats.
Specific locus bisulfite sequencing PCR
To confirm our analysis for DIMP calling based on methylome sequencing, PCR-based bisulfite sequencing was performed. Genomic DNA from leaf tissue of 4-week-old plants was isolated by the DNeasy Plant Kit (Qiagen, Germany). 400 ng of genomic DNA was bisulfite-treated using EpiMark Bisulfite Conversion Kit (New England Biolabs, USA). Bisulfite-treated DNA was used as template for PCR in a 25 ul reaction system by using EpiMark Hot Start Taq DNA Polymerase (New England Biolabs, USA), in the PCR program: Initial denaturation 30 sec at 95 °C, 40 cycles of 95°C for 15 sec, 45°C for 30 sec, 68°C for 1 min, and final extension 5 min at 68 °C. PCR product was gel-purified using kit (Qiagen, Germany) and ligated to TOPO TA cloning kit (Life, USA) for sequencing. At least 25 independent clones were sequenced. Bisulfite DNA sequence methylation status was analyzed by the online program “Kismeth”. Methylation at locus AT5G66750 was used as a control for bisulfite conversion. Primers used in this experiment are listed in the Additional file 14 Table S13.
Plant materials and growth conditions
For Arabidopsis plants used in this study, clean seeds were sown on peat mix in square pots, with stratification at 4 °C for 2 days before moving to growth chamber (22 °C, 120-150 μ mol·m−2·s−1 light). Tomato seeds were germinated on MetroMix 200 medium (SunGro, USA) in square pots and grown in a reach-in chamber (26 °C, 300 μ mol·m−2·s−1 light).
5-azacytidine treatment
The 5-azacytidine treatment protocol was adopted from Griffin et al [57] and Yang et al [30]. Col-0 wild type and msh1 memory line seeds were surface-sterilized in 10% (v/v) sodium hypochlorite, rinsed thoroughly with sterile water, and sown in 8-oz clear cups (Fabri-Kal, USA) containing 30 mL 0.5 M Murashige and Skoog medium (Sigma, USA) supplemented with 1% (w/v) agar and 0 (control) or 100 μM 5-azacytidine (Sigma, USA). The 100 μM concentration was derived from a concentration gradient experiment of 4 concentrations (0 μM, 30 μM, 50 μM, 100 μM) where 100 μM showed visible impact on plant growth for both wild type Col-0 and msh1 memory line plants. Seeds were germinated and grown at 24°C, 18-h day length, and 120-150 μ mol·m−2·s−1 light intensity for 14 days. 10 days old seedling on the MS medium were collected for RNAseq experiment. For longer observation, the treated plants were transferred to square pots with soil and grow under standard conditions in the growth chamber. The experiment was repeated three times, with at least 18 replicates per treatment each experiment.
Sample collection for circadian clock gene expression assays
To assess the expression pattern of core circadian clock genes under clock-driven free running conditions, we adopted the protocol of [38]. Plants were entrained at LD condition (12 hr light/ 12 hr dark) for 4 weeks, then moved to LL (24 hr constant light) for 48 hours before sample collection was initiated. For expression of core circadian clock genes under life-like conditions, plants were entrained at LD (12 hr light/12 hr dark) for 4 weeks before samples were collected. The entire above-ground plant was collected and placed into liquid nitrogen. Samples were taken every 4 hr (ZT6, ZT10, ZT14, ZT18, ZT22, ZT26.ZT30, ZT34, ZT38, ZT42, ZT46, ZT50) in both LD and LL conditions. For each genotype at each time point, at least 3 plants were collected and used in qPCR experiments as biological replicates. An identical sample collection strategy, and LD, LL entrainment conditions, were used for tomato circadian clock gene expression experiments.
Gene Expression Analysis by qPCR
The MIQE [61] was used as standard protocol for the qPCR experiments. Briefly, total RNA from each sample was extracted by NucleoSpin RNA Plant kit (Macherey-Nagel, Germany) following manufacturer’s protocol, including genomic DNA removal. First-strand cDNA was synthesized from 400ng total RNA with oligo primers using iScript Reverse Transcription Supermix for RT-PCR (Bio-Rad, USA). The qPCR was performed on the CFX real-time system (Bio-Rad, USA) with 95 °C for 3 min, 40 cycles of 95 °C for 30 sec and 60 °C for 1 min. Three biological replicates were performed. RNA abundance of target genes was calculated from the average of four technical replicates using Δ Δ Cq method, where Cq is the cycle number at which amplification signal reaches saturation in each PCR run. The Cq values of AT4G05320 and AT5G15710 were used as normalization controls in the calculation.
Real-time PCR primers used in this study and their reference are listed in Supplemental Primers Table. The PCR amplification efficiency was calculated based on a calibration standard curve specific for each primer set, and only primers having amplification efficiency greater than 0.97 were used in the study.
Sample preparation and bisulfite DNA methylome sequencing
For Arabidopsis genome-wide bisulfite methylome sequencing experiments, three individual plants of wild type Arabidopsis thaliana ecotype Col-0 and three isogenic msh1 memory line plants were used. All wild type control plants selected from negative events of RNAi transformation and were maintained in parallel with their msh1 memory counterparts. Whole plants at early bolting were flash frozen in liquid nitrogen. Tissues were ground by motor and pestle in liquid nitrogen, and divided to two, with one half processed by DNeasy Plant Kit (Qiagen, Germany) for genomic DNA (RNA removed) and subsequent bisulfite sequencing. The other half was used for RNA extraction by NucleoSpin RNA Plant Kit (Macherey-Nagel, Germany) following manufacturer’s protocol, including genomic DNA removal, for RNA-seq analysis.
For tomato bisulfite sequencing, wild type tomato (Solanum lycopersicum cv Rutgers) and the corresponding MSH1-RNAi transgene-null segregant (msh1 memory line) were used. Phenotype and line generation details can be found in [30]. The top three leaves from each four-week-old tomato plant were collected and frozen in liquid nitrogen, followed by genomic DNA extraction using DNeasy Plant Kit (Qiagen, Germany). Genomic DNA from three individual plants for both WT and msh1 memory line were used for BSseq.
All BSseq experiments were conducted on the Hiseq 4000 analyzer (Illumina, USA) at BGI-Tech (Shenzhen, China) according to manufacturer’s instructions. Briefly, Genomic DNA was sonicated to 100-300 bp fragments and purified with MiniElute PCR Purification Kit (Qiagen, Germany), and incubated at 20oC after adding End Repair Mix. DNA was purified, a single ‘A’ nucleotide added to the 3’ ends of blunt fragments, purified again and Methylated Adapter was added to 5’ and 3’ ends of each fragment. Fragments of 300-400 bp size range were purified with QIAquick Gel Extraction Kit (Qiagen, Germany) and subjected to bisulfite treatment by Methylation-Gold Kit (ZYMO). These steps were followed by PCR and gel purification (350-400 bp fragments were selected). Qualified libraries were paired-end sequenced on the HiSeq X-ten system.
RNA sequencing and analysis
RNA libraries were constructed as described in the TruSeq RNA Sample Preparation v2 Guide. These libraries were sequenced with the 150-bp reads option, in Hi-Seq 4000 analyzer (Illumina, USA) at BGI-Tech (Shenzhen, China). Alignments were performed using RUM 2.0.4 (default parameters) [62] keeping only uniquely mapped reads. The read count data were generated from the SAM files by using QoRTs software package[63]. DESeq2 [55] was used for gene count normalization and to identify differentially expressed genes (FDR < 0.05, |log2FC| > 0.5.
Funding
The work was supported by funding from NSF-SBIR (2015-33610-23428-UNL) and the Bill and Melinda Gates Foundation (OPP1088661).
Availability of data and materials
The Methyl-IT pipeline source code is available at the GitLab: https://git.psu.edu/genomath/MethylIT Seed development methylome data (accession number GSE68132) were obtained from the Gene Expression Omnibus database.
All Next Generation Sequencing data generated by this study are deposited to Gene Expression Omnibus database under accession numbers listed:
Arabidopsis methylome (GSE106309, Secure token for reviewers: epkxcgcelpcbpon), Arabidopsis msh1 memory 4 week old plant RNAseq (GSE106536, Secure token for reviewers khezyogstbuvryj), Arabidopsis 10 days old seedling 5-azacytidine treatment RNAseq (GSE109164, Secure token for reviewers: gfyfqgucdfqhlal), Tomato methylome (GSE105008, Secure token for reviewers: ebglsioentetrif).
Authors’ contributions
R.S. developed the application of the information thermodynamic theory on cytosine DNA methylation and conducted mathematical and computational biology analyses, XY, HK and YW designed and conducted biological experiments, JRB conducted computation. SM designed experiments, participated in data analysis and wrote manuscript.
Competing interests
S. Mackenzie has served as co-founder for a company that tests the MSH1 system for possible agricultural commercial value.
Consent for publication
Not applicable
Ethics approval and consent to participate
Not applicable
Additional files
Additional file 1: Figures S1 to S10
AddItional file 2: Table S1 Absolute DIMPs counts and DIMPs counts per genomic region for seed development and germination datasets
Additional file 3: Table S2 DMGs Arabidopsis (ws-0) seed development dataset
Additional file 4: Table S3 DMGs from Arabidopis memory line
Additional file 5 Table S4 List of seed development DMGs found in networks based on NEAT
Additional file 6: Table S5 Total 955 of DEGs of Arabidopsis msh1-memory-line
Additional file 7: Table S6 Total 9867 DMGs of Arabidopsis TDNA mutant
Additional file 8: Table S7 NBEA analysis of DEGs in Arabidopsis msh1 memory line
Additional file 9: Table S8 NEAT and NBEA analysis on DMGs from arabidopsis msh1 memory line
Additional file 10: Table S9 DIMPs distribution in 16 regulatory genes in msh1 memory individual plants
Additional file 11: Table S10 DMGs in tomato msh1 memory line
Additional file 12: Table S11 NBEA analysis of DMGs in tomato msh1 memory line
Additional file 13: Table S12 Main intersection between Arabidopsis and tomato DMGs NBEA list
Additional file 14: Table S13 Primers used in this paper
Acknowledgments
We thank Ojus Jain and Kasim Hamo for technical assistance. We also thank Dr. Yingzhi Xu for valuable conversations early in the study. The data presented in this manuscript are tabulated in the main text and supplementary materials.
Footnotes
Robersy Sanchez: rus547{at}psu.edu Xiaodong Yang: xiaodongy86{at}gmail.com Hardik Kundariya: kundariyahardik{at}gmail.com Jose R Barreras: barreras{at}gmail.com Yashitola Wamboldt: yashitola{at}yahoo.com Sally Mackenzie: sam795{at}psu.edu
Abbreviations
- AUC
- Area under the receiver operating characteristic curve
- MSH1
- MUTS HOMOLOG 1
- CDM
- Cytosine DNA methylation
- DAGs
- DMR associated genes
- DEG
- Differentially expressed gene
- DIMPs
- Differentially informative methylated positions
- DMGs
- Differentially methylated genes
- DMPs
- Differentially methylated positions
- DMRs
- differentially methylated regions
- DSS
- Dispersion Shrinkage for Sequencing
- FET
- Fisher’s exact test
- GLM
- generalized linear regression model
- HD
- Hellinger divergence
- HDT
- goodness-of-fit test based on Hellinger divergence
- NEAT
- Network Enrichment Analysis Test
- NBEA
- Network based enrichment analysis
- RMST
- Root-mean-square test
- ROC
- Receiver operating characteristic curve
- SD
- Signal detection
- TVD
- total variation distance
- PMS
- Potential/putative methylation signal