RNA polymerase mapping in plants identifies enhancers enriched in causal variants

Promoter-proximal pausing and divergent transcription at promoters and enhancers, which are prominent features in animals, have been reported to be absent in plants based on a study of Arabidopsis thaliana. Here, our PRO-Seq analysis in cassava (Manihot esculenta) identified peaks of transcriptionally-engaged RNA polymerase II (Pol2) at both 5’ and 3’ ends of genes, consistent with paused or slowly-moving Pol2, and divergent transcription at potential intragenic enhancers. A full genome search for bi-directional transcription using an algorithm for enhancer detection developed in mammals (dREG) identified many enhancer candidates. These sites show distinct patterns of methylation and nucleotide variation based on genomic evolutionary rate profiling characteristic of active enhancers. Maize GRO-Seq data showed RNA polymerase occupancy at promoters and enhancers consistent with cassava but not Arabidopsis. Furthermore, putative enhancers in maize identified by dREG significantly overlapped with sites previously identified on the basis of open chromatin, histone marks, and methylation. We show that SNPs within these divergently transcribed intergenic regions predict significantly more variation in fitness and root composition than SNPs in chromosomal segments randomly ascertained from the same intergenic distribution, suggesting a functional importance of these sites on cassava. The findings shed new light on plant transcription regulation and its impact on development and plasticity.

Initiation Complex is assembled and initiation has occurred 1 . Promoter-proximal pausing has been suggested as a mechanism to tune the expression of specific genes in response to external regulatory signals and might also play a role in stabilizing the open chromatin state around promoter regions 1 . Nascent RNA sequencing also revealed the presence of bidirectional transcription in enhancers, supporting a more unified model of transcription initiation between enhancers and promoters 8 .
In plants, nascent RNAs have only been profiled in Arabidopsis thaliana 9 and Zea mays 10 using GRO-seq. In Arabidopsis, the analysis revealed a lack of bi-directional transcription, promoter-proximal (i.e., 5') pausing or expression of enhancer RNAs. Instead, prominent 3' accumulation of RNA polymerase was observed in both maize and Arabidopsis 9 . These data suggested that gene regulation in plants may have diverged from what is observed in other eukaryotes, and Hetzel et al. 9 suggested that the presence of Pol IV and Pol V (absent in metazoans) 11 reflects a different evolutionary approach to gene regulation within the plant kingdom.
We characterized nascent transcription in cassava (Manihot esculenta) and maize seedlings using PRO-seq 5,11 and re-analyzed the Arabidopsis and maize GRO-seq data from Hetzel et al. 9 and Erhard et al. 10 respectively. For clarity, the four libraries will be referred to as PRO-cassava, PRO-maize, GRO-maize and GRO-arabidopsis. In agreement with previous studies, GROarabidopsis lacked promoter proximal pausing ( Fig 1A) and, instead, showed accumulation of engaged polymerases at the 3' end of each gene ( Fig 1B). Analysis of PRO-maize showed the previously reported 3' pausing 9 ( Fig 1C) and a small accumulation of reads at the Transcription Start Site (TSS) (Fig 1B). This accumulation was consistent with GRO-maize, and thus general across the two techniques and two varieties of maize ( Fig S1). PRO-cassava, in contrast, showed a clear pattern of both 5' (Fig 1E) and 3' pausing ( Fig 1F). Out of the 24,532 genes that were expressed in cassava, 16,605 had a Pausing Index (PI) higher than 2 ( Fig S2, Supplemental methods). While all three plant species demonstrated polymerase accumulation at the 3' end of genes, each displayed a unique accumulation pattern in the promoter-proximal region. Unlike mammals, bi-directional transcription was uncommon among plant promoters (Fig 1). However, several cassava genes showing this behavior were identified (Fig S3).
Enhancers are key eukaryotic regulatory elements that control spatiotemporal gene expression and are especially important during development 12 . Studies in mammals have shown that enhancers produce short unstable RNAs known as eRNAs 13,14 . In the first study of nascent RNA in Arabidopsis, aligning GRO-seq reads to open chromatin sites did not identify transcribed plant enhancers 9 . To validate this observation in another species, we mapped PRO-seq peaks outside coding regions (3kb from the 5'UTR or 3' UTR of any gene) in the cassava genome. We identified ~2,000 peaks in intergenic regions showing clear bi-directional transcription, similar to that observed in mammalian and other metazoan enhancers (Fig S4). Given the resemblance of these elements to mammalian transcriptional regulatory elements, we used discriminative regulatory-element detection (dREG) 15 , a support vector regression algorithm trained to detect enhancers and promoters from GRO-seq mapped reads. We identified 34,000 candidate regulatory regions across the genome of which 16,800 were located in intergenic regions, and 9,665 were at least 1kb away from any gene. This set of 9,665 regions, which we refer to as enhancer candidate regions (Supplemental File S1), showed a clear asymmetric pattern of transcription (Fig 2A-2B).
Two independent lines of evidence supported the biological activity of these enhancer candidate regions. First, using DNA methylation data previously measured in cassava cultivar TME7 16 , we observed profiles in the three DNA contexts (CpG, CHG, and CHH) around the 9,665 cassava enhancer candidate regions distinct from genic and random regions across the genome ( Fig 2C). Second, Genomic Evolutionary Rate Profiling (GERP) 17-19 of the enhancer candidates showed lower conservation of these regions than of random sequences across the genome, whereas coding regions are conserved (Fig 2D and 2E). This low conservation agrees with observed patterns in mammals 20 where enhancers, unlike promoters, are rarely conserved and evolve rapidly.
To test the functional relevance of the plant enhancer candidates, we estimated the percentage of the SNP heritability 21 attributable to the enhancer candidates as compared to randomly ascertained genomic regions with the same intergenic distribution. We set up genomic partitions 22,23 separating a focal partition from a rest-of-genome (ROG) partition, where the focal partition was either the enhancer candidates or the random regions. We used 3,011 cassava clones (i.e., genetically unique but each clonally propagated) of the NextGen Cassava Breeding Project 24 evaluated for four quantitative agronomic traits, dry matter content (DM), fresh yield (FYLD), number of roots (RTNO), and shoot weight (SHTWT), and a disease trait, mean cassava mosaic disease severity (MCMDS) whose genetic architecture is controlled primarily by a single resistance locus 25 . Genomic relationship matrices (GRM) for each partition were calculated using LDAK5, following recommendations made by Speed et al. 21 . For each partition, we fit a model estimating variances of effects distributed according to ROG and focal GRMs. The percentage of phenotypic variance explained by markers inside enhancer regions was higher than random regions for all quantitative agronomic traits but not for the disease resistance trait (Fig.   2F). Plant disease resistance is often conditioned by genes that cause recognition of infection rather than by differential expression 26 . In contrast, fitness-related quantitative traits can be strongly affected by gene regulation 27 . Together, these results suggest that the enhancer candidates causally affect plant phenotypes by modulating gene transcription.
In summary, the nascent transcriptome of cassava, as revealed by PRO-seq, showed features of transcriptional regulation that weren't present or detected in previous plant experiments, including promoter-proximal pausing and the presence of eRNAs. We note that plants, like yeast, lack Negative ELongation Factor (NELF), which is likely required for a kinase-regulated release of paused Pol II 28 . Thus, this enrichment of Pol II may reflect a related maturation checkpoint observed in fission yeast 29 . Enhancer candidate regions were shown to have low levels of evolutionary conservation, a bi-directional pattern of transcription, and a specific DNA methylation profile. These regions also contributed disproportionately to fitness and root composition variation, showing an important new way forward in prioritizing genomic regions for use in crop improvement.
In an attempt to extend our observations to other plants we analyzed the existing GRO-maize data similarly to the PRO-cassava data. We identified 4,135 (Supplemental file S2) enhancer candidate regions using dREG and again observed a clear pattern of bi-directional transcription ( Fig 3A, 3B). The transcription of these candidates was conserved across different genotypes as the PRO-maize data generated in our study showed transcription signals in the same regions ( Fig. S4). We analyzed the transcription pattern of 1,495 (Supplemental file S3) enhancer candidates identified in a separate maize study 30 using an approach that integrated genome-wide methylation data, chromatin accessibility (DNAse-seq), and histone marks (H3K9ac). The GROmaize reads mapped against these candidates also showed bi-directional transcription ( Fig   3C,3E). Moreover, 519 (Supplemental file S4) of the 1,495 enhancer candidates (~35%) showed significant levels of bi-directional transcription and were identified by dREG (Fig 3D, 3F).
Transcription of genomic enhancers was first described in 1992 31 , but the lack of adequate technology prevented further research on the subject until the late 2000's 32,33 . While direct functions have been proposed for eRNAs as regulators of gene expression in metazoans 33,34 there is no evidence of this in plants yet. Previous work in Arabidopsis did not identify eRNAs 9 , leading the authors to state that "if plants have enhancer elements, they rarely, if at all, produce transcripts." Based on supporting research 12,35-37 , however, we believe the existence of plant enhancers is likely commonplace and independent of whether or not they are transcribed. Zhu et al. 37 provided supporting evidence for this statement when they tested a small portion of nearly ten thousand enhancer candidates in A. thaliana: they validated 10 of the 14 enhancer candidates tested using a reporter assay. The GRO-arabidopsis data, however, did not show strong evidence of transcription in the regions identified by Zhu et al. 37 ( Figure S6). Additional data would be needed to fully validate the lack of enhancer transcription in Arabidopsis.
The results reported here suggest that plant transcriptional regulation may be more similar to that of mammals and other metazoans than previously thought. The lack of transcription previously observed in putative Arabidopsis enhancers may have been the result of relatively low read depths in the reported GRO-seq experiment. Further, the genome size of cassava is over six times that of Arabidopsis. Cassava therefore has much greater intergenic space, allowing for the identification of intergenic enhancer candidates whose expression is not confounded by that of  . Reads were aligned to the TSS and the PAS in both sense (yellow) and antisense (green) directions relative to the direction of gene transcription (E,F). Prominent promoter-proximal pausing is shown in Manihot esculenta, and in some degree in maize, but it is not present at all in Arabidopsis as previously reported 9 . Accumulation of RNA polymerase at the 3' end of the genes is a common feature in the three plant species. Relationship matrices were calculated using SNP markers within the enhancer candidate regions using the LDAK5 model and variance components were estimated using EMMREML.

Plant Materials and Nuclei Isolation
Cassava accession "Nase 3" (synonymous with "IITA-TMS-IBA30572" and "Migyera") cuttings were grown in tubes containing enriched medium. Tubes were placed in growth chambers with 12 hours of light at 30°C for 6 weeks before tissue collection. Stem and leaves of approximately 25gr were ground with liquid nitrogen to a fine powder using a mortar and a pestle. The resulting powder was transfer to a coffee grinder containing cold 1X NIB buffer. Then, we used the CelLytica PN Plant nuclei isolation/extraction kit (Sigma-Aldrich) following the instructions for the "Highly-pure preparation of Maize inbred line B73 seeds were put into growth chambers. Shoots were collected 9 days after germination. Around 10 grams of plant tissue were ground with liquid nitrogen to a fine powder. Five grams of ground tissue was transferred into 50 ml fresh SEB extraction buffer (2.0% PVP, 10%TKE, 500mM sucrose, 4mM spermidine, 1mM spermine, 2.5% β-mercaptoethanol). Incubated on ice for 20 minutes, and then filtered through 2 layers of 100um nylon mesh. Triton X-100 was then added to a final concentration of 0.5% and incubated on ice for another 10 minutes. Then the lysate was centrifuged at 2000 rcf for 15 minutes at 4° C and the supernatant was recovered. The pellet was suspended in another 25 ml SEB extraction buffer and centrifuged again at 2000 rcf for 15 minutes. We added 10 ml nuclei storage buffer (50 mM Tris-Cl, 50% glycerol, 5 mM MgCl2, 0.1 mM EDTA, 0.5mM DTT) in the pellet and centrifuged at 2000 rcf for 5 minutes at 4° C. This step was repeated using 1 ml nuclei storage buffer.
Finally, the pellet was resuspended and stored in 105 ul nuclei storage buffer. The protocol was conducted in a cold room (4° C). molecular barcode was removed. Reads were trimmed to a maximum length of 36bp, and the reverse complement was calculated because the Hiseq apparatus starts sequencing from the 5' end. All downstream alignments were performed using Bowtie2 39 . The PRO-seq method described before is not exclusive to transcripts produced by the nuclear DNA Polymerase II, therefore the reads were aligned to the chloroplast genome to eliminate transcripts produced in this organelle. The remaining reads were mapped to the Manihot esculenta reference genome v6.1 (www.phytozome.com). Low-quality alignments were filtered and only reads mapping once to the genome were considered for further analysis (Table S1).
Bedtools 40 was used to get bedgraph files reporting only the number of 3' end reads at each position.

Pro-seq read distribution
The percentage of the cassava genome transcribed was calculated using bedtools ( Figure S7A). A saturation curve, which calculates the number of unique positions covered as a function of read depth was obtained using the bed-metric scripts (https://github.com/corcra/bed-metric.git) ( Figure S7B). Normalized BigWig files representing the mapped reads were used to visualize each strand of the genome separately in the Integrative Genomics Viewer (IGV) 41 . The Metagene plots, histograms, peak scanning and gene expression values were generated using the HOMER software 42 and meta-gene maker (https://github.com/bdo311/metagene-maker). Since the cassava genome is not readily available to work with HOMER, feature annotations were created separately, and the HOMER config files were modified.
We approximated the TSS as the beginning of the 5' UTR because the cassava genome annotation lacked precise TSS sites. The same approach was used for the Arabidopsis and Maize genomes.

Quantifying Pausing and Divergent Transcription
Genes were ranked based on their pausing index. The pausing indices were calculated, as previously

Genomic Partitioning
Genomic Partitioning is a method to explore the genetic architecture of complex traits 21,22 . In this step, we calculated the heritability contribution from the enhancer candidate regions in the cassava genome and compared it with a random set of DNA regions of similar size and occupying a similar distribution across the cassava chromosomes.

Genotype Data
Genotyping-by-sequencing (GBS) libraries were prepared as described previously 44 . Marker genotypes were called using the TASSEL-GBS discovery pipeline 45 using the Manihot esculenta genome assembly v6.1 (www.phytozome.net). The GBS markers were combined with the Cassava HapMap v2.0 variants from 241 deep sequenced cassava accessions 19 to impute variants on all clones to whole genome sequence level in a single step with IMPUTE2 46,47 . The imputation procedure was performed as in Lozano et al. 48 where the number of haplotypes used as the reference panel was set to 400, the effective population size (Ne) to 1000 and the imputation window to 5Mb. The resulting Oxford files where converted into the Plink 49 binary format. In total, 3 million variants with a quality info score higher than 0.3 and Minor Allele Frequency (MAF) > 0.01 were obtained for the 3011 individuals used in this study.

Variance components estimation
Genomic partitioning analyses are imprecise in highly related populations because of high LD between partitions. We mitigated this problem by eliminating markers from the rest of the genome in high LD with the 9,665 cassava enhancer candidate regions (markers were removed that had allelic r 2 > 0.9 and were closer than 100kb to enhancer markers). We also built 10 random sets of 9,665 regions with the same average length and approximate distribution in the genome as the enhancer candidates. As with the enhancer regions, random sets were forced to be outside any annotated gene ( Figure S8), and markers from the rest of the genome in close physical distance and high LD were removed.
Genomic relationship matrices (GRMs) were calculated for focal (i.e., either enhancer or random sets) and ROG genomic partitions using the software LDAK5 following the ideas of Speed et al. 21 . Briefly, LDAK5 relationship matrices control short-range LD by assigning marker weights. Markers residing in low LD regions will have higher weights and are assumed to contribute more than markers in high LD regions. The LDAK5 model also assumes that a SNP's heritability depends on its MAF, using an α value set to -0.25 as suggested in 21 . Finally, LDAK5 considers genotyping uncertainty as higher-quality observed markers should contribute more than poorly imputed markers. Genomic relationship matrices were calculated for the enhancer candidate partition and the rest of the genome partition. Separate analyses used Genomic relationship matrices based on the three random sets. Python scripts for these analyses can be accessed at the GitHub repository associated to this article.
The model fit to calculate the variance components was specified in matrix notation as: We only sampled the null model ten times, creating ten random kernels because the two-GRM one-step model was computationally restrictively slow.
Assuming independence among the four agronomic traits, the probability that all null models across all traits would explain less variance than the alternative, under the null hypothesis that random sets explain the same variance as enhancer candidates, would be (1 / 11) 4 = 6.83e-5. In fact, root number and fresh root yield are strongly correlated (0.65 to 0.80, 51 ) but are both uncorrelated to dry matter content and shoot weight. Thus, a conservative p-value for the hypothesis that enhancer candidates explain more variance in quantitative agronomic traits than random sets would be (1 / 11) 3 = 7.51e-4.