DNA-sequence and epigenomic determinants of local rates of transcription elongation

Across all branches of life, transcription elongation is a crucial, regulated phase in gene expression. Many recent studies in eukaryotes have focused on the regulation of promoter-proximal pausing of RNA Polymerase II (Pol II), but rates of productive elongation also vary substantially throughout the gene body, both within and across genes. Here, we introduce a probabilistic model for systematically evaluating potential determinants of the local elongation rate based on nascent RNA sequencing (NRS) data. Our model is derived from a unified model for both the kinetics of Pol II movement along the DNA template and the generation of NRS read counts at steady state. It allows for a continuously variable elongation rate along the gene body, with the rate at each nucleotide defined by a generalized linear relationship with nearby genomic and epigenomic features. High-dimensional feature vectors are accommodated through a sparse-regression extension. We show with simulations that the model allows accurate detection of associated features and accurate prediction of local elongation rates. In an analysis of public PRO-seq and epigenomic data, we identify several features that are strongly associated with reductions in the local elongation rate, including DNA methylation, splice sites, RNA stem-loops, CTCF binding sites, and several histone marks, including H3K36me3 and H4K20me1. By contrast, low-complexity sequences and H3K79me2 marks are associated with increases in elongation rate. In an analysis of DNA k-mers, we find that cytosine nucleotides are strongly associated with reductions in local elongation rate, particularly when preceded by guanines and followed by adenines or thymines. Increases in elongation rate are associated with thymines and A+T-rich k-mers. These associations are generally shared across cell types, and by considering them our model is effective at predicting features of held-out PRO-seq data. Overall, our analysis is the first to permit genome-wide predictions of relative nucleotide-specific elongation rates based on complex sets of genomic and epigenomic covariates. We have made predictions available for the K562, CD14+, MCF-7, and HeLa-S3 cell types in a UCSC Genome Browser track.

Supplementary Figure S1: A. The distribution of productive initiation rates χ in 10 rounds of simulation sampled from the estimated χ of real K562 PRO-seq data is shown.The dashed line represents the median of 1 event per minute [1].B. The distribution of read depth in 10 rounds of simulation of synthetic data is displayed.The read depth is decreased to match the read depth of real K562 PRO-seq data [1], with a median value of around 0.05 [2].C. The correlation map of selected features from real K562 data.D.
The difference in correlation map of selected features between the sampled covariates and the real K562 covariates.Supplementary Figure S3: A. Distribution of multiple histone marks from TSSs to the end of gene bodies.
To visualize this distribution, we have scaled the ChIP-seq signals for each histone mark from 0 to 1, which allows us to represent their relative heights on the y-axis.

F
Supplementary FigureS2: A& B & C & D.The simulation with an input of baseline ζ that is exactly determined by a GLM, with no other unmodeled source of variation.A. Accuracy of predicted coefficients κ using synthetic data over ten rounds of simulation.Crosses indicate the ground truth.B. Evaluation of the accuracy of predicted per-nucleotide elongation rates ζ i across all TUs by comparing them to the corresponding ground truth, with Pearson's r 2 of 0.991 C. Evaluation of the accuracy of predicted local elongation rates ζ i along an individual TU by comparing them to the corresponding ground truth, with Pearson's r 2 of 0.983.D. The accuracy of per-nucleotide PRO-seq read count X i predictions based on selected features is evaluated by comparing them to the synthetic read counts in the held out data of the total 10 rounds of simulation, with Pearson's r 2 of 0.26.E.The comparison of baseline ζ that is exactly determined by a GLM and the noise ζ after introducing Gaussian noise, with Pearson's r 2 of 0.749.F. The accuracy of per-nucleotide PRO-seq read count X i predictions in the held out data of the total 10 rounds of simulation with an input of noise ζ, with Pearson's r 2 of 0.070.

D
Supplementary FigureS4:The selection of gene body of gene MFSD4B in K562 PRO-seq data.Supplementary FigureS5: A "U-shape" is evident in the selected gene bodies of K562 PRO-seq data.The red line illustrates the general pattern of the relative height of PRO-seq signals across the relative location of gene bodies for 6,000 genes.The blue line represents the relative height after correcting the "U-shape."0.32 0.29 0.16 0.25 0.02 0.04 0.22 0.03 0.02 0.26 0.17 1 0.42 0.36 0.14 0.37 0 −0.01 −0.01 −0.01 0 0.09 0.32 0.42 1 0.26 0.13 0Supplementary FigureS6:A & B.The selection of an appropriate filter and its parameters for the 5' and 3' splicing sites is accomplished by examining metaplots.The average PRO-seq profile at the 5' and 3' splicing sites exhibits discernible effects on the elongation rate within their respective ranges (see left).A generalized filter is applied for the 3 and 5 splice-site features, where the dots indicate the scaling vector (see right).C. A Gaussian filter is applied to smooth the ChIP-seq signals of histone marks.D. The correlation map of all epigenomic and annotated features in the actual analysis of K562 data.

FCE
Supplementary FigureS7: Determination of the hyperparameter of L1 regularization in the simulated data.A. Evaluation of the predictiveness of each hyperparameter through the computation of Akaike Information Criterion (AIC) values.B. The number of non-zero 5-mers by the framework applying L1 regularization.The dashed line represents the number of non-zero 5-mers selected with the hyperparameter corresponding to the lowest AIC value.C. Successful shrinkage of parameters after the application of penalty, contrasted with scenarios without penalty.D. Accuracy assessment of predicted local elongation rates ζ i compared to ground truth, in settings without penalty (left) and with penalty (right), exhibiting Pearson's r 2 values of 0.76 and 0.89.E & F. Comparison of the top 100 5-mers with the most significant coefficients selected by the model and the ground truth in scenarios without and with L1 penalty, respectively.Supplementary FigureS9: The analysis of applying the allmer model to real K562 data. A. The model's performance in predicting local pausing sites along the gene bodies of held-out genes was evaluated.The locations of the pausing sites determined by the predicted read counts X i compared to the true read counts X i , with Pearson's r 2 of 0.64.B. The accuracy of the predicted read counts X i was assessed across 1 kbp intervals for all transcribed units (TUs) by comparing them to the corresponding true X i values, with Pearson's r 2 of 0.65.C. The comparison of predictiveness among the epigenomic model, k-mer model, and the combined model of pausing locations and PRO-seq read counts in 1kbp.Pearson's r 2 between estimation and truth is indicated on the y-axis.D. Comparative analysis of predicted coefficients κ using the original model and a sequence bias model that assumes all genome-wide effects reflect biases.Supplementary FigureS10: A. The distribution of the 3 end base in PRO-seq data from the K562 cell line, acquired through a run-on experiment involving 4 dNTPs with equal concentrations.B. The distribution of the 3 end base in PRO-seq data from the K562 and CD14+ cell lines, acquired through a run-on experiment involving 2 dNTPs.C. Examination utilizing CD14+ PRO-seq data with a UMI design ligated to an insert.An illustration depicting the structure of reads, either with or without inserts.The green arrow indicates the 3 end of UMIs.D. The distribution of the 3 end constitution of UMIs in CD14+ library, with a design to facilitate the examination of potential ligation bias.E. Analysis of the distribution of the 3 end base of UMI, whether with or without successful inserts, indicating an absence of ligation bias towards cytosine.
Supplementary FigureS11: Analysis of K562 cells illustrating the impact of potential internal TSSs on the estimation of DNAm coefficients.In the 'without TSS filter' scenario, the entire gene body is included, and in the 'with TSS filter' scenario, internal TSSs based on GRO-seq signals are excluded within the same gene body.Error bars represent the standard deviation of rounds of sampled genes.The most notable difference lies in the DNAm coefficient: 'without TSS filter' shows a positive correlation between DNAm and elongation rate, while 'with TSS filter' shows a negative correlation.This suggests a switch function for DNAm in new initiation within gene bodies, potentially influencing the true correlations between DNAm and elongating RNAP.

A
Supplementary FigureS13: A. The distribution of the 3 end base in PRO-seq data across for four cell lines.Supplementary FigureS14: A & B.Comparison of enrichment logos illustrating the top positive or negative 5-mers in the upstream 10-nucleotide and 15-nucleotide regions along gene bodies for four cell lines.Supplementary FigureS15: A & B.Comparison of enrichment logos illustrating the top positive or negative 5-mers in the downstream 5-nucleotide and 10-nucleotide regions along gene bodies for four cell lines.