RNase H genes cause distinct impacts on RNA:DNA hybrid formation and mutagenesis genome-wide

RNA:DNA hybrids such as R-loops affect genome integrity and DNA replication fork progression. The overall impacts of naturally occurring RNA:DNA hybrids on genome integrity, and the relative contributions of ribonucleases H to mitigating the negative effects of hybrids, remain unknown. Here, we investigate the contributions of RNases HII (RnhB) and HIII (RnhC) to hybrid removal, DNA replication, and mutagenesis genome-wide. Deletion of either rnhB or rnhC triggers RNA:DNA hybrid accumulation, but with distinct patterns of mutagenesis and hybrid accumulation. Across all cells, hybrids accumulate most strongly in non-coding RNAs and 5′-UTRs of coding sequences. For ΔrnhB, hybrids accumulate preferentially in untranslated regions and early in coding sequences. Hybrid accumulation is particularly sensitive to gene expression in ΔrnhC; in cells lacking RnhC, DNA replication is disrupted leading to transversions and structural variation. Our results resolve the outstanding question of how hybrids in native genomic contexts interact with replication to cause mutagenesis and shape genome organization.

from PY79 gDNA using oJR260 and oJR261and cloning it into the pDR110 vector via Gibson assembly.

SM3.1: Purification of RnhB and RnhC
The purification of RnhB and RnhC was performed as described (10). Briefly, each gene was cloned into pE-SUMO and overexpressed with a SUMO tag in E. coli BL21DE3 cells at an OD~1 at 37 o C in LB with 200 µM IPTG. Cells were further grown at 37°C for 3 hours. Cells were lysed via sonication and cell debris was pelleted via centrifugation. The cell lysate was passed over a Ni 2+ -agarose column, washed, and eluted with 400 mM imidazole. The eluate was then dialyzed into a buffer containing SUMO protease and left overnight at 4°C (50 mM Tris HCl, pH 8.0, 50 mM NaCl and 2 mM beta-mercaptoethanol) The SUMO cleaved protein was then passed back over a Ni 2+ -agarose column and the flow through was applied to an Q-anion exchange column (GE: 17115301) and eluted with a NaCl gradient of 50 to 1000 mM. Fractions were analyzed via SDS-PAGE and the fractions containing the desired protein were pooled, exchanged into protein storage buffer (50 mM Tris HCl, pH 8.0, 100 mM NaCl, 25% glycerol), and flash frozen in liquid nitrogen.

SM3.2: Polyacrylamide gel extraction of the R-loop substrate.
The R-loop containing band was identified using a LI-COR Odyssey imager and was extracted. The band was extracted by the "crush and soak" method with incubating in 100 µl of elution buffer (300 mM magnesium acetate and 1 mM EDTA pH 8.0) rotating overnight at 4°C.
The tube was centrifuged for 1 minute at 13,000 rpm and the supernatant containing the R-loop was removed. The yield was quantified by the intensity of the signal on the LI-COR Odyssey imager.

SM3.3: Mung bean nuclease digestion
Mung bean nuclease (MBN) was purchased from NEB (#M0250S). A concentration of 50 nM oJR336, oJR332, or R-loop was digested with MBN for 1 minute in a 10 µl reaction and quenched with stop buffer (95% formamide, 5 mM EDTA, 0.01% bromophenol blue). 0.3 M NaOH was added to oJR336 and used as a ladder. Reactions were heated to 100°C for 2 minutes and immediately snap cooled in an ice water bath. 20% Urea-PAGE was performed on all samples and digested products were imaged using a LI-COR Odyssey imager.

SM3.4: RNase H digestion
RNase H digestions were performed in 20 µL of RNase H buffer (10 mM Tris-HCl pH 8, 50 mM NaCl, 1 mM MgCl2, 10 µM MnCl2). Each reaction contained 0.5 µM of the R-loop substrate and either 50 nM RnhB, 50 nM RnhC DE/AA, or 4 nM RnhC. Reactions were performed for the indicated times and reactions were quenched with a stop buffer (95% formamide, 5 mM EDTA, 0.01% bromophenol blue). Reactions were heated to 100°C for 2 minutes and immediately snap cooled using an ice bath. 20% Urea-PAGE was performed on all samples and digested products were imaged using a LI-COR Odyssey imager. 0.3 M NaOH digestion of the R-loop was used as a ladder and to map the sites of cleavage.

SM4: RNA-seq experiments and data analysis
Three independent cultures each of JWS105, JWS207, JWS267, and JWS268 were grown to mid-exponential phase in LB at 37°C. Cells were immediately harvested by centrifugation after addition of one volume of ice-cold methanol. Total RNA was purified according to the manufacturer's instructions using the RiboPure RNA purification kit (Life Technologies). Ribosomal RNA depletion was performed by the University of Michigan DNA Sequencing Core, using the Ribo-Zero Magnetic Kit, Bacteria (Illumina), in accordance with the manufacturer's instructions. cDNA preparation, sequencing library preparation, and sequencing of 50-base single-end reads on an Illumina Hi-seq instrument were performed by the University of Michigan DNA Sequencing Core. Reads were aligned to the B. subtilis PY79 reference genome (CP006881.1) using bwa, version 0.7.8-r455. RNA-seq data from wild type B. subtilis PY79 were taken from previously published data (31) (SRA accession SRP067020).
Analysis of differential expression was performed using the statistical software R.
Overlaps between coding sequences and alignments were counted using the summarizeOverlaps function from the GenomicAlignments package (54), with the arguments ignore.strand set to TRUE and mode set to IntersectionNotEmpty. Voom normalization was performed using the packages edgeR and limma (55, 56) . P-values were adjusted for multiple testing by the method of Benjamini and Hochberg (57). The probability of differential expression was calculated by applying the logistic transform to the log-odds of differential expression for each gene returned by the topTable function in the package limma. B. subtilis strains were streaked on LB plates and grown overnight at 37 o C. In the morning, 3 mL of LB was used to resuspend cells from the plate and the optical density at 600 nm (OD) was determined. A larger LB medium culture was then inoculated to achieve an OD ~0.009. The larger culture was grown at 37 o C in a baffle bottom flask until reaching OD ~0.6.
Cells harvested for the RNA:DNA hybrid extraction and recovery were placed on ice and then split into 50 mL portions. Cells were pelleted via centrifugation for 4 min at 5500 x g at 4°C.
Media was decanted, followed by the resuspension of cells in 10 mL of ice-cold, 1X PBS. Cells were pelleted via centrifugation for 4 min at 5500 x g at 4°C. Buffer (1X PBS) was decanted, cells resuspended in residual buffer and then transferred to one pre-chilled 2 mL microfuge tube per aliquot. Cells were collected via centrifugation (4°C; 2 min; 16,000 x g ). After buffer removal via micropipette, cells were flash frozen in an ethanol/dry ice bath. Cell pellet samples were stored at -80°C until processed further.
Caulobacter crescentus (LS101, obtained from Prof. Lucy Shapiro) were grown overnight in PYE medium at 30° C and then used to inoculate 160 mL PYE medium to achieve a density of 0.003 A600. Grown at 30 o C in a baffle bottom flask, cells were ultimately harvested at A600 = 0.618, when two 50 mL aliquots were subjected to centrifugation (5500 rcf, 4° C, 4 min).
After decanting, the two pellets were resuspended in 0.5mL of ice-cold 1x PBS, combined, and brought to 2mL total. The 50 uL aliquots were snap frozen for use as spike-in with B. subtilis samples which underwent HBD pull-down. The exact same batch of aliquots were used as spikeins for all experiments described here.
Soluble material was decanted and placed on ice. Elution buffer (15.8 mL of 50 mM Tris pH 7.5, 300 mM KCl, 250 mM imidazole) was added to the soluble material to achieve a final concentration of 12.5 mM imidazole. Nickel column was pre-washed with 50 mL of equilibrium buffer (50 mM Tris pH 7.5, 300 mM KCl, 12.5 mM imidazole). Half of the buffered soluble material was loaded onto the column, followed by a gravity driven gradient formed from equilibrium buffer (50 mM Tris pH 7.5, 300 mM KCl, 12.5 mM imidazole) and elution buffer (50 mM Tris pH 7.5, 300 mM KCl, 250 mM imidazole). This pre-wash, sample loading, and elution were repeated with the second half of the soluble material. Fractions containing significant amounts of the HBD protein were pooled, then concentrated using 4 -3000 kD molecular weight cut off centrifugation columns (3200 r.c.f., 4° C, 10 min per spin). All nickel purified material was pooled, buffered with the Q-sepharose equilibration buffer [50 mM Tris pH 8, 1 mM EDTA pH 8, 0.1 mM 1,4-dithiothreitol (DTT)] to dilute the KCl to 30 mM. The 50 mL Q-sepharose column was pre-washed with 250 mL equilibration buffer [50 mM Tris pH 8, 1 mM EDTA pH 8, 0.1 mM 1,4-dithiothreitol (DTT)]. The buffered nickel purified material was loaded, followed by a gravity driven gradient formed from Q-sepharose equilibrium buffer (50 mM Tris pH 8, 1 mM EDTA pH 8, 0.1 mM DTT) and Q-sepharose elution buffer (50 mM Tris pH 8, 1 mM EDTA pH 8, 0.1 mM DTT, 0.5 M KCl). Fractions containing significant amounts of the HBD protein were pooled, then concentrated using 3000 kD molecular weight cut off centrifugation columns (3200 r.c.f., 4° C, 10 min per spin) to a final volume of 3 mL. The 10X concentrated storage buffer (529.4µL of 200 mM HEPES pH 7, 1 M KCl, 5 mM EDTA, 10 mM DTT) and 60% glycerol (1.765 mL) were added to the concentrated Q-sepharose purified protein. An additional 6.25 mL of 1x storage buffer with 20% glycerol was added to the buffered Q-sepharose purified protein, and gently mixed. The protein was further concentrated in the same 3000 kD molecular weight cut off centrifugation columns (3200 r.c.f., 4° C, 10 min per spin), aliquoted, and stored -80° C.

SM5.3: Preparation and sequencing library production for HBD pull-down
Each sample was brought to 50 µL volume using 1x TE. Then 2 µL of RNaseA (10 mg/mL) was added and mixed into each sample. After a 30 min incubation at 37° C, Oligo Clean & Concentrator kit (Zymo Research) kit was used to clean all samples. Briefly, to each sample, add 100 µL binding buffer and 400 µL 100% ethanol and vortex. Load onto columns. Centrifuge columns (12k rcf, 30s) and then empty collection tube. Add 750 µL of wash buffer to columns, centrifuge columns (12k rcf, 30s), and then empty collection tube. Dry column by centrifugation (12k rcf, 30s). Transfer column to DNA low-bind tube, add 20 µL of TEe to membrane, wait 30s, centrifuge columns (12k rcf, 60s), and then retain sample. Heat samples to 95° C for 3 min, ice, then use QuantiFluor ssDNA System (Promega) to determine ng/µL. Calculate volume (and dilutions required) to add 10 ng of DNA (and water) to achieve 14.5 µL for ssDNA sequencing preparation.
Sequencing preparation was performed as described by (59) in the STAR Methods section starting at "MNase-SSP" using 10 ng of DNA. The exact sequences of the "PCR_AMP_P7" and "PCR_AMP_P5" used for our samples are detailed in Data File S1. These libraries were cleaned up using 1.8x AxyPrep Mag PCR beads and quantified using QuantiFluor dsDNA System (Promega). Prepared libraries were stored at -20°C.

SM5.4: Computational processing of RNA:DNA pull-down sequencing data
The raw sequencing data was preprocessed by the removal of PCR duplicate reads, adapter sequences, and low-quality bases and reads. To facilitate removal of PCR duplicate reads we included 6bp unique molecular identifiers (UMIs) at the 5′ end of our R2 reads. We removed duplicate reads using pardre (command line flags: -l 8 -c 15 ) (60). We next trimmed the UMIs from the R2 reads and then trimmed adapters from de-duplicated reads using CutAdapt (command line flags: --quality-base=33 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -mask-adapter -n 3 --match-read-wildcards). We then removed low-quality bases and reads with TrimmomaticPE (Command line flags: -phred33 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:10). All samples were aligned to an in-house version of the PY79 genome in which repetitive regions of the genome were hard masked and consensus 5S, 16S, and 23S rRNA were included as separate records within the reference fasta file. Alignment was performed using Bowtie version 2 (Command line flags: -q --end-to-end -X 2000 --verysensitive --no-unal --phred33 -dovetail -fr). All code used to create our custom reference genome and its associated annotations can be found in the following github repository (https://github.com/jwschroeder3/hbd_enrichment_code.git) The above preprocessing and alignment were performed separately for each of two sample types for each biological replicate: a sample taken just prior to the pull-down (input) and the material resulting from the HBD pull-down sample (extract). There are two biological replicates per genotype (PY79, ΔrnhB, ΔrnhC). After alignment, the count of fragments aligning to each position of the genome at 5bp resolution was determined. To infer strand-specific pulldown enrichment at each genome position, we used a newly developed Bayesian analysis method implemented in Stan (61) which we call "Enricherator" (code can be downloaded at https://github.com/jwschroeder3/enricherator.git).

SM5.5: Bayesian inference of genome-wide RNA:DNA hybrid enrichment scores
Enricherator uses a variational Bayesian method to fit a statistical model with a negativebinomial likelihood to genome-wide input and extracted DNA sequencing coverage over input DNA sequencing coverage. Here, "extracted" can mean any type of experiment in which the user would like to determine enrichment of their target of interest. In the present work "extracted" refers to our HBD pull-down sequencing data, but it could be ChIP-seq, etc. for others' use cases. Enricherator produces point estimates and quantile intervals (by default, a 90% highest posterior density interval estimate) for input DNA abundance and extract enrichment by default and can optionally produce similar point estimates and quantile intervals for contrasts between strands or for contrasts between genotypes of interest, providing direct inference of the biological differences between samples.
Facets of Enricherator that make it preferable to most current methods of calculating genome-wide enrichment of extracted DNA signal over input DNA signal are: 1) The use of negative binomial statistics explicitly handles count data appropriately without the need for pseudocounts. 2) Enricherator applies a regularized horseshoe prior to enrichment scores, thus explicitly handling the Bayesian analog of multiple hypothesis testing. This shrinkage prior allows users to directly use the enrichment estimates arising from Enricherator runs without the need to apply shrinkage post-hoc. 3) Enricherator uses information from nearby genomic positions to inform its estimate of enrichment at any given genome position. It achieves this by using a smoothing kernel to mix nearby enrichment estimates. 4) The precision term to the negative-binomial likelihood is fit separately for input sequencing counts and extract (HBD pulldown, ChIP-seq, etc.) sequencing counts, providing maximal model flexibility to appropriately handle biological data. 5) Enricherator uses variational inference to approximate a posterior distribution of enrichment and input DNA scores. Use of variational inference allows the model to fit on the order of 7-8 hours of wall time for bacterial genomes, whereas sampling from the posterior distribution using conventional Monte Carlo methods would take a prohibitively long time. A description of the statistical model follows, and the full source Stan code and helper R scripts can be found at the Enricherator github repository referenced above.
We define subscripts in our model specification as follows: We calculate the expected number of fragments, , for each combination of position, genotype, sample type, and strand as follows: where represents the the expected input DNA coverage for the given combination of position, genotype, and strand, represents the enrichment of HBD pull-down signal for the given combination of position, genotype, and strand, t is an indicator variable for whether the sample is of the "extract" type, and cj is the ln(library size) for sequencing library j, centered on the mean ln(library size) across all sequencing libraries (see section "Library size calculation" below for details on calculation of cj). Therefore, the term cj is an exposure term which performs library size normalization. The symbol represents a vector with a single element for each genotype analyzed. Each element of is a global offset for the enrichment scores within a given genotype. We set a weak prior on using a Student-T distribution (represented by the function "T" above) with 3 degrees of freedom, mean of 0, and standard deviation of 5. Above, represents the deviation from the global extract enrichment in genotype g at genome position p for each strand s. Our use of a global genotype-specific enrichment offset and a position-wise vector of offsets for each genotype enabled our use of a shrinkage prior on , described in greater detail below. The vectors and are at the genomic resolution of the original input count data, and the values in these vectors are not inferred parameters. Rather, values in and are calculated from inferred values in lower-dimensional, i.e., lower-resolution, vectors ′ and ′ (see section "Accounting for covariance between nearby genome positions" below for details on calculating and ).

Library size calculation.
The centered ln(library size) for each sequencing library was calculated by first taking a trimmed mean fragment count for each sequencing library, inspired by a similar approach implemented in the popular RNA-seq analysis R package edgeR (62). We trimmed the bottom 2.5% and the top 2.5% coverage data from the observed position-wise counts, k, in each library and calculated the mean of the remaining counts. Below, n represents the number of genome positions analyzed at 5 bp resolution after the lower and upper quantiles described above have been excluded. The trimmed mean ln(library size) for library j, LLSj was next centered on the mean of the vector of LLS across all libraries to calculate the exposure term, cj, for each sequencing library.

Accounting for covariance between nearby genome positions
Sequencing data is generated in such a way that the number of observed fragments (i.e., the "counts") at nearby genomic positions have a high degree of mutual information, in that knowing the coverage k at position i will give high confidence that coverage at position i+1 will be closer to k than the coverage at a randomly-selected genome position. This covariance between coverage at nearby positions exists primarily due to the fact that entire fragments of DNA sequence are aligned to a reference genome, such that a single alignment will increase counts at several nearby genome positions simultaneously. We wrote Enricherator to account for covariance between nearby genome positions in both input sequencing libraries and extract sequencing libraries. In this section we use the terms IFL and EFL as variables to indicate "mean input fragment length" and "mean extract fragment length", respectively. IFL is used to define the degree to which estimates of input sequencing coverage are smoothed, i.e., the strength of covariance between nearby positions, and EFL is used to define the degree to which estimates of enrichment are smoothed. Higher values for each will yield greater smoothing. Details of how IFL and EFL are used to account for covariance between nearby genomic positions' input coverage and enrichment estimates, respectively, are below.
Consider that for a given sample type, genotype, and strand, and are each a vector of genome positions of size (genome length)/resolution. Let us call mean input fragment length IFL and mean extract fragment length EFL. We define two new vectors, ′ and ′, each element of which represents a fitted value at the following resolutions: IFL/2 and EFL/2. These reduceddimensionality vectors of fitted values are next used to calculate vectors and by deconvolving nearby values of ′ and ′, respectively, using an exponential kernel. The decay rate of the exponential kernel is IFL/2 for deconvolving ′ to and EFL/2 for deconvolving ′ to ( Figure S2).
The above description is defined in the equations that follow, with W representing the sparse matrix of deconvolution weights for calculating from ′, and W representing the sparse matrix of deconvolution weights for calculating from ′. Matrix W has shape (size( ), size( ′)) and W has shape (size( ), size( ′)), and for each matrix, row values are normalized to sum to 1.0. The lower-dimension vector ′ was assigned a normal prior with mean 0 and standard deviation 4 (annotated as "N(0,4)" below). The lower-dimension vector ′ was assigned the regularized horseshoe shrinkage prior (annotated as "HS(1)" below) (63). We used the "make_stancode" function in the R package brms to generate the Stan code implementing the regularized horseshoe prior. We used the default regularized horseshoe prior implemented by brms (64). Our use of a regularized horseshoe prior is justified by the fact that, in the absence of shrinkage on the ′ prior, many false-positive enrichments would arise. The regularized horseshoe explicitly avoids such problems by applying strong shrinkage of the posterior distribution toward zero, while allowing a small subset of parameter values to explore space further from zero, thus providing a Bayesian analog of "multiple hypothesis testing correction" directly within the structure of the statistical model.
Finally, the ln(precision) term for each sample type was assigned a normal prior with mean 2 and standard deviation 1.
The enrichment scores and input abundance estimates were converted from the natural-log scale to a log2 scale prior to interpretation and plotting, and the bedgraph files generated by Enricherator contain these scores on the log2 scale.

Statistical modeling of gene expression and orientation's association with RDH enrichment
For the statistical model used in Figure 2B, we used the R package brms to fit a Bayesian linear model using the formula syntax "hbd ~ rep*exp". The model was fit separately for each genotype. The variable hbd refers to the gene-wise z-scores for HBD enrichment as determined by Enricherator. The variable rep indicates whether a gene is head-on or co-directional, and the variable exp indicates the gene-wise z-scores for transcript abundance. The formula syntax above will fit a linear model with the variables, rep, exp, and and interaction term between rep and exp as covariates.  Table   S5 of (66), filtered for features for which the "Locus_tag" column begins with "BSU_misc" and removing features for which the "classif" column is "-". The work performed in (66) was done in the B. subtilis 168 strain, so we next mapped the ncRNA and UTR features from B. subtilis 168 to the B. subtilis PY79 genome using BLAST. We converted the feature records mentioned above to the bed format and retrieved each RNA feature's sequence using "bedtools getfasta".

Comparison
Each resulting fasta sequence was mapped to the B. subtilis PY79 reference genome using BLAST, and the BLAST results were converted to bed format, and the assignment of each RNA as ncRNA, 5′-, or 3′-UTR performed by (66) was joined to the appropriate features in the resulting bed file, which was finally converted to the gff format. See section SM5.4 above for the url for the github repository containing code used to make the custom annotations.
Feature-level robust z-scores were taken as the median of the position-wise robust zscores for all positions in that feature. To determine statistical significance of differences occurring between genotypes, two-sided Wilcoxon rank sum tests were performed using the wilcoxon function in the python module scipy.stats. Arguments were zero_method="wilcox", correction=False, alternative="two-sided", nan_policy="propagate". To calculate p-values for comparisons between head-on and codirectional genes within genotypes we used the ranksums function in the python module scipy.stats. Arguments were alternative="two-sided" and nan_policy="propagate". Confidence intervals were calculated via bootstrap sampling. Adjusted p-values (FDR) were generated using the multitest package for R version 3.5.1 (mt.rawp2adjp) using the Benjamini & Hochberg procedures (proc="BH").

SM5.6: qPCR of material used for genome-wide RDH enrichment analysis
To determine the enrichment of RNA:DNA hybrids via qPCR for wild type (PY79), ∆rnhB, and ∆rnhC cells, input and extracted material from HBD pull-down were used to assemble qPCR reactions. The reactions were set up to have 1x iTaq Universal SYBR Green Supermix with 300 nmol of each primer set (Data File S1), 0.046875 µL of template per 10 µL reaction. Thermal cycles were 98°C for 5 min, 45 cycles of [ 10 s at 98° C then 30 s at 60°C], followed by a melt curve from 65°C to 95°C in a BioRad CFX Opus RT-PCR instrument.
Primers were chosen to span a range of HBD pulldown efficacies and targeted the following areas: upstream regions of sdpC, aroK, braB, bsrI, mfd, and cspB, and the coding region of ybfM in B. subtilis and the tRNA Thr locus of C. crescentus. Cq values were determined via regression using CFX Maestro Software (BioRad).

SM5.7: Analysis of pPCR RDH enrichment
We used the python module pymc, version 5.0.2, to infer RHD enrichment vs. input DNA for each set of primers used in section SM5.6. To directly compare ΔΔCq values across genomic loci we interrogated, we devised a statistical model that directly incorporated primer efficiency and pulldown efficiency into the model structure. The data used as inputs to the model and the code can be found at https://github.com/jwschroeder3/qpcr_analysis.git.

SM6: Bayesian inference of parameter association with mutation rate
Mutation accumulation data were subset by mutation category, i.e., transition, transversion, or insertion/deletion. A generalized linear model was fit to the data of each mutation category to regress the number of mutations in coding sequences against coding sequence direction relative to the direction of DNA replication fork progression, expression in terms of mean-centered, standardized rpkm values, mean-centered, standardized RNA:DNA hybrid binding domain ChIP enrichment, mean-centered, standardized enrichment of input DNA from our RNA:DNA hybrid binding domain sequencing library preparation, and interaction terms between each of the prior mentioned variables and coding sequence expression. The model fit to the data is summarized as follows: In the above generalized linear model, is the number of mutations observed in CDS i. α is the intercept, and the term ( × ) serves as an exposure term to account for varying CDS length and the number of generations a given genotype's MA lines underwent. We note that this exposure term causes the fit of the model to our data to infer the association between each covariate and mutation rate per nucleotide per generation, rather than simply mutation count. is a binary variable taking the value 0 if CDS i is co-directionally transcribed with the direction of DNA replication fork progression and 1 if it is head-on. The variable represents the expression level of CDS i. To calculate , we calculated ( + 1 × 10 −5 ) values for each CDS using RNA-seq data from PY79 (wild type), ΔrnhB, and ΔrnhC strains. Adding 1x10 -5 to rpkm prior to taking the natural logarithm was done to avoid taking the logarithm of zero for the rare CDSs with no alignments. These ( + 1 × 10 −5 ) values were then standardized for each genotype's RNA-seq data to determine an value for each CDS from each genotype. HBD ChIP enrichment scores were determined as described in Section SM5. For each strain background, HBD pull-down enrichment scores were standardized to determine the value of the variable for each CDS from each genotype.
Input scores were determined as described in section SM5. For each strain background, input scores were standardized to generate values for each CDS from each genotype. A regularizing prior on α was set as a normal distribution with mean = -23 and standard deviation = 5 to reflect our prior knowledge that the average mutation rate per nucleotide per generation is approximately e -23 ≅ 1x10 -10 . A normal distribution of mean = 0 and standard deviation = 3 was used as a prior for the main effects of genotype, gene expression z-score, HBD pull-down zscore, input z-score, and gene orientation. Due to the sparse nature of mutation accumulation line data, we used a horseshoe shrinkage prior with degrees of freedom of 1 and scale of 1 for all interaction terms between variables in the model.
We note that because a natural logarithm link function is applied in fitting Poisson Inference of each variable's association with CDS mutation rate was performed using the R package "brms" (64), which generates code in the programming language Stan to fit a model using Bayesian inference (61). The data used to fit the model are in Data File S2. Posterior samples for each variable were extracted from the resulting fit objects and summarized using the R package "tidybayes". Point estimates (median values for a given posterior distribution), 90% highest continuous probability intervals, and evidence ratios (Bayes factors, or K) for all parameters' associations with mutation rates and hypotheses tested are in Data File S2.
Hypothesis testing was performed by using the function "hypothesis" from the R package "brms" (64). We note that the Bayesian interpretation here provides a measure of support for a hypothesis of interest, and not simply against a specified null hypothesis. We therefore tested three hypotheses for each parameter's estimated association with mutagenesis: 1) the parameter is positively associated with mutation rates ("Evidence ratio positive" column in Data File S2), 2) the parameter is negatively associated with mutation rates ("Evidence ratio negative" column in Data File S2), and 3) the parameter's association with mutation rates is close to zero ("Evidence ratio zero" column in Data File S2). For the hypothesis that the association is near zero, we defined a region of practical equivalence from (-ln(1.2),ln(1.2)). For the directed hypotheses, significance symbols took the form of "*" symbols. For the hypothesis that the association is near zero, we use "X" symbols. The evidence ratios (K) were calculated as the proportion of values from a posterior distribution that met the hypothesis divided by the proportion of values from the posterior distribution that violated the hypothesis. We applied Kass and Raftery's scale for defining strength of evidence, by which a K-value from [3,20) indicates "positive" evidence in support of the hypothesis and the parameter receives a single significance symbol (* or X), a K-value from [20,150) represents "strong" evidence and the parameter receives two significance symbols (** or XX), and a K-value from [150,Inf) indicates "very strong" evidence in support of the hypothesis and the parameter receives three significance symbols (*** or XXX) (33). These ranges can be interpreted as providing similar value to particular p-value ranges that might arise in frequentist statistical testing. For example, the K=3 cutoff in the Bayesian framework is analogous to the commonly used p=0.05 cutoff in frequentist tests.
In addition to the model described above (we call this the "full model"), we fit an additional four modified versions of the model. 1) A model without any term referencing gene expression, 2) a model without any term referencing RDH enrichment, 3) a model with two RDH slopes for the main effect of RDH enrichment and each interaction term for which RDH enrichment was a member, the changepoint between the two slopes was also an additional fitted parameter, 4) a model with a slope of zero to the left of a fitted changepoint, and a fitted slope associated with RDH enrichment to the right of the changepoint. We found that for all three mutation types analyzed in our work, the full model described above, without any changepoint of RDH association with mutation rate, was either the best-performing model or was within the standard error of the best performing model as judged using the "loo_compare" function in version 2.5.1 of the "loo" package (Table S1). We therefore saw no evidence-based reason to use any of these four additional models, and instead used the full model defined above. Figure Figure 4. Stars indicate the strength of evidence for a parameter's association with the natural log of mutation rate being either greater or less than zero. An "X" indicates the strength of evidence for a parameter's association with the natural log of mutation rate being near zero. Strength of evidence was determined as described in Materials and Methods. The symbol * represents "positive" support with a K-value [3,20), ** represents a K-value from [20,150) indicating "strong" evidence for the parameter, and *** represents a K-value from [150,Inf) indicating "very strong" evidence for a hypothesis; equivalently strong K0 values are shown with equal numbers of 'X' characters. (B-G) Fitted values for mean transition, transversion, or indel count for all marginal effects for a theoretical average CDS implied by our model. Points or lines represent the median estimate and bars or shaded intervals denote the 90% highest-density continuous interval.  Table S1. Results of quantitative model comparison for the fits of the five indicated models to each of the three indicated mutation rates for mutations of the indicated type. Values were calculated using version 2.5.2 of the "loo" package in R. a) LOO-IC = LOO information criterion. Lower values of LOO-IC indicate better predictive performance. b) ELPD difference = difference between expected log pointwise predictive density for the model with the best predictive performance versus the given model/mutation type. A value of 0.0 is associated with the model with the best predictive performance, and increasingly negative values indicate increasingly poorly performing models. Note that across mutation types the "full model" is either the best-performing model, or the standard error for the ELPD difference for the "full model" always overlaps the ELPD difference for the best model, indicating their performances are indistinguishable.