The H3K4me1 histone mark recruits DNA repair to functionally constrained genomic regions in plants

Mutation is the ultimate source of genetic variation. Mutation rate variability has been observed within plant genomes, but the underlying mechanisms have been unclear. We previously found that mutations occur less often in functionally constrained regions of the genome in Arabidopsis thaliana and that this mutation rate reduction is predicted by H3K4me1, a histone modification found in the gene bodies of actively expressed and evolutionarily conserved genes in plants. We reanalyzed de novo germline single base substitutions in fast neutron irradiated mutation accumulation lines in Kitaake rice (Oryza sativa) and found the same reduction in mutations associated with H3K4me1, gene bodies, and constrained genes as in A. thaliana, suggesting conserved mechanisms for mutation reduction in plants. Here, we characterize a model of targeted DNA repair to explain these observations; PDS5C and MSH6 DNA repair-related proteins target H3K4me1 through their Tudor domains, resulting in nearby DNA experiencing elevated repair. Experimental data and in-silico modeling support the high affinity of the Tudor domain for H3K4me1 in both proteins, and that this affinity is conserved between plant species. ChIP-seq data from PDS5C confirms its localization to conserved and low mutation rate genome regions. Somatic and germline mutations observed by deep sequencing of wild-type and MSH6 knockout lines confirm that MSH6 preferentially repairs gene bodies and H3K4me1-enriched regions. These findings inspire further research to characterize the origins of mechanisms of targeted DNA repair in eukaryotes and their consequences on tuning the evolutionary trajectories of genomes.


Introduction
Mutations occur when DNA damage or replication error goes unrepaired. Mechanisms that localize DNA repair proteins to certain genome regions, such as by binding certain histone modifications, reduce local mutation rates. Interactions between DNA repair and histone modifications are predicted to evolve if they promote repair in regions prone to deleterious mutations, such as coding regions of essential genes (Lynch, 2010;Martincorena and Luscombe, 2013;Lynch et al., 2016).
The localization of DNA repair proteins responsible for such mutation biases has been well-established in humans (Supek and Lehner, 2015, 2017Foster et al., 2015;Katju et al., 2022). In vertebrates, H3K36me3 is targeted by PWWP domains in proteins contributing to homology-directed and mismatch repair, with H3K36me3 marking the gene bodies and exons of active genes (Li et al., 2013;Huang et al., 2018;Fang et al., 2021;Aymard et al., 2014;Sun et al., 2020). As predicted, reduced mutation rates in active genes and gene bodies have been observed in humans and other animals (Moore et al., 2021;Akdemir et al., 2020;Li et al., 2021;Katju et al., 2022;Supek and Lehner, 2017).
Recently, the Tudor domain of PDS5C was shown to specifically bind H3K4me1 in A. thaliana (Niu et al., 2021). This gene is also a cohesion cofactor that facilitates homology-directed repair (Pradillo et al., 2015;Phipps and Dubrana, 2022;Morales et al., 2020). Recent studies of CRISPR-mediated mutation efficiency show that H3K4me1 is associated with lower mutation efficacy ( R = -0.64), supporting more efficient repair (Weiss et al., 2022;Schep et al., 2021;Zhu et al., 2021). These findings are consistent with analyses of mutation accumulation lines in A. thaliana, which find H3K4me1 associated with lower mutation rates (Monroe et al., 2022). Still, the extent and consequences of targeted DNA repair remain contentious (Liu and Zhang, 2022).
Here, we reanalyzed de novo mutations from whole-genome-sequenced fast neutron mutation accumulation lines in Kitaake Rice (O. sativa) (Li et al., 2017) showing similar patterns of mutation reduction and H3K4me1 association as A. thaliana. We then characterize a mechanistic model to explain these observations, involving PDS5C and MSH6 proteins targeting H3K4me1 via Tudor domains.

Mutations in O. sativa FN-irradiated lines reflect no selection on SBS mutations
Mutagenesis has been used extensively in the generation and study of plant mutation. With single base substitutions (SBS) in fast-neutron mutation accumulation lines largely reflecting native mutational processes (Wyant et al., 2022;Li et al., 2017), the distribution of these de novo mutations could provide insights into mechanisms underlying intragenomic heterogeneity in mutation rate.
We first reanalyzed de novo mutations in a population of 1,504 O. sativa lines that accumulated mutations upon fast neutron radiation. These data were previously described, and single base-pair substitutions (SBS) were validated with a >99% true positive rate (Li et al., 2017). In total, these data included 43,483 SBS, comprising a combination of indistinguishable fast neutron-related and "spontaneous" mutations ( Fig. 1).
We restricted our analyses to SBS mutations to preclude selection that may have occurred in the generation of these lines on insertions, deletions, and structural variants (i.e., whole gene deletions).
To quantitatively evaluate whether selection on SBS mutations occurred in these lines, we examined non-synonymous and synonymous mutation rates. The ratio of non-synonymous to synonymous mutations in mutation accumulation lines was 2.33 (N/S=5,370/2,155), a 190% increase over this ratio (Pn/Ps = 1.21) observed in polymorphisms of 3,010 sequenced O. sativa accessions (Wang et al., 2018) (X 2 = 670.63, p<2x10 -16 ). The ratio of non-synonymous to synonymous de novo mutations was not higher in transposable elements (TE) (N/S=2.31) than in non-TE protein-coding genes (N/S=2.34) (X 2 = 0.035, p = 0.85) nor was it less in coding genes than neutral expectations (N/S=2.33) based on mutation spectra and nucleotide composition of coding regions in the O. sativa genome (X 2 = 0.029, df = 1, p = 0.86) (Fig. S1, methods). To confirm that this non-significance was not simply due to a lack of power, we tested how many non-synonymous mutations would have to be "missing" because of selection to explain the 29% reduction in mutation rate of coding genes relative to non-genic (intergenic, TE) regions we observed in downstream analyses. Given that coding regions constitute only 37% of gene bodies (the rest being intronic and untranslated exonic regions), we find that 34% of non-synonymous mutations would have to have been missing due to selection, which would be readily detected by a significant deviation from the null expectation (X 2 = 224.4, df = 1, p < 2e-16). In conclusion, we detect no evidence of selection on SBS mutations, providing an opportunity to study de novo mutation rate heterogeneity before selection.
We then compared SBS spectra in trinucleotide contexts from these lines with de novo germline mutations in A. thaliana (Weng et al., 2019;Monroe et al., 2022) and found that they are significantly correlated (Fig. 1A, r=0.8, p<2x10 -16 ). We cannot know how much of the residual difference in SBS spectra is due to the effects of fast neutron mutagenesis versus natural differences between O. sativa and A. thaliana, as variations in the spectra of SBS have been reported between related species and even different genotypes of the same species (Jiang et al., 2021;Sasani et al., 2022;Cagan et al., 2022). Future work will benefit from the evaluation of de novo mutations arising in different species under diverse conditions to understand the environmental and genetic controls of SBS mutational spectra in plants.

H3K4me1 marks low mutation rate regions in O. sativa, including gene bodies, and transcriptionally active and evolutionarily conserved genes
To test whether the genome-wide distribution of mutations in O. sativa is associated with histone modifications, we used data from the riceENCODE epigenomic database, which includes H3K4me1, H3K9me1, H3K4me3, H3K36me3, H3K9me2, H3K27me3, H3K27ac, H3K4ac, H3K12ac, H3K9ac, and RNA polymerase II (PII) measured by chromatin immunoprecipitation sequencing (ChIP-seq) . We then tested whether mutation probabilities in 100bp windows in genic regions (the probability of observing a mutation) were predicted by epigenomic features and found a significant reduction in mutation probabilities in windows that overlapped with H3K4me1 peaks (Fig. 1B). These data are consistent with observations in other plant species where lower mutation rates have been observed in regions marked by H3K4me1 (Monroe et al., 2022;Weiss et al., 2022). To further confirm that these patterns were not due to selection, we also restricted our analyses to only those genes in which loss-of-function mutations were found in the population (Li et al., 2017) and observed similar results (Fig. S2). We considered that mutation rate heterogeneity was possibly caused by GC>AT mutations in transposable elements with elevated cytosine methylation in non-genic sequences rather than histone-mediated mutation reduction. Therefore, we restricted our analyses to exclude all such GC>AT mutations and observed similar results. H3K4me1-associated hypomutation was also the same when analyses were restricted to only homozygous mutations (Fig. S2). We further repeated analyses with other regression approaches: single predictor binomial regression, multiple linear regression, Poisson regression, and model selection based on AIC. In all cases, H3K4me1 was found to have the most significant association with reduced mutation rates ( Fig. S4) We calculated mutation rates in genes and their neighboring sequences and observed a significant reduction in mutation rates in gene bodies (Fig. 1C), consistent with lower mutation rates in gene bodies of other plants. Mutation rates were lower both in and around H3K4me1 peaks, which could indicate the action of local recruitment and targeting of DNA repair to H3K4me1 marked sequences including gene bodies (Fig. 1D). That mutation rates were also lower in sequences immediately neighboring H3K4me1 peaks could indicate a spatially distributed effect on mutation around H3K4me1, or the effect of conservative peak calling (Fig. 1D). Only 8.9% of H3K4me1 peaks were found outside of non-TE protein-coding genes. Nevertheless, we could use these instances of non-genic H3K4me1 to test whether the reduction in mutation rates in H3K4me1 peaks was due simply to selection against coding region mutations affecting results. When considering all H3K4me1 peaks, we observed a 20.1% reduction in mutation rates compared to regions within 2kb outside of peaks (X 2 = 124.38, p < 2.2e-16 -). For non-genic H3K4me1 peaks, we observed the same reduction: -20.2% (X 2 = 9.88, p = 0.00167). Together, these results suggested a role of H3K4me1 in localized hypomutation, which could explain reduced gene body mutation rates observed since gene bodies are enriched for H3K4me1 (Fig. 1E,F).
We then compared mutation rates between different classes of genes, which proved consistent with the expected effects of increased DNA repair in functionally constrained genes caused by H3K4me1-localized repair. Mutation rates were significantly lower in genes that overlapped with H3K4me1 peaks, including when considering only synonymous mutations ( Fig S3). H3K4me1 peaks were enriched in genes annotated as expressed compared with those not expressed (Kawahara et al., 2013) (X 2 = 2550961, p < 2x10 -16 ). And, as predicted by their enrichment for H3K4me1, mutation rates were 22% lower in expressed genes (X 2 = 63.7, p = 1x10 -15 )( Fig. 1G). This result was confirmed in an analysis restricted to synonymous mutations in coding regions only, rejecting the hypothesis that these results are due to selection (Fig. S3). Indeed, the ratio of non-synonymous to synonymous de novo mutations in the data was not different between expressed and non-expressed genes (X 2 = 0.0007, p = 0.98) (Fig. S3). Comparing genes that exhibit different degrees of selection in 3,010 natural accessions of O. sativa (Wang et al., 2018), those under elevated purifying selection with low Pn/Ps (non-synonymous/synonymous polymorphisms), were enriched for H3K4me1 peaks (X 2 = 8045711, p < 2x10 -16 ) and experienced 19% lower mutation rates (X 2 = 188.5, p < 2x10 -16 ) (Fig. 1H). These genes did not have a lower ratio of non-synonymous to synonymous in de novo mutations (X 2 = 0.22, p=0.63) ( Fig S3) and the reduction of mutation rates in conserved genes was confirmed by analysis of synonymous mutations only ( Fig S3). We, therefore, find that mutation rates are significantly lower in functionally constrained genes in O. sativa, which cannot be explained by selection on non-synonymous mutations, consistent with what has been previously shown in A. thaliana (Monroe et al., 2022).

PDS5C targets H3K4me1 and is associated with lower mutation rates
These findings in O. sativa are consistent with reports of reduced mutation rates in gene bodies of expressed and constrained genes in A. thaliana and other species (Krasovec et al., 2017;Moore et al., 2021;Monroe et al., 2022). While in humans, this is known to be mediated by H3K36me3 targeting by DNA repair genes, our results suggest that H3K4me1 may be a target of DNA repair in plants. To examine this further, we considered genes with known H3K4me1 targeting. PDS5C, a gene belonging to a family of cohesion cofactors that facilitate homology-directed repair (HDR), contains a Tudor domain that was recently discovered to specifically bind H3K4me1 (Niu et al., 2021). Analyses of ChIP-seq data of PDS5C-Flag from A. thaliana show PDS5C is targeted to gene bodies (which are enriched for H3K4me1 in both O. sativa and A. thaliana) ( Fig. 2A). We also find that PDS5C is increased in regions of lower germline mutation rates in A. thaliana, consistent with recruitment and its function in facilitating DNA repair (Fig. 2B).
Evolutionary models predict that histone-mediated repair mechanisms should evolve if they facilitate lower mutation rates in sequences under purifying selection (Lynch et al., 2016;Martincorena and Luscombe, 2013). As predicted by this theory, we find that PDS5C targeting (ChIP-seq) is enriched in coding sequences, essential genes (determined by experiments of knockout lines (Lloyd and Meinke, 2012;Lloyd et al., 2015)), genes constitutively expressed (detected in 100% of tissues sampled) (Mergner et al., 2020), and genes under stronger purifying selection in natural populations of A. thaliana (lower Pn/Ps) (1001Genomes Consortium, 2016 (Fig. 2C-F).
Visualizing the A. thaliana PDS5C protein full-length model generated by Alphafold (Jumper et al., 2021) reveals that the PDS5C active domain is separated from the Tudor domain by unstructured, and potentially flexible segments (Fig. 3A), suggesting that the Tudor domain operates as an anchor, localizing PDS5C to H3K4me1 and gene bodies of active genes. PDS5C is a cohesion cofactor linked to multiple DNA repair pathways. In its role in cohesion between sister chromatids, it has been reported to promote HDR (Pradillo et al., 2015). This is consistent with its known interaction with repair proteins along with the direct contribution of cohesion and PDS orthologs to the HDR pathway (Morales et al., 2020;Phipps and Dubrana, 2022;Hill et al., 2016;Ren et al., 2005;Bolaños-Villegas et al., 2013;Schubert et al., 2009). The observation that mutation rates are reduced at H3K4me1 peak regions (Fig. 1) supports the hypothesis that Tudor domain-mediated targeting in PDS5C, its orthologs (PDS5A, B, D, and E), or other repair-related proteins contribute to targeted hypomutation in the functionally important regions of the genome. Still, additional experiments are needed to quantify the precise local effect of PDS5C on mutation rate. We compared the PDS5C Tudor domain sequence between A. thaliana and O. sativa and found that the critical amino acids constituting the aromatic cage, where H3K4me1 binding specificity is determined, are conserved (Fig. 3C), suggesting a potential role of PDS5C in the mutation biases observed here in O. sativa (Fig. 1).

Multiple DNA repair mechanisms with Tudor domains could be influencing mutation biases
The discovery of the PDS5C Tudor domain as an H3K4me1 targeting domain (Niu et al., 2021) provides an opportunity to identify other proteins with potential for H3K4me1-mediated gene body recruitment. We used blastp to search the A. thaliana proteome for other proteins containing Tudor domains similar to that of PDS5C. An analysis of gene ontologies indicated that this gene set is highly enriched for genes with DNA repair functions (9/29 genes, p=1x10 -11 ) (Fig. S5A, Table S1). Five of these were PDS5C homologs. We also found that MSH6, a DNA mismatch repair protein, contains a Tudor domain similar to that of PDS5C, which was an obvious candidate for further consideration. That the MSH6 homologous protein in vertebrates instead contains a PWWP that targets gene bodies via H3K36me3 binding suggests a remarkable example of convergent evolution between plants and vertebrates (Li et al., 2013).
Structural modeling prediction of MSH6 structure using AlphaFold (Jumper et al., 2021) indicates that its Tudor domain may, like that in PDS5C, function as an anchor, tethering it to H3K4me1 leading to local increases in DNA repair ( Fig. 3A-B, Fig. S5 (Niu et al., 2021). Superimposition of the three modeled Tudor domains onto the PDS5C Tudor domain showed remarkably similar structures with backbone root mean square deviation (RMSD) values below 1 Å (Fig. S6).
Subsequently, we modified the K4me1 from the H3 tail peptide bound to the PDS5C Tudor domain in ChimeraX (Goddard et al., 2018) to obtain H3K4, H3K4me2, and H3K4me3. Using Rosetta FlexPepDock (Raveh et al., 2010), we modeled the docking of the different methylation states of H3K4 to the Tudor domains of PDS5C and MSH6 from A. thaliana and O. sativa.
We analyzed the geometry of the binding site in the top 10% of models based on Rosetta total score, by measuring the distances of H3K4 nitrogen to the center of the three aromatic rings forming the aromatic cage in the Tudor domains (Fig 3D). The experimental structure of the PDS5C Tudor domain bound to H3K4me1 (PDB:7DE9) (Niu et al., 2021) provided a reference to analyze the geometry of the aromatic cage. In this structure, the average distance between H3K4 nitrogen and the aromatic rings is 4.6 Å (Fig.  3D). We calculated this average distance in the in silico models generated by FlexPepDock and analyzed its distributions per case (Fig. S7).
We found that by setting a distance threshold of 6Å, we were able to accurately calculate the proportion of generated models with the H3K4 inside the aromatic cage. For the A. thaliana PDS5C Tudor, the proportion of models with H3K4 inside the aromatic cage was a near-perfect correlation (r=0.999) with the dissociation constant values obtained experimentally by ITC ( Figure 3E). Therefore, we reasoned that we could apply this approach to predict the binding preference of the Tudor domains of A. thaliana MSH6 and O. sativa MSH6 and PDS5C to the different epigenetic marks. For all three cases, we observed that H3K4me1 was the mark sampled more often inside the aromatic cage (Fig. 3E), followed by H3K4me2, and H3K4me3, aligning with the experimental results for the Tudor domain of A. thaliana PDS5C. Interestingly, lack of methylation resulted in almost no sampling of geometries within the aromatic cage, which aligns with ITC experimental results that show no detectable binding of H3 tail peptide by the PDS5C Tudor domain when K4 was not methylated (Niu et al., 2021). These results suggest that MSH6 and PDS5C Tudor domains bind to H3K4 with more affinity when it is in the monomethylated state (H3K4me1). While binding to H3K4me2 and me3 is possible, our results suggest a larger energetic barrier to access the aromatic cage for these two states and lower affinity. However, binding experiments must be conducted to obtain dissociation constant values, thus determining the exact selectivity for H3K4me1 over H3K4me2 and H3K4me3 of these Tudor domains.
Based on previous experimental data and our computational analysis, we conclude that MSH6 and PDS5C Tudor domains bind to H3K4 preferentially when it is in the monomethylated state (H3K4me1). Interestingly, in-silico predictions suggest different levels of affinity between the Tudor domains, which could lead to differences in their contribution to the H3K4me1-associated mutation rate reduction. The similar binding preference of PDS5C and MSH6 Tudor domains suggests that multiple repair pathways could have evolved H3K4me1-targeting potential in plants, motivating additional experiments and investigations into the evolutionary origins of these mechanisms. Because MSH6 operates as a dimer with MSH2 to form the MutSα complex, which recognizes and repairs small mismatches, its Tudor domain could explain the previous observation that MSH2 preferentially targets gene bodies to reduce mutation rates therein (Belfield et al., 2018). These findings are consistent with extensive work showing that mutation rates can be lower in gene bodies of active and conserved genes, but suggest that these mechanisms are independent of, while functionally analogous to, similar mechanisms known in vertebrates (H3K36me3 targeted repair described in the introduction). Further experiments are needed -the reduced mutation rates in gene bodies in active genes in plants may be explained by multiple mechanisms collectively targeting H3K4me1 or additional histone states via Tudor domains as well as other histone modifications and readers (Davarinejad et al., 2022;Liu et al., 2022).

de novo mutations in MSH6 knockout lines indicate targeted repair of gene bodies and H3K4me1 marked regions
To experimentally test the effect of MSH6 in targeted DNA repair and their potential consequence in the reduction in mutation rates associated with H3K4me1 and therefore gene bodies, we performed deep sequencing (250x depth coverage) of 3-week old rosettes of A. thaliana of 7 MSH6 knockout line SALK_037557 (msh6/msh6), 1 MSH6 knockout line SALK_089638 (msh6/msh6), 2 heterozygous SALK_089638 lines (WT/msh6), and 9 wildtypes (WT/WT) lines (Fig. 4A). Variant caller Strelka2 was used in TUMOR-NORMAL mode to call SBS mutations for each sample following a strict filtering process to call true mutations. In summary, variants were kept if they were called in only 1 sample, PASS for all 16 "NORMAL" comparisons reported by Strelka2 (Kim et al., 2018), the distance to nearest mutation >1000 bp, total read depth >150, and alt read support > 5. Somatic vs de novo germline mutations were distinguished based on deviation from expected alternative variant frequencies (X 2 p<0.001, alt reads < 50%) (Fig. S8B). As expected, the total number of somatic mutations was higher in the MSH6 knockout lines (mean±SE = 40.25±3.14 per sample) than in wild-type and SALK_089638c heterozygous lines (14.1±2.31 per sample). Mutations from heterozygotes were combined with wild-type samples for further comparisons because heterozygous lines exhibited the wild-type phenotypes (Fig.  S8A).
As expected due to mismatches arising from the tendency for oxidized guanine (8-oxoG) to mispair with adenine ( Fig. 4D) (Kino et al., 2017), C>A and T>G substitutions were proportionally higher in MSH6 knockout lines (Fig. 4C), which is consistent with mutational signatures observed in mismatch repair (MMR) deficient organisms (Sanders et al., 2021;Lujan et al., 2014). This increase in C>A and T>G SBS was also found for the de novo germline mutations (Fig. S6C). This result is consistent with the high affinity of MutSα for 8-oxoG:A mispairings (Mazurek et al., 2002). In light of evidence that MSH6 physically binds MutY (Gu et al., 2002;Hahm et al., 2022) and interacts synergistically with OGG1 (Ni et al., 1999;Pavlov et al., 2003), both proteins responsible for 8-oxoG:A base-excision repair (BER), we cannot exclude the possibility that MSH6 promotes the local activity of multiple (MMR and BER) repair pathways. MSH6 also influences somatic recombination in A. thaliana (Li et al., 2006;Gonzalez and Spampinato, 2020). Future studies should also consider the potential effects of somatic recombination on local mutaiton rate.
To test whether MSH6 contributes to the reduction in mutation rates in gene bodies and H3K4me1 marked regions, we explored changes in the proportion of mutations in MSH6 knockout and wild-type lines. The distribution of mutations in wild-type plants around genes was consistent with what has been previously reported (Monroe et al., 2022), while in MSH6 knockout mutant lines gene body mutation rates were increased, consistent with MSH6 targeting to gene bodies in wild-type plants (Fig. 4E). In gene bodies, mutation rates were more than 4.5X higher in MSH6 knockout lines than the wildtype and heterozygous samples, while the increase of mutations in intergenic regions and TEs was significantly less, 2.4X and 1.7X, respectively (Fig. 4F). We also estimated the mutation ratio in relation to H3K4me1 ChIP-seq enrichment, and found that MSH6 knockout lines contain~5X more mutations in genome regions most highly enriched for H3K4me1, with regions containing less H3K4me1 experiencing significantly lower increase mutation rates in the knockout lines (Fig. 4G).
Experimental results are consistent with the hypothesis that H3K4me1 marked genome regions experience targeted DNA repair in plants, with the Tudor domain of repair proteins providing a mechanistic basis for this recruitment. These results inspire future functional studies of this and other potential candidate genes influencing targeted DNA repair and emergent mutation biases.

Conclusions
We found evidence of mutation bias associated with H3K4me1-mediated DNA repair in O. sativa and examined potential mechanisms conserved in plants (Fig. 5). Our observations here are derived from reanalyses of data generated by independent research groups (Li et al., 2017;Niu et al., 2021;Xie et al., 2021) and prove consistent with previous reports of mutation biases in A. thaliana (Monroe et al., 2022). The mechanisms revealed are aligned with evolutionary models of evolved mutation bias, indicating targeting of mismatch repair and homology-directed repair pathways to regions of the genome functionally sensitive to mutation: coding regions and genes under stronger evolutionary constraints. Furthermore, we find genetic evidence that MSH6 regulates the reduced mutation rate in gene bodies and H3K4me1-marked regions in A. thaliana. Functional characterization of PDS5 proteins is still required to estimate their influence over lower mutation rates in H3K4me1-associated regions. These findings provide a plant-specific and higher-resolution mechanistic model of hypomutation in gene bodies and essential genes, motivating experimental investigations to further elucidate the extent, evolutionary origins, and consequences of chromatin-targeted DNA repair.

Mutation dataset in O. sativa
Germline de novo mutations in 1,504 fast neutron mutagenesis lines were downloaded from Kitbase at kitbase.ucdavis.edu. These were independently called and validated as previously described (Li et al., 2017). We focused specifically on single base substitutions (SBS), which were validated with a >99% accuracy by Li et al. (2017). We annotated each SBS in coding regions as being a synonymous or non-synonymous mutation based on the effect on the amino acid sequence. We compared non-synonymous and synonymous ratios with values from genomes of 3,010 natural accessions (Wang et al., 2018) and neutral expectations based on mutation spectra, coding region nucleotide composition, and codon table with the Null_ns_s function from the polymorphology package in R (https://github.com/greymonroe/polymorphology/blob/main/ R/Null_ns_s.R).

Epigenomic data collection
Epigenome features were accessed from the RiceENCODE database (glab.hzau.edu.cn/RiceENCODE/), which has been previously described . In brief, peaks were called from ChIP-seq data with MACS2 (Zhang et al., 2008) narrow-peak calling settings. We analyzed peak distributions for H3K4me1, H3K9me1, H3K4me3, H3K36me3, H3K9me2, H3K27me3, H3K27ac, H3K4ac, H3K12ac, H3K9ac, and RNA polymerase II (PII) measured in Nipponbare O. sativa plant seedlings, which constituted the most complete set of histone modifications available. We repeated analyses but with H3K4me1 measurements derived from panicles and leaves (rather than seedlings) and found essentially the same results.
For downstream analyses comparing changes in mutation in MSH6 knockout mutant lines as a function of H3K4me1, we accessed a collection of H3K4me1 ChIPseq datasets from A. thaliana maintained by the Plant Chromatin State Database (Liu et al., 2018). From these, we calculated the normalized mean across all datasets in 200 bp sliding (100 bp) windows across the entire genome.

Estimation of the relationship between mutation rates and O. sativa epigenomic features
We divided the genome into 100 bp windows surrounding genes (+-3000 bp of genes). This allowed us to, in later steps, restrict our analyses to only genes known to have accumulated loss-of-function mutations, and thus be less likely to be affected by selection. We also divided the genome into 100bp windows and repeated analyses, to confirm that results were generally the same. We calculated the number of single base-pair substitutions and peaks for each epigenomic feature overlapping within each window. We then estimated the relationships between epigenomic features and mutation rates with a binomial generalized linear model where the response was a binary state defined as whether a substitution occurred in that window, predicted by all features, with predictors defined as whether that window overlapped with an epigenome peak. We also repeated the analyses with a linear regression model where the response was the number of mutations in a window and found essentially the same results, so we show the binomial regression results. To test whether findings were driven simply by GC>AT mutations in transposable elements, we removed all GC>AT and repeated analyses. To further control for any residual selection in the mutation accumulation experiment, we also restricted our analyses to genes harboring loss-of-function mutations in the population and repeated the analyses. Finally, we restricted analyses to homozygous SBS and repeated the analyses. Mutation frequencies were plotted around genes in 100 bp windows. Since gene bodies are different lengths, the position of the window was converted into a percent of gene length. H3K4me1 peaks around gene bodies were plotted similarly. We also visualized mutation frequencies relative to H3K4me1 peaks in the same manner.

Analysis of ChIP-seq data of AtPDS5C
To study the distribution of PDS5C, we used ChIP-seq data as described by Niu et al. (2021). PDS5C enrichment was calculated as described by Niu et al. (2021) among regions as log2[(1 + n_ChIP)/N_ChIP] -log2[(1 + n_Input)/N_Input)], where n_ChIP and n_Input represent the total depth of mapped ChIP and Input fragments in a region, and N_ChIP and N_Input are the numbers total depths of mapped unique fragments. We calculated PDS5C enrichment in genic features (1000 bp upstream and downstream of genes, UTRs, introns, coding regions) and gene bodies (TSS to TTS) across the TAIR10 A. thaliana genome (arabidopsis.org).

Relationship between AtPDS5C and functional constraint
We analyzed the enrichment of the PDS5C ChIP-seq peaks in A. thaliana in genetic features and estimated the relationships between those regions and mutation rates, H4K4me1, Pn/Ps, and tissue expression depth. Tissue expression data are from (Mergner et al., 2020). H3K4me1 in Arabidopsis is from the Plant Chromatin State Database (Liu et al., 2018). Synonymous (Ps) and non-synonymous polymorphism (Pn) data are from the 1001 Genomes project (1001Genomes Consortium, 2016. Essential genes were based on findings from (Lloyd and Meinke, 2012). Germline mutation rates are from (Weng et al., 2019;Monroe et al., 2022).

Blastp and protein structure prediction and visualization
We used blastp on Phytozome (Goodstein et al., 2012) to search the O. sativa proteome for PDS5C and MHS6 orthologs and to search the A. thaliana proteome for genes containing Tudor domains similar to that of PDS5C, which was validated experimentally to bind H3K4me1 (Niu et al., 2021). We submitted the resulting list of 29 genes with putative Tudor domains to gene ontology analysis with ShinyGO (bioinformatics.sdstate.edu/go/)(Ge et al., 2019).
All structures were visualized, processed, and analyzed using UCSF ChimeraX (Goddard et al., 2018).  (Raveh et al., 2010) in refinement mode. We generated 10,000 docked models per case and analyzed the top 10% based on Rosetta's total score. Analysis of the outputs was conducted using PyRosetta home-made scripts (Chaudhury et al., 2010) to measure the average distance of lysine 4 nitrogen in histone 3 to the center of the aromatic rings in the residues forming the aromatic cage in Tudor domains.

Somatic mutations in MSH6 Knockout lines
To study the effect of MSH6 in H3K4me1 targeted repair we performed 250x coverage sequencing analyses of 17 A. thaliana; 9 WT/WT, 7 msh6/msh6 (6 SALK_037557 and 1 SALK_089638) and 2 msh6/WT (2 heterozygous SALK_089638). In brief, 3 week old plants were harvested, stems and roots were removed and rosettes were stored at -80 ºC. Whole rosettes were grinded using a mortar and liquid nitrogen and Qiagen DNeasy Plant Mini Kit (Cat No: 69106) following the manufacturer's instructions. Illumina library preparation and Whole-genome DNA sequencing with NovaSeq (150-bp read length at high genomic coverage~250x) was performed by the UC Davis Genome Center. Reads were trimmed with trimmomatic (Bolger et al., 2014) and duplicates were marked with the samtools markdup function (Li et al., 2009). Reads were mapped to the A. thaliana TAIR10 reference genome with bwa mem (Li and Durbin, 2009). Strelka2 variant caller in TUMOR-NORMAL mode (each sample was compared to all other 16 samples as NORMALs) was used to call SBS in the sequenced samples (Kim et al., 2018). Mutations found were filtered by keeping variants called in only one sample, given a PASS quality score in all 16 "NORMAL" comparisons. In addition to the built-in filters imposed by Strelka2 for variants to receive a PASS, which "include (1) the genotype probability computed by the core variant probability model, (2) root-mean-square mapping quality, (3) strand bias, (4) the fraction of reads consistent with locus haplotype model, and (5) the complexity of the reference context as measured by metrics such as homopolymer length and compressibility" (Kim et al., 2018), we further filtered to only include variants in which the distance to the nearest mutation was >1000 bp, total read depth >150 and alt read support > 5. To distinguish somatic mutations from de novo heterozygous variants, read-depths between alternative and reference alleles were compared with an X 2 test against the expected 50/50 ratio for heterozygous variants. Mutations were considered somatic if the X 2 test statistic p-value was less than 0.001 (to account for multiple testing) and the alt read depth was < 50% Homozygous de novo germline mutations were identified when X 2 test statistic p-value was less than 0.001 and the alt read depth was >50%. Complete code for variant calling, post-processing, and analyses are maintained on https://github.com/greymonroe/Quiroz_H3K4me1_mediated _repair.