Sequence conservation need not imply purifying selection: evidence from mammalian stop codon usage

The assumption that conservation of sequence implies the action of purifying selection is central to diverse methodologies to infer functional importance. In mammals, however, GC-biased gene conversion (gBGC), a meiotic mismatch repair bias strongly favouring GC over AT, can in principle mimic the action of selection. As mutation is GC→AT biased, to demonstrate that gBGC does indeed cause false signals requires confidence that an AT-rich residue is selectively optimal compared to its more GC-rich allele, while showing also that the GC-rich alternative is conserved. We propose that mammalian stop codon evolution provides a robust test case. Although in most taxa TAA is the optimal stop codon, TGA is both abundant and conserved in mammalian genomes. We show that this mammalian exceptionalism is well explained by gBGC mimicking purifying selection and that TAA is the selectively optimal codon. Supportive of gBGC, we observe (i) TGA usage trends are consistent at the focal stop and elsewhere (in UTR sequences), (ii) that higher TGA usage and higher TAA→TGA substitution rates are predicted by high recombination rate and (iii) across species the difference in TAA <-> TGA rates between GC rich and GC poor genes is largest in genomes that possess higher between-gene GC variation. TAA optimality is supported both by enrichment in highly expressed genes and trends associated with effective population size. High TGA usage and high TAA→TGA rates in mammals are thus consistent with gBGC’s predicted ability to “drive” deleterious mutations and supports the hypothesis that sequence conservation need not be indicative of purifying selection. A general trend for GC-rich trinucleotides to reside at frequencies far above their mutational equilibrium in high recombining domains supports generality of these results.

249 broader scale analysis. The disadvantage of the former is that local recombination rates are 250 not stationary over evolution time so current estimates need not reflect the past-history that 251 influences stop codon usage. One problem with the latter is low samples size. Indeed, genome 252 segments with consistently high recombination rates that could make for an ideal test are the 253 pseudoautosomal regions (PAR1 and PAR2). However, there are few pseudoautosomal 254 genes. As predicted by the gBGC model these regions have high GC content relative to the 255 chromosome average, reportedly 48% in PAR1 compared to 39% in the rest of the X 256 chromosome (80). In support of the gBGC model explaining high TGA usage, we also find 257 that TGA is used much more often in PAR1 genes (71.4%, using one candidate transcript per 258 gene annotated in this region) compared to the genome wide average (52.4%). Statistical 259 comparison of TGA usage between these two values is, however, underpowered due to there 260 being a low number of annotated genes which we may extract (n = 14).

262
A better "gross" scale analysis is to consider chromosome size as smaller chromosomes are 263 associated with higher recombination rate per bp (21). As predicted by the gBGC model in the 264 human genome, we find autosomal size (bp length) to be negatively associated with GC

268
To test whether local recombination rate is predictive of stop codon usage in humans we 269 employ logistic regression modelling considering all genes, using local recombination rate as 270 the independent variable. Here we consider the recombination rate which for humans is valid 271 as gBGC associated non-crossover and crossover events are highly correlated (18). We find 272 that high recombination rate is significantly predictive of higher TGA usage (coefficient = 273 0.017, p = 0.023) and lower TAA usage (coefficient = -0.046, p = 1 x 10 -6 ), these being the 274 directions predicted by the gBGC hypothesis. Indeed, we find the same trends in non-CDS 275 sequences when using linear models to predict trinucleotide frequencies as TAA, TGA, and 276 TAG may appear more than once (unlike at the canonical stop). High recombination rate 277 significantly predicts higher TGA trinucleotide frequency in the 5' UTR (coefficient = 0.0032, p 278 = 0.012), in the 3' UTR (coefficient = 0.0053, p < 2.2 x 10 -16 ), and in intronic sequence 279 (coefficient = 0.0054, p < 2.2 x 10 -16 ). It also significantly predicts lower TAA trinucleotide The above considers observed patterns of usage. We can also consider evidence from recent 292 substitution events. Here we consider flux, meaning the substitution rate from state A to state 293 B (e.g. TAA→TGA) per occurrence of state A in the ancestral sequence. To calculate flux 294 rates we consider species trios, assign an ancestral state to the internal node by maximum 295 likelihood and calculate rates of change from this ancestral state to a derived state per 296 incidence of the ancestral state. This is comparable to a prior method (49), excepting for our 297 use of likelihood instead of parsimony.

299
The gBGC hypotheses predicts that TAA→TGA flux in the mammalian lineage should be 300 highest in GC-rich isochores. More generally, it predicts that in species with gBGC strong and 301 regionalised enough to cause high variation between genes in GC content, that the 302 TAA→TGA flux should be especially accentuated in GC-rich domains. By contrast, species 303 less influenced by gBGC should not show similar accentuation of TAA→TGA flux. We thus 304 test whether the intragenomic difference in TAA<→TGA flux between the highest and lowest 305 by GC is greater when the difference between the mean GC of the two partitions (high GC, observed pTGA deviation scores to null simulations that calculate pTGA for two null groups of 321 genes according to the net genomic TAA→TGA and TGA→TAA rates (see methods).

323
Consistent with the hypothesis that gBGC drives high TGA usage in GC rich isochores, pTGA 324 is higher in GC-rich genes than GC-poor genes across the four mammalian lineages. The 325 difference between the gene groups is greater than expected by chance in all four cases 326 (primates: p = 0.014, dog: p = 0.040, cow: p < 0.0001, mouse: p < 0.0001). Of the non-        (11,82). However, small chromosomes and associated high recombination 369 rates probably mean that most genes in birds are subject to considerable gBGC, it being 370 notable that the predicted pTGA is high for both gene groups (Fig 2e). Non-isochore-

380
To assess this, using data from the HapMap2 project, we first define highly recombining genes 381 (HRGs) as the top 50% of genes by recombination rate and lowly recombining genes (LRGs) 382 as the bottom 50%. Adapting our stop codon flux methodology, we then calculated the flux 383 rates for TAA→TGA and TGA→TAA for HRGs and LRGs and used these rates to calculate observed pTGA deviation to those observed in null simulations that assume uniform genomic 386 TAA→TGA and TGA→TAA rates. Consistent with the hypothesis that gBGC drives high TGA 387 usage in highly recombining regions, pTGA is higher in HRGs than LRGs, (p=0.049). The 388 pTGA deviation score between HRGs and LRGs is 0.172, slightly less than observed between    The evidence from non-termination sites supports the hypothesis that, whatever causes 401 unusual TGA usage trends in most mammals, it cannot be explained by selection on the focal 402 termination codon alone. Also, as predicted by the gBGC model, the TAA→TGA flux is 403 stronger in domains of high GC/high recombination. Nonetheless, to have a case that gBGC 404 acts against the direction of selection we need also to be able to confident that selection does 405 not prefer TGA. Outside of the focal termination codon this is hard to assay but at the focal 406 stop codon we can gather further evidence.

408
First, selection on any genic feature is classically assumed to predict that usage of that feature 409 will be most common in highly expressed genes (62, 84, 85) as selection is strongest in highly 410 expressed genes. Over-usage of "optimal" codons in highly expressed genes is a case in point suggests that gBGC is more influential when Ne is high (but see also (93)). However, we know 426 the direction of gBGC, and it must act against TAA. Thus, across eukaryotes our expectation 427 is that if TAA is optimal (and gBGC relatively less important), its usage will increase with Ne.

428
However, if gBGC is unexpectedly important outside of mammals or if TGA is optimal then 429 TGA will increase with Ne. We previously observed this not to be the case, with TAA increasing 430 with Ne (94). However, the possibility remains that for lowly expressed genes TGA might be 431 optimal and causing the focal termination codon trends (despite similar behaviour in 3' UTR).

432
We test this extension. abundance (as a proxy for gene expression, for which we employ the natural log to promote 441 a normal distribution) an independent predictor. We control for GC content by fitting 442 multivariate models that include GC3 content (Table 1). Collinearity between GC content and 443 protein abundance need not be a concern as the computed variance inflation factors are very 444 low (less than 1.1 for all models).

500
We consider the relative rates of human germline de novo mutations derived from family trio these bins and their associated mutational spectra and nucleotide contents, we recalculate 520 GC* and TGA* (Fig 5, orange points). We find our GC* and TGA* predictions for each bin to 521 be consistent between isochores of different GC content, indicating that mutation bias is not  We also apply a model in which we generate null sequences from the equilibrium mutational matrix in a Markov process, hence allowing for within UTR mutational 542 events. We consider the relative frequencies of the three stop codons in such sequence and 543 how they vary by local GC. Consistent with the mononucleotide results, we find dinucleotide-544 derived GC* and TGA* to be lower than observed in the genome (40.9% and 52.4% 545 respectively) and, importantly, flat across GC contents (Fig 5, purple points). While TGA* 546 derived from the dinucleotide matrix exceeds TGA* derived from the mononucleotide matrix 547 this is probably as a consequence of permitting CpG hypermutation generating potentially 548 premature stop codons. We conclude that the absence of evidence for increasing GC* with 549 GC content strongly argues against mutation bias as an explanation for higher TAA→TGA flux

560
We find that observed trinucleotide frequencies from GC poor sequences (the bottom 20% of 561 genes by GC content) are accurately predicted by a GC poor mutational matrix (derived from 562 the bottom 20% of de novo mutations by surrounding 10kb GC content) for all sequence that 563 isn't CDS (r 2 > 0.9; Figure 6). This strongly supports the hypothesis that mutation bias alone 564 may explain trinucleotide trends in GC poor domains outside of the coding context. In addition, 565 while one can always consider more complex k-mer dependent mutational models, our 566 extension from dinucleotides rates appears to be robust. Importantly, in such GC poor   The previous analysis suggests that in low GC domains k-mer trends are well predicted by 589 mutation bias alone ( Figure 6). By contrast, in GC rich domains, there exists a substitutional bias in high GC domains illustrative of a broader pattern? Were gBGC mimicking purifying selection we expect that GC rich trinucleotides should be most deviant from their mutational 593 null in GC-rich domains. We hence extend the above analysis to consider the extent to which selectively deleterious (as with TGA). Moreover, even when optimal codons are known to be 597 GC ending selection at exon ends can commonly be in the opposite direction to enable 598 accurate splicing (103), adding complexity.

600
Using mutational profiles from the relevant isochore, we calculate trinucleotide frequencies 601 that represent our mutational null and compare these to observed trinucleotide frequencies in 602 the genome. To test the hypothesis that a fixation "boost" in GC-rich isochores acts differently 603 on GC-rich trinucleotides, we calculate a fixation boost metric. Specifically, we first calculate 604 a (Observed-Expected)/Expected score for the top 20% of sequences by GC content, where 605 expected is the mutational equilibrium frequency derived from the top 20% of de novo 606 mutations assaying their surrounding 10kb GC content. This metric we term deviation 1, or D1 607 for short. We then repeat this for the bottom 20% of sequences by GC content using their 608 equivalent set of de novo mutations, receiving D2. Given the above results (Fig 6), we expect 609 the bottom 20% to be closest to mutation equilibrium, hence having a low D2 score. By  functionality. We also recall that while high recombination rate favours flux to the GC-rich state CAA→CAG is suffering a similar fate to TAA→TAG this too would be supportive of a general nucleotide context-dependent trend in fixation bias affecting TAG rather than selection for termination efficiency.

698
The assumption that sequence conservation implies purifying selection and hence optimality 699 of the preserved sequence underpins many enterprises, from medical diagnostics to

713
Across species a greater flux of TAA→TGA in the GC richer genes is associated with a greater 714 intragenomic variance in GC content, consistent with the above trends being predicted, 715 broadly speaking, by the extent to which a species is isochoric.

717
Is the TAA/TGA enigma a special case or indicative of a more general trend? We observe that 718 deviation of all trinucleotides from mutational equilibrium in GC-rich domains is strongly 719 predicted by their GC content. The TAA→TGA trend in high GC domains can be considered 720 a special example. More generally then, we have strong reason to suspect the gBGC mediated 721 fixation bias will cause false signals of purifying selection at GC-rich residues in GC-rich 722 isochores that extend beyond the specific context of TAA→TGA flux. This example is however 723 unusual in that we have confidence that the substitutional process at the focal termination 724 codon context forces conservation of a non-optimal codon, a trend that can be partly overcome 725 by stronger selection for optimality in highly expressed genes.

727
There is, however, another possibility to explain deviation from mutational equilibrium in more effective in such domains, causing a fixation bias. One can imagine many possible modes of such selection, for example on DNA structure (105-107) or on nucleosome First, in the current context TGA is not selectively favourable at the focal termination codon 737 but nonetheless conserved. This suggests we must evoke a force other than selection to

808
Given the ability of our complex mutation bias model to predict trinucleotide usage in low GC 809 domains (Fig 6), we assume that our mutation bias estimation in GC rich domains is also 810 largely accurate. If so, complex mutation bias is unable to explain the repeatable boost scores 811 (Fig 8). In principle there could be several remaining classes of explanation. First, selection 812 might act differently on underlying di or trimers. For example, regarding TAG and TGA, 813 selection on TA or AG residues may be different to that on TG or GA ones. We can find no 814 convincing evidence for this that can explain the universality of TAG avoidance (see S1 Text).

815
One also needs to evoke selection that is strong enough throughout the human genome, which 816 appears unlikely for reasons given above.

818
A further possibility is an interaction between complex mutation bias and gBGC making certain 819 trinucleotides more liable to conversion owing to their relative commonality in populations.

820
With a difference in mutational equilibria, the incidence of TAA/TAG meiotic heteroduplex 821 mismatches (or sense/antisense ones to be more precise) is highly likely to be lower than that 822 of TAA/TGA mismatches. Thus, gBGC may more commonly act on TAA/TGA. Overall, 823 however, we see no correlation between our gBGC boost score and mutational equilibrium in evidence for a role of local DNA flexibility has been found in yeast (127,128). The biological response elicited by CTG and CGG repeats in human trinucleotide repeat disorders may be 842 mediated by their increased flexibility indicative of a relationship between local flexibility and to local sequence context (130) suggests that an association between DNA mismatch repair 845 and DNA flexibility may have relevance to understanding fixation biases in GC-rich domains.

846
If flexibility is the core factor, then we might expect that a trinucleotide and its antisense should 847 have similar boost scores as both feature in the same three base pairs of DNA (one on the 848 Crick strand, the other on Watson). In our data, however, we find that the difference in gBGC 849 "boost" between sense and antisense trinucleotides is no smaller than randomised The predicted TGA usage for a given lineage, pTGA, was calculated by adapting the formulae 893 outlined by Long, Sung (44). In their study, given a spectrum of de novo mutations, they 894 propose the equilibrium GC content, Pn, can be calculated from the GC→AT mutation rate 895 divided by the reciprocal rate, m, such that:  In their paper, they suggest the substitution control for TAA>TGA and TAA>TAG is A>G, for 987 TGA>TAA and TAG>TAA is G>A, and for TGA>TAG and TAG>TGA is 2 x A>G x G>A.

989
The full spectrum of 108,778 de novo mutations may also be analysed using a 16x16 al. Fine-scale recombination rate differences between sexes, populations and individuals.