The Correlation of Substitution Effects Across Populations and Generations in the Presence of Non-Additive Functional Gene Action

Allele substitution effects at quantitative trait loci (QTL) are part of the basis of quantitative genetics theory and applications such as association analysis and genomic prediction. In the presence of non-additive functional gene action, substitution effects are not constant across populations. We develop an original approach to model the difference in substitution effects across populations as a first order Taylor series expansion from a “focal” population. This expansion involves the difference in allele frequencies and second-order statistical effects (additive by additive and dominance). The change in allele frequencies is a function of relationships (or genetic distances) across populations. As a result, it is possible to estimate the correlation of substitution effects across two populations using three elements: magnitudes of additive, dominance and additive by additive variances; relationships (Nei’s minimum distances or Fst indexes); and assumed heterozygosities. Similarly, the theory applies as well to distinct generations in a population, in which case the distance across generations is a function of increase of inbreeding. Simulation results confirmed our derivations. Slight biases were observed, depending on the non-additive mechanism and the reference allele. Our derivations are useful to understand and forecast the possibility of prediction across populations and the similarity of GWAS effects.


ABSTRACT 26
Allele substitution effects at quantitative trait loci (QTL) are part of the basis of quantitative 27 genetics theory and applications such as association analysis and genomic prediction. In the 28 presence of non-additive functional gene action, substitution effects are not constant across 29 populations. We develop an original approach to model the difference in substitution effects 30 across populations as a first order Taylor series expansion from a "focal" population. This 31 expansion involves the difference in allele frequencies and second-order statistical effects 32 (additive by additive and dominance). The change in allele frequencies is a function of 33 relationships (or genetic distances) across populations. As a result, it is possible to estimate 34 the correlation of substitution effects across two populations using three elements: One of the aims of quantitative genetics is to provide methods for prediction, for instance 45 genomic prediction (prediction of livestock breeding values or of crop performance) or 46 polygenic risk score (risk of a disease in humans). These predictions would ideally work across 47 a range of populations (different breeds, future generations). Ideally, the prediction goes 48 through a process of identifying causal genes, estimating their effects in some population, and 49 transposing these effects to newly genotyped individuals (Lande and Thompson 1990; 50 Meuwissen et al. 2001). These "gene effects" are substitution effects -the regression of the 51 own phenotype (for polygenic risk scores) or expected progeny phenotypes (for estimated 52 breeding values) on gene content at the locus. Being able to use substitution effects at causal 53 genes across populations and generations is a goal of genomic prediction, QTL detection and 54 also of causal mutation finding (Grisart et al. 2002). 55 56 There are several obstacles for these aims. Finding and validating causal genes and 57 understanding their functional mechanism is extremely difficult (Grobet et al. 1997;Bonifati 58 et al. 2003;Rupp et al. 2015). In practice, predictions are done using SNP markers using 59 statistical genetics techniques. In livestock, use of markers results in very good predictions 60 within populations, but mediocre (at best) predictions across populations, even with very 61 Taylor expansions around one of them, the "focal" population. Using Kojima's method (Kojima 139 1959(Kojima 139 , 1961, we put additive substitution effects as a function of differences in allele 140 frequencies across populations, "focal" additive substitution effects, and second order 141 (dominance, additive by additive epistasis) statistical effects. From here, we show that the 142 correlation of substitution effects across populations is (approximately) a function of their 143 differentiation (or genetic distance), the additive, dominant and additive by additive genetic 144 variances, the average heterozygosities and average squared heterozygosities. All these 145 parameters can be estimated in real populations. In the following, we try to stick to  Tanila and Hill (2014) notation. Many details are given as Appendix in the Supplementary 147 Material. Main notation is presented in Table 1.  148   149   TABLE 1 HERE  150   151 Note that our procedure is general -it does not invoke any particular mechanism for epistasis 152 or dominance, nor knowledge of individual QTL effects and locations. We assume that the 153 population mean is a (possibly complex) function of QTL allele frequencies and QTL 154 functional or biological effects. The latter, albeit unknown, are assumed to be constant across 155 populations -we therefore do not consider genotype by environment interactions. 156 of dominant and epistatic biological interactions, the value of depends on the allele 162 frequencies and biological effects; for instance, ! = ! + ( ! − ! ) ! + ( " − " )[ ] !" for 163 2-loci epistasis (Fuerst et al. 1997), for ! = 1 − ! and " = 1 − " frequencies at loci 1 and 164 2, respectively. Inspired by this, we want to write the correlation / # $ , # $ ! 1 for a locus as a 165 function of vectors of respective allele frequencies, $ and $ ! , but in a general manner, 166 without defining a particular functional or biological gene action. We use a first order Taylor 167 series expansion to approximate additive substitution effects in a population % , as a function 168 of effects in another "focal" population and their distance. By doing this, it can be shown 169 that the difference of substitution effects between two populations is (approximately) a 170 function of the genetic distance of the two populations, and the magnitude of dominance and 171 second order epistatic variances in the focal population . 172 173 174 To derive additive substitution effects as function of allelic frequencies , we use Kojima's 175 definition of statistical effects as first, second… derivatives of the mean of the population ( ) 176 as a function of . We don't invoke any explicit function -we just presume that there is one, 177 in other words, change in allele frequencies of the population implies change in the total 178 average genotypic value. Using Kojima's method, the additive substitution effect at the i-th 179 locus is the first derivative (Kojima 1959(Kojima , 1961: 180 Higher order statistical effects implying locus i (i.e. dominance deviations and epistatic 183 interactions) can be represented by higher order partial derivatives of or equivalently as 184 derivatives of # . The dominance deviation at the i-th locus (that we denote as * to 185 distinguish from the biological or functional effect (Falconer and Mackay 1996)) is: 186 The negative sign comes because the dominance deviation is usually understood as a feature 189 of heterozygosity, in other words, it is of opposite sign than the increase of homozygosity in 190 # " . Last, the epistatic pairwise deviation of locus with is 191 This is positive because it is the effect of increasing both # and ' . Note that interaction of 194 order implies k-ith order derivative with scaling factor 1/2 ( . 195 196 Kojima's method shows, therefore, that higher order effects of one locus are derivatives of 197 lower order effects. With these elements we can make a Taylor order expansion of # around 198 frequencies in the "focal" population, $ , so that $ ! = $ + , such that from values of in 199 and changes in allele frequencies = $ ! − $ we create a function # $ ! ≈ / # $ , 1. In the 200 Appendix (section 1.1), we show that the Taylor linear approximation of the substitution 201 where we use differences in allele frequencies , # * $ is the statistical dominance deviation at 204 the locus and ( ) # $ is a vector containing epistatic substitution effects of locus with the 205 rest of loci. By convention we assign ( ) ## $ = 0. 206 207 From equation 1, the covariance across two populations and % of the two substitution 208 of the locus is: 209 The equality holds because terms given that the different statistical effects ( # $ , # * $ and ( ) # $ ) are mutually orthogonal by 212 construction. 213 )*+,-" # .)*+/-" # ! 0 , and now we need the variance of # $ ! as a function of 214 effects in population , this is (see the Appendix, section 1.2 for details): 215 This expression is unsymmetric and seems to imply that by construction / # $ ! 1 > 217 / # $ 1; the reason for this is that for the "focal" population the variance (or at least its 218 estimators) is better known than for the "approximated" population ′. An alternative 219 derivation in the Appendix (section 1.4) shows that, if the "focal" population is a third one (f) 220 a set of expressions analogous to (2) and (3) is: A locus that may diverge a lot (for instance because it is highly polymorphic) has high ( ); 233 two loci in strong linkage will show non-zero ( ! , " ). The second matrix contains 234 (co)variances of the epistatic effects across all pairs marker i -other markers: 235 For instance, /( ) ! $ 1 contains the variance of the epistatic effect of marker i with marker 237 1, and so on. We assume , off-diagonal elements of ( % ) disappear from the result (even 241 if they are not null). Thus, assuming that all diagonal elements of ( % ) have the same 242 . Note that 243 assuming that all diagonal elements of ( % ) are equal is an approximation -for instance 244 more polymorphic loci vary more. 245

246
The expression 2), and it shows that both the variability of dominance deviations / # * $ 1 = 248 7,$ " and its mean / # * $ 1 = 7,$ which, if not zero, can be understood as the basis of 249 inbreeding depression, enter into the expression. Also, we assume (change in allele 250 frequencies, but not the allele frequencies themselves) and the different non-additive effects 251 # * $ and ( ) # $ to be independent (uncorrelated). This makes sense as loci may be selected 252 for additive effects but not for non-additive effects. 253 254 Our next goal is to relate these results in (3) to quantities that are measured empirically. In 255 particular, we need the variances of the different genetic effects and the variance of changes 256 in allele frequencies. We address these two terms in turn.
(the latter is shown in the Appendix, section 1.5) and 263 for the number of QTL loci and using functions of heterozygosities (more details in the 265 Appendix, section 1.6): 266 Here we have assumed independence of QTL allele frequencies and QTL effects. All variances 271 and effects, as well as heterozygosities $ , refer to the focal population with allele 272 frequencies $ and effects $ . Note that we assume HWE and LE within both and '. Of 273 course, we don't know allele frequencies or even numbers of true causal genes, but the first 274 can be prudently guessed and the latter cancel out in the following. 275 276 Consider the scalar 6 " , the variance of differences of allele frequency. In fact, 6 " = 277 H which corresponds to 279 Nei's "minimum genetic distance" (Nei 1987; Caballero and Toro 2002), and, accordingly, will 280 be called $,$ ! = 6 " . The value of $,$ ! can be estimated from marker data as ^$ ,$ ! = 281 , although it is sensitive to the spectra of polymorphisms (e.g. SNP chips vs 282 sequencing). In addition, $,$ ! is also the numerator of the ;< fixation index, e.g. Hudson's(1992) ; Combining expressions above, equation (3) becomes 290 From here the correlation of across populations is (factorizing out ! 4 ) 292 Expression [5] can also be obtained if we start the Taylor expansion from a third focal 294 population that is neither b nor b', as shown in the Appendix, section 1.4. Factorizing out $ ZZZZ 295 we arrive to the slightly clearer expression 296 which shows well that the correlation is a function of distance across populations / $,$ ! 1 and 297 weights of additive vs. non-additive variances. Note that $,$ ! is always positive, which implies 298 The quantities involved in equation (6) are (1) Nei's "minimum genetic distance" $,$ ! , which 301 describes the similarity of populations and ', (2) the variance of statistical additive, 302 dominant and additive by additive effects at the individual level 8 " , 9 " , 88 " in population 303 (3) first and second moments of heterozygosities [ $ , $ " ZZZZ . All these values can be, in principle, 304 estimated from data or "prudently guessed". For the particular case of heterozygosities, SNP 305 markers are a poor choice and it is likely better to use a guess based on sequence or 306 evolutionary processes (coalescence). 307 From the definition of ;< and assuming Appendix, section 1.7), leading to 310 The advantage of the ;< is that it is more robust to the spectra of allele frequencies used to 311 estimate it (Bhatia et al. 2013). If we further assume that dominance variance 9 " is negligible, 312 we can write a neat expression of the correlation in terms of ;< : 313 The second approximation involving Thus, the algorithm to estimate a priori the correlation of across populations and ', 320

Consideration of directionality of substitution effects In Plant and Animal Breeding the origin 328
of the allele is often overlooked, as a mutation may be evolutionary harmful but of interest 329 for farming, and also because many traits selected for do not have a close relationship to 330 fitness in the wild. However, in Evolutionary Genetics it is reasonable to think that most 331 mutations are deleterious, thus with a negative effect of the mutant allele. In Medical Genetics 332 reports of estimated substitution effects are also often done in terms of "susceptible" alleles. 333 In both cases ( ) ≠ 0 or even > 0 for all loci. In both cases is "oriented" and has no 334 zero mean. In the theory above we have considered that is the effect of a randomly drawn 335 allele, which leads to ( ) = 0 and enormously simplifies the algebra. In order to consider 336 oriented , we propose to transform the estimate of Assuming, like in (8), small values of dominance variance and of ;< , we obtain 362 From the true substitution effects $ and $ ! derived above, we obtained the true value 417 / $ , $ ! 1. Note that because of the coalescent simulation, all reference alleles (i.e. with 418 frequency ) are "mutant" ones, and due to assumed dominance and epistatic actions, are 419 negative by construction (i.e. the mutant allele is deleterious). To conciliate this fact with our 420 derivations, that assume that refers to a random allele and has null means, we did two 421 things: (1) compute a "random allele" version of $ , $ ! in which was changed sign for "odd" 422 loci, and (2) estimate / $ , $ ! 1 using the transformation of normal distribution into folded 423 normal (Kan and Robotti 2017), i.e. the r2rabs() function mentioned above. Thus, we have 424 two estimands, the correlation for the "mutant allele" effect RSH*4H / $ , $ ! 1 (which 425 corresponds e.g. to typical use in Evolutionary and Medical Genetics) and the correlation for 426 the "random allele" effect +*47TR / $ , $ ! 1 (which corresponds, e.g. to genomic selection and 427 some GWAS). We observed that RSH*4H / $ , $ ! 1 < +*47TR / $ , $ ! 1. For instance, in one of 428 the scenarios RSH*4H / $ , $ ! 1 = 0.67 and +*47TR / $ , $ ! 1 = 0.85. 429 430 Now we describe the estimators. We considered either the use of "all" polymorphisms, or of 431 a "SNP" selection in which MAF>0.01 across both populations simultaneously (roughly 40,000 432 polymorphisms), as this corresponds to typical SNP panels in genetic improvement. Then we 433 used the equation (6) either for "all" or for "SNP", using their frequencies to obtain $,$ ! , [ $ 434 and $ " ZZZZ . In "all", the spectra of allele frequencies of polymorphisms and of QTL is the same, 435 but not in "SNP". As "SNP" tends to be biased, we also considered the equation (7) using ;< 436 estimated from SNPs and [ $ = 0.107, and $ " ZZZZ = 0.018 from the "Hill" U-shaped distribution 437 mentioned above. Note that ;< is more robust to the spectra of allele frequencies used to 438 estimate it. Thus, we obtained three estimators for "random allele" +*47TR / $ , $ ! 1: * UU , ̂; VW and ̂; VWFXH , and also the three corresponding estimators for "mutant allele" 440 RSH*4H / $ , $ ! 1 applying function r2rabs() to * UU , ; VW and ; VWFXH . 441 442 Results of the simulations. Simulated variance components (expressed as ratios from total 443 genetic variance " ) with marker genotypes and with QTL genotypes are shown in Table 2 In Figure 1 (for defined for random alleles) it can be observed that our expressions (6-7) tend 455 to slightly over-estimate +*47TR / $ , $ ! 1 for Complete Dominance, and Complementary 456 Epistasis. The estimates are quite good for Multiplicative Epistasis. Estimators using SNP 457 markers are slightly less accurate than using the complete polymorphism spectra, and using 458 ;< from SNP markers, coupled with a guess of heterozygosities, is similar to using SNP alone 459 to infer both distances and heterozygosities. 460 Figure 2 (for defined for mutant alleles) presents values of RSH*4H / $ , $ ! 1 and estimates +*47TR / $ , $ ! 1 : depending on the scenario there is more over-or under-estimation that 463 for +*47TR / $ , $ ! 1. In this case the effect of using all polymorphisms or SNP is more marked. 464 However, overall, we find that our estimators explain well the decay in / $ , $ ! 1. The 465 imperfect disagreement (both in Figures 1 and 2) is probably due to several wrong 466 assumptions; we comment some of them in the Discussion. In this work, we present a general, formal framework that does not depend on specific 546 hypothesis regarding gene action or order of the epistatic interactions. In our derivation, we 547 approached high-order functional epistasis by Taylor expansions, leading to expressions that 548 involve only low-order statistical additive epistasis (actually pairwise). Including one extra term 549 in our Taylor expansion would include three-loci statistical epistasis and so on, but extra terms 550 would lead to more difficult expressions (covariance of allele frequencies across loci and 551 populations), and it is expected that the magnitude of the statistical effects is smaller and 552 smaller with higher orders of interactions (Mäki-Tanila and Hill 2014). We find reasonable 553 agreement with our simulation and also with values from literature. However, our simulations 554 may not be particularly realistic, something that would require considerable thinking on how 555 to simulate biologically meaningful epistasis mechanisms for a variety of traits. We see them 556 as building blocks of non-additive architecture. At any rate, the three scenarios generated 557 sizeable non-additive variation which is a challenging case for our expressions. If the allele frequencies are modelled using symmetric ( , ) distributions (see Appendix, 572 section 1.11), these become The first is bounded 573 between 3 (for U-shaped distributions) and 2 (for peaked distributions), so that dominance 574 variation does not play a big role in the difference between substitution effects across 575 populations unless dominance variation is much larger that additive variation, something that 576 seems unlikely based on theory and estimates. However, the second weight, due to epistasis, 577 is not bounded and is large for small values of (e.g. 27 for Beta(0.04,0.04), and in this case 578 functional epistasis plays a strong role in additive variation for U-shaped distributions (Hill et 579 al. 2008). The spectra of allele frequencies of causal mutations is subject to large debate but 580 there seems to be a consensus that low frequency mutations make non-negligible 581 In the simulations we confirmed that our estimates are reasonably good although not perfect. 628 They are, depending on the scenario and target (random allele or mutant allele), almost 629 unbiased, slightly biased upwards, or biased downwards. There are several reasons for the 630 disagreement. The most obvious one is the inherent approximation of the Taylor series 631 expansion. Secondly, splitting variances such as / # # * $ 1 or / % ( ) # $ 1 into basic 632 components implies either a strong assumption of multivariate normality and independence 633 or a less strong one of "expectation and variance-independence" (Bohrnstedt and Goldberger 634 1969). For instance, it is assumed that the variance of # * $ is not related to the magnitude of 635 the difference of allele frequencies # , but this is not necessarily true. Third, it is further 636 assumed, in the factorization of genetic variances, that QTL effects and allele frequencies are 637 independent. We have also ignored that the change is proportional to heterozygosity and 638 therefore small at extremes values of allele frequencies. 639 Another factor that we did not consider is the empirical evidence of an inverse relationship 640 between heterozygosity and absolute effect at the locus (Park et al. 2011). It is unclear how 641 this would affect our findings. A (rather extreme) functional gene action that generates larger 642 at extreme frequencies is overdominance, which is similar to our "dominance" scenario. 643 644 As for our estimators of the correlation across "negative" alleles RSH*4H / $ , $ ! 1, they are 645 less robust that the estimators for a "random" allele +*47TR / $ , $ ! 1. The reason for this is 646 that obtaining RSH*4H / $ , $ ! 1 from +*47TR / $ , $ ! 1 involves a further approximation, the 647 normality of . 648

CONCLUSIONS 650
We presented a coherent, approximate theory, that does not invoke any particular 651 mechanism of gene action, to explain and appraise the change in magnitude of (additive) QTL 652 substitution effects across populations and generations. The theory gives good approximate 653 estimates of this correlation, that needs to be otherwise explicitly estimated. More 654 importantly, the theory shows that the main sources for the change of effects are 655 relationships across populations, magnitudes of additive and first-order non-additive 656 variances (dominance and additive by additive), and spectra of allele frequencies.