A fast linkage disequilibrium-based statistical test for Genome-Wide Epistatic Selection Scans in structured populations

The quest for genome-wide signatures of selection in populations using SNP data has proven efficient to uncover genes involved in conserved or adaptive molecular functions, but none of the statistical methods were designed to identify interacting genes as targets of selective processes. Here, we propose a straightforward statistical test aimed at detecting epistatic selection, based on a linkage disequilibrium (LD) measure accounting for population structure and heterogeneous relatedness between individuals. SNP-based (Trv) and window-based (TcorPC1v) statistics fit a Student distribution, allowing to easily and quickly test the significance of correlation coefficients in the frame of Genome-Wide Epistatic Selection Scans (GWESS) using candidate genes as baits. As a proof of concept, use of SNP data from the Medicago truncatula symbiotic legume plant uncovered a previously unknown gene coadaptation between the MtSUNN (Super Numeric Nodule) receptor and the MtCLE02 (CLAVATA3-Like) signalling peptide, and experimental evidence accordingly supported a MtSUNN-dependent negative role of MtCLE02 in symbiotic root nodulation. Using human HGDP-CEPH SNP data, our new statistical test uncovered strong LD between SLC24A5 and EDAR worldwide, which persists after correction for population structure and relatedness in Central South Asian populations. This result suggests adaptive genetic interaction or coselection between skin pigmentation and the ectodysplasin pathway involved in the development of ectodermal organs (hairs, teeth, sweat glands), in some human populations. Applying this approach to genome-wide SNP data will foster the identification of evolutionary coadapted gene networks. Author summary Population genomic methods have allowed to identify many genes associated with adaptive processes in populations with complex histories. However, they are not designed to identify gene coadaptation between genes through epistatic selection, in structured populations. To tackle this problem, we developed a straightforward LD-based statistical test accounting for population structure and heterogeneous relatedness between individuals, using SNP-based (Trv) or windows-based (TcorPC1v) statistics. This allows easily and quickly testing for significance of correlation coefficients between polymorphic loci in the frame of Genome Wide Epistatic Selection Scans (GWESS). Following detection of gene coadaptation using SNP data from human and the model plant Medicago truncatula, we report experimental evidence of genetic interaction between two receptors involved in the regulation of root nodule symbiosis in Medicago truncatula. This test opens new avenues for exploring the evolution of genes as interacting units and thus paves the way to infer new networks based on evolutionary coadaptation between genes.


Abstract 22
The quest for genome-wide signatures of selection in populations using SNP data has 23 proven efficient to uncover genes involved in conserved or adaptive molecular functions, but 24 none of the statistical methods were designed to identify interacting genes as targets of 25 selective processes. Here, we propose a straightforward statistical test aimed at detecting 26 epistatic selection, based on a linkage disequilibrium (LD) measure accounting for population 27 structure and heterogeneous relatedness between individuals. SNP-based ( ) and window- structures or particular demographic scenarios (2,4,5). Thanks to high-throughput 71 sequencing technologies, these methods can now be used to perform Genome-Wide Selection 72 Scans (GWSS) using massive Single Nucleotide Polymorphism (SNP) datasets (6-8).

73
Although GWSS have identified cohorts of genes associated with past or ongoing adaptive 74 processes, they are not designed to identify gene coadaptation, that is the adaptive interactions 75 between genes through epistatic selection, because the genetic background on which adaptive 76 mutations occur is not considered in statistical frameworks (9). selection, the D'IS 2 measure of LD -originally described by Ohta (14,15) -was proposed to 99 detect epistatic selection (16). However, this statistic showed strong limitations to distinguish 100 between epistatic selection and single-locus selection at two independent loci. In addition, 101 neutral forces such as population structure, genetic drift and relatedness among individuals,

185
On the other hand, dominance, codominance or recessivity in panmictic populations strongly 186 impact co-fixation dynamics due to more complex fitness patterns in the presence of 187 heterozygotes (S1 Table). However, values of the fixation rates must be interpreted in light of  structure with or without non-random mating generates LD between two independent loci, as 206 measured using r or corPC1, which values can reach 0.25 to 0.5 at the final generation (Fig 1).

207
This background LD is lowered to zero or close to zero on average, when correcting these 208 statistics by the V matrix, as measured using rv or corPC1v. This result makes the rv and corPC1v 209 of necessary use in order to perform statistical tests of the neutral hypothesis of a null 210 correlation between two independent loci, accounting for "noisy" neutral processes (see next 211 result section).

212
Second, we observe that selection models tend to generate more LD than NEUT 213 model, though it depended on the statistics (r/rv or corPC1/corPC1v), the mating system and the 214 selection model in an entangled way. For instance, the COAD, COMP and ADD selection 215 models all tend to generate more LD than the NEUT model in self-mating species, with 216 COAD and COMP generating more LD than the ADD model (Fig 1) focusing on the SNPs targeted by selection.

231
Finally, it can be seen that SNP-based LD measures (r/rv) are more efficient than 232 haplotype-based LD measures (corPC1/corPC1v) to detect epistatic selection (Fig 1). However, 233 they cannot capture any signal once allele fixation at one SNP or co-fixation at the two SNPs 234 has occurred. On the other hand, corPC1/corPC1v rely on SNP polymorphism in the genomic 235 region surrounding SNPs under selection, so that they can benefit from the hitch-hiking effect 236 even after allele fixation at the SNPs targeted by selection.  statistics has a value greater than the defined rejection quantile of the τ ( −2) distribution, for 252 different type I errors: 10%, 5% and 1% (the associated rejection quantile are given in 253 bracket). In our simulations, the sample size n was equal to 500. Since the sign of the  and also at the last generation 300 for window-based measures 1 and 1 .

287
At both time-point, a general trend is that the detection power with r/rv and 288 corPC1/corPC1v is higher for the COMP model than for the COAD or the ADD models (i.e. 25-289 50%, 10-65%, and 10-30%, respectively, for α=5% with rv or corPC1v statistics), especially 290 when considering random mating (Fig 2). In addition, the correction of LD-based measures 291 by the kinship matrix (rv/corPC1v) does not increase detection power of epistatic selection; 292 rather it tends to reduce power especially in the COMP model but not in the COAD model.

293
This is due to the fact that the fixation of the AB allelic combination is more frequent in sub- interesting to note that an increased power to detect additive selection is observed with rv 298 especially in a self-mating regime (Fig 2A). This is because, at the whole population level, the which is the case in our simulations (Fig 2A,C; Fig 2B, sub-population -n=80 individuals - (Fig 3A,C; Fig 3B,D; respectively). A clear inflation 315 towards small p-values can be observed for scans based on 1 (Fig 3A,B) compared with 316 scans based on 1 (Fig 3C,D), and this inflation is higher with data from the whole 317 collection which shows a higher degree of population structure. In the FW sub-population 318 scan, a sharp peak is observed using 1 on the chromosome 6 corresponding to the small p-values when using statistic (Fig 5A,B), compared with scans implemented with 361 statistic (Fig 5C,D). Using SNP 15_46172199 as bait for SLC24A5 gene, a peak 362 corresponding to the EDAR gene was detected, with SNP 2_108946170 as the top significant 363 SNP (Fig 5A,C; -based p-value = 2.29x10 -9 ). Conversely, when the scan was performed 364 with SNP 2_108973688 as bait from the EDAR gene, a peak corresponding to the SCL24A5 365 gene was detected, with SNP 15_46179457 as the top significant SNP (Fig 5B,D 2_108973688 -EDAR - (Fig 6C) correlates substantially with global population structure, as 371 depicted by a phylogenetic tree based on the kinship matrix among individuals (Fig 6A,B). GWESS are envisioned -, the window size can be fixed at a value that fits best the average LD 436 decay in the species studied (even though a standard 10 kbp size can be used by default), and 437 they allow to detect coevolving genes even after putative fixation of coselected SNPs due to 438 surrounding SNPs within genes or windows that also carry a selection signal by hitchhiking.

439
The applications of our method to SNP data from human populations or from  which is more likely because such a pattern seems specific from ethnic groups. Thus, in 488 human, LD between EDAR and SCL24A5 is not only related to world-wide population 489 structure and positive selection but also to local coselection of genes.

497
Genetic models of epistatic selection 498 We follow fitness genotype formalization under epistatic selection models as in (10, genotypes. In a diploid population, the two-locus fitness expression will depend on the level 505 of dominance of the derived alleles (S1 Table). In the compensatory model with 506 codominance, we choose to select against the double heterozygote genotype (Aa/Bb) because 507 A and B are independent loci putatively equally expressing AB / ab or Ab / aB. In the neutral 508 model, all fitness values are set up to 1.

SNP-based and window-based LD measures of epistatic selection 510
In a diploid organism, at a given bi-allelic SNP with alleles coded 0 and 1 the three 511 possible genotypes are (00,01,11), which can be coded as the allelic dose of allele 1 (0, 1, 2).

512
The measure on unphased genotypes between two bi-allelic loci is defined by the correlation  which follows a Student distribution τ ( −2) . However, in the case were observations are not 544 independent; that is when the genotypes at a given locus are correlated within the population 545 due to non-random mating and/or between populations due to structure, then we expect that 546 only the statistics obtained using rv or corPC1v follow the distribution τ ( −2) . In the case 547 where the ancestral/derived allele status is known at the SNPs, a positive sign of r (or rv) 548 strictly reflects both the coadaptation and compensatory epistatic models and a unilateral test 549 can be performed with alternative hypothesis being "r (or rv) is significantly higher than 0".

550
Otherwise, the sign of r is not interpretable, but the p-value of the test can be computed on 551 either side of the null distribution. Likewise, whether the ancestral/derived allele status of the 552 SNPs is known or not, the sign of corPC1 (or corPC1v) is not interpretable since PC1 reflections 553 imply identical ranking of individuals genotypes (or relatedness) in a given genomic region 554 (see for instance (51)). R scripts to implement the statistical test based on , , 1 or  Table). First, a sample from an ancestral population was simulated with the coalescence 567 simulator SCRM (52). The sample consists of 1,000 haploid individuals with each four 568 independent chromosomes of length L = 5 Mbp. The mutation (µ) and recombination (c) rates 569 were fixed to classical values of 10 -9 and 10 -8 per base pair per generation, respectively, and 570 the ancestral population size N0 fixed to 100,000, so that the ρ/θ ratio (i.e. 4N0cL/4N0µL) was   (10)   test was used to assess pairwise statistical differences, as indicated within the graph. which symbiotic nodulation phenotype is analyzed in Figure 4. Two independent transgenic 962 roots were analyzed for each genotype.

985
For Figures S6 to S10, the legend is the same as in Figure S5.