A multilocus association analysis method integrating phenotype and expression data reveals multiple novel associations to flowering time variation in wild‐collected Arabidopsis thaliana

The adaptation to a new habitat often results in a confounding between genomewide genotype and beneficial alleles. When the confounding is strong, or the allelic effects is weak, it is a major statistical challenge to detect the adaptive polymorphisms. We describe a novel approach to dissect polygenic traits in natural populations. First, candidate adaptive loci are identified by screening for loci directly associated with the adaptive trait or the expression of genes known to affect it. Then, a multilocus genetic architecture is inferred using a backward elimination association analysis across all candidate loci with an adaptive false discovery rate‐based threshold. Effects of population stratification are controlled by accounting for genomic kinship in both steps of the analysis and also by simultaneously testing all candidate loci in the multilocus model. We illustrate the method by exploring the polygenic basis of an important adaptive trait, flowering time in Arabidopsis thaliana, using public data from the 1,001 genomes project. We revealed associations between 33 (29) loci and flowering time at 10 (16)°C in this collection of natural accessions, where standard genomewide association analysis methods detected five (3) loci. The 33 (29) loci explained approximately 55.1 (48.7)% of the total phenotypic variance of the respective traits. Our work illustrates how the genetic basis of highly polygenic adaptive traits in natural populations can be explored in much greater detail using new multilocus mapping approaches taking advantage of prior biological information, genome and transcriptome data.


| INTRODUCTION
The genomewide association study (GWAS) is a powerful approach to dissect the genetic basis of phenotypic variation in populations, but its application in natural populations is statistically challenging.
For example, when populations develop local adaptations to their environment, this often results in a confounding between adaptive alleles and the genomewide genotype of the local population. This confounding is particularly challenging when adaptation of highly polygenic traits is a consequence of small changes in the allele frequencies of standing variants across many loci (Burke, Liti, & Long, 2014;Orozco-terWengel et al., 2012;Teot onio, Chelo, Bradi c, Rose, & Long, 2009). The power to detect loci making small contributions to the trait variation is generally low in GWAS due to corrections for multiple testing. The additional burden of corrections for population structure decreases power further. By making combined use of powerful model populations, and new analytical approaches accounting for the polygenic nature of complex adaptive traits, it is possible to dissect a polygenic genetic architecture with greater sensitivity to facilitate deeper insights into the genetic basis of adaptive traits (Sheng, Pettersson, Honaker, Siegel, & Carlborg, 2015;Zan et al., 2017). These powerful statistical methods have not yet been adapted to analyses of natural populations, but could also be useful for identifying the genetic mechanisms that govern adaptation of polygenic traits in nature.
Arabidopsis thaliana is a widely used model species in plant biology. It has colonized a wide range of ecological habitats around the world, and the available genotype, expression and phenotype data on many of these ecotypes provide unique opportunities to study the genetic basis of adaptation. Flowering time is one of the most studied adaptive traits in this species as it has an important role in ecological adaptation and also a potential impact on agronomic production in related species. Studies in the laboratory, often on the reference accession Col-0, have identified many genes that influence flowering time by controlling multiple biological pathways, for example, photoperiod (El-Assal, Alonso-Blanco, Peeters, Raz, & Koornneef, 2001;Filiault et al., 2008), vernalization (Li et al., 2014;Shindo et al., 2005) and plant hormone signalling (Sharma et al., 2016) have been identified. Studies of flowering time variation in nature include both QTL mapping in crosses between divergent accessions and genomewide association (GWA) studies in collections of wild accessions (Alonso-Blanco et al., 2016;Atwell et al., 2010;Brachi et al., 2010;Salom e et al., 2011). The molecular studies suggest that hundreds of genes could potentially influence flowering time whereas the studies of natural populations have only been able to associate genetic polymorphisms in a handful of genes with large variation in worldwide flowering time. Hence, even though flowering time is also potentially highly polygenic in nature, the burden of multipletesting and population stratification correction in GWAS analyses of natural associations is too high to identify adaptive alleles that are either rare, have smaller effects or are confounded with population structure even in collections of >1,000 inbred accessions (Alonso-Blanco et al., 2016).
Here, we describe an analytical approach to dissect the genetic basis of polygenic adaptive traits in natural populations. Its use is illustrated by dissecting flowering time variation in a publicly available data set of wild-collected A. thaliana accessions from the 1,001 genomes project (Alonso-Blanco et al., 2016). By utilizing prior information on known flowering time genes and loci in A.
thaliana and integrating it in our analysis of genotype, expression and trait variation data from the 1,001 genomes project, a highly polygenic basis of the worldwide variation in flowering time was revealed.

| Data
We reanalysed a recently released data set on 1,004 wild-collected A. thaliana accessions from the 1,001 genomes project (Alonso-Blanco et al., 2016). All genotype, phenotype and transcriptome data are publicly available within the Arabidopsis thaliana 1,001 genomes project (Alonso-Blanco et al., 2016;Kawakatsu et al., 2016).
The genotypes for the accessions were downloaded in the form of a whole-genome SNP data matrix from http://1001genomes. org/data/GMIMPI/releases/v3.1/SNP_matrix_imputed_hdf5/1001_ SNP_MATRIX.tar.gz. These genotypes were obtained from wholegenome resequencing data, by calling SNPs using the MPI-SHORE and GMI-GATK pipelines and then imputing the whole-genome SNP genotypes (Alonso-Blanco et al., 2016). We filtered these SNPs for minor allele frequency (MAF > 0.03) and pairwise LD (r 2 > .99) to and Gene Ontology (GO) annotation were extracted from this file for genes listed in supporting tables.

| An analytical framework for inferring polygenic genetic architectures
Described here is an analytical framework for inferring polygenic architectures in natural populations. A multilocus genetic model is ZAN AND CARLBORG developed at a predefined false discovery rate significance threshold using a backward elimination approach with an adaptive model selection criterion (Abramovich, Benjamini, Donoho, & Johnstone, 2006;Gavrilov, Benjamini, & Sarkar, 2009;Sheng et al., 2015;Zan et al., 2017). This analysis is performed across a set of candidate loci that, based on prior biological and statistical evidence, are considered likely to influence the trait in the population. Here, stringent preselection is implemented and loci included should either (i) have genomewide significant effects on the trait in a single-locus genomewide association analysis or (ii) have genomewide significant effects on the expression of a gene known to affect the studied trait in earlier experiments, both in the studied population.

| Genomewide association analysis for flowering time
Previous genomewide association (GWA) analysis in this data set reported five genomewide significant associations for the flowering time at 10°C (16°C) using a mixed model analysis to account for population structure (Alonso-Blanco et al., 2016). We here extended these analyses for these traits by implementing a step-wise conditional GWA mapping. At initiation, all loci already detected in the single-locus GWAS were included as cofactors in the analysis model.
Then, one or several conditional GWA analyses were performed, where in each round all loci that passed the multiple-testing corrected genomewide significance threshold were added to the model.
For FT10, two conditional analyses were performed and for FT16 only one. The significance threshold for each scan was calculated using a Bonferroni correction for the number of SNPs with r 2 < .95, resulting in a threshold of Àlog 10 p = 7.33. The model used in the GWA analyses was Where is the mean flowering time at 10°C (16°C), X the design matrix with the number of columns equal to the number of already selected SNPs plus one for the SNP tested for inclusion (coded as 0, 2 for minor allele homozygous, major allele homozygous genotypes, respectively), b is a vector of the effects of substituting two alleles at the corresponding SNPs in X, and Z is the design matrix obtained from a Cholesky decomposition of the G (identity by state-IBS) kinship matrix estimated from the whole-genome SNP data with the ibs function (weighted by allele frequencies) in the GENABEL package (Aulchenko, Ripke, Isaacs, & van Duijn, 2007). The Z matrix satisfies ZZ 0 = G, thus, the random effect vector u will be normally distributed with variance r 2 g , u $ N 0; r 2 g , and e is the normally distributed residual with variance r 2 e ; e $ N 0; r 2 e À Á . The analyses were performed using the polygenic and mmscore functions in the GENABEL package (Aulchenko et al., 2007). The polygenic function was run only once to estimate the r 2 e and r 2 g , and then mmscore function, implementing iterative scans for each SNP, was performed to obtain the genomewide association statistics. In those iterative scan, these two parameters, r 2 e and r 2 g ; were assumed to be constant. Hence, this approach is essentially a linear model associating SNPs to the phenotypic residuals, which excluded the polygenic effects.

| Expression QTL mapping targeting genes in flowering time-related pathways
A list of 282 flowering time genes was obtained from (Brachi et al., 2010). From the available transcriptome sequence data in the 1,001 genomes project, the expression levels were extracted for those genes as described in (Zan, Shen, Forsberg, & Carlborg, 2016).
Whole-genome eQTL mapping was then performed for these genes across the pruned SNP set (r 2 > .99) described in the Data section above. In the analysis, model 1 was used but with a slightly altered design matrix X. Here, X was a column vector containing only the genotype of the tested SNP, coded as 0, 2 for minor allele homozygous, major allele homozygous genotypes. As in the GWA, a Bonferroni corrected significance threshold was used and calculated based on the number of SNPs in the data set with r 2 < .95, resulting in a multiple-testing corrected threshold of Àlog 10 p = 7.33. These analyses were also performed using the polygenic and mmscore functions in the GENABEL package (Aulchenko et al., 2007).

| Multilocus backward elimination association analysis
We have previously developed a multilocus approach to explore the genetic architecture of highly polygenic traits, and a detailed description of the approach is available in (Brandt, Ahsan, Honaker, & Siegel, 2017;Lillie et al., 2017;Sheng et al., 2015). In short, the method was designed to study traits where earlier data suggest them to be polygenic in the analysed population. In such cases, standard association analysis approaches to detect individual loci using stringent multiple-testing corrected significance thresholds will have low power to infer the true genetic architecture of the traits. This is because the many loci contributing to a polygenic trait will make small individual contributions to the genetic variance. By implementing a multilocus mapping approach in a backward elimination model selection framework, it is possible to simultaneously account for the joint genetic effects of many loci and identify those that together contribute to the trait variance. Not known in advance is how many

| Estimation of the variance explained by the mapped loci
Every accession in the data set was grown in replicate and the aver- by fitting the mixed model twice to the data with the respective design matrixes, X S , including the genotypes of the (i) eight GWAS or (ii) 33 total SNPs as fixed effects. In both cases, Z was the genomic kinship matrix fitted as a random effect to account for population structure. Second, using the respective estimates obtained from the two fitted mixed models (2), the variances explained by these two sets of eight (V 8 ) and 33 (V 33 ) SNPs were obtained as where R 2 S is the variance explained by set S, and X S and b S are the design matrix and estimates for set S from model (2). The variance explained by the set of 25 eQTL SNPs selected in the multilocus analysis (V 25 ) was calculated as V 25 = V 33 À V 8 . We also estimated the variances explained by the sets of markers obtained by incrementally adding each of the 33 SNP markers from the selected multilocus model, in the order from largest to smallest additive effect size obtained when fitting all loci together, to S. In each step, the genotype for the added marker was included in X S in the mixed model (2) and the variance of the set estimated using (3). The obtained estimates for these 33 sets are reported in Figure 4b. All variance estimation analyses were implemented in custom R scripts where the mixed models were fitted using the R package HGLM (R€ onneg ard et al., 2010). Figure 3 was made using the R package CIRCLIZE (Gu, Gu, Eils, Schlesner, & Brors, 2014).

| A genomewide association analysis to detect loci associated with flowering time
A publicly available data set with genotypes and phenotypes mea-  Table S1). In addition to these loci, a third locus (peak on chromosome 4, 10,999,188 bp; Figure 1c (Table S2; inflation factors in Figure S2D). One  have a TAIR10 annotation/GO term related to flowering time via light-mediated developmental, or photoperiod control, pathways (Table S3). The LD block around the top associated SNP (r 2 > 0.5) spans 12 kb on chromosome 5 including five genes (AT5G55830, AT5G55835, AT5G55840, AT5G55850 and AT5G55855). An obvious candidate among those is AT5G55835 as it is known to be expressed as a microRNA regulating many developmental processes, including changes in the vegetative phase, and coping with abiotic stress (Lei, Lin, & An, 2016;Wang, Czech, & Weigel, 2009;Yang, Conway, & Poethig, 2011;Yu et al., 2015).

| Revealing the polygenic basis of flowering time in a multilocus analysis of loci associated with time to flowering at 10°C or the expression of flowering time genes
A backward elimination, model selection analysis with an adaptive false discovery rate (FDR) criterion (Brandt et al., 2017;Lillie et al., 2017;Sheng et al., 2015;Zan et al., 2017) was used to identify a polygenic model of loci associated with FT10 in the 1,001 genomes collection. The inputs for the analysis were the candidate loci selected via their direct associations with FT10 in the genomewide association analysis (Table S1) and/or their associations with the expression of known flowering time genes (Table S2). At 15% FDR, 25 of the eQTL and all eight FT10 associated loci were retained in the final model ( Figure 3; Table S4). As at a 15% FDR, 28 of the 33 loci are expected to be true associations, this illustrates the ability of the multilocus association method to reveal the highly polygenic architecture of FT10 in this population.
The additive genetic effects of the associated loci range from 1.8 to 5.8 days with a median of 2.9 (Figure 4a; Table S4). The proportion of the total genetic variance for FT10 that could be explained

| Large overlap between loci associated with flowering times at 10°C and 16°C
To explore the overlap between the polygenic architectures of FT10 and FT16, we also performed an identical three-step analysis for FT16. The two phenotypes have a high phenotypic correlation (Pearson correlation = .88, p-value <2.2 9 10 À16 ), but the standard GWAS analysis only detects three loci for FT16 at the genomewide significance level. Our conditional forward selection GWA analysis for FT16 failed to reveal any additional loci directly associated with this trait ( Figure S3). All three loci associated with FT16 are mapped within 20 kb of the peak SNPs associated with FT10 (Tables S1 and   S5). The multilocus analysis across the 150 eQTL regulating expression of flowering time genes and the three FT16 associated loci identified 29 loci associated with FT16 at 15% FDR. In total, 15 of the 26 eQTL overlapped with the 25 eQTL associated with FT10. (

| DISCUSSION
Here, we describe a multilocus method to dissect polygenic genetic architectures of adaptive traits in natural populations. It is a framework to simultaneously evaluate how multiple loci, selected based on solid prior knowledge, contribute to an adaptive trait. Here, candidate loci were selected by identifying (i) loci with strong markeradaptive trait associations, in the studied population, and (ii) functional candidate genes with expression under genetic control by segregating eQTL. Specifics on how candidate is selected in our study and a more detailed discussion are included below.

| Utilizing prior information to select candidate loci for multilocus analyses
The model trait in the study, flowering time, is a well-studied adaptive trait. In the standard GWAS, a candidate locus was selected that did not quite reach the stringent multiple-testing corrected threshold. Here, prior biological information, in the form of colocation with a QTL affecting the adaptive trait and a strong candidate gene, motivated its selection as a candidate locus. This locus was ultimately included in the final polygenic model. Second, the genes tested in the eGWAS were preselected based on known effects on flowering time in earlier studies (Brachi et al., 2010). It was hypothesized that if the expression variation in these genes was under genetic control in the population, this might ultimately also influence the plant phenotype. Nearly half of these genes were regulated by eQTL and >20 of them were included in the final polygenic model. These results illustrate the value of using biological information to inform statistical analyses of complex adaptive traits.

| Balancing prior information and statistical significance when selecting candidate loci
Too lenient selection of candidates will decrease the power in the polygenic analysis. This is because the adaptive FDR criterion, which is the basis for the polygenic backward elimination analysis, is most powerful for sets of candidates where many of them actually contribute to the trait. A threshold that is too stringent, however, might remove important loci with weaker effects from the polygenic analysis. Balancing quantitative statistical and experimental biological prior information in an appropriate manner is therefore a key in the candidate selection procedure. Here, the analysed population was large, had genotype, expression and phenotype data of high quality and much prior biological information on the genetics of the adaptive trait from experimental work. A stringent prescreening process was therefore implemented to benefit the power in the polygenic analysis. All preselected candidates from the GWAS and eGWAS were strongly supported by statistics as the LD-pruned Bonferroni corrections used provided very similar selection stringency as genomewide permutation-based thresholds (Text S1). Lower preselection thresholds could be considered in less powerful data sets, by for example using FDR-based thresholds. Here in our study, however, the benefits of including more candidates in preselection analyses are likely to be counteracted by a reduced power in the polygenic analysis. This is because the number of candidates was already large in relation to the number of individuals in the data set.

| Model selection using a backward elimination strategy with an adaptive FDR criteria
Two key features make the backward elimination strategy with an adaptive FDR criteria particularly well suited to analyse this type of data. First, the adaptive model selection criterion makes it possible to infer the polygenic architecture of a trait at a predefined statistical significance level. This distinguishes it from other model selection strategies, including shrinkage algorithms such as LASSO, that are based on fixed evaluation criteria or penalty parameters imposing implicit assumptions about the rate of true positives amongst the evaluated loci. This rate is unknown in real data. Second, in the backward elimination framework, the models are fitted using ordinary least squares regression. Shrinkage-based methods (such as LASSO) aim to, out of several correlated variables, select the one most strongly associated with the response and shrink the effects of others towards zero. In the context of an association study, this results in an implicit assumption that the evaluated loci contributing to the trait are uncorrelated (i.e., not in LD). While this might be a valid assumption in many situations, the correlations between variables (LD between markers) are not always statistical noise to be removed. On the contrary, in some instances, correlations provide valuable information about genetic phenomena including epistasis, multi-allelic, imperfect LD between markers and causal polymorphisms, and patterns of coselection. The value of not excluding correlated SNPs has earlier been illustrated in analyses of experimental populations, where it revealed both multi-allelic and epistatic loci (Sheng et al., 2015;Zan et al., 2017).
A possible caveat of the backward elimination algorithm is that, if the number of candidates selected in the prescreening is large, not all candidates can be evaluated in a single model selection step.
Therefore, sets of candidates were first analysed separately with a lower (20% FDR) threshold. The markers remaining in these analyses were then analysed together to obtain the final polygenic model.
The sensitivity of the analysis to the way the markers were divided was explored empirically (Text S1; Table S7) to reveal that the majority of the loci in the polygenic model were insensitive to how the markers were divided in the pre-analysis. A few loci, however, differed between the models but at a 15% FDR threshold, this is to be expected.
F I G U R E 2 The genomic locations of 123 known flowering time genes (Brachi et al., 2010), whose expression is regulated by eQTL segregating in the 1,001 genomes collection, are given on the y-axis. The locations of the eQTL regulating the expression of the flowering time genes are on the x-axis. The eight loci with direct associations to FT10 (Table S1)  finding is consistent with earlier reports in Arabidopsis thaliana that major loci often make important contributions to adaptive traits (Brachi et al., 2015;Forsberg et al., 2015;Rus et al., 2006;Shen et al., 2014;Shindo et al., 2005). Here, we can also show that in addition to the major loci, many others also contribute to the variation in this adaptive phenotype. These are difficult to detect in the GWA, either due to their small individual contributions, or due to their low minor allele frequency. Together, however, they contribute a considerable fraction of the flowering time variation in the 1,001 genomes collection and especially so for FT16. This finding suggests that polygenic adaptation, in addition to major alleles, is likely to make an important contribution to adaptation in A. thaliana. This is consistent with earlier experimental findings on studies of artificial selection responses, where it has been shown that selection on many loci can lead to extreme adaptations in, for example, seed protein and kernel oil in maize (Laurie et al., 2004;Lucas, Zhao, Schneerman, & Moose, 2013), and body weight in mice (Allan, Eisen, & Pomp, 2005) and chicken (Besnier et al., 2011;Johansson, Pettersson, Siegel, & Carlborg, 2010;Sheng et al., 2015;Wahlberg et al., 2009).

| Selection of candidate adaptive loci using expression data
The optimal design for the eGWAS used to identify candidate adap- are identical and what is evaluated in the polygenic analysis is how this genotype potentially contributes to flowering. Our results show that they were highly valuable for identifying candidate genes, but it is unknown how many contributing genes were missed due to their environmentally dependent differential expression.

| Multilocus mapping in other types of populations
The A. thaliana 1,001 genomes data set is much more comprehensive and powerful than what can be expected for most other natural populations. For example, expression data or information on hundreds of experimentally supported candidate genes will likely not be available. Many natural populations are also outbreeding which will decrease the power in the association analysis. However, it is when the power in each separate analysis is low that the potential gain by integrating all available information from multiple data sources is the greatest. By adapting the stringency in the prescreening step to the specific data set, while being aware that in smaller and less powerful populations the Beavis effect might have a strong impact on the outcome of the analysis (Slate, 2013), new smalleffect loci can also be revealed in outbreeding populations. Results illustrating this include those from our studies of an outbreeding experimental chicken population, where prior information on mapped QTL and selective-sweeps has guided a polygenic analysis to reveal novel loci and complex genetic mechanisms (Sheng et al., 2015; Zan F I G U R E 4 (a) Histogram illustrating the additive effects for the 33 loci associated with FT10 in our multilocus analysis at a 15% FDR. Loci detected as candidates in the standard (conditional) GWAS in light (dark) green and eGWAS in purple, respectively. (b) Cumulative variance explained by the 33 loci (red line). These estimates were calculated by adding one locus at a time in the model in the order determined by the effect sizes in (a). The 25%, 50% and 75% quartiles of the total variance explained by the loci are highlighted using dashed black lines et al., 2017). Further, given the decreasing costs for in-depth characterization of biological systems, it is expected that, for example, RNA sequencing will be more widely used in the future to guide preselection of candidate loci in an increasing number of populations.

| CONCLUSIONS
In conclusion, our study demonstrates the value of integrating prior information on the genetic basis of a trait with multiple types of experimental data to dissect polygenic adaptive traits. Using prior information on known flowering time genes, and experimental data on gene expression, flowering time variation and genomewide SNP variation, a highly polygenic basis was revealed for this trait in the 1,001 genomes collection. The results suggest that a combination of major effect alleles and polygenic adaptation has been important for flowering time adaptation in A. thaliana.

ACKNOWLEDGEMENTS
This work was supported by the Swedish Research Council for Environment, Agricultural Sciences and Spatial Planning (Formas grant ID 2013-450 to € OC).

AUTHOR CONTRIBU TI ONS
€ OC and YZ initiated the study, designed the project and the statistical analyses; YZ wrote the analysis scripts and performed the data analyses; € OC and YZ summarized the results and wrote the manuscript.

DATA ACCESSIBI LITY
All data used in this publication are publically available as part of earlier publications and the 1,001 genomes A. thaliana project.

DISCLOSURE DECLARATION
The authors declare no competing interest.