Association mapping identified novel candidate loci affecting wood formation in Norway spruce

➢ Norway spruce (Picea abies) is an important boreal forest tree species of significant ecological and economic importance. Hence there is a strong imperative to dissect the genetics controlling important wood quality traits in the species. ➢ We performed a functional genome-wide association mapping of 17 wood traits in Norway spruce using 178101 single-nucleotide polymorphisms (SNPs) generated from exome genotyping of 517 mother trees. The wood traits were defined using functional modelling of wood properties across annual growth rings. ➢ Association mapping was performed using a multilocus LASSO penalized regression method and we detected a total of 51 significant SNPs from 39 candidate genes that are involved in wood formation. ➢ Our study represents the first functional multi-locus genome-wide association mapping (AM) in Norway spruce. The results advance our understanding of the genetics influencing wood traits, identify novel candidate genes for further functional studies and support current Norway spruce breeding efforts.

List of the phenotypes, their abbreviations and measurement unit Page 8 47 We performed a functional genome-wide association mapping of 17 wood traits in 61 Norway spruce using 178101 single-nucleotide polymorphisms (SNPs) generated 62 from exome genotyping of 517 mother trees. The wood traits were defined using 63 functional modelling of wood properties across annual growth rings. 64 Association mapping was performed using a multilocus LASSO penalized regression 65 method and we detected a total of 51 significant SNPs from 39 candidate genes that 66 are involved in wood formation. 67

Introduction 76
Norway spruce (Picea abies (L.) Karst.) is a dominant boreal softwood species of significant 77 economic and ecological importance (Hannrup et al., 2004). Long-term Norway spruce 78 breeding programmes for improvement of growth and survival were initiated in the 1940s and 79 recently, wood quality has become one of the priority traits (Bertaud & Holmbom, 2004; 80 Hannrup et al., 2004). Norway spruce breeding in Sweden complete one cycle in about 20 81 years and such long generation times make improvements in growth and wood quality very 82 slow. Among wood quality traits, wood density is considered a key indicator of stability, 83 strength and stiffness of sawn timber (Hauksson et al., 2001). Several studies of wood quality 84 observed that fast growth conflicts with high quality wood, as shown by the negative genetic 85 correlation between wood volume growth and density in Norway spruce ( Olesen, 1977;86 Dutilleul et al., 1998;Chen et al., 2014). In order to combine fast growth and desirable wood 87 properties through breeding, and to shorten the breeding cycle, it is therefore imperative to 88 design effective early selection methods and breeding strategies. In an effort to design optimal 89 breeding and selection strategies for reducing or breaking negative genetic correlations 90 between traits it is essential to identify alleles that are responsible for generating favourable or 91 unfavourable genetic correlations (Hallingbäck et al., 2014). 92 When DNA markers were first introduced in 1980s, tree breeders were provided a 93 possibility to correlate phenotypes with polymorphic DNA markers and to conduct selection 94 using genotypes instead of phenotypes (Lande & Thompson, 1990). Groover et al. (1994) first 95 identified quantitative trait loci (QTL) for wood density variation in loblolly pine using 96 linkage analyses based on segregating family pedigrees. However, maker-aided selection 97 (MAS) based on results from QTL analyses was never implemented in practical tree breeding 98 due to the so-called Beavis effect (e.g. inflated estimates of allelic effects and underestimation 99

Materials and Methods 150
Plant material and phenotype data 151 Plant material and phenotype data used in this study have previously been described in Chen 152 et al. (2014). In brief, two progeny trails were established in 1990 in Southern Sweden 153 (S21F9021146 aka F1146 (trial1) and S21F9021147 aka F1147 (trial2)). These trials were 154 composed of 1373 and 1375 open pollinated families, respectively, and form the basis of our 155 analyses. We selected 517 families in 112 sampling stands to use in the investigation of wood 156 properties. At each site, increment wood cores of 12 mm were collected from six trees of the 157 selected families at breast height (1.3 m) (6 progeny × 2 sites = 12 progenies in total). A total 158 of 5618 trees, 2973 and 2645 trees from the F1146 and F1147 trails respectively, were 159 analysed. The pith to bark profiling of the wood physical attributes was analysed using the 160 SilviScan technology (Evans 1994(Evans , 2006 at Innventia, Stockholm, Sweden, where the initial 161 data evaluations were performed using customized methods. These methods focus on the 162 identification and dating of all annual rings and their compartments of earlywood, 163 transitionwood and latewood. For the current study, Innventia also calculated three additional 164 traits, number of cells per ring (NC), wood percentage (WP), and a trait named Mass Index 165 (MI), introduced to express the relative amount of biomass, all derived from the SilviScan 166 data. MI was then used to identify trees with an uncommon positive correlation between 167 density and growth, that is more biomass. The traits included in the current study are listed in 168 The following univariate linear mixed model for joint-site analysis was fitted as: 204 is the observation on the lth tree from the kth family in The index was then treated as a new dynamic trait in the AM analyses where 219 individuals with an index > 1 indicate a wood mass per length unit than the population 220 average in the cross-section at breast height. The index was calculated for all progeny and 221 were used to calculate BVs for the 517 mother trees. 222 The EBVs were plotted against cambial age (annual ring number) to produce time 223 trajectories for each trait ( Fig. 2 and Fig. S1) and used to estimate latent curve parameters. At 224 the first stage, all the trajectories versus cambial age were fitted with a quadratic spline with 225 multiple knots in order to describe the dynamics of the EBVs across age. In this study, this 226 was done with the values of four parameters obtained from the spline fitting: the intercept, the 227 slope and two knot parameters (K1 and K2). The intercept and slopes were used to evaluate 228 the mean and rate of change for the trait across the annual rings, respectively. K1 and K2 229 represent inflection points in the cambial age trajectories where the development of the EBVs 230 enters new phases. These two points (K1 and K2) are therefore supposed to have biological 231 significance, warranting a closer analysis of the genes imparting these shifts in the EBVs 232 dynamics. The four latent traits show lower correlations compared to the direct measurements 233 on the original scales and they also have constant variances, thereby reducing the need to 234 account for residual dependencies in the model (Li et al., 2014). 235 The general definition of a quadratic spline with multiple knots is as follows: 236 which is continuous and where t i (i=1,…,k; t 1 <t 2 …<t k ) are defined as knots, and (t -t i ) 2 + = (t 238 , and otherwise is equal to zero. The number of knots has to be 239 properly defined in order to provide an accurate description of the data under investigation, as 240 well as functional starting points for the search of their locations (Li et al., 2015). In our case, since the growth pattern of wood property traits were not complex, we choose two knots of 242 the time interval. 243 Hence, the quadratic spline model to describe the growth trajectory of individual i 244 applied in this study was defined as: 245 Then the intercept β 0 , slope β 1 , β 2 , (Knot 1 (k1)) and β 3 (Knot 2 (k2)) are estimated by 247 standard least squares, and their estimates were considered as the latent trait in the subsequent 248 QTL analysis conducted in R-studio (Team, 2015). The latent traits were then analysed using 249 the LASSO model in order to identify SNPs showing significant associations to the traits. derived from the sequence capture data. SNPs with missing values following VQSR were 305 imputed using the nearest neighbour principle in TASSEL (Bradbury et al., 2007). This 306 approach was essential considering that PCA demands no missing data points. The covariate 307 matrix derived from the PCA was then displayed by plotting principal component 1 scores 308 against principal component 2 scores in Figure 2. The PCA plot was used to make inference 309 about the population structure. The first two components of the PCA covariate matrix 310 explaining most of the variation were then applied to the AM to account for population 311 structure and correcting for any stratification within the study. 312 Linkage disequilibrium was calculated using VCFtools v.0.1.13 software using the 313 squared correlation coefficient between genotypes (r2) within scaffolds using the "geno-r2". 314 The trend-line of LD decay with physical distance was fitted using nonlinear regression (Hill 315 & Weir, 1988) and the regression line was displayed using R (Team, 2015). The LASSO model: 321 where y i is the phenotypic value of an individual i (i=1,…,n; n is the total number of 323 individuals) for the latent trait β 0 , β 1 , β 2 or β 3 , α 0 is the population mean parameter, x ij is the 324 genotypic value of individual i and marker j coded as 0, 1 and 2 for three marker genotypes 325 AA, AB and BB, respectively, α j is the effect of marker j (i=1,…,n; n is the total number of 326 markers), and λ (>0) is a shrinkage tuning parameter. A fundamental idea of LASSO is to 327 utilize the penalty function to shrink the SNP effects toward zero, and only keep a small 328 number of important SNPs which are highly associated with the trait in the model. 329 The stability selection probability (SSP) of each SNP being selected to the model was 330 applied as a way to control the false discovery rate and determine significant SNPs (Gao et  Norway spruce SNP identification and mapping population structure 356 All of the 517 Norway spruce mother trees in the study were considered for variant detection 357 and an average of 1.5 million paired end reads were sequenced per individual for the 40019 358 exome capture probes. This resulted in the identification of 178101 high confidence SNPs. In 359 order to account for effects derived from population stratification we performed a PCA and 360 identified two separate main population groups as well as a number of individuals scattered in 361 between these two main groups. Nevertheless, the differences due to population structure 362 were small with the first two principal components cumulatively explaining only 2.18% of the 363 genetic variation observed (Fig. 3). LD was also determined between all the SNPs, within 364 contigs as well as within significant contigs only and LD decay across physical distance is 365 plotted in  (Table 2). 380 Several appreciable QTLs were identified with WD and RW having the highest number of 381 associations, at a total of 13 and 14 QTLs, respectively. This was followed by EP/LP-ratio, 382 which had six QTLs. WD, RW and EP/LP were the only three observed traits that have QTLs 383 detected in all four latent traits. For these three phenotypes, the majority of the QTLs were 384 detected when the average ring phenotype was used to derive the latent traits (Table 2). NC 385 associated with one QTL that was detected for the entire ring, whilst six QTLs were identified 386 when EW, TW and LW were analysed separately, with H 2 QTL ranging from 0.01-4.93% 387 (Table 2)  The Mass Index trait, that is linked to a positive effect of wood volume growth and 515 increased density (growth x density) yielded a total of seven associated SNPs, with two 516 upstream gene variants, two missense variants one intergenic variant, one stop gained variant 517 and one synonymous nucleotide replacement ( Table 2). The slope latent trait had five genes 518 with modest influence on the phenotype ranging from 0.01-1.80%. The genes were 519 homologous to Arabidopsis GRAS transcription factor, Aluminium induced protein, Protein 520 virilizer, ARM repeat superfamily protein and an uncharacterized protein. MA_19222g0010 which is homologous to a Picea sitchensis ADP (NB-ARC domain) and 527 explained the highest H 2 QTL of 1.82% (Table S1). 528 Wood density traits were associated with a total of 12 genes, the largest number of 529 genes identified from the contigs. Percentage of wood was linked to ten putative genes, cell 530 width had nine putative genes and number of cells was associated with six genes. Two genes 531 were shared across multiple traits, PCNA was common across RW and LWD, and 532 phophoadenosine phosphosulfate reductase was shared across WD, EWD and ENC. Genes The gene MA_10694g0010 is homologous to an enzyme involved in cell wall 577 biosynthesis, endoglucanase 11-like, and was associated with RW (intercept latent) (Table  578   S1). The association of this gene with the RW intercept implies that the gene influences the 579 trait throughout the growth period. This enzyme is a vital component of xylogenesis and is 580 involved in the active digestion of the primary cell wall (Goulao et al., 2011). The 581 endoglucanase 11-like, was associated with a synonymous SNP MA_10694g0010_11535 for 582 (RW) suggesting an involvement in cell expansion and cell wall loosening during wood 583 formation. Endoglucanases have been proposed as enzymes involved in controlling cell wall 584 loosening (Cosgrove, 2005). Endoglucanase 11-like gene is part of the endo-1 family in 585 which the eno-1-4-β-glucanase Korrigan gene belongs. Characterisation of the Korrigan gene 586 in P. glauca has identified it to be functionally conserved and essential for cellulose synthesis 587 (Maloney et al., 2012). Hence, MA_10694g0010 is a candidate for the remodelling of cell 588 walls that affects the mechanical and growth properties of wood cells, and consequently 589 annual ring width. 590 The synonymous SNP MA_20322g0010_23808 is associated with RW located on 591 gene MA_20322g0010 which is homologous with a plant specific B3-DNA binding protein 592 domain, that is shared among various plant-specific transcription factors. This includes 593 transcription factors involved in auxin and abscisic acid responsive transcription (Yamasaki et 594 al., 2004). Auxin is one of the central phytohormones in the control of plant growth and 595 development (Abel & Theologis, 1996), and also known to be involved in cell wall loosening 596 and elongation (Cosgrove, 2016).
This suggests a possible functional role for 597 MA_20322g0010 in influencing RW. 598 An intron variant located in the MA_10434805g0020 gene, which is homologous to 599 PCNA was detected across several phenotypes (LWD, RW and MOE) associated with the 600 slope latent trait ( Table 2). The detection of this gene across these phenotypes using the slope 601 latent trait implies that the gene affects the rate of change of these phenotypes. Thus, this 602 would be a good gene to target for further studies. PCNA proteins function as integral 603 enzymes in the regulatory pathways of cell cycle regulation and DNA metabolism (Maga & 604 Hübscher, 2003). PCNA has been associated with chromatin remodelling, DNA repair, sister-605 chromatid cohesion and cell cycle control, which are all vital processes in plant growth 606 (Strzalka & Ziemienowicz, 2010), but it has not been previously associated with wood 607 formation traits. 608 In our study we detected a significant downstream SNP 609 (MA_10434624g0010_164772) associated with LRW on gene MA_10434624g0020, 610 homologous to pectinmethylesterases (PMEs), which are cell wall associated enzymes 611 responsible for demethylation of polygalacturonans (Phan et al., 2007). This enzyme has been 612 shown to be linked with many developmental processes in plants, such as, cellular adhesion 613 and stem elongation (Micheli, 2001). An association study in White spruce identified a 614 sulphate-conjugation, play an important role in plant growth and development (Klein & 627 Papenbrock, 2004). Reduced sulphur is utilized by the sulphate assimilation pathway for the 628 synthesis of essential amino acids cysteine and methionine (Kopriva & Koprivova, 2004). 629 Methione acts as a methyl donor in both lignin, hemicellulose and pectin biosynthesis 630 providing a possible mechanism of how PAPS could influence wood density and number of 631

cells. 632
When analyzing QTLs detected for traits linked to the percentage of cells (EP, LP and 633 EP/LP) we identified three putative candidate genes, DNA-3-methyladenine glycosylase II 634 enzyme, phytochrome kinase substrate 1 and glycosyltransferase. DNA-3-methyladenine 635 glycosylase II enzyme is responsible for carrying out base excision repairs (BER) in the 636 genome in order to maintain genomic integrity. This enzyme has the ability to initiate a broad 637 substrate recognition and provides a wide resistance to DNA damaging agents (Wyatt et al., 638 1999). This DNA repair capacity can be expected to be essential for the process of cell 639 propagation and growth. wood development stages has a significant impact on the overall density. The seasonal 663 changes in EWD to LWD has been speculated to be due to a change in auxin levels leading to 664 the initiation of wall-thickening phase, which has a direct impact on the wood quality traits 665 such as MOE. This phase coincides with the cessation of height growth and where available 666 resources are used for cell-wall thickening (Sewell et al., 2000), which may explain the 667 common QTL between LWD, RW and MOE, as part of the same feedback loop mechanism. 668 We identified two associations to homologous genes related to nucleic acid repair 669 functions, DICER-LIKE3 (DCL3) and DNA mismatch repair protein (MSH5), which are 670 concerned with RNA processing as well as DNA repair, respectively. These genes are 671 involved in ensuring the fidelity of DNA replication and to preserve genomic integrity (Hsieh 672 & Yamane, 2008). These genes are possibly associated with cambial cell division and endo-673 reduplication during wood formation and can conceivably have effects on wood density. 674 An association for TWD with a SNP located upstream of gene MA_212523g0010, is 675 homologous to Kinesin-related protein 13 (gene-L484_021891). Kinesin-related proteins are 676 known to be involved in secondary wall deposition, which can impact wood density (Zhong et 677 al., 2002), cell wall strength and oriented deposition of cellulose microfibrils. 678 Several receptor-like Kinases (TIR/NBS/LRR and Serine/threonine-protein 679 phosphatase) homologs were identified across traits (TRW, NC, EP, EP/LP and EWD) (Table  680   S1). These protein domains control a large range of processes including hormone perception 681 and plant development. Approximately 2.5% of the annotated genes in Arabidopsis genome 682 are RLK homologs (Shiu & Bleecker, 2001), where they among other functions play an 683 important role in the differentiation and separation of xylem and phloem cells (Fisher & 684 Turner, 2007). Similar to our study a synonymous SNP in a RLK gene was associated with 685 early wood proportion (EP) in White spruce (Beaulieu et al., 2011), hence RLKs seem to be 686 involved in modifying a number of different wood properties from density to cell identity and 687 number. 688 Norway spruce trees that possess the ability of fast growth and high wood density are 689 very rare, but such trees and associated SNPs were discovered in our study. Trees combining 690 these traits are of high interest to forest industries and owners, and thus also in focus for 691 breeders. Of the seven genes significantly linked to this phenomenon of particular interest was 692 a synonymous SNP on MA_99004g0100 gene homologous to a transcription factor from the 693 GRAS family (Table S1)