Genome‐wide association links candidate genes to fruit firmness, fruit flesh color, flowering time, and soluble solid content in apricot (Prunus armeniaca L.)

Apricots originated from China, Central Asia and the Near East and arrived in Anatolia, and particularly in their second homeland of Malatya province in Turkey. Apricots are outstanding summer fruits, with their beautiful attractive color, delicious sweet taste, aroma and high vitamin and mineral content. In the current study, a total of 259 apricots genotypes from different geographical origins in Turkey were used. Significant variations were detected in fruit firmness (FF), fruit flesh color (FFC), flowering time (FT), and soluble solid content (SSC). A total of 11,532 SNPs based on DArT were developed and used in the analyses of population structure and association mapping (AM). According to the STRUCTURE (v.2.2) analysis, the apricot genotypes were divided into three groups. The mixed linear model with Q and K matrixes were used to detect the associations between the SNPs and four traits. A total of 131 SNPs were associated with FF, FFC and SSC. No SNP marker was detected associated with FT. The results demonstrated that AM had high potential of revealing the markers associated with economically important traits in apricot. The SNPs identified in the study can be used in future breeding programs for marker-assisted selection in apricot.


Introduction
Rosaceae is one of the most important fruit tree families from temperate regions, including apple, peach, strawberry, plum, almond, pear, European plum, and sweet cherry, which are economically important fruit species [1]. Apricot, Prunus armeniaca (Lam.), is also a member of this family and an important stone fruit with a total world production of about 3.8 million tons [2]. The major apricot producing countries are Turkey (730.000 tons), Uzbekistan (662.123 tons), and Iran (306.115 tons) in the world [2].
Various parameters affect fruit quality in apricot. Consumer preferences are based on fruit quality which refers to sensorial properties, such as appearance, texture, taste and aroma, high nutritional components, chemical components, functional properties, and mechanical characteristics [3]. The appearance of fruit is the main criterion for consumers, while a low level of sweetness and hard texture are undesired [3]. Thus, fruit firmness and fruit flesh color are important for consumer satisfaction [3]. Another important parameter affecting fruit quality is soluble solid content (SSC), which 1 3 includes sugars, organic acids, proteins, minerals, lipids, amino acids, and vitamins, and it is the main criteria that determines the taste, flavor and nutritional value of the fruit [4]. In addition to being a delicious edible product, the fruit of apricot is also considered as functional due to its chemical ingredients [4]. The apricot fruit makes a significant contribution to human health through its phenolic compounds content with immune-stimulating, anti-inflammatory and antioxidants properties [4,5]. One of the important traits for apricot producers is flowering time, which is a substantial agronomic trait with an impact on fruit and seed growth in temperate fruit tree species. In cold regions, early flowering individuals are damaged by late frost, while in warm regions, late flowering individuals could lead to some problems concerning leaf and flower bud break, resulting in a decrease in the amount of harvest [5]. In addition, early flowering species have economic value in terms of early market prices.
Complex traits including most of the fruit quality properties are controlled by interacting genes called quantitative traits [1]. Therefore, the clarification of transmission of complex traits is one of the main topics for agricultural sciences [6]. Quantitative trait locus (QTL) mapping is a prevalent method for the mapping of these kinds of traits and bases on biparental mapping populations [6]. Constructing a new cross population is a tedious, time consuming and expensive process. Considering the long generation and juvenile period of fruit trees, it is more difficult to apply QTL mapping [1]. To date, a certain number of QTL maps have been constructed for apricot, such as flowering time [7,8], resistance to sharka disease [9][10][11][12], fruit quality traits [13,14], tree architectural traits [14] and chilling requirements [15]. Association mapping (AM) is an alternative method to pedigree-based QTL mapping and uses natural populations, contrary to QTL mapping, to determine the correlations between phenotypes and genotypes [16]. AM also utilizes historical recombination and natural variation as a basis and provides high map resolution in shorter time due to no requirement of developing a new cross population [16]. This method has been previously employed for different fruit trees, such as peach [17,18], almond [16], and apricot [15,19]. On the other hand, Diversity Array Technology (DarT) is recommended as an encouraging choosing to meet the requirements for whole coverage of the genome and transferability. It is a complexity-reducing, high-throughput, sequence independent, DNA hybridization based procedure that can detect thousands of markers in a single analyze at low cost per data point. This technology has been previously employed for different trees, such as peach [18] and pistachio [20].
The objective of this study was to investigate the associations between SNPs based on the diversity arrays technology (DArT) and the pomological traits of apricot, namely fruit firmness (FF), fruit flesh color (FFC), flowering time (FT), and SSC using 259 apricot genotypes.

Plant material and DNA isolation
A total of 259 apricots (Prunus armeniaca L.) genotypes, which were grown together at Malatya Apricot Research Institute in Turkey, were used in this study (Online resource 1). All genotypes had been planted with eight-meter spacing between and within the rows in the experimental station of Malatya Apricot Research Institute. Each genotype have three trees, they were arranged randomized, and all trees were 20 years old. Standard management practices concerning chemical fertilization, pruning, and disease control were being applied to the trees.
Young leaves were collected from each apricot genotype, cooled in liquid nitrogen, and stored at − 80 °C for future analyses. The leaf samples were ground with a tissue lyser (Technogen Co., Turkey). DNA extraction was carried out with 0.1 g samples of each individual following the protocol described by Deshmukh omitted RNAse application step [21]. Tris-EDTA (TE) buffer (100 µl) was used to dissolve the extracted DNA. For the purification and quantification assessment of the isolated DNA, 1% agarose gel and a spectrophotometer (Nan-oDrop ND 1000) were used, respectively. After confirmation, the DNA samples were stored at -80 °C until they were used for SNP analyses.

Pomological evaluation
Forty fruits from each replication of the genotype were randomly selected and harvested separately for each tree. Fruits were harvested at maturity stage based on appearance and taste of each genotype according to Gecer et al. [22]. For FFC measurements; Color intensity (Chroma) of the flesh (after peeling flesh) was measured with a Minolta chromometer (CR-400, Minolta). Tristimulus color analyzer calibrated to a white porcelain reference plate. The chroma (a*2 + b*2)1/2 were determined by using a and b color values around the equatorial region in three different positions with an average of nine times for each apricot fruits and chroma were used for evaluation. For FT, observations were made by experts, and the first day of flowering was noted as FT. The juice of the 40 apricots was measured with a digital refractometer (Model RA-250HE Kyoto Electronics, Kyoto, Japan), and the SSC values were recorded in °Brix. FF was measured with an acoustic firmness sensor (Aweta BV, the Netherlands). These fruit traits were measured for two consecutive years (2016 and 2017).

Variance analysis
To define the variations in FF, FFC, FT and SSC among the 259 genotypes over two years (2016 and 2017), an analysis 1 3 of variance (ANOVA) was performed using TOTEMSTAT software [23] according to the significance level of P ≤ 0.01. The analysis was applied with randomized complete blocked design (RCBD) with two factors (year and genotype) and three replications. The variations were determined according to year (Y), genotype (G), and Y x G interactions.

DArT analysis
SNP analyses were performed at DArT PLT (Diversity Arrays Technology Pty. Ltd., Canberra, Australia). The PstI-MseI method was preferred from four reducing complexity methods that were tested (data not presented). Digestion/ ligation reaction were applied to DNA samples according to Kilian et al. [24], except one single PstI-compatible adaptor, two various adaptors were used to create two various restriction enzyme (RE) overhangs. The PstI-compatible adaptor with a sequencing primer sequence and staggered, the Illumina flowcell attachment sequence and a barcode region with different length was designed as analogue sequence with Elshire et al. [25]'s. The second adaptor includes the MseI-compatible overhang sequence and flowcell attachment sequence like the first adaptor. In order to amplified PstI-MseI fragments following PCR reaction conditions were used; 30 rounds 94 °C for 1 min, followed by 29 cycles of 94 °C for 20 s, ramp 2.4°/sec to 58 °C, 58 °C for 30 s, ramp 2.4 °C/sec to 72 °C, 72 °C for 45 s. then 72 °C for 7 min. Following the PCR step, c-Bot (Illumina) bridge PCR was performed to equal amounts of PCR products of each and followed by sequencing on Illumina Hiseq2500. PCR products were sequenced in a single lane with 77 cycles. DArT analytical pipelines were used to process all generated sequences. FASTQ files were performed in the primary pipeline in order to filter the poor quality results. The barcode regions were subjected to more reliable selection criteria (≥ Phred pass score of 30). For marker calling step, 2,000,000 sequences per barcode/sample were utilized. Lastly, high quality sequences were used to create the fastqcall files and these files fastqcall were applied to secondary pipeline for Presence/Absence Markers (PAM) calling algorithms and DArT P/L's proprietary SNP (DArTsoft14).
The polymorphism information content (PIC) values represent the discrimination power of the markers. The PIC values were calculated for each marker according to Lynch and Ritland [26] with the following equation: PIC = 1-∑pi 2 , where pi demonstrates the proportion of the population with the ith allele [26]. A dendrogram tree was drawn with the R software with reference to Nei's genetic distance [27].

Genetic variation analysis
STRU CTU RE software (v.2.3.4), which is based on Bayesian modelling, was used to determine the population structure of 259 apricot genotypes [28]. The software was run with a burn-in period of 10,000 and 100,000 Markov Chain Monte Carlo (MCMC) replications. Ten runs were performed for each number of populations (K), ranging from 1 to 10. The best number of subpopulations was determined with the Delta K (ΔK) value using STRU CTU RE HAR-VESTER [29].
Association mapping analysis TASSEL (v.5.2.3) software with a mixed linear model [MLM (K + Q) model] was used for the detection of the associations between DNA markers and pomological traits (FF, FFC, FT, and SSC) [29]. The relative kinship matrix, which shows the genetic relationships between the individuals, was calculated by TASSEL (v.5.0) based on the centered IBS method [30]. The Q matrix was obtained from STRUCTERE software at the ΔK = 3 value. The associations between the SNP markers and pomological traits were visualized as Manhattan plots in R software with the "qqman" package. The designation of significant markers was performed in the same software with the false discovery rate (FDR) [31] and Bonferroni correction [32] being calculated separately for each pomological trait (FF, FFC, FT and SSC). Furthermore, the quantile-quantile (Q-Q) plots were visualized with the same software.

Identification of candidate genes
The sequences of the SNP markers associated with FF, FFC, and SSC were analyzed to determine the functions of the candidate genes using the Phtozome (v12.1) database.

Phenotypic variation
In the present study, FF, FFC, FT, and SSC were measured for two years (2016-2017), and showed normal distribution (these distributions are based on average by accessions) (Online resource 2). These findings indicated the importance of the genetic background of each genotype for the Prunus phenotyping profile. The minimum, maximum and mean values of each year showed high consistency, and no significant differences were observed between the years (2016 and 2017) for the mean values. The minimum, maximum and mean values of all phenotypic traits are presented in Table 1.
The mean values of four traits (FF, FFC, FT and SSC) only slightly differed between the two years (2016 and 2017). However, there were fourfold differences between the SSC values for each year. The same differences were also available in the FFC values for each year (2016 and 2017) ( Table 1)FT ranged from 95 to 125 days with a mean value of 114.2 days in 2016, and it ranged from 95 to 126 days with a mean value of 112 days in 2017 (Table 1). FF varied between 0.1 N and 9.60 N in 2016 and 0.03 N and 6.62 N in 2017 (Table 1). There was a nearly 90-fold difference in the ranges obtained for FF from the two years.

Population structure analysis
A total of 24,864 SNP markers was generated from the DArT analysis (Online resource 5.), and after filtering the missing data [max 5% missing data, Minor Allele Frequency (MAF > 0.5)], 11,532 high-quality SNP markers were obtained (Online resource 6.). An apricot genome was comprised of scaffolds and the detected SNPs located on scaffolds rather than chromosomes. The PIC value was 0.77, ranging between 0.05 and 0.99. These markers were assigned to the related scaffolds and were used in the STRU CTU RE (v.2.2) analysis. This analysis was performed for K from 1 to 10, and the peak was observed at K = 3 according to the ΔK computation data. The STRU CTU RE results showed that the 259 genotypes were divided into three main populations: namely POPI (red), POPII (green) and POPIII (blue) (Online resource 7).
All these genotypes were also further divided into three groups according to Nei's genetic distance analysis (Online resource 9). The first group consisted of Geno 185 (Nigde-Turkey) and Geno 186 (Malatya -Turkey), the second comprised Geno 38 (Siverek/Urfa -Turkey), Geno 230 (USA) and Geno 255 (Russia), and the third contained the remaining 254 genotypes. These results indicate that the genotypes used in this study were not clustered according to their geographical origin.
The expected heterozygosity and fixation index (Fst) are parameters that explain the heterozygosity level of a population. In this study, the expected heterozygosity was determined as 0.20 for Cluster 1, 0.06 for Cluster 2, and 0.11 for Cluster 3, with a mean value of 0.12. On the other hand, the Fst value varied between 0.14 and 0.81 with a mean value of 0.55, representing a high genetic variation level for the population. Three SNPs (− log 10 P value is ≥ 3.27, FDR correction applied) were associated with FFC in 2016 and 13 SNPs (− log 10 P ≥ 3.10, FDR correction applied) were associated with FFC in 2017. Three of these SNPs (SNP 4257, SNP 17,194 and SNP 22,875) was common for both 2016 and 2017 (Online resource 10 and Fig. 2).

Association mapping results
The marker-trait association analysis for FT revealed that it was associated with 10 SNPs (FDR correction applied, -log 10 P ≥ 3.28) in 2016 and 22 SNPs (FDR correction applied, -log 10 P ≥ 3.06) in 2017. However, none of these SNPs was commonly seen in both years (Online resource 10 and Fig. 3).
For SSC, 167 SNPs (-log 10 P ≥ 2.82, FDR correction applied) and 352 SNPs (-log 10 P ≥ 2.72, FDR correction applied) were found related in 2016 and 2017, respectively. Of these SNPs, 71 were detected in both years ( Fig. 4 and online resource 10 and 12). The P values presenting the significance level of the associations between the markers and pomological traits are given in Q-Q plots in online resource 12.

Identification of candidate genes
A total number of 30 putative candidate genes were found to be related to the SNPs associated with FF and SSC (Online resource 13). SNPs which were associated with the FFC trait did not show similarity to any of the putative candidate genes. For the SSC and FF of the apricots, the following proteins and enzymes related to putative candidate genes showed homology with SNPs (given in parentheses): putative 3,4-dihydroxy-2-butanone kinase (SNP526), putative leucine-rich repeat receptor-like protein kinase (SNP1023), pentatricopeptide repeat-containing protein (SNP1482), probable LRR receptor-like serine/threonine-protein kinase (SNP1494), vignain-like (SNP2823), transcription termination factor MTERF6 (SNP3309), LRR receptor-like serine/ threonine-protein kinase GSO2 (SNP3842), WAT1-related All analyzes conducted in the present study confirmed the results of the methods.

Phenotypic variation
Fruit quality parameters are of prime importance for both consumers and growers. Among these parameters, FT is the most widely studied physical attribute in apricot due to its economic importance related to early market prices [14]. In the present study, FT of the genotypes was ranged from 95 to 126 days (Table 1). Among the genotypes studied, Geno115 was the earliest cultivar with 95 days. These genotypes can be use in further breeding studies for developing early and late genotypes. In previous studies, FT was reported to range from 59 to 84 days [8] and 55 to 78 days [14]. The variation that was detected in the present study for FT was greater compared to the literature and can be attributed to the different genetic origins of the cultivars.
SSC is one of the main criteria affecting fruit taste, and SSC value greater than 12°Brix indicates good gustative quality [3]. In the present study, SSC was measured between 8.22 and 32.36°Brix, most genotypes (243 genotypes) had a value over 12°Brix, and there was nearly a fourfold difference in the range (Table 1). In different studies, SSC was detected as 8.73 to 17.80°Brix [13], 10.6 to 16.2°Brix 10 and 6.2 to 19.5°Brix [14]. In our study, a larger variation was found in SSC compared to previous studies. This wide range of phenotypic discrepancy indicates the genotypic variation level of the apricot genotypes used in the study.
FFC and FF are two most important sensorial properties that affect consumer preferences at the purchase step. In the present study, FFC was measured between 11.6° and 46.77°, and there was nearly a fourfold difference in the range (Table 1). In previous studies, FFC was measured from 70° to 94.7° [10] and 67° to 100° [13]. In the current study, the range of FF was obtained as 0.03 to 9.60 N. In the literature, FF was reported to vary between 24.9 and 62.2 N [3] and 15 and 50 N [13]. Both FFC and FF ranges measured in the current study were quite different from those of previous studies.

Population structure
In genetic mapping studies, associations between DNA markers and traits are affected by the type and number of markers [33]. The use of a high number of markers leads to high genome coverage, and therefore high-throughput systems gain importance. In the current study, DArT, a highthroughput system, was used, and a total of 20,264 SNP markers was developed and after filtering 11,532 high quality SNP were used in further analyses. Forcada et al. [18] developed 8144 SNPs on 94 peach genotypes. Comparing with this study, our sequencing results provide high genome coverage.
The STRU CTU RE analysis was used for the identification of the population structure of the 259 apricots investigated in the current study. These genotypes were divided into three main populations (Online resource 7). Structure (Online resource 7) and dendrogram analysis (Online resource 8) show us these genotypes were not divided into populations according to their geographical origin. For example, the genotypes that originated from Turkey, Spain, Italy, Poland, Armenia, France, USA, and Hungary were included in the same (third) group (Online resource 8). The reason for this result could be the complex breeding history of these genotypes. In particular, the use of cultivars with different histories in introgression and intercrossing processes may have led to this situation [33]. In addition, humans move plants from one geographic realm to another, which results in confusion concerning the origin of the plants [33]. Similar to results from the current study, Li et al. [34] also found five different groups in their study. Despite the fact that fewer accessions were used in their study compared to the current study, they generated more markers. They pointed out the reason for the high number of SNPs to be genetically variated wild genotypes [34]. In a previous AM study on apricot, 72 genotypes were used, and the genotypes were divided into two main groups [19]. The reason for the lower number of subpopulations in the current study could be the use of a population with a narrow genetic basis. Although the previous authors also selected the genotypes from different countries, they may have used those of the same origin.
In the present study, the mean Fst value was detected as 0.55 and the mean expected heterozygosity was 0.12, indicating the presence of a high genetic variation in the population structure. These findings also support the idea that DArT systems develop a large number of SNP markers distributed along the apricot genome. In previous studies, the Fst value was reported to be 0.51 [35] and 0.16 [36], and the expected heterozygosity value was 0.82 [49] and 0.29 [50]. These differences between previous studies in terms of diversity may be due to the number and type of markers utilized, and genotyping being undertaken in distinct locations [37].

Association mapping analysis
AM is a powerful technique which is based on the accumulation of genetic variability through evolution in natural populations to identify DNA markers based on the association between genetic markers and phenotypes [38]. AM analyses are used as a rapid and efficient alternative to linkage mapping analyses [16] for detecting the associations between traits and markers; therefore, these analyses are widely employed in the mapping of economically important traits in many crop species [18]. In the present study, K and Q matrixes were used to correct the population structure in the MLM (Q + K) model included in AM analyses. This model effectively eliminates possible false positives with random and fixed effects according to Henderson's notation [39]. In addition, FDR and Bonferroni corrections were applied to eliminate spurious associations [39]. To date, no AM study has been undertaken to reveal the associations between SNP markers and pomological traits (FF, FFC, FT and SSC). However, an association map was constructed by Mariette et al. [19] to identify the SNP markers that controlled resistance to plum pox virus in apricot. Furthermore, Olukolu [15] constructed an AM on the chilling requirements of apricot. Apart from these studies, there are only a limited number of association studies on the economically important traits of the other members of the Rosaceae family [16][17][18]40]. In the present study, a total of 131 SNPs were found significantly associated with three pomological traits (FF, FFC and SSC) of the apricot genotypes via AM analyses (Online resource 9-11 and Figs. 1, 2, 3, 4). Three SNPs (SNP 4257, SNP17194 and SNP 22,875) were found associated with FFC and 57 SNPs were found associated with FF (Online resource 9-11 and Figs. 1, 2, 3, 4). A total of 71 SNPs were associated with SSC in two consecutive years (2016 and 2017) (Online resource 11).In previous studies, Salazar et al. (2013) used control cross population of apricot and found QTLs in one linkage group for SSC and two linkage groups for FFC [13]. Socquet-Juglard et al. [14] also used a control cross population and found one linkage group was related to SSC. Campoy et al. [7] found one linkage group for FF. The number of significant markers identified in the present study was higher than previously reported [13,14] which may be related to the type of population investigated. Also, these previous studies used SSR markers, because of that it's difficult to compare these results. In mapping studies, natural populations provide higher genome coverage and mapping resolution with regard to wide genotypical variations [39]. Another reason for our high number of SNPs may be the use of DArT to produce the markers. This technology is known to produce a high number of SNP markers and thus provide high genome coverage.

Identification of candidate genes
In the present study, 30 putative candidate genes showed homology with the sequences of SNPs associated with FF and SSC (Online resource 13). Among these, transcription termination factor MTERF6 plays an important role in plastid development on Arabidopsis thaliana [41]. WAT1related protein is located on the cell wall and responsible for transmembrane transporter activity [42]. Long-chain acyl-CoA synthetase 8 is very important for lipid metabolism [43]. UDP-glycosyltransferase TURAN is one of the 1 3 responsible enzymes for development of the pollen tube. LRR receptor-like serine/threonine-protein kinase (GSO 2 ) and probable LRR receptor-like serine/threonine-protein kinase (GSO) together play a role in root growth and the growth of the epidermal surface in embryos and cotyledons in Arabidopsis thaliana [44]. Putative pentatricopeptide repeat-containing protein is a member of the pentatricopeptide repeat (PPR) protein family and is involved in the organellar RNA metabolism [45]. Calcium-transporting ATPase 12, plasma membrane-type-like has a function in calciumtransporting ATPase activity and calmodulin binding [46]. tRNA (guanine(37)-N1)-methyltransferase 1 is responsible for the methylation of cytoplasmic and mitochondrial tRNAs in the N1 position of guanosine-37 Arabidopsis thaliana [47]. TMV resistance protein N-like provides resistance to tobacco mosaic virus in plants [48]. Putative 3,4-dihydroxy-2-butanone kinase plays a role in ATP binding [49]. Pentatricopeptide repeat-containing protein is a required protein for the intergenic processing between chloroplast rsp7 and ndhB transcripts [50].

Conclusions
This is the first AM study that presents the associations between SNPs based on the DArT technology and economically important traits (FF, FFC, FT and SSC) in apricot. Large variations were determined for these four traits. The results of this study highlight the importance of using populations with wide variations in AM studies. AM revealed significant associations for FF, FFC and SSC. The SNPs identified in the study can be used in future breeding programs for marker-assisted selection in apricot. On the other hand, the genotypes with the earliest FT can be used as parents in developing early cultivars combined with other desirable traits.