Copy number variations shape the structural diversity of Arabidopsis metabolic gene clusters and are associated with the climatic gradient

Background Metabolic gene clusters (MGCs) encode at least three different enzymes for a common biosynthetic pathway. Comparative genome analyses highlighted the role of duplications, deletions and rearrangements in MGC formation. We hypothesized that these mechanisms also contribute to MGC intraspecies diversity and play a role in adaptation. Results We assessed copy number variations (CNVs) of four Arabidopsis thaliana MGCs in a population of 1,152 accessions, with experimental and bioinformatic approaches. The MGC diversity was lowest in marneral gene cluster (one private deletion CNV) and highest in the arabidiol/baruol gene cluster where 811 accessions had gene gains or losses, however, there were no presence/absence variations of the entire clusters. We found that the compact version of thalianol gene cluster was predominant in Arabidopsis and more conserved than the noncontiguogus version. In arabidiol/baruol cluster we found a large insertion in 35% of analyzed accessions, that contained duplications of the reference genes CYP705A2 and BARS1. The BARS1 paralog, which we named BARS2, encoded a novel oxidosqualene synthase. Unexpectedly, in accessions with the insertion, the arabidiol/baruol gene cluster was expressed not only in roots but also in leaves. Additionally, they presented different root growth dynamics and were associated with warmer climates compared to the reference-like accessions. We also found that paired genes encoding terpene synthases and cytochrome P450 oxidases had higher copy number variability compared to non-paired ones. Conclusions Our study highlights the importance of intraspecies variation and nonreference genomes for dissecting secondary metabolite biosynthesis pathways and understanding their role in adaptation and evolution.

They altogether form four MGCs: three on chromosome 5 and one on chromosome 4. Here we 1 present the extent of intraspecies diversity of these MGCs in a population of over 1,000 natural 2 accessions. We focused on gene duplications and deletions since they may directly alter the genetic 3 composition of MGCs and thus have a strong possible impact on the MGC-encoded metabolic 4 pathways. 5

7
Arabidopsis MGCs are located in highly variable regions, but only two of them display 8 substantial gene copy number polymorphism 9 We started our analysis from searching both the literature and published gene coexpression datasets 10 to include confirmed and putative gene members in each MGCs' boundaries, as briefly summarized 11 below and in Additional file 1: Table S1. Next, we aligned each MGC with the common copy 12 number variants (CNVs), previously identified by the analysis of short read whole genome 13 sequencing (WGS) data, to assess their intraspecies diversity (Fig. 1A, Additional file 1: Table S2) 14 [14]. 15 16 Thalianol gene cluster was initially identified as a set of four "core" genes: thalianol synthase gene 17 THAS1, two CYPs, CYP708A2 and CYP705A5, as well as BAHD acyltransferase gene family 18 member AT5G47980, the latter two being separated from each other by a noncoding transcribed 19 locus AT5G07035 [15,16]. Recently, additional genes required for thalianol conversion to thalianin 20 were identified [11]. One of them, AT5G47950, encodes an acyltransferase and is located close to 21 AT5G47980 (separated by two intervening genes, RABA4C and AT5G47970) and the other two 22 (oxidoreductases SDR4 and AT1G66800) are located on different chromosomes and were not 23 included in our analysis. Accordingly, the estimated size of the entire thalianol gene cluster in the 24 reference (Col-0) genome is ~45 kb (Additional file 2: Fig. S1A). We found that this MGC was 25 overlapped by six CNVs of various sizes (0.6 kb to 240.7 kb). The smallest one was located in the 26 intergenic region. The largest CNV encompassed the whole cluster and additionally extended 1 upstream by 195 kb (regarding plus genome strand). The remaining variants were from 10.1 to 26.6 2 kb in size and each overlapped with some but not all thalianol cluster genes. CNV_18605 entirely 3 overlapped with THAS1 and CYP708A2 and partially with CYP705A5. CNV_18604 completely 4 covered AT5G47980, CYP705A5 and the noncoding locus AT5G07035 located between them. 5 CNV_18606 and CNV_18608 covered single genes, CYP708A2 and THAS1, respectively. 6 7 Marneral gene cluster is ~35 kb in size and is the most compact plant MGCs described to date 8 (Additional file 2: Fig. S1B). It is made up of three genes: MRN1 (encoding marneral synthase), 9 CYP71A16 (encoding marneral oxidase) and CYP705A12 (function unknown) [17][18][19]. MRN1 is 10 separated from the rest of the cluster by intervening noncoding transcribed loci AT5G00580, 11 AT5G06325 and AT5G06335. We found 15 CNVs, each smaller than 10 kb, scattered across this 12 MGC region and together covering nearly 52% of its length. However, the CNV arrangement was 13 strikingly different from any other MGC, since all variants were intergenic and did not overlap with 14 the marneral cluster genes. 15 16 The tirucalladienol gene cluster includes five members. Of them, PEN3 encodes triterpene synthase 17 involved in the production of tirucalla-7,24-dien-3β-ol [20]. Three other genes in this cluster belong 18 to the CYP716 family. Two of them, AT5G36130 and CYP716A2, have been recently postulated to 19 jointly form one locus and encode a single cytochrome protein with unknown function [21]. The 20 third one, CYP716A1, encodes an enzyme that is involved in the hydroxylation of tirucalla-7,24-  Fig. S1C). This MGC overlapped with 17 CNVs, which together covered nearly 80% of its 1 length. The longest variant, CNV_17742 (69.7 kb), overlapped with only the 5' end of the cluster, 2 covering CYP716A1, AT5G36130 and the intervening genes between them. CNV_17765 was 19.1 3 kb in size and overlapped with CYP716A2, PEN3 and a nearby intervening gene. Six CNVs 4 partially overlapped with one cluster gene only (CYP716A1, AT5G36130, CYP716A2 or PEN3) and 5 sometimes extended to adjacent intervening genes. The remaining variants were small (0.6 kb to 3.4 6 kb in length) and entirely located in the intergenic regions. It is worth to notice that upstream of the 7 tirucalladienol gene cluster, a region genetically divergent from the surrounding genomic segments, 8 called a hotspot of rearrangements, was previously discovered [24]. Another, smaller hotspot of 9 rearrangements was found between CYP716A1 and AT5G36130. The presence of these two 10 divergent regions stayed in agreement with the higher frequency of CNVs in this location. On the 11 contrary, the 3' region of the tirucalladienol gene cluster, including SCPL1, was not overlapped by 12 any CNV. 13 14 The arabidiol/baruol gene cluster is the only known MGC located on Arabidopsis chromosome 4. It 15 is a complex region, which encompasses two closely located OSCs (PEN1 and BARS1), sharing 16 91% similarity at the amino acid level. PEN1 encodes a synthase involved in the production of 17 arabidiol [25]. Adjacent to PEN1 is CYP705A1, involved in arabidiol degradation upon jasmonic 18 acid treatment, which leads to releasing a volatile homoterpene, (E)-4,8-dimethyl-1,3,7-nonatriene 19 (DMNT) [26]. When PEN1 and CYP705A1 were coexpressed in N. benthamiana along with 20 distantly located genes AT1G66800, SDR4 and AT5G47950 (which are also involved in the 21 thalianol biosynthetic pathway), they acted together to convert arabidiol to arabidin [11]. BARS1 22 encodes a multifunctional cyclase, which produces baruol as its main product [27]. The genomic 23 region surrounding PEN1 and BARS1 contains seven additional CYPs (CYP702A2, CYP702A3, 24 CYP705A2, CYP705A3, CYP705A4, CYP702A5, CYP702A6) and two acyltransferase genes 25 (AT4G15390, BIA1). Their role in the arabidiol/baruol metabolism has not been determined. 26 However, they displayed coexpression with either PEN1 or BARS1 and are included in the MGC 1 [16,23]. Accordingly, the estimated size of the arabidiol/baruol gene cluster region is 83 kb 2 (Additional file 2: Fig. S1D). There are few intervening loci there, including a protein-coding gene 3 CSLB06, two pseudogenes CYP702A4P and CYP702A7P and one novel transcribed region 4 AT4G06325. Arabidiol/baruol gene cluster overlapped with 24 CNVs (0.6 kb to 21 kb), which 5 covered about 53% of its length. CNVs were grouped in three distinct regions, separated by 6 invariable segments. In the first variable region, which also co-localized with a hotspot of 7 rearrangement [24] there were eight CNVs and three partially overlapped with CYP702A2 or 8 CYP702A3. In the second variable region there were four CNVs, three of which had some overlap 9 with cluster genes (CYP705A2, CYP705A3, BARS1). Notably, only CNV_14660 fully overlapped 10 all three of them. The third variable region was covered by 12 CNVs, but they were mostly 11 intergenic and only three overlapped with genes (CYP702A5 or CYP702A5 and CYP702A6). 12 CYP705A1, PEN1, CYP705A4, AT4G15390 and BIA1 were not covered by any common CNV.

14
A large diversity of CNVs and their overlap with MGC genes indicated there might be considerable 15 differences in triterpene biosynthesis pathways encoded by various Arabidopsis accessions. In order 16 to assess this putative diversity, we analyzed the deletions and duplications of individual genes. We 17 retrieved copy number data for 31 genes (clustered and intervening genes), each from 1,056 18 accessions (listed as RD dataset in Additional file 1: Table S3) [14]. We validated and 19 supplemented these data with multiplex ligation-dependent amplification (MLPA) and droplet 20 digital PCR-based genotyping (ddPCR) assays. MLPA covered 13 genes from thalianol, marneral 21 and arabidiol/baruol clusters and were performed for 232 accessions, of which 136 were in common 22 with RD analysis (MLPA dataset in Additional file 1: Table S4; Additional file 2: Fig. S2-S3) 23 while ddPCR covered four genes from thalianol cluster and were performed for 20 accessions, all in 24 common with RD and/or MLPA analyses (ddPCR dataset in Additional file 1: Table S5). We 25 defined the thresholds for detecting duplications and deletions for each data type. Next, we assigned 26 9 the copy number status of each gene in each accession ("REF", "LOSS" or "GAIN") by combining 1 all three datasets (Additional file 1: Table S6). Out of 1,784 genotypes assigned with two or three 2 approaches, 1,763 were fully concordant (98.8% congruency) and 19 out of 21 discrepancies were 3 resolved manually (see Additional file 1: Table S7, Additional file 2: Fig. S4 and Methods). The 4 combined genotyping data were further used to asses and compare the levels of MGC variation. 5 6 Among the 1,152 accessions, only 28.6% had no gene gains or losses in any MGC (Fig. 1B). This 7 included 65% of assayed accessions from the German genetic group and 39% of accessions from 8 the Central Europe group. In turn, the vast majority (at least 90%) of accessions from groups known 9 to be genetically distant from the reference genome (North Sweden, Spain, Italy-Balkan-Caucasus, 10 and Relict groups) displayed gene CNV in at least one MGC. We note that the real number of 11 invariable accessions could be even lower since for 96 of them, all genes in the tirucalladienol 12 cluster and some in the arabidiol/baruol cluster were not genotyped. Altogether, 19 MGC genes 13 were affected: four in the thalianol cluster, one in the marneral cluster, three in the tirucalladienol 14 cluster and 11 in the arabidiol/baruol cluster (Fig. 1C). The latter was also most variable in terms of 15 the number of accessions carrying CNVs and the diversity of CNV patterns. For two genes we 16 detected only copy gains, six genes were multiallelic (with both gains and losses) and for 11 genes 17 we observed only copy losses. As expected, genes overlapping the previously defined CNV regions 18 showed the highest variability in our assays. Remarkably, we did not observe complete loss or gain 19 of entire MGC in any accession. In the next step, we inspected in more detail the level of diversity 20 of each MGC.

22
The compact version of thalianol gene cluster is more common and has lower variation than 23 the reference-like noncontiguous version 24 Survey in 1,152 accessions with a combination of RD, MLPA and ddPCR approaches revealed 49 25 accessions with gene deletions and 5 accessions with gene duplications in the thalianol gene cluster. 26 Except for AT5G47950, which was invariable, all genes in this cluster were affected by CNVs, 1 which followed five distinct patterns ( Fig. 2A). Three of these patterns (deletions only) were 2 previously identified by browsing the polymorphisms listed in 1001 Genomes Project database 3 [28,29]. These deletions were found in 21 accessions (~2%) and 19 of them were also detected in 4 our study (we did not analyze the other two accessions). The most common variant (variant A) was 5 the deletion of a region encompassing AT5G47980 and CYP705A5, combined with the deletion of 6 THAS1. We detected it in 37 accessions from six countries: Sweden (13), Italy (8), Germany (6), 7 Spain (5), Bulgaria (3) and Portugal (2). We also confirmed the existence of two previously 8 reported rare variants. The first one (variant B) was a large deletion spanning AT5G47980, 9 CYP705A5 and CYP708A2. We found it in two accessions from Germany (Bch-1, Sp-0), one from 10 Italy (Etna-2) and one from Spain (IP-Mon-5). The second rare variant (variant C) was a deletion of 11 a single gene, CYP708A2, which we found in four accessions from Spain (Can-0, Ped-0, IP-Her-12 12 and Nac-0) and one from Portugal (IP-Mos-1). Of note, four accessions in this group represented 13 the genetically differentiated lineage of Relicts. We also found a new type of deletion in two Relicts 14 (IP-Rel-0 and Con-0) and one non-Relict (IP-All-0), all originating from Spain. This deletion and one from Greece (Olympia-2). The duplication was detected with both RD and MLPA assays 20 and we confirmed it by sequence analysis of the de novo genomic assembly for Mitterberg-2-185, 21 which we generated with Nanopore sequencing data (Additional file 2: Fig. S6). The duplicated 22 region was ~3 kb in size. It spanned AT5G47980 and its flanks (0.5 kb upstream and 0.7 kb 23 downstream). In Mitterberg-2-185, the sequence copies were positioned in tandem and differed 24 from each other by only two mismatches and a 1-bp gap. Accordingly, duplicated genes encoded 25 identical proteins with 99% identity to the reference AT5G47980 acyltransferase. Although they 26 were shorter than the reference protein (404 aa versus 443 aa), they possessed a complete 1 transferase domain (pfam02458). AT5G47950 -the gene encoding another acyltransferase in the thalianol gene cluster was stable in 4 copy number. However, we detected a chromosomal inversion (with respect to the reference 5 genome orientation), spanning this gene and two intervening genes, RABA4C and AT5G47970, in 6 Mitterberg-2-185 de novo assembly. This inversion resulted in a more contiguous and compact 7 organization of the entire cluster (Fig. 2B). Similar inversions were previously detected in PacBio- genetic groups as well as from Asia group (83.6% to 88.9%) (Fig. 2C). It was also more frequent 18 in Central and Western Europe groups as well as in Germany group, except for U.S.A accessions, in 19 which we mainly observed the discontiguous version. Apart from USA, the reference-like cluster 20 was slightly more abundant also in Spain genetic group. In Relictsthe group of individuals 21 considered to represent different glacial refugia, genetically diverted from the majority of 22 worldwide accessions [28], there was similar frequency of discontiguous and compact versions (12 23 and 10 accessions, respectively). Interestingly, the CNV frequency substantially differed between the accessions with and without 1 inversion (Fig. 2D, E). Copy number changes affected only 1.1% of accessions with the compact 2 cluster version, including four out of five cases of variant E (AT5G47980 duplication) and three out 3 of four cases of variant B (AT5G47980, CYP705A5 and CYP708A2 deletion). The remaining CNVs, 4 including all deletions spanning THAS1, were found exclusively among the accessions with the 5 reference-like cluster type. Altogether, 12.7% of accessions with the discontiguous cluster type 6 were affected by CNVs.  accessions, all three genes were absent. This included Ty-1, for which we analyzed de novo 25 genomic assembly and confirmed the gene loss, as well as Draha2 and Jl-3, for which we observed 26 13 numerous read pairs spanning the deletion site. In seven other accession, AT5G36130 or CYP716A2 1 or both were missing, including Dolna-1-40, for which we confirmed AT5G36130 loss with the 2 Nanopore sequencing and analysis of de novo genomic assembly (Additional file 2: Fig. S9). 3 According to a recent study, AT5G36130 and CYP716A2 gene models are misannotated and they 4 jointly encode a single full-length protein of the CYP716A subfamily with cytochrome oxidase 5 activity [21]. Therefore, it should be noted that all 15 accessions with CNVs in the tirucalladienol 6 gene cluster, lacked a functional CYP716A2 gene. coverage, there were apparent differences in the variation frequency between the clustered genes. 13 Two CYP702A family members located at the cluster's 5'end were highly variable.  The two OSCs, PEN1 and BARS1, differed in variation between each other but had similar variation 22 frequencies as their neighboring CYPs. CYP705A1 and PEN1, both implicated in the arabidiol 23 biosynthesis pathway, were stable in copy number, in agreement with their localization outside any 24 common CNVs. Both genes were deleted only in IP-Deh-1 accession (Spain), while in Qui-0 (also 25 Spain) PEN1 was deleted and CYP705A1 status was not evaluated. Additionally, in Kyoto both 26 genes were partially deleted, which we confirmed by analyzing available PacBio-based de novo 1 genomic assembly of this accession [24]. 2 3 BARS1, CYP705A2 and CYP705A3 were all located within the central CNV segment. All three 4 genes were deleted in some accessions originating from Sweden. We also observed smaller 5 deletions or duplications in this segment, of which the most remarkable was the duplication of 6 CYP705A2, detected in 433 (37.6%) accessions. However, genotypic data for CYP705A2 and 7 BARS1 were noisy and we reasoned there was more variation in this region than could be revealed 8 by our standard genotyping. Therefore, we manually inspected WGS reads which mapped in this  Fig. S3). Surprisingly, we also observed a mix of reads which mapped to 13 CYP705A2 and BARS1 loci with and without mismatches in a large number of accessions. Thus, we 14 called SNPs in the coding sequences of both genes to get more information on their diversity. As 15 expected, numerous heterozygous SNPs were called in both genes in the above accessions. Because

16
Arabidopsis is a self-pollinating species and therefore highly homozygous, we hypothesized that the 17 reads with the mismatches originated from different loci (gene duplicates), with similarity to 18 CYP705A2 and BARS1, respectively. WGS reads from the accessions with the duplications would 19 then generate pseudo-heterozygous SNP calls when mapped to the reference gene models. 20 Additionally, since we were able to detect copy gain of CYP705A2 in hundreds of accessions but 21 found only one duplication of BARS1, we concluded that the sequence differences between BARS1 22 and its duplicate prevented its detection by RD or MLPA assays. In support of our hypothesis, in 23 accessions with CYP705A2 duplication, we found some heterozygous SNPs both at CYP705A2 and 24 BARS1 loci. Out of 394 accessions with CYP705A2 duplication, 357 (90.6%) had at least one 25 heterozygous SNP called in this locus, compared to only 69 out of 646 (10.7%) accessions without 26 changes in CYP705A2 copy number (Wilcoxon rank sum test with continuity correction, p.value 1 <2.2×10 -16 ) (Additional file 2: Fig. S11A). Moreover, the numbers of heterozygous SNPs at 2 CYP705A2 and BARS1 were strongly correlated (Pearson's r = 0.86) (Additional file 2: Fig. S11B). 3 We also reasoned that the absence of reference CYP705A2 and BARS1 genes and the presence of  To identify the hidden BARS1 duplicate, we analyzed PacBio-based genomic assemblies of seven 10 accessions: An-1, Cvi-0, Kyoto, Ler-0, C24, Eri-1 and Sha [24], four of which have been also 11 genotyped in our study (Fig. 3A). In our assays, An-1 and Kyoto had no changes in CYP705A2 and 12 BARS1 copy number and no heterozygous SNPs were found, while in Cvi-0 and Ler-0 CYP705A2 13 was duplicated and heterozygous SNPs in both genes were present. We re-annotated the entire 14 cluster region in each accession and compared it with the reference (Additional file 1: Table S9). 15 In six accessions, BARS1 lacked the largest intron present in the reference gene model, in agreement 16 with our observations (Additional file 2: Fig. S12). In Cvi-0, Eri-1 and Ler-0 we identified a 17 nonreference gene, encoding a protein with ~91% identity to baruol synthase 1 (Additional file 2: 18 Fig. S13). In C24, this gene was present but interrupted by ATCOPIA52 retrotransposon insertion, 19 resulting in two shorter ORFs, of which only one has been annotated previously. Based on 20 phylogenetic analysis and sequence comparison with BARS1 ortholog from a related species 21 Arabidopsis lyrata, we concluded that the identified gene was a BARS1 duplicate and we named it 22 BARS2 ( Fig. 3B). At the nucleotide level, the differences between BARS1 and BARS2 accumulated 23 in the introns, which likely prevented mapping of many BARS2-derived reads to BARS1 locus. The 24 differences in the exons very well matched the heterozygous SNP positions, which we observed 25 (Additional file 2: Fig. S14). MLPA probe targeting BARS1 was located in a highly divergent 26 region which prevented it from binding to the BARS2 genomic sequence. These observations 1 explained why we did not detect BARS1 duplication in our assays. Full-length proteins encoded by BARS2 in Cvi-0, Eri-1 and Ler-0 possessed both N-terminal and C-4 terminal squalene-hopene cyclase domains, typical for OSCs (Fig. 3C). We performed three-5 dimensional (3D) modelling of baruol synthase 1 from Col-0 (reference) and Cvi-0 (which is a 6 Relict) as well as putative baruol synthase 2 from Cvi-0, using a machine learning approach, 7 implemented in ColabFol software (Additional file 2: Supplementary Methods) and found that all 8 structures were highly similar (Additional file 1: Table S10). Additionally, we superposed our 3D 9 models with the experimental,crystal structure of human OSC, available in a complex with its 10 reaction product lanosterol [30,31]. This approach allowed us to identify potential substrate-binding 11 cavities ( Fig. 3D; Additional file 3). Notably, the catalytic aspartate residue D455 present in 12 human cyclase had its counterparts in plant OSCs: D493 in Col-0 baruol synthase 1 (isoform 13 NP_193272.1) and D490 in Col-0 baruol synthase 1 (isoform NP_001329547.1), Cvi-0 BARS1 14 (ATCVI-4G38020) and Cvi-0 BARS2 (ATCVI-4G38110). This indicated that BARS2 encoded a 15 novel, so far uncharacterized OSC in the Arabidopsis genome. As expected, we also found 16 CYP705A2 duplicates in C24, Cvi-0, Eri-1 and Ler-0 assemblies (84% identity at the nucleotide 17 level, 79% identity and 88% similarity at the protein level) (Additional file 2: Fig. S15). This gene, 18 which we named CYP705A2a, was adjacent to BARS2 and located on the minus strand of the large 19 genomic sequence insertion between CYP702A6 and BIA genes. The inserted sequence also 20 contained a ~5 kb long interspersed nuclear element 1 (LINE-1) retrotransposon and some shorter,  Copy number changes of CYP705A2 and BARS1 are associated with climatic gradient and 25 root growth variation 26 In the next step, we combined the genotyping results with the SNP analysis for all accessions, in 1 order to determine the copy number status of CYP705A2 and BARS1 and their duplicates (see 2 Additional file 1: Table S11 and Methods for details). We distinguished four groups of accessions results confirmed the differences between these groups (Additional file 2: Fig. S16). We noted that 18 we did not detect BARS2-specific products in many samples from the AA-PP group. However, the 19 band for CYP705A2a was clearly present in these accessions. We suppose that the BARS2 sequence 20 might further diverge in this group. Genome sequencing and assembling for representative AA-AA 21 and AA-PP accessions will be needed to further evaluate the structure of arabidiol/baruol cluster 22 region in these two minor groups. The accessions with the nonreference gene pair (AA-PP and/or PP-PP) were present among all 25 genetic groups but with a different frequency (Fig. 4B). They dominated among the Relicts (81%), 26 Spain (60%) and Italy/Balkan/Caucasus (89.6%) groups but constituted the minority of accessions 1 in the groups from north and east margins of the species range (North Sweden 18.6%, South 2 Sweden 16%, Asia 9.4%). They were also mostly absent among U.S.A accessions. The presence of 3 CYP705A2a and BARS2 genes in the majority of Relicts suggested that the gene duplication event 4 preceded the recent massive species migration, which took place in the post-glacial period and 5 shaped the current Arabidopsis population structure [32]. We visualized the four groups in the 6 principal component analysis (PCA) plots, generated with genome-wide bi-allelic SNPs, by 7 applying varying linkage disequilibrium (LD) parameter (Additional file 2: Fig. S17). At low LD, 8 when the contribution of the ancestral alleles to PCA was highest, we observed a clear convergence 9 of the PC1 and PC2 components with the presence/absence of duplicate gene pair (Fig. 4C). We 10 also observed that the accessions possessing CYP705A2a and BARS2 genes were associated with 11 significantly lower latitudes compared to the remaining accessions (one-way rank-based analysis of 12 variance (ANOVA, p.value<0.001) followed by Dunn's test with BH correction (p.value<0.001)) 13 (Fig. 4D). In case of two most abundant groups, namely PP-PP and PP-AA, this difference was  Motivated by the above observations, we decided to evaluate the possible association between 21 arabidiol/baruol gene cluster variation and environmental conditions. Since the climate is a 22 substantial selection factor, we also checked for the phenotypic variability between PP-AA and PP- 23 PP accessions. To this end, we performed two-group comparisons of 516 phenotypic and climatic 24 variables retrieved from the Arapheno database [34] and focused on these which significantly 25 differed between both groups (Wilcoxon rank sum test with continuity correction, p.value <0.05) 26 (Additional file 1: Table S12). The most prominent were 88 variables describing the climatic 1 conditions at the accessions' sites of origin (Fig. 5A), especially maximal and minimal temperature 2 conditions, precipitation and evapotranspiration [35]. These results confirmed that the differential 3 geographical distribution of accessions between PP-AA and PP-PP groups was strongly associated 4 with the climatic gradient. 5 6 Apart from the climate data, 40 diverse phenotypes varied significantly between both groups. At 7 this point, we cannot exclude that these differences might be influenced by another genetic factor, 8 independent from the arabidiol/baruol cluster structure. For example, the different flowering times 9 between both groups could be related to the known latitudinal allelic variation of flowering time-10 associated genes [36]. However, we paid special attention to root growth-related phenotypes since 11 all Arabidopsis MGCs are considered to have root-specific expression. Moreover, the mutation of 12 AT5G47950 acyltransferase gene (involved in the thalianin and arabidin biosynthesis) was 13 previously shown to negatively impact the root growth [37]. We observed significant differences in 14 root growth dynamics between PP-AA and PP-PP groups during the first week after germination 15 [38]. More specifically, roots of PP-PP accessions elongated slower compared to PP-AA 16 accessions (Fig. 5B). Additionally, PP-PP accessions showed a significantly lower rate of root 17 organogenesis from explants, compared to PP-AA accessions, under one of three growth conditions 18 tested in another study (Fig. 5C) [39]. We also checked for the genetic interactions between 19 thalianol and arabidiol/baruol clusters for these phenotypes since the distribution of discontiguous 20 and compact versions of this cluster was also related to geography. However, despite the strong 21 association between the thalianol gene cluster contiguity and the latitude, the variation in thalianol 22 gene structure did not explain the variation in root growth phenotypes, contrary to the presence vs.  In a highly structured population, such as that of Arabidopsis, the correlation of allele frequency 1 with geography and phenotypic traits may not only result from the selection but may be also statistical significance for any variable we tested, still the lowest p-values were obtained for PP-AA 10 and PP-PP groups' association with the climatic data and root organogenesis phenotypes (Fig. 5D,   11 Additional file 1: Table S12). 12 13 In the reference accession Col-0, all genes in the arabidiol/baruol cluster are low-expressed and  Table S13). In both 23 accessions, the clusters were silenced in shoots, except for acyltransferase gene AT4G15390, which 24 displayed low activity in Cvi-0. Also, in both accessions, the clusters were active in roots, where the 25 expression of AT4G15390 was much stronger than the remaining genes. In Cvi-0, genes located in 26 the genomic insertion (CYP705A2a, BARS2 and ATCVI-4G38100 which encodes a protein with 1 partial similarity to acyltransferase) were also expressed, although at lower level compared to the 2 rest of the cluster. Surprisingly, we detected expression of arabidiol/baruol clusters genes in leaves 3 of Cvi-0, which indicated different tissue specificity of its arabidiol/baruol gene cluster compared to 4 Col-0. Outside of the genomic insertion, the clustered gene expression level was much lower 5 compared to the roots, and the transcripts of CYP705A2, CYP705A3 and BARS1 were barely 6 detectable. However, in Cvi-0, the transcription levels of ATCVI-4G38100, CYP705A2a and BARS2 7 genes localized in the insertion were similar in the leaves and roots. The expression of CYP705A2a 8 and BARS2 in leaves of this accession was higher than the expression of CYP705A2 and BARS1. 9 Accordingly, it can not be excluded that the metabolic products of arabidiol/baruol gene cluster 10 activity in the roots and leaves of Cvi-0 accession are not identical.  (Fig. 5F). This observation stayed in agreement with our predictions that some reads from BARS2 22 transcripts mapped to BARS1 locus, elevating the measured expression level in leaves of accessions 23 from these two groups. To confirm that, we additionally re-mapped the raw RNA-Seq reads from 24 four accessions: Ty-1 and Cdm-1 (both representing the PP-PP group) as well as Kn-0 and Sha (the 25 PP-AA group) to their respective genomic assemblies, in which we annotated the genes of interest 26 (Additional file 2: Figure S20B). We analyzed the transcription levels of CYP705A2, BARS1, 1 CYP705A2a and BARS2 and confirmed that in the leaves of PP-PP accessions BARS2 was 2 expressed, while BARS1 was not. 3 4 The proximity of terpenoid synthase and cytochrome P450 genes in the Arabidopsis genome is 5 associated with increased copy number diversity 6 In many plant genomes, genes encoding terpenoid synthases (TSs, this includes OSCs analyzed in 7 our study) are positioned in the vicinity of CYPs more often than expected by chance. Therefore, 8 they frequently exist as TS-CYP pairs [22]. According to our results, TS-CYP pairs located in 9 MGCs had similar (either high or low) diversity in copy number and the paired genes were 10 frequently duplicated or deleted together. We wanted to check whether copy number polymorphism 11 is a common feature of TS-CYP pairs in the Arabidopsis genome. We created a comprehensive list   this cluster origined by duplications in Oryza genus, which was followed by their organizing into a 24 cluster. Moreover, intact DGC7 was was located in a selective sweep region of japonica subspecies 25 and was highly enriched in this subpopulation, while in many indica accessions it was absent or 26 incomplete, suggesting positive selection of DGC7 in japonica but not indica during domestication. 1 These studies suggested that plant MGCs are dynamically evolving and the genetic mechanisms 2 that originally lead to their formation may be captured at the intraspecies genetic variation level. 3 Similar conclusions were drawn from the previous study of a filamentous fungus Aspergillus 4 fumigatus, in which the secondary metabolic pathway genes are commonly organied into clusters. 5 In that species, substantial MGC genetic diversity was observed among 66 strains. The main 6 variants included, apart from point mutations and horizontal gene transfer, also the gain and loss   . By CNV analysis in ~1,000 accessions we found that in 23 Arabidopsis, the physical proximity of CYPs and TSs was associated with increased rates of these 24 genes' duplications and deletions compared to non-paired ones. It might suggest that the occurrence 25 of such a specific gene mix, combined with the structural instability of its genomic neighborhood, 26 boosted the potential to produce novel metabolic pathways [10]. We also revealed that Arabidopsis 1 MGCs had different levels of variation and these differences reflected the phylogeny of clade II 2 OSCs that we analyzed. Of them, MRN1 is most divergent in amino acid sequence. It is also mono-3 functional, i.e., catalyzes the cyclization of 2,3-oxidosqualene into one specific product -marneral  Thalianol gene cluster, which has been most intensively studied so far, was the second most variable 12 MGC in our analysis [11,16,37]. The first evidence for the copy number and structural diversity of 13 this cluster comes from the study of Liu et al. [29], who analyzed the frequency of SNPs, small indels 14 and gene deletions within the thalianol cluster genes. The authors showed that SNPs and indels were 15 the most common polymorphisms, however the changes were mostly benign. They also identified 16 large gene deletions in ~2% of the studied accessions. Since our approach was specifically focused 17 on CNV analysis and was duplication-aware, it was more sensitive to large structural variations. As 18 a result, we observed over two times higher rate of copy number alterations within the thalianol gene 19 cluster (4.7%), with 49 accessions carrying gene deletions and 5 accessions with gene duplications. 20 We identified two new CNV patterns: one of them was a large deletion of CYP705A5, CYP708A2 21 and THAS1 and the other one was AT5G47980 duplication. Apart from that, by the large-scale 22 analysis of chromosomal inversions, performed for nearly 1000 accessions, we validated earlier 23 assumptions, based on only 22 accessions, that the compact (contiguous) cluster version is 24 predominant in Arabidopsis. It is also better conserved compared to the discontiguous version. 25 However, it remains to be investigated whether the tighter clustering of thalianol gene cluster leads 26 to altered gene expression or the diversification of the metabolite profiles and whether it may be 1 advantageous in certain environmental conditions. 2 3 The arabidiol/baruol was the largest and the most variable MGC in Arabidopsis. This region 4 comprises only few gene subfamilies, but it is significantly expanded compared to the sister species 5 A. lyrata, including the presence of two highly similar OSCs, PEN1 and BARS1, orthologous to A. 6 lyrata gene LOC9306317. It is suggestive of recent duplications and stays in agreement with our 7 observations of an exceptionally high rate of intraspecific gene gains and losses within this MGC. 8 Of note, within this cluster there was a clear segmentation into variable and invariable gene blocks. 9 PEN1 and CYP705A1 genes constituted the almost invariable segment, while the adjacent group of 10 CYP705A2, CYP705A3 and BARS1 genes displayed high rate of copy number changes.

11
Segmentation may result from the ongoing process of selection-driven fixation of the arabidiol sub- 12 cluster. The products of PEN1 and CYP705A1 are involved in response to jasmonic acid treatment 13 and infection with the root-rot pathogen Pythium irregulare [26]. Moreover, arabidiol may be 14 further converted to arabidin in the pathway involving acyltransferase encoded by AT5G47950. 15 Crispr mutants with disrupted AT5G47950 gene had significantly shorter roots than the wild-type 16 plants and arabidin did not accumulate in these roots [37]. AT5G47950 is the only member of the 17 thalianol gene cluster which was invariable in copy number, which may indicate the biological 18 significance of the arabidin biosynthesis pathway. Interestingly, A. lyrata is able to convert apo- The diversity of OSCs determines the initial diversity of 2,3-oxidosqualene cyclization products 1 generated by the plant [55]. So far, the Arabidopsis genome was considered to include 13 OSCs and 2 it has been over a decade since the biochemical profile of PEN3 -which was thought to be the last 3 unexamined one -was described [20]. Here we report the discovery of BARS2 gene, which was 4 found in numerous accessions but was absent from Col-0, hence it is also missing from the current 5 reference genome. We performed sequence and expression analyses and applied structural 6 modelling, which indicated that BARS2 encodes a functional clade II OSC. It may be involved in 7 the biosynthesis of one product or a range of them, presumably different from those generated by 8 its closest paralog, BARS1. It is worth to notice that BARS1 has the lowest product specificity  BARS2 is a part of a large ~25kb insertion located at the end of the arabidiol/baruol gene cluster, 19 extending its region significantly. Within this insert, we also found a duplication of CYP705A2, 20 namely CYP705A2a. According to our estimations, BARS2 and CYP705A2a gene pair may be 21 present in nearly one-third of the Arabidopsis population. Interestingly, we observed a significant 22 association between the presence/absence of this segment and the climates of accessions' origin as 23 well as the root growth dynamics. Unfortunately, little is known about the role of baruol and its 24 modification products in Arabidopsis, which currently limits our ability to explain this association.

25
Additional studies will be also required to assess whether it is related to the expression of individual 26 genes or to the differences in transcriptional activity of the entire cluster, e.g. by the epigenetic  has been demonstrated previously [52,53]. Additionally, a strong association was observed between 20 Arabidopsis phyllosphere microbiota and the host genotype, in a recent study by Karasov et al. [61]. 21 In that study, bacterial communities that colonized the leaves of 267 Arabidopsis populations, 22 assessed at various localizations in Europe, formed two distinct groups. These groups were strongly 23 associated with the latititude, in a phylotype-specific manner, namely a significant latitudinal cline 24 was observed for the most abundant strains of Sphingomonas genus. Sphingomonas are commonly 25 associated with both roots and leaves across Arabidopsis populations [62]. The species belonging to 26 29 this genus possess a range of biodegradative and biosynthetic capabilities [63,64]. Plant-colonizing 1 Sphingomonas strains were also implicated in promoting Arabidopsis growth and increasing 2 drought resistance [65] or protecting the plants against the leaf-pathogenic Pseudomonas syringae 3 [66]. Prominently, in the study by Karasov et al., the host plant genotype alone could explain 52% 4 to 68% of the observed variance in the phyllosphere microbial composition [61]. Moreover, the 5 microbiome type was strongly associated with the index of the dryness of the local environment, 6 based on recent precipitation and temperature data. Taking into account that the presence/absence 7 variation of genomic insertion containing CYP705A2a and BARS2 was also associated with these 8 climatic variables and that the geographic distribution of the compact version of thalianol gene 9 cluster was skewed towards higher latitudes, we hypothesize that the genetic diversity of triterpene 10 biosynthesis pathways in Arabidopsis may be interdependent on the diversity of soil bacterial 11 communities present in various environments and this relationship might play some role in 12 Arabidopsis adaptation to climate-driven selective pressures. It is worth to note that triterpenes are 13 high-molecular-weight non-volatile compounds, which are likely to act locally. However, they may 14 be further processed and generate various breakdown products, both volatile and non-volatile, 15 which may be biologically active [26,54]. There is also a growing evidence that intraspecies  clusters. MGC prediction is, however, often complicated by the low expression of genes for 24 specialized metabolite biosynthesis, therefore its accuracy and sensitivity strongly depend on the 25 abundance of data from various tissues, time points, and environmental conditions [23]. Based on the 26 presented results, we suggest that the analysis of intraspecies genetic and transcriptomic variation  In the current study, we created a detailed picture of CNVs and structural rearrangements, driving 10 the diversity of triterpene biosynthesis-related genes in Arabidopsis. We revealed new associations 11 between MGC structure, root growth phenotypes and the geographic distribution of Arabidopsis    MLPA and ddPCR datasets were then combined using the following procedure. For genes and 26 accessions covered by multiple datasets, the final genotype was assigned based on all data. 1 Discordant genotype assignments (21 out of 1,784 covered by multiple datasets) were manually 2 investigated and 19 of them were resolved (Additional file 1: Figure S4; Additional file 2: Table   3 S7). Out of the remaining 32,000, which were assayed with one method only, the genotype was 4 manually corrected in 13 cases with values very close to the arbitrary threshold, based on 5 population data distribution. Final genotype assignments for each gene and each accession are listed 6 in Additional file 2: Table S6. 7 Sanger sequencing 8 The genomic DNA of Mir-0 accession (ID 8337) was used as a template (2 ng) for amplification with GLnexus [80]. Analysis was performed for CYP705A2 and BARS1 genomic loci. The results 5 were further filtered to include only biallelic variants, that were located in the exons of each gene 6 (for BARS1, exon intersections from two transcript models were used). The number of heterozygous    23 The entire set of 516 phenotypes from 26 studies was downloaded from the Arapheno database on 24 26 April 2022 [87]. The above genome-wide SNP dataset, to which we added a biallelic variant 25 representing PP-AA or PP-PP group assignment, was used. The IBS kinship matrix was calculated 26 on 954 accessions. Association analysis was performed for each phenotype using a mixed model   were excluded from the further analysis. Genes were assigned to clans and families according to the 26 above resources. A list of TS genes was created based on a previous study [22] and restricted to 1 genes with valid Araport 11 locus. Genotypes were assigned based on criteria defined for RD 2 dataset: (CN =< 1 as losses, CN >=3.4 as gains, the remaining genotypes were classified as 3 unchanged). Genes from thalianol, tirucalladienol, arabidiol/baruol and marneral gene clusters were 4 already genotyped. Gene coordinates were downloaded from Araport 11. All CYP genes positioned 5 at a distance +/-30 kb from TS gene borders were classified as paired with a given TS gene. 6 Information about predicted secondary metabolism clusters was retrieved from plantiSMASH             Table S6.    continuity correction, *p.value<0.1, **p.value<0.05, ***p.value<0.001. D) Results of a genome-1 wide association study for PP-AA / PP-PP allelic variation. Study with climatic data is in the grey 2 box E) Tissue specificity of arabidiol/baruol gene cluster expression in Col-0 and Cvi-0. F) 3 Population-level differences in gene expression in leaves among the PP-AA, PP-PP and AA-PP 4 groups. Expression levels are shown as log2(TPM+1). Stars denote the significance of one-way 5 rank-based analysis of variance (ANOVA, p.value<0.001), followed by Dunn's test with BH 6 correction, **p.value<0.05, ***p.value<0.001. Source data are available in the Arapheno database 7 (plots A-C), Additional file 1: Table S12 (plot D) and Additional file 1: Table S13 (plots E-F).