Copy number variation introduced by a massive 1 mobile element underpins global thermal 2 adaptation in a fungal wheat pathogen

Abstract


Introduction
Populations occupying heterogeneous environments may evolve locally advantageous traits under divergent selection pressures 1 .How different forms of genetic variation contribute to such environmental adaptation remains unclear.Most broad-scale comparative evolutionary analyses are focused on single nucleotide variants (SNVs) [2][3][4] .However, large-effect structural variants have also been shown to play a role in species range adaptation [5][6][7][8][9] .Adaptative chromosomal inversions are well documented across populations of Drosophila and are linked to seasonal temperature fluctuations 6,7,10 and cold tolerance 8,9 .
Copy number variation (CNV) is a type of unbalanced structural variant defined by the loss or gain of sequence fragments ranging from ~50 bp in length to entire chromosomes.Analyzing CNVs systematically remains challenging due to limits in detection and resolution of the exact sequence rearrangements 11,12 .
CNVs can drive genome evolution 13 , contribute to domestication and speciation events 14,15 , and promote environmental adaptation 16,17 .Population-based studies revealed CNVs associated with environmental adaptation in seabirds, with a large 60 kb CNV likely contributing to plumage and thermal adaptation 18 .In wild lobster populations, CNVs but not SNVs are associated with sea surface temperature adaptation 19 .Hence, elucidating the population genetic context of widely distributed species is necessary to assess how CNVs and dynamic genome compartments contribute to adaptation.The impacts of gene gains and losses mediated by CNVs across the genome vary from local gene dosage effects 20 to reshuffling gene structures 21 , global transcriptional changes and chromatin reconfiguration 22,23 .CNVs mainly arise from inaccurate DNA repair and nonhomologous recombination 24 .Segmental duplications are often triggered by transposable element (TE) activity, and simple repeats are targets for nonallelic homologous recombination (NHR), leading to CNVs 25,26 .Overall, CNVs are linked to replicative or nonreplicative nonhomologous processes based on weak homology 24,27 .
CNVs have been implicated in numerous phenotypic traits, including human disease 28 , life-history traits of crops 29,30 and drug resistance 31,32 .For example, a gene duplication of ACE-1, the target site for organophosphate and carbamate insecticides, confers resistance to the malaria vector Anopheles gambiae 31 .CNVs in fungal plant pathogens are a major concern because such genetic variation is linked to fungicide resistance 33,34 , pathogen virulence 35,36 , and nutrient absorption efficiency 37 .Rapid adaptation in plant pathogens is a threat to global food security 38 and facilitates climate change 39,40 .Zymoseptoria tritici is one of the most destructive pathogens of wheat crops worldwide 41 .This haploid ascomycete underwent global population expansion concurrent with the introduction of wheat cultivation across continents 42 .With global spread, the pathogen accumulated mutations likely involved in adaptation to new climates 42 .The genome is organized in highly dynamic chromosomes, including eight accessory chromosomes and high degrees of structural variation 43,44 .The genome has expanded recently, most likely caused by TE activity and a weakening of genomic defense mediated by repeat-induced point mutations (RIPs) 42,45 .RIP is thought to be active during sexual reproduction and promotes mutations in any duplicated sequences 46 .Hence, the genomic defense mechanism is thought to constrain adaptation through CNVs.The species exhibits high gene set polymorphisms across populations 47 ; however, how the global spread on the wheat host has shaped environmental adaptation remains unknown.
Here, we analyzed a global panel of 1109 Z. tritici genomes covering all major regions linked to the domestication history of the wheat host.We validated a set of high-confidence CNVs to recapitulate the evolution of gene gain and loss across the global population genetic context of the species to assess the impact on gene functions.Finally, we identified how CNVs contributed to chromosomal polymorphism and environmental range adaptation, including an ~68 kb cargo-carrying Starship element.

Sampling
We performed copy number variation analysis on a global collection of Zymoseptoria tritici comprising 1109 Illumina short-read genomes (Figure 1A).The collection covers strains originating from 42 countries representative of the history of wheat domestication and historical expansion of wheat cultivation (Supplementary Table S1) 42 .Additionally, we used high-quality full chromosome assemblies based on PacBio long reads of 19 strains of Z. tritici representative of the global genetic diversity of the species and genomes of four sister species (Z.pseudotritici, Z. ardabilae, Z. brevis and Z. passerinii) 44,48 for CNV call validation.Information about sample origin, sequence quality and accession numbers is provided in Supplementary Table S1.

CNV calling
Illumina raw reads were trimmed with Trimmomatic v. 032 49 and mapped to the Z. tritici (IPO323) reference genome using Bowtie2 v.2.4.0 50 very-sensitive-local parameter.We used GATK CNV caller v.4.1.9.0 51 with recommended parameters on alignment BAM files (n = 1109).The software scans read coverage and models sequencing biases based on negative binomials, taking copy number states and genomic regions of CNV activity into account for a hierarchical hidden Markov model (HHMM).We set the CNV interval to 1000 bp windows with no overlap.Such intervals are recommended to account for variation in sequence coverage (Supplementary Table S1).We filtered for GC content in windows (min=0.1 and max=0.9),as well as extremely low and high read counts (--low-count-filter-count-threshold = 5, -extreme-count-filter-minimum-percentile = 1, --extreme-count-filter-maximum-percentile = 99).We then built a prior table for chromosomal ploidy to assign prior probabilities for each ploidy state.Finally, we called CNV genotypes using the germlineCNVCaller and PostprocessGermlineCNVCalls functions.After genome-wide CNV calling, we filtered and validated gene CNVs.We used bedtools v2.31 52 annotate to overlap gene elements with the CNV calling.

1) CNV coherence validation
To assess CNV calling quality, we analyzed seven sets of independently sequenced sets of the same strain (i.e., independent library and sequencing; Supplementary Table S2).After validation, we retained the strain with the higher mean coverage of each replicate pair in the dataset for further analysis.
2) Gene CNV events CNV calling was performed in 1-kb windows across the genome, allowing for ambiguous calls in polymorphic gene elements.Deletion, duplication and single-copy CNV events were attributed to a gene if the event coverage was >80% of the gene.We defined partial deletions if the deletion covered 50-80% of the gene.Additional combinations were defined as single-copy events.

3) CNV filtering and validation
To reduce false positive calls in the dataset, we used quality scores (CNQ), which are defined as the difference between the two best genotype Phred-scaled log posteriors.We set the CNQ threshold based on the structural variant calling verified using the fully assembled genomes of Z. tritici 53 .We contrasted CNV calls by the GATK calling pipeline to pairwise whole-genome comparisons of chromosome-level assemblies based on the software SyRI v1.3 54 using IPO323 as the reference genome.For a direct comparison of variant genotyping, we analyzed four strains present in both the global dataset (Illumina reads) and the chromosome-level assembly dataset (PacBio reads; samples 3D1, 3D7, 1E4 and 1A5).We first subset gene sets to unambiguous CNV calls (i.e., with more than 60% of event coverage) in both datasets.We then subset the SyRI structural variant calls to single-copy regions, translocation and DNA gain and loss, removing SNV calls (Supplementary Table S3).We compared the call quality based on CNQ to the gCNV GATK output.The levels of matching and discordant calls between tools were used to define thresholds for CNV GATK calls.We filtered for missing data in the global dataset by removing calls with <50% coverage or <80% call frequency.To define CNV segments (i.e., larger regions with consistent CN calls), we binned GATK CNV Caller segment output per strain and calculated the CNV segment quality call QA defined as complementary Phred-scaled probability at all points (i.e., bins) in the segment, which agree with the segment copy-number call.We then subset for CNV segments based on the filtered dataset and kept CNV segments within the upper quartile quality.We retrieved 14 loci to cross-check the validation with a PCR screen performed earlier for deletion polymorphisms 55 .

4) Chromosome number variation
Z. tritici is haploid and carries 13 core chromosomes and up to eight accessory chromosomes (not shared by all members of the species).We used chromosome-wide copy number estimates using the mean coverage (MAPQ>20) of core chromosomes for each strain to define accessory chromosome presence/absence among all strains.Core chromosomes with more than 1.25 times the core coverage were defined as likely duplicated.Accessory chromosomes with less than three times the core chromosome mean coverage were considered absent from the strain.The variable median coverage cutoff for accessory chromosomes is necessary because of the chromosome size variation in the species 44 .Chromosomes with large size variation and uncertain estimates of coverage thresholds were additionally checked using gene presence density cutoffs based on the CNV call unfiltered output.

Single nucleotide variant call and de novo genome assembly
To analyze the genomic context of CNVs in the population, we retrieved SNV calls and de novo genome assembly analyses previously performed for the same global collection (n=1109) 42 .Briefly, reads were mapped to the reference genome (IPO323) using Bowtie2 v.2.4.139 50.SNVs and short indels were assessed using the short-variant pipeline performed with GATK v4.1.4.HaplotypeCaller 56 .Ploidy was set to 1, and hard filtering was performed.The per-site filters included FS > 10, 444 MQ < 20, QD < 20, ReadPosRankSum between -2 and 2, MQRankSum between -2 and 2, and BaseQRankSum between -2 and 2. We filtered the dataset for biallelic SNV genotypes, a minor allele frequency of 0.05 and max missing genotype data of 20%.We further used the dataset to define synonymous and nonsynonymous variant prediction impact on the protein using SnpEff version 57 .We filtered for CNV genes with a ≥80% single copy genotype in the global panel.We built a database based on the reference genome IPO323 annotation and filtered for the top impact effect per variant.De novo assemblies were generated using SPADES v3.14.1 software 58 with the "--careful" parameter to reduce mismatches.

Population structure analysis
The global collection included 1003 genomes grouped into 11 genetic clusters and 106 admixed samples defined as showing less than 75% assignment to any specific cluster 42 .CNV-based population structure and VST outlier analyses were performed on the filtered, biallelic gene CNV dataset with a single copy defined as the reference allele and the most frequently observed CNV event at the locus as the alternative allele.We filtered core chromosome gene CNVs for a minor allele frequency ≥1% and 20% maximum missing genotypes per locus.The SNV-based PCA was generated using the SNV data subset with the BCFtools 59 "M2" option to keep only biallelic SNV and "-q 0.05:minor" for a minor allele frequency filter of 5%.We used vcftools 60 "--thin 1000" to keep only 1 SNV for every 1 kb interval.The PCA was performed with the ade4 R package v1.7-22 61 and visualized with the ggplot2 v3.4.2 R package 62 .CNV fixation indices (VST) were calculated using the hierfstat v0.5-11 R package 63 .

Environmental and life-history trait adaptation analysis
We performed whole-chromosome CNV and gene CNV genotype-environment association (GEA) analyses using bioclimatic factors from the WorldClim database v.2.1 at 10' resolution.The bioclimatic data comprise historic averages (between 1970-2000) of 19 climatic variables related to temperature and precipitation.Geographic coordinates of strain collection sites were used to approximate climatic dataset gridpoints.We used two widely used mixed linear model association mapping tools: GEMMA v.0.98.3 64 and Tassel v5 65 using the Rtassel R package v0.9.29 66 .We used the thinned SNV dataset (see Population Structure) to estimate the kinship matrix.We contrasted our CNV-based GEA with the SNP-based GEA performed by 42 using identical climatic datasets and tools.We used Bonferroni corrections (alpha = 0.05) to account for multiple testing in GEA analyses.We performed phenotype-genotype GWAS analyses on a subset of the global collection (n = 145 strains) using 24 life-history phenotypes 67 .The phenotypic data comprised virulence (i.e., lesion size) and reproduction (i.e., pycnidia production) of individual strains during host infection on 12 different wheat cultivars.Tests were conducted with GEMMA and Tassel software using the same parameters as for the GEA analyses.

Transcriptome profiling
We analyzed transcriptional profiles based on gene expression in minimal medium conditions of 19 Z. tritici chromosome-level assembly strains 44 and a collection of strains from a Swiss field population (Eschikon, Switzerland; n=74) 68 .In summary, 10e5 cells were inoculated using liquid minimum media (MM) with a limited carbon source and grown for 7-10 days to reach the hyphal growth stage.RNA extraction was performed using a NucleoSpin RNA Plant Kit following the manufacturer's instructions 44 .We analyzed the publicly available transcriptome dataset (NCBI SRA accession SRP077418) of four strains also included in this study (3D1, 3D7, 1E4 and 1A5) inoculated on wheat plants and analyzed 7, 12, 14 and 28 days after infection 69 .Illumina raw sequencing reads were trimmed and filtered for adapter contamination using Trimmomatic v. 0.32 (parameters: ILLUMINACLIP:Trueseq3_PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36) 49 .Filtered reads were aligned using HISAT2 v.
2.0.4 with default parameters 70 to the Z. tritici reference genome (IPO323).Mapped transcripts were quantified using HTSeq-count 71 .Read counts were normalized by calculating trimmed means of M-values (TMM) with the calcNormFactors option.To account for gene length, we calculated reads per kilobase per million mapped reads (RPKM) values using the R package edgeR 72 .

Orthology analyses and characterization of gene functions
Orthologs were searched among the chromosome-level assemblies (n=19) and four sister species of Z. tritici with Orthofinder v.2.2 73 .Orthologs shared between Z. tritici and at least one sister species were defined as conserved and used to infer gene losses versus gains in Z. tritici.TE annotations of the chromosome-level assemblies were retrieved from 44,45 .Chromosome-level assemblies of Z. tritici were also analyzed to predict biosynthesis-related gene clusters (BGCs) using antiSMASH v.5.0 74 .Identified gene clusters were further annotated using InterProScan v.5.54 75 .GO term enrichment analyses were performed using Fisher's exact tests based on gene counts with the topGO R package 76 .The GO term treemap was plotted using the treemapify R package 77 .We retrieved publicly available ChIP-seq datasets of histone modifications H3K4me2, H3K9m2 and H3K27me3 from the NCBI SRA (SRP059394) of the Z. tritici IPO323 reference genome isolate grown in rich medium 78 .ChIP-seq reads were trimmed with Trimmomatic v.0.32 49 and mapped to the IPO323 reference genome with Bowtie2 v.2.4 50 .Alignment BAM files were converted using BEDtools v.2.30 52, and peak calling was performed using Homer v.4.11 79 .Gene coverage was analyzed with the BEDtools intersect command.

Starship characterization and annotation
To characterize Starship mobile elements in the Z. tritici global collection, we searched for genes with CNVs for Starship-associated functional domains 80 using the hmmsearch function of HMMER v3.3.2 (Evalue ≤ 0.001) 81 .We focused on genes belonging to a newly described family of tyrosine recombinases with DUF3435 domains (Protein Family accession: PF11917) that are both necessary and sufficient for the movement of entire elements 82 .We defined the boundaries of candidate elements by annotating their putative empty insertion sites.We aligned 25 kb upstream of the candidate tyrosine recombinases to the corresponding homologous region in isolates that were missing the gene to determine the upstream element boundary and then aligned 25 kb downstream of the homologous region back to the contig containing the tyrosine recombinase to determine the downstream element boundary.All alignments and quality filtering were performed with MUMmer4 83 (nucmer settings: -mum; delta-filter settings: -m -l 2000 -i 90) and manually inspected.
To establish phylogenies for Starship cargo genes, we performed pairwise alignments of predicted proteins for each genome using blastP (Camacho 2009).To ensure that phylogenies are not biased by missed gene annotations, we performed pairwise alignments of the predicted protein sequence against the chromosomelevel assemblies and draft assemblies of the global collection using Exonerate v.2.70 (Slater and Birney 2005) with the parameter --model protein2genome -minintron 20 --maxintron 3000.We aligned sequences with MAFFT v. 7.310 84 using the --maxiterate 1000 --localpair options.Phylogenetic trees were built using RAxML v.8 (Stamatakis 2014) with the parameters -m GTRGAMMA for nucleotide sequences and -m PROTGAMMAAUTO for protein sequences with 1000 bootstrap replicates.

Visualizations and statistical analyses
All described statistical tests were performed using R 85 .Analyses of differences among genetic clusters were performed using ANOVAs with the multcomp R package 86 .Heatmaps were generated with the pheatmap R package 87 .Phylogenetic trees were plotted with the ggtree R package v3.8.0 88 .Synteny plots of the region were plotted with the genoplotR 89 and gggenomes 90 R packages.The correlation plot was generated using the corrplot R package 91 .Additional graphics were produced using the ggplot2 R package 62 .To analyze associations of the RIP composite index and gene CNV events, we used a mixed effect linear regression model with the R package nlme 92 .We first used a baseline model of the explanatory variable (i.e., RIP mean per strain) and response variable (i.e., CNV event per strain) and compared it to more complex models adding population and RIP index as random effects.We used the ANOVA function from the package car 93 to assess model fit.We used the function in r.squaredGLMM 94 to calculate the conditional (R 2 c) and marginal coefficients (R 2 m) of the generalized mixed-effect models.

Chromosomal and gene copy number variants in a 1000-genome panel
We used short-read sequencing data generated for a global panel of 1109 genomes covering the global distribution range of Z. tritici (Figure 1A).The collection of genomes covers 42 countries, capturing the spread of the pathogen concurrently with the historic spread of wheat cultivation across continents 42 .The genomes were collected from a broad range of climates from hot and dry Middle Eastern regions to cooler and humid regions at high latitudes.We performed short read mapping along the genome to assess segmental deletions and duplications.We implemented multiple filtering procedures and validation steps to ensure high-confidence CNV calls (Figure 1B).To evaluate CNV call performance, we first compared the gene CNV calling of seven matching pair strains with replicated sequencing data.We found largely congruent calls for duplications and deletions (Figure 1C).Next, we validated the CNV calling independently based on completely assembled genomes and synteny analyses 48 (Supplementary Table S3).
We found high consistency for variant calls between short-read CNV calling and the chromosome-level synteny approach (Figure 1D, Supplementary Figure S1A).We used the empirically assessed confidence threshold (i.e., CNQ) to filter the global short-read dataset.Deletion and single copy event calls were filtered to reduce false positives and retain high-confidence calls (Figure 1D-E, Supplementary Figure S1B-D).
Finally, we evaluated CNV calls for 14 genes confirmed by a PCR assay on 94 strains 47 (Figure 1F).The resulting dataset of high-confidence calls included 1104 strains and 8625 distinct gene loci affected by CNVs (Figure 1B; Supplementary Table S4).The species carries a set of eight highly polymorphic accessory chromosomes with unknown contributions to environmental adaptation (Figure 2A).Accessory chromosomes vary significantly in frequency with evidence for partial chromosomal deletions (Figure 2A; Supplementary Figure 2A).We verified the absence of particular chromosomes using transcriptomics data (Supplementary Figure 2B) 68 .As expected, core chromosomes 1-13 were fixed in the global collection (Figure 2B).We found 17 (0.1%) cases of core chromosome duplications, with chromosomes 5 and 12 accounting for 64% of all cases (Figure 2C).We found that 19% of accessory chromosomes were missing from the global collection (n=1698 out of 8872; Figure 2B).Chromosome 18 was missing in 57% of strains, followed by chromosome 21, which was missing in 26% of samples (Figure 2B).Overall, chromosome 18 accounts for 70% of all complex arrangements (i.e., high degree of duplications and deletions across the chromosome arm; Figure 2B), with most variation being associated with high repeat content.We found that the chromosome 14 structural variation in the population was associated with a large insertion 44,95 of 351 kb encoding ca.40 transcriptionally repressed genes (Figure 2D).The total chromosome number per genome varied up to 56% (13-23 chromosomes), with an average of 20 chromosomes underpinning substantial genetic diversity (Figure 2E).Two strains (0.6% of total) carried only core chromosomes (Figure 2E, Supplementary Figure 3A).We found no evidence supporting yet undescribed accessory chromosomes (Figure 3F).Higher chromosome numbers than found in the reference genome strain were caused by chromosomal duplications.
We found that even strains with the highest chromosome numbers (n ≥ 21) were able to successfully infect wheat leaves and reproduce (Supplementary Figure 3B).The number within the boxplot refers to the sample size.Wilcoxon test, p value < 0.0001.Groups with <5 samples lacked statistical power.F) CNV gene distance to the closest transposable element.G) Gene ontology enrichment analysis of CNV genes.Values in white refer to Fisher's test p values.BP refers to biological process, and MF refers to molecular function.
To identify candidate CNVs associated with climatic adaptation, we analyzed gene CNV variation across the global distribution range of the pathogen.We focused on biallelic CNVs with single-copy, deletion, or duplication of genes (Supplementary Figure 4A).Strains with duplicated chromosomes were removed from further analyses.We found that 3.3%, 5.2% and 17.5% of CNVs were common (>1% frequency), rare (<=1% frequency) and singletons, respectively, across the 1000-genome panel for a total of 8511 loci (Figure 3A).Most genes (73.9%) showed no CNVs and are likely under strong purifying selection.Most CNV genes (85.7%) share an ortholog in at least one sister species (Supplementary Figure S4B).Using parsimony, we infer that most CNV events are likely gene losses.Most genes on accessory chromosomes (90%; n= 203) exhibited CNVs compared to 24% (n = 2021) of core chromosome genes (Figure 3A).Among the core chromosomes, chromosome 5 exhibited the strongest skew toward low-frequency CNVs (i.e., singleton and rare category with CNV frequency ≤1%; Figure 3A).CNV variation was also found across chromosome arms (Figure 3B).We found overall gene duplications to be more abundant (n=1400) but mostly at low frequency (1388 CNVs with ≤1% duplication frequency in panel) compared to gene deletions (n=682, Figure 3C; Supplementary Figure 4B).The distinct frequency patterns of duplications and deletions were consistent across quality filtering stages (Supplementary Figure 4D).Hence, we asked whether gene deletions and duplications would segregate differently among populations (Figure 3D).Rare CNVs (global frequency <1%) showed similar proportions for gene deletions and duplications compared to the common frequency category (>1%).Gene duplications were four times less likely shared between populations compared to deletions (p value < 0.00001, 95% CI 14-41%; Supplementary Figure 5A).We then searched for CNV segment size variation in the global genome panel.We binned 1-kb CNV events into larger contiguous calls of presence or absence (Supplementary Table S5; Supplementary Figure 5B).
In the upper quartile (QA>21), we found duplicated gene segments to be larger in size compared to deletions (p value < 0.00001) and more constrained in size at increasing frequency (Figure 3E; Supplementary Figure 5C).Taken together, our findings show that gene deletions can be readily purged, likely due to deleterious effects.We investigated features shared by CNV-affected genes and found that such genes were on average closer to TEs than conserved genes (Supplementary Figure S6A).Furthermore, gene deletions were closer to TEs than duplications (p value <0.01; Figure 3F).We then asked whether coding sequences of CNV genes carry more high-impact SNVs (i.e., variants with disruptive effects) compared to conserved genes.Genes segregating deletions harbored higher impact variants compared to conserved genes and genes with duplications.Common duplicated genes exhibit the highest number of high-impact variants (median 0.0012; Supplementary Figure S6B), both consistent with functional redundancy (i.e., relaxed selection) and genomic defenses affecting coding sequences.We also found that deleted or partially deleted genes were enriched for H3K27me3 repressive histone methylation marks.In contrast, conserved genes and genes with duplications were enriched for H3K4me2 euchromatin marks (Supplementary Figure S7A).Consistent with these observations, CNV genes show higher transcriptional variation during host infection and possibly higher functional redundancy (Supplementary Figure S6D).Overall, CNV genes were functionally enriched in metabolic processes, including toxic secondary metabolic processes and peptidase activity (Figure 4G; Supplementary Table S6; Supplementary Figure S7C).Gene dispensability and functional enrichment of metabolic processes suggest that gene CNVs facilitate metabolic diversification and local adaptation.

Effects of genome defenses and signatures of local adaptation
Population differentiation across the globe detected at 218 gene CNVs is broadly congruent with SNVbased assessments of population differentiation 42 (Figure 4A, Supplementary Figure S8A).In Europe, CNV-based population structure showed a pattern consistent with recent immigration events (Figure 4A) 42 .In addition, the European population showed higher rates of gene flow with other regions (Supplementary Figure S8B), corroborating the role the continent played in historic pathogen dispersal.
Populations across the globe differed substantially in the rate of CNV events per strain (19-83 with a median of 44; Figure 4B).Interestingly, we found a strong enrichment in duplications in the clusters assigned to Australia, NA -USA and Europe (Figure 4B).An important factor shaping observed CNV rates among populations is the potential activity of RIP genomic defenses.Genomes with a functional copy of dim2 likely express functional RIP machinery 42,96 .Dim2-carrying populations tended to show higher rates of gene deletions (Figure 4C).A mixed linear model accounting for population effects showed a weak but significant association of gene deletions and the strength of RIP (r 2 = 0.023; Supplementary Figure S8C-D).We identified candidates for local adaptation across all genetic clusters.We assessed the 95 th upper quantile fixation index (VST) score per gene CNV among populations (Figure 4D).The top VST CNV gene was predicted to be an effector gene with predicted functions in plant infection (Supplementary Table S7).
Consistent with the predicted function, the gene is highly expressed during the initial stages of wheat infection and shares no homology outside of the species (Figure 4E).The gene was rare and stable over time in the North African population but fixed across all other populations (Figure 4E).Hence, the gene may play a role in adaptation to local host genotypes, favoring gene loss to avoid host recognition.

Structural variation underpinning climate adaptation
The species underwent climatic adaptation over the course of continental colonization 42 .We investigated the contributions of chromosome and gene CNVs to overall climatic adaptation using genome-environment association (GEA) analyses.We analyzed a total of 1099 samples and examined 19 bioclimatic factors based on two mixed-model association approaches for adaptive CNV discovery (Supplementary Table S8; Bonferroni α = 0.05).We identified significant associations for the chromosome CNVs of accessory chromosomes 15, 17 and 20 (Supplementary Figure S9A-B; Supplementary Table S9).Chromosome 20 was the most consistently retrieved by both GEA methods, revealing the mean temperature to the coldest and driest quarters (Supplementary Figure S10A).Most climatic factors showed strong positive correlations (r > 0.8, p value < 0.0001), and associated loci were typically exclusive to individual climatic variables (Supplementary Figure S9).Next, we analyzed phenotypic trait variation for a subset of the global collection of strains (n = 145; Supplementary Table S10) 67, and no trait was significantly associated with CNVs (FDR 5% threshold).We identified 21 gene-level CNV associations with climatic factors spanning 14 different loci (Figure 5A, Supplementary Table S11).Associated loci encoded functions including epigenetic regulation, metabolism and cell signaling functions (Supplementary Table S11).We found the strongest correlations with the climatic factors of maximum temperature of warmest month and mean temperature of warmest quarter (Figure 5 A; r = 0.84; p value < 0.001, Supplementary Figure S9C).The associated CNVs were segregating variable gene deletion frequencies (Figure 5B).We found that 85.7% of the associated genes share an ortholog in at least one Zymoseptoria sister species.We integrated CNV and SNV-based GEA analyses and identified three genes with shared evidence from both marker types (Supplementary Table S11) 42 .In general, gene functions identified by each variant type were only moderately overlapping (r 2 < 0.6, Supplementary Figure S9B; Supplementary Table S11).This suggests largely independent contributions by CNVs and SNVs to climate adaptation.We identified a significant association in the gene CNV of Zt09_9_00561 encoding interferon 6 (Figure 5A), which was supported by both association mapping methods.Gene presence is associated with higher mean temperatures of the wettest quarter (Supplementary Table S11).We also found congruent mean annual temperature associations for the CNVs for the gene pair Zt09_2_00058/60 (Figure 5A) and the SNV association at Zt09_2_00069 (Supplementary Table S11).The associated CNVs flank a biosynthetic gene cluster (BGC19) on chromosome 2 with the second highest fixation index values (VST) of all gene CNVs.
(Figures 4C, 5C; Supplementary Figure S10B).We investigated the nature of BGC19 and found that it is a 63-kb cluster of unknown function (Figure 4C).The BGC is present in the sister species Z. brevis, and the loss of the core biosynthetic gene (Zt09_2_00067) was confirmed by PCR 47 .The presence of BGC19 was negatively correlated with mean annual temperatures (r = -0.61,p value < 2.2e-16; Figure 5D).The loss of the BGC was also associated with higher reproduction on wheat leaves of 12 wheat cultivars (Figure 5E; Supplementary Tables S10, S12).Taken together, the deletion of the BGC19 gene cluster is likely under antagonistic pleiotropy for high-temperature adaptation and host colonization potential.

The climate associated Sir2 locus is occupied by a massive Starship mobile element
Causal factors contributing to environmental adaptation remain poorly understood.We took advantage of the high-quality pangenome resources for the species to investigate the significant association of the annual temperature range with a CNV on chromosome 7 (Figure 5A, 6A).The gene Zt09_7_00034 is predicted to encode a homolog of Sir2, an NAD-dependent deacetylase major protein family associated with lifespan and mating type silencing in yeast, as well as aging in humans [97][98][99] .A phylogenetic analysis of the Sirtuin family showed that Z. tritici Sir2 is an ancient duplication of Sir5 in Dothideomycetes followed by multiple independent gene losses (Supplementary Figure S11).To investigate the chromosomal context of the Sir2 deletion, we analyzed both unfiltered CNV call data and chromosome-level assembled genomes (n = 19).
CNV frequency analyses of the global collection showed that the region segregates a large insertion affecting transcriptionally active genes as well as TEs (Figure 6A).The Sir2 locus was partially present in the center of origin populations (i.e., Middle East, Iran) and fixed in Oceania and the USA (Supplementary Figure S12A).The gene Zt09_7_00033 adjacent to Sir2 encodes a DUF3435 domain characteristic of a newly described family of tyrosine recombinases 100 .This fungal-specific tyrosine domain is required for the mobilization of massive TEs identified as Starships 100,101 .Genes unrelated to DUF3435 encoded inside the Starship are called cargo.The massive mobile element resides close to the major global regulator Velvet 102 , which is linked to sexual reproduction and growth 103 .We found a single insertion in Starship (hereafter identified as Swordfish) present in a single copy (global frequency of 64.5%) or entirely missing (14.3% globally; Supplementary Figure S12B).Swordfish boundaries contained no detectable direct repeats flanking the element, suggesting that the element lost the ability to transpose.The Swordfish region is rich in TEs (Figure 6A), and chromosome-level assembled genomes revealed syntenic Swordfish flanks among sister species (Supplementary Figure S13A).Strains harboring Starship carry ~68k additional sequences (SD=13.5 kb; Supplementary Figure S13B).We used gene synteny and phylogenetic analysis to retrace the evolution of Swordfish.Phylogenies of the flanking genes match the evolutionary history of the genus (Supplementary Figure S14).Swordfish gene cargo underwent a complex sequence of duplications, transposition, and multiple, independent gene losses (Figure 6B, Supplementary Figure S15A).For example, the genes Zt09_7_00039/38 have paralogs in different genetic contexts, suggesting ancestral duplications (Supplementary Figure S15).Chromosomelevel assemblies lacking Swordfish (strains IR01_48b, CNR93, UR95) show the flanking gene Zt09_7_00030 and cargo genes Zt09_7_00037/38 in a different location on chromosome 7.Although the Swordfish is never present in more than one copy, a close homolog of the Starship tyrosine recombinase was found in strains lacking Starship (i.e., strain Zt469, Figure 6B).Distant duplicates of the tyrosine recombinase (identity <60%) were found on chromosomes 1 and 12 (Supplementary Figure S15B).Strain 3D1 carried a tandem duplication of the cargo genes Zt09_7_00033/34/35 (Supplementary Figure S16).Swordfish is rich in TE sequences, including specific retrotransposons (RLX_LARD_Gridr and RII_Philae; Supplementary Figure S16A-B).Some strains lacking Swordfish carry cargo genes (Zt09_7_00037-39) in a different genomic location, suggesting that Swordfish cargo turnover was recent and possibly mediated by TEs (Supplementary Figure S16C).Epigenetic analyses revealed a bi-repressive pattern regulated by H3K27me3 and H3k9me2 modifications (Figure 6C).Flanking regions are marked by H3K4me2 euchromatin.The predicted secreted protein Zt09_07_00037 and the flanking gene Zt09_7_00040 are highly expressed during infection (Figure 6C).We found a high degree of sequence conservation among the triplet Zt09_7_00033/34/35 lacking coding sequence variants, suggesting strong purifying selection and conservation of synteny (Supplementary Figures S13, S17).
The Swordfish cargo gene Sir2 CNV is significantly associated with the annual range of temperature, and the neighboring highly expressed secreted gene Zt09_7_00037 is associated with the mean diurnal range.
Both climatic factors are moderately correlated (r = 0.55, p value < 0.001).Overall, we identified a possible new role for Sir2 as a factor in climate adaptation.Hence, the massive swordfish mobile element likely contributes to thermal adaptation and shapes the species range across highly variable environmental conditions.

Discussion
Climatic factors are strong determinants of pathogen spread and disease severity 39,104 .Genetic variability provides the substrate for rapid adaptation to a changing environment 105 and to cope with environmental stressors [106][107][108] .How copy number variation shapes climate gradients spanned by individual species remains poorly understood.We show that multiple gene deletions contribute to pathogen adaptation across continental climatic gradients.CNVs are also drivers of metabolic diversity, including contributions by Starship mobile elements reshuffling genes carried as cargo.
Approximately a quarter of all genes in the pathogen species were affected by CNV events, likely reflecting an equilibrium between new CNVs being generated and purifying selection acting against CNVs 109 .Genes segregating CNVs were located closer to TEs and were functionally enriched for catalytic activities and secondary metabolic processes compared to more conserved genes.Furthermore, gene duplications were the most abundant CNV events yet remained at low frequency in populations and were rarely shared among populations.The skew in CNV events may stem from a duplication detection bias; however, a more stringent control for call quality and CNV frequencies produced similar outcomes.Gene duplications are a powerful source for gene neofunctionalization [110][111][112] and promote non-allelic homologous recombination due to homology among duplicates 113,114 .Z. tritici is a highly recombinant species 115,116, suggesting that segmental duplications could impact the likelihood of nonallelic homologous recombination 114 .The evolutionary history of the pathogen has likely impacted the rates of duplications on chromosomes, as observed in the more recently colonized Oceanian and American continents.More recent populations exhibited signatures of bottlenecks with reduced genetic diversity 42 .Concurrently, the bottlenecked populations experienced an increase in TE activity, most likely caused by a loss of defense mechanisms against TEs 42,96 .RIP triggers rapid mutation accumulation after duplication, leading to loss-of-function or, more rarely, diversifying selection in populations 117,118 .RIP activity likely acted as a driver of gene loss by rapidly mutating genes and facilitating the purging of nonfunctional copies through excision.Such nonrandom processes underlying the creation and elimination of structural variation illuminate how species gene pools can evolve over short evolutionary time periods.
The role of gene duplications in environmental adaptation is well documented across kingdoms [119][120][121] .How gene loss can contribute to adaptation is less well understood.Here, we showed that gene losses largely explain CNV-driven environmental adaptation.We found signatures for adaptive gene loss ranging from single genes to chromosome copy number variation.Gene loss typically arises from the abrupt rearrangement of coding sequences by repetitive elements, unequal crossing-over events or by the accumulation of mutations leading to a loss of function 122,123 .For example, loss of the Desat2 gene in cosmopolitan D. melanogaster was linked to resistance to cold 124 .Hence, adaptive loss to changing environments shows convergence across kingdoms [125][126][127][128][129] .
We found strong associations indicating that climate adaptation across continents was facilitated by presence/absence variation in a biosynthetic gene cluster.The presence of the BGC19 cluster is positively associated with colder climates, and cluster deletion is associated with an increased reproduction rate in specific wheat cultivars.Fungal BGCs are a major source of chemical diversity, and tight physical organization improves coregulation efficiency and reduces cytotoxicity caused by intermediary products 130 .Secondary metabolites are involved in numerous cellular functions, including virulence, defense, and growth 131 .BGCs can be hotspots generating structural variation [132][133][134] and favor adaptation 135,136 .Adaptive loss of BGCs is thought to be explained by the Black Queen hypothesis, which argues that communities sharing "leaky" resources such as metabolites will favor genome reduction and loss of gene sets necessary to make such metabolites accessible 137 .As we have identified linked antagonistic pleiotropy governing the presence of BGC19, the gene cluster may be under a complex selection regime across populations.
In addition to climate adaptation, virulence factors (i.e., effectors) can also undergo adaptive gene loss to facilitate pathogen adaptation 97,138 .In an arms race between host and pathogens, such virulence factors can be important triggers for host defense mechanisms or serve to manipulate the immune response 139,140 .The sampling in North Africa covered a 30-year timespan and showed consistently low frequencies of the virulence factor in populations.The region predominantly produces durum wheat (Triticum durum) in contrast to the more widely cultivated bread wheat (T.aestivum) 141 .Strong selection in North African populations was likely driven by pathogen-specific recognition mechanisms 142,143 .Overall, we found that gene loss at different scales likely played an important role in environmental adaptation across the historical spread of the pathogen population.
We identified, to our knowledge, the first Starship mobile element associated with climate adaptation in fungal populations.Starships are widespread among fungi and were recently characterized as tyrosinerecombinase-mobilized DNA transposons amassing multiple host genes and TEs 101,144 .Elements can reach up to approximately half a megabase in size and carry species-specific cargo genes.Individual Starships were associated with adaptive functions such as heavy metal tolerance in a strain-specific manner 145 spore killing-mediated meiotic drive 144 and formaldehyde resistance 146 .Starships are powerful agents that reshuffle the core functions of gene clusters 146 .The cargo carried by swordfish expands our understanding of adaptive functions associated with massive selfish elements.The cargo includes gene CNVs associated with thermal range climatic factors that may contribute to the environmental range of pathogen populations 19,147,148 .Signatures of thermal adaptation were previously documented in Z. tritici without identifying a molecular mechanism [149][150][151] .Surprisingly, the Swordfish element resides within a highly conserved region across sister species and neighbors the gene VeltB, which is a key regulator of reproduction, metabolism and growth processes in fungi 102 .The strong epigenetic repression across the entire Swordfish element is consistent with the strong repression of TE activity immediately adjacent to a transcriptionally open and conserved region of the chromosome 152 .We hypothesize that this Starship likely inserted itself at this locus after the emergence of the species, followed by a complex set of duplications and transposition events facilitated by high TE activity in the region.
Swordfish carries an ancient duplication of the Sir5 gene.The sirtuin protein family is a group of evolutionarily conserved NAD+-dependent histone deacetylases involved in regulating epigenetic processes and 153 .Through their enzymatic activity, sirtuins modulate the acetylation status of histones and other proteins, which have been implicated in a diverse range of biological processes in eukaryotes, including cellular aging, diseases, genome stability, metabolism, and stress response 97 .The Zt09_7_00034 sirtuin resides exclusively next to the tyrosine recombinase gene within the Swordfish.The lack of SNV variants in the coding sequence, coupled with transcription activity, suggests a local regulatory role within the epigenetic landscape of the element, partially explaining the successful insertion of the element in a highly conserved genomic region.In fungi, sirtuins are involved in mating-type loci silencing, rDNA stabilization, host defense suppression and secondary metabolism 99,154,155 .Here, we found the first evidence of a sirtuin protein linked to climate adaptation in fungi.

Figure 1 .
Figure 1.High-quality gene CNV calls across a thousand-genome panel of the wheat pathogen Zymoseptoria tritici.A) Geographic distribution of the 1109 samples.B) CNV call dataset after hard filtering across genetic clusters.C) Gene CNV call accuracy comparison between strains sequenced twice.D) Gene CNV call congruence comparison between GATK and chromosome-level assembly (SyRI) based CNV calling methods.The bar plot refers to the number of matching loci between methods.The boxplot refers to the quality call parameter CNQ score between matching and nonmatching calls.E) CNQ score distribution for duplications and deletions per chromosome after filtering.F) Matches of gene CNV calls and PCR validation.

Figure 2 . 302 Figure 3 .
Figure 2. Global survey of CNVs.A) Read coverage ratio between accessory chromosomes and core chromosomes in the global collection.Expected read coverage was defined as a minimum of 1/3 read coverage compared to the mean core chromosome coverage.B) Frequency of chromosome CNVs in the global collection.Complex CNVs are defined as chromosomes exhibiting large deletions and duplications across chromosome arms.C) Distribution of chromosome duplications.D) Distribution of chromosome 14 insertion variants.The boxplot refers to the transcription activity of genes in the insertion sequence compared to other genes under in vitro conditions and during the host infection cycle.Dpi refers to days post infection.E) Total chromosome count including duplicates.F) Samples carrying chromosome duplications bringing the total chromosomal load beyond the canonical 21 euploid chromosomes.

Figure 4 .
Figure 4. Global copy-number polymorphism population structure.A) Principal components (PCs) analyses based on 218 CNV genes or 26,728 single nucleotide variants (SNVs).The color scheme assigns genetic clusters.B) Number of gene CNVs per strain and variant frequencies at CNV sites per population in core and accessory chromosomes.The sample number per population is shown in the last column.Different letters above the boxplot identify significantly different groups based on an ANOVA.C) Relation between mean CNV number per strain and RIP composite index in gene deletions and duplications.Functional RIP machinery refers to genomes sharing an intact copy of the dim2 methyltransferase gene underpinning activation of RIP.D) CNV variant differentiation based on the fixation index VST across core chromosomes.The red dotted line marks the 95 th quantile used as a threshold.E) CNV frequencies of the top VST CNV locus across populations.Frequency shifts at the effector locus in MEA-North Africa genomes.Expression profile of the effector gene during host infection in four strains (dpi: days post infection).

Figure 5 .
Figure 5. CNV contribution to environmental range adaptation.A) Manhattan plot of the genome-environment association (GEA) analyses using two mixed model methods (Tassel and GEMMA).The red dotted line refers to the Bonferroni (alpha=0.05)threshold.The Venn diagram shows the GEA hit overlap between mixed model methods (Tassel and GEMMA).The bar plot shows the number of significant hits per climatic factor.B) Distribution of gene CNV frequency and significant GEA hits in the global collection.The red dotted line refers to the mean frequency.C) Biosynthetic gene cluster (BGC) 19 frequency in the global collection.Circles at the top refer to mean transcriptional activity under in vitro conditions in a subset of the global collection (n=74).D) Presence of BGC19 across geography and correlation with the associated phenotype (annual mean temperature).E) On-host reproduction rate in a subset (n=145) of global collection grouped by presence/absence of BGC19 and three different wheat cultivars.Data displayed the overlapping hits using Tassel and GEMMA.Wilcoxon test, p value<0.0001.

Figure 6 .
Figure 6.Starship mobile element associated with thermal adaptation.A) Temperature variation associated with the CNV of Sir2.Starship gene order in the reference genome IPO323.B) Synteny plot between chromosome-level assemblies of Z. tritici strains and sister species within the Starship region.Diagonal bars refer to different genomic contexts, and dotted arrows refer to transposition events between regions.C) Predicted impact on protein sequences (normalized by transcript size), transcriptional activity during the infection host cycle and posttranslational modifications for Starship and flanking region genes.The dark grey are refers to the Starship cargo genes.