A THOUSAND - GENOME PANEL RETRACES THE 1 GLOBAL SPREAD AND CLIMATIC ADAPTATION 2 OF A MAJOR CROP PATHOGEN 3

we by assembling a global thousand-genome panel of Zymoseptoria tritici a reported in all production areas worldwide. We identify the global invasion routes and ongoing genetic exchange of the pathogen among wheat-growing regions. We find that the global expansion was accompanied by increased activity of transposable elements and weakened genomic defenses. Finally, we find significant standing variation for adaptation to new climates encountered during the global spread. Our work shows how large population genomic panels enable deep insights into the evolutionary trajectory of a major

Zymoseptoria tritici is a major pathogen of bread and durum wheat, causing the disease Septoria tritici 76 blotch, which is now reported in most wheat-growing regions and causes significant damage 10 . The 77 center of origin of Z. tritici is located in the Middle East, where sister species are found to infect wild 78 grasses 11 . The emergence of Z. tritici was concomitant with the domestication of wheat 11 . The timing 79 3 and shared geographic origin of the pathogen and domesticated wheat strongly suggests coevolution 80 between the two species. The pathogen harbors extensive standing variation from individual infected 81 leaves to large agricultural regions 12,13 . As a consequence, the pathogen showed rapid responses across 82 all major wheat-producing areas to overcome host resistance and gain tolerance to fungicides in less 83 than a decade 10 94 We assessed the evolutionary trajectory of the pathogen in conjunction with the history of global wheat 95 cultivation (Fig. 1A). For this, we assembled a worldwide collection of Z. tritici isolates from naturally 96 infected fields (Fig. 1B). We selected isolates covering most wheat production areas, both in the center 97 of origin of the crop (i.e., the Fertile Crescent in the Middle-East), and in areas where wheat was 98 introduced during the last millennia (i.e., Europe and North Africa), or last centuries (i.e., the Americas 99 and Oceania; Fig. 1C). We gathered 1109 high-quality, whole-genome, short-read sequencing datasets 100 (Table S1) covering 42 countries and a broad range of climates. Using a joint genotyping approach, we 101 produced raw variant calls for further inspection. To assess genotyping accuracy, we used eight isolates 102 with replicate sequencing data to analyze discrepancies. We adjusted quality thresholds targeting 103 specifically the type of genotyping errors observed in our data set (Fig. S1)

112
We tested whether global diversity patterns of pathogen populations are likely a consequence of the 113 history of wheat cultivation. We first performed an unsupervised clustering of genotypes and identified 114 eleven well-supported clusters ( Fig. 2A, Fig. S2-3). Over 90% of the genotypes were clearly assigned 115 to a single cluster ( Fig. 2A, Table S3). Two clusters were identified among genotypes originating from 116 the pathogen center of origin, distinguishing collections from Iran and Middle Eastern regions.

117
Genotypes from Africa and Europe split into two distinct genetic clusters without any apparent 118 secondary structure within clusters. This lack of any fine-scale structure is remarkable given the 119 extensive geographic sampling of European genotypes and suggests extensive gene flow within the 120 5 continent. Genotypes from Oceania grouped into three distinct clusters marked by collections from 121 Tasmania, the Australian mainland and New Zealand. Genotypes from North America formed two 122 clusters along a North-South separation. Finally, South American genotypes formed two clusters split 123 along the Andes (Chile versus Argentina and Uruguay). A principal component analysis of all 124 genotypes confirmed the nested genetic structure with differentiation at the continent level, subdivisions 125 within some continents and the existence of admixed genotypes (Fig. 2B, Fig. S4).  (Fig. S5D). However, the migration events did not affect the overall 172 shape of the inferred population tree (Fig. 2C, Fig. S5B-D). To better understand effects of long-distance 173 gene flow, we investigated the relationship between relatedness among genotypes (i.e., identity-by-174 state) and geographic distance. At the continent level, we observed a negative relationship between 175 identity-by-state and geographic distance (Fig. S6). The wide distribution of identity-by-state values 176 shows that although closely related isolates tend to be found at closer geographic distance, distantly 177 related isolates can be found at both far and close geographic distances. In combination, our findings

224
We found that TE insertion polymorphisms are most often specific to a genetic cluster with only 31% 225 of insertions shared by two or more populations (Fig. S7B). We used TE insertion polymorphisms as a 226 genetic marker and found that population differentiation was matching differentiation assessed using 227 short variants (Fig. S7C). Hence, the population history of the pathogen was an important factor shaping 228 TE content. We also found that the per-individual TE content has increased in most areas outside of the 229 center of origin. Accounting for depth of sequencing, the TE content of the Middle Eastern and African 230 genomes was lower than in any other region (Fig. 3D). In contrast, the Oceanian and American genomes 231 contained among the highest TE numbers. We identified multiple-step increases in TE content in

339
We identified 1956 variants significantly associated with at least one climatic variable and 640 variants 340 with a minor allele frequency (MAF) higher than 5%. The number of associated variants per climatic 341 variable ranged between 36 and 541 (for BIO9 and BIO6 respectively), including 1-190 significant 342 SNPs with a MAF > =5% per climatic variable (Fig. 4E). We investigated whether particular variant 343 classes were enriched in the set of significantly associated SNPs. Using permutations, we found that 344 both nonsynonymous and intergenic variants were more frequently associated with climatic variables 345 than expected randomly while synonymous variants were significantly depleted (Fig. S10B). We found 346 187 genes that were in proximity or directly affected by variants associated with bioclimatic variables 347 and with a MAF >= 5%, including 65 containing non-synonymous variants (Table S5). For each GEA, 348 we retrieved significantly associated loci by clustering significant variants within a distance of 10 kb.

349
The significant variants clustered into 5-27 distinct loci per GEA consistent with a polygenic basis for 350 most climate adaptation (Fig. S11-15; Table S4). A large number of associated loci were shared between 351 GEA of different climatic variables. Highly correlated climatic variables including BIO5 and BIO10, 352 the maximum temperature of the warmest month and the mean temperature of the warmest quarter, 353 respectively, shared also higher numbers of significantly associated SNPs (Fig. 4D, Fig, S10A).

354
However, we also identified hotspots of climatic adaptation loci for largely independent climatic 355 variables (Fig. 4F). We identified a large segment of chromosome 7 and a telomeric region of  (Table S6). For example, the temperature-sensitivity QTL on chromosome 1 overlaps with 365 loci associated with multiple temperature-related climatic variables such as the mean temperature of the 366 warmest quarter (Fig. 5A-C). A variant associated with the mean temperature of the warmest quarter To identify whether adaptive mutations tend to arise locally or occur across large geographic areas, we 375 clustered significant loci based on the presence or absence of adaptive variants per country. We 376 identified eight clusters characterized by shared adaptive variants with a similar geographic distribution 377 (Fig. 5E, Fig. S16). Some clusters of adaptive variants are highly geographically localized (i.e., clusters We created de novo draft assemblies using the software SPADES v.3.14.1 37 with the "careful" method.

434
We used only the assemblies (filtered to remove any contigs shorter than 1 kb) with fewer than 1600 435 contigs and a total length between 30 and 4 Mb. We trimmed and filtered the reads with Trimmomatic BaseQRankSum between -2 and 2. We also included a per-genotype filter, removing any genotype with 446 a depth lower than 3.

448
We further assessed the quality of our SNP calling using 8 isolates which were sequenced two times 449 (including in some cases in different sequencing datasets). We made the assumption that the real 450 variation between these pairs should be close to 0, although we cannot completely exclude the 451 possibility of a small number of mutations happening during culturing or maintaining of these isolates 452 in collection. We used the differences, i.e., erroneously called variants, between the resequencing pairs 453 as an estimation of genotyping errors and to identify the causes of genotyping errors. Most of the 454 erroneously called variants remaining after the hard filtering were related to genotypes with near-equal 455 numbers of reads supporting the reference allele and an alternative allele, i.e.., "heterozygous" alleles.

456
Such genotypes were called with a high confidence despite the fact that such a heterozygous-like pattern 457 should be recognized as errors in a haploid organism and could be due to misalignment or repeated 458 sequences in the genomes. We consequently implemented an allelic balance custom script to recognize 459 such positions and to filter out any genotype that had fewer than 90% of reads supporting the called 460 allele. This filter removed 75% of the erroneous variants left between the resequenced pairs after the 461 hard filtering. As the rest of the erroneous variants were related to low sequencing depth, we further 462 implemented a per-sample missing data and low-depth filtering, removing any sample with more than 463 20% of missing data and a mean depth of coverage lower than 6 on the core chromosomes (based on 464 vcftools -missing-indv and --depth options) 41 . In the next filtering step, we removed the samples that 465 were clones or near-clones. To identify these isolates, we created a network of isolates with an identity-466 by-state value superior to 0.99 (as measured by plink v1.9 42 ) and extracted the subgraphs designating

494
For the analyses relying on the comparison of distinct groups, we discretized the populations by using 495 only isolates belonging to one of the populations with a proportion higher than 0.75 at K=11, the inferred 496 best number of clusters. These genotypes were then used as input for treemix which infers splits between 497 populations and creates a population tree 46 . We ensured that the tree shape was consistent regardless of 498 possible migration events by running treemix with a number of possible migration events ranging from 499 0 to 6. We used the assembled genomes of Z. ardabiliae and Z. passerinii as outgroups to root the tree 47 .

500
We used the scikit-allel python package to measure genetic diversity (pi), taking into account only the 501 non-admixed isolates from each cluster, in non-overlapping windows of 1 kb 48 . To remove windows 502 with too much missing data (i.e., those that would artifactually lower the diversity), we selected only 503 windows in which less than 20% of the variants were filtered out. We controlled for the variation in 504 isolate numbers between clusters by subsampling each cluster 10 times to the smallest cluster size (N= 505 16) and averaging the obtained diversity estimates per window over the 10 subsamples.

508
We called the TE insertions with the software ngs-te-mapper2 49 . In this process, the sequencing reads 509 are first queried against a library of TE sequences, for which we used the TE consensus sequences 510 obtained from 20 fully assembled genomes of 19 global isolates 50 . The "junction reads" that align both 511 on a TE consensus sequence and on the flanking genome are used to determine the site of insertion of 512 reference and non-reference TEs. To take into account any inaccuracies in the detection of the insertion 513 sites, the positions of the insertions were rounded to 100 bp, so that insertions of the same element in a 514 short window were considered to be the same insertion for further analyses. The insertions found in 515 more than 10 samples were used to create a PCA (prcomp function from the stats R package), clustering 516 the isolates based on their shared TE insertions. 517 19 We investigated the genomic distribution of variants, in relation to genomics, transcriptomics and 518 epigenomics estimates. We gathered information from several sources and aggregated the data in 10-519 kb windows. We used transcriptomic data produced previously 51 and analyzed to calculate TPM 47 , 520 representing the gene expression during the necrotrophic and the asymptomatic phases of infection for 521 the reference isolate. To include epigenomics data, we used previously published histone mark ChIP-522 seq data 22 and identified the peaks for several histone marks: H3K4me2, H3K27me3, and H3K9me2.

523
We trimmed and mapped the reads using trimmomatic as above and bowtie, and filtered the mapped 524 reads to keep only those aligning with a quality higher than 30 using samtools. The histone mark peaks 525 were called using macs2 (option -no-model) 52 , and only the regions appearing in both repeats were 526 kept (bedtools intersect). To remove windows with low variant counts due to a low mappability, we

539
One of the most important enzymes for RIP is dim2. Based on previous knowledge that the strain Zt10 540 contains a functional copy of dim2 28 , we extracted the sequence of the gene (Zt10_unitig_006_0417) 541 as well as its two flanking genes (Zt10_unitig_006_0416, and Zt10_unitig_006_0418). These were then 542 used as query sequences for the software blast to detect the presence and location of the 3 genes in de 543 novo draft assemblies based on the Illumina resequencing. In many isolates dim2 is found in multiple 544 copies. We thus identified the native copy as the copy found between the two flanking genes or, when 545 the assemblies did not include all three genes on one contig, the copy found within less than 10 kb of 546 one of the flanking genes. We then considered the percentage of identity between the native copies and 547 the functional copy of Zt10. 548 549 Statistical differences between geographical groups were assessed using a one-way ANOVA with 550 blocks, with the sequencing batch considered as the confounding block. The sequencing batch effect 551 was especially strong for the genomes from Hartmann et al. 56 probably due to a strong GC bias in the 552 sequencing. As a post-hoc analysis, we used mean separation tests (least-square means) and displayed 553 the results as letters on the corresponding plots. 554 20 555 Adaptation and selection 556 We obtained geographical coordinates from the metadata attached to the isolates or inferred them based 557 on the most precise sampling location available. Coordinates inferred can be found in Table S1. We 558 downloaded gridded weather and climate data at the 10' resolution from the WorldClim database 559 version 2 (WorldClim.org). The geographical coordinates were used to approximate the environmental 560 conditions of origin for each isolate from all bioclimatic variables, which include for example the mean 561 diurnal range and the annual precipitation as an average between 1970 and 2000. Based on these 562 environmental estimates and the genomic variants, we identified genotype-to-environment associations 563 with the software GEMMA 0.98.3 57 . We used a LOCO (Leave-One-Chromosome-Out) approach to 564 estimate the kinship matrix on the genome excluding the particular chromosome on which we were 565 estimating the associations. Significance threshold was set using the Bonferroni correction method, i.e., 566 by dividing the traditional threshold of significance of 0.05 by the numbers of variants that were tested. 567 Nearby significant SNPs were grouped together in "significant loci" if they were closer than 10 kb.

568
To identify the genes which are potentially causal for adaptation to climatic conditions, we predicted 569 the effect of the significant variants on the genes and proteins using SnpEff 58 . We created a custom 570 SnpEff database so that the predictions would match the gene annotation we are using 59 and setting the 571 upstream/downstream interval length to 1 kb. We also used the SnpEff predictions to compare the 572 distribution of effects (synonymous, non-synonymous and modifier) in the significantly associated 573 variants and in all the variants using 200 random draws.

575
We investigated the geographic distribution of the potentially adaptive alleles we found through k-576 means clustering of each allele's presence-absence per country. Per significant locus and per bioclimatic 577 variable, we selected the variant with the lowest p value (i.e., the top variant in each "peak"), and 578 identified whether the minor allele was present or absent per country/state including more than 5 579 isolates. The matrix of presence/absence was then used for the clustering. This analysis did not reveal