Large-scale analyses reveal the contribution of adaptive evolution in pathogenic and non-pathogenic fungal species

Genome studies of fungal pathogens have presented evidence for exceptionally high rates of evolution. It has been proposed that rapid adaptation is a hallmark of pathogen evolution that facilitates the invasion of new host niches and the overcoming of intervention strategies such as fungicide applications and drug treatments. To which extent high levels of genetic variation within and between species correlate with adaptive protein evolution in fungi more generally has so far not been explored. In this study, we addressed the contribution of adaptive evolution relative to genetic drift in 20 fungal species, hereby exploring genetic variation in 2,478 fungal genomes. We reannotated positions of protein-coding genes to obtain a high-quality dataset of 234,427 full-length core gene and 25,612 accessory gene alignments. We applied an extension of the McDonald-Kreitman test that models the distributions of fitness effects to infer the rate of adaptive (ωA) and non-adaptive (ωNA) non-synonymous substitutions in protein-coding genes. To explore the relevance of recombination on local adaptation rates, we inferred the population genomic recombination rate for all 20 species. Our analyses reveal extensive variation in rates of adaptation and show that high rates of adaptation are not a hallmark of a pathogenic lifestyle. Up to 83% of non-synonymous substitutions are adaptive in the species Parastagonospora nodorum. However, non-synonymous substitutions in other species, including the prominent rice-infecting pathogen Magnaporthe oryzae, are predominantly non-adaptive (neutral or slightly deleterious). Correlating adaptation measures with effective population size and recombination rate, we show that effective population size is a primary determinant of adaptive evolution in fungi. At the genome scale, recombination rate variation explains variation in both ωA and ωNA. Finally, we demonstrate the robustness of our estimates using simulations. We underline the value of population genetic principles in studies of fungal evolution, and we highlight the importance of demographic processes in adaptive evolution of pathogenic and non-pathogenic species.


1 1
Intriguingly, for the parameter ω NA , expressing the rate of fixation of non-adaptive 4 1 2 non-synonymous mutations, in general was slightly higher in genes encoding  Table S11). Together, the 4 1 5 results show that genes encoding non-secreted proteins are generally more 4 1 6 conserved than genes encoding secreted proteins and effectors. Moreover, we also 4 1 7 show that effector proteins typically have higher rates of adaptation.
4 1 8 We further tested these observations using a linear model. Hereby we obtain further 4 1 9 support that effectors generally have significantly higher ω A and ω NA than non-4 2 0 secreted proteins across species (P = 0.023 and P = 0.017, respectively; Table 1).

2 1
However, values of ω A and ω NA in effector proteins are not significantly different from 4 2 2 those in secreted proteins. We also observed a higher standard deviation in ω A and  and supp figures 5, 6 and 7). This pattern likely reflects a higher level of conservation 4 2 5 in housekeeping genes such as essential enzymes, while other proteins secreted to 4 2 6 interact with the host or environment evolve at faster rates. Both median π N /π S and 4 2 7 d N /d S support a less stringent level of purifying selection in effector proteins than non-4 2 8 secreted proteins in most species (Supplementary figures 8,9,10,11,12,13). Overall, we show that effector proteins have higher rates of mutations fixed by 4 3 0 positive selection. However, a relaxation of purifying selection also contributes to 4 3 1 maintaining higher genetic variation in these genes. Genes present in accessory regions have been shown to play a role in pathogen 4 3 5 adaptation in several microbial species (38,39). In some species, these regions  The frequency of gene deletions per species was skewed towards lower frequencies, We next determined the contribution of adaptive and non-adaptive substitutions to 4 4 8 the evolution of core and accessory protein-coding genes. We used datasets of disentangling the adaptive portion of ω, we found a predominance of higher ω A 4 5 4 across the same previous nine species, with significant statistical differences in five other hand, the non-adaptive non-synonymous substitutions rate (ω NA ) was higher in 4 5 8 the core genes for five species out of a total of nine that could be analysed ( Figure 5; 4 5 9 Supplementary Figure 17). ω NA indicates that the core gene portion of P. nodorum 4 6 0 and S. sclerotiorum is subject to higher selective constraints, while in Z. tritici, C.  Together, the observed patterns of ω A and ω NA indicate that genes located in has so far only been addressed for individual species (e.g., 18,9,46) but not across thereby test general hypotheses related to adaptive evolution. In our study, we estimations to confounding effects from demography and population structure. here. At a local scale, a primary determinant of adaptive evolution is recombination. Hereby variation in recombination rate correlates with rates of adaptation as and --emit-ref-confidence GVCF. Next, genome VCF files were combined using 6 4 5 CombineGVCFs. The combined gVCF file was then genotyped using 6 4 6 GenotypeGVCFs with option --max-alternate-alleles 2, and variants selected using 6 4 7 SelectVariants with option --select-type-to-include SNP. Finally, SNPs were removed 6 4 8 if flagged with any of the following filter criteria: QD < 5; QUAL < 1000; MQ < 20; - 2.0. We used vcftools (v0.1.14) to remove SNPs failing the latter filters with -- remove-filtered-all (70). For each of the 17 outgroup species, we performed de novo assemblies using 6 5 6 previously trimmed reads. We used spades assembly toolkit (v3.14.1) with option -- careful to reduce the number of mismatches and short indels (71). We used a range 6 5 8 of pre-determined k-mer sizes per outgroup species according to the read length 6 5 9 obtained. The quality of each genome assembly was assessed with quast (v4.6.3) species genomes with --alternatives-from-evidence=false. After hard filtering, we kept only bi-allelic SNPs without missing information across 6 7 5 isolates of each focal species Supplementary absence of terminal gaps, and a threshold of 5% gaps. The protein sequences of each reference genome used to map the target species 6 9 2 were used to establish the phylogenetic relationship among species. We included 6 9 3 three outgroup species Amanita muscaria (PRJNA207684), Pneumocystis jirovecii 6 9 4 (PRJNA223510), and Rhizophagus irregularis (PRJNA208392). Protein datasets 6 9 5 were used for pairwise BLAST searches and ortholog reconstruction using OrthoFinder (v2.5.2) and default settings. A second run of OrthoFinder was 6 9 7 performed to generate multiple sequence alignments (MSA) of single-copy Next, we overlapped the prediction of deletion events with protein-coding gene 7 1 6 coordinates. Using "read.gff" from the R package ape (v5.6-2) (83) we imported the 7 1 7 coordinates of the protein-coding gene of each focal species using the gff file 7 1 8 provided with the reference genome. We converted both deletion and gene 7 1 9 coordinates into "Genomic Range" object using the R package GenomicRanges with deleted regions were retained (i.e., partial gene deletions were not considered). We constructed a dataset comprising: Core genes (genes present in all isolates in 7 2 5 our dataset for a species), and accessory genes (genes missing from at least one isolate for a species). First, we selected accessory genes at a missingness 7 2 7 frequency ranging between 0.1 and 5%. Next, we selected the isolates harboring the identified genes, from which we extracted sequence information for the accessory 7 2 9 gene group. Finally, from the same isolates, we randomly chose an equal number of 7 3 0 core genes to compose the core gene group. As proteins with an extracellular  Table 4). We included the identification of signal peptides with  θ for the given species. Also, for species with more than 100 individuals, we times to calculate a median value of ρ. If a species has a number of isolates lower 7 6 5 than that available for the pre-computed likelihood tables, we used the option "lkgen" to reduce the pre-computed likelihood table and calculated the median value of ρ. We further used the estimations of ρ across the genome of all species to gather have an equivalent number of genes per ρ category (Supplementary Table 6). The previously generated gene datasets were used to infer the synonymous and non-synonymous unfolded site frequency spectra (SFS), synonymous (Lps) and non-synonymous (Lpn) polymorphic sites, as well as synonymous (d S ) and non- Gamma, and Scaled Beta (Supplementary Table 7). The DFE models were 7 9 4 compared using Akaike's Information Criterion (94). A model averaging procedure of amino-acid substitutions that are adaptive). We used bootstraps to estimate the variation in each gene set's adaptation rates. We performed 100 re-sampling in each gene set. The rates of adaptation were species, and the estimates' distribution was used to visualize confidence intervals. A permutation test was implemented to assess the significance of differences in the 8 0 8 adaptation rates between gene sets, shuffling all genes among predicted function 8 0 9 (e.g., in pairwise combinations of non-secreted, secreted and effector gene sets) and between core and accessory 1,000 times. P-values were computed as by the is the number of replicates for which the absolute difference is 8 1 4 greater than or equal to the observed difference, and n is the number of replicates 8 1 5 for which parameters could be successfully calculated (9). We performed statistical inferences using generalized least square models (GLS) 8 1 9 implemented using the package nlme in R (R Core Team, 2022; Pinheiro et al.,  We used both ω A and ω NA as response variables. As explanatory variables, we used be computed for each single genes but for a set of genes, we could only assess categorical factors independently of each other in separate models. We applied a log fitting the models, we checked for the normality of residuals through visual inspection 8 3 5 of residuals' distribution and formal testing using the Shapiro-Wilk test (99,100).

3 6
Residuals variance homogeneity (homoskedasticity) was also inspected, and the 8 3 7 residual's independence was assessed with a Ljung-Box test (101). Based on the 8 3 8 previously described sanity checks, we performed a Box-Cox transformation on the 8 3 9 response variable when requirements of normality and homoskedasticity were not fulfilled (102). The resulting following model structures were considered: To assess the robustness of ω NA , ω A , and α estimations against the effects of 8 5 8 demography and population structure we performed simulations using different 8 5 9 population dynamics and re-estimated the rates of adaptation (see supplementary recombination rate is uniform along the sequence and set to 1e-7 Mbp (10cM/Mb).

6 5
The mutation rate, also uniform, is set to 2.2e-7 per bp. Only mutations within deleterious, and advantageous. Neutral mutations were used to derive the neutral 8 6 8 "synonymous" SFS, while deleterious and beneficial mutations were used to 8 6 9 generate the "non-synonymous". We ran Grapes based on the SFS dataset of the of the rates of adaptation were performed using the classical estimation (Basic, Grapes (14). We compared the estimated ω NA , ω A , and α to the corresponding true values. The authors are thankful to Nicolas Galtier and Marjolaine Rousselle for helpful 8 8 2 discussions and for sharing simulation scripts, as well as members of the 8 8 3 Environmental Genomics group in Kiel for fruitful discussions. This research was Agriculture, Agricultural Research Service through USDA projects 3060-22000-051- information and does not imply recommendation or endorsement by the U.S.

9 0
Department of Agriculture. USDA is an equal opportunity provider and employer. All sequencing data is available from the NCBI Sequence Read Archive. Individual accession numbers are provided in the Supplementary Table 1. Code availability 8 9 7 All custom scripts used for analysis and data representation are available from 8 9 8 https://github.com/DaniloASP/RatesOfAdaptation (DOI: 10.5281/zenodo.8289768).

9 9
Guided steps, downstream processing and visualization of the data using R, Bash and Python are provided as R markdown tutorials.
2 7 N a t u r e . 1 9 9 1 J u n ; 3 5 1 (  6  3  2  8  ) : 2 9 s p e c i e s a n d  D  a  n  e  c  e  k  P  ,  A  u  t  o  n  A  ,  A  b  e  c  a  s  i  s  G  ,  A  l  b  e  r  s  C  A  ,  B  a  n  k  s  E  ,  D  e  P  r  i  s  t  o  M  A  ,  e  t  a  l  .  T  h  e  v  a  r  i  a  n  t  c  a  l  l  1  0  8  6   f  o  r  m  a  t  a  n  d  V  C  F  t  o  o  l  s  .  B  i  o  i  n  f  o  r  m  a  t  i  c  s  .  2  0  1  1  ;  2  7  (  1  5  )  :  2  1  5  6  -8  .  1  0  8  7   7  1  .  P  r  j  i  b  e  l  s  k  i  A  ,  A  n  t  i  p  o  v  D  ,  M  e  l  e  s  h  k  o  D  ,  L  a  p  i  d  u  s  A  ,  K  o  r  o  b  e  y  n  i  k  o  v  A  .  U  s  i  n  g  S  P  A  d  e  s  D  e  N  o  v  o  1  0  8  8   A  s  s  e  m  b  l  e  r  .  C  u  r  r  e  n  t  P  r  o  t  o  c  o  l  s  i  n  B  i  o  i  n  f  o  r  m  a  t  i  c  s  [  I  n  t  e  r  n  e  t  ]  .  2  0  2  0  J  u  n  [  c  i  t  e  d  2  0  2  2  J  a  n  1  0  8  9   3  0  ] ;  D  o  r  m  a  n  n  C  F  ,  C  a  l  a  b  r  e  s  e  J  M  ,  G  u  i  l  l  e  r  a  -A  r  r  o  i  t  a  G  ,  M  a  t  e  c  h  o  u  E  ,  B  a  h  n  V  ,  B  a  r  t  o  ń  K  ,  e  t  a  l  .  1  1  5  8   M  o  d  e  l  a  v  e  r  a  g  i  n  g  i  n  e  c  o  l  o  g  y  :  a  r  e  v  i  e  w  o  f  B  a  y  e  s  i  a  n  ,  i  n  f  o  r  m  a  t  i  o  n  -t  h  e  o  r  e  t  i  c  ,  a  n  d  t  a  c  t  i  c  a  l  1  1  5  9   a  p  p  r  o  a  c  h  e  s  f  o  r  p  r  e  d  i  c  t  i  v  e  i  n  f  e  r  e  n  c  e  .  E  c  o  l  M  o  n  o  g  r  .  2  0  1  8  N  o  v  ;  8  8  (  4  )  :  4  8  5  -5  0  4  .  1  1  6  0   9  6  .  Q  u  i  n  n  G  P  ,  K  e  o  u  g  h  M  J  .  E  x  p  e  r  i  m  e  n  t  a  l  D  e  s  i  g  n  a  n  d  D  a  t  a  A  n  a  l  y  s  i  s  f  o  r  B  i  o  l  o  g  i  s  t  s  [  I  n  t  e  r  n  e  t  ] .