Taxonomic vs genomic fungi: contrasting evolutionary loss of protistan genomic heritage and emergence of fungal novelties

Fungi are among the most ecologically important heterotrophs that have radiated into most niches on Earth and fulfil key ecological services. However, despite intense interest in their origins, major genomic trends characterising the evolutionary route from a unicellular opisthokont ancestor to derived multicellular fungi remain poorly known. Here, we reconstructed gene family evolution across 123 genomes of fungi and relatives and show that a dominant trend in early fungal evolution has been the gradual shedding of protist genes and highly episodic innovation via gene duplication. We find that the gene content of early-diverging fungi is protist-like in many respects, owing to the conservation of protist genes in early fungi. While gene loss has been constant and gradual during early fungal evolution, our reconstructions show that gene innovation showed two peaks. Gene groups with the largest contribution to genomic change included extracellular proteins, transcription factors, as well as ones linked to the coordination of nutrient uptake with growth, highlighting the transition to a sessile osmotrophic feeding strategy and subsequent lifestyle evolution as important elements of early fungal evolution. Taken together, this work provided a highly resolved genome-wide catalogue of gene family changes across fungal evolution. This suggests that the genome of pre-fungal ancestors may have been transformed into the archetypal fungal genome by a combination of gradual gene loss, turnover and two large duplication events rather than by abrupt changes, and consequently, that the taxonomically defined fungal kingdom does not represent a genomically uniform assemblage of extant species characterized by diagnostic synapomorphies.


32
The evolutionary diversification of lineages into clades of various sizes, some highly 33 successful and species-rich while others less so, is an outcome of complex interactions between 34 ecological opportunity, changing biotic and abiotic environments and the genetic makeup of the 35 organisms representing the lineage. Inferring the genomic footprint of the emergence of kingdom-level 36 clades and, conversely, innovations that underlie their evolutionary success have been hard to assess. 37 Recently, comparative genomic approaches that allow ancestral genome content to be inferred based 38 on extant genomes revealed that complex mechanisms, including novel and co-opted genes as well as 39 gene loss 1-3 contributed to the origins various clades, including pro-4 and eukaryotes 5 , metazoans 1,6 or 40 land plants (Embryophytes 2 ). These patterns are consistent with proposed general trends of genome 41 evolution 7 , however, whether they are universal across the tree of life is unknown. 42 Fungi are one of the evolutionarily most successful groups. They not only exhibit an extreme 43 diversity in morphological traits, but also in ecological functions, which make them key players in the 44 ecosystem as symbionts, parasites or saprobes, among others. Kingdom Fungi, defined in the broad 45 sense to include Rozellomycota and Microsporidia (also known as Opisthosporidia) and 46 Aphelidomycota 8 , encompasses at least 10 phylum-level clades, with most of its phylogenetic 47 diversity found in early-diverging fungi (EDF), an informal group comprising Rozello-, Aphelido-, 48 Chytridio-, Sanchytrio-, Blastocladio-, Olpidio-, Zoopago-and Mucoromycota as well as 49 Microsporidia 9,10 . Irrespective of significant flux in the taxonomic definition of Fungi 11 , how the 50 earliest fungal ancestors are characterised and how they evolved from a unicellular opisthokont 51 ancestor has been the subject of intense research. Reconstructions of the last universal fungal ancestor 52 (LUFA) became more and more nuanced with the inclusion of newly discovered and/or classified 53 early diverging lineages 8,10,12-16 . These results contributed to our current understanding of LUFA as a 54 unicellular phagotrophic parasite of microalgae, possessing motile cell stages with amoeboid and 55 flagellar motility and a chitinous cell wall, at least in part of its life cycle 9,15 . In contrast, derived fungi 56 (i.e., Dikarya) are sessile terrestrial osmotrophs that grow septate hyphae, and their cells are covered 57 by a rigid cell wall throughout their life cycle. However, how LUFA and subsequent ancestors evolved 58 modern fungal traits is poorly known, however, and systematic analyses of the genomic changes 59 during this transformation are missing. 60 In this study we reconstructed gene family evolution in the fungal kingdom using 123 whole 61 genomes and analysed temporal and functional trends in the genomic changes we inferred. We find 62 that early diverging fungi are genetically intermediate between pre-fungal protists and Dikarya and 63 that a gradual shedding of ancient protist gene families happened in parallel with the emergence of 64 fungal novelties and expansion of pre-existing families in Fungi. The tempo and mode of gene 65 turnover and innovation outlines major genomic trends and picture an episodic emergence of modern 66 fungal traits, including multiple waves of gene family expansion and contraction related to key fungal 67 traits. Our results reveal that taxonomically defined fungi don't match with 'genomic fungi' and that 68 early fungal evolution has been highly episodic with continuous gene turnover. 69

71
To obtain a global perspective on gene content differences between fungi and related 72 opisthokonts, we first clustered protein sequences of 123 species into homologous protein groups 73 (HGs, see Methods). For this, we selected representatives of all, but one currently accepted fungal 74 phyla (including Rozellomycota and Microsporidia, except Sanchythriomycota), as well as 17 75 Holozoa, Amoebozoa and Heterolobosea species as outgroup (Supplementary Data 1). We inferred a 76 species phylogeny by maximum likelihood analysis of a supermatrix of 272 single-copy orthologs 77 (Fig. 1a, Supplementary Fig. 1). Our phylogeny is highly supported and is largely congruent with 78 phylogenies from recent studies 9,15,17-19 . 79 Based on the HG membership data, a Principal Coordinate Analysis resolved phyla into 80 distinct groups, indicating that gene content divergence correlates well with phylogeny (Fig. 1b). Early 81 diverging fungi, especially chytrids and aphelids, instead of occupying distinct positions in the space, 82 grouped with non-fungal protists, with the Blastocladio-and Zoopagomycota being transitional 83 towards the Mucoromycota+Dikarya (Fig. 1b) Gradual turnover of protist heritage and synapomorphies in fungi 102 By systematically searching for HGs with >70% conservation in at least one taxonomic group 103 (Supplementary Data 1), present in the Holomycota but absent from Dikarya, we identified 540 104 families (Fig. 2, Supplementary Data 2). This indicates that EDF possesses a considerable number of 105 HGs shared with protists, considerably more than previous anecdotal evidence suggested. For 106 example, LUFA lost 41 of these, but possessed 499 HGs that are conserved in protists but are missing 107 in Dikarya (Fig. 2). Analysis of the extinction dynamics of the 540 HGs showed that they were lost in 108 a stepwise manner, with the largest loss events, 76, 64 and 239 HGs inferred in nodes where 109 Blastocladiomycota, Zoopagomycota and Dikarya, respectively, split from the backbone. 110 The We inferred both a copy number reduction and complete loss of HGs within these pathways 121 ( Supplementary Fig. 2-3). Shared ancestry of Ca 2+ signalling in Fungi and Metazoa was known, 122 despite of the remarkable differences among them 27 , but our results revealed that the full complement 123 of these pathways have been retained in fungi until the Mucoromycota-Dikarya MRCA. Our data set 124 also showed that the gamma-secretase complex, which is involved in regulated intramembrane 125 proteolysis for various developmental and signalling processes and was thought to be absent in fungi 28 , 126 in fact, was lost only in Dikarya, Olpidium, and the RM clade. Interestingly, several HGs associated 127 with ubiquitination have also been lost during fungal evolution. Other functions highlighted by GO 128 correspond to families annotated in extant holozoa as mechano-and voltage-sensitive channels, 129 receptors (EGF, GPCRs) and mechanical or visual perception (Supplementary Data 2). Metazoan 130 annotations of these families are the most precise that exist, but may be hardly informative as to the 131 functions of these families in pre-fungal protists and EDF. 132 We also looked at core fungal novelties, defined here as HGs that evolved in one of the early 133 fungal ancestors and are highly conserved (≥70%) in descendant lineages. Previous studies reported a 134 shortage of fungal synapomorphies 11 , so we here tested if our genome-wide dataset can recover gene 135 families specific to and conserved within fungi. We identified 163 HGs, considerably less than recent 136 studies have, in animals and green plants (using 95% conservation threshold 1 ). This suggests that 137 novel core gene families are less prevalent in fungi, possible because fungi comprise several highly 138 reduced clades (e.g. yeast lineages 30 , Microsporidia). Nevertheless, the 163 families originated across 139 multiple nodes, with a larger grouping at the split of the Chytridiomycota (n=57, 35%, Node149, Fig.  140 2), suggesting that this node has seen key transitions in genome evolution (see 4 th chapter HGs) among core fungal novelties, highlighting the understudied status of fungal-specific genes. 156 Taken together, our analyses revealed that early-diverging fungi possess a large number of 157 HGs shared specifically with protists, and that these were gradually lost during evolution. The broad 158 conservation of protist HGs may explain the morphological and genetic similarities between EDF and 159 protists reported in previous studies 16,23,24,31,32 and also clarifies why the genome content of fungi 160 reflects that of ancestral opisthokonts more than metazoan gene content does 33 . However, our data 161 revealed that the retention of protist genes is not restricted to certain gene families, but is a genome-162 wide trend in early-diverging fungi. At the same time, we identified several conserved gene families 163 that originated in early fungi and were mostly conserved afterwards, as well as 186 less conserved 164 families containing fungal-specific domains. Although most of these can't be considered 165 synapomorphies in the strict sense 11 , due to their imperfect conservation, this indicates that beyond 166 losses, considerable novelty has also emerged in early fungal ancestors. To obtain a detailed picture on gene repertoire changes and test if evolution is gradual or 181 rather episodic, as some theories predict 35 , we reconstructed gene gain and loss dynamics for all HGs 182 containing at least four proteins. Our reconstructions provide information on which genes were 183 duplicated and lost in each of the HGs and at which branches of the phylogenetic tree (Fig. 3). For 184 example, LUFA was inferred to have had 12,761 genes, gained 913 and lost 295 compared to its 185 immediate ancestor, corresponding to a moderate net expansion. Reconstructed ancestral proteome 186 sizes ranged from 12,761 protein-coding genes in LUFA to 14,891 in the MRCA of Zoopagomycota 187 and derived fungi. If we consider net changes (duplications minus losses), it seems that most of the 188 genomes of early fungal ancestors contracted (Fig. 3), with two exceptions. The first is the split of 189 chytrids, while the second expanding ancestor is the MRCA of Zoopagomycota and derived fungi, 190 which is inferred to have expanded by 791 genes (1,114 gains, 323 losses).

191
We find that gene duplication has been highly episodic, with five large burst events, whereas 192 losses were ubiquitous in fungal backbone nodes. Of the five bursts of gene duplication two were 193 inferred along the backbone of the fungal tree (nodes C and D on half-circles indicate the inferred gains and losses at each node, while blue (gain) and red (loss) 224 numbers along the backbone depict the number of gain and loss events. Note that the ratio of blue and 225 red circles is proportional to the amount of net innovation and turnover in a given node. Letters in bold 226 represent the five largest bursts of duplication that comprised more than 2,000 gain events. Pie charts, 227 next to the nodes D and C represent the contribution of four functional categories to gene gains. Clade contained extracellular proteins (mostly CAZymes), transporters or transcription factors (Fig. 3a). The 238 remaining families included diverse functions, such as GPCRs, heterokaryon incompatibility genes, 239 protein kinases and endochitinases, among others (Supplementary Data 6). When we parsed these 240 figures in the context of the two large duplication events, we found that 22.9% of the chytrid and 241 34.7% of the Dikarya duplication bursts were related to extracellular functions, transporters and 242 transcription factors, respectively (Fig. 3b). The importance of these gene families for genomic 243 changes was also confirmed by functional enrichment analyses of HG duplication data 244 (Supplementary Data 5). Based on these observations, we below scrutinise extracellular, transporter 245 and transcription factor families in more detail. 246 Mapping of families with predicted extracellular localisation (hereafter extracellular proteins, 247 EP, for short) revealed a two-stage duplication dynamics, with large expansions in the MRCA of 248 chytrids and rest of the fungi (from 441 to 653 genes) and in that of Dikarya (from 806 to 1,167 genes; 249 Supplementary Fig. 5). In the first event, EP expansion was driven mainly by CAZYmes, whereas 250 other extracellular functions, including proteases, made a more modest contribution. These expansions 251 resulted in a shift in the CAZYme and SSP content of the EP from 34% in LUFA to 51% in the 252 Dikarya MRCA. Within EPs PCWDE families follow similar patterns, but with a larger expansion at 253 the MRCA of Dikarya (Supplementary Fig. 6). In contrast to PCWDEs, FCW families showed 254 proportionally more diversification in the chytrids than in the Dikarya, and large turnover in the 255 Dikarya MRCA (Supplementary Fig. 7). ( Supplementary Fig. 8). Of the three largest transporter families, we found that ABC transporters and 267 P-type ATPases showed contraction, whereas the Major Facilitator Superfamily underwent an extreme 268 expansion in fungi. This makes sense considering that only the latter is able to transport a high variety 269 of small molecules, including sugars, peptides and lipids 40 , whereas ABC transporters and P-type  Fig. 4a). In line with this, extant Mucoromycota contain the 287 most TFs among fungi, with Shannon-based diversities similar to those of EDFs (Fig. 4b). The 288 TFomes of extant Dikarya, especially Ascomycota species show similar sizes but lower diversity, due 289 partly to the expansion of the Fungal trans 2 and Zn cluster TF families, and partly to losses of certain 290 families in the Dikarya. 291 Inferred copy number of TFs in fungal ancestors, outlined three 'epochs', within each of which 292 ancestral TF repertoires seemed to be relatively constant (Fig. 4a). Transitions between these epochs 293 are concominant with remarkable historical events, like emergence of-mostly anucleate-rhizoids 294 and osmotrophy, or that of terrestrialization and of aseptate hyphae ( Fig. 2 and 4a).

295
Ranking of TF families based on the cumulative net change along the fungal backbone 296 (Supplementary Data 8) revealed that The most dynamically changing TF families included both pan-297 eukaryotic (e.g. C2H2-like, bZIP, HLH, HSF, Homeobox, GATA) and predominantly fungal TFs (e.g. 298 Zn cluster, Fungal trans 2, APSES, Velvet and Gti1/Pac2), irrespective of their raw copy numbers. 299 Several families expanded considerably more than the whole proteome, indicating that their copy 300 numbers are decoupled from proteome size. For example, while the proteome size of the MRCA of 301 Chytrids and derived fungi increased by 16.2% relative to the preceding node, the Zn cluster and bZIP 302 families expanded by 738% and 325%, respectively; suggesting expansion driven by adaptation. 303 Interestingly, C2H2 TFs underwent a massive expansion in the Opisthokonta ancestor. This family is 304 highly diverse in both Holozoa and the Holomycota, with clade specific contractions in secondarily 305 simplified clades, such as yeast-like Dikarya ( Supplementary Fig. 9). 306 Taken together, the evolutionary dynamics of transcription factors differs from genome-wide 307 patterns, in that, albeit also episodic, showed expansions in different nodes. Nevertheless, these data 308 suggest that broad rewiring of fungal gene regulatory networks has transcended multiple ancestors in 309 EDF. shown with dots. 321

322
The origins of highly diverse clades across the tree of life are remarkable evolutionary events, 323 but are they remarkable from a genomic perspective or because we attach taxonomic definitions, such 324 as a kingdom, to some of them? Based on systematic analyses we inferred that protein coding gene 325 content has changed drastically during early fungal evolution. However, in contrast to animals and 326 plants 2 , this has not happened abruptly at the taxonomic limit of the kingdom, even if competing 327 taxonomic circumscriptions of Fungi are considered. Rather, we found a significant retention and 328 stepwise loss of protist genes in early diverging fungi, combined with the episodic expansion of novel 329 and ancient gene families. We also identified major functional trends and the most dynamically 330 changing gene families, which allowed us to relate genomic changes to trait evolution during the early 331 evolution of fungi. 332 Our systematic analyses revealed that early-diverging fungi retained hundreds of protist gene 333 families that are missing in the Dikarya. These were lost gradually or replaced by fungal-specific 334 genes (e.g. E2F cell cycle regulator by SBF 24 ) during fungal evolution. We think this explains why 335 EDF gravitate towards protists in gene content-based analyses (See Fig. 1) and provides a genome-336 wide explanation for sporadic evidence for the similarity of EDFs to protists from analyses of single 337 genes 24,25 , individual genomes 33 or TF-omes 45 . At the same time, we detected a limited number of gene 338 families that could be considered synapomorphic for fungi, in agreement with previous conjectures on 339 the lack of synapomorphies in the kingdom 11,22 . 340 The retention of protist genes and the lack of clear synapomorphies for fungi blur lines 341 between fungi and related protists and may explain why the taxonomic limits of fungi have been 342 challenging to define and are still a matter of debate 14 . In other words, taxonomically defined Fungi 343 don't match with any clade we could define, based on gene content, as 'genomic Fungi', a situation 344 that would not change with the exclusion of Opisthosporidia from fungi. Defining 'genomic Fungi' is 345 also complicated by patterns of gene turnover, nevertheless, gene content (Fig. 1), turnover rates (Fig.  346 2) and novel gene families suggest that, if any fungal clade, the Dikarya comprises species with a 347 broader set of signature gene families. 348 Gene loss has been recently named as a dominant mechanism of metazoan 3,46,47 and plant 349 evolution 2 . Similar patterns were often observed in ancestral genome reconstructions, which gave rise 350 to the hypothesis of genome evolution comprising periods of genome expansion resulting in complex 351 ancestral genomes, followed by genome streamlining via gene loss 35 . In contrast, early fungal 352 evolution may be better viewed as a series of turnover events, in which gene loss aided the gradual 353 shedding of protist traits whereas innovation was probably fueled by two bursts and continuous small-354 scale duplication of genes.

355
Functional analysis of gene duplications and losses outlined major functional trends in the 356 evolution of early fungal genomes. Remarkable lost or contracting gene groups were related to 357 phagocytosis, the flagellum, cell cycle regulation and signalling (Fig. 3), among others. For highly 358 duplicated gene groups, a dominant functional signal was related to extracellular proteins (e.g. perhaps related to balancing nutrient assimilation and growth, an important regulatory mechanism for 368 hyphal osmotrophs. Our genome-wide catalogue of changes revealed a large array of other functions 369 as well, many of which cannot currently be easily linked to phenotypes or ecological functions due to 370 the paucity of knowledge on gene function. 371 Taken together, this study reconstructed the history of genomic change in the fungal kingdom 372 at unprecedented detail and provides a resource for further analyses of fungal genome evolution, also 373 at smaller evolutionary time scales. We conclude that, albeit sharp genetic changes at the border of the 374 fungal kingdom were not inferred, a large turnover of protistan genes and a gradual emergence of 375 fungal novelties in early fungal evolution portray a clear genetic roadmap for the emergence of 376 modern fungi from a unicellular algal parasite ancestor. 377

378
Dataset assembly and clustering of proteins 379 To evenly sample clades in the fungal kingdom, we sampled 106 fungal species, consisting of 380 proteomes from all known phylum level clades, with the exception of Sanchytriomycota, which were 381 not published prior to our data collection. In addition, as outgroups, 17 non-fungal species were 382 sampled to represent Heterolobosea, Amoebozoa, Holozoa, and Nucleariids. Proteome sequences were 383 downloaded from the JGI Genome Portal and NCBI/Ensembl (before July 2021; 49,50 ). All-vs-all 384 similarity searches of the 123 proteomes were performed with MMseqs2 51 using three iterations, with 385 sensitivity set to 6.5, max-seqs to 15,000, e-profile set to 0.001, a preliminary coverage threshold set to 386 0.2, and an e-value threshold set to 1e -4 . We then performed an asymmetric coverage filtering 387 (requiring 20% coverage for the longer and 80% for the shorter protein) and Markov clustering with 388 an inflation parameter 2.0 52 as described previously 53 . After clustering, we removed contaminating 389 proteins from gene families following the logic of Richter et al. 47 . Further, to achieve better 390 completeness of clusters without increasing noise, we merged clusters based on similarity, using the 391 all vs all output of MMSeqs and the results of HMM search between the consensus sequence and 392 HMM profiles of clusters. Based on the MMSeqs output file, the connection intensity of clusters (i.e a 393 network) was constructed. These cluster networks were iteratively reduced by excluding the weakest 394 nodes-sorted by the number of connections between the two clusters, normalised to cluster size-until 395 the maximum diameter of a network was three. Of these cluster pairs, only those that achieved an e-396 value cut-off at least 1e -10 in the HMM search (http://hmmer.org/) with an asymmetric coverage of 397 75/20% (HMM profile and consensus, respectively) and whose match score reached at least 75% of 398 the self-match were allowed to be merged. We called these merged filtered clusters of protein as 399 homologous protein groups, briefly HGs. 400

Species tree reconstruction 401
For species tree inference, marker genes were selected from four sources: 1) single-copy gene 402 families based on the clustering mentioned above, 2) clusters that were single-copy after eliminating two, only best hits were used for each species. Subsequently clusters were removed, where the average 406 distance of amino acid (AA) alignment were too high (>=1.5; dist.ml with the model WAG 56 ) and 407 which showed paraphyletic clusters in the hierarchical clustering of AA distances. Multiple sequence 408 alignments were inferred using PRANK v.170427 57 and trimmed using TrimAL v.1.2 58 (-strict). 409 Trimmed MSAs shorter than 60 amino acid residues (AA) and containing < 30 species were discarded, 410 leaving 272 single-copy clusters resulting 68,662 sites that were finally used for tree reconstruction. 411 Phylogenetic inference was performed under maximum likelihood (ML) in IQ-TREE v1.6.12 59 with 412 ultrafast bootstrap 60 (1000 replicates) based on the partitioned dataset of 272 clusters using the 413 substitution model LG+G. More complex models (LG+C60+G) had no effect on the topology of early 414 diverging lineages. We applied a constrained tree topology ((Allma1,Olpbor1),Ganpr1,Partr) in order 415 to separate Aphelidomycota from Blastocladiomycota, which caused no significant change in log-416 likelihood values as assessed by the Shimodaira-Hasegawa (SH) test 61 (deltaL=21.8, p-value = 0.356). 417

Inference of genome-wide duplication and loss history of clusters 418
For gene tree reconstructions, gene families containing at least four proteins were aligned 419 using the L-INS-I or auto (if the former was not applicable) algorithm MAFFT v7.313 62 and trimmed 420 with TrimAL (-gt 0.2). Gene trees were inferred in RAxmlHPC-PTHREADS-AVX2 8.2.12 under the 421 PROTGAMMAWAG model and we estimated branch robustness using the SH-like support 63 . Rooting 422 and gene-tree/species-tree reconciliation was performed with NOTUNG v2.9 64 using an edge-weight 423 threshold of 80.

424
Gene duplication and loss histories were inferred by mapping orthologous groups delimited on 425 the basis of gene trees to the species tree using Dollo parsimony implemented in a modified version of Only those HGs were considered transporters that contained >50% of transporter-specific domains, 463 and if plasma membrane localization (score >15) was the most likely within the HG. 464 Identification of conserved and dynamically changing homologous groups. 465 To obtain an overall picture of similarities and differences in the gene repertoire of the 123 466 species, a principal coordinate analysis (PCoA) was performed. For PCoA, a total of 9,993 467 homologous groups (HG) comprising at least four proteins and reaching ≥50% species from any clade, 468 were used (see clades in Supplementary Data 1). A binary distance coefficient from the gene family 469 presence/absence data was used, and a minimum spanning tree was superimposed on the distribution 470 of species on the two principal coordinates (using packages of stat, ape and tidyverse [66][67][68]  LGN).