Abstract
Human-related environments, including food and clinical settings, present microorganisms with atypical and challenging conditions that necessitate adaptation. Several cases of novel horizontally acquired genetic material associated with adaptive traits have been recently described, contained within giant transposons named Starships. While several Starships have been recently found in domesticated species, the extent of their impact on the evolution of human-associated fungi remains unknown. Here, we investigated whether Starships have shaped the genomes of two major genera of fungi occurring in food and clinical environments, Aspergillus and Penicillium. Using seven independent domestication events, we found in all cases that the domesticated strains or species exhibited significantly greater Starship content compared with close relatives from non-human-related environments. We found a similar pattern in clinical contexts. Our findings have clear implications for agriculture, human health and the food industry as we implicate Starships as a widely recurrent mechanism of gene transfer aiding the rapid adaptation of fungi to novel environments.
Introduction
A central question in evolutionary biology is to understand how species adapt to their environment. Indeed, identifying genomic footprints of adaptation is of interest for predicting how species can face global changes. Horizontal gene transfer (HGT) and transposable elements (TEs) are key contributors to adaptive evolution. First recognized as a major mechanism of adaptation in bacteria, it is now accepted that HGT also occurs in eukaryotes (Fitzpatrick, 2012; Keeling, 2024; Van Etten & Bhattacharya, 2020), facilitating, for example, the acquisition of both virulence genes in fungal plant pathogens (Friesen et al., 2006) and carotenoid production in aphids (Moran & Jarvik, 2010). TEs are mobile genetic elements present in all domains of life, capable of moving within and between genomes through horizontal transfer events (H.-H. Zhang et al., 2020). This movement generates genetic variability and genome plasticity, fostering the emergence of novel adaptive variants, particularly in species coping with new environments (Grandaubert et al., 2014; Schrader & Schmitz, 2019).
Fungi are models for studying adaptation in eukaryotes given their relatively small genomes and their experimental assets (Gladieux et al., 2014). Additionally, some have been domesticated by humans to ferment food (e.g. beer, bread, sake, tamari, cheese), for industrial purposes (enzyme and protein synthesis) and to produce antibiotics (penicillin). Considering domestication is the result of strong and recent selection by humans on some phenotypic traits, these domesticated fungi are prime models for looking at genomic footprints of adaptation as they should be still visible in the genomes. Horizontal gene transfers have been identified as a recurrent genomic mechanism involved in adaptation in domesticated fungi, in the budding yeast Saccharomyces cerevisiae (Legras et al., 2018; Novo et al., 2009), in the dry-cured meat Penicillium species P. nalgiovense and P. salamii (Lo et al., 2023) as well as in the cheese species P. camemberti and P. roqueforti used for soft cheeses and blue cheeses production, respectively (Cheeseman et al., 2014; Ropars et al., 2015). The very same HGTs have been identified in a variety of distantly related Penicillium fungi used for both cheese and meat production giving us cases of convergent adaptation (Cheeseman et al., 2014; Lo et al., 2023; Ropars et al., 2015).
Recently, these HGTs found in domesticated Penicillium fungi have been formally identified as elements from the superfamily of giant transposons called Starships (Gluck-Thaler et al., 2022; Lo et al., 2023). These large TEs, ranging from tens to hundreds of kilobases, contain a structurally conserved tyrosine recombinase with a DUF3435 domain, also called the Captain, and can house an extensive and diverse array of cargo genes not involved in transposition (Gluck-Thaler et al., 2022; Urquhart, Vogan, et al., 2023). Notably, these cargo genes have already been linked to, sometimes retroactively, resistance to formaldehyde (Urquhart, Gluck-Thaler, et al., 2023), heavy metals (Urquhart et al., 2022; Urquhart, Vogan, et al., 2023), adaptation to food production (Cheeseman et al., 2014; Lo et al., 2023; Ropars et al., 2015) and plant pathogenicity (A. Bucknell et al., 2024; Gourlie et al., 2022; Peck et al., 2023), suggesting that Starships and their associated cargo may facilitate rapid adaptation to challenging environments (A. H. Bucknell & McDonald, 2023). This may be particularly relevant to both cnewly emerging (Seyedmousavi et al., 2018) and treatment resistant pathogens (Rivelli Zea & Toyotome, 2022) within the Aspergillus genus.
Due to both their size and the frequent presence of repetitive material, Starships are often fragmented in short-read assembled genomes and the analysis of their structure and impact can be hindered (A. Bucknell et al., 2024; Cheeseman et al., 2014; Gluck-Thaler et al., 2022). Fortunately, long-read assembled genomes have enabled the discovery and description of complete Starship elements along with their surrounding genomic context (A. Bucknell et al., 2024; Gluck-Thaler et al., 2022; Gourlie et al., 2022; Haridas et al., 2023; Vogan et al., 2021). Here, we focused on the fungal genera Penicillium and Aspergillus, each comprising various species of considerable significance to humans, to test whether human-related strains, compared to environmental, have accumulated a greater amount of accessory material from Starships insertions. Within these genera we used seven, clear, well described and independent examples of fungi domesticated for food production. These fungi play roles in the production of cheese (e.g. Penicillium camemberti for the production of soft cheeses and P. roqueforti for blue cheeses (Ropars, Caron, et al., 2020)), cured sausages (e.g. P. nalgiovense and P. salamii (Zadravec et al., 2023)) and both soy and rice based products such as soy sauce, miso and alcohols (e.g. Aspergillus oryzae (Gibbons et al., 2012; Machida et al., 2005), A. sojae (Acevedo et al., 2023) and A. luchuensis (Futagami, 2022)). Additionally we looked at increasingly problematic human pathogens (e.g. A. fumigatus, A. udagawae and A. felis (Barrs et al., 2013; Rivelli Zea & Toyotome, 2022; Seyedmousavi et al., 2018)).
First, we compiled a genome database containing more than 1,600 public and newly-assembled genomes, manually curated information on the isolation origins of each strain and generated phylogenies using >3000 single copy orthologs to ensure accurate species identification. We then used starfish (Gluck-Thaler & Vogan, 2024) to identify tyrosine recombinase/Captain genes within all these genomes, using their abundance as a proxy for the number of Starships present. By coupling these data we discovered a notable trend. Specifically, all strains or species associated with food production or clinical contexts exhibited significantly greater Starship content compared with closely related strains or species from environmental sources. Moreover, we also observed an increase in Starship content among strains isolated from diverse saline environments, unrelated to food production. We hypothesise that the association between increased Starship content and these extreme environments is because growth in these conditions required novel Starship cargo genes in order to facilitate adaptation. Second, we developed a genome-graph based pipeline to extract both complete Starships and large Starship-related regions from long-read genomes. This approach allowed us not only to corroborate our prior findings independently but also verify that the recent surges in Captain genes corresponded with an increase in Starship-related material. We also found that specific gene functions were enriched in Starships of domesticated fungi, relevant with adaptation to food. This includes several functionally relevant cargo clusters, present across multiple Starships, thus greatly enhancing the list of adaptive convergence cases between distinct lineages Additionally, we unveiled a novel element present in a subset of Starships, a MYB/SANT gene, often located in Starships at the opposite edge to the Captain gene, therefore likely denoting the downstream border to these Starships. This combination of a conserved transposase and MYB-like gene notably mirrors the configuration of Harbinger transposons, where both genes are required for transposition (Hancock et al., 2010; Sinzelle et al., 2008). In summary, our findings corroborate the building evidence that Starships play a distinguished role in rapid adaptation of fungi to novel environments, including those with human significance.
Results
Accumulation of Starships in saline and human-associated fungal isolates
Using 1516 (368 Penicillium and 1148 Aspergillus) publicly assembled genomes and 157 newly assembled Penicillium genomes, alongside manually curated isolation origins, we tested whether food and clinical environments have any discernible relationship to Starship content (table S1). We additionally highlighted strains in a variety of Aspergillus species that came from diverse saline environments. Due to the limitations in using short-read generated assemblies, which make up the vast majority of publicly available genomes, we use solely the number of identified DUF3435-domain containing-domain containing genes (hereafter named Captains) as a proxy for the number of Starship insertions considering their conserved and unique position in Starships (Gluck-Thaler et al., 2022; Gluck-Thaler & Vogan, 2024). We tested the accuracy of this proxy by looking for positive correlations between Captain count and genome size within species/species-complexes in which a large number of genomes were available. In all cases we detected strong and significant positive relationships (fig. S1; fig. S2) establishing that we are likely correct in equating Captain count with Starship insertions.
To ensure the accuracy of the species identification for all genomes used in this study, we also constructed BUSCO-based phylogenetic trees for each genus (Fig. 1; fig. S3; fig. S4; fig. S5; fig. S6; table S1; File S1; File S2). Using these genus wide datasets, we see greater Starship content on average in food-related, clinical and saline-environment strains compared to other environmental strains (Fig. 1B, Fig. 1D).
Aspergillus strains isolated from saline environments are phylogenetically dispersed among the genus and relatively rare (Fig. 1C). The 18 strains identified span 15 different species, within seven sections of the genus (table S1). Furthermore, the saline environments differed widely, including both terrestrial and aquatic origins (table S1). We compared the 18 strains isolated from saline environments to strains from all other environments and found that saline strains have a significantly larger number of Starships compared to all other groups (Fig. 1D).
Accumulation of Starships in moulds domesticated for food-production
We have previously shown that three Penicillium species have been domesticated for cheesemaking, namely P. camemberti used for soft cheeses such as Camembert (Ropars, Didiot, et al., 2020), P. roqueforti used for blue cheeses worldwide such as Roquefort or Stilton (Dumas et al., 2020), and P. biforme present on many natural cheese rinds and also found on some dried sausages (Lo et al., 2023; Ropars, Didiot, et al., 2020), with P. biforme and P. camemberti being sister clades. Our analysis shows that P. biforme and the two varieties of P. camemberti; i.e. var. camemberti and var. caseifulvum, contain significantly more Starships than their wild sister species P. fuscoglaucum and P. palitans (Fig. 2A; fig. S7). Penicillium roqueforti has been domesticated independently at least twice (Dumas et al., 2020), with one population containing strains of Roquefort PDO cheeses (named Roquefort population) and another population containing strains from all blue cheeses produced worldwide (named Non-Roquefort population). A new cheese population has been recently described (Crequer et al., 2023), with strains isolated from an uninoculated blue cheese from the village Termignon in the French Alps (named the Termignon population). The two HGTs identified previously in the non-Roquefort population (cheesyTer (Ropars et al., 2015) and Wallaby (Cheeseman et al., 2014), retroactively identified as Starships (Gluck-Thaler et al., 2022; Urquhart, Vogan, et al., 2023)), have also been identified in the Termignon population. Our analyses show that strains from the wood/lumber and silage populations have significantly fewer Starships than cheese-related strains from both the Non-Roquefort and Termignon populations (Fig. 2B; fig. S8). We also see that the Roquefort population has significantly fewer Starships than other cheese strains. We newly defined a ‘Contaminants’ population, comprising strains isolated from contaminated food products, with a Starship count similar to the Non-Roquefort and Termignon clades, suggesting a potentially feral origin.
In dry-cured meats, two Penicillium species are isolated in most cases, namely P. nalgiovense and P. salamii (Iacumin et al., 2009; Leistner, 1990; Lo et al., 2023; Perrone et al., 2015). Other Penicillium species have also been isolated from fermented meat products, including P. biforme (also found in cheese) and P. nordicum. On the contrary to Penicillium cheese species, dry-cured meat strains of P. salamii and P. nalgiovense are not genetically distinct from their wild counterparts when considering SNPs, but very few non-dried sausage strains were analysed (Lo et al., 2023). However, dry-cured meat strains carry HGTs, which are likely Starships, that are shared between all three species and absent from non-dry-cured meat strains (Lo et al., 2023). Our analysis found that P. salamii contains significantly more Starships than its wild sister species P. olsonii (Fig. 2C; fig. S9). Furthermore, within the P. salamii species, strains isolated from dry-cured meat exhibit significantly more Starships compared to environmentally isolated strains. For P. nalgiovense we show that it contains significantly more Starships than both sister species P. chrysogenum and P. rubens (Fig. 2D). As in P. salamii, P. nalgiovense strains isolated from dry-cured meats appear to have more Starships compared to their environmentally isolated counterparts, however there are currently only three of the latter. Notably, the two environmental strains with the lowest Starship count were isolated from soil and do not phylogenetically cluster with sausage strains. The third, with Starship content comparable to sausage strains, was isolated from squash, however it phylogenetically clusters with the other sausage strains and it has been shown to have cured meat-like phenotypes (Lo et al., 2023) (fig. S10). This strain, alongside strains isolated from ‘environment-food’, have been suggested previously to be potentially feral strains (Lo et al., 2023).
Two distantly related species of Aspergillus, namely A. sojae and A. oryzae, are used for fermenting soybean- and rice-based products such as miso and sake (Acevedo et al., 2023; Gibbons et al., 2012). We show that A. sojae contains a significantly larger number of Starships compared to highly genetically similar A. parasiticus strains isolated from environmental sources (Fig. 3A; fig. S11). Similarly, A. oryzae displays a significant increase in Starship count when compared to closely related A. flavus strains (Fig. 3B; fig. S12). Moreover, within A. oryzae, strains with food-production origins contain significantly more Starships than environmentally isolated strains. Of particular note, the few clinical isolates of A. oryzae contain, on average, a similar number of Starships as the food-production isolates and were phylogenetically distinct (fig. S12). The last food-related Aspergillus species, A. luchuensis, primarily used in alcohol production, contains significantly more Starships than its closely related and primarily environmental sister species A. tubingensis (Fig. 3C; fig. S13).
Similarly, in A. niger, the average number of Captains for the food-production, environment-food and clinical isolates are higher than that of environment and environment-clinical (fig. S14), though few genomes are available (fig. S15). While scarcely any food-related strains have been isolated, two were isolated from a wheat-based fermentation starter (Jeong & Seo, 2022), one of which has the highest number of Starships within the species. Notably, isolation from wheat Qu appears common (K. Zhang et al., 2019). Both their recent isolation and elevated levels of Captains may hint at an under-documented domestication event. Lastly, although A. niger is generally used for enzyme, protein and secondary metabolite production in industry (Cairns et al., 2018), no industrial strains were available publically. More genomes of clinical, food related and industrial strains would help clarify their associations with Starships.
Accumulation of Starships in clinical isolates
Aspergillosis encompasses a spectrum of diseases caused by species within the genus Aspergillus and represents a significant health concern globally. Although almost 20 species have been implicated as causative agents (Pfaller et al., 2006), the majority of infections are caused by four species, namely A. fumigatus, A. flavus, A. niger and A. terreus. Aspergillus fumigatus is currently the major air-borne fungal pathogen, with a relatively high incidence and mortality, comparable to Candida and Cryptococcus species (Bongomin et al., 2017; Dagenais & Keller, 2009; Denning, 2024). We found that clinical strains of A. fumigatus contain on average more Starships than environmental strains (Fig. 3D). Notably, the clinical strains are phylogenetically distinct and widely distributed (fig. S16). Aspergillus udagawae and A. felis are two species considered emerging causes of invasive aspergillosis (Barrs et al., 2013; Seyedmousavi et al., 2018; Stewart et al., 2022; Sugui et al., 2010). Although A. udagawae, A. felis and its sister species A. pseudoviridinutans have few genomes available, those at hand hint at a potentially positive relationship between clinical appearance and Starship count (Fig. 3E; fig. S17). These findings suggest a role of Starships in the virulence and/or emergence of fungal pathogens.
Looking at Starship content across the entire Aspergillus genus, the species A. montevidensis/amstelodami stands out with an average of 46 Captains (44-48) across nine genomes, the highest per species with more than one genome, compared to the genus median of 11 (Fig. 1; fig. S18). This species is both considered pathogenic (Fernandez-Pittol et al., 2022; Siqueira et al., 2018) and commonly found in fermented foods such as teas, meju and cocoa beans, the latter of which requires salt tolerance (Ding et al., 2024; Ryu et al., 2021; Takenaka et al., 2020). Additionally, A. montevidensis is closely related to A. cristatus, another species domesticated for tea fermentation and with high salt tolerance (Xie et al., 2024), with an average of 30 captains. This clade warrants further sampling and investigation.
Accumulation of specific genes functionally relevant for food adaptation in Starships
Using long-read genomes we aimed to, first, confirm our earlier results and secondly, explore both the cargo content and structure of Starships. To do so, we employed a robust genome-graph based pipeline, using pggb and odgi (Garrison et al., 2023; Guarracino et al., 2022), to conduct clade-specific all-vs-all whole genome alignments, exclusively using contiguous long-read assembled genomes. This new approach allowed us to extract large candidate Starship regions from all genomes simultaneously and without reference bias. The criteria for these initial candidate regions were straightforward: they needed to be larger than 30 kb in size and absent in at least one close relative genome. These regions were then classified as a Starship-related region if at least one Starship-related gene was present (See Methods section ‘Starship-like region detection’). We did not aim to further characterise these Starship-related regions, for example by identifying more precisely the Starship borders, nor require the presence of a Captain, considering fragmented assemblies, structural variation and Repeat Induced Point-mutation (RIP) can disrupt the detection of full length elements and/or specific sequences in the latter case (Gluck-Thaler et al., 2022).
Functional cargo convergence within primarily unique Starships
Using 76 long read assemblies from 36 species, we constructed 11 clade-specific genome-graphs and detected a total of 857 Starship-related regions with a combined total of 202Mb and 963 Captains (table S2). With this data we were able to recapitulate what we previously discovered, i.e., that domesticated and clinical strains consistently have a larger number of Starships compared to their environmental relatives, and additionally that this equates to a larger sum total of Starship-related material (Fig. 4; fig. S19). The importance of the ability to detect all regions associated with Starship-related genes, considering the variation in Starship size, is highlighted by certain cases where although the number of Captains were similar, the amount of Starship-related material could differ largely. For example, the A. oryzae strain KBP3, with a similar number of Captains, contains over twice as much Starship-related material compared to other strains (Fig. 4A). Additionally, we were able to assess which regions were shared between different genomes within each genome graph constructed. This highlighted that some very closely related strains have a large number of recently acquired, unique Starships (Fig. 4A; fig. S19). A clear example is between the two varieties of P. camemberti, var. caseifulvum and var. camemberti, which share 99.9% DNA identity and yet contain primarily unique Starships, suggesting Starships have accumulated independently after the divergence of the two varieties. This result was verified by manually inspecting alignments of all Starship-related regions.
We then investigated whether Starship-cargo was functionally enriched for certain COG categories compared to non-cargo genes. We used our four newly assembled and annotated long-read assemblies from three cheese-related strains (P. caseifulvum ESE00019, P. camemberti LCP06093 and P. biforme LCP05531) and a closely related environmental strain (P. fuscoglaucum ESE00090). Using all COGs, and identifying genes without any COGs we saw a very strong enrichment in genes without a COG annotation in all three genomes (fig. S20). In our cheese related strains we also see the depletion of several other COG categories. Only the Z (Cytoskeleton) category was enriched in P. biforme. We then filtered out COGs that were absent, unknown (S), or associated with TEs (L and K) and found, only in cheese strains, several COGs significantly enriched in cargo genes compared to the non-cargo gene dataset (fig. S21). Combining the data from all three genomes showed the same COG differences (Fig. 4B). The enriched COGs are E (Amino acid transport and metabolism), I (Lipid transport and metabolism), M (Cell wall/membrane/envelope biogenesis), P (Inorganic ion transport and metabolism) and Z, whilst there is a depletion of COGs A (RNA processing and modification), C(Energy production and conversion), H (Coenzyme transport and metabolism), J (Translation, ribosomal structure and biogenesis), U (Intracellular trafficking, secretion, and vesicular transport) and Y(Nuclear structure).
We also tested cargo COGs depending on the isolation origins of the strain. We compared all cargo from food, clinical and environmental isolations (Fig 4C). We found no significant differences between clinical and environmental cargo however, several differences were significant when comparing either to food-related cargo. Cargo from food-related isolates are enriched for B (Chromatin structure and dynamics), D (Cell cycle control, cell division, chromosome partitioning), F (Nucleotide transport and metabolism), I, J (Translation, ribosomal structure and biogenesis), M, O (Post-translational modification, protein turnover, chaperones) and P.
Starships contain cargo-clusters for rapid adaptation to human-related environments
Using the Starship-like regions, we found not only genes of general interest such as those related to drug resistance (CDR1, metallo-beta-lactamase, FCR1, PDR16, ergosterol biosynthesis genes, YOR1, …), pathogenicity (RBT1, SEF1, AGS1, CAT1, …) and metal ion homeostasis (VCX1, ZRT2, SMF3, …) (table S3.6), but also gene clusters likely involved in lactose metabolism, dityrosine biosynthesis, salt tolerance, arsenic resistance and ethanol utilisation (Fig. 5; fig. S22; fig. S23; fig. S23; table S3). In all cases, the rapid loss of synteny at the edges of each cluster suggests that they have all been acquired as cargo-clusters several times by different Starships.
Lactose metabolism and dityrosine synthesis clusters in cheese and cured-meat Penicillium species acquired independently several times in different Starships
We previously discovered CheesyTer, a retroactively identified Starship (Gluck-Thaler et al., 2022; Ropars et al., 2015; Urquhart, Vogan, et al., 2023), shared between species primarily used for cheese production, and containing two genes, a beta-galactosidase and a lactose permease, likely important for lactose consumption (Ropars et al., 2015). Here, we were able to identify that the beta-galactosidase (LAC4) and the lactose permease are actually contained within a larger lactose related gene cluster. The other genes within this cluster are two C6 zinc-fingers, a glycoside hydrolase of the family 71 (GH71) and a D-lactate dehydrogenase (DLD1) (Fig. 5), all of which were present in CheesyTer (Ropars et al., 2015) likely play a direct role in lactose metabolism (table S3.1). Although our current sampling of this cluster suggests all genes are core, there is an apparent lower identity and truncated reading frames in both DLD1 and one of the two C6-zinc fingers. In addition to the previously described species P. roqueforti, P. camemberti, P. biforme and P. nalgiovense, we also found this gene cluster in genomes from P. solitum, P. salamii, P. nordicum and P. crustosum, which are all found in food environments, although P. crustosum is often considered purely a food contaminant (Lund et al., 1995; Pitt & Hocking, 2009). These genomes represent multiple cases of independently domesticated species for both cheese and cured-meat. Additionally, P. biforme (LCP05531) contained two copies of the cluster in different Starship-related regions (Fig. 5).
We also identified another biosynthesis cluster, namely the dityrosine cluster, containing two genes, DIT1 and DIT2 that cluster tail-to-tail as already described across a variety of ascomycetes, particularly the Saccharomycetes (Linder, 2019; Nickles et al., 2023). These genes have been shown in Saccharomyces cerevisiae to be involved in the biosynthesis of dityrosine within the spore cell wall and enhance their tolerance to stresses (Briza et al., 1990, 1994). In Candida albicans, with the same cluster, a dit2 mutant was found to be both more susceptible to the antifungal drug caspofungin and to exhibit hyphal growth in minimal media (Melo et al., 2008).
We describe for the first time a DIT1/DIT2 cluster in two distinct Starship-related regions (Fig. 5; table S3.2). In addition to DIT1 and DIT2, this cluster contains, DTR1, a multi-drug resistance protein and putative dityrosine transporter of the major facilitator superfamily (Felder et al., 2002), and ATG22, a vacuolar effluxer for amino acids after autophagy that primarily targets tyrosine (Yang et al., 2006). Notably, this set of four genes has been previously described as a co-regulated dityrosine cluster in Pichia stipitis (Jeffries & Van Vleet, 2009). This cluster has been independently acquired multiple times in different Starships, in the cheese species P. camemberti var. camemberti (LCP06093), P. camemberti var. caseifulvum (ESE00019) and P. biforme (LCP05531). Of note, there are five copies of the cluster within P. camemberti var. caseifulvum (ESE00019), involving four Starships of which one, ESE00019:HTR34, contains a duplicate copy without DTR1 and an apparent splitting of DIT2 into two ORFs (Fig. 5). Similar in respects to the lactose cluster, the dityrosine cluster is found in cheese and cured-meat related strains.
Expanding upon the Arsenic resistance cluster of Hephaestus
Although arsenic is a naturally occurring toxic compound present ubiquitously in the environment, elevated concentrations are found in certain areas due to human-activities such as mining and agriculture (Patel et al., 2023; Tchounwou et al., 2012). Arsenic resistance has evolved many times (William & Magpantay, 2024), including the ACR cluster in S. cerevisiae (Bobrowicz et al., 1997; Stefanini et al., 2022) and there is also evidence for the horizontal transmission of arsenic resistance genes by large ICE elements in bacteria (Arai et al., 2019). An arsenic cluster found within the Starship Hephaestus contains five genes (arsH, arsC, arsB, Pho80 and arsM) and confers arsenic resistance in the environmental fungus Paecilomyces variotii (Urquhart et al., 2022). With a larger sampling of all arsenic clusters in Starships, we describe here the cluster’s variable content, made up of only three core-cluster genes, namely, arsH, arsC and arsB, and additional genes that may be frequent but not required (fig. S22; table S3.3). Among these cluster-accessory genes, several were frequently present, specifically two genes with no functional description available and a basic region leucine zipper (bZIP), potentially playing a similar role to the bZIP transcription factor ARR1/YAP8, which is both required for the transcription of ARR2 and ARR3 in S. cerevisiae (Wysocki et al., 2004) and activated by arsenite binding (Navarro et al., 2022). Additionally, two other cluster-accessory genes were found close to the cluster in several different Starship-related regions; TRX1, a thioredoxin both inhibited by arsenic and shown to confer resistance (Jovanovic et al., 2021; Lu et al., 2007; Park, 2020) and a heavy metal associated p-type ATPase, similar to pcaA, shown to confer resistance to cadmium and lead in Hephaestus, possibly involved in arsenic/arsenite transport (Antonucci et al., 2017; Flores-Iga et al., 2023; H. Yan et al., 2019). Several genomes had multiple copies of the cargo-cluster genes, whether from several unique Starships and/or duplications within Starships. Aside from the previously described species (P. variotii, A. fumigatus, A. sydowii, A. primulinus and A. varians (Urquhart et al., 2022)), we detected the cluster in six Penicillium species (P. chrysogenum/rubens, P. brevicompactum, P. freii, P. crustosum, P. nordicum and P. biforme) and a further one in A. niger. Notably, this list contains more environmental species, deviating from the other clusters that contain primarily food-related isolates, and a combination of species from both genera.
A highly variable salt tolerance gene cluster in food-related species
We also found more extensive evidence for a cluster briefly described in Urquhart et al. 2022 within A. aureolatus and containing genes involved in salt tolerance (NHA1, ENA1/2 and SAT4) (fig. S23; table S3.4). This cluster contains a more varied array of accessory genes and their structure, however, focussing on ENA2 (involved in Na+ efflux to allow salt tolerance (Ariño et al., 2010; Kvitek et al., 2008; McDaniel et al., 2018)) as a central component, we see other genes commonly cluster within its vicinity. The two other core genes include the previously described SAT4/HAL4, involved in salt tolerance (Mulet et al., 1999; Posas et al., 1995; Urquhart et al., 2022), and a TRK2-like gene involved in potassium transport (Mulet et al., 1999). In the majority of cases, both genes are found clustered with ENA2, however, their variable orientation and position relative to ENA2 exemplify this cluster’s diverse arrangement (fig. S23). Additionally, this cluster contains multiple instances of the accessory genes TRK1 (Potassium transporter, activated by SAT4 and required for SAT4 salt tolerance (Mulet et al., 1999)), pacC/RIM101 (TF involved in pH signal transduction; confers salt tolerance through the regulation of ENA2-like genes (Caracuel et al., 2003; Lamb & Mitchell, 2003)), ERG20 (involved in sterol biosynthesis shown to impact salt-tolerance (Kodedová & Sychrová, 2015)), and other genes (sodium/hydrogen antiporters and exchangers, a chromate transporter, a sodium/phosphate symporter, a potassium transporter, a ring finger protein and an anion exchange family protein) (table S3.4). This gene cluster repertoire clearly highlights its functional role centred around ion homeostasis and salt tolerance.
Additionally, Starship-related regions from A. niger (KJC3) contained two clusters with ENA2, a previously detected anion exchange family protein, a plasma membrane ATPase and, in one of the two cases, a sodium/hydrogen exchanger. Further cementing this clusters’ functional relevance is that it was found in multiple configurations and unique Starships in several strains all related to food-production. For example, three clusters were found each in P. salamii (PN007), P. solitum (strain12) and P. camemberti var. caseifulvum (ESE00019), that is a cured-meat and two cheese-related strains respectively. Furthermore, these three strains belonging to distant species, all shared the same Starship with one of these cluster configurations.
An ethanol utilisation cluster in food spoilage species
The model system of ethanol utilisation in Aspergillus nidulans requires both alcA and aldA, encoding for alcohol and aldehyde dehydrogenases that convert alcohol into acetate via acetaldehyde, and the transcription factor alcR, which strongly induces the expression of alcA and aldA (Felenbok et al., 2001). Alongside alcR and alcA, three other genes make up an alc gene cluster and are regulated by alcR. One of these three genes, alcS, is likely an acetate transporter (Flipphi et al., 2006; Robellet et al., 2008). We found a Starship gene cluster containing alcR, alcS, alcA (ADH1) and aldA (ALD5) (fig. S24; table S3.5). Additionally, ERT1/AcuK was found in several clusters, another ethanol regulated transcription factor (Gasmi et al., 2014; Hynes et al., 2007). This cluster was found in strains of P. nordicum, P. canescens, P. antarcticum, P. bialowiezense, P. digitatum, P. verrucosum and P. expansum, of which the latter four are all related to food spoilage, primarily in fruit (Kim et al., 2007; Koffmann & Penrose, 1987; Luciano-Rosario et al., 2020; Marcet-Houben et al., 2012; Scholtz & Korsten, 2016).
Cargo-clusters within the whole genera
Having leveraged long-read assemblies in order to describe complete cargo-clusters, we then took a broader look at their distribution within all genomes of Penicillium and Aspergillus used in this study (Fig. 5B, D; fig. S25; table S3.7). We found in most cases the presence of each cluster within the same species, however there were three main exceptions; The lactose cluster was found in 10/39 genomes of Aspergillus sydowii, the arsenic cluster was found in most strains of P. roqueforti (deviating from the other clusters found only in food-related clades) and the salt cluster was found in 2/8 A. tamarii strains. Looking at the isolation origins of all strains with each cluster, more environmental strains are found containing the ethanol and the arsenic cluster, whilst the other three clusters are found primarily in strains from food, most notably the dityrosine cluster (fig. S25). Considering how little is known about the roles of both dityrosine and dityrosine clusters (Nickles et al., 2023), the link between this dityrosine cluster and food-related species warrants further investigation.
A new gene identified at the opposite border to StarshipCaptains
During preliminary pairwise comparisons and manual inspection of Starship-related regions, we identified genes with a particular MYB/SANT protein domain that were frequently present and Starship-related region specific (fig. S26; Table 1). Additionally we noticed that these genes were not only frequently close to the edge of Starship-related regions but at the opposite edge to the Captain, if present. Using our newly generated list of Starship-related regions, we generated a non-redundant dataset and then compared the distance of each Starship-related gene, including the MYB/SANT, to the closest edge. This showed that DUF3435 and MYB/SANT genes are on average significantly closer to an edge than all other Starship-related genes (Fig. 6A). We also looked at the relative distance of each Starship-related gene from each DUF3435, within the same Starship-related region, and showed that MYB/SANT genes are significantly further away than other Starship-related genes (Fig. 6B). These results affirm the idea that MYB/SANT genes, like Captains (DUF3435), appear to have a structurally conserved position at the edge of Starships opposite to the Captain.
Additionally, we have many examples of Starships found within contigs, therefore more likely to represent complete Starship elements, that are clearly bordered by a Captain at one edge and the MYB/SANT gene at the other (Fig. 6C). The six examples all contained Terminal Inverted Repeats flanking Captain and MYB elements. This configuration was also detected in four of the previously identified complete Starships from Gluck-Thaler et al. 2022 (fig. S27). Notably, we also found genes that contained both the DUF3435 and MYB/SANT protein domains, including in the previously annotated Starship (Starship-Aory1) (Gluck-Thaler et al., 2022) (fig. S27). Using the set of non-redundant Starship-related regions, we also compared their size separated by whether they contained both Captain and MYB/SANT genes together, either one of the genes without the other (e.g. contains a Captain but no MYB/SANT gene), or neither. We see that Starship-related regions containing both genes were significantly larger than those containing only the Captain or neither (Fig. 6D). No significant difference in size was observed with Starship-related regions containing only the MYB/SANT gene although the size is on average smaller. This smaller average size could be accounted for by both assembly breaks that therefore place the Captain and MYB/SANT on separate contigs, or structural variation which would separate them physically within the genomes. Lastly, the Starship-related regions containing only the MYB/SANT gene are on average larger than those with only the Captain. Considering both Captain- and MYB/SANT-only regions would be subject to the same impact of assembly breaks and structural variations, this suggests that the MYB/SANT-containing regions initially come from on average larger Starships. Taken together, these results indicate that there is an association between larger Starships and the presence of MYB/SANT genes.
Lastly, we constructed a MYB/SANT database using all genes identified as MYB/SANT Starship-related genes and used this to replace the Captain database in Starfish. Starfish was then run on the same long and short-read genome datasets for Penicillium and Aspergillus. The MYB/SANT count recapitulated much of the same differences when comparing clades and we see a strong correlation between MYB/SANT count and Captain count, where the number of Captains is on average slightly more than MYB/SANT genes (Fig. 6E)
Discussion
Identifying the other edge of large Starships
The only structurally conserved gene within Starships described so far is the putative tyrosine recombinase, also referred as Captain/DUF3435, positioned close to one edge of Starships (Gluck-Thaler et al., 2022; Urquhart, Vogan, et al., 2023). Here we show that a MYB/SANT transcription factor is present in a subset of Starships and structurally positioned at the opposite edge to the Captain. This combination of genes, a transposase and MYB/SANT gene, can also be found in Harbinger transposons, both of which are required for transposition (Kapitonov & Jurka, 2004; Sinzelle et al., 2008); Harbinger DNA transposons have been identified in diverse phyla of life including protists, plants, insects and vertebrates. Considering this similarity, we could hypothesise that the MYB/SANT gene in Starships plays a similar role, aiding in both the nuclear import and TIR recruitment of the transposase. However, MYB/SANT do not appear required for Starships, as they are often absent from seemingly complete Starships and are found at a lower frequency than Captains. Therefore, MYB/SANT presence may be only beneficial for transposition, particularly in larger Starships, as our data suggests. Alternatively, the MYB/SANT TF may impact the movement of other Starships also present within the same genome. Much like previous HGT regions retroactively described as Starships, we found MYB/SANT genes in previously described Starships confirming their presence regardless of the pipeline used to discover the region.
Starships, a convergent mechanism for food and clinical filamentous fungi to rapidly adapt to their environments
Using both inter- and intra-species comparisons, of varying distances, our findings reveal a consistent pattern where all species linked to food production in both the Aspergillus and Penicillium genera exhibit a distinct increase in the number of Starships compared to their environmentally isolated counterparts (Fig. 1; Fig. 2; Fig. 3; Fig. 4; table S1). In total, we identified this relationship within seven independent instances of fungi domesticated for food production, including cheeses, dry-cured meats, and various soy and rice-based foods. Furthermore, many genes within Starships have putative functions relevant for adaptation to food environments, such as genes involved in lactose metabolism and salt tolerance. We found, building upon previous observations (Cheeseman et al., 2014; Lo et al., 2023; Ropars et al., 2015), that Starships and their food relevant cargo are commonly shared among food-related species and in turn highlighting Starships as a generalisable mechanism for these filamentous fungi to rapidly adapt to these new niches.
Fermented food environments are rich in terms of both nutrients and microbes (number and species diversity), including bacteria, yeasts and filamentous fungi, making them ideal media for genetic exchange (Wolfe & Dutton, 2015). For example, several Penicillium and Aspergillus species have been described to co-occur in cheeses alongside many other fungi (Martin & Cotter, 2023). This mimics what has been described in cheese-related bacteria, as HGT events appear to be common, contain genes with functional relevance for cheese adaptation and in some cases facilitated by ICEs (Bonham et al., 2017). Similar to ICEs, due to a Starship’s relatively large size, these transfer events are certainly costly and should be maintained under strong selection. This is particularly relevant in some cases such as in the cheese-related species P. camemberti var. caseifulvum, where we identified that greater than 15% of the ESE00019 genome is made up of Starship-related material. Furthermore, considering that Starships have been transferred between species from the same food-related environments (Cheeseman et al., 2014; Lo et al., 2023; Ropars et al., 2015), that they can be very numerous in a single host genome and that most Starships identified are unique, even if at times containing similar cargo, this establishes their crucial role in adaptation and their ease of transfer; extrapolated, additional cargo may yet be acquired in domesticated strains. However, we still don’t know how often Starships move or whether they can be lost as easily as they are acquired.
Although food environments could encourage the likelihood of cross species Starship transfers (assuming they only require physical contact), this may only be part of the story, particularly due to elevated Starship content in clinical strains. The current prevailing theory is that clinical strains of Aspergillus, with pathogenic adaptations, have acquired these prior to colonisation (Rhodes et al., 2022; Snelders et al., 2009; Verweij et al., 2020). The primary example is azole resistance in A. fumigatus, which is said to develop resistance in agricultural settings due to the use of azole based antifungal treatments. This would therefore likely be the same pressure driving the increase in Starship content, and considering that most environmental strains of A. fumigatus came from agricultural settings, may explain why there is a smaller difference between clinical and environmental strains, compared to other food-related analyses.
Starships, a more general mechanism to rapidly adapt and survive to extreme environments?
Beyond food and clinical isolates, we were also able to identify a relative increase in Starships in several phylogenetically distant fungi occurring in saline environments, unrelated to food. However, food environments linked to Starships are almost all salted, i.e. cheese, cured meat, soy sauce, miso, suggesting salt could play a part in Starship movement. A variety of stresses, including salt, have already been identified as a means of inducing the transfer of plasmids and TE transposition (Beuls et al., 2012; Peng et al., 2019). In this scheme, certain conditions such as high salinity might not only select for Starships that confer resistance but may also increase general Starship mobility and subsequent accumulation. However, we also identified potentially elevated Starships in strains isolated from heavy metal contaminated soil (fig. S28) aligning with the previously described Starship acquired heavy metal resistance (Urquhart et al., 2022; Urquhart, Vogan, et al., 2023). Furthermore, the ethanol utilisation cluster was found primarily in species known to cause post-harvest disease in fruits (Kim et al., 2007; Koffmann & Penrose, 1987; Luciano-Rosario et al., 2020; Marcet-Houben et al., 2012; Scholtz & Korsten, 2016). Ethanol resistance and utilisation may be in response to early colonising yeast that can rapidly produce volatile ethanol in order to deter competition (Dashko et al., 2014; Druvefors et al., 2005; Fredlund et al., 2004; Freimoser et al., 2019; Kwasiborski et al., 2014).
Cargo-clusters drive the functional convergence of Starships
Our data on Starship-related gene clusters highlights that ‘cargo-clusters’ frequently move among different Starships as single elements. This is supported by the recently published results of Urquhart et al. 2023, where they describe the same cargo-cluster, responsible for resistance to formaldehyde, in several different Starships, referred to as cargo-swapping (Urquhart, Gluck-Thaler, et al., 2023). Within the formaldehyde cluster, it was experimentally shown that ssfB granted much greater formaldehyde resistance compared to the six other genes and that it was the only gene consistently maintained within the cluster, whilst the other genes’ presence varied (Urquhart, Gluck-Thaler, et al., 2023). The arsenic cluster in the Starship Hephaestus has similarly been shown to have been acquired as a single element including a truncated flanking gene (Urquhart et al., 2022). Here we showed several cargo-clusters, unlike those containing biosynthetic pathways, that contain variable gene content with similarly functional roles. For example, the arsenic cluster, previously described in the Starship Hephaestus, contains a set of three core genes (ArsC, ARR3 and ArsH) alongside seven accessory cargo-cluster genes (fig. S22). This variable arrangement is similar to that of arsenic gene islands in prokaryotes (Ben Fekih et al., 2018). It is therefore possible that cargo-clusters may only orient themselves around certain more functionally important genes, providing a more malleable adaptive structure whilst conserving the clusters’ function. This is particularly likely considering that cargo is in itself accessory material.
In conclusion, Starships thus appear to be a way of rapidly adapting and surviving to novel conditions, depending on the cargo genes acquired, having been employed heavily within human-related environments. By focusing on recent evolutionary events, domestication and pathogenesis, we were able to clearly detect a relationship between Starships and novel niches whether desired or otherwise. Furthermore, we have described numerous cargo genes involved in resistance to salt, sulfites and even drugs. Many of these are antimicrobial compounds employed by humans for food preservation or against pathogens of both humans and plants. This may become important as fungal pathogens are becoming increasingly resistant to antifungal treatments (Fisher et al., 2022; Vitiello et al., 2023; Yin et al., 2023) and their ability to rapidly acquire and transfer resistance would certainly worsen treatment outcomes. Our findings have thus clear implications for the future of agriculture, human health and the food industry as we shed light on a widely recurrent mechanism of gene transfer between filamentous fungi. This also indicates a need for caution regarding the use of bioengineered fungal strains in our agriculture or food systems, as already proposed in the fermented food species Aspergillus oryzae which was successfully genetically modified using the CRISPR-Cas9 system (Maini Rekdal et al., 2024). We cannot predict how these genes will be spread and whether they can be future threats.
Methods
Sampling
We newly isolated 68 strains from cheese rinds. To do so, we performed successive dilutions of pieces of rinds to obtain isolated spores on solid malt extract agar. We checked the species identity using the beta-tubulin locus using bt2a and bt2a primers (seq primers). Table S1 recapitulates strain details, including species identification, origin and country of origin.
DNA extraction, genome sequencing
We used the Nucleospin Soil Kit (Macherey-Nagel, Düren, Germany) for DNA extraction of the 68 newly isolated strains grown for five days on malt extract agar. DNA was sequenced using Illumina technologies (paired-end, 150bp) (table S1).
Illumina genome assembly
We assembled the genomes of the 68 newly isolated strains and of the 49 already published strains (Ropars et al 2020; table S1). Illumina reads were checked with FastQC (v0.11.9) (Andrews, 2020). Leading or trailing low quality or N bases below a quality score of three were removed. For each read, only parts that had an average quality score higher than 20 on a four base window were kept, and only reads with at least a length of 36 bp were kept. Cleaned reads were assembled with SPAdes (v3.15.3) (Prjibelski et al., 2020) not using unpaired reads using the --careful parameter.
Long-read genome assembly
We newly assembled 3 genomes (ESE00019, ESE00090 and LCP05531) with Oxford Nanopore Technology (ONT), newly assembled 1 genomes (LCP05904) with Pacbio and re-assembled 3 genomes (LCP06093, PN007, PN016) with publicly available Pacbio data. We assembled each dataset with a variety of assemblers and with different preprocessed read datasets and settled on two assemblies depending on the technology. Additionally, illumina data was available for all ONT sequenced strains, and one re-assembled strain, PN016. For ONT sequenced strains, data was basecalled using Guppy (v5; model r941_min_fast_g507) and assembled using Canu (v2.2) (Koren et al., 2017) with all raw reads and Flye (v2.9-b1768) (Kolmogorov et al., 2019) after using Filtlong (v0.2.1) (github.com/rrwick/Filtlong) to remove all reads less than 5kb. These assemblies were merged using Ragtag’s (v2.1.0) (Alonge et al., 2022) patch function with the Canu assembly as the ‘target’ and the flye assembly as the ‘query’. A second round of merging was done replacing the flye assembly with the merged assembly from the first round. This merged assembly was then polished with Medaka (v1.6.0) (github.com/nanoporetech/medaka) using all the ONT reads. The assembly was then polished using the illumina reads with Hapo-G (v1.3.2) (Aury & Istace, 2021). For Pacbio based genomes, both the Canu and Flye assembly was assembled using the Filtlong based reads, removing all reads shorter than 5kb and merging was performed using Quickmerge (v0.3) (Chakraborty et al., 2016) with the flye assembly as the ‘reference’ and the Canu assembly as the ‘query’. One round of Hapo-G polishing was performed for PN017. All assemblies were then treated to remove redundant contigs. Contig alignment was performed by BLAST (Camacho et al., 2009) and contigs were removed if both 99% or more was covered by another larger contig with 99% or more identity. For the contig naming; contigs were organised by largest to smallest then numbered and mitochondrial contigs were identified by BLASTing all contigs against a manually identified complete mitochondrial genome from ESE00019.
We assessed genome completeness using the BUSCO (v5.3.1) eurotiales_odb10 database (Manni et al., 2021) and Merqury (-k 18) (Rhie et al., 2020) for those with illumina data (table S4). We also compared genomes using assembly statistics (table S4). Additionally we compared our long read assemblies to short-read assemblies if available. This shows that long-read assembly and reassembling of these genomes resulted in significant improvements in contiguity, BUSCO and Merqury completeness compared to previous assemblies both long and short.
Long-read genome annotation
All genomes excluding PN016 and PN007 were annotated. Genomes were first analysed by RepeatModeler (v2.0.2) (Flynn et al., 2020) using the BuildDatabase and RepeatModeler functions (-engine ncbi). Repeat families were further clustered using cd-hit-est (v4.8.1) (-d 0 -aS 0.8 -c 0.8 -G 0 -g 1 -b 500) (Li & Godzik, 2006). This repeat library was then fed to RepeatMasker (v4.1.2-p1) (http://www.repeatmasker.org) to identify and mask all repeats (-xsmall -s). The masked assembly was then annotated by Braker2 (v2.1.6) (--fungus --gff3) (Brůna et al., 2021) using a set of all BUSCOs in fungal taxonomic groups (Manni et al., 2021). Braker was then rerun using the previous run’s augustus protein hints for 4 rounds. The resulting protein dataset was then analysed using the initial BUSCO database, showing 99.9% of all proteins were annotated for ONT based assemblies and 94.7-94.6% for Pacbio based assemblies. The resulting proteins were annotated using Funannotate (v1.8.11) (Palmer & Stajich, 2020). Firstly a funannotate database was generated using the setup command (-b fungi). We then used the funannotate’s iprscan wrapper (Jones et al., 2014) and emapper (v2.1.8) (Cantalapiedra et al., 2021) to annotate protein domains and Eggnog functions respectively. The masked genome, the braker gff3, the funannotate fungi database, the iprscan output and the eggnog annotations were then combined using funannotate’s annotate function.
Phylogenomics
We placed all genomes used in our study within a phylogenomic framework, separated into the Penicillium and Aspergillus genera, using several genomes of the other genus for rooting (table S1). This served several purposes, such as verifying species identification for public assemblies (which were frequently incorrect and have been tagged in table S1), confirming species relationships including their distances and highlighting complex species naming systems such as in Aspergillus species complexes. This genome dataset excluded any NCBI genomes labelled as abnormal or from metagenomic data. All genomes were analysed by BUSCO using the eurotiales_odb10 database (Manni et al., 2021). Any genome with less than 80% complete single copy BUSCOs was removed from further analysis. We then established a protein dataset consisting of only BUSCOs contained within 99% of all genomes (3154/4190 and 3258/4190 for Penicillium and Aspergillus respectively) which were then used to build each tree. Using BUSCOs allowed us to heavily limit any impact of horizontally transferred genes on the topography. The BUSCO datasets were aligned by mafft (v7.310) (Katoh & Standley, 2013) and trimmed by trimal (v1.4.rev15) (Capella-Gutiérrez et al., 2009). All protein alignments were concatenated using catfasta2phyml/catfasta2phyml.pl (https://github.com/nylander/catfasta2phyml) (-c -f) to produce a single fasta file. We then used this to generate a species tree with VeryFastTree (-double-precision) (v4.0.3) with the default SH-like node support calculations (Supp. files 1 and 2) (Piñeiro et al., 2020; Price et al., 2010). The resulting treefile was analysed in R (R Core Team, 2024). Each tree was rooted using the outgroup genomes with treeio::MRCA (v1.22) (Wang et al., 2020), ape::root.phylo (v5.7.1) (Paradis & Schliep, 2019), TreeTools::Preorder (v1.10.0) (Smith, 2019) and treeio::droptip and visualised using the ggtree suite (Yu et al., 2017). Renaming of species was performed, based on the phylogeny as noted in table S1, with most being supported by other publications, also noted in table S1. It should be kept in mind that renaming is specific to the genome, regardless of the strain name assigned in NCBI. Renaming efforts were mainly targeted towards species with numerous genomes and we do not want to enter the debate surrounding the naming of A. terreus/A. pseudoterreus. For phylogenies with collapsed clades (e.g Fig. 2 and Fig. 3) we used scaleClade (scale=0.1) then collapse (mode=‘mixed’) from the ggtree suite. Therefore these species or clade specific nodes display a triangle a tenth the size of the actual branches and with corners extending to both the maximum and minimum of these scaled branch lengths. The phylogenetics pipeline can be found at https://github.com/SAMtoBAM/publicgenomes-to-buscophylogeny.
Strain isolation origins and classification
For each genome assembly we attempted to manually gather information on the strains’ isolation origins. Classifying these origins into simple categories was then also done manually. For environmental origins we used several environmental categories, i.e. environmental, environment-clinical, environment-food and environment (saline). Environment (saline) was used for Aspergillus strains where the environment would be generally considered as extreme in its saline content such as sea-water or saline soils. Environment-food was used primarily for contamination of processed food-products such as unwanted growth on dairy products. Environment-clinical strains mainly come from a single study which isolated fungi from healthy human faecal samples (Q. Yan et al., 2024). Other strains in this latter category were isolated from hospital and/or human samples but unassociated with disease. Table S1 contains the detailed and simplified isolation origins.
Starfish identification on all (short- and long-read) genomes
In order to estimate the starship count within and between species, we used Starfish to identify DUF3435/Captain genes in both long and short-read assemblies. The Captain count was then used as a proxy for the number of Starships. We ran Starfish annotate with default settings and the default Captain database (YRsuperfamRefs.faa). The number of captains and the isolation origins per genome was then plotted against the phylogeny in R. To reorder plots of Captain counts against phylogenetic trees, bar graphs were made by ggplot (Wickham, 2016) then combined with ggtree (v3.6.2) (Yu et al., 2017) phylogenies using the R library aplot (Yu, 2023).To identify changes in Starship content, species known to have been domesticated for food production or cause disease were compared against either wild sister species, or within species split by isolation origins. In all cases, to test for differences in Captain counts between species and/or isolation environments, we performed Wilcoxon rank-sum tests with Bonferroni adjusted p-values for multiple comparisons.
To test the accuracy of Captain count as a proxy, alongside the captains specificity within Starships, genome sizes were compared against the number of captains using the Pearson correlation coefficient in R. In Fig. 1 two strains of P. janthinellum were removed due to an extremely high Captain count which was most likely erroneous due to no corresponding increase in genome size.
In A.fumigatus, in order to remove any bias in how genomes were assembled across projects, we restricted the isolation-Starship comparison to a large subset of 252/336 genomes, containing both environmental (202/252) and clinical (50/252) isolates, that all originated from two bioprojects/studies (PRJNA697844 and PRJNA595552) from the same group and therefore were all assembled and treated with the same pipeline (Barber et al., 2020, 2021). In doing so we still see, as shown with all genomes, that clinical strains contain more starships (fig. S29).
Pairwise captain identity and Average Amino Acid Identity
In order to clearly distinguish that in some cases captains have been horizontally transferred we looked at the discrepancies between the pairwise identity of captain proteins and the average amino acid identity (AAI) of their respective genomes using BUSCO proteins. For captains, we concatenated all captain proteins identified by Starfish and used to construct a Diamond database (Buchfink et al., 2021). We then searched this database using all captain proteins, all vs all, using diamond blastp, filtering for matches where both the reference and query protein were covered 95% and the identity was >= 95% (--very-sensitive --id 95 --query-cover 95 --subject-cover 95 --no-self-hits).
For the AAI, 50 random BUSCOs used to build both the Penicillium and Aspergillus phylogenies were analysed separately in the same fashion with only a filter for 80% reference and query coverage (--very-sensitive --query-cover 80 --subject-cover 80 --no-self-hits). For all captain matches, we calculated the AAI for the two genomes containing the captains by taking their average BUSCOs match identities.
Genome graph
Using NCBI (Sayers et al., 2022) we identified publicly available, contiguous (relatively high contig N50), long-read technology assemblies within species of interest and combined them with our newly assembled genomes (table S2). We separated them into groups based on their relatedness and generated genome graphs for each using pggb (v0.5.0-hdfd78af) (-p 75 -s 30000 -m -S -k 19-G 7919,8069 -Y _) (Garrison et al., 2023). We also set the option -n as the number of genomes used per grouping. The maximum distance (-p) was only reduced to 50% (-p 50) for the Brevicompacta graph to help with the larger distances between P. brevicompactum and P. salamii. The script for generating a genome-graph can be found here https://github.com/SAMtoBAM/pggb_starship_pipeline.
Starship-like region detection
To extract presence and absence data we used odgi (v0.8.1-1-ge91b1cd) (Guarracino et al., 2022) paths, extracted both the paths and associated genomes, then fed this into odgi pav alongside 1kb windows for each genome (generated using bedtools (v2.26.0) (Quinlan & Hall, 2010) makewindows). We then extracted all regions absent from at least one genome, merged these windows within 1bp using bedtools merge, removed all remaining regions smaller than 10kb then remerged regions with a max gap of 20 kb. The final resulting regions were then filtered for a minimum size of 30kb. This dataset was then annotated with braker, emapper and funannotate, as with the whole genomes, except using only a single round of braker annotation.
Following this the resulting annotation files were searched for protein domains and Eggnogs indicative of Starship-related genes. Table 1 below shows the list of genes and their respective tags used to find them. These Starship-related genes are made up of previously identified components of starships from Gluck-Thaler et al. 2022 (DUF3435/Captain, DUF3723, PLPs) alongside newly added conidiophore related genes (CRGs), DUF6066 and the MYB/SANT. These new genes alongside the previously identified genes were shown to be highly starship specific in all genomes analysed.
The Starship-related gene annotation and identification pipeline was initially tested on the Starships released in Gluck-thaler et al. 2022 showing we were able to effectively identify the Captains in 33/39 (85%) of these Starships and additionally MYB/SANT genes at the distal end of 4/39 (fig. S27).
Additionally Starfish (Gluck-Thaler & Vogan, 2024) was used on each dataset similarly in order to detect DUF3435/Captain genes using the default database (YRsuperfamRefs.faa) and default settings. Additional Captains, that were not already identified, were added to the set of Starship-related genes within the defined candidate regions. Notably this added an additional 338/986 Captains (34%). Any region containing a Starship-related gene was then considered a Starship-related region (Supp. file 3). The script for detecting Starship-related regions can be found here https://github.com/SAMtoBAM/pggb_starship_pipeline.
Functional enrichment
For functional enrichment we used COG gene annotations all from the same annotation pipeline. Genes were classified by their COG category or given ‘Absent’ in the absence of annotation. Our first dataset contained our four newly assembled and annotated long-read assemblies. This contains three strains from cheese-related species (ESE00019, LCP06093, LCP05531) and one strain from an environmental species (ESE00090). For each genome we separated genes into those present or absent from the Starship-like regions detected by the genome-graph. For each genome, all genes were annotated simultaneously, therefore we limit bias in annotation pipelines or runs that could explain differences in cargo and non-cargo genes. Hypergeometric tests for enrichment and depletion were performed for each COG category and the p-value was adjusted for multiple comparisons (Benjamini–Hochberg) using phyper and p.adjust respectively (R Core Team, 2024). We also performed an analysis on the combined cargo and non-cargo genes from all three cheese-related strains, which showed the same differences. Second, we performed the same tests on the COG categories of cargo in Starship-like regions, split into groups based on the strains isolation category, either food, clinical or environment. The above tests were all performed with and without the COG categories ‘S’, ‘L’, ‘K’ and ‘Absent’. S and Absent are unknown categories. L and K are commonly associated with Transposable elements and our Starship-like regions were not masked prior to annotation.
Clinker
A gff3 file for each Starship-related region, containing a gene cluster, was converted to an EMBL file using EMBLmyGFF3 (v2.3) (Norling et al., 2018).The EMBL file was then converted to a genbank file using biopython (SeqIO.convert(sys.stdin, ’embl’, sys.stdout, ’genbank’)) (Cock et al., 2009). This two step process avoided any formatting issues given by other tools that converted gff3 files directly to gbk. We then ran Clinker (v0.0.28) (-i 0.5), with a minimum of 50% sequence alignment identity, on all the genbank files simultaneously for each gene cluster. The html output was manually modified in order to highlight the clusters and shorten the length of regions. The Saccharomyces Genome Database (Engel et al., 2022), Candida genome database (Skrzypek et al., 2017), fungidb (Basenko et al., 2018) and NCBI (Sayers et al., 2022) were indispensable in manually investigating the function and relevance of these genes.
Cargo-clusters in all assemblies
All assemblies used in this study (Table S1) were combined within a single BLAST database and queried using the core genes for each cluster, with a minimum query coverage and identity of 95%. A cluster was considered present if all core genes contained a match. The presence/absence plot (Fig 6) was generated using ggplot and geom_tile (Wickham, 2016) then combined with ggtree (v3.6.2) (Yu et al., 2017) phylogenies using the R library aplot (Yu, 2023).
Funding
This work was funded by the Artifice ANR-19-CE20-0006-01 ANR grant to J.R.
Acknowledgements
We would like to thank Emile Gluck-Thaler for his insights and suggestions on the manuscript.
Footnotes
S. O’Donnell: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Software, Validation, Visualisation, Writing-original draft, Writing-review and editing. G. Rezende: Investigation. J.-P. Vernadet: Data Curation, Investigation, Software. A. Snirc: Investigation. J. Ropars: Conceptualization, Data Curation, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing-original draft, Writing-review and editing.