Abstract
Sexual reproduction in Eukarya consists of genome reduction by meiosis and subsequent gamete fusion. The presence of meiotic genes in Archaea and Bacteria suggests that prokaryotic DNA repair mechanisms evolved towards meiotic recombination1, 2. However, the evolutionary origin of gamete fusion is less clear because fusogenic proteins resembling those found in Eukarya have so far not been identified in prokaryotes3–5. Here, using bioinformatics, we identified archaeal genes encoding candidates of fusexins, a superfamily of fusogens mediating somatic and gamete fusion in multiple eukaryotic lineages. Crystallographic structure determination of a candidate archaeal FusexinA reveals an archetypical trimeric fusexin architecture with novel features such as a six-helix bundle and an additional globular domain. We demonstrate that ectopically expressed FusexinA can fuse mammalian cells, and that this process involves the additional domain and a more broadly conserved fusion loop. Genome content analyses reveal that archaeal fusexins genes are within integrated mobile elements. Finally, evolutionary analyses place these archaeal fusogens as the founders of the fusexin superfamily. Based on these findings, we propose a new hypothesis on the origins of eukaryotic sex where an archaeal fusexin, originally used by selfish elements for horizontal transmission, was repurposed to enable gamete fusion.
Introduction
How the earliest eukaryotes developed the capacity for gamete fusion is a central question that is entangled with the origins of the eukaryotic cell itself. The widespread presence of a conserved set of meiotic, gamete and nuclear fusion proteins (fusogens) among extant eukaryotes suggests that meiotic sex emerged once, predating the last eukaryotic common ancestor (LECA)1, 6. The conserved gamete fusogen HAP2/GCS1 belongs to a superfamily of fusion proteins called fusexins3–5. This superfamily encompasses class II viral fusogens (viral fusexins) that fuse the envelope of some animal viruses with the membranes of host cells during infection7–9; EFF-1 and AFF-1 (somatic fusexins) that promote cell fusion during syncytial organ development10–14; and HAP2/GCS1 (sexual fusexins) that mediate gamete fusion15–17. Although it is assumed that sexual fusexins were already present in the LECA8, their shared ancestry with viral fusexins posed a “the virus or the egg” evolutionary dilemma18. In one scenario, fusexins are proper eukaryal innovations that were transferred to some viruses and used for host invasion. Alternatively, a viral fusexin gene was captured by an early eukaryotic cell and then repurposed for gamete fusion.
Here, we identify a fourth family of fusexins in the genomes of Archaea and in the prokaryotic fractions of metagenomes from diverse environments. We provide structural and functional evidence indicating that these proteins are cellular fusogens. Genomic and evolutionary analyses reveal the ancient origins of these archaeal fusexins and their lateral mobility within Archaea, leading us to provide a working model for the emergence of meiotic sex during eukaryogenesis.
Results
Fusexin genes in Archaea
To search for fusexins we used the crystallographic structures of HAP2/GCS1 of C. reinhardtii, A. thaliana, and T. cruzi (Cr/At/TcHAP2)4, 19, 20 to build dedicated Hidden Markov Models (HMMs). These were used to scan the Uniclust30 database with HHblits (see Methods, Supplementary Information). We detected 24 high confidence candidates in prokaryotes: eight belong to isolated and cultivated archaea, and the remaining sixteen come from metagenomics-assembled genomes (MAGs, Extended Data Table 1). We then built HMMs of the candidate ectodomains and compared them to HMMs of sexual, somatic and viral fusexins. Figure 1a shows that the prokaryotic candidates are closely related to HAP2/GCS1 (hereafter referred to as HAP2), with E-values below 0.001 and HHblits derived probabilities higher than 90% (Supplementary Fig. 1). Since all candidate sequences from pure culture genomes (PCGs) are from Archaea, we decided to name these proteins FusexinA (FsxA). All fsxA genes found in cultivated and isolated prokaryotes are restricted to the Halobacteria class (Euryarchaeota superphylum) whereas MAGs containing FsxAs include all major Archaea superphyla (Extended Data Table 1). Next, we used this set of FsxA sequences to search the Metaclust database, which comprises 1.59 billion clustered proteins from over 2200 metagenomic and metatranscriptomic datasets. Performing a scan pipeline using PSI-BLAST, HMM-HMM comparisons and topology filtering (see Methods) we found 96 high-confidence fsxA genes. The identified fsxA genes come from different environments (with preeminence of saline samples), and from a wide range of temperatures (-35 to 80°C, Source Data Table 1).
FsxA is a structural homologue of HAP2/GCS1
To obtain experimental evidence for the presence of fusexin-like proteins in Archaea, a selection of the candidate genes was screened for expression in mammalian cells. High-level expression was observed for a metagenomic FsxA sequence from a hypersaline environment, predicted to encode a large ectodomain region followed by three transmembrane helices (Supplementary Fig. 2a, b; Source Data Table 1). Although the protein was prone to denaturation on cryo-EM grids, we could grow crystals of its ∼55 kDa ectodomain (FsxAE) in the presence of 2.5 M NaCl, 0.2 M CaCl2 (Extended Data Fig. 1). These yielded data to 2.3 Å resolution (Extended Data Table 2), which however could not be phased experimentally, probably because the high-salt mother liquor composition hindered heavy atom binding. Molecular replacement with HAP2-derived homology models also failed, but we succeeded in solving the structure using a combination of fragments from models generated by AlphaFold221 (Extended Data Fig. 2).
Despite being a monomer in solution (Extended Data Fig. 1c,d), FsxAE crystallized as a homotrimer of 119x78x75 Å (Fig. 1b and Extended Data Figs. 1f and 3). Each protomer consists of four domains, the first three of which match the approximate dimensions and relative arrangement of domains I-III of fusexins in their post-fusion conformation22 (Fig. 1b and Extended Data Fig. 4a,b); accordingly, fold and interface similarity searches identify HAP2 as the closest structural homologue of FsxAE, followed by viral fusexins and C. elegans EFF-1 (Extended Data Fig. 4c). FsxA domains I and III are relatively sequence-conserved among archaeal homologues (Extended Data Fig. 5a; Supplementary Fig. 3) and closely resemble the corresponding domains of HAP2 (RMSD 2.1 Å over 218 Cα), including the invariant disulfide bond between domain III strands βC and βF4 (C3389-C4432; Extended Data Figs. 3d and 4). On the other hand, FsxA domain II shares the same topology as that of HAP2 but differs significantly in terms of secondary structure elements and their relative orientation, as well as disulfide bonds (Extended Data Fig. 4d). In particular, FsxA domain II is characterized by a four-helix hairpin, the N-terminal half of which interacts with the same region of the other two subunits to generate a six-helix bundle around the molecule’s three-fold axis (Figs. 1b and 2a-c; Extended Data Figs. 3a and 5b).
Notably, unlike previously characterized viral and eukaryotic fusexins, FsxA also contains a fourth globular domain conserved among archaeal homologues (Fig. 1b and Extended Data Figs. 3, 4; Supplementary Fig. 3), whose antiparallel β-sandwich, including the two C-terminal disulfides of the protein, resembles the carbohydrate-binding fold of dust mite allergen Der p 23 and related chitin-binding proteins23 (Fig. 2d); accordingly, it is also structurally similar to a high-confidence AlphaFold2 model of the C-terminal domain of acidic mammalian chitinase24. In addition to being coaxially stacked with domain III as a result of a loop/loop interaction stabilized by the C5457-C6477 disulfide, domain IV contributes to the quaternary structure of the protein by interacting with domain II of the adjacent subunit to which domain III also binds (Figs. 1b and 2c; Extended Data Fig. 4a, b).
The FsxAE monomer has a net charge of -67 (Fig. 2a), and another important feature stabilizing its homotrimeric assembly is a set of Ca2+ and Na+ ions that interacts with negatively charged residues at the interface between subunits (Fig. 2b; Extended Data Figs. 3a and 5b). Additional metal ions bind to sites located within individual subunits; in particular, a Ca2+ ion shapes the conformation of the domain II c-d loop (S143-V148) so that its uncharged surface protrudes from the rest of the molecule (Fig. 2b, c, e; Extended Data Figs. 3b and 5c). Strikingly, the position of this element matches that of the fusion loops of other fusexins, including the Ca2+-binding fusion surface of rubella virus E1 protein25, 26 (Fig. 2e). Moreover, as previously observed in the case of CrHAP220, the loops of each trimer interact with those of another trimer within the FsxA crystal lattice.
In summary, despite significant differences in the fold of domain II, the unprecedented presence of a domain IV, and extreme electrostatic properties, the overall structural similarity between FsxA and viral or eukaryotic fusexins strongly suggests that this prokaryotic molecule also functions to fuse membranes.
FsxA can fuse eukaryotic cells
To test the fusogenic activities of the candidate archaeal fusexins we studied their fusion activity upon transfection in eukaryotic cells3, 12, 14. For this, we co-cultured two batches of BHK cells independently transfected with FsxA and coexpressing either nuclear H2B-RFP or H2B-GFP 3. Following co-culture of the two batches, we fixed, permeabilized and performed immunofluorescence against a V5 tag fused to the cytoplasmic tail of FsxA (Fig. 3a, b). We observed a five-fold increase in the mixing of the nuclear H2B-GFP and H2B-RFP compared to vector control, showing that FsxA is a bona fide fusogen, comparable in efficiency to the eukaryotic gamete fusexin AtHAP2 (Fig. 3c; Extended Data Fig. 6). To determine whether FsxA expression is required in both fusing cells or, alternatively, it suffices in one of the fusing partners, we mixed BHK-FsxA coexpressing cytoplasmic GFP with BHK cells expressing only nuclear RFP. We found increased multinucleation of GFP+ cells but very low mixing with RFP+ cells not expressing FsxA. In contrast, the vesicular stomatitis virus G-glycoprotein (VSVG) fusogen induced efficient unilateral fusion14 (Fig. 3d-f; Extended Data Fig. 6). Thus, FsxA acts in a bilateral way, similarly to the C. elegans EFF-1 and AFF-1 fusexins14, 27–29. We then performed live-imaging using spinning disk confocal microscopy and observed cell-cell fusion of BHK-FsxA cells (Fig. 3g, h; Supplementary Videos 1 and 2).
Structure-function analysis of FsxA
To compare archaeal FsxA activity with fusexins from eukaryotes and viruses, we introduced mutations into two specific structural domains of FsxA and tested their surface expression and fusogenic activities in mammalian cells.
First, to test whether the putative fusion loop (FL) of FsxA (143-SVTSPV-148) is involved in fusion, we replaced it with a linker of 4G between Y142A and Y149A (Figs. 2e and 3i; Supplementary Figs. 3 and 4; ΔFL→AG4A). This FL replacement did not affect surface expression yet resulted in a reduction in content mixing to levels similar to those of the negative control (Fig. 3j; Extended Data Fig. 7).
Second, we asked whether domain IV, which is present in archaeal fusexins but absent in known eukaryotic and viral fusexins, has a function in the fusion process. For this, we replaced the entire domain with the stem region of C. elegans’ EFF-1 (Fig. 3i; ΔDIV→EFF-1 stem). While this mutant FsxA reaches the cell surface, suggesting that it folds normally, it showed a significantly reduced activity compared to wildtype FsxA (Fig. 3j; Extended Data Fig. 7; Supplementary Fig. 4).
FsxAs are ancestral fusogens associated with integrated mobile elements
The sparse pattern of FsxA presence in Archaea led us to perform genomic comparisons of related species with and without the fsxA gene. These comparisons revealed large DNA insertions (> 50 kbp) in the genomes of species with fsxA genes (Supplementary Fig. 5), which we analysed in more detail. We first performed k-mer spectrum analysis on fsxA-containing genomes of pure cultured species and found divergent regions containing the fsxA ORF (Figure 4a; Supplementary Fig. 6). Then, performing homology searches (Supplementary Fig. 7), we studied the gene content of fsxA-containing regions. We found that they share a portion of their genes (Supplementary Fig. 8) and display conserved synteny (Fig. 4b; Supplementary Fig. 9), suggesting common ancestry. These regions are enriched in ORFs that show homology with proteins involved in DNA mobilization and integration such as the type-IV secretion system VirB4/TrbE and TraG/VirD4 ATPases, the HerA helicase and the XerC/D tyrosine recombinase (Figure 4b; Supplementary Table 1). Thus, our results suggest that fsxA genes are contained in integrated mobile elements (IMEs) that can be mobilized by a conjugative-like, cell fusion-dependent mechanism.
To describe FsxA’s tempo and mode of evolution we built maximum likelihood phylogenies for a set of FsxA sequences derived from isolated species, metagenomic samples and MAGs, and a subset of HAP2s. We found that the branching pattern of FsxA sequences is incompatible with their species tree (Extended Data Fig. 8). This incongruence supports a history of horizontal gene transfer (HGT) events within Archaea, in line with fsxAs presence in IMEs. Moreover, HAP2 monophyly indicates an ancient split before the eukaryal radiation (Extended Data Fig. 8).
To analyze deep homologies with no sequence signal, we built a structural comparison tree between archaeal, eukaryotic and viral fusexins (Fig. 4c). In this minimum evolution tree, both Minimal Ancestor Deviation30 (MAD) and midpoint rooting reveal the position of the root within the archaeal branch. These results are in line with current eukaryogenesis models that point to a history of massive acquisition of prokaryotic genes by HGT during the transition to LECA, suggesting FsxAs are basal, predating the divergence between eukaryotic and viral fusexins. Furthermore, relative acquisition time analysis (Supplementary Fig. 10) suggests that fsxA was an intermediate-to-late acquisition in the transition from First Eukaryotic Common Ancestor (FECA) to LECA.
Discussion
The archaeal fusexins herein identified reveal a broader presence of these fusogens in yet another domain of life and with different types of membranes. We also unveil a wider physicochemical landscape for this protein superfamily, from cold hypersaline lakes to hot springs and hydrothermal vents (Source Data Table 1 and Extended Data Fig. 8).
Our structural and functional analyses show that FsxA has both conserved and diverged properties when compared to eukaryotic and viral fusexins (Extended Data Fig. 4; Fig. 3). Like its viral counterparts, FsxA has an uncharged loop that is essential for fusion. Unlike any other previously known fusexin, FsxA possesses an additional domain (domain IV), that is important for FsxA activity and may bind sugars (Figs. 2d and 3j). Considering that cell surface glycosylation was found to be important for fusion-based mating of halophilic archaea31, this domain may actively promote fusion by interacting with carbohydrates attached to lipids or proteins such as S-layer glycoproteins32. Like somatic and sexual fusexins, FsxA mediates BHK cell fusion in a bilateral fashion (Fig. 3f). Future studies will aim at understanding the importance of the six-helix bundle formed by FsxA domain II, which is unprecedented among fusexins and raises an unexpected structural connection with class I viral fusogens8, 9.
The presence of FsxAs in Halobacteria IMEs is consistent with the evolutionary history and genetic structure of members of this class. Halophilic archaea are notorious for being polyploid33 and undergoing HGT events that overcome species and genera barriers34, 35. The best evidence of archaeal cell fusion comes from studies showing bilateral DNA exchange that correlates with cytoplasmic bridges made up of fused lipid bilayers connecting haloarchaeal cells32, 36, 37. Thus, it is plausible that Halobacteria evolved HGT mechanisms based on conjugative-like DNA mobilization and cell-cell fusion38. FsxAs seem to be absent in some genomes of archaeal species known to undergo fusion-based mating for gene transfer37. Their relative confinement to few archaeal lineages suggests limited fitness advantages to their present bearers indicating they are molecular relics, playing a marginal role in Archaea. Current evidence suggests that major eukaryotic lineages such as chordata and fungi replaced HAP2 with other fusogens during evolution39. Therefore, other unidentified archaeal fusogens may be at play. More broadly, cell fusion-based HGT might have declined during archaeal evolution in favour of conjugation, transduction and natural transformation.
The lateral mobility of fsxA genes and their likely ancestral position (Extended Data Fig. 8, Fig. 4c) prompts us to abandon the “virus or the egg” dilemma of the origin of fusexins18 in favour of a hypothesis where it was archaeal fusexins who were repurposed (exapted) for gamete fusion. In the hypothesis that we are calling “eukaryotic sexaptation”, fusexins paved the way for sexual reproduction and other processes relying on membrane fusion during the FECA to LECA transition (Fig. 4d).
Discovery of the Asgard superphylum40 and the successful recent cultivation of one of its members41 have lent weight to eukaryogenesis models where heterogeneous populations of bacteria and archaea lived in syntrophy transferring metabolites and genetic information42. Lateral transfer of a fsxA gene, presumably at a mid-stage of eukaryogenesis (Supplementary Fig. 10, Fig. 4d), could thus have enabled pre-LECA cells to undergo genome expansion, explore syncytial forms43, and evolve into mononucleated cells fully equipped for meiosis and gamete fusion44, 45. Our findings suggest that today’s eukaryotic sexual reproduction is the result of over two billion years of evolution of this ancient archaeal cell fusion machine.
Methods
No statistical methods were used to predetermine sample size. The experiments were not randomized.
Initial fusexin search using structurally-guided MSAs
HMMs were prepared using structurally guided multiple sequence alignments (MSAs) of known eukaryotic HAP2 sequences. Structural MSAs were derived using I-TASSER46 generated models of HAP2 homologues for Erythranthe guttata (A0A022QRC8), Phytomonas sp. isolate EM1 (W6KUI1), Plasmodium falciparum (A0A1C3KGX6), Chlorella variabilis (E1Z455) and the HAP2 crystal structures for Chlamydomonas reinhardtii (PDB 6E1847 and 6DBS20) and Arabidopsis thaliana (PDB 5OW319).
Searches for fusexin homologues using structurally guided MSAs were performed for 3 iterations on the Uniclust database48 using default HHBlits parameters49.
HMM-based distance matrices
A taxonomically representative list of known viral and eukaryotic fusexin homologues, covering major lineages, was manually curated. A MSA was built for each homologue by using the sequence as a query on the Uniclust database with HHBlits49 for 3 iterations (see Supplementary Information). This set of MSAs was compiled into an HHSuite database and each MSA was used as a query against this database to establish a profile-based distance matrix using the probability of homology (Supplementary Fig. 1, Fig. 1a).
Metaclust database search
We searched the Metaclust50 dataset using an HMM made of FsxA sequences found in PCGs and MAGs (Source Data Table 1; Supplementary Information). FsxA sequences were aligned using ClustalO51 on default settings for 3 iterations and the resulting MSA was used as a query with hmmsearch52 against the Metaclust5050 dataset. All returned sequences with an E-value < 0.0001 with a match length greater than 100 residues were selected for further analysis. Manual curation was performed using membrane protein topology predictor TOPCONS53 and distant homology searches using HHPRED54 against the PDB70.
DNA constructs
Ten archaeal genes were synthesized (GenScript) and cloned into pGene/V5-His vectors (Supplementary Table 3). Details of nucleotides used for synthesis and protein sequences are described in Source Data 1.
For structural studies, a synthetic gene fragment encoding the extracellular region of a metagenomic FsxA ORF (IMG genome 3300000868, scaffold JGI12330J12834_ 1000008, ORF 8; Source Data Table 1) (GenScript) was subcloned by PCR in frame with the 5’ chicken Crypα signal peptide- and 3’ 8xHis-tag-encoding sequences of pLJ6, a mammalian expression vector derived from pHLsec355. The protein construct that yielded the final high-resolution dataset included residues D25-S535 and contained a T369C substitution, introduced by PCR mutagenesis with the aim of facilitating heavy atom derivatization for experimental phasing. Oligonucleotides were from Sigma-Aldrich or IDT and all constructs were verified by DNA sequencing (Eurofins Genomics or Macrogen).
To generate pCI::GFPnes plasmid (see list of plasmids in Supplementary Table 4), an oligo DNA encoding for the nuclear export signal (LQKKLEELELD) was cloned into the C-terminal end of EGFP of the pCAGIG plasmid using the enzyme BsrGI. Then, the GFPnes coding sequence was amplified using the pCAGGS FW and pCAGGS RV primers, cut with BmgBI and BglII and used to replace the H2B-GFP coding sequence of the pCI::H2B-GFP plasmid (see list of primers in Supplementary Table 5). FsxA-V5, AtHAP2-V53, EFF-1-V5, VSV-G14 and other archaeal fusexins (NaFsxA, HQ22FsxA, HnFsxA) were subcloned into corresponding pCI::H2B-RFP/H2B-GFP/GFPnes vectors separately. For mutagenesis of FsxA, i) FsxA-ΔFL-AG4A: The mutation of Y142A, Y149A and four glycines inserted between them were achieved using PCR with overlapping primers. ii) FsxA-ΔDIV-EFF-1 stem: the stem region of EFF-1(E510-D561) was amplified from pGene::EFF-1-V5 and fused to the upstream and downstream of FsxA-DIV region with overlapping primers. All mutants were ligated into pCI::H2B-RFP and pCI::GFPnes vectors for mixing assay. Additional details are found in Supplementary Tables 4 and 5.
Protein expression and purification
HEK293T cells57 were transiently transfected using 25 kDa branched polyethyleneimine and cultured in DMEM media (Invitrogen) supplemented with 2% (v/v) foetal bovine serum (Biological Industries). 90-96 hours after transfection, the conditioned media from HEK293T cells was harvested, 0.2 µm-filtered (Pall) and adjusted to 20 mM Na-HEPES pH 7.8, 2.5 M NaCl, 5 mM imidazole. 10 ml Ni Sepharose excel beads (GE Healthcare) pre-equilibrated with immobilized metal affinity chromatography (IMAC) buffer (20 mM Na-HEPES pH 7.8, 2.5 M NaCl, 10 mM imidazole) were added to 1 L adjusted conditioned media and incubated overnight at 4°C. After washing the beads with 100 column volumes IMAC buffer, captured FsxAE was batch-eluted with 30 mL 20 mM Na-HEPES pH 7.8, 2.5 M NaCl, 500 mM imidazole and concentrated with 30 kDa-cutoff centrifugal filtration devices (Amicon). The material was then further purified by SEC at 4°C, using an ÄKTAfplc chromatography system (GE Healthcare) equipped with a Superdex 200 Increase 10/300 GL column (GE Healthcare) pre-equilibrated with 20 mM Na-HEPES pH 7.8, 2.5 M NaCl. Peak fractions were pooled and concentrated to 5 mg mL-1 (Extended Data Fig. 1a, b).
Size exclusion chromatography-multiangle light scattering (SEC-MALS)
Purified FsxAE (200 µg) were measured using an Ettan LC high-performance liquid chromatography system with a UV-900 detector (Amersham Pharmacia Biotech; λ = 280 nm), coupled with a miniDawn Treos MALS detector (Wyatt Technology; λ = 658 nm) and an Optilab T-rEX dRI detector (Wyatt Technology; λ = 660 nm). Separation was performed at 20°C using a Superdex 200 Increase 10/300 GL column (GE Healthcare) with a flow rate of 0.5 mL min−1 and a mobile phase consisting of 20 mM Na-HEPES pH 7.8, 150 mM NaCl was used (Extended Data Fig. 1c). The data processing and weight-averaged molecular mass calculation were performed using the ASTRA 7.1.3 software (Wyatt Technology). BSA (150 µg) was used as a control.
Small-angle X-ray scattering (SAXS)
SAXS experiments were performed at beamline BM29 of the European Synchrotron Radiation Facility (ESRF)58, using FsxAE (4.5 mg mL-1) in 20 mM Na-HEPES pH 7.8, 150 mM NaCl. Sample delivery and measurements were performed using a 1 mm thick quartz capillary, which is part of the BM29 BioSAXS automated sample changer unit59. Data were collected at 1 Å wavelength in 10 frames of 1 s at 20°C, using an estimated beam size of 1 mm x 100 µm; buffer blank measurements were carried out under the same conditions, both before and after sample measurement. Data were averaged and subtracted using PRIMUS60 from the ATSAS package61, which was also used to calculate the pair-distance distribution function, as well as the radius of gyration and the Porod volume. Theoretical scattering curves for monomeric and trimeric FsxAE were calculated and compared with the experimental data using CRYSOL62. Ab initio envelope reconstruction was performed using the program DAMMIF63, resulting in twenty models that were superimposed and averaged with DAMAVER64. Chain A of the refined FsxAE model was fitted into the SAXS reconstruction using UCSF ChimeraX65 (Extended Data Fig. 1d).
Crystallization and X-ray diffraction data collection
Two similar initial hits obtained from extensive screening using a mosquito crystallization robot (TTP Labtech) were manually optimized by setting up vapour diffusion experiments at 20°C in 24-well plates. To grow diffraction-quality crystals, 1 µl purified FsxAE was mixed with 1 µL 23% (w/v) PEG 4000, 0.1 M Tris-HCl pH 8.5, 0.2 M CaCl2 and equilibrated against 1 mL of the same solution. Rhomboidal plates of FsxAE grew in 1-3 months from protein precipitate that appeared after overnight equilibration of the crystallization drops (Extended Data Fig. 1e). For data collection, specimens were freed from the precipitate by micromanipulation with MicroMounts (MiTeGen) and flash frozen in liquid nitrogen. More than a hundred crystals were screened at beamlines ID23-1 of the ESRF66 and I04 of Diamond Light Source, yielding datasets of highly variable quality. The final X-ray diffraction dataset at 2.3 Å resolution was collected at ESRF ID23-1.
Data reduction and non-crystallographic symmetry analysis
Datasets were processed in space group C2 with XDS67 (Extended Data Table 2). By revealing a strong non-origin peak at chi=120 (Extended Data Fig. 1f), self rotation functions calculated with MOLREP68 or POLARRFN69 clearly indicated the presence of three-fold non-crystallographic symmetry (NCS) within the asymmetric unit of the centred monoclinic crystals. Combined with Matthews coefficient calculations70, 71, this strongly suggested that FsxAE crystallized as a homotrimer.
Structure determination by molecular replacement with AlphaFold2 models
Because multiple attempts to experimentally determine the structure of FsxAE using a variety of heavy atoms failed, we took advantage of the recent significant advances in protein 3D structure prediction using machine learning72 to phase the data by molecular replacement (MR)73 (Extended Data Fig. 2). To do so, we used AlphaFold221 to generate five independent models of FsxA ectodomain residues D25-S535, with per-residue pseudo-B factors corresponding to 100-(per-residue confidence (pLDDT21)). These models had relative root-mean-square deviations (RMSD) of 1.4-3.3 Å, or 0.7-1.9 Å after excluding 26 C-terminal residues predicted with low-confidence. Initial attempts to solve the structure with Phaser74, using an ensemble including these models (further truncated to Q453, the predicted C-terminal end of domain III), yielded 4 solutions (with top Log Likelihood Gain (LLG) 188, final Translation Function Z score (TFZ) 9.6) that were retrospectively correct in terms of domain I/II placement, but completely wrong in the positioning of domain III. Because of the latter, automatic refinement of these solutions did not progress beyond Rfree ∼0.53. On the other hand, a parallel consecutive search for three copies of a domain I/II ensemble (D25-A335; RMSD 0.3-0.9 Å) followed by three copies of domain III (P350-Q453; RMSD 0.1-0.3 Å), using a model RMSD variance of 1 Å, yielded a clear single solution (LLG 876, TFZ 23.1) that could be automatically refined to initial R 0.45, Rfree 0.46.
Remarkably, although a single copy of domain 3 corresponds to only 7% of the total scattering mass in the asymmetric unit of the FsxAE crystal, the very high accuracy of its AlphaFold2 model (reflected by a posteriori-calculated global RMSD and Distance Test Total Score (GDT_TS) of 0.7 Å and 97.6, respectively) allowed Phaser to also find a correct MR solution using just this part of the structure. Specifically, a consecutive search for three copies of the domain resulted in a trimeric model with LLG 275 and TFZ 15.1, which could be refined to starting R 0.51, Rfree 0.51.
Also worth mentioning is the observation that the same domain I/II + domain III MR strategy used to phase the 2.3 Å resolution data could also be successfully applied to an initial dataset at much lower resolution (3.5 Å, with outer shell mean I/σI 0.6 and CC1/2 0.31); in this case, the Phaser LLG and TFZ values for the solution were 361 and 13.5, respectively, and initial automatic refinement of the corresponding model yielded R 0.44, Rfree 0.48.
Model building, refinement and validation
The initial model of FsxAE was first automatically rebuilt using PHENIX AutoBuild75 (1083 residues; R 0.34, Rfree 0.38) and then significantly improved with the machine-learning-based sequence-docking method of ARP/wARP76, as implemented in CCP469 (1390 residues; REFMAC77 R 0.23). The resulting set of coordinates was subsequently subjected to alternating cycles of manual rebuilding with Coot78/ISOLDE79 and refinement with phenix.refine80, using torsion-based NCS restraints and three Translation-Libration-Screw-rotation groups per chain. Metal ions were assigned based on electron density level, difference Fourier maps generated using alternative atom types, correspondence with peaks in phased anomalous difference maps generated with PHENIX81 or ANODE82 and coordination properties83. Protein geometry was validated using MolProbity84 (Extended Data Table 2).
Sequence-structure analysis
Transmembrane helices were predicted using TMHMM85. GDT_TS scores were calculated using LGA86 and structural similarities were assessed with Dali87 and PDBeFold88. Secondary structure was assigned using DSSP89. Subunit interfaces were analyzed using PDBsum90, PIC91 and PISA92. Molecular charge was calculated using the YASARA2 force field93 and electrostatic surface potential calculations were performed with PDB2PQR94 and APBS95, via the APBS Tools plugin of PyMOL. Mapping of amino acid conservation onto the 3D structure of FsxAE was carried out by analyzing a sequence alignment of archaeal homologues with ConSurf96. Structural figures were generated with PyMOL (Schrödinger, LLC).
IME identification and analyses
K-mer (K = 4) spectrum for each genome was calculated for a sliding window of 1 kb using 500 bp steps and subtracted from the genomic average at each window position97. The absolute value of the difference between the genomic average and window spectra was represented graphically over the entire genome (Supplementary Fig. 6, Fig. 4a). Gaussian mixture models using two distributions were fitted to the K-mer content of all windows to classify windows as belonging to either the core genome or transferred elements97. HMMER52 and Pfam98 were used to assign domains and their associated arCOG99 identifiers to ORFs using default parameters (Source Data Table 3). Synteny conservation plots were made with MCscan tool100 from the JCVI pipeline, creating relevant files by formatting data regarding the inferred homology relationship with homemade Python scripts. For details, see Supplementary Information.
Phylogenetic analyses
Maximum likelihood phylogenetic trees were generated with sequences aligned with MAFFT101 (L-INS-i option) as input for IQ-TREE102 and selecting the best evolutionary model with ModelFinder103. Homology trimeric models for FsxAE archaeal homologues (Extended Data Fig. 8) were built with MODELLER104 using our crystal structure as template. Stem length and timing of acquisition of FsxA-HAP2 was done as described105 in order to compare it with their data, generously made available106. The root for the structural tree was inferred using both midpoint-rooting and the Minimal Ancestor Deviation method30.
Structural alignment and phylogeny
The overall assumption here is common knowledge, namely, that protein folds and their decorations evolve more slowly than sequences, hence preserving deep evolutionary signals107, 108. FsxA models and crystal structures of FsxAE and eukaryotic and viral fusexins were used in all-vs-all comparisons with FATCAT109 to establish structural distances between them and write PDB files for each superimposed pair. The following experimental crystal structures from other works were used: Flavivirus E: West Nile virus (2I69)110; Dengue virus serotype 1 (4GSX)111; Alphavirus E1: Semliki Forest virus (1RER)112; Chikungunya virus (3N43)113; C. elegans EFF-1 (4OJC)12; Bunyavirus Gc Rift Valley fever virus (6EGU)114; eukaryotic HAP2/GCS1 from A. thaliana (5OW3)19 and C. reinhardtii (6E18)47. The text output of FATCAT was parsed and compiled into pairwise alignments, which were in turn merged iteratively with ClustalO51 to generate a structure-based MSA. Essentially equivalent alignments can be obtained with online servers such as POSA115 or mTM-align116, that will derive TM-score matrices and multiple structure alignments. The PDB files produced by flexible alignment with FATCAT were compared with TMalign117 to build a distance matrix filled with TM scores118 between all structures. This distance matrix was the basis to compute a minimum evolution tree with FastME119 on default parameters.
Cells and reagents
BHK-21 cells (kindly obtained from Judith White, University of Virginia) were maintained in DMEM supplemented with 10% FBS (Biological Industries), 100 U/ml penicillin, 100 µg/ml streptomycin (Biological Industries), 2 mM L-glutamine (Biological Industries), 1 mM sodium pyruvate (Gibco), and 30 mM HEPES buffer, pH 7.3, at 37°C with 5% CO2. Transfections were performed using Fugene HD (Promega) or jetPRIME (Polyplus) according to the manufacturer’s instructions.
Immunofluorescence
BHK cells were grown on 24-well tissue-culture plates with glass coverslips. Permeabilized cells were fixed with 4% paraformaldehyde (EM grade, Bar Naor, Israel) in PBS, followed by incubation in 40 mM NH4Cl to block free aldehydes, permeabilized in 0.1% Triton X-100 in PBS and blocked in 1% FBS in PBS. After fixation, the coverslips were incubated 1 h with mouse anti–V5 antibody (Invitrogen, 1:500) and 1 h with the secondary antibody which was donkey anti–mouse coupled to Alexa Fluor 488 (Invitrogen, 1:500). Alternatively, for immunofluorescence without permeabilization, cells were blocked on ice in PBS with 1% FBS for 20 minutes, and then stained with Monoclonal ANTI-FLAG M2 antibody (Sigma, 1:1000) on ice for 1h. After anti-FLAG staining, cells were washed and fixed with 4% PFA in PBS. Cells were blocked again and stained with the secondary antibody (donkey anti–mouse coupled to Alexa Fluor 488; Invitrogen) diluted 1:500 in PBS for 1 h. In all cases, nuclei were stained with 1 µg/ml DAPI. Images were captured using a Nikon Eclipse E800 with a 60X/1.40 Plan Apochromat objective and an optical zoom lens (Nikon) using a Hamamatsu ORCA-ER camera controlled by Micro-Manager software120 (Extended Data Fig. 7d).
Western blots
24 h post-transfection, cells were treated with Lysis Buffer (50 mM Tris-HCl pH 8.0, 100 mM NaCl, 5 mM EDTA, 1% Triton X-100 supplemented with chymostatin, leupeptin, antipain and pepstatin) on ice for 10 min. After 10 min centrifugation at 4 °C,14,000 rpm, supernatants of lysates were mixed with reducing sample buffer (+ DTT) and incubated 5 min at 95°C. Samples were loaded on a 10% SDS-PAGE gel and transferred to PVDF membrane. After blocking, membranes were incubated with primary antibody anti–V5 mouse monoclonal antibody (1:5,000; Invitrogen) or anti-actin (1:2,000; MP Biomedicals) at 4 °C overnight and HRP-conjugated goat anti-mouse secondary antibody 1 h at room temperature. Membranes were imaged by the ECL detection system using FUSION-PULSE.6 (VILBER).
Content mixing assays with immunofluorescence
BHK-21 cells at 70% confluence were transfected (using JetPrime; Polyplus at a ratio of 1:2 DNA:transfection reagent) with 1 µg pCI::FsxA-V5::H2B-eGFP, pCI::FsxA-V5::H2B-RFP, pCI::AtHAP2-V5::H2B-eGFP, pCI::AtHAP2-V5::H2B-RFP, respectively. Control cells were co-transfected with pCI::H2B-eGFP and pRFPnes or pCI::H2B-RFP and pRFPnes. 4 h after transfection, the cells were washed 4 times with DMEM with 10% serum (Invitrogen), 4 times with PBS and detached using Trypsin (Biological Industries). The transfected cells were collected in Eppendorf tubes, resuspended in DMEM with 10% serum, and counted. Equal amounts of H2B-RFP and H2B-eGFP cells were mixed and seeded on glass-bottom plates (12-well black, glass-bottom #1.5H; Cellvis) and incubated at 37°C and 5% CO2. 18 h after mixing, 20 µM 5-fluoro-2’-deoxyuridine (FdUrd) was added to the plates to arrest the cell cycle and 24 h later, the cells were fixed with 4% PFA in PBS and processed for immunofluorescence. To assay mixed cells and detect the transfected proteins (FsxA-V5 or AtHAP2-V5), we stained cells with anti-V5 mAb (Life Science). The secondary antibody was Alexa Fluor 488 goat anti-mouse, with 1 µg/ml DAPI3. Micrographs were obtained using wide-field laser illumination using an ELYRA system S.1 microscope (Plan-Apochromat 20X NA 0.8; Zeiss). The GFP + RFP mixing index was calculated as the number of Red and Green nuclei in mixed cells out of the total number of nuclei of fluorescent (green cytoplasm) cells in contact (Fig. 3b).
Cell fusion assay by content mixing with nuclear and cytoplasmic markers
For the unilateral setup, BHK-21 cells were transfected (as explained above) with 1 µg pCI::H2B-RFP; pCI::GFPnes; pCI::FsxA-V5::GFPnes; 0.25 µg pCI::EFF-1-V5::GFPnes; 1 µg pCI::VSV-G::GFPnes in respective 35mm plates. The cells were incubated, washed, and mixed with pCI::H2B-RFP (empty vector) transfected cells. For evaluating the mutants, BHK-21 cells were transfected with 1 µg pCI::FsxA-V5::GFPnes or pCI::FsxA-V5::H2B-RFP or the plasmids encoding for each mutant: ΔFL→AG4A or ΔDIV→EFF-1 stem (Fig. 3i). Empty pCI::GFPnes or pCI::H2B-RFP were used as negative controls. 4 h after transfection, the cells were washed, counted, mixed, and incubated as previously described. In all cases, 18 h after mixing, 20 µM FdUrd was added to the plates, and 24 h later, the cells were fixed with 4% paraformaldehyde diluted in PBS. Nuclei were stained with 1 µg/ml DAPI. Micrographs were obtained using wide-field laser illumination using an ELYRA system S.1 microscope as described above. The GFP + RFP mixing index was calculated as the number of nuclei in mixed cells, green cytoplasm (GFPnes) with red (H2B-RFP) and blue (DAPI) nuclei out of the total number of nuclei in fluorescent cells in contact (Fig. 3e and 3j). For the unilateral assay, multinucleation was determined as the ratio between the number of nuclei in multinucleated green cells and the total number of nuclei in green multinucleated cells and GFPnes expressing cells that were in contact but did not fuse.
Live imaging of fusing cells
BHK cells were plated on 15 mm glass bottom plates (Wuxi NEST Biotechnology Co., Ltd.) and transfected with 1 µg pCI::FsxA-V5::H2B-GFP together with 0.5 µg myristoylated-mCherry (myr-palm-mCherry; kindly provided by Valentin Dunsing and Salvatore Chiantia 56). 18 h after transfection, the cells were incubated with 2 μg/ml Hoechst dye for 10 min at 37°C and washed once with fresh medium. Time-lapse microscopy to identify fusing cells was performed using a spinning disc confocal microscope (CSU-X; Yokogawa Electric Corporation) with an Eclipse Ti and a Plan-Apochromat 20X (NA, 0.75; Nikon) objective. Images in differential interference contrast and red channels were recorded every 4 min in different positions of the plate using high gain and minimum laser exposure. Time lapse images were captured with an iXon 3 EMCCD camera (Andor Technology). After 5 h, confocal z-series, including detection of the DAPI channel, were obtained to confirm the formation of multinucleated cells. Image analyses were performed in MetaMorph (Molecular Devices) and ImageJ121 (National Institutes of Health).
Surface biotinylation
Proteins localizing on the surface were detected as previously described3. Briefly, BHK cells were transfected with 1 µg pCAGGS, pCAGGS::EFF-1-V5, pCAGGS::FsxA-V5, pCAGGS::ΔFL→AG4A-V5 or pCAGGS::ΔDIV→EFF-1 stem-V5. 24 h later, cells were washed twice with ice-cold PBS2+ (with Ca2+ and Mg2+) and incubated with 0.5 mg/ml EZ-Link Sulfo NHS-Biotin (Thermo Fisher Scientific) for 30 min on ice. The cells were washed four times with ice-cold PBS2+, once with DMEM with 10% FBS (to quench residual biotin), followed by two more washes with PBS2+. To each plate 300 µl of Lysis Buffer supplemented with 10 mM iodoacetamide were added and the cells detached using a scrapper. The insoluble debris was separated by centrifugation (10 min at 21,000 g), and the lysate was mixed with NeutrAvidin Agarose Resin (Thermo Fisher Scientific) and 0.3% SDS. After an incubation of 12 h at 4°C the resin was separated by centrifugation (2 min at 21,000 g), washed three times with lysis buffer and then mixed with SDS-PAGE loading solution with freshly added 5% b-mercaptoethanol and incubated 5 min at 100°C. After pelleting by centrifugation, the samples were separated by SDS-PAGE gel and analyzed by Western blotting as described above using anti–V5 mouse monoclonal antibody. Loading was controlled using anti-actin C4 monoclonal (1:2,000; MP Biomedicals).
Data analysis
Counting of content mixing and multinucleation was made blind for the experiments included in Fig. 3f and 3j. Interobserver error was estimated for counting of multinucleated cells, cells in contact, and content-mixing experiments: the differences in percentages of multinucleation and content mixing obtained by two observers was <10%. Figures were prepared with Photoshop and Illustrator CS (Adobe), BioRender and ImageJ121.
Statistical tests
Results are presented as means ± SEM. For each experiment we performed at least three independent biological repetitions. To evaluate the significance of differences between the averages we used one-way ANOVA as described in the legends (GraphPad Prism).
Data availability
All relevant data are included as Supplementary Information files (see Suppl. Inf. Guide). Crystallographic structure factors and atomic coordinates have been deposited in the Protein Data Bank under accession code 7P4L.
Code availability
All relevant codes, notebooks and datasets necessary for: HHblits and Hmmer searches and comparisons (Fig1a, Suppl. Fig1, Source Data Table 1); Kmer spectra analyses (Fig. 4a, Suppl Fig. 6); IMEs clustering, content and synteny analyses (Source Data Tables 2 and 3, Fig. 4b, Suppl. Figs.8 and 9), protein sequence and structure-based comparisons (Extended Data Fig. 8, Fig. 4c) and timing of gene acquisition analysis (Suppl. Fig. 10) are available upon request. GitHub repository URL will be provided upon publication.
Author contributions
B.P. conceived the experiments; performed some imaging work; designed, supervised and analyzed cell fusion experiments. C.D. helped devise analysis strategies for k-mer and phylogenetic analysis. C.V. designed, performed and analyzed cell fusion assays. D.deS. collected X-ray data, took part in structural analysis and validated metal substructure. D.M. carried out deep homology detection of FsxA; designed gene content analysis strategies and phylogenetic analysis, collected sequence data; performed k-mer, functional, structural and phylogenomics analyses, built homology models, coded analysis routine pipelines. H.R. supervised the bioinformatics part of the work, estimated relative acquisition times, designed and performed phylogenetic and phylogenomic surveys. K.F. made constructs of FsxA mutants. K.T. and J.J. generated AlphaFold2 models of FsxA. L.J. supervised the biochemical and structural part of the work; collected X-ray data; solved the FsxA structure, refined and analyzed it. M.G. designed bioinformatic strategies, supervised bioinformatic aspects of the work, analyzed sequence and structural data. M.L. performed IME synteny analyses, phylogenetic and phylogenomic surveys. N.G.B. carried out live imaging and surface biotinylation experiments, assisted with the preparation and design of plasmids; analyzed data. P.S.A. supervised the bioinformatics part of the work, analyzed data. S.N. expressed, purified and crystallized FsxA; performed SEC-MALS experiments; analyzed SAXS data; collected X-ray data and took part in structure determination, model building and structure analysis. X.L. carried out immunofluorescence and western blots for archaeal fusexins in mammalian cells, designed and constructed plasmids, performed imaging work. D.M., S.N., X.L., N.G.B, M.G., H.R., P.S.A., L.J. and B.P. made figures and tables. D.M., S.N., X.L., M.G., H.R., P.S.A., L.J. and B.P. wrote the manuscript. All authors reviewed the manuscript.
Correspondence and requests for materials should be addressed to M.G. or H.R. or P.S.A. or L.J. or B.P.
Competing interests
J.J. has filed provisional patent applications relating to machine learning for predicting protein structures. The other authors declare no competing interests.
Extended data
Supplementary Information
Supplementary Methods
Integrated Mobile Element (IME) identification by k-mer spectra analysis and comparative genomics
Among different methodologies that rely on DNA composition to identify horizontally transferred genomic regions1, k-mer spectrum analysis is a standard tool for this purpose2, 3. Normalized k-mer spectra for DNA sequences of arbitrary length were generated by counting occurrences of all k-mers and normalizing by the total amount of words counted. k-mer sizes from 3 bp to 8 bp were tested with no effect on results. A length of 4 bp was selected. To detect possible horizontally transferred regions, an average spectrum for each genome was calculated. A spectrum was calculated for a sliding window of 1 kb using 500 bp steps and subtracted from the genomic average at each window position (Supplementary Fig. 6). The absolute value of the difference between the genomic average and window spectra is represented over the entire genome. Gaussian mixture models using two distributions were fit4 to the k-mer content of all windows, to classify these as belonging to either the core genome or transferred elements. This deviation in k-mer spectra has been explored in the context of the archaeal mobilome and contains information on the ecological niche and evolutionary history of DNA sequences5.
Comparison between close species with presence (fsxA+) or absence (fsxA-) of archaeal fusexins to detect insertion sites was done performing sequence similarity searches in complete genomes from the closest relatives available in the PATRIC database6 (Extended Data Table 1 and Supplementary Fig. 5). Coordinates of fsxA-containing IMEs present in pure culture genomes (PCGs) are annotated in Supplementary Table 2.
Homology groups for gene sequences encoded in IMEs from PCGs, metagenomics-assembled genomes (MAGs) and metagenome assembled scaffolds (see Source Data Table 1) were created by means of the pipeline represented in Supplementary Fig. 7. Briefly, after in-house gene re-annotation in each mobile element (ME), successive rounds of similarity searches using Hidden Markov Model (HMM)-protein and HMM-HMM comparisons were used to establish group belonging. Finally, we clustered MEs using a Jaccard index based distance matrix from these homology groups to assess their similarity.
Synteny conservation plots were made with MCscan tool from the JCVI pipeline7, creating relevant files by formatting data regarding the inferred homology relationship with homemade Python scripts.
HMMER8 and Pfam9 were used on default parameters to assign domains and their associated arCOG10, 11 identifiers to ORFs (Source Data Table 2).
IME homology analyses
We followed the pipeline depicted in Supplementary Fig. 7. Briefly, PCGs’ IMEs were determined by a combination of k-mer spectra and genomic alignments (see Supplementary Table 2). We initially inspected fsxA-containing scaffolds and kept only sequences that were 20 kb or longer for downstream analyses. We generated an enriched annotation for each IME. Then, we obtained an initial set of groups of homologous sequences, and each of these groups was enriched by means of HMM searches. Subsequently, the enriched homology groups showing similarity between them, as judged by HMM-HMM comparisons, were collapsed into unique groups.
In detail, first, we re-annotated the identified mobile elements (see Methods), combining the corresponding segment of the PATRIC6 GFF annotation file with in-house ORF predictions (minimum ORF length of 30 nucleotides, option by default). ORF inference was done by means of getorf of the EMBOSS package v6.6.0.012, specifying genetic code by Table 11 (Bacteria and Archaea) and other parameters running by default. The similarity of inferred ORFs and annotated features in these mobile elements (i.e. features in their GFF annotation file) was established by means of BLASTP reciprocal searches13. We kept all the predicted ORFs and homologues that were at least annotated in one genome, in this way we tried to recover missanotated conserved ORFs.
Initial sets of homologues were generated with get_homologues v2021030514. Sequence identity and query coverage thresholds were set to 35% and 70%, respectively. In-paralogues were not allowed within these groups (option ‘-e’), and remaining parameters were run by default.
HMM profiles were constructed for each homologue group. To this aim, homologous sequences were retrieved for members of each group from the UniRef50 database15 with jackhmmer (HMMER package v3.1b2; http://hmmer.org8) running with one iteration (‘-N 1’ parameter). MSAs were then generated for each group and its relevant hits with MAFFT16 (v7.310) running under ‘--auto’ parameter, and HMMs were created with hmmbuild (HMMER8). Homologue groups were enriched by means of HMM searches with hmmsearch (HMMER8), using each HMM as a query against a database comprising all predicted ORFs described above. Hits showing an e-value < 1e-10 and covering at least 50% of the HMM were added to the groups.
Enriched homology groups showing homology were collapsed. For this purpose, HMM-vs-HMM comparisons were performed with HHalign v3.3.0 from the HHsuite17. A graph was created with the Python library networkx v2.5.1, each node being an enriched group of homologues. An edge was established between nodes if their HMM-HMM alignment was significant (i.e. e-value < 1e-10, HMM coverage of longest HMM >= 50%). Groups of interconnected nodes were established with the ‘connected_components()’ routine, creating a collapsed homology group in each case.
Finally, we assessed the gene content similarity between mobile elements using a Jaccard Index based on the homology groups defined above. Usual Jaccard index of two sets is defined as (# of the intersection)/(# of the union). In this case: We performed a hierarchical clustering of the MEs based on a distance matrix obtained from the pairwise Jaccard Indexes (distance(A,B) = 1 - JA,B). This was done in Python with seaborn v0.11.118, employing the clustermap function. A subset of 11 mobile elements (Supplementary Fig. 9, in red), which included ME from PCGs and FsxA.11, was selected for synteny conservation analysis. Plots depicting synteny in gene content between homolog groups were generated employing the MCscan tool7. Collapsed clusters can be found in Supplementary Information.
Relative time of acquisition analysis
Previous work19, 20 has closely examined relative acquisition times of thousands of eukaryotic genes. Briefly, given a set of homologous genes present in Prokaryotes and Eukaryotes, analysis of the corresponding phylogeny may shed light on the relative time of acquisition of this gene. First, we have to determine if this gene was in the LECA. Then, with its phylogeny (Supplementary Fig. 10a) we can define: i) a LECA node indicating the node where this gene started to diversify within the Eukaryotic clade, ii) an acquisition node connecting eukaryotic to prokaryotic lineages, and iii) the ‘stem’, which is the branch connecting the acquisition node with the LECA node. At this point a sister lineage can be identified as the potential donor of the gene and the different stem lengths will inform on how early (or late) this gene was acquired. Long stems are indicative of older genes, thus probably present in the FECA, while short stems suggest a late acquisition during the FECA-to-LECA transition. Given that genes have different evolutionary rates, it is important to normalize the raw stem length (Rsl) dividing it by the median of the eukaryotic branch length (from the LECA node to each tip of the tree) obtaining a normalized stem length (Sl). For a complete description of this approach, see previous reports19, 20. Based on this idea, we decided to perform a similar analysis of the fsxA gene. Some singularities of this family, such as its existence within IMEs in a limited set of completely sequenced genomes, may influence the results. Yet, it meets the general criteria for the analysis and its comparison with the results obtained by these authors should be informative.
Thus, we closely followed the approach described in references 19 and 20 in order to compare our results with these works. We downloaded the same 209 complete eukaryotic proteomes and scanned them with hmmsearch8 with a HAP2 HMM profile built from a curated alignment from our previous work21, obtaining 86 sequences. Then, following the scrollsaw22 method to pick slowly evolving sequences, we generated automatic and manual datasets, discarding fast evolving sequences, clade-specific duplications and possible phylogenetic artifacts. The scrollsaw method has proved useful in certain inferences, as it increases well-supported deep nodes in phylogenetic trees. We next proceeded to reduce the PCGs + MAGs + Metagenomic FsxA dataset with kClust v1.03723 using a clustering threshold of 2.93, which corresponds to a 60% sequence identity. Finally, we combined eukaryotic and prokaryotic results into several datasets in order to perform sensitivity analyses. In the most stringent dataset, we excluded metagenomic sequences outside the PCG clade (n=2), addressing the possibility of tree artifacts or even the accidental inclusion of a eukaryotic sequence from metagenomic data.
Multiple sequence alignments were computed with MAFFT16 v7.310 (‘--auto’ option) and trimmed with trimAl v1.4.rev154424 with a 10% gap threshold (‘-gt 0.1’), i.e. only removing columns with more than 90% gaps. Phylogenetic trees were inferred with IQ-TREE v1.6.12 (LG4X evolutionary model25, 1,000 ultrafast bootstraps26). Trees were analyzed with ETE Toolkit27. All data and scripts are described in the Supplementary Information Guide and available upon request. The stem length distributions and sister clade identification for the eukaryotic genes of references 19 and 20 were obtained from data made available by the authors28.
Supplementary Tables
Supplementary Video Legends
Supplementary Video 1 | Time-lapse experiment using spinning disk confocal microscopy reveals merging of two cells expressing myr-mCherry and FsxA. Time in hours:minutes. Merge of the red and DIC channels is shown.
Supplementary Video 2 | Z-series of the binucleated BHK cell from Fig. 3h. Labeled nuclei (blue) and myr-mCherry (white). Each optical section obtained with spinning disc confocal microscopy is 1 μm apart.
Acknowledgements
We thank Sonja-Verena Albers, Dan Cassel, Peter Walter, Alejandro Colman-Lerner, Uri Gophna, Yael Iosilevskii, Shahar Lavid, and members of our laboratories for discussion and comments on the manuscript; Kira Makarova for discussions on archaea; Jose Flores for advice on searches of fusexins in metagenomes. This work was supported by grants from Comisión Sectorial de Investigación Científica, Uruguay (CSIC I+D-2020-682 to H.R. and M.G.); Fondo para la Investigación Científica y Tecnológica, Argentina (PICT-2017-0854 to P.S.A.); Beca de Doctorado Consejo Nacional de Investigaciones Científicas y Técnicas, Argentina (to D.M.); the Knut and Alice Wallenberg Foundation (project grant 2018.0042 to L.J.); the Novartis Foundation for Medical-Biological Research (grant #17B111 to C.D.); the Swedish Research Council (project grants 2016-03999 and 2020-04936 to L.J.); the Swiss Leading House for the Latin American Region (C.D. and P.S.A.); the Swiss National Science Foundation (grant 183723 to C.D.); the Israel Science Foundation (grants 257/17, 2462/18, 2327/19, and 178/20 to B.P.); FOCEM, Fondo para la Convergencia Estructural del Mercosur (COF 03/11, M.G.). H.R., M.G. and M.L. also thank Programa de Desarrollo de las Ciencias Básicas and ANII, Uruguay. The computations for this work were performed at the Vital-IT Center for high-performance computing of the SIB Swiss Institute of Bioinformatics.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.
- 12.↵
- 13.
- 14.↵
- 15.↵
- 16.
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.↵
- 115.↵
- 116.↵
- 117.↵
- 118.↵
- 119.↵
- 120.↵
- 121.↵