BEAN and HABAS: Polyphyletic insertions in RNAP that point to deep-time evolutionary divergence of bacteria

The β and β’ subunits of the RNA polymerase (RNAP) are large proteins with complex multi-domain architectures that include several insertional domains. Here, we analyze the multi-domain organizations of bacterial RNAP-β and RNAP-β’ using sequence, experimentally determined structures and AlphaFold structure predictions. We observe that bacterial lineage-specific domains in RNAP-β belong to a group of domains that we call BEAN (Broadly Embedded ANnex) and that in RNAP-β’, bacterial lineage-specific domains are HAmmerhead/BArrel-Sandwich Hybrid (HABAS) domains. The BEAN domain has a characteristic three-dimensional structure composed of two square bracket-like elements that are antiparallel relative to each other. The HABAS domain contains a four-stranded open β-sheet with a GD-box-like motif in one of the β-strands and the adjoining loop. The BEAN domain is identified not only in the bacterial RNAP-β’, but also in the archaeal version of universal ribosomal protein L10. The HABAS domain is observed as an insertional domain in several metabolic proteins. The phylogenetic distributions of bacterial lineage-specific insertional domains of β and β’ subunits of RNAP follow the Tree of Life. The presence of insertional domains can help establish a relative timeline of events in the evolution of a protein because insertion is inferred to post-date the base domain. We discuss mechanisms that might account for the discovery of homologous insertional domains in non-equivalent locations in bacteria and archaea.


Introduction
Transcription is the Central Dogma process in which RNA polymerase (RNAP) transcribes DNA into RNA (Hurwitz, et al. 1961).mRNA is then translated into protein in the ribosome.
RNAP-β and RNAP-β' have bacterial, archaeal, and eukaryotic orthologs and contain universal sequence motifs and domains (Sweetser, et al. 1987;Jokerst, et al. 1989;Lane and Darst 2010a, b).However, the domain architectures of RNAP-β and RNAP-β' vary significantly between archaea and bacteria, and among bacteria.Archaea-specific domains of RNAP are conserved in eukaryotes (Figures S2-S3).Here, we use sequences and structures to begin to reconstruct an extraordinary succession of events during the deep evolutionary history of RNAP.
We observe that bacterial lineages acquired specific types of insertional domains at multiple locations of RNAP-β and RNAP-β'.Archaeal lineages acquired different insertional domains.The acquisition of these insertions occurred during the deep history of RNAP, billions of years ago.The results here allow us to establish a classification system for bacterial RNAP-β and RNAP-β' subunits, based on type, location and chronology of domain insertion.Some bacteria-specific domains of RNAP are observed only in certain bacterial lineages (Borukhov, et al. 1991;Severinov, et al. 1992;Iyer, et al. 2004;Lane and Darst 2010b, a;Huang, et al. 2015;Qayyum, et al. 2024).We use a naming scheme in which domains of RNAP-β and RNAP-β' that occur in all archaea but not in bacteria are called here "a-specific" domains.Domains that occur in all bacteria but not in archaea are called "b-specific".Domains that occur in some bacterial lineages but not others are called "b/lineage-specific".
Proteins most commonly acquire new domains by terminal addition (Weiner, et al. 2006;Marsh and Teichmann 2010), generating tandem multidomain architectures.Yet, both RNAP-β and RNAP-β' acquired domains by internal insertion.Insertional domains are less frequent than terminally-added domains (Manriquez-Sandoval and Fried 2022).In bacterial These accretion processes reveal dependencies that can be particularly useful in establishing timelines of evolutionary events; it is assumed that an insertional domain was acquired more recently than the domain that hosts it.Most of the a-, b-and b/lineage-specific domains of RNAP-β and RNAP-β' are insertional.In multiple sequence alignments (MSAs), a-, b-and b/lineage-specific domains are observed as "blocks" (Vishwanath, et al. 2004).In MSAs of RNAP-β and RNAP-β', universal domains are broken by these blocks (Figure S1).
We observe a broadly distributed b/lineage-specific insertional domain with idiosyncratic positions in RNAP-β.We call this domain BEAN (Broadly Embedded ANnex).We observe a b/lineage-specific insertional domain with idiosyncratic position in RNAP-β'.We call this domain HABAS (HAmmerhead/BArrel-Sandwich Hybrid).The BEAN domain is also identified in the archaeal version of universal ribosomal protein L10 (uL10).The HABAS domain is observed as an insertional domain in several metabolic proteins.
The locations and phylogenetic distributions of insertional domains in RNAP-β, RNAP-β' and uL10 report on events that occurred in the deep evolutionary past.These insertional domains appear in distinct positions in the most deeply rooted bacterial lineages.We explore possible scenarios by which BEAN was idiosyncratically inserted in bacterial RNAP but not archaeal RNAP and was universally inserted in archaeal uL10 but not bacterial uL10.Here, we observe the results of processes that reshaped the multidomain architecture of bacterial and archaeal orthologs and tapered off after early evolution.

Results
Domain organizations of RNAP β and β'.
The MSAs of RNAP-β and RNAP-β' display block structures indicating universal as well as a-specific, b-specific and b/lineage-specific elements (Figures 1a and 2a).We analyzed the RNAP-β and RNAP-β' domain organizations using orthologous sequences from a subsample of a reference set of evenly sampled bacterial genomes (Zhu, et al. 2019).The subsample used here includes representatives from all known major bacterial species and has been adapted from (Moody, et al. 2022).The blocks within the MSA were annotated using CATH (Sillitoe, et al. 2021) (Tables 1 and 2).The location and type of insertion were verified using experimentally determined structures (Berman, et al. 2000) and AlphaFold structure predictions (Jumper, et al. 2021;Varadi, et al. 2022).We observe small insertions (< 50 residues) that lack sequence similarity to each other or to domain entries in three classification databases: CATH (Sillitoe, et al. 2021), ECOD (Schaeffer, et al. 2017) and SCOPe (Chandonia, et al. 2017).These insertions are omitted from the analysis here.The conservation score ranges from 0 (not conserved) to 1 (highly conserved).Valdar01 score was calculated on the multiple sequence alignment of sequence representatives for each type using Scorecons.For clarity, N-and Cterminal residues that extend beyond the shared core of bacteria were masked.
Our inquiry here is facilitated by our naming system for RNAP domains.In this scheme, the subunit is indicated by β or β', followed by a hyphen and the letter "u" to indicate universal conservation, or "a" to indicate a-specific, or "b" to indicate b-specific.The domains (D) are numbered in order of appearance in the sequence.For example, the N-terminal RNAP βsubunit domain, which is universal, is called β-uD1.

The BEAN domain
The BEAN domain has a characteristic three-dimensional structure composed of two square bracket-like elements that are anti-parallel relative to each other (Figure 3d).Each bracket-like element is formed by an α-helix and two β-strands.The orientation between consecutive secondary elements is 90° within each bracket.The first bracket is formed by A cluster analysis of sequences based on BLASTP P-values shows that b/lineage-specific BEAN domains in RNAP-β are more similar to each other than to BEAN domains in other proteins (Figure 3a).Additionally, BEAN domains in RNAP-β show a conserved TGD/E (threonine, glycine, aspartic/glutamic acid) sequence motif (Figure 3d) that is absent from other BEAN domains.Thus, BEAN domains in RNAP-β appear to share more recent ancestry with each other than with BEAN domains of other proteins.
Note that bacterial type 3 RNAP-β' is composed of two polypeptide chains.The Nterminal sub-subunit (RNAP-β' BacN ) ends with β'-uD3; and the C-terminal sub-subunit (RNAP- β' BacC ) starts at β'uD4 (Figure 2a).RNAP-β' BacN and RNAP-β' BacC assemble to form a complete RNAP-β' (Qayyum, et al. 2024).The HABAS domain is a four-stranded open β-sheet with a conserved sequence motif in one of the β-strands and the adjoining loop.The motif contains glycine (G), aliphatic (Ψ), and polar (ρ) amino acids as follows: ΨxΨρxGρxΨxxGρxΨxx.We call this motif the GD-box-like motif because it is similar but not identical to the GD-box sequence motif ΨxΨxxGρxΨxΨ (Alva, et al. 2009).We found three distinct topologies of secondary structural elements in HABAS domains.These topologies are related by circular permutation.We distinguish and name these topological variants by the locations of their GD-box-like motifs.The most frequently observed topology has a GD-box-like motif in strand β4 (β4 GD ).β-bD1 has a β4 GD topology (Figure 4d).β- uD5 has a β2 GD topology (Figure 4e), and a HABAS domain of NusG has a β3 GD topology (Figure 4f).
The β and β' subunits are fused into one polypeptide chain in Candidatus Adlerbacteria and Wolinella succinogenes (Table S1).Type 1 RNAP-β lacks b/lineage-specific BEAN insertions.Its scattered phylogenetic distribution could correspond to HGT or reduction from more elaborate types.To test whether the scattered distribution of type 1 RNAP can be attributed to HGT, we calculated a maximum likelihood gene tree of RNAP-β using sites that are conserved in all bacteria (Figure 6).We compared the gene tree of RNAP-β to a consensus tree of bacteria calculated previously using 27 vertically inherited genes (Moody, et al. 2022).Our phylogenetic analysis shows that the tree of RNAP-β follows the the tree of bacteria (Figures 6 and S4  Type 4 insertions in Terrabacteria may be convergent with type 4 insertions in Gracilicutes.Two scenarios could explain the distributions of type 4 RNAP-β insertions in two Terrabacteria groups, Armatimonadetes and DST.Type 4 RNAP-β insertions could arise from HGT from Gracilicutes or from vertical inheritance with convergence of the insertion site.In the case of Armatimonadetes, the first scenario (HGT from Gracilicutes) is not supported by our gene tree of RNAP-β (Figure 6 and Figure S4).Our gene tree is consistent with bacteria phylogenies that have been calculated using distinct gene markers (Coleman, et al. 2021;Megrian, et al. 2022;Witwinowski, et al. 2022).Thus, RNAP-β appears to have been vertically inherited to Armatimonadates and the location of its BEAN insertions appears to be convergent with type 4 RNAP-β from Gracilicutes.The location of BEAN insertions in RNAP-β from the DST group could also be convergent.However, in our gene tree, the DST appear as a sister lineage of Gracilicutes.Because RNAP-β from DST does not group within any of the Gracilicutes lineages, but appears as a separate sister lineage, the topology of our tree does not allow us to rule out or support a HGT of type 4 RNAP-β.

Discussion
The data presented here are consistent with a model in which RNAP was subject to a discrete episode of aggressive domain insertion, around or after the last bacterial common ancestor, followed by a precipitous decline in the frequency of insertion (Figure 7).RNAP is a multi-subunit protein complex that contains RNAP-β and RNAP-β' subunits.RNAP-β and RNAPβ' are found in RNAPs in archaea, bacteria, eukarya, and nucleocytoplasmic large DNA viruses (Iyer, et al. 2001).Here we report that RNAP-β and RNAP-β' each contain homologous insertional domains with idiosyncratic positions that generate block structures of RNAP-β andβ' MSAs.
Block structures of MSAs are not exclusive to RNAP-β and RNAP-β' and have been described for universal components of the translation system [ribosomal proteins (Vishwanath, et al. 2004), and aminoacyl tRNA synthetases (Alvarez-Carreño, et al. 2023)].But block differences in the translational system are observed between archaea and bacteria whereas here, in RNAP, they are observed within archaeal and bacterial domains.These insertional domains are informative about evolutionary history and pose important questions about evolutionary mechanisms.Acquisition of BEAN domains may have been independent of acquisition of HABAS domains.We observe a mismatch in the distributions of RNAP-β and RNAP-β' (Figures 5 and 7).

BEAN insertions and RNAP-β evolution
The distribution of RNAP-β types suggests that BEAN domains were inserted in the ancestors of three early branching bacterial lineages (Figure 7).These lineages are: i) the ancestor of Firmicutes, which acquired type 2 insertions; ii) the ancestor of Chloroflexi and CPR bacteria, which acquired type 3 insertions; and iii) the ancestor of all Gracilicutes, which acquired type 4 insertions.Armatimonadetes and DST also have type 4 insertions.

HABAS insertions and RNAP-β'evolution
HABAS insertions in bacterial RNAP-β' appear to have occurred in three ancestral populations: i) the ancestor of Armatimonadetes, Actinobacteria, Chloroflexi and CPR bacteria; ii) an ancestor within the CMS group; and iii) the ancestor of Gracilicutes.Extensive insertional diversity with the DST group suggests that these insertions occurred very early in bacterial evolution.The lack of insertional diversity in RNAP in late divergent groups suggests cessation of insertions later in bacterial evolution.The genes for RNAP-β and RNAP-β' recorded and preserved the marks of evolutionary events that affected ancestral groups.
Insertional domains in the evolution of translation and transcription  S1-S2).Similarly, in most archaea, uL10 is in the neighbourhood of the genes that encode for NusG, uL1 and uL11 (Table S3).
Transcription and translation are two of the central biological processes responsible for the encoding and synthesis of proteins.The patterns of insertion of HABAS and BEAN domains in universal and ancient proteins pose provocative questions regarding the timing and order of events during the early evolution of life.The mechanism of insertion remains unclear.
The combined data suggest that the bulk of the acquisition of BEAN domains in RNAP-β and archaeal uL10 and HABAS domains in RNAP-β' occurred in ancestral lineages, shortly after LUCA, and that the descendants generally retained these insertions.We speculate that BEAN and HABAS insertions could be influenced by the genomic context.The slight differences in the locations of BEAN and HABAS insertions in RNAP-β and RNAP-β' may reflect distinct bacterial lineages with distinct gene locations.Thus, in our model, b/lineage specific occurred in the deep evolutionary past, and just after an early divergence of the Last Bacterial Common Ancestor into distinct bacterial groups.The patterns that we observe left a mark on some of the first ancestral bacterial groups, and hint to an early diversification of Terrabacteria, particularly of the DST group.

Methods
Identification of RNAP-β and β' subunits in bacteria and archaea The sequences of RNAP subunits β and β' from Sulfolobus acidocaldarius (UniProt IDs:

Identification of HABAS and BEAN domains homologs
A-, b-and b/lineage-specific insertions were trimmed according to the blocks in the MSAs.Profiles were calculated for each trimmed MSA with hhmake from the hh-suite (Steinegger, Meier, et al. 2019), considering columns with fewer than 50% gaps match states.

Figure 1 .
Figure 1.Multi-domain organizations of RNAP-β.(a) Multi-domain organization diagrams of RNAP-β in archaea and bacteria.First row: domains of archaeal RNAP-β.Second row: universally shared domains between archaeal and bacterial orthologs and universal sequence motifs described in (Sweetser, et al. 1987).Third row: domains of bacterial type 1 RNAP-β.Fourth row: location of bacterial type 2 insertions.Fifth row: location of bacterial type 3 insertions.Sixth row: location of bacterial type 4 insertions (b to e) Three-dimensional structure of the bacterial RNAP-β.Domains are coloured as in (a).(b) type 1 (AlphaFold DB: AF-A2BT61-F1), (c) type 2 (AlphaFold DB: AF-A9B6J3-F1), (d) type 3 (AlphaFold DB: AF-Q8ETY8-F1), and (e) type 4 (PDB: 4IGC, chain C). (f to i) Bacterial RNAP-β coloured by residue conservation.The conservation score ranges from 0 (not conserved)to 1 (highly conserved).Valdar01 score was calculated on the multiple sequence alignment of sequence representatives for each type using Scorecons.For clarity, N-and C-terminal residues that extend beyond the shared core of bacteria were masked.

Figure 3 .
Figure 3.The BEAN domain.(a) BEAN domain sequences clustered by similarity at a P-value threshold of 1×10 -11

α1⊥β2⊥β3
and the second bracket by β3⊥β4⊥α5.The BEAN domain maps to CATH superfamily 3.90.105.10.We have identified BEAN domains in bacterial and archaeal proteins other than RNAP-β (Figure 3a to 3d).Using sequence and structure similarities, we find the BEAN domain in bacterial RNAP-β'; the archaeal version, but not the bacterial version of ribosomal protein uL10; molybdenum cofactor biosynthesis protein MoeA; ornithine/lysine/arginine decarboxylases; a putative ferredoxin; and in one uncharacterized protein.

A
given bacterial lineage tends to have a single type of RNAP-β.Bacteria in the CMS group (Cyanobacteria and related bacteria) contain type 1 RNAP-β; Armatimonadetes contain type 4* RNAP-β; Actinobacteria contain type 1 RNAP-β; and Chloroflexi contain type 2 RNAP-β.However, type 1 is observed mixing among other types of RNAP-β in some bacterial lineages.Most bacteria in the DST group contain type 4 RNAP-β, but some contain type 1. Firmicutes generally contain type 3 RNAP-β, but some contain type 1. Bacteria in the CPR group (Candidate Phyla Radiation) contain either type 2 or type 1 RNAP-β.

Figure 5 .
Figure 5. Phylogenetic distribution of the different multi-domain organizations of RNAP-β and RNAP-β' in bacteria.Domain organization types are shown in different colors.The tree of Bacteria was adapted from (Moody, et al., 2022).(a) Distribution of RNAP-β types in bacteria.Phylogenetic groups with a scattered distribution of type 1 RNAP-β are indicated by a darker outline.(b) Distribution of RNAP-β types in bacteria.CMS: Cyanobacteria, Margulisbacteria, Melainabacteria; CPR: Candidatus Phyla Radiation; DST: Deinococcus-Thermus, Synergistes, Thermotogae, Bipolaricaulota, Caldiserica, Coprothermobacterota; FCB: ) and suggests vertical inheritance of the RNAP-β gene in DST, Firmicutes and CPR.This correspondence further suggests that type 1 in DST, Firmicutes and CPR evolved by reduction through loss of b/lineage-specific BEAN insertions.

Figure 6 .
Figure 6.Maximum likelihood tree of RNAP-β.The ML tree (inferred with model Q.yeast +G+I) of RNAP-β was calculated using positions conserved in all bacteria.

Figure 7 .
Figure 7. RNAP-β and RNAP-β' domain insertions mapped into a schematic representation of the tree of bacteria.The tree reproduces the topology from Figures 5a and b, branches lengths have been altered.External nodes correspond to 14 bacterial lineages with 13 distinct combinations of RNAP-β and RNAP-β' types.
Our sequence similarity searches indicate that HABAS and BEAN are inserted in multiple unrelated proteins.We identify a b-specific BEAN insertion in RNAP-β' and b/lineage specific BEAN insertions in bacterial RNAP-β.We observe BEAN insertions in the archaeal version of uL10.We identify a universal HABAS and a b-specific HABAS in RNAP-β and b/lineage specific HABAS insertions in RNAP-β'.Finally, we observe HABAS insertions in NusG, the only universally conserved transcription elongation factor(Werner and Grohmann 2011).HABAS insertions in NusG were identified in only bacteria from the DST group: Fervidobacterium islandicum, Petrotoga olearia, Kosmotoga olearia and Candidatus Bipolaricaulis anaerobius (TablesS1).The observation of a BEAN domain in the archaeal but not in the bacterial version of universal ribosomal protein L10 suggests some frequency of insertion before the last universal common ancestor (LUCA).The local gene neighbourhood may have influenced the acquisition of BEAN and HABAS domains.The genes for RNAP-β and -β' are adjacent to each other in the genomes of virtually all bacteria and most are in the neighbourhood of the genes that encode for NusG and universal ribosomal proteins uL1 and uL11 (Tables P11513 and P11512) and Bacillus subtitlis (UniProt IDs: P37870 and P37871) were searched in a set of archaeal and bacterial proteomes derived from(Moody, et al. 2022) using phmmer from the HMMER3 suite (Eddy 2011).Sequences above threshold (E-value < 1×10 -10 ) were retrieved and aligned.Multiple sequence alignments (MSAs) were generated with the einsi option from MAFFT v7 (Katoh and Standley 2013).Domain annotation of RNAP-β and β' and classificationThe MSAs of RNAP subunits β and β' were converted each into a sequence profile and compared to CATH_S40, ECOD_F70 and SCOPe95 with HH-search(Steinegger, Meier, et al. 2019) on the MPI Bioinformatics Toolkit(Zimmermann, et al. 2018).CATH_S40 contains CATH domains clustered at 40% sequence identity; ECOD_F70 contains ECOD domains clustered at 70% sequence identity; and SCOPe95 contains domain sequences clustered at 95% sequence identity.The block patterns on the MSAs were used as reference to classify the multi-domain organization types of bacterial RNAP-β and RNAP-β' proteins (FigureS1).For each RNAP-β and RNAP-β' type, a representative was selected for structure analysis.All representatives have a experimentally determined structure in the PDB(Berman, et al. 2000) or a predicted structure in AlphaFold DB(Varadi, et al. 2022).Per-residue confidence score (pLDDT) and predicted aligned error (PAE) of the structure predictions are shown in FiguresS6-S16.Sequences with unique insertion patterns were annotated individually, and the annotations were inspected over structure predictions generated with AlphaFold version 2.0(Jumper, et al. 2021).
2013); and converted to HMM profiles with HMMER version 3.3.2(Eddy 2011).The HMM profiles were searched with phmmer in the same set of archaeal and bacterial proteomes(Moody, et al. 2022) that was used RNAP-β and RNAP-β' identification.Structure based MSAs of HABAS and BEAN domains were calculated with MATRAS(Kawabata 2003).

Table 1 .
Multi-domain architecture of RNAP β subunit in representatives from Archaea and Bacteria

Table 2 .
Multi-domain architecture of RNAP β' subunit in representatives from Archaea and Bacteria