Title: Finding priority ribosomes for future structural and pathogen research based upon global RNA and protein sequence analysis

Ribosome-targeting antibiotics comprise over half of antibiotics used in medicine, but our fundamental knowledge of their binding sites is derived primarily from ribosome structures from non-pathogenic species. These include Thermus thermophilus , Deinococcus radiodurans and Haloarcula marismortui, as well as commensal and pathogenic Escherichia coli . Advancements in electron cryomicroscopy have allowed for the determination of more ribosome structures from pathogenic bacteria, with each study highlighting species-specific differences that had not been observed in the non-pathogenic structures. These observed differences suggest that more novel ribosome structures, particularly from pathogens, are required for a more accurate understanding of the level of diversity of the entire bacterial ribosome, potentially leading to potential advancements in antibiotic research. In this study, high accuracy covariance and hidden Markov models were used to annotate ribosomal RNA and protein sequences respectively from genomic sequence, allowing us to determine the underlying ribosomal sequence diversity using phylogenetic methods. This analysis provided evidence that the current non-pathogenic ribosome structures are not sufficient representatives of some pathogenic bacteria, such as Campylobacter pylori, or of whole phyla such as Bacteroidetes.

We further note that the specific target sites of antibiotics correspond to highly conserved, functional ribosome components, and that a survey focusing on these sites will not yield significant diversity. However, we also note that ribosome function has been found to be influenced by less well conserved components of structure (2,22), and even by elements of structure quite distant from conserved antibiotic binding sites (24). Therefore, the aim of this study is to broadly survey ribosome sequence space to determine how representative current ribosome structures are, ranging from the extremophiles (eg: D. radiodurans, T. thermophilus) for human pathogenic species (eg: S. aureus, L. monocytogenes), when compared to all bacterial species. Using this analysis, we propose new representative bacteria, including some surprising and diverse species, with the potential of yielding valuable new ribosome structural information that could be prioritised for future ribosomal structural studies.

Materials and Methods
Our analysis includes the full-length conserved components of the bacterial ribosome. We have elected not to specifically analyse antibiotic targets (e.g. Spectinomycin targets G1064 and C1992 of 16S rRNA (25)) as these are highly conserved functional components of the ribosome, which therefore carry a limited number of phylogenetically informative characters (26). Full-length sequences also allow a global overview of ribosome variation, some of which may change the behaviour of binding sites due to "action at a distance" (2,22,24). A total of 3,758 bacterial genomes and the H. marismortui genome (Accession: AY596297.1) were obtained from the European Nucleotide Archive (27,28). One representative sequence was retained per species for each rRNA and protein sequence, which were filtered to only include genomes with annotations covering 80% of the expected sequence length based upon consensus sequences. If paralogues were present, the sequence with the highest corresponding covariance model and hidden Markov model bit score for the species was used (29). Both 16S and 23S rRNA were annotated using barrnap v0.9 (https://github.com/tseemann/barrnap), and INFERNAL 1.1.2 was used to create sequence alignments using Rfam v14.3 covariance models (29,30). Covariance models are the gold-standard for structural RNA alignment (29,31), and the Rfam models are derived from the CRW database (32), which is the product of decades of manual alignment curation, resulting in some of the highest quality ribosomal RNA structural alignments available to date (32,33). Ribosomal protein sequences from 32 universally conserved proteins were identified from six-frame translations of whole genomes and open-reading frame predictions, to account for potentially inconsistent or absent genome annotations. The selected sequences were aligned using HMMER v3.1 (hmmer.org), with protein hidden Markov models from Pfam v33.1 (34). Profile HMMs are the gold-standard for protein homology search and alignment (35,36), Pfam is a widely used and highly curated database of protein domain alignments. The advantages of this approach are that our results are up-to-date and consistent with the current genome databases, contain both RNA and protein components of the bacterial ribosome and utilise the most accurate methods for alignment production. Whilst we acknowledge that there are existing ribosomal datasets (e.g. SILVA, GreenGenes, RDB, etc), these do not meet all the above requirements. Although there are alternative ribosomal RNA and protein alignments we could have used (e.g. (26,37,38)), which will presumably producing similar results, the Rfam and Pfam alignments have been continuously curated for more than 20 years and are shipped with statistical representations for effective homology searches (covariance and hidden Markov models). Phylogenetic trees for each alignment were generated using the maximum likelihood method from PHYLIP v3.697, with distance matrices of the pairwise distances between species computed in R v4.2.1 using ape v5.6 (39,40). The distance matrices for each ribosomal gene were summed to create a single unified distance matrix, and this was used for multi-dimensional scaling (MDS) and visualisation ( Fig. 1).

Results
Ribosomal sequence clustering suggests that several bacterial phyla are poorly represented by the current structures. To identify phyla that are poorly represented by solved ribosome structures, multidimensional scaling (MDS) was used to reduce two rRNA and 26 ribosomal protein-derived phylogeny trees, which captured the most dissimilarities between species. As assessing individual ribosomal sequences is unlikely to reflect structural variation, we combined 28 phylogenetic trees for each conserved rRNA and protein, creating one distance matrix prior to MDS (38). Only ribosomal genes that are conserved across the majority of the species in our dataset were used. Therefore, ribosomal proteins uS2, uS4, uL15, uL23, uL24 and uL30 were removed as there were no homologous sequences within Epsilonproteobacteria.
The resulting MDS plot shows four main clusters; two illustrate the non-monophyletic nature of Proteobacteria (38), one dominated by Actinobacteria and the fourth consisting of the remaining bacterial phyla (Fig. 1). The Proteobacteria cluster follows observations made in previous studies, such as Beta and Gammaproteobacteria being closely related, and Deltaproteobacteria and Oligoflexia being more related to non-Proteobacteria (38,41). Due to these non-monophyletic properties, Proteobacteria classes will be treated as individual phyla for the rest of this study (38,41). Bacteroidetes, Acidithiobacilla and Epsilonproteobacteria formed slightly isolated groups away from the larger clusters, suggesting that these could become more defined clusters if their sample sizes were larger (Fig. 1). Alphaproteobacteria is the only cluster without a solved structure, implying the presence of phyla specific variation that has not been captured by current structures (Fig. 1). Therefore, it is unlikely that all the bacterial phyla are well represented by the available ribosome structures, given that the solved structures in the multiple phyla cluster group together, instead of being evenly distributed throughout the clusters (Fig. 1).

Evaluation of current structures indicates that Bacillus subtilis is the most representative.
To evaluate whether current ribosome structures from non-pathogens are sufficiently representative of the available sequenced bacterial ribosomes, phylogenetic distances were calculated from the summed distance matrix between 13 published solved structures (Supplementary Table 2) and 1,385 bacterial ribosomal sequences available in this study (11-16, 18, 20-22). The solved structure with the lowest recorded distance for each species is considered to be the most representative and assumes that species with similar primary sequences will form similar tertiary structures. As the minimum distance to the nearest solved structure increases, it becomes more likely that current structures are not suitably representative. These underrepresented species should be prioritised for future ribosome structural studies. Overall, B. subtilis was considered to be the most representative structure for 382 species, followed by A. baumannii for 264 species and 167 for M. smegmatis (Fig. 1). D. radiodurans, T. thermophilus and unsurprisingly the archaeon H. marismortui were the three least representative structures analysed and were not representative of any pathogenic species (Fig. 1), implying that the structures from these three species are becoming less representative with the introduction of ribosome structures from less divergent bacteria (11). M. tuberculosis was also representative for a small number of species, which is likely due to the presence of M. smegmatis, given that these two bacteria are closely related -effectively splitting the Actinobacteria (Fig. 1).
This observation does not imply that one structure is necessarily sufficient to represent a complete phyla or class, with E. coli and P. aeruginosa representing 149 and 110 species respectively, accounting for most Gammaproteobacteria (Fig. 1). Whilst these structures are very similar, they do capture some important differences (21). Each solved structure tends to solely be representative of the phyla it originated from, with M. smegmatis as the most representative of Actinobacteria, and both E. coli and P. aeruginosa representing Gammaproteobacteria exclusively. However, B. subtilis and A. baumannii were not representative of only their respective phyla, as the Gammaproteobacterium A. baumannii is representative of both Alpha-and Betaproteobacteria and the Bacillota/Firmicutes B. subtilis was the most representative for the remaining phyla without a ribosome structure (Table 1). Therefore, we hypothesise that having at least one representative structure per phylum, or preferably one per class, would allow the majority of bacteria to be sufficiently represented.

Fig. 1.
A multiple-dimensional scaling (MDS) plot for combined phylogenetic distances from 16S, 23S rRNA and 26 universally conserved ribosomal proteins (N=1,396). All Proteobacteria are coloured by class, and other included phyla are coloured individually to highlight taxonomic relationships with clusters. Species with a solved ribosome structure (N=13), or known pathogenic bacteria without a structure (N=38) have been highlighted with a black dot. The number of bacteria that consider each species with solved structure to be representative, based on the minimum phylogenetic distance, has been labelled along with the species' names. The full list of species, MDS coordinates, the solved structures that were considered to be most representative and the corresponding minimum distance are available on Github.

Introducing new ribosome structures shows that an Epsilonproteobacteria representative, such as a Campylobacter jejuni, should be prioritised for structural work.
To simulate the effect of having at least one representative ribosome structure per phyla, we selected one proposed structure per phyla which had the smallest average phylogenetic distance to all other members in the respective phyla. Only species with a minimum distance to a solved structure greater than 12.86 were considered, which is above the lower quartile of all the minimum distances recorded, as these species are most likely to be poorly represented by current structures. The available pathogens with a minimum distance above this threshold were prioritised due to their relevance in health research. The introduction of new proposed structures resulted in a decrease in the average minimum distance to a solved structure across each phylum, implying that having at least one structure per phylum is a suitable sampling strategy for improving representation (Table 1). Of the ten highest priority structures, only Epsilonproteobacteria, Chlamydiae and Chlorobi had an average minimum distance per phyla below the lower quartile threshold, after the proposed structures were introduced (Table 1). This suggests that the other phyla listed are more diverse and may require additional structures to capture the remaining variation, or that a representative non-pathogenic ribosome would be a more beneficial representative. Campylobacter jejuni was identified as the highest priority structure (Table 1), as it was the most representative Epsilonproteobacteria pathogen, had the largest average minimum distance observed across all phyla and is a WHO priority pathogen (1). However, the selected ribosome for solving does not specifically need to be from a pathogenic strain, with either an attenuated lab strain (20,22), or another species from the same genus (15,18) being appropriate alternatives (Fig. 1). * The first value is the average minimum distance prior to the proposed structures being selected, and the value in brackets is the recalculated average distance after the proposed structures have been incorporated.

Discussion
Our results reveal significant phylogenetic gaps in the existing cohort of solved ribosome structures. Filling these gaps will help identify clade-specific features of isolated bacterial clades that lack representative ribosome structures. This is important for the analysis of ribosome evolution, pathogen research, and potentially for antibiotic development.
There are two limitations with this approach to ribosome structure prioritisation: there is a bias towards species with available genomes, which are enriched in pathogenic species rather than being the most evolutionarily representative (42). In addition, our approach prioritises the most novel ribosomes within each major bacterial clade, with the risk of over-emphasising the most diverse ribosomes. This is somewhat mitigated by selecting the most closely related pathogenic species as representative targets.
While having at least one representative ribosome structure per phyla is expected to capture the bulk of structural variation (Table 1), there is no guarantee that primary sequence differences result in significant changes at the tertiary level. A prominent example of this is H. marismortui which, although it is an archaean ribosome structure, has been used as an alternative to bacterial counterparts (11), despite having no close relatives at the primary level (Fig. 1). Another consideration is that the structural variation observed may be species-specific, rather than shared across a phylum, with species-specific differences having been observed between P. aeruginosa and A. baumannii, yet no phyla-specific differences when compared to E. coli and other available structures (16,21). However, these three structures were observed to be relatively divergent from each other based upon the MDS analysis (Fig. 1), which reinforces the in vivo observation that P. aeruginosa and A. baumannii ribosomes are more similar to each other than to E. coli, suggesting that phylogenetic analyses have the potential to reflect structural variation (16).
Our list of high-priority bacterial ribosomes reveals that a number of bacterial clades are poorly represented by the existing solved structures (Table 1, Supplementary Table 1). Working down this list gives an opportunity to identify the most variant ribosomes based upon sequence analysis, which in many instances will reflect diverse ribosome structures. These will allow the development of more specific classes of antibiotics, as well as provide further information regarding the specifics of ribosome structure evolution.

Data Availability
All data described was last accessed in September 2022. Custom scripts, details of parameters, and dependencies used, and the accessions for all downloaded data and the resulting curated alignments and trees are available on GitHub: https://github.com/helena-bethany/ribosomal-variation