The phylogenomic landscape of the genus Serratia

The genus Serratia has been studied for over a century and includes clinically-important and diverse environmental members. Despite this, there is a paucity of genomic information across the genus and a robust whole genome-based phylogenetic framework is lacking. Here, we have assembled and analysed a representative set of 664 genomes from across the genus, including 215 historic isolates originally used in defining the genus. Phylogenomic analysis of the genus reveals a clearly-defined population structure which displays deep divisions and aligns with ecological niche, as well as striking congruence between historical biochemical phenotyping data and contemporary genomics data. We show that Serratia is a diverse genus which displays striking plasticity and ability to adapt to its environment, including a highly-varied portfolio of plasmids, and provide evidence of different patterns of gene flow across the genus. This work provides an essential platform for understanding the emergence of clinical and other lineages of Serratia.

phenotypic and DNA-DNA hybridisation data with contemporary genomics data. In this study, we bring 4 together all of the previous molecular and biochemical knowledge from this important genus and place it in a genomics framework. This not only explains why previous definitions of this genus were robust, To understand how this GC pattern impacts on protein coding, we investigated the variation in GC 137 content over the three codon positions for all lineages using the genus-core set of 1655 genes ( Fig. 4c; 138 Fig. 2b), and separately for all other genes, designated "non-core". We observed no obvious difference 139 in GC content between genes that were core or non-core at all three codon positions, termed as GC1-3 140 (Fig. 4c). The GC content at GC2 is essentially fixed across the genus (Fig. 4c), whilst GC1 shows a 141 slight skew across the genus, varying by approximately 1%. Codon position GC3 showed a clear bias 142 for A/T-ending codons in low GC species and G/C-ending codons in high GC species, as expected 41 143 (Fig. 4d). Hence, the difference in average G/C across the genus is largely explained by variation in codon position GC3. For example, GC3 in S. grimesii is 20% lower than in S. marcescens L9 (Fig. 4c).

145
Taken together, the variations in metabolic capability and GC content between both species and niche-146 adapted lineages are indicative of long-term niche adaptation within evolutionary timescales.

149
The results so far suggest that the pan-genome of Serratia lineages is phylogenetically constrained, yet 150 members of Enterobacteriaceae are known to have a highly plastic gene content through horizontal 7 gene transfer (HGT). To investigate this further, we sought to understand genus-wide species plasticity. Plasticity can be estimated by comparing the pan-genome size and complexity against the size of the genomes (Fig. 2b). Genes were defined as core to a lineage if a gene was present in at least 95 percent of the genomes in each lineage, and the union of all lineage-core genes was defined as the genus-core, 158 consisting of 1655 genes (Fig. 2a,b). This analysis showed lineage-and species-level core gene gain 159 and loss, which are markedly larger in terms of the number of genes when looking at lineages that have   Table 2). In contrast, S. ficaria, which has similar internal branch lengths,

168
has a more open pan-genome curve (Fig. 2c), and an accessory genome of 3490 genes, despite being 169 represented by fewer genomes in the analysis (Supplementary Table 2

171
Evidence of core gene gain and loss possibly reflective of speciation or niche adaptation can be seen 172 when examining this data. For example, 99 genes are found core to all three lineages in S. entomophila, 173 and 41 genes are found core to the entire S. liquefaciens complex, which comprises S. liquefaciens, 174 grimesii, proteamaculans and quinivorans (Fig. 2b). Within the pan-genome we identified lineage-and 175 species-exclusive gene sets, as well as those whose genes are also present at intermediate or rare 176 frequencies across the genus (Fig. 2b). For example, of the 99 S. entomophila species-core genes, 35 genes were found across the rest of the genus ( Supplementary Fig. 5), shared between both high and low GC members. In contrast, in the 41 genes core to the S. liquefaciens complex, very few are found outside the complex, and where they are, they are predominantly present in S. ficaria and S. plymuthica 180 ( Supplementary Fig. 6). The sharing of genes across the genus, implying potential gene flow, raises 181 questions about whether GC3 has been ameliorated to reflect the GC3 trend in a potential recipient 182 genome. Of the 35 genes from the high GC species S. entomophila that are found across the low GC 183 species S. liquefaciens complex, S. plymuthica and S. fonticola, the GC3 values of these genes are lower genes which are also found in S. ficaria and S. plymuthica, both species with higher GC than members 8 In an attempt to understand the mechanisms by which genes are gained and lost, we focused initially 188 on S. marcescens. We investigated the genetic context of the metabolic gene loci associated with different biotypes. In doing so, we identified a hypervariable locus analogous to the plasticity zone seen the other encoding tRNA-Sertga and tRNA-Thrtgt. It encoded the genes required for gentisate degradation 193 (nag gene cassette) and/or tetrathionate reduction (ttr gene cassette), present in the same order and 194 orientation across the species, located alongside three sets of genes that were variably present across 195 the S. marcescens phylogeny (Fig. 5). These three sets comprised: (1) four genes including one encoding     Table 3). The 218 collection of identified plasmids displays a wide range of sizes (~1-310 kb) and GC content (~30-66%), indicating diversity. However, the distribution of these traits varied amongst Serratia species L9 and L12, which are the 'clinical' lineages in which 97% and 81% percent of the isolates, plasmids ranges from single genus to multi-phyla, with the most heterogeneous host range profile 228 observed for plasmids found S. marcescens (Fig. 6).

244
Notably, the predicted host range of the plasmids brings an additional perspective on their potential 245 dynamics within the genus. Most plasmids identified in S. marcescens appear to be restricted to this 246 species within Serratia. Yet many of them have a predicted host range that goes beyond the taxonomic 247 rank of family, implying transfer outside the genus, including two clusters of small ColRNAI plasmids 248 predicted to cross multiple phyla (Fig. 6). In contrast, the largest plasmid cluster (pADAP-like),

249
featuring multiple non-marcescens species, seems to be restricted to the Serratia genus. Altogether, this 250 picture may suggest that the ecological niche of S. marcescens has favoured plasmid exchange with 251 diverse hosts outside the genus but has also promoted plasmid containment within the species in  10 pigment prodigiosin 17 . However, in fact, prodigiosin production has only been observed in S. 258 marcescens biogroups A1, A2 and A6, some rubidaea and some plymuthica isolates 21 . The pig gene plymuthica genomes (Figs. 3, 7) which are associated with biotypes or biogroups known to be 263 pigmented. In each case, the cluster presents exactly the same contiguous order of genes (pigA-pigN), 264 however, notably, it is found in separate genomic loci in each of the three different species (Fig. 7).

265
Representative pig gene clusters from each species share ~77-80% identity at the nucleotide level (   to be discovered. This may be partly due to geography and lack of sampling: the strain that occupies 11 L16 (MSU97) was sourced from a plant in the Carrao River in Venezuela 46 , a region which is not highly 293 sampled.
It is interesting to consider how the computational approaches used here to classify and describe the ways an in silico proxy, whilst the connection between in silico prediction of metabolic potential and 298 the lab-based tests detecting the corresponding metabolic pathway in the original biochemical-based 299 biotyping is obvious. Furthermore, in some cases these biotypes highlight further clusters within 300 lineages that match branching within the phylogeny. For example, biotypes C1c, EB and RB, and

304
This accuracy is particularly striking given that we have observed that presence or absence of metabolic 305 pathways (corresponding to the historic biotyping tests used) can be due to repeated gene gain or loss 306 in the same locus over short evolutionary distances. For example, the genes required for the degradation 307 of gentisate and the reduction of tetrathionate are gained and lost within and between lineages in S. 308 marcescens, in the same locus and also in the same conserved order (Fig. 5). This would explain why 309 the original phenogrouped biotypes based on biochemical typing had "variable" results for certain 310 metabolic tests, such as gentisate degradation being observed to be variable in the clinical biotypes A8a 311 and A8c 92 . This locus-specific pathway gain and loss in historic isolates is also seen in more 312 contemporary strains (Fig. 5). The maintenance of this plasticity zone suggests that there are transient 313 and frequently re-occurring environmental selective pressures where the benefit and cost of these 314 pathways is great enough to provide selection both for and against them. In other words, the data suggest 315 that both the loss and re-acquisition of these elements is of benefit to S. marcescens at various times.

316
It is also noteworthy that the environment from which strains were isolated across our assembled dataset 317 tends to match the environments and niches with which each biotype was historically associated 21 . Of biotyping data along with the population structure defined here, the combined data suggest that S. 339 marcescens is highly plastic in its nature yet can also become specialised in a particular niche.

340
Speciation and niche specialisation events or processes are seen across the phylogeny, as highlighted 341 by the long branch lengths between divisions, separations in GC content, variation in metabolic 342 potential, and enrichment for certain isolation source sites in different lineages. These divisions likely 343 represent ancient speciation events that have occurred as Serratia has spread to be ubiquitous 344 worldwide. As mentioned above, changes in GC content can be a response to long-term niche 345 adaptation, however there is no commonly held theory or understanding of the possible reasons that 346 underpin this. One possible factor that may have influenced the variation in GC content observed across 347 Serratia is a difference in ideal growth temperature: higher GC Serratia species tend to be able to grow 348 better at higher temperatures than lower GC Serratia species 21 . Another possibility is that the observed 349 GC-dependent change in codon usage, which does not alter protein sequence or function, is indicative 350 of a shift to an optimal set of codons for each particular Serratia species, although the evolutionary 351 pressure that would drive such a shift is not clear. Importantly, however, this division in GC content 352 does not seem to be a barrier for gene flow in Serratia, since genes core to the high GC species S. 353 entomophila can also be found in polyphyletic and variable patterns across the genus, including in low 354 GC Serratia species (Supplementary Fig. 11). However, it is formally possible that these genes could 355 have been horizontally acquired from non-Serratia sources.

356
This study also provides definitive genomic evidence to explain the variation in a classical Serratia 357 phenotype, namely the production of the red pigment prodigiosin (Fig. 7). The high level of synteny 358 within the pig gene cluster together with the absence of homology in the flanking regions indicates that 359 the ability to produce prodigiosin has been acquired on at least three separate occasions within Serratia, nucleotide identity, it is unclear when and how these genes were incorporated into the chromosome, 13 and whether each event reflects gene flow within the genus or separate acquisition from an external noting that S. marcescens, S. plymuthica, and S. rubidaea all variably produce a red pigment 21 .
bacterial, immunosuppressive, and anti-cancer activity 44 . The biological advantage for these individual 367 Serratia species, or subsets thereof, to be able to produce prodigiosin is unclear, however it could reflect

419
and then corrected using Bracken 55 which assigns reads to a specific reference sequence, species or 420 genus. If reads were not able to be assigned to a taxonomic class, they were classed as 'unclassified'.

421
Any read sets that belonged to genera other than Serratia were discarded from any further analysis, 422 along with any assemblies obtained from those read sets.

423
Any read sets with more than an estimated five percent of heterozygous SNPs across the whole genome 424 were removed from further analysis, in addition to any assemblies obtained from those read sets.

425
Heterozygous SNPs were calculated using a software pipeline from the pathogen informatics team at 426 the Wellcome Sanger Institute. Specifically, read sets from each Serratia sample were aligned to an 427 appropriate reference for that sample, given the taxonomic profile from the Kraken and Bracken output.   were downloaded if the species was attributed to any of the following: Serratia sp., odorifera, rubidaea, 16 from the analysis, along with any assemblies comprised of more than 250 contigs. Quast v4.6.0 58 was used to extract statistics for genomes and genomic assemblies, specifically whole genome GC content, genomes detemined by visual inspection as being non-Serratia or close to non-Serratia members of 451 Enterobacteriacaeae were removed from any subsequent analysis. Ten genomes were excluded on this 452 basis, including several so-called Serratia sp. and Serratia oryzae.

453
Genome assembly and annotation

454
The assembly method used for genome assembly and annotation for each genome are detailed in applied to the assembly with the best N50 and contigs were scaffolded using SSPACE 60 and sequence 461 gaps filled using GapFiller 61 . For isolates from UK hospitals that were only sequenced by short-read 462 technology, these short reads were assembled using SPAdes v3.6.1 62 , using default settings.
gene accumulation curves were generated using the specaccum function from the R package Vegan  For the Serratia phylogeny, a concatenated core-gene alignment from 2252 genes (2,820,212 bp in 496 length) from Panaroo v1.2.3 67 (as described above) was filtered to remove monomorphic sites that were exclusively A, T, G or C using SNP-sites v2.5.1 70 . The resulting alignment was 398,551 bp in length.

505
Whole-genome assemblies were compared in a pairwise manner using fastANI v1.3 74 , and phylogroups 506 determined through clustering these comparisons using a cutoff of 95% average nucleotide identity 507 (ANI). Genomic assemblies were then clustered base on this cutoff value, using the script 508 fastANI_to_clusters.py which uses the networkx package (https://networkx.github.io/), and visualised 509 using Cytoscape v3.7.1 75 . The phylogeny was partitioned into lineages defined through hierarchical 510 bayesian clustering using Fastbaps v1.0.4 76 . Fastbaps was used to cluster the phylogeny over four levels, used as input to Fastbaps, alongside the rooted phylogenetic tree to provide a guide for the hierarchical 18

515
In silico reconstruction of metabolic pathways was performed using Pathway tools v23.5 39 , using a 516 multi-processing wrapper tool mpwt (https://github.com/AuReMe/mpwt) 77 . In order to arrange input 517 data into the appropriate format, and subsequently parse the output, a collection of Python and R scripts 518 were written (https://github.com/djw533/pathwaytools_gff2gbk). Further specific information about 519 how to run this can be found in the readme hosted at the github repository. In brief: Representative 520 protein sequences for each of the 47,743 protein family groups identified in the pan-genome analysis 521 were extracted from the pan-genome graph-associated data using Cytoscape v3.7.1, and functionally

830
Firstly, we would like to acknowledge the contribution of, and thank, all those colleagues who 831 contributed over many years to the collection of the Serratia isolates forming the Institut Pasteur 408 genomes from publicly available databases, and 256 sequenced in this study. Tree constructed with 1000 ultrafast bootstraps. The core-gene alignment was produced from a Panaroo pan-genome analysis run with "--clean_mode moderate" and the protein family threshold set to 70% shared sequence identity. Branches are coloured according to phylogroups defined by clustering assemblies at 95% ANI. Clades are shaded according to lineage, calculated through hierarchical bayesian clustering to three levels using FastBaps. "Labelled species" refers to the labelled name of species on the provided Serratia strain sample, or species name associated with published Serratia genome sequences in the NCBI GenBank database.   marcescens is plotted against a maximum-likelihood sub-phylogeny from the tree shown in Fig. 1. Clades for which all descending tips represent strains that have an identical set of genes in the locus depicted are collapsed, and denoted by a diamond shape within the tree.
The size of the diamond represents the number of tips in each collapsed clade. Tips lacking a completely assembled gene locus between tRNA-Proggg and tRNA-Sertga have been pruned from the tree. Genes are coloured according to their role, or in the absence of any predicted function, named according to the group number assigned by Panaroo in the pan-genome (Fig. 2). Prophage regions and the closest related prophage sequence determined by PHASTER are indicated.