Emergence and features of the multipartite genome structure of the family Burkholderiaceae revealed through comparative and evolutionary genomics

The multipartite genome structure is found in a diverse group of important symbiotic and pathogenic bacteria; however, the advantage of this genome structure remains incompletely understood. Here, we perform comparative genomics of hundreds of finished β-proteobacterial genomes to study the role and emergence of multipartite genomes. Nearly all essential secondary replicons (chromids) of the β-proteobacteria are found in the family Burkholderiaceae. These replicons arose from just two plasmid acquisition events, and they were likely stabilized early in their evolution by the presence of core genes, at least some of which were likely acquired through an inter-replicon translocation event. On average, Burkholderiaceae genera with multipartite genomes had a larger total genome size, but smaller chromosome, than genera without secondary replicons. Pangenome-level functional enrichment analyses suggested that inter-replicon functional biases are partially driven by the enrichment of secondary replicons in the accessory pangenome fraction. Nevertheless, the small overlap in orthologous groups present in each replicon’s pangenome indicates a clear functional separation of the replicons. Chromids appeared biased to environmental adaptation, as the functional categories enriched on chromids were also over-represented on the chromosomes of the environmental genera (Paraburkholderia, Cupriavidus) compared to the pathogenic genera (Burkholderia, Ralstonia). Using ancestral state reconstruction, it was predicted that the rate of accumulation of modern-day genes by chromids was more rapid than the rate of gene accumulation by the chromosomes. Overall, the data are consistent with a model where the primary advantage of secondary replicons is in facilitating increased rates of gene acquisition through horizontal gene transfer, consequently resulting in a replicon enriched in genes associated with adaptation to novel environments.

replicons (chromids) of the β-proteobacteria are found in the family Burkholderiaceae. These 23 replicons arose from just two plasmid acquisition events, and they were likely stabilized early in 24 their evolution by the presence of core genes, at least some of which were likely acquired 25 through an inter-replicon translocation event. On average, Burkholderiaceae genera with 26 multipartite genomes had a larger total genome size, but smaller chromosome, than genera 27 without secondary replicons. Pangenome-level functional enrichment analyses suggested that 28 inter-replicon functional biases are partially driven by the enrichment of secondary replicons in 29 the accessory pangenome fraction. Nevertheless, the small overlap in orthologous groups present 30 in each replicon's pangenome indicates a clear functional separation of the replicons. Chromids 31 appeared biased to environmental adaptation, as the functional categories enriched on chromids 32 were also over-represented on the chromosomes of the environmental genera (Paraburkholderia,33 Cupriavidus) compared to the pathogenic genera (Burkholderia, Ralstonia). Using ancestral state 34 reconstruction, it was predicted that the rate of accumulation of modern-day genes by chromids 35 was more rapid than the rate of gene accumulation by the chromosomes. Overall, the data are 36 consistent with a model where the primary advantage of secondary replicons is in facilitating 37 increased rates of gene acquisition through horizontal gene transfer, consequently resulting in a 38 replicon enriched in genes associated with adaptation to novel environments. 39

INTRODUCTION 41
The genomes of about 10% of bacterial species display a multipartite structure, consisting 42 of a chromosome plus one or more additional replicons (Harrison et  Burkholderia. In the case of Cupriavidus, the following genomes were included: i) all genomes 212 annotated with Cupriavidus as the genus, and ii) Ralstonia pickettii DTP0602, Ralstonia 213 eutropha H16, and R. eutropha JMP134, as these grouped with the Cupriavidus in both the 214 phylogeny and ANI matrix. All strains annotated with the genus Ralstonia not included in the 215 genus Cupriavidus were included in the calculation of the pangenome of the genus Ralstonia. 216 To compare the core genomes of the different genera, a Blast bidirectional best hit (Blast-217 BBH) approach was employed. For each pair-wise comparison, the nucleotide 218 pan_genome_reference.fa files returned by Roary were converted into amino acid fasta files 219 using transeq of the EMBOSS version 6.6.0.0 package (Rice et al. 2000

Functional analyses 233
Functional annotation of the accessory pangenomes was performed on the replicon-234 specific pangenomes produced as described above. A custom pipeline using Bash, Perl, and 235 MATLAB (mathworks.com) scripts was used to collect the representative genes for each 236 orthologous protein family present in less than 95% of strains, based on the 237 'gene_presence_absence.csv' and 'pan_genome_reference.fa' files produced by Roary (Page et 238 al. 2015). Each coding sequence was converted to the corresponding amino acid sequence using 239 the transeq function of the EMBOSS package (Rice et al. 2000). Eggnog-mapper (Huerta-Cepas 240 et al. 2017) was then used to functionally annotate the proteins using the 'bact' database, 32 241 CPUs, and the 'usemem' option. Next, a Bash script was used to extract the eggnog-mapper 242 output for just those proteins present in less than 75% or 25% of the strains, as well as to count 243 the total number of proteins present in less than 75% or 25% of the strains. The collected data 244 was then parsed using a Perl script, which returned the number or proteins and percentage of 245 proteins annotated with each COG category (Tatusov et al. 2003 with an alignment length of at least 70% of the shorter protein (-aS 0.7), and following additional 255 options: -G 0, -T 30, -n 2, -d 0, and -M 50000. A total of 63,554 clusters were produced. The 256 output of cd-hit was converted into a binary presence (one) and absence (zero) matrix, indicating 257 which clusters were found in which strains. 258 Each of the 63,554 clusters were associated to the Roary produced pangenomes, 259 described above, for the chromosomes and chromids of each of the genera Burkholderia, 260 Paraburkholderia, Ralstonia, and Cupriavidus. The strains were then split into two groups: i) the 261 Burkholderia and Paraburkholderia strains, and ii) the Ralstonia and Cupriavidus strains. In 262 both groups, the chromosomal clusters of the two genera were merged as a single list, as were 263 the chromid clusters. Then, clusters found on the chromosomes but not chromids of that group 264 were identified, and the clusters specific to the chromids and absent on the chromosomes of that 265 group were also identified. For the Burkholderia/Paraburkholderia group, there were 13,625 266 chromosome clusters and 12,332 chromid clusters. For the Cupriavidus/Ralstonia group, there 267 were 9,908 chromosome clusters and 6,474 chromid clusters. 268 Ancestral state reconstruction was performed using the ace function of the APE package 269 version 5.1 implemented in R (Paradis et al. 2004). This was done using the matrix produced 270 based on the cd-hit output, and the phylogeny of the 293 Burkholderiaceae strains rooted based 271 on the β-proteobacteria phylogeny of Figure 1. Ancestral state reconstruction was run 272 independently for each cluster, using the equal rates model. If the probability of the ancestor 273 having the gene was above 50%, the strain was said to contain the cluster; otherwise, the cluster 274 was said to be absent in that strain. Next, for each set of clusters of interest (e.g., 275 Burkholderia/Paraburkholderia chromosomal clusters), the number of clusters predicted to be 276 present in the ancestor were summed. The summed ancestral reconstruction data were then 277 mapped to the phylogeny using the phytools package version 0.6.44 of R (Revell 2011).  Figure S3). Notably, replicons larger than 800 kb 306 tend to be widespread within a clade of species, whereas those smaller than 800 kb tend to be 307 strain-or species-specific. Therefore, we further focused our analyses of secondary replicons to 308 those greater than 800 kb in size, expanded to include smaller replicons displaying clear signs of 309 synteny to a secondary replicon larger than 800 kb. These large replicons were classified as large corresponding to the stink-bug-associated beneficial and environmental group (Takeshita and  328 Kikuchi 2017), contain a second large megaplasmid; strikingly, the chromosome accounts for 329 only 33-43% of the total genome size of these strains (Table 1) replicon acquired by an ancestor at the base of this taxon prior to its divergence. 356 The above conclusion is supported by synteny and gene content analyses. Although there 357 is significant genetic variation, clear stretches of synteny can be seen when comparing between 358 chromids of the genus Burkholderia as well as when comparing chromids of the genus 359 Paraburkholderia ( Figure S5). Additionally, a core set of chromid genes (on at least 99% of the 360 chromids) is observed for both genera: 130 genes belong to the core gene set of the genus 361 Burkholderia, while 74 genes are part of the core gene set of the genus Paraburkholderia (Table  362 2, Data Set S5). Moreover, 37 of the core genes are common between the Burkholderia and the 363 Paraburkholderia (Table 2, Data Set S5). These results thus support that the chromids of all 364 species within these genera share common ancestry. However, the low level of synteny and gene 365 conservation may suggest that the ancestral plasmid contained relatively few genes found on the 366 modern day chromids. 367 The 37 genes common to the chromids of Burkholderia and  (Table 1). On the other hand, the 474 multipartite genomes of the genus Ralstonia are similar in size to those of the genus Pandoraea. 475 In contrast, the Pandoraea chromosomes are significantly larger (> 1 Mb) than the chromosomes 476 of all four genera with multipartite genomes (Table 1) Table S2). In particular, energy production (category 512 C), transport and metabolism of amino acids, lipids, carbohydrates, and inorganic ions 513 (categories E, G, I, P), secondary metabolism (category Q), transcription (category K), and signal 514 transduction (category T) were enriched in the chromid and/or megaplasmid accessory 515 pangenomes (Table 3, Table S2). Thus, the increased genetic variability of secondary replicons is 516 insufficient to fully explain the inter-replicon functional biases within bacterial multipartite 517 genomes. 518 For each of the COG categories mentioned above that are biased between chromosomes 519 and chromids (C, E, G, I, P, Q, K, T), we approximated the functional diversity by counting the 520 number of unique eggNOG orthologous groups (OGs) annotated within each category. Two 521 consistent patterns emerged. First, when considering the number of OGs, the enrichments 522 observed in chromid versus chromosome accessory genomes tended to decrease, but still 523 remained (Table 3). This suggests that the enrichment of these functions in the chromid 524 accessory genome is a combination of the independent acquisition (or duplication) of 525 functionally redundant genes, as well as an extension of the functional repertoire of the replicon. 526 Second, a large fraction (66-88%) of the OGs for each genus were specific to either the 527 chromosomal or chromid accessory genome. This observation further supports the notion that the 528 chromids and chromosomes are functionally divergent, and it suggests that their gene content is 529 derived from a distinct gene pools. 530 Overall, the above results are consistent with different selective pressures acting on each 531 replicon, thereby selecting for differential functional enrichments on each replicon. This is 532 consistent with results suggestive of differential evolutionary trajectories of each replicon in a 533 genome (Galardini et al. 2013), and the observation of different mutation rates between replicons 534 that is at least partly independent of functional biases (Cooper et al. 2010;Dillon et al. 2015). 535

Functional biases follow lifestyle not taxonomy 536
The functional contents of the four genera with multipartite genomes were compared to 537 identify the effect of lifestyle. The results revealed that the chromosomal pangenome functional 538 content of the genus Burkholderia is similar to that of the genus Ralstonia, while that of the 539 genera Cupriavidus and Paraburkholderia were similar and distinct from the functional content 540 of the other two genera ( Figure 5A). A similar pattern was detected for the pangenome functional 541 content of the secondary replicons ( Figure 5B). Notably, species of the genera Burkholderia and 542 Ralstonia are pathogenic, whereas species of the genera Cupriavidus and Paraburkholderia are 543 environmental isolates (Data Set S4). Hence, the functional content of the species of the family 544 Burkholderiaceae appears to be related more to lifestyle than to taxonomy. 545 The observation that the chromosomes of the family Burkholderiaceae are functionally 546 biased according to lifestyle suggests that having a chromid is not necessary, per se, for 547 adaptation to the distinct niches of these organisms. This is further supported by the pangenome 548 functional content of the chromosomes of the genus Pandoraea. This genus lacks secondary 549 replicons but has a lifestyle similar to the genus Burkholderia, including being an emerging 550 pathogen found in the lungs of cystic fibrosis patients (Coenye et al. 2000;Daneshvar et al. 551 2001). It is therefore not surprising that in terms of COG functional abundances, the 552 chromosomal accessory pangenome of the Pandoraea was similar to that of Burkholderia and 553 Ralstonia ( Figure 5A and Table S2). However, there did appear to be moderate biases in the 554 chromosomal COG abundances of Pandoraea relative to Burkholderia and Ralstonia, with these 555 biases often being in the direction of those seen for the chromids versus chromosomes. For 556 example, moderate enrichments were observed in COG categories C, K, P, and T ( Figure 5 and 557 Table S2). We therefore conclude that a secondary replicon is not an absolute necessity for niche 558 adaptation by these organisms. 559 560

Chromids appear enriched in environmental adaptation functions 561
Given that functional biases were detected between pathogenic and environmental 562 genera, and that functional biases were detected between chromosomes and chromids, we 563 examined whether there were similarities between these two sets of biases. As shown in Figure  564 6A, there is a correlation between the ratio of COG abundance in the chromosomal accessory 565 genome of the Paraburkholderia versus the Burkholderia, and the ratio of COG abundance in 566 the chromid and chromosomal accessory genomes of these genera. A similar pattern was 567 observed for the genera Cupriavidus and Ralstonia ( Figure 6B). Notably, for both pairs of 568 genera, the observed relationship was stronger for the pathogenic organisms (Burkholderia and 569 Ralstonia). Thus, at least in the family Burkholderiaceae, functions enriched in environmental 570 isolates relative to pathogenic organisms are the same functions enriched on chromids relative to 571 chromosomes. This is especially true for the pathogenic genera, likely as the chromosomes of the 572 environmental strains also contain higher amounts of these genes. These results are therefore 573 consistent with chromids being enriched in environmental, or niche, adaptation genes. However, 574 this result further supports that chromids are not absolutely necessary for the accumulation of 575 niche adaptation genes, and that such genes are likely accumulated on the chromosome even in 576 the presence of a chromid. 577 578

Chromids may facilitate increased rates of horizontal gene transfer 579
It has been suggested that chromid evolution involves significant gene accumulation and 580 replicon enlargement (diCenzo et al. 2014; diCenzo and Finan 2017), and that chromids may 581 permit species to continue to enlarge their genome when the chromosome reaches a maximal size 582 (Slater et al. 2009). We therefore evaluated the evolution of the genome content within the 583 family Burkholderiaceae. As described in the Methods, all proteins of the 293 Burkholderiaceae 584 strains were grouped into clusters using cd-hit (Li and Godzik 2006), and the presence of each 585 cluster in each ancestral strain was predicted using ancestral state reconstruction, as implemented 586 in the APE package of R (Paradis et al. 2004). We considered two groups of strains: i) the genera 587 Ralstonia and Cupriavidus, and ii) the genera Burkholderia and Paraburkholderia. Within each 588 group, the analysis was limited to those clusters found only in the chromosome pangenome or 589 only in the chromid pangenome. 590 The same overall trend was observed in both groups of strains (Figure 7). Moving from 591 the base of the tree to the extant strains, there was generally an increase in the number of genes 592 found in the ancestral strains that are also present in the modern-day strains. This was true for 593 both the chromosomal genes and for the chromid genes. However, the proportional increase in 594 the number of chromid genes outpaced that in the rate of gain of chromosomal genes ( Figure  595 S8), an observation that was particularly pronounced in the Burkholderia and Paraburkholderia. 596 These observations are consistent with the ancestral chromid replicons carrying a relatively small 597 portion of the genes currently found on the modern-day descendants. Additionally, these results 598 are consistent with secondary replicons more rapidly acquiring (and possibly losing) genes 599 through horizontal transfer than chromosomes, contributing to increased genome plasticity. 600

CONCLUSIONS 602
It has been noted elsewhere that ~ 10% of sequenced bacterial species contain a chromid 603 (Harrison et al. 2010; diCenzo and Finan 2017), suggesting that the presence of a chromid is 604 uncommon but not rare. In our analysis of the b-proteobacteria, we examined 213 named 605 species, of which 49 (23%) contained a chromid (Figure 1). Notably, all but one of these were 606 within the family Burkholderiaceae, and the data suggest that the chromids in this family arose 607 from just two ancestral plasmids; however, it remains possible that the transition from plasmid to 608 chromid may have occurred multiple times. It has also been proposed that the chromids of the 609 Rhizobiaceae family, which account for the majority of the chromids in the a-proteobacteria, 610 arose from a single ancestral plasmid (Slater et al. 2009). Thus, we conclude that it is likely that 611 the majority of chromids in sequenced strains arose from a few parental replicons, and that the 612 emergence of chromids is likely rarer than suggested by the 10% prevalence. 613 On average, bacteria containing a chromid have a larger total genome than bacteria 614 lacking chromids, and the difference in genome size can be attributed to the presence of the 615 chromid (Harrison et al. 2010;. This observation has led to the 616 suggestion that the role of chromids is to allow further genome enlargement than is possible with 617 only a chromosome; however, some evidence is also inconsistent with this hypothesis (Egan et 618 al. 2005; diCenzo and Finan 2017). Here, we observed that genera with a chromid had a larger 619 genome than related genera without a chromid (Table 1). However, the chromosomes of these 620 genera tended to be smaller than the chromosomes of a related genus (Pandoraea) without a 621 chromid (Table 1). Similarly, a-rhizobia with a chromid contain a smaller chromosome than a-622 rhizobia lacking chromids (MacLean et al. 2007). This suggests that, at least in some cases, 623 species with a chromid do not contain a chromosome that has reached a maximal size limit. We 624 therefore conclude that the presence of a chromid cannot be solely explained by a need for a 625 mechanism to increase genome content, although chromids may facilitate this process. 626 Another theory is that chromids (and megaplasmids) serve niche specialized roles, and 627 their evolution was driven by adaptation to novel environments (Chain et al. 2006 which contain a chromid ( Figure 5A). Hence, while secondary replicons may indeed be niche 635 specialized, chromids are not essential for adapting to these environments, and the same 636 functions can instead be accumulated by chromosomes in the absence of secondary replicons. 637 We therefore propose a modification of the theory that the primary advantage of 638 secondary replicons is related to niche adaptation. We suggest that the main advantage provided 639 by a secondary replicon is increased genetic malleability, which consequently results in 640 environmental specialization. If acquisition of the secondary replicon corresponds with the 641 ability to inhabit a new environment, the cell will experience new selective pressures selecting 642 for the acquisition of a distinct repertoire of genes. If secondary replicons are more genetically 643 malleable than chromosomes, these new genes would be preferentially accumulated by the 644 secondary replicon, thereby resulting in its enlargement and a replicon specialized for adaptation 645 to a particular environment. In support of this, hot spots for horizontal gene transfer have been 646   * The ratio is calculated as chromid / chromosome. 920 † Values represent the percentage of all genes in the replicon-specific pangenome accessory fraction annotated with the indicated 921 COG category (Table S3).

922
¥ Values represent the number of unique orthologous groups within the indicated COG category in replicon-specific accessory 923 fraction, expressed as a percentage of the total number of genes in the replicon-specific pangenome accessory fraction. 924