Virulent but not temperate bacteriophages display hallmarks of rapid translation initiation

Bacteriophages rely almost exclusively on host-cell machinery to produce their proteins, and their mRNAs must therefore compete with host mRNAs for valuable translational resources. In many bacterial species, highly translated mRNAs are characterized by the presence of a Shine-Dalgarno sequence motif upstream of the start codon and weak secondary structure within the start codon region. However, the general constraints and principles underlying the translation of phage mRNAs are largely unknown. Here, we show that phage mRNAs are highly enriched in strong Shine-Dalgarno sequences and have comparatively weaker secondary structures in the start codon region than host-cell mRNAs. Phage mRNAs appear statistically similar to the most highly expressed host genes in E. coli according to both features, strongly suggesting that they initiate translation at particularly high rates. Interestingly, we find that these observations are driven largely by virulent phages and that temperate phages encode mRNAs with similar start codon features to their host genes. These findings apply broadly across a wide-diversity of host-species and phage genomes. Further study of phage translational regulation—with a particular emphasis on virulent phages—may provide new strategies for engineering phage genomes and recombinant expression systems more generally.

Protein translation consumes a large amount of cellular resources, and the genomes of mi-2 crobial species show strong evidence of selection for rapid and efficient translation [1][2][3][4][5][6][7][8]. The 3 statistical analysis of genome sequence features has provided important insights into many 4 of the basic mechanisms of gene expression, particularly by contrasting sequence patterns 5 within genes that have high-and low-expression demands [9][10][11][12][13][14] . Over-representation of 6 a purine-rich motif upstream of bacterial start codons, for instance, ultimately lead to the 7 discovery of the Shine-Dalgarno sequence mechanism-a direct binding interaction between 8 the 30S ribosomal sub-unit and mRNA that governs start codon recognition and facilitates 9 translation initiation [15][16][17][18][19][20][21]. Collectively, gene sequence analyses have facilitated advances 10 in recombinant protein engineering by deciphering the sequence rules for optimal expression, 11 which have subsequently been validated, refined, and expanded upon in numerous exper-12 imental and evolutionary studies [22][23][24][25][26][27][28][29][30][31][32]. However, the amount of available information 13 contained within individual bacterial genomes is limited both in terms of the overall number 14 of proteins and the sequence-level diversity that is present across closely related strains. 15 Bacteriophage genomes, by contrast, contain a large pool of untapped sequence diver-16 sity that metagenomic techniques have only recently begun to characterize in depth [33][34][35][36][37][38][39]. 17 Phage mRNAs must be expressed in host cells-in most cases, entirely by existing host cell 18 machinery-and the statistical patterns and constraints that are encoded in these sequences 19 may thus help to further elucidate host-cell transcriptional and translational constraints 20 and mechanisms [40,41]. While several model phage species are well-characterized [42- 21 46], there have been comparatively few investigations into larger-scale statistical patterns 22 present within phage genomes [47][48][49][50][51][52][53]. Deciphering the coding sequence rules governing 23 phage genome evolution is additionally critical for engineering phages as well as for deter- 24 mining the utility of this knowledge for host-cell engineering applications [54,55]. 25 Here, we analyze thousands of complete, high-quality phage genomes from 33 different 26 bacterial hosts to determine whether the constraints shaping translation-initiation regions 27 in phage mRNAs differ from host-cell genomes. We first present an in depth analysis of E. 28 coli -infecting phages and find that phage genomes are predicted to have translation-initiation 29 rates that are on-par with host-cell genes encoding only the mostly highly abundant proteins. 30 Next, we show how translation initiation-related features co-vary and associate strongly with 31 phage lifestyle. Finally, we extend our findings across a broad phylogenetic range of host 32 species, suggesting that phage mRNAs-and virulent phages in particular-are subject to 33 strong evolutionary pressure to ensure rapid translation initiation.

35
Phage mRNAs are predicted to have rapid translation-initiation 36 rates. 37 We focused our study on two sequence features that are robustly linked to translation-38 initiation rates across a wide-range of bacterial species. For a given gene, we first measured 39 the strength of anti-Shine-Dalgarno (aSD) sequence binding interaction (5 -CCUCCU-3 ) 40 by selecting the strongest binding hexamer sequence within a narrow window upstream of 41 the annotated start codon (Fig. 1A). Stronger aSD sequence binding (more negative ∆G) 42 is associated with start codon recognition and rapid translation initiation [56]. Next, we 43 calculated the structural accessibility of the start codon by predicting the stability of the 44 secondary structure for a 90 nucleotide fragment surrounding each start codon (30 bases 45 upstream and 60 bases downstream, Fig. 1B). Weaker secondary structure within this region 46 (more positive ∆G) is robustly associated with higher translation-initiation rates [57]. 47 We measured aSD sequence binding and secondary structure strengths for all protein 48 coding genes in phage T7-a well-studied, model phage-and its bacterial host, E. coli. We 49 observed a clear distinction whereby the aSD sequence binding strengths of T7 genes are 50 narrower and highly skewed towards stronger binding relative to E. coli genes (Fig. 1C, mean 51 of -7.61 vs -4.77 kcal/mol, Welch's t-test p < 0.001). The distribution of secondary structure 52 strengths within the start codon region are slightly (but insignificantly) shifted towards 53 weaker secondary structure, which is also associated with higher translation-initiation rates 54 (Fig. 1D, mean of -18.3 vs -19.5 kcal/mol, p = 0.07). Given that aSD sequence binding 55 strengths are substantially stronger and that the strength of mRNA secondary structures 56 are statistically similar in phage T7 relative to E. coli, this analysis shows that overall 57 translation-initiation rates (TIR) are predicted to be higher for phage T7 mRNAs relative 58 to host mRNAs. 59 To determine the generality of these findings, we extended this analysis to a set of 254 60 diverse E. coli -infecting phage genomes (see Materials and Methods). Similar to phage T7, 61 we found that 202 of these 254 phage genomes had a stronger mean aSD sequence binding 62 strength when compared to the E. coli genome and 82 of these comparisons were significant 63 (Fig. 1E, Welch's t-test with FDR-correction, p < 0.01). By contrast, only 10 of the phage 64 genomes displayed significantly weaker aSD binding strengths when compared with E. coli 65 (according to the same significance criteria). 66 When we considered mRNA secondary structure surrounding the start codon, we found 67 that 152 out of 254 tested phage genomes had weaker mRNA secondary structure in the 68 start codon region compared with E. coli genes and 99 of these comparisons were statistically 69 significant (Welch's t-test with FDR-correction, p < 0.01, Fig. 1F). As with aSD sequence 70 binding, this result indicates that phage mRNAs are likely to initiate translation more rapidly 71 than host-cell mRNAs. Only 13 phage genomes had significantly stronger predicted mRNA 72 secondary structures within the start codon region compared to host genome mRNAs (and 73 thus lower predicted translation-initiation rates). 74 One possible explanation for why phage mRNAs appear to have higher translation-75 initiation rates than hosts is that phage genomes are highly compact and may be devoid 76 of genes that are rarely expressed or under weak selective pressures-as may be the case for 77 much of the > 4, 000 protein coding genes within the E. coli genome. We therefore repeated 78 the above analyses, comparing phage genomes to subsets of the E. coli genome with pro-79 gressively more stringent cutoffs in terms of their average protein abundances (see Materials 80 and Methods). As expected, we found that E. coli mRNAs that encode highly abundant 81 proteins have both substantially stronger aSD sequence binding strengths and weaker mRNA 82 secondary structure in the start codon regions. Achieving parity in terms of roughly bal-83 ancing the number of phage genomes with significantly stronger (and weaker) aSD sequence 84 binding strengths required considering only the top 10-25% E. coli genes with the highest 85 protein abundances ( Supplementary Fig. S1A). The results when assessing mRNA secondary 86 structure in the start codon region are similar: overall the translation-initiation region of 87 phage mRNAs appear to be under strong selection that is on par with only the most highly 88 expressed host genes ( Supplementary Fig. S1B). We additionally considered essential and 89 non-essential host gene categories separately under the expectation that phage genes might 90 more closely resemble essential host-cell genes. However, we found that phage translation-91 initiation regions are significantly distinct from both essential and non-essential gene subsets 92 ( Supplementary Fig. S1A,B). 93 Sequence features in translation-initiation regions differ between 94 virulent and temperate phages. 95 Our findings thus far show that phage mRNAs are predicted to bind strongly to ribosomes 96 and thus are predicted to have increased translation-initiation rates relative to host mRNAs. 97 Additionally, phage mRNAs also appear to have generally weaker secondary structures sur-98 rounding the start codon, a feature that is itself associated with rapid recruitment of ribo-99 somes and high translation-initiation rates. However, we have investigated these features in 100 isolation and it is unclear whether these two sequence features represent competing strate-101 gies that different phages use to ensure rapid translation initiation or whether these features 102 co-occur within the same genomes.

103
To investigate the possible differences between phage and host mRNAs along several 104 dimensions, we used a logistic regression framework that attempts to predict the genome of 105 origin (host or phage) based on knowledge of both the aSD sequence binding strength and 106 the strength of mRNA secondary structure surrounding the start codon for each phage-host 107 pairing. Using only a single predictor variable, this procedure is equivalent to a Student's 108 t-test but the flexible nature of the model allows us to analyze both variables simultaneously 109 and to further account for potentially confounding variables. The reported effect size is 110 simply the model coefficient for each variable (a standardized conditional log-odds ratio), 111 the magnitude of which can be directly compared for various predictor variables to assess 112 their overall contribution. We decided to control for two potentially confounding variables: 113 coding sequence GC content and codon usage biases. The GC content of coding sequences 114 can directly influence the strength of mRNA secondary structure (and this feature may vary 115 between phage and host genomes). Additionally, codon usage biases are a general indicator 116 of translational selection-particularly on translation elongation. We found that GC content 117 and codon usage biases do indeed vary between host and phages when these variables are 118 considered in isolation ( Supplementary Fig. S2). Our logistic regression model thus contains 119 two features of direct interest related to translation initiation and two features that we treat 120 as potentially confounding variables.

121
Using the same dataset of 254 E. coli infecting phages, we found that strong aSD sequence 122 binding strength and weak mRNA secondary structure co-occur in a large subset of phages 123 (upper-left quadrant, Fig. 2A). The number of significant phages indicated in the figure (37 124 in the upper-left quadrant) refers to to phages where both variables are significantly different 125 from the host (after separately applying a FDR-correction to each variable, p < 0.01). How-126 ever, numerous phage genomes reside in the lower-left quadrant where they display slightly 127 stronger aSD binding strengths than the host genome but also slightly stronger mRNA sec-128 ondary structure in the start codon region. Only two of these genomes are significant for both 129 quantities, however. Finally, a comparatively small number of genomes occupy quadrants 130 on the right side of this scatter plot (displaying weaker aSD sequence binding strengths) 131 and only a single phage genome is statistically different from the host for both features-in 132 this case, having significantly weaker aSD sequence binding and stronger mRNA secondary 133 structure in the start codon region.

134
All of the results that we have presented thus far treat phage genomes uniformly. How-135 ever, this approach may obscure important differences in how natural selection operates on 136 the translation-initiation regions of different phages. We thus predicted the lifestyle of each 137 phage genome in the dataset using BACPHLIP [58] and noticed a striking difference between 138 phages that are confidently predicted (≥ 0.95% probability) to be either temperate or viru-139 lent (n=143, of which 39 are temperate and 104 are virulent, Fig. 2B). While both categories 140 of phages have strong aSD sequence binding energies (most data points remain to the left of 141 zero in Fig. 2B), the virulent phages reside in the upper-left quadrant and almost exclusively 142 have high predicted translation-initiation rates (Fig. 3B). By contrast, the vast majority of 143 temperate phages reside in the lower-left quadrant where they display signatures of strong 144 aSD sequence binding strengths but also stronger mRNA secondary structures as compared 145 to the host. Repeating the full analysis from Fig. 2 without including the coding sequence 146 GC percent and iCUB covariates reveals similar findings, with virulent phages preferentially 147 residing in the top left quadrant, which is indicative of storng predicted translation initiation 148 rates ( Supplementary Fig. S3). 149 We further confirmed this stark separation between temperate and virulent phage genomes 150 by looking at each of the features independently within the temperate and virulent phage 151 genome sets ( Supplementary Fig. S4). The clearest and most significant differences emerged 152 when considering mRNA secondary structure around the start codon and codon usage bi-153 ases. In both cases, virulent phages are predicted to have stronger translation initiation rates 154 and translation efficiency when compared with temperate phages (Welch's t-test, p < 0.001). 155 The strength of aSD sequence binding also differs slightly (p = 0.011) between these two 156 sets of phages: of the 143 genomes with confident lifestyle predictions, the top 35 phages 157 with the strongest mean aSD sequence binding energies are all virulent phages. Finally, 158 coding sequence GC content also varied, virulent phages displaying a generally lower coding 159 sequence GC percent (p < 0.001) but the link between this feature and translation efficiency 160 is less well studied.

161
Finally, to better understand how these two translation initiation-related features con-162 tribute to translation efficiency-a metric that is roughly akin to the amount of protein 163 produced from a given level of mRNA over a given time period and is often calculated using 164 ribosome profiling. We built multi-variate regression models to predict gene-specific transla-165 tion efficiencies based off of two previously published, empirically determined values [59,60]. 166 Both translation initiation features were highly significant, and the resulting models had 167 Adjusted-R 2 values between 0.11-0.17 ( Supplementary Fig. S5). We applied the best fitting 168 model to predict translation efficiency values for phage genomes, and observed that a large 169 number had significantly greater predicted mean translation efficiency when compared with 170 the predictions for the entire E. coli genome. Confirming our lifestyle-based findings, this ef-171 fect was driven almost entirely by virulent phage genomes-many of which had significantly 172 greater predicted translation efficiency whereas no temperate phage genomes showed this 173 effect ( Supplementary Fig. S5).

174
High translation-initiation rates are a general feature of phage 175 genomes. 176 We have thus far presented an in depth analysis of E. coli infecting phage genomes, but the 177 generality of these findings to other host-cell bacterial species is unclear. To determine if 178 phage mRNAs display evidence of rapid translation initiation in general, and whether this 179 effect is particularly pronounced in virulent phages, we repeated our analyses from Fig. 1E,F 180 on 32 additional taxonomically diverse bacterial host organisms for which we have at least 181 10 (dereplicated) complete phage genomes for comparison.

182
For nearly all host species, we found that a substantial fraction of phage genomes had sig-183 nificantly different aSD sequence binding strengths and mRNA secondary structure strengths 184 compared to their hosts (Fig. 3,  is calculated via Welch's t-test with FDR-correction, p < 0.01). In nearly all cases, the 186 effect was heavily biased in the direction of increased predicted translation-initiation rates 187 for phage mRNAs: stronger aSD sequence binding strengths and weaker mRNA secondary 188 structure in the start codon region. There were, however, a few scattered host species for 189 whom there appeared to be no significant difference for one or both of the translation-190 initiation features. In Cellulophaga baltica, for instance, we note that it has previously been 191 shown that many members of the Bacteroidetes phylum do not use the aSD sequence bind-192 ing mechanism and it is thus unsurprising to see that phages infecting this species are also 193 devoid of this feature [6, 61,62]. Additionally, we observed generally weaker effects for both 194 translation initation-related features across members of the Firmicutes phylum (excepting 195 the Bacillus genus) but note that the translation-initiation regions in many of these host-cell 196 genomes have particularly strong features compared with species like E. coli. 197 We additionally wished to assess differences between temperate and virulent phages for 198 this diverse set of species. However, most phages with known, annotated lifestyles come from 199 a biased set of host species and this bias also affects the accuracy of phage lifestyle prediction 200 for different host-cell species. We predicted the lifestyle of all phages infecting the host-cell 201 species depicted in Fig. 3 and (as in Fig 2) required a 95% probability of correct lifestyle 202 assignment. This procedure yielded 9 host species for which we could confidently identify a 203 minimum of 5 temperate and 5 virulent phage genomes. For these hosts, we split the phage 204 genomes into temperate and virulent categories and repeated the analysis from Fig. 3   considering temperate and virulent phages independently.

206
Similar to our findings across E. coli infecting phages, a substantial number of both 207 virulent and temperate phages displayed stronger aSD sequence binding strengths relative 208 to host-cell genomes and we observed comparatively minor differences in the magnitude of 209 this feature across lifestyle classes Fig. 4A. Also confirming our results seen in E. coli, we 210 observed very few cases where temperate phages were predicted to have significantly different 211 mRNA secondary structure in the start codon region in either direction. By contrast, large 212 numbers of virulent phages have weaker mRNA secondary structure in the start codon region 213 Fig. 4B. Virulent phages, across nearly all of the 9 species studied, have strong sequence 214 signatures that are indicative of rapid translation initiation. As with E. coli, the picture 215 is more nuanced for temperate phages, which tend to have strong aSD sequence binding 216 strengths (Fig. 4, left) but little or no significant difference in mRNA secondary structure 217 strengths in the start codon region compared to host genes (Fig. 4, left)

219
Translation initiation region sequence preferences provide bacteria with a means to differen-220 tiate start codons from background sequences and to modulate the protein production rate 221 for individual mRNAs. By analyzing the sequence patterns contained within thousands of 222 complete bacteriophage genomes, we have shown these sequence preferences are particularly 223 strong in phages and indicative of rapid translation-initiation rates that are on par with 224 only the most highly expressed host-cell genes. Most strikingly, we find that virulent phage 225 genomes differ from temperate phage genomes in this regard: mRNAs from both types of 226 phages appear to be enriched in strong anti-Shine-Dalgarno sequence binding sites, but only 227 virulent phages couple this ribosomal capture mechanism with particularly weak mRNA 228 secondary structure that helps to facilitate ribosomal recruitment.

229
Within temperate phage genomes, the net result of partially contradictory observations-230 stronger aSD sequence binding and slightly stronger mRNA secondary structure relative 231 to host genes-is unclear. To date, there have been a limited number of genome-wide 232 transcriptomic and proteomic studies specifically targeting phage infections [42][43][44][45][46]. As 233 techniques such as RNA-seq and ribosome profiling become more widespread and applied to 234 diverse phage and host species, we expect that the details of ribosomal competition during 235 phage infection will become more clear. However, based on regression models that were fit to 236 empirically determined E. coli translation efficiency values, it appears that temperate phages 237 are likely to have slightly lower translation efficiency ( Supplementary Fig. S5). By contrast, 238 virulent phage genomes appear to have unequivocally increased rates of translation initiation 239 according to all of our best knowledge about this detailed molecular process. Our study thus 240 joins a growing body of literature highlighting important differences in evolutionary processes 241 between temperate and virulent phage genomes [63][64][65][66][67][68].

242
While we are unable to say precisely why temperate and virulent phages should differ 243 so starkly in regards to their translational initiation regions, we speculate that the life-244 history of these different virus types results in unique translational pressures. Temperate 245 phage genomes, for instance, may accumulate numerous point mutations that are likely to 246 be under weak selection pressure during extended periods of dormancy. It is possible that 247 excision of temperate phages from the host-cell genome (and subsequent entry into the lytic 248 cycle) is preceded by other steps that limit competition between phage and host-cell mRNAs. 249 By contrast, virulent phage genomes must rapidly produce protein in a race against host-cell 250 detection mechanisms in order to reproduce at every generation. While not the focus of our 251 study, we observed that codon usage biases also differ significantly between temperate and 252 virulent phages, with temperate phages having generally weaker biases than we observed 253 for virulent phage genomes. This again points in the direction of stronger overall levels 254 of translational selection on virulent phages. However, the precise consequences of these 255 different evolutionary modes as it relates to optimizing translation initiation remains an 256 intriguing area of further theoretical and empirical study.

257
Prior work has shown that viral mRNAs (including phage) tend to have weak mRNA 258 secondary structure surrounding the start codon, but limits in the number of available phage 259 genomes made it difficult to explicitly study phage-host pairings [69]. Other studies have 260 looked at differences in codon usage biases and higher-order sequence effects within bacterio-261 phage genomes and have generally found strong signatures of translational selection in phage 262 genomes [47][48][49][50][51][52][53]. Our findings thus build on existing research and suggest that a distinct an 263 important comparison is between host and phage mRNAs, which are likely to be in direct 264 competition for the same pool of ribosomes and translational resources during periods of 265 active infection. The growing abundance of complete, host-linked phage genomes presents 266 exciting opportunities for further study in this area.

267
A potential limitation of our work is that phylogenetic relatedness presents numerous sta-268 tistical challenges. Two randomly chosen phages may share either close or distant sequence 269 identity and common ancestry [70]. While phylogenetic comparative methods can be applied 270 to correct for non-independence, these methods require an underlying phylogenetic tree that 271 can be particularly challenging to assemble for phage genomes [39,71,72]. In lieu of this 272 approach, we have opted for a simpler method that should partially mitigate phylogenetic 273 biases: clustering genomes and selecting single representative genomes per cluster to ensure 274 that the data points are more independent than they might otherwise be. Additionally, 275 while we present multi-species comparisons between various host-cell species (Figs. 3 & 4), 276 we again have not accounted for any confounding effects that could arise due to phylogenetic 277 relatedness between host-cell species. However, we do not perform any explicit statistical 278 analyses on the entire set of data, and rather include these figures to graphically show that 279 our findings appear to apply across numerous species.

280
Analysis of protein coding sequences, as well as up-and down-stream regulatory se-281 quences, has enhanced our understanding of several transcriptional and translational mech-282 anisms, including the role of codon usage biases and higher-order sequence features in the 283 rate of protein production [73][74][75][76][77][78]. In a sense, phages are the original synthetic biologists-284 exploiting host cells to produce a set of macromolecules that decrease host fitness. The 285 precise details of how individual phages manipulate their hosts is highly varied, but higher-286 order genome features such as we have studied here may provide important insight into 287 translational regulation, which may further enhance our ability to engineer both phage and 288 host-cell genomes.

290
Virus data compilation, genome annotation, and host prediction 291 The core of our study relies on access to numerous and distinct bacteriophage genomes 292 with consistent annotations and trustworthy predictions of primary host species. On-going 293 research in each of these areas is continuing to expand the availability of phage genomes that 294 are suitable for genome-linked analyses such as we performed here [39,[79][80][81][82][83]. To maintain 295 consistency and to ensure access to the highest-quality genomes, we used the NCBI Virus 296 genome database (last accessed: November 2020) and selected only "complete" genomes to 297 include in our study. An advantage of this choice is that we additionally relied on previously 298 existing genome annotations for all phages included in this study. Only a small subset of 299 phage genomes have host annotations, so we limited our selection of host-cell species to 300 those with at least 50 annotated phage genomes and 10 annotated phage genomes after 301 dereplication (see below).

302
To ensure that the selected phage genomes were not severely biased by a small number 303 of over-represented and closely related genomes, we used FastANI to measure the average 304 nucleotide identity (ANI) between all phages that infect a single host, requiring that the 305 alignment span a minimum of 80% the length of the shortest genome [84]. We constructed a 306 unique all-to-all distance matrix, and used the cd-hit-est greedy algorithm to cluster phages 307 at 95% sequence identity [85]. From each cluster, we subsequently selected the longest Ref-308 Seq genome as a cluster representative. In the event that there were no RefSeq genomes 309 in a cluster, we simply selected the longest genome. Finally, we removed poorly annotated 310 phage genomes from the analysis, which we conservatively defined as those with an anno-311 tated coding sequence density <50%. For E. coli, our processing began with a set of 1,473 312 genomes and our final dataset encompassed 254 genomes, which were analyzed throughout 313 this manuscript.

314
Definitions of sequence features related to translation initiation.

315
The Shine-Dalgarno sequence is a collection of related mRNA sequence motifs that are de-316 fined by their ability to bind strongly to the highly conserved anti-Shine-Dalgarno sequence-317 located on the 30S subunit of the bacterial ribosome. Here, we define the aSD binding 318 strength of a given coding sequence as the strongest possible binding affinity between the 319 anti-SD sequence (defined via the core sequence of 5 -CCUCCU-3 ) and hexamer sequences 320 upstream of the start codon, restricting the upstream range to have a minimum gap of 4 321 nucleotides and a maximum gap of 10 nucleotides between the 3 most base in the hex-322 amer and the first nucleotide of the start codon. Binding energies were calculated using the 323 "RNAcofold" program, part of the ViennaRNA (v2.4.14) suite [86]. 324 We determined the strength of mRNA secondary structure surrounding all start codons 325 by assessing a 90 nucleotide long sequence fragment (30 nts upstream of the start codon and 326 60 nts downstream) using the "RNAfold" program from ViennaRNA (v2.4.14) [86]. While 327 we expect this to be a rough approximation of the true secondary structure present in the 328 start codon region for each mRNA, numerous prior studies have showed that similar window 329 sizes roughly capture the relevant feature of mRNA secondary structure strength [22,56,57]. 330 More complicated models of translation initiation may include explicit penalties for SD 331 sequence spacing relative to the start codon [23], the identity of the start codon itself [87], 332 kinetics of secondary structure unfolding and refolding [88], penalties for binding too strongly 333 to the aSD sequence [56], as well as a number of other possible features [89][90][91]. However, 334 it is unknown how transferable these other mechanisms are across diverse organisms given 335 that the vast majority of existing work has been performed in E. coli. To keep our model 336 simple, we focused on the two most consistent and frequently cited modifiers of translation-337 initiation rates. Additionally, we found that a simple multi-variable linear regression model 338 consisting of only aSD binding strength and the strength of secondary structure around 339 the start codon is capable of significantly predicting translation efficiencies derived from two 340 separate genome-wide E. coli datasets [59,60] (Supplementary Fig. S5). In these models, R 2 341 values ranged from 0.11-0.17 and predictions of both models were highly correlated. Further, 342 in both cases, start codons with weak mRNA secondary structure and strong aSD sequence 343 binding have the highest empirically measured translation efficiency values and both sequence 344 features were highly significant in the regression models (p < 0.001). We did not apply these 345 models to predict translation efficiencies for phages or hosts outside of E. coli, reasoning that 346 doing so would make a dangerous assumption that the mechanisms of translational regulation 347 remain similar across phylogenetically diverse species. Our current approach does, however, 348 assume that the mechanisms of translation efficiency and translational regulation remain 349 similar between normal E. coli growth and during phage infection.

350
Other CDS-level feature definitions and inclusion criteria 351 Codon usage bias is frequently considered an indicator of translational selection due to the 352 potential for synonymous codons to modulate the rate of translation elongation. We thus 353 wanted to ensure that our findings were robust to variation in codon usage bias differences 354 between host and phage genomes, as well as GC content variation (which has a direct impact 355 on mRNA secondary structure strength). Coding sequence GC content was simply calculated 356 for each CDS as the number of G+C nucleotides divided by the total coding sequence length. 357 Codon usage bias was measured using iCUB [92]. There are many possible metrics of codon 358 usage bias, but a benchmark study showed that iCUB outperformed other metrics that rely 359 solely on coding sequence information (as opposed to a priori defined reference sets of genes 360 or other external information such as tRNA abundances or a reference set of highly expressed 361 genes). The iCUB metric is similar to the more commonly cited effective number of codons: 362 lower values indicate fewer codons and thus more bias. However, iCUB explicitly controls 363 for GC content variation and produces better estimates of protein abundance across diverse 364 microbial taxa. We observed that both of these features (GC-content and iCUB) differ 365 significantly between phage and host species (Supplementary Fig. S2), providing validation 366 that our results are likely to be more robust and conservative by controlling for these effects. 367 For all studied coding sequences (either phage or host-derived), we only considered an-368 notated genes whose length was at least 90 nucleotides (30 amino acids) and for which the 369 length was a multiple of three (potentially removing a small number of genes with pro-370 grammed frame-shifts). Additionally, all of our results for host genomes excluded coding 371 sequences whose "product" feature annotation contained the word "phage". This filter was 372 performed to ideally exclude prophage-associated coding sequences from being included in 373 the host genome when drawing comparisons.

374
Compilation of E. coli -specific empirical data 375 We leveraged several existing genome-scale, empirical data sources to thoroughly contrast 376 E. coli infecting phages with various subsets of E. coli genes as well as to ensure that 377 our translation-initiation metrics were associated with empirically derived translation effi-378 ciency measurements. Specifically, we relied on PAXdb for protein abundance data [93], 379 two separate datasets of gene essentiality [94,95], and two separate datasets of ribosome-380 profiling derived translation efficiency measurements (briefly referenced in the preceding 381 section) [59,60].

382
For protein abundance data, we simply drew various percentile-based thresholds to con-383 sider increasingly stringent sets of "highly-expressed" genes. PAXdb aggregates protein 384 abundance measurements from numerous studies and growth conditions, and while these 385 abundance values are estimates from only a small snap-shot of possible studies and possible 386 growth conditions, these values are well-established (if rough) indicators of overall protein 387 abundance [93]. For gene essentiality, we used a consensus approach where we only considered 388 genes that were considered either "essential" or "non-essential" in two separate datasets (to 389 increase robustness) and did not consider genes with conflicting annotations for this portion 390 of the analysis.

391
Phage lifestyle prediction 392 We used the BACPHLIP software [58] to categorize all phage genomes in our dataset into 393 temperate and virulent lifestyles. BACPHLIP uses genome-sequence input, determines the 394 presence/absence of several hundred lysogeny-associated protein domains, and uses a random 395 forest classifier return a probability of the given phage being either temperate or virulent. 396 Here, we limited our analyses of both E. coli infecting phages and phages that infect other 397 species to those with a predicted lifestyle probability ≥ 0.95. Additionally, when assessing 398 differences between temperate and virulent phages, we removed any host species from our 399 analysis that did not have at least 5 phage genomes (after dereplication) in each lifestyle 400 category.

Statistical tests 402
Single-variable comparisons were performed in Python using the scipy package implementa-403 tion of Welch's T-test. Multiple comparisons were corrected for by using the applying the 404 statsmodels implementation of FDR correction (Benjamini/Hochberg) to the list of p−values 405 with alpha set to 0.01. Multivariate analyses were performed using the logit function in 406 statsmodels with zscore normalization to all predictor variables in order to standardize ef-407 fect sizes. Correction for multiple comparisons in these models was again accomplished by 408 filtering the p−values from individual model coefficients through the FDR correction proce-409 dure with the alpha value set to 0.01. Thus, we ensured that all of our results were highly 410 significant and robust to artifacts arising from multiple comparisons.

411
Host taxonomy assignment 412 We did not explicitly model any effects across a phylogenetic tree and instead present analyses 413 of multiple species independently in Figs. 3 & 4. We do, however, note that no phage genomes 414 were shared across host species for any part of this analysis, but individual host species 415 nevertheless have their own complicated and non-independent phylogenetic histories. We 416 arranged species taxonomically for easier visual comparison by querying the NCBI taxonomy 417 database and used the ete3 package to roughly order species according to their taxonomic 418 grouping. All statistical results were performed for each species independently.

419
Code and data availability. Supplementary Figure S1. Comparing translation-initiation features in phage genomes to host genome subsets. (A) The distribution of mean aSD binding energy across phage genomes compared with indicated subsets of host genes (error bars depict standard errors for host genome subset means, orange square indicates the median phage genome value). The number of phage genomes that are significantly different from gene categories are shown on the right, with red bars extending to the left of 0 indicating genomes with significantly stronger aSD binding strengths and brown bars to the right indicating significantly weaker aSD binding strengths (significance was defined via Welch's t-test with FDR-correction, p < 0.01). (B) As in (A), depicting mean start codon secondary structure around the start codon.  [60]. (C) Differences in mean predicted translation efficiency (using the 2014 model) between phage genomes and the host E. coli genome. 'n. sig' denotes the number of phage genomes that are significantly differenent from E. coli in either direction (Welch's t-test with FDR-correction, p < 0.01). (D) As in panel (C), splitting phage genomes according to lifestyle (including only genomes with high-confidence lifestyle predictions: 39 temperate and 104 virulent). Among virulent phage genomes, 73 (0) have significantly higher (and lower) predicted translation efficiency compared to E. coli. By contrast, 0 (3) temperate phage genomes have significantly higher (and lower) predicted levels of translation efficiency.