Polymorphism at four-fold degenerate site, but not at the intergenic regions, explains nucleotide compositional strand asymmetry in bacteria

We investigated single nucleotide polymorphism in intergenic regions (IRs) and four-fold degenerate sites (FFS) in genomes of three γ-Proteobacteria and two Firmicutes to understand the mechanism of nucleotide compositional asymmetry between the leading and the lagging strands. Pattern of the polymorphism spectra were alike regarding transitions but variable regarding transversions in the IRs of these bacteria. Contrasting trends of complementary polymorphisms such as C→T vs G→A as well as A→G vs T→C in the IRs vindicated similar replication-associated strand asymmetry regarding cytosine and adenine deamination, respectively, across these bacteria. Surprisingly, the polymorphism pattern at FFS was different from that of the IRs and its frequency was always more than the IRs in these bacteria. Further, the polymorphism patterns within a bacterium were inconsistent across the five amino acids, which neither the replication nor the transcription-associated mutations could explain. However, the polymorphism at FFS coincided with amino acid specific codon usage bias in the five bacteria. Further, strand asymmetry in nucleotide composition could be explained by the polymorphism at FFS, not at the IRs. Therefore, polymorphisms at FFS might not be treated as nearly neutral unlike that in IRs in these bacteria.


Introduction 47
Prior to the analysis of nucleotide polymorphism at intergenic regions (IRs), we performed an 115 elaborate study on its nucleotide composition. G+C% difference between whole genome and IRs was more 116 prominent for the three γ-Proteobacteria (Ec, Kp, Se) than the two Firmicutes (Sa, Sp) (Supplementary Table  117 1). The G+C% in the IRs was similar between the LeS and the LaS within a bacterium. Abundance values 118 between complementary nucleotides were more similar than that between two non-complementary nucleotides 119 (Table 1). We found out different skews (AT/GC/KM/RY) in the IRs. AT skew was negative in the LeS and 120 positive in the LaS in all bacteria except Sa where the reverse AT skew pattern was observed. The atypical AT 121 skew in IR was in concordance with AT skew in Sa chromosome (Charneski et al. 2011). It is pertinent to note 122 that the AT skew patterns were not consistent in IRs of Sa and Sp, although both belong to Firmicutes. In 123 contrast to AT skew, GC skew values were found to be positive in the LeS and negative in the LaS of these 124 bacteria. The magnitude of GC skew was observed to be higher than that of AT skew across the five bacteria. 125 Among the five bacteria, the magnitude of both AT and GC skew was observed relatively high in Sa. The keto-126 amino (KM) and purine-pyrimidine (RY) skews were in general positive in the LeS and negative in the LaS 127 of these bacteria (Table 1). 128 Table 1    strand, unlike ti. The most frequent transversions G→T (C→A) is known to be due to the oxidation of G to 185 form 8oxoG (Loon et al. 2010), which seems to be occurring equally in both the strands. In Sa and Sp the 186 frequency of G→T (C→A) was higher than the transitions polymorphisms A→G (T→C). This was 187 contradicting the general notion that frequency of a transition polymorphism is higher than that of a 188 transversion polymorphism. The reason behind the higher frequency of A→T (T→A) than that of A→C 189 (T→G) or C→G (G→C) is not known. In ti as well as in case of tv, polymorphisms were more than two times 190 biased towards A/T over G/C in three γ-Proteobacteria, and the same were more than three times biased 191 towards A/T over G/C in two Firmicutes. This was in concordance with their genome composition values that 192 polymorphism was more biased to AT in genome with low genome G+C composition. 193 In conclusion, polymorphism study at IRs has revealed that replication associated polymorphism 194 asymmetry between the LeS and the LaS, is mainly attributed to deamination of cytosine and adenine. This  Table 2). In general, ti was more frequent than tv across the five amino acids in 207 these bacteria. However, the ti/tv values were variable, which indicated differences across these amino acid 208 regarding polymorphism at FFS (Table 3). The transition as well as the transversion polymorphisms were more the case of transversion polymorphisms (Table 3). It is pertinent to note that polymorphisms such as C→T and 212 G→A were not always more frequent than A→G and T→C. For example, in case of Thr in Ec and across the 213 five amino acids in case of Kp (Supplementary Table 2

234
We compared frequency values between complementary polymorphisms within a strand to find out 235 any possible role of strand asymmetry. In general, the difference between C→T and G→A were positive in 236 the LeS and negative in the LaS across the five amino acids in these five bacteria. This indicated replication 237 associated strand asymmetry regarding cytosine deamination. Regarding the other transition polymorphisms, 238 difference between A→G and T→C were high but non-uniform across the five amino acids within a bacterium: 239 the difference can be positive in case of an amino acid while negative in case of another amino acid ( Figure  240 3). Further, the difference between A→G and T→C values were similar both in the LeS and the LaS. This 241 indicated that the difference was not generated due to replication associated strand asymmetry. Similarly, the 242 inconsistent pattern across the amino acids indicated that the pattern was not due to transcription associated 243 mutation. In case of complementary transversions, the difference values were high but not uniform across the 244 five amino acids within a bacterium. Further, the difference values remain similar both in the LeS and the LaS.
Unlike IRs, the difference between complementary polymorphisms were high at FFS and the difference value 246 reflected were more of amino acid specific, not strand specific. 247

257
The polymorphism at the four-fold degenerate site coincides with codon usage bias 258 The amino acid specific polymorphisms at FFS indicated that the polymorphisms were most probably LeS was more than that in LaS. The reverse was true regarding the frequencies of A and C. 267  were studied in the other two γ-proteobacteria Kp and Se. In Kp, as the genome is G+C high, G/C-ending 290 codons are more abundant over the A/T-ending codons. In general A→G was more frequent than C→T and 291 G→A in all amino acids that favoured higher abundance of G-ending codons in the genome. A→C was more 292 frequent than C→A in all amino acids except Pro that supported CCC being preferred low here. Further, we 293 noticed correlation in preference of C-ending codons and higher frequency of G→C in Thr and Gly, whereas 294 the G-ending codons were preferred in Val and Pro that corresponded to the higher frequency of C→G. A→T 295 was more frequent than T→A in Val and Gly, which supported the T-ending codons being preferred over the 296 A-ending codons in these amino acids. T→G was more frequent than G→T in Pro while T→G was less 297 frequent than G→T in Gly which supported G-ending codon being preferred in Pro but not in Gly. Therefore, 298 the impact of codon usage bias was observed on the polymorphism at FFS of Kp. Similarly, in Se, G→C 299 transversion was more frequent than C→G in Thr and Gly while the reverse pattern was true in Pro in both the 300 strands. This was in concordance with the preference of C-ending codon in Thr, avoidance of C-ending codon 301 in Pro and G-ending codon in Gly. Polymorphism at A→T was more frequent than T→A in Val and Gly which 302 supported the higher abundance of GTT/GGT in the genome. Therefore, polymorphism pattern at FFS was 303 influenced by codon usage bias in Se. 304 addition, G→A was more frequent than C→T in case of Thr because ACA is the most frequent codon here. In 308 Sp, A-ending codon was more frequent than T-ending codons in Pro. It was obvious to observe that A→T was 309 more frequent than T→A in case of Val, Ala and Gly but the reverse was true in case of Pro, which was in 310 concordance with the codon usage bias. Therefore, polymorphism at FFS in Sa as well as in Sp was influenced 311 by codon usage bias in these bacteria. aureus could not be explained by the cytosine deamination theory. Therefore, a detailed investigation has been are the most frequent ones and display the main difference between the strands. This is in favour of the notion 333 that cytosine deamination is the major cause of polymorphism in genomes and the process is more frequent in 334 the LeS than the LaS. In a small magnitude, we have also observed that A→G and T→C contributes towards 335 strands asymmetry at IRs. In parallel with cytosine deamination theory, it may be hypothesized that more 336 frequent adenine deamination in LeS might result in higher A→G transition in the strand than the 337 complementary strand. It is known that cytosine deamination and adenine deamination have an opposite impact 338 on genome G+C%. Regarding transition polymorphisms at IRs, the five bacteria behave similar in this study. 339 However, in transversions, frequency of a polymorphism is observed to be similar between the strands and the 340 difference value between complementary transversions within strand is very low. Therefore, contribution of 341 transversion polymorphism in strand asymmetry is very low in the IRs of these bacteria. It is pertinent to note 342 that frequency of G→T (C→A) are higher than other transversions. Transversion polymorphisms increases 343 A/T at IRs like transition polymorphisms. But the bias towards A/T of these polymorphisms are more in the 344 two Firmicutes than the γ-Proteobacteria. The G→T (C→A) value is higher as well as A→G (T→C) value is 345 lower in Firmicutes in comparison to the γ-Proteobacteria for which A/T bias is observed to be more in the 346 former than the latter. The polymorphism study at the IRs suggests that, the replication associated strand 347 asymmetry is indifferent between the two groups of bacteria. Therefore, the atypical AT skew in the 348 chromosome of Sa is not supported by the polymorphism at IRs. 349 In the two Firmicutes, Sa and Sp exhibit opposite patterns of nucleotide composition at FFS. The 350 nucleotide A is more frequent than T in Sa, while T is more frequent than A in Sp. The coding sequence is 351 more abundant in the leading strand than the lagging strand of the Firmicutes (Rocha 2004). Therefore, the 352 abundance of A is more than T in the LeS of Sa and the abundance of T is more than A in the LeS of Sp. Codon 353 usage bias at FFS of an amino acid is similar between the strands indicating the weak influence of strand 354 specific polymorphism. Therefore, the atypical AT skew in Sa can be attributed to codon usage bias, which is 355 due to the selection on codon usage bias. Our findings are in concordance with the earlier observation that 356 selection and gene distribution asymmetry between the strands was associated with the atypical AT skew in understanding regarding the influence of codon usage bias on the compositional strand asymmetry become 360 clear in this study because of the polymorphism study done separately in IRs and FFS. the minimum and maximum genome composition (Raghavan et al. 2012). Directional mutation bias in support 363 of the neutral theory of evolution has been proposed to explain genome G+C% in organisms (Sueoka 1988). 364 In support of directional mutation theory, Muto and Osawa (1987) demonstrated that synonymous codon usage 365 varies between high and low G+C bacterial genomes. But theoretical analysis of the G+C% of the synonymous 366 codons suggests that the maximum G+C composition difference between two synonymous codons (e.g., GGU 367 and GGC) of an amino acid can be 33.33% with exceptions only in Arg (e.g., CGG, AGU) and Leu (e.g., to believe that the selection on codon usage bias responsible for the observed polymorphism. It is already 385 known that GGG and CCC codons are prone to translational frameshift (O'Connor 1998(O'Connor , 2002. Interestingly, to be selected positively in bacterial genomes (Satapathy et al. 2014(Satapathy et al. , 2016. Further amino acid specific codon 389 usage bias is similar in the two strands indicating a weak influence of the strand specific mutation bias, in 390 comparison to the translational selection in genomes. This observation supports that G+C% variations across 391 different amino acids at FFS is due to selection on codon usage bias and may not be due to the directional 392 mutation. Further, considering a limited set of high expression genes, we observed that the polymorphism at 393 FFS of the high expression genes is in line with that at FFS of whole genome. It is interesting to note that the 394 earlier notion of genome composition determining the codon usage bias (Muto and Osawa 1987) is found 395 inconsistent in this study. Now our analysis using large number of genomes of γ-Proteobacteria and Firmicutes 396 have suggested the role of codon usage bias in determining the genome G+C% supporting the selection theory 397 of evolution. Assuming that the selection theory is true, it is speculated that in an AT rich genome, A/T ending 398 synonymous codons are likely to be selected over the G/C ending synonymous codons and vice versa is true 399 for GC rich genomes (Hershberg and Petrov 2009). We anticipated that future research on translation rate of 400 synonymous codons is expected to uncover the mechanism of genome composition in AT and GC rich bacteria. 401 Large range of genome G+C% is a classic example of the neutral theory of evolution in bacteria which 402 means that there is no specific advantage that could be linked to genome composition (Lassalle et al. 2015). genomic AT-contents in intracellular genetic elements. Genome composition is known to be associated with 406 bacterial phylogeny such as Firmicutes with low genome G+C%, Actinomycetes and β-Proteobacteria with 407 high genome G+C% (Satapathy et al. 2010). However, the reason for the high and low G+C% in these phyla 408 is not clearly understood. But phylogeny specific optimal codon selection has been reported recently 409 (Satapathy et al. 2016). Future understanding of translational decoding by the ribosome might explain the 410 phylogeny specific codon usage bias and genome composition. It is pertinent to note that ribosome mediated 411 gene regulation by co-translational protein folding has been demonstrated to be species specific in E. coli and case bacteria with poor selection on codon usage bias should exhibit low genome G+C%, while bacteria with 417 high genome G+C% must exhibit high selection on codon usage bias. In a different study, it has been shown 418 that bacteria with high genome G+C% indeed exhibited selection on codon usage bias (Satapathy et al. 2014). 419 Hence, this further supports that the selection of codon usage bias is responsible for genome composition in 420 bacteria. As codon usage bias is universal in bacteria, it may be possible that the difference between two 421 genomes regarding codon usage bias may act as a selection against lateral gene transfer in bacteria. 422

Materials and Methods 423
Segregating the leading and the lagging strands in bacterial chromosomes 424 We carried out a detailed single nucleotide polymorphism study using computational analysis of   In coding sequences (CDS), we considered polymorphism at FFS of the amino acids such as Val, Pro, 474 Thr, Ala and Gly. For example, if a nucleotide change (suppose A→T) observed at the 3 rd position of codon, 475 the corresponding codon was found out in the reference sequence (considering the preceding two nucleotides). 476 If the codon codes for Val (i.e., the codon is GTT/GTC/GTA/GTG), then we increase A→T change for Val 477 by 1. Using this approach, we calculated polymorphism at FFS of the five amino acids. Polymorphism 478 frequencies were normalized by dividing the total count of a given change by the total count of the nucleotide 479 in which polymorphism has occurred in the reference sequence. For example, if the total number of C→T 480 change is 20 and the total number of C in the reference sequence (either at IRs or at FFS of that amino acid) is 481 100, then the normalized frequency is calculated as 20/100 = 0.2. The frequencies of different nucleotide 482 polymorphisms were calculated accordingly. For statistical analysis and determining p-value for significance 483 test, Mann Whitney test is used (Mann and Whitney 1947). 484