Protein-coding potential of RNAs measured by potentially translated island scores

Recent studies have identified numerous RNAs that are functionally both coding and noncoding. However, the sequence characteristics that determine bifunctionality remain largely unknown. In this study, we developed and tested a potentially translated island (PTI) score, defined as the occupancy of the longest open reading frame (ORF) among all putative ORFs. We found that this score correlated with translation, including noncoding RNAs. In bacteria and archaea, coding and noncoding transcripts had narrow distributions of high and low PTI scores, respectively, whereas those of eukaryotes showed relatively broader distributions, with considerable overlap between coding and noncoding transcripts. The extent of overlap positively and negatively correlated with the mutation rates of genomes and effective population sizes of species, respectively. These overlaps were significantly increased in threatened species. In macroevolution, the appearance of the nucleus and multicellularity seem to have influenced the overlap of PTI score distributions, so that the probability of the existence of bifunctional RNAs is increased in eukaryotes. In mammalian testes, we observed an enrichment of noncoding RNAs with high PTI scores, which are candidates for bifunctional RNAs. These results suggest that the decrease in population size and the emergence of testes in eukaryotic multicellular organisms allow for the stable existence of bifunctional RNAs, consequently increasing the probability of the birth of novel coding and non-coding RNAs.


Introduction 43
Recent advances in RNA sequencing technology have revealed that most of the 44 eukaryotic genome is transcribed, primarily producing noncoding RNAs (Okazaki et al. 45 2002;Djebali et al. 2012;Ulitsky and Bartel 2013;Kopp et al. 2018). Noncoding RNAs 46 that are more than 200 nucleotides in length are referred to as long noncoding RNAs 47  Figure 2B). We also calculated the PTI scores of mouse transcripts 136 from RefSeq and Ensembl and found that the distribution of f(x) was shifted to the right 137 with an apex of 0.55 (Supplementary Figure 2C), similar to that of human transcripts. 138 These results suggest that the sequences, not lengths, of the coding transcripts increased 139 the PTI scores in mice and humans. 140 141 PTI scores correlate with protein-coding potential in humans and mice 142 Next, we examined the relationship between PTI score and protein-coding potential. 143 Based on the PTI score distributions of coding and noncoding transcripts, protein-144 coding potential F(x) was defined as the probability of being a coding transcript with a 145 PTI score of x. A sample F(0.15) calculation for human transcripts is shown in Figure  146 1D. This result indicates that any given human RNA transcript with a calculated PTI 147 score of 0.15 has a protein-coding potential F(x) of 0.183. F(x) was correlated with PTI 148 scores ≤ 0.65 ( Figure 1E and Supplementary Figure 3A). The protein-coding potentials 149 of sequences in RefSeq data slightly decreased after peaking at 0.65 ( Figure 1E  The intercepts were near zero, and the slopes were approximately 1.3. Using these 161 formulas, we can calculate the protein-coding potential F(x) for any given human For both human and mouse transcripts, the PTI score correlated linearly with the 176 protein-coding potential at PTI scores ≤ 0.65. Moreover, when the PTI score limit 177 approached 0, the probability of the transcript being a coding RNA was 0 ( Figure 1E  Next, we investigated whether the PTI score is useful for identifying coding RNAs 182 among NR transcripts. From the 7,144 transcripts registered as noncoding genes in 183 2015, we excluded small RNAs (< 200 nucleotides) and those with short primary PTIs 184 (< 20 residues). Among the remaining 6,617 NR genes, 219 were reassigned as NM 185 over the past 3 years (Supplementary Table 2), including the previously identified de 186 novo gene MYCNOS/NCYM (Suenaga et al. 2014). The percentage of reclassification 187 increased for NR transcripts with high PTI scores ( Figure 1F). Thus, a high PTI score is 188 a useful indicator of coding transcripts. NR transcripts with high protein-coding 189 potential (0.6 ≤ PTI score < 0.8) were then extracted, and the domain structure of the 190 pPTI amino-acid sequence was assessed using BLASTP. A total of 217 transcripts 191 showed putative domain structures in pPTI, whereas 310 did not (Supplementary Table  192 3). Transcripts with domain structures are often derived from transcript variants, 193 pseudogenes, or readthrough of coding genes; those without domain structures are often 194 derived from antisense or long intergenic noncoding RNAs (lincRNAs) ( Table 1). We next examined the functions of genes generating NR transcripts with high coding 196 potential (0.6 ≤ PTI score < 0.8). We divided the NR transcripts into those with and 197 without putative domains to investigate novel coding gene candidates, either originating 198 from pre-existing genes or created from non-genic regions. Analysis using the Database 199 for Annotation, Visualization, and Integrated Discovery (DAVID) functional annotation 200 tool (Huang et al. 2009a(Huang et al. , 2009b showed that NR transcripts without domain structures 201 were derived from original genes related to transcriptional regulation, multicellular 202 organismal processes, and developmental processes (Supplementary Table 4). Among 203 the target genes of transcription factors, NMYC (MYCN), TGIF, and ZIC2 were ranked 204 in the top three and are all necessary for forebrain development (Supplementary Table  205 4) (Brown et al. 1998;Gripp et al. 2000;van Bokhoven et al. 2005 and chromosome V or single-organism cellular processes (Supplementary Table 11) 216 were enriched. Therefore, the relationship between brain development and cancer in the 217 function of high-PTI-score lncRNAs seems to be specific to humans. 218 219 PTIs affect the protein-coding potential predicted by Ka/Ks 220 To examine the relationship between PTI scores and natural selection in the prediction 221 of protein-coding potential, we calculated the ratio of nonsynonymous (Ka) to 222 synonymous (Ks) values by comparing human transcripts with syntenic genomic 223 regions of chimpanzees and mice ( Figure 1G). Transcripts were selected based on the 224 syntenically conserved regions: 44,593 (vs. chimp) and 14,016 (vs. mouse). We found a 225 linear relationship between the F(x) and PTI scores in the conserved transcripts ( Figure  226 1G, left panels). As predicted, coding transcripts exhibited Ka/Ks < 0.5 at a higher 227 frequency than did noncoding transcripts, with large differences observed when for PTI 228 scores > 0.9 or < 0.1, with the smallest difference for PTI scores of approximately 0.35 229 to 0.45 ( Figure 1G, right panels). These results indicate that for transcripts with PTI 230 scores near the highest or lowest values, the conservation of ORF/pPTI sequences 231 (negative selection, Ka/Ks < 0.5) determines the coding potential. In contrast, for 232 transcripts with PTI scores between 0.35 and 0.45, the conservation of ORF/pPTI 233 sequences has almost no effect on the coding potential, and thus ORF/pPTI sequences 234 have more potential to evolve neutrally. Therefore, noncoding transcripts showing both 235 negative selection (Ka/Ks < 0.5) and the highest PTI scores may include new coding 236 transcript candidates. We list 23 such transcripts in Supplementary  Figure 4A). On the other hand, there was no rightward shift in ORF 252 size or transcript length (Supplementary Figure 4B and 4C). Therefore, the translation 253 of non-coding RNAs was strongly correlated with the PTI score, but not with transcript 254 length, ORF size, or ORF coverage. 255 Next, to examine whether PTI scores were associated with translation occupancy of 256 ORFs in coding RNAs, we defined the ORFs in which translation products were 257 identified in the sPTIs: uORFs, sPTIs with translation products detected in the 5'UTRs; 258 sORFs, PTIs with translation products detected in other frames overlapping with ORFs 13 of major proteins; and dORF, sPTIs with translation products detected in the 3'UTRs 260 ( Figure 2B). When the PTI scores of coding RNAs with uORF, sORF, and dORF were 261 calculated and the PTI score distribution was compared with that of all coding RNAs, 262 the PTI scores shifted to lower values, peaking at 0.35-0.45 ( Figure 2C). Although there 263 are differences in the effect of the location of the translated sPTI in a dataset from a 264 different database, the PTI score distribution remained similar, that is, it shifted to lower 265 values and increased the number of coding transcripts with PTI scores of 0.35-0.45 266 (Supplementary Figure 5). These results support the idea that the PTI score is related to 267 the occupancy of major ORFs in the translation of RNAs. In addition, when considering 268 the results in Figure 1G To analyze the relationship between PTI scores and protein-coding potential in a broad 275 lineage of organisms, we selected 100 organisms, consisting of five bacteria, ten 276 archaea, and 85 eukaryotes ( Figure 3. To examine the evolutionary 280 potential in humans and mice, we selected a relatively large number of mammalian 282 species (36). Species with fewer than three lncRNAs were not used to calculate g(x) and 283 were not included in the histograms illustrating their relationship with the PTI score 284 (Figures 4 and 5). For all organisms, the relative frequency of coding transcripts f(x) 285 was shifted to the right (higher PTI score) compared to random or random shuffling  and 5). In sharp contrast to f(x), the relative frequency of lncRNAs g(x) was shifted to 297 the right (higher PTI scores) in eukaryotes, including G. lamblia, which belongs to the 298 earliest diverging eukaryotic lineage and lacks mitochondria ( Figure 4). Since the 299 distribution of f(x) in the Excavata, including G. lamblia, showed a similar pattern to 300 that of bacteria, the right shift of g(x) seems to be an earlier event than the left shift of 301 f(x) in the evolution of eukaryotes. Collectively, the right and left shifts of f(x) and g(x) 302 contribute to blurring the boundary between coding and noncoding transcripts in 303 eukaryotes. 304 305 PTI score distribution overlap is inversely correlated with effective population size 306 In general, eukaryotes (particularly multicellular organisms) have smaller effective 307 population sizes than prokaryotes, with higher mutation rates due to the effect of genetic size was inversely proportional to the mutation rate of genomic DNA, even in the 317 remaining 24 species (exponent = −1.126, R 2 = 0.6842, Figure 6A). Opti positively and 318 negatively correlated with mutation rates and effective population size, respectively, 319 with relationships that could be approximated as logarithmic (R 2 = 0.7578) or 320 exponential functions (R² = 0.4667). In contrast, ORF coverage (Ocov) showed a weaker 321 relationship with mutation rates and effective population size (Supplementary Figure 9).  consider the possibility that Opti might be elevated in endangered organisms. We 327 calculated Opti for 35 vertebrate species on the IUCN Red List (left panel, Figure 6B; 328 Supplementary table 1), and found that species at risk of extinction had significantly 329 higher Opti than species with little risk of extinction (Least Concern, LC). In addition, 330 among LCs, Opti was higher for species with decreasing numbers compared to those 331 with stable populations (right panel, Figure 6B; Supplementary table 1). 332 333

Relationship between PTI score and protein-coding potential 334
The overlapping of relative frequencies in f(x) and g(x) led us to examine the 335 relationship between the PTI score and protein-coding potential F(x) in eukaryotes. To 336 avoid being misled by small sample numbers, we selected 32 species with more than 337 1000 lncRNAs that contained pPTIs to calculate F(x) (Figure 7 and Supplementary 338 Figure 10). In humans and mice, the relationship between the PTI score and F(x) was 339 approximated with a linear function passing through the origin of the PTI score ≤ 0.65.

Characteristics of RNA virus genomes in human and bacterial cells 355
In sharp contrast to the coding transcripts of bacteria and archaea, the PTI scores of 356 coding transcripts of eukaryotes overlapped with those of noncoding RNAs due to their 357 broad distribution of low PTI scores. To investigate the molecular mechanism 358 underlying the distinct distribution of coding transcripts between bacteria and 359 eukaryotes, we analyzed the genome sequences of RNA viruses that infect human or 360 bacterial cells. Positive-sense single-stranded RNAs, or (+) ssRNAs, are parts of the 361 viral genome that generate mRNAs and are translated into viral proteins via the host 362 translation system. Therefore, efficient translation in host cells contributes to the 363 replication of (+) ssRNA viruses. We speculated that PTIs other than bona fide ORFs 364 affect the coding potential of the viral genome in host cells. Multiple bona fide ORFs 365 are present in viral genomes. Thus, we extended the concept of PTIs to multiple ORFs 366 in viral RNA genomes ( Figure 8A) and set the viral ORF (vORF) score. 367 Among the positive sense ssRNA viruses registered in the NCBI database, 198 were 368 human viruses and 13 were bacteriophages. We eliminated the viruses that produced 369 viral proteins by exceptional translation mechanisms such as ribosome frameshifting, 370 alternative initiation sites, ribosome slippage, and RNA editing, focusing on the 371 remaining 95 human viruses, including nine retroviruses (Supplementary Table 14) and 372 10 bacteriophages (Supplementary Table 15). The relative frequencies of the human 373 viruses and bacteriophages showed distinct peaks at PTI scores of 0.65 and 0.75, 374 respectively ( Figure 8B). These values correspond to the PTI scores of the highest 375 protein-coding potential in humans ( Figure 1E and Supplementary Figure 3A) and the 376 highest frequency of coding transcripts in bacteria ( Figure 4). In addition, the relative 377 frequency of human viruses showed a broader distribution of low PTI scores compared 378 to bacteriophages, particularly in human retroviruses ( Figure 8B). Therefore, RNA virus 379 genomes appear to have sequence characteristics that maximize their protein-coding 380 potential in host cells. 381 382

The relationship between PTI scores and tissue-specific expressions 383
The right shift of the PTI score distribution for noncoding RNAs is pronounced in 384 eukaryotes, especially in multicellular organisms (Figures 4 and 5). To examine the 385 possibility that different tissues in multicellular organisms show different PTI 386 distributions for noncoding RNAs, we analyzed transcriptome data to calculate the PTI 387 scores of human noncoding transcripts expressed in multiple tissues ( Figure 8C). The 388 shifted to higher values for mature testes ( Figure 8C). Similar results were also obtained 390 for opossums, rats, mice, and macaques, although their shifts were weaker than those of 391 humans (Supplementary figure 11). Furthermore, the noncoding transcripts that were 392 expressed in a tissue-specific manner had higher PTI scores than ubiquitously expressed 393 noncoding transcripts in humans ( Figure 8D) and the other four species (Supplementary 394 figure 12). The relationship between the specificity of expression and the PTI score was 395 also found for human coding transcripts (Supplementary figure 13). These results 396 suggest that the tissue-/cell type-specific expression of transcripts evolved in 397 multicellular eukaryotes contributes to increased PTI scores for noncoding transcripts. 398 Since the majority of tissue-specific transcripts were expressed in matured testes (7,573 399 of 8,523 transcripts (89%) in the highest specificity group for humans), the evolution of 400 the testis also seems to contribute to the existence of high PTI score-noncoding RNAs, 401 thus contributing to the birth of new coding genes. 402 403

Discussion 404
Here, we showed that PTI scores are associated with protein-coding potential in cellular 405 organisms. In bacteria and archaea, the PTI-score distributions for noncoding and 406 coding transcripts were distinct (low and high scores), whereas they were merged in 407 eukaryotes. 408 Right shifts in the distribution of noncoding RNA occurred in G. lamblia, one of the 409 earliest diverging eukaryotes, which are binucleate and lack mitochondria, peroxisomes, 410 and a typical Golgi apparatus (Ankarklev et al. 2010;Bartelt et al. 2015;Buret et al. 2020) and were commonly observed in all eukaryotes examined. Moreover, functional 412 noncoding RNAs, by definition, should not be translated by ribosomes in cells. 413 However, in bacteria and archaea, newly transcribed RNAs are immediately bound by 414 ribosomes (Miller et al. 1970;French et al. 2007) and do not have the chance to escape 415 translation. Thus, as expected, transcripts with noncoding functions in bacteria and 416 archaea showed low PTI scores (top panel, Figure 9A). Alternatively, in eukaryotes, the 417 existence of the nucleus prevents the immediate binding of lncRNAs by ribosomes, so 418 cytoplasmic translocation from the nucleus is required for translation. Therefore, 419 eukaryotic lncRNAs may function in the nucleus even with high PTI scores, and the 420 subsequent evolution of cytosolic translocation of these noncoding RNAs may 421 contribute to the origination of new coding genes (middle panel, Figure 9A). Thus, the 422 pervasive transcription of the genome seems to help eukaryotes to create new functional 423 noncoding/coding RNAs, while being disadvantageous for bacteria and archaea by 424 increasing the risk of transcription of high-PTI-score transcripts, leading to immediate 425 translation of wasteful and/or toxic proteins (top and middle panels, Figure 9A,  Figure 9A). Since the possibility that a new 430 protein will not be toxic in multiple intracellular environments is lower than the 431 possibility that it will not be toxic in a particular intracellular environment, noncoding 432 RNAs that are ubiquitously expressed need to have lower PTI scores than those with 433 specific expression (bottom panel, Figure 9A). 434 Kaessmann proposed an "out of the testes hypothesis," arguing that the testis facilitates Consistent with this hypothesis, we found that the molecular traits of coding or 507 noncoding RNAs were prominent in bacteria/archaea and weak in multicellular 508 eukaryotes, allowing the existence of bifunctional RNAs. The excessive overlap of PTI 509 score distributions (Opti > 0.7) diminished the correlation between the PTI score and 510 protein-coding potential. This indicates that both coding and noncoding transcripts lost 511 their molecular traits as coding and noncoding RNAs in terms of PTI score, which 512 became lethal or highly deleterious for the species. 513 Species with decreasing population sizes showed significantly higher Opti compared 514 with species with a stable population size, even in the LC group. Combined with the 515 results discussed above, we propose a novel model of new gene origination in which 516 new gene birth occurs in response to decreased effective population sizes ( Figure 9B). In conclusion, the PTI score is an important indicator for integrating the concept of gene 540 birth into classical evolutionary theory, thereby contributing to the elucidation of the 541 molecular basis for the evolution of complex species, including humans. In the future, it will be necessary to calculate PTI scores based on the transcriptomes of additional 543 species to test our hypothesis that positioning new gene birth as a countermeasure to the 544 decline in effective population size. 545 546

Potentially translated islands (PTIs) 548
Definition 549 PTIs are defined as sequence segments beginning at AUG and ending with any of the 550 UAA, UAG, or UGA stop codons in the 5ʹ to 3ʹ direction within an RNA sequence in all 551 three possible reading frames ( Figure 1A).

Definition 560
The PTI length is defined as the length of the amino acid sequence, excluding the stop 561 codon, and is represented by ( Figure 1A). In an RNA sequence, the longest PTI is 562 designated as the primary PTI (pPTI), whereas the others are termed secondary PTIs 563 (sPTIs). The lengths of pPTI and sPTI are described as !"#$ and %"#$ , respectively 564 ( Figure 1A). 565

Definition 577
We defined the PTI score ( Figure 1A) according to Equations 2 and 3. 578 + %"#$* + ⋯ + %"#$+ + ⋯ + %"#$( (2) 579 represents the sum of all PTI lengths. 581 The definition is derived from the hypothesis that the potential for translation of a pPTI 582 is reduced by translation of sPTIs. Consistent with this hypothesis, coding transcripts 583 with translation of sPTIs had lower PTI scores than all coding transcripts ( Figure 2C). 584

Example 585
For an RNA sequence with only one PTI, the PTI score is 1 ( Figure 1B). An RNA 586 sequence with many sPTIs tended to have a score close to 0 ( Figure 1B). If the sum of 587 all sPTI lengths was equal to pPTI length, the PTI score is 0.5 ( Figure 1B). The PTI 588 score of the NCYM transcript is 0.568 (Supplementary Figure 1C). Further information 589 is included in the Supplementary Notes. 590

Relative frequencies f(x) and g(x) 595
Definition 596 We defined the, f(x) and g(x), respectively, as ( Figure 1C To define coding/non-coding transcripts with a PTI score of x, we divided the 603 histograms into ten classes, and used the median values of the classes to represent the 604 PTI score ( Figure 1C). Therefore, in Equations 5 and 6, the PTI score x is restricted as 605 follows:

Nonsynonymous (Ka) to synonymous (Ks) nucleotide substitution ratios 664
To identify orthologous regions between human transcripts and chimpanzee/mouse 665 genomes, BLAT v. 36 (Kent 2002) was conducted using human transcript sequences 666 with the estimated PTI score against chimpanzee (PtRV2) and mouse (GRCm38.p6) 667 genomic sequences defined in the NCBI database. We defined the blat best-hit genomic 668 regions of chimpanzee/mouse as orthologs for each human transcript. The human-669 chimpanzee (or human-mouse) sequences were aligned for each exon region and the sequences were combined for each transcript. Only orthologous sequence pairs of more 671 than 60 bp in length (encoding > 20 amino acid residues) were extracted. 672 Nonsynonymous (Ka) and synonymous (Ks) nucleotide substitution rates were 673 estimated as described by Yang and Nielsen (Yang and Nielsen 2000), implemented in 674 PAML version 4.8a (Yang 1997

Relative frequencies of negatively selected genes 680
We defined this frequency, h(x), in coding and noncoding transcripts ( Figure 1G TimeTree (Hedges et al. 2006) was used to draw trees using official species names. 690

Selection of viruses and identification of vORFs 692
The complete genomes of positive-sense single-stranded RNA viruses infecting humans 693 or bacteria (Supplementary Tables 14 and 15)  Transcriptome data from five species were obtained from a previous study (Sarropoulos 713 et al. 2019). All transcripts expressed at detectable levels (non-zero) in each tissue were 714 used to calculate PTI scores for lncRNAs and to plot PTI score distributions. For the 715 correlation between tissue specificity and PTI score, we divided the transcripts into the 716 indicated groups according to the number of tissues in which the transcript was detected 717 and described the PTI score distribution in each group. Human transcriptome data for 718 coding transcripts were obtained from the Human Protein Atlas 719 (http://www.proteinatlas.org), including RNA isoform data from 131 cell lines and 281 720 tissues. The PTI score for each transcript was calculated from Ensembl data.  P-values were calculated using the Yate's continuity correction. N.S., not significant. 1027