Potentially translated sequences determine protein-coding potential of RNAs 1 in cellular organisms 2 3

RNA sequence characteristics determine whether their transcripts are coding or noncoding. Recent studies have shown that, paradoxical to the definition of noncoding RNA, several long noncoding RNAs (lncRNAs) translate functional peptides/proteins. However, the characteristics of RNA sequences that distinguish such newly identified coding transcripts from lncRNAs remain largely unknown. In this study, we found that potentially translated sequences in RNAs determine the protein-coding potential of RNAs in cellular organisms. We defined the potentially translated island (PTI) score as the fraction of the length of the longest potentially translated region among all regions. To analyze its relationship with protein-coding potential, we calculated the PTI scores in 3.4 million RNA transcripts from 100 cellular organisms, including 5 bacteria, 10 archaea, and 85 eukaryotes, as well as 105 positive-sense single-strand RNA virus genomes. In bacteria and archaea, coding and noncoding transcripts exclusively presented high and low PTI scores, respectively, whereas those of eukaryotic coding and noncoding transcripts showed relatively broader distributions. The relationship between the PTI score and protein-coding potential was sigmoidal in most eukaryotes; however, it was linear passing through the origin in three distinct eutherian lineages, including humans. The RNA sequences of virus genomes appeared to adapt to translation systems of host organisms by maximizing protein-coding potential in host cells. Hence, the PTIs determined the protein-coding potential of RNAs in cellular organisms. Additionally, coding and noncoding RNA do not exhibit dichotomous sequence characteristics in eukaryotes, instead they exhibit a gradient of protein-coding potential.

Introduction 47 5 defined this indicator using PTI lengths and subsequently examined associations 87 between the indicator and protein-coding potential. We also present analyses of more 88 than 3.4 million transcripts in 100 cellular organisms belonging to all three domains of 89 life to investigate the evolution of the relationship between the PTI score and protein-90 coding potential. Finally, we examined whether virus RNA genomes have differentially 91 evolved to maximize the protein-coding potential in different host organisms. 92 93 Results 94

Coding transcripts show higher PTI scores in humans and mice 95
We defined PTI score as described in the Materials and Methods and illustrated in 96 Figure 1A to B, and analyzed human transcripts registered in the nucleotide database of In both human and mouse transcripts, the PTI score correlated linearly with the protein-157 coding potential at PTI scores ≤ 0.65. Moreover, when the PTI score limit approached 158 0, the probability of the transcript being a coding RNA was 0. 159 160

PTIs affect the protein-coding potential predicted by Ka/Ks 161
To examine the relationship between the PTI score and natural selection in the 162 prediction of protein-coding potential, we calculated the ratio of nonsynonymous (Ka) 163 to synonymous (Ks) values by comparing human transcripts with syntenic genome 164 regions of chimpanzee or mouse ( Figure 1F). Transcripts were selected based on the 165 syntenically conserved regions, 44,593 (vs. chimp) and 14,016 (vs. mouse). The results 166 revealed linear relationships between the F(x) and PTI score in the conserved transcripts 167 ( Figure 1F, left panels). As predicted, coding transcripts contained transcripts with 168 Ka/Ks < 0.5 at a higher frequency than noncoding transcripts, with the largest difference 169 observed when the PTI score was > 0.9, and the smallest difference at PTI scores of 170 approximately 0.45 ( Figure 1F, right panels). 171 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted April 15, 2021. 9 Therefore, noncoding transcripts showing both negative selection (Ka/Ks < 0.5) and the 172 highest PTI score may include new coding transcript candidates. 173 We have listed 23 such transcripts (Supplementary Table 2

Characterization of high PTI score lncRNAs in humans 178
Next, we investigated whether the PTI score is useful for identifying coding RNAs 179 among NR transcripts. From the 7,144 transcripts registered as noncoding genes until 180 2015, we excluded small RNAs (< 200 nucleotides) and those with short primary PTIs 181 (< 20 amino acids). Among the remaining 6,617 NR genes, 219 were reassigned as NM 182 over the past 3 years (Supplementary Table 3), including the previously identified de 183 novo gene MYCNOS/NCYM (Suenaga et al. 2014). The percentage of reclassification 184 increased among NR transcripts with high PTI scores ( Figure 1G). Thus, a high PTI 185 score is a useful indicator of coding transcripts. NR transcripts with high protein-coding 186 potential (0.6 ≤ PTI score < 0.8) were then extracted, and the domain structure of the 187 pPTI amino acid sequence was estimated using BLASTP (See the Materials and 188 Methods for details). A total of 217 transcripts showed a putative domain structure(s) in 189 pPTI, whereas 310 transcripts showed none (Supplementary Table 4). Transcripts with a 190 domain structure are often derived from transcript variants, pseudogenes, or 191 readthrough of coding genes; transcripts without domain structures are often derived 192 from antisense RNA or long intergenic noncoding RNA (lincRNA) ( Table 1). 193 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted April 15, 2021. ; https://doi.org/10.1101/2021.04.14.439730 doi: bioRxiv preprint We next examined the functions of genes generating NR transcripts with high coding 194 probabilities (0.6 ≤ PTI score < 0.8). We divided the NR transcripts into those with or 195 without putative domains to investigate candidates of novel coding genes, either 196 originating from pre-existing genes or created from non-genic regions, respectively.  To analyze the relationship between PTI scores and protein-coding potential in a broad 213 lineage of cellular organisms, we selected 100 organisms, consisting of 5 bacteria, 10 214 archaea and 85 eukaryotes (Supplementary Table 1), and calculated the PTI score for a 215 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted April 15, 2021. ; https://doi.org/10.1101/2021.04.14.439730 doi: bioRxiv preprint total of more than 3.4 million transcripts (Supplementary Table 1). Phylogenic trees of 216 the cellular organisms are presented on a logarithmic time scale along with the number 217 of species in each lineage used in the analyses (Figure 2). To examine the evolutionary 218 conservation of the linear relationship between PTI score and protein-coding potential 219 in humans and mice, we selected a relatively large number of species (36) from 220 mammals. The species with fewer than three lncRNAs were not used to calculate g(x) 221 and were not included in the histograms illustrating the relationship with PTI score 222 ( Figure 3). In all organisms, the relative frequency of coding transcripts f(x) was shifted 223 to the right (higher PTI score) compared to random or random shuffling controls (  shifted to the left (lower PTI scores) compared to bacteria and archaea (Figure 3 and 4). 234 In sharp contrast to f(x), the relative frequency of lncRNAs g(x) was shifted to the right 235 (higher PTI scores) in eukaryotes, including G. lamblia that belongs to the earliest 236 diverging eukaryotic lineage and lacks mitochondria (Figure 3). Since the distribution 237 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted April 15, 2021. ; 12 of f(x) in the Excavata, including G. lamblia, showed a similar pattern to bacteria, the 238 right shift of g(x) seems to be an earlier event than the left shift of f(x) in the evolution 239 of eukaryotes. Collectively, the right and left shifts of f(x) and g(x) contribute to 240 blurring the boundary between coding and noncoding transcripts in eukaryotes. 241 242 Relationship between the PTI score and protein-coding potential 243 The overlapping of relative frequencies in f(x) and g(x) led us to examine the 244 relationship between PTI score and protein-coding potential F(x) in eukaryotes. To 245 avoid misleading data obtained by small sampling numbers, we selected 32 species with 246 more than 1000 lncRNAs that contain pPTIs for calculation of F(x) ( Figure 5 and 247 Supplementary Figure 6). In humans and mice, the relationship between PTI score and 248 F(x) was approximated with the linear function passing through the origin of the PTI 249 score ≤ 0.65. Therefore, we used linear approximation in the F(x) of 32 species and 250 divided them into two groups, linear and sigmoidal, based on the shape and formula of 251 the approximated function ( Figure 5). We defined the linear group as R 2 > 0.9 with an 252 absolute value of the intercept < 0.1; the sigmoidal group had a slope > 1.0 and intercept 253 < -0.1. In U. americanus, C. canadensis, and G. gorilla, fewer than five lncRNAs 254 exhibited PTI scores of 0.05; thus, we eliminated the F(0.05) in these species for the 255 approximation by linear function (indicated with asterisks in Figure 5). The five species 256 that did not fit in the linear or sigmoidal group were characterized by high F(x) in low 257 PTI scores, and belonged to plants (Z. mays), reptiles (A. Carolinensis) and mammals 258 (O. anatinu, S. boliviensis, and G. gorilla) ( Figure 5). In these species, PTI scores 259 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted April 15, 2021. ; https://doi.org/10.1101/2021.04.14.439730 doi: bioRxiv preprint 13 showed weaker association with protein-coding potential. Sigmoidal relationships were 260 observed in 18/32 species, while linear relationships were apparent in nine species 261 within three mammalian lineages, Cetartiodactyla, Rodentia and Primates. Since the 262 sigmoidal relationship was the most broadly observed across the examined lineages, this 263 relationship appears to be of an ancestral type. 264 265

Characteristics of RNA virus genomes in human and bacterial cells 266
In sharp contrast to the coding transcripts of bacteria and archaea, the PTI score in  Among the positive-sense ssRNA viruses registered in the NCBI database, 198 were 280 human viruses and 13 were bacteriophages. We eliminated the viruses that translate 281 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is  Table 8 Here, we showed that the PTI is associated with protein-coding potential in cellular 296 organisms. In bacteria and archaea, the distribution of noncoding and coding transcripts 297 separately presented at low and high PTI scores, whereas they were merged in 298 eukaryotes. The overlapping distribution of noncoding and coding RNA in eukaryotes is 299 caused by the right and left distribution shifts of noncoding and coding transcripts, 300

respectively. 301
The right shifts in the distribution of noncoding RNA occurred for G. lamblia, one of 302 the earliest diverging eukaryotes that contain two nuclei and lack mitochondria, 303 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted April 15, 2021. In eukaryotes, we calculated protein-coding potential F(x) based on the overlapped 340 relative frequencies of noncoding g(x) and coding transcripts f(x). The relationship 341 between protein-coding potential and PTI score was divided into three groups, 342 sigmoidal, linear, and others. Among them, switch-type sigmoidal relationships seem to 343 be of the ancestral-type based on their conservation in eukaryotes, and to emerge after 344 the all-or-none-type relationships in bacteria and archaea. Meanwhile, the linear group 345 showed relatively high protein-coding potential at low or intermediate PTI scores, 346 further blurring the boundary between noncoding and coding transcripts. For example, 347 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted April 15, 2021. The PTI score is a value uniquely calculated based on any given RNA sequence, 354 whereas the Ka/Ks value requires comparison with a species and varies across different 355 species. Therefore, the Ka/Ks value cannot be uniquely calculated from a given RNA 356 sequence. Hence, PTI scores, but not Ka/Ks values, exhibit species-specific 357 relationships with unique distributions of coding potential. Due to the advantages of the 358 PTI score, we can investigate evolutionary changes in the relationship between PTI 359 score and coding potential. Furthermore, Ka/Ks < 0.5 did not show different relative 360 frequencies of noncoding and coding transcripts at a PTI score of approximately 0.45, 361 suggesting that Ka/Ks does not predict the coding potential at the PTI score and that the 362 prediction is dependent on the PTI score. In addition, to calculate Ka/Ks, orthologous 363 transcripts or genomes should be conserved among species. Therefore, Ka/Ks cannot be 364 applied for coding prediction of species-specific transcripts such as NCYM. On the other 365 hand, by using approximate functions, we can calculate the coding potential F(x) of any 366 given transcript with a PTI score ≤ 0.65 in the nine species classified to linear group. 367 Human NR transcripts with high PTI scores have been reclassified as coding genes over  In conclusion, we identified a novel determinant of protein-coding potential in cellular 389 organisms. The relationship between PTI score and protein-coding potential revealed 390 that the boundary between coding and noncoding transcripts in bacteria and archaea is 391 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted April 15, 2021. ; https://doi.org/10.1101/2021.04.14.439730 doi: bioRxiv preprint 19 blurred in eukaryotes. Therefore, when a eukaryotic transcript possesses a moderate PTI 392 score, bifunctional characterization as both a coding and noncoding transcript may be 393 essential for a full understanding of the biological roles of RNAs. 394 395

Materials and Methods 396
Potentially translated islands 397

Definition 398
PTIs are defined as sequence segments beginning at AUG and ending with any of UAA, 399 UAG, or UGA in the 5ʹ to 3ʹ direction within an RNA sequence in all three possible 400 reading frames ( Figure 1A). The length of PTI is the length of the amino acid sequence excluding the stop codon and 410 is represented by ( Figure 1A). In an RNA sequence, the longest PTI is designated as 411 the primary PTI (pPTI), whereas the others are termed secondary PTIs (sPTIs). The 412 lengths of a pPTI and sPTI are described as !"#$ and %"#$ , respectively ( Figure 1A). 413 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is Based on these definitions, the shortest PTI is "AUGUAA," "AUGUAG," or 415 "AUGUGA," and its amino acid length is 1. For example, the NCYM transcript has a 416 pPTI with a length of 109 at frame 1; three sPTIs with lengths of 69, 8 and 6, 417 respectively, at frame 2; and no PTIs at frame 3 (Supplementary Figure 1B and C). 418

Characteristics 419
Therefore, the following relationship between the lengths of pPTI and sPTI is held: 420 Definition 425 The definition of PTI score is motivated by our hypothetical concept that translation of 426 pPTI is limited by alternate competing sPTIs. We defined the PTI score ( Figure 1A) 427 according to Equation 2 and 3: 428 represents the sum of all PTI lengths. 431

Example 432
If an RNA sequence has only one PTI, the PTI score is 1 ( Figure 1B). An RNA 433 sequence with many sPTIs tends to have a score close to 0 ( Figure 1B). If the sum of all 434 sPTI length is equal to pPTI length, the PTI score is 0.5 ( Figure 1B). The PTI score of 435 . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted April 15, 2021. ; https://doi.org/10.1101/2021.04.14.439730 doi: bioRxiv preprint 21 the NCYM transcript is calculated as 0.568 (Supplementary Figure 1C). Further 436 information is included in Supplementary Notes. 437

Definition 443
We defined the relative frequencies of coding and noncoding transcripts, f(x) and g(x), 444 as follows ( Figure 1C To define coding/non-coding transcripts with a PTI score of x, we made histograms 451 divided by ten classes, and the median values of the classes were used to represent the 452 PTI score ( Figure 1C). Therefore, in Equation 5 and 6, the PTI score x is restricted as As an example, F(0.15) in human transcripts is depicted in Figure 1D.

Ratio of nonsynonymous (Ka) to synonymous (Ks) nucleotide substitution rates 504
To identify orthologous regions between human transcripts and chimpanzee/mouse 505 genomes, BLAT v. 36 (Kent 2002) was conducted using human transcript sequences 506 with the estimated PTI score against chimpanzee (PtRV2) and mouse (GRCm38.p6) 507 genomic sequences defined in the NCBI database. We defined the blat best-hit genomic 508 regions of chimpanzee/mouse as orthologs for each human transcript. The human-509 chimpanzee (or human-mouse) sequences were aligned for each exon region and the 510 sequences were combined for each transcript. Only orthologous sequence pairs more 511 than 60 bp in length (encoding > 20 amino acid sequences) were extracted.  CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted April 15, 2021. ; https://doi.org/10.1101/2021.04.14.439730 doi: bioRxiv preprint CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted April 15, 2021. ; https://doi.org/10.1101/2021.04.14.439730 doi: bioRxiv preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is     (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is  . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is    . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is  . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted April 15, 2021. ; https://doi.org/10.1101/2021.04.14.439730 doi: bioRxiv preprint   . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted April 15, 2021. ; https://doi.org/10.1101/2021.04.14.439730 doi: bioRxiv preprint  . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted April 15, 2021. ; https://doi.org/10.1101/2021.04.14.439730 doi: bioRxiv preprint