The longest intron rule

Despite the fact that long introns mean an energy and time burden for eukaryotic cells, they play an irreplaceable role in the diversification and regulation of protein production. A general feature is the accumulation of the longest introns closer to the start of protein-coding genes. Our work aimed to more closely characterize the genes in which the longest intron is only located in the second or third third of the gene. Data on the lengths of all introns in genes were extracted from the genomes of 4 vertebrates (human, mouse, chicken and zebrafish), nematode worm and yeast. With the genes divided according to the relative position of the longest intron, gene set enrichment analyses were performed, the results of which were then confronted with the results of randomly generated gene sets of the same size. The analyses show that among the genes associated with biological processes of ion transport across membranes, cell signaling or the development of multicellular organisms, there is a greater representation of genes with the longest intron in the first third. Conversely, among the genes associated with the processes of RNA processing and ribosome biogenesis, there are more genes with the longest intron in the second or third third. It is a newly discovered characteristic with more general validity among multicellular organisms.


The longest intron rule Introduction
Although in the genomes of eukaryotic organisms we find a few percent of genes that consist of only one exon, the vast majority of genes are made up of both exons and introns.The length of individual exons and introns can vary by several orders of magnitude even within a single gene, and this variability primarily concerns the length of introns.The biological function of exons as gene segments encoding the sequence of amino acids in polypeptide chains, except for the less frequent cases of exons in the untranslated regions of genes, was recognized soon after they were defined.The functions of introns were revealed gradually, independently of the discussions about the time of their origin (introns-early versus introns-late theories), and it is still an incompletely explored issue [1].Today, it is clear that introns play an irreplaceable role in increasing the diversity of proteins and in regulating the level of their gene expression [2,3].The fact that the longest intron is significantly longer than the other introns in the gene and that it is often located closer to the 5' end of the gene was already noted by the first works studying the structure of genes and genomes of sequenced eukaryotic organisms [4,5].Information on the detailed distribution of intron lengths in the genes of various organisms has, however, been published so far only little and scattered.In our first study on this topic, we further defined the distribution of the longest introns in human genes -64% of genes with the longest intron in the first third of all introns and 19 and 17% for the second and third third, with notable peaks for the middle and last introns of approx.5 and 6%, respectively [6].Moreover, we showed that localization of the longest intron in the second or third third is significantly more frequent for certain functionally related groups of genes, e.g. for DNA repair genes.In this follow-up study, we extended the number of tested genes to the entire human genome and further tested five genomes of other organisms -mouse, chicken, zebrafish, nematode worm and yeast -with the intention of finding a possible more generally applicable rule of the relationship between the position of the longest intron and the biological function of the gene.

Algorithm for calculating the lengths of all introns in protein-coding genes
In the first step, we obtained the positions of the beginnings (hereinafter referred to as Exon Start) and ends (Exon End) of all exons in the desired transcripts.For this, we used a query to the Ensembl database (https://www.ensembl.org/index.html;Release 108) [7] via the BioMart tool.In order to evaluate the most representative transcripts, we used MANE Select flags for Filters criteria in the human genome [8].Considering that MANE Select is not available for other genomes, we sorted the transcripts in other organisms based on the selection of criteria Gene type -protein coding and at the same time Ensembl Canonical.In the event that even Ensembl Canonical flags were not available for the given genome, e.g. in Caenorhabditis elegans, we used APPRIS annotation instead.Among the data (Attributes) we queried for each transcript were Gene stable ID, Gene stable ID version, Transcript Stable ID, Gene name, Strand, Exon rank in transcript, Exon region start (bp, base pairs) and Exon region end (bp).In the second step, we used an in-house shell script to calculate the lengths of individual introns from the obtained data (the code is available in Supplementary Information S1).Genes were sorted not by their names (symbols) but by Gene or Transcript Stable IDs.We calculated intron lengths in bp for genes on the Forward strand according to the formula {[Exon(n+1)Start -Exon(n)End] -1}; n are positive integers starting from 1.For genes on the Reverse strand, this formula was modified to {[Exon(n)Start -Exon(n+1)End] -1}.The script than created a table with the lengths of all introns for each protein-coding gene and searched for the position of the longest intron in the gene.In particular, we used AWK language to perform the following steps: 1) Calculate the lengths of the introns; 2) Extract Gene names if they are available; 3) Indicate the longest intron; 4) Calculate the relative position of the longest intron (the ratio of the position of the longest intron to the total number of introns in the given gene) and Bourne Again Shell (BASH) to create a matrix of Gene or Transcript Stable IDs versus the lengths of the introns in bp.The script can handle various delimiters and require an input file exported from Ensembl BioMart query with either Gene Stable ID Version or Transcript stable ID as the main unique identifier.The first seven columns in the file must be: (1) Gene stable ID, (2) Gene or Transcript stable ID (version), (3) Gene name, (4) Strand, (5) Exon rank in transcript, (6) Exon region start and (7) Exon region end.The primary data thus obtained, which we analyzed in this study, for individual organisms -Homo sapiens (human), Mus musculus (mouse), Gallus gallus (chicken), Danio rerio (zebrafish), Caenorhabditis elegans (worm) and Saccharomyces cerevisiae (yeast) -are stored in Supplementary Information Tables S2-7, respectively.Gene set enrichment analysis (GSEA) For GSEA, we selected genes that contain at least three introns from each studied genome, with the exception of yeast's genome, and further divided these genes into three subgroups according to the relative position of the longest intron (defined above).The position of the longest intron in the first third of all introns means that this ratio is in the interval (0;0.33].Similarly, for the second third it is in the interval (0.33;0.66] and for the third third in (0.66;1].In yeast, where there is only a minimal number of genes with more than one intron, we divided all protein-coding genes into only two tested groups: 1) without introns and 2) with introns.Our input (primary) data for GSEA can be obtained from Table S8.We performed GSEA in parallel using two web platforms -g:Profiler (https://biit.cs.ut.ee/gprofiler/gost) [9] and ShinyGO (http://bioinformatics.sdstate.edu/go/)[10], taking into account the procedure recommended in the work of Reimand et al. [11].With the g:Profiler program, we used the option to analyze multiple gene files simultaneously (Run as multiquery option), other possible options for setting the result parameters were left in the default settings, even with the ShinyGO program.We used the AmiGO 2 project (http://amigo.geneontology.org/amigo/landing)[12] for better orientation in the hierarchical structure of GO terms (Gene Ontology; http://geneontology.org/) [13] in the resulting lists of terms and the tool for creating Venn diagrams Multiple List Comparator (https://molbiotools.com/listcompare.php)[14] for finding common terms between these lists.KEGG diagrams (Kyoto Encyclopedia of Genes and Genomes; https://www.genome.jp/kegg/)[15] were created with the help of Pathview (https://pathview.uncc.edu/)[16].Random gene lists In order to obtain comparable control data, we generated random sets of genes using the RSAT web server (http://rsat.sb-roscoff.fr/index.php)[17] and performed the same analyses as the primary data.For each of the multicellular organisms tested, we generated three sets, for yeast only two, always with the same number of genes as for adequate primary data.Control data is stored in Table S9.

Characteristics of the studied genomes
We analyzed 4 genomes of representatives of the vertebrate subphylum (human, mouse, chicken and zebrafish) and one genome each from the phylum Nematoda (worm) and division Ascomycota (yeast).Of the mentioned vertebrates, the chicken has the fewest protein-coding genes (approx.17,000) and the zebrafish the most (30,000), the human and the mouse have 19 and 22 thousand genes respectively.The median number of introns in a gene is similar among all four vertebrates, i.e. 6 or 7, and approximately 80% of genes carry 3 or more introns.Human and mouse have approximately 60% of genes that have their longest introns located in the first third and 20% each of genes that have their longest introns located in the second and third thirds.In chicken and zebrafish, this ratio is skewed towards a higher percentage of genes that have the longest intron in the third third, i.e. 24 and 32%, respectively.In the nematode worm, the number of protein-coding genes is also about 20 thousand, but the median number of introns in a gene is only 4, and the number of genes with the longest intron in the first third is even lower (about 40%).From the point of view of introns, the unicellular yeast genome, with approximately 7,000 genes, is fundamentally different.96% of the genes do not have a single intron present, and the remaining 4% are genes with one intron (only 13 yeast genes carry two introns and 5 genes more than two).Therefore, the division into groups according to the position of the longest intron is not relevant here.More precise values are shown in Table 1.3), ( 4), ( 6) and ( 8), and the term monoatomic ion transport (9) is superior to the terms ( 5), ( 6), ( 7), ( 8) and (11); the other terms are not closely related.Another 121 terms (shown in Table S11) were common to four of the tested multicellular organisms.P values for selected terms are demonstrated in Table 2.

Overrepresentation of specific Gene Ontology terms among genes that have the longest intron localized in the second or third part of the gene
In a similar manner, in all evaluated multicellular organisms, three GO terms from the Biological Process category -(1) ncRNA metabolic process, (2) ncRNA processing and (3) ribosome biogenesisshowed a significantly higher occurrence exclusively among genes in which the longest intron is located in the second or third part.Even in the case of yeast, these three GO terms are overrepresented among genes with introns compared to genes without introns (Table 3).The term No. ( 1) is superior to (2).A number of other terms showed the same characteristics in several of the studied genomes.In particular, the following five terms -RNA modification, chromosome organization, rRNA metabolic process, tRNA metabolic process and translation -were common to the four multicellular organisms and another twelve terms including DNA repair, cilium organization, methylation, nuclear division and rRNA processing to three organisms (Table S12).All these results for the longest introns localized in the second or third part are listed in Table S13 including P values of individual GO terms.

Increased occurrence of genes included in certain KEGG pathways depending on the position of the longest intron in the gene
The analysis focused on the participation of genes, divided according to the position of the longest intron, in specific KEGG pathways brought additional results.The number of genes with the longest intron in the first third was significantly increased, in all 5 tested multicellular organisms, for the following 3 pathways: Calcium signaling pathway, FoxO signaling pathway and Purine metabolism.Additional thirty-eight pathways, including Apoptosis, Adherens, Tight and Gap junctions, ErbB, Hedgehog and Insulin signaling pathways, and Carbon and Glycerolipid metabolisms, met statistical significance in 4 organisms (Table S14).Conversely, for genes with the longest intron in the second or third third, the most represented pathways were: Biosynthesis of cofactors, Ribosome, Spliceosome, Cell cycle, Proteasome and Ribosome biogenesis in eukaryotes (Table S15).A demonstrative selection of KEGG pathways is shown in Table 4, and more detailed results can be found in Tables S16   and S17.Genes from the Spliceosome pathway, which have the longest intron located in the second or third third of the gene in the model organism zebrafish, are highlighted in Figure 1.*calculated by the ShinyGO program; adj., adjusted; NS, not significant Figure 1: The Spliceosome pathway in the model organism zebrafish.The genes, which have the longest intron located in the second or third third of the gene are highlighted in red.

Results from control datasets
No KEGG pathway was significantly increased in the control data, nor were any of the GO terms belonging to the Biological Process category, which were overrepresented in the primary data.Only a few terms from other categories (e.g.cytoplasm and protein binding) met the criteria for statistical significance in random gene control sets, but only in single cases without repetition in another organism.

Discussion
Although longer introns represent an energy and time burden for the genetic apparatus of the cell trying to efficiently transfer information from its storage to concrete implementation, they perform important and apparently irreplaceable tasks in this process.Our presented analyses show that a characteristic position of the longest intron can be found in genes with certain biological functions.
The results in five multicellular and one unicellular eukaryotic organisms suggest the more general validity of this previously unreported characteristic.
We have evidence that longer introns contain higher densities of conserved sites [18] and various regulatory elements [19,20].Also, the presence of longer introns closer to the 5' end of eukaryotic genes is a well-known phenomenon [21], which is explained by a greater occurrence of regulatory elements and thus a greater restriction of selectivity in these introns.The current work, on the other hand, aimed to focus on genes that have the longest intron located in the second or third third of the gene.We specified the proportion of these protein-coding genes in the genomes of five multicellular organisms.Such genes make up 61% of the nematode worm genome, 56% in zebrafish, 45% in chicken, and only around 40% in humans and mice.It appears that the proportion of genes with the longest intron in the first third of the gene increases significantly with the complexity of multicellular organisms.It is evident from our results that among the genes associated with the processes of cell signaling, export from cell, cell junction organization, peptidyl-amino acid modification or developmental growth, there is a greater representation of genes with the longest intron in the first third.And this characteristic is again more evident in more complex organisms.Conversely, among the genes associated with nuclear division, ribosome biogenesis, macromolecule methylation or cytoplasmic translation, there are more genes with the longest intron in the second or third third; with more general validity among studied organisms.The term ncRNA (non-coding RNA) is superior both to RNA molecules involved in the formation of proteins according to the information stored in mRNA (rRNA, tRNA, snRNA and snoRNA), and to RNA molecules with various other regulatory functions (snoRNA, miRNA, siRNA and others), see e.g., the work of Mattick and Makunin [22].Thus, the term ncRNA processing encompasses a whole constellation of more broadly related events.However, the core RNA processing mechanisms and machinery are highly conserved, with dysregulation of constitutive patterns often having severe cellular consequences [23].Some authors even consider RNA processing as a relic that can be traced back either to the RNA world that preceded both DNA and encoded protein synthesis or to the later ribonucleoprotein (RNP) world [24].The differences in the length structure of the introns, discussed by us, in these genes are most likely related to this high evolutionary conservation.The influence of introns on the regulation of genes for ribosomal proteins and the fact that these genes carry introns even in the otherwise intron-poor yeast genome have already been mentioned in the published literature [25].The presence of introns substantially increases the complexity of ribosomal protein gene expression as they variably slow the expression cycle, and in addition, many introns can contain non-coding RNA involved in other layers of regulation [2].In current works on ribosomal proteins, their other functions beyond the framework of ribosome formation and maintenance are emphasized, e.g.activation of p53-dependent pathway in response to stress [26,27].Thus, the significance of the longest intron feature of these genes highlighted by us may also turn out to be broader and awaits further investigation.

Conclusions
In eukaryotic organisms, genes related to more complex functions of multicellular organisms tend to locate the longest intron in the first third of the gene, while genes related to very ancient and evolutionarily conserved functions related to RNA processing show the position of the longest intron mainly in the second or third third.

Table 1 :
Numbers and localization of introns within protein-coding genes of six studied eukaryotic genomes.,amonggenes/transcriptswith three and more introns; **,since a single representative transcript is not available in C. elegans, all protein-coding transcripts were evaluated; g, gene(s); i, intron(s); N, number; nr, not relevant; t, transcript(s); Overrepresentation of specific Gene Ontology terms among genes that have the longest intron localized in the first part of the gene The analysis focused on GO terms from the Biological Process category characterizing genes that have the longest intron located in the first third of the gene found a large number of terms exclusively overrepresented in this subgroup, especially in the human and mouse genomes (TableS10).The intersection of all these terms among the five studied multicellular organisms contained the following 12 terms: (1) cell morphogenesis, (2) dephosphorylation, (3) inorganic cation transmembrane transport, (4) inorganic ion transmembrane transport, (5) metal ion transport, (6) monoatomic cation transmembrane transport, (7) monoatomic cation transport, (8) monoatomic ion (12)nsmembrane transport, (9) monoatomic ion transport, (10) multicellular organismal process,(11)sodium ion transport and (12) transmembrane transport.The term transmembrane transport(12)is superior to the terms (

Table 2 :
Selected Gene Ontology (GO) terms exclusively overrepresented in genes with the longest intron localized in the first part of the gene

Table 3 :
Selected Gene Ontology (GO) terms exclusively overrepresented in genes with the longest intron localized in the second or third part of the gene the longest intron number to number of all introns ratio; **calculated by the g:Profiler program; adj., adjusted; NS, not significant *

Table 4 :
Selected KEGG pathways overrepresented among genes divided according to the position of the longest intron