Genomic Abelian Finite Groups

Experimental studies reveal that genome architecture splits into DNA sequence domains suggesting a well-structured genomic architecture, where, for each species, genome populations are integrated by individual mutational variants. Herein, we show that, consistent with the fundamental theorem of Abelian finite groups, the architecture of population genomes from the same or closed related species can be quantitatively represented in terms of the direct sum of homocyclic Abelian groups of prime-power order defined on the genetic code and on the set of DNA bases, where populations can be stratified into subpopulations with the same canonical decomposition into p-groups. Through concrete examples we show that the architectures of current annotated genomic regions including (but not limited to) transcription factors binding-motif, promoter regulatory boxes, exon and intron arrangement associated to gene splicing are subjects for feasible modeling as decomposable Abelian p-groups. Moreover, we show that the epigenomic variations induced by diseases or environmental changes also can be represented as an Abelian group decomposable into homocyclic Abelian p-groups. The nexus between the direct sum of homocycle Abelian p-groups and the endomorphism ring paved the ways to unveil unsuspected stochastic-deterministic logical propositions ruling the ensemble of genomic regions. Our study aims to set the basis for concrete applications of the theory in computational biology and bioinformatics. Consistently with this goal, a computational tool designed for the analysis of fixed mutational events in gene/genome populations represented as endomorphisms and automorphisms is provided. Results suggest that complex local architectures and evolutionary features no evident through the direct experimentation can be unveiled through the analysis of the endomorphism ring and the subsequent application of machine learning approaches for the identification of stochastic-deterministic logical rules (reflecting the evolutionary pressure on the region) constraining the set of possible mutational events (represented as homomorphisms) and the evolutionary paths.

are reported in several diseases [5]. Hence, some hierarchical logic is inherent to the genetic 54 information system that makes it feasible for mathematical studies. In particular, there exist 55 mathematical biology reasons to analyze the genetic information system as a communication system 56 [7][8][9][10]. 57 We propose the study of genome architecture in the context of population genomics, where all 58 the variability constrained by the evolutionary pressure is expressed. Although the random nature of 59 the mutational process, only a small fraction of mutations is fixed in genomic populations. In 60 particular, fixation events, ultimately guided by random genetic drift and positive selection are 61 constrained by the genetic code, which permits a probabilistic estimation of the evolutionary 62 mutational cost by simulating the evolutionary process as an optimization process with genetic 63 algorithms [11]. 64 65 Under the assumption that current forms of life evolved from simple primordial cells with very simple 66 genomic structure and robust coding apparatus, the genetic code is a fundamental link to the primeval 67 of homocyclic 2-groups and 5-groups defined on the genetic code. Concepts and basic applications 147 are introduced step by step, sometimes with self-evident statements for a reader familiar with 148 molecular biology. However, it will be shown that the algebraic modeling is addressed to unveil more 149 complex relationships between molecular evolutionary process and the genomic architecture than 150 those eyes-visible relationships. This goal will be evidenced on section 3.2. Our algebraic model 151 approach is intended to set the theoretical basis for further studies addressed to unveil and to in SI Fig 2 [11]). Simulation of the evolutionary mutational process with the application of genetic 182 algorithms indicates that fixed mutational events found in different protein populations are very 183 restrictive in the sense that the optimal evolutionary codon distances are reached for specific models 184 of genetic-code cube or for specific combination of genetic-code cube models [11]. In the present 185 work, it will be shown that codon mutational events represented in terms of automorphisms can be 186 also restrictive for specific genetic-code cube models (section 3.1). 187

The genetic code
All the Abelian p-group included in the current work are oriented to the study of the mutational 188 process [11,[20][21][22][23]. That is, since we are interested in those structures that permit the analysis and 189 quantitative description of the mutational process in organismal populations, where mutational event 190 can be represented by means of endomorphisms, automorphisms, and translations on a group. In the 191 present work, we do not include algebraic structures designed to study the origin and evolution of the 192 genetic code [11,21]. The genetic code is taken as currently is, without over-impose any evolutionary 193 hypothesis on it.  A general model also consider Abelian 5-groups that includes a dummy variable (denoted by 206 letter D), which extends the DNA alphabet to five letters. The usefulness of including a fifth base in 207 the evolutionary analysis was shown in reference [23], where two evolutionary models, an algebraic 208 and a stationary Markov (process) models, were applied to phylogenetic analysis reaching (both 209 models) greater discriminatory power than the (now) classical Tamura version: 3.16) and, also, in GitHub at: https://github.com/genomaths/GenomAutomorphism. The 255 whole R script pipeline applied in the estimation of automorphisms (Fig 8) and decision tree (Fig 9)  Then, it is said that G is the direct sum of its subgroups i B , which formally is expressed by the 287 expression: Genomic DNA sequences from superior organisms are integrated by intergenic regions and 289 gene regions. The former are the larger regions, while the later includes the protein-coding regions as 290 subsets. The MSA of DNA and protein-coding sequences reveal allocations of the nucleotide bases 291 and aminoacids into stretched of strings. The alignment of these stretched would indicate the presence 292 of substitutions, insertions, and deletion (indel) mutations. As a result, the alignment of homolog 293 genomic regions or whole chromosome DNA sequences from several individuals from the same or 294 close-related species can be split into well-defined subregions or domains, and each one of them can 295 be represented as homocyclic Abelian groups, i.e., as the direct sum of cyclic group of the same 296 prime-power order (Fig 3). As a result, each DNA sequence is represented as a N-dimensional vector 297 with numerical coordinates representing bases and codons. 298      Table 1 Although the above direct sums of Abelian p-groups provides a useful compact representation 348 of a MSA, for application purposes to genomics, we would also consider to use the concept of direct where gaps representing base D from the extended genetic code were added to preserve the coding 419 frame, which naturally is restored by splicing soon after transcription. b, A gene model where both 420 exons, 1 and 2, carries a complete set of three codons (base-triplets). Both gene models, from panels 421 a and b, share a common group representation as direct sum of Abelian 5-groups. 422 423 The respective exon regions have different lengths and gaps ("₋", representing base D in the 424 extended genetic code) were added to exons 1 and 2 (from panel a) to preserve the reading frame in 425 the group representation (after transcription and splicing gaps are removed). Both gene models, from 426 panel a and b, share a common direct sum of Abelian 2-groups and 5-groups: 427 3  7  2  8  3  6  3  5  5  2  5  5  5 n n m n

The analysis of theses gene 428 models suggests that DNA sequences sharing a common group representation as direct sum of 429
Abelian p-groups would carry the same or similar, or close related biological information. However, 430 it does not imply that the architecture of these protein-coding regions is the same. The gene model in 431 particularly, Sp1 transcription factors. A putative GC box was included in exon 2, which is an atypical 460 scenario, but it can be found, e.g., in the second exon from the gene encoding for sphingosine kinase 461 1 (SPHK1), transcript variant 2 (NM_182965, CCDS11744.1). In this group representation, the 462 spliceosome donor GTR can be represented by the elements from a quotient group (see main text). 463 464 Since purine bases (R: A and G) are the only accepted variants at the third codon position, it is 465 convenient to model these base-triples with the group defined on the cube AGCU [11] (SI Fig 3). 466 Next, following analogous reasoning as in [22], it turns out that the set of base-triplets GTR is a coset 467  (Fig 7c and d) It is obvious that the MSA from a whole genome derives from the MSA of every genomic 516 region, from populations from the same or closed related species. At this point, it is worthy to recall 517 that there is not, for example, just one human genome or just one from any other species, but 518 populations of human genomes and genomes populations from other species. Since a relatively small 519 genomic region can be represented by the direct sum of Abelian homocyclic groups of prime-power 520 order, then the whole genome population from individuals from the same or closed related species 521 can be represented as an Abelian group, which will be, in turns, the direct sum of Abelian homocyclic 522 groups of prime-power order. Hence, results lead us to the representation of genomic regions from 523 organismal populations from the same species or close related species (as suggested in Fig 3 to 7) by 524 means of direct sum of their group representation into Abelian cyclic groups. A general illustration 525 of this modelling would be, for example: 526 where the exponents and stand for the number of components in the given homocyclic group. 528 That is, Eq. 7 expresses that any large enough genomic region can be represented as direct sum of 529 homocyclic Abelian groups of prime-power order. In other words, the fundamental theorem of 530 Abelian finite groups (FTAG) has an equivalent in genomics. 531 Theorem 1. The genomic architecture from a genome population can be quantitatively represented 532 as an Abelian group isomorphic to a direct sum of homocyclic Abelian groups of prime-power order. 533 The proof of this theorem is self-evident across the discussion and examples presented here. 534 Basically, group representations of the genetic code lead to group representations of local genomic 535 domains in terms of cyclic groups of prime-power order, for example, ( ) ( ) , till covering the whole genome. As for any finite Abelian 537 group, the Abelian group representation of genome populations can be expressed in terms of a direct 538 sum of Abelian homocyclic groups of prime-power order. Any new discovering on the annotation of 539 a given genome population will only split an Abelian group, already defined on some genomic 540 domain/region, into the direct sum of Abelian subgroups ■. 541 The application of the FTAG in terms of the group representation of genomic regions G, as 542 given in Eq. 7, establishes the basis to the study the molecular evolutionary (mutational) processes in 543 terms of endomorphisms. That is, fixed mutational events in the organismal population can be 544 modeled as homomorphism: endomorphisms and automorphisms, all elements of the endomorphism 545 ring ( ) R G on G (see next section). In the context of comparative evolutionary genomics, the analysis 546 of the endomorphism ring ( ) R G is an intermediate step for the further application of methods from 547 Category theory, which has the potential to unveil unsuspected features of the genome architecture, 548 hard to be inferred from the direct experimentation. 549

The endomorphism ring
In 557 other words, mutational events fixed in gene/genome populations can be quantitatively described as 558 endomorphisms and automorphisms. 559 In the Abelian p-group defined on In this modeling, mutational events are represented as endomorphisms . This fact permits the study of the genome architecture through the study of the evolutionary 573 (mutational) process in a genome population. Moreover, the decomposition of the endomorphism ring 574 into subgroups, quotient groups, and cosets can lead to a deterministic algebraic taxonomy of the 575 species based on their genome architecture, which is not limited by our current biological knowledge. 576 Particularly relevant for the evolutionary comparative genomics is Baer-Kaplansky theorem: If G 577 and H are p-groups such that ( ) ( ) Application of Baer-Kaplansky theorem implies that two gene-body regions encoding exactly for 582 the same polypeptide but with different region architecture (Fig 5) are under different evolutionary 583 pressure. That is, if the group representations of two homologous gene-body regions are not 584 isomorphic, then their endomorphism rings are not isomorphic either and, consequently, they will be 585 under different evolutionary pressure, experiencing different subsets of mutational events, which are 586 represented as endomorphisms from their corresponding endomorphism rings. This scenario is 587 typically found in some isoforms, which are proteins that are similar to each other and perform similar 588 roles within cells [48]. This is the case where two or more closely related genes are responsible for 589 the same translated protein, illustrated in  Fig 2), the frequency of purine 606 (HHR) transitions is followed by pyrimidine (HHY) transitions. 607 The analysis on the pairwise alignment of protein-coding regions of SARS and Bat SARS-like 608 coronaviruses is presented in Fig 8b an c. Most of the mutational events distinguishing human SARS 609 from Bat SARS-like coronaviruses can be described by automorphism on cube ACGT. This 610 observation was confirmed in primate somatic cytochrome c (Fig 8c) and BRCA1 DNA repair gene 611 (Fig 8d). Since automorphisms transform the null element (gap-triplet DDD/---) into itself, insertion-612 deletion mutational events cannot be described by automorphisms but as translations on the groups 613 (denoted as Trnl in Fig 8). The representation of conserved genomic regions with homocyclic p-group 614 is straightforward. However, their frequency in the genome architecture exponentially decreases with 615 the size of the region (Fig 8f and SI Fig 4). the two mentioned bat strains. The best fitted probability distribution turned out to be the generalized 637 gamma distribution. 638 639 Next, under the assumption that Eq. 8 holds, different protein-coding regions must experience 640 "preference" for specific type of automorphisms. To illustrating the concept, an analysis based on the 641 application of Theorem 1 and Eq. 8 on gene/genome population studies, an application of decision 642 tree algorithms was conducted on primate BRCA1 genes. Results for the analysis with Chi-squared 643 Automated Interaction Detection (CHAID) is presented in Fig 9. It is important to keep in mind that 644 this is only an illustrative example with small sample size, and that definite conclusions related to 645 BRCA1 genes can only be derived with larger sample size from humans and non-human primate 646 sequences. In this algorithmic approach, for each compound category consisting of three or more of 647 the original categories, the algorithm finds the most statistically significant binary split for a node 648 (split-variable) based on a chi-squared test [52]. 649 For a given MSA of protein-coding regions, the resulting decision tree leads to stochastic-650 deterministic logical rules (propositions) permitting a probabilistic estimation of the best model 651 approach holding Eq. 8. For example, since only one mutational event human-to-human from class 652 A3 is reported in the right side of the tree (Fig 9), with high probability the proposition: "(A4 ˅ (A3 653 ˄ ¬ HRH) → ¬ human" is true. That is, with high probability only non-humans hold the last rule. Likewise, an estimation of mutational cost can be given in terms of distances between aminoacids 682 based on codon distances defined on a specific genetic-code cube model or on a combination of two 683 models [11,54]. Examples of stochastic some mutational rules are given in Table 1. 684

692
Our results provides supporting evidence to the previous finding reported in [11] about that the 693 selection of the genetic-code cube model cannot be arbitrary, since the automorphisms and the 694 estimation of mutational costs (as defined in [11]) on different local DNA protein-coding regions 695 shows clear "preference" for specific models. Obviously, the mathematical model is only a tool (a 696 representation of the physicochemical relationships given between molecules) applied to uncovering 697 the existence of specific evolutionary constraints for the transmission of genetic information. 698

699
In this section we want to highlight a direction of future theoretical developments. A full coverage of 700 this topic is out of the limits of the current work. Nevertheless, a sketch on a future direction is 701 presented here. Our goal will be the description of mutational process on protein-coding regions in 702 terms of homomorphisms of different algebraic structures. 703 Genomic regions represented as an Abelian group decomposable into homocyclic Abelian p-704 groups, e.g.
This group is an element of the Ab category ′ , which is a subcategory of the + -Mod category 743 R  over the ring   a = (9,32,24,56,60,27,28,5) and 756 = (8,1,56,60,28,2), respectively, with coordinates on the ring + . The coordinates of sequences 757 A and B on the ring + and permits the new representations: ′ = (9 ∈ ℤ 2 6 ,32 ∈ 758 ℤ 5 3 , (24,56,60) ∈ ℤ 2 6 , (27, 28,5) ∈ ℤ 2 6 ) and ′ = �(8,1) ∈ ℤ 2 6 , 0 ∈ ℤ 5 3 , (56,60) ∈ ℤ 2 6 , 0 ∈ ℤ 5 3 , 759 (28,2) ∈ ℤ 2 6 �, respectively. Sequence ′ is map into ′ the group homomorphism : ' Or by means of the affine transformation:   (Fig 7a and b). The group representation 814 of protein-coding regions (or base-triplet sequences) as numerical vectors with coordinates on 3 5  815 Results indicate that, as a consequence of the genetic code constraints and the evolutionary 816 pressure on protein-coding regions, stochastic-deterministic logical rules can be inferred on a large 817 enough sample-size from a gene/genomic-region population. Such a stochastic-deterministic rules 818 lead to specific applications of Theorem 1 and Eq.8, consequently, the analysis of mutational process 819 on each group, subgroup, and coset. For example, mutational events on a MSA column (identified) 820 from class YHH (with discriminatory classification power as shown in Fig 8)  to be detected by current experimental approaches. All the information required can be retrieved from 842 the MSA of DNA sequences, which is particularly relevant for poorly annotated genomes. 843 Results shown in Figs. 8 and 9 also suggest deep implications of Baer-Kaplansky theorem on 844 the genome architecture. Concretely, on an evolutionary context, the fact that two genomic regions 845 from two different species are identical or almost identical, and that they can even encode the same 846 functional protein, does not necessarily imply that they hold to the same genome architecture. Baer-847 Kaplansky theorem implies that to hold the same architecture, these hypothetical regions must also 848 experience equivalent evolutionary pressure, which in turn implies that the regions must experience 849 the same type of mutational events in terms of automorphism/endomorphism representations. 850 For example, let's suppose that the results shown in Fig 9 were derived from a large sample 851 size (large enough to derive statistically significant rules), then the rule "A1 ˄ R3 ˄ (YHH ˅ HHY) 852 → human" (Fig 9) implies that the gene regions of BRCA1 from human and non-human primates do 853 not belong to the same equivalent class of genomic region. In particular, since the endomorphism 854 rings ( ) A group isomorphism is a one-to-one correspondence (mapping) between two sets that 926 preserves binary relationships between elements of the sets. That is, an isomorphism is a 927 homomorphism holding the inverse mapping:

947
As it is shown in the next section, these algebraic structures have been defined on the genetic code. 948 In particular, the ring ( ) All the data, computational and statistical analyses can be reproduced following the R scripts 979 provided in tutorials available at the GenomAutomorphism R package website 980 https://genomaths.github.io/genomautomorphism/. In particular, data and R scripts used in the 981 computation of automorphisms and the decision tree from Fig 9 are

Reported genetic code abelian groups relevant for the current study
Herein, we assume that readers are familiar with the definition of abelian group, which otherwise can be found in textbooks and elsewhere including Wikipedia. Nevertheless, all the abelian groups discussed here are isomorphic to the well-known abelian groups of integer module n, which are easily apprehended by a college-average educated mind. For example, the abelian group defined on the set {0, 1, 2, 3, 4}, which corresponds to the group of integers modulo 5 ( 5  ), where (2 + 1) mod 5 = 3, (1 + 3) mod 5 = 4, (2 + 3) mod 5 = 0, etc. The subjacent biophysical and biochemical reasonings to define the algebraic operations on the set of DNA bases and on the codon set were given in references [12,14,17].
The 64 −  algebras of the genetic code (C g ) The 64 −  algebras of the genetic code (Cg) and gene sequences were stated several years ago. In the 64 −  algebra Cg the sum operation, defined on the codon set, is a manner to consecutively obtain all codons from the codon AAC (UUG) in such a way that the genetic code will represent a non-dimensional code scale of amino acids interaction energy in proteins.
A description of the genetic code abelian finite group (Cg, +) can be found in [12]. Group ( )    . Each element of this group represents an equivalence class of codons.
Two triplets 1 2 3 X X X and 1 2 3 Y Y Y are equivalent if, and only if, the difference 1 2 3 1 2 3 In biological terms, substitution mutations involving codons from the same class will not alter (or at least no substantially alter in most of the cases) the physicochemical properties of the encoded protein domains, since in the worst scenario involves aminoacids with very close physicochemical properties, with the exception of codon for aminoacid tryptophan.

The ℤ group of the extended genetic code (Ce)
The extension of the genetic code group ( ) x x x = + + (see Table 1).

The � , +� group of the extended genetic code (Ce)
Formally the genetic code only is limited to translated coding regions where the number of RNA bases is a multiple of 3. However, as suggested in reference [17], the difficulties in prebiotic synthesis of the nucleosides components of RNA (nucleo-base + sugar) and suggested that some of the original bases may not have been the present purines or pyrimidines [18]. Piccirilli et al. [19] demonstrated that the alphabet can in principle be larger. Switzer et al. [20] have shown an enzymatic incorporation of new functionalized bases into RNA and DNA. This expanded the genetic alphabet from 4 to 5 or more letters, which permits new base pairs, and provides RNA molecules with the potential to greatly increase their catalytic power.
It is important to notice that even in the current (friendly) environmental conditions not a single cell can survive without a DNA repair enzymatic machinery and that such an enzymatic machinery did not existed at all in the primaeval forms of live. Here, we are confronting the chicken and egg problem. To date, the best solution (to our knowledge) is the admission of alternative base-pairs in the primordial DNA alphabet which, as suggested in the studies on the prebiotic chemistry, could contribute to the thermal and general physicochemical stability of the primordial DNA molecules. Cytosine DNA methylation results from the addition of methyl groups to cytosine C5 residues, and the configuration of methylation within a genome provides trans-generational epigenetic information. These epigenetic modifications can influence the transcriptional activity of the corresponding genes, or maintain genome integrity by repressing transposable elements and affecting long-term gene silencing mechanisms It is worthy to notice that there 24 way to define each one of the above mentioned algebraic structures [30,31]. Nevertheless, for each defined genetic code group, there is only one (genetic code abelian group) up to isomorphism, which lead to their representation as an abelian group, where the sum operation corresponds to the sum of integer modulo { } 6 3 2, 2 ,5,5 n ∈ .

REST DNA sequence motifs. R code
The DNA sequence motifs targeted by transcription factors usually integrate genomic building block across several species DNA sequence alignment of the protein-coding sequences from phospholipase B domain containing-2 (PLBD2) carrying the footprint sequence motif recognized (targeted) by the Silencing Transcription factor (REST), also known as Neuron-Restrictive Silencer Factor (NRSF) REST (NRSF).

Table S2
Operation tables of the Klein four group defined on the ordered set of four DNA bases {A,G,C,U}.

Figure S2
Automorphism distribution per mutation type in primate BRCA1 gene. As a result of the evolutionary pressure on DNA protein-coding regions (addressed to preserve the aminoacid physicochemical properties and, consequently, the biological functions of the encoded proteins) the highest mutational rate is found in the third base of the codon, followed by the first base, and the lowest rate is found in the second one [23]. DNA bases are classified based on the physicochemical criteria used to ordering the set of codons: number of hydrogen bonds (strong-weak, S-W), chemical type (purine-pyrimidine, Y-R), and chemical groups (amino versus keto, M-K) [20]. Preserved codon positions are labeled with letter "H" and insertion-mutations identified in the multiple sequence alignment are labeled as "---". The data and R script to build this graphic is available at: https://genomaths.github.io/genomautomorphism/articles/automorphism_on_msa_brca1.html.

Figure S3
Genetic-code cube AGCU inserted in the three-dimensional space and centered in the coordinate origin.
The insertion of the cube in 3D-space takes advantage of the isomorphism: