Longest Order Conserved Exemplar Subsequences

We propose a new problem whose input data are two linear genomes together with two indexed gene subsequences of them, which asks to find a longest common exemplar subsequence of the two given genomes with a subsequence identical to the given indexed gene subsequences. We present an algorithm for this problem such that the algorithm is allowed to take diminishing time and space to solve the problem by setting the indexed genes with an incremental number. Although an incremental number of indexed genes were selected, the algorithm was verified definite to reach a solution whose length insistently comes very close to a real longest common exemplar subsequence of the two given genomes. Aiming at 23 human/gorilla chromosome pairs, the algorithm was examined for use in questing for longest common exemplar subsequences whose basic units are annotated genes as well as pseudo genes, namely consecutive DNA subsequences. By contrasting the pseudo gene common exemplar subsequences the algorithm had reached for the human chromosomes 7 and 16 and their gorilla homologues with those annotated genes in the human and gorilla chromosomes, we found more than 1 000 and 500 pseudo genes in the human chromosomes 7 and 16 that occur in the same order as they are in the gorilla chromosomes 7 and 16 and, do not overlap with any annotated gene. Author summary There is a benefit of the algorithm: It can reach a long enough common exemplar subsequence of two linear genomes in as fast a speed as one requires even if the given genomes would be equipped with too many duplicated genes, which can be done by setting incremental number of indexed genes. We developed a Java software based on the algorithm, that has been available for download on https://github.com/ShuZhang-sdu/LCES. Only in need to set the indexed gene sequences as null, was it verified successful for our algorithm to obtain the longest common exemplar subsequences of the annotated gene summary pairs extracted from 23 human/gorilla chromosome pairs. In convenience for researchers to find new motifs or conserved genes, we devoted for the algorithm to quest pseudo gene (i.e. consecutive DNA subsequences) summary pairs of the 23 human/gorilla chromosome pairs for solutions. There are 20 pseudo gene summary pairs whose longest common exemplar subsequences have been found by the algorithm with null indexed gene sequences. The other 3 pseudo gene summary pairs were verified solvable for the algorithm to reach their longest common exemplar subsequences that have to admit subsequences identical to given indexed gene subsequences. There were informed to exist 2 353 and 1 148 pseudo genes in the gorilla chromosome 7 and 16 that occur in the same order as they are in the human chromosome 7 and 16 and, do not overlap with any annotated gene. These pseudo genes should be significant for annotating the human or gorilla genome.


Introduction
1 Conserved gene based molecular mechanism analysis happens basically in kinds of 2 bioresearch such as breeding soybean seeds [1] as well as developing drugs or vaccines 3 to beat back corona viruses [2] [3] [4]. Finding conserved gene subsequences in 4 genomes has been an attractive topic in biomedicine, bioinformatics and computer 5 science all the time. A consecutive DNA subsequence is not necessarily an annotated 6 gene if it is conserved. Thus a sequence of conserved consecutive DNA subsequences in 7 a genome sounds to have a broader range of applications such as for people to find 8 motifs in it or illuminate bio-functions of it. An order conserved sequence of conserved 9 consecutive DNA subsequences or genes can be found by solving a combinatorial 10 problem in terms of genome comparison [5] [6] [7]. 11 The exemplar breakpoint distance problem (abbr. EBD) proposed by Sankoff is a 12 pioneering problem for this purpose [5]. Some heuristic algorithms using branch and 13 bound [5] or divide and conquer [8] can be found effective for this problem. This 14 problem is NP-Hard [9] and there are extensive approaches on the computational 15 complexity of this problem [10] [11] [12]. Some more algorithmic progresses for special 16 versions of this problem can be found in the literatures [13] [14]. The decision version 17 of EBD, the zero exemplar breakpoint distance problem (abbr. ZEBD), remains 18 NP-Hard [11] and admits an O(n 2 1.86 n ) time algorithm for the two given genomes 19 both to allow at most 2 identical genes where n is the number of genes in the given 20 genomes [15]. Some special versions of ZEBD admit polynomial time 21 algorithms [11] [15] [16]. 22 In the sense of order conserved, a long enough common exemplar subsequence of 23 two or more linear genomes should be better qualified for conserved. The exemplar 24 longest common subsequence problem (abbr. ELCS) was proposed by Bonizzoni et 25 al. [6] to quest for a longest common subsequence of two or more linear genomes that 26 contains a gene of each mandatory gene family. This problem is NP-Hard since ZEBD 27 is a special version of it and, admits a polynomial time algorithm for the special 28 version whose input is of two linear genomes with at most three genes of each 29 mandatory family [6]. This problem admits fixed-parameter algorithms for instances 30 of two linear genomes [6]. 31 The repetition-free longest common subsequence problem (abbr. RFLCS) proposed 32 by Adi et al. [7] is in fact a special version of ELCS. This problem is NP-Hard since B) in exclusion of those identical to genes in X (resp. Y ) admits a span s(A, X) (resp. 66 s(B, Y )), then the time complexity and the space complexity of the algorithm are as Y with increasing number of genes.

74
Using the algorithm for LCES-IG, we developed a software that have been ready 75 for finding a longest common exemplar subsequence of two linear genomes A and B, 76 which admits a subsequence identical to a given subsequence X of A as well as a given 77 subsequence Y of B. Aiming at 23 human/gorilla chromosome pairs, we performed 78 experiments where the algorithm was employed to quest for their longest common 79 exemplar subsequences of annotated genes. By running the algorithm with X and Y 80 as null, we obtained the longest common exemplar subsequences of 23 annotated gene 81 summary pairs extracted from those 23 human/gorilla chromosome pairs.

82
In expectation of avail to find new conserved genes or new motifs in human as well 83 as gorilla genomes, we performed experiments where the algorithm was employed to 84 quest for longest common exemplar subsequences of pseudo genes, i.e., consecutive 85 DNA subsequences that occurs in a human chromosome and its gorilla homologue, 86 aiming at those 23 human/gorilla chromosome pairs. A modified version of 87 SDquest [23] was used to identify the consecutive DNA subsequences in those 23 88 human/gorilla chromosome pairs as pseudo genes so that 23 pairs of pseudo gene 89 sequences each of which will be called a pseudo gene summary were extracted from 90 those 23 human/gorilla chromosome pairs.

91
For X and Y to be set with null, there are 20 pseudo gene summary pairs whose 92 longest common exemplar subsequences have been found by the algorithm. The other 93 3 pseudo gene summary pairs were all verified solvable for the algorithm to reach their 94 longest common exemplar subsequences that have to admit subsequences identical to 95 given indexed gene subsequences X and Y . Positively, one is allowed to select more 96 indexed genes in X as well as Y to make the algorithm take less time and space to 97 November 24, 2020 3/24 reach a solution whose length can always come very close to a real longest common 98 exemplar subsequence of the two given genomes. This implies it constantly true for 99 the algorithm to reach an order conserved sequence of conserved consecutive DNA 100 subsequences in a genome in as short a running time as required in practice. 101 In convenience for researchers to mine something from those longest common  [1,5] ∥ π [6,10].

127
There can be more than one gene of the same gene family in a linear genome or 128 gene sequence. Two genes are referred to as identical, if they are of the same gene 129 family. For a gene π[i] in π, we denote by occ(π, π[i]) the number of genes in π that 130 are identical to π[i]. We refer to j − i for i < j as the span of π[i] and π[j] in π. The 131 span of a linear genome refers to the maximum span of two identical genes in that 132 linear genome. Let π = 1 4 3 2 4 1 5 2 6 for example. Then occ(π, 1) = occ(π, 2) = 2, 133 occ(π, 3) = 1 and the span of π is 5, because π [1] and π [6] are identical and their span 134 5 is maximum over all two identical genes in π. 135 We denote it x = y for two genes x and y to be identical. Two gene sequences X = 136 X [1, p] 137 A sequence of genes is referred to as a subsequence of a linear genome, if it can be 138 obtained by deleting some genes (not necessarily consecutive) from the linear genome. 139 A subsequence of a linear genome on Σ is referred to as an exemplar subsequence of 140 the genome, if any two genes in this subsequence are not identical. problem is what we were just asked.

165
Question: Find a longest common exemplar subsequence of G 1 , ..., G N indexed by 166 X j for 1 ≤ j ≤ N . 167 We are going to mention this problem as LCES-IG for short and abbreviate by an 168 LCES a longest common exemplar subsequence of some linear genomes.

169
In what follows of this section, we present a dynamic programming algorithm for 170 LCES-IG whose instances are of two linear genomes and their respective exemplar 171 subsequences. We represent an LCES-IG instance by two linear genomes A = A [1] ...
, if identical to a gene in X (resp. Y ) and 175 other than the gene, cannot occur in any common exemplar subsequence indexed by 176 X (resp. Y ). Thus A (resp. B) is always assumed with no other gene identical to  179 For i and j with 0 ≤ i ≤ m and 0 ≤ j ≤ n, let C(i, j) denote the set of all common 180 exemplar subsequences of A [1, i] and B [1, j]. A member C ∈ C(i, j) is referred to as   188 be indexed by X [1, k] as well as Y [1, k], then any extension of the member cannot be 189 indexed by X as well as Y . Proof. Let C 1 ∈ C(i, j), C ∈ C(m, n). If C is an extension of C 1 , then there is a It follows from occ(A, A[x p ]) = 1 for 1 ≤ p ≤ k that C 2 cannot be indexed by X[p] 194 for 1 ≤ p ≤ k. If C 1 fails to be indexed by X [1, k] as well as Y [1, k], so does C. 195 It follows by Lemma 1 that if i ≥ x k and j ≥ y k , anyone in C(i, j) can be given up 196 to maintain for extension if it fails to be indexed by X [1, k] as well as Y [1, k]. This 197 implies that if i ≥ x k+1 and j ≥ y k+1 for 1 ≤ k + 1 ≤ q, then there is no need to 198 maintain for extension any one in C(i, j) that is indexed by X [1, k] as well as Y [1, k] 199 instead of X[1, k + 1] as well as Y [1, k + 1]. On the other hand, if i ≥ x k+1 and j < 200 y k+1 (resp. i < x k+1 and j ≥ y k+1 ) for 1 ≤ k + 1 ≤ q, then it follows by Lemma 1 201 that anyone in C(i, j) can be given up to maintain for extension. Thus only when i 202 and j fall in such intervals as x k ≤ i < x k+1 and y k ≤ j < y k+1 , is it necessary to 203 maintain for extension those common exemplar subsequences in C(i, j) indexed by 204 X [1, k] as well as Y [1, k],   For i and j with x k ≤ i < x k+1 and y k ≤ j < y k+1 (0 ≤ k ≤ q), we focus on how 212 to get CP (i, j). Since A[1, 0] = "" and B[1, 0] = "", we set CP (x 0 , y 0 ) = CP (0, 0) = 213 {""}. The following lemma is in terms of how to get CP (x k , y k ) for k ≥ 1.

221
Only if : Let C ∈ C(x k , y k ) be indexed by X [1, k] as well as Y [1, k].
. It remains to argue the way of how to get CP (i, j) from CP (i, denote an arbitrary common exemplar subsequence of 236 A [1, i] and B [1, j] indexed by X [1, k] as well as Y [1, k].   248 that can be argued true in the same way as in case (1.1). Since i = x k , no member in 249 CP (i − 1, j) can be indexed by X [1, k] as well as Y [1, k] To sum up, CP (i, j) can be computed recursively by Formula (1) for i and j with i 254 The lemma follows from that the 272 longest extension of C 1 in C(m, n) has no less genes than |C 1 ∥ C 3 |. 273 By Lemma 3, to get an LCES of A and B indexed by X as well as Y , it suffices for 274 the dynamic programming to maintain a subset of CP (i, j) whose common exemplar

286
It follows from CP (0, 0) = {""} that CF P (0, 0) = {""}. Then for i and j with i + 287 j > 0, x k ≤ i < x k+1 and y k ≤ j < y k+1 (0 ≤ k ≤ q), we face with the task of getting 288 a minimum representative subset of CP (i, j) from CF P (i − 1, j), CF P (i, j − 1) and . Then we argue for CF P to be 300 representative in the following two aspects.
The reason why CF P is minimum over all representative subsets of subset of CP (i, j) can be extracted by examining every two members in a subset of   336 Formally, the confused gene family set of C can be got as follows.
where it is treated as what occurs in C ∈ CP (i, j). Thus to get the confused gene 343 family set of C ∈ CP (i, j), it suffices to decide whether the gene family of . This can be argued in two subcases.  In summary of (2.1) and (2.2), f (i, j, C) can be got by the following formula.
353  Let F P (i, j) denote the confused gene family set collection of CF P (i, j). Since To extract F P (i, j), it needs help of the lengths of those members in CP (i, j, x k+1 and y k ≤ j < y k+1 (0 ≤ k ≤ q), the CES length of an arbitrary confused gene can 378 be computed recursively by Formula (5). 379 accompanied with a CES length, F P (i, j) can be arrived at recursively by Formula (6). 388 This leads to a dynamic programming based algorithm to get F P (m, n). Since all 1: x 0 ← 0; y 0 ← 0; x q+1 ← m + 1; y q+1 ← n + 1; 2: F P (i, 0) = F P (0, j) ← {∅}, 0 ≤ i ≤ m, 0 ≤ j ≤ n; 3: for k from 0 to q do 4: for i from x k to x k+1 − 1 do 5: for j from y k to y k+1 − 1 do 6: Get F P (i, j) by (3)

498
Those 23 human/gorilla chromosome summary pairs were all verified solvable for 499 LCES(A, B, X, Y ) with X = Y = "" to reach their LCESs. These LCESs have been 500 prepared ready in S1 File. Since the length of an LCES of two chromosome summaries 501 reflects the structure similarity of them, we present in Table 2 the first row the lengths 502 of those 23 LCESs. Moreover, we present in Table 2  A "length" statistic represents the gene number of an LCES. An "lces/lh" (resp. "lces/lg") statistic represents the ratio of the gen number of an LCES to the gene number of a human (resp. gorilla) chromosome summary. A "time(ms)" statistic represents the real running time.
There exist 1 658 distinct named genes in the LCES for the algorithm to reach in 507 questing the summaries of human chromosome 1 and its gorilla homologue. It took

514
To break through the restriction of genes that have been annotated, we turn attention 515 to find order conserved sequences of significant subsequences in human or gorilla 516 chromosomes without any regard to the annotation files. We refer to a consecutive 517 DNA subsequence as a pseudo gene, if it has at least a stated number (500) of DNA 518 bases and occurs in a human chromosome as well as its gorilla homologue. We start 519 with a DNA sequence represented human chromosome and its gorilla homologue to 520 pursue one of their pseudo gene sequence represented LCESs.

521
A modified version of SDquest [23] was employed to extract two respective pseudo 522 gene subsequences from a human chromosome and its gorilla homologue. For SDquest 523 November 24, 2020 13/24 with input to be a human chromosome and its gorilla homologue, it will output two 524 pseudo gene sequence represented subsequences of the human chromosome and its 525 gorilla homologue respectively. SDquest was set to identify a consecutive DNA 526 subsequence in a human (resp. gorilla) chromosome as a pseudo gene if it has at least 527 500 DNA bases and occurs in the chromosome's gorilla (resp. human) homologue with 528 at least 95% identity 1 [23]. All pseudo genes for SDquest to identify were encoded 529 with integers and two pseudo genes were treated as identical and encoded with the 530 same integer, if they are with at least 95% identity.

531
The "hg38" assembly [28] and the "gorGor4" assembly [29]  A "Human" or "Gorilla" statistic represents the length of a human or gorilla pseudo gene summary. whose LCES lengths as well as the running time the algorithm took to reach them, are 547 given in Table 4. We were informed out of memory by LCES(A, B, "", "") for it to 548 quest the other three summary pairs. 549 A "length" statistic represents the length of an LCES. A "time(ms)" statistic represents the real time for the algorithm to get an LCES.

LCES without indexed genes
1 95% identity means matches matches+mismatches+indels = 95%, where matches (resp. mismatches) is the number of matched (resp. mismatched) nucleotides and indels is the number of indels.  What we did for this uncertainty is: For r to take value in (0, 1) increasingly, 576 randomly select a subsequence C ′ of C with r|C| pseudo genes and examine

LCES with indexed genes
where each time of the algorithm's execution in a round was with 591 C ′ randomly selected from C and |C ′ | = r|C|. 592 In Table 5, we present the respective successful numbers in all rounds of algorithm 593 executions. Those statistics in Table 5   algorithm execution was with C ′ randomly selected from C and |C ′ | = r|C|. 615 We present in Table 6  pseudo gene summary pairs. We present in Table 6

622
Moreover, we present in Table 6  The statistics in Table 6 show that although there are an increasing number of indexed 627 genes to be selected, there exists no downward trend on the lengths of solutions for the 628 algorithm to reach. The lengths of solutions the algorithm had reached are always 629 very close to the length of a real LCES. This implies that one is allowed to select more 630 indexed genes to make the algorithm take less time to reach a common exemplar 631 subsequence whose length can come very close to an LCES.  As r takes value from 30% to 50%, the average running time for Index-LCES(A, B, 633 solutions. This might be because there are 12 identical pseudo gene pairs in both of 638 these two pseudo gene summaries whose spans are no less than 4730.

639
For the pseudo gene summary pairs of the human chromosomes 7 and 16 and their 640 gorilla homologues and for r ∈ {10%, 15%, 20%, 25%, 30%}, we present in Table 7 Table 7 show that the running time the algorithm 646 had taken averagely to reach a solution rapidly decreases as the value of r increases, 647 instead of which, the average length of those common exemplar subsequences for the 648 algorithm to reach remains unchanged.   genes. In S6 File (resp. S7 File), we have prepared ready the annotated genes in those 677 23 human (resp. 24 gorilla) chromosomes that overlap with some pseudo genes in 678 sequences in S4 File.

679
A pseudo gene in a sequence in S4 File, if does not overlap with any annotated gene, 680 is valuable for people to illuminate its bio-function or significant for people to mine 681 motifs in it. So In S8 File (resp. S9 File), we have prepared ready those pseudo genes 682 that do not overlap with any annotated gene in human (resp. gorilla) chromosomes.

683
One may be more interested in a pseudo gene, if it both does not overlap with any 684 annotated gene and is long enough. So in S10 File (resp. S11 File), we have prepared 685 ready those pseudo genes that do not overlap with any annotated gene in human as 686 well as gorilla chromosomes and each of them has at least 10 000 bases. concern because they were reached by Index-LCES(A, B, X, Y ) with some pseudo 691 genes fixed as indexed. Since the gorilla chromosome 2 we mentioned is a fusion of two 692 real gorilla chromosomes, we will give up to show more statistics about CES-2 in 693 contrast with the annotated gene sequences that were extracted from the human 694 chromosome 2 and its gorilla homologue. In Table 8, we disclose more statistic 695 information about CES-7 and CES-16.

696
In Table 8 the column with title AG (resp. PG), we present the numbers of 697 annotated (resp. pseudo) genes that were extracted from the human chromosomes 7

709
To identify new genes or motifs, the pseudo genes that do not overlap with any 710 annotated gene should be more attractive for people to pay attention to. In Table 8 711 the column with title PG∧AG, we present the numbers of pseudo genes in CES-7 and 712 CES-16 that do not overlap with any annotated gene extracted from the human 713 chromosome 7 and 16. We present in the column with title PG∧AG(10000) 714 additionally, the numbers of pseudo genes in CES-7 and CES-16 that were involved in 715 the column with title PG∧AG and each of them has at least 10 000 bases. In Table 8 716 the last column, we present the site intervals of the longest pseudo genes that do not 717 overlap with any annotated gene in CES-7 and CES-16 respectively. 718 We are informed by these statistics an order conserved sequence of more than 400 719 (resp. 1 000) pseudo genes all of which are of more than 10 000 bases and do not 720 overlap with any annotated gene in the human (resp. gorilla) chromosome 7. The same 721 two numbers for the human and gorilla chromosome 16 are 140 and 400 respectively. "humanpos" (resp. "gorillapos") indicates the identification number in A (resp. B) of 800 the pseudo gene that is in the same row as it lies in and the column with title 801 "indexedgene". Let X (resp. Y ) denote the subsequence of pseudo genes in A (resp. 802 B) whose identification numbers are given in the column with title "humanpos" (resp. 803 "gorillapos"). Then the LCES of A and B indexed by X as well as Y will be reached