Determine transposable genes when the orders of genes are different

Certain nucleotide sequences in DNA can change their positions. Such nucleotide sequences might be shorter than a general gene. When we restrict to nucleotide sequences that form complete genes, we can still find genes that change their relative locations in a genome. Thus for different individuals of the same species, the orders of genes might be different. Such spatial difference of gene orders might be affected by temporal difference of gene (mutation) orders, and can be used to explain the order of mutation problem in myeloproliferative neoplasm. A practical problem is to determine such transposable genes in given gene sequences. Through an intuitive rule, we transform the biological problem of determining transposable genes into a rigorous mathematical problem of determining the longest common subsequence. Given several number sequences, determining the longest common subsequence is a classical problem in computer science. Depending on whether the gene sequence is linear or circular, and whether genes have multiple copies, we classify the problem of determining transposable genes into different scenarios and design corresponding algorithms. Specifically, we study the situation where the longest common subsequence is not unique.


Introduction
Various events, such as inversion, insertion, deletion, and duplication, can change the nucleotide sequence [28]. Such rearrangement events lead to the existence of transposons (also called transposable elements or jumping genes), which are DNA sequences that can change their relative positions within the genome. Transposons were first discovered in maize by Barbara McClintock [41]. Transposons have various types: long terminal repeats (LTR) retrotransposons, Dictyostelium intermediate repeat sequence (DIRS)-like elements, Penelope-like elements (PLE), long interspersed elements (LINE), short interspersed elements (SINE), terminal inverted repeats (TIR), Helitrons, etc. [40].
Transposons are common in various species. For the human genome, the proportion of transposons is approximately 44%, although most of transposons are inactive [42]. Transposons can participate in controlling gene expression [82], and they are related to several diseases, such as cancer [13], hemophilia [33], and porphyria [45]. Transposons can drive rapid phenotypic variations, which cause complicated cell behaviors [80,48,47,11,29]. Transposons can be used to detect cancer drivers [49] and potential therapies [2]. Transposons are also essential for the development of Oxytricha trifallax [50], antibiotic resistance of bacteria [3], and the proliferation of various cells [54,78,14]. With the presence of transposons, the regulation between genes might be affected, which is a challenge for inferring the structures of gene regulatory networks [74] and general transcriptome analysis [60,81].
When transposons have been determined, we can use them to compare the genomes of different species, and such comparisons can be combined with other measurements between species, such as metrics on developmental trees [70]. Such comparisons can be also extended to different tissues to help with the prediction of tissue transplantation experiments [75]. Besides, for some species, cells at different positions have different gene expression patterns, which might be related to transposons [72].
Many transposons are as short as 10 2 − 10 3 base pairs, shorter than a general gene [53]. To determine such short transposons, one needs to analyze the original AGCT nucleotide sequences. There have been many algorithms developed to determine short transposons from nucleotide sequences, such as MELT (Mobile Element Locator Tool) [18], ERVcaller (Endogenous Retro-Virus caller) [10], and TEMP2 (Transposable Elements Movements Present 2) [79]. Different algorithms may only determine certain types of transposons. For more details, readers may refer to other papers [51,20]. They use raw DNA sequencing data, which only contain imperfect information about the true DNA sequence, and the data quality depends on some factors that vary across different datasets [17]. Besides, they need a corresponding genome or reference transposon libraries.
There are gross DNA changes that associate with many genes, also called genomic rearrangements [21]. Such rearrangements include inversion, transposition, fusion, and fission [8]. To determine such gross genomic rearrangements, one first needs to convert nucleotide sequences into gene sequences by annotation. For two different gene sequences, the general idea of determining rearrangements is to calculate the minimal number of operations required for transforming one sequence into the other [62]. This defines an editing distance between gene sequences, which can be used to compare the evolution distance between species and construct the phylogenetic tree [61]. There have been many algorithms developed to determine genomic rearrangements. They consider different scenarios: whether the gene sequence is linear or circular, whether genes have unique labels, and what operations can be taken. Kececioglu and Sankoff only consider inversion for linear sequences with unique gene labels [34]; Blanchette et al. consider inversion and transposition for circular sequences with unique gene labels [6]; Tesler considers inversion, transposition, fusion, and fission for linear and circular sequences with unique gene labels [62]; Terauds and Sumner study circular sequences with representation theory tools [61]; Bohnenkämper et al. consider linear and circular sequences with possibly duplicated labels [8]. There are also systematic pipelines for determining rearrangements from whole-genome assemblies [19,43]. Nevertheless, these methods consider large-scale rearrangements, and minimize the number of operations to transform one gene sequence into the other, not concrete genes that can change their locations. Besides, these methods only compare two gene sequences, not more. Their results depend on the set of possible operations, which is somewhat arbitrary.
In this paper, we consider a mesoscopic scenario between the genomic rearrangement situation and the short transposon situation: Given accurately annotated gene sequences (not nucleotide sequences) from different individuals, determine individual genes (not short nucleotide segments or long gene strands) that can change their locations (transposable). This provides a qualitative description for the stability of genes, which can guide gene editing [67] and phylogenetics [32]. The proportion of fixed genes quantifies the robustness of the genome. We aim at minimizing the number of genes to move. When there are only two gene sequences, this is equivalent to calculating genomic arrangements, where the only allowed operation is single-gene transposition.
In the copy-paste (duplication) case and deletion case, we can compare the numbers of copies of genes for different individuals to determine the transposable genes that have changed their copy numbers. In the inversion case, we can check the direction of genes to determine transposable genes that have changed their orientations [38]. In the cut-paste (insertion) case, the compositions of gene sequences are the same, but the orders of genes differ. It is not straightforward to uniquely determine which genes have changed their relative locations. Instead, we can consider the complement of transposable genes, which keep their relative locations and form a common subsequence of gene sequences from different individuals. Notice that genes in a subsequence does not need to be adjacent in the original sequences, different from a substring. We aim at explaining the difference among gene sequences with minimal transposable genes, meaning that we want to maximize the length of the complement of transposable genes. Thus we define the transposable genes to be the complement of the longest common subsequence. Given raw nucleotide sequences, we first transform them into gene sequences. Then we apply our algorithms to find the longest common subsequence, and the complement is transposable genes. If the longest common subsequence is not unique, we also need to determine which genes are more conserved and appear in all longest common subsequences.
It is common to use the length of the longest common subsequence as a quantitative score for comparing DNA sequences [12,26,83]. The longest common subsequence has also been used to define ultraconserved elements [55] or remove incongruent markers [16].
Determining the longest common subsequence is a classical problem in computer science. Various scenarios for this problem have been studied. Here we list Scenarios A-E, where the first two are more commonly studied. For more works in these scenarios, readers may refer to more thorough reviews [5,24,76]. Scenario A considers two sequences with possibly repeated genes, and the sequence length is n. The goal is to find the longest common subsequence, where the length is count by gene copies. This can be solved by dynamic programming with O(n 2 ) time complexity and O(n) space complexity [23], but O(n 2−ϵ ) time complexity for any ϵ > 0 is impossible [4]. This also can be solved with o(n) space complexity and O(n 3 ) time complexity [35]. In Scenario B, there are m sequences with possibly repeated genes, and the sequence length is n. The goal is to find the longest common subsequence, where the length is count by gene copies. A standard dynamic programming algorithm has O(n m ) time complexity [7]. There have been other faster algorithms [66,44,27]. This scenario is equivalent to the maximum clique problem in graph theory, which is NP-hard [39], but has fast exact and heuristic algorithms [30,37,71]. Scenario C considers 2 sequences with possibly repeated genes, and the sequence length is n. The goal is to find the longest common subsequence, where each gene appears at most once. This scenario is NP-hard [1]. Scenario D is similar to Scenario B, but only consider common subsequences that contain or do not contain certain strings [68,46]. In Scenario E, the gene sequences are arc-annotated, and the longest common subsequence should have the same arc annotation in original sequences [31].
In this paper, we consider four scenarios that are different from the previously studied longest common subsequence problems. These four scenarios are determined by two factors: whether the considered species has linear or circular gene sequences, and whether genes have multiple copies. When genes have multiple copies, we only consider common subsequences that consist of all or none of copies of the same gene. Scenario 1 has linear sequences without duplicated genes; Scenario 2 has circular sequences without duplicated genes; Scenario 3 has linear sequences with duplicated genes; Scenario 4 has circular sequences with duplicated genes.
Most known methods only aim at finding one longest common subsequence. When the longest common subsequence is not unique, we also need to classify whether a gene appears in all/some/none of the longest common subsequences. Determining all longest common subsequences is too timeconsuming. To determine the relationship between genes and longest common subsequences, we develop corresponding algorithms with polynomial time complexities for Scenarios 1,2 (Algorithms 2,4). To our knowledge, there are no other determinations of whether genes appear in all longest common subsequences with polynomial complexities. Scenarios 3,4 only consider subsequences that consist of all or none copies of the same gene, and calculate the length by genes. Therefore, they are different from the classic Scenario B. We develop the equivalence of Scenario 3 with the maximum clique problems on graphs (Proposition 1). We prove that Scenario 4 is between the maximum clique problems on graphs and the maximum clique problems on 3-uniform hypergraphs (Propositions 2, 3). Although circular sequences are commonly studied in the context of genomic rearrangements, they are rare in the literature of longest common subsequence problems. Therefore, our Algorithm 3 that finds a longest common subsequence for Scenario 2 should also be novel. We test Algorithms 1,2,3,4 on the gene sequences of different Escherichia coli individuals and find some possible transposable genes.
If we only need to find one longest common subsequence, then Scenario 1 is a special case of Scenario B, and our method (Algorithm 1) is easily derived from standard algorithms. Scenarios 3,4 are equivalent to maximum clique problems in graphs and hypergraphs, which are NP-hard. These properties are also similar to Scenario B. Although there have been numerous algorithms for the maximum clique problem [77], for the sake of complete-ness, we design fast heuristic algorithms (Algorithms 5,6) and test them to find that they only fail in rare cases.
We proposed the idea of using the longest common subsequence to find transposable genes and Algorithm 1 in a previous paper [32], where Algorithm 1 was applied to study the "core-gene-defined genome organizational framework" (the complement of transposable genes) in various bacteria, and found that for different species, the transposable gene distribution and developmental traits are correlated. This paper considers other situations (especially when the longest common subsequence is not unique), and can be regarded as a theoretical sequel of that previous paper. Algorithm 1 is contained in this paper for the sake of completeness.
We first describe the setup for the problem of determining transposable genes and transform it into the problem of finding the longest common subsequence. In the following four sections, we transform them into corresponding graph theory problems and design algorithms. We finish with some discussions. All the algorithms in this paper have been implemented in Python. For the code and data files, see https://github.com/YueWangMathbio/Transposon.

Setup
Given raw DNA sequencing data, the first step is to transform them into gene sequences. This can be done with various genome annotation tools [59,9]. For simplicity, we replace the gene names by numbers 1, . . . , n.
For some species, the DNA is a line [58]. We can represent this DNA as a linear gene sequence of distinct numbers that represent genes: (1,2,3,4). If some genes change their transcriptional orientations, we can simply detect them and handle the remaining genes. Now a linear DNA naturally has a direction (from 5' end to 3' end), thus (1, 2, 3, 4) and (4, 3, 2, 1) are two different gene sequences.
Consider two linear gene sequences from different individuals: (1, 2, 3, 4) and (1,4,2,3). We can intuitively detect that gene 4 changes its relative position, and should be regarded as a transposable gene. However, changing the positions of genes 2, 3 can also transform one sequence into the other. The reason that we think gene 4 (not genes 2, 3) changes its relative position is that the number of genes we need to move is smaller. However, the number of genes that change their relative locations is difficult to determine. We can consider the complement of transposable genes, i.e., genes that do not change their relative positions. These fixed genes can be easily defined as the longest common subsequence of given gene sequences. Here a common subsequence consists of some genes (not necessarily adjacent, different from a substring) that keep their relative orders in the original sequences. Thus transposable genes are the complement of this longest common subsequence. Notice that the longest common subsequence might not be unique. We classify genes by their relations with the longest common subsequence(s). The motivation of classifying transposable genes with respect to the intersection and union of longest common subsequences is similar to defining essential variables with Markov boundaries in causal inference [73].

Definition 1.
A gene is proper-transposable if it is not contained in any longest common subsequence. A gene is non-transposable if it is contained in every longest common subsequence. A gene is quasi-transposable if it is contained in some but not all longest common subsequences.
For some species, the DNA is a circle, not a line [65]. A circular DNA also has a natural direction (from 5' end to 3' end), and we use the clockwise direction to represent this natural direction. In the circular sequence scenario, a common subsequence is a circular sequence that can be obtained from each circular gene sequence by deleting some genes. See Fig. 1 for two circular gene sequences and their longest common subsequence. Notice that we can rotate each circular sequence for a better match. A gene might have multiple copies (duplicated) in a gene sequence [25]. Notice that the definition of the transposable gene is a gene (specific DNA sequence) that has the ability to change its position, not a certain copy of a gene that changes its position. This means transposable genes should be defined for genes, not gene copies. Thus we should only consider common subsequences that consist of all or none copies of the same gene. When calculating the length of a common subsequence, we should count genes, not gene copies. Consider two linear sequences (4, 1, 2, 1, 1, 3, 2, 4, 1, 1) and (4, 1, 2, 3, 1, 1, 2, 1, 1, 4). If we consider any subsequences, the longest common subsequence is (4, 1, 2, 1, 1, 2, 1, 1); if we only consider subsequences that contain all or none copies of the same gene, but count the length by copies, the longest common subsequence is (1, 2, 1, 1, 2, 1, 1); if we only consider subsequences that contain all or none copies of the same gene, and count the length by genes, the unique longest common subsequence is (4, 2, 3, 2, 4), and gene 1 is proper-transposable.
When we consider circular gene sequences with duplicated genes, we should still only consider subsequences that consist of all or none copies of the same gene, and calculate the length by genes. Notice that circular sequences can be rotated. See Fig. 2 for two circular gene sequences with duplicated genes and their longest common subsequence. We have turned the problem of determining transposable genes into finding the longest common subsequence of several gene sequences. Depending on whether the gene sequences are linear or circular, and whether genes have multiple copies, the problem can be classified into four scenarios: Scenario 1: Consider m linear sequences of genes 1, . . . , n, where each gene has only one copy in each sequence. Determine the longest linear sequence that is a common subsequence of these m sequences. Scenario 2: Consider m circular sequences of genes 1, . . . , n, where each gene has only one copy in each sequence. Determine the longest circular sequence that is a common subsequence of these m sequences. Here circular sequences can be rotated. Scenario 3: Consider m linear sequences of genes 1, . . . , n, where each gene can have multiple copies in each sequence. Determine the longest linear sequence that is a common subsequence of these m sequences. Only consider subsequences that consist of all or none copies of the same gene, and calculate the length by genes. Scenario 4: Consider m circular sequences of genes 1, . . . , n, where each gene can have multiple copies in each sequence. Determine the longest circular sequence that is a common subsequence of these m sequences. Only consider subsequences that consist of all or none copies of the same gene, and calculate the length by genes. Here circular sequences can be rotated.
These four scenarios correspond to different algorithms, and will be discussed separately.

Linear sequences without duplicated genes
In Scenario 1, consider m linear gene sequences, where each sequence contains n genes 1, . . . , n. Each gene has only one copy. For such permutations of 1, . . . , n, we need to find the longest common subsequence.

A graph representation of the problem
Brute-force searching that tests whether each subsequence appears in all sequences is not applicable, since the time complexity is exponential in n.
To develop a polynomial algorithm, we first design an auxiliary directed graph G.

Definition 2.
For m linear sequences with n non-duplicated genes, the corresponding auxiliary graph G is a directed graph, where each vertex is a gene g i , and there is a directed edge from g i to g j if and only if g i appears before g j in all m sequences.
A directed path g 1 → g 2 → g 3 → · · · → g 4 → g 5 in G corresponds to a common subsequence (g 1 , g 2 , g 3 , . . . , g 4 , g 5 ) of m sequences, and vice versa. We add 0 to the head of each sequence and n + 1 to the tail. Then the longest common subsequence must start at 0 and end at n + 1. The problem of finding the longest common subsequence becomes finding the longest path from 0 to n + 1 in G. See Fig. 3 for an example of using the auxiliary graph to determine transposable genes. This auxiliary graph G has no directed loop (acyclic). If there exists a loop g 1 → g 2 → g 3 → · · · → g 4 → g 1 , then g 1 is prior to g 4 and g 4 is prior to g 1 in all sequences, a contradiction.

Find the longest path
Determining the longest path between two vertices in a directed acyclic graph can be solved by a standard dynamic programming algorithm. For a vertex g i ∈ {0, 1, . . . , n}, consider the longest path from g i to n + 1. Since there exists an edge g i → n + 1, and G is acyclic, this longest path exists. If the longest path is not unique, assign one arbitrarily.
Definition 3. Define F + (g i ) to be the length of the longest path from g i to n + 1 in G, and H + (g i ) to be the vertex next to g i in this path.
F + and H + can be calculated recursively: For one gene g i , consider all genes g j with an edge g i → g j in G. The gene g j with the largest F + (g j ) is assigned to be H + (g i ), and F + (g i ) = F + (g j ) + 1. If g l → n + 1 is the only edge that starts from gene g l , then F + (g l ) = 1, and H + (g l ) = n + 1. In other words, .

Test the uniqueness of the longest path
To test whether quasi-transposable genes exist, we need to check the uniqueness of this longest path.
, . . . , n, n + 1}, define F − (g i ) to be the length of the longest path from 0 to g i in G, and H − (g i ) to be the vertex prior to g i in this path.
F − and H − can be calculated similar to F + and H + . We can see that ∈ L 0 is proper-transposable, then the longest common subsequence is unique, and all genes in L 0 (excluding the auxiliary 0 and n + 1) are non-transposable. The procedure of determining transposable genes stops here. Otherwise, the longest common subsequence is not unique, and we need to find quasi-transposable genes in L 0 .

Find quasi-transposable genes
When determining all quasi-transposable genes g 1 , . . . , g k not in L 0 , as described above, we construct corresponding longest paths L 1 , . . . , L k from 0 to n + 1, where each L i passes through g i . We claim that a gene g j ∈ L 0 is non-transposable if and only if g j is contained in all L 1 , . . . , L k . To prove this, we need the following lemma. Lemma 1. In Scenario 1 of linear sequences without duplicated genes, each quasi-transposable gene g i has a corresponding quasi-transposable gene g j , so that no longest common subsequence can contain both g i and g j .
If a gene g j ∈ L 0 is non-transposable, then it is contained in all L 1 , . . . , L k . If g j ∈ L 0 is quasi-transposable, by Lemma 1, there is a quasi-transposable gene g l / ∈ L 0 which is mutual-exclusive with g j , in the sense that g l and g j cannot appear in the same longest common subsequence. The corresponding longest path L l contains g l , thus cannot contain g j . This proves our approach to determine the quasi-transposable genes in L 0 .
Proof of Lemma 1. Fix a quasi-transposable gene g i . It is contained in a longest path L i , which contains all non-transposable genes. Thus for each non-transposable gene g * , there is an edge between g * and g i in G. Assume g i has no such mutual-exclusive quasi-transposable gene g j . Then there is an edge (direction unknown) in G between g i and each quasi-transposable gene g j . Choose a longest path L * in G that does not contain g i . Whether g j ∈ L * is a non-transposable gene or a quasi-transposable gene, there is an edge between g j and g i . Determine the first gene g k in L * that has an edge g i → g k . Since there is an edge g i → n + 1, g k exists. Since there is an edge 0 → g i , g k ̸ = 0. Denote the previous gene of g k in L * by g l , then g l exists, and there is an edge g l → g i . Thus we construct a path 0 → · · · → g l → g i → g k → · · · → n + 1, which is longer than the longest path, a contradiction. Thus g i has a mutual-exclusive quasi-transposable gene g j .

Algorithms and complexities
We summarize the above method as Algorithms 1,2. If we have known that the longest common subsequence is unique, then we just need to apply Algorithm 1, so that genes in L 0 are non-transposable, and genes not in L 0 are proper-transposable. We have reported Algorithm 1 previously [32,69]. Algorithm 1 is kept here to make the story complete. Assume we have m sequences with length n, and the length of the longest common subsequence is n − k.
All three sequences of ST540 start with gene dnaA and end with gene rpmH. We can regard them as linear gene sequences. We remove genes that appear more than once in one sequence, and remove genes that do not appear in all three sequences. After applying Algorithms 1,2 on these three sequences, there are 301 non-transposable genes, 4 quasi-transposable genes (hpaC, iraD, fbpC, psiB), and 263 proper-transposable genes. The reason for the large amount of proper-transposable genes is that sequence CP007265.1 is significantly different from the other two. After removing it and ap- calculate F − (·) and H − (·) for each gene g i in 1, . . . , n, n + 1 recursively: % If g j with g i → g j that maximizes F + (g j ) is not unique, choose one randomly % If argmax is not unique, choose one randomly 5. Construct a longest path L 0 from 0 to n + 1: Algorithm 1: Detailed workflow of determining proper-transposable genes and quasi-transposable genes in Scenario 1, preparation stage.
For each gene g i in g 1 , . . . , g k Output g i is a proper-transposable gene

Else
Output g i is a quasi-transposable gene Use H + (·) and H − (·) to construct L i , a longest path from 0 to n + 1 that passes g i .

End of for
For each gene g j in L 0 (excluding auxiliary 0 and n + 1) If g j is contained in all L 1 , . . . , L k Output g j is non-transposable Else Output g j is quasi-transposable plying Algorithms 1,2 to the remaining two sequences (CP007390.1 and CP007391.1), there are 564 non-transposable genes and 4 quasi-transposable genes (hpaC, iraD, fbpC, psiB). Therefore, some genes in hpaC, iraD, fbpC, psiB are likely to translocate.
All three sequences of ST2747 start with gene glnG and end with gene hemG. We can regard them as linear gene sequences. We remove genes that appear more than once in one sequence, and remove genes that do not appear in all three sequences. After applying Algorithms 1,2 on these three sequences, all 573 genes are non-transposable.

Circular sequences without duplicated genes
In Scenario 2, consider m circular gene sequences, where each sequence contains n genes 1, . . . , n. Each gene has only one copy in each sequence. For such circular permutations of 1, . . . , n, we need to find the longest common subsequence. Assume the length of the longest common subsequence is n−k.

Find a longest common subsequence
We first randomly choose a gene g i . Cut all circular sequences at g i and expand them to be linear sequences. For example, the circular sequences in Fig. 1 cut at 1 are correspondingly (1, 2, 3, 4, 5, 6) and (1, 2, 6, 4, 5, 3). Using Algorithm 1, we can find L i that begins with g i , which is a longest common subsequence of all expanded linear sequences. In the above example, the longest common linear subsequence starting from 1 is (1, 2, 4, 5). If g i is a non-transposable gene or a quasi-transposable gene, then L i (glued back to a circle) is a longest common circular subsequence. If g i is a propertransposable gene, then L i is shorter than the longest common circular subsequence. In Fig. 1, gene 1 is non-transposable, and (1, 2, 4, 5) (glued) is the longest common circular subsequence.
We do not know if L i (glued) is a longest common subsequence (whether containing g i or not) for all circular sequences. If there is a longer common subsequence, it should contain genes that are not in L i . Consider four variables L, g, C, and S, whose initial values are L i , g i , the length of L i , and the complement of L i . These variables contain information on the longest common linear subsequence that we have found during this procedure.
Choose a gene g j in S, and cut all circular gene sequences at g j . Apply Algorithm 1 to find L j , which is the longest in common subsequences that contain g j . If the length of L j is larger than C, set L to be L j , set g to be g j , set C to be the length of L j , and set S to be the complement of L j . Otherwise, keep L, g, C, and S still.
Choose another gene g l in S which has not been chosen before, and repeat this procedure. This procedure terminates when all genes in S have been chosen and cut. Denote the final values of L, g, C, and S by L 0 , g 0 , C 0 , and S 0 . Here S 0 is the complement of L 0 .
During this procedure, if the current g is a proper-transposable gene, then S contains a non-transposable gene or a quasi-transposable gene, which has not been chosen. Thus L, g, C, S will be further updated. If the current g is a non-transposable gene or a quasi-transposable gene, then C has reached its maximum, and L, g, C, S will not be further updated. This means L 0 is a longest common circular subsequence, and C 0 is the length of the longest common subsequence, n − k. Also, the total number of genes being chosen and cut is k +1. All k genes in S 0 and g 0 are chosen and cut. A gene g t in L 0 (excluding g 0 ) is non-transposable or quasi-transposable, and cannot be chosen and cut. The reason is that it cannot be chosen before g 0 is chosen (only proper-transposable genes can be chosen before g 0 is chosen), and it cannot be chosen after g 0 is chosen (g t / ∈ S 0 ).

Determine quasi-transposable genes
For each gene g p ∈ S 0 , apply Algorithm 1 to calculate C p , the length of the longest common subsequence that contains g p . If C p < C 0 , g p is a propertransposable gene. Otherwise, C p = C 0 means g p is a quasi-transposable gene. We have found all proper-transposable genes. If all genes in S 0 are proper-transposable, then all genes in L 0 are non-transposable, and the procedure terminates. If S 0 contains quasi-transposable genes, then L 0 also has quasi-transposable genes. To determine quasi-transposable genes in L 0 , we need the following lemma.
Lemma 2. In Scenario 2, choose a quasi-transposable gene g p and cut the circular sequences at g p to obtain linear sequences. A proper-transposable gene for the circular sequences is also a proper-transposable gene for the linear sequences; a non-transposable gene for the circular sequences is also a non-transposable gene for the linear sequences.
Proof. Consider a longest common subsequence L p for linear sequences cut at g p . Since g p is a quasi-transposable gene, the length of L p is also n − k, meaning that L p is also a longest common subsequence for circular se-quences. Now, this lemma is proved by the definition of proper/quasi/nontransposable gene.
If a gene g r in L 0 is non-transposable for the circular sequences, then g r is a non-transposable gene for linear sequences cut at each quasi-transposable gene g q ∈ S 0 . If a gene g s in L 0 is quasi-transposable for the circular sequences, then there is a longest common circular subsequence L t that does not contain g s , meaning that L t contains a quasi-transposable gene g t not in L 0 . Then g s is a proper/quasi-transposable gene for linear sequences cut at g t . Therefore, we can use the following method to determine quasi-transposable genes in L 0 . For each quasi-transposable gene g q ∈ S 0 , cut at g q and apply Algorithms 1,2 to determine if each gene in L 0 is proper/quasi/nontransposable for the linear gene sequences cut at g q . A gene g r ∈ L 0 is nontransposable for the circular sequences if and only if it is non-transposable for linear sequences cut at any quasi-transposable gene g q ∈ S 0 . A gene g s ∈ L 0 is quasi-transposable for the circular sequences if and only if it is proper/quasi-transposable for linear sequences cut at some quasi-transposable gene g q ∈ S 0 .
When we have determined all quasi-transposable genes in S 0 , it might be tempting to apply a simpler approach to determine quasi-transposable genes in L 0 : For each quasi-transposable gene g q ∈ S 0 , cut at g q and apply Algorithm 1 to find a longest common subsequence L q . A gene in L 0 is non-transposable if and only if it appears in all such L q . This approach is valid only if the following conjecture holds, which is similar to Lemma 1: Conjecture 1. In Scenario 2 of circular sequences without duplicated genes, each quasi-transposable gene g i has a corresponding quasi-transposable gene g j , so that no longest common subsequence can contain both g i and g j .
However, Conjecture 1 does not hold. See Fig. 4 for a counterexample. All genes are quasi-transposable. Any two quasi-transposable genes are contained in a longest common subsequence (length 3). Thus the simplified approach above does not work.
We summarize the above method as Algorithms 3,4. If we have known that the longest common subsequence is unique, then we just need to apply Algorithm 3, so that genes in S 0 are proper-transposable, and genes not in S 0 are non-transposable. Assume we have m sequences with length n, and the length of the longest common subsequence is n − k.  We regard all three sequences of ST540 as circular gene sequences. We remove genes that appear more than once in one sequence, and remove genes that do not appear in all three sequences. After applying Algorithms 3,4 on these three sequences, there are 389 non-transposable genes, 50 quasitransposable genes, and 129 proper-transposable genes. The reason for the large amount of proper-transposable genes is that sequence CP007265.1 is significantly different from the other two. After removing it and applying Algorithms 3,4 to the remaining two sequences (CP007390.1 and CP007391.1), there are 564 non-transposable genes and 4 quasi-transposable genes (hpaC, iraD, fbpC, psiB). Therefore, some genes in hpaC, iraD, fbpC, psiB are likely to translocate.

Applications on experimental data
We regard all three sequences of ST2747 as circular gene sequences. We remove genes that appear more than once in one sequence, and remove genes that do not appear in all three sequences. After applying Algorithms 3,4 on these three sequences, all 573 genes are non-transposable genes.
1. Input m circular sequences of genes 1, . . . , n, where each gene has only one copy in each sequence 2. Choose a gene g i randomly Cut all circular sequences at g i and expand them to be linear sequences Apply Algorithm 1 to find L i , a longest common subsequence in the expanded linear sequences Set C to be the length of L i , and set S to be the complement of L i 3. While S has a gene g j that has not been chosen and cut Cut all circular sequences at g j and apply Algorithm 1 to find L j Denote the length of L j by C j If C j > C Update C to be C j , and update S to be the complement of L j

End of if
End of while Cut all circular sequences at g l and expand them to be linear sequences Apply Algorithm 1 to find L l , a longest common subsequence in the expanded linear sequences.
Denote the length of L l by C l If C l < C 0 Output g l is a proper-transposable gene Else Output g l is a quasi-transposable gene Cut all circular sequences at g l and apply Algorithms 1,2 to find all proper/quasi-transposable genes for linear gene sequences starting at g l Output genes not in S 0 but being proper/quasi-transposable for such linear sequences are quasi-transposable for circular sequences

End of for
Output other genes that have not been determined to be proper/quasi-transposable are all non-transposable 3. Output: whether each gene is proper/quasi/non-transposable Algorithm 4: Detailed workflow of determining proper-transposable genes and quasi-transposable genes in Scenario 2, output stage.

Linear sequences with duplicated genes
In Scenario 3, consider m linear gene sequences, where each sequence contains different numbers of copies of n genes 1, . . . , n. We need to find the longest common subsequence. Here we only consider common subsequences that consist of all or none copies of the same gene, and the subsequence length is calculated by genes, not gene copies.

Definition 5.
A subgraph of G consists of some genes g 1 , . . . , g l and the edges between them. In a subgraph, if there is an edge between any two genes, this subgraph is called a complete subgraph (also called a clique).
Definition 6. In graph G, the degree of a gene g is the number of edges linking g. In a complete graph of p genes, where any two genes have an edge in between, each gene has degree p − 1.
Definition 7. If all copies of genes g 1 , . . . , g l keep their relative locations in all linear sequences, we say that g 1 , . . . , g l form a common subsequence.
The following Lemma 3 shows that there is a bijection between common subsequences and complete subgraphs in G. The problem of determining the longest common subsequence now becomes determining the largest complete subgraph of G.
Lemma 3. In Scenario 3, construct the auxiliary graph G from gene sequences. If g 1 , . . . , g k form a complete subgraph in G, then g 1 , . . . , g k form a common subsequence, and vice versa.
Proof. If g 1 , . . . , g l form a common subsequence, then there is an edge in G between any two genes in g 1 , . . . , g l , meaning that they form a complete subgraph.
For the other direction, only consider copies of g 1 , . . . , g k in these sequences. If g 1 , . . . , g k do not form a common subsequence, find the first digit that such sequences differ. Assume g p and g q can both appear in this digit. Then g p , g q cannot form a common subsequence, and there is no edge between g p and g q .

A heuristic algorithm
The above discussion shows that given gene sequences, we can construct an undirected graph G, so that there is a bijection between common subsequences and complete subgraphs. The inverse also holds: We can construct corresponding gene sequences for a graph.
Lemma 4. Given an undirected graph G, we can construct two gene sequences, so that there is a bijection between common subsequences and complete subgraphs.
Combining Lemma 3 and Lemma 4, we obtain the following result: Proposition 1. Finding the longest common sequence in Scenario 3 is equivalent to the maximum clique problem, which is NP-hard.
Proof. For an undirected graph, we can use Lemma 4 to construct corresponding sequences. If we have the solution of finding the longest common sequence in Scenario 3, then we can find the largest complete subgraph in an extra polynomial time.
For gene sequences in Scenario 3, we can construct corresponding auxiliary graph. If we have the solution of finding the largest complete subgraph, then we can use Lemma 3 to find the longest common sequence in Scenario 3 in an extra polynomial time.
Therefore, finding the longest common sequence in Scenario 3 and finding the largest complete subgraph are equivalent. The problem of determining the largest complete subgraph is just the maximum clique problem, which is NP-hard [64]. Thus finding the longest common sequence in Scenario 3 is also NP-hard. This means it is not likely to design an algorithm that always correctly determines the longest common subsequence in polynomial time.
We have transformed Scenario 3 into the maximum clique problem for a graph G. There have been various algorithms for the maximum clique problem [30,37,71], and readers may refer to a review for more details [77]. For completeness, we propose a simple idea: In the auxiliary graph G, repeatedly abandon the gene with the smallest degree (and also edges linking this gene) until the remaining genes form a complete subgraph. See Algorithm 5 for the details of this greedy heuristic method. This algorithm is easy to understand, and can provide some intuition. We do not claim that Algorithm 5 is comparable to other sophisticated algorithms.
We test Algorithm 5 on random graphs. Construct a random graph with n genes, and any two genes have probability 0.5 to have an edge in between. Use brute-force search to find the maximum clique, and compare its size with the result of Algorithm 5. For each n ≤ 15, we repeat this for 10000 times, and every time Algorithm 5 returns the correct result. Therefore, for small random graphs, the 95% credible interval for the success rate of Algorithm 5 is [0.9997, 1]. We can claim that Algorithm 5 is a good heuristic algorithm that fails with a very small probability. Since finding the true maximum clique requires exponentially slow brute-force search, we do not test on very large graphs. Nevertheless, Algorithm 5 does not always produce the correct result. See Fig. 6 for a counterexample. Here genes 1, 2, 3, 4, 5, 6 have degree 4, while genes 7, 8, 9, 10 have degree 3. When applying Algorithm 5,genes 7,8,9,10 are first abandoned, and the final result just has three genes, such as 1, 3, 5. However, the largest complete graph is 7,8,9,10. Besides, Algorithm 5 can only determine one (possibly longest) common subsequence. Thus we cannot determine the existence of quasi-transposable genes. The auxiliary graph G of linear sequences (7,8,9,10,1,1,2,3,3,4,5,5,6) and (1,2,1,3,4,3,5,6,5,7,8,9,10). This counterexample fails Algorithm 5.
Assume we have m sequences with n genes. In general, the copy number of a gene is small, and we can assume the length of each sequence is O(n). The time complexities of Step 2 and Step 3 in Algorithm 5 are O(mn 2 ) and O(n 2 ), and the overall time complexity is O(mn 2 ). The space complexity is trivially O(mn + n 2 ).

Circular sequences with duplicated genes
In Scenario 4, consider m circular gene sequences, where each sequence contains different numbers of copies of n genes 1, . . . , n. We need to find the longest common subsequence. Here we only consider common subsequences that consist of all or none copies of the same gene, and the subsequence length is calculated by genes, not gene copies.
We shall prove that finding the longest common subsequence in Scenario 4 is no easier than in Scenario 3. Thus Scenario 4 is also NP-hard. Proposition 2. Finding the longest common subsequence in Scenario 4 is NP-hard.
Proof. From Proposition 1, Scenario 3 is NP-hard, meaning that any NP problem can be reduced to Scenario 3 in polynomial time. We just need to prove that Scenario 3 can be reduced to Scenario 4 in polynomial time.
Given m linear sequences with n genes in Scenario 3, add genes n + 1, . . . , 2n + 1 to the end of each sequence, and glue each linear sequence into a circular sequence. The longest common subsequence for these circular sequence has the following properties: (1) it contains all genes n+1, . . . , 2n+ 1; (2) after cutting at n + 1 and removing genes n + 1, . . . , 2n + 1, the remaining linear sequence is the longest common subsequence in Scenario 3.
(2) After cutting and removing n + 1, . . . , 2n + 1, the remaining linear sequence is a common subsequence in Scenario 3. If there is a longer common subsequence, then that with n + 1, . . . , 2n + 1 should be a longer common subsequence in Scenario 4, a contradiction.
Therefore, if we can find the longest common subsequence for these circular sequences, then we can find the longest common subsequence for linear sequences in polynomial time.
Similar to Scenario 3, to find the longest common subsequence in Scenario 4, we want to reduce it to a maximum clique problem. However, Lemma 3 does not hold in Scenario 4. For example, we can consider a circular sequence (1, 2, 3) and its mirror symmetry. These two sequences are different, but any two genes form a common subsequence. However, inspired by Lemma 3, we have the following conjecture, although we do not know if it is correct or not.

Conjecture 2.
In Scenario 4, if any three genes g i , g j , g l in g 1 , . . . , g k form a common subsequence, then g 1 , . . . , g k form a common subsequence.
To solve Scenario 4,construct a 3-uniform hypergraph G as following [15]: vertices are genes 1, . . . , n; there is a 3-hyperedge (undirected) that links genes g i , g j , g k if and only if they form a common subsequence. Proposition 3. If Conjecture 2 holds, then finding the longest common sequence in Scenario 4 can be reduced to the maximum clique problem for 3-uniform hypergraphs.
Proof. If g 1 , . . . , g k form a common subsequence, then any three genes g i , g j , g l has a 3-hyperedge, and g 1 , . . . , g k form a complete subgraph. If g 1 , . . . , g k form a complete subgraph, then any three genes g i , g j , g l form a common subsequence. By Conjecture 2 , this means g 1 , . . . , g k form a common subsequence. Therefore, there is a bijection between common subsequence and complete subgraph. If we can find the maximum clique problem for 3uniform hypergraphs, then it corresponds to the longest common subsequence.
We have reduced Scenario 4 into the maximum clique problem for 3uniform hypergraphs, which is also NP-hard [77]. There have been some algorithms for the maximum clique problem for 3-uniform hypergraphs [63,57]. For completeness, we propose a simple idea: Repeatedly delete the gene that has the smallest degree, until we have a complete subgraph that any three genes have a 3-hyperedge that links them. We summarize this greedy heuristic method as Algorithm 6. This algorithm is easy to understand, and can provide some intuition. We do not claim that Algorithm 6 is comparable to other sophisticated algorithms.
We test Algorithm 6 on random graphs. Construct a random graph with n genes, and any two genes have probability 0.5 to have an edge in between. Use brute-force search to find the maximum clique, and compare its size with the result of Algorithm 6. For each n ≤ 15, we repeat this for 10000 times, and every time Algorithm 6 returns the correct result. Therefore, for small random graphs, the 95% credible interval for the success rate of Algorithm 6 is [0.9997, 1]. We can claim that Algorithm 6 is a good heuristic algorithm that fails with a very small probability. Since finding the true maximum clique requires exponentially slow brute-force search, we do not test on very large graphs.
Assume we have m sequences with n genes. In general, the copy number of a gene is small, and we can assume the length of each sequence is O(n). : Four circular sequences. The longest common subsequence is (7,8,9,10). This counterexample fails Algorithm 6.

Discussion
Certain patients of myeloproliferative neoplasm have two mutations, JAK2 V617F and TET2. The temporal order of these two mutations could affect various macroscopic and microscopic features [52]. One explanation for this phenomenon is that mutations can lead to different spatial orders, namely gene transpositions, but different transpositions are not interchangeable [56]. Algorithms in this paper can be used to determine the transposable genes and study how different temporal gene (mutation) orders and spatial gene orders interfere with each other. A gene g i might be missing in some sequences. Since g i is not in any longest common subsequence, it should be a proper-transposable gene. This gene can be directly removed before applying corresponding algorithms.
We can adopt a stricter definition of transposable genes to exclude a gene which only changes its relative position in a few (no more than l, where l is small enough) sequences. Then we should consider the longest sequence which is a common subsequence of at least m − l sequences. We can run the corresponding algorithm for every m − l sequences. Thus the total time complexity will be multiplied by a factor of m l .
In Scenario 1 and Scenario 2 (linear/circular sequences without duplicated genes), if each sequence has n genes, and the longest common subsequence has length n − k, then there are at most k proper-transposable genes. About quasi-transposable genes, inspired by Lemma 1, we have the following guess.
Conjecture 3. Consider m linear/circular sequences with n genes without multiple copies. Assume the length of the longest common subsequence is n − k, and there are l proper-transposable genes. Then the number of quasitransposable genes is no larger than 2(k − l).
When l + 2(k − l) ≤ n, in both linear and circular scenarios, we can find examples with 2(k − l) quasi-transposable genes.

Conclusion
In this paper, we study the problem of determining transposable genes in gene sequences, and design Algorithms 1-6 for different scenarios. To apply those algorithms, one needs to apply genomic annotation tools to transform raw DNA sequencing data into gene sequences, and replace gene names by numbers. Those algorithms have at most O(mn 3 ) time complexity, where m is the number of sequences, and n is the number of genes. Thus they can run in a reasonable time for most applications. We prove that the latter two scenarios are NP-hard (Propositions 1,2), and propose two unresolved problems (Conjectures 2,3) in discrete mathematics.
We start with gene sequences and determine translocated genes. Therefore, short transposons (possibly shorter than a gene) cannot be determined. Besides, we do not determine specific genomic rearrangement events. We aim at determining which genes are able to translocate. Specifically, we study how many longest common subsequences contain a certain gene, as a measure for its "stability". This mesoscopic viewpoint can be intriguing for understanding changes in genome.
The results in this paper are not limited to Scenarios 1-4. They can be applied to other bioinformatics situations, or even other fields that need discrete mathematics tools, such as text processing, compiler optimization, data analysis, image analysis [22]. Besides, algorithms in this paper might be able to detect non-syntenic regions [36].
There are some possible future directions: (1) prove Conjectures 2,3; (2) extend Proposition 3 to find more efficient solutions to Scenario 4; (3) determine whether genes appear in all longest common subsequences in other similar scenarios.