Evolutionary insights into Mariner-like elements in Apis 1

Abstract


Background
Mariner is a class II transposable element (TE) that moves within a host genome in a "copy and paste style" manner [1][2][3].Mariner was initially discovered in the Drosophila mauritiana genome [4].The length of first detected mariner (MOS1) is 1286 bp.TE similar to Mariner were subsequently identified in the genomes of many plants, vertebrates, and invertebrates, and named mariner-like elements (MLEs).The size of MLEs are about the same as one of Mos1, but varies copy to copy.The copy number of MLEs varies from species to species as well.For example, a copies of MLEs in the human genome is estimated about 14,000 [5] and one of in Apis mellifera is less than 2,000 copies (see below).Mariner and MLEs encode a transposase that possesses a D,D34D motif, which is characteristic of mariner and MLEs [3].Another mariner-specific characteristic is the terminal inverted repeat (TIR) located at both ends of mariner that contains transposase-binding sites.MLE has been detected using PCR with specific primers designed form consensus region of the transposase.Detected MLEs have been are classified their subfamilies (e.g.irritans, mellifera, mauritana, cecropia, and capitata) [2,6,7].Previous studies indicated that mariner and MLE distributed in the genomes of many species via horizontal transfer [2].
Evolutionary, the invading element is thought to have exploded in the host's genome (burst) and with the accumulation of mutations, the MLEs becomes immobilized gradually.Most of the detected MLEs are thought to be non-autonomous.
The PCR-based methods cannot detect all mariner and MLEs in genomes because a MLE that has lost the consensus region cannot be found in this way.These days, due to the availability of whole genome sequence data and the lower cost of genome sequencing, the methods employed to identify TE, including mariner and MLE, have been modified.One method involves the use of "RepeatMasker" (Smit A.F.A., Hubley R. & Green P. RepeatMasker Open 4.0.2013-2015, http://www.repeatmasker.org), which searches sequences in whole genome data that are similar to consensus TE sequences registered in a sequence library (e.g.Repbase) [8].This method detects all types of TE and autonomous and non-autonomous elements in the whole genome region, thereby tracking MLEs and providing further insights into the horizontal transfer of MLEs between host species.
The genus Apis (honey bee species) consists of four major and several minor species, and the following whole genome data of major 4 species are currently available (three Apis species and three Apis cerana subspecies): A. mellifera (Am) [9], A. cerana japonica (Acj) [10], A. cerana cerana Korea native (Ack) [11], A. cerana cerana China native (Acc) [12], A. dorsata (Ad) [13], and A. florea (Af) (Table 1).Apis genomes contain small numbers of TE, which mainly consist of class II TE, particularly MLEs [10,14].Thus, through analysis using these whole genomes, and detected MLEs data, MLEs dynamics in Apis genomes could be revealed.Being a small number of copies, in MLE evolutionary studies, Apis are one of the suitable insects.
Difficulties have been associated with the classification of TE because of variations in their sequences, as described above.Using whole genome information, six mariner consensus sequences (Ammers 1-6) have been reported for A. mellifera [14].However, this classification is just the results of RepeatMasker therefore it is not well considered.
The old PCR methods did not follow evolutionary back ground.In the present study, we extracted Dromar sequences in each Dromar class and built the profile files of each Dromar classes by Clustal Omega and hmmer programs with the Dromar sequences.Apis MLEs were then classified using these profiles by nhmmer (DNA homology search with profile Hidden Markov Models).Moreover, a phylogenetic analysis of the Apis MLE of each Dromar class was performed (Fig. 1).The dynamics of long Apis MLEs, defined as larger than 1kbp in Apis genomes, were revealed by using sequence analysis.Through these results, we provide novel evolutionary insights into Apis MLEs.

MLEs in six Apis species
RepeatMasker was used to detect MLEs in the Am, Ack, Acc, Acj, Af, and Ad genomes.
The output files obtained are shown in Supporting data 1 and 2. Approximately 50 to 150 MLEs were detected in the Ack, Acc, Acj, Af, and Ad genomes, while 2147 MLEs were identified in the Am genome.Detailed and sequence data of the MLEs detected are shown in Supporting data 3 and 4, respectively.To establish whether the identification of many MLEs was A. mellifera species-or subspecies-specific, the detection of MLEs by RepeatMasker with the same conditions was performed with the A. mellifera subspecies, these are A. mellifera carnica (Amcar), A. mellifera caucasica (Amcau), A. mellifera intermissa (Ami), and A. mellifera mellifera (Amm) (Table 1).The RepeatMasker results of these A. mellifera were not markedly different from Am (Supporting data 1 and 2), and the numbers of MLEs in Amcar, Amcau, Ami and Amm were 2247, 2215, 1738 and 2237, respectively.The numbers of MLEs in these genomes were not markedly different from those in Am.Therefore, we decided to perform further analyses against Am genome data.While processing the present study, new and highquality Acc genome data (Acc_new) were published [16].To clarify whether the results of RepeatMasker particularly in MLEs, were changed, the detection of MLEs by RepeatMasker with the same conditions was performed.The results of RepeatMasker were not markedly different between Acc and Acc_new (Supporting data 1 and 2), and the number of MLEs in Acc and Acc_new were 147 and 200, respectively.Therefore, we decided to perform further analyses of Acc genome data.Collectively, the results of MLE numbers in these Apis genomes suggested that a small number of MLEs existed in Apis genomes, except for Am genome, and some of these MLEs might burst within the Am genome in the process of evolution.

Dromar profiling
To build profiling data on 36 Dromar groups, we extracted Dromar sequences using 20 Drosophila genome sequences and the positional data of Dromar reported by Wallau et al. (2014).We only obtained 1418 Dromar sequences from 3685 Dromar copies because some Drosophila genome data were not available in the public database or the descriptions of some scaffolds in Drosophila genome data differed from those in positional data (Supporting data 5).

Phylogenetic analysis of Apis MLE and Dromar sequences.
Sequence alignments were performed using the Dromar and Apis MLE sequences belonging to each Dromar class (newick files for phylogenetic trees are shown in Supporting data 9), and a phylogenetic tree of Dromar 14, Dromar 17, and Dromar 29 was constructed based on alignment results because the majority of MLEs were classified into the three classes (Fig. 3), and an original tree figure is shown in supporting data 10.In Dromar 14, there were distinct clades in the phylogenetic tree (Fig. 3A).Two of these clades (clades I and III) consisted of only Am MLEs, while the other clade included six Apis MLEs plus Dromars (clade II).In Dromar 17, there were four distinct clades (Fig. 3B).One of them included only Am_MLEs (clade I).Clades II and IV include six Apis MLEs while clade III included Dromars.
In the case of Dromar 29, there were five distinct clades (Fig. 3C).Clade I consisted of only Am_MLEs while clade II included two Ad_MLEs and an Ad_MLE.Clade III included the six Apis MLEs while clade IV consisted of Dromars plus one Am_MLE.These results suggested that there are two groups of Apis MLEs.We hypothesized that some MLEs forming the Am_MLE-only group in the phylogenetic trees may invade to A. mellifera genomes by horizontal transfer and burst within the A. mellifera genome, while the other MLEs forming six Apis MLEs settle in Apis genomes.

Long MLE analysis
MLEs extracted using RepeatMasker include collasped, non-functional and short "junk" MLEs.Further elucidating the evolution insight of MLE in Apis, "long" MLE were analyzed.The long MLEs are thought to maintain the potential mobile feature of MLE.To confirm this hypothesis, we initially listed Apis MLEs larger than 1 kbp (long MLEs) (Table 4 and Supporting data 11).The majority of long MLEs are Am_MLEs.Long Am_MLEs were equally distributed in all chromosomes, except for the LG_8 chromosome (Fig. 4), suggesting that the burst of long Am_MLEs is not a chromosomal-local but across these all chromosomes.
To assess the dynamism of these long Apis MLEs within or among the six Apis genomes, we obtained the predicted transposase amino acid sequences using TransDecorder (Supporting data 12).A phylogenetic tree was constructed by using the transposase sequences (Fig. 5).There were three clades.One included Am_, Acc_, Ack_ and Acj_MLEs while the other two included only Am_MLEs, suggesting that MLEs consisting of Am_, Acc_, Ack_, and Acj_MLEs settled in Apis genomes, while those consisting of Am_MLEs invaded and burst in A. mellifera genome.
To confirm the settlement, we searched long MLEs in the clade including multiple Apis MLEs that code full-length transposases (sequence of more than 300 amino acids).We found Acc_MLE_100.p1,Acj_MLE_148.p1,and Ack_MLE_6.p1 in this clade and the three sequences were the same, except for one residue, with 99% identity of "Apis cerana Mariner transposase" in NCBI-nr (Accession ID: BAB86288.1)(data not shown) [12].To clarify whether the transposase also existed in Am, Ad and, Af genomes, we performed BLAST search (tblastn) against the Am, Ad and Af genomes.The tblastn results of Am showed that 73 MLEs with an alignment length > 300 and e-value < 1e-100 were hits (supplemental data 14) and the E_value of one of the 73 MLEs (sequence ID:NC_037644.1 range:12509588-12510547) was 0.0, which may have been a "settling MLE" from Apis species in the Am genome.The tblastn results of Ad and Af showed that a single sequence was a hit in each genome.(supplemental data 14).To investigate whether the Acc_MLE_10 settled in the Apis genomes, we compared the nucleotide sequences of the Acc_MLE_10 plus the flanking 500bp on both sides with the hit MLEs plus flanking 500bp on both sides in Am, Ack, Acj, Ad, Af (supporting data 15).More than 80% identities were observed in all combinations and the internal sites of the two sequences were mostly same while most outer sites in both sides were different, showing that the TIR of each MLE was conserved.Collectively, these results suggested that the MLEs settled through Apis genomes and burst in the Am genome.
We searched long MLEs in the two clades including only Am_MLEs that code fulllength transposases (sequence of more than 300 amino acids) (Fig. 5).Of the two clades, we found no transposase with a sequence of more than 300 amino acids and the longest transposase was Am_MLE_1186.p1(271 aa) in the smaller clade.We performed a blastp search by using Am_MLE_1186.p1as query against NCBI-nr.The top hit sequence was "Mariner Mos1 transposase [Stegodyphus mimosarum] (Accession ID: KFM57872.1),the percent identity, Evalue, and query cover of which were 51.16%, 1e-96, and 94% respectively (data not shown).
Using Am_MLE_1186.p1,we performed a tblastn search against the six Apis genome with Evalue < 1e-100.We did not obtain hits in any genomes, except for the Am genome (Supporting data 16).Among the hit sites in the Am genome, there were 5 sites plus one query itself that encoded more than 250 aa.We found two full-length transposases, Am_MLE_182.p1 and Am_MLE_284.p1, in the larger clade.Am_MLE_182.p1 was annotated as Camar1 transposase (Accession ID: AAO12862.1)by blastp against NCBI-nr, the percent identity, E-value, and query cover of which were 48.09%, 1e-120, and 99% respectively (data not shown).In the case of Am_MLE_182.p1, the same annotation results of Am_MLE_284.p1 were obtained.
Consequently, we performed a tblastn search against the six Apis genomes using the two Am_MLEs.Am_MLE_284.p1 hit one sequence with an alignment length > 300 and e-value < 1e-100 in each Af, Acj and Ack genome, but did not hit any sequence in the Acc and Ad genomes, and the hit sites in the Af, Acj and Ack were the same as the Acc_MLE_100 tblastn hit site as previously reported.However, the percent of identities of these hits were approximately 45 % (data not shown).In the Am genome, seven sequences were more than 300 amino acids in length and had 80% identity (Supporting data 17).In the case of Am_MLE_182.p1, there was no tblastn hit site against the Ack, Acj, Acc, Ad, or Af genome (data not shown).In the Am genome, there were seven with a sequence length of more than 300 amino acids and 80% identity.The hit sites were same as the hits of Am_MLE_284.p1 (Supporting data 17).These results suggested that there were two types of MLEs, Am_MLE_1186 and Am_MLE_284 types, which were only found in the Am genome and burst within it.

Discussion
In the present study, we extracted MLEs from six Apis genomes.The number of Am_MLEs was markedly higher than those of the other Apis species MLEs.MLEs were classified into Dromar classes according to Hidden Markov Model (HMM) profiles made by Dromar sequences [15].Classification results revealed that the majority of Apis MLEs were classified into mellifera subfamilies in old classification.The phylogenetic tree of each class showed that these MLEs formed two types of clusters, of which one included only Am_MLEs, while the other included the MLEs of six or multiple Apis species.Moreover, there was one long MLE-encoding complete transposase, which might settle in the Ad, Af, Acc, Ack, and Acj genomes and might explode in the Am genome, and two other types of long Am_MLEs that only existed in the A. mellifera genome and might explode within it.More Am MLEs were detected than the other five Apis MLEs.Differences in the MLE numbers detected were not due to the reference genome qualities of the six Apis species.We detected MLEs by RepeatMasker using multiple Apis genome data (Table 3).For example, new Acc genome data were published during the processing of the present study [16], and we performed MLE detection using these data.Although the contig N50 values of new Acc genome data were more than 150-fold higher than that of the Acc genome used in the present study, the numbers of MLEs detected were not markedly different between the two sets of genome data.Moreover, we detected MLEs by multiple A. mellifera genomes, the contig N50 values of which varied.Approximately 1700 to 2200 MLEs were detected in these Am genomes.
Based on these results, differences in MLEs between not Am genome and Am genome were not due to artificial or technical differences, suggesting that different feature of the MLEs among each Apis species seemed to due to evolutionary history of each species.
The initial study on the Am genome detected approximately 1100 MLEs in the genome, and these MLEs were classified into six classes, AmMar1-6, which included members of the mellifera, irritans, and rosa subfamilies [14].In the present study, we detected approximately 2000 MLEs from the latest version of the Am genome [9], and 864 MLEs of the detected MLEs with E-value < 1e-5 were classified.All 864 MLEs were allocated to mellifera subfamilies.This difference in classifications between the present and previous studies may be due to the detection and classification method, nhmmer.nhmmer adopted a HMM model that detects previously unrecognized sequence features [17].In detail, nhmmer, hmmer for nucleotide, can detect remote homologs as sensitively as possible [17].Thus, we think nhmmer is suitable for analyzing TEs, which are often mutated and highly varied.RepeatMakser also adopted nhmmer and searched TEs with Repbase library.On the other hand, the Dromar classification is considered to be sophisticated because the classification was achieved by multiple methods and many genome data.We attempted to perform a new classification method for Apis MLEs that combines nhmmer (HMM) and Dromar sequences, and provided detailed classifications and novel insights into Apis MLEs.Therefore, our classification method using nhmmer and the MLE sequences of the other species may be employed in a TE analysis.Based on our classification in the present study, the majority of MLEs of Am, Acc, Ack, Acj, Ad, and Af were classified into mellifera subfamilies, while a few were classified into mauritiana subfamilies.The present results further supported Apis MLEs exploding in the Am genome.
Phylogenetic trees revealed that there were two types of clusters, one of which consisted of only Am_MLEs, and the other of several Apis species MLEs.It is plausible that the former type of MLE invaded the genome in horizontal transmission and exploded after the bees diverged in speciation from other species of bees while MLEs of the latter type invaded the genome in horizontal transmission and settled through Apis species genomes and explode in the Am genome before the species divergence.This aspect was also consistent with the analysis of the long MLEs, which have putative transposases sequence and may maintain mobility.In addition, the long MLE that encoded the complete transposase, A. cerana Mariner transposase, was not found in the Ad, Af, or Am genome.However, the MLE encoding the complete transposase was identified in the Ad, Af, and Am genomes by a tblastn search because these MLEs were detected as a length < 1 kbp by RepeatMasker and were filtered out.One MLE that encoded the same sequence of A. cerana Mariner transposase was found in the Am genome, and several MLEs that encoded transposases that were highly similar to A. cerana Mariner transposase were also detected.These results suggested that the MLEs encoding A. cerana Mariner transposase invade horizontally through Apis evolutions and "burst" in the Am genome, but not in other Apis species.On the other hand, other types of Am_MLEs which carry the transposase similar to Camar1 transposase [Chymomyza amoena] were only found in the Am genome.These types of MLEs may invade the only Am genome by horizontal transfer and burst within the Am genome.The reason why MLEs only burst in the Am genome remains unknown.
The origin of Apis honey bee is Asia, except for Am, whose origin is Europe-Africa.On the other hand, in most beekeeping, the Am is used and is widespread in the all over world as domestic insects.These facts may be related to this aspect.More detailed structural analysis of Am and other Apis genome must be needed.This is the first example of the finer aspects of MLE evolution among closely related species, and perhaps the first time in other TEs.
Integrating all the results, the evolutionary aspect of MLEs belonging to the Dromar is shown in Figure 6.

Conclusions
We detected MLEs from the genome data of six Apis species, and performed nhmmerbased Apis MLE classification and phylogenetic analyses.About over 50-100 times of Am_MLEs were detected than these of the other Apis MLEs.Almost all of Apis MLEs classified were classified into mellifera subfamilies.The long MLEs of which length are over 1kbp divided into two types.One type of MLEs settled in Apis genome and burst in Am genome, the other invade into and burst in Am genome.We showed firstly provided evolutional insight Apis MLEs (Fig. 6).

MLE detection and extraction of sequence data
The six Apis genome genomes plus other versions and several A. mellifera subspecies sequences used in the present study are shown in Table 1.
RepeatMasker (Smit A.F.A., Hubley R. & Green P. RepeatMasker Open 4.0.2013-2015, http://www.repeatmasker.org) was used to detect TE in the six Apis genomes, as described in our previous study [10].RepeatMasker output (.tbl and .out)files are shown in Supporting data 1 and 2, respectively.Data on MLE were extracted from outfiles, and a single ID was allocated to each detected element (Supporting data 3).To extract each mariner sequence, positional data in Supporting data 3 and BLAST software were utilized (Supporting data 4) [18].

Classification of Apis MLEs
To build the hmmer profiles of Dromars, Dromar sequences were extracted using the

Long MLE analysis
The distribution of long Am_MLEs in A. mellifera chromosomes was visualized by ChromoMap [23].To obtain the predicted transposase of long MLEs, TransDecorder version 5.5.0 was used with default settings (https://github.com/TransDecoder/TransDecoder/wiki).EMBOSS needle (6.6.0.0) was used to compare the sequences of the long MLEs encoding the complete transposase plus flanking site [24].

Figures & Tables
Therefore, sequences for Dromar 3, 7, 18, 19, and 20 were not extracted.Consequently, the nhmmer profiles of 31 Dromars for the classification of Apis MLEs were built from the alignment results of each Dromar group (separated and merged alignment and nhmmer profiling data are shown in Supporting data 6 and 7, respectively).
MLEs were classified using the extracted Apis MLE sequences and Dromar profiles.The results of the classification are shown in Fig. 2, and detailed classification results are provided in Supporting data 8.In total, 864 out of approximately 2000 Am MLEs were classified into only mainly Dromar 14, Dromar 17 and Dromar 29 except for one into Dromar6, which are all melliferra subfamilies, while no MLE were classified into the other subfamilies.(Fig. 2).The MLEs of the other five species classified into mauritiana and mellifera subfamilies, and most of MLEs were classified into Dromar 14, Dromar 17 and Dromar 29.A list of all classified MLEs is shown in Supporting data 8.The results suggested that MLEs in Apis genome mainly consist of melliferra subfamilies.

Drosophila
genome data and Dromar location data reported by Wallau et al. (2014) (Supporting data 5) [15].However, some Dromar sequences were not extracted because genome sequence IDs in Drosophila genome data differed from those described in Dromar location data.Therefore, several Dromars were not used in the present study.The sequences of each Dromar were aligned by Clustal Omega (Clustal Omega -1.2.4) [19], and the hmmer profiles of Dromars were made by hmmer-building (HMMER 3.2.1)(Supporting data 6) [17].Using the nhmmer search program with merged Dromar hmm profiles (Supporting data 7), the Apis MLEs similar to each Dromar profile were identified (E-values < 1e-5) [17].The output data and table results of nhmmer are shown in Supporting data 18 and 19 respectively.Based on hmmer search results, each MLE was classified into hit Dromar.If a MLE was hit to multiple Dromars, it was classified into Dromar with the lowest e-value.

Fig. 1 Fig. 2 Fig. 3
Fig. 1 Scheme of data analyses in the present study

Fig. 4 Fig. 5
Fig. 4 Distributions of long Am_MLEs.LG_1 to LG_16 indicate the chromosome names of A. mellifera.Yellow markers indicate long Am_MLEs.Bold yellow markers indicate two MLEs placed at very close positions to each other.The figure was drawn by chromoMap

Fig. 6
Fig. 6 Evolutional insight of Apis MLEsSchematic summary of evolutionary insights into Apis MLEs revealed by this study.