Global non-random abundance of short tandem repeats in rodents and primates

Background While of predominant abundance across vertebrate genomes and significant biological implications, the relevance of short tandem repeat (STR) abundance to speciation remains largely elusive and attributed to random coincidence for the most part. In a model study, here we collected whole-genome abundance of mono-, di-, and trinucleotide STRs in nine species, encompassing rodents and primates, including rat, mouse, olive baboon, gelada, macaque, gorilla, chimpanzee, bonobo, and human. The obtained unnormalized and normalized data were used to analyze hierarchical clustering of the STR abundances in the selected species. Results We found massive differential abundances between the rodent and primate orders. In addition, while numerous STRs had random abundance across the nine selected species, the global abundance conformed to three consistent , as follows: , , , which coincided with the phylogenetic distances of the selected species (p< 4E-05). Exceptionally, in the trinucleotide STR compartment, human was significantly distant from all other species. Conclusion We propose that the global abundance of STRs is non-random in rodents and primates, and probably had a determining impact on the speciation of the two orders. We also propose the STRs and STR lengths which specifically coincided with the phylogeny of the selected species.

STRs are a source of rapid and continuous morphological evolution [11], for example, in the evolution of facial length in mammals [12]. These highly evolving genetic elements may also be ideal responsive elements to fluctuating selective pressures. A role in evolutionary selection and adaptation is consistent with deep evolutionary conservation of some STRs, as "tuning knobs", including several in genes with neurological and neurodevelopmental function [13].
While a limited number of studies indicate that purifying selection and drift can shape the structure of STRs at the inter-and intra-species levels [14][15][16][17][18][19], the global abundance of STRs at the crossroads of speciation remains largely unknown.
Mononucleotide and dinucleotide STRs are the most common categories of STRs in the vertebrate genomes [20,21]. In addition to their association with frameshifts in coding sequences and pathological [22] and possibly evolutionary consequences, recent evidence indicates surprising functions for the mononucleotide STRs, such as their provisional role in translation initiation site selection [9]. Several groups have found evidence on the involvement of a number of dinucleotide STRs in gene regulation, speciation, and evolution [3,20,[23][24][25][26].
Trinucleotide STRs are frequently linked to human neurological disorders, most of which are specific to this species [27,28].
In a model study, here we analyzed the evolutionary abundance of all types of mono-, di-, and trinucleotide STRs in nine selected species, encompassing rodents, Old World monkeys, and great apes.

Species and whole-genome sequences
By using the UCSC genome browser (https://hgdownload.soe.ucsc.edu), the whole genomes of nine species were downloaded and analyzed, species and genome sizes of which were as follows:

Chromosome-by-chromosome aggregation of STRs
Whole-genome chromosome-by-chromosome data were aggregated and analyzed in the nine species, without normalization (approach 1) and with normalization (approach 2). In approach 1, all chromosomal data were collected without removing any numerically non-identical chromosomes across the nine species. In approach 2, data on the identical chromosome sets (numerically) across the nine species were collected in an array of 20 columns, each column corresponding to a chromosome. In this approach, mouse was selected as reference, because it had the lowest number of chromosomes among the nine species.

STR abundance and hierarchical cluster analysis across species
Whole-genome STR abundances across the selected species were deciphered and depicted by boxplot diagrams and hierarchical clustering, using boxplot and hclust packages[29] in R, respectively. Boxplots illustrate abundance differences among segments across the selected species, and hierarchical clustering plots demonstrate the level of similarity and differences across the obtained abundances. The input data to these packages were numerical arrays obtained with each approach. Each array consisted of a number of columns, each column corresponding to the STR abundance in different chromosomes.

Statistical analysis
The STR abundances across the nine selected species were compared by repeated measurements analysis, using one and two-way ANOVA tests. These analyses were confirmed by nonparametric tests.

Results
Global abundance of mono, di, and trinucleotide STRs coincides with the phylogenetic distance of the nine selected species.
The whole-genome STR abundances from aggregated chromosome-by-chromosome analysis in the dinucleotide category (Table 2)  There was global shrinkage of the trinucleotide STR compartment in primates vs. rodents, without (P=3.8E-05) and with normalization of the data (P=2.4E-07) ( Table 3, Fig. 3 and Suppl. 1). Remarkably, human stood out among all other species in the hierarchical clustering,

Differential abundance patterns of STRs across rodents and primates.
Numerous STRs across the mono, di, and trinucleotide STR categories coincided with the phylogenetic distances of the nine selected species. For example, the most abundant STRs across all nine species were T/A mononucleotides of 10, 11, and 12 repeats, which coincided with the genetic distance of the selected species (Fig. 4). Likewise, (ct)6 and (taa)4 conformed to the phylogeny of the studied species in the di and trinucleotide STR categories, respectively.

Discussion
It is largely unknown whether at the crossroads of speciation, STRs evolved as a result of purifying selection, genetic drift, and/or in a directional manner. In a model study, we selected multiple species across rodents and primates, and investigated the abundance of all possible types of mononucleotides, dinucleotide, and trinucleotide STRs on the whole-genome scale in those species. Hierarchical clustering of the obtained abundances yielded clusters that predominantly coincided with the phylogenetic distances of the selected species.
Hierarchical clustering is an unsupervised clustering method that is used to group data. This algorithm is unsupervised because it uses random, unlabelled datasets. As the number of clusters increases, the accuracy of the hierarchical clustering algorithm improves. Here we implemented this algorithm to cluster the nine selected species based on the obtained STR abundances.
Our findings may be of significance in two respects. Firstly, there were significant differential abundances separating rodents from primates, for example, massive decremented abundance of dinucleotide and trinucleotide STRs in primates vs. the rodent species, and massive incremented abundance of mononucleotide STRs in primates vs. rodents. Those differential abundances might have determining roles in the speciation of the two orders. Secondly, the three major clusters obtained from global hierarchical cluster analysis matched the phylogeny of the three classes of species, i.e., <rodents>, <Old World monkeys>, and <great apes>. It is possible that there are mathematical channels/thresholds required for the abundance of STRs in various orders. This is in line with the hypothesis that STRs function as scaffolds for biological computers[30].
In addition, our data indicate that various STRs and STR lengths behave differently with respect to their colossal abundance. Not all studied STRs coincided with the phylogenetic distances of the nine selected species. We hypothesize that those which coincided had a link with the speciation of those species, whereas those which did not probably followed random patterns. Future studies such as large-scale genome-editing of STRs[45] in embryonic stem cells and investigation of their differentiation into various cell lineages may be candidate approaches to investigate how the observed massive non-random patterns link to speciation and evolution.

Conclusion
We propose that the global abundance of STRs is non-random across rodents and primates. We also propose the STRs and STR lengths which coincided with the phylogenetic distances of those species.

Competing interests
Authors have no conflict of interest to declare.

Funding
This research was funded by the University of Social Welfare and Rehabilitation Sciences, Tehran, Iran.

Authors' contributions
MA performed and coordinated the bioinformatics analyses. MS performed the biostatistics analysis. YHN, IA, and AMAM contributed to data collection. KK contributed to data collection and coordination. MO conceived and supervised the project, and wrote the manuscript.

29.
Murtagh, F. and P.  Fig. 1. Unnormalized data on whole-genome mononucleotide STRs in the nine selected species. Global incremented pattern was observed in the primate species vs. rodents (left graphs). The overall hierarchical clustering yielded three clusters, which coincided with rodents, Old World monkeys, and great apes (right graphs).

Fig. 2.
Unnormalized data on whole-genome dinucleotide STRs in the nine selected species. Global decremented patterns were observed in all primate species vs. mouse and rat. Fig. 3. Unnormalized data on whole-genome trinucleotide STRs in the nine selected species. While global decremented patterns were observed in primates vs. rodents, intriguingly, human stood out in this category, in comparison to all other species.