The variations of human miRNAs and Ising like base pairing models

miRNAs are small about 22-base pair long, RNA molecules are of extreme biological importance. Like other longer RNA molecules, messages in miRNAs are encoded by the permutations of only four nucleotide bases represented by A, U, C and G. However, just like words in any language, not all combination of these alphabets make a meaningful word. In fact, we find that the distributions of nucleotides bases in human miRNAs show significant deviation from randomness. First, a miRNA sequence containing four bases are mapped into a binary string with three kinds of classifications according to their chemical properties. Then, we propose a simple nearest neighbor model (Ising model) to understand the statistical variations in human miRNAs.


Introduction
Micro-RNAs (miRNAs) are small non-coding RNA molecules. They are made of the different permutations of nucleotide bases A, U, G and C. The number of bases in the miRNA is conserved across the species and they vary around 22 nucleotides (mostly). They play important role as the regula- 5 tors of gene expression and they target some of messenger-RNAs (mRNAs) [1,2,3]. Mainly, the function of miRNAs is to down-regulate gene expression [4]. In recent years, the effects of miRNAs are also found in malignant cells; the miRNAs influence numerous cancer-relevant processes in a malignant cell [5,6]. It will be worthy if the governing role of miRNAs could be 10 apprehended from nucleotide bases and their properties. The recent study have demonenstrated how the miRNAs of five species are distributed on the basis of purine-pyrimidine bases that used different mathematica parameters [7]. To have a deeper understanding of the nature of miRNAs, here we study a few statistical properties of the miRNAs and propose some simple physics-15 inspired models to account for their ensemble properties. As we will discuss, our results may shade some light on how the miRNAs are produced in the cell.

Dataset specification 20
The whole miRNA sequences (for Human, Hominidae family) are accessable through the website (a miRNA database: http://www.mirbase.org/). In human, 2588 reported mature miRNAs sequences are found. Their lengths vary from 16 to 28 nucleotide bases, but mostly within the range of 20-24 bases (Fig 1). Let L be the data set containing 2588 miRNAs i.e. L = 2588.

25
The average length or mean (M (L)) of 2588 miRNAs is 21.5881. All the computational codes are written in MATLAB R2016a software.

Individual variation of nucleotides bases
There are only four ribonucleotide bases which are distributed over the miRNAs. Here, we will see their distribution in terms of the probability 30 measure. Let, P r(X) is the probability of a nucleotide X (X ∈ {A, G, U, C}) to be present in the 2588 miRNAs, then where f X is the average number of occurrence of the nucleotide X over 2588 miRNAs.  From the data set, we find f A = 4.77, f G = 6.21, f U = 5.55 and f C = 5.53. So, P r(A) = 0.22, P r(G) = 0.28, P r(U ) = 0.26 and P r(C) = 0.24. It is observed that P r(A) = P r(G) = P r(U ) = P r(C). Therefore, we find the individual variations of each nucleotide base over the 2588 miRNAs. And 40 the orderis P r(A) < P r(C) < P r(U ) < P r(G).

Grouping of nucleotide bases and transformations into binary
The four-letter alphabets (A/G/U/C) may be mapped onto two-letter (binary) alphabets in three distinct possible ways. As it turns out, these three classifications may be understood on the basis of chemical properties 45 of nucleotide bases [8]. These are: • Purine-Pyrimidine classification: The bases AG and UC are the purine and pyrimidine (Pu-Py) groups respectively. Among the three classifications, the purine-pyrimidine class is possibly the most important one [9]. Purines are the most widely occurring nitrogencontaining heterocyclic compounds in nature. In order to form DNA and 55 RNA, both purines and pyrimidines are needed by the cell in approximately equal quantities. But, this may not hold true for miRNAs. For all of these three groupings, we can assign the bits 1's and 0's to the respective members of any of the aforementioned classifications. For example, for the purine-pyrimidine class we may have the following rule: Interestingly, for all three categories, the distribution of occurrence of 0 (or 1) in the binary string representation of miRNAs fits well to a normal  distribution. The group-specific distributions of the nucleotide base pairs vary significantly from each other. However, we observe that the mean oc-70 currence of any member (pu or py) in the Purine-Pyrimidine classification is the closest to the occurrence averaged over all the members of all the three classifications (10.79), as compared to other two groupings. The mean and variance for each members of all the classifications are provided in the Table  1. The question that naturally arises is: which kind of model best describes 75

Binomial Model
The simplest thing to do is to fit a binomial model. In a binomial model, probability of occurrence of a 0 (or 1) at each location is determined by an independent coin-toss with a probability p (or q). Here p + q = 1. The mean 80 of an binomial distribution is np. The binomial distribution has a simple property that variance is npq (Theoretical/Expected). Hence from the mean we can calculate p and check if the variance is close to the observed variance. (Here, n = M (L) = 21.5881). The details of Expected variance and observed variance for the bases of each classifications are shown in Table 1.

85
So, from the above results we find that the observed variance is greater than the variance obtained from the binomial model. However, the discrepancy in variance for the Pu-Py classification is much larger than that for the other classifications. For all other classes the expected variance is much closer to the observed variance. So, there are patterns of miRNAs where one 90 group (Pu or Py) is more frequent than what is expected from chance alone. Therefore, it is interpreted that miRNAs have a stronger interdependency with respect to the purine-pyrimidine class.
To have a better statistical understanding of our claim that observed mean is different from the mean calculated from observed variance assum-95 ing binomial distribution, we performed one-sample t-test (h) with the null hypothesis that two observed mean are equal. We performed the t-test separately for all three cases and calculated the p value. In all three cases we can reject the null hypothesis with high significance ( Table 2).

Mean-field model 100
In the previous subsection we have observed that the variance of the data is more than what is expected of the independent, binomially distributed binary strings. This discrepancy indicates that there could be interactions other element within the same string with equal strength (all-to-all connected topology). In the other possible case, there might be a strong nearest neighbor interaction instead of a mean-field one. To check which of these two descriptions are closer to the observations from the data, we calculate and compare the probabilities of finding 2-clusters (i.e., like-pairs) of Purines and Pyrimidines in the binary strings obtained from the data (shown in Table 4) and the same obtained by the bootstrap method. In the bootstrap method we shuffle the elements of each binary string multiple times keeping the numbers of Purines and Pyrimidines constant. We then calculate the probabilities of a 2-cluster in the shuffled strings and obtain a distribution of them. If the 115 probabilities obtained from the data falls under the bootstrap distribution, preferably close to the pick of the distribution, then one can infer that shuffling does not change the cluster-probabilities, and a mean-field picture is enough to model the statistical properties of the miRNA sequences. The distributions and the values calculated from the data are shown in figures 5 120 and 6. A statistically significant difference between the observed means and val-  ues obtained in the bootstrap analysis indicate that the purines and pyrimidines are not positioned in a random fashion in the miRNAs. Their special arrangements, therefore, suggest that they have a rather strong nearest 125 neighbour interaction.

Nearest Neighbor Model : Ising Model
Here we construct a Nearest Neighbour (NN) model which assumes pairwise interaction between any two nearest neighbour elements. In a different context, for various nucleotide sequencing data such models have previously been used [10]. In this model, the probability of occurrence of a particular string of length N with r zeros would be given by, where 'J's are the strength of nearest neighbour interactions.  Table 3 we observe that P 00 ≈ P 11 and P 01 ≈ P 10 for all 3 types of classifications. This indicates that it might be sufficient to club the four 130 different interactions to interactions between similar elements and those between the different elements. Also, we know that P 0 = P 1 (from Table 1) and P 00 = P 01 (from Table 3) for all of the different classification schemes. These observations together suggest that a nearest neighbour model can possibly be a good candidate to fit and explain the observed data.

135
In physics literature the nearest neighbour model is called the Ising model. Historically, the Ising model was created/used to describe the paramagnetic to ferromagnetic transition in 1-dimension. It is a semi-classical model which, in simple terms, assumes that the magnetic moments (spins) of atoms in a 1d atomistic chain can assume only + 1 2 (↑) and − 1 2 (↓) values in some suitable 140 unit system along a particular (say, Z) axis. In a specific variant of the model, let us consider a closed (i.e., with periodic boundary condition) chain of N spins with nearest neighbour interaction strength, J, and chemical potential (magnetic field like term), µ. We assume that the inverse temperature, β = 1. There can be two approaches to fit this model to the observed microRNA 145 data. One is to calculate the susceptibility using the transfer matrix method and treat it as the variance of the difference between population of the two states (↑ and ↓ for Ising model, Pu and Py for our classification, for example) . This approach is equivalent to looking at the microRNA strings from a holistic point of view. The other approach is to calculate exactly the 150 probabilities of n-clusters of a particular state and fit them to the observed values.

Cluster probabilities for large N Ising model
As mentioned earlier, we can verify the exactness of our model by checking the probabilities for high order clusters (11 and 00, 111 and 000 and so on).
155 This is shown in Table 4. (Pu,St,Am) for 1's and (Py,We,Ke) for 0's. And the corresponding 2-D line plot is shown in Figure 5 in two different styles for (Pu, St, Am) and (Py, We, Ke). The P r(r) values (r > 4) for Pu-Py class is more than the other classes.  Feeding these values in eqn.s 17 and 18 we find, P 000 = 0.147573 (6% deviation from the observed value) and P 0000 = 0.0809581 (16.5% deviation from the observed value).
However, doing the same thing with equations and probabilities for 1-165 clusters gives, P 111 = 0.167168 (3.4% deviation from the observed value) and P 1111 = 0.0957027 (8% deviation from the observed value).

Cluster probabilities
In their paper [11] Ivanytskyi & Chelnokov have calculated the exact expression for average occupancy number of distinct l-clusters in Ising model in the absence of external field. The average occupancy number can be written as, Now, the number of all distinct and non-distinct q-clusters ,ñ q≤N , can be related to the number of all distinct clusters as, Solving equation 4 for q = 2 and p 00 from Table 4 we get, J = 0.1728697. Using this J in equations 3 and 4 we predict p 000 from this model to be 170 0.1714543, which deviates from the observation ( Table 4) by 1.05% only.

Variance from finite N Ising chain with PBC
In App. 3.1 we calculate a frequently used physical quantity, known as the isothermal susceptibility, for the Ising chain for finite N. This quantity can be argued to be identical to the variance of difference between the population of two classes (in case of Pu-Py classification, χ T ≡ σ 2 n P u −n P y N ). At µ → 0 limit this variance is given as, Solving for J with σ 2 database /N = 39.8470/22 gives, J = 0.297 (6) which is almost double the value obtained in subsection 2.6.2. For the strong-weak H-bond grouping, σ 2 database /N = 29.4017/22 gives, whereas, J = 0.175 for the amino-keto grouping.

Concluding Remarks
We observe that the Ising model with nearest neighbour interactions and chemical potential serves as a better descriptor of the miRNA data than the binomial model. We also observe that, within the scope of Ising model, the mean-field like quantities such as variance cannot lead us to a full description of the data. For a better description we may use a hybrid model which takes strong nearest-neighbour interactions into account with a background meanfield description.
At least for the Pu-Py classification, we observe that there is hardly any difference between the finite and large N Ising models in describing the miRNA data. So the finite number of elements in the string does not play 185 an important role in our approximated description [12].
In all of our analysis we have mapped the four letter alphabet of nucleotide bases into a two letter binary alphabet consisting of zero and one. It is possible to keep the four letter alphabet and address the problem in all its generality. In that case the transfer matrix will have sixteen parameters. We Where F = e 2J (e 2µ (e µ (G + 4) − 1) + 1),