## Abstract

Evolution of human genetics is one of the most interesting areas for researchers. Determination of Haplotypes not only makes valuable information for this purpose but also performs a major role in investigating the probable relation between diseases and genomes. Determining haplotypes by experimental methods is a time-consuming and expensive task. Recent progress in high throughput sequencing allows researchers to use computational methods for this purpose. Although, several algorithms have been proposed but they are less accurate when the error rate of input fragments increases. In this paper, first, a fuzzy conflict graph is constructed based on the similarities of all input fragments and next, the cluster centers are used as initial centers by fuzzy c-means (FCM) algorithm. The proposed method has been tested on several real datasets and compared with some current methods. The comparison with the existing approaches shows that our method can be a complementary role among the others.

## 1. Introduction

The sequencing efforts of Human genome project revealed that more than 99% of DNA sequences of human are identical [1]. As a result, the genomic differences is the responsible for diversities in our phenotypes and can be considered for many applications such as medical, drug designing, disease diagnosis and studying population history [2, 3]. Single Nucleotide Polymorphisms (SNPs) are the sites on DNA sequences that have common variations [4]. The nucleotides involved in an SNP are called alleles. Haplotype is a set of the number of SNPs that are located in a specific chromosome. Recent works show that haplotypes have more valuable information than individual SNPs [5]. In diploid organisms, such as humans, genomes are organized into pairs of chromosomes one inherited from father and other inherited from mother that are called paternal and maternal respectively. Consequently, from each copy one haplotype sequence can be gained [6, 7]. Determination of haplotypes from experimental works is so time-consuming and costing. Hence, using of computational methods is appropriate. In order to solve haplotype reconstruction problem, various methods have been proposed. At present, there are two chief models: haplotype inference [8–13] and haplotype assembly [6, 14–20]. The presented method in this article is based on the haplotype assembly.

Lancia and his colleagues[21] first proposed haplotype assembly problem. Suppose there are some short SNP fragments that are bellowing to a pair of chromosomes. Their model tries to divide these fragments in to two clusters such that each haplotype is reconstructed. Existence of errors in fragments and gaps as well as diploid organism lead to this problem becomes challenging and more difficult. Due to finding and rectifying fragment’s errors, several models have been proposed which Minimum Fragment Removal (MFR), Minimum SNP Removal (MSR), Longest Haplotype Reconstruction (LHR) and Minimum Error Correction (MEC) are four main chief models. MEC has been presented by Lippert and coworkers [22]. Although, this model is the most complicated amongst the others, it is so popular and has been used in many related works. It is proved that MEC problem is NP-hard [23].

Up to now, several approaches have been proposed to address the SIH problem based on MEC model which can be categorized as exact, metaheuristic and probabilistic methods. Exact based methods attempt to address the problem accurately and reconstruct haplotypes optimally. However, these approaches has to contain some constraints for input fragments [18, 24–26]. Since MEC problem is NP-hard, metaheuristic algorithms such as GA and PSO have been applied to solve this problem. In this case, the objective function has been designed based on MEC model and the method attempts to enhance it iteratively [6, 15, 27–30]. Existing gaps and errors in the input data, encouraged some researchers to propose probabilistic models to solve this problem. For example, HASH[19] and CUT[20] are two main approaches which lie in this category.

Fasthap method was recently proposed by Mazrouee and her colleagues [31]. Developing in accuracy and time complexity are the main goals of their method. Algorithmically, dissimilarity of every pair of fragments is measured by a new distance metric; next, a weighted graph is built based on the obtained measures; then, the created graph is used to partitioning the fragments one after another; eventually, the initial partitioning is developed in order to improve the overall MEC. The experimental results not only outperform but also the time complexity is enhanced.

Fuzzy c-means (FCM) clustering is an unsupervised technique that has been widely applied in many fields such as geology, medical imaging, target recognition, and image segmentation [32–37]. The main advantage of this approach against hard c-means is that each sample can belong to several clusters based on the measure of its membership degree. This ability is more suitable in concerning with noisy data and decrease its sensitivity against the existing noise [38]. Single individual haplotype (SIH) reconstruction problem is one of the active research areas in bioinformatics which can be modelled as a clustering problem. Most of the existing methods cluster the input fragments based on their distances. However, existing errors and gaps in the input fragments lead to computing their distances becomes unreliable.

This paper focuses on using FCM approach and introduces a new haplotype reconstruction method. Membership degree of each input fragment can interpret their belongings more precisely. The proposed method includes two steps. First, the suggested distance metric in[31] is used for building fuzzy conflict graph. Then the input fragments are partitioned into two clusters based on their similarities and from each cluster an initial haplotype is gained. In the next step, the obtained haplotypes are used by fuzzy c-means (FCM) clustering method as preliminary centers and tries to improve the accuracy of the pervious partitioning. From the results of several experiments on real data, we can see that the proposed method can always find good solutions. Also, comparing the results with several methods indicates that our method achieves an appropriate accuracy in most cases.

The rest of this paper is organized as follows: In section 2, SIH problem is formally defined and preliminary definitions and notations are given. In the third section data representation is discussed and the proposed method is described. In section 4, the experimental results of comparing the method with other popular approaches are provided. Our conclusions are drawn in the final section.

## 2. Problem formulation

Given a set of SNP fragments which are read from both chromosomes and the columns with identical values have been removed. Next as can be seen in Fig. 1 (a), an *m* × *n* matrix called SNP matrix is constructed which contains the fragments where *m* is the number of fragments and *n* is the number of the sites. In reality, there are two possible alleles for each SNP. Therefore, alleles of each SNP based on their frequency in population can be denoted by ‘*0*’ and ‘*1*’ [7] (Fig. 1 (b)).

Each element of matrix can be *1, 0* or ‘− ‘where ‘− ‘indicates a gap. The original haplotypes are a pair of binary strings *H(h*_{1},*h*_{2}*)* with length *n*. The aim of SIH reconstruction is division the SNP matrix into two parts by row, and then the corresponding haplotype from each part is reconstructed.

If the fragments are error-free (Fig. 1) then they can be clustered into two groups such that all the fragments in each cluster are compatible (Fig. 2). However, in the presence of errors, there are some fragments which have conflict with the both clusters. In this case, we should reconstruct the haplotypes such that some objective function is minimized. In this study, we define this function based on Minimum error correction (MEC)[21, 22].

Suppose that is the pair of reconstructed haplotypes. The accuracy of the algorithm is measured by reconstruction rate (RR)[6] which is defined as follows:

Where and

## 3. Materials and methods

As demonstrated by a series of recent publications[14, 39–46] and summarized as Chou’s 5-step rule[47], to present a suitable analysis method for a biological system, we should follow the following five guidelines: (a) select a valid benchmark dataset; (b) formulate data with an effective mathematical expression; (c) introduce a powerful algorithm to operate the reconstruction; (d) evaluate the accuracy; (e) establish a user-friendly web-server. Below, we are to describe how to deal with these steps one-by-one.

### 3.1. Materials

The Geraci’s dataset[48] is one of the major benchmarks which is prepared based on Hapmap project. This dataset consists of 22 pairs of human chromosomes from four different populations and has widely been used by several researchers [14, 15, 17, 48–50]. There are three parameters related to the data set: haplotype length, error rate and coverage rate which are denoted by *l, e, c*, respectively. Each parameter has several different values, *l* = *100, 350*, and *700, e* = *0*.*0, 0*.*1, 0*.*2* and *0*.*3, c* = *3, 5, 8* and *10*. Error rate refers to the amount of read data which has been read imprecisely. For example when e equals 0.1, it means that 10% of available data is noisy. Moreover, the coverage parameter refers to the number of times each of the two haplotypes replicates when generating the dataset. For each combination of these parameters there are *100* instances.

### 3.2. Data formulation

As it is mentioned in the previous section, suppose input SNP fragments as a *m* × *n* matrix called SNP matrix. The similarity between each two fragment *X* = (*x*_{1},*x*_{2}, … *x*_{n}) and *Y* = (*y*_{1},*y*_{2}, … *y*_{n}) can be defined as follows:

It is be noted that Eq.3 is a type of Hamming distance which has been used in [31]. It is required that the distances between all the input fragments are calculated. Next, a complete fuzzy conflict graph is constructed which has *m* vertices equal to the number of fragments and each edge representing the distance between two corresponding fragments. In fact, this graph represents dissimilarity between pairs of fragments. For example, the demonstrated matrix in Fig. 3 represents the normalized distances between all six fragments in the Fig. 1. It is be noted that distance between *f*_{i} and *f*_{j} is normalized by the number of SNP sites which at least have been covered by *f*_{i} or *f*_{j}. Moreover, its corresponding fuzzy conflict graph is depicted too.

### 3.3. Proposed method

The proposed method has two phases. First, distances between all the fragments are calculated according to Eq.3 and the corresponding fuzzy conflict graph is constructed. Next, the obtained distances are used to bi-partitioning all the fragments. It is be noticed that this clustering is done based on the dissimilarities between the fragments. In the second phase, centers of the gained clusters (the obtained haplotypes) are used as initial centers by fuzzy c-means algorithm. First step leads to increase the convergence speed of FCM and decreases the number of iterations. The FCM algorithm assigns fragments to each cluster according to fuzzy memberships. Let is degree of membership of *jth* fragment to *ith* cluster which *i* ∈ {0,1}and matrix U = [*u*_{ij}]_{2 × n} contains the membership of all the fragments. In this way each fragment may belong to any of the two clusters by different membership degrees. The algorithm is an iterative optimization that minimizes the cost function defined as bellows:

Where *d*_{ij} the distance between each cluster center and input fragments that defined is based on relation (1), N is the number of fragments and m is a constant which controls the fuzziness of the resulting partition. This measure can be set between one to infinity and there isn’t any theoretical way to determine it. In this study, m equals with 2 based on several past researches [33, 37, 51–53]. The updated membership matrix and cluster centers are calculated from

The algorithm is expressed by the flowchart shown in Fig. 4. The last two steps are iterated until the improvement over the previous iteration is below a threshold *ε*. The cost function is minimized when fragments close to the centroid of their clusters are assigned high membership values, and low membership values are assigned to fragments that far from the centroid.

## 4. Experimental results

To evaluate the performance of our method, as mentioned previously, we have used the dataset in Geraci’s research [48, 54].

In order to assess the performance of the proposed algorithm, it is compared with the algorithms that were investigated in Geraci’s research[48]. MLF[55] and 2d [26] are based on MEC model. The former uses confidence score for each SNP site and the later uses two distance metrics in order to cluster input fragments. Both of DGS[56] and Cut[20] methods works with a sub-matrix of input fragments. The first, considers a pair of haplotypes as initial and based on the majority rule refines it step by step. The second models the SIH as a max-cut problem in derived SNP graphs. Fast[57] sorts input fragments according to the positions which gaps begin and assigns them to the clusters. SHR[58] is a randomized-based approach which selects the input fragments in an iterative manner and by exploiting hamming distance assigns them to the closest set. Finally, SPH[54] is a heuristic-based method expoliting the statistical correlations between SNPs and uses for the high noisy fragments with low coverages. The results of proposed method called FCMhap as well as the results of other algorithms can be seen as follows. It should be noticed that Table 1–3 represents the results of haplotypes with length *100, 350* and *700* respectively. The first two columns in these tables indicate error rate and coverage separately. The results of our method can be seen in the last column. The bold values specify the utmost RRs, also the gray values indicate the second highest RRs. It is interesting to note that all compering methods have good performance in the error-free cases. However, by increasing the amount of noise and gaps, their performances decrease dramatically. By using FCM, each fragment can be belong to both clusters. Their belonging have been determined according to their membership degree measures. The membership degree can describe the belonging of each fragment more accurately especially for input fragments with large amount of noise. Therefore, as can be seen here, the comparison of results particularly in cases with a high error rate, demonstrates that FCMhap has suitable performance against the other methods.

In this study, we have focused on the improvement of the reconstruction rate. However, in order to provide a comprehensive assessment about the proposed method, its running time has been compared against the other approaches. For this purpose, for each combination of the parameters, the methods have been run by an ordinary desktop PC over 10 samples which have been selected randomly. The average of running times for each set of parameters have been collected which can be seen in Tables 4–6.

The comparison of the haplotype reconstruction time demonstrates that the running time of the proposed method is scalable with the other approaches and it can reconstruct haplotypes for each parameter assignment in less than seven seconds.

## 5. Conclusion

Providing huge amount of genomic sequences has been increased the importance of single individual haplotype problem. Determination of haplotype can be useful in several domains such as understanding the relation between genetic variations and complicated diseases. Since laboratory-based methods are time consuming and expensive, several computational-based approaches have been proposed which reconstruct haplotypes directly from the reads. But their performance can be dramatically decreased in dealing with the noisy input data. We have presented FCMhap, an effective method that utilizes Fuzzy c-means (FCM) algorithm as a main step. FCM by considering a fuzzy membership for each read can efficiently cluster the noisy data. The obtained results demonstrate that FCMhap can improve reconstruction rate especially for high-error-rate data. It should be noted that the codes used to prepare this article are available from the author upon request.

## Acknowledgement

We would like to acknowledge the help that received from our colleagues in Machine Learning and Bioinformatics Laboratory (MLBL) of University of Zanjan, Zanjan, Iran. The authors would also like to thank Dr. F. Geraci for providing his benchmark data set.