## ABSTRACT

A framework that formulates haplotype assembly as sparse tensor decomposition is proposed. The problem is cast as that of decomposing a tensor having special structural constraints and missing a large fraction of its entries into a product of two factors, U and ; tensor reveals haplotype information while U is a sparse matrix encoding the origin of erroneous sequencing reads. An algorithm, AltHap, which reconstructs haplotypes of either diploid or poly-ploid organisms by solving this decomposition problem is proposed. Starting from a judiciously selected initial point, AltHap alternates between two optimization tasks to recover U and by relying on a modified gradient descent search that exploits salient structural properties of U and . The performance and convergence properties of AltHap are theoretically analyzed and, in doing so, guarantees on the achievable minimum error correction scores and correct phasing rate are established. AltHap was tested in a number of different scenarios and was shown to compare favorably to state-of-the-art methods in applications to haplotype assembly of diploids, and significantly outperform existing techniques when applied to haplotype assembly of polyploids.

## 1 INTRODUCTION

Fast and accurate DNA sequencing has enabled unprecedented studies of genetic variations and their effect on human health and medical treatments. Complete information about variations in an in-dividual’s genome is given by haplotypes, the ordered lists of single nucleotide polymorphisms (SNPs) on the individual’s chromosomes [37]. Haplotype information is of fundamental importance for a wide range of applications. For instance, when the corresponding genes on a homologous pair of chromosomes contain multiple variants, they could exhibit different gene expression patterns. In humans, this may affect an individual’s susceptibility to diseases and response to therapeutic drugs, and hence suggest directions for medical and pharmaceutical research [11]. Haplotype information also enables whole genome association studies that focus on the so-called tag SNPs [19], representative SNPs in a region of the genome characterized by strong correlation between alleles (i.e., by high linkage disequilibrium). Moreover, haplotype sequences can be used to infer recombination patterns and identify genes under positive selection [36]. In addition to the SNPs and minor structural variations found in a healthy individual’s genome, complex chromosomal aberrations such as translocations and nonreciprocal structural changes – including aneuploidy – are present in cancer cells. Cancer haplotype assembly enables identification of “driver” mutations and thus helps to understanding the mechanisms behind the disease and discovery of its genetic signatures.

Haplotype assembly from short reads obtained by high-throughput DNA sequencing requires partitioning (either directly or in-directly) the reads into *K* clusters (*K* = 2 for diploids, *K* = 3 for triploids, etc.), each collecting the reads corresponding to one of the chromosomes. If the reads were free of sequencing errors, this task would be straightforward. However, sequencing is erro-neous – state-of-the-art platforms have error rates on the order of 10^{−3} – 10^{−2}. This leads to ambiguities regarding the origin of a read and therefore renders haplotype assembly challenging. For this reason, the vast majority of haplotype assembly techniques attempts to remove the aforementioned ambiguities by either discarding or altering sequencing data; this has led to the minimum fragment removal, minimum SNP removal [26], maximum fragments cut [16], and minimum error correction formulations of the assembly problem [29]. Most of the recent haplotype assembly methods (see, e.g., [7, 25, 31, 32, 40]) focus on the minimum error correction (MEC) formulation where the goal is to find the smallest number of nucleotides in reads that need to be changed so that any read partitioning ambiguities would be resolved. It has been shown that finding optimal solution to the MEC formulation of the haplotype assembly problem is NP-hard [7, 10, 26]. In [39], the authors used a branch-and-bound scheme to minimize the MEC objective over the space of reads; to reduce the search space, they relied on a bound on the objective obtained by a random partition of the reads. Unfortunately, exponential growth of the complexity of this scheme makes it computationally infeasible even for moderate haplotype lengths. Integer linear programming techniques have been applied to haplotype assembly in [9], but the approach there fails at computationally difficult instances of the problem. More recently, fixed parameter tractable (FPT) algorithms with runtimes exponential in the number of variants per read [6, 22] were proposed; these methods are well-suited for short reads but become infeasible for the long ones. A dynamic programming scheme for haplotype assembly of diploids proposed in [21] is also exponential in the length of the longest read. A probabilistic dynamic programming algorithm that optimizes a likelihood function generalizing the MEC objective is developed in [25]; this method is characterized by high accuracy but is significantly slower than the previous heuristics. [31, 32] aim to process long reads by developing algorithms for the exact optimization of weighted variants of the MEC score that scale well with read length but are exponential in the sequencing coverage. These methods, along with ProbHap [25], struggle to remain accurate and practically feasible at high coverages (e.g., higher than 12 [25]).

The computational challenges of optimizing MEC score has motivated several polynomial time heuristics. In a pioneering work [28], a greedy algorithm seeking the most likely haplotypes was used to assemble haplotypes of the first complete diploid individual genome obtained via high-throughput sequencing. To compute posterior joint probabilities of consecutive SNPs, Bayesian methods relying on MCMC and Gibbs sampling schemes were proposed in [4] and [24], respectively; unfortunately, slow convergence of Markov chains that these schemes rely on limits their practical feasibility. Following an observation that haplotype assembly can be interpreted as the clustering problem, a max-cut formulation was proposed in [3]; an efficient algorithm (HapCUT) that solves it and significantly outperforms the method in [28] was developed and has been widely used subsequently. A flow-graph based approach in [1], HapCompass, re-examined fragment removal strategy and demonstrated superior performance over HapCut. Other recent diploid haplotype assembly methods include a greedy max-cut approach in [17], convex optimization program for minimizing the MEC score in [13], and a communication-theoretic interpretation of the problem solved via belief propagation (BP) in [34]. Note that deep sequencing coverage provided by state-of-the-art high-throughput sequencing platforms and the emergence of very long insert sizes in recent technologies (e.g., fosmid [17]) may enable assembly of extremely long haplotype blocks but also impose significant computational burden on the methods above.

Increased affordability, capability to provide deep coverage, and longer sequencing read lengths also enabled studies of genetic variations of polyploid organisms. However, haplotype assembly for polyploid genomes is considerably more challenging than that for diploids; to illustrate this, note that for a polyploid genome with *k* haplotype sequences of length *m*, under the all-heterozygous assumption there are (*k* − 1)^{m} different genotypes and at least 2^{(m − 1)} (*k* − 1)^{m} different haplotype phasings. In part for this reason relatively fewer methods for solving the haplotype assembly problems in polyploids have been developed. In fact, with the exception of HapCompass [1], SDhaP [13] and BP [34], the above listed methods are restricted to diploid genomes. Other recent techniques capable of reconstructing haplotypes for both diploid and polyploid genomes include HapTree [5], a Bayesian method to find the maximum likelihood haplotype shown to be superior to HapCompass and SDhaP (see, e.g., [30] for a detailed comparison), and H-PoP [40], the state-of-the-art dynamic programming method that significantly outperforms the schemes developed in [1, 5, 13] in terms of accuracy, memory consumption, and speed.

In this paper, we propose a unified framework for haplotype assembly of diploid and polyploid genomes based on sparse tensor decomposition; the framework essentially solves a relaxed version of the NP-hard MEC formulation of the haplotype assembly problem. In particular, read fragments are organized in a sparse binary tensor which can be thought of as being obtained by multiplying a matrix that contains information about the origin of erroneous sequencing reads and a tensor that contains haplotype information of an organism. The problem then is recast as that of decomposing a tensor having special structural constraints and missing a large fraction of its entries. Based on a modified gradient descent method and after unfolding the observed and haplotype information bearing tensors, an iterative procedure for finding the decomposition is proposed. The algorithm exploits underlying structural properties of the factors to perform decomposition at a low computational cost. In addition, we analyze the performance and convergence properties of the proposed algorithm and determine bounds on the minimum error correction (MEC) scores and correct phasing rate (CPR) – also referred to as reconstruction rate – that the algorithm achieves for a given sequencing coverage and data error rate. To the best of our knowledge, this is the first polynomial time approximation algorithm for haplotype assembly of diploids and polyploids having explicit theoretical guarantees for its achievable MEC score and CPR. The proposed algorithm, referred to as AltHap, is tested in applications to haplotype assembly for both diploid and polyploid genomes (synthetic and real data) and compared with several state-of-the-art methods. Our extensive experiments reveal that AltHap outperforms the competing techniques in terms of accuracy, running time, or both. It should be noted that while state-of-the-art haplotype assembly methods for polyploids assume haplotypes may only have biallelic sites, AltHap is capable of reconstructing polyallelic haplotypes which are common in many plants and some animals, are of particular importance for applications such as crop cultivation [35], and may help in reconstruction of viral quasispecies [33]. Moreover, unlike several state-of-the-art haplotype assembly methods that have complexity which scales exponentially with either read length (e.g., [9]) or coverage (e.g., [25]), AltHap’s iterative steps are linear in both; this makes AltHap well-suited for haplotype assembly from long sequencing reads and deep coverage data. Indeed, we confirm this claim by performing extensive experiment using simulated datasets.

## 2 MATHEMATICAL MODEL AND PROBLEM FORMULATION

We briefly summarize notation used in the paper. Bold capital letters refer to matrices and bold lowercase letters represent vectors. Tensors are denoted by underlined bold capital letters, e.g., . M_{::1} and denote the frontal slice and the mode-1 unfolding of a third-order tensor , respectively. For a positive integer *n*, [*n*] denotes the set {1 …, *n*}. The condition number of rank-*k* matrix M is defined as *κ* = *σ*_{1}/*σ*_{k} where *σ*_{1} ≥ · · · ≥ *σ*_{k} > 0 are singular values of M. SVD_{k} (M) denotes the rank *k* approximation (compact SVD) of M computed by power iteration method [2, 27].

Let 𝓗 = {**h**_{1}, …, **h**_{k}} denote the set of haplotype sequences of a *k*-ploid organism, and let **R** be an *n* × *m* SNP fragment ma-trix where *n* denotes the number of sequencing reads and *m* is the length of haplotype sequences. **R** is an incomplete matrix that can be thought of as being obtained by sampling, with errors, matrix **M** that consists of *n* rows; each row of **M** is a sequence randomly selected from among *k* haplotype sequences. Since each SNP is one of four possible nucleotides, we use the al-phabet 𝒜 = {(1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1)} to describe the information in the haplotype sequences; the mapping between nucleotides and alphabet components follows arbitrary convention.

The reads can now be organized into an *n* × *m* × 4 SNP fragment tensor which we denote by . The (*i*, *j*,:) fiber of , i.e., a one-dimensional slice obtained by fixing the first and second indices of the tensor, represents the value of the *j*^{th} SNP in the *i*^{th} read. Let Ω denote the set of informative fibers of , i.e., the set of (*i*, *j*,:) such that the *i*^{th} read covers the *j*^{th} SNP. Define an operator 𝒫_{Ω}(.) as

is a tensor obtained by sampling, with errors, tensor having *n* copies of *k* encoded haplotype sequences as its horizontal slices. More specifically, we can write , where contains haplotype information, i.e., the *j*^{th} vertical slice of , V_{:j:}, is the encoded sequence of the *j*^{th} haplotype, and U∈ {0, 1}^{n×k} is a matrix that assigns each of *n* horizontal slices of to one of *k* haplotype sequences, i.e., the *i*^{th} row of U, **u**_{i}, is an indicator of the origin of the *i*^{th} read. Let Φ = {**e**_{1}, …, **e**_{k}}, where e_{l} ∈ ℝ^{k} is the *l*^{th} standard basis vector having 1 in the *l*^{th} position and 0 elsewhere. The rows of U are standard unit basis vectors in ℝ^{k}, i.e., **u**_{i} ∈ Φ, ∀*i* ∈ [*n*]. This representation is illustrated in Fig. 1 where the (1, 1,:) fiber of specified with dashed lines is mapped to the (1, 1,:) fiber of which in turn implies that in the example described in Fig. 1 we have **u**_{1} = **e**_{1}.

DNA sequencing is erroneous and hence we assume a model where the informative fibers in are perturbed versions of the corresponding fibers in with data error rate *p*_{e}, i.e., if the (*i*, *j*,:) ∈ Ω fiber in takes value e_{l} ∈ 𝒜, R* _{ij}*: with probability 1 −

*p*equals e

_{e}_{l}and with probability

*p*

_{e}takes one of the other three possibilities. Thus, the observed SNP fragment tensor can be modeled as where is an additive noise tensor defined as where the notation 𝒰(𝒜\{M

_{i j:}}) denotes uniform selection of a vector from 𝒜\{M

_{i j:}}. The goal of haplotype assembly can now be formulated as follows:

*Given the SNP fragment tensor*

*, find the tensor of haplotype sequences that minimizes the MEC score*.

Next, we formalize the MEC score as well as the correct phasing rate, also known as reconstruction rate, the two metrics that are used to characterize performance of haplotype assembly schemes (see, e.g., [9, 14, 18, 21]). For two alleles **a**_{1}, **a**_{2} ∈ 𝒜 ∪ {0}, we define a dissimilarity function *d*(**a**_{1}, **a**_{2}) as

The MEC score is the smallest number of fibers in that need to be altered so that the resulting modified data is consistent with the reconstructed haplotype , i.e.,

The correct phasing rate (CPR), also referred to as the reconstruction rate, can conveniently be written using the dissimilarity function *d*(., .). Let denote the tensor of true haplotype sequences. Then
where 𝓜 is a one-to-one mapping from lateral slices of to those of , i.e., a one-to-one mapping from the set of reconstructed haplotypes to the set of true haplotypes.

We now describe our proposed relaxation of the MEC formulation of the haplotype asseLmbly problem. Let *p*_{i} ∈ [*k*], ∀*i* ∈ [*n*] be defined as *. Notice that for any **j* such that . Therefore, by denoting Ω_{i} the set of informative fibers for the *i*^{th} read we obtain^{1}
where (*a*) follows from the definition of the Frobenius norm and vec(.) in (*b*) denotes the vectorization of its argument. Let **e**_{p} be the *p*^{th} standard unit vector ∀*p* ∈ [*k*]. It is straightforward to observe that the last equality in (6) can equivalently be written as where is the mode-1 unfolding of the tensor . Hence, be the matrix such that for its *i*^{th} row it holds that **u**_{i} = e_{pi}. In addition, notice that vec (**R**_{i::}) is the *i*^{th} row of . Therefore, from the definition of the Frobenius norm and the fact that we obtain

The optimization problem in (7) is NP-hard since the entries of are binary and the objective function is non-convex. Relaxing the binary constraint to , where 𝒞 = [0, 1], results in the following relaxation of the MEC formulation,

The new formulation can be summarized as follows. We start by finding the so-called mode-1 unfolding of tensors and and denote the decomposition , as illustrated in Fig. 2. As implied by the figure, after unfolding, the entries of the (1, 1,:) fiber are mapped to four blocks of and that correspond to the frontal slices of tensors and , respectively. Then, to determine the haplotype sequence that minimizes the MEC score, one needs to solve (8) and find the optimal tensor decomposition.

## 3 STRUCTURED TENSOR DECOMPOSITION ALGORITHM

Although the objective function in (8) is convex in each of the factors when the other factor is fixed, is generally nonconvex. To facilitate computationally efficient search for the solution of (8), we rely on a modified gradient search algorithm which exploits the special structures of **U** and and iteratively updates the estimates starting from some initial point . More specifically, given the current estimates , the update rules are
where denotes the partial derivative of evaluated at , *α* is a judiciously chosen step size, and ∏_{𝒞} denotes the projection operator onto 𝒞. Notice that the optimization in (9) is done by exhaustively searching over *k* vectors in F. Since the number of haplotypes *k* is relatively small, the complexity of the exhaustive search (9) is low. The proposed scheme is formalized as Algorithm 1. MATLAB and Python implementations of AltHap are freely available from https://sourceforge.net/projects/althap/.

Structured Tensor Decomposition Algorithm

**Input:** SNP fragment matrix **R**, step size *α*, maximum number of iterations *T*

**Output: **, an estimate of the true haplotype tensor

**Preprocessing:** Encode **R** to binary tensor and find the mode-1 unfolding,

**Initialization:** Compute and let . Define Φ = {**e**_{1}, …, **e**_{k}}

**for** *t* = 0; 1; 2; 3 …, *T* − 1 **do**

1.

2.

3.

**end for**

Decode to obtain

## 4 CONVERGENCE ANALYSIS OF ALTHAP

In this section, we analyze the convergence properties of AltHap and provide performance guarantees in different scenarios.

In the appendix we show that, a judicious choice of the step size *α* according to
where 𝒞 ∈ (0, 2) is a constant, guarantees that the value of the objective function in (8) decreases as one alternates between (9) and (10), which in turn implies that AltHap converges. The key observation that leads to this result is that is a convex function in each of the factor matrices and that 𝒞 = [0, 1] is a convex set; hence the projection ∏_{𝒞} in (10) leads to a reduction of in each iteration *t*.

It is important however to determine the conditions under which the stationary point of AltHap coincides with the global optima of (8). To this end, we first provide the definition of incoherence of matrices [8].

*A rank-k matrix* M ∈ ℝ^{n×m} *with singular value decomposition is incoherent with parameter* 1 ≤ *μ* ≤ *if for every* 1 ≤ *i* ≤ *n,* 1 ≤ *j* ≤ *m*

Let each fiber in be observed uniformly with probably *p*. Let denote the expected number of SNPs covered by each read, and denote the expected coverage for each of the haplotype sequences. Theorem 4.2 built upon the results of [20, 23, 38] states that with an adequate number of covered SNPs, the solution found by AltHap reconstructs up to an error term that stems from the existence of errors in sequencing reads.

*Assume is μ-incoherent. Suppose the condition number of* *is κ. Then there exist numerical constants C*_{0}, *C*_{1} > 0 *such that if* Ω *is uniformly generated at random and*

*with probability at least* *, the solution found by AltHap satisfies*

The proof of Theorem 4.2 which is omitted for brevity relies on a coupled perturbation analysis to establish a certain type of local convexity of the objective function around the global optima. Thus, under (13) there is no other stationary point around the global optima and hence, starting from a good initial point, AltHap converges globally. We employ the initialization procedure suggested by [38] – summarized in the initialization step of Algorithm 1 – which is based on a low cost singular value decomposition of using power method [2, 27] and with high probability lies in the described convexity region of *.*

*Remark 1:* Under the assumption of 4.2, the Condition specifies a lower bound on the expected number of covered SNPs, *C*_{snp}, that is required for the exact recovery of in the idealistic error-free scenario, i.e., for *p*_{e} = 0. With higher sequencing coverage, more SNPs are covered by the reads and hence *C*_{snp} required for accurate haplotype assembly scales with *C*_{seq} along with other parameters. Moreover, the term on the right hand side of (14) is the bound on the error of the solution generated by AltHap which increases with the sequencing error rate *p*_{e} and ploidy *k*, and decreases with *C*_{snp} and the number of reads *n*, as expected.

*Remark 2:* If is well-conditioned, i.e., is characterized by a small incoherence parameter *μ* and a small condition number *κ*, the recovery becomes easier; this is reflected in less strict sufficient condition (13s) and improved achievable performance (14). In fact, as we verified in our simulation studies, by using the proposed framework for haplotype assembly, the parameters *μ* and *μ* associated with are close to 1 (the ideal case). Theorem 4.3 provides theoretical bounds on the expected MEC scores and CPR achieved by AltHap. (The proof is in the appendix.)

*Under the conditions of Theorem 4.2, with probability at least it holds that*

*Moreover, if the reads sample haplotype sequences uniformly, with probability at least* *it holds that*

*Remark 3:* The bound established in (15) suggests that the expected MEC increases with the length of the haplotype sequences, sequencing error, number of haplotype sequences, and sequencing coverage. A higher sequencing coverage results in a larger fragment data which in turn leads to higher MEC scores.

*Remark 4:* As intuitively expected, the bound (16) suggests that AltHap’s achievable expected CPR improves with the number of sequencing reads and the SNP coverage; on the other hand, the CPR deteriorates at higher data error rates. Finally, assuming the same sequencing parameters, (16s) implies that reconstruction of polyploid haplotypes is more challenging than that of diploids.

## 5 SIMULATION RESULTS AND DISCUSSION

We evaluated the performance of the proposed method on both experimental and simulated data, as described next. AltHap was implemented in Python and MATLAB, and the simulations were conducted on a single core Intel Xeon E5-2690 v3 (Haswell) with 2.6 GHz and 64 GB DDR4-2133 RAM. The benchmarking algorithms include Belief Propagation (BP) [34], a communication-inspired method capable of performing haplotype assembly of diploid and biallelic polyploid species, HapTree [5], and H-PoP [40], the state-of-the-art dynamic programming algorithm for haplotype assembly of diploid and biallelic polyploid species shown to be superior to HapTree [5], HapCompass [1], and SDhaP [13] in terms of both accuracy and speed [30, 40]. Following the prior works on haplotype assembly (see, e.g., [9, 14, 18, 21]) we use MEC score and CPR to assess the quality of the reconstructed haplotypes.^{2}

### 5.1 Experimental data

We first tested performance of AltHap in an application to haplotype reconstruction of a data set from the 1000 Genomes Project – in particular, the sample NA12878 sequenced at high coverage using the 454 sequencing platform. In this work, we take the triophased variant calls from the GATK resource bundle [15] as the true haplotype sequences. We compare the MEC score, CPR, and running time achieved by AltHap to those of H-PoP, BP, and HapTree. All the algorithms used in the benchmarking study were executed with their default settings. The results are given in Table 1. As seen there, among the considered algorithms AltHap achieves the smallest MEC score for nearly all chromosomes and the highest CPR for majority of the chromosomes. Moreover, H-PoP and BP are the fastest and second fastest schemes but their speed comes at the cost of reduced accuracy. On the other hand, AltHap is slightly slower than H-PoP and BP but faster than HapTree.

Fosmid pool-based sequencing provides very long fragments and is characterized by much higher ratio of the number of SNPs to the number of reads than the standard data sets generated by high-throughput sequencing platforms. We consider the fosmid sequence data for chromosomes of HapMap NA12878 and again take the trio-phased variant calls from the GATK resource bundle [15] as the true haplotype sequences. We compare the performance of AltHap to those of H-PoP, BP, and HapTree and report the results in Table 2. As can be seen from Table 2, AltHap achieves the best CPR and MEC score for most of the chromosomes and is faster than BP and HapTree. H-PoP is the fastest among the considered schemes but its speed comes at the cost of reduced accuracy. The second most accurate method is HapTree. However, it is significantly slower than other methods which makes it use more challenging in practice. Since HapTree could not finish assembling haplotype of the 6^{th} chromosome in 48 hours, that result is missing from the table.

### 5.2 Simulated data: the diploid case

To further benchmark performance of the proposed scheme, we test it on the synthetic data from [18] often used to compare methods for haplotype assembly of diploids. These data sets emulate haplo-type assembly under varied coverage, sequencing error rates and haplotype block lengths. We constrain our study to the assembly of haplotype blocks having length *m* = 700 bp (the longest blocks in the data set). The results, averaged over 100 instances of the problem, are given in Table 3. As evident from this table, AltHap outperforms other algorithms for nearly all the combinations of data error rates and sequencing coverage and is also much faster than BP and HapTree while being slightly slower than H-PoP.

### 5.3 Simulated data: the polyploid case

The performance of AltHap in applications to haplotype assembly for polyploids was tested using simulations; in particular, we studied how AltHap’s accuracy depends on coverage and sequencing error rate. The generated data sets consist of paired-end reads with long inserts that emulate the scenario where long connected haplotype blocks need to be assembled. We simulate sampling of the entire genome using paired-end reads and generate SNPs along the genome with probability 1 in 300. In other words, the distance between pairs of adjacent SNPs follows a geometric random variable with parameter (the SNP rate). To emulate a sequencing process capable of facilitating reconstruction of long haplotype blocks, we randomly generate paired-end reads of length 500 with average insert length of 10,000 bp and the standard deviation of 10%; sequencing errors are inserted using realistic error profiles [12] and genotyping is performed using a Bayesian approach [15]. At such read and insert lengths, the generated haplotype blocks are nearly fully connected. Each experiment is repeated 10 times. AltHap is compared with H-PoP and BP. We also tried to run Hap-Tree. However, HapTree could not finish the simulations for the considered block size in 48 hours.

Table 4 compares the CPR, MEC score, and running times of AltHap with those of H-PoP, and BP for biallelic triploid genomes with haplotype block lengths of *m* = 1000 for several combinations of sequencing coverage and data error rates. As can be seen there, AltHap outperforms both H-PoP and BP in terms of the CPR and MEC score in all the scenarios. In addition, AltHap is much faster than BP while capable of achieving highly accurate performance. Interestingly, AltHap is much faster than H-PoP for the highest coverage, i.e., 30. This is due to linear complexity of AltHap’s iterations which makes it suitable for high-throughput sequencing data characterized by high coverage. The results of tests conducted on simulated biallelic tetraploid genomes are summarized in Table 5, where we observe that AltHap outperforms the competing schemes in terms of both accuracy and running time.

We further studied the performance of AltHap on triploid and tetraploid organisms having polyallelic sites and the results are summarized in Table 6 and Table 7, respectively. Notice that none of the competing schemes are capable of handling polyallelic genomes. The results suggest that AltHap was able to reconstruct underlying haplotype sequences with competitive performance at a low computational cost.

The results of these extensive simulations imply that, as expected, haplotype assembly becomes more challenging as the number of haplotype sequences (i.e., the ploidy) increases. Nevertheless, in all the conducted studies, AltHap consistently reconstructs haplotype sequences accurately and with low computational cost. In addition, the results of Table 4 and Table 5 demonstrate that the computational time of AltHap grows significantly slower with coverage than the computational time of the competing schemes. In particular, for high coverages that are characteristic of high-throughput sequencing technologies, AltHap is the most efficient algorithm. Finally, we use the results obtained by running AltHap on simulated biallelic triploid data (i.e., the results summarized in Table 4) to examine tightness of the theoretical bounds on the CPR stated in Theorem 4.3. In particular, theoretical bounds on CPR are compared to the CPRs empirically computed for various combinations of coverage and data error rates (averaged over 10 independent problem instances). In Fig. 3a, the theoretical bound and experimental CPR results are shown as functions of the data error rate for coverage 15. We observe that the bound is reasonably close to the experimental results over the considered range of data error rates. In Fig. 3b, the theoretical bound and experimental CPR results are plotted against sequencing coverage for the data error rate *p*_{e} = 0.002. This figure, too, implies that the theoretical CPR bound is relatively close to the experimental results.

## 6 CONCLUSION

In this paper, we developed a novel haplotype assembly framework for both diploid and polyploid organisms that relies on sparse tensor decomposition. The proposed algorithm, referred to as AltHap, exploits structural properties of the problem to efficiently find tensor factors and thus assemble haplotypes in an iterative fashion, alternating between two computationally tractable optimization tasks. If the algorithm starts the iterations from an appropriately selected initial point, AltHap converges to a stationary point which is with high probability in close proximity of the solution that is optimal in the MEC sense. In addition, we analyzed the performance and convergence properties of AltHap and found bounds on its achievable MEC score and the correct phasing rate. AltHap, unlike the majority of existing methods for haplotype assembly for polyploids, is capable of reconstructing haplotypes with polyallelic sites, making it useful in a number of applications involving plant genomes. To the best of our knowledge, AltHap is the first polynomial time approximation algorithm for haplotype assembly with analytical guarantees on its achievable performance. Moreover, unlike several state-of-the-art techniques which are exponential in either read length or sequencing coverage, AltHap’s steps are linear in both and thus suitable to process sequencing data with long reads and deep coverage. Our extensive tests on real and simulated data demonstrate that AltHap compares favorably to competing methods in applications to haplotype assembly of diploids, and significantly outperforms existing techniques when applied to haplotype assembly of polyploids.

As part of the future work, it is of interest to extend the sparse tensor decomposition framework to viral quasispecies reconstruction and recovery of bacterial haplotypes from metagenomic data.

## Derivation of the MEC and CPR bounds

Recall that under conditions of Theorem 4.1, with probability it holds that . Once the stationary point is found, Althap performs a decoding (rounding) step in order to obtain the binary solution . In this rounding procedure AltHap first normalizes all unfolded fibers such that sum of entries of each fiber equals 1. Then AltHap sets the largest entry of each unfolded fiber to 1 and the remaining three entries to 0. Eventually, AltHap reshapes the solution to the tensor . Note that this normalization is not required and we only consider it for the analysis purposes. Therefore, it is required to establish a bound on . Let 𝓛 be the set of mismatching fibers, i.e., ∀*f* ∈ 𝓛, . It is straightforward to see that . First, notice that .

In addition, consider a fiber ∀*f* ∈ 𝓛. The minimum value of occurs when two entries in *f* are both equal to 0.5 and the remaining two entries are both 0. Hence, it becomes clear . Thus, with probability it holds that
which is the desired relation. We now establish the MEC bound. Using the linearity of expectation and the above discussion, we obtain

Thus, 𝔼{MEC} ≤ 2*p*_{e} (*C*_{seq}*m* + *κ*^{4}*C*_{1}*k*). We now establish the CPR bound. The following is an equivalent definition of CPR computed using unfolded tensors of the true and the reconstructed haplotype sequences,
where 𝓜 is a one-to-one mapping from the corresponding entries of the lateral slices of to those of . Assuming that sequencing reads uniformly sample haplotype sequences, on average, the mismatches between and contribute equally to the number of mismatches between and . That is, . Therefore,

Thus, which is the desired bound.

## APPENDIX Derivation of the proposed step size

For the proposed algorithm to converge, it must hold that

∀*t*. First, by noting (9) it holds that . It now remains to show that . First, for the sake of notations and clarity, define

Recall that and . Since Π_{C} is a projection onto a convex set of constraints, *. Thus, it remains to show **. Given that,
*

Now, consider the second term in the last line of (6). Following straightforward linear algebra we obtain

Therefore,

By choosing the step size (11) we obtain

Clearly if *C* ∈ (0, 2) it must hold that *, which in turn implies convergence.*

## Footnotes

* This work is funded by the National Science Foundation under grants CCF 1320273 and CCF 1618427.

↵

^{1}Notice that .↵

^{2}For diploid data, we also quantified the performance of different methods by means of the switch error rate (SWER) (here omitted from brevity and reported at https://sourceforge.net/projects/althap/).