Principal curve approaches for inferring 3D chromatin architecture

Three dimensional (3D) genome spatial organization is critical for numerous cellular processes, including transcription, while certain conformation-driven structural alterations are frequently oncogenic. Genome architecture had been notoriously difficult to elucidate, but the advent of the suite of chromatin conformation capture assays, notably Hi-C, has transformed understanding of chromatin structure and provided downstream biological insights. Although many findings have flowed from direct analysis of the pairwise proximity data produced by these assays, there is added value in generating corresponding 3D reconstructions deriving from superposing genomic features on the reconstruction. Accordingly, many methods for inferring 3D architecture from proximity d hyperrefata have been advanced. However, none of these approaches exploit the fact that single chromosome solutions constitute a one dimensional (1D) curve in 3D. Rather, this aspect has either been addressed by imposition of constraints, which is both computationally burdensome and cell type specific, or ignored with contiguity imposed after the fact. Here we target finding a 1D curve by extending principal curve methodology to the metric scaling problem. We illustrate how this approach yields a sequence of candidate solutions, indexed by an underlying smoothness or degrees-of-freedom parameter, and propose methods for selection from this sequence. We apply the methodology to Hi-C data obtained on IMR90 cells and so are positioned to evaluate reconstruction accuracy by referencing orthogonal imaging data. The results indicate the utility and reproducibility of our principal curve approach in the face of underlying structural variation.


Introduction
The three-dimensional (3D) configuration of chromosomes within the eukaryote nucleus is important for several cellular functions, including gene expression regulation, and has also been linked to translocation events and cancer driving gene fusions (Mitelman and others, 2007). While direct visualization of 3D architecture has improved (see Section 2.9), imaging challenges pertaining to chromatin compaction and dynamics persist. However, the ability to infer chromatin architectures at increasing resolution has been enabled by chromosome conformation capture (3C) assays (Dekker and others, 2002). In particular, when coupled with next generation sequencing, such Hi-C methods (Lieberman-Aiden and others, 2009;Duan and others, 2010) yield an inventory of pairwise, genome-wide chromatin interactions, or contacts. In turn, the contact data form the basis for reconstructing 3D configurations (Zhang and others, 2013;Varoquaux and others, 2014;Ay and others, 2014;Zou and others, 2016;Rieber and Mahony, 2017). While many novel conformational-related findings have flowed from direct analysis of contact level data, added value of performing downstream analysis based on attendant 3D reconstructions has been demonstrated. These benefits derive from the ability to superpose genomic features on the reconstruction. Examples include co-localization of genomic landmarks such as early replication origins in yeast (Witten and Noble, 2012;Capurso and Segal, 2014), gene expression gradients in relation to telomeric distance and co-localization of virulence genes in the malaria parasite (Ay and others, 2014), the impact of spatial organization on double strand break repair (Lee and others, 2016), and elucidation of '3D hotspots' corresponding to (say) overlaid ChIP-Seq transcription factor extremes which can reveal novel regulatory interactions (Capurso and others, 2016).
The contact or interaction matrices resulting from Hi-C assays, which are typically performed on bulk cell populations, are depicted as heatmaps, which record the frequency with which pairs of binned genomic loci are cross-linked, reflecting spatial proximity of the respective loci bins within the nucleus. A common first step toward 3D reconstruction is the conversion of contact frequencies into distances, typically assuming inverse power-law relationships (Varoquaux and others, 2014;Ay and others, 2014;Shavit and others, 2014;Rieber and Mahony, 2017), from which 3D chromatin architecture can be obtained via versions of the multi-dimensional scaling (MDS) paradigm. In response to (i) the bulk cell population underpinnings of contact data, (ii) computational challenges posed by the dimensionality of the MDS reconstruction problem as governed by bin extent, and (iii) accommodating biological considerations, several competing reconstruction algorithms have been advanced. However, none of these take advantage of the fact that the 3D solution for individual chromosomes corresponds to a one-dimensional (1D) curve in 3-space. Rather, this aspect has been addressed by imposition of constraints (Duan and others, 2010;Ay and others, 2014;Stevens and others, 2017), which are cell type specific and require prescription of constraint parameters. These parameters can be difficult to specify and their inclusion substantially increases the computational burden. Other approaches (Zhang and others, 2013;Park and Lin, 2017;Rieber and Mahony, 2017) do not formally incorporate contiguity but impose it post hoc, creating chromatin reconstructions by "connecting the dots" of the 3D solution according to the ordering of corresponding genomic bins.
Here we directly target chromosome reconstruction by finding a 1D curve approximation to the contact matrix via extending principal curve methodology (Hastie and Stuetzle, 1989) to the metric scaling problem. After reviewing problem formulation and current reconstruction techniques in Section 2.1, we develop two building blocks, Principal Curve Metric Scaling (PCMS; Sections 2.2, 2.3) and Weighted PCMS (WPCMS; Sections 2.4, 2.5), that enable our novel Poisson Metric Scaling (PoisMS; Sections 2.6, 2.7) approach. Strategies for selecting a specific reconstruction from a degrees-of-freedom indexed series of solutions are described in Section 2.8. Methods for appraising the accuracy of candidate reconstructions using orthogonal imaging data are outlined in Section 2.9. Results from applying the methodology to Hi-C data from IMR90 cells are presented in Section 3, while the Discussion indicates directions for future work.

Existing approaches to 3D chromatin reconstruction from Hi-C assays
Our focus is on reconstruction of individual chromosomes; whole genome architecture can follow by appropriately positioning these solutions (Segal and Bengtsson, 2015;Rieber and Mahony, 2017). As is standard, we disregard complexities deriving from chromosome pairing arising in diploid cells (which can be disentangled at high resolutions (Rao and others, 2014)) and address issues surrounding bulk cell experiments and inter-cell variation in the Discussion.
The result of a Hi-C experiment is the contact map, a symmetric matrix C = [C ij ] ∈ Z n×n + of contact counts between n (binned) genomic loci i, j on a genome-wide basis; Figure 1 provides an example. We defer questions surrounding contact matrix normalization. This matrix can be exceedingly sparse, even after binning. The 3D chromatin reconstruction problem is to use the contact matrix C to obtain a 3D point configuration x 1 , . . . , x n ∈ R 3 corresponding to the spatial coordinates of loci 1, . . . , n respectively; Figure 2 gives an illustration.
Many approaches have been proposed to tackle this problem with broad distinction between optimization and modelbased methods (Varoquaux and others, 2014;Rieber and Mahony, 2017). A common first step is conversion of the contact matrix into a distance matrix D = [D ij ] (Duan and others, 2010;Varoquaux and others, 2014;Ay and others, 2014;Shavit and others, 2014), followed by solving the multi-dimensional scaling (MDS; Hastie and others, 2009) problem: position points (corresponding to genomic loci) in 3D so that the resultant interpoint distances best conform to the distance matrix.
A variety of methods have also been used for transforming contacts to distances. At one extreme, in terms of imposing biological assumptions, are methods that relate observed intra-chromosomal contacts to genomic distances and then ascribe physical distances based on organism specific findings on chromatin packing (Duan and others, 2010) or relationships between genomic and physical distances for crumpled polymers (Ay and others, 2014). Such distances inform the subsequent optimization step as they permit incorporation of known biological constraints that can be expressed in terms of physical separation. Importantly, these constraints include prescriptions on the 3D separation between contiguous genomic bins. It is by this means that obtaining a 1D curve is indirectly facilitated. However, obtaining physical distances requires both strong assumptions and organism specific data (Fudenberg and Mirny, 2012). More broadly, a number of approaches (Zhang and others, 2013;Varoquaux and others, 2014;Zou and others, 2016;Rieber and Mahony, 2017) utilize power law transfer functions to map contacts to (non-physical) distances Adoption of the power law derives from empirical and theoretical work but again constitutes a strong assumption (Fudenberg and Mirny, 2012).
Once we have a distance matrix D, optimization approaches seek a 3D configuration x 1 , . . . , x n that best fits D according to an MDS criterion. If · designates the Euclidean norm, then an example of MDS loss incorporating weights and penalty (Zhang and others, 2013) is with the corresponding optimization problem Here common choices for the weights W ij include D −1 ij (Zhang and others, 2013) and D −2 ij (Varoquaux and others, 2014), these being analogous to precision weighting since large C ij (small D ij ) are more accurately measured. Similarly, the penalty (second) term maximizes the pairwise distances for loci bins with C ij = 0 under the presumption that such loci should not be too close.
It is worth noting that (1), and related criteria, correspond to a nonconvex, nonlinear optimization problem that is NP hard and while various devices have been employed to mitigate the computational burden (e.g., Zhang and others, 2013), computational concerns, particularly for high resolution (many loci bins) problems, remain forefront.
Probabilistic methods model the contact counts with an optimization goal of maximizing the corresponding log-likelihood.
In particular, Poisson models, C ij ∼ P ois(λ ij ), are widely used (Varoquaux and others, 2014;Zou and others, 2016;Park and Lin, 2017), where λ ij = λ ij (x 1 , . . . , x n ) is a function of the genomic loci spatial coordinates x 1 , . . . , x n . For example, Rosenthal and others (2019) prescribe exponential dependence between the Poisson rate parameter and inter-loci distances: λ ij = β x i − x j α for some α < 0, a framework we slightly modify in Section 2.6.
All existing approaches implicitly represent chromatin as a polygonal chain. Constraints on the geometrical structure of the polygonal chain can be imposed via penalties on edge lengths and angles between successive edges, with even quaternion-based formulations employed (Caudai and others, 2015). Rosenthal and others (2019) utilize penalties to control smoothness of the resulting conformations. However, despite imparting targeted properties to the resulting reconstruction, such penalty-based approaches increase the complexity of the objective, its gradient and Hessian, both slowing and limiting, especially with respect to resolution, associated algorithms.
Here we develop a suite of novel approaches that directly model chromatin configuration as a 1D curve in 3D. Our primary method, Poisson Metric Scaling (PoisMS), is based on a Poisson model for contact counts and provides an efficient means for obtaining smooth 1D reconstructions, that combines advantages of both MDS and probabilistic models. This technique utilizes two building blocks of intrinsic interest. First, we introduce the Principal Curve Metric Scaling (PCMS) approach that features an optimization problem inspired by MDS and stated in terms of inner products. This problem admits a simple solution obtained via the singular value decomposition. Next, we develop Weighted PCMS (WPCMS), a weighted version of PCMS that, importantly, models distances rather than inner products and further permits control over the influence of particular elements of the contact matrix on the resulting reconstruction. This technique requires an iterative algorithm that uses PCMS as the core component. Finally, WPCMS in turn can be used in conjunction with projected gradient descent to solve a second order approximation of the Poisson log-likelihood, yielding our PoisMS algorithm.

PCMS: metric scaling with a smooth curve constraint
The PCMS technique is based on classical MDS. Given a symmetric matrix Z, PCMS treats it as a similarity matrix and approximates it by an inner product matrix (Buja and others, 2008). In particular, Z can correspond to the contact matrix after conversion to a distance matrix followed by double centering, the standard MDS device that turns (Euclidean) squared distances into inner products. We illustrate this approach in the Supplement (S2) with distances obtained via power law transformation. However, while it is thereby possible to use PCMS as a standalone reconstruction tool, we seek methods that avoid having to convert contacts to distances. So, here we develop PCMS with a view to utilizing it as a building block of our PoisMS technique.
be the matrix of genomic loci coordinates and let S(X) = XX T refer to the inner product matrix of the reconstruction X. If · F denotes the Frobenius norm, then the goal is to minimize the Strain objective: Instead of adding a smoothness penalty to the objective, we impose an additional constraint: This constraint will serve to capture the inherent contiguity of chromatin. We model the curve γ by a cubic spline with k degrees-of-freedom as follows (Hastie and others, 2009). Suppose h 1 (t), . . . , h k (t) are cubic spline basis functions in Let t i index the genomic locus of x i in the parameter space of γ, i.e. x i = γ(t i ), and H ∈ R n×k be the matrix of spline basis evaluations at t i , i.e. H i = h (t i ). Since binning typically results in evenly spaced genomic loci it is convenient to set t 1 = 1, t 2 = 2, . . . , t n = n, although irregular spacing is readily handled. So, the constraint (4) can be written as , or equivalently, in matrix form as X = HΘ leading to the optimization problem Hereafter we denote the corresponding solution byΘ = PCMS(Z, H), the resulting chromatin reconstruction bŷ X = HΘ and the approximation of the original matrix Z asẐ = S(X).

PCMS solution via eigen-decomposition
Note that the parameter Θ in the PCMS problem (5) is unconstrained. Since Θ is defined up to a multiplication by a full-rank matrix, we can assume H to be a matrix with orthogonal columns. To find the PCMS solution the following lemma, proved in Supplement Section S1, is useful.
Minimizing˜ P CM S (Θ) can be interpreted as approximating the matrix H T ZH by a positive semi-definite rank 3 matrix ΘΘ T . Assuming that the symmetric matrix H T ZH has at least three positive eigenvalues the solution can be found via eigen-decomposition of H T ZH: let H T ZH = QΛQ T for orthogonal Q and diagonal Λ = diag(λ 1 , . . . , λ n ) with λ 1 ≥ λ 2 ≥ . . . ≥ λ n , then The computational efficiency of PCMS derives from the fact that it relies on eigen-decomposition of a small k × k matrix, requiring only O(k 3 ) additional operations.

WPCMS: a distance-based model for chromatin reconstruction
As indicated, direct application of PCMS to Hi-C data is limited by the need to convert contact counts to distances and then (via double centering) to inner products since such conversion can be problematic. Even simplistic approaches, based on power law transformation, prescribe a value for the index parameter, failing to accommodate dependence of the index on influencing factors such as cell type, chromosome, organism and resolution. Moreover, the double centering trick requires that resultant distances be Euclidean.
Accordingly, we develop a distance-based version of PCMS, wherein the symmetric matrix Z contains pairwise squared distances, as opposed to inner products. Additional flexibility is facilitated by introducing weights to the problem setup, which permits control over the impact of particular elements Z ij on the reconstruction, for example to counteract diagonal dominance (Yang and others, 2017). Although the resulting technique, Weighted PCMS (WPCMS), can again be used as a standalone reconstruction tool (Supplement Section S4), akin to PCMS its primary purpose is as component of the PoisMS approach.
We introduce a matrix of weights W ∈ [0, 1] n×n , denote by D(X) the matrix of pairwise distances between genomic loci and consider the following loss function where * refers to the Hadamard (element-wise) product and matrix squaring is also element-wise. The WPCMS problem can be stated as follows: The corresponding solution and reconstruction are denoted byΘ = PCMS W (Z, H) andX = HΘ, respectively, along with the corresponding approximationẐ = D 2 (X) of matrix Z.

Iterative algorithm for solving the WPCMS problem
Problem (8) can be elegantly solved using projected gradient descent (PGD) (Hastie and others, 2015), broadly used to solve constrained optimization problems. We first exploit the fact that the matrix of squared distances can be rewritten in terms of the inner product matrix: Here 1 = (1, 1, . . . , 1) T ∈ R n and diag(S(X)) = x 1 2 , x 2 2 , . . . , x n 2 T is the diagonal of the inner product matrix. So (8) can be restated in terms of inner products: The PGD procedure alternates the following two steps: Here proj M(H) (S) denotes the projection of matrix S onto the matrix manifold M(H). The [Gradient] step makes recourse to the following Lemma, proved in Supplement Section S3.
LEMMA 2.2 Let D 2 denote the matrix of squared distances corresponding to inner product matrix S (as in 9). If G = W * Z − D 2 and G + = diag(G · 1) is the diagonal matrix containing row sums of G on the diagonal, then up to a scaling factor ∇ W P CM S (S) = G − G + .
Next, note that the [Projection] step requires solving the optimization problem which is easily done using PCMS. Thus, we end up with the following PGD procedure: 1.

[SDG]
Calculate the current guess for the inner product matrix S = XX T and use it to compute the matrix of squared distances
Convergence is assessed via the stopping criterion W P CM S (Θ old )− W P CM S (Θnew) < 1 , where 1 is a pre-chosen accuracy rate, and Θ old and Θ new are Θ values calculated at the previous and current iterations respectively. Details and extensions of WPCMS are provided in the Supplement.

PoisMS: Poisson model for contact counts
We now develop our primary approach, Poisson Metric Scaling (PoisMS), using WPCMS as a building block. We define a probabilistic model for contact counts based on natural and previously adopted assumptions: Poisson distributed counts C ij with dependence of the Poisson mean on chromatin 3D structure, specifically on pairwise (squared) distances between genomic loci: with β ∈ R an intercept parameter. The negative log-likelihood objective is and the MLE optimization problem under the smooth curve constraint (4) is In (11) we use squared distance rather than distance, reflecting criterion (1) and conferring computational convenience. We denote the corresponding matrix of spline coefficients byΘ = PoisMS(C, H) and the resulting chromatin reconstruction byX = HΘ.

Iterative algorithm for solving PoisMS problem
A virtue of the Poisson model is that the second order Taylor approximation (SOA) of the negative log-likelihood (12) is simply the weighted Frobenius norm. Further, it is well known that the optimal value of this SOA amounts to one step of the Newton method for optimizing the original loss function. We use these facts to develop an iterative algorithm based on the WPCMS technique, which is equivalent to a projected Newton Method.
First, we review the SOA of the negative Poisson log-likelihood in the univariate case. Suppose c ∼ Pois(λ). The negative log likelihood (λ) = λ − c log λ can be reparametrized in terms of the natural parameter η = log(λ) leading to (η) = e η − cη. Then the SOA of the reparametrized negative log-likelihood at some point η 0 = log λ 0 , up to scaling and shifting by a constant, is: The multivariate version is as follows. Suppose C ∈ Z n×n + where C ij ∼ Pois(λ ij ) and η ij = log(λ ij ). Let the respective matrices of Poisson and natural parameters be Λ = [λ ij ] ∈ R n×n + and H = [η ij ] ∈ R n×n . Then the SOA of the negative log-likelihood at some point H 0 , up to scale and shift constants, is Here * is the Hadamard (element-wise) product, with matrix exponentiation and division also being interpreted as element-wise operations.
Recall that in the Poisson model (11) the natural parameter depends linearly on the matrix of genomic loci pairwise distances: H = log Λ = −D 2 (X) + β. So, the SOA can be rewritten as Suppose that the current reconstruction guess is X 0 with corresponding natural parameter value H 0 = −D 2 (X 0 ) + β. Then we have the following approximation of the Poisson loss (12) at point H 0 again up to scaling and shifting by a constant: Thus, under the smooth curve constraint X = HΘ, the loss function SOA (X) coincides with the WPCMS loss (8) and we obtain a nice application of the WPCMS algorithm, with the solution to the second order approximation of problem (13): being exactly Θ = PCMS W (Z, H). This observation can be applied to simplify computations for the Poisson model and underlies our PoisMS algorithm.
The last step of our PoisMS algorithm is to update β according to the current guess of Θ. This can be done by optimizing the negative log-likelihood with respect to β. All together this leads to the following algorithm that repeatedly approximates the Poisson objective at current guess Θ by a quadratic function and shifts Θ towards the global minimum of this quadratic approximation: 1.
The stopping rule for the PoisMS algorithm is similar to WPCMS: for some fixed accuracy rate 2 we check if the updated (Θ new , β new ) meets the criteria P oisM S (Θ old ,β old )− P oisM S (Θnew,βnew) P oisM S (Θ old ,β old ) < 2 after each iteration of steps 2.1-2.3.
The non-convexity of the PoisMS criteria (13) implies that initialization can impact the resulting reconstruction. In the Supplement we discuss use of WPCMS to provide a warm start for the PoisMS algorithm, as well as algorithmic extensions and computational complexity.

Determination of principal curve degrees-of-freedom
The main hyperparameter of the PoisMS approach is the spline degrees-of-freedom df (spline basis size), which controls the smoothness of the resulting reconstruction. To determine the optimal value, for each df we create the spline basis matrix H df , find the corresponding solution (Θ df ,β df ) and the resulting reconstructionX df = H dfΘdf . We measure the error rate by the normalized Poisson deviance, i.e err(X df ,β df ) = 2 n 2 1≤i,j≤n Initially, we tried cross-validation to find the optimal value of df , as is common for smoothing (penalty) parameter determination. However, the complex and structural dependencies that characterize contact matrices made this approach problematic. As an alternative we adopted an approach based on identifying the "elbow" that is prototypic in graphs of resubstitution error, here err(X df ,β df ), versus model complexity, here df . The logic as to why this change point constitutes a basis for model complexity determination is described in Breiman and others (1984) in terms of bias-variance tradeoff. Elbow identification is also used for determining appropriate numbers of principal components (Jolliffe, 2002) and clusters (Hastie and others, 2009), as well as dimension in MDS (Kruskal and Wish, 1978) and non-negative matrix factorization (NMF; see Hutchins and others, 2008) problems.

Accuracy assessment via multiplex FISH
While the prescription in Section 2.8 provides a means for selecting a particular PoisMS model it does not address the accuracy of the chosen model. The absence of gold standards makes such assessment challenging. In comparing competing 3D genome reconstructions several authors have appealed to simulation (Zhang and others, 2013;Varoquaux and others, 2014;Zou and others, 2016;Park and Lin, 2017), however, real data referents are preferable. To that end, many of the same reconstruction algorithm developers have made recourse to fluorescence in situ hybridization (FISH) imaging as a basis for gauging accuracy. This proceeds by comparing distances between imaged probes with corresponding reconstruction-based distances. But such methods are necessarily limited by the sparse number of probes (∼ 2 − 6; see Lieberman-Aiden and others, 2009; Shavit and others, 2014; Park and Lin, 2017) and the modest resolution thereof, many straddling over 1 megabase. The recent advent of multiplex FISH (Wang and others, 2016) transforms 3D genome reconstruction accuracy evaluation by providing an order of magnitude more probes and hence two orders of magnitude more inter-probe distances than conventional FISH. Moreover, the probes are at higher resolution and centered at topologically associated domains (TADs; see Dixon and others, 2012). We use this imaging data, along with companion accuracy assessment approaches (Segal and Bengtsson, 2018) to evaluate our PoisMS reconstructions.
The image-based 3D genomic coordinates furnished from multiplex FISH serve to define the gold standard by which we assess reconstructions. The existence of numerous multiplex FISH replicates is crucial for this task and three steps are necessary to effect such evaluation.
Obtaining the gold standard. Given N multiplex FISH replicates denote the matrix of the spatial coordinates for replicate i ∈ {1, . . . , N } by M i ∈ R n 0 ×3 where n 0 denotes the number of distinct multiplex FISH loci (probes) over all replicates. We start by defining the medoid replicate. For a pair of 3D conformations X 1 , X 2 ∈ R n 0 ×3 denote the number of observed loci by n(X 1 , X 2 ) and suppose d proc (X 1 , X 2 ) is the squared Procrustes distance from X 2 to X 1 following alignment allowing translation, rotation and scaling (Hastie and others, 2009). Then the dissimilarity between X 1 and X 2 is defined by using asymmetric (scaling and rotation transforms applied to X 2 only) Procrustes distance. This measure of agreement between two reconstructions coincides with mean squared deviation (see, for example, Segal and Bengtsson (2018)).
We next define the medoid replicate as the replicate whose (weighted) average dissimilarity to the other replicates is minimal: with weights Computing the reference distribution. Treating the average Procrustes conformationM as our gold standard we obtain a reference distribution by measuring the dissimilarity between it and the multiplex FISH replicates: d(M , M i ). The resulting empirical distribution captures experimental variation around the gold standard. A fine point is that this distribution will exhibit reduced dispersion compared to its target population quantity owing to data re-use since M i contributes toM . While this concern could be mitigated by employing leave-one-out techniques the large number of available replicates (> 110) renders this unnecessary (Segal and Bengtsson, 2018).
Evaluating chromatin reconstructions. To evaluate reconstructions resulting from the PoisMS approach we first need to align the reconstruction with the gold standard. This may involve preliminary coarsening of one or other coordinate sets to yield comparable resolution. Here, the genomic coordinate ranges for each multiplex FISH probe are coarser than the Hi-C bins used in our reconstructions. So we calculate the average of the reconstruction coordinates falling in the corresponding multiplex FISH bins to obtain a lower resolution reconstructionX of the same dimension asM . To quantify how close this reconstruction is to the gold standardM we again measure dissimilarity following alignment d(M ,X) Interpretations of this quantity in the context of the reference distribution are presented in the Results section.

A contrasting reconstruction algorithm: HSA
To compare our PoisMS solution with an alternate reconstruction algorithm we make recourse to HSA (Zou and others, 2016). This technique provides an interesting contrast in that it employs a similar Poisson formulation to (12) but instead of contiguity being captured via principal curves per (4), it is indirectly imparted by constraints that induce dependencies on a hidden Gaussian Markov chain over the solution coordinates. Obtaining these spatial coordinates is achieved via simulated annealing with further smoothness effected via distance-based penalization.
HSA has performed well in some benchmarking studies and features several compelling attributes including (i) simultaneously handling multiple data tracks allowing for integration of replicate contact maps and (ii) adaptively estimating the power-law index whereby contacts are transformed to distances as previously emphasized. Nonetheless, in contrast to PoisMS, HSA incurs a substantial compute and memory burden, and questions surrounding robustness have been raised (Rieber and Mahony, 2017).
To compare PoisMS performance with HSA we use the approach described in Section 2.9. Having obtained a HSA reconstruction we measure the dissimilarity between the reconstruction and the gold standard. The quantity so obtained is interpreted in the context of the attendant reference distribution (see Section 3.3 and Supplement Section S10).

Chromosome reconstructions
We present PoisMS reconstructions for IMR90 cell chromosome 20 at 100kb resolution for which multiplex FISH and Hi-C data acquisition and processing has been previously described (Segal and Bengtsson, 2018). Results for chromosome 21 are presented in the Supplement.
In Figure 1 we present the heatmap for log(C). The resulting PoisMS reconstructionsX df along with the Poisson parameter matrix log(Λ df ) = −D 2 (X df ) + β, that can be viewed as an approximation of log(C), are presented in Figure  2 for a series of degrees-of-freedom values.

Determining degrees-of-freedom
The graph of error rate err(X df ,β df ) vs df reveals rapidly decreasing error rates up to df = 30 with subsequent gradual decline (Figure 3). The optimal df according to the elbow heuristic, obtained using the R package segmented (Muggeo, 2008), is df = 25, also shown in Figure 3.

Evaluating reconstructions via the multiplex FISH referent
Procrustes alignment of 3D conformations, and calculation of the corresponding distances d proc (·, ·), was performed using the R package vegan (Oksanen and others, 2016). We obtain the multiplex FISH medoid conformation based on the smallest row sum (17) of the dissimilarity matrix of normalized Procrustes distances (16) as described above. The 111 multiplex FISH replicate conformations are then aligned to the medoid as a prelude to calculating the average Procrustes conformation -our gold standard. Figure 4 shows the histogram of dissimilarities between multiplex FISH replicates and our derived gold standard that constitutes the reference distribution. We position the PoisMS reconstruction dissimilarities therein corresponding to the indicated series of degrees-of-freedom values. HSA reconstruction dissimilarity values are also included.
The following conclusions can be drawn from Figure 4. For chromosome 20 (see the Supplement S10 for chromosome 21), all fits for PoisMS lie within the range of the multiplex FISH dissimilarity distribution that reflects experimental variation. The fact that the PoisMS dissimilarity values are in the left tail of this distribution indicates the accuracy of the proposed reconstructions, highlighting the utility of the proposed methodology. Further, that larger dissimilarity values pertain for HSA, particularly for chromosome 21, suggests that PoisMS performs at least comparably to this well benchmarked alternative. That PoisMS wall clock times are minutes rather than days for HSA is notable.

Discussion
Central to our principal curve based approaches to 3D chromatin reconstruction is that the configuration of an individual chromosome within the nucleus can be treated as a contiguous 1D curve since the diameter of the chromatin fiber is negligible compared to the nuclear volume. The extent to which the curve is "smooth" is determined by an adaptively selected degrees-of-freedom parameter. As mentioned in the introduction, previous reconstruction methods either impart contiguity indirectly by prescribing constraints, which are difficult to specify, or impose it post hoc. In comparison, our methods based on principal curves are computationally efficient, readily scale to high resolution contact data and are parsimonious with regard tuning parameters.
Our implementation of PoisMS utilizes cubic spline basis functions, which contribute to this computational efficiency. However, the nature of chromatin folding and attendant Hi-C data is such that these bases will be less effective in capturing fine 3D structure, as opposed to global backbone architecture. This derives from the hierarchical, domain-based organization of chromatin, aspects that have been tackled by some reconstruction algorithms using strategies that synthesize solutions obtained at differing scales (Rieber and Mahony, 2017;Trieu and others, 2019). We will investigate whether principal curve solutions can similarly serve as building blocks in addition to exploring the use of alternate basis functions, notably wavelets.
Our analyses of Hi-C data from IMR90 cells was motivated by the availability of corresponding multiplex FISH data enabling accuracy assessment. However, the extent and resolution of multiplex FISH imaging is limited, narrowing the applicability of this means of evaluation. An even more fundamental issue pertains to attempting chromatin reconstruction using bulk Hi-C data from large cell populations. As has been emphasized (Lando and others, 2018), the presence of numerous conflicting contacts suggests that the notion of a consensus underlying 3D conformation is questionable and that there is substantial cell-to-cell structural variation. This places a premium on pursuing single cell reconstructions as enabled by the recent emergence of single cell Hi-C protocols (Ramani and others, 2017). That one of these advances (Stevens and others, 2017) also provides parallel imaging data, putatively enabling reconstruction accuracy determination, underscores the importance of applying reconstruction methods in single cell settings, despite contact map sparsity, and is the subject of future work.

Software
Proposed methods are implemented in the R package PoisMS; the software is available from Github (https://github. com/ElenaTuzhilina/PoisMS).

Funding
Mark Segal was partially supported by grant GM-109457 from the National Institutes of Health. Trevor Hastie was partially supported by grants DMS-1407548 and IIS 1837931 from the National Science Foundation, and grant 5R01 EB 001988-21 from the National Institutes of Health.
Supplementary Material S1 PCMS lemma proof LEMMA 2.1 If H ∈ R n×k is a matrix with orthogonal columns, i.e. H T H = I, and S(X) = XX T then problem Proof. SupposeH = H H ⊥ is a column-wise combination of H and its orthogonal complement H ⊥ . Therefore,H is a square orthogonal matrix and for any B ∈ R n×n the following relation holds In the last equation the constant term does not depend on the parameter Θ implying the equivalence of the above two optimization problems.

S2 PCMS reconstruction
In our original development of principal curve metric scaling we directly modeled the contact matrix C in terms of the inner product matrix S(X) = XX T , where X ∈ R n×3 is the matrix of genomic loci coordinates of the reconstruction. Accordingly, in optimizing the Strain objective, we are approximating C ij ≈ x i , x j . However, as pointed out by a referee, this is a strong assumption, corresponding to treating C as a similarity matrix. The referee further noted that in contrast to exploratory uses of MDS, for which postulating plausible similarity matrices suffices, our goal of reproducing (up to a scale factor) the correct 3D chromatin configuration, requires more careful specification of the target similarity matrix.
Here we develop a proposal of the referee in devising one such appropriate similarity matrix, by combining power law transformation with double centering, and demonstrate application of PCMS as a reconstruction tool in this context. Specifically, we first convert contacts to distances via D ij = 1 C ij +1 . This corresponds to power law transformation with index α = −1 (Lesne and others, 2014), where we add one in the denominator to avoid dividing by zero. Next, we use the centering operator J = I − 11 T n to transform distances to similarities via double centering: Z = − 1 2 JD 2 J (Buja and others, 2008). The heatmap of the resulting similarity matrix is displayed in Figure 5. Finally, we run PCMS on Z. The solutionΘ = PCMS(Z, H) leads to the reconstructionsX = HΘ and approximations of the similarity matrix Z ≈ S(X) presented in Figure 7 for a range of degrees-of-freedom (df ) values. The corresponding plot of approximation error against degrees-of-freedom is presented in Figure 6. Use of segmented regression to identify reasonable model size, given by the elbow, suggests a value df = 25 as shown.

S3 WPCMS lemma proof
LEMMA 2.2 Let D 2 denote the matrix of squared distances corresponding to the inner product matrix S = XX T . For similarity matrix Z and weight matrix W let G = W * Z − D 2 and G + = diag(G · 1), a diagonal matrix with column sums of G on the diagonal. Then, up to a scaling factor, ∇ W P CM S (S) = G − G + .
Proof. Note that So the partial derivatives are In matrix form this is exactly ∇ W P CM S (S) = G − G + .

S3.1 Learning rate
Since the WPCMS algorithm is based on Projected Gradient Descent (PGD), many extensions can be naturally derived. For example, a learning rate δ can be introduced by replacing G by δG at the [Gradient] step. Further, line search can be readily added to the algorithm:

S3.2 Intercept
As with PoisMS, it is possible to incorporate an intercept parameter into the WPCMS problem statement, leading to the optimization problem: The WPCMS iterative algorithm is then modified by the addition of the following step:

S4 WPCMS reconstruction
Similarly to PCMS, we can use WPCMS as a standalone chromatin reconstruction technique. To make results comparable to PoisMS we adopt the following approach. According to the Poisson model under consideration, the matrix of Poisson rate parameters Λ depends on genomic loci spatial coordinates X via the relation log Λ = −D(X) 2 + β. Since Λ is the expectation of the contact counts C it is reasonable to seek for the approximation We introduce the binary matrix of weights W with elements Then, the corresponding optimization problem coincides with the WPCMS problem (19) with Z = − log(C). We obtained the solution (X,β) = P CM S W (Z, H) using our WPCMS algorithm and present chromatin reconstructionsX = HΘ as well as approximations −D 2 (X) +β to the log-transformed contact matrix log(C) for a series of degrees-of-freedom in Figure 8. The corresponding plot of approximation error is presented in Figure 9. Use of segmented regression to identify reasonable model size, given by the elbow, suggests a value df = 35 as shown.

S5 WPCMS initialization
We study the influence of the initialization on the resulting WPCMS reconstructions. In our experiments we generated 50 random Θ with Θ ij ∼ N (0, 1). At the [Initialize] step for each of these Θ we set the starting value for the reconstruction to X = HΘ, for fixed degrees-of-freedom df = 50, and run the WPCMS algorithm. We then compare the solutions obtained, as well as losses and β values. In Figure 10 we superpose coordinate-wise representations of all 50 curves obtained via WPCMS. The WPCMS algorithm converges to two stable conformations. The relative change in the achieved values for both loss and β does not exceed 1%; the number of iterations required for the same convergence criterion varies from tens to a few hundreds. Figure 11 depicts the 50 WPCMS solutions in terms of two key geometric characteristics of the reconstructions: distances between two successive genomic loci which reflects curvature, and angles formed by three successive genomic loci which reflects torsion. The plot showcases highly similar geometric structure across reconstructions, irrespective of initialization.

S6.2 Learning rate
Note that the update for the Poisson generalized linear model (GLM) amounts to one step of the Newton method in the space of the natural parameter H = −D 2 (X) + β. It is not difficult to show that it is equivalent to the Newton step in the parameter space of distances D 2 (X) and, consequently, in the parameter space of inner products S(X) since there is a linear dependence of the elements of D 2 (X) on the elements of S(X). Therefore, we can add a learning rate as well as line search (in the parameter space of S(X)) to the PoisMS algorithm as follows: [

S7 PoisMS initialization
Akin to our exploration of the impact of initialization on WPCMS (Section S5) here we examine initialization in the context of PoisMS. In Figure 12 we again superpose 50 curves obtained from different initializations of PoisMS and represent them coordinate-wise. As is apparent, PoisMS exhibits higher sensitivity to the starting value of Θ than WPCMS. However, the relative change in the achieved values for both loss and β is still small, i.e. does not exceed 3%.
While Figure 12 highlights reconstruction variability with respect to initialization, Figure 13 which, as per Figure 11, captures geometry via successive edge lengths and angles, suggests the existence of a common underlying similar structure.

S8 PoisMS warm start
In view of the sensitivity of PoisMS to initialization we pursue use of a warm start. Based on the heuristic from Section S4, we replace the random initialization step by initializing Θ at the WPCMS solution (21) where W is the binary matrix (20). Adoption of warm starts significantly decreases the number of iterations required for convergence and allows for lower loss value. The resulting reconstruction is presented in Figure 12 in black. Figure 12: 50 reconstructionsX computed for df = 50 via PoisMS; each reconstruction corresponds to an initialization (random initialization in blue, warm start in black) and is represented coordinate-wise. Figure 13: 50 reconstructionsX computed for df = 50 via PoisMS; each reconstruction corresponds to a random initialization, and is represented as lengths of reconstruction edges and angles between successive edges.

S9 Computational complexity
The computational complexity of PCMS is O(n 2 k) operations for computing the product H T CH and extra O(k 3 ) for eigen-decomposition (negligible for small k). Each iteration of WPCMS has the same complexity as PCMS plus O(n 2 ) operations for computing Hadamard products with matrix W . Finally, the complexity of a PoisMS iteration is equivalent to a WPCMS iteration and requires an additional O(n 2 ) operations for calculating the second order approximation.

S10 PoisMS reconstructions for IMR90 cell chromosome 21
We provide results for IMR90 chromosome 21 at 100kb resolution that parallel the presentation for chromosome 20.