A Model Selection Approach to Hierarchical Shape Clustering with an Application to Cell Shapes

Shape is an important phenotype of living species that contain different environmental and genetic information. Clustering living cells using their shape information can provide a preliminary guide to their functionality and evolution. Hierarchical clustering and dendrograms, as a visualization tool for hierarchical clustering, are commonly used by practitioners for classification and clustering. The existing hierarchical shape clustering methods are distance based. Such methods often lack a proper statistical foundation to allow for making inference on important parameters such as the number of clusters, often of prime interest to practitioners. We take a model selection perspective to clustering and propose a shape clustering method through linear models defined on Spherical Harmonics expansions of shapes. We introduce a BIC-type criterion, called CLUSBIC, and study consistency of the criterion. Special attention is paid to the notions of over- and under-specified models, important in studying model selection criteria and naturally defined in model selection literature. These notions do not automatically extend to shape clustering when a model selection perspective is adopted for clustering. To this end we take a novel approach using hypothesis testing. We apply our proposed criterion to cluster a set of real 3D images from HeLa cell line.

Shape modelling plays an important role in medical imaging and computer vision [1,2]. There 2 is, in particular, a vast literature on cell shape analysis where the shape can often provide 3 important information about the functionality of the cell ( [3,4,5,6,7]). Most of such analysis 4 employ techniques that often lack a statistical error component. Variability among shapes has 5 been studied using different approaches: active shape models [8], computational anatomy [9], 6 planar shape analysis [10,11], etc.
where l and m are integers with |m| ≤ l, and the associated Legendre polynomial P m l [13]. 49 Given a spherical function r(θ, φ) and a specified maximum degree L max , one can write 50 r(θ, φ) as a linear expansion of Y m l 's with possibly some measurement errors and find the 51 coefficient of fit through the method of least squares. That is, where y T = r 1 r 2 . . . r N , β T = β 1 β 2 . . . β K , where N is the number of observations, |m| ≤ l ≤ L max , and K = (L max + 1) 2 is the number 53 of parameters. The quality of fit improves as L max increases, i.e., the larger the number of 54 expansion terms. 55 The model (2) is only suitable for surface modeling of stellar shapes. A different parametric form for surface modeling, regardless of the type of shape, is suggested by [14,15]. This parametric form gives us three explicit functions defining the surface of shape as x s (θ, φ), y s (θ, φ), and z s (θ, φ), where each of the coordinates is modeled as a function of spherical harmonic bases. Accordingly, the following three linear models are generated, The set of expansion coefficients (β x , β y , β z ) defines the shape completely. Assuming x s , y s 56 and z s to be independent of each other, the above three models are equivalent to the following 1 Shape Clustering

59
Having characterised shapes by functional forms, we can cluster shapes according to their 60 estimated coefficients. Clustering shapes can be achieved simply by computing the Euclidean 61 distance over their parameters of the models in equation (2). However, this heuristic method 62 may not lead to proper clusters, so we aim to develop a methodology using fitted models. To 63 this end, we present a likelihood-based approach for clustering shapes in this section. From the 64 Bayesian point of view, the clustering procedure is enhanced by contemplating the distribution 65 of parameters in equation (2). We assume some prior distributions over parameters of the fit β, 66 and calculate the marginal distribution of the model by integrating the likelihood with respect 67 to the assumed prior. In this approach, the hierarchy of clusters is built up based on the 68 marginal likelihoods [16,17,18,19]. Agglomerative hierarchical clustering is used to 69 establish the dendrogram in which curves are merged as long as the merge improves the 70 marginal likelihoods, as discussed in [20].

71
This section is organized as follows. First, a brief sketch of likelihood calculation with 72 classical assumptions is provided in Subsection 1.
where N and K indicate the number of observations and the attributes respectively. We 78 assume ε is distributed according to the Gaussian distribution with mean 0 N ×1 and covariance 79 matrix σ 2 I N , i.e. ε ∼ N (0, σ 2 I N ), where I N is the identity matrix of size N .

80
In case of spherical harmonic expansions, equation (3), we assume that the error terms ε x , 81 ε y , and ε z are independent of each other and each follows the Gaussian distribution 82 N (0, σ 2 I N ). Subsequently, the vector where the symbol ⊗ is the Kronecker product.

84
Given D distinct shapes, the model associated with this set of shapes is where We explain the methodology using the notation for equation (4). Similar methodology 86 applies using equation (3), taking into account the property in equation (5).
is a grouping vector, e.g. d = (1, 1, . . . , 1) assigns all D shapes to one group and d = (1, 2, . . . , D) assigns each shape to a singleton. The likelihood function for the model (6) given the grouping vector d is, where C(d) denotes the number of unique elements in d (the number of clusters) N (i) , while 88 y (i) and X (i) represent, respectively, the number of observations, vector of response, and 89 matrix of covariates after combining the clusters. The vector of unknown parameters is 90 denoted by β.

91
For the sake of simplicity, we propose conjugate priors [21]. To begin with, σ 2 is assumed 92 to be known. The standard conjugate prior imposed on β, conditional on σ 2 , is , where β 0 , and V 0 are the prior mean and prior covariance matrix for 94 β respectively. For a detailed discussion of Bayesian methods see [22]. The marginal 95 likelihood of y can be computed as follows: In case of conjugate priors, the model appears as the multivariate Gaussian distribution with mean Xβ 0 and covariance matrix of I N + XV 0 X T , where N = D i=1 N i denotes the number of observations. The curves are assigned to a group with maximum p(d | y). By the Bayes theorem, The dendrogram exploits the marginal likelihood to build the hierarchy. It is expected that 97 the marginal likelihood reaches its maximum over a reasonable grouping. Thus, the logical 98 cut-off on the dendrogram is when the marginal likelihood is maximized over the dendrogram.

99
In order to calculate the marginal likelihood p(y | σ 2 ) for each curve, one needs to inversion of size N , see [24]. If σ 2 is nearly degenerate, i.e, E(σ 2 ) = µ 0 , and Var(σ 2 ) ≈ 0, 107 this model is, asymptotically, the same as the Gaussian model. Bayesian clustering according to the p(d | y), equation (8), requires modeling p(y | d) and p(d). Here, we follow the structure suggested in [20]. The random vector d denotes the possible groupings for a set of shapes. Suppose C(d) is the total number of clusters at each step and n 1 , n 2 , . . . , n C(d) are the total number of shapes in each of the clusters. Suppose that C(d) is uniformly distributed over the set {1, 2, . . . , D}, where D is the total number of shapes and the n j , j = 1, 2, . . . , D, is distributed according to multinomial-Dirichlet with parameter π, .
A uniform setting on the parameter vector π leads to

117
We introduce a criterion for clustering, based on the marginal likelihoods, called Clustering 118 Bayesian Information Criterion (CLUSBIC). CLUSBIC is similar to BIC in nature, but it is 119 designed for the purpose of hierarchical clustering. When all data fall into one cluster,

121
[26] proposed a simple informative distribution on the coefficient β in Gaussian regression 122 models. As a prior on the parameter verctor, given the covariates, he considered 123 β ∼ N (β 0 , gσ 2 (X T X) −1 ), a conjugate Gaussian prior distribution. This covariance matrix is 124 a scaled version of the covariance matrix of the maximum likelihood estimator of β. In 125 practice, β 0 can be taken as zero and g is an overdispersion parameter to be estimated or 126 manually tuned. The parameter g can be set according to various common model selection 127 methods such as AIC, BIC and RIC [27]. [26,28] suggested the use of the prior distribution 128 on g as a fully Bayesian method (see also [29]). Various other methods have been 129 recommended in the literature for finding an optimal value for g, such as the empirical Bayes

137
The testing formulation of model selection is to test the following hypothesis Clustering, in contrast, is concerned with finding different 139 ways of combining the covariates. Clustering may then be formulated as testing the hypothesis 140 where T qK×DK is the matrix of constraints such that Tβ = 0 is a set of q linear constraints 141 on β j for 1 ≤ j ≤ D. The following theorem provides an approximation of the marginal 142 likelihood which we call CLUSBIC. 143 whereβ is the maximum likelihood estimate of β under the null hypothesis, H 0 : Tβ = 0.

145
Proof. The proof is given in Section A of the Appendix.  Using CLUSBIC, two groups, say group i and group j, are merged together if it leads to a decrease in the CLUSBIC. The CLUSBIC is calculated for different combination of shapes. A pair of groupings that minimizes CLUSBIC is a merging candidate. For example, groups i and j are merged together in the second level of the hierarchy only if the merge decreases the total CLUSBIC compared to the level before. The following calculation shows that this idea of merging the groups is very much aligned with likelihood maximization.
where ℓ i = log{p(y i | β i , X, σ 2 )} and ℓ (ij) is the log-likelihood after merging group i with 152 group j. As the total number of observations is fixed at each level of hierarchy, i,e., For Gaussian models, Ward's linkage is closely related to CLUSBIC. Minimizing the CLUSBIC between each two consecutive levels of the dendrogram leads to minimizing the metric, whereÊ(y i ) = Xβ i . Substituting the above equation in the c ij metric, we have When X T X is a diagonal matrix, c ij equals ∆(i, j). 155

156
In this section, we show that the CLUSBIC criterion is consistent in choosing the true 157 clustering as N i → ∞ for i = 1, 2, . . . , C(d) . Assuming that there exists a model m 0 ∈ M 158 that represents the true clustering, the CLUSBIC, developed in Theorem 1, is said to be a 159 consistent criterion if wherem is the model selected by CLUSBIC.

161
In a regression setting, the space of models M can be partitioned into two sub-spaces of under-specified and over-specified models in a rather straightforward fashion. The space of under-specified models M 1 contains all models that mistakenly exclude the attributes of the true models. On the other hand, the space of over-specified models M 2 contains all models that include more attributes besides the true model's attributes. In other words, the sub-spaces M 1 and M 2 can be effortlessly established considering the presence or the absence of the attributes of the true model. Therefore, more formally, we have Consequently, for each model that belongs to M 2 , the dimension of the model, K, is always 162 greater than the true model.

163
The definition of under-specified and over-specified models needs to be adjusted for the 164 clustering context. By definition, the over-specified models must include the attributes of the 165 true model. The question that arises here is how the notion of attributes can be interpreted for 166 the clustering problem. Suppose that each column of X represents an attribute; more clusters 167 mean more attributes. There is some confusion as to whether the over-specified models should 168 include the clusters of the true model (i.e., smaller number of clusters compared to the true 169 model) or disjoint the clusters of the true model (i.e., having more number of clusters).

170
Consider the general linear model y = Xβ + e in equation 6 be the total number of observations and the number of parameters in the model respectively. One can approach the problem of defining the model space of M 2 from the hypothesis testing point of view (10). In this approach, the over-specified model is the one that the null hypothesis for that model holds true considering the assumptions on the true model. More formally, we define M 1 and M 2 as follows for the clustering problem, where T m is the matrix of constraints for the model m ∈ M . To establish the consistency of 171 CLUSBIC, we need to prove the following statements, 174 We in fact prove a stronger version of the first statement ( a). We'll show that 175 a*) lim N →∞ N h p N (m) = 0 for m ∈ M 1 , and for any positive h.

176
The consistency of the BIC in model selection has been developed in the literature considering different assumptions, see [33,34,35]. Estimating the number of parameters under the null hypothesis in (10) leads to constrained least squares. Consequently, April 11, 2020 7/22 The matrix Q is the projection matrix which is symmetric and idempotent.

177
In the following theorem, we show that CLUSBIC is a consistent clustering measure for 178 any arbitrary distribution on model (4). To prove the consistency of CLUSBIC, we rely on a 179 quadratic approximation to the logarithm of the likelihood suggested by [36]. It can be shown 180 that results can be exact for Gaussian models, see Section B.

181
Theorem 2. The CLUSBIC is a consistent clustering criterion.

182
Proof. The proof is given in Section B of the Appendix. 183 2 Application

184
In this section, we apply our proposed method to the biological cell data obtained from the boundary with the so-called common variance is generated.

209
Note that in the case of 2D shape modeling, one can employ basis functions such as 210 Fourier, splines or wavelets and then follow the same procedure for shape clustering as in 211 Section 1. For the sake of readability, in Figure 1, the dendrogram is reported for only ten 212 random cells. Each cell is assigned a number corresponding to its order in the database.

213
In order to obtain the 3D Cartesian coordinates associated to each voxel on the surface, we 214 do as follows. First, the boundary data for each image stack is extracted separately. Second, 215 the 2D coordinates of each stack are combined all together to create the 3D coordinates of the 216 cell shapes depicted in Figure 2. We then take the image spacing information into acccount.   For this dataset, the voxel spacing is (0.049µm, 0.049µm, 0.203µm) [40]. It should be noted 218 that the final extracted data must be within a sphere of unit radius to be suitable for modeling 219 using spherical harmonics.

220
As image data involve noise, the least squares equation (3) is ill posed. A regularization 221 approach can then be taken to estimate the model parameters. See [41] for details and further 222 discussion. The result of clustering for D = 10 random cells is reported in Figure 3.

223
We repeat the same procedure considering all D = 50 cells. The clustering result is 224 reported in Figure 4.

225
As we discussed in Section 1.2, the number of all possible groupings is 50 k=1 50 k ≈ 10 47 . 226 In practice, it is not feasible to explore all possible groupings when D is relatively large. We 227 ran the random Gibbs sampling for 8000 cycles as an example. The convergence behavior of 228 the sampling throughout the 8000 cycles is reported in Figure 5.    Proof of Theorem 1. Consider the marginal posterior for a set of D shapes For simplicity, we use the notation p(y | β, σ 2 ) instead of p(y | β, X, d, σ 2 ). In order to 247 obtain an approximation to this integral, we take a second order Taylor expansion of the 248 log-likelihood atβ, the solution to the following constrained optimization problem, 249 max log{p(y | β, σ 2 )}, subject to Tβ = 0.
The solutionβ can be found using the method of Lagrange multipliers. The Lagrangian 250 function for this problem is where λ is the vector of Lagrange multipliers. Expanding L(β, λ) aboutβ andλ.
L(β, λ) = log{p(y |β, Under the assumption that Tβ = 0, Defining the average observed Fisher information matrix as April 11, 2020 13/22 The desired result then follows upon noticing thatβ p − → β [42] and 253 log( Proof of Theorem 2. The proof comprises two steps. First, the consistency of CLUSBIC is 256 established for Gaussian models. We then extend the result to smooth non-Gaussian models 257 where by "smooth " we generally mean the likelihood is a C 3 function of the unknown 258 parameter. The second step essentially follows from step one upon applying a quadratic 259 approximation to the logarithm of the likelihood.

261
Suppose the error terms, (4), are distributed according to the Gaussian distribution, one can easily show that the CLUSBIC has the following form, similar to the BIC, see [43], The estimate of the variance matrix for model m is, Suppose K d and D are fixed, we follow the same setting as in [34] by modifying constraints 262 of the form Tβ = 0. The following two assumptions are required for the proof, 263 1. X T X is positive definite. Proof. The proof is given in Supplementary Material.

274
Having established the above lemmas, the following theorem can be established for 275 Gaussian models.

276
Theorem 3. The CLUSBIC is a consistent clustering measure for Gaussian models.

277
Proof. The proof is given in Supplementary Material.

278
Step 2. non-Gaussian Model: For any m ∈ M and the true value of parameter β 0 , one can write the following decomposition 281 Applying the second order Taylor expansion to log{p m (y | β 0 )} at the pointβ, equation (17), where I(β 0 ) is the Fisher information matrix. On the other hand, by the definition of 284 constrained optimization problem equations (12), and (13), Expanding the score function around the point β 0 , and λ 0 = 0, we have . Taking into account equations (21) and (22), one can show that where I −1 (β 0 ) is the inverse of the Fisher information matrix. By equations (24) and (21), and 286 the fact that I(β 0 )Σβ is an idempotent matrix, one can conclude that where χ has a chi-squared distribution with dim(b 0 ) degrees of freedom. As the convergence 288 in distribution implies boundedness in probability, component 1 in (20) is of order O p (1).

289
As for component 2 in (20), by the central limit theorem, where

292
Now coming back to the equation (19), for both m ∈ M 1 and m ∈ M 2 − {m 0 }, Using Jensen's inequality, Therefore, as N tends to infinity (iii) Orthogonal complement of the sub-space is defined as rank(W) = rank(V) + rank(V ⊥ ∩ W).
(v) Q X = X(X T X) −1 X T is called projection matrix onto C(X). The matrix Q X is 300 symmetric (Q X = Q T X ) and idempotent (Q X Q X = Q X ).

301
(vi) If Q 1 X and Q 2 X are projection matrices with C(Q 1 X ) ⊂ C(Q 2 X ), then Q 2 X − Q 1 X is also a 302 projection matrix onto the orthogonal complement of C(Q 1 X ) with respect to C(Q 2 X ).

303
(vii) Let A be k × k matrix of constants and y ∼ N (µ, σ 2 I). If A is idempotent with rank p, then y T Ay where λ = µ T Aµ σ 2 .

304
of Lemma 1 . In order to prove the lemma, we work with the difference between the CLUSBIC of the true model and any arbitrary model in M 1 . We decompose this difference into several random variables. Taking into account the properties of multivariate Gaussian distribution and the quadratic forms from the same family, we proceed with the proof. Let m ∈ M 1 , where,  1. e T Q * Xβ ∼ N (0, σ 2 β T X T Q * 2 Xβ) =⇒ X ∼ N (0, 1).
2. Y N is a quadratic form from the Gaussian distribution.
where χ 2 b is the chi-squared distribution with b degrees of freedom.

308
The validity of equation (29) can be easily verified through the definition of CLUSBIC in equation (18). By the Fréchet inequality, where A i 's are some events, the equation (29) is bounded by, Using the assumptions 1 and 2, Using the characteristics of the standard Gaussian distribution function, where φ(.) is the density function of the standard Gaussian distribution.

310
Given that Y N is a quadratic form, using the definition of moment generating functions for quadratic forms [45], we have The equations (31), (32) and (33)   April 11, 2020 19/22 By definition, we have Under H 0 , for an arbitrary model we have, Tβ = 0, Under H 0 , µ ∈ C(X) = V and µ⊥C(T * ), or µ ∈ C(T * ) ⊥ ∩ C(X) = V 0 which is the orthogonal complement of C(T * ) with respect to C(X).
Therefore, rank(T * ) = qK. By definition of projection matrix (v), By the property (vi), Q(m) − Q(m 0 ) is a projection matrix. Thus, using equation (35), χ has chi-squared distribution with p degrees of freedom, For the second term in equation (34), one can show that using an inequality on chi-squared distribution, see [46]. This completes the proof.

316
The risk, or expected loss, for the model is where Im =m is the indicator function. By the Cauchy-Schwartz's inequality, Therefore, A 1 and A 2 tend to 0 as N → ∞ by condition (a).