Hyperbolic geometry-based deep learning methods to produce population trees from genotype data

The production of population-level trees using the genomic data of individuals is a fundamental task in the field of population genetics. Typically, these trees are produced using standard methods like hierarchical clustering, neighbor joining, or maximum likelihood. However, such methods are non-parametric: they require all data to be present at the time of tree formation, and the addition of new data points necessitates the regeneration of the entire tree, a potentially expensive process. They also do not easily integrate with larger workflows. In this study, we aim to address these problems by introducing parametric deep learning methods for tree formation from genotype data. Our models specifically create continuous representations of population trees in hyperbolic space, which has previously proven highly effective in embedding hierarchically structured data. We present two different architectures - a multi-layer perceptron (MLP) and a variational autoencoder (VAE) - and we analyze their performance using a variety of metrics along with comparisons to established tree-building methods. Both models tested produce embedding spaces that reflect human evolutionary history. In addition, we demonstrate the generalizability of these models by verifying that addition of new samples to an existing tree occurs in a semantically meaningful manner. Finally, we use Dasgupta’s Cost to compare the quality of trees generated by our models to those produced by established methods. Despite the fact that the benchmark methods are directly fit on the evaluation data, our models are able to outperform some of these and achieve highly comparable performance overall.


Introduction
Recent advances in genetic sequencing technology, along with the wide availability of 16 genetic data, have resulted in the increased interest in predicting disease risk and other 17 traits directly from an individual's genomic data. These predictions are made possible 18 through studies like Genome-Wide Asssociation Studies (GWAS), which determine 19 correlations between genomic variants and such traits, and the calculation of quantities 20 like Polygenic Risk Scores (PRS), which use an individual's genotype to quantify disease 21 risk. However, these models do not always generalize across ancestries, often requiring 22 different models to be developed for different populations [1]. Accurate knowledge of an 23 individual's genetic ancestry is therefore frequently important in making the most 24 accurate and effective healthcare decisions possible. 25 Ancestry categories are hierarchical, with each individual having a label at each level 26 in the hierarchy. For example, if we were to separate humans by continental population, 27 a particular individual could be described as European. If we then proceeded to cluster 28 the European population, the same individual could then be described as Italian. If we 29 then identified sub-clusters within the Italian population, this person could now be 30 identified as Sicilian, and so on. Trees are a natural choice to represent such 31 hierarchically structured data, and they have been widely used in the past in several 32 related paradigms. For example, Li et al. utilized trees from genotype data to elucidate 33 migration patterns and study relationships between human populations from across the 34 world [2]. Duda and Zrzavý combined genetic data with linguistic data to further probe 35 these relationships [3]. Trees are widely applicable to non-human species as well, as 36 evidenced by Parker et al.'s study utilizing trees to elucidate the population histories of 37 canids [4]. On a broader scale, of course, trees have been utilized to catalog and 38 categorize the entirety of life on Earth into an all-encompassing "tree of life" [5]. 39 Finally, there has been much theoretical, mathematical, and computational work done 40 to devise and improve state-of-the-art tree production methods [6][7][8]. 41 Indeed, utilizing trees can provide several benefits in the study of ancestry. 42 Importantly, a tree can help assess how related two population groups or two 43 individuals are to each other. This fact can allow us to, for example, use population 44 trees to determine if GWAS or PRS results from one group are likely to generalize to 45 another. Similarly, we can use trees to determine which groups can be added to a 46 particular study and therefore increase its power while retaining accuracy. At present, 47 the majority of genome-trait association studies have been conducted in individuals of 48 European descent, largely neglecting individuals of other ancestries [1]. Population trees 49 can allow us to optimize study design for these groups and can be used for 50 pharmacogenomics, thus allowing for the delivery of vital healthcare information as 51 accurately and efficiently as possible [9].

52
Genetic ancestry-based trees also have a variety of applications outside healthcare.

53
Notably, their structure can provide vital information regarding the course of human 54 evolution, and they can also potentially explain various migration events that have 55 occurred through history [2,6]. Such information may prove valuable for historical or 56 demographic studies. Furthermore, trees can be produced for crops, livestock, and other 57 species, potentially aiding in decisions regarding agriculture and ecological health [4,10]. 58 To create a tree from genomic data, it is necessary to leverage certain genetic 59 signatures to calculate measures of similarity between the entities that will comprise the 60 tree. A common choice is to use Single Nucleotide Polymorphisms (SNPs), or single 61 genomic positions that are known to vary across human populations due to the 62 evolutionary history of the human species [2,11]. In general, it is expected that a pair of 63 closely related individuals will share more SNPs than a pair of more distantly related 64 ones. Typically, a tree-building algorithm like hierarchical clustering is employed 65 together with a similarity metric related to the number of SNPs shared in common 66 between each pair of individuals [12]. trees [14][15][16]. We will therefore describe the properties of hyperbolic spaces -in 91 particular, those following the Poincaré disc model -which allow for this hierarchical 92 structure to be captured, along with those that are relevant for the understanding of our 93 models.

94
In general, hyperbolic geometry is defined as a non-Euclidean geometry with a 95 constant negative curvature [13,17]. Various types of hyperbolic spaces exist, and one 96 commonly used space is the Poincaré ball model, or its two-dimensional equivalent, the 97 Poincaré disk. In a Poincaré disk (as with other hyperbolic spaces), the distance 98 between two points z i and z j is calculated differently than in the standard Euclidean 99 formulation [13]. Specifically, the distance formula is given by: A byproduct of this distance metric is that surface area grows in an exponential 101 manner as we reach the edge of the disk. Since the number of leaves in a tree grows 102 exponentially with respect to the tree's height, a Poincaré disk (or other hyperbolic 103 space) is therefore ideal for learning accurate tree representations without the need for 104 an excessively large embedding space. In fact, one can think of hyperbolic spaces as 105 continuous analogues of trees [14]. Previous studies have demonstrated methods to 106 embed trees into hyperbolic spaces with minimal error [16]. Euclidean spaces do not 107 possess this property of exponential growth, meaning modeling larger trees requires a 108 higher dimensionality, which can cause a variety of problems with training, including 109 overfitting and memory concerns [14].

110
It is also important to consider the nature of the curves, or geodesics, which 111 represent the shortest paths between two points in the Poincaré disk. In this 112 formulation, geodesics are either lines through the origin or circular arcs that intersect 113 the two points and are orthogonal to the edges of the disk [18].

114
Given our knowledge of geodesics, we must introduce one final important property: 115 the exponential map. Assume that we start with a point z, which is on a unit geodesic γ 116 such that γ(0) = z. There will exist only one possible γ such that the tangent vector at 117 z (i.e. γ ′ (0)) is a certain vector v. We then define the exponential map of v as 118 exp z (v) = γ(1). A closed-form solution exists for this transformation, and it is used 119 frequently in the various architectures and probability distributions utilized in this 120 study [17].

121
The HypHC Model 122 Central to the models employed in this study is Hyperbolic Hierarchical Clustering, or 123 HypHC, which was introduced by Chami et al. in 2020 [13]. This method utilizes a 124 differentiable objective function to embed data points in hyperbolic space such that 125 distances between embeddings reflect closeness in a hierarchical clustering tree.

126
Hyperbolic geometry was chosen for this task because it allows for more natural and 127 effective representations of trees than is possible in the standard Euclidean space [13]. 128 Specifically, the objective function of HypHC is a continuous relaxation in hyperbolic 129 space of Dasgupta's Cost, a common metric used to evaluate the quality of hierarchical 130 cluster assignments.

131
Given a similarity matrix w representing the data, and a tree T generated with 132 hierarchical clustering, Dasgupta's cost can be written as follows: where w ij is the similarity between sample i and j, w ijk|T = w ij if the tree first splits k 134 March 30, 2022 4/16 from i and j, w ik if the tree first splits j from i and k, and w jk if the tree first splits i 135 from j and k [13].

136
Dasgupta's cost therefore calculates whether more similar nodes are located closer 137 together in the tree than less related ones. Alternatively, it gauges if the Lowest

138
Common Ancestors (LCA) -the points at which two nodes split from each other -of 139 more similar nodes are lower in the tree, while the LCAs of less similar nodes are higher 140 in the tree. The term w ijk|T can therefore be interpreted as the similarity of the two 141 nodes in {i, j, k} with the LCA that is farthest from the root of the tree [13].

142
Although this objective function is discrete, a continuous analog can be developed in 143 hyperbolic space. To do so, we must define the continuous equivalent of the LCA. Let z i 144 and z j be the hyperbolic embeddings of two data points i and j, and let q ij be the 145 curve representing the geodesic (shortest path distance) between z i and z j . We first 146 consider that in a conventional tree, the LCA between two nodes will be the point on 147 the shortest path between the nodes that is closest to the root. Similarly, if we assume 148 our tree is rooted at the origin in hyperbolic space, the hyperbolic LCA between z i and 149 z j will be the point on the geodesic that is closest to the origin [13].

150
If we define d LCA (z i , z j ) to be the distance from the origin to the hyperbolic LCA of 151 z i and z j , we can define the continuous analog to Dasgupta's cost as follows: The term w ijk|HypHC is then defined as follows: where σ T is the softmax function with a learned temperature scaling parameter 154 (specifically, the logits are scaled by 1 T before the softmax is applied) [13]. Therefore, 155 the softmax output acts as a smooth relaxation of an indicator function.

156
This objective can then be optimized by gradient descent in a similar manner to a 157 conventional neural network. Even though all data points are considered at once, we 158 can divide the triplets i, j, k into batches for the sake of computational efficiency and 159 memory usage.

160
Finally, if desired, the HypHC paper provides algorithms to convert the continuous 161 space into a discrete tree. The most theoretically rigorous of these can be 162 computationally expensive under certain situations, but a greedy algorithm is provided 163 which runs far faster and has been validated to achieve comparable performance. In this 164 algorithm, points first are normalized to have the same distance from the origin in the 165 Poincaré Disk. Next, the points are sorted according to their angles in the disk, which 166 are analogous to their positions when normalized. We then calculate the two largest 167 gaps in this sorted list (ie. the two largest differences between consecutive angles), and 168 we split the dataset into points that have angles between these two gaps and points that 169 do not. Now that the first division has been determined, we recursively divide each split 170 at the largest gap until the full tree structure is reached [13]. In this manner, we have a 171 simple, efficient algorithm to produce discrete trees from HypHC embedding spaces.

172
Novel Parametric Models 173 We will now discuss the parametric models developed as part of this study. each individual in a batch, embeddings will be obtained using the fully connected layers. 181 Next, the HypHC loss will be calculated using only these embeddings, and a gradient 182 update will then be performed.

183
While the traditional HypHC model directly optimizes each embedding z i given an 184 input x i , our proposed Hyperbolic MLP model makes uses of a function f (represented 185 by the neural network) to translate between input and output, specifically by 186 calculating f (x i ) = z i . Therefore, we learn a parametric model, f , that predicts z i given 187 an input x i .

188
When the model has been trained, embeddings for new data points can be obtained 189 by applying the fully connected layers to the input genotype. In this manner, the 190 embedding space is still optimized using a hierarchical clustering-based objective, but it 191 also retains the ability for new data points to be added in a semantically meaningful 192 manner.

194
The model consists of MLP-based encoders and decoders, with transformations applied 195 to convert data into coordinates in hyperbolic space, following the framework provided 196 in [17]. Specifically, after applying linear layers, followed by ReLU non-linearities, to the 197 input data to obtain predicted means and standard deviations for the embedding 198 coordinates, the predicted means are applied along the exponential map (which does not 199 occur in the MLP model). Latent coordinates are then sampled. The decoder consists of 200 a logarithmic map -to undo the exponential map -followed by a series of linear layers 201 to reproduce the input. In all cases, the embedding space used was two-dimensional.

202
The optimization objective of the model consists of three weighted components. The 203 first two components result from the need to maximize the evidence lower bound 204 (ELBO), as in standard VAEs. When extended to Riemannian spaces, this is equivalent 205 to minimizing the expression [17]: The term −log(q(x i |z)) represents the likelihood of the reconstructed data, and it is 207 calculated in practice using the binary cross-entropy between the reconstructions and 208 the original data. The term (log(q(z|x i )) − log(p(z))) represents the similarity between 209 the predicted and prior distributions for latent space coordinates, and it is analogous to 210 the KL divergence term in the standard VAE objective. For the latent prior, in lieu of 211 using a standard normal distribution, we use a Wrapped Normal distribution on a  The final per-batch objective for the model can therefore be written as: March 30, 2022 6/16 where λ 1 , λ 2 , and λ 3 are manually specified weights for each of the loss terms and n is 222 the batch size. In this architecture, the reconstruction loss is intended as an additional 223 objective to ensure the latent embedding space most accurately reflects the relationships 224 present in the raw data. We surmised that the combination of HypHC loss and 225 reconstruction loss could possibly prove more effective than simply using the HypHC  [13,17]. Notably, 248 to ensure effective visualization, only two-dimensional embedding spaces were used. 249 We conducted a non-exhaustive hyperparameter and architecture search for each The MLP model was trained with a learning rate of 10 −3 and a temperature of 10 −4 , 257 and the VAE model was trained with a learning rate of 10 −2 and a temperature of 10 −6 . 258 Due to the nature of the HypHC loss calculation, the batch size is more consequential 259 than in most machine learning models. In both cases, a batch size of 64 was used to 260 balance runtime with a sufficient sample size for HypHC loss. In all cases, the 261 Riemannian Adam optimizer was used [13,22]. Finally, all models were trained using 262 early stopping with a patience of 5.

263
After inspecting the embedding spaces that were learnt, we compared the results of 264 these models to those from non-parametric models applied to the test dataset -namely, 265 HypHC and various forms of hierarchical clustering. In all cases, the similarity between 266 two data points was defined as the fraction of SNPs for which the same nucleotide was 267 present.  both plots -namely, the fact that the African cluster is the most separated from the 283 others -therefore perfectly represents established scholarship [23,24]. 284 Similarly, the order of continental populations in the plot reflects the history of 285 human migration. When early human populations left Africa, they likely did so via the 286 modern near East, which falls under the "West Asia" category in our dataset. From 287 there, migrations are believed to have occurred to the east into Southern Asia, and 288 eventually to the west into Europe. Within Asia, migrations occurred to populate the 289 entire continent, and along the Indian Ocean coast into Oceania [25,26]. Finally, 290 humans crossed the Bering Strait land bridge from East Asia into the Americas. Due to 291 West Asia being the immediate destination from Africa, and a region that has remained 292 in contact with it, it is reasonable that West Asians are the closest groups to Africans in 293 the plot. It is also fitting that Europeans and South Asians are closest to West Asians, 294 due again both to geography and the migration patterns mentioned above. Finally, due 295 to the patterns of migration within and out of Asia, we would expect South Asians, 296 East Asians, Oceanians, and Americans to be present in close proximity to each other, 297 which is indeed the case.

298
The model can also deliver insights regarding the unique genetic ancestries of  [27,28]. In conclusion, both the MLP and VAE models produce embedding 307 spaces that largely reflect established knowledge regarding the genetic relationships 308 between human ancestries.

310
As stated, the chief advantage of parametric tree production methods is that new 311 samples can be added to an existing tree in a semantically meaningful manner. We These results therefore illustrate that the models presented in this study are able to 320 effectively generalize beyond the data they were trained on. Specifically, one can easily 321 incorporate new samples into existing tree representations, a useful task for a variety of 322 experimental settings.

324
To explore comparisons with non-parametric methods, we performed the following 325 procedure for the MLP and VAE models. First, we used a trained model to predict on 326 all samples from the test dataset. These embeddings represent a continuous tree 327 consisting of all test datapoints, and we then converted them into a discrete tree using 328 the decoding method suggested in the HypHC paper. We then compared this tree to a 329 variety of non-parametric methods, with Dasgupta's cost as the metric of choice. As 330 stated, these non-parametric methods include HypHC and various forms of hierarchical 331 clustering. These variations of hierarchical clustering differ in the manner in which 332 distances between clusters are calculated. If we assume we are calculating the distances 333 between clusters u and v, then these methods can be described as follows: where cluster u was created 338 from clusters s and t.

339
These methods represent all hierarchical clustering variations found in the Scipy Python 340 package able to utilize a Hamming Distance metric [29]. HypHC was implemented by 341 adapting code found in the HypHC package to fit the specific use case [13]. For the 342 HypHC model, we used a learning rate of 10 −2 , a temperature of 10 −3 , and early 343 stopping with a patience of 5.

344
Note that since the output of the VAE model is non-deterministic, the value 345 presented is the result of aggregating the Dasgupta's Cost values produced by 346 predicting on the test set ten different times. Table 1 shows the Dasgupta's Cost values 347 for all methods on the test set.

348
As shown in Table 1, both parametric methods achieved highly competitive 349 performance as compared to the non-parametric methods, even outperforming multiple 350 of these benchmarks. It is instructive to once again note the key difference between the 351 parametric and non-parametric methods with respect to this experiment -while the test 352 set was directly used to train or fit all non-parametric methods used in this experiment, 353 it represents completely unseen data for the parametric methods. The fact that the 354 VAE and MLP models performed similarly to the state-of-the-art methods despite this 355 disadvantage reflects their generalizability and ability to create effective tree 356 representations from previously unseen sequences.   Table 1. Comparison of Dasgupta's Cost values on the test dataset for all parametric and non-parametric methods tested. Since the output of the VAE model is non-deterministic, the value presented is the result of predicting on the test set ten different times.
creation. If new data is obtained subsequently, it cannot easily be added to the 365 resultant tree; instead, the tree must be remade. In addition, as the method is discrete, 366 it cannot easily be incorporated into larger workflows. An effective parametric analog 367 for hierarchical clustering that can successfully represent genotype data is therefore 368 desirable. Hyperbolic geometry is generally seen as a natural choice for embedding 369 hierarchically structured data, so we focused in this area -and in particular, on 370 variations of the HypHC method -when designing methodologies for this study.

371
Overall, the performance of our two parametric methods -VAE and MLP-based 372 models -proved very promising when evaluated from a variety of different perspectives. 373 We first verified that our models' embedding spaces grouped individuals by their 374 continental population, and we observed that the location and orientation of these 375 clusters reflected human evolutionary history. We also found cases in which 376 subpopulations with diverse influences were represented accordingly. We next tested  The properties of these models can allow for a variety of useful and interesting 384 applications. Of course, one of the most obvious pertains to adding a new sample to a 385 tree with a large number of nodes. Instead of performing the computationally expensive 386 process of remaking the tree each time a new node is added, we can simply add points 387 to the embedding space as we desire. A discrete tree can then be decoded from the 388 embedding space. In this manner, significant time is saved when producing updated 389 trees. Additionally, the frameworks presented makes it far easier to include hierarchical 390 clustering as part of larger machine learning workflows. Embeddings can be calculated 391 at some point in a pipeline, fed into the next step, and the entire model can eventually 392 be trained in an end-to-end fashion. One can imagine, for example, using hyperbolic 393 embeddings as inputs to a classifier for whether individuals are at risk for a particular 394 disease. Ultimately, a continuous and parametric model can greatly expand the realm of 395 tasks over which hierarchical clustering can contribute to a solution. Our models can 396 also potentially aid in preserving privacy of genetic data. If researchers desire to 397 produce a tree from certain individuals, they would normally need to access the raw 398 genotype data. Now, another option could be to receive the model embeddings instead, 399 after which the tree can be easily decoded. In this manner, the raw data will remain 400 hidden, thus protecting the privacy of the subjects. Furthermore, researchers without 401 permission to view the data could still perform analyses. Of course, such models would 402 likely require modifications, along with vigorous testing, to ensure they are truly 403 privacy-preserving. However, the models presented in this study represent a promising 404 approach. Finally, the dual methods of representation possible -a continuous 405 embedding space or a discrete tree -can allow for more comprehensive visualization of a 406 population. A tree is clearly effective at denoting the position of each node in the 407 overall hierarchy. However, relative distances between large groups or overall population 408 trends may be more intuitively depicted in a continuous plot.

409
It is also informative to compare the two parametric models. The MLP and VAE 410 achieved similar performance to each other, indicating that neither objective function is 411 intrinsically superior to the other. This is an interesting finding, as the MLP objective 412 is explicitly optimized for hierarchical clustering, while the VAE objective is not. It is 413 possible that for this data type simply embedding the data in hyperbolic space -which, 414 as stated previously, is highly effective for representing hierarchically structured data -is 415 enough to achieve strong hierarchical clustering performance. Both objective functions 416 can certainly produce semantically meaningful embeddings. Further elucidating the 417 specific behavior imparted by each objective function is an interesting future direction, 418 as is probing a larger number of architectures and objectives to determine the optimal 419 design choices. Another area of exploration pertains to fully studying the capabilities of 420 the VAE model. VAEs have a variety of functions that were not explicitly tested and further, one would likely have to obtain more samples for each population or generate 439 them synthetically. More data would also allow us to train targeted models on specific 440 populations rather than individuals from around the world, thus allowing for greater 441 regional precision and resolution. One could also experiment with the metrics used for 442 hierarchical clustering. In all experiments performed in this study, the metric used was 443 directly derived from the genotype data, which was also the input to the model. 444 However, the similarity metric could also conceivably be used to add new information. 445 For example, one could craft a similarity metric around whether two samples are of the 446 same ethnicity, thus introducing supervised learning based on a sample's population 447 labels. One could then take this idea a step further and introduce arbitrary similarity 448 metrics -for example, defining the metric based on the primary language spoken by 449 each individual. If a model can cluster according to this metric, while still grouping by 450 genotype within each cluster, it would prove an interesting tool.

451
In summary, this study introduces multiple parametric methods for tree formation 452 from genotype data, and it also puts forward several future directions for portable 453 private models, supervised approaches, and synthetic data generation based on these 454 methods.