## Abstract

Characterizing the genetic substructure of large cohorts has become increasingly important as genetic association and prediction studies are extended to massive, increasingly diverse, biobanks. ADMIXTURE and STRUCTURE are widely used unsupervised clustering algorithms for characterizing such ancestral genetic structure. These methods decompose individual genomes into fractional cluster assignments with each cluster representing a vector of DNA marker frequencies. The assignments, and clusters, provide an interpretable representation for geneticists to describe population substructure at the sample level. However, with the rapidly increasing size of population biobanks and the growing numbers of variants genotyped (or sequenced) per sample, such traditional methods become computationally intractable. Furthermore, multiple runs with different hyperparameters are required to properly depict the population clustering using these traditional methods, increasing the computational burden. This can lead to days of compute. In this work we present Neural ADMIXTURE, a neural network autoencoder that follows the same modeling assumptions as ADMIXTURE, providing similar (or better) clustering, while reducing the compute time by orders of magnitude. In addition, this network can include multiple outputs, providing the equivalent results as running the original ADMIXTURE algorithm many times with different numbers of clusters. These models can also be stored, allowing later cluster assignment to be performed with a linear computational time.

## Introduction

The rapid growth in numbers of sequenced human genomes and the proliferation of population-scale biobanks have enabled the creation of increasingly accurate models to predict traits and disease risk based on an individual’s genome. However, different predictive models can be required depending on an individual’s genetic ancestry, and this necessitates accurately characterizing an individual’s genetic ancestry composition at the individual level [1]. Such characterization is also an essential part of most modern population genetic studies and national biobanking projects. However, many existing algorithms for population genetic analyses struggle to keep up with next generation sequencing data sets, where both the number of samples and the number of sequenced positions along the genome, are much greater. This has created an intense need for more computationally efficient and accessible methods for detailed large-scale structure analyses. To date the vast majority of association studies, such as genome-wide association studies (GWAS), which look for correlations between genomics sequences and phenotypes, and predictive models like polygenic risk scores (PRS), which indicate genetic predisposition to phenotypes, rely on samples from individuals of European descent, thus excluding most of the world’s population and creating a new divide in healthcare [2]. The inclusion of fast, interpretable algorithms that characterize the ancestry makeup of genetic sequences is an important part of facilitating the creation of diverse association studies and expanding the reach of personalized genomic medicine.

A common approach for resolving the population structure within a genetic dataset is to describe each sample by a set of proportional cluster assignments obtained through an unsupervised clustering algorithm. Such methods take as input each individual’s sequence of single nucleotide polymorphisms (SNPs), that is, those positions along the genome known to vary between individuals. There are over 10 million known SNPs in the human genome with most of the remainder of the human DNA sequence shared in common between all humans. Such positions are commonly encoded with a binary value, where 0 is used to encode the most common (or reference) variant at that SNP position on the genome, and 1 is used to encode the minority (or sometimes called “alternative”) variant. This binary encoding works, because the vast majority of such variable positions have only one alternative to the common variant (are biallelic). The frequency distribution of these variants, and the correlations (linkage disequilibrium) between neighboring SNPs, will vary between populations due to different founder events, migration histories, and genetic drift experienced by those different populations. These differences can lead to predictive models trained on one population failing when faced with sequences from an unseen population. In addition, characterizing differing variant frequencies between populations can provide valuable historical and demographic information such as divergence times, migration events, and historical census size [3].

In this paper we present an autoencoder that implements one of the most widely used clustering methods for population genetics applications: ADMIXTURE [4, 5]. ADMIXTURE was developed as a more computationally efficient solution than Structure, and we now take this pursuit of efficiency to the next generation. Our proposed method, *Neural ADMIXTURE*, follows the same modeling assumptions as ADMIXTURE but re-frames the task as a neural network-based autoencoder, providing much faster computational times, both on GPU and CPU, and higher quality cluster assignments. Additionally, we introduce *Multi-head Neural ADMIXTURE*, which combines multiple decoders to obtain clustering results equivalent to running the original ADMIXTURE repeatedly with different priors for numbers of clusters. Both methods also include a supervised version that performs regular classification given ground truth training labels. The proposed method is fully compatible with the original ADMIXTURE framework, allowing use of ADMIXTURE results as initialization for Neural ADMIXTURE parameters, and vice-versa.

## Related work

Model-based clustering methods such as FRAPPE [6], STRUCTURE [7] and ADMIXTURE [4, 5] are the most commonly used unsupervised clustering techniques for analyzing the population structure of genomic sequences. These methods, which resemble probabilistic versions of the Non-negative Matrix Factorization (NMF), decompose each input sequence into a set of cluster assignments and a set of centroids (average SNP variant sequences) for each cluster. Specifically, the cluster assignments specify what proportion of each ancestry cluster an individual has, while the centroids indicate the SNP variant frequencies at each genetic position for each cluster. These methods allow the user to visualize ancestry composition within genetic datasets, compute how genetically distant different population groups are, and compute statistics that allow the dating of migration history. STRUCTURE [7] makes use of Bayesian models, using a Dirichlet prior for cluster assignments and the cluster centers, trained with Markov chain Monte Carlo (MCMC), making it highly computationally intensive. FRAPPE [6] and ADMIXTURE [4, 5] make use of maximum-likelihood point estimates, obtaining predictions with a quality compared to STRUCTURE, but with much faster computational times. FRAPPE makes use of the Expectation-Maximization algorithm (EM) while ADMIXTURE, as explained in depth in the following section, makes use of a faster block relaxation quasi-Newton optimization technique. However, each method still requires many hours of compute time and is not well suited for modern biobank datasets with tens or hundreds of thousands of samples and millions of SNP feature dimensions.

Several autoencoder architectures similar to our work have been presented. Some examples include the Dirichlet Variational Autoencoder [8], Deep Archetypal Analysis (DeepAA) [9], and Genotype Convolutional Autoencoder (GCAE) [10]. Such networks encode each sample as a point within a convex hull, or as a set of proportions and probabilities. The Dirichlet VAE replaces the commonly used Gaussian prior in the bottleneck by a Dirichlet prior. DeepAA adds constraints to enforce that the bottleneck representation is non-negative and sums to one. GCAE is a convolutional neural network with a Softmax activation in the bottleneck that provides similar clustering results as ADMIXTURE, while being more computationally intensive. Such methods are composed by non-linear encoders and decoders, which deny them the interpretability that our method provides. In fact, our proposed method can be seen as a non-variational version of the Dirichlet VAE with a linear decoder (without bias) and additional constraints in the dynamic range of its weights.

Neural Network-based supervised methods for ancestry classification have also been introduced. Some examples include LAI-Net [11], which provides a high-resolution ancestry estimate along a chromosome sequence, Diet Networks [12], which proposes a genome classifier with different regularization techniques to deal with the high dimensionality of genomic data, and Locator [13], which treats ancestry inference as a geographical prediction problem. While these methods can accurately classify genomes once trained, the ground truth labels used to train these supervised methods are typically hand-crafted reflections of concepts such as ethnicity, or self-reported race of the individual samples. These human-informed classes do not always reflect the full spectrum, or significant clusters, of genetically relevant substructure within and between populations. Therefore, in many genetic applications, it is preferred to use unsupervised methods that do not rely on the complexity of socially-constructed labeling schemes.

## ADMIXTURE

In this work, we follow the notation presented in Alexander et al [5]. Note that each individual human has two copies of each chromosome (one paternal and one maternal). Therefore, for a given individual at each genomic position we have the possibility of four different combinations of biallelic SNPs (0/0, 0/1, 1/0, 1/1). It is common practice to sum both maternal and paternal sequences, obtaining a count sequence *n _{ij}*. In this scenario, an individual

*i*has

*n*∈ {0, 1, 2} copies of the minority SNP

_{ij}*j*. ADMIXTURE models each individual’s sequence, given a fixed number of clusters (population groups)

*K*, as

*n*∼

_{ij}*Bin*(2

*, p*), where

_{ij}*p*= Σ

_{ij}*, with*

_{k}q_{ik}f_{kj}*q*denoting the fraction of population

_{ik}*k*assigned to

*i*, and

*f*denoting the frequency of SNPs with value “1”

_{kj}*j*in population

*k*. ADMIXTURE applies block relaxation to try to find the parameters

*Q*and

*F*that minimize the following negative log-likelihood function: where

*Q*= (

*q*) and

_{ik}*F*= (

*f*). ADMIXTURE also allows an expectation-maximization (EM) based optimization, identical to FRAPPE [6], but this approach is slower than the block relaxation approach [4]. The value of

_{kj}*K*is typically chosen by using a cross-validation procedure, [5] necessitating runs across a range of values.

ADMIXTURE allows, among others, two valuable optimization alternatives, which are the *projective* analysis and the *supervised* training. In the projective analysis, *F* is initialized to a previously estimated matrix, and only *Q* is optimized. The initialization of *F* may come from previously fit ADMIXTURE models for which the learnt population structure is considered robust. This is especially useful in scenarios where ADMIXTURE is fit in a large dataset and new unseen samples need to be processed. The projective analysis allows estimation of the cluster assignments without the need of fitting the complete model with all the dataset samples. On the other hand, the supervised version requires that some population ancestries are known, so some rows of *Q* are initialized and fixed to these ancestries, while the rest of the rows of *Q* and *F* are optimized normally.

The block relaxation optimization in ADMIXTURE runs much faster than its main competitors, namely FRAPPE [6] and STRUCTURE [7]. Moreover, it can be run in multi-threading mode, greatly boosting the execution time. However, this boost is still insufficient when dealing with either a large number of samples or a large number of SNPs. A neural network version of the algorithm, however, benefits from massive speedups during training (*e.g.* minibatch training, GPU usage), as well as during inference time with a well-chosen architecture.

## Neural ADMIXTURE

### Network architecture

ADMIXTURE can be formulated as a non-negative matrix factorization problem. Let *X* denote the training samples, where the features are the alternate allele counts per SNP. Then, *X* ≈ *QF*, where *Q* are the assignments, *F* are the alternate allele frequencies per SNP and population, and the negative log-likelihood in Equation (1) is the distance metric between *X* and *QF*. This can be naturally translated into the neural network world as a vanilla autoencoder, with *Q* = *f _{θ}*(

*X*) being the bottleneck estimated by the encoder

*f*and

_{θ}*F*being the decoder weights themselves. The encoder-decoder architecture is depicted in Figure 1. The fact that

*Q*is estimated at every forward pass and not learnt as a whole for the training data means that, at inference time, we will not have to run the optimization process again, as in ADMIXTURE’s projective analysis, but instead perform a simple forward pass.

Note that the restrictions in the optimization problem (Equation (1)) impose restrictions in the architecture. Those relating to *Q* (Σ* _{k} q_{ik}* = 1 and

*q*≥ 0) can be enforced by applying a softmax activation at the encoder output, making the bottleneck equivalent to the population estimates. Furthermore, while the decoder restriction (0 ≤

_{ik}*f*≤ 1) could also be enforced in the architecture itself (

_{kj}*e.g.*applying the sigmoid function to the decoder weights), we have found that it suffices to simply project the weights of the decoder to the interval [0, 1] after every optimization step, which is one of the most common forms of projected gradient descent [14].

We note that, critically, the decoder must be linear and cannot be followed by a non-linearity, as it would break the interpretability of the *F* matrix and the equivalence between the decoder weights and the cluster centroids (frequencies per SNP and per cluster) would be lost. On the other hand, the encoder architecture is free from constraints, and it may be composed of several neural layers with its corresponding non-linearities, if deemed appropriate. In fact, the proposed Neural ADMIXTURE includes a 512-dimensional non-linear intermediate layer with a ReLU activation before the bottleneck, as well as a batch normalization layer that acts directly on the input.

The classical ADMIXTURE model does not exactly reconstruct the input data as a regular autoencoder would do, as the input SNP genotype sequences, *n _{ij}* ∈ {0, 1, 2}, and the reconstructions

*p*∈ [0, 1], do not have matching ranges. This can easily be remedied by dividing the genotype counts by two, so that now the input data are . Moreover, instead of minimizing (Equation (1)), we propose minimizing the binary cross-entropy instead, using a penalty term on the Frobenius norm of the first non-linear layer weights,

_{ij}*W*

_{1}:

This regularization term avoids hard assignments in the bottleneck, which helps during the training process and reduces overfitting. In Equation (3) we show that the proposed optimization problem and the classical ADMIXTURE one are equivalent (excluding the regularization term) by using the definitions of the parameters as well as Equations (1) and (2):

Note that a perfect reconstruction can be obtained by setting the number of clusters equal to the number of training samples or to the number of input dimensions. However, we want the bottleneck to capture elementary information about the population structure of the given sequences, therefore we make use of low-dimensional bottlenecks.

### Decoder initialization

Due to the restriction that non-linearities cannot be used in the decoder, as well as the fairly large number of parameters for a single layer, the decoder weights (and thus, the overall performance of the model) are quite sensitive to the initialization. Common initializations, such as Xavier [15], do not work successfully in this architecture. However, the fact that the decoder is interpretable can be exploited in our favour, as we can try to insert information about the population structure into the initialization in an unsupervised manner. As the entries of (*f _{kj}*) are the frequencies of the alternate variant of SNP

*j*in population

*k*,

*f*almost coincides with the centroid of the samples in population

_{k}*k*. This suggests that classical clustering methods can be performed with the results used to initialize the decoder weights.

In the high dimensional space that we work, even fast clustering algorithms such as K-Means would yield high execution times. Instead of clustering in the original feature space, we propose to project the data using Principal Components Analysis (PCA) into a lower dimensional subspace of only a few (2 to 6) principal components and then perform K-Means. PCA is widely used in genetic analyses, as a small number of principal components often explain much of the population substructure of the sequences [16, 17, 18], which is what we are interested in. Hence, to initialize the decoder weights, we propose Algorithm 1:

### Multi-head architecture

In ADMIXTURE, cross-validation must be performed in order to choose the number of population clusters (*K*), unless specific prior information about the number of population ancestries is known. Furthermore, in many applications, practitioners desire to observe how cluster assignments change as the number of clusters increase. With the number of both sequenced individuals and variants increasing, the feasible number of different trials of cross-validation rapidly decreases due to its computational cost. As a solution, we propose a variation to Neural ADMIXTURE: the *Multi-head Neural ADMIXTURE* (MNA), which takes advantage of the 512-dimensional latent representation (from now on, shared bottleneck) computed by the encoder. In MNA, the shared bottleneck is jointly learnt for different values of *K*, {*K*_{1} *… K _{H}*}.

Figure 2 shows how the shared bottleneck of the multi-head structure is split into *H* different heads. The *i*-th head consists of a non-linear projection to a *K _{i}*-dimensional vector, which corresponds to an assignment assuming there are

*K*different populations in the data. While every head could be combined and fed through a decoder, this would cause the decoder weights

_{i}*F*not to be interpretable. Therefore, every head needs to have its own decoder and, thus,

*H*different reconstructions of the input are performed in every forward pass.

As we have *H* reconstructions, we will now have *H* different loss values. We can train this architecture by minimizing,
where *Q′ _{h}* and

*F′*are, respectively, the cluster assignments and the SNP frequencies per population for the

_{h}*h*-th head. The restrictions of the ADMIXTURE optimization problem (Equation (1)) must be satisfied by

*Q′*and

_{h}*F′*∀

_{h}*h*∈ {1, …,

*H*}.

The multi-head architecture allows computation of *H* different cluster assignments, for different values of *K*, efficiently in a single forward pass. Results can then be both quantitatively and qualitatively analysed in order to decide which value of *K* is the most suitable for the data.

### Supervised training

ADMIXTURE allows for supervised training by fixing some (or all) entries in the *Q* matrix. The same approach cannot be applied to the neural network architecture because *Q* are not learnt parameters (like *F*) but are instead the activation of the encoder estimated at every forward pass. As a solution, we propose to add a classification loss to the bottleneck entries. Let *Y* denote the ground truth ancestries and denote the cross-entropy loss. In the supervised version, the optimization problem (assuming a single-head architecture) is formulated as
along with the restrictions from Equation (1). Note that unsupervised Neural ADMIXTURE can be seen as a particular case by setting *η* = 0. Furthermore, having both losses allows for semi-supervised training, where only part of the training samples have ground truth labels.

In the supervised learning setting, instead of initializing the decoder weights using PCK-Means (Algorithm 1), we can exploit the fact that we know to which population each sample belongs; the decoder weights can simply be initialized as the centroid of each ground truth population, a straightforward computation. Moreover, this initialization will avoid permutation issues (*i.e.* cluster “i” found by PCK-Means may not correspond to population encoded as “i”) which would make convergence slower, or even have a negative impact on performance.

### Pretrained Neural ADMIXTURE

As a final contribution, we propose a training scheme which allows reusing the results of a previously optimized ADMIXTURE to speedup inference on a novel dataset. This is especially useful when many ADMIXTURE runs have already been computed, so that full retraining, which is computationally expensive with large datasets, is not desirable.

Let *F _{A}* denote the

*F*matrix estimated by ADMIXTURE. This matrix can be used in Neural ADMIXTURE so the

*Q*estimates will be similar to those given by ADMIXTURE by (a) initializing the decoder weights

*W*to

_{D}*F*, and (b) freezing

_{A}*W*and learning

_{D}*Q*in a few epochs. While the resulting

*Q*estimates will not be exactly equal to the estimates coming from the classic ADMIXTURE, the computation of cluster assignments will be sped up noticeably. We prove this in the next section.

## Experiments

### Datasets

We use a comprehensive set of publicly available human whole genome sequences from diverse populations across the world, combining the 1000 Genomes Project [19], the Simons Genome Diversity Project [20], and the Human Genome Diversity Project [21]. We include 550 Africans (AFR), 75 Native Americans (AMR), 651 East Asians (EAS), 496 Europeans (EUR), 27 Oceanians (OCE), 590 South Asians (SAS), and 127 West Asians (WAS). Each category is defined geographically with the American populations additionally filtered to exclude post-colonial groups with recent origins from other continents (eg. Europe and Africa) by considering only samples with over 95% indigenous local ancestry segments. The genome sequences are from anonymous individuals sequenced with their full consent. Samples are randomly split into train and validation using a 80/20 split. We make use of three different datasets: Chm-22, Chm-22-SIM and Chm-1. Chm-22 and Chm-1 include the same set of individuals, but with only the subset of their genome sequence encoded on chromosome 22 and chromosome 1, respectively, considered. Chm-22-SIM is an augmented version of the Chm-22 data: it contains simulated descendants of the real individuals, created using a recombination simulation program, PyAdmix [22] with the simulations performed independently on the train and validation partitions of Chm-22. A total of 400 individuals per ancestry are generated in the training set and 50 in the validation set. Both Chm-22 and Chm-22-SIM have 317,408 SNPs, while Chm-1 has 362,605 variants. Originally, chromosome 1 sequences contained 5 times more SNPs, but inspired by linkage disequilibrium pruning recommendations in classical ADMIXTURE, we uniformly down-sampled the sequences. While down-sampling is required for long sequences using GPU compute (to avoid memory problems), the complete sequence can be used when using CPU. Furthermore, even when down-sampling the sequence, centroid estimates for all the SNPs can be recovered after training by simply computing the weighted average of the training samples, using as weights the cluster assigments.

### Benchmark setup

ADMIXTURE models are optimized (both in training and projection mode) using 16 threads on an Intel Xeon 2.6GHz (x86 64), with 32 cores and 260GB of RAM. The same architecture is used to train and run inference using Neural ADMIXTURE for the CPU experiments. For the GPU experiments, all networks are trained on a NVIDIA GeForce RTX 2080 Ti. The same type of graphics card is used to run inference on the trained models.

All models are trained using *K* = 7 clusters. Classic ADMIXTURE models are stopped after 20 iterations in training mode. In the projective analysis mode, the software stops upon convergence (which typically happens at less than 15 iterations). Unsupervised networks (including the pretrained Neural ADMIXTURE) are trained using Adam [23] for 10 epochs, except for Chm-22-SIM dataset, where the networks are trained for 20 epochs. The supervised network is optimized over 6 epochs. To run inference on the validation data using the networks, a batch size of 200 sequences is used. All Neural ADMIXTURE networks are trained using a learning rate of 10^{−4} and a batch size of 200. The regularization parameter, *λ*, is set to 10^{−2}. During our experiments, we observed that the value of *λ* is correlated with the distribution of the assignments (with *λ* = 0 assignments are hard, while large values of *λ* results in completely uniform assignations). The networks are implemented using the PyTorch framework [24].

To quantitatively evaluate performance, we compute on the validation data. Inference is understood as running a projective analysis on the validation set in ADMIXTURE, and performing a forward pass to obtain the *Q* estimates in Neural ADMIXTURE. Furthermore, we use the Adjusted Mutual Information (AMI) [25] between the ground-truth ancestries and the *Q* estimates of the validation data. The AMI is defined as:
where *H*(*q*) denotes the entropy of *q* and *H*(*q* : *q*) denotes the mutual information between *q* and can be calculated using the equation introduced in [26]. A score of one indicates perfect agreement between the assignments, and is the maximum value this metric can accept. The *AMI* can be computed using the SciKit-Learn [27] package. Note that the *AMI* is not defined for soft clustering evaluation, so in order to compute it we assume that the population assigned to a sample corresponds to the cluster to which the individuals most belong.

On the other hand, we also evaluate the soft clustering itself with a new metric, Δ, as
where *Q* are the cluster assignments matrix, *Y* the one-hot encoded ground truth and *N* the number of samples. This is equivalent to computing the mean squared difference between the covariance matrices of both the estimated and the target population. In case the estimates *Q* completely agree with *Y* (up to permutation), then Δ will be 0. The more the disagreement, the higher the value of Δ.

We are interested in these metrics (*AMI* and Δ) as they are more easily interpretable than the loss function value itself. We are aware that these pseudo-supervised metrics will not give us the true quality of the predictions of the models, as some labels may not be accurate. However, assuming that most of the labels are correct, it will allow us to quantitatively analyze the agreement between the handcrafted labels and the model estimates, and therefore give us an estimation of the quality of the predictions, which we will use to compare between the classical and the neural approaches and draw the appropriate conclusions in case the difference among these metrics is high.

Moreover, let *T _{C}* be the execution time for classical ADMIXTURE (in either training or projective analysis) and let

*T*be the execution time for the corresponding Neural ADMIXTURE. We define the speedup as the ratio between execution times, .

_{N}*T*When

_{N}*S*> 1, Neural ADMIXTURE runs faster than ADMIXTURE, and otherwise if

*S*< 1.

### Single-head results

The results in Table 2 show that, when training on GPU, Neural ADMIXTURE is at least an order of magnitude faster than ADMIXTURE, while achieving very competitive results. Moreover, it is approximately two orders of magnitude faster in inference, both on CPU and on GPU. In the supervised case the CPU version of Neural ADMIXTURE is slower than ADMIXTURE. We believe this is due to overhead introduced by the extra gradient computation on the added supervised loss term.

We also compare ADMIXTURE and Neural ADMIXTURE by visualizing their *Q* estimates and their respective SNP frequencies *F*. Figure 3a contains the prediction over the training and validation data of the dataset Chm-22-SIM. The SNP frequencies (that is, the entries in the *F* matrix) from both models can be observed (projected onto the first two principal components of the training data) in Figure 3b. Qualitatively, Neural ADMIXTURE estimates tend to be more polarized, with many samples being assigned only to a single population, while ADMIXTURE appears to be more conservative. On this dataset ADMIXTURE does not differentiate Native Americans (AMR) and East Asians (EAS), and instead partitions Africans (AFR) into two different different ancestry clusters. Neural ADMIXTURE, however, is able to split EAS and AMR populations. Qualitative examples of the Multi-head Neural ADMIXTURE are shown in the Appendix.

Such qualitative results are on par with the AMI and Δ values in both train and validation (Table 2), and supervised and unsupervised, indicating that the Neural ADMIXTURE provides estimates which are closer to the ground truth labels as compared to classic ADMIXTURE.

### Multi-head results

The main advantage of the Multi-head Neural ADMIXTURE (MNA) architecture is that it can perform simultaneous clustering and inference for multiple values of *K* (number of ancestry clusters). Running many different values for *K* allows geneticists to obtain a more complete picture of the variation within populations, and is recommended practice. Figure 5 and 6 show examples of the output of MNA for different clustering results under values of K ranging from 3 to 10.

In Figure 5a (K=3) we can observe that EUR, WAS and SAS are combined within the same cluster, while OCE and EAS are clustered together, and AFR has its own cluster. These results reflect the genetic similarity between the respective groups due to their Out-of-Africa migration patterns and subsequent gene flow.

After adding a new cluster (Figure 5b) OCE obtains its own cluster, reflecting the ancient divergence of that population from the others. As more clusters are incorporated, AMR and EAS obtain their own clusters and OCE is divided between a component found predominantly in OCE and a component characteristic of EAS. The latter likely reflects the later migration of Austronesian speakers from East Asia out into the Pacific Islands, where they contributed their ancestry to the Oceanian inhabitants. A shared component between EUR, SAS and WAS is maintained, independent of the cluster number *K*, which could be linked to early farmer expansions out of West Asia and into both Europe and South Asia, following the birth of agriculture, as well as to the much later expansion of the Indo-European language family across all of these regions. Other genetic exchanges between these neighboring regions doubtlessly played a role.

With a sufficiently high number of clusters (Figure 6d) a shared component between WAS and some AFR populations appear, which might reflect North African gene flow.

## Discussion

In this paper, we demonstrate that the unsupervised clustering algorithm ADMIXTURE can be re-framed as an autoencoder. This novel framing, which we name Neural ADMIXTURE, allows for the use of common optimization techniques such as SGD or Adam and provides rapid inference through the encoder, two orders of magnitude faster than original ADMIXTURE algorithm. Furthermore, by adding more heads, multiple estimates with different priors on the cluster number (K) can be performed simultaneously, reducing overall compute time still further. This approach, combined with the use of GPU compute, can enable rapid results on even large modern biobanks.

## Acknowledgments

This work was supported in part by NIH grant 7U01HG009080.

## Appendix A Results

The order of the labels in all figures are the same as in Figure 3a.

## Chm-1 (unsupervised)

## Chm-22 (unsupervised)

## Chm-22 (pretrained)

## Chm-22 (supervised)

## Footnotes

↵* adomi{at}stanford.edu and ioannidis{at}stanford.edu