## Abstract

Admixture graphs represent the genetic relationship between a set of populations through splits, drift and admixture. In this paper we present the `Julia` package `miqoGraph`, which uses mixed-integer quadratic optimization to fit topology, drift lengths, and admixture proportions simultaneously. Inference of topology is particularly powerful, with integer optimization automating what is usually an arduous manual process.

**Availability** https://github.com/juliayyan/PhylogeneticTrees.jl

**Contact** jyyan{at}mit.edu

## 1 Introduction

The genetic relationship between a set of populations cannot be described precisely by a simple tree because of the presence of admixture. An admixture graph provides a way to represent the complex relationship between populations, including their separation, subsequent drift, and possible merging by using distributions of multiple trees. Several methods exist to build and visualize admixture graphs as well as to infer optimal parameters of drift lengths and admixture edges, such as `TreeMix` (Pickrell and Pritchard, 2012), `AdmixTools` (Patterson *et al.*, 2012), `MixMapper` (Lipson *et al.*, 2013), and `admixturegraph` (Leppälä *et al.*, 2017). Some of these methods cannot simultaneously infer the optimal topology with the parameters of the graph under that topology. Rather, they require that the topology of a particular graph be pre-specified, and then infer the graph parameters. Other methods such as `TreeMix` (Pickrell and Pritchard, 2012) and `MixMapper` (Lipson *et al.*, 2013) search a restricted space of possible admixtures. In Leppälä *et al.* (2017), all possible topologies are enumerated using exhaustive searches; however such an approach becomes intractable at larger problem sizes.

Here, we describe the implementation of the `miqoGraph` package, which can simultaneously infer the optimal graph topology, drift lengths, and admixture proportions. We test our algorithm on simulated and real datasets comprising several populations and show that even as fewer parameters are specified *a priori*, running times are sped up over competitive algorithms by orders of magnitude: high-quality and optimal solutions can be found in seconds.

## 2 Methods

A typical approach to admixture graph fitting is to first specify a topology, and then compute the graph’s fit to genetic data. Drift patterns in the data can be summarized by *f*-statistics (Patterson *et al.*, 2012), and for a given topology it is possible to construct a basis set of expected values of *f*-statistics that define the graph (Pickrell and Pritchard, 2012). We will refer to the vector of empirical *f*-statistics as **f**, and given a topology **x**, drift lengths **w**, and admixture proportions ** α**, we will call the vector of the expected

*f*-statistics

*g*(

**w**,

**;**

*α***x**). Drift lengths and admixture proportions are then selected to maximize the likelihood as follows: where Σ

^{−1}is the covariance matrix of the empirical statistics

**f**, and represents the set of all valid admixture proportions. This is the approach of

`qpGraph`, developed by Patterson

*et al.*(2012).

In our approach, which we call `miqoGraph`, rather than fixing the topology **x** before solving for drift lengths **w** and admixture proportions ** α**, we optimize over all three

*simultaneously*: where represents the set of all valid topologies. Topologies have previously been explored by enumeration over small graphs (Leppälä

*et al.*, 2017), but this approach is intractable for larger graphs. Here we present a novel formulation of the problem using mixed-integer quadratic optimization (MIQO), where we model the problem of determining a best-fit graph topology as assignment of populations to leaf nodes of a binary tree. Although such problems are difficult in theory, modern optimization solvers such as Gurobi (Gurobi Optimization, Inc., 2016) and CPLEX (IBM ILOG CPLEX Optimization Studio, 2013) can quickly solve large-scale MIQO problems in practice. For an overview of integer optimization, see Wolsey and Nemhauser, 2014, and for our formulation, see the supplementary material.

Our approach requires pre-specification of the following parameters:

Tree depth

*D*∈ ℤ^{+},Number of admixture events

*A*∈ {0} ∪ ℤ^{+}, andAdmixture resolution

*K*for*K*∈ ℤ^{+}(only needed if*A >*0).

If there are no admixture events (*A* = 0), the populations’ relationship can be represented using a single binary tree. We model admixture (*A >* 0) by allowing the population assignments to leaf nodes to be between 0 and 100%, and the admixture resolution *K* allows admixture proportions to be estimated to an accuracy of . For example, *K* = 10 allows values of 0%, 10%,…, 90%, 100% (see supplementary materials).

We then solve optimization problem (2) to find the best-fit tree topology, drift lengths, and admixture proportions under the specified parameters. A major benefit of `miqoGraph` over prior approaches is the flexibility of the parameters, with each specification of parameter values representing numerous potential admixture graphs. As such, although it is computationally intractable to enumerate over the thousands of potential topologies for several populations, our algorithm quickly finds well-fit topologies using MIQO. Although it may not be obvious which parameter values are appropriate *a priori*, multiple optimization problems can be solved in parallel on a reasonable range of parameter values. In our experiments, we found that trying one tree depth, several admixture resolutions, and a few admixture events were sufficient to find the correct admixture graph topologies.

Although it is not required, prior knowledge can reduce the solution space and speed up the solution time. For example, a user can specify that the path from the root to a particular population does not contain admixture, which we found to be a particularly useful feature in our simulations.

## 3 Computational Results

In Section 3.1, we validate our models on simulated admixture graphs, before turning to real data in Section 3.2. Our methods were implemented using the Julia programming language (Bezanson *et al.*, 2014), using the optimization modeling package JuMP (Lubin and Dunning, 2015). We used the Gurobi solver version 8.0.1 (Gurobi Optimization, Inc., 2016). Computational experiments were run on a desktop computer with a 16-core Intel Xeon E5-2650 CPU, 3.40 GHz processor, and 64GB of memory. Our code is available on GitHub as the package `PhylogeneticTrees.jl`. Documentation is available in the supplementary materials.

### 3.1 Simulated Data

We tested `miqoGraph` on three simulated admixture graphs that represent the possible varieties of admix-ture events to validate its performance. Parameters for the simulation are described in the supplementary material.

A summary of our simulated graphs is shown in Figure 1. Figure 1a shows the base graph upon which the three simulated graphs are built. The first of these, in Figure 1b, involves a single admixture event where Population 3 is produced from equal mixtures of Populations Slot0 and Slot1. The next graph in Figure 1c is identical to the first except that the admixture proportions are changed to 10% and 90%. A more complex graph in Figure 1d has Population 3 produced from a nested admixture event between Populations Slot0, Slot1, and Slot2. These graphs will be referred to as *SimpleMix*, *UnevenMix*, and *NestedMix*, respectively. These simulations are by no means exhaustive. Because the underlying optimization model’s size scales quadratically with the admixture granularity *K* (see supplementary materials), `miqoGraph` may not be appropriate for detecting low admixture proportions; the lowest that we test is 10% in the UnevenMix case. Furthermore, our formulation models admixture only at the leaf nodes: for example, the NestedMix case is represented as Population 3 being assigned 25% to Slot0, 25% to Slot1, and 50% to Slot2. Although the nesting is straightforward in this case, `miqoGraph` may produce less interpretable results for admixture graphs with many interdependent nested admixture events. Nonetheless, `miqoGraph` is a powerful tool on a variety of use cases, as we will show.

#### 3.1.1 The SimpleMix example

On the SimpleMix dataset, we fitted a tree with a single admixture event (*A* = 1) at an admixture resolution of *K* = 2. Our `miqoGraph` algorithm found the optimal solution in 94s with an objective value of 18.99, although it took significantly longer to prove that the topology and parameters were optimal, terminating in 1,040s. In practice, it is often the case that optimization algorithms find the optimal solution quickly but take longer to prove optimality. As such, common practice is to terminate the solver early when progress in the objective slows or halts.

However, if a guarantee of optimality is desired, it can be efficiently obtained through a combination of optimization and grid search, since the model’s solution time can speed up dramatically if it is known *a priori* that certain populations do not have admixture events in their paths to the root. We were able to perform a grid search in a total of 106s that produced the objective-18.99 tree along with a guarantee of optimality. Details of this grid search are in the supplementary material.

Most importantly, the optimal solution matched the simulated SimpleMix graph perfectly in topology and parameters.

#### 3.1.2 The UnevenMix example

In the SimpleMix case, the actual graph was admixed at exactly 50% and 50%, allowing for a low admixture granularity of *K* = 2 to capture the correct admixture event. A natural question arises when considering admixture proportions that require a higher level of resolution to capture: what level of resolution is sufficient to infer the correct admixture graph? To answer this question, we turn to the UnevenMix case of Figure 1c, where the admixture proportions are 10% and 90%.

As in the SimpleMix case, for the UnevenMix case we began with solving for a tree without admixture. In this case, `miqoGraph` terminated in 3s with an objective value of 22.08. We then ran `miqoGraph` varying the admixture resolution from *K* = 2, 3,…, 10, specifying that only Population 3 was admixed. At *K* = 7, the objective and topologies converged to the correct values and the admixture proportions reached the approximate values of 14% and 86%. This result indicates that the resolution need not be set exactly to the level required for the correct graph (*K* = 10), although it should be reasonably close. A reasonable approach might be to run `miqoGraph` at increasingly fine granularities until the topology converges. A continuous optimization algorithm such as `qpGraph` can also be run for fine-tuning.

#### 3.1.3 The NestedMix example

Our final simulated case was called NestedMix, and was similar to SimpleMix and UnevenMix except that it included a second admixture event. Admixed graphs were inferred with two admixture events, and only Population 3 was allowed to experience admixture. For both admixture resolutions *K* = 3 and 4, the inferred topology once again matched the original topology, albeit with different admixture proportions, and these trees were proved to be optimal in 33s and 63s respectively.

Table 1 shows a summary of the running times of `miqoGraph` for each dataset, compared with the running time of `qpGraph` (with the topology fixed to the correct topology). Our algorithm, `miqoGraph`, is able to find the correct graphs orders of magnitude more quickly than `qpGraph`, and with the exception of the UnevenMix case, we prove optimality more quickly as well. Our algorithm accomplishes these improved running times while also allowing exploration of varied graph topologies, when by comparison, `qpGraph` requires fixing a single graph topology *a priori*.

The three cases SimpleMix, UnevenMix, and NestedMix are toy examples, but they represent a range of common cases. In the following section, we demonstrate the performance of `miqoGraph` on real data.

### 3.2 Real Data from Eurasia and the Americas

After testing the performance and correctness of `miqoGraph` on simulated data, we ran it on a six-population of modern and ancient DNA samples from Eurasia and the Americas to infer the phylogeny of populations leading to the Karitiana, a South American population from Brazil.

Even without specifying which population should be admixed and at a coarse admixture granularity of *K* = 2, `miqoGraph` found a solution within 8s and verified optimality after 9s. Most importantly, the graph matched one found using exhaustive searches with `qpGraph`.

We were able to refine the admixture proportions by running `miqoGraph` at higher resolutions of *K* = 3 and 4. By leveraging the knowledge that Karatiana should be admixed, learned from the *K* = 2 output, `miqoGraph` inferred the correct topology almost instantaneously (1s and 3s, respectively). At *K* = 4, the inferred admixture proportions 25%-75% corresponded closely to the values of 28%-72% estimated by `qpGraph`, and the drift lengths were also similar (see supplementary materials). The admixture graph inferred by `miqoGraph` at *K* = 4 is shown in Figure 2. In this topology, Karatiana is admixed between an ancient North Eurasian-related and a present-day East Asian-related source, consistent with previous results examining the initial peopling of the Americas (Raghavan *et al.*, 2014).

## 4 Limitations

The main limitation of `miqoGraph` lies in the restriction of admixture events to the leaf nodes of the graph and therefore, the interpretation of its output in the presence of multiple nested admixture events. Suppose a particular population A has admixture from populations B and C, and that B itself is admixed from D and E. The ordering of these events is not captured in our representation of the graph, and it can be challenging to reconstruct the correct sequence of events leading to the true admixture graph. To aid interpretability, our framework allows the user to sequentially add new populations while fixing the topology for other populations. The positions of these new populations can vary freely, or they can be tentatively assigned to positions based on the user’s best guess, giving the optimizer a “warm start” to improve upon. A second issue with our approach is that the proportion of admixture inferred is done in discrete values whose granularity is specified *a priori*. It is possible that at low admixture granularities, the best-fit topology may be incorrect. One possible way to mitigate this effect is to use `miqoGraph` to explore a possible set of graph topologies and then to use continuous optimizers such as that implemented in `AdmixTools` (Patterson *et al.*, 2012) to fit parameters on these topologies.

## 5 Conclusion

Our results show that `miqoGraph` is able to simultaneously infer topologies, drift lengths, and admixture proportions in seconds to minutes for both simulated and real-world cases. The formulation is primarily useful in settings with few nested admixture events. Nonetheless, the use of integer optimization to model what was previously a combination of labor-intensive manual enumeration and continuous optimization represents a significant step forward in efficient inference of admixture graphs.

## Acknowledgements

The authors thank Dimitris Bertsimas and members of the Reich laboratory for productive discussions.