Historical trade routes for diversification of domesticated chickpea inferred from landrace genomics

According to archaeological records, chickpea (Cicer arietinum) was first domesticated in the Fertile Crescent 10 thousand years ago. Its subsequent diversification in South Asia, Ethiopia, and the Western Mediterranean, however, remains obscure and cannot be resolved using only archeological and historical evidence. In particular, chickpea has two market types: ‘desi’, which has a similar flower and seed coat color to chickpea’s wild relatives; and ‘kabuli’, which has light-colored seed, and is linguistically tied to Central Asia but has an unknown geographic origin. Based on the genetic data from 421 chickpea landraces from six geographic regions, we tested complex historical hypotheses of chickpea migration and admixture on two levels: within and between major regions of cultivation. For the former, we developed popdisp, a Bayesian model of population dispersal from a regional center towards sample locations, and confirmed that chickpea spread within each region along trade routes rather than by simple diffusion. For the latter, migration between regions, we developed another model, migadmi, that evaluates multiple and nested admixture events. Applying this model to desi populations, we found both Indian and Middle Eastern traces in Ethiopian chickpea, suggesting presence of a seaway from South Asia to Ethiopia — and the cultural legacy of the Queen of Sheba. As for the origin of kabuli chickpeas, we found significant evidence for an origin from Turkey rather than Central Asia.

Fertile Crescent 10 thousand years ago. Its subsequent diversification in South Asia, Ethiopia,23 and the Western Mediterranean, however, remains obscure and cannot be resolved using only 24 archeological and historical evidence. In particular, chickpea has two market types: 'desi', 25 which has a similar flower and seed coat color to chickpea's wild relatives; and 'kabuli', which 26 has light-colored seed, and is linguistically tied to Central Asia but has an unknown geographic 27 origin. 28 29 Based on the genetic data from 421 chickpea landraces from six geographic regions, we tested 30 complex historical hypotheses of chickpea migration and admixture on two levels: within and 31 between major regions of cultivation. For the former, we developed popdisp, a Bayesian model 32 of population dispersal from a regional center towards sample locations, and confirmed that 33 chickpea spread within each region along trade routes rather than by simple diffusion. 34 35 For the latter, migration between regions, we developed another model, migadmi, that 36 evaluates multiple and nested admixture events. Applying this model to desi populations, we 37 found both Indian and Middle Eastern traces in Ethiopian chickpea, suggesting presence of a 38 seaway from South Asia to Ethiopia -and the cultural legacy of the Queen of Sheba. As for 39 Introduction 44 45 The genetic variation of species reflects evolutionary history. The history of a domesticated 46 species is inextricably linked with human history and we can learn much about one from 47 studying the other. Reconstructing the spread of cultigens reveals the history of both plant and 48 human and has the potential to improve modern genomics-assisted breeding schemes. 49 50 Chickpea (Cicer arietinum L.) is an important source of high-quality protein (Abbo et al., 2003a), 51 ranked third among legumes in terms of grain production (Jain et al., 2013). It is extensively 52 cultivated in India, West Asia, Eastern Africa, and the Mediterranean Basin, but how it reached 53 these regions, and its subsequent admixture history is not well-understood. Limiting factors in 54 reconstructing chickpea domestication history include: (1) lack of whole-genome sequences 55 from ancient chickpea, (2) reduced genetic diversity in cultivars due to domestication 56 bottlenecks, (3) the replacement of locally evolving landraces with modern commercial 57 varieties (Abbo et al., 2003a). The most suitable material for studying chickpea domestication 58 is the historical germplasm collection made by Vavilov in the 1920s-1930s, stored at the N.I. 59 Vavilov All Russian Institute of Plant Genetic Resources (VIR). This collection currently contains 60 3380 chickpea accessions, almost half of which represent pre-Green Revolution landraces with 61 known geographical origin (Figure 1a). Vavilov not only established this unique collection, but 62 also identified several "centers of origin" (or diversity) of crop plants (Vavilov, 1926) ( Figure  63 2a). For chickpea, centers of diversity include six regions (van der Maesen, 1984;Vavilov, 64 1951), which we will denote by the nearest contemporary country: Turkey, Uzbekistan, India, 65 Lebanon, Morocco, and Ethiopia. We assembled a panel of 421 chickpea landraces which 66 represent these regions ( Figure 1a) and tested historical hypotheses of chickpea diversification 67 based on genotyping at 2579 loci. 68 69 Chickpea centers of diversity have rich archaeological records, and several domestication 70 scenarios have been proposed based on these. The wild progenitor of C. arietinum is C. 71 reticulatum, a rare species found in a small area of south-eastern Turkey (Abbo et al., 2003a). 72 Because Turkey (and Syria) also harbor several archaeological sites with the earliest remains of 73 cultivated chickpea (ca 9500 ybp) (Abbo et al., 2003b;Tanno and Willcox, 2006), this region is 74 generally accepted as the origin of chickpea. Based on the archaeological records, chickpea 75 then spread throughout ancient world, reaching western-central Asia (Uzbekistan) and the 76 Indus Valley ca 6000 ybp, the Mediterranean basin (Lebanon, Morocco) ca 5500 ybp, and 77 Ethiopia ca 3500 ybp. While the chickpea migration relationships between Turkey, Lebanon, 78 India and central Asia are supported by archeological records, the exact dispersal and 79 admixture history of chickpea within the Mediterranean Basin and to Ethiopia are anyone's 80 guess. 81 The C. arietinum L. history gets more complicated due to the presence of two distinct types: 83 'desi' and 'kabuli', which differ in size/morphology, color and surface of seeds (Purushothaman 84 et al., 2014) ( Figure 1a). Desi and kabuli types have sometimes been designated as subspecies 85 microsperma and macrosperma, respectively (Moreno and Cubero, 1978), although these 86 older taxonomic terms do not reflect a crossing boundary or substantial molecular genetic 87 differentiation (Varma Penmetsa et al., 2016). The desi type is considered to be ancestral and 88 resembles wild progenitors (С. reticulatum and C. echinospermum) more than kabuli. It was 89 proposed that kabuli was once selected from the local desis, and then spread; however, the 90 region of origin is not known. 91 92 We utilized the genotyped landraces from Vavilov's collection to test the ambiguities in 93 chickpea history and reconstruct migration routes of both desi and kabuli types in the following 94 way. We first obtained robust estimates of allele frequencies in 10 chickpea populations (6 95 desis: Turkey, Uzbekistan, India, Lebanon, Morocco, and Ethiopia, and 4 kabulis: Turkey, 96 Uzbekistan, Lebanon, and Morocco). For this purpose, we developed the popdisp model 97 (population dispersals), which considers geographical locations of chickpea sampling sites, the 98 nonequal number of samples in locations, and, most crucially, possible ways of chickpea 99 dispersals within a region. We examined two hypothetical dispersals for each of 10 populations 100 and get estimates of allele frequencies in populations' centers. outliers. To get more robust estimates, we developed a model, popdisp (Figure 2a), which 175 considers different scenarios for dispersals within a geographic region and takes into account 176 landrace-specific effects. The structure of the model was inspired by BayPass (Gautier, 2015), 177 a b and processing of allele frequencies was performed as in BEDASSLE (Bradburd et al., 2013) and 178 compositional data analysis (CoDA) (Pawlowsky-Glahn and Buccianti, 2011). 179 We hypothesized that each region had one trade center, where chickpea was first introduced,  180 and considered two scenarios for subsequent dispersal within the region. In the first scenario, 181 dispersal within each region proceeded by the transport of seeds to local villages via roads and 182 paths. As a result, the genetic relatedness in local landraces would be predicted by the net of 183 regional trade routes. This scenario was contrasted with simple diffusion, so that genetic 184 differences between landraces would be explained by geodesic distance. We called these two 185 scenarios "trade routes" and "linear", respectively ( The desi chickpea type resembles the wild progenitor and is considered ancestral. Its spread 227 between regions is partly known from archaeology: chickpea was domesticated in Turkey and 228 then introduced into India, Uzbekistan and Lebanon. We set these four populations as sources 229 with known phylogeny (black-coloured subtree in Figure 3b). Ethiopian and Moroccan chickpea 230 desi populations appeared later, and their sources are not known (Figure 3a). 231 232     Because desi type is considered to be more primitive and ancestral it is not unreasonable to 335 assume that kabuli's spread between centers of secondary diversification had an influence 336 from local desis. However, both kabuli's origin and migration history with possible desi 337 influences remain unclear. 338 339 To identify the origin of kabulis, we draw alternative admixture graphs of population 340 relatedness. The first assumes the dispersal of kabuli chickpea from Turkey's Fertile Crescent 341 ( Figure 4а) and the second reflects a Central Asian origin (modern Uzbekistan) with subsequent 342 movement back to Turkey (Figure 4b). Parameters for the black-coloured part of the graphs in 343 Figure 4a,b were taken from the previous analysis of desi populations, the remaining 344 parameters were estimated with the migadmi model. The optimal likelihood of the former 345 graph is higher, but not significantly. Therefore, to determine the kabuli's origin, we analysed 346 fractions of variance in each mixed population explained by its sources.  To test the migration and admixture hypotheses, we developed two methods. The first model 391 is popdisp, which estimates allele frequencies in the population, under the assumption of a 392 particular dispersal model within the region. We considered two reasonable physical agents of 393 migration: traders or diffusion that approximates continuous-time stochastic process. Our 394 assertion was that genomic resemblance between accessions can reflect either 'least-cost 395 path' trade route distance between sample sites or linear distance between them. Our 396 analyses unambiguously favour the former hypothesis ( Figure 1a). In the future it will be 397 interesting to apply this approach to species with different dispersal strategies, for instance 398 comparing crops like round-seeded chickpea to human-associated weeds like spiky-podded 399 Medicago We developed popdisp, a Bayesian hierarchical model (Figure 2a)  New variable ! # means the log-balance between frequencies of reference and alternative 484 alleles, and is not bounded, i.e., can take values in (−∞, +∞). The latter allows us to model 485 correlations between population frequencies using Multivariate normal distributions without 486 artificial truncation, which is necessary when the model operates with non-transformed 487 frequencies (Gautier, 2015). 488 489 To describe the genetic drift of allele frequencies along the binary-branching paths, we 490 modified the approach proposed in TreeMix (Pickrell and Pritchard, 2012) and BayPass (Gautier,491 2015). In the Wright Fisher model, the expected value and variance of allele frequency in -th 492 where is the amount of genetic drift, 493 which has occurred along the path from the ancestral population to -th population. To match 494 these first two moments after ilr-transformation of allele frequencies (Appendix 2), the 495 following should be satisfied: . 496

497
Using the logic of model construction from TreeMix(Pickrell and Pritchard, 2012) and Gaussian 498 model for changing log-balances, we get that ! #~O " where is proportional to 499 the cumulative path from the ancestral population to -th population. Using the Felsenstein's 500 approach (Felsenstein, 1973), we model the change of log-balances along the binary-branching 501 path with Multivariate normal distribution: 502 503 where , QQQ⃗ = @ ' # , -# , … . # A, # is the constant of proportionality specific for -th SNP, is 505 × matrix, which reflects the covariance structure between population based on the 506 binary-branching path. This path can be represented as a binary tree structure with ancestral 507 population at the root and leaves (Figure 2b). On the diagonal, matrix contains cumulative 508 branch lengths from the tree root to respective leaves, and the off-diagonal elements are equal 509 to sum of common branches for respective pair of populations (Felsenstein, 1973). We 510 compute values in matrix based on known length of binary-branching path and scale it, so 511 that the mean value of diagonal elements should equal to one. 512 513 514  515 For each SNP, model has the following parameters: the allele frequency in the ancestral 516 population, log-balances of allele frequencies for populations, and the constant of 517

Prior probabilities and MCMC
proportionality. To get estimates, we constructed Bayesian model with the following prior 518 distributions for parameters. 519 520 For " # , we proposed uninformative beta prior, @ # , # A, with uniform prior for the mean, 521 / " / " 01 " ∼ (0,1), and exponential prior for the so-called "sample size", # + # ∼ (1). 522 We also assume the exponential prior for constant of proportionality: # ∼ (1). 523 524 The complexity of the model does not allow the use of Gibbs Sampling. Instead, we performed 525 the algebraic inference of derivatives for log posterior distribution and run Hamiltonian Monte 526 Carlo sampling algorithm (Neal, 2012)  To test hypothetical migration routes of chickpea between regions, we created a model based 540 on the same assumptions as used in the model for population spread within a region. We 541 consider populations characterised with vectors of log-balances of allele frequencies, which 542 are obtained from the previous analysis. We denote log-balances of allele frequencies of -th 543 SNP in -th populations with ! # . 544

545
A migration hypothesis is set by the binary tree, which branch lengths are parameters. Based 546 on the migration hypothesis, we construct the parametrized covariance matrix and matrix 547 containing variances of differences between log-balances: !2 = !! + 22 − 2 !2 . Then, we 548 can construct the following likelihood function (Appendix 3): 549 550 where is a number of SNPs, is the matrix of log-balances for all SNPs and all populations, 552 # is a SNP-specific scale parameter. 553 554 The likelihood (2) contains a unique scale parameter, # , for each SNPs, making the model 555 overparametrized. To reduce the number of parameters, we applied the sliding window 556 technique. We divided each chromosome into overlapping windows of the same size almost 557 equal to the LD, 3 ⋅ 10 8 bp; the step parameter in the sliding window was 1 ⋅ 10 8 . As the 558 density of SNPs along chromosomes is not uniform (Supplementary File 5), windows contained 559 different numbers of SNPs; those with less than 10 SNPs were filtered out. 560 561 We assumed that SNPs within each window are probably linked and had evolved with a similar 562 rate. This assumption allows us to avoid # parameters (set it to 1), and infer objective function 563 proportional to log-likelihood (see Appendix 4): 564 565

566
where 9 ( , , ) is a root mean square distance between -th and -th populations, 567 computed on SNPs from -th window (see Appendix 4), log denotes the log-density of 568 normal distribution. We estimate parameters in matrix separately for each window. 569 570 Modeling admixture events 571 572 We developed a new model of admixtures which considers that (i) admixture events happened 573 long ago and all populations (both source and mixed) accumulated their own variance after 574 the event, (ii) number of source populations in one event are not constrained, i.e., can be 575 higher than 2, (iii) several admixture events can be analyzed simultaneously, and (iv) 576 admixtures can form a hierarchy, i.e., a mixed population in one admixture event can be a 577 source in another event. 578 579 Let population be a mixture of sources ( : , = 1, nnnnn ), which are precursors of current 580 populations ( : , = 1, nnnnn ). We parametrized this admixture event with the following variables: Projection 596 The map projection used to represent a geographic region on a flat surface plays a critical role 597 when measuring distances (such as distances between regions), areas or assessing shape or 598 direction. Whenever a spherical model of Earth is projected onto two-dimensional surface, 599 distortions of one or another kind are introduced, altering these variables to a different degree. 600 Our It is typical to use geodesic measurements of distance between pairs of points in landscape 611 genomics (Abebe et al., 2015) and although these can yield adequate results, they do not take 612 full advantage of genomic data to provide insights into historical patterns of trade and 613 diffusion. Least-cost path models (Douglas, 1994) have emerged as an explanatory framework 614 for movement of goods in archeology (Kantner, 1997). This approach of calculating the 615 distance of a path with the least "cost" (interpreted usually as change in elevation) provides a 616 mechanism, in the absence of historical data on exact movement routes, to estimate the time 617 and energy that it would have taken to travel from location to location. Pairwise distances 618 between concentrations of accessions were calculated both using geodesics as is typical in 619 landscape genomics (Abebe et al., 2015) and as least-cost paths with slope and water bodies 620 defining landscape friction, following a trend to use three-dimensional spatial modeling to 621 predict trade routes between ancient settlements (Herzog, 2014;van Lanen et al., 2015). We 622 used the hiking function, which has been used in archaeological and ethnographic applications 623 (Gorenflo and Gale, 1990) to assign resistance along with a cost surface accounting for climatic 624 conditions. 625 626 We created a cost surface using selected geo-climatic Holocene data sets, mask of water 627 bodies, and weighted elevation gradient, rescaled to a common scale. We used the following 628 climatic layers: maximum temperature of the warmest month, minimum temperature of the 629 coldest month and precipitation of wettest month for past conditions (Mid-Holocene), 630 obtained from WorldClim, Version 1.4 database, MIROC-ESM GCM (Hijmans et al., 2005). 631 Temperature and Precipitation ranges were ranked in accordance with ASHRAE Thermal 632 Comfort chart (Hoyt et al., 2013). 633 634 A slope layer was created from the world elevation (GTOPO30) and reclassified according to 635 the Tobler function (Tobler, 1993). In addition, a water mask was created to mask out water 636 bodies. We then used Weighted Overlay tool of ArcGIS to create a cost surface layer, where 637 each pixel had a value of the least accumulative cost distance from or to a source of interest. 638 Supplementary File 7 describes scheme of classification for each layer and its relative weight 639 in building cost surface. 640 641 One hypothesis is that movement between sites always goes through historical centers of 642 trade before dispersing out to rural villages. In this exploratory analysis we converted least-643 cost paths between mean centers that could have served as the foci of crop dispersion, using 644 data acquired from the Ancient World Mapping Center, UNC GIS, into vector format and 645 construct a road network for the whole area. 646 647 The cost distance layer was further used to prototype paths between cities (regional centers 648 of dispersion) as well as within each cluster. The resultant least-cost path rasters were 649 converted to vector format, cleaned of duplicates and served as base data for building a road 650 network. We then employed ArcGIS Network Analyst functionality to build a road network that 651 encountered for terrain relief and point connectivity, and to retrieve distance values between 652 and within spatial clusters. Straight-line geodesic distances were calculated with the ESRI 653 ArcGIS Near tool. 654 655 Selection of Centers of Diversification 656 We estimated the number and locations for hypothetical centers of diffusion by combining 657 current knowledge of regions that served as World Centers of Diversity (Corinto, 2014), cluster 658 analysis of our accessions' locations, and historical data for locations of ancient cities that were 659 prominent trading centers during ancient times (Ancient World Mapping Center, n.d.). 660 We applied ArcGIS clustering analysis and spatial statistics tools to group all accessions into six 661 clusters based on geographic locations and spatial constraints, and to calculate mean center 662 for each cluster. We then compared the locations of the mean centers with known ancient . 685 686 Appendix 3. Estimates for branch parameters of a tree 687 688 Let's consider populations originated from one ancestral state and a binary tree depicting 689 their migration history; all tree branch lengths are parameters. Each population is 690 characterized by log-balance of allele frequencies for a SNP, # . In the model for population 691 spread within a region, it has been assumed that ⃗ ∼ r " QQQQ⃗, ( ' , -, … , 5 ), " is the log-balance of allele frequency in the root of the tree (ancestral state). 693 However, in testing historical hypotheses, there is no given information about the ancestral 694 state: " is not known, position of the root in the binary tree is parametrized. Therefore, it is 695 impractical to include " into the model and use the above-mentioned multivariate normal 696 distribution. 697 698 To avoid the use of " , we propose an approach which considers total variance between 699 populations instead of covariance. Let covariance matrix between populations, be obtained 700 based on the fully parametrized binary tree according to Felsenstein's method (Felsenstein,701 1973) (see Example on Figure A1). Then, we can obtain a matrix , which elements are 702 proportional to variances of the difference between log-balances: Based on Gaussian changing log-balances, we get: . 708

709
To get maximum likelihood estimates of the tree branch length based one SNP, the following 710 likelihood function can be written: 711 . 712 713 714 Figure A1. Example of constructing matrix V based on the tree with parametrized branches. 715 716 717 Appendix 4. Inference of likelihood function for a set of linked SNPs 718 719 A "window" is a segment on a chromosome of length equal to a predefined value (≈LD) that 720 contains a subset of SNPs. We assumed, that, within each window, SNPs are probably linked 721 and they had evolved with a similar rate. Let 9 be a set (group) of SNPs corresponding to -722 th window, and 9 be a scale, specific for this window and reflecting the rate. For -th SNP in 723 -th population, we denote log-balances of allele frequency with ! # . Then, the Likelihood 724 function for log-balances of allele frequencies in the -th window is: 725 726 where " # is the allele frequency of the ancestral state. This value is not a parameter, is not 729 known, and plays the scale role. In line with CoDA, we estimate it as "  Consider six populations originated from one ancestral state, and a tree depicting the history 746 of the populations ( Figure A2a); ! is a normal random variable reflecting the log-balance of 747 frequencies for the SNP in population ( Figure A2a). We denote lengths of tree branches with 748 # . 749 Let the seventh population (having log-balance of frequencies for the SNP) originate by a 750 mixture event of three populations (precursors of ' , J , and 8 ), and then evolve 751 independently along the branch with the length ; ( Figure A2b). We assume that the mixture 752 event happened long ago, so that current populations # have their own evolutionary history, 753 independent from the sources # . To carefully consider the mixture event, we introduced 754 weight parameters # , # , # , as demonstrated in Figure A2b,e. In our example, the number of 755 additional parameters is 10, and the number of constraints is 4; hence, the number of free 756 parameters is 6. The number of cells in the matrix , which contain additional parameters, is 757 6, so all free parameters are identifiable in this example. However, in the extreme situation, 758 when all six initial populations can be considered as sources of the mixed one, the number of 759 free parameters reaches 12, and some of them become non-identifiable. 760 In general, when the initial tree connects KLK populations and all of them can be sources of a 761 mixed one, the number of free parameters is 2 KLK and number of cells in the matrix , which 762 contain additional parameters, is KLK . Therefore, to avoid this overparameterization we 763 introduce several constraints. First, we assume that all # are equal to each other, and this 764 assumption reduce the number of free parameters to ( KLK + 1) ( Figure A2c). Second, we set 765 the regularization on # weights using the Dirichlet prior with all concentration parameters 766 equal to 0.9: r ' , … M )*) s ∼ Dirichlet(0.9 … 0.9). Imitating absorbing states in the genetic 767 drift, this prior tends to pull some weights to zeros, i.e. to put r ' , … M )*) s vector closer to 768 the border of KLK -dimensional simplex. These two introduced restrictions make all free 769 parameters in the model identifiable. admixture; # represent the frequency balance of a SNP for -th population, is the population 777 formed with an admixture, # are the length of a tree branch, # , # , and # are a weight 778 parameters. The -table demonstrates the variance-covariance matrix for all populations  779 after re-parametrization. 780 and MixMapper (Lipson et al., 2013). We created a list of characteristics to compare the 789 packages and found that our method covers and outperforms capabilities of TreeMix and 790 MixMapper: our package copes with estimating multiple complex admixture events with more 791 than 2 sources and demonstrates the admixture patterns along the chromosomes. Moreover, 792 it has two additional features that were not accounted for in previous models. 793 794 The first feature is that, migadmi allows populations to get their own variance after admixture 795 events. In the existing approaches, it is assumed that the composite population is a weighted 796 sum of some source populations, and weights sum to 1. However, in reality, almost no 797 population is settled as a net sum of two or more. Ordinarily, when a part of one population 798 appears in a new place, it evolves some period of time getting its own variability, and then if 799 the admixture event happens, the mixed population continues to evolve. As a result, the 800 variance in the admixed population can be factored into contributions from source populations 801 and self-accumulated variance. The latter is especially important if the admixture events 802 happened long ago (e.g., as in our study). Things get more complicated when considering that 803 source populations have also evolved. To avoid modeling the mixed populations as a weighted 804 sum of source ones, we parametrized the own variance of each population after the admixture 805 event. 806

807
The second important feature of migadmi is the use of ilr-transformed allele frequency instead 808 of allele frequency itself. Allele frequencies, as fractions or percentages, are constrained (i.e. 809 sum up to 1 or 100%), which makes standard statistical methods inapplicable. For example, 810 frequencies cannot be modelled as normally distributed random variables, as the domain of 811 the normal distribution is (−∞, +∞), not [0, 1]. Another problem is presence of negative bias 812 in covariance estimates between frequencies (Aitchison, 1986). Moreover, frequency of one 813 allele is inextricably linked with frequencies of others as they sum to 1. Therefore, modeling 814 frequency changes of one allele cannot be considered without modeling changes in other 815 alleles. To correctly work with frequencies, the theory of compositional data analysis and 816 Aitchison geometry were first established in the end of previous century (Aitchison,817 1986)(Pawlowsky-Glahn and Buccianti, 2011). Following this theory, one can independently 818 analyze ( − 1) balances between frequencies, instead of frequencies. In case of biallelic 819 SNPs, the balance is the logarithm of the ratio between reference and alternative alleles, and 820 this balance takes values in (−∞, +∞). We adapted the use of balances to model changes of 821 allele frequencies in line with the Wright-Fisher drift model. The balance-based approach was 822 used in both popdisp and migadmi models. 823

824
The direct comparison of migadmi results with TreeMix and MixMapper results is not possible 825 because we used migadmi to estimate complex admixture graphs, which TreeMix and 826 MixMapper cannot cope with (Table A1). However, we performed the standard TreeMix and 827 MixMapper analyses and traced the common and different trends in results. 828 829 First, we applied TreeMix and set to estimate 4 events within 10 populations. We used TreeMix 830 in two modes: without tree root specification and with specificationof Ethiopia desi population 831 as a root, the most distinct one ( Figure A3). We also used the bootstrap with the size of 35, 832 that equals to the mean number of SNPs in our sliding window technique. Both obtained 833 admixture graphs demonstrated two expectable distant clades in trees: Uzbekistan-India and 834 Turkey-Lebanon-Morocco. However, the obtained trees also contained deviations from the 835 expectations. In the root-specified tree, the Ethiopian desi population is the source for Turkish 836 desi that contradicts the conventional story of chickpea spread ( Figure A3a). The root-837 unspecified tree contains India's influence on Moroccan desi, which is also unlikely, because 838 these populations are the most distant to each other ( Figure A3b). 839

840
On the other hand, TreeMix graphs partly support the hypothetical origin of Ethiopian and 841 Moroccan desis. The location of Ethiopian desi on the root-unspecified tree demonstrated its 842 sources from both main clades, which is in line with the mixed origin of this population. In the 843 root-unspecified tree, the Moroccan desi population is located between Turkish and Lebanese 844 populations, while in the root-specified tree, it locates close to Turkey with an admixture from 845 Lebanese desi. Therefore, we may conclude that Moroccan desi is an indirect mixture of 846 Turkish desi and Lebanese desi. 847 The origin of kabulis is impossible to infer from this tree, however, the root-specified tree 848 indicates that the Uzbeki kabuli has an admixture from the Turkey-Morocco clade, that is in 849 line with our hypothesis, that Uzbeki kabuli is not the source of other kabulis. 850 851 852 Figure A3. Admixture graphs obtained with the TreeMix package for (a) unrooted tree and ( The most pronounced difference between desi and kabuli chickpea types is the flower color. 881 In legumes, this trait is Mendelian and controlled by the so-called A gene (Hellens et al., 2010). 882 For Pisum sativum and Medicago truncatula, the sequences of this gene can be found at 883 GenBank accessions: GU132940 (MtbHLH) and GU132941 (PsbHLH). We took these 884 sequences, performed the tBLASTn search against Cicer ariethinum genes, and found the 885 match with basic helix-loop-helix protein A located at LOC101506726 locus (2149255-886 2158629bp, the beginning of chromosome 4). 887 To verify that this region is associated with desi/kabuli difference, we performed GWAS 888 analysis on the binary trait (belonging to desi or kabuli) using rrBLUP. We found one significant 889 SNP which is located very close to the found homologous LOC101506726 locus. Therefore, we 890 suppose that this locus can be considered as a marker locus for kabuli. 891  We would like to thank Magnus Nordborg for discussions and his helpful advice on the paper 916 structure.   Figure 9. Correspondence between mean SNP frequencies in 6 desi populations and SNP frequencies estimated by two more robust methods. For each method, we took into account the regional distribution of samples: samples in each population belong to geographical locations. For each SNP, we estimated the mean allele frequency in each location, " ! $ !"#,% , and then applied two methods. The first method (brown dots) reflects the median values across " ! $ !"#,% . The second method (green dots) corresponds to the calculation of the center composition as in the compositional data analysis (CoDA). Together with mean allele frequencies in locations, this method considers frequencies of the second . Obtained values for each SNP are "averaged" allele frequencies in a population in line with CoDA. Analysis of brown dots shows long vertical ranges at 0 and 1, indicating the prevalence of locations with homozygous SNPs, which is not caught by calculations of means. The CoDA-based method not only highlights the prevalence of SNP homozygosity but also softly accounts for minor heterozygosity. As our popdisp method, both methods (more robust than mean values) demonstrate S-like shape dependency between the mean and estimated SNP frequencies.