India represents an intricate tapestry of population sub-structure shaped by geography, language, culture and social stratification operating in concert [1–3]. To date, no study has attempted to model and evaluate how these evolutionary forces have interacted to shape the patterns of genetic diversity within India. Geography has been shown to closely correlate with genetic structure in other parts of the world [4, 5]. However, the strict endogamy imposed by the Indian caste system, and the large number of spoken languages add further levels of complexity. We merged all publicly available data from the Indian subcontinent into a dataset of 835 individuals across 48,373 SNPs from 84 well-defined groups [2, 6–9]. Bringing together geography, sociolinguistics and genetics, we developed COGG (Correlation Optimization of Genetics and Geodemographics) in order to build a model that optimally explains the observed population genetic sub-structure. We find that shared language rather than geography or social structure has been the most powerful force in creating paths of gene flow within India. Further investigating the origins of Indian substructure, we create population genetic networks across Eurasia. We observe two major corridors towards mainland India; one through the Northwestern and another through the Northeastern frontier with the Uygur population acting as a bridge across the two routes. Importantly, network, ADMIXTURE analysis and f3 statistics support a far northern path connecting Europe to Siberia and gene flow from Siberia and Mongolia towards Central Asia and India.
The genetic structure of human populations reflects gene flow around and through geographic, linguistic, cultural, and social barriers. We set out to explore how the complex interplay of these factors may shape the patterns of genetic variation focusing on India, a country of intriguing levels of population structure complexity. The caste system in India has been documented since 1500-1000 BC and imposes strict rules of endogamy over the past several thousands of years. Social stratification within India may be summarised into the so-called Forward Castes and the Backward Castes [10], while 8.2% of the total population belongs to Scheduled Tribes and represents minorities that lie outside the caste system, still largely based on hunting, gathering and unorganized agriculture, with no written form of language [11]. Furthermore, there are 22 official languages within India, also following a distinctive geographic spread. The Dravidian (DR) speaking groups inhabit southern India, Indo-European (IE) speakers inhabit primarily northern India (but also parts of west and east India as well) and Tibeto-Burman (TB) speakers are mostly confined to northeastern India. The numerically small group of Austro-Asiatic (AA) speakers, who are exclusively tribal and are thought to be the original inhabitants of mainland India, inhabit fragmented geographical areas of eastern and central India. Previous studies have uncovered four ancestral components within India [2, 8, 9], representing Northern India, Southern India, Austroasiatic speakers and Tibeto-Burman speakers. Furthermore, it has been shown that prior to the establishment of the caste system, there was wide admixture across tribes and castes in India which came to an abrupt end 1,900 to 4,200 years before present [8].
Starting from all publicly available data from the Indian subcontinent (835 individuals, see Figure 1A and Supplementary Table 1) and unlike prior studies [9, 12], we created a normalized data set over castes, tribes, geographical locations, and language families that guarantees an approximately equal representation of endogamous populations, geographical locations, and language groups (a total of 368 individuals from 33 populations genotyped across 48, 373 SNPs). In other regions of the world, it has often been observed that individuals from the same geographical region cluster together and it is often the case that the top two principal components are well-correlated with geography, namely longitude and latitude [13, 14]. For instance, within Europe, the squared Pearson-correlation coefficient r2 between the top singular vector of the genetic co-variance matrix vs. latitude (north-south) was equal to 0.77 and 0.78 for the second singular vector of the same matrix vs. longitude (east-west). In order to explore whether Indian genetic information mirrors geography, we computed the top two principal components using EIGENSTRAT [15] and plotted the top two left singular vectors of the resulting genetic covariance matrix (Figure 1B). It is straight-forward to observe that the IE and DR speaking populations form a long cline, while the AA and TB speakers form separate clusters. We computed the Pearson correlation coefficient (r2) between the top two left singular vectors (we will denote them by PC1 and PC2) of the covariance matrix and the geographic coordinates (longitude and latitude) of the samples under study and we observed r2 = 0.604 for PC1 vs. longitude and r2 = 0.065 for PC2 vs. latitude. Thus, PC1 recovers a relatively significant fraction of the longitude, but PC2 essentially entirely fails to recover the latitude. These findings are in sharp contrast with findings within the European continent [4, 9, 16]. ADMIXTURE analysis is consistent with previous studies, showing high degrees of shared ancestry across castes, but also across castes and tribes, thus supporting the notion that a demographic shift from wide admixture to endogamy occurred recently in Indian history (Figure 2, Supplementary Figure 1). Our meta-analysis of the ADMIXTURE output [17] shows that the IE and DR populations across castes shared very high ancestry, indicating the autochthonous origin of the caste system in India (Figure 2). f3 statistics show that most of the castes and tribes in India are admixed, with contributions from other castes and/or tribes, across languages affiliations (Supplementary Table 4 and Supplementary Note). The geographically isolated Tibeto-Burman tribes and the Dravidian speaking tribes appear to be the most isolated in India. Linear Discriminant Analysis on the normalized data set clearly supports genetic stratification by castes and languages in the Indian sub-continent (Supplementary Figures 3A and 3B).
In order to understand the genetic substructure of India, considering the strongly endogamous social structure as well as the presence of multiple language families, we developed COGG (Correlation Optimization of Genetics and Geodemographics). COGG is a novel method that correlates genomewide genotypes, as represented by the top two principal components, with geography (longitude and latitude) and sociolinguistic factors (caste and language information in this case). The need for such methods has been pointed out by many studies [3, 9, 18–26]. Given information on m samples, the objective of COGG is to maximize the correlation between the genetic component as represented by the top singular vectors of the genetic covariance matrix formed by the genotypic data and a matrix containing information on geography, castes, tribes, and languages for each sample. More precisely, let u be the m-dimensional vector that represents either PC1 or PC2. Let G be the Geodemographic Matrix (an m × k matrix, where k is the number of geodemographic attributes that will be studied). Then, COGG seeks to maximize
In the above, a is an (unknown) k-dimensional vector whose elements are the ai’s; we use Gi to denote the i-th column of the matrix G as a column vector. In our experiment, G has nine columns (i.e., k = 9): longitude and latitude are represented as numeric values, but caste/tribe/language information are encoded as zero-one indicator variables. We analytically solved the optimization problem of eqn. (1) to obtain a closed form solution for amax (see Supplementary Note). Plugging in the solution for amax in our data, we obtain a Pearson correlation coefficient r2 = 0.93 for PC1 vs. G and r2 = 0.85 for PC2 vs. G. Thus, we are recovering almost all of the genetic structure of the Indian subcontinent using the Geodemographic matrix G instead of just longitude and latitude: the values of r2 increase from 0.6 to 0.93 for PC1 and from 0.06 to 0.85 for PC2. This massive improvement came from considering endogamy and language families, two attributes that are pivotal in study the genetic stratification of Indian populations and is statistically significant (Figure 3).
In order to formally investigate which of the nine features (columns) in the geodemographic matrix G contribute more in the optimization problem of eqn. (1) we used the sparse approximation framework and the Orthogonal Matching Pursuit (OMP) algorithm from applied mathematics [27] (see Supplementary Note). Running OMP on our dataset we obtain two sets of three features each, S1 and S2, for PC1 and PC2 respectively:
Plugging in S1 as the reduced feature space in COGG resulted in r2 = 0.92 for PC1 vs. S1 and r2 = 0.85 for PC2 vs. S2; these values are capturing approximately over 99% of the values returned by COGG when all the features in G are included. Our feature selection approach for COGG explains the influence of sociolinguistics in shaping the genetic structure of the region, identifying membership to the AA or TB language group (which mostly consists of Backward Caste and Tribal groups), Forward Caste (who are usually found in IE and DR language groups), and latitude as the most significant geodemographic features that correlate to genetic structure within India, highlighting the language-caste interplay.
We proceeded to explore the structure of the Indian sub-continent in relation to the rest of Eurasia analysing a dataset of 1,332 individuals over 42,975 SNPs (Supplementary Table 1), sampled from 73 populations. Meta-analysis of the ADMIXTURE output reveals that, overall, Indian populations share a great proportion of ancestry with the so-called Indian NorthWestern Frontier populations, namely the tribal populations spanning Afghanistan and Pakistan (Figure 4). In concordance with previous studies we find higher degrees of shared ancestry of Central Asian populations with IE and DR Forward Castes [12, 20, 28]. IE Forward Castes also share large amounts of ancestry with other IE speaking populations (ie Europeans). However, IE and TB speakers as well as DR speaking Castes also share considerable amounts of ancestry with the Uygurs. On the other hand, AA speakers, who have been suggested as the earliest settlers of India [20, 29], appear more isolated. TB speakers share very high amounts of ancestry with populations from China but also Mongolia and Siberia.
PCA uncovers a structure that resembles a triangle, with Europeans residing in one corner, the Chinese on another corner and the Dravidian and Austro-asiatic speaking tribal populations of India occupying the third corner (Figure 5A). Siberians, Mongols and Uygurs stretch towards India’s Northwestern Frontier, while Tibeto-Burman speaking Indians connect India to China. We employed a population network analysis approach [30] in order to trace the gene-flow paths towards the Indian subcontinent (Figure 5B). Within India, IE, TB and AA Tribes are major nodes connecting to multiple populations. Tibeto-Burman Tribes stand at the Northeastern gateway from China to India, while IE Forward Castes are at the entry-point from the Northwestern frontier. Considering the whole of Eurasia, we observe three major paths leading to the two entry points of India: from Europe to Central Asia and the Indian Northwerstern Frontier, from Northern Europe to Siberia, and then Mongolia, then splitting towards China and Northeast India on one hand or the Uygurs, Central Asia and Northwestern India on the other hand. f3 tests [31] (Figure 6) and TreeMix [32] analyses also support the notion that IE and TB Forward Castes have arisen through admixture of populations originating from the Caucasus and Mongolia (Supplementary Table 3, Supplementary Figure 4, and Supplementary Note). Previous studies have also supported a north-western and north-eastern corridor of migration towards India. However, this is the first study to connect the two paths through the populations of Siberia and Mongolia.
In summary, we present a novel method building a model that correlates geography, social, cultural and linguistic factors to genetic structure. The method is of independent interest and can be used to analyze any dataset of genotypic data where side information (e.g., geographic locations and/or other demographic information) for the samples is known. We are thus able to uncover the major forces that have shaped population genetic structure within India. Furthermore, through population genetic networks, ADMIXTURE analysis and f3 tests, we have drawn paths of migration and gene flow throughout Eurasia, bringing out the importance of an ancient northern route moving from Europe through Siberia, Mongolia and merging back towards Central Asia and India. The possibility to correlate genomic background to geographic, social and cultural differences opens new avenues for understanding how human history and mating patterns translate into the genomic structure of extant human populations.
Code Availability
All code (including source files) is available at https://github.com/aritra90/COGG.
Data Availability
We have used publicly available data sets along with data reported by other studies. Our data sets will be made available upon request to the corresponding authors.
Author Contributions
A.B., P.D. and P.P. conceived and designed the project. A.B. gathered samples from various sources and performed the data analyses after discussing with D.E.P., P.D. and P.P. D.E.P. performed and wrote the LDA analysis. L.P. participated in and discussed analyses. A.B., P.D. and P.P wrote the manuscript.
Online Methods
Samples
We used PLINK [33, 34] to assemble genome-wide data for 839 samples from 87 well-defined sociolinguistic groups (see Supplementary Table 1) genotyped on a 48,225 SNPs. These samples were collected from various sources [2, 6–9] with the consent of the corresponding authors. We created and tested subsets of this dataset in order to construct an equal representation of castes, tribes, language families and geographical locations for this study. The normalized subset for which we have reported results for the Indian populations contains 368 samples from 33 populations genotyped for 48,326 SNPs.
We merged reference populations from Eurasia and Southeast Asia, collected from various publicly available sources such as HGDP [35], the Estonian Biocenter [36–42] and the Allele Frequency Database (ALFRED) [43] with our normalized Indian dataset to create a merged data set of 1,332 samples from 73 population groups genotyped on 42,975 SNPs (Supplementary Table 1).
PCA and LDA
We used the smartPCA program of the EIGENSOFT package 6.1.4 [15] as well as our own MatLab implementation of PCA [44, 45]. We also implemented our own version of Linear Discriminant Analysis.
COGG and feature selection using Orthogonal Matching Pursuit
COGG stands for Correlation Optimization of Genetics and Geodemographics and maximizes the correlation between one of the top two principal components and the Geodemographic matrix, containing geographical coordinates, caste, tribe and language information encoded as indicator variables. We restrict our encoding into three castes: Forward castes, Backward castes and Tribal or nomadic hunter gatherers. u is the vector containing either one of the top two principal components, computed by EIGENSTRAT [15]; the Geodemographic matrix is denoted by G. The caste (Forward, Backward and Tribals) and language (AA, DR, IE, TB) encoding was performed as follows:
Let a be the k-dimensional vector whose elements are a1 … ak (in our case, k = 9). COGG solves the following optimization problem (see Supplementary Note for details):
Recall that Gi denotes the i-th column of G as a column vector. Let for i = 1 … k and let d be the vector of the di’s. Also, let for all i, j = 1 … k and let M be the matrix of the Mij’s. Then the optimizer for COGG is given by
We also check for statistical significance of the maximum squared Pearson correlation coefficient r2, returned by COGG, by randomly permuting the columns corresponding to castes and languages in G in 1,000 iterations and calculating amax for each iteration; we report the histogram of the resulting r2 values.
We used a greedy feature selection algorithm described in [27] to select features of the Geodemographic matrix G. We obtain two sets of the three most significant features from the nine features in G, one for PC1 and the other for PC2. The algorithm is described in detail in the Supplementary Note. In words, it selects the column which results in the maximum r2 value from G and then projects G (and u) on the subspace perpendicular to the selected column in order to form G′ (and u′). We iterate the process until we remove the required number of features from G.
All the values returned by this method are statistically significant, as random permutations of the elements of the features in S1 and S2 recover almost nothing. We also checked all possible sets of three features exhaustively and concluded that (for both PC1 and PC2) S1 and S2 return the maximum correlation.
Estimating population admixture
We used the ADMIXTURE v1.22 software [46] for all admixture analyses and used our in house script to plot the admixture estimates. Before running ADMIXTURE, we pruned for LD using PLINK [33, 34] by setting --indep-pairwise 50 10 0.8. To determine the optimal number of ancestral populations (K), we varied K between two and eight performing iterations until convergence for each value of K. We also performed a quantitative analysis of ADMIXTURE’s output using a method described and implemented in [17]. To visualize the results of this quantitative analysis, we designed a color-coding scheme, where the highest shared ancestry between two populations is black and the lowest shared ancestry is white. All intermediate values of shared ancestry follow a gradient from white to black.
Three population statistics, network analysis, and TreeMix
We used ADMIXTOOLS [31] to compute f3 statistics for our data sets to find signs of admixture using the qp3Pop program. To better visualize and understand the connection between the populations included in our study, we performed a network analysis on the results of ADMIXTURE, using a method presented by a previous study [30]. Finally, TreeMix [32] was used to analyze the population divergence, mainly for the IE language dispersal into the Indian subcontinent. We used migration values from zero to eight to infer language dispersal routes.
Acknowledgements
This study was supported by NSF IIS-1319280, NSF IIS-1661760, and IBM. Part of this work was done at IBM TJ Watson Research Center where AB was an intern. We thank D. Reich and P. Moorjani for sharing genotypic data of 248 samples from [2] and 378 samples from [8]. We also thank P. P. Majumder who allowed us to use the genotypic data from 367 samples from [9].