Single haplotype admixture models using large scale HLA genotype frequencies to reproduce human admixture

The human leukocyte antigen (HLA) is the most polymorphic region in humans. Anthropologists use HLA to trace populations’ migration and evolution. However, recent admixture between populations can mask the ancestral haplotype frequency distribution. We present a statistical method based on high-resolution HLA haplotype frequencies to resolve population admixture using a non-negative matrix factorization formalism and validated using haplotype frequencies from 56 world populations. The result is a minimal set of source components (SCs) decoding roughly 90% of the total variance in the studied admixtures. These SCs agree with the geographical distribution, phylogenies, and recent admixture events of the studied groups. With the growing population of multi-ethnic individuals, or individuals that do not report race/ethnic information, the HLA matching process for stem-cell and solid organ transplants is becoming more challenging. The presented algorithm provides a framework that facilitates the breakdown of highly admixed populations into SCs, which can be used to better match the rapidly growing population of multi-ethnic individuals worldwide.


27
Author Summary

28
Human Leukocyte Antigen (HLA) is known to be the most polymorphic region in the human 29 genome. Anthropologists frequently use HLA to trace migration and evolution of different populations. 30 This is due to the high linkage among HLA genes leading to the transmission of intact haplotypes from 31 parents to offspring, hence preserving key population ancestral features. 32 We developed a new HLA-based method to identify admixture models in mixed populations 33 using high-resolution HLA haplotype frequencies. Our results highlight that a single highly polymorphic 34 locus can contain enough information to map clearly human admixture and the population genetics of 35 the different human populations, and reproduces results based on SNP arrays.

36
The presented algorithm is validated using haplotype frequencies sampled from 56 worldwide 37 populations. Under such factorization we demonstrate that 90% of the variance in these populations can 38 be explained using a much-reduced set of 8 ethnic groups. We demonstrate that the estimated ethnic 39 groups and admixture models agree with the geographical distribution, population phylogenies and 40 recent historic admixture events of the studied populations.

42
Human Leukocyte Antigen (HLA) plays a crucial role in both the adaptive and innate immune 43 system and is proven instrumental in multiple medical disciplines, including matching for solid organ and 44 hematopoietic stem-cell transplantation (HSCT) [1][2][3][4]. As generations of humans migrated throughout 45 the world, HLA has evolved with each population conserving key ancestral features [5], and with more 46 than 15,000 defined alleles to date[6] and over 1,000,000 haplotypes [7], it stands as the most 47 polymorphic region in the human genome.

119
The factorization is initialized with a seed (e.g. A 0 and/or B 0 ), which initiates the expectation 120 maximization (EM) process. In each iteration, a new non-negative matrix A or B is calculated. Then, the 121 new obtained matrix is used for calculating the complementary matrix. Iterations are repeated until 122 converging to a locally optimal matrix factorization [43]. 123 There are currently multiple standard algorithms for NMF [43]. These algorithms approximate a 124 non-negative matrix (a matrix where all elements are non-negative) as a product of two low-rank non-125 negative matrices. The results of the approximation are affected by the loss function used and by the 126 weighting of the input columns. In the current context, we produce a haplotype frequency matrix of all 127 HLA haplotype frequencies for over 50 populations and present it as the product of eight original 128 population frequency matrix and mixing factors.

130
We applied a Neighbor-Joining (NJ) algorithm to construct a phylogenetic tree of the studied 131 populations [45,46]. Following the tree construction, populations were grouped by the most dominant 132 OP obtained from the NMF analysis. To validate the NMF algorithm, we investigated the similarity 133 between the tree structure and the phylogenetic grouping. 7-10 OPs (Fig. 1A). Note that non-scaled solutions produce OPs strongly biased toward highly admixed 185 groups (Supp. Mat. Fig. S1) and produce a higher error rates (Fig. 1A). 196 Results

197
Our results show that eight OPs are sufficient to represent the admixture in the 56 studied populations. 198 We observed that the diversity level varied widely among populations; in some groups the admixture 199 was predominantly represented by one OP as in the case of the US Japanese and Filipino populations, 200 while other populations, such as US Middle Eastern, New Zealand and Australian Aboriginal were quite 201 admixed (Fig. 1B). These results are consistent when the admixture method and number of OPs are 202 varied (Supp. Mat. Fig. S1). The difference in the composition can be observed in the entropy of vector 203 -(Ordinate axis in Fig. 1C).
The highly admixed populations with higher diversity in haplotype frequency may be challenging to 205 properly classify, as can be seen from the correlation between the admixture error and the i E 206 admixture entropy (Fig. 1C).
Note that the observed correlation may be the result of a cofactor such as the population size. To 208 neutralize such an effect a partial correlation was performed (i.e. a correlation over the residual of the 209 regression on the population size), with a Spearman partial correlation of 0.4 (p<1.e-4.). Interesting 210 correlations also emerge between the entropy of the haplotype frequencies and the error, with a low 211 error for high entropy (more uniform distribution of haplotypes). Moreover, as could be expected size is 212 correlated with a lower error and a lower entropy of A, representing the bias toward the more precise 213 representation of larger populations (Supp Mat. Fig. S2).

214
The admixture results (Fig. 1B) are consistent with the phylogenetic analysis based on the Fst distance 215 between the HLA haplotype frequencies of the observed populations ( Fig. 2A), where leaves are colored 216 per . The tree splits into primary ethnic groups. Similar populations from different registries (e.g. admixture was estimated using the NMF-based analysis and HLA haplotype frequencies. As seen in Fig.  243 3A, significant similarities can be seen between the NMF-based (top) and MCMC estimated admixtures 244 (bottom), especially in Chinese, Japanese, African and European populations. We observed more splits 245 in the Latin-American 1KG populations (MXL, PUR, CLM) in the MCMC admixture. This could be the 246 result of over representation of these samples in the 1KG project compared with the registry samples. 247 Note that the registry data does not contain a specific Iberian Spanish population, and the European 248 admixture in registry Latin American populations is usually less than Iberian populations because of the 249 Amerindian influence.

250
One application of the presented methods is the detection of the composition of untested groups to 251 understand their sub-structure. To estimate the composition of unknown populations, we performed an 252 admixture analysis using the estimated OPs to compute admixture in groups from the Canadian 253 OneMatch registry, without the limit that the entire composition should be explained by a combination 254 of the OPs: HLA haplotype frequencies from the Canadian registry were estimated at a four-locus resolution. To 257 estimate the admixture of the Canadian populations, we dropped the resolution of the OPs haplotype 258 frequency distribution to four-locus by summing over all five-locus haplotypes that differ only in HLA-259 DQB1. We then computed the admixture of the new observed populations using Eq. (5).

260
For most Canadian populations, the estimated admixture via OPs agreed with self-reported race and 261 ethnicities and with similar groups from other registries in the US, Australia and Europe ( Figure 3B). The 262 similarity between the US and Canadian admixtures is clearly demonstrated in the admixture estimation 263 of the Canadian Asian, African, European, Latin American, Korean, and Middle Eastern populations. 264 However, some populations, such as Aboriginal and Filipino Canadians show a slightly different 265 composition. For example, the Canadian Aboriginal population is composed not only of the Australian 266 Aboriginal population components, but also has fragments from African and Hispanic populations. This 267 could be attributed to different patterns of population migration and historical events.

269
We present an algorithm that dissects population genetic admixture, based on HLA haplotype 270 frequencies, into OP components in the presence of background LD and high polymorphism. While 271 traditional admixture and lineage models typically rely on many biallelic genome-wide loci, we 272 demonstrate that HLA is polymorphic enough to allow for a clear delineation of population composition 273 using a single genomic region. The admixture problem, shown equivalent to a Non-engative Matrix 274 Factorization analysis, is more computationally efficient than traditional admixture models and allows 275 for the admixture of a large number of populations with an extensive number of haplotypes (in the 276 order of a million haplotypes). To our knowledge, this is the first algorithm to dissect population 277 admixture with specific focus on HLA and a validation framework using a dataset with the presented 278 magnitude.

279
We developed and applied our method to the haplotype frequencies of 56 populations from different 280 adult volunteer stem-cell registries representing over 3.5 million donors. We showed that the resulting 281 admixture is consistent with the known ethnic composition, recent history and SNP based admixture.

282
The results of the admixture and phylogenetic analyses show a clear distinction between Asian, African, 283 European and Israeli populations. Expectedly, the Latin-American and Middle Eastern Arab populations 284 were more admixed than other populations and contained a notable European component. Our method 285 was also able to distinguish East Asian from south Asian and Pacific Islander populations such as New 286 Zealand. Some of the analyzed populations were more admixed than others; for example, most East 287 Asian populations had a single predominant OP while Caribbean and Middle Eastern groups were more 288 admixed. A distinct difference emerged between Israeli and non-Israeli populations. Within the Jewish 289 Israeli populations, phylogenetic analysis separated several distinct groups: Ashkenazi, North African, 290 Central Asian, and Yemenite and Ethiopian Jews. Interestingly, the Jewish populations were mainly 291 admixed with Caucasian and south Asian populations, probably representing historic migration and 292 modern admixture events. Our results showed over 50% European admixture in Native American and 293 Australian aboriginal populations, suggesting large recent admixtures ( Figure 1B). Additionally, we have 294 reported in a previous study a degree of over reporting of self-identified Native-American race among 295 Be The Match donors [64] that did not completely coincide with the genetic composition of reporting 296 individuals, some of whom were found to have substantial European admixtures. Our phylogenetic 297 analysis showed comparable results (Figure 2A). The general division of branches agreed with previous 298 phylogenies of general human populations

299
A limitation of the current methods is its restriction to phased HLA haplotype frequencies. The current 300 implementation cannot be applied to SNP or other unphased marker data. Additionally, the presented 301 admixture assignments are estimated at the population level, i.e. each haplotype is assigned a most 302 likely OP, but a relative fraction of OP cannot be assigned to an individual. Thus, the presented method 303 is complimentary to existing unphased admixture models. has an assorted color, depending on the group it belongs to. The correlation between the entropy and 504 admixture is unaffected by the population size. five-locus haplotype frequencies of the 56 studied populations. The tree is divided into branches 514 delineating the main ethnic groups in the current analysis: Jewish, African, European, Hispanic, East and 515 South Asians. Color coding shows concordance between the estimated admixture and tree branches. 516 The only exceptions are highly admixed populations, such as Ethiopian Jews. B) PCOA (Principal 517 Coordinate Analysis) analyses of the distances as defined by pairwise population Fst. The population 518 have been grouped into broad regions: South Asian, East Asian, Pacific Islanders, Jews, European, 519 African and Hispanic. More plots are provided in supplementary materials.