Abstract
The origins of the Albanian people have vexed linguists and historians for centuries, as Albanians first appear in the historical record in the 11th century CE, while their language is one of the most enigmatic branches of the Indo-European family. To identify the populations that contributed to the ancestry of Albanians, we undertake a genomic transect of the Balkans over the last 8000 years, where we analyse more than 6000 previously published ancient genomes using state-of-the-art bioinformatics tools and algorithms that quantify spatiotemporal human mobility. We find that modern Albanians descend from Roman era western Balkan populations, with additional admixture from Slavic-related groups. Remarkably, Albanian paternal ancestry shows continuity from Bronze Age Balkan populations, including those known as Illyrians. Our results provide an unprecedented understanding of the historical and demographic processes that led to the formation of modern Albanians and help locate the area where the Albanian language developed.
Introduction
During the Iron Age (1100 BCE–150 CE), the Balkans were characterised by remarkable cultural, linguistic, and genetic heterogeneity (1–6). In the western Balkans, “Celtic” cultures such as Hallstatt and La Tène, interacted for centuries with local groups referred to as the “Illyrians” and “Dalmatians” (7, 8). Deep in the Balkan heartland, heterogenous populations named by classical authors as “Dacians”, “Dardanians”, “Moesians”, and “Paeonians” (1, 7–9) bordered nomadic cultures from the Pontic-Caspian steppe known as the “Scythians” (10), while the southeastern part of the peninsula was inhabited by “Thracians” and “Greeks” (2, 11). Balkan peoples also expanded beyond the confines of the peninsula, with “Messapians” migrating to southeast Italy at least since 600 BCE (12). The linguistic and cultural diversity of the Balkans was considerably homogenised during the Hellenistic (2, 9), Roman (9, 13–15), and especially the Migration Period, when Germanic and Slavic-speaking groups massively settled in the region (2, 3, 16, 17). These events ultimately led to the extinction of all palaeo-Balkan languages except Greek and Albanian. The latter is one of the most enigmatic branches of the Indo-European language family, having vexed linguists for more than two centuries (2, 18, 19).
Tracing the origins of Albanians and their language is challenging for several reasons. Only a handful of historical sources comment on the ethnic and linguistic composition of the southwest Balkans during the transition from classical antiquity to Medieval times (500-1000 CE) (20–22), and none of them mention an Albanian-speaking population from the territory of modern Albania. Speakers of Slavic languages are reported to have inhabited what is now southern Albania in the 8th century CE (15, 21), where the frequency of Slavic toponyms also peaks (23), while the same region is characterised by the presence of Greek-speakers at least since the Medieval period (24). The urbanised Medieval populations of the northwest, referred to by contemporary historians as the Romani/Ῥωμᾶνοι (20), are thought to have spoken a variant of vulgar Latin known as West Balkan Romance (14, 15, 25) that persisted at least until the 13th century CE (26). The demographic and linguistic situation in the mountainous interior is unknown, and it is only in the 11th century CE that Albanians appear in the historical record (22), while the earliest surviving written document of their language dates to 1462 CE (27).
A number of linguistic hypotheses have attempted to identify the affinities of the Albanian language and to locate the region where it developed, yet no definitive conclusions have been drawn. The most prominent, mutually exclusive hypotheses can be divided into those arguing for a local west Balkan origin from an Illyrian (28, 29) or Messapic background (19, 30, 31) [which may or may not have been distinct languages (7, 30, 32)], and those proposing a non-local origin from a Daco-Moesian-Thracian background (2, 19, 33) or an unattested Balkan language, whose speakers entered Albania from the central-east Balkans sometime after 400 CE (15, 32, 34, 35). The validity of these hypotheses, although hotly debated, is hard to test, as these ancient languages are poorly recorded, being known only from fragmentary inscriptions, toponyms, and a handful of historical sources (2, 7, 36). Furthermore, all of the ethnonyms of ancient Balkan peoples, such as “Illyrian” and “Thracian”, are likely artificial labels that were coined by ancient and modern authors (37), and may include several related languages with largely obscure geographical limits, intelligibility, and emic identities of their speakers (8, 9, 32). The most recent linguistic hypotheses propose a sister-group relationship of Albanian to Greek or to the Greek-Armenian clade (18, 38, 39), which firmly places the origin of the language in the Balkans but does not pinpoint the location of the proto-Albanian homeland within the peninsula and its potential affiliation to historically attested populations.
Archaeological data on Albania’s Medieval cultures are also inconclusive, especially for the Komani-Kruja complex (ca. 600-800 CE), which has been interpreted as the cultural expression of either a Romanised population (local or intrusive) (8, 40, 41) or an indigenous Albanian-speaking group (42).
Due to the challenges associated with linking archaeological, literary, and linguistic evidence, an archaeogenetic approach may offer novel insights into the origin of the Albanians, their biological relationships to ancient people, and the affinities of their language. Although gene flow is not always accompanied by language shifts [as in the case of Basque (43) and Etruscan (44)], migration is one of the primary vectors of cultural change (45, 46), of which language dissemination is a frequent outcome (3, 17, 47–49). Recent years have witnessed a surge in the palaeogenomic sampling of the Balkan peninsula (3–6, 11, 16), yet the resulting datasets have not been mined to help us understand how migration led to the emergence and spread of new material cultures, communities, and languages in the territory of modern Albania.
To gain insight into the biological and linguistic origins of modern Albanians, we undertake a palaeogenomic transect of the Balkans from the Neolithic to the modern era (Fig. 1; Tables S1-S2), using more than 6000 previously published ancient genomes from western Eurasia, which we interrogate by means of state-of-the-art statistical analyses (Tables S3-S18), large-scale algorithms that quantify human mobility (Table S19), and a meta-analysis of ancient IBD-sharing data [Tables 20-21; (50)]. We also mine publicly available Y-chromosome haplogroup data from more than 2500 ancient and modern Balkan samples (Tables S22-S34), which reveal the proximate ancestors of modern Albanian men.
Geographical distribution and dating of the examined Balkan populations. A) Archaeological sites annotated by period. B) Analysed ancient individuals from the territory of modern Albania arranged by archaeological site and date (radiocarbon or archaeological chronology).
Results
Complex genetic transformations during the Balkan Stone to Bronze Age transition
To visualise variation in genetic composition, we perform principal components analysis (PCA), where we project previously published ancient individuals from the Balkans and adjacent regions onto present day West Eurasian populations genotyped with the Human Origins array (Fig. S1). Three distinct population clusters can be seen on PC1 and PC2 of our Stone Age-Early Bronze Age PCA (Fig. 2). Hunter-Gatherer (HG) populations, which include Western Hunter Gatherers and Iron Gates Hunter-Gatherers (IGHG), occupy the right side of PC1; Early European Farmers (EEF), who introduced agriculture to Europe, cluster on the bottom left corner of PC2; and Yamnaya steppe pastoralists, who are attributed with the expansion of Indo-European languages to Eurasia and beyond, plot at the uppermost space between PC1 and PC2.
PCA of modern West Eurasian samples (grey points) with projection of ancient Neolithic-Early Bronze Age individuals from the Balkans. Dashed lines indicate the two poles of observed ancestry shifts from the Neolithic onwards – one generated by the assimilation of hunter-gatherers (blue line), and another by admixture with incoming Yamnaya-related populations from the Pontic-Caspian steppe (yellow line). Samples clustering within the PCA space enclosed by HG, EEF, and Yamnaya derive variable amounts of ancestry from these populations.
The PCA shows that the single Early Neolithic sample from Albania clusters with contemporary farming populations from Greece and Anatolia (Fig. 2), an affinity that is further supported by outgroup f3-statistics (Fig. S2A), which measure shared genetic drift between a pair of populations, as well as formal f4-statistics qpAdm tests with rotating sources (Table S3). No major genetic changes are observed in samples dating to the Neolithic-Chalcolithic transition (Fig. 2), with the individuals from Albania (Alb NChl) being modelled as deriving their entire ancestry from the preceding Albanian Neolithic, possibly with an additional 5% admixture from an IGHG-like source (Figs. 3A, S3; Table S4).
Distal qpAdm models for Balkan populations from the Neolithic to post-Medieval period. A) Neolithic and Early Bronze Age population models. B) Bronze-Iron Age models. C) Roman-post-Roman models. Putative ancestral source populations are colour-coded and presented in the top part of the figure. The single model receiving the highest statistical support is presented (all possible models and the criteria used to select the ones shown here are provided in Tables S3-S18).
Unlike the genetic continuity observed during the Neolithic-Chalcolithic transition, the PCA indicates remarkable ancestry shifts in the Balkans during the Early Bronze Age (EBA) (ca. 2600-1800 BCE). Most EBA Balkan samples show a significant change in their PC position and plot towards the Yamnaya (Fig. 2), suggesting the arrival of the steppe cultural package in the region, including Indo-European speech (4, 11, 43, 47, 49, 51–53). The Albania Çinamak EBA sample, dated to 2663-2472 BCE (Fig. 1B; Table S1), together with two samples from Bulgaria Boyanovo EBA (2500-2000 BCE), cluster closest to Yamnaya (Fig. 2), and qpAdm tests model them as having roughly 70% and 80-85% steppe ancestry, respectively, with the remainder of their genome deriving from local farming populations (Fig. 3A; Tables S5, S7). Accordingly, both male samples from Albania Çinamak EBA and Bulgaria Boyanovo EBA carry Y-chromosome haplogroup R1b-M269 (Tables S22, S26), the primary indicator of paternal ancestry from the Pontic-Caspian steppe (6, 49, 53). Importantly, steppe-derived admixture in Albania Çinamak EBA is best modelled with largely unadmixed Yamnaya groups rather than with EEF-and-HG-admixed Corded Ware populations (Fig. S2C; Table S7), suggesting that Indo-European-related ancestry arrived in the Balkans directly from the Pontic-Caspian steppe, in accordance with previous studies (6). This is also supported by the presence of Y-chromosomal haplogroup R1b-Z2103 in the Bulgaria Boyanovo EBA sample, which has been found in the earliest Yamnaya contexts (6, 49, 53), whereas Corded Ware samples are characterised by largely different subclades of haplogroup R1 (R1a-M417, R1b-U106, R1b-L151) (6, 45, 54). The ancestors of Albania Çinamak EBA were likely recent arrivals from the steppe, as this individual lived during the transitional period where the first kurgan burials made their appearance in Albania (55). Contemporary EBA archaeological finds from tumuli in Shkodër (100 km to the west of Çinamak) include artefacts associated with the Vučedol culture (55), revealing links with the northern Adriatic and Pannonia. It should be noted that a male with high levels of steppe ancestry discovered in a Vučedol context in Croatia also belonged to Y-chromosomal haplogroup R1b-Z2103 (56).
Regarding the broader Balkan EBA context, early interactions between incoming nomadic pastoralists and local farming populations were likely complex, as evidenced by the heterogenous levels of steppe admixture among geographically adjacent groups, which range from 0-7% (Greece Manika EBA), 16% (Bulgaria Tell Kran EBA), 30% (Serbia Mokrin EBA Maros) to ca. 45% (Bulgaria Kapetan Andreevo EBA) (Fig. 3A; Table S5). These differences in ancestry are reflected in the PCA, where the abovementioned groups occupy a middle space between largely unadmixed Yamnaya and EEF populations (Fig. 2).
The establishment of the BA-IA Balkan genetic cline
We next examined the projections of Bronze Age (BA) and Iron Age (IA) Balkan populations upon the principal components generated earlier (Fig. 4). By the Middle-Late Bronze Age (MLBA) and the Iron Age, the genetic ancestry of Balkan populations had become less heterogenous, exhibiting a north-to-south cline that broadly reflects geography (Fig. 4A). Accordingly, the MLBA-IA samples from Çinamak in Albania are clinal between contemporary populations from Croatia and Montenegro on one side of PC1 and PC2, and from North Macedonia and northern Greece on the other side (Fig. 4B). This affinity is further corroborated by f3-statistics (Fig. S2D, E) and Mobest analysis (Fig. S4), which employs an algorithm for spatiotemporal mapping of genetic profiles using bulk aDNA data (57). Formal f4-models with qpAdm using ultimate sources indicate a uniform genetic profile and a resurgence of farmer ancestry across the Central-West Balkans and northern Greece compared with the preceding EBA, as most samples derive around 60% of their ancestry from EEFs, 0-5% from Iran N, and 30-40% from steppe populations (Fig. 3B; Table S5). In the southern end of the Balkan genetic cline, Bronze Age and Iron Age populations of central-southern Greece and Bulgaria have considerably higher EEF (75-80%) and Iran N (5-10%) ancestry, and a much lower proportion of steppe ancestry (15-20%) (Fig. 3B; Table S5).
PCA of modern West Eurasian samples (grey circles) with projection of ancient Bronze Age and Iron Age individuals from the Balkans. A) PCA of all tested populations, showing a north-to-south cline that mirrors geography. B) Detail of region enclosed in the dotted rectangle in panel A, showing the clustering of populations from Albania (enclosed in solid line).
The close clustering of BA-IA populations from Albania, Croatia, Montenegro, North Macedonia, and northern Greece is also confirmed in proximate qpAdm models, as the Çinamak MLBA-IA samples derive most of their ancestry from the West Balkans (Tables S8-S9), with a possible 15-25% contribution from a southeast Balkan source (Bulgaria EIA, Greece BA Mycenaean) after the Middle Bronze Age (MBA) (Table S9). Based on the above, the MBA-IA populations of a large geographic region spanning northern Greece, North Macedonia and the entire Adriatic coast, including the region of modern Albania, form a uniform genetic cluster with similar admixture proportions (Fig. 3B) that persists for at least 1.500 years and transcends the linguistic boundaries identified by classical authors (7, 9). Our findings are further reinforced by IBD-sharing between certain samples from Albania and North Macedonia (Table S20) (50).
Intriguingly, two mercenaries from the Battle of Himera in Sicily (58) fall within the PCA cluster of BA-IA Albania (Fig, 4B). This position might be coincidental, as these two samples are characterised by high proportions of IGHG-related ancestry (9%), which in the Balkans can only be found in populations from EBA Serbia (Fig. 3B; Table S5). Furthermore, the proximate ancestry sources of these mercenaries are uncertain using both f3-and-f4-statistics (Fig. S5; Table S10), suggesting they might derive from a currently unsampled population, likely from the Central Balkans.
An outlier from Hellenistic North Macedonia might represent an early migrant from the Near East, as he is shifted towards populations from the Caucasus (Fig. 4A), and his ultimate ancestry is successfully modelled with proxies from the Caucasus and the Levant (Fig. 3B; Table S5).
Albania as a refugium of Iron Age ancestry during the rise and fall of the Roman Empire
It has been previously demonstrated (3, 16) that during Imperial Roman times, or perhaps even the preceding Hellenistic era, as we show here (Fig. 4B), large-scale westward immigration from the Eastern provinces took place, which transformed the genetic landscape of the Empire (59). This demographic shift also took place in the Balkans (3), as can be seen in the PCA of Roman and Post-Roman samples (Fig. 5), where virtually all local populations are shifted from their Bronze Age and Iron Age position towards the Eastern Mediterranean, forming a cluster along what we term as the Anatolian cline (Fig. 5A).
PCA of modern West Eurasian samples (grey circles) with projection of ancient Roman, Medieval, and post-Medieval individuals from the Balkans. A) PCA of tested populations, highlighting the shift in the PCA coordinates of Roman and post-Roman groups towards the Anatolian (green dashed line) and Balto-Slavic (blue dashed line) clines, compared to their counterparts from the same regions in the preceding Bronze Age and Iron Age (faded grey polygons). B) Detail of region in panel A, showing the tight clustering of populations from Albania over the past 3000 years, including modern individuals (orange polygon). For simplicity, Roman-era populations within and close to this cluster have been removed and can be seen in the previous panel.
Formal f4-statistical tests using qpAdm also support the divergence in Balkan populations between the BA/IA and the Hellenistic/Roman era, by modelling post-Neolithic Anatolian admixture with an Iran N-like reference (Fig. 3C), which became pervasive in Anatolia and the Aegean by the EBA (60), and was also often accompanied by Levantine ancestry (59). Although largely absent in the preceding Iron Age, Iran N-like admixture entered Balkan populations in a multiphased manner, with some Roman-era-Early Medieval populations being modelled as deriving 0-15% of their ancestry from this source (Croatia Roman Beli Manastir and Šćitarjevo, Montenegro Doclea Roman), which in adjacent regions can be as high as 20-30% (Croatia Novo Selo Bunje, Zadar, and Trogir Dragulin) (Fig. 3C; Table S5). These differential ancestry shifts are also corroborated using ADMIXTURE on a subset of our dataset, represented by the appearance of a West-Central Asian component that was absent prior to the Roman period (Fig. S3).
While Anatolian ancestry was becoming established in the Balkans, a second, even more significant demographic transformation began to take place. By 200 CE, individuals related to north-eastern European, Balto-Slavic, and nomadic steppe populations started appearing in the Balkans (Fig. 5A). Such movements peaked during the Migration Period (roughly 350-600 CE) with the mass migration and settlement of Avar, Germanic and Slavic-speaking groups, which transformed the cultural, linguistic, and ethnic composition of the region (3, 9, 17, 61). Admixture with these newcomers was pervasive, as can be seen in the PCA, where most Balkan populations show a significant shift towards Balto-Slavic groups, along what we term the Balto-Slavic cline (Fig. 5A). This admixture event is mirrored in an unprecedented increase of IGHG-related ancestry in most late Roman-Early Medieval Balkan populations, which is most significant in samples located within the Balto-Slavic cline (Fig. 3C; Table S5). Both qpAdm and ADMIXTURE models suggest that from an average of 4-5% in the preceding BA-IA and the early Roman era, IGHG ancestry rose to 13-18% in strongly Balto-Slavic-shifted populations from late Roman, Medieval, and Post-Medieval Croatia, Macedonia, Montenegro, and Serbia (Fig. 3C; Fig. S3; Table S5).
To quantify the impact of north-eastern European migrations into the Balkans, we used f4 rotating-source qpAdm tests with proximate sources dating to the Iron Age, Bronze Age and the Roman era, where Russia Ingria IA served as a proxy for Balto-Slavic-related ancestry. Our qpAdm models recover remarkably high Balto-Slavic-related ancestry in the late Roman and Post-Roman populations of Croatia (50-65%), Montenegro (45-65%), North Macedonia (30-50%), and Serbia (50-55%) (Table S11). These admixture proportions are almost identical to those proposed by studies on modern Slavic-speaking populations using different methods, which recover the ancestry of South Slavs as 55-70% Balto-Slavic-related, with the remainder originating from the pre-Slavic inhabitants of the Balkans (48), further affirming the accuracy of our models.
Considering the abovementioned Anatolian and Balto-Slavic ancestry shifts in the Roman and Medieval Balkans, we model the ancestry of samples from Medieval Albania. In contrast to neighbouring populations, the samples from Medieval South-Eastern (Shtikë, 889-989 calCE) and North-Eastern (Kënetë, 773-885 calCE) Albania (hereafter Albania Mdv) experience only a minor shift on their PC position from the BA-IA to the Migration Period (Fig. 5A, B), hinting at large-scale genetic continuity for over 2500-3000 years. This is reflected in the ancestry makeup of Albania Mdv, as in ultimate f4 qpAdm models they derive 17% of their ancestry from Iran N-like sources (2-5% in the BA-IA), while their IGHG ancestry increases only marginally (0-4% in BA-IA, 6% in Medieval times) (Fig. 3C; Table S5). Proximate qpAdm models comprising Balkan BA-IA sources and proxies for Anatolian-Levantine (East Anatolia BA IA; Syria Ebla EMBA) ancestry also replicate this ancestry shift (Table S12). The most strongly supported result was a two-way model where Albania Mdv derive 85% of their ancestry from BA-IA-Hellenistic Albania, and 15% from either the Anatolian or Levantine proxy, with equal support (Table S12).
To further resolve the observed ancestry patterns, we ran a second proximate model using Roman era West Balkan sources, together with a Slavic proxy (Russia Ingria IA), where we recovered Albania Mdv as either 100% Roman West Balkan, or 85% Roman West Balkan and 15% Slavic-related (Table S13). Accordingly, Albania Mdv cluster together with Roman era West Balkan samples that derive only a small proportion of their ancestry from Anatolian populations [Croatia (Beli Manastir, Gardun, Omišalj, Sisak, Trogir, Velić, Zadar); Montenegro (Doclea); Serbia (West-Balkan-shifted samples)] (Fig. 5B) and show little to no increase in IGHG ancestry (Fig. 3C).
Populations of largely unadmixed palaeo-Balkan ancestry persisted in pockets in other regions of the collapsing Roman Empire as well. Two samples from Early Avar Pannonia (550-650 CE) cluster with populations from Iron Age Bulgaria and Greece on the PCA (Fig. 5C), a relationship corroborated by f3-statistics (Fig. S6), qpAdm (Table S14), and IBD-sharing (Table S21) (50). Such outliers may affirm historical reports of the Avars undertaking mass resettlements of Roman subjects from the area of Thrace towards their Khaganate in Pannonia (62, 63).
Given that some linguistic hypotheses suggest a mixed West Balkan and Thracian origin for the Medieval population of Albania (2, 64), we undertake an f4 qpAdm test combining Roman era West Balkan, East Balkan (the two outliers from Early Avar Hungary), and Slavic-related (Russia IA Ingria) sources. Albania Mdv is once again effectively modelled as being 100% of Roman era West Balkan origin (Croatia Roman Gardun, Montenegro Roman Doclea), or as a two-way mix of a Roman period West Balkan source (85%) and a Slavic-related source (15%) with high support (p = 0.16; SE = 0.04) (Table S15). Two-way or three-way models with West Balkan + East Balkan or West Balkan + East Balkan + Slavic-related sources also passed with low statistical support (SE = 0.12 and SE = 0.09-0.14, respectively), recovering Albania Mdv as deriving 45-60% of their ancestry from the two Thracian-shifted Avar era outliers (Table S15), a finding that may also receive support from f3-statistics (Fig. S2F). Furthermore, models using the Avar era outliers as the sole palaeo-Balkan source for Albania Mdv receive high support (p = 0.72; SE = 0.03; Table S15). However, it is unlikely that the Avar era outliers are realistic local sources due the fact that such significant ancestry shifts would have pulled Albania Mdv samples towards the direction of southeastern Balkan populations on the PCA, which is not the case (Fig. 5A, B). Overall, the PCA and f4 qpAdm statistical models suggest that the Medieval population of Albania was minimally affected by demographic changes during the Roman era, in stark contrast to adjacent regions such as Croatia and Serbia (Fig. 5A). However, currently unsampled urbanised areas in Albania such as Durrës and Shkodër, may have comprised populations of more complex ancestry.
Although we show that some West and East Balkan populations persisted largely unadmixed in late Roman and early post-Roman times (Fig. 5A), such ancestry profiles cannot be found post-900 CE, as Mobest analysis of Albania Mdv shows affinities only to populations from Italy, which maintained a larger proportion of Eastern Mediterranean ancestry compared to contemporary sampled locations in southeastern Europe (Fig. 6B, C). This suggests that the region of modern Albania served as a refugium of Iron Age West Balkan ancestry throughout the demographic and social upheaval that took place during the Migration Period (9, 17, 61). However, our qpAdm models cannot discriminate whether West palaeo-Balkan ancestry in Medieval Albania originated from indigenous populations or other incoming palaeo-Balkan groups from adjacent regions such as the northern Adriatic coast or the Balkan interior. Indeed, IBD-sharing suggests that the sample from Kënetë shares small (10.5 cM) segments with BA and Roman individuals from neighbouring Velika Gruda in Montenegro and Zadar in Croatia, respectively (Table S20) (50). Conversely, the individual from Shtikë shares short segments (11-14.5 cM) with geographically distant Hun and Avar-era commoners from Hungary (Table S20), suggesting that possible trade networks between Medieval Albania and the Avar Khaganate (based on archaeological artefacts) (8, 15, 40) may have also involved genetic exchange.
Mobest analysis of Late Bronze Age-Iron Age, Medieval, and post-Medieval samples from the region of Albania, plotting the probability surface that identifies the highest genetic-geographical match at the mean date of the respective individual. The higher the probability surface (light yellow-green), the closer the genetic match. Only a single sample from the post-Medieval population of Bardhoc is included here, as all individuals from this region display the same probability surface (Fig. S7A-C). The latitude and longitude coordinates with the best fit for the examined individuals are provided in EPSG:3035 projection in Table S19.
The formation of the modern Albanian genome
We next sought to characterise the ancestry of post-Medieval samples from Central (Pazhok, 1527-1660 calCE) and north-eastern (Bardhoc, 1400-1700 calCE) Albania. The place-name Bardhoc is of Albanian etymology (35), and is mentioned in Ottoman registers contemporary to the studied samples (65, 66), suggesting the latter might have been Albanian-speakers. Furthermore, in the adjacent region of Has, a majority of the population is recorded as Albanian by contemporary Ottoman sources, although speakers of South-Slavic languages were also present (65).
In the Roman-Post-Roman PCA, individuals from post-Medieval Bardhoc (hereafter Bardhoc PostMdv) cluster with the preceding Medieval samples from Kënetë and Shtikë, except for one outlier (Fig. 5B) who clusters with the Pazhok sample, both of which are pulled towards the Balto-Slavic cline (Fig. 5B), which was confirmed with f3-statistics (Fig. S2H, I). We employ a two-way f4-statistics qpAdm model to test whether the pattern observed in the PCA reflects shared ancestry, where Albania Mdv serves as a local source (Table S16). To test for excess north-eastern-European-related admixture, the second ancestry source is modelled with a proxy for unadmixed Balto-Slavic-related ancestry (Russia IA Ingria), as well as a suite of palaeo-Balkan-admixed populations within the Balto-Slavic cline dating to the Roman (Croatia Roman NE Europe o, Serbia Roman NE Europe o), Medieval (Macedonia Medieval, Montenegro Doclea Slavic, East Anatolia Roman Slav), and post-Medieval (Serbia Post-Medieval) periods (Tables S16-S17).
One-way models for Bardhoc PostMdv using Albania Mdv as a source receive very high statistical support, while two-way models with an additional low-level contribution from unadmixed (2%) or admixed (5-10%) Balto-Slavic-related sources are also feasible (Fig. 7; Tables S16-S17). However, a one-way f4 qpAdm model for the two Balto-Slavic-shifted outliers from Bardhoc and Pazhok is rejected (Tables S16-S17), suggesting more complex ancestry in these individuals. These outliers are instead recovered as deriving their ancestry from a two-way mixture between a local source (Albania Mdv) and unadmixed (ca. 15%) or admixed (ca. 25-35%) Balto-Slavic-related groups (Fig. 7; Tables S16-S17). Heterogeneity in Slavic-related admixture also characterises the modern Albanian samples from Tirana, which ranges from 18-48% (Fig. 7). It is likely that modern populations inhabiting areas in Albania that experienced little to no Slavic settlement in Medieval times, such as the south-west (Labëria) and the north (Malësi e Madhe, Rrethi i Matit) (8, 14, 15, 23), will harbour less Balto-Slavic-related ancestry than the current modern Albanian samples.
Admixture modelling of samples from post-Medieval Albania using a rotated qpAdm model. The samples of Albania Mdv are used as a proximate local source, while various Roman and Medieval populations along the Balto-Slavic cline serve as proxies for Balto-Slavic-related ancestry. Models with poor fit (p-value ≤ 0.05) and infeasible coefficients (≥0.15) are not shown. Note that models with modern Albanian samples are of lower resolution as they exploit 600k SNPs and not the full 1240k SNPs used for tests including solely ancient samples. However, the proportion of Slavic-related ancestry in modern Albanians is comparable with that of the post-Medieval outliers from Bardhoc, suggesting that qpAdm tests including modern Albanian samples are accurate. Figure based on Table S16.
Although models using the unadmixed Balto-Slavic-related proxy (Russia IA Ingria) receive higher statistical support (Fig. 7; Tables S16-S18), they are less likely historically, as there is no evidence that such individuals persisted in the Balkans post-600 CE based on the PCA (Fig. 5A). Likewise, Mobest analysis on the post-Medieval samples of Bardhoc and Pazhok reveals highest genetic similarity to neighbouring South-Slavic populations and not to north-eastern unadmixed Balto-Slavic groups (Fig. 6D-F). It is therefore likely that Slavic admixture entered the Albanian population via an ancestry profile related to modern South Slavs, rather than unadmixed Balto-Slavic populations. Regarding the origins of such South Slavic-related admixture in post-Medieval and modern Albanian populations, qpAdm models with north-western South Slavic sources (Croatia, Montenegro, Serbia) are highly supported, in agreement with historical and linguistic data (33), while a source from Medieval North Macedonia is either rejected or receives low support (SE = 0.14 and 0.15 in Tables S16-S17, respectively).
We have shown that modern Albanians from Tirana derive 25-48% of their ancestry from a South-Slavic-related source (Fig. 7; Tables S16-S17). This ancestry contribution is two to three times higher than the frequency of South-Slavic-associated Y-chromosome haplogroups (R1a-M417, I2a-M423) (67, 68) in the modern Albanian population (15% combined; Fig. S8), suggesting that Slavic-related admixture may have been largely female-mediated, as has been shown in 10th century Serbia (3). These findings are in agreement with anthropological and historical data supporting a strongly patrilineal, kinship-focused culture among Albanians until early modern times (69–71).
To determine whether the similarities in autosomal ancestry and PCA position between the Medieval and post-Medieval population of Albania are the result of direct descent, we mined previously published datasets of IBD-sharing between ancient samples (50) (Table S20). Remarkably, all samples from post-Medieval Bardhoc share large IBD segments (10-70 cM) with the Medieval individuals from Kënetë and Shtikë, despite the latter being situated 300 km to the south and having lived 500-700 years earlier (Table S20). However, such a close relationship may be exaggerated by founder effects, especially considering that the modern Albanian population displays elevated haplotype sharing (72, 73). We also observe IBD-sharing (9-13 cM; Table S20) of the Bardhoc samples with a 17th-20th century CE individual from Roopkund lake in India who clusters with modern mainland Greeks (74). This finding suggests that a population related to Bardhoc may have been involved in the mass migrations of Albanian-speakers into Greece in the 14th-16th centuries CE (18, 75).
Based on the findings from the PCA, the f4 qpAdm models, and IBD-sharing, we show that the population of Bardhoc PostMdv descends from earlier medieval groups from Albania, which in turn showed continuity with IA and Roman era western Balkan individuals. However, despite its relative isolation, post-Medieval Albania represented a diverse cultural landscape. This is mirrored in three individuals from Barç, who project far from the remaining Balkan populations on the PCA (Fig. 5B). Although previously interpreted as being of Turkic origin (6, 16), our qpAdm and ADMIXTURE analyses conclusively show that these individuals harbour ancestry from South Asia (Figs. 3C, S3; Table S6), suggesting descent from the Roma people (76). Furthermore, the sole male individual is assigned to Y-chromosome haplogroup J2a-Y18404 (Table S22), which is found almost exclusively among Roma populations (77). The same holds true for the mitochondrial haplogroup U3b1 of one of the females [especially for branch U3b1c2 (77)]. To our knowledge, this is the first record of Roma people in the aDNA record.
Large-scale persistence of pre-Migration Period haplogroups in modern Albanians
Given that the transmission of genetic ancestry, culture, and likely language are often male driven (43, 46, 54), we interrogated ancient and modern Balkan Y-chromosome haplogroups to further refine our understanding of demographic changes in the region. Remarkably, we find that 80% of the paternal ancestry of modern Albanians stems from pre-Migration Period populations (Fig. S8). We focus on the primary palaeo-Balkan lineages of modern Albanians – haplogroups E-V13 (27-35%), J2b-Z600 (15%), and R1b-BY611>Z2705 (12-14%) (Fig. S8; Table S33) and report their regional history and phylogeny in the context of the formation of modern Albanian populations.
Haplogroup J2b-Z600 branched off its parent lineage J2b-L283 around 3500-3000 BCE (77, 78). Current sampling suggests that J2b-Z600 was absent from the European Neolithic-Chalcolithic (Fig. 8), as it appears abruptly on the aDNA record in the Serbian Bronze Age (2100-1800 BCE) in a Maros cultural context, alongside the parent subclade of R1b-BY611 (Fig. S9) (5, 79). This places two of the most frequent paternal lineages of the Albanians (Fig. S9) in the Central-West Balkans by the EBA (Figs. 8-9, S9). Haplogroup J2b-Z600 experienced a major founder effect and diversification in the ancient populations of the Adriatic coast (Albania, Croatia, Montenegro), where it accounts for 50-70% of all paternal lineages during the BA-IA (Figs. 8-9), and has been found in samples associated with major West Balkan archaeological cultural expressions, most notably in Maros, Cetina, Japodian and Liburnian contexts (Tables S23-S24, S28) (5, 79). Coupled with its remarkably local distribution in pre-Roman times (Fig. S10), J2b-Z600 may represent a reliable indicator of ultimate Bronze Age-Iron Age West Balkan paternal ancestry. The distributional expansion of J2b-Z600 in northern and western Europe in Roman and post-Roman times (Figs. 8-9, S10) is not surprising, as the West Balkans supplied the Empire with mercenaries, soldiers, and Emperors for centuries (7–9). Within an Albanian context, J2b-Z600 subclades found in BA-IA, Roman and Medieval Albania (Bardhoc, Çinamak), Montenegro (Doclea, Velika Gruda), and Southern Croatia (Gardun, Gudnja cave), have daughter or sister lineages in modern Albanians (Table S34), suggesting significant paternal continuity from ancient south-west Balkan populations identified as “Illyrians” by classical authors (Fig. S11). aDNA samples from Roman Serbia (Sviloš, Viminacium) and Late Avar-Medieval Hungary (Alattyán, Sárrétudvari) belong to J2b-Z600 lineages related to those of modern Albanians (Table S34), indicating an ultimately south-west Balkan paternal origin for these individuals, which corroborates inscriptional and historical evidence for transplantations of “Illyrian” soldiers along the Limes (8, 9).
Y-DNA transect of the Balkans and Hungary over the past 8.000 years, using data from Tables S22-S31. Sample sizes shown above bars.
Temporal distribution of the principal Central-West Balkan-derived Y-chromosome haplogroups in European populations, with a schematic phylogeny on the y-axis. Horizontal lines extending from some datapoints represent the archaeologically or radio-carbon-determined time range in which a particular individual lived. Compiled from Tables S22-31.
The most frequent paternal lineage among the Yamnaya, R1b-Z2103 (6, 49), is represented in the ancient Balkans by daughter haplogroup R1b-CTS7556, which was first found in a Maros cultural context in Serbia (Figs. 8-9, S9-S10) (79), suggesting a direct migration of Yamnaya-related groups into the Pannonian plain. In turn, the primary descendant clade of R1b-CTS7556 is R1b-CTS1450, whose various daughter lineages appear in Bronze and Iron Age populations of northeastern Albania (Çinamak) and North Macedonia (Figs. 8-9, S9-S10). Importantly, R1b-CTS1450 lineages directly ancestral to haplogroup R1b-BY611>Z2705 (which today comprises almost exclusively Albanians), were found in BA-IA Çinamak (Fig. S9C). Although these lineages ultimately did not contribute to modern Balkan populations, their presence in Çinamak suggests that the group that introduced R1b-BY611>Z2705 to the territory of modern Albania was located nearby. One of the Bardhoc PostMdv samples belongs to R1b-BY611>Z2705 (Table S22), further supporting our interpretation of that population being related to modern Albanians.
Despite being one of the most frequent haplogroups in modern Balkan populations (67, 80–83), the origins of E-V13 are enigmatic. The earliest record of this haplogroup among historically attested groups is in BA-IA Bulgaria (Figs. 8-9, S10), suggesting an association with the people known as the “Thracians”. By the early Roman era, E-V13 likely experienced significant demographic increase, as it appears at medium to high frequencies in areas where in the preceding Bronze and Iron Age it was either very rare (Croatia, Hungary) or entirely absent (Serbia) (Figs. 8-9). An association of the expansion of E-V13 with southeastern Balkan populations from the Thracian world is reinforced by one of the Avar-era outliers from Hungary, who is assigned to E-V13 and clusters with BA-IA populations from Bulgaria on the PCA (Fig. 5A), an affinity confirmed by qpAdm (Table S14), and IBD-sharing (Table S21) (50). A Scythian from Moldova (Table S21) who clusters close to Balkan IA populations (Fig. 4A) and belongs to E-V13 (Fig. S10) also displays IBD-sharing with Bulgaria IA (Table 21). Our findings support late Roman historical records which mention the presence of “Thracian” groups known as the “Bessi” throughout the Balkans until the 6th century CE (2, 36, 62, 64).
However, not all populations with E-V13 were characterised by a Bulgaria EIA-like autosomal profile, as shown by the two E-V13-bearing Himera mercenaries (Fig. 4), who were likely related to EBA-LBA populations from Serbia (Supplementary Methods; Table S10). Furthermore, several Roman-era samples with a West Balkan autosomal profile [(Croatia: Sipar R3664, Ščitarjevo R3659); Serbia (Viminacium R6756)] also harboured E-V13 (Tables S23, S28). This does not exclude a Thracian origin, as the historical region of Dardania (roughly modern Kosovo, southern Serbia, and western North Macedonia) is recorded as a zone of linguistic contact between “Illyrian” and “Daco-Thracian” groups (Fig. S7) (1, 7). It is therefore likely that the population that introduced E-V13 into Albanians would plot close to Roman era West Balkan groups. Remarkably, several E-V13 subclades found in Avar and Medieval era Hungary, as well as in one of the Himera mercenaries, are characterized by sister or daughter branches that comprise almost exclusively Albanians (Table S34). Whether the individuals from Hungary had an ultimately Balkan origin is unknown, as most do not share IBD segments with Balkan populations and are significantly admixed with northern European and Central Asian populations (Table S21).
To obtain insights on the ethnogenesis of modern Albanians, we plot the mean Y-full TMRCAs of Albanian-specific subclades of E-V13, J2b-Z600, R1b-BY611 and other palaeo-Balkan haplogroups (R1b-PF7562, I-M223) (Fig. 10). Remarkably, a majority of these haplogroups (J2b-Z600, R1b-BY611, R1b-PF7562, I-M223) experience a sudden and steep increase in subclade diversity between 500-800 CE (Fig. 10), which coincides with the timing proposed by linguistic and historical hypotheses on the origins of Albanians and their language (33–35, 64, 84), as well as IBD-sharing analyses (72). The low number of diversifying subclades prior to 500 CE is likely caused by missing data, probably due to significant loss of diversity associated with the demographic turmoil of the Migration Period.
Graphical representation of clade formation (cumulative sum of new subclades) of the principal palaeo-Balkan haplogroups of modern Albanians, using TMRCA estimates from Y-full. Plotted using data from Table S32.
Unlike the abovementioned haplogroups, E-V13 exhibits continuous subclade diversification from the Bronze Age to the Roman period (Fig. 10), suggesting that populations with a high frequency of E-V13 may have followed a different demographic trajectory from those with J2b-Z600, R1b-BY611, R1b-PF7562, and I-M223. The rate of E-V13 subclade diversification increased steeply from 500 CE onwards, following the pattern of the other haplogroups found in modern Albanians (Fig. 10). Based on the above, it is possible that currently unsampled populations from the Central-West Balkan interior that were characterised by high frequencies of E-V13 may have entered the region of modern Albania around 500 CE, where they merged and co-expanded with local groups. This may also explain the absence of E-V13 from the aDNA transect of Albania, despite being the commonest haplogroup in the modern Albanian population.
Together with qpAdm and IBD data, the principal Y-chromosome haplogroups add further evidence for large-scale continuity of modern Albanians from local groups, without excluding the possibility of admixture with neighbouring palaeo-Balkan populations. Continuity is also mirrored in rarer lineages, such as haplogroup T-Y206597, found in the Medieval individual from Kënetë (Table S34), which today comprises almost exclusively Albanians (77).
Discussion
Our genomic transect of the population of Albania from the Neolithic to the modern era reveals fluctuations in genetic ancestry over a period of 8000 years. In contrast to the southeastern Balkans, where the arrival of Pontic-Caspian steppe ancestry and the associated Indo-European cultural package during the EBA did not lead to a lasting genetic turnover (4), we show that contemporary populations in Albania were genetically transformed both in autosomal and paternal ancestry (Fig. 8; Table S5). We find that more than a millennium later, BA-IA Balkan populations with high levels of steppe ancestry (30-40%) formed a distinct genetic cluster that extended from northwestern Greece, North Macedonia and the Adriatic coast (including Albania) and transcended archaeological and linguistic boundaries (Fig. 4A). This genetic continuum was broken down across the Balkans during the Roman and Migration period (Fig. 5A), due to mass settlement of Germanic and Slavic-speaking groups in the region.
However, in agreement with linguistic studies, we find that Albanians likely descend from a surviving West palaeo-Balkan population that experienced significant demographic increase approximately between 500-800 CE (Fig. 10), perhaps after a population bottleneck. We show that in contrast to the rest of the Balkans, the Medieval samples from both North and South Albania experienced little to no contribution from surrounding Slavic populations (Fig, 6B-C; Tables S12-S13, S15) and maintained high levels of BA-IA West Balkan ancestry. Remarkably, the same genetic profile persisted 500-800 years later in most of the post-Medieval samples from Bardhoc, as shown both by the PCA (Fig. 5), qpAdm analyses (Tables S16-S18), and IBD data (Table S20), which indicate significant genetic continuity from the Medieval populations of Albania. However, qpAdm models cannot exclude the possibility of additional admixture with currently unsampled neighbouring late Roman-early Medieval palaeo-Balkan groups with a similar ancestry profile. Based on linguistic data, the area of modern Kosovo and southeastern Serbia may have been such a source (15, 33, 34).
Despite being largely unaffected by the demographic changes that took place during the Migration period, the historical Albanians did not emerge in isolation. At the peak of the Migration Period, the Medieval population of Albania displayed genetic links as far as Pannonia (Tables 20-21), while in post-Medieval times we detected the presence of individuals likely related to modern Roma people (Fig. 5A). Furthermore, two of the post-Medieval samples exhibit significant admixture with South Slavic populations (Tables S16-S18), and modern Albanians display highly variable levels of Slavic ancestry (Fig. 5B, Tables S16-S18). This indicates complex historical interactions with South Slavic populations, as suggested by toponymy and linguistics (23, 35).
We reveal that a significant proportion of the paternal ancestry of modern Albanians derives from groups ultimately descending from the BA-IA West Balkans (Tables S33-S34), including those traditionally known as “Illyrians” (Figs. S9, S11), which reflects our findings on autosomal ancestry. However, inferring the language spoken by the Medieval samples from Albania is challenging, as Greek, South Slavic and West Balkan Romance are the only recorded languages of the region (14, 15, 25), while there is no indication of the survival of “Illyrian” following the first centuries of Roman rule (7, 8). Furthermore, Albanian displays Latin loans from both the Western and Eastern Balkans (85), which attests to linguistic influences beyond the confines of modern Albania. Testing the Messapic hypothesis for Albanian (7, 19, 30, 32) was not possible due to the low coverage of said samples (12). Although the presence of haplogroups J2b-L283, I-M223, and R1b-Z2103 among the Messapians (Table S30) suggests a West Balkan origin, whether a related language persisted in the Balkans during Medieval times is unknown.
Even though Eastern Roman historians were unfamiliar with Albanians (22), we cannot exclude the possibility that proto-Albanians interacted with populations speaking Greek, Aromanian, or Slavic in what is now southern Albania during Medieval times. Given that genetic data strongly suggest a predominantly local origin for Albanians, their Medieval ancestors may have inhabited a geographically restricted area [possibly the region of Mat in central Albania (14)], only occasionally venturing towards the south. These movements may have increased in scale over time, finally attracting the attention of Greek-speaking historians in the 11th century (22).
While the quest for the origins of the Albanian language will certainly continue, we expect that the present study will shape these debates and provide the necessary framework for more extensive research on the genetic ancestry of the ancient and modern inhabitants of Albania.
Materials and Methods
Experimental Design
The following sections describe the examined dataset, the resources providing the samples examined herein, the statistical methods used to analyse their ancestry, and the employed data visualisation software.
Dataset
A list of all the samples and naming conventions used in the qpAdm models and PCA of this study can be found in Tables S1-S2. On several occasions, our sample naming differs to that given in the original study, based on insights from the PCA, admixture modelling, and dating interpretation. We mention here naming differences based on a re-interpretation of the samples in question. Individuals I18723, I18721, I18719 from Bezdanjača Cave in Croatia, which were archaeologically dated to the Bronze Age (1500-800 BCE), were interpreted in previous studies as outliers compared to the contemporary population of Croatia (5, 6). We argue that based on insights from the PCA and their uniparental markers, these outlying individuals likely date to post-Medieval times. Individuals I18721 and I18719 are assigned to haplogroup I2-M423>I-Y3120 (Table S23), which is associated with the Slavic expansion toward southern Europe during the Migration Period, and has experienced major founder effects in the South Slavic population of the Balkans (48, 68, 82). This haplogroup is extremely unlikely to have entered the Western Balkans in the Bronze Age, as our extensive haplogroup dataset shows that subclades downstream of I2-M423 appear in the region primarily during the Migration Period (Fig. 8), as expected. Occasional migrants from the Balto-Slavic world to southern Europe are known in the Iron Age, such as two mercenaries from the Battle of Himera in Sicily (58). However, the Bezdanjača Cave outliers are unlikely to represent BA migrants from northern Europe, as the mitochondrial haplogroup of individual I18719 (HV0a1a1b), has a TMRCA of 225 years before present with a person from modern Germany (77). Furthermore, previous archaeological studies in Bezdanjača Cave radiocarbon-dated an ancient individual to the 17th century CE, and also reported the remains of two skeletons which date to World War II (86). Another sample that might be misdated is I13170 from Velika Gruda in Montenegro, attributed to the Iron Age (800-400 BCE) (6). This individual, which was not radiocarbon dated, clustered with modern South Slavs (6), and was therefore excluded from our analyses.
Within the context of the samples from Albania, individual I13834 from Barç (Southeast, Korça Basin), radiocarbon dated to Post-Medieval times (1452-1619 calCE (385±15 BP, PSUAMS-8300) (6, 16) is intriguing. The sample clusters with Bronze Age and Iron Age samples from Albania (Fig. 5), and also lacks any of the Iran N-related ancestry present in Medieval, Post-Medieval, and modern samples from Albania (Table S5). In contrast, sample I13839 from neighbouring Shtikë (Southeastern, Kolonja Plateau), which is radiocarbon dated to 889-989 CE (6, 16) derives part of his ancestry from an Iran N-related source (Table S5), as expected. Based on the above, individual I13834 from Barç either maintained an unadmixed profile for more than 1600 years, or it might represent a case of the freshwater reservoir effect, which can cause errors in carbon dating of samples (87). Due to its uncertain dating, we tentatively assigned individual I13834 to the Medieval population on the PCA (Fig. 5A, B), and excluded it from all proximate admixture analyses.
Statistical Analysis
PCA
We undertook Principal Component Analysis (PCA) on a subset of the HO (51) and newly genotyped Reitsema et al. (2022) (58) dataset using the ‘smartpca’ function (v16000) in EIGENSOFT (version 7.2.1) (88) (Fig. S1). SNP datasets in EIGENSTRAT format were combined using the mergeit function included in EIGENSOFT, with default parameters, aside from allowdups: YES; outputformat: EIGENSTRAT; strandcheck: NO; hashcheck: NO. We converted .bam files to EIGENSTRAT format using the mpileup function from the SamTools software package (v1.16), with optional flags -R -B -q30 -Q30. The resulting pileup files were processed into EIGENSTRAT using the pileupCaller function from the SequenceTools software package (v1.5.2 - https://github.com/stschiff/sequenceTools) using the default parameters and calling_method: randomHaploid.
HO dataset individuals of ancestry directly relevant to West Eurasian populations were used (Caucasus, Central Asia, Europe, Middle East), while certain Middle Eastern and North African populations with significant sub-Saharan ancestry that caused shrinkage of the PCA were removed. For compatibility with the HO dataset, SNPs within 1240K SNP datasets corresponding to those in the HO dataset were renamed prior to the datasets being combined using mergeit. We then projected ancient samples onto present-day individuals with “lsqproject:YES” and “shrinkmode:YES”, in three chronologically distinct datasets (Neolithic-Early Bronze Age, Bronze Age-Iron Age, Roman-Post-Roman), resulting in Figs 2, 4, 5.
Low coverage, non-UDG-treated ancient samples were not included in our PCA, as they can cause artefactual shifts towards the direction of sub-Saharan populations, which is not only misleading but also causes shrinkage of the PCA. To avoid redundancy due to shared ancestry, we also removed close relatives from the ancient samples dataset. The final dataset of the modern and ancient samples used for the PCA is provided in Table S2.
ADMIXTURE
We used ADMIXTURE (v1.3.0) to analyze 1,044 modern and ancient human samples. Most of the individuals were sourced from the Allen Ancient DNA Resource or AADR (v54.1_1240K_public) from the David Reich Lab. Sixteen of the samples came from Antonio et al. 2023 (89). The initial set of 1,150,639 autosomal SNPs was pruned for linkage disequilibrium in PLINK 1.90 with parameters --indep-pairwise 50 25 0.2, which resulted in a final set of 276,155 SNPs. All individuals with a genotyping rate of less than 5% were removed from the analysis. ADMIXTURE was run with the number of ancestral populations (K) ranging from 2 to 8. The results, particularly at K7 and K8, are similar to our formal qpAdm statistics-based analyses. The ADMIXTURE analysis also reveals South Asian-related ancestry in the ALB_Barc_PostMdv_Roma_profile individuals. Unlike other Europeans, these individuals show ∼20% membership in the K7 and K8 clusters (labeled C3 and C1, respectively) that are modal in modern samples from southern and eastern India (such as ILA.SG and BIR.SG).
f3-statistics to measure genetic drift
We employed the ADMIXTOOLS v.2.0.0 package in R (version 4.1.1), operated using RStudio (v. 2022.07.1+554), in order to estimate outgroup f3-statistics of the form f3(outgroup; population A, population B), where outgroup = Cameroon_SMA, population A = the tested individual or metapopulation, and population B = the ancient populations in our working dataset, respectively. Outgroup f3-statistics estimate pairwise genetic affinity via allele sharing (90). The higher the value of the f3-statistic the higher is the genetic affinity between population A and the tested ancient population(s) B.
qpAdm admixture modelling
We employed qpAdm to model the ancestry of the examined populations by using a rotating sources approach (91), with an emphasis on the samples from Albania. A model was accepted if its p-value was significant (≥0.05) and the standard errors (SE) were sufficiently low (≤0.15). Models with a satisfactory Z-score (≥3) were favoured if they corresponded to patterns observed on the PCA and had a significant p-value and sufficiently low standard errors, although we did accept models with lower Z-scores as well. Models with poor fit (p-value ≤0.05) and infeasible coefficients (SE>0.15) were rejected and are not shown.
We note that qpAdm analyses between ancient and modern individuals use around half (600k) of the SNPs compared to ancient-to-ancient models, which exploit the full 1240k SNPs available. This is due to the fact that all modern individuals have been genotyped with the HO SNP array (90) which tests for a much smaller number of SNPs compared to the 1240k array used for ancient samples. qpAdm can be used successfully even with low coverage samples (91), and the admixture proportions we recover for the modern Albanian samples are consistent with their position on the PCA and the ADMIXTURE analyses. However, we caution that the resolution of the models involving modern samples will be lower compared to those including only ancient metapopulations.
An extensive description of our qpAdm models and the rationale behind the chosen reference populations can be found in the Supplementary Material.
Mobest analysis
The Mobest analyses were run using a kernel size of 800 (corresponding to 800 km in space and 800 years in time). The predication grid was set to 50 by 50 km tiles. As genetic input we used the first two PCs of the West Eurasian PCA as well as the first two PCs of a PCA only including European individuals. Our reference dataset comprised 5664 published individuals from the Allen Ancient DNA Resource and the samples from the Antonio et al. (2022) pre-print that were used for the qpAdm analyses (Table S1). The reference samples date between 150 and 5000 BP. Those samples are located between 29.9° and 70° Lat. and -24° and 70° Long. (in EPSG:4326 projection). Samples with less than 15k SNPs were excluded. The relative search time was set to 0, thus, the probability surface indicates the highest genetic-geographical match at the mean date of the respective individual.
Y-chromosome analysis and interpretation
Ancient DNA Y-chromosome haplogroup data were aggregated from the literature (Tables S22-S31), albeit at low subclade resolution. In most cases, the terminal subclade we cite was assigned by yfull.com (77) and the free, publicly available FamilyTreeDNA Discover BetaTM (Gene by Gene Ltd), both being broadly used resources for inferring human uniparental haplogroups and their TMRCA (6, 92, 93). Additionally, publicly available raw data from Albania by Lazaridis et al. (2022) (6, 16) and Central-Western Balkan samples by Antonio et al. (2022) (89) were manually evaluated in the present study using IGV (94), and were also called with SAMtools and BCFtools (95) by Open Genomes (Ted Kandell), and Open Genomes and the Society of Serbian Genealogists “Poreklo” (Milan Rajevac). TMRCAs are publicly available at yfull.com and FamilyTreeDNA Discover BetaTM. The modern Albanian data populating the yfull.com and FamilyTreeDNA Discover BetaTM Y-chromosome public databases stem from direct-to-consumer whole genome sequencing tests (BigY (96), Nebula Genomics, Dante Labs, YSEQ).
To generate Figs. 8-10, S9-S11, we assembled all Y-chromosome haplogroups from Tables S22-S30 and grouped them into Table S31, which we used to produce a bar-chart in RStudio using ggplot2 (97). Regarding Fig. 10, which plots the TMRCAs of the principal Y-chromosome haplogroups of modern Albanians (E-V13, J2b-Z600, R1b-BY611, I-M223, and R1b-PF7562), we consulted the corresponding phylogenetic trees at yfull.com (77). We then assembled in Table S32 the TMRCAs of the Y-chromosome subclades associated with Albanians and their expansions into neighbouring regions (Greece, Bosnia, Montenegro, North Macedonia, Serbia). We did not include the TMRCAs of subclades lacking an association with Albanians (i.e. not having any Albanians in their daughter lineages). To study the Y-chromosome haplogroup distribution of the modern Albanian population, we used both academic samples from the scientific literature (75, 82, 83) (n = 377) and the publicly available dataset of rrenjet.com (98) (n = 1534) (Table S33, Supplementary Material).
IBD data
A recent study (50) provides a dataset with IBD-sharing between 10156 ancient Eurasian individuals. We mined said dataset for samples from Albania and their matches, which are presented in Table S20. We also provide data on IBD-based matches for Scythian and Avar era samples which were characterised by haplogroups of possible Balkan origin (E-V13, J2b-Z600, R1b-Z2103 and R1b-PF7562) in Table S21. We excluded potential false positive IBD matches appearing among shotgun-sequenced and non-UDG-treated samples of the ancIBD dataset, especially in Viking era individuals (50).
Data visualisation
We used R (version 4.1.1) via RStudio (v. 2022.07.1+554). All plots were created using package ggplot2. To generate vertical stacked bar plots, we additionally used package forcats. To combine plots, we used Adobe Illustrator v. 27.4. We generated maps using the QGIS Geographic Information System, QGIS Association (http://www.qgis.org) and SimpleMappr.
Acknowledgements
LRD acknowledges the Leverhulme Trust Early Career Fellowship grant (ECF-2021-199) for funding him during this research. The authors’ contributions are as follows: Conceptualization: LRD. Data curation: LRD, AA. Formal Analysis: LRD, AA, DW, AH. Funding acquisition: LRD. Investigation: LRD. Methodology: LRD, AA, DW, AH. Project Administration: LRD. Visualization: LRD, AA, DW, AH. Writing – original draft: LRD. Writing – review and editing. LRD, AA, DW, AH.
The authors are grateful to Joscha Gretzinger (Max Planck Institute for Evolutionary Anthropology, Leipzig) for undertaking the Mobest analyses. LRD thanks Andreas Kyropoulos and Leonidas Embirikos for extensive discussions on Albanian history and linguistics, and to James Kempton for assistance with QGIS mapping. LRD thanks Alexandros Spanos and Leo Cooper for assistance in compiling part of the ancient Greek Y-chromosome dataset (Table S25). The authors thank Ted Kandell (Open Genomes) and Milan Rajevac (Open Genomes, the Society of Serbian Genealogists “Poreklo”) for providing detailed Y-chromosome haplogroup determinations for part of the examined dataset.
The authors declare that they have no competing interests. All data needed to evaluate the conclusions in the paper are present in the paper and the Supplementary Materials.
Footnotes
aenaristo{at}gmail.com, eurogenesblog{at}gmail.com, a.heraclides{at}euc.ac.cy
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵