Comprehensive analysis of microsatellite polymorphisms in human populations

Microsatellites (MS) are tandem repeats of short units, and have been used for population genetics, individual identification, and medical genetics. However, studies of MS on a whole-genome level are limited, and genotyping methods for MS have yet to be established. Here, we analyzed approximately 8.5 million MS regions using a previously developed MS caller for short reads (MIVcall method) for three large publicly available human genome sequencing data sets: the Korean Personal Genome Project, Simons Genome Diversity Project, and Human Genome Diversity Project. Our analysis identified 253,114 polymorphic MS. A comparison among different populations suggests that MS in the coding region evolved by random genetic drift and natural selection. In an analysis of genetic structures, MS clearly revealed population structures as SNPs and detected clusters that were not found by SNPs in African and Oceanian populations. Based on the MS polymorphisms, we selected MS marker candidates for individual identification. Finally, we applied our method to a deep sequenced ancient DNA sample. This study provides a comprehensive picture of MS polymorphisms and application to human population studies.


Introduction
Repetitive sequences account for more than two-thirds of the human genome (de Koning et al. 2011). Among them, sequences consisting of tandem repeats of short units are classified as microsatellites (MS). The mutation rate of MS is higher than of other genomic regions, and MS have high diversity among individuals (Ellegren 2004).
Due to their high heterozygosity and multiallelic nature, MS have been widely used as genetic markers in population studies (Shriver et al. 1997;Ellegren 2004;Pemberton et al. 2013). Previous studies have analyzed dozens to several hundreds of MS and revealed the genetic structure of modern human populations, the relationship of genetic and linguistic variations, and the trace of natural selection (Shriver et al. 1997;Rosenberg et al. 2002;de Filippo et al. 2012). These studies have made a great contribution to understanding human evolutionary history. However, although conventional PCR-based MS genotyping methods, such as capillary electrophoresis-based fragment length analysis, are highly accurate and able to analyze long MS, they are not suitable for high-throughput genotyping; therefore, MS has largely been replaced with single-nucleotide polymorphism (SNP) in related studies. Indeed, in the past decade, genomewide SNP analysis had become a standard method for human population genetics (Novembre et al. 2008).
In addition to population studies, MS has been widely used for personal identification in forensic science and paternity testing (Ruitberg et al. 2001;Butler 2006), since MS are multiallelic and have more discriminative power than SNP (Ellegren 2004). Established common MS sets, such as the Globalfiler kit, have been used for many years (Ludeman et al. 2018). Although these MS marker sets have sufficient power for most cases, they are not selected from entire genomes and there may be other more efficient MS in the human genome.
Next-generation sequencing technologies (NGS) enable whole-genome sequencing (WGS). In the past decade, applications of NGS and the development of algorithms for the analysis have successfully identified genetic variations, including single-nucleotide variations, insertions and deletions, copy-number variations, and structural variations (Nagasaki et al. 2015). However, due to the short-read length and the high sequencing error rate in 1 3 repeat regions, the identification of mutations and polymorphisms in MS regions has been difficult. Previously, several groups including ours developed MS genotyping tools from WGS data (Willems et al. 2014;Fujimoto et al. 2020;Jakubosky et al. 2020). Although these WGS-based tools cannot be used to analyze long MS and may be less accurate than conventional genotyping methods, they can provide a comprehensive picture of MS variation in the genome. Indeed, they have identified somatic mutations and germline polymorphisms in MS in the human population and revealed genome-wide patterns of MS polymorphisms, factors that determine the mutation rate of MS, and functional roles of MS on gene expressions (Willems et al. 2014;Gymrek et al. 2015Gymrek et al. , 2017Fujimoto et al. 2020). Although these studies provide important information on MS, the MS genotyping method is not perfect, and only a few studies have been conducted for MS polymorphisms. Indeed, the patterns of genome-wide MS polymorphisms have not been well analyzed in human populations. Nor has the amount of genetic variation in MS among different human populations been compared in detail. Furthermore, clustering based on principal component analysis (PCA) with MS polymorphisms has presented unclear results compared to that with SNPs (Willems et al. 2014;Mallick et al. 2016), and the efficiency of genome-wide MS for analyzing population structures requires more study.
Here, we analyzed approximately nine million MS regions using a previously developed MS caller for three large publicly available human genome sequencing data sets: Korean Personal Genome Project (KPGP), Simons Genome Diversity Project (SGDP), and Human Genome Diversity Project (HGDP; Cavalli-Sforza 2005; Mallick et al. 2016;Fujimoto et al. 2020). We revealed the pattern of MS polymorphisms, analyzed the genetic structure of the populations with several dimensionality reduction methods, and identified useful candidate MS for individual identification. Additionally, we analyzed MS of an ancient DNA sample (Kanzawa-Kiriyama et al. 2019). Our analysis provides a comprehensive picture of MS polymorphisms and their application to human population studies.

Establishment of MS calling
We identified genotypes of MS with the MIVcall method (Fujimoto et al. 2020). MIVcall detects genetic variations in MS regions by considering sequencing error rates of different repeat units. MIVcall calculates a likelihood that the observed pattern of read numbers is caused by sequencing errors; the higher the likelihood value the more likely the pattern was caused by sequencing errors. MIVcall outputs the log 10 (likelihood) and number of reads for each MS allele, which can then be used to evaluate the reliability of the MS genotypes. In our previous study, we evaluated the threshold of log 10 (likelihood) using simulation data from male chrX and performed an experimental validation with capillary electrophoresis. In the present study, we examined two output parameters, log 10 (likelihood) and the number of reads, using monozygotic twins in the KPGP (KPGP-00088 and KPGP-00089). Because monozygotic twins have completely same genotypes, we considered all disconcordant genotypes between the twins as errors. We compared the genotypes of 8,343,174 MS in the twins and classified them into concordant homozygote, concordant heterozygote, disconcordance of two alleles, and disconcordance of one allele (Table S1). Based on the results, the cutoff − log 10 (likelihood) and minimum number of reads for allele detection were set to − 4 and 3, respectively, in this study (Table S1).

Selection of samples and MS
We selected samples for the analysis. In this study, MS covered by less than ten reads were considered insufficient depth, and we excluded samples if more than 4% had insufficient depth of MS. As a result, 276 samples from the SGDP, 693 samples from the HGDP, and 81 samples from the KPGP (excluding one of the monozygotic twins) were selected.
We next selected MS loci for this study using the 81 Korean samples from the KPGP. We selected MS that were genotyped (number of reads ≥ 10) in more than 85% of the 81 samples, leaving 8,468,218 MS. We also tested deviation from the Hardy-Weinberg equilibrium (HWE) in the KPGP, but no MS showed significant deviation.
Of the selected 8,468,218 MS, 7,740,569 MS were monomorphic. The average number of analyzed MS was 8,460,463.5 in the HGDP and SGDP combined, and 8,431,686.8 in the KPGP. In an ancient DNA sample (F23), 7,787,953 MS were analyzed. The average numbers of reads of MS were 31.7 in HGDP and SGDP combined,

Genome-wide pattern of MS polymorphisms
The length of MS was negatively correlated to the number of samples with insufficient depth (p value < 10 -200 Kruskal-Wallis test) but positively correlated to the number of alleles (p value < 10 -200 Kruskal-Wallis test; Fig. 1a, b). The increase in the number of alleles was more gradual in longer MS (Fig. 1b). Longer MS had fewer reads fully covering them compared to shorter MS and thus a lower detection sensitivity (Fujimoto et al. 2020). The number and heterozygosity of different repeat units showed that MS with higher AT content had significantly higher heterozygosity (Pearson's uncorrelated test p value = 1.47 × 10 -61 ), suggesting that the mutability of MS is affected by the base composition (Payseur et al. 2011;Fig. 1c, d).
To compare the genetic variation among human populations, we calculated the distribution of the heterozygosity of all MS, MS in CDS, MS in non-CDS, 3n MS in CDS, and non-3n MS in CDS ( Fig. 3a-e; Table S5). We also calculated the ratio of the mean heterozygosity of non-3n MS to mean heterozygosity of 3n MS in CDS among each region (Fig. 2f). Africa had the highest heterozygosity and lowest ratio of mean heterozygosity, as previously reported (Willems et al. 2014; Fig. 3; Table S5).
We observed that 696 genes had 724 polymorphic non-3n MS (Tables S4, S6). We performed a pathway analysis for these genes, but did not find any significantly over-represented pathway (data not shown). Among the 696 genes, 444 genes were reported in the GWAS catalog database, and the MS polymorphisms may affect diseases and traits (Table S6).

Analysis of population structure
To analyze the population structure, we conducted five dimensionality reduction methods for MS polymorphisms: PCA, t-Distributed Stochastic Neighbor Embedding (t-SNE), Uniform Manifold Approximation and Projection (UMAP), PCA-t-SNE, and PCA-UMAP using MS and SNP. For this analysis, we used MS and SNP with MAF ≥ 1% Africa,358,538 in America,416,191 in Central Asia and Siberia,403,015 in East Asia,389,358 in Oceania,437,410 in South Asia, and 432,163 in West Eurasia (Table S2).
To apply dimensionality reduction methods, we converted MS genotypes to numerical values using two methods, the multiallelic method and average method (see Methods), and performed PCA with both methods for all samples. Although the results were not significantly different between the two, the average method had a higher contribution rate (PC1 = 5.80%) than the multiallelic method (PC1 = 4.97%; Supplementary Fig. 2). Therefore, we selected the average method in this study.
PCA was conducted for all regions and each individual region ( Fig. 4; Supplementary Fig. 3). We found similar patterns between MS and SNPs for all samples (Fig. 4a, b). However, patterns were slightly different in African and Oceanian populations between MS and SNPs ( Fig. 4c-f). These results suggest that genome-wide MS has comparable resolution to SNPs for the genetic structure of human Fig. 2 Features of MS in CDS and non-CDS regions. a MS in CDS had higher GC content (p value = 6.49 × 10 -144 Wilcoxon rank sum test; mean 0.42 in CDS and 0.14 in non-CDS). b MS in CDS had shorter lengths (p value = 1.25 × 10 -29 Wilcoxon rank sum test; mean 8.2 in CDS and 9.3 in non-CDS). c MS in CDS had fewer alleles (p value = 1.80 × 10 -47 Wilcoxon rank sum test, mean 2.2 in CDS and 2.7 in non-CDS). d MS in CDS had lower heterozygosity (p value = 1.42 × 10 -89 Wilcoxon rank sum test; mean 0.024 in CDS and 0.092 in non-CDS). e Total number of MS loci per unit length in non-CDS. f Total number of MS loci per unit length in CDS. g Number of multiallelic MS of each unit length in CDS. h Number of multiallelic MS of each unit length in non-CDS. i Heterozygosity of MS in non-CDS and CDS. Non-3n MS had lower heterozygosity than 3n MS (p value = 5.23 × 10 -13 Wilcoxon rank sum test). Orange lines in (a)-(d) and (i) indicate medians ◂ populations and that MS can be used to find new genetic structures. t-SNE, UMAP, PCA-t-SNE, and PCA-UMAP were also conducted for all samples (Supplementary Fig. 4). In most of these analyses, MS did not detect novel clusters; however, MS discriminated African populations from the others in the t-SNE analysis ( Supplementary Fig. 4a).

MS marker set for individual identification
Our analysis identified 1,495 MS with a unit length = 4 bp and heterozygosity ≥ 0.5 (Table S7). From these, 22 MS with the highest heterozygosity were selected from each chromosome (Table S8). For the selected 22 loci, we calculated the discriminative power using allele frequencies of 255 Japanese ICGC datasets (Table S8). The discriminative power for this MS set was estimated to be 1.0 × 10 -17 .

MS analysis of an ancient DNA sample
Although the genome-wide MS analysis of ancient DNA samples has yet to be conducted, an analysis of MS variation in ancient DNA samples may contribute to clarifying the genetic structure of ancient populations. Since low-quality sequencing data of ancient DNA samples can result in incorrect results, we selected an ancient DNA sample with high sequence depth (sample ID; F23; Kanzawa-Kiriyama et al. 2019) and analyzed the distribution of the variant allele frequency (VAF) for MS ( Fig. 5a-d). VAF was calculated as (number of reads with second most frequent pattern)/ (total number of reads) for heterozygous MS. The distributions of VAF were quite different between F23 and modern human samples in MS with lengths < 3 bp, suggesting that genotypes contained many errors. However, in MS with lengths ≥ 3 bp, the distributions of VAF were not different ( Fig. 5a-d). Therefore, we used MS with lengths ≥ 3 bp and performed PCA with HGDP and SGDP samples (All, East Asia, America, Oceania, Central Asia and Siberia, and South Asia; Fig. 5e, f). F23 was clustered close to Central Asia and Siberia and East Asia populations in the PCA plot.

Discussion
Population genetic studies strongly depend on variant calling. Thus, like other types of variants, MS analysis is affected by genotyping methods. Since most MS calling methods analyze only predefined MS regions, the number of target MS is different among studies (700,000 loci in Gymrek et al. 2017;Willems et al., 2014, and 1.6 million loci in Jakubosky et al. 2020;Willems et al. 2014;Gymrek et al. 2017;Jakubosky et al. 2020). In this study, we analyzed a larger number of MS regions (8,468,218 MS) than previous studies for a comprehensive analysis of human MS polymorphisms. Compared to conventional PCR-based MS studies (Rosenberg et al. 2002), this study has another advantage: MS were not influenced by ascertainment bias. Most conventional studies have analyzed pre-screened MS marker sets, which are influenced by the MS marker selection. On the other hand, we did not select MS based on the allele frequency in certain populations and could analyze features of MS and compare the variation among different populations without the influence of ascertainment bias.
We first selected parameters for the analysis based on the evaluation of monozygotic twins in the KPGP. The concordant rate of this parameter set was estimated to be 99.87%, which is sufficiently accurate for population studies (Table S1). We then selected 8,468,218 MS based on the call rate and HWE in the KPGP for 81 Korean individuals. Our MS calling identified 253,114 MS with MAF ≥ 1% in the SGDP and HGDP (Supplementary Fig. 1). Since the SGDP and HGDP datasets represent human genome diversity, these MS polymorphisms can be used for future population studies. Previous genome-wide studies reported that MS polymorphisms could influence gene expression patterns and human traits (Gymrek et al. 2015;Jakubosky et al. 2020). Therefore, applications of our MS set and our MS calling method may contribute to discovering novel disease susceptibility genes (Press et al. 2014). In particular, 696 genes with non-3n MS, in which 444 genes were reported by previous GWAS studies, are good targets for disease studies (Table S6).
The amount of genetic variation reflects the effective population size (N). A comparison of heterozygosities across regions showed Africa with the highest (Fig. 3; Table S5) and America, which was estimated to have a very small effective population size, with the lowest (Fig. 3; Table S5; Hey 2005). Such patterns were observed in other genetic variations and MS in previous studies (Willems et al. 2014;Auton et al. 2015), suggesting that the heterozygosity of MS reflects the size of each population. Theoretical population genetics also predicts that the effectiveness of natural selection depends on the selection coefficient (s) of the genetic variation and population size (Charlesworth 2009). Therefore, a comparison of genetic variations of MS in the CDS may provide additional information about the evolution of MS. Most of the 3n MS should not cause severe damage to protein functions and have neutral or nearly neutral effects, whereas non-3n MS cause a frameshift and should have deleterious effects. We attempted to evaluate the strength of  Table S5). The African population showed the lowest ratio, and populations with lower heterozygosity tended to have a higher ratio ( Fig. 3f; Table S5). This pattern indicates that the African population has the largest effective population size, and that stronger natural selection has acted to remove deleterious non-3n MS. In Central Asia and Siberia, the heterozygosity was not the lowest, but the heterozygosity ratio of non-3n MS to 3n MS in CDS was the highest (Fig. 3f; Table S5). A previous study showed that subdivided populations have a higher effective population size and lower selection coefficient (Cherry and Wakeley 2003). The Central Asia and Siberia population may be composed of subpopulations, which may affect the selection pressure against MS. These results indicate that MS evolved by the combination of population history and natural selection.
MS has been used to infer genetic structures because of high genetic diversity (Rosenberg et al. 2002). In a previous study, PCA was conducted using 53,002 MS, but the genetic structures by the MS PCA were less clear than that by SNP PCA (Mallick et al. 2016). In the present study, we used a larger number of MS (253,114 MS with MAF ≥ 1%) and obtained highly concordant results with SNPs (Fig. 4a, b). Although the overall patterns were similar between MS and SNP ( Fig. 4; Supplementary Fig. 3), small differences were observed in African and Oceanian populations (Fig. 5c-f). In Oceanian populations, NAN-Melanesians (NAN; Non-Austronesian) and Bougainville, who belong to Melanesians, were clustered in the SNP PCA but not in the MS PCA (Fig. 4e, f). In African populations, Biaka and Mbuti populations showed different patterns between the SNP and MS PCAs (Fig. 4c, d), which may be caused by hidden population structures. Although the efficiency of using MS for genetic structures should be evaluated by a larger number of samples, these results indicate that MS can be an additional marker set and may detect hidden population structures in the human population.
In addition to the modern human samples, we analyzed a deep sequenced ancient human sample (F23; Kanzawa-Kiriyama et al. 2019). In ancient genome sequencing, the DNA fragmentation and library construction process should affect the quality of the sequence reads. To evaluate the quality of the MS call, we compared the distribution of VAF of this sample with that of modern human samples (Fig. 5a-d). The clear skew of VAF was observed in MS with unit lengths ≤ 2 bp, suggesting that MS with a short unit are strongly affected by the quality of DNA samples. Our previous study estimated that MS with short repeat units have higher sequencing error rates than those with longer units (Fujimoto et al. 2020). Therefore, MS with short units in the ancient DNA sample should be strongly affected by errors. However, the distributions of VAF for unit lengths ≥ 3 bp were not different; therefore, we used these MS for the analysis. In the PCA, F23 was close to East Asians, which is consistent with the SNP PCA in a previous study (Kanzawa-Kiriyama et al. 2019;Fig. 5e, f). Although this analysis provided a plausible result, one needs to note that it was performed for a sample with very high depth (mode of depth = 48×). The majority of ancient DNA samples have much lower depth, making MS analysis difficult. In the future, improvements in sequencing technology may produce deep sequenced ancient DNA data, allowing the application of MS to detect novel population structures.
We selected 22 MS marker candidates for the personal identification. Using the allele frequencies in a Japanese population, the discriminative power was estimated to be 1 × 10 -17 , which is sufficient for personal identification. Although the discriminative power of our MS set is slightly lower than that of the Globalfiler kit, which is a standard MS set, for a Japanese population (5.6 × 10 18 ; Fujii et al. 2015), the length of our MS was shorter and can be genotyped by short-read sequencers. Additionally, the PCR success rate of MS is known to be affected by the length of the MS (Schneider et al. 2004), and our shorter MS may be robust to DNA degradation.
This study provides a comprehensive catalog of MS in human populations and shows the applicability of MS to modern and ancient human population studies. Nevertheless, our study has several limitations. First, the genotyping of MS needs reads that cover MS regions. Therefore, the amount of data and read length strongly affect the results. For example, we removed 824,459 MS and 395 samples from the SGDP and HGDP due to insufficient depth. Deeper sequence data would improve the quality of the MS calling. Second, long MS cannot be analyzed using short-read data. A previous study has reported that the mutation rate of longer MS is higher than that of shorter MS (Sun et al. 2012). Therefore, longer MS, which were not analyzed in this study, should have higher genetic variations. In the current study, only ~ 3% of MS were polymorphic, but longer MS should contain a large percentage. A recent study using a long-read sequencer reported high genetic variation in long repeat regions (Audano et al. 2019). In the future, the analysis of long-read data should detect a larger number of polymorphic MS, which would provide a better catalog of MS polymorphisms.  Currently, large-scale sequencing projects are ongoing worldwide, in which the analysis of MS, in addition to SNPs, should provide deeper understanding of human genetic variations and benefit genome medicine. KPGP data were used for the MS selection and parameter optimization of the MS calling. The KPGP has data from monozygotic twins (KPGP-00088 and KPGP-00089), which were used for the parameter optimization of the MS calling (Table S9). One of the monozygotic twins and other Korean samples (in total n = 81) were used for the MS selection. The HGDP and SGDP samples were merged and used as a single dataset (Table S9). When population names were inconsistent between the SGDP and HGDP, we adopted the population names of the SGDP (Table S9).

Data
As a result of the sample selection (see below), we selected 81 Korean samples from the KPGP and 969 samples from the HGDP and SGDP (138 samples from Africa, 70 from America, 48 from Central Asia and Siberia, 193 from East Asia, 38 from Oceania, 195 from South Asia, and 287 from West Eurasia). The quality of the ICGC sequencing data was not constant among samples; therefore, Japanese samples from the ICGC were used to estimate the allele frequencies of our MS marker set for personal identification (Table S10).
The downloaded bam files were results of the mapping to GRCh37; therefore, our analysis was based on GRCh37. The list of MS is provided as Additional file 1. For future studies, we also provide a list of MS based on GRCh38 (Additional file 2).

MS genotyping using MIVcall method
Target MS regions were selected in our previous study (Fujimoto et al. 2020) using three software packages: MSDetector, Tandem Repeat Finder, and MISA software (Benson 1999;Girgis and Sheetlin 2013;Hause et al. 2016). Regions were filtered based on the uniqueness of the flanking sequences and the distance to other MS. Insertions and deletions in a target MS were detected using the MIVcall method (Fujimoto et al. 2020); MIVcall counts the length of each MS in each read. When multiple lengths are observed in an MS locus in a sample, the most frequent pattern is assumed to be present, and the second most frequent pattern is examined. The likelihood value was calculated based on the number of reads and the difference in length between the most frequent pattern and the second most frequent pattern. In addition to the general pattern, non-unit length alleles can exist (for example, + 1 bp change in 4-bp MS; Tables S7, S8). If the difference between the most frequent pattern and the second most frequent pattern was not the multiple of the unit length, the whole number was used for the calculation of the likelihood value (L). Genotypes were determined based on the likelihood value, the number of reads of each allele, and VAF.

Establishing the MS detection method
In our previous study (Fujimoto et al. 2020), the optimal criteria of likelihood and number of reads [the minimum − log 10 (likelihood) value and minimum number of allelic reads] were chosen to analyze somatic mutations. To obtain the optimal criteria for a polymorphism, we used monozygotic twins in the KPGP (KPGP-00088 and KPGP-00089). Since all genotypes of monozygotic twins are identical, we tested various parameter sets and compared the concordance rates of genotypes between twins.

Sample selection and MS filtering
Since MS are susceptible to sequencing errors, selecting high-quality samples is necessary. Thus, we counted the total number of reads for each MS, and MS covered by less than ten reads were considered MS with insufficient depth. We excluded samples if more than 4% of MS loci had insufficient depth.
Next, we selected MS loci from the 9,292,677 MS selected in our previous study (Fujimoto et al. 2020). Using the 81 Korean samples in the KPGP, we counted the number of samples with insufficient depth for each MS and removed samples if the percentage of MS with insufficient depth was ≥ 15%. Additionally, we tested deviations from the HWE with Fisher's exact test for a 2 × n contingency table (n; number of genotypes) in the KPGP (α = 0.0001).

Genome-wide pattern of MS
In this study, we analyzed MS in the autosome. To reveal the landscape of MS polymorphisms, we analyzed the features of MS. This analysis was performed using HGDP and SGDP samples. We analyzed the association of the length of an MS region in the reference genome with the number of samples with insufficient depth and heterozygosity. We also examined the number and heterozygosity of MS for repeat units with different sequences (for example, A, G, and AC). We merged MS with different unit sequences if reverse-complement (e.g., GT to CA) or reverse (e.g., TA to AT) generated the same sequences.
We then focused on MS in CDS and in non-CDS. We compared GC content, MS length, number of alleles, and heterozygosity between CDS and non-CDS MS. In CDS, MS were classified into 3n MS (3 and 6 bp) and non-3n MS (1, 2, 4, 5 and 7 bp). We compared heterozygosity among all MS, 3n MS, non-3n MS, MS in CDS, and MS in non-CDS. The ratios of the mean heterozygosity of non-3n MS to the mean heterozygosity of 3n MS were compared among the different populations. We also performed a pathway analysis for genes with multiallelic non-3n MS using the Reactome database.

Analysis of population structures with MS
We conducted five dimensionality reduction methods: Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), Uniform Manifold Approximation and Projection (UMAP), PCA-t-SNE, and PCA-UMAP. For these analyses, the genotypes of MS and SNPs were converted to numerical values. SNP genotypes were converted to numerical values by counting the number of minor alleles. For MS, we used two methods, the multiallelic method and average method. In the multiallelic method, each allele at an MS locus is treated as a different marker (if three samples have the following genotypes, [8/8, 13/14, 8/14], we converted them into 3 independent pseudo-loci: [2, 0, 1] (8 or else), [0, 1, 0] (13 or else), and [0, 1, 1] (14 or else; Putman et al. 2014). In the average method, we calculated the average length of two alleles in each individual (in the previous example, the average method converts the genotypes to [8, 13.5, 11]; Supplementary Fig. 5). We performed a PCA with both methods and compared the results.
We used MS and SNP with MAF ≥ 1% in the SGDP and HGDP samples. MAF was calculated as 1 -(major allele frequency). A PCA was conducted for all samples and samples in each region (Africa, America, Central Asia and Siberia, East Asia, Oceania, South Asia, and West Eurasia). Other dimensionality reduction methods (t-SNE, UMAP, PCA-t-SNE, and PCA-UMAP) were applied to all samples only.

MS sets for personal identification
Individual identification with MS is an important topic in human genetics. We selected a set of 22 MS and estimated the discriminative power. We calculated the heterozygosity of MS with repeat unit lengths = 4 and selected 22 MS with the highest heterozygosity in each chromosome. The allele frequencies of the selected 22 loci were estimated using 255 Japanese samples from the ICGC data. The discriminative power was calculated as the product of frequencies of the most frequent genotype in each locus (Fujii et al. 2015).

Analysis of ancient DNA samples
To analyze MS variation in ancient DNA, we analyzed one ancient DNA sample with a higher depth of coverage from a previous study (Sample ID; F23; Kanzawa-Kiriyama et al. 2019). To examine the quality of the variant calling, we calculated the VAF of each MS of F23 and compared it with the average VAF of 10 randomly selected HGDP samples. We then conducted a PCA for F23 sample with the SGDP and HGDP samples.