Long-read Data Revealed Structural Diversity in Human Centromere Sequences

Centromeres invariably serve as the loci of kinetochore assembly in all eukaryotic cells, but their underlying DNA sequences evolve rapidly. Human centromeres are characterized by their extremely repetitive structures, i.e., higher-order repeats, rendering the region one of the most difficult parts of the genome to assess. Consequently, our understanding of centromere sequence variations across human populations is limited. Here, we analyzed chromosomes 11, 17, and X using long sequencing reads of two European and two Asian genomes, and our results show that human centromere sequences exhibit substantial structural diversity, harboring many novel variant higher-order repeats specific to individuals, while frequent single-nucleotide variants are largely conserved. Our findings add another dimension to our knowledge of centromeres, challenging the notion of stable human centromeres. The discovery of such diversity prompts further deep sequencing of human populations to understand the true nature of sequence evolution in human centromeres.


Direct determination and quantification of HOR variants via HOR encoding of long reads
To investigate interindividual variations within the centromeric array, we analyzed single-molecule real-time sequencing reads from four samples: B561 (Japanese), AK1 (Korean) 30 , HG002 (Ashkenazi) 33 , and CHM13 (European) 31 . First, the long reads were preprocessed in silico to filter out the noncentromeric fraction, and the remaining reads were interpreted as a series of alphoid monomers using a catalogue of 1197 monomers, i.e., represented as monomer-encoded reads (Figure 1b). Then the monomer-encoded reads were clustered based on the composition of the different types of monomers , and for each cluster of reads associated with one of the HOR arrays, we constructed a catalog of variant HORs by detecting frequent patterns in the monomer-encoded reads. Finally, HOR-encoded reads were obtained by automatically replacing these patterns with symbols representing HORs. Below, we focused on HOR arrays in chromosomes 11 (D11Z1), 17 (D17Z1), and X (DXZ1), which evolved from the 5-mer archetypal HOR, in which we expected more complex variations could be found than in the other chromosomes associated with the dimeric archetypes 16 . We avoided chromosomes 1, 5, 13, 14, 19, 21, and 22, where chromosome identity is obscured because of shared HOR patterns. For CHM13, we performed the same analysis using Oxford Nanopore Technologies (ONT) data 37 for comparison with the PacBio data.

Diversity of HOR variant frequencies in four different samples
The detected variant HORs showed a wide diversity in terms of presence and abundance among the four samples.
In chromosome X, the canonical HOR consists of 12 monomers, and it was the most frequent pattern found in reads across the four datasets (98.0% ∼ 99.1% of all HOR types). Other than the canonical HOR, 11 variant HORs whose sizes ranged from 5-mer to 22-mer were defined (Figure 1c, d, Supplementary Figure 1). While five variants were shared among the four samples, six were found in <4 samples. For example, the 18-mer, an 11-mer were specific to CHM13 and another 11-mer was found only in AK1. To understand which variant HORs exhibited notable variability even in the presence of a small number of false positives and false negatives in the detection of variant HORs, we performed chi-square tests against the models where the relative frequency of variants should be constant across four samples. At least eight variants showed significant variation (p < 0.01 after correction for the number of tests) among the samples (Supplementary Table 1).
Similarly, for chromosome 17, 21 distinct variants of sizes from 7-mer to 24-mer were detected (Figure 1e). Twelve of them were specific to <4 samples. Notably, the 13-mer (in which three monomers, the 10th, 11th, and 12th, were deleted from the 16-mer) was present at high frequency only in AK1 and was essentially missing in the other three samples. Thus, AK1 turned out to have haplotype II, which is known to be shared among 35% of individuals 25,41 . Consequently, while all datasets agreed that the most frequent pattern was the HOR made up of 16 monomers, its densities were different, measuring 59.2%, 35.4%, 59.7%, and 57.1% (per 16 monomers), respectively, in B561, AK1, HG002, and CHM13(PacBio). Again, the distribution of variant HORs across the individual samples was far from uniform (Supplementary Figure 4, Supplementary Table 1).
In chromosome 11, the 5-mer canonical HOR 16 was the most frequent species, representing 97.4% ∼ 99.4% of all HOR types. In addition, 18 variant HORs were detected, 12 of which comprised the same repertoire of 5 monomers as the canonical HORs, while the other six variants exhibited slightly different compositions in terms of the monomers they contained (Supplementary Figure 3). As with the other chromosomes we investigated, these variant HORs were observed with significantly variable frequencies across the four samples ( Figure 1h). The most prominent difference was observed for a 6-mer variant (6m2; deletion of the second monomer), which accounted for 1.98% of the HORs detected in AK1. The same 6-mer variant was present at very low frequencies in two European samples HG002 and CHM13, but it was completely missing in the Japanese sample, B561.

The different modes of local expansion of variant HORs
Because the variant HORs were detected in long reads, we could investigate the contexts within which they were found ( Figure  1f, g, i). In general, we found several examples in which the same type of variant HOR was enriched in a read, which suggested that these variants might have expanded locally through a series of duplication events, rather than occurring independently. Moreover, different modes of expansion were in evidence depending on the type of variant HOR. For example, the characteristic 13-mer variant (13m0-2) detected in chromosome 17 of AK1 was found in tandem or interleaved with other HORs (Figure  1f, g). In contrast, the 6-mer variant (6m2) detected in chromosome 11 of AK1 was found in a sporadic manner (Figure 1i). Therefore, unlike the variant 13m0-2, 6m2 seemed incapable of independent tandem expansion, and some preference (e.g., length) with respect to the unit of expansion may exist.

Local evolutionary patterns within centromeric array by rare and specific variant HOR motifs
We searched the HOR-encoded long reads for rare variant HOR motifs that were conserved among samples. We excluded the possibility that such a rare motif would have emerged independently multiple times in centromeres, which is an implausible hypothesis according to the maximum-parsimony criterion. In addition, we observed at most two patterns in the context of such rare motifs within the long reads from each of the three diploid individuals, and a single pattern for CHM13, implying that 2/11 these regions with rare motifs were homologous. Therefore, we could compare the regions around each motif to understand local sequence evolution patterns in centromeres.
For example, we observed a combination of 12-mer and 5-mer variant HORs in chromosome 11 and the reads with the motif could be aligned and clustered into a single group (homozygous) or two groups (heterozygous) in each sample, revealing at least six alternative loci across four samples (Figure 2a, Supplementary Figures 5, 6).
Although each cluster may not represent a single locus, we were able to compare homologous regions and identify structural alteration events around them. In B561, the downstream motifs were characterized by the presence of interleaving 8-mer variants, the 11-mer, and then, tandem 12-mers. The long reads suggested that the motif was located in two alternative loci with similar but clearly distinct structures (Figure 2a). In HG002, similar patterns were observed and at least two alternative loci were suggested, but we could not phase them due to insufficient read length. Reasonably, there were structural alteration events involving the variant HORs, and both B561 and HG002 were heterozygous in terms of HOR-level variation. In AK1 and CHM13, which were depleted of the 8-mer variant, the downstream motifs were tandem canonical 5-mers (8 copies in CHM13, 9 copies in AK1), followed by the combination of another type of 8-mer variant and the 14-mer variant. These differential patterns also suggested that the expansion or contraction had occurred in one of these individuals.
Similarly, we identified an additional motif comprising a 15-mer, 12-mer, and 16-mer in chromosome 17 (Figure 2b, Supplementary Figure 7). While all the reads with this motif suggested a single locus in haploid CHM13, two loci were found for each of the two diploid samples (B561 and HG002), and the pattern was missing altogether in AK1. Thus, in total, the motif appeared in five alternative alleles, assuming that they were all homologous. Overall, these examples, as well as the previous examples, demonstrated the highly variable nature of the centromere sequences.

How useful are HOR variants in assembling centromeric sequencing reads?
Given that centromeric arrays contain ample HOR variants, one might also expect that they would serve as unique anchors useful in genome assembly. Put differently, with enough HOR variants, centromeric arrays would no longer appear repetitive. This intuition needs to be examined carefully, because the vast majority of the HORs found in centromeric arrays are canonical, and HOR variants remain rare. For example, the fraction of variant HORs among all the detected HOR units in chromosome X was 0.9% ∼ 2.0% across our four samples. Considering this low rate of variant HORs that serve as positional markers, we calculated the probability that a window of a given length was locatable, i.e., contained multiple HOR variants that could determine the relative position of the region compared to others. We assumed that HOR variants occurred in centromeres according to the Bernoulli distribution. Then, in an array with 99% canonical 12-mers (mimicking the situation in chromosome X), the probabilities that the regions would be locatable were 2.4%, 8.3%, and 16.6%, respectively, for 50 kb, 100 kb, and 150 kb window sizes, leaving most regions unlocatable. In an array with 99% canonical 5-mers (similar to chromosome 11 in our data), the probabilities were 11.5%, 32.3%, 52.3%, respectively, for 50 kb, 100 kb, and 150 kb window sizes. Therefore, we speculate that much longer reads would be needed to assemble centromeric regions based solely on the presence of HOR variants.

SNV landscapes on the canonical HORs
Next, we analyzed SNV distribution over the canonical HORs, i.e., 5-mers in chromosome 11, 16-mers in chromosome 17, and 12-mers in chromosome X. Here, we did not analyse indels because they cannot be called confidently within long reads. Although most of the alternative bases were observed with a certain frequency (approximately a few percent) due to remaining substitution errors in the long reads, we could spot prevalent SNV sites as prominent peaks in the plots (Figure 3a-d, Supplementary Figures 8, 9). For example, the number of SNVs detected with a frequency above 5% among the canonical 5-mers in chromosome 11 were 25, 17, 17, and 20, respectively, for the B561, AK1, HG002, and CHM13 (PacBio) datasets. Strikingly, those SNVs were likely to be shared among the four samples and even their frequencies were strongly correlated ( Figure 3e). Similar patterns were confirmed for chromosomes 17 and X as well (Supplementary Figures 8, 9). One might expect that the pairs of Asian samples (B561-AK1) or European samples (HG002-CHM13) would show stronger correlations than the geographically unrelated pairs. However, we did not observe such a pattern, and AK1-CHM13 was always the most strongly correlated pair across the three chromosomes we analyzed (Supplementary Figure 10).
We found that the ONT read dataset for CHM13 was the most dissimilar to all the PacBio datasets (Supplementary Figures  10, 11, 12); thus, these reads might be unsuitable for SNV analysis, presumably due to persistent systematic sequencing errors.
Overall, we observed the conservation of SNVs in canonical HORs, in contrast to the diversity of HOR patterns.

Discussion
In this study, we cataloged variant HORs found in the diploid human centromeres of three chromosomes 11, 17, and X by effectively utilizing long read data without explicit sequence assembly. We expanded the knowledge of variant HORs that have been documented 20-22 , and comparison of the variant HOR frequencies among four individuals revealed a substantial diversity

3/11
in human centromeres. For some cases, we could even trace structural changes among samples via rare and unique motifs, i.e., relatively rare combinations of variant HORs. By analysing three diploid samples, we detected several loci that appeared to represent alternative alleles, but none of them were the same, indicating that such variations are far from fixed among human populations and that other configurations remain to be discovered. Here, we note that regions harboring variant HORs were only a portion of the entire picture. Assuming that similar structural changes are occurred within the sea of tandem replicates of canonical HORs, there is even greater diversity in centromeres than we conservatively estimated here. Thus, we hypothesize that the tandem nature of centromeric arrays makes them extremely variable, as with other tandem repeats, and if one could accurately measure them, they would contain sufficient information to distinguish individual persons just as microsatellites can. We would be able to prove or disprove the hypothesis by exploring how diverse the structures of HOR arrays may be within a single population. While centromeres were highly variable at the HOR level, we demonstrated that the frequent SNVs in the HOR units were clearly conserved among diverse populations, suggesting that they have spread before the geographical separation of the human population. These seemingly contradictory observations can be reconciled easily, because the unequal crossovers and gene conversions that would be responsible for the introduction of HOR-level structural variation can contribute to homogenization at the SNV level. What does it means to have such large structural diversity in centromeres? Because centromeres have a fundamental importance to proper chromosome segregation during cell division, it was once considered surprising to observe great diversity in centromeric sequences across different eukaryotic taxa ("centromere paradox") 42 . Centromere drive theory explained the rapid evolvability of centromeres via genetic conflict during female meiosis I, rendering the centromeres as a crux of the molecular identity of species 43 . Nevertheless, growing evidence suggests that centromeres can be highly variable within a single species 5, 10, 20, 24 , and our findings add another layer of diversification, ultimately challenging the notion of stable human centromeres.
Several mechanisms can contribute to such structural diversity within centromeres: unequal crossover between sister chromatids, meiotic unequal crossover, gene conversion, and homologous recombination resulting in noncrossover products, to name a few. Among them, meiotic crossovers might arguably be excluded as a major driving force because they are suppressed near centromeric regions 7,44 , and consequently, centromeres are reported to form large conserved linkage-disequilibrium blocks 10 . On the one hand, the structural diversity within centromeres can be best explained by frequent unequal crossovers between sister chromatids as well as gene conversions. On the other hand, centromere integrity in a human population might have been maintained through occasional gene conversions and infrequent meiotic crossovers, both of which can counteract the diversification processes by effectively homogenizing sequences among different alleles. Notably, all these mechanisms are consistent with the local, progressive expansion suggested in this study as well as in previous evolutionary analyses 45 . We speculate that all these mechanisms might have contributed to the current landscape of human centromeres.
Recently, in addition to a complete linear assembly of a human Y centromere, a telomere-to-telomere assembly spanning the whole human X centromere utilizing ultralong ONT reads has been reported, showing, at last, the time is ripe to investigate centromeres in terms of sequencing technology 36, 37 . With an increasing number of individual genomes from the same or closely related populations sequenced by long reads, one would be able to precisely observe the processes of diversification and homogenization that occur within human centromeres. Therefore, such a study should provide a basis to delineate the complex mechanisms involved and to understand the true nature of centromere evolution.

Preparation of long read sequencing data
For SMRTbell library preparation, B cell DNA (named B561 in the main text) was sheared using a Diagenode's Megaruptor 2 with software setting 75 kb and purified using a 0.6× volume ratio of AMPure beads (Pacific Biosciences, Menlo Park, CA, USA). SMRTbell libraries for sequencing were prepared using the "Procedure & Checklist-Preparing > 30 kb Libraries Using SMRTbell Express Template Preparation Kit" protocol. Briefly, the steps included (1) DNA repair, (2) blunt ligation with hairpin adapters with the SMRTbell Express Template Preparation Kit (Pacific Biosciences) (3) 15 kb size cutoff size selection using the BluePippin DNA Size Selection System by Sage Science, and (4) binding to polymerase using Sequel Binding Kit 2.1, later Sequel Binding Kit 3.0, (Pacific Biosciences). SMRTbell libraries were sequenced on Sequel SMRT Cells (Pacific Biosciences) using diffusion loading, 30 kb-insert size and 600-min movies. Other long read data, i.e., from AK1 30 , CHM13 31 , HG002(Hi-Fi) 33 , and CHM13 (ONT ultra-long) 37 were obtained via public repositories respectively.

Filtering out non-centromeric reads
To enrich the centromeric reads in silico, we calculated the reference 6-mer frequency vector with the non-redundant catalog of 1194 monomers. We also calculated the "query" 6-mer frequency vector (normalized by length in bp) and its dot product with the reference for each long read. The dot products exhibited a bi-modal distribution which represents the mixture of centromeric

4/11
and non-centromeric reads. Thus, only reads with the dot product greater than 200 were included in later analysis. We modified squeakr 46 to perform these steps.

Monomer encoding of long reads
For monomer encoding of long reads, the catalog of human alphoid monomers was adapted from: https://raw.githubusercontent.com/volkansevim/alpha-CENTAURI/nhmmscan/example/MigaKH.HigherOrderRptMon.fa Then, the distinct 1197 monomers were mapped by blastn (version 2.4.0+) to long reads with following parameters: -max_target_seqs 1000000 -word_size 7 -qcov_hsp_perc 60 -outfmt "17 SQ" Optimal assignment was calculated via dynamic programming procedure, maximizing the following quantity: where i indexes monomers assigned to the read, s i is the BLAST score of the hit, (b i , e i ) the region covered by the monomer. Intuitively, it tries to assign as many monomers with acceptable scores as possible, due to the first term. The second term penalized the overlaps (cf. gaps were not penalized) so that each segment of the read be assigned at most one monomer.

HOR encoding
Reads were then clustered based on the types of assigned monomers. First, each read was treated as a bag of monomers and represented as a vector encoding the relative frequencies of each type of monomer assigned to the read. Then, K-means clustering (K = 40) was performed with the cosine distance (i.e., 1− correlation) so that similar HOR patterns were represented in a single cluster. The three clusters of reads from chromosome 11, 17, and X can be identified, since they mainly comprised the characteristic (SF3, supra-chromosomal family 3) monomers. In each read cluster for chromosomes 11, 17, and X, recurrent combinations of monomers were identified as HORs. Within HOR patterns, no gap of >100 bp was allowed between neighboring monomers. With the list of identified HORs, reads were processed again to be encoded as series of assigned HORs plus the mismatches against the reference monomers.

Author contributions statement
Y.S. conceived and conducted the study. Y.S., G.M., and S.M. analysed the results and wrote the manuscript All authors reviewed the manuscript.

Additional information
All sequencing data for B561 are available under the accession code JGAS00000000173 in the DDBJ Japanese Genotypephenotype Archive for genetic and phenotypic human data. The snapshot of all custom codes used for the study is available here: https://github.com/hacone/hc_temporary/releases/tag/submission-v.0.9 The authors declare no competing interests.  m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12   m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12  m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m1 m2 m3 m11 12~m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12   22  22m  m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11   m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12   18  18m  m1 m2 m3 m4 m5 m6 m7 m8 m9   m4 m5 m6 m7 m8 m9 m10 m11 m12   17  17m  m1 m2 m3 m4 m5 m6   m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12   11  11m1  m1  m3 m4 m5 m6 m7 m8 m9 m10 m11   The specific combination of 12-mer and variant 5-mer (enclosed by blue lines) was found in all samples in similar but slightly different contexts. In CHM13 and AK1, there were another 8-mer plus 14-mer motif in downstream, but the number of canonical 5-mer HOR between them was differed by one unit. In B561, the differences found in upstream and downstream of the motif could be phased into two alternative patterns, suggesting they were two alternative alleles. In HG002, the alternative patterns in downstream could not be phased due to shorter read length, but explained by two loci. (b) The specific combination of 15-mer variant, 12-mer variant, and 16-mer canonical HOR in this order in three samples. While there was a single pattern in CHM13, two alternative patterns were found in upstream and/or downstream in B561 and HG002.    Single-nucleotide variants found in the 16-mer canonical HOR of chromosome 17 in four samples. (a-d) The frequency (y-axis) of alternative bases according to positions over the 16-mer canonical HOR (x-axis). Data for each alternative base is colored in yellow, purple, green, or blue if it is A, C, G, or T, respectively. (e) Correlation of frequency of variants among four samples. Each point represents a specific SNV, and plotted according to its frequency in two samples (x-axis and y-axis). P.C = the Pearson correlation between frequencies of all variants.  12 Correlation of frequency of detected variants between two sequencing technologies. Each point represents a specific SNV, and plotted according to its frequency in each dataset (x-axis for PacBio and y-axis for ONT). Data for three canonical HORs, 5-mer in chromosome 11, 16-mer in chromosome 17, and 12-mer in chromosome X, is shown. There are some spurious variants called only in ONT along the y-axis. P.C = the Pearson correlation between frequencies of all variants.   Table 1 Summary of chi-square tests for distribution of variant HORs across four samples, in each of the three chromosomes X, 11, and 17. All statistical tests were performed based on the numbers shown in the tables in Supplementary Figure  1-3. For each type of variant HOR, we performed five tests in total: one for the observed number in each sample (degree of freedom = 1) and one for distribution across the four samples (degree of freedom = 4). The cells are colored in yellow if the minus log p-value exceeds the threshold for family-wise error rate of 0.01, which is shown below each table. The cells are colored in gray if p-value was too small to calculate its logarithm.