Abstract
Existing human genome assemblies have almost entirely excluded highly repetitive sequences within and near centromeres, limiting our understanding of their sequence, evolution, and essential role in chromosome segregation. Here, we present an extensive study of newly assembled peri/centromeric sequences representing 6.2% (189.9 Mb) of the first complete, telomere-to-telomere human genome assembly (T2T-CHM13). We discovered novel patterns of peri/centromeric repeat organization, variation, and evolution at both large and small length scales. We also found that inner kinetochore proteins tend to overlap the most recently duplicated subregions within centromeres. Finally, we compared chromosome X centromeres across a diverse panel of individuals and uncovered structural, epigenetic, and sequence variation at single-base resolution across these regions. In total, this work provides an unprecedented atlas of human centromeres to guide future studies of their complex and critical functions as well as their unique evolutionary dynamics.
One-sentence summary Deep characterization of fully assembled human centromeres reveals their architecture and fine-scale organization, variation, and evolution.
Introduction
The human genome reference sequence has remained incomplete for two decades. Genome assembly efforts to date have excluded an estimated 5-10% of the human genome, most of which is found in and around each chromosome’s highly repetitive centromere, owing to a fundamental inability to assemble across long, repetitive sequences using short DNA sequencing reads (1, 2). Centromeres function to ensure proper distribution of genetic material to daughter cells during cell division, making them critical for genome stability, fertility, and healthy development (3). Nearly everything known about the sequence composition of human centromeres and their surrounding regions, called pericentromeres, stems from individual experimental observations (4–7), low-resolution classical mapping techniques (8, 9), analyses of unassembled sequencing reads (10–13), or recent studies of centromeric sequences on individual chromosomes (14–16). As a result, millions of bases in each chromosome’s peri/centromere have remained largely uncharacterized and have been omitted from essentially all contemporary genetic and epigenetic studies. Emerging long-read sequencing and assembly methods have now enabled the Telomere-to-Telomere Consortium to produce the first complete assembly of an entire human genome (T2T-CHM13) (2). This effort relied on careful measures to correctly assemble, polish, and validate entire centromeric and pericentromeric repeat arrays for the first time (2, 17). By deeply characterizing these newly assembled sequences, we present the first high-resolution, genome-wide atlas of the sequence content and organization of human peri/centromeric regions.
Centromeres provide a robust assembly and attachment point for kinetochore proteins, which physically couple each chromosome to the mitotic or meiotic spindle (3). Compromised centromere function can lead to nondisjunction, a major cause of somatic and germline disease (18, 19). In many eukaryotes, the centromere is composed of tandemly repeated DNA sequences, called satellite DNA, but these sequences differ widely among species (20, 21). In humans, centromeres are defined by alpha satellite DNA (αSat), an AT-rich repeat family composed of ~171 bp monomers, which can occur as different subtypes repeated in a head-to-tail orientation for millions of bases (22, 23). In the largest αSat arrays, different monomer subtypes belong to higher order repeats (HORs); for example, monomer subtypes a,b,c can repeat as abc-abc-abc (24, 25). Each array can contain thousands of nearly identical HORs, but kinetochore proteins bind only a subset of HORs within a single HOR array on each chromosome (25). HOR arrays tend to differ in sequence and structure between chromosomes (26, 27) and, like other satellite repeats, they evolve rapidly, expanding and contracting in repeat copy number over time, generating a high degree of polymorphism across individuals (28–31). Active (kinetochore-binding) centromeric sequences are embedded within inactive pericentromeric regions, which often include smaller arrays of diverged αSat monomers that lack HORs (26, 32). Pericentromeric regions also contain transposable elements and segmental duplications, which sometimes include expressed genes (33, 34), and frequently contain non-centromeric satellite repeat families (Human Satellites 1-3, beta and gamma satellite, reviewed (35)), which have poorly understood functions. Given the unprecedented opportunity to explore these regions in a complete human genome assembly, we investigated the localization of inner kinetochore proteins within active centromeres and surveyed sequence-based trends in the structure, function, variation, and evolution of peri/centromeric DNA.
Complete assessment of αSat substructure and evolution
Human peri/centromeric satellite DNAs represent 6.2% of the T2T-CHM13v1.0 genome (~189.9 Mb) (Supplemental Section 1, Table S1,2, Fig. S1), which is roughly equal to the entire length of chromosome 4. Nearly all of this sequence remains unassembled in the current GRCh38/hg38 reference sequence (hereafter, hg38), including pericentromeric satellite DNA families that extend into each of the five acrocentric short arms. Based on decades of individual observations, a framework for the overall structure of a typical human peri/centromeric region has been proposed (Fig. 1A). Using the CHM13 assembly, we tested and largely confirmed this broad framework genome-wide at base-pair resolution, with some notable and surprising exceptions (Fig. 1B,C).
Consistent with prior studies (reviewed by (36)), all centromeric regions contain long tracts, or arrays, of alpha satellite (αSat) monomers, which are organized in a head-to-tail orientation spanning millions of bases (85 Mb total genome-wide). A genome-wide assessment of CHM13 αSat monomers revealed a broad range of pairwise sequence identity, with limited deviation in repeat unit length (median 171 bp, with 99.5% within the range 140-187 bp). We next compared these CHM13 monomers with those found in hg38, which are mostly limited to 59 Mb of centromere reference models that lack biologically meaningful long-range repeat structures (11). We found these monomer sets to be largely concordant with respect to both pairwise sequence identity (≥98%) and chromosomal localization, with only a small number of repeats specific to each respective assembly (CHM13 vs hg38) (Supplemental Section 1, Table S3).
Using previously described methods (32, 37, 38), we performed complete, monomer-by-monomer classification of all αSat into 20 distinct suprachromosomal families (SFs; Supplemental Section 1). Each family is composed of SF-specific monomer classes (Table S4). SFs annotated include novel SFs 01 and 02, which unite the sequences previously identified as archaic SF1 and SF2, respectively (37, 38), and SFs13-18, which represent small pieces of the most ancient αSat and occur far from the centromere, presumably at the sites of long-defunct ancient centromeres (Table S5). Within each centromeric region, we identify between 1 and 9 HOR arrays, totaling 70 Mb genome-wide (62 Mb active HORs, 8 Mb inactive HORs, Fig. 1). Although 18 out of 23 chromosomes contain multiple, distinct HOR arrays, only one HOR array per chromosome binds inner kinetochore proteins and is thus designated as active with respect to centromere function (Table S3) (25). All other HOR arrays on the same chromosome are considered inactive, although it is possible for these inactive arrays to be competent for centromere function, as previously observed for ‘epialleles’ on chromosome 17 and for other cases where CENP-A is present on different HOR arrays between the two homologs (40, 41). The active array on each chromosome ranges in size from 5 Mb on chr18 down to 340 kb on chr21, which is near the low end of normal variation for this centromere (39). Adjacent to many highly homogeneous arrays are regions of highly divergent αSat HORs, in which HOR periodicity is somewhat or even completely destroyed (38), as well as highly divergent monomeric layers (32), together totaling 15.2 Mb in CHM13.
Utilizing HOR monomers and HORs primarily inferred from the hg38 centromere reference models, we provided initial annotation for all HOR arrays in CHM13 at the monomer level, revealing 80 HORs and >1000 different monomers in HORs. All HOR arrays including HORs that were previously named received new names according to a new naming system especially designed for αSat HORs (38) (Table S3). Most of the HORs are represented in both the T2T-CHM13 assembly and the hg38 centromere reference models (Table S3). We confirmed the comprehensiveness and accuracy of this HOR/monomer annotation and of the CHM13 reference itself by observing the following expected features: (i) complete coverage of SF1-3 arrays by HOR annotations, (ii) the absence of significant contamination of one HOR array by monomers of the other, and by (iii) concordance with known monomer arrangements in canonical HORs. The annotated CHM13 assembly provides a unique opportunity to analyze αSat across an entire genome, which we detail below.
New pericentromeric satellite families and detailed study of the largest human satellite arrays
Outside of the αSat arrays, we generated detailed maps of each pericentromeric region, encompassing 104.7 Mb of non-αSat sequences (Supplemental Section 1). Classical human satellites 2 and 3 (HSat2,3, totaling 28.7 and 47.6 Mb, respectively) constitute the largest contiguous satellite arrays found in the human genome, with large arrays on chromosomes 1, 9, and 16 (13.2, 27.6, and 12.7 Mb respectively). HSat2 and HSat3 are derived from a simple ancestral (CATTC)n repeat that diverged into distinct families and 14 previously characterized subfamilies (10, 42). HSat1 describes two distinct sequence families that were discovered within the same AT-rich fraction of genomic DNA isolated by classic separation methods (42, 43). We now provide a new naming system for these two HSat1 families to clarify their identity and origin: HSat1A (formerly “SAR”), which is a 42 bp repeat constituting the most AT-rich regions of the genome, and HSat1B (formerly “HSATI”) which is a composite of AT-rich sequences and Alu fragments, found almost entirely on the Y chromosome (44).
Beta satellite (βSat) represents the next-largest family after αSat and HSat1-3 (7.7 Mb genome-wide). It is enriched on the acrocentric short arms (2, 45), and within the pericentromeric regions of 11 chromosomes, and is defined by a 68 bp repeat unit (46). βSat can be further subdivided into simple arrays and beta-composite arrays, in which βSat repeats are interspersed with LSau elements (47–49). Gamma satellite DNA (γSat), a well-characterized 220 bp tandem repeat on chromosomes 8 and X (15, 16), was identified within all acrocentric short arms and six pericentromeric regions (50, 51) (630 kb total). Although both βSat and γSat represent smaller satellite families in the human genome, they are more GC-rich than other satellites (βSat, 52%; γSat, 72%; αSat, 39%) and contain dense CpG methylation (Fig. S2). All remaining annotated pericentromeric satellite DNAs (collectively referenced as ‘p-censat’) total 5.55 Mb, with 1.19 Mb representing new satellite array predictions (49). Non-satellite bases observed between adjacent arrays and extending into the p-arms and q-arms are considered ‘centric transition’ regions, which largely represent long tracts of segmental duplications, including expressed genes (Supplemental Section 2) (2, 52).
Novel chromosomal localizations and polymorphisms of satellite subfamilies
Distinct repeat arrays from the same satellite family show varying degrees of similarity with each other. For example, centromeres on chromosomes 13/21, 14/22 and 1/5/19 have near-identical HORs that have confounded studies in the past (25, 36–38). T2T-CHM13 is the first assembly to successfully assign each of these active arrays to its specific chromosome, permitting a more comprehensive assessment of their respective repeat structures and sequence composition. As HOR 1/5/19 presents an especially problematic case because of dimeric expansions (38), we demonstrated that the arrays were chromosome-specific with reference to flow-sorted chromosome libraries (Supplemental Section 3, Fig. S3). We next developed separate consensus HORs, an HMM-based automatic HOR annotation tool, and browser tracks for each array and discovered that each chromosome has a distinct haplotype characterized by chromosome-specific sequence differences (3-20 base changes per HOR) and sometimes structural variants.
To provide a genome-wide view of the overall sequence similarity between different αSat arrays, we obtained the full set of 75-mer sequences within each array and searched for exact matches to the rest of the genome (Fig. 1D), readily identifying the hierarchical evolutionary relationships between subsets of αSat arrays (which can be organized into SFs and sub-SFs; (reviewed in (36)). This hierarchical subfamily organization is also observed for HSats and βSats, although their inter-array divergence levels appear lower than for αSats overall (Fig. 1E-F).
Using standards developed in previous work (10, 26, 32, 37, 38), we assigned the largest satellite DNA families (αSat, HSat3, and HSat2) into their respective sequence subfamilies, determining some of their chromosomal localizations for the first time. For example, we identified a 280 kb HSat3 array on chr17 and found that it belongs to subfamily B1, which had never previously been localized to a particular chromosome (10). This subfamily is entirely specific to chr17. While we found novel chromosomal localizations of several HSat3 subfamilies (Fig. 2), we also noticed a conspicuous lack of HSat3B2 on CHM13 chr1, contrary to expectations (10). To examine whether this was true for other individuals, we searched for contigs overlapping the chr1 pericentromeric region across 16 haplotype-resolved draft assemblies from genetically diverse individuals, from the Human Pangenome Reference Consortium (HPRC) (53) (Supplemental Section 4). This revealed that the chr1 HSat3B2 array in CHM13 belongs to a haplotype with 400 kb polymorphic deletion, which we detected in 29% (8/28) of those examined (Fig. 2A).
Novel structural rearrangements and genes in peri/centromeric regions
Annotating strand orientations across entire satellite arrays revealed several novel and unexpected anomalies (Fig. 2, Table S6,7). While diverged αSats are known to contain many sequence inversions (54), we quantified this phenomenon genome-wide for the first time (Fig. 2F), and found a 1.7 Mb inversion inside the active αSat HOR array on chr1 (Fig. 2A), along with inversions in inactive HORs on chromosomes 3, 16, and 20 (Fig. 2B). Surprisingly, the large pericentromeric HSat3B5 array on chr9 and the beta satellite arrays on chr1 and the acrocentrics (Fig. 2A and Fig. S4) contain over 200 inversion breakpoints. Apart from inversions, two multi-megabase HSat1A arrays appear to have inserted and expanded within the active HOR arrays of chromosomes 3 and 4 (Fig. 2B), and given the concordant strand orientations of the flanking αSat arrays, these apparent insertions were unlikely to have arisen from inversions. The large insertions and inversions within active HOR arrays are particularly surprising (Fig 2, chrs 1, 3, 4), because they reveal dynamism within an area of the genome previously considered highly homogeneous (30, 55). We sought to investigate these further by searching for evidence of these insertion/inversion breakpoints in the set of HPRC draft assemblies (Fig. S5). We found that the chr1 active HOR inversion is polymorphic across individuals, evident in about half of ascertainable haplotypes (11/24), while the HSat1A insertions on chr3 and chr4 were evident in all ascertainable haplotypes (32/32 and 33/33, respectively; Fig. S6). We also found evidence for an ancient duplication event that predated African ape divergence and involved a large segment of the ancient chr6 centromere plus about 1 Mb of adjacent p-arm sequence. This duplication has created a new centromere locus that hosts the current active cen6 HOR array. The duplication is visible as two nested centromere-flanking intra-chromosomal segmental duplications about 1 Mb in size and an old ~200 kb αSat array that follows the q-arm duplication and presumably contains the decayed remnants of the old centromere (Supplemental Section 5, Table S8).
Like inversions and insertions, transposable elements (TEs) are virtually absent from homogeneous HOR arrays but are enriched in divergent αSat (Fig. 2F) (56, 57). The CHM13 assembly also revealed that certain novel satellites are composed entirely of combinations of TEs, which we refer to as “composite satellites” (49) (Hoyt et al. 2021). Consistent with individual published observations (44, 47, 58), we also found that other satellites, such as HSat1, HSat3, and βSat, often include fragments of ancient TEs as part of their repeating units (Fig. 2A,F)—a phenomenon we rarely observe in αSat HOR arrays (Fig. S7).
Finally, we compared our pericentromeric maps to gene annotations (Table S9,10). One region on chr17, located between the large HSat3 and αSat arrays (Fig. 2D), contains two protein-coding genes: KCNJ17, which encodes a disease-associated potassium channel in muscle cells (59), and UBBP4, which encodes a functional ubiquitin variant that may play a role in regulating nuclear lamins (60). Notably, KCNJ17 is missing from GRCh38, which causes inaccurate and missed variant calls in homologous genes KCNJ12 and KCNJ18 (61). Interestingly, this region also contains a long noncoding RNA annotation (LINC02002), which starts inside an SST1 element and continues into an adjacent 33 kb array of divergent αSat (Fig. 2D). Unexpectedly, we also identified a processed paralog of an apoptosis-related protein-coding gene, BCLAF1 (BCL2 Associated Transcription Factor 1), as part of a segmental duplication embedded within an inactive HOR array on chr16 (Fig. S8).
New methods uncover the fine repeat structure of satellite DNA arrays
To further chart the structure of peri/centromeric regions at high resolution, we compared individual repeat units within and between different satellite arrays. We decomposed each αSat HOR array first into individual monomers and then into entire HORs, revealing the positions of full-size canonical HORs and structural variant HORs resulting from insertions or deletions (Supplemental Section 6, Table S11). These indels are often not in register with arbitrarily chosen monomer start sites, creating hybrid monomers. For example, if the canonical HOR structure is abc-abc-abc, a deletion variant might occur as ac-ac-ac (i.e. an in-register deletion of b) or a/bc-a/bc-a/bc where a/b is a hybrid (i.e. an out-ofregister deletion overlapping the junction between a and b). HOR structures were characterized by two different approaches: we applied hg38-based manual HOR inference to the CHM13 assembly (38), and performed de-novo inference from CHM13 assembly (Supplemental Section 6). Both approaches yielded the same canonical HORs for active arrays (62–64). We also searched for these canonical and variant HOR types in HiFi sequencing reads from 16 genetically diverse individuals. While some chromosomes, such as chr7, are composed almost entirely of canonical HOR units, other chromosomes, such as chr10, contain many structural variant HOR types, with high variation in the relative frequency of these variants across individuals (Fig. 3A and Fig. S9).
Repeat structure decomposition of other satellite families is less straightforward because, unlike αSat, some families have inconsistent or unknown repeat unit sizes. For example, although both HSat2 and HSat3 are thought to have evolved from an ancient simple repeat of (CATTC)n, they have long since diverged on different chromosomes and, at least in some cases, they have been shown to be composed of longer repeat units on the order of multiple kilobases (10). We propose calling these longer repeat units nested tandem repeats (NTRs), to distinguish them from higher order repeats, which are composed of discrete numbers of monomers of similar lengths. To expand our ability to annotate repeat structure within newly assembled satellite DNA arrays, we created NTRprism, a versatile algorithm for discovering and visualizing satellite repeat periodicity (Fig. 3B and Fig. S10). NTRprism is somewhat analogous to classical restriction digest experiments that revealed repeat periodicities in certain satellite families (65), but it is greatly enhanced by the ability to computationally examine all possible k-mers, not just those targeted by restriction enzymes. Using this tool, we discovered new HORs in HSat1 and βSat arrays, as well as new NTRs in multiple HSat2,3 arrays (Fig. 3B and Fig. S10). We also applied this tool in smaller windows across individual arrays, showing that repeat periodicity can vary across an array, consistent with NTRs evolving and expanding hyper-locally in some cases (Fig. S10).
Genome-wide evidence of layered expansions in centromeric arrays
Previous αSat studies have hypothesized a layered expansion model for centromeric αSat arrays (36), in which distinct new repeats periodically emerge and expand within an active array, displacing the older repeats sideways and becoming the new site of kinetochore assembly. Over time, distinct layers of progressively older and more divergent repeats are expected to expand out on either side, flanking the active centromere core with mirror symmetry (Fig. 3C). The repeats that seed new layers may originate from outside the array (e.g. by insertion of an αSat sequence from another chromosome) or from mutations within the same array (Fig. 3D). As the new centromere core expands, the flanks rapidly shrink and accumulate mutations, inversions, TE insertions, and other satellite expansions (16, 32, 38). Previous efforts to document this layered expansion pattern have focused on divergent αSat compartments that surround HOR arrays (32). Here, we performed a study of active αSat arrays in their entirety, together with adjacent flanking regions, to survey the degree of peri/centromeric symmetry, divergence, and decay signatures, providing the first detailed, genome-wide evidence in support of this model.
First, in agreement with prior studies, we observed a symmetrical flanking arrangement of two types of divergent αSat: divergent HORs (dHORs) (Table S12), and monomeric αSats (Table S13), which represent ancient, decayed centromeres of primate ancestors (32). We classified divergent αSat into distinct SFs and dHOR families, and demonstrated how these sequences accumulate mutations, inversions, and TE insertions over time (Table S14). In monomeric αSats, L1s are common, and the age of the oldest L1s increases with the age of the αSat layer (Fig. 3C). Although L1s are extremely rare in αSat HOR arrays, when they do occur, they are always the youngest L1Hs elements, which are known to be still active in humans (Fig. 3C, Table S15) (49, 66). In agreement with previous studies, we document a gradient of size (Fig. 3C), and a gradient of intra-array divergence (17 to 26%) preceded by a steep (~10%) increase that marks the transition between HOR arrays and outer layers (Fig. 3C), (16, 32, 38). In total, molecular dating of the flanking αSats by divergence (grouped by SF-layers), shared occurrence in primate lineages (32), and the age of embedded L1 elements revealed an age gradient away from the central active array (Fig. 3C,D, Table S15).
We next asked if the layered expansion pattern overlaps the active arrays themselves. As shown in Fig. 3D, the sequences seeding the expanding satellite array can be either introduced from within (intraarray seeding) (32) or from an external HOR (or non-HOR) array (inter-array seeding) (67, 68) (Table S16). In total, we document five cases of inter-array type symmetry (Fig. 3D) of which only one was known before (69). In some cases of the inter-array model, the active HOR array originates from a different SF than the flanking inactive array (chrs 1 and 16) (Fig. 3D). This, together with a well-studied case of inter-array symmetry in cen17 (70) provides evidence of how entire arrays have been displaced recently in favor of an introduced sequence.
Moreover, detailed study of active arrays in their entirety provided evidence of intra-array symmetry, defined by classification of HORs by their shared sequence variants. Such variants were known for decades (30, 71, 72) and recently were noted in the first completely assembled centromeres from chromosomes X (73, 74) and 8 (16), where the central part of the active array was found to contain HOR variants slightly different from those on the flanks. To test if this array structure is typical, we aligned individual HOR units within the same array and clustered them into “HOR-haplotypes” or “HOR-haps” (Supplemental Section 6). Initial broad classifications of entire arrays into 2-4 distinct HOR-haps revealed that active HOR arrays are also composed of distinct layers, which typically expand from the middle (dark red versus grey, Fig. 3D).
Further classification of sequences within each broad HOR-hap identified additional substructure, and evidence for symmetric patterns (Fig. 3E). To examine whether the middle HOR-haps are likely to be the youngest evolutionarily, we built phylogenetic trees of consensus HOR-haps (Fig. 3E) and rooted them using reconstructed “mock” SF-ancestral sequences built from consensus monomers for each SF. We also performed complete phylogenetic analysis of all HORs. The identification of evolutionarily younger and older HOR-haps was supported by both methods, as shown for chr3 (Fig. 3E). In addition, the intra-array divergence in central HOR-haps is usually slightly lower than in the flanking arrays, indicating that the central HOR-haps have expanded more recently. Together, these findings present strong genome-wide evidence for a layered expansion pattern within active arrays.
Satellite array organization at sites of kinetochore assembly
Human centromeres are defined epigenetically as the specific subregion bound by inner kinetochore proteins within each active αSat HOR array (21, 75). Centromeres contain a combination of epigenetic marks distinguishing them from the surrounding pericentromeric heterochromatin, including the presence of the histone variant CENP-A (76, 77), “centrochromatin”-associated histone modifications (78), and reduced CpG methylation (15, 16, 79). To study HOR organization at sites of kinetochore assembly, we identified discrete regions of CENP-A enrichment within each active array using published native CHM13 ChIP-seq (NChIP) data (16) along with CUT&RUN (80) data generated in this study (Supplemental Section 7).
Consistent with previous studies, CENP-A binding is almost exclusively localized within αSat HOR arrays, with one active array per chromosome (25) (Table S17). To demarcate specific subsets of ‘active HORs’, we developed a new repeat-sensitive short-read alignment method (Fig. 4A) using a collection of informative markers across each active array to precisely map each overlapping read. We identified unique 100 bp marker sequences covering 5.9% of all bases in all active arrays in CHM13. These markers are non-uniformly distributed showing depletion of coverage at the sites of centromere protein enrichment (Fig. S11), which we determine to be due to recent, local duplications. To increase the comprehensiveness of our short-read mapping strategy, we also included “region-specific markers” observed at two or more locations within a set maximum distance from each other. These regionspecific markers allowed us to ask if sequences specific to a given region showed evidence of enrichment, broadening our coverage of the array. We studied centromere protein enrichment patterns using a combination of these single-copy and region-specific markers, either by directly determining the enrichment of these sequences across read datasets (reference-independent (81), Fig. S12) or by filtering read alignments based on overlap (marker-spanning alignments) (Supplemental Section 7). Use of these two orthogonal methods allowed us to determine the span of constitutively bound CENP-A within each array and delineate active HOR arrays containing each chromosome’s centromere (Table S17).
In agreement with previous studies, we found the strongest CENP-A enrichment near sites reported to be depleted in CpG methylation, or centromere dip regions (CDRs) (16, 79). Notably, some chromosomes show evidence for multiple peaks within each CDR region, which could represent interspersed domains, variation in the organization of CENP-A nucleosomes across the two homologous chromosomes, or polymorphic organization across the population of cells (79). Here, we extend these findings and report that the complete span of the centromere region, as defined by the CENP-A enrichment patterns, extends outside of CDRs by hundreds of kilobases across all chromosomes (Fig. 4C). Furthermore, on some chromosomes, we detected smaller regions of centromere protein enrichment outside of the primary CDR, with some overlapping a minor, secondary CDR (chromosome 4) or no CDR at all (chromosome 18) (Fig. 4C, Fig. S13). In total, these findings issue the first map of human centromeres in a complete genome. In doing so, we identified subregions within each HOR array that are competent to support kinetochore assembly and centromere function.
In relation to the layered expansion pattern, we found that CENP-A is commonly enriched in the youngest HOR-haplotypes within the majority of arrays (Table S17). Furthermore, for each centromere region we increased the number of HOR-hap clusters to study more refined groupings of HORs that are enriched for CENP-A and associated with the sites of kinetochore assembly (Fig. 4B-E). In the active array on chromosome 12, we identified CENP-A enrichment on one of two large macro-repeat structures, both presenting similar HOR-hap sequences (Fig. 4B, Fig. S14). Constructing phylogenetic trees from consensus repeats revealed a subset of HORs that are evolutionarily derived from the ancestral SF-specific class monomers (as shown in Fig. 4F for chromosomes 4, 6, 7, and 12, Fig. S15). Further investigation into the region of CENP-A enrichment on chromosome 12 revealed a zone of recent HOR expansions (i.e. eight sites of recent duplications within a ~365 kb region, (Supplemental Section 7, Fig. S16) that coincides with the CDR and distinguishes one macro-repeat region from the other. We observed similar zones of recent expansion that overlap recent HOR-haps on most other chromosomes (Fig. 4C, shown for example on centromere 7 in Fig. S15), although we identified a few notable exceptions to this general trend. For example, on chromosome 4, which has two CENP-A regions occurring on either side of a 1.7 Mb HSat1A array, we found that the larger CENP-A region spans a slightly younger HOR-hap and the smaller CENP-A region spans an older HOR-hap (Fig. 4D,F). Similarly, we observe CENP-A enrichment within an older HOR-hap layer on centromere 6, over a megabase away from the site of a recent duplication event (Fig. 4E,F). In summary, we provide support of the layered expansion model, and observe that human centromeres are commonly positioned over the youngest layers within each array and that these layers are prone to recent duplications.
Genetic and epigenetic variation across human X centromeres
Satellite DNA arrays are known to be highly variable in size across individuals; in fact, the extremes of this size variation are often plainly visible under the microscope in chromosomal karyotypes, and they have been described for decades, yet the clinical significance of these variants remains unknown and largely unexplored (82, 83). More recent studies have provided low-resolution sequencing-based evidence for variability in both satellite array lengths and the frequency of certain sequence and structural variants within human populations (10–12), suggesting accelerated sequence evolution in these regions compared to the rest of the genome. However, satellite array variation and evolution remain poorly understood at base-level resolution due to the lack of complete centromere assemblies.
To address this, we deeply characterized and compared centromere array assemblies from chrX across seven diverse males, thus capturing the full extent of biologically important sequence variation (Fig. 5A, Supplemental Section 8, Fig. S17). We assigned repeats to seven HOR-haps, revealing both localized and broad variation within each array. For example, we identified large, tandem duplications (spanning hundreds of kilobases) in two assemblies relative to CHM13 (HG01109, PUR and HG03492, PJL, Fig. S18). Four of the seven arrays contain zones of recent duplication in the younger HOR-hap, in a similar position to that of CHM13, with all remaining assemblies showing a trend of recent duplication within a shared region closer to the p-arm (spanning different subsets of more divergent and less derived older HOR-haps). Notably, we found evidence for an additional HOR-hap type in ancient lineages (84) that did not participate in the late pleistocene emigration of modern humans from Africa, (Fig. 5A, dark red), representing an independent core of expanding centromeric sequence.
Next, we studied how this variation within αSats relates to variation across single-nucleotide markers that are in linkage disequilibrium (LD) with the centromere, i.e. markers that tend to be co-inherited with the centromere. Because meiotic recombination rates are extremely low in pericentromeric regions (due to the “centromere effect”; (85)), centromeres are embedded in long haplotypes, which are called cenhaps (Fig 5B; (84)). Cenhaps are identified by first clustering pericentromeric single-nucleotide variants into phylogenetic trees, and then splitting them into large clades of shared descent. Here, we divided a group of 1599 males genotyped using published short-read sequencing data (86) into 12 cenhaps (with 98 individuals remaining unclassified; Fig. 5C, Fig. S19, Table S18). We also defined array-specific and HOR-hap-specific k-mer markers allowing us to utilize short-read sequencing data to estimate the absolute size of each individual’s chrX centromere array (Supplemental Section 8, Table S19) (11, 84)), along with the relative proportion of that individual’s array assigned to each HOR-hap. The results revealed that different cenhaps have different αSat array size distributions as well as different average HOR-hap compositions (Fig. 5C, Fig. S20). As shown in Fig. 5D, two of the 12 cenhaps, 1 and 2, are very common outside of Africa (overall, 49% and 47%, respectively), while the other cenhaps exhibit a range of frequencies across the samples from Africa as well as those with recent African admixture (ASW, PUR, CLM, ACB). This pattern is consistent with the accepted demographic bottleneck associated with early human migration (87). The observed concentration of cenhap and αSat variation in African individuals underlines the need for greater representation of African genomes in pan-genome assembly efforts.
To explore the variation within one of the large cenhap groups (cenhap 2), we compared fine-scale cenhap phylogenies and HOR-hap assignments across 567 individual X chromosomes, revealing a degree of further substructure and variation in the αSat array on a more recent evolutionary timescale (Fig. 5E). To dissect this further, we compared two finished centromere assemblies from CHM13 and HG002, a cell line whose chrX array had been constructed using T2T assembly methods, and whose array structure had been experimentally validated by both pulse-field gel electrophoresis Southern blots and by digital droplet PCR (2). CHM13 and HG002 have similar array sizes and belong to cenhap 2 (Fig. 5E). We also studied patterns of CENP-A CUT&RUN enrichment in HG002 relative to CHM13 (79). We found both genomes to be highly concordant across the array, apart from three regions, where we observe recent amplifications and/or deletions of repeats (Fig. 5F). Notably, the region with the most pronounced structural differences between CHM13 and HG002 coincides with the strongest CENP-A enrichment in both arrays. Therefore, even though inner kinetochore proteins are present in both arrays over CDRs and young HOR-haps, the HOR sequences enriched with CENP-A represent local duplication events that are not shared and distinguish the two arrays (marked in yellow, Fig. 5G, Fig. S21).
Finally, we asked if CENP-A enrichment patterns were consistently found in the younger HOR-haps, as observed in CHM13 and HG002, across publicly available ChIP-seq datasets (Fig. S22). Using the T2T-CHM13 X array as a reference, we mapped these available datasets and determined CENP-A enrichment for each X HOR-hap relative to the matched input DNA. Notably, in several individuals we observed CENP-A enrichment within the older HOR-hap subregion, proximal to the p-arm, indicating the presence of a centromere X epiallele (as shown for three XY individuals, HuRef (88) in Fig. 5H, and also supported with data from HT1080b (89) and MS4221 (90)). Further, we compared two independent CUT&RUN experiments from the RPE-1 cell line (XX) (91) and found consistent evidence for heterozygous positions of CENP-A within the same cell line, with enrichment on both older and newer HOR-haps, indicating that the two X homologs carry different functional epialleles. Three additional 46,XX cell lines (IMS13q, PDNC4, K562 (92)) were determined to be consistent with CHM13, providing evidence that the same CENP-A+ HOR-hap is shared across both homologous X chromosomes in each line. In total, these findings uncover frequent variation in the position of the X centromere, indicating that individuals may have heterozygous and homozygous epialleles within the population. Further, these observations highlight the need to study both epigenetic and genetic variation in centromeric regions, across both related and unrelated individuals and across populations of cells over time, to better define the trends and exceptions regarding centromeric epiallele positioning and inheritance.
Discussion
This work has produced detailed maps of previously unassembled centromeric and pericentromeric regions, which represent the largest fraction of newly introduced sequence in the complete T2T-CHM13 reference assembly (2). We produced detailed annotations and resources to facilitate further analysis of these complicated loci by the community. In doing so, we revealed surprising large and small scale variations in the organization and composition of active centromeres. The most dramatic variants include the interruption of two active centromeric arrays by a different satellite repeat family (on chrs 3 and 4), and a large inversion in another active array (on chr 1). We also found strong genome-wide evidence for a layered expansion model of centromere evolution, supported by ancient evolutionary patterns in the divergent satellites that flank the active centromere, as well as by recent sequence expansions within the active centromere. We defined sets of short markers specific to each array that can be used for mapping short sequencing reads and for interrogating peri/centromeric structure and function, such as designing oligo-FISH probes and guide RNAs for CRISPR-based experiments. We also demonstrated the utility of these markers to accurately localize short reads from protein-DNA interaction mapping experiments and whole-genome shotgun sequencing datasets. Furthermore, we developed a new method, NTRprism, for visualizing and quantifying tandem repeat periodicity in any satellite family, and we used this to discover novel repeat structure within multiple HSat arrays.
Our new tools and resources allowed us to characterize satellite array variation to new depths, uncovering a large polymorphic deletion of an entire HSat3 array, along with a novel expansion of a particular chrX alpha satellite HOR-haplotype within African populations. Additionally, we found a recent duplication in the chrX HOR array, representing hundreds of kilobases, that is common in individuals from a specific centromere-spanning haplotype group (cenhap 1), which can explain why individuals harboring this cenhap have larger average array sizes compared to other cenhaps. The evidence for such large duplications in human history was revealed by our assessment of centromeric macrorepeats, including those on chromosomes 12, 6, an X. The high degree of polymorphism in these regions underlines the need to produce telomere-to-telomere assemblies from many diverse individuals, to fully capture the extent of human variation in these regions and to shed light on their recent evolution and the functional consequences of this evolution. Achieving this goal will require an ability to produce accurate, complete, phased assemblies from diploid individuals. Centromeric regions would seem to present the greatest challenge for phased assembly due to their repetitive nature, but their high degree of variation may assist these efforts. Now, equipped with the T2T-CHM13 assembly and the approaches we developed here to study and compare the most challenging repetitive regions in the genome, we are optimistic that future high-quality, phased, diploid, T2T assemblies are within reach.
Finally, apart from genetic variation in these regions, we identified epigenetic variation in the location of centromere proteins within an array, as has been described previously on other chromosomes (40, 41, 93, 94). Future investigations need to study centromeric protein localization at fine scales across many individuals, in order to better understand centromeric establishment and propagation, and how this relates to the underlying genetic variation found within each array. In light of our observation that CENP-A tends to localize to the most recently expanded HORs genome-wide, many questions remain about the evolutionary and molecular mechanisms responsible for the relationship between the kinetochore and the layered expansion patterns of satellite DNAs. It’s possible that satellite expansions occur neutrally, and more recently expanded subregions coincidentally attract the kinetochore—this is feasible in light of the evolutionary patterns we observe within non-centromeric satellite arrays. Another possibility is what we refer to as the “kinetochore selection hypothesis,” in which the kinetochore plays a causal role in amplifying particular HOR variants to which it preferentially binds (36). These models are not mutually exclusive, and they are also compatible with models of centromere drive or other molecular drive models (95, 96). Experiments in model organisms have demonstrated that extreme array sequence variants increase meiotic and mitotic nondisjunction rates and can promote both mutational drive and/or (female) meiotic drive (97–99). Similar drive mechanisms, along with selection for variants that promote high-fidelity chromosome transmission, may also play a role in shaping centromeric sequence diversity in the human population.
Exploring these models will require careful experimental systems and methods for precisely measuring interactions between kinetochore proteins and repetitive DNA, as well as how these interactions affect the fidelity of chromosome transmission. While the short-read mapping methods that we developed enable the use of existing protocols like NChIP (100) and CUT&RUN (80) to provide sensitive protein-DNA interaction information at broad scales within satellite arrays, we anticipate that new long-read methods for mapping protein-DNA interactions will be essential for providing high-resolution binding footprint information, including in regions that lack single-copy or region-specific markers (101). We anticipate a future in which we will soon have pan-genome and pan-epigenome references in all human peri/centromeric regions, finally making them accessible for careful study using modern genomic tools.
Author contributions
αSat sequence characterization: AVB, LU, FDR, AM, VAS, TD, OK, FG, EIR, PAP, IAA, KHM; Pericentromeric satellite characterization: NA, GAL, SJH, MEGS, DO, TJW, LGDL, AMP, RJO, KHM CUT&RUN experiments, mapping, and enrichment analyses: GVC, NA, GAL, PS, SJH, AMM, AR, MEGS, KT, SRS, AS, AFS, BAS, AFD, GHK, AG, WT, KHM; cenhap analysis and interpretation: NA, SAL, CHL, IAA, KHM; array length prediction: MB, JLG, MCS, JMZ; Methylation Analysis: AG, WT; Chromosome imaging and flow sorting: TP, JLG, SB, AY, AMP; Dotplot analysis: LU, FDR, MRV, RL, PK, AMP, IAA; Transposable element analysis: NA, SJH, GAH, RJO, LU, IAA; CHM13 satellite assembly and het analysis: NA, GAL, SN, SK, AR, AMM, AMP; UCSC genome browser and annotation workflow: MD; HiFi assemblies and quality assessment of diverse panel: NA, MA, RL, KS, AM, AVB, SA, JMZ, MCS, BP, EEE, AMP, gene annotation and expression: CJS, MRV, MH, MYD, MD; manuscript writing: NA, IAA, KHM, with input from all authors
Competing interests
SK and KHM have received travel funds to speak at symposia organized by Oxford Nanopore. W.T. has two patents (8,748,091 and 8,394,584) licensed to Oxford Nanopore Technologies.
Data and materials availability
Sequence data are available through https://www.ncbi.nlm.nih.gov/bioproject/559484
Human Pangenome Reference Consortium (HPRC) generated long and accurate HiFi reads for sixteen human samples HG002, HG003, HG004, HG005, HG006, HG007, HG01243, HG02055, HG02109, HG02723, HG03492, HG01109, HG01442,HG02080,HG02145, and HG03098. We refer to these datasets as HPRC samples. (https://github.com/human-pangenomics/hpgp-data).
Data tracks and satellite annotations can be visualized on the UCSC Genome Browser (102, 103): http://genome.ucsc.edu/cgi-bin/hgTracks?genome=t2t-chm13-v1.0&hubUrl= http://t2t.gi.ucsc.edu/chm13/hub/hub.txt
Annotation data tables and other supporting data and analysis workflows: https://github.com/kmiga/t2t_censat/
Acknowledgements
This work was supported, in part, by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health (SN, SK, AR, AMM, SB, AY and AMP), the Intramural funding at the National Institute of Standards and Technology (JMZ), HHMI Hanna H. Gray Fellowship (NA), Damon Runyon Postdoctoral Fellowship, Pew Latin American Fellowship (GVC). Grants from the U.S. National Institutes of Health (NIH/NIGMS F32 GM134558 to GAL; NIH/NHGRI R01 HG00990905 to PS and AFS; NIH R01GM123312-02 to SJH, GAH, RJO; NIH R21CA240199, NSF 1643825 to RJO; NIH/NHGRI F31HG011205 to CJS; NIH/NHGRI R01 HG009190 to AG and WT; NIH/NHGRI R01HG010485, U41HG010972, U01HG010961 to KS; NIH/NIGMS R01GM132600, P20GM103546 and NIH/NHGRI U24HG010136 to DO and TJW; NIH/NHGRI R01 HG010329 to SRS; NIH/NHGRI R01HG010485, U41HG010972, U01HG010961, U24HG011853, OT2OD026682 to BP; NIH/GM R35 GM139653 and R01 GM117420 to GHK; NIH R01 GM124041, R01 GM129263, R21 CA238758 to BAS; NIH/NHGRI R01HG002385, R01HG010169 NIH/NHGRI U01 1U01HG010971 to EEE; NIH/OD/NIMH DP2 OD025824 to MYD; NIH/NHGRI U24HG010263; NHGRI U24HG006620; NCI U01CA253481; NIDDK R24 DK106766-01A1 to MCS; NIH/NHGRI U41HG007234 to MD; NIH/NHGRI R01 1R01HG011274-01, NIH/NHGRI R21 1R21HG010548-01 to KHM), National Science Foundation (NSF 1613806 to SJH, GAH, RJO; NSF DBI-1627442, NSF IOS-1732253, NSF IOS-1758800 to MCS); Mark Foundation for Cancer Research to SA and MCS (19-033-ASP); Russian Science Federation RSF 19-75-30039 (analysis of genomic repeats) to IAA; Russian Foundation for Basic Research (RFBR 18-29-13051) to LU; LU is supported by Sirius University; St. Petersburg State University ((grant ID PURE 73023573) to AM, TD, IAA and (grant ID PURE 51555639) to OK); NIH R01AG054712 to EIR; Ministry of Science and Higher Education of the Russian Federation (075-10-2020-116 (13.1902.21.0023)) to FG; Connecticut Innovations to RJO; Stowers Institute for Medical Research to JLG; AS is a Chan Zuckerberg Biohub Investigator; EEE and AFD are investigators of the Howard Hughes Medical Institute. Certain commercial equipment, instruments, or materials are identified to specify adequately experimental conditions or reported results. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment, instruments, or materials identified are necessarily the best available for the purpose. This work utilized the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov).
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.
- 6.
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵