Abstract
A challenging problem in human population genetics is related to the unique role of human Y chromosome, with properties that distinguish humans from other species. Centromeres in primate genomes are constituted of tandem repeats of ∼ 171 bp alpha satellite monomers, commonly organized into higher order repats (HORs). Because of gaps in DNA sequencing, HOR regions as genomic “black holes” have been understudied in spite of crucial importance. Only recently the sequencing of more complete satellite DNAs becomes accessible. In human Y chromosome the largest alpha satellite higher order repeat unit 34/36mer was found, but its polymorphic variants were not investigated. Here, we study the human Y chromosome centromeric genomic sequence from hg38 assembly using our novel ALPHAsub algorithm for simple identification of alpha satellite arrays and robust GRM algorithm for HOR identification in repeat sequences. We determine the monomer alignment scheme for alpha satellite HOR array based on canonical 34mer HOR, discovering a wealth of novel polymorphic variants which include the HOR-type monomer duplications, monomer deletions/insertions or rearrangements and non-HOR insertions.
Author Summary The centromere is important for segregation of chromosomes during cell division in eukaryotes. Its destabilization results in chromosomal missegregation, aneuploidy, hallmarks of cancers and birth defects. In primate genomes centromeres contain tandem repeats of ∼ 171 bp alpha satellite DNA, commonly organized into higher order repeats (HORs). In this work, we used our bioinformatics algorithms to study the human Y chromosome centromeric genomic sequence and we discover a wealth of novel polymorphic variants which include the HOR-type monomer duplications, monomer deletions/insertions or rearrangements and non-HOR insertions. These results could help to understand the role of alpha satellites and alpha HOR structures in centromeric organization and function, in particular their role in creating a functional kinetochore that is crucial for chromosome segregation during cell division.
Introduction
It was noted that “the properties of the Y chromosome read like a list of violations of the rulebook of human genetics” and “it seems the more we know about the Y chromosome, the more questions we have” [1]. Studies of atypical structure of human Y chromosome were largely focused on gene related content [2]. On the other hand, human Y chromosome is replete with pronounced noncoding repetitive sequences [3-7].
Centromeres of all human chromosomes consist of tandem repeats of alpha satellite monomers, commonly organized as higher order repeats (HORs) superimposed on approximately periodic tandem of alpha satellite monomers [8-12]. Because of gaps in DNA sequencing, these HOR regions, like genomic “black holes” [13, 14], have been understudied in spite of their crucial importance [15, 16]. With an impressive recent progress in sequencing technology [4, 16-18], the study of more complete satellite DNAs becomes accessible [4, 19].
An alpha satellite HOR in human chromosome Y was found previously by restriction map estimates [3]. It is organized into tandemly repeating units, most of which are approximately 5.7 kb long while some variant units are about 6.0 kb long. Both the 5.7 kb and the 6.0 kb HOR units were found to consist of tandemly repeating alpha satellite monomers, with the 6.0 kb unit containing two more monomers compared to 5.7 kb unit. Later, a value of 5.941 kb was reported for this HOR unit length [2].
Sequence contigs spanning junctions at the edges of the centromere array, becoming available, enabled more extensive bioinformatics analyses of repeat patterns in human genome [13, 20-24]. However, major gaps remained in the centromeric regions of human chromosomes, as “black holes” in genomes [13, 14, 25, 26].
Using GRM algorithm, the alpha satellite HORs were identified and analysed bioinformatically for previous incomplete and gapped Build 37.1 assembly for human and Build 2.1 for chimpanzee Y chromosome [27]. The human genome reference sequence was incomplete owing to the challenge of assembling long tracts of near-identical tandem repeats in centromeres.
An alpha satellite reference model has been recently produced and incorporated in the hg38 human genome assembly [28-30]. In the hg38 human genome assembly, centromere gaps have been filled by alpha satellite reference models, which are statistical representations of homogeneous HOR arrays [31]. Only recently, a long-read strategy was applied to a human centromere. A nanopore sequencing strategy was used to generate high-quality reads that span highly repetitive DNA in centromere of one individual human Y chromosome [4]. The DYZ3 array was assembled and characterized as HOR with 5.8 kb consensus sequence without repeat inversions [4]. Instances of 6.0 kb HOR structural variant were detected, evidence for seven 6.0 kb copies within DYZ3 array was found, present in two clusters separated by 110 kb, in accordance with predictions by previous restriction map estimates [32].
Results
Using robust ALPHAsub+GRM algorithm [27, 33, 34] we identify and analyse alpha satellite HOR arrays in hg38 assembly of Y chromosome (RefSeq Accession NC_000024.10). The resulting alpha satellite HOR ideogram for centromeric region of hg38 assembly of Y chromosome is shown, with three HOR domains I-III (Fig. 1).
The alpha satellite arrays extracted in the first step are given in Table 1. The GRM algorithm was extended by introducing the method ALPHAsub used to identify positions of alpha satellite arrays in DNA sequence, regardless weather they are of HOR or nonHOR type. In this way we determined segments with alpha satellite arrays. We have found 28 different alpha satellite arrays in human chromosome Y, three of them arranged in 34mer/36mer HOR structures. Previously, for human chromosome Y, only the 5.7 kb and the 6.0 kb alpha satellite HOR units were found[2-4].
The GRM peak at 5786 bp corresponding to the major 34mer/36mer HOR in GRM diagram is sizable (Fig. 2), because of large number of approximately regular HOR copies in HOR array. Applying ALPHAsub+GRM algorithm we determined the monomer alignment schemes for alpha satellite HOR arrays in hg38 assembly for human Y chromosome in domains I, II, and III (Fig. 3a-c, respectively). Going along genome sequence, the monomers are positioned in the monomer alignment scheme in order of appearance. In constructing monomer alignment scheme monomers are assigned to the same monomer type if divergence is less than 5 % and are placed in the same column of the scheme. Otherwise, monomers do not belong to types included in HOR and are referred to as non-HOR monomers.
The mean divergence among monomers within each HOR copy in domain I is ∼ 20%, and the mean divergence among monomers of the same type in different HOR copies is ∼ 0.5%. For 34mer HOR array in domain II the mean divergence is ∼ 20 % and ∼ 3 %, respectively; and for 34mer HOR array in domain III ∼ 20 % and ∼ 1 %, respectively, not counting monomer insertions/deletions. The corresponding consensus sequences of monomers in 34/36mer are given in Supplementary Table 1.
34/36mer HOR and its polymorphic variants in domain I
Using ALPHAsub+GRM algorithm in domain I of hg38 assembly, the 40 HOR copies are obtained: 27 complete 34mer HOR copies (referred to as canonical) and 13 polymorphic variants (Fig. 3a). The monomer types m8, m15 and m16 are absent in most of HOR copies and thus they are referred to as non-canonical monomers. In the aligned monomer scheme each horizontal bar presenting a monomer is characterized by its monomer type m1, m2, m37 and arrays of non-HOR monomers are characterized by arrow and symbol of insertion (for example a1 in Fig. 3a). For example, the HOR copy h1 consists of 36 monomer tandems of types m1-m7 and m9-m37 (displayed by horizontal bars No. 1-7 and 9-37). HOR copy h37 consists of a 20 monomers, tandem m1-m7 (monomers No. 1-7), monomer m9 (No. 9), tandem m28-m37 (monomers No. 28-37) and two non-HOR monomer array a1 inserted after monomer m9.
Four polymorphic variants are complete 36mers, three 32mers, three 35mers, one 27mer, one 24mer, and one 20mer (Table 2a). The four complete variant 36mer HOR copies arise from canonical 34mer by inserting after m14 two additional monomers. Such are HOR copies h1, h2, h32, h39. This HOR array, containing both 34mer and 36mer copies, is referred to as 34/36mer HOR. Occasional monomer duplications, similar as found here, appear also in alpha satellite HORs in some other human chromosomes (for example, [27, 35]).
The variant 35mer HOR copies h9, h20 and h38 arise from canonical 34mer HOR copy by duplicating the monomer m14, resulting in the variant monomer sequence … m13 m14 m14 m17 m18 … This variant triplet of HOR copies characterizes domain I and domain III. It is noted that monomer duplications appear in alpha satelite HORs also in other human chromosomes (for example, [27, 35]). Monomer duplication is present in six variant HOR copies for domain I. A particular case is combined monomer duplication, deletion and insertion in HOR copies h8, h21 and h25. In these three variant 32mer HOR copies the typical monomers m1-m7, m13-m14 and m17-m37 are intact. Then m5 is duplicated, a non-canonical monomer m8 is inserted and typical m9-m12 monomers are absent. These three HOR copies appear as polymorphic variants, where the 5-monomer region m8-m12 is distorted in a specific way, including replacement by one duplicate and one atypical monomer. This results in variant monomer sequence with … m4, m5, m6, m7, m5, m8, m13, m14, m17…
Variant HOR copies h34 and h37 involve significant deletions of monomers, and h40 is located at the end of domain I and is missing last ten monomers (m28-m37). Variant HOR copy h37 also involve insertion of two non-HOR alpha satellite monomers (non-repeating monomers which don’t have another copies within HOR structure)
34mer HOR and its polymorphic variants in domain II
The domains II and III have been inserted in the hg38 assembly sequence of Y chromosome in reverse orientation, considering the orientation of domain I. We have presented domain II and domain III monomer alignment schemes for canonical alpha satellite 34mer HORs and its polymorphic variants in direct orientations (Fig 3b and 3c) and we have conveniently adjust the start of HOR sequences to match the individual monomers from domains II and III to monomers in domain I (Fig 4a and 4b).
In domain II we identified segments classified in five 34mer HOR copies h1-h5 (Fig. 3b and Table 2b). They contain 45 different types of alpha satellite monomers: out of 34 canonical and 3 non-canonical monomers (m8, m15, m16) from domain I, all have counterparts in domain II. The corresponding similarity of monomer types between domains I vs. II are shown in Fig 4a. The other monomers are insertions of non-HOR monomers.
In this HOR array each of 34 HOR-monomers m1-m34 from domain I is repeated, most of them threefold. However, besides these HOR-repeating monomers, the HOR copy h3 has some inserted non-HOR monomers, appearing only once in the aligned scheme, labelled as b1 (array of 8 non-HOR monomers). This array of 8 inserted monomers is similar (up to 5%) to array of 8 inserted monomers in domain III (labelled as c1) (Fig. 4d). This array are followed by a duplication of monomer m5, then by an insertion of non-canonical monomer m8, then by a tandem of HOR monomers m9-m4, then by an insertion of two non-canonical monomers m15 and m16, and then by a tandem of HOR monomers m17-m37.
34mer HOR and its polymorphic variants in domain III
Using ALPHAsub+GRM algorithm the scheme of aligned monomer structure for HORs in domain III with 6 HOR copies was determined (Fig. 3c and Table 2c). The corresponding consensus HOR unit is almost identical to consensus HOR unit in domain I (divergence ∼ 0.3 %,). The corresponding similarity of monomer types between domains I vs. III are shown in Fig 4b.
All six HOR copies in domain III are polymorphic variants of canonical 34mer HOR. Two HOR copies (h3 and h4) are 35mers, where both copies have one monomer duplicated (m14, like 35mers in domain I). The HOR copy h2 has one non-HOR monomer insertion (marked as insertion c2 in Fig. 3c) and three monomers deletion (m35-m37). The HOR copy h5 contains 44 HOR monomer types, and is sizably distorted by addition of array of 8 non-HOR monomers (labelled as c1 in Fig 3c) that follow after m7 and are continued by non-canonical monomer insertion m8. These additional 8 monomers diverge from all classical HOR monomers by more than 5% and are similar to 8 additional monomers from domain II (Fig 4d). The structure of HOR copies h1 (5mer) and h6 (10mer) are determined by their location, the start and the end of domain III, respectively.
Full monomer divergence matrices between consensus monomers from domains I vs. II, I vs. III, and II vs. III are shown by heatmap in Fig. 4. As could be predicted from Fig 3, m15 and m16 in domains I and II have no monomer counterparts (below 5% identity) in domain III (Fig 4b and 4c).
Discussion
Recent rapidly improving second and third generation sequencing opens the possibility to determine complete ensemble of alpha satellite HORs in the whole human genome, which will enable broader investigations of alpha satellite HORs, their polymorphic variants and their influence on centromere dynamics. Previously, in chromosome Y, HOR with 5.8 kb consensus sequence and 6.0 kb HOR structural variant were detected [3, 4] that correspond to 34mer and 36mer HORs, respectively. In this paper, we have discovered a wealth of novel polymorphic variants, which include the HOR-type monomer duplications, monomer deletions/insertions or rearrangements and non-HOR insertions. In particular, these polimorfic varyants result with HOR structures up to 44 monomers length. These results could help to understand the role of alpha satellites and alpha HOR structures in centromeric organization and function, in particular their role in formation of functional kinetochore. One could expect that rich long HOR repeat units will be found also in centromere of some other human chromosomes. The coming years may bring exciting new developments in HOR investigations.
Methods
In this study the hg38 assembly sequence of Y chromosome (RefSeq Accession.version NC_000024.10) was used for HOR analysis downloaded from: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/ARCHIVE/ANNOTATION_RELEASE.1_08/Assembled_chromosomes/seq/.
Here we used our robust computational algorithm GRM - Global Repeat Map algorithm[27, 33, 34] convenient for HOR identification in novel centromeric repeat sequences and an ALPHAsub algorithm convenient for simple identification of alpha satellite arrays.
ALPHAsub algorithm
ALPHAsub algorithm is a simple method for extraction of alpha satellite tandem arrays from a given genomic sequence, irrespectively of whether they are organized into HORs or not. As a convenient “ideal key word” (a “seed”), we use a robust 28-bp segment from alpha satellite DNA sequences, TGAGAAACTGCTTTGTGATGTGTGCATT and its reverse complement. This choice of a well conserved region of known alpha satellite tandem is based on our previous experience with alpha satellite tandem arrays [27, 33, 35]. First, using the Levenshtein distance algorithm [36], all positions in the whole chromosome are determined where the 28-bp sequence of “ideal key word” or its reverse complement differs from a “real key word” by at most nine nucleotides. Second, the distances between positions of neighbouring “real key words” are calculated. Third, only those “real key words” are retained for which distance to its previous neighbour is approximately equal to 171 bp or to a multiple of 171 bp (d (n, n − 1)∼m · 171; m = 1, 2, …). In the latter case (m > 1), the additional “real key words” (one for m = 2, two for m = 2, and so on) are determined in the sequence between “real key word” and its previous neighbour, using the Levenshtein distance algorithm, at positions with the smallest difference of “real key words” compared to “ideal key word” or its reverse complement. In general, a distance between the additional “real key words”, obtained by this method, is always approximately equal to 171 bp. In this way, we determined positions of all alpha satellites within chromosome Y. In the next step, using positions of “real key words”, all alpha satellites from hg38 DNA sequence for chromosome Y are extracted and different alpha satellite ensembles are identified. On this basis, we have designed our ALPHAsub algorithm and computer program. Applying ALPHAsub program to the hg38 sequence of Y chromosome we determine location of all alpha satellite arrays within genomic sequence. In this way we determine regions within Y chromosome that contain alpha satellites arrays.
GRM algorithm
Global repeat algorithm (GRM) is an efficient and robust novel method to identify and study repeats, especially HORs, in a given DNA sequence [27, 33, 34]. For long DNA sequences of whole chromosomes, the noise in GRM diagram increases with increasing length of HOR repeat unit. This noise is significantly reduced by applying GRM to those regions which contain alpha satellite arrays selected using ALPHAsub algorithm for analysis of the whole chromosome sequence. We note that the GRM algorithm chooses the starting point autonomously, causing a difference of starting point with respect to standardly used sequence of consensus monomer. This choice of starting point does not influence the results.
ALPHAsub+GRM algorithm: GRM algorithm expanded by ALPHAsub algorithm
Successive application of ALPHAsub and GRM algorithms is used for identification and analysis of alpha satellite HORs in a whole chromosome sequence: in the first step we identify chromosome regions that contain alpha satellite arrays and in the second step we perform GRM computation for these regions. The algorithm is freely available on our web server genom.hazu.hr at https://genom.hazu.hr/tools.html. For identification of higher order structures, we use Needleman-Wunsch algorithm that creates a divergence matrix where diagonals highlight higher order structures of n-mers. We also show divergences in heatmap graphs.
Author Contributions
I.V. and M.G. performed the computations. M.G. wrote computational algorithm ALPHAsub. V.P. supervised the study. All authors analysed computational results. V.P. wrote the manuscript. All authors read and approved the final version of the manuscript.
Supporting information captions
Supplementary Tables S1
a) Consensus sequence of canonical 34/36mer HOR in domain I
b) Consensus sequence of variant 34mer HOR in domain II
c) Consensus sequence of variant 34mer HOR in domain III
Acknowledgements
We thank C. Tyler-Smith for stimulating our interest for alpha satellites. We also acknowledge support from the QuantiXLie Centre of Excellence, a project cofinanced by the Croatian Government and European Union through the European Regional Development Fund - the Competitiveness and Cohesion Operational Programme (Grant KK.01.1.1.01.0004), and the grant IP-2014-09-3626 from Croatian Science Foundation.