Abstract
Recombination is a main source of genetic variability. However, the potential role of the variation generated by recombination in phenotypic traits, including diseases, remains unexplored as there is currently no method to infer chromosomal subpopulations based on recombination patterns differences. We developed recombClust, a method that uses SNP-phased data to detect differences in historic recombination in a chromosome population. We validated our method by performing simulations and by using real data to accurately predict the alleles of well known recombination modifiers, including common inversions in Drosophila melanogaster and human, and the chromosomes under selective pressure at the lactase locus in humans. We then applied recombClust to the complex human 1q21.1 region, where nonallelic homologous recombination produces deleterious phenotypes. We discovered and validated the presence of two different recombination histories in these regions that significantly associated with the differential expression of ANKRD35 in whole blood and that were in high linkage with variants previously associated with hypertension. By detecting differences in historic recombination, our method opens a way to assess the influence of recombination variation in phenotypic traits.
Introduction
Recombination plays a central role in adaptation and evolution, and its influence in human disease is becoming increasingly clear [1]. During the last decade, our understanding of genome-wide recombination rates and landscape has been greatly increased by the resolution and power of high-throughput data and analysis methods on population samples. Methods that extract recombination signals from linkage between SNPs have been instrumental [2–6]. However, despite these great advances, the outstanding question on how recombination variability influences phenotypes has lagged behind as there has not been a method to measure recombination variation between individuals for large association studies. A large body of theoretical work, initiated by Nei [7], has explored the conditions under which the variability of general recombination modifiers evolve [8, 9] yet empirical studies that link recombination variability in a genomic region with phenotypic traits and fitness are restricted to already known specific modifiers, such as inversions or specific polymorphisms [10–12]. In this context, we developed recombClust, a pioneer method to detect recombination variability between chromosomes by inferring the differences in recombination histories within a genomic region.
Recombination produces offspring chromosomes with new combinations of maternal and paternal DNA material at each side of a recombination event [13]; making it a main source of novel genetic diversity. At the population level, when multiple recombination events have occured between two genomic markers, the linkage between them decreases and a random association is then observed. Historic recombination patterns were thus successfully extracted from the linkage between dense SNP markers, strongly matching direct observations on recombination events in sperm samples [14]. Because linkage methods are population-based estimates, they have been intensely used to compute accurate recombination rates and landscapes in large population samples but, at the same time, have also been disregarded in their ability to detect recombination variation between individuals [15], i.e. used to infer groups of chromosomes with different recombination histories in a genomic region. However, latent variable mixture models can be incorporated to linkage methods to detect the underlying mixture of chromosome subpopulations, characterized by different recombination patterns. We, therefore, hypothesized that in a genomic region where the recombination frequency and location are modified in a subpopulation of chromosomes, the chromosomes can be grouped according to consistent recombination histories within the region. The detected chromosome groups could then be tested for association with phenotypes, allowing the use of large cohorts to study the phenotypic effects of recombination variability in the genomic region.
Here, we proposed a method that leverages chromosomal differences in linkage patterns in a genomic region to classify the chromosomes of a population into groups with different recombination histories. The method, named recombClust, comprises two steps. First, it fits a mixture model for each pair of SNP blocks within a genomic region to classify chromosomes into those with a history of high recombination or high linkage between the blocks; second, it tests the consistency of the chromosomes’ classification across all the mixture models i.e. all SNP block pairs. Chromosome groups with different recombination histories are thus called by the chromosome’s classification into the consistent recombining groups. By estimating the proportion of chromosomes with historic recombination at a given point in the region, the recombination pattern for each chromosome subpopulation can be reconstructed.
We tested the performance and adequacy of the method using numerous simulated scenarios and demonstrated its ability to detect known recombination modifiers with their correct recombination patterns using real data for Drosophila melanogaster and humans. Finally, we used the method to i) detect and validate chromosome subpopulations with different historic recombination at 1q21.1, a genomic region at risk of deleterious rearrangements leading to the thrombocytopenia-absent radius (TAR) syndrome [16, 17], and ii) to associate the chromosome groups with changes in gene expression in blood. The method was implemented in a computationally efficient tool, compatible with Bioconductor’s packages and the variant call format (VCF). The development version is available at https://github.com/isglobal-brge/recombClust and the final version will be available in Bioconductor.
Results
We implemented recombClust, a method to classify chromosomes into groups with different recombination histories across a genomic region (Figure 1). The method comprises two steps. First, for each pair of SNP blocks in a genomic region, it fits a mixture model of two chromosome groups (recomb/linkage), one in which chromosomes display random association between the blocks (recomb) and the other where the blocks are found in complete association (linkage). Second, recombClust classifies chromosomes into subpopulations (A/B) based on a consensus clustering across all the mixture models fitted along the genomic region. The chromosome groups A/B are the subpopulations associated with different recombination histories, which can be reconstructed from the proportion of chromosomes in the recomb group at each point across the genomic region. The underlying chromosome substructructure (A/B) can be used in downstream analysis, such as transcription and methylation profiling or association with phenotypes.
Modeling the mixture of chromosomes under recombination and linkage
We developed a mixture model to split the chromosomes of a population into those showing high recombination and those showing high linkage history between two SNP blocks (Methods). Figure 1A illustrates two instances where the mixture model is fitted at two different points in a genomic region. For illustration purposes, only two alleles are shown at each SNP block (+,-). The first recombination point is tested by blocks 1/2 where M chromosomes are in the recomb group showing random association between the blocks and N chromosomes are in the linkage group showing maximum linkage between the blocks. The other point is tested by blocks i–1/i where now the N previous chromosomes belong to the recomb group and the previous M chromosomes to the linkage group. Note that although the model at one point can be ambiguous for some chromosomes (i.e. chrom 1 at 1/2 in Figure 1A), the final chromosome classification into subpopulations consistent with specific recombination patterns is robust when considering other points in the region (chrom 1 at i–1/i), as explained in the following section.
We simulated multiple datasets representing a SNP block pair that flanked one recombination point for a group of chromosomes (recomb) but remained in linkage for a second group (linkage), see Methods section. We first evaluated how the proportion between recomb and linkage populations affected the accuracy of the model to correctly classify the chromosomes, varying the proportion between 0.1 and 0.9. We observed that the mixture model had high accuracy (>80%) across all the proportion range, being optimal, as expected, when the mixture was small, i.e. the mixture frequency approached to 1 or 0 (Sup Figure 1A). We also observed that the model was robust under different initializations of the mixture frequency (Sup Figure 1B). Overall, our simulations showed that the mixture model was able to robustly split the chromosomes into two groups, one with null LD (recomb) and other with full LD (linkage) between a pair of 2-SNP blocks.
We then evaluated the accuracy of the model under different within and between SNP block variabilities, using a fix scenario with a 0.5 proportion of mixture between the recomb and linkage groups. To test SNP block variability, we simulated multiple two-SNP block pairs, flanking a recombination point, and determined the haplotypes across the blocks. We varied the number of SNP alleles that were different between the most frequent recomb and linkage haplotypes. We thus assessed the extent to which the accuracy of the model was affected by increasing differences in haplotype variability between the groups. We observed that the mixture model had an accuracy of 75% when most frequent haplotypes were shared between groups, and topped to 90% when the difference between the haplotypes was given by only one SNP allele (Sup Figure 1C). This suggests a substantial accuracy when the differences in mutation frequency between the groups are small, which, in addition, can be boosted by the presence of one SNP allele that associates with one of the groups. We then assessed the influence of within block variability on model accuracy. For an scenario of full linkage of the SNPs within all blocks, which reduces to having blocks of 1 SNP, the accuracy dropped to ~60%, showing that larger and more variable SNP blocks increase model’s accuracy (Sup Figure 1D).
Classifying chromosomes into different recombination histories within a genomic region
The second step of recombClust is a consensus clustering of mixture models at numerous points along a genomic region to classify chromosomes into consistent recombining groups (A/B) (Figure 1B). Within the region, all possible SNP blocks pairs are tested such that they do not overlap and are at a maximum distance of 10kb. For each block pair, a mixture model is fitted and the chromosomes classified into the recomb and linkage groups. Because at one point in the region, chromosomes in recomb can be in linkage at another point, a consistent classification over the mixture model predictions was considered. For this step, we applied a clustering method (k-means) on the first PCA components of the model prediction variables obtained from the mixture models fitted along the region. The clusters identified were then considered as chromosomes with similar recombination patterns within the region. Mixture model classification across the region was used to reconstruct the pattern of classification proportion into the recomb and linkage groups, this pattern was then compared with the recombination patterns obtained by other linkage based methods, which are applicable only when the chromosome subpopulation A/B are initially known.
We used simulations to test whether the number of chromosomes and the number of recombination points affected the accuracy of recombClust to identify subpopulations of chromosomes with different recombination patterns. We thus simulated datasets representing SNP block pairs that flanked multiple recombination points. We simulated two kinds of populations: (1) a mixture population, where one subpopulation (A) belonged to the recomb group in half of the points and to the linkage group in the other half while a second subpopulation (B) belonged to the linkage and recomb groups, respectively; and (2) a single population where all chromosomes belonged to the same recombination groups across all recombination points.
First, to assess false discovery rate and statistical power, we selected several scenarios changing the number of chromosomes per population (from 20 to 60) and the number of recombination points (from 10 to 100). In all cases, we performed a PCA to the classification matrix given by the mixture model probabilities of belonging to a recomb group at each SNP block pair (Figure 1B). Then, using k-means, we clustered the first two PC components in two groups and considered that recombClust detected differences in recombination patterns when the average silhouette value of the clustering was higher than 0.7 [18] (Sup Figure 2). We observed that under single population simulations, recombClust had a false discovery rate < 0.05 for recombination points > 70 and for all the number of chromosomes considered (>20). In addition, the power to detect different recombination patterns for simulations of chromosomes with two different recombination histories achieved 80% for > 25 chromosomes and for differences in historical recombination in > 16 points (Sup Figure 3).
Second, to confirm that the model detected differences in recombination histories rather than allele differences, we compared recombClust classification with that of a PCA on the simulated genotypes. For a simulation with chromosome mixture, we observed a neat separation of the chromosome subpopulations (Sup Figure 4A) with recombClust, which we did not observe for allele differences.
recombClust accurately classifies inversion status based on differences in historic recombination
The alleles of polymorphic inversions differ in the recombination histories inside the inverted region because recombination is suppressed in heterokaryotypes [3]. We, therefore, asked the extent to which the inversion alleles, being strong recombination modifiers, could be inferred by recombination differences using recombClust. We evaluated the method’s performance to predict simulated inversions from the coalescent simulator invertFREGENE and tested its accuracy to classify common inversions in Drosophila melanogaster and humans. Using invertFREGENE [19], we simulated inversions with different lengths (from 50Kb to 1Mb) and frequencies (from 0.1 to 0.9) and tested the prediction accuracy of chromosome classification into their inversion alleles. We observed accuracy greater than 90% for inversions larger than 250Kb (Sup Figure 4B). As expected, accuracy for short inversions was lower as they presented fewer recombination points. recombClust’s mean accuracy was higher (95%) for inversion frequencies between 0.2 and 0.8 (Sup Figure 5) but did not correlate with the inversion’s age (r = 0.02, p-value = 0.19) (Sup Figure 6).
We then used recombClust to determine whether the alleles of three common polymorphic chromosomal inversions in Drosophila melanogaster (In(2L)t, ln(2R)NS and ln(3R)Mo) could be determined from different recombination histories. We ran recombClust on genome-wide SNP data of 205 lines derived from Raleigh, USA population, comprised in the Drosophila Genetics Reference Panel (DGRP2)[20, 21] and compared the inferred recombining subpopulations with the experimental inversion alleles of the lines. For all the inversions, we observed clear clustering in the first PC component of the mixture classification matrix (Figure 2) that resulted in a 98% match with the inversion alleles, when a k-means clustering was applied. Likewise, we compared the recombClust calling of human inversions at 8p23.1 and 17q21.31 with the experimental inversion genotypes, as obtained from the invFEST repository [22] for the European subjects of the 1000 Genomes Project. Using SNP-phased data, we found that recombClust neatly separated inverted and standard chromosomes (Figure 2) in the first PC component of the mixture classification matrix. The k-means clustering of the first PC accurately matched the experimental inversion-alleles (8p23.1: 100%, 17q21.31: 99.3%). Overall, these results showed that recombination substructure can reliably identify the inversion alleles of some common inversions in two different species.
To test whether recombClust classification reflects differences in historical recombination rates, we compared the recombination pattern obtained along the longest human polymorphic inversion 8p23.1 with recombClust with the recombination rates estimated with FastEPRR [23] independently obtained for each inversion allele (Figure 3). Remarkably, we observed that the inferred proportion of chromosomes in the recomb population across the genomic region accurately captured the underlying recombination patterns obtained by FastEPRR for each of the 8p23.1 inversion alleles. We also observed that the largest differences in recombination proportion were obtained near the recombination hotspots obtained by Alves et al [3] (Figure 3).These results confirmed that the chromosome subpopulations identified by recombClust are clearly mapped to different recombination histories.
recombClust detects recombination histories associated to ancestral differences
Modifiers of historical recombination patterns include numerous processes other than inversions that can act simultaneously on the same genomic region. In particular, differences in historical recombination patterns between ancestries can derive from random differences in the occurrence of recombination events or from the emergence of hotspot differences regulated by ancestry-specific alleles [24]. As such, we asked the extent to which differences between human populations could also be detected in loci already under the influence of inversion alleles. We, therefore, used recombClust to detect the modifier alleles in the loci corresponding to the human inversions at 8p23.1 and 17q21.31 for all the individuals in the 1000 Genomes Project, covering four different continental populations [25]. We inspected the first two PC components of the mixture model predictions for inv-8p23.1 (Sup Figure 7), and observed multiple clusters, in which chromosomes segregated both by inversion status and ancestry. However, for inv-17q21.31, the additional clusters observed in the standard allele did not map to ancestral differences. The observations on both human inversions confirmed that that clusters identified in the first PCs of the mixture model predictions can be interpreted as non-recombining chromosome groups that differ in inversion status, ancestry, or other unobserved factors that suppress recombination between the groups, such as copy number variants likely segregating in standard chromosomes at 17q21.31 [26].
recombClust detects recombination histories associated with selection
Chromosomes with advantageous alleles show a decrease in recombination around the locus under selection. While selection, like demography, does not have a direct influence on the biological process of recombination, they modulate the historical recombination patterns [5]. Therefore, we asked whether recombClust was able to detect chromosomes under selection and recover their recombination patterns. We studied the LCT locus, a human locus known to be under positive selection for lactase tolerance, as defined in PopHumanScan (chr2:135770000-136900000, hg19) [27]. We aimed to detect the underlying chromosomes under selection and their recombination pattern in the LCT locus, for the European individuals of the 1000 Genomes Project. We observed two chromosomes subpopulations (A/B) by clustering the first PC components of the mixture classification matrix (allele 1: 60.8%, allele 2: 39.2%) (Figure 4). Notably, chromosome allele 1 was the most frequent except for the Tuscany population (TSI) (Sup Table 1), the only European population which does not show marks of selection in the LCT locus, as reported in PopHumanScan [27]. We also observed a strong correlation between rs4988235 (C/T(-13910)), the SNP linked to lactose resistance, and the inferred subpopulation groups (r2 = 0.64), where the allele conferring lactose resistance (T) was very frequent in chromosome allele 1 (83%) and almost absent in chromosome allele 2 (<1%). The ability of recombClust to detect chromosomes under selection was further confirmed by the proportion of chromosomes in recomb along the locus for each chromosome subpopulation. As expected, we confirmed that recombination appeared flat in group A (under selection) but not in B (not under selection) across the LTC locus (Figure 4). We also recovered the recombination patterns independently obtained with FastEPRR, for each chromosome subpopulation. Recombination peaks for chromosome allele 2 were found between genes R3HDM1 and DARS genes, matching previously reported recombination peaks [28].
recombClust detects recombination differences in complex genomic regions
The region at 1q21.1 between chr1:145,399,075-145,594,214 (hg19) [29] is prone to various deleterious rearrangements by non-allelic homologous recombination (NAHR) at the numerous segmental duplications (SD) in the region [16]. The rearrangements include microdeletions leading to the thrombocytopenia-absent radius (TAR) syndrome and a range of multiple neurodevelopmental phenotypes caused by duplications and deletions distal to the TAR region [16]. As strong control of recombination is expected in regions regions at risk of NAHR during meiosis [30], we hypothesized that different recombination histories would be detectable in region and aimed to determine their functional correlates.
We ran recombClust across the region chr1:145.35-145.75Mb characterized by four blocks of segmental duplications. The most common deletion for the TAR syndrome is observed between the first and third block [31] while the smallest reported deletion was found between the second and third block [16] (Figure 5). We first analyzed the European individuals of the 1000 Genomes project and observed two clear clusters in the first two PCs of the classification matrix across mixture models. We defined two chromosome subpopulations (subpopulation 1: 80.9%, subpopulation 2: 19.1%) that were in Hardy-Weinberg equilibrium (P = 1) and thus confirmed our hypothesis for the presence of different recombination histories in the region. For each group, we estimated the recombination pattern given by the proportion of chromosomes in recomb (Figure 5), observing important differences between the groups. Notably chromosomes in subpopulation 2 had higher recombination proportion than those in subpopulation 1 along the region except for the small interval containing the genes LIX1L and RBM8A, the causative gene of TAR syndrome [29]. However, the highest differences in recombination proportions were observed between the third and fourth SD blocks, where subpopulation 1 showed null recombination; suggesting a stronger suppression of recombination for this group of chromosomes. We fully validated the chromosome subpopulations and their recombination patterns using the Whole Genome Sequencing data of 287 European individuals from the Genotype-Tissue Expression project (GTEx) (Figure 5). We thus obtained strong evidence for the existence of two recombination histories in the region.
We further asked whether the recombination histories could have a functional role. We tested, using RNA-sequencing data in blood from the GTEx project, if the expression levels of the genes in 1q21.1 were associated with the two different recombination histories. We found a significant differential expression of ANKRD35 (log fold change = 0.18, P = 6.7×10−4) and noted that the SNP rs10910843, an eQLTs of ANKRD35 in blood [32], was in high linkage with the chromosome subpopulations. We additionally found that the SNP rs72704264, a risk factors for hypertension [33], was also in high linkage with the subpopulations, showing likely functional links associated to the different recombination histories.
Discussion
recombClust is the first method to classify chromosomes into different subpopulations based on the inference of the recombination histories along genomic regions. Linkage methods for detecting historic recombination patterns have been important to characterize the distribution of recombination hot-spots between species and ancestries [34–36]. While current methods aim is to robustly estimate the recombination rate between markers by coalescent modeling, accounting for selection and demographic effects, they do not detect recombination variation between individuals. recombClust fills this gap, further allowing to test the association between differences in recombination histories with phenotypes.
recombClust assumes that there is an inverse relationship between recombination and linkage between genetic markers (SNP-blocks). However, the similarity of the recombination patterns obtained with recombClust with those obtained with FastEPRR shows that this assumption is not inaccurate. This is because recombClust is also the first method to incorporate the spatial correlation of the recombination signal along a genomic region, which other linkage methods do not. Consequently, demographic and selection signals, which induce spatial correlation, are directly extracted from the data (Figures 5–6). Additional analyses are, however, required to identify the nature of different recombination histories and to determine whether they are due to ancestry, selection or the presence of chromosomal rearrangements affecting the recombination within the region. In particular, the method successfully split the groups of chromosomes being selected in the LCT locus from those which are not, accurately giving a flat recombination pattern to the group under selection. This is an added advantage with respect to methods like FastEPRR in the computation of recombination patterns because recombClust explicitly extracts the selection signal from the data by identifying the chromosomes under selection as those with a flat recombination pattern in the locus. Our analyses showed that at the LCT locus, the pattern differences between chromosomes groups where large, further suggesting a novel approach in the detection of selection signals.
We have shown that when recombination modifiers are expected to affect a genomic region, such as inversions, recombClust can be reliably used to infer its alleles in large population samples. recombClust can, for instance, be added to other methods that genotype inversion from SNP data, offering an additional signal derived from recombination patterns [37]. However, we expect that the limitations of these methods also apply to recombClust, such as inversions being ancient and not recurrent. Recombination modifiers acting on small targeted sequences that are not expected to show a spatial-extended historic pattern require further methodological developments, like merging the mixture model with coalescent modeling. In general, recombination modifiers whose effects cannot be observed in historical recombination patterns are beyond linkage methods.
We also showed that recombClust can detect differences in recombination histories in complex regions prone to non-allelic homologous recombination (NAHR) and, therefore, likely subjected to tight regulation of recombination [30]. We discovered and validated the existence or two recombination histories in the 1q21.1 locus at risk of deleterious syndromes. Detailed analyses are needed to disentangle the nature of the recombination modifiers acting on the region, which can be, for instance, a mixture of genomic rearrangements, epigenetic marks or functional mechanisms regulating double strand breaks that avoid NAHR [30]. In addition, the question arises of whether the recombination between the chromosome subpopulations confers specific risks to deletions and duplications in the offspring. As for the subpopulations’ relation with more common phenotypes, we observed a strong linkage with a risk factor for hypertension showing probable implications of recombination variation with this trait within 1q21.1. We, therefore, showed an approach to measure the impact of different recombination histories on phenotypes, opening a way to study how recombination variation influences traits.
Methods
recombClust description
We proposed a method to classify chromosomes according to the combinations of SNP alleles across a genomic region that are allowed by different recombination patterns. Consider a situation where two recombination patterns are latent in the chromosome population generating two chromosome subpopulations in a given genomic region (Sup Figure 8). A first subpopulation of chromosomes comprises those that have recombined at any of three given points within the region, and a second subpopulation comprises those that have recombined at any of two other points. In this case, we can see, for instance, that while two specific haplotypes G1 and H1 are compatible with the recombination pattern 1, they are maximally different in mutation content at each SNP variant. In addition, H1 is more similar in mutation content to H2 than G1 is to H1, despite H1 and H2 belonging to different recombination subpopulations. In this work, we proposed the method recombClust that first classifies chromosomes into those that have recombined in point between two markers and those that have not, and second, it computes a consensus classification of chromosomes across all points, separating the population of chromosomes according to different recombination patterns along the segment.
Mixture model to classify a fraction of recombining chromosomes
The first step of recombClust is the classification of chromosomes that have recombined at one point flanked by two SNP blocks. We therefore propose to model the likelihood that a chromosome in the sample is drawn from a mixture of chromosomes that highly recombined at the point (recomb) and that remained in complete LD (linkage) (Figure 1A). The likelihood is therefore given by a mixture of two latent chromosome groups (recomb/linkage). In the first group, we model the recombination at a point that lies in the sequence interval between a pair of SNP blocks (i=1, 2) of length L. Phased SNP alleles are encoded by 0 or 1, the haplotype of a chromosome at block i is a random variable denoted Xi ∈ {0,1}L and the haplotype of the joint blocks is the random variable given by the concatenation of the block variables X12 = X1 ∘ X2. Under this model, the recombination completely breaks the LD between the SNP blocks (r2 = 0) in the recomb subpopulation and therefore X1 and X2 are statistically independent. Therefore, the probability that a chromosome is observed with haplotype x12 in a chromosome group under recombination is: given the haplotype frequencies n1 and n2.
For the second chromosome group, we consider that there is no recombination and we model the SNP blocks to be in complete LD (r2 = 1). For the chromosomes in the linkage group, X1 and X2 are completely linked. X2 can be unambiguously mapped to X1 (f : X2 → X1). Under this model, the probability of observing haplotype x12 is: where d are the frequencies of X1.
We define the mixture model with two components, following equations (1) and (2). The model represents a chromosome population with a mixture of recomb and linkage groups with proportion π. We therefore assume that the probability of observing a chromosome with haplotype x12 is where r1 and r2 are the frequencies of haplotypes X1 and X2 in recomb, l1 is the haplotype frequencies of X1 in linkage, where g is the function linking X2 to X1.
Given a set of m independent chromosomes (k = 1, … m), we denote the random variable for the joint blocks over all chromosomes as and therefore the likelihoods of observing the data y12 under the mixture model is:
The mixture model parameters are determined using an Expectation-Maximization (EM) algorithm. For each chromosome, we define a hidden variable zk∈{0,1}. This variable indicates if the chromosome belongs to the recomb or the linkage groups. The EM algorithm updates the model parameters iteratively maximizing the expectation of the data. Given the parameters of the model ω, ω = (r1 r2, l1, g, π), we define the probability that chromosome k belongs to the linkage group, . Similarly, the probability that individual k belongs to the recomb group given ω is . For each k the probability of belonging to any group is 1 and, therefore, s0,k(ω) + s1,k(ω) = 1. In each step of the EM algorithm, we find the value of ω’ that maximizes:
We therefore update the mixture likelihood by ω’given by:
We estimate haplotype frequencies r1, r2, and l1 in close form using Lagrange multipliers, following Sindi et al. [38]. In particular, we obtain
Where s0(ω) and s1(ω) are the probabilities that a chromosome in the population belongs to the linkage or the recomb groups . We consider that a chromosome k belongs to recomb if s1, k > 0.5. The function g’ is defined using a greedy algorithm which sequentially pairs each observed r2, in decreasing order by their frequency, with the x1 for which the observed frequency of x12, is maximum and has not been previously paired. The final ω’ is such that its square root difference with the previous estimate is lower than machine precision. In addition, for numerical stability we set the zero in equation 2 to 10−5.
Clustering of chromosomes into different recombination patterns
Differences in recombination patterns are given by the recombination points in which only a fraction of chromosomes showed historical recombination. In the second step of recombClust, a consensus clustering is performed on all the recombination points tested over a genomic region to determine whether individual chromosomes are consistently classified into different recombination patterns. Therefore, to detect a subpopulation of chromosomes across the region based on their recombination patterns, recombClust first extensively fits the mixture model between numerous non-overlapping 2-SNP blocks. For each model, the method computes the probability that the chromosomes belong to the recomb group. Finally, recombClust produces a consensus classification of the chromosomes by clustering the first principal component of the recomb probabilities matrix across all mixture models fitted in the genomic region (Figure 1B).
Extraction of recombination patterns along a genomic region
We defined recombClust recombination patterns as the proportion of chromosomes that have recombined in each subpopulation at different points inside a target region. We started by dividing the target region in non-overlapping windows. In each window, we selected those models overlapping the window. In each model, we assigned a chromosome to recomb group if its probability of belonging to the recomb group was higher than 0.5. Then, we consider that the chromosome belonged to recombining group in a given window when it was assigned to the recomb group in more than half of the models. We defined non-overlapping windows of 50Kb for human ~4Mb inversion 8p23.1 and the ~1Mb LCT region and of size 20Kb in the 0.4Mb 1q21.1 region.
Simulation to assess mixture model performance
We evaluated the accuracy of the mixture model to classify individual chromosomes with extensive simulations. 200 instances of a reference scenario were generated and compared with the 200 instances of multiple scenarios under different SNP block and between chromosome group variabilities. For one instance of the reference scenario, we simulated 1000 chromosomes in the recomb and the linkage groups each, given be the random and full linkage association between a pair of two-SNP blocks, respectively. For the recomb group, the chromosome alleles at each SNP were drawn from a binomial distribution whose frequency was independently sampled from a uniform distribution (unif(0.55, 0.95)), assuming no LD within the blocks and between blocks. For the linkage group, SNPs within the blocks were independent but the pair of blocks, flanking the recombination point, was in maximum LD. We then considered that the most frequent haplotype for the joint SNP blocks was the same in both subpopulations and given by the SNP alleles with maximum frequency, so the overall linkage in the total population was of D’=1. Different scenarios were obtained by changing the parameters of these simulations, where assessed the performance of the mixture model, given by the accuracy to correctly classify chromosomes into the recomb/linkage groups. We first assessed the extent to which the accuracy of the model was affected by the genetic variability between populations, by considering that the differences between the most frequent block-pair haplotypes in each chromosome group was increasingly higher. We did this by changing the number of SNP alleles that were different between the most frequent haplotypes in each group.
We also assessed the influence of within block variability on model accuracy, by taking blocks where the linkage between the SNPs in the block was maximum. This scenario reduces to having blocks of 1 SNP. Finally, we evaluated how the proportion between recomb and linkage populations affected the mixture model performance. We simulated different scenarios where the proportion of the recomb population ranged between 0.1 and 0.9. We test the model the reference scenario and using different initializations for the mixture frequency.
Performance of recombClust to detect chromosomes with different recombination histories
We also evaluated the performance of classifying the chromosomes under different recombination patterns using simulated inversions. As inversion polymorphisms produce chromosomal subpopulations that differ in their recombination patterns, we tested the ability of recombClust to detect inversion status in simulated inversions. We simulated an inversion of 800 Kb and a frequency of 20% using invertFREGENE [19] to evaluate the mixture model at different recombination points. We varied the inversion length (from 50 Kb to 1 Mb) and inversion frequency (from 0.1 to 0.9) to evaluate the overall recombClust performance to call the inversion status of the chromosomes. Each combination of frequency and length was run 100 times. In all simulations, we used the default values of invertFREGENE parameters (recombination: 1.25 × 10−7, mutation rate: 2.3 × 10−7).
Drosophila Melanogaster and human inversions
We tested whether recombClust could characterize chromosomal inversions using differences in recombination patterns in Drosophila Melanogaster and in humans. We used recombClust to infer inversion status of chromosomes for three well known inversions: ln(2L)t (2?2225744-13154180, dm6), ln(2R)NS (2R:11278659-16163839, dm6) and ln(3R)Mo (3R:17232639-24857019). We used SNP data from DGRP2 lines [20, 21], excluding individuals with call rate < 95% and SNPs having any missing or a MAF < 5%, classified the lines into the underlying recombination patterns computed by recombClust and compared the classification with experimental inversion genotypes [21].
We used recombClust to classify phased chromosomes into underlying recombination patterns within human inversions at 8p23.1 (chr8:8055789-11980649, hg19) and 17q21.31 (chr17:43661775-44372665, hg19). We used SNP phased data from the 1000 Genomes project [25]. We inferred the recombination modifier variants with recombClust and compared them with the experimental inversion genotypes available in the invFEST repository [22].
Recombination substructure in the susceptibility region of TAR syndrome
We ran recombClust across the region chr1:145.35-145.75Mb characterized by four blocks of segmental duplications. This region is prone to deleterious rearrangements by non-allelic homologous recombination (NAHR), which can lead to the thrombocytopenia-absent radius (TAR) syndrome. We analyzed the 503 European individuals from the 1000 Genomes project and the 528 European individuals of Genotype-Tissue Expression (GTEx) project [39]. We obtained GTEx data from dbGAP (accession code: phs000424.v7.p2), we phased it with SHAPEIT [40] and we selected those individuals classified as European by peddy [41] with a probability higher than 0.9. In the recombClust analysis, we included SNPs with a MAF > 0.05 and performed the consensus clustering across the detected points with a hierarchical clustering. In GTEx, we used the first two PCs of the chromosome subpopulation probabilities while in 1000 Genomes we only used the second PC. We tested Hardy-Weinberg equilibrium using SNPassoc [42].
We studied whether the chromosome genotypes, derived from the chromosome subpopulations, were associated with gene expression and phenotype differences between individuals. We evaluated the association with gene expression in whole blood using GTEx data, using the gene raw counts from recount2 (33). For each tissue, we removed genes with less than 10 counts in more than 90% of the samples. We tested the association between the chromosome alleles and gene expression, applying a robust linear regression with limma [43] to log2CPM values obtained with voom [44]. We included sex, platform, top three genome-wide principal components and variables from PEER as covariates.
Supplementary material
Additional File 1: .pdf. Supplementary Figures and Tables. Sup Figures 1–8, Sup Table 1
Funding
This work was partly supported by the Spanish Ministry of Economy and Competitiveness [MTM2015-68140-R]; and the Catalan Government [#016FI_B 00272 to CR-A]. Funding for open access charge: Spanish Ministry of Economy and Competitiveness. J.G. is funded by the European Commission (H2020-ERC-2014-CoG-647900), the Ministerio de Ciencia, Innovación y Universidades/AEI/FEDER (BFU2017-82937-P) and the Secretaria d’Universitats i Recerca del Departament d’Economia i Coneixement de la Generalitat de Catalunya (GRC 2017 SGR 880).
Disclosure Declaration
The authors declare no conflict of interest.
Acknowledgments
The authors would like to express their gratitude to the Supercomputing and Bioinnovation Center (SCBI) of the University of Malaga (Spain) for their support and resources. The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. GTEx data were obtained from: the GTEx Portal on 06/07/2018 and dbGaP accession number phs000424.v7.p2 on 12/05/2017.
Footnotes
↵† Authors should be regarded as joint First Authors