Multidimensional Scaling and Relatedness Research

Jan Graffelman; Iván Galván Femenía; Rafael de Cid; Carles Barceló-i-Vidal

doi:10.1101/297879

Abstract

Multidimensional scaling is a well-known multivariate technique, that is often used in genetics for studying population substructure. In this paper we show that multidimensional scaling of marker data is of relevance for relatedness research. Relatedness is usually investigated by estimating and plotting identity-by-state and identity-by-descent allele-sharing statistics. We show that outlying individuals in a map obtained by multidimensional scaling of genetic variables do not necessarily stem from a different human population, but can be the consequence of relatedness. We propose a method for classifying pairs of individuals into the standard relationship categories that combines genetic bootstrapping, multidimensional scaling and discriminant analysis. We validate our method with simulation studies. Given the variant filtering procedures, our method classifies relationships up to and including the fourth degree with high accuracy (96-97%), using only identity by state. The usefulness of the method is illustrated with data from the 1,000 genomes and the GCAT projects.

1 Introduction

Multidimensional scaling (MDS) is a versatile multivariate statistical technique that has found applications in many fields of science [Cox and Cox, 2001, Borg and Groenen, 2005]. In genetics, MDS is often used to investigate the existence of population substructure. If individuals do not come from a single homogeneous human population, but from two or more populations with different allele frequencies, then an MDS of the genetic marker data typically separates the individuals of the different populations. An example is given in Figure 1 where we show the MDS map for a sample of 368 individuals, 165 from the CEU population, and 203 from the YRI population of the HapMap project [The International HapMap Consortium, 2007], now included in the 1,000 genomes project [The 1000 Genomes Project Consortium, 2015], using the allele-sharing distance [Mountain and Cavalli-Sforza, 1997]. Many examples of the use of MDS for detecting population substructure can be found in the genetics literature. Pemberton et al. (2010) used MDS to investigate the population affiliation of the HapMap Phase III individuals. Other examples of population substructure research with MDS are given by Jakobsson et al. (2008), Sabatti (2009), Foulkes (2009), Wang (2010) and Pemberton et al. (2013).

Figure 1:

MDS map for a sample of 165 individual of the CEU and 203 individuals of the YRI population of the 1,000 genomes project, using 48,072 LD-pruned, autosomal SNPs with MAF > 0.40 and non-significant in a HWE exact test.

The examples in the literature and in Figure 1 suggest that observing outlying individuals in an MDS map implies these individuals are different, and therefore potentially stem from a different human populations that may be characterized by different allele frequencies for the variants under study. In this paper we show that this conclusion is not always warranted. Outlying individuals, and even groups of outlying individuals, may very well stem from the same human population and appear separated from their population for a very different reason: they correspond to individuals that are closely related. Figure 1 illustrates this point, where some interrelated individuals of the YRI population appear separated from their alikes. In particular, the map distance of the most outlying individuals with respect to the center of the YRI population is actually larger than the distance between the centers of the YRI and the CEU population. The four most outlying individuals are all involved in two parent-offspring (PO) relationships, whereas the two intermediate outliers are involved in one PO relationship (Pemberton et al. (2010), supplemental data). Relatedness investigations in genetics are mostly based on allele-sharing statistics such as the average number of IBS alleles shared by a pair of individuals over a set of loci, or by estimating the probabilities of sharing 0, 1 or 2 alleles IBD [Cotterman, 1941, Thompson, 1991], known as Cotterman’s coefficients. Plots of these sharing statistics typically show clusters that correspond to unrelated individuals (UN), parent-offspring pairs (PO), full sibs (FS), half sibs (HS), monozygotic twins (MZ), avuncular pairs (AV), first cousins (FC), grandparent-grandchild (GG) or more remote relationships. MDS is not used in such relatedness investigation, but is linked to the allele sharing approach in the sense that a distance matrix can easily be computed from the allele sharing statistics. Moreover, recent studies [Weir and Goudet, 2017, Conomos et al., 2016] have pointed out that relatedness and population substructure are tightly connected topics in the sense that both induce allele sharing.

The remainder of this paper is organized as follows. In Section 2 we provide background on relatedness research and MDS, and we develop theory for using MDS in relatedness research by calculating genetic distances between pairs of individuals. Section 3 explores and validates our methodology with some simulation studies, that assess the effects of sample size, number of variants and minor allele frequency (MAF) on the classification error rate obtained by a discriminant analysis based on MDS results. Section 4 shows applications of the proposed methodology with data from the 1,000 genomes and GCAT projects. A Discussion section (5) finishes the paper.

2 Theory

Relatedness research

We briefly review some fairly standard procedures that are currently used in relatedness research. Relatedness investigations are focused on the extent to which alleles are shared between individuals. Two individuals can share 0, 1 or 2 alleles for any autosomal variant. Alleles can be identical by state (IBS) or identical by descent (IBD). A pair of individuals share IBS alleles if they match irrespective of their provenance; whereas they share IBD alleles only if they come from a common ancestor. In IBS studies, the means and standard deviations of the IBS allele counts over variants [Abecasis et al., 2001], or the proportions of variants sharing 0, 1 and 2 IBS alleles [Rosenberg, 2006] can be plotted. These plots reveal characteristic clusters corresponding to MZ, PO, FS, HS/AV/GG, FC or UN pairs. Alternatively, in an IBD based approach, the probability of sharing 0, 1 or 2 IBD alleles (usually denoted by and ) can be represented in a scatterplot [Nembot-Simo and Graham, 2013]. The Cotterman coefficients can be estimated by the method of moments [Purcell et al., 2007] or by maximum likelihood [Thompson, 1975, Milligan, 2003, Weir et al., 2006]. A good account of the state-of-the-art of IBD relatedness research is given by Weir et al. (2006). In IBD studies, reference values for the standard relationships are available (see Table 1). Related pairs can also be distinguished, albeit at lower resolution, by using the coancestry coefficient defined as θ = k₁/2 + k₂ or the kinship coefficient defined as φ = θ/2. An example of the use of φ is shown by Conomos et al. (2016) where plots of the kinship coefficient and k₀ also uncover clusters with the different family relationships. Galvaán et al. (2017) give an overview of graphics for related ness research. Figure 2 shows a panel plot of some standard graphics used in IBS and IBD studies for all the pairs of individuals from the CEU population of the HapMap project. These plots distinguish UN, PO, FS and second degree pairs. First and second degree relationships for the CEU population were documented by Pemberton et al. (2010) using IBS methods, and confirmed by Kyriazopoulou et al. (2011), who used hidden Markov models and suggested additional third and fourth degree relationships. Stevens et al. (2012) used IBD methods confirming the results of Pemberton et al. (2010).

View this table:

Table 1:

Degree of relationship (R), kinship coefficient (φ), and probability of sharing zero, one or two alleles identical by descent (k₀, k₁, k₂).

Figure 2:

Some standard graphics for relatedness research for the CEU sample of the HapMap project. IBS/IBD statistics were calculated over a set of 26,081 complete, LD-pruned autosomal SNPs with MAF above 0.4, and HWE exact test p-value above 0.05. A: scatterplot of the mean and standard deviation of the number of IBS alleles. B: scatterplot of the fraction of variants sharing 2 (p₂) against the fraction sharing 0 (p₀) IBS alleles. C: scatterplot of against . D: scatterplot of φ against .

Alternatively, a Markov-chain approach with the calculation of likelihood ratios for putative and alternative relationship has been developed by Epstein et al. (2000) (the Relpair program) and by McPeek and Sun (2000) (the Prest-plus program).

Multidimensional scaling and relatedness research

Multidimensional scaling (MDS) is a well-known multivariate method with many applications in genetics, though its potential in relatedness research seems not to have been explored. Several variants of multidimensional scaling exist and a distinction is usually drawn between metric and non-metric multidimensional scaling. Metric MDS is also known as classical scaling or principal coordinate analysis (PCO, Gower [1966]). In metric MDS a map of the individuals is obtained by extracting eigenvalues and eigenvectors from a conveniently transformed distance matrix. The distances in the map approximate the (genetic) distances between the individuals. The method is not in-variant under monotone transformations of the original distances. Non-metric multidimensional scaling obtains the map in an iterative manner by minimizing a stress function, and is invariant under monotone transformations of the distances, as it uses only the rank order of the distances. In this paper we confine ourselves mostly to metric MDS. The theory of MDS is described in most textbooks on multivariate analysis [Mardia et al., 1979] and some books are entirely dedicated to the topic [Cox and Cox, 2001, Borg and Groenen, 2005].

Figure 3 shows metric MDS maps for simulated populations with some related individuals. Bi-allelic marker data was generated by sampling independent genotypes from a multinomial distribution with Hardy-Weinberg frequencies, using uniformly distributed allele frequencies. Some related individuals were then generated from the base set of independent individuals. Figure 3A shows that a single monozygotic twin in the database stands out as an outlier in the map (the two twin-sibs show up as two coinciding points in the first dimension). The existence of two identical points in MDS can also be inferred from the eigenvalues, as two identical points create an additional zero eigenvalue in the MDS output (one zero eigenvalue is always present and is a characteristic of the method as it results from the double-centring of the transformed distance matrix). At first sight the appearance of related individuals as outliers seems contradictory. The MZ pair in panel (A) is shown at a large distance from the rest of the population suggesting that the genetic distance between a twin-sib and the rest of the individuals is much larger than any of the distances between any other pair of unrelated non-twin-sib individuals. This is actually not true. The genetic distances between a twin-sib and the other individuals are in fact of similar order of magnitude as the distances between any pair of unrelated individuals. The reason that the MZ pair separates out in the map at a larger distance is due to the fact that the MZ pair constitutes an outlier in the analysis. The genetic distance between the twin-sibs is zero, and this is an extreme outlier in the off-diagonal entries of the distance matrix, where all distances between unrelated pairs are much larger. MDS is sensitive to the existence of outliers and fits the outlying point much better than the other points. In regression, it is well-known that a single outlying observation can exert leverage on the slope of the regression line and can highly influence the slope [Weisberg, 2005, Chapter 9]. This phenomenon also arises in MDS, where a single outlying observation exerts leverage on the fitted plane. Consequently, the goodness-of-fit of all distances involving the twins are all high, whereas the goodness-of-fit of all other distances is very low, as is revealed in a diagnostic plot of fitted versus observed distances (see supplementary Figure S1). Similarly, most of the maps shown in Figure 3 containing some related individuals are pretty much uninformative on most of the distances, except for those distances that involve the outlying points.

Figure S1:

Fitted versus observed distance for an MDS map of a homogeneous population with one MZ pair. The single pair involving the two individuals of the MZ pair is red, pairs involving one individual of the MZ pair are blue and pairs not involving any individual of the MZ pair are black.

Figure 3:

MDS maps for simulated populations with some related individuals. (F=Father, M=Mother, C=Child, MZ=Monozygotic twin, FS=Full Sib, HS=Half Sibs)

Figure 4:

MDS maps for simulated samples, with 1% genotyping error. 100 pairs of each type of relationship (UN, fifth, fourth, third (FC), second (HS), FS and PO) were generated using an allele frequencies of 0.5, assuming Hardy-Weinberg equilibrium and 5,000 independent bi-allelic variants.

MDS with pairs of individuals

An important difference between the standard allele-sharing methods outlined in the first paragraph of this Section and MDS as discussed above is that the allele-sharing methods plot pairs of individuals, whereas MDS is usually employed to plot the individuals themselves. To bring MDS maps to the same realm as the allele sharing techniques, we propose to compute a distance matrix between all possible pairs of individuals rather than the distance matrix between the individuals. In order to do so, we need to define a distance (or similarity) measure between two pairs of individuals. This can be done in many different ways. The distance measure needs to indicate to what extend two pairs of individuals are similar or not. We use n for the number of individuals, and k for the number of genetic variants. For bi-allelic variants there exist six possible pairs of genotypes whose counts over k variants can be laid out in a triangular array shown in Table 2, where k_ij refers to the number of variants that have i B alleles for one individual, and j B alleles for the other individual.

View this table:

Table 2:

Lower triangular matrix layout with counts for all possible genotype pairs.

Consequently, each pair can be represented by a vector of six counts which can be expressed as a composition by division by its total

The total number of variants is given by k = Σ_i≥j k_ij. For PO pairs this vector has, in theory, a structural zero, k₂₀ = 0, because PO pairs share at least one IBS allele. However, for empirical data k₂₀ = 0 is, with large k, never observed due to the existence of some mutations and genotyping error. In order to measure the distance between two pairs, x and y, we consider the ordinary Euclidean distance (d_e), the Manhattan distance (d_m), the Canberra distance (d_c) and the Aitchison distance [Aitchison, 1986], d_a, where the latter is appealing for being specifically designed for compositional data. These distances are given by where D represents the number of parts of the composition (here D = 6), and g(•) the geometric mean of the composition. The Aitchison distance requires strictly positive elements, which is theoretically a problem for PO pairs. Methodology for dealing with zeros in compositions has been developed (Mart´ın-Fernaändez et al., 2011). Additional metrics could be explored, but are not considered here.

A drawback of the representation of pairs of individuals in an MDS map is that the type of relationship cannot be inferred. Outlying points observed in pairwise MDS maps are likely to correspond to related pairs of individuals, but without additional analysis one does not know whether the outlier is an MZ, PO, FS, HS or other pair. We resolve this by first identifying a subset of approximately unrelated individuals in the database, having a co-ancestry coefficient with other individuals that is below 0.05. We next simulate pairs of related individuals of known relationships from this subset, applying the Mendelian inheritance rules. E.g., PO pairs are simulated by first drawing two parents at random from the unrelated subset. A child is then simulated by drawing one allele at random from both these parents. The process is repeated in order to generate many random PO pairs. FS, HS and pairs of other relationships are simulated in an analogous manner. This process is based on re-sampling the alleles of the individuals, and we will refer to this principle as genetic bootstrapping. The artificially generated data set forms a reference set or training set against which the empirically observed data can be compared. This reference set is conditional on the allele frequencies of the observed sample. The distance matrix of all simulated pairs is calculated and used to construct an MDS reference map. The empirically observed pairs are projected onto this MDS map and their relationship is inferred, according to which simulated type of relationship is most close to the empirical pair. This can be done in a quantitative way by classifying all empirical pairs with linear discriminant analysis, using the simulated pairs as a training set. We briefly outline the calculations below.

MDS [Mardia et al., 1979, Chapter 14] is performed by the spectral de composition in Equation (6) of the m × m scalar product matrix B, which is obtained from the double centred matrix with the squared distances (D) as , where H is an idempotent centring matrix (H = I − (1/m)11^′),

Matrix V contains the eigenvectors of B, and D_λ is a diagonal matrix containing the eigenvalues in non-increasing order of magnitude. Matrix X contains the coordinates of the pairs in an MDS map of dimension r, the dimensionality chosen for the representation of the results.

Let d be the m × 1 vector of supplementary distances, containing the squared distances between an empirical pair and all m pairs in the reference set. The squared transformed distances are obtained as

This transformation amounts to subtracting the average of the squared distance vector and adding half the mean of the squared entries of the original distance matrix calculated for the reference set. The projection of the empirical pairs is carried out by using results that plot supplementary points in an MDS map. In particular, the coordinates of a projected pair, z, are found by [Gower, 1968] where the resulting coordinate vector z is r × 1, and has as many elements as the dimensionality chosen for representation of the results. The validity of (7) and (8) is easily verified by taking a copy of one row of the original distance matrix, and applying the projection equations. The map coordinates of the projected point will then coincide exactly with the coordinates of that point in the initial MDS solution. Equations (7) and (8) exemplify that a supplementary point is found by first centring (and scaling) the observation in the same way the original observations are centered (and scaled), followed by a regression [Graffelman and Aluja-Banet, 2003].

3 Simulations

We first validate the proposed methodology with some simulations. We simulated 5,000 independent genetic bi-allelic variants by sampling from a multinomial distribution under the Hardy-Weinberg assumption, using a minor allele frequency of 0.5 for all variants. Using Mendelian inheritance rules, 100 independent pairs of each type of relationship were simulated. We used two scenarios, one with perfect application of Mendelian inheritance rules, yielding simulated data sets that are free of Mendelian inconsistencies, and one with 1% genotyping error for all three genotypes. Two-dimensional metric MDS maps using four distance measures (Euclidean, Manhattan, Canberra and Aitchison) for the six-part compositions of the pairs are shown in Figure 4. An MDS with the Euclidean distance is seen to separate the main types of relationships until the fourth degree (UN, fourth, third, and second degree, FS and PO), with overlapping clusters for UN, fifth and fourth degree. The FS group is most outlying. The Manhattan metric gives a picture that is similar to the Euclidean distance, but with apparently less discrimination between UN, fifth, fourth, third and second degree relationships. The Canberra metric, which is related to the Manhattan metric, places PO and FS pairs at similar positions in the first dimension, and separates them in the second. Results obtained with the Aitchison metric differ, as this metric strongly sorts out the PO pairs in the first dimension, inverting the order of the PO and FS clusters in this dimension. Scatterplots of the MDS coordinates beyond the second dimension show, in general, little separation of the different relationships, confirming that the solution is essentially two-dimensional. Exceptions are the Canberra and Aitchison distance, which do show clear clustering of relationships in the third dimension (see supplementary figure S2). Classification rates for a linear discriminant analysis, using the first three MDS dimensions, are given for the four metrics in Table 3.

Figure S2:

Scatterplot matrices of the coordinates of the first four dimensions of an MDS solution using Euclidean, Manhattan, Canberra and Aitchison distance. Data obtained by simulation of 5,000 independent variants with a MAF of 0.5 under Hardy-Weinberg equilibrium.

View this table:

Table 3:

Overall and up to and including fourth degree relationships classification rates for linear discriminant analysis using four different metrics, and applying genetic bootstrapping without genotyping error, or with 1% genotyping error.

The non-perfect overall classification can mainly be ascribed to the fifth degree pairs, who are partly misclassified as fourth degree or unrelated. The classification rates up to and including the fourth degree pairs are about five percent higher. The differences between the metrics are small, though the Euclidean distance seems to work best, in particular in the presence of genotyping error. Somewhat better classification rates are obtained if genotyping error is absent. When there is no genotyping error, there are structural zeros in the compositions of PO pairs, which were dealt with using methods outlined by Martín-Fernández et al. (2011). Similar results were obtained by using quadratic discriminant analysis (not shown).

This simulation concerns a relatively ideal dataset with independent variants and maximally polymorphic variants. For empirical data sets, the independence of the variants can be approximately achieved by LD pruning variants. In practice, many variants have a low MAF. We investigated the effect of the MAF on the discriminatory power of the different metrics, by simulating variants with different MAF. Figure 5 shows how the classification rate varies as a function of the MAF and the number of bootstrap samples for the considered metrics.

Figure 5:

Classification rate up to and including the fourth degree as a function of the minor allele frequency (MAF) and the number of maximally polymorphic variants obtained by linear discriminant analysis using four different metrics, and allowing for 1% genotyping error.

Figure 5 shows a worse classification rate for the Aitchison metric, whereas Euclidean, Manhattan and Canberra metrics achieve the best classification with high MAF variants. The observed classification rates are similar for a training set with 20 and 100 simulated pairs. The classification rate depends on the number of maximally polymorphic variants, and suggests that 25,000 variants are sufficient to almost perfectly classify PO, FS, second, third and fourth degree relationships. In the light of the simulation results, we will use one hundred bootstrap samples and the Euclidean distance for the examples presented in Section 4. In the remainder of the paper, we will refer to our method, which combines genetic bootstrapping, IBS allele sharing, MDS and LDA, as IBS-MDS.

4 Applications

In this Section we apply the proposed MDS of pairwise distances to two genomic data sets. We use the CEU population of the HapMap project, now available as part of the 1,000 genomes project [The 1000 Genomes Project Consortium, 2015] (www.internationalgenome.org), whose family relationships have been analysed in detail by Pemberton et al. (2010), Kyriazopoulou et al. (2011), Huff et al. (2011) and Stevens et al. (2011, 2012). We also present a relatedness study of the population based GCAT Genomes for Life project (a cohort study of the genomes of Catalonia, www.genomesforlife.com, Obo´n-Santacana et al. [2018]). For both projects, we used Plink 1.90 [Purcell et al., 2007] for data manipulation and filtering, and R [R Core Team, 2014] for MDS and discriminant analysis.

4.1 The CEU sample

We detail the analysis of the CEU panel. HapMap phase II and III variants of the CEU panel were filtered according to missingness (only variants genotyped for all individuals were used), MAF (> 0.40) and Hardy-Weinberg equilibrium test result (using the exact mid p-value > 0.05 [Graffelman and Moreno, 2013]). Variants were LD-pruned with Plink’s independent pairwise option (indep-pairwise 50 5 0.2). The final data set contained 31,370 autosomal variants. The CEU panel consists of 165 individuals, mainly PO trios, giving 13,530 possible pairs of individuals. The ordinary MDS map of the individuals of this panel is shown in Figure 6. Documented related individuals are connected by lines. Figure 6 shows an outlying FS pair in the first dimension, that pulls the related families along this axis.

Figure 6:

MDS map for a sample of 165 individuals of the CEU population, with outlying FS pair in the first dimension

Figure 7 shows the MDS map of the pairwise approach, using the Euclidean distance measure. The convex hulls delimit the cloud of the positions of the simulated UN, fifth, fourth, third, second, FS and PO pairs (using 100 pairs of each). The classification rate of the simulated data was as high as 99.6%. Points with open circles represent the empirical pairs which were projected onto the map using Equations (7) and (8), and are coloured according to their predicted relationship. The map suggests the CEU sample has 96 PO pairs, one FS pair, two second degree pairs, one third degree pair, five fourth degree pairs and many fifth degree pairs that merge with UN pairs. The classification of the empirical pairs by linear discriminant analysis confirmed the 96 PO and the single FS pair relationships described by Pemberton et al. (2010) (results not shown), as well as the additional FC pair reported by Kyriazopoulou et al. (2011), with a posterior probability of one for all these pairs. Third and fourth degree relationships uncovered by Kyriazopoulou et al. (2011) are reported in Table 4, together with the posterior probabilities obtained in our IBS-MDS analysis. 193 pairs were classified as fifth degree relationship pairs, of which 66 had a posterior probability above 0.95. We do not report all these pairs because given the overlap with the UN cluster and the poorer classification rate of the fifth degree observed in the simulations.

View this table:

Table 4:

Predicted relationships of third (3rd) and fourth (4th) degree pairs of the CEU sample of the HapMap project reported by Kyriazopoulou et al. (2011) and their posterior probabilities according to IBS-MDS. Coding and abbreviations used: sex 1 = male, 2 = female; a hyphen (-) indicates the corresponding pair is not annotated or regarded unknown by the corresponding authors; FCOR: first cousin once removed; Pem.: Pemberton et al. (2010); Kyr.: Kyriazopoulou et al. (2011); Ste.: Stevens et al. (2012); Huf.: Huff et al. (2011).

Figure 7:

MDS map of 13,530 pairs of individuals of the CEU population. Convex hulls delimit the region of the pairs obtained by genetic bootstrapping.

Our results confirm a third degree pair (pair 1 in Table 4) reported by Kyriazopoulou et al. (2011). We also confirm four of the fourth degree pairs reported by the latter authors (pairs 2-5 in Table 4). However, we also observed considerably incongruence of our results with those of the latter authors. We found an FC pair to be classified as fourth degree (pair 6) by our method and 11 reported fourth degree pairs were classified as fifth degree. One of these (pair 16) had a high probability of being even more remotely related. We also compared results with those published by Huff et al. (2011), who estimate recent shared ancestry (ERSA) by using IBD segments. Our work confirms three fourth degree pairs and one fifth degree pair report by the latter authors, though we found two additional fourth degree pairs which are not confirmed by Huff et al. (2011).

4.2 GCAT samples

We use samples from the GCAT Genomes for life project, a cohort study of the genomes of Catalonia (www.genomesforlife.com). GCAT is a prospective cohort study that includes 17,924 participants (40-65 years, release August 2017) recruited from the general population of Catalonia, a Mediterranean region in the northeast of Spain. Participants are mainly part of the Blood and Tissue Bank (BST), a public agency of the Catalan Department of Health. Detailed information regarding the GCAT project is described in Obo´n-Santacana et al. (2018). We study relatedness on 5,075 GCAT Spanish participants from Caucasian origin and 736,223 SNPs that passed quality control [Sumoy et al., 2017]. Inferred relatives of first and second degree were confirmed by the BST public agency, for pairs sharing one surname (PO, second degree pairs) or two surnames (FS pairs), respecting the privacy of the participants. According to the same filtering procedures used in the CEU samples, 26,006 SNPs (MAF > 0.40, LD-pruned, HWE exact mid p-value > 0.05, and missing call rate 0) were considered for relatedness analysis. A MDS map of the individuals, using the allele sharing distance, is shown in Figure 8A.

Figure 8:

MDS maps of 5,075 individuals of the GCAT study.

The first dimension shows an outlying triple of sibs (pairs 48, 49 and 50 in Table S1, Figure 8A). When considering second and third dimensions, this trio is masked, and these axes reveal other FS pairs (Figure 8B). Figure 9 shows the IBS-MDS approach for pairs of individuals using the Euclidean distance. Our IBS-MDS method detects four PO, 70 FS and 12 second degree pairs, and reveals 68 and 54 third and fourth degree pairs (See supplementary Table S1). The IBS-MDS-plot shows a large variability in the FS group, where many pairs fall outside their respective simulated convex hull. We suggest this excess variability is, at least in part, due to the existence of some three-quarter siblings in the database (e.g. half-sibs where two of the three parents have an FS or PO relationship). The Cotterman coefficients for a three-quarter sibling relationship are readily deduced to be k₀ = 3/8, k₁ = 1/2 and k₂ = 1/8, and their co-ancestry coefficient is θ = 3/8, or, equivalently, their kinship coefficient is 3/16. They can therefore be expected to fall in-between the

View this table:

Table S1:

Predicted relationships of first, second, third and fourth degree pairs of the GCAT cohort and their probabilities according to IBS-MDS. k0, k1, k2: Cotterman’s coefficients, θ: coancestry coefficient.

Figure 9:

MDS map of pairs of individuals of the GCAT study. Convex hulls delimit the region of the pairs obtained by genetic bootstrapping.

FS and 2nd degree clusters, as is shown for some labeled pairs in Figure 9. We therefore repeated IBS-MDS with a new training set where this type of relationship had been included. The resulting map is shown in Figure 10 and shows that one outlying second degree pair (77) and several initially outlying FS pairs (73, 63, 53, 43) are now indeed classified as three-quarter siblings. The posterior probabilities for these classifications are given in supplementary Table S2.

View this table:

Table S2:

Potential three-quarter siblings of the GCAT cohort and their probabilities according to IBS-MDS. k0, k1, k2: Cotterman’s coefficients, θ: coancestry coefficient.

Figure 10:

MDS map of pairs of individuals of the GCAT study, trained with an additional category for three-quarter siblings (3/4S). Convex hulls delimit the region of the pairs obtained by genetic bootstrapping.

5 Conclusions and Discussion

Multidimensional scaling is a widely used tool in genetic analysis, and is often used to check data for homogeneity and for substructure investigation. If there exists population substructure, MDS maps typically reveal clusters corresponding to the different populations. However, outlying observations or groups of outlying observations do not necessarily prove the existence of population substructure. Individuals may separate out in MDS because they are related but actually stem from the same population. We conclude that population substructure cannot be ruled out or confirmed only by looking at MDS maps. We consider it necessary to look for additional zero eigenvalues and to inspect diagnostic plots like Figure S1 to identify influential observations, and to perform a relatedness study before inferring the existence of any population substructure.

The proposed method for classifying pairs using MDS and discriminant analysis is seen to perform well with both simulated and empirical data. The sampling of artificially related pairs from the observed data requires a considerable number of approximately unrelated individuals to be present. We therefore suggest the method to be used for large samples, where such a substantial subset of unrelated individuals can be identified. This is probably not an obstacle for the use of our method, as increasingly large samples are being used in epidemiological genomics. The sampling of artificial pairs from the observed data respects the allele frequency distribution of the original data, and provide reference areas for the different relationships given the allele frequencies of the observed data.

If MDS is used with the primary purpose of detecting substructure, we suggest first to remove one individual from each related pair, and to apply MDS to the filtered dataset. If not, related individuals can upset the proper detection of substructure. Incidental outliers in MDS maps are likely to correspond to individuals that have a close family relationship. Figure 2 shows a set of standard patterns that are relatively easy to recognize in genetic data analysis. However, if more family relationships are present in the database, the patterns obtained by MDS are more difficult to interpret.

The proposed IBS-MDS method can be used in an exploratory and iterative way. For some data sets, a first analysis may reveal that certain relationship categories are obviously absent in the data. In such cases, IBS-MDS can be re-run with a training set without these categories, in order to focus the analysis on those relationships that are of relevance for the data under study. Indeed, we tacitly omitted monozygotic twins (MZ) from our training set, as MZ pairs (or duplicates) are easily recognized, and this way the analysis is focused on more remote relationships. The GCAT example illustrates this iterative use, as three-quarter siblings were added to the training set in second instance to improve the analysis.

In standard applications of MDS to bi-allelic genetic variants, genetic data is often coded in (0,1,2) format and Euclidean distances between individuals are calculated in this coding, and represented by MDS in low-dimensional space. The units on the axes of such maps are not very well interpretable, because the Euclidean distances are calculated over a large number of genetic variants. We propose to scale the Euclidean distance matrix that is entered into MDS by a factor of , where k is the number of variants. This scaling will not affect the configuration of the points in the map, and neither its goodness-of-fit, but it will render the axes and distances of the map more interpretable. Two individuals that are a unit distance apart in the map now differ on average by one allele per locus. This will typically scale all the coordinates in the MDS map within the (−1,1) interval, as the maximum difference in number of alleles between two individuals is two. We note that the Euclidean distance is not the same as the allele-sharing distance we used in this paper. The allele-sharing distance is the sum of the absolute value of the difference in numbers of alleles over all loci. In multivariate analysis this metric is better known as the Manhattan distance, the city-block distance, or the taxicab metric. Division of the Manhattan distance by the number of polymorphisms gives the allele-sharing distance. For the sake of interpretability, it is convenient to scale the Manhattan distance matrix by 1/k prior to using it in MDS. A pair of individuals two units apart will then differ by two alleles at all loci. The suggested scalings facilitate interpretation of MDS results, but are hampered by the fact that the map is a low-dimensional approximation to the original distance matrix, and therefore the property will not hold exactly, but only approximately so. For distance matrices that have the Euclidean property, and that therefore only have non-negative eigenvalues [Mardia et al., 1979, Chapter 14], the map distances will approximate the true distances from below, and one can say that two individuals that are, for instance, 0.5 units apart, will differ on average by at least 0.5 minor alleles per locus, or can be expected to have at least one different allele for every pair of loci. We have used Manhattan distances scaled by 1/k in all MDS maps of individuals used this paper (Figures 1, 3, 6 and 8).

We note that the family relationships detected in Figure 3 would have gone largely unnoticed if the data is analyzed by non-metric multidimensional scaling of the between-individual distance matrix. In non-metric MDS only the rank order of the distances is used, not the value of the distances itself. In non-metric MDS related individuals do therefore generally not appear as outliers. This makes the non-metric approach less attractive at the individual level, though it does not rule out the non-metric approach for analysing pairs.

We recommend the use of discriminant analysis in allele-sharing studies as employed in this paper. The posterior probabilities of the different relationships give a quantitative criterion for deciding upon which relationship is most likely for a given pair of individuals. In allele sharing studies this decision is mostly made graphically by inspecting a (p₀, p₂) plot in IBS studies, or a (k₀, k₂) plot in IBD studies. We note that these posterior probabilities differ from those used in a standard discriminant analysis, in the sense that they are affected by additional uncertainty generated by using a training set obtained by genetic bootstrapping of the observed data.

The GCAT sample shows, for almost all relationship categories, larger variability in the MDS coordinates than would be expected under strict Mendelian sampling of alleles from unrelated individuals. This excess variability can, at least in part, be explained by the presence of additional relatedness between (unobserved) close relatives of the individuals in the database. This leads to increased autozygosity, which is a characteristic of more endogamous populations. The occurrence of three-quarter siblings is just a particular instance of this phenomenon. Consequently, the degree of relatedness of two individuals tends to become a continuous variable, which is increasingly hard to discretize into the standard relationship categories.

Several scholars have recently distinguished between remote or background relatedness (population substructure) and more recent relatedness (family relationships), describing methods that try to address both [Conomos et al., 2016, Weir and Goudet, 2017]. We have shown in this paper that MDS can be used to detect both recent relatedness and substructure.

Acknowledgments

This work was partially supported by grants SGR551 and SGR1269 from the Agència de Gestio´ d’Ajuts Universitaris i de Recerca (AGAUR) of the Generalitat de Catalunya, by grant MTM2015-65016-C2-2-R and ADE 10/00026 (MINECO/FEDER) of the Spanish Ministry of Economy and Competitiveness and European Regional Development Fund, and by grant R01 GM075091 from the United States National Institutes of Health. R. de Cid was supported by the Ramon y Cajal action (RYC-2011-07822).

References

↵
G. R. Abecasis, S. S. Cherny, C. W. O. C., and L. R. Cardon. GRR: Graphical representation of relationship errors. Bioinformatics, 17(8):742–743, 2001.
OpenUrl CrossRef PubMed Web of Science
↵
J. Aitchison. The Statistical Analysis of Compositional Data. The Blackburn press, Caldwell, NJ, 1986. 2003 printing.
↵
I. Borg and P. J. F. Groenen. Modern multidimensional scaling. Springer, New York, second edition, 2005.
↵
M. P. Conomos, A. P. Reiner, B. S. Weir, and T. A. Thornton. Model-free estimation of recent genetic relatedness. American Journal of Human Genetics, 98(1):127–148, 2016. doi: 10.1016/j.ajhg.2015.11.022.
OpenUrl CrossRef PubMed
↵
C. W. Cotterman. Relatives and human genetic analysis. The Scientific Monthly, 53:227–234, 1941.
OpenUrl
↵
T. F. Cox and M. A. A. Cox. Multidimensional Scaling. Chapman & Hall, second edition, 2001.
↵
M. Epstein, W. Duren, and B. M. Improved inference of relationship for pairs of individuals. American Journal of Human Genetics, 67:1219–1231, 2000.
OpenUrl CrossRef PubMed Web of Science
↵
A. S. Foulkes. Applied statistical genetics with R. Springer, New York, 2009.
↵
I. Galván-Femenía, J. Graffelman, and C. Barceló-i Vidal. Graphics for relatedness research. Molecular Ecology Resources, 17(6):1271–1282, 2017. doi: https://doi.org/10.1111/1755-0998.12674.
OpenUrl
↵
J. C. Gower. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53(3):325–338, 1966.
OpenUrl CrossRef Web of Science
↵
J. C. Gower. Adding a point to vector diagrams in multivariate analysis. Biometrika, 55:582–585, 1968.
OpenUrl CrossRef Web of Science
↵
J. Graffelman and T. Aluja-Banet. Optimal representation of supplementary variables in biplots from principal component analysis and correspondence analysis. Biometrical Journal, 45(4):491–509, 2003.
OpenUrl
↵
J. Graffelman and V. Moreno. The mid p-value in exact tests for Hardy-Weinberg equilibrium. Statistical Applications in Genetics and Molecular Biology, 12(4):433–448, 2013.
OpenUrl CrossRef
↵
C. Huff, D. Witherspoon, T. Simonson, J. Xing, W. Watkins, Y. Zhang, T. Tuohy, D. Neklason, R. Burt, S. Guthery, S. Woodward, and L. Jorde. Maximum-likelihood estimation of recent shared ancestry (ersa). Genome research, 21(5):768–774, 2011. doi: 10.1101/gr.115972.110.
OpenUrl Abstract/FREE Full Text
↵
M. Jakobsson, S. W. Scholz, P. Scheet, J. R. Gibbs, J. M. VanLiere, H. Fung, Z. Szpiech, J. Degnan, K. Wang, R. Guerreiro, J. Bras, J. Schymick, D. Hernandez, B. Traynor, J. Simon-Sanchez, M. Matarin, A. Britton, J. van de Leemput, I. Rafferty, M. Bucan, H. Cann, J. Hardy, N. Rosenberg, and A. Singleton. Genotype, haplotype and copy-number variation in worldwide human populations. Nature, 451:998–1003, 2008. doi: http://dx.doi.org/10.1038/nature06742.
OpenUrl CrossRef PubMed Web of Science
↵
S. Kyriazopoulou-Panagiotopoulou, D. Kashef Haghighi, S. J. Aerni, A. Sundquist, S. Bercovici, and S. Batzoglou. Reconstruction of genealogical relationships with applications to Phase III of HapMap. Bioinformatics, 27(13):i333–41, 2011. doi: 10.1093/bioinformatics/btr243.
OpenUrl CrossRef
↵
K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate analysis. Academic press London, 1979.
↵
1. V. Pawlowsky-Glahn and
2. A. Buccianti
J. A. Martín-Fernández, J. Palarea-Albaladejo, and R. A. Olea. Dealing with zeros. In V. Pawlowsky-Glahn and A. Buccianti, editors, Compositional Data Analysis: Theory and Applications, pages 43–58, Chichester (UK), 2011. John Wiley & Sons. ISBN 978-0-470-71135-4.
↵
M. S. McPeek and L. Sun. Statistical tests for detection of misspecified relationships by use of genome-screen data. American Journal of Human Genetics, 66(3):1076–1094, 2000.
OpenUrl CrossRef PubMed Web of Science
↵
B. G. Milligan. Maximum-likelihood estimation of relatedness. Genetics, 163 (3):1153–1167, 2003.
OpenUrl Abstract/FREE Full Text
↵
J. L. Mountain and L. L. Cavalli-Sforza. Multilocus genotypes, a tree of individuals, and human evolutionary history. American Journal of Human Genetics, 61:705–718, 1997.
OpenUrl CrossRef PubMed Web of Science
↵
A. Nembot-Simo and B. Graham, J. amd McNeney. CrypticIBDcheck: an R package for checking cryptic relatedness in nominally unrelated individuals. Source Code for Biology and Medicine, 8(5), 2013.
↵
M. Obón-Santacana, M. Vilardell, A. Carreras, X. Duran, J. Velasco, I. Galván-Femenía, T. Alonso, L. Puig, L. Sumoy, E. J. Duell, M. Peru-cho, V. Moreno, and de Cid R. Cohort profile: Gcat genomes for life. a prospective cohort study of the genomes of catalonia. BMJ Open, 2018. To appear.
↵
T. J. Pemberton, C. Wang, J. Z. Li, and N. A. Rosenberg. Inference of unexpected genetic relatedness among individuals in HapMap Phase III. American Journal of Human Genetics, 87:457–464, 2010.
OpenUrl CrossRef PubMed Web of Science
↵
T. J. Pemberton, M. DeGiorgio, and N. A. Rosenberg. Population structure in a comprehensive genomic data set on human microsatellite variation. G3: Genes, Genomes, Genetics, 3(5):891–907, 2013. doi: 10.1534/g3.113.005728. URL http://www.g3journal.org/content/3/5/891.
OpenUrl Abstract/FREE Full Text
↵
S. Purcell, B. Neale, K. Todd-Brown, L. Thomas, M. A. R. Ferreira, D. Ben-der, J. Maller, P. Sklar, P. I. W. de Bakker, M. J. Daly, and P. C. Sham. Plink: A toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics, 81(3):559–575, 2007. URL http://pngu.mgh.harvard.edu/purcell/plink/.
OpenUrl CrossRef PubMed
↵
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2014. URL http://www.R-project.org/.
↵
N. A. Rosenberg. Standardized subsets of the HGDP-CEPH Human Genome Diversity cell line Panel, accounting for atypical and duplicated samples and pairs of close relatives. American Journal of Human Genetics, 70(6): 841–847, 2006. doi: 10.1111/j.1469-1809.2006.00285.x.
OpenUrl CrossRef PubMed Web of Science
↵
C. Sabatti, S. K. Service, A. Hartikainen, A. Pouta, S. Ripatti, J. Brodsky, C. G. Jones, N. A. Zaitlen, T. Varilo, M. Kaakinen, U. Sovio, A. Ruoko-nen, J. Laitinen, E. Jakkula, L. Coin, C. Hoggart, A. Collins, H. Turunen, S. Gabriel, P. Elliot, M. I. McCarthy, M. J. Daly, M. Järvelin, N. B. Freimer, and L. Peltonen. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nature genetics, 41(1):35–46, 2009. doi: 10.1038/ng.271.
OpenUrl CrossRef PubMed Web of Science
↵
E. L. Stevens, G. Heckenberg, E. D. O. Roberson, J. D. Baugher, T. J. Downey, and J. Pevsner. Inference of relationships in population data using indentity-by-descent and identity-by-state. PLoS Genetics, 7(9):1–15, 2011.
OpenUrl
↵
E. L. Stevens, J. D. Baugher, M. D. Shirley, L. P. Frelin, and J. Pevsner. Unexpected relationships and inbreeding in HapMap Phase III populations. PLoS ONE, 7(11):e49575, 2012. doi: 10.1371/journal.pone.0049575.
OpenUrl CrossRef PubMed
↵
L. Sumoy, D. Piñeyro, I. Galván-Femenía, R. Pluvionet, L. Ramos, and R. de Cid. Joint calling description and qc summary. Technical report, GCAT Project, IMPPC, Badalona (Barcelona, Spain), 2017. URL http://www.genomesforlife.com/investigadors.
↵
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68–74, 2015.
OpenUrl CrossRef PubMed
↵
The International HapMap Consortium. A second generation human haplo-type map of over 3.1 million SNPs. Nature, 449:851–861, 2007.
OpenUrl CrossRef PubMed Web of Science
↵
E. A. Thompson. Estimation of pairwise relationships. Annals of Human Genetics, 39:173–188, 1975.
OpenUrl CrossRef PubMed Web of Science
↵
1. C. R. Rao and
2. R. Chakraborty
E. A. Thompson. Estimation of relationships from genetic data. In C. R. Rao and R. Chakraborty, editors, Handbook of Statistics, volume 8, pages 255–269, Amsterdam, 1991. Elsevier Science.
OpenUrl
↵
C. Wang, Z. Szpiech, J. Degnan, M. Jakobsson, T. Pemberton, J. Hardy, A. Singleton, and N. Rosenberg. Comparing spatial maps of human population-genetic variation using procrustes analysis. Statistical applications in genetics and molecular biology, 9, 2010. doi: 10.2202/1544-6115.1493. Article 13.
OpenUrl CrossRef
↵
B. S. Weir and J. Goudet. A unified characterization of population structure and relatedness. Genetics, 2017. ISSN 0016-6731. doi: 10.1534/genetics.116.198424. URL http://www.genetics.org/content/early/2017/05/26/genetics.116.198424.
OpenUrl Abstract/FREE Full Text
↵
B. S. Weir, A. D. Anderson, and A. B. Hepler. Genetic relatedness analysis: modern data and new challenges. Nature Reviews Genetics, 7(10):771–780, 2006.
OpenUrl CrossRef PubMed Web of Science
↵
S. Weisberg. Applied Linear Regression. John Wiley & Sons, New Jersey, third edition, 2005.

View the discussion thread.

Posted April 09, 2018.

Download PDF

Citation Tools

Subject Area

Genetics

Subject Areas

All Articles

Animal Behavior and Cognition (5204)
Biochemistry (11718)
Bioengineering (8724)
Bioinformatics (29132)
Biophysics (14937)
Cancer Biology (12052)
Cell Biology (17362)
Clinical Trials (138)
Developmental Biology (9407)
Ecology (14146)
Epidemiology (2067)
Evolutionary Biology (18270)
Genetics (12223)
Genomics (16768)
Immunology (11844)
Microbiology (28016)
Molecular Biology (11560)
Neuroscience (60841)
Paleontology (450)
Pathology (1864)
Pharmacology and Toxicology (3231)
Physiology (4940)
Plant Biology (10405)
Scientific Communication and Education (1681)
Synthetic Biology (2878)
Systems Biology (7333)
Zoology (1642)

[1] ↵
G. R. Abecasis, S. S. Cherny, C. W. O. C., and L. R. Cardon. GRR: Graphical representation of relationship errors. Bioinformatics, 17(8):742–743, 2001.
OpenUrl CrossRef PubMed Web of Science

[2] ↵
J. Aitchison. The Statistical Analysis of Compositional Data. The Blackburn press, Caldwell, NJ, 1986. 2003 printing.

[3] ↵
I. Borg and P. J. F. Groenen. Modern multidimensional scaling. Springer, New York, second edition, 2005.

[4] ↵
M. P. Conomos, A. P. Reiner, B. S. Weir, and T. A. Thornton. Model-free estimation of recent genetic relatedness. American Journal of Human Genetics, 98(1):127–148, 2016. doi: 10.1016/j.ajhg.2015.11.022.
OpenUrl CrossRef PubMed

[5] ↵
C. W. Cotterman. Relatives and human genetic analysis. The Scientific Monthly, 53:227–234, 1941.
OpenUrl

[6] ↵
T. F. Cox and M. A. A. Cox. Multidimensional Scaling. Chapman & Hall, second edition, 2001.

[7] ↵
M. Epstein, W. Duren, and B. M. Improved inference of relationship for pairs of individuals. American Journal of Human Genetics, 67:1219–1231, 2000.
OpenUrl CrossRef PubMed Web of Science

[8] ↵
A. S. Foulkes. Applied statistical genetics with R. Springer, New York, 2009.

[9] ↵
I. Galván-Femenía, J. Graffelman, and C. Barceló-i Vidal. Graphics for relatedness research. Molecular Ecology Resources, 17(6):1271–1282, 2017. doi: https://doi.org/10.1111/1755-0998.12674.
OpenUrl

[10] ↵
J. C. Gower. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53(3):325–338, 1966.
OpenUrl CrossRef Web of Science

[11] ↵
J. C. Gower. Adding a point to vector diagrams in multivariate analysis. Biometrika, 55:582–585, 1968.
OpenUrl CrossRef Web of Science

[12] ↵
J. Graffelman and T. Aluja-Banet. Optimal representation of supplementary variables in biplots from principal component analysis and correspondence analysis. Biometrical Journal, 45(4):491–509, 2003.
OpenUrl

[13] ↵
J. Graffelman and V. Moreno. The mid p-value in exact tests for Hardy-Weinberg equilibrium. Statistical Applications in Genetics and Molecular Biology, 12(4):433–448, 2013.
OpenUrl CrossRef

[14] ↵
C. Huff, D. Witherspoon, T. Simonson, J. Xing, W. Watkins, Y. Zhang, T. Tuohy, D. Neklason, R. Burt, S. Guthery, S. Woodward, and L. Jorde. Maximum-likelihood estimation of recent shared ancestry (ersa). Genome research, 21(5):768–774, 2011. doi: 10.1101/gr.115972.110.
OpenUrl Abstract/FREE Full Text

[15] ↵
M. Jakobsson, S. W. Scholz, P. Scheet, J. R. Gibbs, J. M. VanLiere, H. Fung, Z. Szpiech, J. Degnan, K. Wang, R. Guerreiro, J. Bras, J. Schymick, D. Hernandez, B. Traynor, J. Simon-Sanchez, M. Matarin, A. Britton, J. van de Leemput, I. Rafferty, M. Bucan, H. Cann, J. Hardy, N. Rosenberg, and A. Singleton. Genotype, haplotype and copy-number variation in worldwide human populations. Nature, 451:998–1003, 2008. doi: http://dx.doi.org/10.1038/nature06742.
OpenUrl CrossRef PubMed Web of Science

[16] ↵
S. Kyriazopoulou-Panagiotopoulou, D. Kashef Haghighi, S. J. Aerni, A. Sundquist, S. Bercovici, and S. Batzoglou. Reconstruction of genealogical relationships with applications to Phase III of HapMap. Bioinformatics, 27(13):i333–41, 2011. doi: 10.1093/bioinformatics/btr243.
OpenUrl CrossRef

[17] ↵
K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate analysis. Academic press London, 1979.

[18] ↵
V. Pawlowsky-Glahn and
A. Buccianti
J. A. Martín-Fernández, J. Palarea-Albaladejo, and R. A. Olea. Dealing with zeros. In V. Pawlowsky-Glahn and A. Buccianti, editors, Compositional Data Analysis: Theory and Applications, pages 43–58, Chichester (UK), 2011. John Wiley & Sons. ISBN 978-0-470-71135-4.

[19] V. Pawlowsky-Glahn and

[20] A. Buccianti

[21] ↵
M. S. McPeek and L. Sun. Statistical tests for detection of misspecified relationships by use of genome-screen data. American Journal of Human Genetics, 66(3):1076–1094, 2000.
OpenUrl CrossRef PubMed Web of Science

[22] ↵
B. G. Milligan. Maximum-likelihood estimation of relatedness. Genetics, 163 (3):1153–1167, 2003.
OpenUrl Abstract/FREE Full Text

[23] ↵
J. L. Mountain and L. L. Cavalli-Sforza. Multilocus genotypes, a tree of individuals, and human evolutionary history. American Journal of Human Genetics, 61:705–718, 1997.
OpenUrl CrossRef PubMed Web of Science

[24] ↵
A. Nembot-Simo and B. Graham, J. amd McNeney. CrypticIBDcheck: an R package for checking cryptic relatedness in nominally unrelated individuals. Source Code for Biology and Medicine, 8(5), 2013.

[25] ↵
M. Obón-Santacana, M. Vilardell, A. Carreras, X. Duran, J. Velasco, I. Galván-Femenía, T. Alonso, L. Puig, L. Sumoy, E. J. Duell, M. Peru-cho, V. Moreno, and de Cid R. Cohort profile: Gcat genomes for life. a prospective cohort study of the genomes of catalonia. BMJ Open, 2018. To appear.

[26] ↵
T. J. Pemberton, C. Wang, J. Z. Li, and N. A. Rosenberg. Inference of unexpected genetic relatedness among individuals in HapMap Phase III. American Journal of Human Genetics, 87:457–464, 2010.
OpenUrl CrossRef PubMed Web of Science

[27] ↵
T. J. Pemberton, M. DeGiorgio, and N. A. Rosenberg. Population structure in a comprehensive genomic data set on human microsatellite variation. G3: Genes, Genomes, Genetics, 3(5):891–907, 2013. doi: 10.1534/g3.113.005728. URL http://www.g3journal.org/content/3/5/891.
OpenUrl Abstract/FREE Full Text

[28] ↵
S. Purcell, B. Neale, K. Todd-Brown, L. Thomas, M. A. R. Ferreira, D. Ben-der, J. Maller, P. Sklar, P. I. W. de Bakker, M. J. Daly, and P. C. Sham. Plink: A toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics, 81(3):559–575, 2007. URL http://pngu.mgh.harvard.edu/purcell/plink/.
OpenUrl CrossRef PubMed

[29] ↵
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2014. URL http://www.R-project.org/.

[30] ↵
N. A. Rosenberg. Standardized subsets of the HGDP-CEPH Human Genome Diversity cell line Panel, accounting for atypical and duplicated samples and pairs of close relatives. American Journal of Human Genetics, 70(6): 841–847, 2006. doi: 10.1111/j.1469-1809.2006.00285.x.
OpenUrl CrossRef PubMed Web of Science

[31] ↵
C. Sabatti, S. K. Service, A. Hartikainen, A. Pouta, S. Ripatti, J. Brodsky, C. G. Jones, N. A. Zaitlen, T. Varilo, M. Kaakinen, U. Sovio, A. Ruoko-nen, J. Laitinen, E. Jakkula, L. Coin, C. Hoggart, A. Collins, H. Turunen, S. Gabriel, P. Elliot, M. I. McCarthy, M. J. Daly, M. Järvelin, N. B. Freimer, and L. Peltonen. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nature genetics, 41(1):35–46, 2009. doi: 10.1038/ng.271.
OpenUrl CrossRef PubMed Web of Science

[32] ↵
E. L. Stevens, G. Heckenberg, E. D. O. Roberson, J. D. Baugher, T. J. Downey, and J. Pevsner. Inference of relationships in population data using indentity-by-descent and identity-by-state. PLoS Genetics, 7(9):1–15, 2011.
OpenUrl

[33] ↵
E. L. Stevens, J. D. Baugher, M. D. Shirley, L. P. Frelin, and J. Pevsner. Unexpected relationships and inbreeding in HapMap Phase III populations. PLoS ONE, 7(11):e49575, 2012. doi: 10.1371/journal.pone.0049575.
OpenUrl CrossRef PubMed

[34] ↵
L. Sumoy, D. Piñeyro, I. Galván-Femenía, R. Pluvionet, L. Ramos, and R. de Cid. Joint calling description and qc summary. Technical report, GCAT Project, IMPPC, Badalona (Barcelona, Spain), 2017. URL http://www.genomesforlife.com/investigadors.

[35] ↵
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68–74, 2015.
OpenUrl CrossRef PubMed

[36] ↵
The International HapMap Consortium. A second generation human haplo-type map of over 3.1 million SNPs. Nature, 449:851–861, 2007.
OpenUrl CrossRef PubMed Web of Science

[37] ↵
E. A. Thompson. Estimation of pairwise relationships. Annals of Human Genetics, 39:173–188, 1975.
OpenUrl CrossRef PubMed Web of Science

[38] ↵
C. R. Rao and
R. Chakraborty
E. A. Thompson. Estimation of relationships from genetic data. In C. R. Rao and R. Chakraborty, editors, Handbook of Statistics, volume 8, pages 255–269, Amsterdam, 1991. Elsevier Science.
OpenUrl

[39] C. R. Rao and

[40] R. Chakraborty

[41] ↵
C. Wang, Z. Szpiech, J. Degnan, M. Jakobsson, T. Pemberton, J. Hardy, A. Singleton, and N. Rosenberg. Comparing spatial maps of human population-genetic variation using procrustes analysis. Statistical applications in genetics and molecular biology, 9, 2010. doi: 10.2202/1544-6115.1493. Article 13.
OpenUrl CrossRef

[42] ↵
B. S. Weir and J. Goudet. A unified characterization of population structure and relatedness. Genetics, 2017. ISSN 0016-6731. doi: 10.1534/genetics.116.198424. URL http://www.genetics.org/content/early/2017/05/26/genetics.116.198424.
OpenUrl Abstract/FREE Full Text

[43] ↵
B. S. Weir, A. D. Anderson, and A. B. Hepler. Genetic relatedness analysis: modern data and new challenges. Nature Reviews Genetics, 7(10):771–780, 2006.
OpenUrl CrossRef PubMed Web of Science

[44] ↵
S. Weisberg. Applied Linear Regression. John Wiley & Sons, New Jersey, third edition, 2005.

Multidimensional Scaling and Relatedness Research

Abstract

1 Introduction

2 Theory

Relatedness research

Multidimensional scaling and relatedness research

MDS with pairs of individuals

3 Simulations

4 Applications

4.1 The CEU sample

4.2 GCAT samples

5 Conclusions and Discussion

Acknowledgments

References

Citation Manager Formats

Subject Area