Long runs of homozygosity are correlated with marriage preferences across global population samples

Children of consanguineous unions carry long runs of homozygosity (ROH) in their genomes, due to their parents’ recent shared ancestry. This increases the burden of recessive disease in populations with high levels of consanguinity and has been heavily studied in some groups. However, there has been little investigation of the broader effect of consanguinity on patterns of genetic variation on a global scale. Here, we collect published genetic data and information about marriage practices from 396 worldwide populations and show that preference for cousin marriage has a detectable effect on the distribution of long ROH in these samples, increasing the expected number of ROH longer than 10Mb by a factor of 1.5 (P=2.3 × 10−4). Variation in marriage practice and consequent rates of consanguinity is therefore an important aspect of demographic history for the purposes of modeling human genetic variation. However, marriage practices explain a relatively small proportion of the variation in ROH distribution and consequently the ability to predict marriage practices from population genetic samples (for example of ancient populations) is limited.


Introduction 23
Marriage practices have consequences for human genetic variation. One extensively debated 24 and regulated practice is consanguinity -the union between closely related individuals. An  For these reasons, populations that practice cousin marriage may in fact demonstrate 43 relatively few of the negative genetic consequences associated with increased homozygos-44 ity. For example, many marriages might occur between "cousins" who are actually quite 45 distantly related. Alternatively, in many regions of the world, cousin marriage co-occurs 46 with endogamy (marriage within defined sub-groups such as castes) [8]. Since endogamy 47 can itself lead to excess recessive disease burden [9], the marginal effects of consanguinity 48 might be small. Finally, even though both endogamy and consanguinity, by increasing 49 ROH, increase the risk of recessive disease in the short term, the same process exposes such 50 alleles to selection, purging them more effectively from the population in the long term.  We therefore set out to test whether a cultural preference for cousin marriage is de-63 tectable in genetic data from a worldwide sample of 3,859 individuals from 396 popula-64 tions. We categorize these populations into those prohibiting (43%), allowing (12%), and 65 preferring (44%) cousin marriage using publicly available ethnographic sources. We find 66 that ethnographic measures of preferences for cousin marriage are detectably related to the 67 distribution of long ROH, increasing the expected number of ROH longer than 10Mb by 68 1.5× in populations preferring cousin marriages over those that prohibit it (P = 2.31 × 69 10 -4 ), after controlling for ten principal components of genetic variation, shorter runs of ho-70 mozygosity, and heterozygosity (collectively serving as proxies for demographic events such 71 as bottlenecks, admixture and endogamy). This effect corresponds to an approximate in-  only those that were identical by descent (IBD) within an individual. We refer to these 89 segments as runs of homozygosity (ROH) to distinguish them from inter-individual IBD. 90 We also calculated the expected number and total length of long runs of homozygosity analytically using the model described in The integrals   Genetic data collection and processing 97 We collected genotype data generated with the Affymetrix Human Origins array from 6 98 papers [9, 19-23], and merged this with whole-genome sequence data from three further

106
For each individual, we calculated heterozygosity (at genotyped SNPs) and the number 107 of ROH greater than 1,2,5 and 10 Mb (using the --het and --homozy commands in plink ). 108 We then subtracted the number of ROH so that our variable NROH1 represents the number 109 of ROH longer than 1 but smaller than 2 Mb, NROH5 representing the number of ROH  Ethnographic data collection and processing 114 We collected data on marriage practices from the following sources:

124
We encoded the level of consanguinity in a population categorically as 'prohibited'

125
(coded as 0), 'permitted' (1), or 'preferred' (2). When quantitative measures (e.g. preva-126 lence of cousin unions in the population) were available, we designated populations to the 127 above three categories based on the percent of unions between first cousins and first cousins 128 once removed. In such cases, a prevalence of 0 to 2.5% was encoded as prohibited, 2.5% 129 to 20% was encoded as 'permitted', and 20% and above was encoded as 'preferred'. When 130 quantitative measures were not available, we analyzed ethnographic records for terms indi-131 cating the preference for cousin unions. Groups that stated cousin marriage was "common",

132
"practiced", "encouraged" or equivalent were classified as 'preferred' category. Meanwhile, 133 groups where cousin marriage was described as being "allowed", "occasional", "present but 134 uncommon" or equivalent were coded as 'permitted'. Lastly, groups where cousin unions 135 were "forbidden", "barred" or marriage was "exogamous" were classified as 'prohibited'. We 136 classified groups where information was unclear or unavailable as 'missing'. We gathered 137 marriage practice information for 522 populations, which was reduced to 396 (and 3,859 138 individuals) after merging with the genetic data.

139
To validate our consanguinity scoring, we tested whether our assignments were corre- To further ensure that the relationship between consanguinity and NROH10 was not con-171 founded by long-term demographic history, we identified population pairs that were matched 172 genetically and geographically but were different in that cousin unions were preferred in one 173 population and prohibited in the other. To maximize genetic similarity and geographic prox-

174
Page 8 imity in selecting population pairs, we calculated genetic (Euclidean) distance using PC1-10 175 and geographic distances (using longitude and latitude) between populations that prefer 176 consanguinity (score = 2) and those that either prohibit it or permit it (score ∈ {0, 1}).

177
Then, we divided each distance matrix by its median (to account for differences in scale) 178 and averaged the two to calculate a single distance matrix. We selected population pairs in 179 ascending order of the distance between them such that each population was part of only Page 9 on the average number or total length of long (>10 cM) ROH (ROH10) (Fig. 2). This is 199 because long runs of homozygosity arise primarily due to cousin unions in the recent past as short runs of homozygosity (Fig. 2).

206
Geographic variation in cultural preference of consanguinity 207 We show the geographic variation in cultural preference for consanguinity in Fig. 3. It is 208 interesting to note that in India, preferences for consanguinity appear to markedly change 209 when comparing northern populations to southern populations ( Figure 3B). Consanguinity   and horizontal bars represent the 95% CI.
Page 14 geographic proximity, but discordant for cousin marriage practice, and then tested for a 239 difference in NROH within pairs (Methods). We find that NROH10 is significantly different 240 between such matched populations (P = 0.007, N = 72 pairs) (Fig. 5) whereas other vari-241 ables that are more sensitive to long-term demographic history such as NROH1, NROH2, 242 NROH5, and HET are not (Table 3), consistent with results from the linear models. Re-243 moving populations with mean NROH10 > 5 did not change this result (Table S6). which is conservative in this context. Using leave-one-out cross-validation we find that 251 classification power is low (area under the curve, or AUC = 0.62) but well calibrated for 252 predicted probabilities > 50% (Fig. 6).

254
The negative genetic consequences of marriage between close relatives have been well-known 255 for centuries, yet the practice continues amongst millions of households around the world.   for consanguinity score is shown relative to the score of 0 (consanguinity prohibited). Coefficients of 1 and 2 represent populations which permit and prefer cousin unions, respectively.
Regional coefficients are shown relative to the Americas. Ten genetic PCs (calculated for the full sample) were also included in the model but their coefficients are not shown.    Table S6: Genetic differences between pairs of populations preferring and prohibiting consanguinity (outliers removed, N = 68 pairs)