Whole-genome Studies of Malagasy People Uncover Novel Body Composition Associations

The majority of human genomic research studies have been conducted in European ancestry cohorts, at the loss of detecting potentially novel and globally impactful findings. Here, we present the first whole genome sequence data and genome-wide association study in a cohort of 264 Malagasy individuals from three locations on the island of Madagascar. We describe genetic variation in this Malagasy cohort, providing insight into the shared and unique patterns of genetic variation across the island. We observe phenotypic variation by location, and find high rates of hypertension particularly in the Southern Highlands as well as elevated malaria prevalence in the West Coast relative to other sites. We find a number of genetic associations with body composition traits, including many variants that are unique to African populations or populations with admixed African ancestry such as Madagascar. This study highlights the utility of including diverse populations in genomic research for the potential to gain novel insights, even with small cohort sizes. This project was conducted in partnership and consultation with local Malagasy stakeholders and serves as an example for equitable genomic research with potential impacts on our understanding of human health and disease.


Main Text
2][3] The Malagasy population can trace their genetic ancestry to eastern African Bantu-speaking populations and Austronesian groups most likely from modern day Borneo, Indonesia. 4,5Austronesian founders arrived in eastern Madagascar and African founders in the northwest coast of Madagascar.
These ancestral groups likely settled the island at different times and in waves over the course of 1000-2000 years, with subsequent admixture beginning approximately 28 generations in the past.[8][9] There have been few genome-wide studies of the Malagasy population of Madagascar.
Of these existing studies, most have primarily focused on the selective and demographic history including the timing of settlement, evidence of past bottlenecks, admixture, and positive selection. 4,5,8,10,11The unique genetic architecture of Madagascar is of particular interest from a population genetics perspective.There is a known signal of strong post-admixture selection at the malaria protective Duffy-null variant on chromosome 1. 10,12Beyond these two studies which both focused on the single locus signal at Duffy-null, other signals of selection in Malagasy have not been investigated.Further, these past studies either focused on a single region in Madagascar (central highlands) 12 , or grouped individuals from multiple regions across the island into one population for analysis purposes. 10However, due to the settlement history of the island, patterns of genetic diversity vary across Madagascar, including relative contributions of Austronesian and African sources. 8Accordingly, there may be differences in selective pressures and genomic regions under selection for groups residing in different parts of Madagascar.The recent admixture and selective pressures unique to the island may have led novel or globally rare variants with functional impacts to increase in frequency.This provides a distinctive opportunity to detect genetic associations either at novel loci or with novel variants at known loci, which can add to our understanding of the underlying biology driving an association.
Of note, the genome-wide studies to date have all relied on genotyping array data.Here, we describe the first whole genome sequence data from the Malagasy population of Madagascar.We also identify several genetic associations with anthropometric phenotypes, representing the first GWAS in a Malagasy cohort.This study expands the diversity of populations that have been included in genomic research, allowing us to gain novel insight into human health both in Madagascar and globally.
This study was conducted in collaboration with scientists and anthropologists at Variant Bio (VB) and the University of Antananarivo Department of Anthropobiology and Sustainable Development (UA).First, representatives of VB and UA conducted an initial community consultation in order to determine study sites and to assess the feasibility and interest in genetic research within the community.We also established a benefit-sharing agreement based on the needs of the community, with access to clean water and improved school infrastructure being the most common concerns across communities. 13timately, the study included community-based recruitment of adult volunteers from three locales in order to gain population-representative samples: Tsianaloka Village, Tsimafana (West Coast); Ampandrialaza Village, Ampangabe (Central Highlands); and Tsiandatsiana Village, Ankerana (Southern Highlands) (Figure 1A).We collected 15 anthropometric and spirometric phenotypes as well as history of malaria infection.See Supplemental Methods for more details on participant inclusion criteria and phenotyping.
Overall, the participants were healthy, with the vast majority of individuals self-reporting excellent health (Figure 2A).The sample population was generally young, with an average age across the three sites ranging from 29 to 35 years old (West Coast = 35 years, Southern Highlands = 33 years, Central Highlands = 29.5 years) (Figure 2B, Supplemental Table 1).
Approximately 20% of participants had been to a doctor in the year preceding this study.Only 14 individuals (~5%) reported having malaria in the past year, and 20% of participants reported ever having malaria.This proportion of both male and female participants reporting ever having malaria is noticeably higher in the West Coast (male = 56%, female = 41%) compared to the other two study sites (0-11%), likely reflective of an increased presence of the malaria parasites along the coast (Supplemental Table 2).We measured both systolic and diastolic blood pressure and categorized individuals who had Stage 1 or Stage 2 hypertension versus normal blood pressure based on American College of Cardiology (ACC) guidelines. 14We find that approximately 61% of participants were hypertensive (Normotensive / slightly elevated = 39%, Stage 1 hypertension = 31% Stage 2 hypertension = 30%), with the Southern Highlands showing on average higher blood pressure than the other two sites (Figure 2C, Supplemental Table 1).A previous study has similarly reported high rates of hypertension (49%) in a rural community in the northeastern coast of Madagascar. 15In Figure 2C, we show distributions of six phenotypes stratified by site and sex: maximum forced expiratory volume in 1 second (FEV1 max) measured in percentage, systolic blood pressure measured in millimeters of mercury (mmHg), weight measured in kilograms (kg), body fat percentage, and hip and weight circumferences measured in centimeters (cm).Descriptive statistics for all collected phenotype distributions, stratified by site and sex, are provided in Supplemental Tables 1 & 2.
Mid-pass whole genome sequencing, joint genotyping, and imputation was performed for all participants, with 264 passing quality control at a mean coverage of 4.6x (Supplemental Methods) 16 .Reference genomes were included from the Indonesian Genome Diversity Project (IDGP) (EGAS00001003054) and the Papuan Genome Diversity Project (PGDP) 17 and 1000 Genomes references from eastern Asia, Europe, and central, eastern, and western Africa (Han 18 Including the Malagasy cohort, the final call set included 1,077 individuals.A summary table of population-level allele frequencies for the ~29 million alleles segregating in the Malagasy cohort is provided (see Data Availability).
We performed separate quality checks and filtering at the individual and variant level for GWAS and population genetics analyses respectively (Supplemental Methods).We ran a principal components analysis (PCA) in Hail (v0.2) and ADMIXTURE for a subset of 977 individuals passing QC, including 208 of the Malagasy participants. 19,20Individuals within the Malagasy cohort cluster by locale along the African to Asian ancestry cline in PC space (Figure 1B).The clustering by site is likely indicative of different proportions of African and Austronesian ancestry resulting from the complex settlement history of the island.Indeed, we see from ADMIXTURE results at K=3 that the average proportion of African-related global ancestry is higher in the West Coast (66% African) compared to a higher average proportion of Austronesian-related global ancestry in the Central Highlands (41% African), with the global ancestry proportions Southern Highlands being closer to equal contributions from the two ancestral sources (53% African) (Figure 1C).
We identified genetic variants that are unique or enriched in the Malagasy population relative to other reference populations.Specifically, we defined a variant as "enriched" if the minor allele frequency was less than 0.01 in gnomAD v3, 21,22 less than 0.01 in all reference populations included in joint calling and imputation (LUH, GAM, YRI, CHB, GBR, IDGP, PGDP) and more than 0.05 in the Malagasy cohort.We further annotated these enriched variants as having likely functional impacts if they had a CADD greater than 30 or else was annotated by Variant Effect Predictor with a consequence of "high" or "moderate" impact according to the Ensembl IMPACT rating (https://ensembl.org/info/genome/variation/prediction/predicted_data.html)(e.g.stop gain, stop lost, frameshift, missense, etc).Specifically, "high" impact variants are defined as those that are "assumed to have high (disruptive) impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay," and "moderate" impact variants are defined as those that are "non-disruptive" and "might change protein effectiveness."Out of 44,731,585 total variants passing QC, we identified 20,656 variants that were enriched in the Malagasy population (maf > 0.05) and either absent or low frequency ( <0.01) in other global reference populations as described above.Of these, 116 had likely functional impact according to our criteria (Figure 3A, Supplementary File 1).
Next, runs of homozygosity (ROH) were called to identify signatures of recent or older bottlenecks.We detected runs of homozygosity (ROH) for each individual using bcftools/RoH. 23 called ROH for each individual, and calculated the total amount of the genome contained within ROH.Larger amounts of the genome contained in ROH, and in particular ROH spanning long genomic distances, may be indicative of recent bottleneck events. 24In Figure 3B, we see that the overall patterns of ROH align with expectations.That is, bottlenecked out-of-Africa (OOA) populations showing higher sum total amounts of the genome contained in ROHs and continental African populations show lower values due to the higher effective population sizes.
Among the three study sites in Madagascar, the Central Highlands have higher values of sum total ROH, reflective of a historical bottleneck in the region.This is consistent with a past study which found evidence of a bottleneck in a central highlands subgroup based on patterns of IBD sharing from SNP array data. 8,25 performed GWAS for the 16 phenotypes collected across 12,528,464 million variants passing GWAS QC (Supplemental Methods), combining all individuals from across the three sampling sites into one cohort due to the low sample size (213-214 individuals, varying by phenotype).We identified 22 variants reaching genome-wide significance (p < 5e-8) for at least one phenotype.Summary statistics for the lead variant associations are shown in Table 1, variant-level summary statistics for all significant associations can be found in Supplemental Table 3, and an additional summary of all GWAS is provided in Supplemental Table 4. Full summary statistics are provided for all GWAS (Data Availability).
Our results include many low to moderate frequency variants specific to African ancestry populations associated with body composition traits (Table 1, Supplemental Table 3).For example, a haplotype overlapping with RPTOR on chromosome 17 was associated with both body fat percentage and weight (Figure 4A, Table 1).The lead variant (rs115215854) is an intron variant in RPTOR.7][28][29] With a fraction of that sample size (n = 214), we detect significant associations at RPTOR.Further, the lead variant for these two body composition associations is segregating at low frequencies (~3%) in African populations or populations with admixed African ancestry, including the Malagasy cohort, and unobserved in other 1000Gs populations around the world (Table 1).
Similarly, we identified a significant association with waist circumference and waist-to-hip ratio on chromosome 20, with the lead variant (rs115838297) in the intron of DOK5 (Figure 4B).
Variants in DOK5 have been weakly associated with body mass index in the UKBiobank with large cohort sizes. 26,27,30Again, we calculated the allele frequencies for the lead variant from this study in 1000 Genomes populations and found that this allele is unique to African populations or populations with admixed African ancestry, including the Malagasy cohort and segregates at a low frequency (~4%) (Table 1).
Among the novel associations is a variant in WASHC5 on chromosome 20 that is positively associated with weight and negatively associated with body fat percentage (rs16900241) (Figure 4A).We find no past studies associating variants in WASHC5 with body composition traits.WASHC5 encodes the protein strumpellin which is most highly expressed in skeletal muscle and also highly expressed in the thyroid. 31We also identified a significant association with the maximum forced expiratory volume in 1 second (FEV1 max) on chromosome 5.The peak variant (rs11950040) is globally common with an allele frequency of 42.8% in 1000 Genomes, and is a downstream gene variant for CCL28 (Table 1, Supplemental Table 3).Additional summary statistics, including population-level allele frequencies for lead variants and all significant variant associations are provided in Table 1 and Supplemental Table 3, respectively.Despite the small cohort size in this study, GWAS for well-powered quantitative body composition traits like weight and waist circumference led to discovery of novel associations or novel variants at known trait-associated loci.We included a minor allele count (MAC) threshold in our QC criteria (see Supplemental Methods); however, we recognize that this cohort is a small sample size and that a number of our GWAS associations are with variants between 1 and 5% minor allele frequency.Though we urge caution in interpretation, we are encouraged that a number of the low frequency variant associations are in genes that have previously been associated with relevant phenotypes in larger cohorts (e.g.DOK5, RPTOR).Additionally, the lead variant associated with FEV1 max, chr5_43372087_A_G / rs11950040, is also common in European ancestry and has a nominally significant association with peak expiratory flow in Europeans. 32Although we combined the cohort across sites due to low sample sizes, the slight genetic differentiation and varying environmental or selective pressures across the sites suggests deeper sampling and investigations in each region could identify group-specific genetic associations with traits of interest.
This study of genetic and phenotypic variation across three Malagasy villages demonstrates site-specific trait distributions including malaria incidence and genetic substructure by locale.These results emphasize the need for more granular genomic studies that take into account region-specific genetic and environmental differences.In particular, these results have implications for studies that have scanned the Malagasy genome for signals of positive selection.Studies have identified the strong selection at Duffy-null in a cohort with participants either grouped together from all across the island of Madagascar 10 or at one specific site. 12Instead, a study designed to detect selection signals in each site independently may find that the post-admixture Duffy-null signal is not as wide-spread, particularly in places where malaria incidence is low.A site-specific study design may also find unique and novel selection signals at each locale that were masked in a study that grouped all populations despite the clear substructuring that we observe in our analysis.The WGS data described in this study from three specific sites is a valuable start for identifying such region-specific signals.
The Malagasy people have never been included in WGS studies or GWA studies, but here we show that even with a small cohort, we can draw potentially important insights into the genetic basis of clinically relevant traits.This study demonstrates the utility of partnering with local communities in order to expand the diversity of people who participate in and potentially benefit from genomic research.processed sequencing data, performed imputation, and quality control.M.M. carried out population genetic analyses.R.R. and S.L.V.B. performed community engagement and consultation.S.E.C. carried out GWAS.All authors reviewed and contributed to writing the manuscript.

Figure
Figure 1.Cohort overview.A) Map of three sampling sites on Madagascar corresponding to the West Coast (Tsianaloka Village, Tsimafana), Central Highlands (Ampandrialaza Village, Ampangabe), and the Southern Highlands (Tsiandatsiana Village, Ankerana).B) PCs 2-4 vs PC1 and C) ADMIXTURE plot for K=3, including Malagasy cohort (West Coast, Central Highlands, Southern Highlands) and relevant reference populations from 1000 Genomes (GWD, YRI, LUH CHB, GBR) and additional reference samples from IDGP and PGDP.Points in the scatterplot are colored by group legend, global ancestry bars are colored by proportion of African-related (pink), Asian-related (purple), or European-related (yellow) ancestry.

1 .
Figure 1.Cohort overview.A) Map of three sampling sites on Madagascar corresponding to the West Coast (Tsianaloka Village, Tsimafana), Central Highlands (Ampandrialaza Village, Ampangabe), and the Southern Highlands (Tsiandatsiana Village, Ankerana).B) PCs 2-4 vs PC1 and C) ADMIXTURE plot for K=3, including Malagasy cohort (West Coast, Central Highlands, Southern Highlands) and relevant reference populations from 1000 Genomes (GWD, YRI, LUH CHB, GBR) and additional reference samples from IDGP and PGDP.Points in the scatterplot are colored by group legend, global ancestry bars are colored by proportion of African-related (pink), Asian-related (purple), or European-related (yellow) ancestry.

Figure 2 .
Figure 2. Phenotype distributions by site.A) Self-reported health across all three sites B) Distribution of ages stratified by site (color) and sex (x-axis) C) Distribution of traits values stratified by site and sex for six of the collected phenotypes (FEV1 max [percent], systolic blood pressure [mmhg]], weight [kg], body fat percentage [percent], hip circumference [cm], waist circumference [cm]).

Figure 3 .
Figure 3. Genetic Variation Overview.A) Count of variants enriched in the Malagasy cohort relative to global populations, grouped by functional category (VEP consequence, x-axis) and Ensembl IMPACT classification (high, medium, low, moderate).B) Sum of the genome contained in runs of homozygosity (ROH) by Madagascar study site or reference population.

Figure 4 .
Figure 4. GWAS Manhattan plots for two anthropometric traits.P-values for all autosomal variants passing GWAS QC on a -log10 scale for A) body fat percentage and B) waist circumference.Annotated gene for top variant, or nearest protein coding gene in parenthesis, is listed for each locus reaching genome-wide significance (p < 5e-8, red dashed line).

Table 1 .
Summary statistics for the lead variants from 15 significant associations.The p-values and beta coefficients for each lead variant and trait associated locus are listed.The lead variant key (chr_pos_ref_alt), rsID, VEP annotated gene (or closest coding gene in parenthesis), and allele frequencies are included (MG=Malagasy cohort from this study, AFR=1000 Genomes African populations, EAS=1000 Genomes East Asian populations, EUR=1000 Genomes European populations, SAS=1000 Genomes South Asian populations).Processing was done with Variant Bio's in-house processing pipeline.Raw sequencing data were inspected with fastqc (v0.11.7) and adapters were trimmed using cutadapt (v2.10).Trimmed reads were then processed following the GATK Best Practices guidelines, more specifically the CCDG functional equivalence version(Regier et al. 2018) (using BWA-mem v0.7.15 for mapping onto GRCh38, Picard v2.21.4 for duplicate marking, samtools v1.10 for sorting, and GATK v4.1.4.0 for BQSR).Coverage was determined using Picard CollectWgsMetrics.Two samples were found to have excessive bacterial contamination with mapping rates to GRCh38 below 30% and excluded from further analysis.British [GBR], Yoruba [YRI], Luhya [LWK], Gambian [GWD]; all sequenced at high coverage and processed with the same pipeline).We next ran VQSR (setting --truth-sensitivity-filter-level to 99.8 for SNPs and 99.0 for indels) and only retained PASS filter sites for further analyses.Next, we ran imputation within the cohort using Beagle v5.1 (beagle.27Apr20.b81.jar), with all genotypes with GQ < 18 set to missing (Emde et al. 2021) (method: https://github.com/variant-bio/mid-pass).In a first pass, we only imputed genotypes of variants in Genome In a Bottle (GIAB) "easy" regions(Olson et al. 2022), ensuring high quality imputation within this stricter set.After fixing these genotypes, we added variants outside easy regions as well as structural variants (SVs) from the Human Genome Structural Variation Consortium genotyped in all individuals with PanGenie (Ebler et al. 2022) (GQ < 200 set to missing) and imputed these within the cohort as well.populationgeneticanalysessuchas PCA, ADMIXTURE, and runs of homozygosity (ROH), we started with the joint callset which included 44,731,585 and 1,077 reference and Malagasy cohort samples.We first masked GIAB low confidence regions(Olson et al. 2022), resulting in 28,670,287 variants.We filtered 11 subjects with missing sex, 60 subjects with relatedness above 3rd degree (pihat > 0.125), and 29 subjects greater than 7 standard deviations from the mean for any of PCs 1-10 (in the unfiltered callset).This resulted in 977 total subjects, including 208 from the Malagasy cohort (West Coast n = 69; Southern Highlands n = 65; Central Highlands n = 74).We filtered 740,169 variants with MSQ ranksum not equal to zero or variants that did not pass QC in gnomAD v3(Karczewski et al. 2020).We filtered 140,648 variants without a MAC of at least 5 after subject-level filtering.We again restricted the analyses to autosomes only, resulting in an additional 907,092 variants filtered.Ultimately, this final callset included 977 subjects and 26,882,378 variants. For

Table 4 .
Summary of all 16 GWAS run in this study.For each phenotype, the lambda GC, number of hits, number of independent hits as determined by conditional analysis, the number of loci, sample size, and the number of cases and controls if applicable.