Abstract
MalariaGEN is a data-sharing network that enables groups around the world to work together on the genomic epidemiology of malaria. Here we describe a new release of curated genome variation data on 7,000 Plasmodium falciparum samples from MalariaGEN partner studies in 28 malaria-endemic countries. High-quality genotype calls on 3 million single nucleotide polymorphisms (SNPs) and short indels were produced using a standardised analysis pipeline. Copy number variants associated with drug resistance and structural variants that cause failure of rapid diagnostic tests were also analysed. Almost all samples showed genetic evidence of resistance to at least one antimalarial drug, and some samples from Southeast Asia carried markers of resistance to six commonly-used drugs. Genes expressed during the mosquito stage of the parasite life-cycle are prominent among loci that show strong geographic differentiation. By continuing to enlarge this open data resource we aim to facilitate research into the evolutionary processes affecting malaria control and to accelerate development of the surveillance toolkit required for malaria elimination.
Introduction
A major obstacle to malaria elimination is the great capacity of the parasite and vector populations to evolve in response to malaria control interventions. The widespread use of chloroquine and DDT in the 1950’s led to high levels of drug and insecticide resistance, and the same pattern has been repeated for other first-line antimalarial drugs and insecticides. Over the past 15 years, mass distribution of pyrethroid-treated bednets in Africa and worldwide use of artemisinin combination therapy (ACT) has led to substantial reductions in malaria prevalence and mortality, but there are rapidly increasing levels of resistance to ACT in Southeast Asian parasites and of pyrethroid resistance in African mosquitoes. A deep understanding of local patterns of resistance and the continually changing nature of the local parasite and vector populations is necessary to manage the use of drugs and insecticides and to deploy public health resources for maximum sustainability and impact.
Current methods for genetic surveillance of the parasite population are largely based on targeted genotyping of specific loci, e.g. known markers of drug resistance. Whole genome sequencing of malaria parasites is currently more expensive and complex, particularly at the stage of data analysis, but it is an important adjunct to targeted genotyping, as it provides a more comprehensive picture of parasite genetic variation. It is particularly important for discovery of new drug resistance markers and for monitoring patterns of gene flow and evolutionary adaptation in the parasite population.
The Plasmodium falciparum Community Project (Pf Community Project) was established with the aim of integrating parasite genome sequencing into clinical and epidemiological studies of malaria (www.malariagen.net/projects). It forms part of the Malaria Genomic Epidemiology Network (MalariaGEN), a global data-sharing network comprising multiple partner studies, each with its own research objectives and led by a local investigator.1 Genome sequencing was performed centrally, and partner studies were free to analyse and publish the genetic data produced on their own samples, in line with MalariaGEN’s guiding principles on equitable data sharing.1–3 A programme of capacity building for research into parasite genetics was developed at multiple sites in Africa alongside the Pf Community Project.4
The first phase of the project focused on developing simple methods to obtain purified parasite genome DNA from small blood samples collected in the field 5,6 and on establishing reliable computational methods for variant discovery and genotype calling from short-read sequencing data.7 This presented a number of analytical challenges due to long tracts of highly repetitive sequence and hypervariable regions within the P. falciparum genome, and also because a single infection can contain a complex mixture of genotypes. Once a reliable analysis pipeline was in place, a process was established for periodic data releases to partners, with continual improvements in data quality as new analytical methods were developed.
Data from the Pf Community Project were initially released through a companion project called Pf3k (www.malariagen.net/data), whose goal was to bring together leading analysts from multiple institutions to benchmark and standardise methods of variant discovery and genotyping calling. A visual analytics web application was developed8 for researchers to explore the data (www.malariagen.net/apps/pf3k). The open dataset was enlarged in 2016 when multiple partner studies contributed to a consortial publication on 3,488 samples from 23 countries.9
Data produced by the Pf Community Project have been used to address a broad range of research questions, both by the groups that generated samples and data and by the wider research community, and have generated over 50 previous publications (refs 5-55). These data have become a key resource for the epidemiology and population genetics of antimalarial drug resistance9–22 and an important platform for the discovery of new genetic markers and mechanisms of resistance through genome-wide association studies23–27 and combined genome-transcriptome analysis. 28 The data have also been used to study gene deletions that cause failure of rapid diagnostic tests29; to characterise genetic variation in malaria vaccine antigens 30,31; to screen for new vaccine candidates32 ; to investigate specific host-parasite interactions33,34; and to describe the evolutionary adaptation and diversification of local parasite populations. 7,9,12,35–40
The Pf Community Project data also provide an important resource for developing and testing new analytical and computational methods. A key area of methods development is quantification of within-host diversity 7,41–46, estimation of inbreeding7,47, and deconvolution of mixed infections into individual strains.48,49 The data have also been used to develop and test methods for estimating identity by descent50,51, imputation52, typing structural variants53, designing other SNP genotyping platforms54 and data visualisation8,55. In a companion study we performed whole genome sequencing of experimental genetic crosses of P. falciparum, and this provided a benchmark to test the accuracy of our genotyping methods, and to conduct an in-depth analysis of indels, structural variants and recombination events which are complicated to ascertain in these population genetic samples56.
Here we describe a new release of curated genome variation data on 7,113 samples of P. falciparum collected by 49 partner studies from 73 locations in Africa, Asia, South America and Oceania between 2002 and 2015 (Table 1, Supplementary Table 1 and 2).
Results
Variant discovery and genotyping
We used the Illumina platform to produce genome sequencing data on all samples and we mapped the sequence reads against the P. falciparum 3D7 v3 reference genome. The median depth of coverage was 73 sequence reads averaged across the whole genome and across all samples. We constructed an analysis pipeline for variant discovery and genotyping, including stringent quality control filters that took into account the unusual features of the P. falciparum genome, incorporating lessons learnt from our previous work7,56 and the Pf3k project, as outlined in the Methods section.
In the first stage of analysis we discovered variation at over six million positions, corresponding to about a quarter of the 23Mb P. falciparum genome (Supplementary Table 3). These included 3,168,721 single nucleotide polymorphisms (SNPs): these were slightly more common in coding than non-coding regions and were mostly biallelic. The remaining 2,882,975 variants were predominantly short indels but also included more complex combinations of SNPs and indels: these were much more abundant in non-coding than coding regions, and mostly had at least 3 alleles. The predominance of indels in non-coding regions has been previously observed and is most likely a consequence of the extreme AT bias which leads to many short repetitive sequences. 56,57
For the purpose of this analysis, we excluded all variants in subtelomeric and internal hypervariable regions, mitochondrial and apicoplast genomes, and some other regions of the genome where the mapping of short sequence reads is prone to a high error rate due to extremely high rates of variation. 56 A total of 1,838,733 SNPs (of which 1,626,886 were biallelic) and 1,276,027 indels (or SNP/indel combinations) passed all these filters. The pass rate for SNPs in coding regions (66%) was considerably higher than that for SNPs in non-coding regions (47%), indels in coding regions (37%) and indels in non-coding regions (47%). Finally we removed samples with a low genotyping success rate or other quality control issues. We also removed replicates and 41 samples with genetic markers of infection by multiple Plasmodium species, leaving 5,970 high-quality samples from 28 countries (Table 1).
We used coverage and read pair analysis to determine duplication genotypes around mdr1, plasmepsin2/3 and gch1, each of which are associated with drug resistance. For each of these three genes we discovered many different sets of breakpoints (29, 10 and 3 pairs of breakpoints for mdr1, gch1, and plasmepsin 2/3, respectively), including complex rearrangements58 that to the best of our knowledge have not been observed before in Plasmodium species (Supplementary Note, Supplementary Tables 4-6). We also used sequence reads coverage to identify large structural variants that appear to delete or disrupt hrp2 and hrp3, an event that can cause rapid diagnostic tests to malfunction.
The population genetic analyses in this paper are based on the filtered dataset of high-quality SNP genotypes in 5,970 samples. These data are openly available, together with annotated genotyping data on 6 million putative variants in all 7,113 samples, plus details of partner studies and sampling locations, at www.malariagen.net/resource/26 .
Global population structure
The genetic structure of the global parasite population reflects its geographic regional structure 7,9,10 as illustrated by a neighbour-joining tree and a principal component analysis of all samples based on their SNP genotypes (Figure 1). Based on these observations we grouped the samples into eight geographic regions: West Africa, Central Africa, East Africa, South Asia, the western part of Southeast Asia, the eastern part of Southeast Asia, Oceania and South America. Each of these can be viewed as a regional sub-population of parasites, which is more or less differentiated from other regional sub-populations depending on rates of gene flow and other factors. The different regions encompass a range of epidemiological and environmental settings, varying in transmission intensity, vector species and history of antimalarial drug usage. Note these regional classifications are intentionally broad, and therefore overlook many interesting aspects of local population structure, e.g. a distinctive Ethiopian sub-population can be identified by more detailed analysis of African samples.12
Genetically mixed infections were considerably more common in Africa than other regions, consistent with the high intensity of malaria transmission in Africa (Figure 2a). Analysis of FWS, a measure of within-host diversity7, shows that most samples from Southeast Asia (1763/2341), South America (37/37) and Oceania (158/201) have FWS >0.95, which to a first approximation indicates that the infection is dominated by a clonal population of parasite.41 In contrast, nearly half of samples from Africa (1625/3314) have FWS <0.95, indicating the presence of more complex infections. Genetically mixed infections were also common in Bangladesh (41/77 samples have FWS <0.95), another area of high malaria transmission and the only South Asian country represented in this dataset, but did not reach the extremely high levels of within-host diversity (FWS <0.2) observed in some samples from Africa.
The average nucleotide diversity across the global sample collection was 0.040% (median=0.028%), i.e. two randomly-selected samples differ by an average of 4 nucleotide positions per 10kb. Levels of nucleotide diversity vary greatly across the genome56 and also geographically (Figure 2b). Distributions of values were highest in Africa, followed by Bangladesh, but the scale of regional differences was relatively modest, ranging from an average of 0.030% in Eastern Southeast Asia to 0.040% in West Africa (median=0.019% and 0.028% respectively; Figure 2b). In other words, the nucleotide diversity of each regional parasite population was not much less than that of the global parasite population. This is consistent with the idea that the global P. falciparum population has a common African origin and that historically there must have been significant levels of migration.
All regional sub-populations showed very low levels of linkage disequilibrium relative to human populations, e.g. r2 decayed to <0.1 within 500 bp (Figure 2c). As expected, African populations had the highest rates of LD decay, implying the highest levels of haplotype diversity.
Geographic patterns of population differentiation and gene flow
Parasite sub-populations in different locations naturally tend to differentiate over time unless there is sufficient gene flow to counterbalance genetic drift. Genome-wide estimates of FST provide an indicator of this process of genetic differentiation, which is partly determined by geographic distance (Figure 3). For example, we observe much greater genetic differentiation between South America and South Asia (genome-wide average FST 0.22) or between Africa and Oceania (0.20) than between sub-regions within Asia (<0.1) or within Africa (<0.02).
These data reveal some interesting exceptions to the general rule that genome-wide FST is correlated with geographic distance. For example, African parasites are more strongly differentiated from Southeast Asian parasites (genome-wide average FST 0.20) than they are from parasites in neighbouring Bangladesh (0.11). If this is examined in more detail, there is an unexpectedly steep gradient of genetic differentiation at the geographical boundary between South Asia and Southeast Asia, i.e. parasites sampled in Myanmar and Western Thailand are much more strongly differentiated from parasites sampled in Bangladesh (genome-wide FST 0.07) than would be expected given that these are neighbouring countries. As discussed later, Southeast Asia is the global epicentre of antimalarial drug resistance, and these observations add to a growing body of evidence that Southeast Asian parasites have acquired a wide range of genomic features that are likely due to natural selection rather than genetic drift.23,40
It is noteworthy that the level of genetic differentiation between western and eastern parts of Southeast Asia (genome-wide FST 0.05) is greater than between West Africa and East Africa (0.02) although the geographic distances are much greater in Africa. This is likely due to the lower intensity of malaria transmission in Southeast Asia, and in particular the presence of a malaria-free corridor running through Thailand, which act as barriers to gene flow across the region.23,40
Genes with high levels of geographic differentiation
The FST metric can also be calculated for individual variants to identify specific genes that have acquired high levels of geographic differentiation relative to the genome as a whole. This can be done either at the global level (to identify variants that are highly differentiated between different regions of the world) or at the local level (to identify variants that are highly differentiated between different sampling locations within a region).
To identify variants that are strongly differentiated at the global level, we began by estimating FST for each SNP across all of the eight regional sub-populations. The group of SNPs with the highest global FST levels were found to be strongly enriched for non-synonymous mutations, suggesting that the process of differentiation is at least in part due to natural selection (Figure 4). After ranking all SNPs according to their global FST value, we calculated a global differentiation score for each gene based on the highest-ranking non-synonymous SNP within the gene (see Methods). All genes are ranked according to their global differentiation score in the accompanying data release, and those with the highest score are listed in Supplementary Table 7. The most highly differentiated gene, p47, is known to interact with the mosquito immune system59 and has two variants (S242L and V247A) that are at fixation in South America but absent in other geographic regions. Also among the five most highly differentiated genes are gig (implicated in gametocytogenesis60), pfs16, (expressed on the surface of gametes61) and ctrp (expressed on the ookinete cell surface and essential for mosquito infection62). Thus four of the five most highly differentiated parasite genes are involved in the process of transmission by the mosquito vector, raising the possibility that this reflects evolutionary adaptation of the P. falciparum population to the different Anopheles species that transmit malaria in different geographical regions.
It is more difficult to characterise variants that are strongly differentiated at the local level, due to smaller sample sizes and various sources of sampling bias, but a crude estimate can be obtained by analysis of each of the six geographical regions with samples from multiple countries. FST was estimated for each SNP across different sampling locations within each geographical region, and the results for different regions were combined by a heuristic approach to obtain a local differentiation score for each gene (see Methods). A range of genes associated with drug resistance (crt, dhfr, dhps, kelch13, mdr1, mdr2 and fd) were in the top centile of local differentiation scores (Supplementary Figure 1, Supplementary Table 8, Supplementary Note).
Geographic patterns of drug resistance
Classification of samples based on markers of drug resistance
Antimalarial drug resistance represents a major focus of research for many partner studies within the Pf Community Project, and this dataset therefore contains a significant body of data that have appeared in previous reports on drug resistance. Readers are referred to these publications for more detailed analyses of local patterns of resistance9–14,16–22 and of resistance to specific drugs including chloroquine16,21, sulfadoxine-pyrimethamine16,19,21 and artemisinin combination therapy9–11,13–15,17,18,21,22.
Here we have classified all samples into different types of drug resistance based on published genetic markers and current knowledge of the molecular mechanisms (see www.malariagen.net/resource/26 for details of the heuristic used). Table 2 summarises the frequency of different types of drug resistance in samples from different geographical regions. Overall, we observed higher prevalence of samples classified as resistant in Southeast Asia than anywhere else, with multiple samples resistant to all drugs considered. Note that samples were collected over a relatively long time period (2002-15) during which there were major changes in global patterns of drug resistance, and that the sampling locations represented in a given year depended on which partner studies were operative at the time. To alleviate this problem we have also divided the data into samples collected before and after 2011 (Supplementary table 10), but temporal trends in aggregated data should be interpreted with due caution.
Below we summarise the overall profile of drug resistance types in the regional sub-populations: this is intended simply to provide context for users of this dataset, and should not be regarded as a statement of the current epidemiological situation. The Supplementary Notes contain a more detailed description of the geographical distribution of haplotypes, CNV breakpoints, interactions between genes, and variants associated with less commonly used antimalarial drugs. In the accompanying data release we also identify samples with mdr1, plasmepsin2/3 and gch1 gene amplifications that can affect drug resistance.
Chloroquine resistance
Samples were classified as chloroquine resistant if they carried the crt 76T allele. As shown in Table 2, this was found in almost all samples from Southeast Asia, South America and Oceania. It was also found across Africa but at lower frequencies, particularly in East Africa where chloroquine resistance is known to have declined since chloroquine was discontinued63–65. Supplementary Table 11 shows the geographical distribution of different crt haplotypes (based on amino acid positions 72-76) which is consistent with the theory that chloroquine resistance spread from SE Asia to Africa with multiple independent origins in South America and Oceania66,67. The crt locus is also relevant to other types of drug resistance, e.g. crt variants that are relatively specific to SE Asia form the genetic background of artemisinin resistance, and newly emerging crt alleles have been associated with the spread of ACT failure due to piperaquine resistance13,14,22,68.
Sulfadoxine-pyrimethamine resistance
Clinical resistance to sulfadoxine-pyrimethamine is determined by multiple mutations and their interactions, so following current practice69 we classified SP resistant samples into four overlapping types: (i) carrying the dhfr 108N allele, associated with pyrimethamine resistance; (ii) the dhps 437G allele, associated with sulfadoxine resistance; (iii) carrying the dhfr triple mutant, which is strongly associated with SP failure; (iv) carrying the dhfr/dhps sextuple mutant, which confers a higher level of SP resistance. As shown in Table 2, dhfr 108N was found in almost all samples in all regions apart from West Africa, while dhps 437G was at very high frequency throughout most of Africa and Asia, and at lower frequencies in South America and Oceania (see also Supplementary Table 12). Triple mutant dhfr parasites were common throughout Africa and Asia, whereas sextuple mutant dhfr/dhps parasites were at much lower frequency except in Western SE Asia. In the accompanying data release we also identify samples with gch1 gene amplifications (Supplementary Table 4) that can modulate SP resistance70, although their effect on the clinical outcome and interaction with mutations in dhfr and dhps is not fully established.
Resistance to artemisinin combination therapy
We classified samples as artemisinin resistant based on the World Health Organisation classification of non-synonymous mutations in the propeller region of the kelch13 gene that have been associated with delayed parasite clearance71. By this definition, artemisinin resistance was confined to Southeast Asia but, as previously reported, this dataset contains a substantial number of non-synonymous kelch13 propeller SNPs occurring at <5% frequency in Africa and elsewhere9. The most common ACT formulations in Southeast Asia are artesunate-mefloquine (AS-MQ) and dihydroartemisinin-piperaquine (DHA-PPQ). We classified samples as mefloquine resistant if they had mdr1 amplification72 or as piperaquine resistant if they had plasmepsin 2/3 amplification25. Mefloquine resistance was observed throughout SE Asia and was most common in the western part. Piperaquine resistance was confined to eastern SE Asia with a notable concentration in western Cambodia. Elsewhere11,13 we describe the KEL1/PLA1 lineage of artemisinin- and piperaquine-resistant parasites that expanded in western Cambodia during 2008-13, and then spread to other countries during 2013-18, causing high rates of DHA-PPQ treatment failure across eastern SE Asia: since the current dataset extends only to 2015 it captures only the first phase of the KEL1/PLA1 lineage expansion.
HRP2/3 deletions that affect rapid diagnostic tests
Rapid diagnostic tests (RDTs) provide a simple and inexpensive way to test for parasites in the blood of patients who are suspected to have malaria, and have become a vital tool for malaria control73,74. The most widely used RDTs are designed to detect P. falciparum histidine-rich protein 2 and cross-react with histidine-rich protein 3, encoded by the hrp2 and hrp3 genes respectively. Parasites with gene deletions of hrp2 and/or hrp3 have emerged as an important cause of RDT failure in a number of locations75–79. It is difficult to devise a simple genetic assay to monitor for risk of RDT failure because hrp2 and hrp3 deletions comprise a diverse mixture of large structural variations with multiple independent origins, and both genes are located in subtelomeric regions of the genome with very high levels of natural variation29,80–83. In the absence of a well-validated algorithmic method, we visually inspected sequence read coverage and identified samples with clear evidence of large structural variants that disrupted or deleted the hrp2 and hrp3 genes. We took a conservative approach: samples that appeared to have a mixture of deleted and non-deleted genotypes were classified as non-deleted.
Deletions were found at relatively high frequency in Peru (8 of 21 samples had hrp2 deletions, 14 had hrp3 deletions and 6 had both) but were not seen in samples from Colombia and were relatively rare outside South America. Oceania was the only other region where we observed hrp2 deletions, but at very low frequency (4%, n=3/80), and also had hrp3 deletions (25%) though no combined deletions were seen. Deletions of hrp3 only were more geographically widespread than hrp2 deletions, being common in Ethiopia (43%, n=9/21) and in Senegal (7%, n=6/84), and at relatively low frequency (<5%) in Kenya, Cambodia, Laos, and Vietnam (Supplementary Table 13). Note that these findings might under-estimate the true prevalence of hrp2/hrp3 deletions, due to sampling bias (our samples were primarily collected from RDT-positive cases) and also because we focused on large structural variants and did not consider polymorphisms that might also cause RDT failure but would require more sophisticated analytical approaches. There is a need for more reliable diagnostics of hrp2 and hrp3 deletions, and we hope that these open data will accelerate this important area of applied methodological research.
Discussion
This open dataset comprises sequence reads and genotype calls on over 7,000 P. falciparum samples from MalariaGEN partner studies in 28 countries. After excluding variants and samples that failed to meet stringent quality control criteria, the dataset contains high-quality genotype calls for 3 million polymorphisms including SNPs, indels, CNVs and large structural variations, in almost 6,000 samples. The data can be analysed in their entirety or can be filtered to select for specific genes, or geographical locations, or samples with particular genotypes. This is twice the sample size of our previous consortial publication9 and is the largest available data resource for analysis of P. falciparum population structure, gene flow and evolutionary adaptation. Each sample has been annotated to show its profile of resistance to six major antimalarial drugs and whether it carries structural variations that can cause RDT failure. The classification scheme is heuristic and based on a subset of known genetic markers, so it should not be treated as a failsafe predictor of the phenotype of a particular sample. Our purpose in providing these annotations is to make it easy for users without specialist training in genetics to explore the global dataset and to analyse any subset of samples for key features that are relevant to malaria control.
An important function of this curated dataset is to provide information on the provenance and key features of samples associated with each partner study, thus allowing the findings reported in different publications to be linked and compared. Data produced by the Pf Community Project have been analysed in more than 50 publications (refs 5-55) and a few examples will serve to illustrate the diverse ways in which the data are being used. An analysis of samples collected across Africa by Amambua-Ngwa, Djimde and colleagues found evidence that parasite population structure overlaps with historical patterns of human migration and that the P. falciparum population in Ethiopia is significantly diverged from other parts of the continent.12 A series of studies by Amato, Miotto and colleagues have documented the evolution of a multidrug-resistant lineage of P. falciparum that originated in Western Cambodia over ten years ago and is now expanding rapidly across Southeast Asia, acquiring additional resistance mutations as it spreads.11,13,14 McVean and colleagues have developed a computational method for deconvolution of the haplotypic structure of mixed infections, allowing analysis of the pedigree structure of parasites that are cotransmitted by the same mosquito.49 Bahlo and colleagues have developed a different haplotype-based method to describe the relatedness structure of the parasite population and to identify new genomic loci with evidence of recent positive selection.50
A recent report from the World Health Organisation highlights the need for improved surveillance systems in sustaining malaria control and achieving the long-term goal of malaria eradication.84 To be of practical value for national malaria control programmes, genetic data must address well-defined use cases and be readily accessible.85 Amplicon sequencing technologies provide a powerful new tool for targeted genotyping that could feasibly be implemented locally in malaria-endemic countries 86,87, but there remains a need for the international malaria control community to generate and share whole genome sequencing data, e.g. to monitor for newly emerging forms of drug resistance and to understand regional patterns of parasite migration. The next generation of long-read sequencing technologies will improve the precision of population genomic inference, e.g. by enabling analysis of hypervariable regions of the genome, and of pedigree structures within mixed infections. The accuracy with which the resistance phenotype of a sample can be predicted from genome sequencing data will also improve as we gain better functional understanding of the polygenic determinants of drug resistance.
Thus the next few years are likely to see major advances in both the scale and information content of parasite genomic data. The practical value for malaria control will be greatly enhanced by the progressive acquisition of longitudinal time-series data, particularly if this is linked to other sources of epidemiological data and translated into reliable, actionable information with sufficient rapidity to allow control programmes to monitor the impact of their interventions on the parasite population in near real time. The Pf Community Project provides proof of concept that systems can be developed for groups in different countries to share data, to analyse it using standardised methods, and to make it readily accessible to other researchers and the malaria control community.
Methods
All samples in this study were derived from blood samples obtained from patients with P. falciparum malaria, collected with informed consent from the patient or a parent or guardian. At each location, sample collection was approved by the appropriate local and institutional ethics committees. The following local and institutional committees gave ethical approval for the partner studies: Human Research Ethics Committee of the Northern Territory Department of Health & Families and Menzies School of Health Research, Darwin, Australia; National Research Ethics Committee of Bangladesh Medical Research Council, Bangladesh; Comite d’Ethique de la Recherche - Institut des Sciences Biomedicales Appliquees, Benin; Ministere de la Sante – Republique du Benin, Benin; Comité d’Éthique, Ministère de la Santé, Bobo-Dioulasso, Burkina Faso; Institutional Review Board Centre Muraz, Burkina Faso; Ministry of Health National Ethics Committee for Health Research, Cambodia; Institutional Review Board University of Buea, Cameroon; Comite Institucional de Etica de investigaciones en humanos de CIDEIM, Colombia; Comité National d’Ethique de la Recherche, Cote d’Ivoire; Comite d’Ethique Universite de Kinshasa, Democratic Republic of Congo; Armauer Hansen Research Institute Institutional Review Board, Ethiopia; Addis Ababa University, Aklilu Lemma Institute of Pathobiology Institutional Review Board, Ethiopia; Kintampo Health Research Centre Institutional Ethics Committee, Ghana; Ghana Health Service Ethical Review Committee, Ghana; University of Ghana Noguchi Medical Research Institute, Ghana; Navrongo Health Research Centre Institutional Review Board, Ghana; Comite d’Ethique National Pour la Recherché en Santé, Republique de Guinee; Indian Council of Medical Research, India; Eijkman Institute Research Ethics Commission, Eijkman Institute for Molecular Biology, Jakarta, Indonesia; KEMRI Scientific and Ethics Review Unit, Kenya; Ministry of Health National Ethics Committee For Health Research, Laos; Ethical Review Committee of University of Ilorin Teaching Hospital, Nigeria; Comité National d’Ethique auprès du Ministère de la Santé Publique, Madagascar; College of Medicine Regional Ethics Committee University of Malawi, Malawi; Faculté de Médecine, de Pharmacie et d’Odonto-Stomatologie, University of Bamako, Bamako, Mali; Ethics Committee of the Ministry of Health, Mali; Ethics committee of the Ministry of Health, Mauritania; Department of Medical Research (Lower Myanmar); Ministry of Health, Government of The Republic of the Union of Myanmar;: Institutional Review Board, Papua New Guinea Institute of Medical Research, Goroka, Papua New Guinea; PNG Medical Research Advisory Council (MRAC), Papua New Guinea; Institutional Review Board, Universidad Nacional de la Amazonia Peruana, Iquitos, Peru; Ethics Committee of the Ministry of Health, Senegal; National Institute for Medical Research and Ministry of Health and Social Welfare, Tanzania; Medical Research Coordinating Committee of the National Institute for Medical Research, Tanzania; Ethics Committee, Faculty of Tropical Medicine, Mahidol University, Bangkok, Thailand; Ethics Committee at Institute for the Development of Human Research Protections, Thailand; Gambia Government/MRC Joint Ethics Committee, Banjul, The Gambia; London School of Hygiene and Tropical Medicine Ethics Committee, London, UK; Oxford Tropical Research Ethics Committee, Oxford, UK; Walter Reed Army Institute of Research, USA; National Institute of Allergy and Infectious Diseases, Bethesda, MD, USA; Ethical Committee, Hospital for Tropical Diseases, Ho Chi Minh City, Vietnam; Ministry of Health Institute of Malariology-Parasitology-Entomology, Vietnam.
Standard laboratory protocols were used to determine DNA quantity and proportion of human DNA in each sample as previously described7,56.
Here we summarise the bioinformatics methods used to produce and analyse the data; all the details are available at www.malariagen.net/resource/26.
Reads mapping to the human reference genome were discarded before all analyses, and the remaining reads were mapped to the P. falciparum 3D7 v3 reference genome using bwa mem88. “Improved” BAMs were created using the Picard tools CleanSam, FixMateInformation and MarkDuplicates and GATK base quality score recalibration. All lanes for each sample were merged to create sample-level BAM files.
We discovered potential SNPs and indels by running GATK’s HaplotypeCaller89 independently across each of the 7,182 sample-level BAM files and genotyped these for each of the 16 reference sequences (14 chromosomes, 1 apicoplast and 1 mitochondria) using GATK’s CombineGVCFs and GenotypeGCVFs.
SNPs and indels were filtered using GATK’s Variant Quality Score Recalibration (VQSR). Variants with a VQSLOD score ≤ 0 were filtered out. Functional annotations were applied using snpEff90 version 4.1. Genome regions were annotated using vcftools and masked if they were outside the core genome. Unless otherwise specified, we used biallelic SNPs that pass all quality filters for all the analysis.
We removed 69 samples from lab studies to create the release VCF files which contain 7,113 samples. VCF files were converted to ZARR format and subsequent analyses were mainly performed using scikit-allel (https://github.com/cggh/scikit-allel) and the ZARR files.
We identified species using nucleotide sequence from reads mapping to six different loci in the mitochondrial genome, using custom java code (https://github.com/malariagen/GeneticReportCard). The loci were located within the cox3 gene (PF3D7_MIT01400), as described in a previously published species detection method.91 Alleles at various mitochondrial positions within the six loci were genotyped and used for classification as shown in Supplementary Table 14.
We created a final analysis set of 5,970 samples after removing replicate, low coverage, suspected contaminations or mislabelling and mixed-species samples.
We calculate genetic distance between samples using biallelic SNPs that pass filters using a method previously described9. In addition to calculating genetic distance between all pairs of samples from the current data set, we also calculated the genetic distance between each sample and the lab strains 3D7, 7G8, GB4, HB3 and Dd2 from the Pf3k project (www.malariagen.net/projects/pf3k).
The matrix of genetic distances was used to generate neighbour-joining trees and principal coordinates. Based on these observations we grouped the samples into eight geographic regions: South America, West Africa, Central Africa, East Africa, South Asia, the western part of Southeast Asia, the eastern part of Southeast Asia and Oceania, with samples assigned to region based on the geographic location of the sampling site. Five samples from returning travellers were assigned to region based on the reported country of travel.
FWS was calculated using custom python scripts using the method previously described7. Nucleotide diversity (π) was calculated in non-overlapping 25kbp genomic windows, only considering coding biallelic SNPs to reduce the ascertainment bias caused by poor accessibility of non-coding regions. LD decay (r2) was calculated using the method of Rogers and Huff and biallelic SNPs with low missingness and regional allele frequency >10%. Mean FST between populations was calculated using Hudson’s method.
Allele frequencies stratified by geographic regions and sampling sites were calculated using the genotype calls produced by GATK. FST was calculated between all 8 regions, and also between all sites with at least 25 QC pass samples. FST between different locations for individual SNPs was calculated using Weir and Cockerham’s method.
We used two complementary methods to determine tandem duplication genotypes around mdr1, plasmepsin2/3 and gch1, namely a coverage-based method and a method based on position and orientation of reads near discovered duplication breakpoints. In brief, the outline algorithm is: (1) Determine copy number at each locus using a coverage based hidden Markov model (HMM); (2) Determine breakpoints of identified duplications by manual inspection of reads and face-away read pairs around all sets of breakpoints; (3) for each locus in each sample, initially set copy number to that determined by the HMM if <= 10 CNVs discovered in total, else consider undetermined; (4) if face-away pairs provide self-sufficient evidence for the presence or absence of the amplification, override the HMM call; (5) for each locus in each sample, set the breakpoint to be that with the highest proportion of face-away reads.
We genotyped deletions in hrp2 and hrp3 by manual inspection of sequence read coverage plots.
The procedure used to map genetic markers to inferred resistance status classification is described in the details for each drug in the accompanying data release (https://www.malariagen.net/resource/26).
In brief, we called amino acids at selected loci by first determining the reference amino acids and then, for each sample, applying all variations using the GT field of the VCF file. The amino acid and copy number calls generated were used to classify all samples into different types of drug resistance. Our methods of classification were heuristic and based on the available data and current knowledge of the molecular mechanisms. Each type of resistance was considered to be either present, absent or unknown for a given sample.
We defined the global differentiation score for a gene as , where N is the rank of the non-synonymous SNP with the highest global FST value within that gene. To define the local differentiation score, we first calculated for each region containing multiple sites (WAF, EAF, SAS, WSEA, ESEA and OCE) FST for each SNP between sites within that region. For each gene, we then calculated the rank of the highest FST non-synonymous SNP within that gene for each of the six regions. We defined the local differentiation score for each gene using the second highest of these six ranks (N), to ensure that the gene was highly ranked in at least two populations, i.e. to minimise the chance of artefactually ranked a gene highly due to a single variant in a single population. The final local differentiation score was normalised to ensure that the range of possible scores was between 0 and 1, local differentiation score was defined as .
Authors
MalariaGEN Plasmodium falciparum Community Project
Data analysis group
Pearson, RD1, 2, *, Amato, R1, 2, *, Hamilton, WL1, 3, Almagro-Garcia, J1, 2, Chookajorn, T4, Kochakarn, T1, 4, Miotto, O1, 2, 5, Kwiatkowski, DP1, 2, 6
*Joint analysis lead
Local study design, implementation and sample collection
Ahouidi, A7, Amambua-Ngwa, A8, Amaratunga, C9, Amenga-Etego, L10, 11, Andagalu, B12, Anderson, TJ13, Andrianaranjaka, V14, Apinjoh, T15, Ashley, E5, Auburn, S16, 17, Awandare, G11, 18, Ba, H19, Baraka, V20, 21, Barry, AE22, 23, 24, Bejon, P25, Bertin, GI26, Boni, MF17, 27, Borrmann, S28, Bousema, T29, 30, Branch, O31, Bull, PC25, 32, Chotivanich, K4, Claessens, A8, Conway, D29, Craig, A33, 34, d’Alessandro, U8, Dama, S35, Day, N5, Diakite, M35, Djimdé, A35, Dolecek, C17, Dondorp, A5, Drakeley, C29, Duffy, P9, Echeverry, DF36, 37, Egwang, TG38, Erko, B39, Fairhurst, RM40, Faiz, A41, Fanello, CA5, Fukuda, MM42, Gamboa, D43, Ghansah, A44, Golassa, L39, Harrison, GLA24, Hien, TT27, 45, Hill, CA46, Hodgson, A47, Imwong, M4, Ishengoma, DS20, 48, Jackson, SA49, Kamaliddin, C26, Kamau, E50, Konaté, A51, Kyaw, MP52, 53, Lim, P54, 9, Lon, C42, Loua, KM55, Maïga-Ascofaré, O35, 56, 57, Marfurt, J16, Marsh, K17, 58, Mayxay, M59, 60, Mobegi, V61, Mokuolu, OA62, Montgomery, J63, Mueller, I24, 64, Newton, PN65, Nguyen, T27, Noedl, H66, Nosten, F17, 67, Noviyanti, R68, Nzila, A69, Ochola-Oyier, LI25, Ocholla, H70, 71, 72, Oduro, A10, Onyamboko, MA73, Ouedraogo, J74, Peshu, N25, Phyo, AP5, 67, Plowe, CV75, Price, RN16, 45, 5, Pukrittayakamee, S4, Randrianarivelojosia, M14, 76, Rayner, JC1, Ringwald, P77, Ruiz, L78, Saunders, D42, Shayo, A79, Siba, P80, Su, X9, Sutherland, C29, Takala-Harrison, S81, Tavul, L80, Thathy, V25, Tshefu, A82, Verra, F83, Vinetz, J43, 84, Wellems, TE9, Wendler, J6, White, NJ5, Yavo, W51, 85, Ye, H86
Sequencing, data production and informatics
Pearson, RD1, 2, Stalker, J1, Ali, M1, Amato, R1, 2, Ariani, C1, Busby, G2, Drury, E1, Hart, L2, Hubbart, C6, Jacob, C1, Jeffery, B2, Jeffreys, AE6, Jyothi, D1, Kekre, M1, Kluczynski, K2, Malangone, C1, Manske, M1, Miles, A1, 2, Nguyen, T1, Rowlands, K6, Wright, I2, Goncalves, S1, Rockett, KA1, 6
Partner study support and coordination
Simpson, VJ2, Miotto, O1, 2, 5, Amato, R1, 2, Goncalves, S1, Henrichs, C2, Johnson, KJ2, Pearson, RD1, 2, Rockett, KA1, 6, Kwiatkowski, DP1, 2, 6
Affiliations
1Wellcome Sanger Institute, Hinxton, UK
2MRC Centre for Genomics and Global Health, Big Data Institute, University of Oxford, Oxford, UK
3Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK
4Mahidol University, Bangkok, Thailand
5Mahidol-Oxford Tropical Medicine Research Unit (MORU), Thailand
6Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
7Hopital Le Dantec, Universite Cheikh Anta Diop, Dakar, Senegal
8Medical Research Council Unit, The Gambia
9National Institute of Allergy and Infectious Diseases (NIAID), NIH, USA
10Navrongo Health Research Centre, Ghana
11West African Centre for Cell Biology of Infectious Pathogens (WACCBIP)
12United States Army Medical Research Directorate-Africa, Kenya Medical Research Institute/Walter Reed Project, Kisumu, Kenya
13Texas Biomedical Research Institute, San Antonio, USA
14Institut Pasteur de Madagascar
15University of Buea, Cameroon
16Menzies School of Health Research, Darwin, Australia
17Nuffield Department of Medicine, University of Oxford, UK
18University of Ghana, Legon, Ghana
19Institut National de Recherche en Santé Publique, Nouakchott, Mauritania
20National Institute for Medical Research (NIMR), United Republic of Tanzania
21Department of Epidemiology, International Health Unit, Universiteit Antwerpen, Belgium
22Deakin University, Australia
23Burnet Institute, Australia
24Walter and Eliza Hall Institute, Australia
25KEMRI Wellcome Trust Research Programme, Kenya
26Research institute for development, France
27Oxford University Clinical Research Unit (OUCRU), Vietnam
28Institute for Tropical Medicine, University of Tübingen, Germany
29London School of Hygiene and Tropical Medicine, UK
30Radboud University Medical Center, The Netherlands
31NYU School of Medicine Langone Medical Center, USA
32Department of Pathology, University of Cambridge, UK
33Liverpool School of Tropical Medicine, UK
34Malawi-Liverpool-Wellcome Trust Clinical Research Programme
35Malaria Research and Training Centre, University of Science, Techniques and Technologies of Bamako, Mali
36Centro Internacional de Entrenamiento e Investigaciones Médicas - CIDEIM, Cali, Colombia
37Universidad Icesi, Cali, Colombia
38Biotech Laboratories, Uganda
39Aklilu Lemma Institute of Pathobiology, Addis Ababa University, Ethiopia
40National Institutes of Health (NIH), USA
41Dev Care Foundation, Dhaka, Bangladesh
42Department of Immunology and Medicine, US Army Medical Component, Armed Forces Research Institute of Medical Sciences (USAMC-AFRIMS), Bangkok, Thailand
43Laboratorio ICEMR-Amazonia, Laboratorios de Investigacion y Desarrollo, Facultad de Ciencias y Filosofia, Universidad Peruana Cayetano Heredia, Lima, Peru
44Nogouchi Memorial Institute for Medical Research, Legon-Accra, Ghana
45Centre for Tropical Medicine and Global Health, University of Oxford, UK
46Department of Entomology, Purdue University, West Lafayette, USA
47Ghana Health Service, Ministry of Health, Ghana
48East African Consortium for Clinical Research (EACCR), United Republic of Tanzania
49Center for Applied Genetic Technologies, University of Georgia, Athens, GA, USA
50Walter Reed Army Institute of Research, U.S. Military HIV Research Program, Silver Spring, MD, USA
51University Félix Houphouët-Boigny, Côte d’Ivoire
52The Myanmar Oxford Clinical Research Unit, University of Oxford, Myanmar
53University of Public Health, Yangon, Myanmar
54Medical Care Development International, Maryland, USA
55Institut National de Santé Publique, Conakry, Republic of Guinea
56Bernhard Nocht Institute for Tropical Medicine, Germany
57Research in Tropical Medicine, Kwame Nkrumah University of Sciences and Technology, Kumasi, Ghana
58African Academy of Sciences, Kenya
59Wellcome Trust-Mahosot Hospital Oxford University Medicine Research Collaboration (LOMWRU)
60Faculty of Postgraduate Studies, University of Health Sciences (UHS), Vientiane, Laos
61School of Medicine, University of Nairobi, Kenya
62Department of Paediatrics and Child Health, University of Ilorin, Ilorin, Nigeria
63Institute of Vector-Borne Disease, Monash University, Clayton, Victoria, 3800, Australia
64Barcelona Centre for International Health Research, Spain
65Wellcome Trust-Mahosot Hospital-Oxford Tropical Medicine Research Collaboration, Lao, PDR
66Malaria Research Initiative Bandarban (MARIB), Bangladesh
67Shoklo Malaria Research Unit
68Eijkman Institute for Molecular Biology, Indonesia
69King Fahid University of Petroleum and Minerals (KFUMP), Saudi Arabia
70Malaria Capacity Development Consortium
71KEMRI - Centres for Disease Control and Prevention (CDC) Research Program, Kisumu, Kenya
72Centre for Bioinformatics and Biotechnology, University of Nairobi, Kenya
73Kinshasa School of Public Health, University of Kinshasa, DRC
74Institut de Recherche en Sciences de la Santé, Burkina Faso
75Duke Global Health Institute, Duke University
76Universités d’Antananarivo et de Mahajanga, Madagascar
77World Health Organization (WHO), Switzerland
78Universidad Nacional de la Amazonia Peruana, Peru
79Nelson Mandela Institute of Science and Technology, Tanzania
80Papua New Guinea Institute of Medical Research, PNG
81Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD, USA
82University of Kinshasa, DRC
83Sapienza University of Rome, Italy
84Yale School of Medicine, New Haven, CT
85Malaria Research and Control Center of the National Institute of Public Health, Côte d’Ivoire
86Department of Medical Research, Myanmar
Acknowledgements
This study was conducted by the MalariaGEN Plasmodium falciparum Community Project, and was made possible by clinical parasite samples contributed by partner studies, whose investigators are represented in the author list and in the associated data release (https://www.malariagen.net/resource/26). In addition, the authors would like to thank the following individuals who contributed to partner studies, making this study possible: Dr Eugene Laman for work in sample collection in the Republic of Guinea; Dr Abderahmane Tandia and Dr Yacine Deh for work in sample collection in Mauritania; Dr Ibrahim Sanogo for work in sample collection in Mali. Genome sequencing was undertaken by the Wellcome Sanger Institute and we thank the staff of the Wellcome Sanger Institute Sample Logistics, Sequencing, and Informatics facilities for their contribution. The sequencing, analysis, informatics and management of the Community Project are supported by Wellcome through Sanger Institute core funding (098051), a Strategic Award (090770/Z/09/Z) and the Wellcome Centre for Human Genetics core funding (203141/Z/16/Z), and by the MRC Centre for Genomics and Global Health which is jointly funded by the Medical Research Council and the Department for International Development (DFID) (G0600718; M006212). The views expressed here are solely those of the authors and do not reflect the views, policies or positions of the U.S. Government or Department of Defense. Material has been reviewed by the Walter Reed Army Institute of Research. There is no objection to its presentation and/or publication. The opinions or assertions contained herein are the private views of the author, and are not to be construed as official, or as reflecting true views of the Department of the Army or the Department of Defense. The investigators have adhered to the policies for protection of human subjects as prescribed in AR 70–25.
Footnotes
↵* The full list of authors appears at the end of the manuscript