Abstract
Individuals infected with the Plasmodium falciparum malaria parasite can carry multiple strains with varying levels of relatedness. Yet, how parameters of local epidemiology and the biology of transmission affect the rate and relatedness of such mixed infections remains unclear. Here, we develop an enhanced method for strain deconvolution from genome sequencing data, which estimates the number of strains, their proportions, identity-by-descent (IBD) profiles and individual haplotypes. We validate the method through experimental and in silico simulations and apply it to the Pf3k data set, consisting of 2,344 field samples from 13 countries. We find that the rate of mixed infection varies from 18% to 63% acrosscountries and that 51% of all mixed infections involve more than two strains. By modelling the structure of IBD resulting from different infection mechanisms we estimate that 55% of dual infections contain sibling strains likely to have been co-transmitted from a single mosquito, and find evidence of mixed infections propagated over successive infection cycles. By combining genetic data with epidemiological estimates of prevalence from the Malaria Atlas Project, we find that, at the country level, prevalence correlates with both the rate of mixed infection (Pearson r = 0.65, P = 3.7 ⨯ 10−6) and the level of IBD (r = −0.51, P = 6.0 ⨯ 10−4). Genomics is becoming a standard tool in pathogen surveillance.In this work, we conclude that monitoring fine-scale patterns of mixed infections and within-sample relatedness will be highly informative for assessing the impact of interventions and to inform malaria control programs.
1 Introduction
Individuals infected with malaria-causing parasites of the genus Plasmodium often carry multiple, distinct strains of the same species (Bell et al., 2006). Such mixed infections, also known as complex infections, are likely indicative of intense local exposure rates, being common in regions of Africa with high rates of prevalence (Howes et al., 2016). However, they have also been documented for P. vivax and other malaria-causing parasites (Ivo Mueller, 2007; Collins, 2012), even in regions of much lower prevalence (Howes et al., 2016; Steenkeste et al., 2010). Mixed infections have been associated with increased disease severity (de Roode et al., 2005) and also facilitate the generation of genomic diversity within the parasite, enabling co-transmission to the mosquito vector where sexual recombination occurs (Mzilahowa et al., 2007). Mixed infections are transient (Bruce and Day, 2002; Zimmerman et al., 2004), but little is known about the distribution of their duration. Whether the clearance of one or more strains results purely from host immunity (Borrmann and Matuschewski, 2011) or can be influenced by interactions between the distinct strains (Enosse et al., 2006; Bushman et al., 2016), are also open questions.
Although mixed infections can be studied from genetic barcodes (Galinsky et al., 2015) or single nucleotide polymorphisms (SNPs) (O’Brien et al., 2016), genome sequencing provides a more powerful approach for detecting mixed infections (Chang et al., 2017). Genetic differences between co-existing strains manifest as polymorphic loci in the DNA sequence of the isolate. The higher resolution of sequencing data allows the use of statistical methods for estimating the number of distinct strains, their relative proportions, and genome sequences (Zhu et al., 2018). Although genomic approaches cannot identify individuals infected multiple times by identical strains, and are affected by sequencing errors and problems of incomplete or erroneous reference assemblies, they provide a rich characterisation of within host diversity (Manske et al.,2012; Auburn et al., 2012; Pearson et al., 2016).
Previous research has highlighted that co-existing strains can be highly related (Nair et al., 2014; Trevino et al., 2017). For example, in P. vivax 58% of mixed infections show long stretches of within host homozygosity (Pearson et al., 2016). In addition,Nkhoma et al. (2012) reported an average of 78.7% P. falciparum allele sharing in Malawi and 87.6% sharing in Thailand. Relatedness can arise through different mechanisms. First, a mosquito vector may acquire distinct strains from biting a single multiply-infected individual, in which case sexual reproduction and onward transmission can result in F2 inbred progeny. A similar process may occur from biting multiple infected individuals. If these subsequently undergo sexual reproduction in the mosquito midgut, then transmission may result in an individual being infected with multiple, sibling strains with lower levels of inbreeding than in the previous case. Alternatively, relatedness can occur through independent infection events but in a population where genetic diversity is low, such as during the early stages of an outbreak or following severe population bottlenecks; for instance, those provoked by an intervention (Mouzin et al., 2010; Wong et al., 2017; Daniels et al., 2015).
The rate and relatedness structure of mixed infections are therefore highly relevant for understanding regional epidemiology. However, progress towards utilising this source of information is limited by three problems. Firstly, while strain deconvolution within mixed infections has received substantial attention (Galinsky et al., 2015; O’Brien et al., 2016; Chang et al., 2017; Zhu et al., 2018), currently, no methods perform joint deconvolution of strains and estimation of relatedness. Because existing deconvolution methods assume equal relatedness along the genome, differences in relatedness that occur, for example, through infection by sibling strains can lead to errors in the estimation of the number, proportions and sequences of individual strains (Figure 1). Recently, progress has been made in the case of dual-infections with balanced proportions (Henden et al., 2018), but a general solution is lacking. The second problem is that little is known about how the rate and relatedness structure of mixed infections relates to underlying epidemiological parameters. Informally, mixed infections will occur when prevalence is high; an observation exploited by Cerqueira et al.(2017) when estimating changes in transmission over time. However, the quantitative nature of this relationship, the key parameters that influence mixed infection rates and how patterns of relatedness relate to infection dynamics are largely unexplored.
Here, we develop, test and apply an enhanced method for strain deconvolution, which builds on our previously-published DEploid software. The method separates estimation of strain number, proportions, and relatedness (specifically the identity-by-descent, or IBD, profile along the genome) from the problem of inferring genome sequences. This strategy provides substantial improvements in accuracy under complex settings or when dealing with low coverage data. We apply the approach to 2,344 field isolates of P. falciparum collected from 13 countries over a range of years (2001–2014) and available through the Pf3k Project (see Supplementary Note), and characterise the rate and relatedness patterns of mixed infections. In addition, we develop a statistical framework for characterising the processes underlying mixed infections, estimating that more than half of mixed infections arise from the transmission of siblings, as well as demonstrating the propagation of mixed infections through cycles of host-vector transmission. Finally, we investigate the relationships between statistics of mixed infection and epidemiological estimates of pathogen prevalence (MAP, 2017), showing that country-level rates of mixed infection are highly correlated with estimates of malaria parasite prevalence.
2 Strain deconvolution in the presence of relatedness
Existing methods for deconvolution of mixed infections typically assume that the different genetic strains present in mixed infections are unrelated. This assumption allows for efficient computation of priors for allele frequencies within samples, either through assuming independence of loci (O’Brien et al., 2016) or as sequences generated as imperfect mosaics of some (predefined) reference panel (Zhu et al., 2018). However, when strains are related to each other, and particularly when patterns of IBD vary along the genome (for example through being siblings, or sibs for short), the constraints imposed on within-sample allele frequencies through IBD can cause problems for deconvolution methods, which can try to fit complex strain combinations (with relatedness) as simpler configurations (without relatedness). Below we outline the approach we take to integrating IBD into DEploid Further details are provided in the Supplementary Materials.
2.1 Decoding genomic relatedness among strains
A common approach to detecting IBD between two genomes is to employ a hidden Markov Model that transitions into and out of IBD states (Chang et al., 2015; Gusev et al., 2009,2011). We have generalised this approach to the case of k haploid Plasmodium genomes (strains). In this setting, there are 2k possible genotype configurations, as each of the k strains can be either reference, i.e. same as the reference genome used during assembly, or alternative at a given locus (we assume all variation is bi-allelic). If each of the k strains constitutes a unique proportion of the infection, each genotype configuration will produce a distinct alternative within sample allele frequency (WSAF; Figure 1A), which defines the expected fraction of total sequencing reads that are alternative at a given locus in the sequenced infection.
The effect of IBD among these k strains is to limit the number of distinct genotype configurations possible, in a way that depends on the pattern of IBD sharing. Consider that, for any given locus, the k strains in the infection are assigned to j ≤ k possible reference haplotypes. IBD exists when two or more strains are assigned to the same haplotype. In this scenario, the total number of possible patterns of IBD for a given k is equal to where S(k, j) is the number of ways k objects can be split into j subsets (a Stirling number of the second kind (Graham et al., 1988)). Thus, for two strains, there are two possible IBD states (IBD or non-IBD), for three strains there are five states (all IBD, none IBD and the three pairwise IBD configurations), for four strains there are fifteen states (see Supplementary Materials), and so on. We limit analysis to a maximum of four strains for computational efficiency and because higher levels of mixed infection are rarely observed. Finally, for a given IBD state, only 2j rather than 2k genotype configurations are possible, thereby restricting the set of possible WSAF values.
Moving along the genome, recombination can result in changes in IBD state, hence changing WSAF values at those loci (Figure 1B). To infer IBD states we use a hidden Markov model, which assumes linkage equilibrium between variants for computational efficiency, with a Gamma-Poisson emission model for read counts (see Supplementary Materials). Population-level allele frequencies are estimated from isolates obtained from a similar geographic region. Given the structure of the hidden Markov model, we can compute the likelihood of the strain proportions by integrating over all possible IBD sharing patterns, yielding a Bayesian estimate for the number and proportions of strains (see Methods). We then use posterior decoding to infer the relatedness structure across the genome (Figure 1B). To quantify relatedness, we compute the mean IBD between pairs of strains, and statistics of IBD tract length (mean, median and N50, the length-weighted median IBD tract length, Figure 1C).
In contrast to our previous work, DEploidIBD infers strain structure in two steps. In the first we estimate the number and proportions of strains using Markov Chain Monte-Carlo (MCMC), allowing for IBD as described above. In the second, we infer the individual genomes of the strains, using the MCMC methodology of Zhu et al. (2018), which can account for linkage disequilibrium (LD) between variants, but without updating strain proportions. The choice of reference samples for deconvolution is described in Zhu et al. (2018) and in the Supplementary Materials. During this step we do not use the inferred IBD constraints perse, though the inferred haplotypes will typically copy from the same (or identical) members of the reference panel within the IBD tract.
3 Results
3.1 Method validation
We validated DEploidIBD through both experimental mixtures using lab strains and in silico mixtures using clonal field samples. First, to test consistency with DEploid (Zhu et al., 2018), we re-analysed the 27 experimental mixtures from (Wendler, 2015). This data set includes 27 samples of various mixtures of four laboratory parasite lines (3D7, Dd2, HB3 and 7G8; Figure 2-Supplement 3). Allowing for mixtures of up to four strains and using optimal reference panels, we found comparable performance with the single-step DEploid method, with the exception of three strains of equal proportions where LD information is necessary to achieve accurate deconvolution (Figure 2-Supplement 3).
To test the accuracy of DEploidIBD in a more realistic setting, we created in silico mixtures of two strains from 212 clonal samples of Asian origin (proportions ranging from 10/90% to 45/55%) using Chromosome 14 data (8,070 sites). A further 20 randomly chosen samples were used as the reference panel. In order to compare the accuracy of the two methods at different levels of relatedness, we set 25%, 50% and 75% of the second haplotype to be the same as the first haplotype to mimic scenarios of low, medium and high relatedness. This operation sets a lower limit to the relatedness between two strains, as background relatedness may also exist. To simulate data, we used empirical read depths and drew read counts for the two alleles from binomial proportions. We inferred strain proportions (summarised by the effective number of strains: , and haplotypes. Both DEploid and DEploidIBD correctly estimate strain proportions with low relatedness (Figure 2A). However, for moderate and high relatedness mixtures, DEploid fails to recover the correct proportion, when the minor strain proportion is below 30%.
DEploidIBD is a substantial improvement on DEploid. In addition to estimating proportions and number of strains, DEploidIBD also estimates identity-by-descent (IBD) profiles. However, due to background relatedness DEploidIBD typically over-estimates IBD fraction by a few percentage points (Figure 2B). Rates of genotype error are similar for the two approaches in settings of low relatedness (error rate of 0.4% per site for 25/75 mixtures and 1.0% for 45/55 mixtures). However, for the 25/75% mixtures with high relatedness, genotype error for the non-IBD approach increases to 0.6%, while error in the IBD approach remains at 0.4% (Figure 2C). Switch errors in haplotype estimation are comparable between the two methods and decrease with increased relatedness due to the higher homozygosity (Figure 2D). In summary, joint inference of IBD profiles and strain haplotypes is expected to improve estimates of strain proportions (and hence haplotypes), particularly in regions with high rates of IBD. Moreover, direct estimates of IBD within mixed infections can be used as an additional feature to characterise isolates.
We repeated the in silico experiment with mixtures of two strains from 197 clonal African samples, with mixing proportions of 10/90%, 25/75% and 45/55%, using 92,780 sites from Chromosome 14. DEploidIBD estimates the correct proportions at all relatedness levels (Figure 2-Supplement 1), although with a greater relative difference in effective K compare to Asia (~ 2% vs. ~ 1%). DEploidIBD also recovers the correct level of relatedness and IBD tract length (note that in Africa background relatedness is typically low). The per site genotype error rate remains below 1%. The number of haplotype switch errors is higher than in Asia, but by a factor much less than the 11-fold increase in the number of SNPs.
Finally, we extended benchmarking to in silico mixtures of three Asian strains (Figure 2–supplement 2). We set one strain to have the highest proportion (the dominant strain) and constructed the two minor strains to be IBD with the dominant strain over distinct halves of the chromosome, such that at any point there are only two distinct haplotypes present. We find that DEploidIBD outperforms (lower relative difference) DEploid in all cases and typically provides accurate estimates of proportions (Figure 2–Supplement 2) with the exception of two cases. For the case of (0.10, 0.40, 0.50), the minor strain creates very weak allele frequency imbalance, leading DEploidIBD to infer the number of strains as two (with proportions ~ 45/55%) in 90/100 cases. For the case of (0.30, 0.30, 0.40), the problem is fundamentally unidentifiable and DEploidIBD fits the data as a mixture of two strains. In these cases, DEploidIBD also underestimates the pairwise relatedness and N50 tract lengths.
3.2 Geographical variation in mixed infection rates and relatedness
To investigate how the rate and relatedness structure of mixed infections varies among geographical regions with different epidemiological characteristics, we applied DEploidIBD to 2,344 field samples of P. falciparum released by the Pf3k project (Pf3k Consortium, 2016). These samples were collected under a wide range of studies with heterogeneous designs, with the majority of samples being taken from symptomatic individuals seeking clinical treatment. A summary of the data sources is presented in Table 1 and full details regarding study designs can be found at https://www.malariagen.net/projects/pf3k#sampling-locations Details of data processing are given in the Methods. For deconvolution, samples were grouped into geographical regions by genetic similarity; four in Africa, and three in Asia. (Table 1). Reference panels were constructed from the clonal samples found at each region. Since previous research has uncovered severe population structure in Cambodia (Miotto et al., 2013), we stratified samples into West and North Cambodia when performing analysis at the country level. Diagnostic plots for the deconvolution of all samples can be found at https://github.com/mcveanlab/mixedIBD-Supplement and inferred haplotypes can be accessed at URL. We identified 787 samples where low sequencing coverage or the presence of low-frequency strains resulted in unusual haplotypes (see Supplementary Material). Estimates of strain number, proportions and IBD states from these samples are used in subsequent analyses, but not the haplotypes. We also confirmed that reported results are not affected by the exclusion of all metrics from samples with haplotypes with low confidence.
We find substantial variation in the rate and relatedness structure of mixed infections across continents and countries. Within Africa, rates of mixed infection vary from 18% in Senegal to 63% in Malawi (Figure 3A). In Southeast Asian samples, mixed infection rates are in general lower, though also vary considerably; from 21% in Thailand to 54% in Bangladesh. Where data for a location is available over multiple years, we find no evidence for significant fluctuation over time (though we note that these studies are typically not well powered to see temporal variation and collection dates are very heterogeneous). We observe that between 5.1% (Senegal) and 40% (Malawi) of individuals have infections carrying more than two strains.
Relatedness between samples and populations also varies substantially. In dual infections, the average fraction of the genome inferred to be IBD ranges from 21% in Guinea to 59% in West Cambodia (Figure 3B). Asian populations show, on average,a higher level of relatedness within dual infections (48%) compared to African populations (29%). Levels of IBD in samples with three or more strains are comparable to those seen in dual infections (average IBD being 50% in Asia and 29% in Africa) and significantly correlated at the country level, with weighted correlation of 0.76 (P = 0.0017, weighted by the number of mixed samples). Overall, 53% of all mixed infections involve strains with over 30% of the genome being IBD.
Table 1: Summary of Pf3k samples.
We next considered the relationship between mixed infection rate and the level of IBD. We find that populations with higher rates of mixed infection tend to have lower levels of IBD within mixed infections (linear model P = 0.06 after accounting for a continental level difference and weighted by sample size). However, the continental level effect is driven by Senegal, which has an unusual combination of low mixed infections and also low IBD. Excluding Senegal, we find a consistent pattern across populations (Figure 3C), with a strong negative correlation between mixed infection rate and the level of IBD (Pearson r = –0.84, P = 3 × 10−4). Previous work has demonstrated how a recent and dramatic decline in P. falciparum prevalence within Senegal has left an impact on patterns of genetic variation (Daniels et al., 2015), which may explain its unusual profile.
3.3 Inferring the origin of IBD in mixed infections
The high levels of IBD observed in many mixed infections suggest the presence of sibling strains (Figure 4). To quantify the expected IBD patterns between siblings, we developed a meiosis simulator for P. falciparum (pf-meiosis), incorporating relevant features of malaria biology that can impact the way IBD is produced in a mosquito and detected in a human host. Most importantly, a single infected mosquito can undergo multiple meioses in parallel, one occurring for each oocyst that forms on the mosquito midgut (Ghosh et al., 2000). In a mosquito infected with two distinct strains, each oocyst can either self (the maternal and paternal strain are the same) or outbreed (the maternal and paternal strains are different). We model a K = n mixed infection as a sample of n strains (without replacement, as drawing identical strains yields K = n – 1) from the pool of strains created by all oocysts. Studies of wild-caught Anopheles Gambiae suggest that the distribution of oocysts is roughly geometric, with the majority of infected mosquitoes carrying only one oocyst (Beier et al.,1991; Collins et al., 1984). Surprisingly, in such a case, a K = 2 infection will have an expected IBD of 1/3 (see Supplementary Materials). Conditioning on at least one progeny originating from an outbred oocyst (such that a detectable recombination event has occurred), the expected IBD asymptotically approaches 1/2 as the total number of oocysts grows.
Using this simulation framework, we sought to classify observed mixed infections based on their patterns of IBD. We used two summary statistics to perform the classification: mean IBD segment length and IBD fraction. We built empirical distributions for these two statistics for each country in Pf3k, by simulating meiosis between pairs of clonal samples from that country. In this way, we control for variation in genetic diversity (as background IBD between clonal samples) in each country. Starting from a pair of clonal samples (M = 0, where M indicates the number of meioses that have occurred), we simulated three successive rounds of meiosis (M = 1, 2, 3), representing the creation and serial transmission of a mixed infection (Figure 5A). Each round of meiosis increases the amount of observed IBD. For example, in Ghana, the mean IBD fraction for M = 0 was 0.002, for M = 1 was 0.41, for M = 2 was 0.66, and for M = 3 was 0.80 (Figure 5B). West Cambodia, which has lower genetic diversity, had a mean IBD fraction of 0.08 for M = 0 and consequently, the mean IBD fractions for higher values of M were slightly increased, to 0.46, 0.68, 0.81 for M = 1, 2 and 3, respectively (Figure 5B).
From these simulated distributions, we used Naive Bayes to classify k = 2 mixed infections in Pf3k (Figure 5C). Of the 404 K = 2 samples containing only high-quality haplotypes (see Supplementary Materials), 288 (71%) had IBD statistics that fell within the range observed across all simulated M. Of these, more than half (221, 55%) were classified as siblings (M > 0, with ¿ 99% posterior probability). Moreover, we observe geographical differences in the rate at which sibling and unrelated mixed infections occur. Notably, in Asia a greater fraction of all mixed infections contained siblings (65% vs. 51% in Africa), driven by a higher frequency of M = 2 and M = 3 mixed infections (Figure 5D).
3.4 Characteristics of mixed infections correlatewith local parasite prevalence
To assess how characteristics of mixed infections relate to local infection intensity, we obtained estimates of P. falciparum prevalence (PfPR2–10) from the Malaria Atlas Project (MAP, 2017, see Table 1). The country level prevalence estimates range from 0.01% in Thailand to 55% in Ghana, with African countries having up to two orders of magnitude greater values than Asian ones (mean of 36% in Africa and 0.6% in Asia). However, seasonal and geographic fluctuations in prevalence mean that, conditional on sampling an individual with malaria, local prevalence may be much higher than the longer-term (and more geographically widespread) average. We summarise mixed infection rates by the average effective number of strains, which reflects both the number and proportion of strains present. This metric both avoids the problem of having to estimate a threshold for determining the presence of a very low proportion strain and is sensitive to the presence of triply (and more) infected samples.
We find that the effective number of strains is a significant predictor of PfPR2–10 in African populations (r = 0.48, P = 0.04), but is uncorrelated within Asian populations. Similarly, within-sample IBD and background IBD are both negatively correlated with PfPR2–10 only in Africa (r = –0.67, P = 0.0017 and r = –0.53, P = 0.02, respectively). The rate of sibling infection (M = 1) is not correlated with the parasite prevalence (r = –0.06, P = 0.70). However, the super-sibling infection rate (M = 2, 3) does exhibit a marginally significant correlation with PfPR2–10 (r = –0.29, P = 0.06), albeit only at the continental scale. Interestingly, all statistics relating to IBD are positively correlated with PfPR2–10 in Asian populations (though not significantly so), in contrast to the negative (and significant) associations seen within African populations.
4 Discussion
It has long been appreciated that mixed infections are an integral part of malaria biology, determining the number, proportions, and haplotypes of the strains that comprise them has proven a formidable challenge. Previously we developed an algorithm, DEploid, for deconvolving mixed infections (Zhu et al., 2018). However, we subsequently noticed the presence of mixed infections with highly related strains in which the algorithm performed poorly, particularly with low-frequency minor strains. Mixed infections containing highly related strains represent an epidemiological scenario of particular interest, because they are likely to have been produced from a single mosquito bite, itself multiply infected, and in which meiosis has occurred to generate sibling strains. Thus, we developed an enhanced method, DEploidIBD, capable not only of deconvolving highly related mixed infections, but also providing a profile of IBD segments between all pairs of strains present in the infection. We note that technical difficulties remain, including analysing data with multiple infecting species, coping with low-coverage data, and selecting appropriate reference panels from the growing reference resources.
The application of DEploidIBD to the 2,344 samples in the Pf3k project has revealed the extent and structure of relatedness among malaria infections and how these characteristics vary between geographic locations. We found that 1,026 (44%) of all samples in Pf3k were mixed, being comprised of 480 K = 2 infections, 372 K = 3 and 127 K = 4 infections. Across the entire data set, the total number of genomes extracted from mixed infections is nearly double the number extracted from clonal infections (2,584 genomes from K > 1 vs. 1,365 from K =1). We also found considerable variation, between countries and continents in the characteristics of mixed infections, suggesting that they are sensitive to local epidemiology. For example, in West Africa, Senegal (which has undergone a recent and effective malaria control campaign) has a rate of mixed infections less than half that of neighbouring Guinea and Mali. Previous work has highlighted the utility of mixed infection rate in discerning changes in regional prevalence, and we re-enforce that finding here, observing a significant correlation between the effective number of strains and parasite prevalence across Pf3k collection sites. Similarly, using DEploidIBD we also observe significant geographical variation in the relatedness profiles of strains within mixed infections. Interestingly, this variation is structured such that regions with high rates of mixed infection tend to contain strains that are less related, resulting in a significant negative correlation between mixed infection rate and mean relatedness within those infections.
The ability to identify the extent and genomic structure of IBD enables inference of the mechanisms by which mixed infections can arise. A mixed infection of K strains can be produced by either K independent infectious bites or by j < K infectious bites. In the first case, parasites are delivered by separate vectors and no meiosis occurs between the distinct strains, thus any IBD observed in the mixed infection must have pre-existed as background IBD between the individual strains. In the second case, meiosis may occur between strains, resulting in long tracts of IBD. The exact amount of IBD produced by meiosis is a random variable, dependent on outcomes of meiotic processes, such as the number of recombination events, the distance between them, and the segregation of chromosomes. Importantly, the mean IBD produced during meiosis in P. falciparum also depends on the number and type (selfed vs outbred) of oocysts in the infectious mosquito. Consequently, the amount of IBD expected in a single-bite mixed infection produced from two unrelated parasites strains will always be slightly less than 1/2, and as low as 1/3.
To quantify the distribution of IBD statistics expected through different mechanisms of mixed infection, we developed a Monte Carlo simulation tool, pf-meiosis, which we used to infer the recent transmission history of individuals with dual (K = 2) infections. We considered mixed infection chains, in which M successive rounds of meiosis, transmission to host, and uptake by vector can result in sibling strain infections with very high levels of IBD. Overall, we found that 56% of all mixed infections are from sibling strains and, particularly within Asian population samples, evidence for long mixed infection chains (M > 1). This observation is not a product of lower genetic diversity in Asia, as differences in background IBD between countries have been controlled for in the simulations. Rather, it reflects true differences in transmission epidemiology between continents. These findings have three important consequences. First, it suggests that successful establishment of multiple strains through a single infection event is major source of mixed infection. Second, it implies that the bottlenecks imposed at transmission (to host and vector) are relatively weak. Finally, it indicates that the source of mixed infections reflects aspects of local epidemiology.
We note that a non-trivial fraction (29%) of all mixed infections had patterns of IBD inconsistent with the simulations (typically higher IBD than background but lower than among siblings). We suggest two explanations. Firstly, our estimate of background IBD, generated by combining pairs of random clonal samples from a given country into an artificial M = 0 mixed infection, will underestimate true background IBD if there is very strong local population structure. Second, we only simulated simple mixed infection transmission chains, at the exclusion of more complex transmission histories, such as involving strains related at the level of cousins. The extent to which such complex histories can be inferred with certainty remains to be explored.
Finally, our results show that the rate and relatedness structure of mixed infections correlate with estimated levels of parasite prevalence, at least within Africa, where prevalence is typically high (Smith et al.,1993). In Asia, which has much lower overall prevalence, as well as greater temporal (and possibly spatial) fluctuations, we do not observe such correlations. However, it may well be that other genomic features that we don’t contemplate in this work could provide much higher resolution, in space and time, for capturing changes in prevalence than traditional methods. Testing this hypothesis will lead to a much greater understanding of how genomic data can potentially be used to inform global efforts to control and eradicate malaria.
5 Methods and Materials
The data analysed within this paper were collected and made openly available to researchers by member of the Pf3k Consortium. Information about studies within the data set can be found at https://www.malariagen.net/projects/pf3k#sampling-locations Detailed information about data processing can be found at https://www.malariagen.net/data/pf3k-5 Briefly, field isolates were sequenced to an average read depth of 86 (range 12.6 – 192.5). After removing human-derived reads and mapping to the 3D7 reference genome, variants were called using GATK best practice and approximately one million variant sites were genotyped in each isolate. After filtering samples for low coverage and cross-species contamination, 2,344 samples remained. The Supplementary Material provides details on the filters used and data availability. For deconvolution, samples were grouped into geographical regions by genetic similarity; four in Africa, and three in Asia. (Table 1). Reference panels were constructed from the clonal samples found at each region. Since previous research has uncovered severe population structure in Cambodia (Miotto et al., 2013), we stratified samples into West and North Cambodia when performing analysis at the country level.
7 Data availability
Metadata on samples is available from ftp://ngs.sanger.ac.uk/production/pf3k/release_5/pf3k_release_5_metadata_20170804.txt.gz. Sequence data (aligned to Plasmodium falciparum strain 3D7 v3.1 reference genome sequences, for details see ftp://ftp.sanger.ac.uk/pub/project/pathogens/gff3/2015-08/Pfalciparum.genome.fasta.gz) is available from ftp://ngs.sanger.ac.uk/production/pf3k/release_5/5.1/. Diagnostic plots for the deconvolution of all samples can be found at https://github.com/mcveanlab/mixedIBD-Supplement and deconvolved haplotypes can be accessed at XXX. Code implementing the algorithms described in this paper, DEploidIBD, is available at https://github.com/mcveanlab/DEploid.
8 Disclosure Declaration
None declared.
6 Acknowledgements
This study was supported by the Wellcome Trust (206194, 090770, 204911, 100956/Z/13/Z to GM), the Medical Research Council (G0600718), and the UK Department for International Development (M006212).
This study used data from the MalariaGEN Pf3k Project. Genome sequencing was done by the Wellcome Sanger Institute (WSI), and sample collections were coordinated by the MalariaGEN Resource Centre. The samples from Senegal were was supported by funding from the Bill and Melinda Gates Foundation to Dyann Wirth, and sequenced by the Broad Institute. We thank the staff of the WSI Sample Logistics, Sequencing, and Informatics facilities for their contribution; all patients and collaborators contributing samples and data to the Pf3k project.