Summary
Plasmodium simium, a malaria parasite of non-human primates in the Atlantic forest region of Brazil was recently shown to cause zoonotic infection in humans in the region. Phylogenetic analyses based on the whole genome sequences of six P. simium isolates infecting humans and two isolates from brown howler monkeys revealed that P. simium is monophyletic within the broader diversity of South American Plasmodium vivax, consistent with the hypothesis that P. simium first infected non-human primates as a result of a host-switch from humans carrying P. vivax. We provide molecular evidence that the current zoonotic infections of people have likely resulted from multiple independent host switches, each seeded from a different monkey infection. Very low levels of genetic diversity within P. simium genomes and the absence of P. simium-P. vivax hybrids suggest that the P. simium population emerged recently and has subsequently experienced a period of independent evolution in Platyrrhini monkeys. We further find that Plasmodium Interspersed Repeat (PIR) genes, Plasmodium Helical Interspersed Subtelomeric (PHIST) genes and Tryptophan-Rich Antigens (TRAg) genes in P. siumium are genetically divergent from P. vivax and are enriched for non-synonymous single nucleotide polymorphisms, consistent with the rapid evolution of these genes. Analysis of genes involved in erythrocyte invasion revealed several notable differences between P. vivax and P. simium, including large deletions within the coding region of the Duffy Binding Protein 1 (DBP1) and Reticulocyte Binding Protein 2a (RBP2a) genes in P. simium. Genotyping of P. simium isolates from non-human primates (NHPs) and zoonotic human infections showed that a precise deletion of 38 amino acids in DBP1 is exclusively present in all human infecting isolates, whereas non-human primate infecting isolates were polymorphic for the deletion. We speculate that these deletions in the parasite-encoded key erythrocyte invasion ligands and the additional rapid genetic changes have facilitated zoonotic transfer to humans. Non-human primate malaria parasites can be considered a reservoir of potential infectious human parasites that must be considered in any attempt of malaria elimination. The genome of P. simium will thus form an important basis for future functional characterizations on the mechanisms underlying malaria zoonosis.
Introduction
There are currently eight species of malaria parasites known to cause disease in humans; Plasmodium falciparum, Plasmodium vivax, Plasmodium malariae, Plasmodium ovale curtisi, Plasmodium ovale wallikeri, Plasmodium knowlesi, Plasmodium cynomolgi and Plasmodium simium. The latter three species are more commonly parasitic on non-human primates and have only relatively recently been shown to infect humans 1-3.
As interventions against human malaria parasites, particularly P. falciparum and P. vivax, continue to reduce their prevalence, the increasing importance of zoonotic malaria is becoming apparent. In countries currently moving towards the elimination of malaria, the presence of populations of potentially zoonotic parasites in non-human primates constitutes a significant obstacle.
The propensity of malaria parasites to switch hosts and the consequences of this for human health are underlined by the fact that both P. vivax and P. falciparum first arose as human pathogens as the result of host switches from great apes in Africa 4-6. As contact between humans and the mosquitoes that feed on non-human primates increases due to habitat destruction and human encroachment into non-human primate habitats 7, there is increasing danger of zoonotic malaria transmission leading to the emergence of novel human malaria pathogens. Understanding how malaria parasites adapt to new hosts and new transmission environments allows assessment of the risks posed by novel zoonotic malaria outbreaks.
The clinical epidemiology of zoonotic malaria varies according to the parasite species involved and the demographics of the human-host population infected. Severe and lethal outcomes have been reported in people infected with P. knowlesi in Malaysia 8, whilst infection with P. cynomolgi in the same region appears to cause moderate/mild clinical symptoms 9. Interestingly, both P. knowlesi and P. cynomolgi infections in the Mekong region appear less virulent than in Malaysia, and are often asymptomatic 3,10, and this may be due to the relative virulence of the parasite strains circulating there and/or differences in the susceptibility of the local human populations. As the parasites of non-human primates have co-evolved with and adapted to their monkey hosts, it is impossible to predict their potential pathogenesis in zoonotic human infections. The virulence of P. falciparum, for example, has been attributed to its relatively recent emergence as a human pathogen 11, which appears to have occurred following a single host transfer from a gorilla in Africa 5.
Eighty nine percent of the malaria infections in Brazil are caused by P. vivax, with over 99% of these cases occurring in the Amazonian region. This region accounts for almost 60% of the area of Brazil, and is home to 13% of the population (https://www.ibge.gov.br/). Of the 0.4% of cases registered outside the Amazon, around 90% occur in the Atlantic Forest, a region of tropical forest that extends along the Atlantic coast of Brazil, and are caused by an apparently mild, vivax-like malaria parasite transmitted by Anopheles (Kerteszia) cruzii, a mosquito species that breeds in the leaf axils of bromeliad plants 12.
Following a malaria outbreak in the Atlantic Forest of Rio de Janeiro in 2015/2016, it was shown that these infections were caused by the non-human primate malaria parasite P. simium 1. DNA samples collected both from humans and non-human primates (NHPs) in the same region shared identical mitochondrial genome sequences, distinct from P. vivax isolates from anywhere in the world and identical to that of a P. simium parasite isolated from a monkey in the same region in 1966, and to all isolates of P. simium recovered from NHPs since 13,14.
It was previously thought that Plasmodium vivax became a parasite of humans following a host switch from macaques in Southeast Asia, due to its close phylogenetic relationship with a clade of parasites infecting monkeys in this region and due to the high genetic diversity among P. vivax isolates from Southeast Asia 15. We now know, however, that it became a human parasite following a host switch from great apes in Africa 6. It is likely that it was introduced to the Americas by European colonisers following Columbus’ journey to the New World in 1492. Indeed, present-day P. vivax in South America is closely related to a strain of the parasite present, historically, in Spain 16. The genetic diversity of extant P. vivax in the Americas suggests multiple post-Columbian colonising events associated with the passage of infected people from various regions throughout the world 17. There is some evidence to suggest that P. vivax parasites may also have been introduced to South America in pre-Columbian times, and may have contributed to the extensive genetic diversity of the parasite on this continent 17.
Plasmodium simium, a parasite of various species of Platyrrhini monkeys whose range is restricted to the Atlantic Forest of south and southeast Brazil 18, is genetically and morphologically similar to P. vivax 1,19-22. Based on this similarity, it appears likely that P. simium originated as a parasite of monkeys in Brazil following a host switch from humans carrying P. vivax. The recent 2015/2016 outbreak of P. simium in the local human population of Rio de Janeiro’s Atlantic Forest raises questions about the degree of divergence that has occurred between P. vivax and P. simium, and whether adaptation to monkeys has led to the evolution of a parasite with clinical relevance to human health that differs from that of P. vivax.
It is unclear whether the current outbreak of P. simium in the human population of Rio de Janeiro was the result of a single transfer of the parasite from a monkey to a human and its subsequent transfers between people, or whether multiple independent host switches have occurred, each seeded from a different monkey infection. Furthermore, the degree and nature of adaptation to a non-human primate host and a sylvatic transmission cycle that has occurred in P. simium following its anthroponotic origin is of relevance to the understanding of how malaria parasites adapt to new hosts. It is also of interest to determine whether the current, human-infecting P. simium parasites have recently undergone changes at the genomic level that have allowed them to infect people in this region, as it has previously been suggested that P. simium has historically lacked the ability to infect man 23.
In order to resolve these questions, and so to better understand the epidemiology and natural history of this emerging zoonotic parasite, we analysed the whole genome sequences of P. simium parasites isolated from both humans and non-human primates in the Atlantic Forest region of Rio de Janeiro.
Results
Genome assembly and phylogeny
From a single P. simium sample collected from Rio de Janeiro state in 2016 1 short read sequences were obtained and assembled into a draft genome (see Supplementary Materials). The assembled genome consists of 2,192 scaffolds over 1kb with a combined size of 29 Mb (Table S1). Two scaffolds corresponding to the apicoplast and mitochondrial organelles are also identified (Figure S1). Gene content analysis showed an annotation completeness comparable to previously published Plasmodium assemblies (Figure S35). A phylogenetic tree constructed from 3,181 of 1:1 orthologs of the annotated P. simium protein-coding genes with Plasmodium vivax, P. cynomolgi, P. coatneyi, P. knowlesi, P. malariae, P. falciparum, P. reichenowi, and P. gallinaceum confirmed that P. simium is very closely related to P. vivax (Figure S2).
P. simium-P. vivax diversity analysis
To detect single nucleotide polymorphisms (SNPs) within the P. vivax/P. simium clade, short Illumina paired-end sequence reads were mapped onto the P. vivax P01 reference genome 24. Reads were collected from eleven human P. simium samples, two monkey P. simium samples, two P. vivax samples from Brazilian Amazon, and a range of P. vivax strains representing a global distribution retrieved from the literature 25. Including only SNPs with a minimum depth of five reads, a total of 232,780 SNPs were initially called across 79 samples. Sixteen samples were subsequently removed from further analysis primarily due to low coverage resulting in a total of 63 samples (Table S2, Table S3). Few SNP loci are covered across all samples, and to enable diversity analysis, we restrict all further analysis to the 124,968 SNPs for which data is available from at least 55 samples (Figure S3).
P. simium-P. vivax population analysis
A Principal Component Analysis (PCA) plot constructed from these genome-wide SNP loci showed a clear separation between American and Asian P. vivax samples as well as a distinct grouping of P. simium samples (Figure S4). The latter observation suggesting that both human and monkey P. simium samples form a single population that is genetically differentiated from other American P. vivax populations. A similar pattern is observed when performing a multidimensional scaling analysis of the SNP data (Figure S5). To enable a phylogenetic approach, we constructed an alignment from the 124,968 SNP sites. In the resulting phylogenetic tree, P. vivax strains generally clustered according to their geographical origin, and the Asian and American samples were clearly separated (Figure 1A, a tree with sample IDs is available in Figure S6). P. simium samples clustered as a monophyletic group with Mexican vivax samples (Figure 1A), consistent with a recent American origin for P. simium.
To examine whether the P. simium isolates we obtained were part of a continuous population with local P. vivax, we examined population ancestry with the ADMIXTURE program 26 (Figure S7). This analysis is consistent with the PCA and MDS analysis (Figure S4 & Figure S5) and the phylogenetic analysis of segregating SNPs (Figure 1), showing that P. simium forms a genetically distinct population of P. vivax. The absence of P. simium-P. vivax hybrids (introgression events) suggests that P. simium has undergone a period of independent evolution in Platyrrhini monkeys.
P. simium genetic differentiation from P. vivax is enhanced in host-parasite interacting genes
To characterise the P. simium population further, we estimated the nucleotide diversity in P. simium and P. vivax samples (see Materials and Methods). P. simium diversity (genome-median: 1.3×10−4) is more than five times lower than the diversity observed when comparing all P. vivax samples (genome-median: 7.5×10−4) (Figure 2). Diversity within coding sequences in P. vivax is consistent with previous reports 6. The median nucleotide diversity between P. simium and P. vivax genomes of 8.4×10−4 and the low diversity within P. simium suggest that the strains we examined are part of a relatively recent or isolated population.
We then examined the population differentiation over the entire genome using FST, a measure of the proportion of ancestry private to a population (FST=0 for completely intermixed populations, FST=1 for populations with completely independent ancestry). Although our analysis contains very few samples, FST estimates can be very accurate if multiple genomic sites are used 27. Consistent with phylogenetic and admixture analysis, we observed a high level of differentiation between human P. simium and American P. vivax (FST=0.46). For comparison, the differentiation between vivax from America and vivax from Myanmar and Thailand (henceforth referred to as ‘Asian vivax’) is less than half of this (FST=0.22). To examine whether there were any signals of adaptive change in P. simium that may have occurred during its adaptation in monkeys upon anthroponotic transfer, we calculated the fixation index for all individual genes. Clearly, the small number of samples renders this analysis prone to false and incorrect signals, and FST values for individual genes should be interpreted with caution. Nevertheless, we attempted to look for general patterns in FST values across gene groups.
Amongst the 4,341 P. vivax genes with at least one SNP in our data set, we examined the top-25% of the genes with highest FST values for enrichment in functional Gene Ontology (GO) terms or metabolic pathways. No GO terms or pathways were significant at the 0.05 level after Bonferroni correction (Table S4 & Table S5). Using the P. falciparum orthologs instead – when available – gave similar results (not shown). We next tested if any of the gene families (Figure S8, Figure S9, Table S6) were associated with high FST values. Genes belonging to the Plasmodium Interspersed Repeat (PIR) family involved in antigenic variation 28, the Plasmodium Helical Interspersed Subtelomeric (PHIST) genes, a family of exported proteins 29, the merozoite surface proteins MSP730, and Tryptophan-rich antigens (TRAg) 31 were enriched among the genes with high FST values (binomial distribution, PIR; p=3.5×10−3, PHIST; p=4.1×10−4, MSP7; p=0.034, TRAg; p=2.5×10−3).
As these gene families are involved in parasite-host interactions, the observation of elevated FST values may simply reflect a general pattern of rapid genetic divergence in Plasmodium parasites. To test this, we repeated the FST analysis between American vivax and a selection of Asian vivax isolates (Myanmar and Thailand samples only). Consistent with the phylogenetic analysis (Figure 1A) gene FST was slightly higher overall between simium and American vivax than between American and Asian vivax samples (Figure 3A). However, none of the gene families were overrepresented among genes with high FST (top-25%) between American and Asian vivax. To further examine if the elevated FST measures found for PIR, PHIST, MSP7, and TRAg genes are exclusive to the comparison between simium and American vivax, we calculated the ratio between the two FST measurements (‘simium versus American vivax’ and ‘ American versus Asian vivax’) (Figure 3B). The ratios for PIR, PHIST and TRAg genes were significantly higher than observed for the remaining genes (Figure 3C), whereas ratios for MSP7 genes were not (Mann-Whitney U, p=0.12). Although the P. simium and the P. vivax P01 both genomes encode a high number of the gene family members, our analysis is restricted to the P. vivax genes for which our P. simium short read sequences can map. For example, only 408 out of the 1209 P. vivax PIR genes have coverage from P. simium reads across at least 80% of their gene length (Figure S10). Further, an even smaller number of these genes have detectable SNPs between simium and American vivax samples and are included in the analysis (numbers shown below Figure 3B).
To test if the sequence redundancy among gene family loci could result in spurious cross-mapping of short sequence reads we specifically tested the quality of SNPs in gene families, and SNPs residing in gene families showed no signs of decreased calling, mapping, or base qualities compared to other SNPs (Figure S11).
A phylogenetic analysis of PIR, PHIST and TRAg proteins harbouring genomic SNPs revealed no apparent association between certain protein phylogenetic sub-groups and high FST ratios (Figure S12-S14), consistent with a subtle signature of polygenic adaptation in these gene families.
When testing all exported genes and genes involved in invasion and exported genes (Table S8), the observed FST ratios were not significantly different from the background (Mann-Whitney U, p=0.5473). Hence, the differences in FST observed for PIR, PHIST and TRAg genes are not a general phenomenon amongst the genes known to be involved in interactions with the host and red cell invasion.
The observed skew towards higher FST values when comparing simium and American vivax (Figure 3A) could be a result of an inherent diversity between different American vivax populations potentially stemming from multiple introductions of P. vivax to the American continent 17. To test if such founder effects and subsequent population bottlenecks could explain the observations, we repeated the FST analysis using only Mexican vivax samples as American representatives. Four Mexican samples (SRS693273, SRS694229, SRS694244, SRS694267) were used. These clustered close together in both the SNP phylogeny (Figure 1) and in the PCA and MDS plots (Figure S4 & Figure S5), and are assumed to share a recent evolutionary history. This analysis revealed the same pattern of elevated FST values between simium and Mexican vivax, and PIR genes did again display significantly higher FST ratios (Figure S15). Although PHIST and TRAg genes also showed higher FST ratios, these were no longer significant (Figure S15). We therefore conclude that the observed higher FST values between simium and American vivax PIR genes are not solely a result of diversity within American vivax populations, but rather appear specific to comparisons with P. simium.
Adaptive changes in PIR genes would be expected to produce stronger genetic divergence in non-synonymous codon positions. To examine this, we divided genic SNPs into synonymous and non-synonymous changes. In PIR genes, there are 353 non-synonymous and 185 synonymous SNPs (non-synonymous to synonymous SNP ratio = 1:1.91). Similarly, in PHIST and TRAg genes we find 220 and 103 non-synonymous, respectively, and 67 and 41 synonymous SNPs, respectively (PHIST ratio = 1:3.28, TRAg ratio = 1:2.51). In all other genes, the ratio between non-synonymous and synonymous SNPs is 1:1.49. Hence, the proportion of non-synonymous SNPs in PIR, PHIST and TRAg genes is significantly higher than in all other genes (chi-square, PIR; p= 0.0073, PHIST; p=9.4×10−9, TRAg; p=0.0054).
Our finding that PIR, PHIST and TRAg genes overall display markedly higher FST values between simium and vivax suggest that these gene groups are enriched for private alleles consistent with natural selection acting upon these genes subsequent to the split between P. simium and P. vivax.
P. simium invasome components
In invading P. vivax merozoites, binding to host red blood cells is mediated by two gene families: Duffy Binding Proteins (DBPs) bind the Duffy Antigen Receptor for Chemokines (DARC) 32,33, which is present on both host normocytes and reticulocytes, whereas Reticulocyte Binding Proteins (RBPs) preferentially bind host reticulocytes 34−36. Recently, the reported protein structure of P. vivax RBP2b revealed the evolutionary conservation of residues involved in the invasion complex formation 36. Two DBPs, DBP1 and DBP2, are present in P. vivax P01 (Table S9). RBPs can be divided into three subfamilies, RBP1, RBP2, and RBP3 37. The P. vivax P01 genome encodes 11 RBPs (including the reticulocyte binding surface protein, RBSA), of which three are pseudogenes (Table S9).
The P. vivax DBP and RBP were used to search the P. simium proteins, resulting in the detection of the two DBP proteins and RBP1a, RBP1b, RBP2a, RBP2b, and RBP3 and failure to detect RBP2c and RBP2d (Figure 4; Table S9; Figure S16; Figure S17) across all sequenced P. simium samples. As in other P. vivax genomes, the P. simium RBP3 is a pseudogene 38, indicating that the pseudogenization event happened prior to the split between P. vivax and P. simium.
To determine whether the apparent absences of individual RBP genes in P. simium were due to incomplete genome assembly, we examined the coverage of P. simium reads mapped onto P. vivax RBP gene loci. As expected, no P. simium coverage was observed at the RBP2c, RBP2d, and RBP2e genes in P. simium samples, including the previously published CDC strain deposited in GenBank (accession ACB42432) 39 (Figure S18).
Coverage of mapped reads across invasome gene loci revealed no apparent elevated coverage in genes compared to their flanking genomic regions, which would otherwise be expected if the P. simium genome contained multiple (duplicated) copies of non-assembled invasion genes (Figure S19). Similarly, analysis of P. simium read mapping data using the DELLY software 40 showed no large genomic duplications and deletions events occurring at loci harbouring invasion genes (Table S10) although numerous short indels were detected within protein-coding genes (Table S11).
Structural variation in P. simium Duffy Binding Protein 1
The simium assembly revealed that the invasion gene DBP1 contains a large deletion within its coding sequence (Figure 4) (a full alignment is provided in Figure S20). Intriguingly, the previously published P. simium CDC strain (originally isolated in 1966) DBP1 does not contain the deletion (‘simium CDC’ in Figure 4B). A haplotype network confirms that this previously published DBP1 gene is indeed a P. simium sequence (Figure S22), and the SNP analyses consistently assign the CDC strain to the simium cluster (Figure 1, Figure S4, and Figure S5). Compared to the P. vivax P01 reference genome the SalI reference harbours a 27 base pair deletion in DBP1, in contrast to the 115 bp deletion observed in all P. simium samples isolated from humans (Figure 4). This deletion is also present in most P. vivax isolates (Figure S23). Additional deletion patterns exist among isolates, and in a few cases multiple versions are detected within samples (Figure S23).
The presence of repetitive sequences within the DBP1 gene could potentially result in aberrant assembly across the DBP1 locus, which could appear as an apparent deletion in subsequent bioinformatic analysis. We tested this possibility and the DBP1 gene does not harbour any noticeable degree of repetitiveness (Figure S24). Several read mapping analyses confirmed that the P. simium-specific 115 bp deletion was not an assembly artefact (Figure S25-S27).
We next designed primers for PCR amplification of a genomic segment across the deleted region in the P. simium DBP1 gene and tested the occurrence of these deletion events in a range of P. vivax and P. simium field samples from Brazil. All P. vivax samples tested by PCR produced bands consistent with absence of the deletion whereas all samples from human-infecting P. simium produced bands consistent with the presence of the precise 115 bp deletion (Figure S28, top & middle). Interestingly, non-human primate (NHP)-infecting P. simium isolates were a mix of samples with and without deletions (Figure S28, bottom). If the P. simium-specific deletion in DBP1 is a prerequisite for the ability to infect humans this suggests that only a subset of NHP-infecting P. simium parasites currently possess the DBP1 allele required for zoonotic transfer to humans.
A large, additional deletion was observed in the P. simium RBP2a gene, the presence of which was also supported by read mapping and PCR analysis (Figure 3, Figure S29-S32).
Potential structural implications of the deletion in DBP1 and RBP2a
We next investigated if the observed deletions render DBP1 and RBP2a dysfunctional. DBP1 contains a large extracellular region, which includes the N-terminal DBL region which is mediating the association with DARC in P. vivax 41, followed by a largely disordered region and a cysteine-rich domain (Figure 4c). DBP1 has a single-pass transmembrane helix and a short cytoplasmic tail. The deletion observed in the human-infecting P. simium only affects the disordered region, leaving the flanking domains intact. We produced homology models of the DBL domains from the P. vivax strain P01, the human-infecting P. simium strain AF22, and the P. simium CDC strain, based on the crystal structure of the >96% identical DBL domain of P. vivax bound to DARC (PDB ID 4nuv). Whereas no significant substitutions were found in the DBL domain between both P. simium sequences, our analysis showed that residue substitutions between P. simium and P. vivax DBL domains cluster in proximity of the DARC binding site (Figure S33). Based on our models, these substitutions are unlikely to negatively affect the association with DARC, supporting that the DBL domains of both P. simium would be capable of binding to human DARC. Hence, the human-infecting P. simium sequence encodes for a protein that retains the capacity to bind to human DARC, but would have the interacting domain positioned closer to the membrane than in the monkey-infecting CDC strain.
The deletion we detected in human-infecting RBP2a was more severe, resulting in the loss of 1003 residues. These residues are predicted to form a mostly α-helical extracellular stem-like structure that positions the reticulocyte binding domain away from the membrane (Figure 4d). However, given that the deletion does neither affect the transmembrane region, nor the receptor-binding domain, our analysis supports that the resulting truncated RBP2a protein can still associate with the human receptor, but that the binding event would occur closer to the plasmodium membrane.
Discussion
We present the genome of Plasmodium simium, the eighth malaria parasite species known to infect humans in nature. In recent evolutionary time, P. simium has undergone both anthroponosis and zoonosis making it unique for the study of the genetics underlying host-switching in malaria parasites. The genome content confirmed the close phylogenetic relationship between P. simium and P. vivax, and further analyses on single nucleotide divergences support a very recent American origin for P. simium. This recent split between P. vivax and P. simium precludes detection of genes under positive evolution 42, and we have instead performed a general analysis of population differentiation between extant P. simium and P. vivax isolates using FST. We find that members of three gene families involved in antigenic variation, PIR, PHIST and TRAg, show significantly elevated FST levels between P. simium and P. vivax. As higher FST values amongst these genes are not observed between global vivax populations, their genetic differentiation appears to be associated with host-switching between human and monkey.
Two proteins involved in host invasion, DBP1 and RBP2a, were found to harbour extensive deletions in P. simium compared to P. vivax. Interestingly, experimental analysis of P. simium samples revealed that isolates from human hosts all carried the DBP1 deletion, whereas isolates from non-human primates displayed both absence and presence of the deletion. This DBP1 deletion is not present in the P. simium isolated from a brown howler monkey in the 1960s, which was previously shown to be incapable of infecting humans 23, although some degree of laboratory adaptation of this parasite may have affected its genome. However, this deletion is also absent in P. vivax, so cannot in itself explain the ability of P. simium to infect humans in the current outbreak. It is possible, however, that this deletion is required for P. simium to invade human red blood cells given the alterations that have occurred elsewhere in its invasome following adaptation to non-human primates since the split between P. simium and its human-infecting P. vivax ancestor.
Invasome proteins are obvious candidates for genetic factors underlying host-specificity, and an inactivating mutation in a P. falciparum erythrocyte binding antigen has recently been shown to underlie host-specificity 43. Traditionally, functional studies on invasome proteins have focused on domains known to bind or interact directly with the host. Although the P. simium-specific DBP1 and RBP2a deletions reported here do not cover known structural motifs, these deletions could nevertheless affect host cell recognition as disordered protein regions have known roles in cellular regulation and signal transduction 44. Further, a shorter, less flexible linker between the plasmodium membrane and the receptor-binding DBP1 domain may favour a more rigid and better oriented positioning of the dimeric DBP1, enhancing its capacity to engage the human receptor.
Phylogenetic analysis of the P. simium clade gives the geographical location of its most closely related P. vivax strain as Mexico, and not Brazil. In imported populations, the relationship between geographical and genetic proximity may be weak. Multiple introductions of diverse strains from founder populations may occur independently over large distances, so that two closely related strains may be introduced in distantly located regions. It may be postulated that there occurred the introduction of strains of P. vivax to Mexico from the Old World that were closely related, due to similar regions of origin, to strains introduced to the Atlantic Forest which went on to become P. simium in New World monkeys. Strains from a different point of origin were introduced to the Amazonian region of Brazil. This hypothesis necessitates reproductive isolation of the P. simium clade from the Brazilian P. vivax parasites following their initial introduction; an isolation that would be facilitated, presumably, by their separate host ranges.
Due to uncertainties regarding the number of individual genomes that were transferred during the original host switch from man to NHPs that resulted in the formation of the P. simium clade, it is impossible to perform dating analyses to determine a time for the split between P. vivax and P. simium with which we can be confident. The phylogeny shown in figure 1 is consistent with the hypothesis that all present-day P. vivax/P. simium originated from a now extinct Old World population. The most parsimonious explanation for this is that today’s New World P. vivax/P. simium originated from European P. vivax, which was itself a remnant of the original Eurasian/African P. vivax driven to extinction in Africa by the evolution of the Duffy negative condition in the local human populations, and from Europe by malaria eradication programmes in the latter half of the twentieth century. This hypothesis is supported by the evidence of a close relationship between historical Spanish P. vivax and South American strains of the parasite 16, and by previous analyses of the mitochondrial genome 45. Therefore, we postulate that the host switch between humans and non-human primates that eventually led to establishment of P. simium in howler monkeys must have occurred subsequent to the European colonisation of the Americas, within the last 600 years.
We find no evidence from the nuclear genome, the mitochondrial genome or the apicoplast genome that any of the P. vivax /P. simium strains from the New World considered in our analyses are more closely related to Old World parasites than they are to each other, as previously contended 46. However, our nuclear genome phylogeny is based on genome-wide SNPs, and so represents an “average” phylogeny across the genome. This cannot be considered to reflect a true history of parasite ancestry due to the effects of recombination, and it is possible that trees produced from individual genes might reveal different phylogenetic relationships.
Given the limited genetic diversity amongst the P. simium isolates considered here compared to that of P. vivax, it is almost certain that the original host switch occurred from humans to NHPs, and not the other way around 22. Similarly, the larger amount of genetic diversity in the current NHP-infecting P. simium compared to those P. simium strains isolated from humans (as indicated by the higher degree of DBP1 polymorphism in the NHP-infecting P. simium compared to the strains infecting humans), suggests that humans are being infected from a pool of NHP parasites in a true zoonotic manner, as opposed to the sharing of a common parasite pool between humans and NHPs
The biological definition of a species is a group of organisms that can exchange genetic material and produce viable offspring. We have no way of knowing whether this is the case for P. vivax and P. simium, and genetic crossing experiments would be required to resolve this question. Our phylogenetic analysis, however, clearly shows P. simium forming a clade on its own within the broader diversity of P. vivax, and that strongly suggests, given what we know about its biology, that allopatric speciation has been/is occurring. Plasmodium simium appears to have been reproductively isolated from other strains of P. vivax for long enough for significant genetic differentiation to occur (FST = 0.46), with some invasome genes showing even higher genetic differentiation.
Plasmodium simium is currently recognised as a species separate from P. vivax; it has been well characterised and described in the literature, and there is a type specimen available, with which all the strains sequenced here cluster in one monophyletic group. Therefore, we cannot at present overturn the species status of P. simium in the absence of conclusive proof from crossing experiments.
In summary, the recent outbreak of human malaria in the Atlantic Forest of Rio de Janeiro underlines the impact of zoonotic events on human health. In this sense non-human primate malaria parasites can be considered a reservoir of potential infectious human parasites that must be considered in any attempt of malaria eradication. Little is known about the genetic basis for zoonosis, yet the presented genome sequence of P. simium suggests a deletion within the DBP1 gene as a possible facilitator of zoonotic transfer. The genome of P. simium will thus form an important basis for future functional characterizations on the mechanisms underlying malaria zoonosis.
Methods
Sample Collection and Preparation
Human and primate samples of P. simium were collected and prepared as part of a previous study 1,14. Additionally, two P. vivax samples from the Amazon area of Brazil were also collected from human patients (Table S2). All participants provided informed written consent. The P. simium CDC (Howler) strain (Catalog No. MRA-353) from ATCC was obtained via the BEI Resources Repository in NIAID-NIH (https://www.beiresources.org/).
DNA extraction and sequencing
DNA was extracted as described1. The genomic DNA for each sample was quantitated using the Qubit® 2.0 Fluorometer and was used for library preparation. The DNA for intact samples was sheared using a Covaris E220 DNA sonicator to fragments of 500bp. The DNA libraries for intact samples were made using the TruSeq Nano DNA Library Prep kit (Illumina), whereas the DNA libraries for degraded samples were made using Ovation Ultralow Library System V2 kit (Nugen), according to the manufacturers’ instructions. The amplified libraries were stored in −20 °C. The pooled libraries were sequenced in an Illumina HiSeq4000 instrument (2 × 150 bp PE reads) (Illumina). A PhiX control library was applied to the sequencing run as a base balanced sequence for the calibration of the instrument so that each base type is captured during the entire run. Raw sequence reads were submitted to FastQC v.0.11.5 and the quality score of the sequences generated was determined. Samples AF22, AF26, AF36 were additionally sequenced and scaffolded by PacBio RS II platform (Pacific Biosciences, California, US) using a SMRT library. Genomic DNA from the P. vivax samples was extracted from filter paper as previously described 47.
Illumina reads preparation and mapping
Fastqc v 0.11.6 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc) was used to evaluate the quality of Illumina reads. Illumina adapters were removed, and reads were trimmed using the trimmomatic v0.33 48 software with the following conditions: LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:36
To exclude human reads from our analysis, trimmed reads were mapped against the human reference genome (v. hg38) and the Plasmodium vivax strain P01 reference genome (v. 36) from PlasmoDB (www.plasmodb.org) with bowtie2 (v 2.3.3.1) 49. Reads mapping against the human genome were removed from further analysis.
Genome Assembly
P. simium sample AF22 was selected for genome assembly based on read quality and coverage. After removal of human contaminants, Illumina reads were assembled into contigs using the Spades (v 3.70) assembler 50. Contigs assembled into scaffolds running SSPACE (v 3.0) 51 for 15 rounds and gaps filled with Gapfiller (v 1.10) 52. Scaffolds were subsequently corrected with Illumina reads using the Pilon (v 1.22) software 53. Blobtools (v 1.0) (DOI: 10.5281/zenodo.845347) 54 was used to remove any residual contaminant scaffolds. Genome size and GC content is in line with that of P. vivax species (Table S1). Genomic scaffolds representing the mitochondrial and apicoplast genome were identified through blastn searches against the corresponding P. falciparum and P. vivax sequences (Figure S1). The P. simium mitochondrial genome was aligned against a range of previously published P. vivax and P. simium mitochondrial genomes 55,56. A gap-filled region in the alignment where the distal parts of the P. simium scaffold were merged was manually deleted. A minimum spanning haplotype network was produced using PopART 57,58 confirming the authenticity of the P. simium mitochondrial genome (Figure S34).
Genome Annotation
Two approaches were used to annotate the reference P. simium AF22 genome. Firstly, the Maker pipeline (v 2.31.8) 59 was run for two rounds, using ESTs and protein evidence from P. vivax and P. cynomolgi strain B and P. falciparum to generate Augustus gene models. Secondly, a separate annotation was produced using the Companion web server 60. Companion was run using the P. vivax P01 reference assembly and default parameters. Basic annotation statistics are provided in Table S1. The relatively low number of genes (5966) is lower due to the fragmented and incomplete nature of the P. simium assembly (Table S1). Gene content was estimated using BUSCO 61,62 (v3.0) revealing an annotation completeness comparable to other Plasmodium genome assemblies (Figure S35).
PlasmoDB Genome References and Annotations
Genome fasta files, as well as annotated protein and CDS files were obtained from PlasmoDB for the following species: P. gallinaceum 8A, P. cynomolgi B and M, P knowlesi H, P. falciparum 3D7, P. reichenowi G01, P. malariae UG01, P. ovale curtisi GH01, P. coatneyi Hackeri, P. vivax P01 and P. vivax SalI. For each species, version 36 was used.
Orthologous group determination
Amino-acid sequences-based phylogenetic trees were prepared using protein sequences from the P. simium annotation, as well as the protein annotations from 10 malaria species downloaded from PlasmoDB: P. vivax P01, P. cynomolgi B, P. knowlesi H, P. vivax-like Pvl01, P. coatneyi Hackeri, P. falciparum 3D7, P. gallinaceum 8A, P. malariae UG01, P. ovale curtisi GH01, and P. reichenowi G01. P. vivax-like from PlasmoDB version 43, all other annotations from version 41. A total of 3181 1:1 orthologous genes were identified using the Proteinortho (v 6.0.3) software 63. Approximately 88% of the predicted genes in P. simium have orthologs in the P. vivax P01 (Figure S36).
Indels in genes
Structural variations were detected using DELLY 40 (v 0.7.9). Coordinates of structural rearrangements their nearest genes are listed in Table S10. Shorter indels were detected from soft-clipping information in read mapping (using the ‘-i’ option in DELLY)(Table S11). Indels in exons were further compared to indels present in the P. simium AF22 genome assembly, suggesting a high false discovery rate of DELLY indels compared to assembly indels (Figure S37).
Protein phylogeny
Protein sequences were aligned using mafft (v 7.222) 64 and alignments were subsequently trimmed with trimAl (v 1.2rev59) 65 using the heuristic ‘automated1’ method to select the best trimming procedure. Trimmed alignments were concatenated and a phylogenetic tree was constructed using RAxML (v 8.2.3) 66 with the PROTGAMMALG model.
SNP calling and analysis
Short sequence reads from 15 simium samples (13 human and 2 monkey) and two vivax samples, all from this study (Table S2), were aligned against a combined human (hg38) and P. vivax (strain P01, version 39) genome using NextGenMap (v0.5.5) 67. This was similarly done for 30 previously published P. vivax strains 25 and the Sal1 reference. These data sets were downloaded from ENA (https://www.ebi.ac.uk/ena) (Table S3).
Duplicate reads were removed using samtools (v 1.9) 68 and the filtered reads were realigned using IndelRealigner from the GATK package (v 4.0.11) 69. SNPs were called independently with GATK HaplotypeCaller and freebayes (v 1.2.0) 70, keeping only SNPs with a QUAL score above 30. The final SNP set were determined from the inter-section between GATK and freebayes. Allele frequencies and mean coverage across SNP sites are shown in Figure S38. PCA plot was constructed using plink (v 1.90) 71, and admixture analysis was done with Admixture (v 1.3.0) 26. FST values were estimated from nucleotide data with the PopGenome R package 72,73 using the Weir & Cockerham method 74. Non-synonymous and synonymous SNPs were identified using snpeff 75.
SNP phylogeny
Alleles from SNP positions with data in 55 samples were retrieved, concatenated, and aligned using mafft 64. Tree was produced by PhyML 76,77 with the GTR substitution model selected by SMS 78. Branch support was evaluated with the Bayesian-like transformation of approximate likelihood ratio test, aBayes 79. Phylogenetic network was made in SplitsTree 80 using the NeighborNet network 81.
Nucleotide diversity
Conventional tools calculating nucleotide diversity directly from the variant call files assumes that samples are aligned across the entire reference sequence. But as read coverage across the reference genome was highly uneven between samples (Figure S38), adjustment for this was required. Coverage across the reference genome was thus calculated for each sample using samtools mpileup (v 1.9) 68. For each comparison between two samples, the nucleotide divergence was calculated as number of detected bi-allelic SNPs per nucleotide with read coverage of at least 5X in both samples.
Gene sequence deletions
Exploratory Neighbor-Joining phylogenies produced with CLUSTALW 82,83 and visualized with FigTree (https://github.com/rambaut/figtree/) after alignment with mafft 82. Pacbio reads were aligned using Blasr (v 5.3.2) 84, short Illumina reads using NextGenMap (v0.5.5) 67. Dotplots done with FlexiDot (v1.05) 85.
Gene families and groups
Exported gene sets were compiled from the literature 86-88. Invasion genes were retrieved from 89. Gene families were assessed in seven Plasmodium genomes (P. simium, P. vivax SalI, P. vivax P01, P. vivax-like Pvl01, P. cynomolgi M, P. cynomolgi B, and P. knowlesi H) using the following pipeline: For all genomes annotated genes were collected for each gene families. These ‘seed’ sequences were used to search all proteins from all genomes using BLASTP and best hits for all proteins were recorded. For each gene family ‘seed’ sequences were then aligned with mafft 64, trimmed with trimAl 65, and HMM models were then built using HMMer (http://hmmer.org/). For PIR/VIR and PHIST genes, models were built for each genome independently, for all other gene families a single model was built from all genomes. These models were then used to search all proteins in all genomes. All proteins with best BLASTP hit to a ‘seed’ sequence from a given genome were sorted according to their bit score. The lowest 5% of hits were discarded and remaining proteins with best hits to a ‘seed’ sequence were assigned one ‘significant’ hit. As all proteins were searched against ‘seeds’ from the six annotated genomes (P. simium excluded), a maximum of six ‘significant’ BLAST hits could be obtained. Similarly, for each HMM model the bottom 25% hits were discarded and remaining hits were considered ‘significant’. The final set of gene families consists of previously annotated genes and un-annotated genes with at least two ‘significant’ hits (either BLASTP or HMM).
PCR amplification of DBP1 and RBP2a genes
PCR primers were initially designed from alignments between P. vivax and P. simium sequences and subsequent tested using Primer-BLAST 90 and PlasmoDB (www.plasmodb.org). For DBP1, the reaction was performed in 10 μL volumes containing 0.5 μM of each oligonucleotide primer, 1 μL DNA and 5 μL of Master Mix 2x (Promega) (0.3 units of Taq Polymerase, 200 μM each deoxyribonucleotide triphosphates and 1.5 mM MgCl2). Samples were run with the following settings: 2 minutes of activation at 95°C, followed by 35 cycles with 30 seconds denaturation at 95°C, 30 seconds annealing at 57°C (ΔT=-0.2 °C from 2nd cycle) and 1 minute extension at 72°C, then 5 minutes final extension at 72°C and hold in 4°C. For RBP2a PCR, the reaction was performed in 10 μL volumes containing 0.5 μM of each oligonucleotide primer, 1 μL DNA, 0.1 μL PlatinumTaq DNA Polymerase High Fidelity (Invitrogen, 5U/ μ L), 0.2 mM each deoxyribonucleotide triphosphates and 2 mM MgSO4. The PCR assays were performed with the following cycling parameters: an initial denaturation at 94°C for 1.5 min followed by 40 cycles of denaturation at 94°C for 15 sec, annealing at 65°C for 30 sec (ΔT=-0.2 °C from 2nd cycle) and extension at 68°C for 3.5 min. The temperature was then reduced to 4 °C until the samples were taken. All Genotyping assays were performed in the thermocycler Veriti 96 wells, Applied Biosystems, and the amplified fragments were visualized by electrophoresis on agarose gels (2% for DBP1 and 1% for RBP2a) in 1x TAE buffer (40 mM Tris-acetate, 1 mM EDTA) with 5 μg/ mL ethidium bromide (Invitrogen) in a horizontal system (Bio-Rad) at 100 V for 30 min. Gels were examined with a UV transilluminator (UVP - Bio-Doc System).
To prevent cross-contamination, the DNA extraction and mix preparation were performed in “parasite DNA-free rooms” distinct from each other. Furthermore, each of these separate areas has different sets of pipettes and all procedures were performed using plugged pipette tips. DNA extraction was performed twice on different days. Positive (DNA extracted from blood from patients with known P. vivax infection) and negative (no DNA and DNA extracted from individuals who have never traveled to malaria-endemic areas) controls were used in each round of amplification. DNA extracted from blood of a patient with high parasitemia for P. vivax and DNA of P. simium of a non-human primate with an acute infection and parasitemia confirmed by optical microscopy served as positive controls in the PCR assays. Primer sequences are provided in Figure S28 and S32.
Structural modelling of DBP1 and RBP2a genes
RaptorX 91 was used for prediction of secondary structure and protein disorder. Homology models for the DBL domain of P. vivax P01 strain, P. simium AF22, and the previously published CDC P. simium strain were produced by SWISS-MODEL 92, using the crystallographic structure of the DBL domain from Plasmodium vivax DBP bound to the ectodomain of the human DARC receptor (PDB ID 4nuv), with an identity of 98%, 96% and 96% for P. vivax, P. simium AF22 and P. simium CDC, respectively. QMEAN values were - 2.27, −2.04 and −2.03, respectively. The homology model for the reticulocyte binding protein 2 (RBP2a) of P. vivax strain P01 was produced based on the cryoEM structure of the complex between the P. vivax RBP2b and the human transferrin receptor TfR1 (PDB ID 6d05)36, with an identity of 31% and QMEAN value of −2.46. The visualization and structural analysis of the produced models was done with PyMOL (https://pymol.org/2/).
Data availability
The reference genome assembly and short sequence reads have been uploaded to European Nucleotide Archive (https://www.ebi.ac.uk/ena/) under the Study accession number PRJEB34061.
Author contributions
CTDR, PB, AdPC, RLdO, RC and AP conceived the study. CTDR, PB, CFAdB, MdFFdC, RC and AP supervised and co-ordinated the study. AdPC, FVSdA, DAMA, CBJ, JCdSJ and ZMBH collected materials. OD, QG, AdPC, CFAdB, MdFFdC, FVSdA, DAMA, CBJ, JCsSJ and ZMBH conducted wet-lab experiments. TM, AK, SF, DCJ, FJGV, SA, CTDR, PB, RLdO, CFAdB, MdFFdC, FVSdA, DAMA, CBJ, JCdSJ, ZMBH, RC and AP analysed and interpreted the data. TM, RC, AP, CTDR, PB, CFAdB and RLdO drafted and edited the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We thank Prof. Xin-zhuan Su at the National Institute of Allergy and Infectious Diseases, NIH, for invaluable help in obtaining the parasite gDNA from the BEI Resources; Sidnei Silva and Graziela Zanini, for assistance on the parasitological diagnosis of the human samples; Aline Lavigne and Larissa Gomes for undertaking the PCR for P vivax; Alcides Pissinatti and Silvia Bahadian Moreira for the facilities provided at the Primate Centre of Rio de Janeiro; Orzinete Rodrigues Soares for non-human primates’ blood slides’; Marcelo Quintela, Waldemir Paixão Vargas, Carlos Alberto C. da Silva, Alexandre B. de Souza, Vicente Klonowski, Romenique L. Araújo, Luis R. Nogueira, Fernando Barreto, Ana L. Quijada, Luiz P.P. Silva, Gelson Medeiros, Adilson B. Ramos, Marcilene B. Ramos, Carlos A.A. Júnior, Paulo G. Barbosa, Sérgio F. Fragoso, Adilson R. Silva, Cecília Cronemberger, Marcelo Rheingantz, Leonardo Nascimento and João Marins for the field support; Grupo Técnico de Vigilância de Arboviroses (GT-Arbo – Brazilian Ministry of Health) for field and material supports; and Cassio Leonel Peterka from The Brazilian Ministry of Health for malaria epidemiological data. The following reagent was obtained through BEI Resources, NIAID, NIH: Plasmodium simium, Strain Howler, MRA-353, contributed by William E. Collins. The work was supported by the King Abdullah University of Science and Technology (KAUST) through the baseline fund BRF1020/01/01 to AP and BAS/1/1056-01-01 to STA, and the Award No. URF/1/1976-25 from the Office of Sponsored Research (OSR). The field work in the Atlantic Forest and laboratory analysis in Brazil received financial support from the Secretary for Health Surveillance of the Ministry of Health through the Global Fund (agreement IOC-005-Fio-13), Programa Nacional de Excelência (PRONEX) and contract 407873/2018-0 of the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), the Fundação de Amparo à Pesquisa do Estado de Minas Gerais (Fapemig CBB-APQ-02620-15) and the Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro (Faperj), Brazil. CNPq supports CFAB, CTDR, MFFC, PB and RLO, with a research productivity fellowship. CTDR (CNE: E-26/202.921/2018), MFFC, PB and RLO are also supported by Faperj as Cientistas do nosso estado. AdP-C was supported by a postdoctoral fellowship from the Faperj and DAMA by a fellowship from the CGZV-SVS (Brazilian Ministry of Health) TED 49/2018 grant. SF was supported by a Wellcome Seed Award in Science to DCJ (208965/Z/17/Z).