Abstract
African trypanosomes are vector-borne haemoparasites that cause African trypanosomiasis in humans and animals. Parasite survival in the bloodstream depends on immune evasion, achieved by antigenic variation of the Variant Surface Glycoprotein (VSG) coating the trypanosome cell surface. Recombination, or rather directed gene conversion, is fundamental in Trypanosoma brucei, as both a mechanism of VSG gene switching and of generating antigenic diversity during infections. Trypanosoma vivax is a related, livestock pathogen also displaying antigenic variation, but whose VSG lack key structures necessary for gene conversion in T. brucei. Thus, this study tests a long-standing prediction that T. vivax has a more restricted antigenic repertoire. Here we show that global VSG repertoire is broadly conserved across diverse T. vivax clinical strains. We use sequence mapping, coalescent approaches and experimental infections to show that recombination plays little, if any, role in diversifying T. vivax VSG sequences. These results explain interspecific differences in disease, such as propensity for self-cure, and indicate that either T. vivax has an alternate mechanism for immune evasion or else a distinct transmission strategy that reduces its reliance on long-term persistence. The lack of recombination driving antigenic diversity in T. vivax has immediate consequences for both the current mechanistic model of antigenic variation in African trypanosomes and species differences in virulence and transmission strategy, requiring us to reconsider the wider epidemiology of animal African trypanosomiasis.
African trypanosomes (Trypanosoma spp.) are unicellular hemoparasites and the cause of African Trypanosomiasis in animals and humans1. These parasites are transmitted by tsetse flies (Glossina spp.), and their proliferation in blood and other tissues leads to anaemia, immune and neurological dysfunction, which is typically fatal if untreated. The profound, negative impact of this disease on livestock productivity across sub-Saharan Africa is measured in billions of dollars annually2.
Trypanosoma vivax is a livestock parasite found throughout sub-Saharan Africa and South America3–5. Although superficially like the more familiar T. brucei, (the species responsible for Human African trypanosomiasis), and T. congolense (another livestock parasite), T. vivax is distinct in morphology and motility6, cellular ultrastructure7,8 and genetic repertoire9,10. Most conspicuously, it has a truncated life cycle in tsetse flies, lacking a procyclic stage in the insect midgut, and can be transmitted non-cyclically by other genera of hematophagous flies6.
Although distinct from T. brucei, T. vivax shares a defining phenotype with other African trypanosomes. Trypanosome cell surfaces are coated with a Variant Surface Glycoprotein (VSG) that undergoes antigenic variation11. Trypanosome genomes encode hundreds of alternative VSG, but each cell expresses just a single variant. Periodically, new variants emerge that have dynamically switched to an alternative expressed VSG11. Each VSG is strongly immunogenic but confers no heterologous protection. Thus, as antibodies clear the dominant VSG clones of the parasite infra-population, serologically distinct clones replace them, rendering cognate antibodies redundant and facilitating a persistent infection12.
Previously, we showed that T. vivax VSG are distinct from those in T. brucei or T. congolense. T. vivax VSG genes display much greater sequence divergence, and include sub-families absent in other species (named Fam23-26 inclusive13). In T. brucei, gene conversion is crucial to switching VSG genes and generating novel antigens14,15. However, sequence repeats known to facilitate gene conversion in T. brucei were absent from the T. vivax reference genome, suggesting that the T. brucei-based paradigm of antigenic variation might not apply there10.
Experiments from the pre-genomic era revealed certain enigmatic features that corroborate the distinctiveness of antigenic variation in T. vivax and which remain unexplained. Animals infected with T. vivax self-cure more often and faster compared with other species, which was attributed to antigenic exhaustion16,17. Clones expressing certain VSG re-emerged late in infection after the host had developed immunity3,17. Quite unlike T. brucei or T. congolense, recovered animals displayed immunity to strains from very distant locations, indicating that T. vivax serodemes could span countries, or even the whole continent18,19. Such features prompted the prediction that antigen repertoire in T. vivax would be smaller than in other trypanosomes3.
Here, we address these long-standing issues by characterising antigenic diversity in clinical T. vivax isolates. We apply the data to examine VSG recombination in parasite populations and to profile VSG expression during experimental infections in a goat model. The Variant Antigen Profile (VAP) we establish for T. vivax shows that VSG sequence patterns in T. vivax are incompatible with the current, T. brucei-based model for antigenic variation in trypanosomes.
Results
Genomes of 28 T. vivax clinical strains isolated from seven countries were sequenced on the Illumina MiSeq platform. Genome assemblies ranged in coverage from 32.8% to 80.4%, in sequence depth from 3.5x to 78.5x, and in contiguity (N50) from 238 to 2852 (Supplementary Table 1). Using sequence homology with known VSG sequences in the T. vivax Y486 and T. brucei TREU927 reference genomes, between 40 and 436 VSG genes were recovered from assembled genome contigs; the mean average (175) is approximately one fifth of the reference genome repertoire (N=865)10.
T. vivax variant antigen profiles reflect genealogy
We devised a VAP for T. vivax VSG gene repertoire to examine antigenic diversity across strains. The four VSG-like gene sub-families (Fam23-26)13 in the T. vivax Y486 reference sequence (hereafter called ‘Y486’) occurred in all genomes, in similar proportions (Supplementary Fig. 1), making them unsuitable for discriminating between strains. Therefore, we produced clusters of orthologs (COGs) for all VSG-like sequences from Y486 and 28 clinical strains (N=6235), defining a COG as a group of VSG-like sequences with ≥90% sequence identity. This produced 2038 COGs, each comprising a single gene plus near-identical paralogs from multiple strains. Most COGs (78%) were cosmopolitan (i.e. present in multiple locations, see Methods), while 441 were strain-specific (Supplementary Table 2).
VAPs based on presence or absence of VSG COGs were compared to strain genealogy and geography to examine spatio-temporal variation in VSG repertoire. Fig. 1 shows that VAP-based strain relationships matched those inferred from whole genome single nucleotide polymorphisms (SNPs), and therefore, that VAP reflects both population history and location. There is a remarkable correspondence between VAPs of Ugandan strains with those from Brazil, suggesting that these Brazilian T. vivax were introduced into Brazil from East Africa. The correspondence of VAPs and SNPs is particularly clear when we compare the Ugandan/Brazilian profile with those in Nigeria. While clearly divergent in their VSG repertoire, there remain 769 COGs (37%) that are shared between these locations; (‘TvILV-21’ possesses various COGs widespread in West Africa). Thus, T. vivax VSG repertoires diverge in concert with the wider genome and provide a faithful record of population history, in contrast to T. congolense, where the opposite effect was observed20.
Global T. vivax VSG repertoire comprises 174 phylotypes
The VSG gene complements in our strain genome sequences are incomplete. So, while comparing partial strain genomes in combination provides a coherent analysis of global VSG variation, the spatial distribution of COGs, and the number of truly location-specific COGs, will increase with greater sampling. This is clear when we consider that 248 COGs (12.2%) comprise a single Y486- specific sequence, which is the only strain with a complete VSG complement. Presently, a COG-based VAP will include too many false negative ‘absences’ to reliably profile individual strains.
A VAP that allows comparison of any two strains must be based on universal markers that also vary in the population. COGs are not universal and sub-families do not vary; so, we reasoned that a taxon of intermediate inclusivity would satisfy both criteria. Therefore, we devised another VAP based on phylotypes, each consisting of multiple, related COGs with ≥70% sequence identity (see Methods). 174 VSG phylotypes accommodated every VSG-like sequence we observed. Fig. 2 shows the size and distribution of these across strains and emphasizes the widespread distribution of most phylotypes, 86% (149/174) of which are cosmopolitan.
Exceptions to this trend, as structurally distinct VSG sub-families restricted to specific populations, may be epidemiologically important. Among Nigerian samples, the location with the largest sample (N=11) and so the most robust presence/absence calls, five phylotypes are unique (P94, P118, P126, P170, P173). These are not recent derivations in Nigerian T. vivax because they are defined by a threshold sequence identity and so, are of approximately equally age. Moreover, their positions in Fig. 2 indicate no significant difference in the node connectivity of Nigeria-specific and cosmopolitan phylotypes overall. As these phylotypes comprised only one or two COGs, we extended the analysis to COGs generally.
We found 130 COGS in at least 9/11 Nigeria strains and no other location. We hypothesized that, if they were relatively recent gene duplications, they would have shorter genetic distances to their closest relatives than cosmopolitan COGs. We estimated Maximum Likelihood phylogenies for each phylotype containing a Nigerian-specific COG and inferred relative divergence times using the RelTime tool in MEGA v10.0.521. This showed that there was no significant difference (p=0.35) in the mean divergence times for Nigeria-specific COGs (μ=0.038±0.005; N=83) and cosmopolitan COGs in the same phylotype (μ=0.041±0.005; N=212). Therefore, Nigerian-specific COGs and phylotypes are just as ancient as lineages with cosmopolitan distributions, and do not provide evidence for population-specific gene family expansions.
In summary, the incompleteness of strain genomes compelled us to adopt phylotypes as a universal but variable metric to profile T. vivax VSG repertoire. On this basis, T. vivax VSG repertoire appears to be relatively conserved continent-wide. Population variation does exist, especially at COG-level, but appears to originate through differential patterns of lineage loss rather than population-specific gene family expansions, since Nigeria-specific COGs are no younger than other VSG. This degree of continent-wide conservation is quite unlike patterns seen in T. brucei22. Suspecting that this indicated a more fundamental difference between African trypanosome species in how antigenic diversity evolves, we examined population variation among their VSG sequences in detail.
Minimal signature of recombination in T. vivax VSG sequences
We took multiple approaches to test the hypothesis10 that T. vivax VSG recombine less than T. brucei and T. congolense VSG. First, we asked if VSG sequences assort. Based on the current model of antigenic switching11, VSG reads from 28 clinical strains would not remain paired after mapping to Y486 because historical recombination events would have distributed them across multiple reference loci. Fig. 3a shows that the proportion of strain read-pairs remaining paired after mapping is significantly higher in T. vivax (mean=92%; N=19) relative to T. congolense (mean=87%; t=3.23; p<0.05) and T. brucei (mean=76%; t=12.8; p<0.001), and is almost as high as a negative control comprising adenylate cyclase genes (mean=97%).
Reversing this approach, we examined how Y486 VSG gene sequences mapped to strain assemblies when broken into 150 bp segments. Fig. 3b shows how the outcome of segmental mapping was defined. The mean proportion of Y486 VSG that are mosaics of strain genes (i.e. ‘Multi-coupled’ (MC: 25%) or ‘Uncoupled’ (UC: 7%)) is significantly lower than in T. congolense (MC: 33%; p<0.05 UC: 31%; p<0.001) and T. brucei (MC: 39%; p<0.001; UC: 12%; p<0.001); p<0.001), while the number that are essentially orthologous (i.e. ‘Fully-coupled’ (FC: 59%)) is significantly greater (for T. congolense, p<0.001; for T. brucei, p<0.001) (Fig. 3c). Analysis of phylogenetic incompatibility in alignments of FC and MC quartets using PHI23 corroborates the mapping patterns. Across all species, FC VSG contain little evidence for phylogenetic incompatibility and not generally more than the adenylate cyclase control (Fig. 3d). While MC VSG display phylogenetic incompatibility, T. vivax MC quartets displayed this less frequently (Ppi=41%) than in T. congolense (Ppi=65%) and T. brucei (Ppi=67%).
While there are fewer MC VSG in T. vivax, this sizeable minority might still be genuine mosaics. Alternatively, other processes such as gene paralogy or substitution rate heterogeneity could account for the signature of recombination. Hence, we explicitly modelled the history of recombination within FC or MC sequence quartets using ancestral recombination graphs (ARG) and inferred the time to most recent common ancestor (TMRCA) for each quartet. Average TMRCA was significantly greater for T. vivax FC VSG (0.19±0.17) than either T. congolense (0.05±0.06) or T. brucei (0.06±0.07), indicating much deeper coalescent times for T. vivax VSG. More importantly, the variance in TMRCA along sequence alignments is extremely small for T. vivax FC VSG, showing that the whole alignment shares a common ARG (Fig. 3e). Variance is greater for MC VSG, but both MC and FC types are significantly less variable than either other species (p<0.001). Both the relatively small TMRCA and variance in TMRCA along alignments indicates that T. brucei and T. congolense VSG are routinely mosaics, while the coalescence of most T. vivax VSG can be modelled without recombination. Interestingly, TMRCA variance is significantly higher among T. brucei MC VSG quartets than T. congolense VSG (p<0.001), indicating that the former may have a higher recombination rate (explored further in Supplementary Table 3).
In summary, these analyses show that retention of orthology among VSG loci across trypanosome populations varies significantly between species. Fig. 3f plots the total pairwise orthology between strains (see Methods). Around 75% of T. vivax VSG are found in multiple strains as orthologs, without evidence for recombination, compared with □40% in T. brucei (p<0.001) and T. congolense (p<0.001). As the VAPs indicated, T. vivax VSG typically retain orthology and essentially behave like ‘normal’ genes in the population, while T. brucei or T. congolense VSG recombine frequently, causing loss of orthology and the appearance of strain-specific mosaics throughout the population.
Strong phylogenetic effects in VSG expression in vivo
Broadly conserved VSG phylotypes containing little signature of historical recombination indicate that VSG mosaics do not contribute to antigenic diversity in vivo. We tested this by measuring VSG transcript abundance in goats experimentally infected with T. vivax (strain Lins24) over a 40-day period. Parasitaemia and expression profiles of VSG phylotypes in four replicates are shown in Fig. 4. We observed the expected waves of parasitaemia beginning after four days and continuing approximately every three days until termination (i.e. 6-9 parasitaemic peaks). Transcriptomes were prepared for each peak and revealed 282 different VSG transcripts across all replicates (Supplementary Table 4), which belonged to 31 different phylotypes (18% of total).
Variant antigen profiling of the expressed transcripts characterised the dominant, (but more often co-dominant), VSG phylotypes across successive peaks (Fig. 4). Somewhat contrary to expectation, persistent expression of a phylotype across peaks, e.g. P24 (Supplementary Fig. 2) and P2 (Supplementary Fig. 3), or re-emergence of a phylotype after decline, e.g. P40 (Supplementary Fig. 4) and P143 (Supplementary Fig. 5), was often seen. The identity of expressed phylotypes was partly reproduced across replicates, with 12/31 phylotypes observed in all four animals, and 19 phylotypes in three animals (Supplementary Fig. 6); on 21 occasions this extended to an identical VSG sequence, (for detail, see Supplementary Fig. 2-5).
Similarly, the order of VSG expression was partly reproducible across animals. Fig. 5 displays transcript number and abundance at early, middle and late points in the experiment, mapped on to the sequence similarity network of all phylotypes. The best example of reproducibility is the dominant expression of P24 in the middle-to-late period across all animals, Other examples include a group of phylotypes (P2, P40, P142 and P143) expressed early (i.e. peak 1/2, Fig. 5a) in A2 and A3, then re-emerging later at peak 5/6 in A1-3 (Fig. 5b), and even later in A4. For detailed analysis of phylotype abundance at each time-point see Supplementary Fig. 7. Importantly, however, while phylotypes show consistency in expression through time and across replicates, individual VSG transcripts do not. Hence, while P24 was a dominant variant antigen in every replicate, the actual P24 transcript expressed was different in each case and diverged by up to 26.5%. Further examples in Supplementary Fig. 2-5 demonstrate that this was typical.
Across all peaks, groups of related transcripts of the same phylotype were commonly co-expressed at the same peak (e.g. P2 expression comprised 3.08±1.97 transcripts on average, P24=2.33±1.3, P40=2.67±1.12, P143=2.71±1.25). On three occasions, the observed phylotype comprised seven distinct transcripts (P2 at peak 5 in A1, P8 at peak 8 in A4 and P135 at peak 5 in A1). Overall, only 8/31 phylotypes were only ever represented by a single transcript. This indicates that the expressed repertoire is determined in part by sequence homology, and Supplementary Fig. 8 shows that expressed transcripts belong to significantly fewer phylotypes than simulated transcript repertoires of the same size, confirming that they are not drawn from the available repertoire by chance. For detailed examples, see Supplementary Fig. 2-5.
An obvious feature in Fig. 5 is the concentration of highly-expressed phylotypes in the bottom-left corner of the network. A complex of closely-related Fam23 phylotypes (e.g. P2, P40, P142) were expressed early in A1 and A2 (Fig. 5a-b). This was followed by Fam23 phylotypes more centrally placed (e.g. P8), and finally, Fam25 phylotypes (e.g. P24/P44) in late infection. In A3 and A4, a similar pattern occurred, except that Fam25 VSG (i.e. P44) were expressed early, followed by the Fam23 complex and then P24. This can also be seen in Supplementary Fig. 7, where phylotypes displaying reproducible profiles across replicates are often closely related (e.g. P2, P40, P142 and P143). The connectivity of nodes representing expressed phylotypes is greater than that expected by chance. The clustering coefficient of a sub-network representing all ‘expressed’ nodes across all peaks is significantly greater than randomised sub-networks of the same size (p<0.05; for detail, see Supplementary Fig. 9).
In summary, the major pattern emerging from in vivo expression profiles is a strong phylogenetic signal on three levels. First, the identity and order of expressed phylotypes is partly reproducible, (but expression of individual transcripts is typically not). Second, phylotypes expressed at a given peak regularly comprise multiple related, but non-identical, transcripts. Finally, at the phylotype level, related phylotypes are expressed simultaneously or consecutively, manifested as clustering in Fig. 5 and Supplementary Fig. 8. Therefore, phylogeny (or sequence identity) is an important factor in explaining VSG expression profile in T. vivax.
No mosaics of VSG phylotypes during experimental infections
Expressed VSG in T. brucei include sequence mosaics, which is interpreted as evidence for recombination of VSG loci during infections15,25,26. In T. brucei, VSG mosaics can be formed between highly divergent donors with as little as 25% identity along their entire lengths26, and can implicate relatively short recombinant tracts of □100 bp27. We analysed expressed VSG transcript sequence mosaics by comparing 100 bp windows of each transcript to the T. vivax Lins genome sequence using BLASTp28. Typically, mosaics would be confirmed where a single transcript displayed affinities to different VSG genes along its length. Unfortunately, since both VSG transcripts and gene sequences were often fragmentary, it was common for a transcript to have multiple affinities as no single gene sequence spanned its length. Even so, without exception, the closest related sequences in every window of each transcript were other sequences in the same phylotype.
With sequence affinities inconclusive, we searched for reorganisation of an expressed VSG sequence relative to a genomic locus by mapping all read-pairs belonging to VSG transcripts to the T. vivax Lins genome. The percentage of read-pairs that mapped to unpaired genomic positions (1.06-5.63%) was greater than the percentage arising from a random selection of 100 housekeeping genes (0.01 - 0.05%). However, given that T. vivax VSG are arranged in tandem gene arrays of closely-related paralogs10, we reasoned that this repetitive organisation might lead to multiple mapping of reads. Indeed, the percentage of VSG read-pairs split after mapping is not significantly different to that of adenylate cyclases (3.43-7.53%; p=0.892), which do not form mosaics but are often arranged in tandem arrays29.
Nonetheless, the few mis-mapped reads could still derive from rare mosaic transcripts. To examine these explicitly, we aligned VSG transcripts with the three most similar genes from the T. vivax Lins genome sequence using BLASTn (where three sequences >500bp in length could be obtained; N=68) and used GARD30 to identify potential recombination breakpoints. The closest matches to each transcript were again always from the same phylotype (minimum full-length sequence identity of 86%). GARD found that 54/68 alignments displayed significant topological incongruence not attributable to rate heterogeneity, indicating 1.94±1.66 breakpoints on average (ranging between 0 and 7). This might suggest that mosaicism is widespread within phylotypes, however, this degree of phylogenetic incompatibility was not significantly different to adenylate cyclases (36/48 alignments with significant topological incongruence and an average of 1.87±1.88 breakpoints (ranging between 0 and 8); p=0.39).
In summary, while most transcript alignments contained breakpoints, these only implicated very closely related sequences, and the scale of genetic admixture is comparable with other tandemly arrayed gene families. Thus, we believe that these slight topological inconsistencies are consistent with re-arrangements (real or artefactual) caused by tandem arrangement of T. vivax VSG. Certainly, no transcript contained evidence for mosaics of different VSG phylotypes and therefore, assortment of T. brucei order was sort seen.
Discussion
The current model of trypanosome antigenic variation has recombination as the driver behind novelty and persistence. Unlike T. brucei and T. congolense, we find little evidence for VSG mosaics, either historically in the population or during experimental infections. Instead, T. vivax VSG repertoire comprises 174 conserved phylotypes, and incomplete sorting of these lineages produces population variation. We see now that the deep ancestry of VSG lineages and lack of VSG pseudogenes in T. vivax10 reflect a long history without recombination.
Experiments in the twentieth century documented the progression of Variant Antigen Types (VATs) during T. vivax infections3,16,17. VATs represent parasite clones that confer a specific, reproducible immunity, assumed to relate to a specific VSG. Our results confirm the hypothesis that emerged from these experiments, that the T. vivax VSG repertoire is smaller than those of other species3,16. While the number of VSG genes is comparable to T. brucei and T. congolense, these provide fewer unique antigens because they are often extremely similar, expressed simultaneously, and cannot recombine. This explains several features of T. vivax infections, including the propensity for host self-cure16 and the re-emergence of VATs late in infection17. Furthermore, 70% of phylotypes and 45% of COGs are shared between East and West Africa respectively, which could explain the widespread distribution of serodemes, that is, why immunity to VATs in East Africa provides protection against some parasite strains from Western and Southern Africa also19,31.
We have defined VSG phylotypes as universal but variable quantities for variant antigen profiling of any T. vivax strain. The evolutionary conservation of many phylotypes, and their reproducible expression patterns (in contrast with individual genes), has shown that phylotypes are not merely convenient, but have biological relevance. A crucial consideration then is how phylotypes relate to VATs. If individual transcripts in a phylotype cross-react with the same antibody, then VATs are likely to be synonymous with phylotypes; which raises the question of why multiple transcripts are expressed when this confers no benefit to parasite persistence. Conversely, if all VSG transcripts are serologically distinct, this poses the question of why co-expression is determined by sequence homology. Either way, the relevance of VSG phylogeny to antigenic variation is clear. The absence of recombination means that the mechanism of VSG switching in T. vivax must be different to the T. brucei model. We have seen that VSG expression in vivo displays an obvious phylogenetic signal, which might be explained if co-expressed transcripts derive from the same tandem array of VSG paralogs, which exist throughout the T. vivax genome10. If so, these structures could have a central role in a distinct switching mechanism not dependent on gene conversion.
Without recombination to create mosaic VSG sequences, there is a fundamental limitation on antigenic diversity in T. vivax and therein its capacity for immune evasion. This poses profound new questions of how T. vivax persists long enough to transmit, (which it evidently does very successfully). Perhaps T. vivax has adopted a different life strategy with respect to the transmission-virulence or invasion-persistence trade-offs that govern pathogen evolution32,33. One possibility is that T. vivax has evolved a more acute infection strategy than other species and achieves transmission over shorter periods. Some aspects support an invasion-persistence trade-off; T. vivax infections (where the host survives) are typically shorter than other species34,35, and some haemorrhagic strains cause an extremely acute syndrome that is also hypervirulent36,37. Furthermore, where trypanosome species have been directly compared, chronic pathologies such as reduced packed cell volume34,35 and humoral immunosuppression38 are less severe with T. vivax. However, there is no evidence that T. vivax replicates or transmits quicker, as would be expected under a trade-off. Another possibility is that the idiosyncratic life cycle and wider vector range of T. vivax6, are an adaptation to increase transmission in the absence of long-term persistence. However, in various reports, animals that survive the initial acute T. vivax infection are said to develop a chronic, often asymptomatic, infection during which parasites are not visible39–41, but which may cause progressive neuropathy42. Thus, another possibility is that T. vivax cause long-term, chronic infections like other species, but has an alternative mechanism for persistence. Dissemination to immune-privileged sites might allow persistence at low cell densities and T. vivax does disseminate to the reproductive and nervous systems, but all trypanosome species have a comparable ability for disease tropism43.
In conclusion, the orthology of VSG phylotypes across populations, and the considerable structural divergence among them, indicates that the global T. vivax variant antigen repertoire has remained largely unchanged over time. Crucially, we find no evidence in T. vivax for the vital role that recombination, or gene conversion, has in diversifying VSG sequences and mediating antigenic switching in T. brucei. This is a major departure from the current model of antigenic variation, indicating that T. vivax has a distinct mechanism of immune evasion. Antigenic diversity in T. vivax is finite, in a way that T. brucei and T. congolense are not; this both explains the antigenic exhaustion observed during T. vivax infections and poses important new questions of how infections persist under such circumstances. Possibly, the lack of adaptation for persistence, so evident in T. brucei, reflects a fundamentally different life strategy in T. vivax, with profound implications for understanding virulence and transmission of this pervasive and devastating pathogen.
Methods
Ethical Considerations
This study was conducted in accordance with the guidelines of the Brazilian College of Animal Experimentation (CONCEA), following the Brazilian law for “Procedures for the Scientific Use of Animals” (11.794/ 2008 and decree 6.899/2009). Ethical approval was obtained from the Ethical Committee to the Use of Animals (CEUA) of the Veterinary and Agrarian Sciences Faculty (FCAV) of the State University of São Paulo (Jaboticabal campus) (São Paulo, Brazil) (protocol no. 001494/18, issued on 08/02/2018). The study was also approved by the Animal Welfare and Ethical Review Body (AWERB) of the University of Liverpool (AWC0103).
Sample preparation
A panel of 25 T. vivax-infected blood stabilates (150 μl), representing isolates from Burkina Faso (N=5), Ivory Coast (N=3), Nigeria (N=11), Gambia (N=1), Uganda (N=4), Togo (N=1), were selected from Azizi Biorepository (http://azizi.ilri.org/repository/) at the International Livestock Research Institute (ILRI), and the Centre International de Recherche-Développement sur l’Elevage en zone Subhumide (CIRDES) (Supplementary Table 4). In addition, genomic DNA of three Brazilian isolates previously described24,44,45 was obtained from Instituto de Ciências Biomédicas (ICB) at the University of São Paulo. For samples from ILRI and CIRDES: Red blood cells were lysed with ACK lysing buffer (Gibco, UK) and discarded by centrifugation. Cells were washed twice in 1ml MACS buffer by centrifugation (10 min, 2500 rpm). The pellet was resuspended in 100 μl lysis buffer (aqueous solution of 1 M Tris-HCl pH8.0, 0.1 mM NaCl, 10 μM EDTA, 5% SDS, 0.14 μM Proteinase K). Samples were incubated at room temperature for 1 h and DNA was extracted with magnetic Sera-Mag Speedbeads (GE Healthcare Life Sciences, UK) according to the manufacturer’s protocol. For samples from ICB: DNA obtained from ICB was extracted following an ammonium acetate protocol previously described38 (TvBrMi) or a traditional phenol-chloroform extraction protocol (TvBrRp).
Genome sequencing and assembly
Illumina paired-end sequencing libraries were prepared from genomic DNA using the NEBNext® Ultra™ DNA Library Prep Kit according to the manufacturer’s protocol (New England Biolabs, UK) and sequenced by standard procedures on the Illumina MiSeq platform, as 150 bp (ILRI) or 250 bp (ICB and CIRDES) paired ends. For each sample, the data yield from sequencing after quality filtering was between 1.69 × 106 and 1.32 × 107 read pairs. Samples were assembled de-novo using Velvet 1.2.1039 with a kmer of 65 (ILRI and CIRDES) or 99 (ICB). These produced assemblies with n50 between 238 and 2852 bp (median=353; mean=985). Allele frequencies were inspected to ensure samples were from single infections only (Accession number: PRJNA486085).
VSG-like sequence recovery and systematics
VSG-like nucleotide sequences were retrieved from the assembled contigs files by sequence similarity search with tBLASTx28. We used a database of T. vivax Y486 VSG as query and a significance threshold of p>0.001, contig length ≥100 amino acids, and sequence identity ≥40%. Additionally, we queried a database of T. brucei a-VSG and b-VSG sequences, using the same p-value and length thresholds, to accommodate VSG genes that might be absent from T. vivax Y486, i.e. the possibility that the reference is not representative of all strains. In the event, the reference proved to be representative.
VSG-like sequences were translated and clustered using OrthoFinder46 under the default settings. Orthofinder clustered orthologous sequences from the reference and 28 strains. In practice, these clusters of orthologs (‘COGs’) also included near-identical in-paralogs. Sequences in each cluster were aligned using Clustalx47 and all alignments were edited to remove overhangs and short (<100 bp) sequences. Edited alignments were refined to produce COGs with >90% average sequence identity by combining COGs that were very similar or, more frequently, subdividing Orthofinder clusters that contained several orthologous groups until the average sequence divergence was <0.05. In complex cases of large Orthofinder clusters, neighbour-joining phylogenies were estimated to aid sub-division. Sequences that could not be placed with any other such that sequence divergence was <0.05 were categorized as ‘unclustered’, (assumed to be strain specific VSG).
With the membership of COGs determined, we reverted to the original, unedited sequences to identify the longest representative as a ‘type sequence’ of that COG. These were combined with the original, unclustered sequences and compared with Fam23-26 VSG reference sequences using BLASTp to confirm their validity and assign a subfamily. The type sequences subdivided thus: Fam23 (967), Fam24 (539), Fam25 (345) and Fam 26 (193). Sequences found not to have a satisfactory match to Fam23-26 VSG were excluded. This process produced 760 COGs (comprising 2576 sequences) and 1278 unclustered, or ‘singleton’ sequences. Each type sequence and singleton was compared against all others using BLASTp to establish cohorts of related COGs/singletons, which we call ‘phylotypes’. A BLASTp output was used to create sequence alignments for phylotypes and to estimate neighbour-joining phylogenies for each. The membership of phylotypes was manually adjusted by removing the most divergent sequences until each met a threshold of 70% average sequence identity.
Note that the geographical distribution VSG COGs and phylotypes is inferred from the strains in which type sequences were detected. We define a ‘cosmopolitan’ COG or phylotype as being present in more than one location, except if these locations are Brazil and Uganda, or any combination of Ivory Coast, Togo and Burkina Faso. In both cases, we judged the T. vivax strains to be too close to justify these as separate populations. COGs or phylotypes found only in Brazil and Uganda are considered ‘East African’ in this study. Those found only in some combination of Ivory Coast, Togo and Burkina Faso are considered ‘West African’.
Variant Antigen Profiling
To produce VAPs for each strain, we used sequence mapping to confirm the presence or absence of individual COGs. As mapping makes use of low-coverage reads that would not otherwise be integrated into VSG sequence assemblies, this was more efficient than inspecting genome contigs for sequence homology. There was an 11% increase in the observed repertoire size (an average of 87 additional VSG) when mapping relative to BLAST. Mapping indicated that most singleton sequences were present in other strains despite the absence of assembled orthologs. Of 1279 sequences that could not be placed in a COG, only 34 (2.7%) remained location-specific after mapping. For these reasons, trimmed sequence reads were aligned to the 2038 COG type sequences, using Bowtie248 set to -D 20 -R 3 -N 1 -L 20. A customized Perl script was used to select entries with a match length ≥245 nucleotides (corresponding to a 2% error rate in a 250 bp sequencing read), mapped as proper pairs, in the correct orientation, and within the expected insert size. This list was compared to the COG database and used to produce the presence/absence binary matrix that represents the T. vivax VAP. VAP-based strain relationships were estimated by hierarchical clustering analysis in R, using binary distance calculation and the Ward’s minimum variance method49, and compared to the whole-genome variation phylogeny. For phylotype-based VAPs, presence/absence and distribution data were generated by summing over all constituent VSG COGs and singletons.
Strain variation
To estimate strain relationships based on the whole genome, MiSeq reads were retrieved and mapped against the T. vivax Y486 genome using BWA mem50, converted to BAM format, sorted and indexed with SAMtools51. Sorted BAM files were cleaned, duplicates marked and indexed with Picard (http://broadinstitute.github.io/picard/), and Single Nucleotide Polymorphisms (SNPs) were called and filtered with Genome Analysis Toolkit suite according to the best practice protocol for multi-sample variant calling52. The multi-sample VCF file obtained from GATK was converted to FASTA format using VCFtools v0.1.1453 and a maximum likelihood phylogeny was estimated with PHYML54, using the GTR+г+I model of nucleotide substitution, following Smart Model Selection55.
T. vivax experimental infections
Five male Saanen goats of 4 to 8 months of age, housed at the Veterinary and Agrarian Sciences Faculty (FCAV) of the State University of São Paulo (Jaboticabal campus) (São Paulo, Brazil), were infected the T. vivax Lins24 isolate. Before inoculation, parasite stabilates cryopreserved in 8% glycerol were thawed, checked for viability under a light microscope. Each animal was inoculated intravenously with approximately 6 × 106 parasites. Animals were clinically examined daily and parasitaemia was determined by microscopy as previously described56. Animal 2 was euthanized by anesthesia overdose on day 39 post-infection (p.i.) after showing signs of health deterioration (loss of appetite, lethargy and anaemia). Xylasine chlorohydrate (0.2 mg/kg) was administered intra-muscularly as pre-anesthetic medication, followed by intramuscular ketamine chlorohydrate (2 mg/kg) as anesthetic. Cardio-respiratory arrest was induced by intrathecal administration of lidocaine chlorohydrate. Remaining animals were euthanized on day 45 p.i. according to the same procedure.
Blood collection, RNA extraction and sequencing
At each parasitaemia peak, 4 ml of blood were collected from jugular venepuncture and centrifuged for 15 min at 13,000 x g. The buffy coat was removed into a 2.0 ml LoBind microcentrifuge tube (Eppendorf, UK), 1.5 ml of ACK Lysing buffer (Gibco, UK) added, and the mixture incubated for 15 min at room temperature to lyse leftover red blood cells. Samples were centrifuged for 15 min at 13,000 x g, washed twice in PBS, pH 8.0, snap frozen in liquid nitrogen and kept and −80 °C until RNA extraction. RNA was extracted using the RNeasy Mini Kit (Qiagen, UK) according to the manufacturer’s protocol, yielding a total RNA output between 117 ng and 13 μg per sample, quantified on the NanoDrop 2000 (ThermoFisher Scientific, Brazil). Up to 1 μg of total RNA was used to prepare multiplexed cDNA libraries as described57 using the T. vivax splice-leader (SL) sequence58 as the second cDNA strand primer. For samples up to day 30 p.i., the protocol of Cuypers et al. (2017)57 was followed exactly as described, quantified using Qubit HS dsDNA (Invitrogen, UK) and the Agilent 2100 Bioanalyzer (Agilent Technologies, UK), and sequenced at Centre of Genomic Research (Liverpool, UK) on a single lane of the HiSeq 4000 platform (Illumina Inc, USA) as 150 paired ends, producing 280M mappable reads. However, as the library insert sizes produced were longer that recommended for the HiSeq 4000 platform (Illumina Inc, USA), the protocol for samples from days 30-45 p.i. was modified. Instead of adding the indexes from the Illumina Nextera index kit, adapter-ligated, SL-selected cDNA was used as input for the NEB Ultra II FS DNA library kit (NEB, UK), which includes an initial step of DNA fragmentation. Sequencing statistics are shown in Supplementary Table 1.
Transcriptome Profiling
RNAseq reads were assembled de-novo using Trinity59. Transcript abundances were estimated for each sample with kallisto60 using Trinity pre-compiled scripts. Subsequently, transcript abundances of samples from the same animal, expressed as transcripts per million, were combined and normalized based on the weighted trimmed mean of log expression ratios (trimmed mean of M values (TMM)61). TMM normalization adjusts expression values to the library size and reduces composition bias. TMM values were used to produce transcript expression matrices for each animal. To recover all VSG-like sequences in the transcriptomes, a sequence similarity search was performed with tBLASTx28 using the T. vivax COG database produced above as query and a significance threshold of E>0.001, contig length ≥150 amino acids, and sequence identity ≥70%. All retrieved VSG-like sequences were manually curated to remove spurious matches. The resulting lists of VSG transcripts were used as query in a sequence similarity search to identify VSG transcripts matching the list of COGs defined in the VAP. A threshold of E>0.001, contig length greater than 50 amino acids, and sequence identity ≥98% was applied. Finally, VSG transcripts were assigned a phylotype based on sequence similarity comparison to the VSG phylotype network (≥70% nucleotide identity across the whole gene sequence). VSG transcript abundances were combined per phylotype, resulting in a transcript expression matrix containing the abundance of each VSG phylotype over time.
Recombination Analysis
Fifty previously published genomes from T. brucei spp.29,62,63 and T. congolense20 and nineteen of the T. vivax genomes presented in this study were used to compare signatures of recombination across species (Supplementary Table 4). VSGs and adenylate cyclase genes were extracted from genome assemblies by sequence similarity search (BLASTn28) using a nucleotide identity ≥50%, length ≥600 nucleotides, and E<0.001. VSG assortment was quantified by read mapping using Bowtie248. VSG read-pairs were retrieved from the genomes and mapped against reference full-length VSG to calculate the proportion of strain read-pairs remaining paired after mapping. This protocol was repeated for adenylate cyclases.
In the segmental mapping approach, reference VSGs were broken into 150 bp fragments and mapped against the strain VSGs to calculate the frequency of reference reads remaining paired. VSG were characterized into uncoupled, multi-coupled and fully coupled, according to the estimated number of donors. Fully coupled VSGs were those with at least one donor contributing to more than 84% of the sequence. Multi-coupled VSGs were those with one or more donors contributing with more than 1 fragment (≥300 bp), whereas uncoupled VSGs were those remaining (i.e. one or more donors contributing with 1 fragment only (i.e. ≤150 bp). The reference VSGs that were not mapped at least once to the strain VSGs were considered reference-specific variants.
The phylogenetic signal of MC and FC VSGs and adenylate cyclases was calculated using phylogenetic incompatibility (Ppi) in PHI23 and compared to the Ppi of for two sets of simulated data (250 replicates, 16 sequences per replicate) with and without recombination. Simulated data was generated with NetRecodon64, under diploid settings, a population mutation rate (θ) of 160, a heterogeneity rate of 0.05, and an expected population size of 1000. The population recombination rate (ρ) was set to 0 and 96 for the non-recombinant dataset and recombinant datasets, respectively. Both experimental and simulated sequences were divided into sequence quartets, aligned with Muscle65 and iteratively parsed through PHI23. FC, adenylate cyclase and simulated quartets were randomly generated and parsed through PHI 100 times for statistical power. MC quartets were compiled manually with MC VSG and 3 donors.
Total sequence orthology in each trypanosome species VSG repertoire was calculated as the proportion of shared nucleotides in the total number of nucleotides of the VSG repertoire of a given strain. The number of shared nucleotides was extracted from the mapping output file using genomecov from bedtools66.
Estimation of ancestral recombination graphs
Ancestral recombination graphs were reconstructed for multi-coupled and fully-coupled VSG quartet alignments and adenylate cyclase control quartet alignments using the ACG software package67. The TMRCA was estimated along the length of each aligned quartet at 20 bp intervals using a 100 bp wide sliding window using constant recombination rate / population size models with an MCMC length of 10,000,000, burn-in of 1,000,000 and sampling frequency of 2,500. For each individual quartet the TMRCA along the length of the alignment was summarised by calculating the mean TMRCA. To identify evidence of recombination, which would generate a sequence with regions of differing ancestries, the variance in TMRCA along the alignment was calculated for each individual quartet.
Author Contributions
Conceived and designed the experiments: SSP, APJ. Performed the experiments: SSP, HN, MO, KN. Analysed the data: SSP, CWD, PR, APJ. Contributed reagents/materials/analysis tools: RMA, ZB, SK, RZM, MMGT, APJ. Wrote the paper: SSP, APJ. Obtained funding: SSP, MMGT, RZM, APJ.
Competing Interests
The authors declare no competing interests.
Acknowledgements
This work was supported by grants from the Biotechnology and Biological Sciences Research Council (BB/M022811/1 and BB/R021139/1), an International Veterinary Vaccinology Network (IVVN) pump-priming award, a Bill and Melinda Gates Foundation Grand Challenges Explorations award (Round 11), and the Wellcome Trust (WT206815/Z/17/Z).