Getting close to nature – Plasmodium knowlesi reference genome sequences from contemporary clinical isolates

Plasmodium knowlesi, a malaria parasite of old-world macaque monkeys, is used extensively to model Plasmodium biology. Recently P. knowlesi was found in the human population of Southeast Asia, particularly Malaysia. P. knowlesi causes un-complicated to severe and fatal malaria in the human host with features in common with the more prevalent and virulent malaria caused by Plasmodium falciparum. As such P. knowlesi presents a unique opportunity to inform an experimental model for malaria with clinical data from same-species human infections. Experimental lines of P. knowlesi represent well characterised genetically static parasites and to maximise their utility as a backdrop for understanding malaria pathophysiology, genetically diverse contemporary clinical isolates, essentially wild-type, require comparable characterization. The Oxford Nanopore PCR-free long-read sequencing platform was used to sequence P. knowlesi parasites from archived clinical samples. The sequencing platform and assembly pipeline was designed to facilitate capturing data on important multiple gene families, including the P. knowlesi schizont-infected cell agglutination (SICA) var genes and the Knowlesi-Interspersed Repeats (KIR) genes. The SICAvar and KIR gene families code for antigenically variant proteins that have been difficult to resolve and characterise. Analyses presented here suggest that the family members have arisen through a process of gene duplication, selection pressure and variation. Highly evolving genes tend to be located proximal to genetic elements that drive change rather than regions that support core gene conservation. For example, the virulence-associated P. falciparum erythrocyte membrane protein (PfEMP1) gene family members are restricted to relatively unstable sub-telomeric regions. In contrast the SICAvar and KIR genes are located throughout the genome but as the study presented here shows, they occupy otherwise gene-sparse chromosomal locations. The novel methods presented here offer the malaria research community new tools to generate comprehensive genome sequence data from small clinical samples and renewed insight into these complex real-world parasites. Author summary Malaria is a potentially severe disease caused by parasite species within genus Plasmodium. Even though the number of cases is in decline there were over 200 million reported cases of malaria in 2019 that resulted in >400,000 deaths. Despite huge research efforts we still do not understand precisely how malaria makes some individuals very ill and by extension how to successfully augment and manage severe disease. Here we developed a novel method to generate comprehensive robust genome sequences from the malaria parasite Plasmodium knowlesi collected from clinical samples. We propose to use the method and initial data generated here to begin to build a resource to identify disease associated genetic traits of P. knowlesi taken from patient’s samples. In addition to the methodology, what further sets this work apart is the unique opportunity to utilize same-species experimental P. knowlesi parasites to discover a potential role for particular parasite traits in the differential disease progression we observe in patients with P. knowlesi malaria. While we developed the methods to study severe malaria, they are affordable and accessible, and offer the wider malaria research community the means to add context and insight into real-world malaria parasites.


Plasmodium falciparum.
23 As such P. knowlesi presents a unique opportunity to inform an experimental model for malaria with 24 clinical data from same-species human infections.

25
Experimental lines of P. knowlesi represent well characterised genetically static parasites and to 26 maximise their utility as a backdrop for understanding malaria pathophysiology, genetically diverse 27 contemporary clinical isolates, essentially wild-type, require comparable characterization.

29
The Oxford Nanopore PCR-free long-read sequencing platform was used to sequence P. knowlesi 30 parasites from archived clinical samples. The sequencing platform and assembly pipeline was 31 designed to facilitate capturing data on important multiple gene families, including the P. knowlesi 32 schizont-infected cell agglutination (SICA) var genes and the Knowlesi-Interspersed Repeats (KIR) 33 genes.

2
The SICAvar and KIR gene families code for antigenically variant proteins that have been difficult to 35 resolve and characterise. Analyses presented here suggest that the family members have arisen 36 through a process of gene duplication, selection pressure and variation. Highly evolving genes tend 37 to be located proximal to genetic elements that drive change rather than regions that support core 38 gene conservation. For example, the virulence-associated P. falciparum erythrocyte membrane 39 protein (PfEMP1) gene family members are restricted to relatively unstable sub-telomeric regions.

40
In contrast the SICAvar and KIR genes are located throughout the genome but as the study 41 presented here shows, they occupy otherwise gene-sparse chromosomal locations.

42
The novel methods presented here offer the malaria research community new tools to generate 43 comprehensive genome sequence data from small clinical samples and renewed insight into these 44 complex real-world parasites.

46
Author summary 47 Malaria is a potentially severe disease caused by parasite species within genus Plasmodium. 48 Even though the number of cases is in decline there were over 200 million reported cases of 49 malaria in 2019 that resulted in >400,000 deaths. Despite huge research efforts we still do 50 not understand precisely how malaria makes some individuals very ill and by extension how 51 to successfully augment and manage severe disease. 52 Here we developed a novel method to generate comprehensive robust genome sequences 53 from the malaria parasite Plasmodium knowlesi collected from clinical samples. 54 We propose to use the method and initial data generated here to begin to build a resource 55 to identify disease associated genetic traits of P. knowlesi taken from patient's samples. In 56 addition to the methodology, what further sets this work apart is the unique opportunity to 57 utilize same-species experimental P. knowlesi parasites to discover a potential role for 58 particular parasite traits in the differential disease progression we observe in patients with 59

65
Plasmodium knowlesi is a malaria parasite first described in a natural host, the long-tailed macaque 66 monkey (Macaca fascicularis), in the early part of the 20 th Century [1]. Although an incidental find, P.
3 knowlesi was soon exploited as a model parasite for malaria research [2][3][4]. Experimental P. knowlesi 68 was well-characterised over time with several additional lines adapted from natural hosts in 69 geographically distinct regions, including a human infection [2][3][4][5][6]. Taken   Zoonotic malaria caused by P. knowlesi is currently the most common type of malaria in Malaysia, 79 with most of the cases reported in Malaysian Borneo [9]. Indeed, naturally acquired P. knowlesi 80 malaria causes a spectrum of disease from uncomplicated to severe and fatal infections with 81 tantalizing similarity to severe adult malaria caused by P. falciparum [10][11][12][13] 82 The clinical similarities observed in patients with severe P. knowlesi and P. falciparum infections 83 suggest that P. knowlesi has the potential to serve as both a human pathogen and animal model for 84 severe malaria pathophysiology that has hitherto eluded medical science [11,14,15].

85
To take this idea forward, it seemed prudent to compare genome sequences derived from 86 contemporary clinical isolates of P. knowlesi with the reference P. knowlesi genome generated from 87 a genetically static and laboratory passaged experimental line [16]. 88 We developed methods to produce high-quality Illumina short-read P. knowlesi genome sequence 89 data from frozen clinical blood samples [17]. The outputs of this work identified genome-wide 90 diversity, including genomic dimorphism in P. knowlesi isolates from patients. Comparisons also 91 highlighted that reference P. knowlesi genome sequence data, generated from experimental lines 92 established mid-twentieth century, may not properly reflect and capture important loci for research 93 on malaria pathophysiology, particularly multiple gene families.

130
The new reference genomes will, for the first time, provide insight into clinically relevant 131 contemporary P. knowlesi parasites. These diverse parasites are essentially wild-type and the product 132 of ongoing mosquito transmission and recombination in nature [17,[36][37][38][39]. The genomes will offer a 133 valuable resource not only for our studies on members of the SICAvar gene family and virulence but 134 5 also to the wider zoonotic malaria research community working on comparative biology of malaria 135 parasites, drug discovery and vaccine development.

138
Evaluating draft de novo genomes

139
The genome pipeline, beginning with Oxford Nanopore Technologies (ONT) MinION sequencing 140 through to de novo assembly and genome annotation with downstream analyses, is shown (Fig 1).

141
The pipeline was used to produce draft P. knowlesi de novo genomes using DNA extracted from two 142 clinical isolates sks047 and sks048 and for comparison the well characterised cultured line, P.

143
knowlesi A1-H.1. For purpose of clarity, the P. knowlesi A1-H.1 de novo draft genome assembled 144 here is referred to as StAPkA1H1 (please see methods section). Read coverage of 225x, 71x and 65x 145 was obtained for StAPkA1H1, sks047 and sks048 respectively (Error! Reference source not found.).

146
The draft assemblies resolved into 100 or fewer contigs before further reduction to <72 contigs after 147 scaffolding (      Legend to Table 2: SICAvar domain fragments are found annotated across the genomes; combinations of these fragments can form complete SICAvar proteins, indicating the possibility of a larger number of SICAvar proteins present in native genomes. Gene data for reference PkA1H1 was unavailable. * total genome length excluding the mitochondrial and apicoplast genome sequences ** total number of coding genes and pseudogenes identified with a function *** SICAvar Type 1 (T1); SICAvar Type 2 (T2); SICAvar single domain fragments (SDM's). Single domain fragments code for SICAvar protein fragments.

7
SICAvar genes and gene fragments in the draft genomes presented here were resolved to the best 203 current sequencing technology. The differences observed between SICAvar gene copy numbers and 204 fragment copy numbers in clinical isolates compared with those in experimental lines deserves 205 further investigation.

207
In regions of the draft genomes where gaps could not be resolved contigs which had evidence that 208 they belong together either by long reads spanning them, or similarity to the reference, were 209 scaffolded with N bases, proportional to the gap size (

220
These orthologous genes can be considered as the core P. knowlesi gene set and are indicative of 221 reliable and accurate assemblies (Table 2). In particular, the contemporary patient isolates -sks047 222 and sks048 -show >4600 shared orthologues with the PKNH reference genome (Table 2).

225
The apicoplast genome (API) could not be assembled for sks047, and while API contigs were 226 successfully assembled for StAPkA1H1 and sks048 (SI Table 1 Fig 3). The unplaced sequences in the 00 'bin' chromosomes account for at 241 least 40% of gaps in the three draft genomes (Error! Reference source not found.). Indeed, each 242 draft genome's chromosome structure conforms to that of the PKNH reference genome with 243 uniform coverage across the chromosomes in regions with no gaps (SI Fig 4). This is also apparent in 244 fragmented chromosomes, which retain the same chromosomal structure as PKNH (SI Fig 5). While   6). SICAvar genes appear to be distributed across the genome, 276 chromosomes, including the chromosomal extremities with more members annotated than 277 previously reported by Pain et al. (2008), particularly on chromosomes 10, 11 and 12 (SI Fig 7).

280
Following filtering for length, quality and depth, reads-based structural variants (SVs) were called 281 using the ONT SV pipeline and assembly-based SVs were called using Assemblytics [41]. The reads-282 based approach returned 1316 and 1398 SVs for sks047 and sks048, respectively (Table 4). The 283 10 assembly-based approach returned 856 and 839 SVs for sks047 and sks048, respectively (Table 4).

284
The reads-based approach is expected to return more variants due to a higher error rate in the raw 285 reads used compared with the collapsed assembly-based methodology.   Singleton (no identified duplication; proximal (two identified duplicated genes with <20 genes 317 between them); dispersed (>20 genes between the 2 candidate genes); tandem (duplication events 318 next to each other) and segmental/ whole genome duplication (WGD) (>4co-linear genes with <25 319 genes between them). To gain an insight into differences in the duplication types we classified 320 duplication types for the BUSCO core eucaryotic core control gene population and the PkSICAvar type 321 1, PkSICAvar type 2 and the KIR multiple gene families of interest in the three draft genomes 322 StAPkA1H1, sks047 and sks048 (Fig 4). p= 0.40). Therefore, there was no observed excess duplication types for BUSCO genes (Fig 4). However, 326 duplication profiles for the genes annotated SICAvar type 1, SICAvar type 2 and KIR in the draft 327 genomes, StAPkA1H1, sks047 and sks048, were markedly different from the BUSCO gene profiles with 328 no evidence for singleton genes (Fig 4). When compared to 100 randomly obtained genes as a 329 population this result profile was statistically significant (Mann-Whitney U test, p < 1.0e-9).

439
The two clinical isolates (sks047 and sks048) and the control (StAPkA1H1) resolved into 14 440 chromosomes as expected for Plasmodium spp. and one 'bin' 00 chromosome. The PKNH reference 441 genome also resolves into 14 chromosomes and one bin chromosome where 1.73% of the total 442 sequence comprising 62 genes were assigned [16]. The bin chromosomes (chr00) of StAPkA1H1, 443 sks047 and sks048 contain 1.59%, 2.09% and 1.94% total sequence length with 18, 35 and 25 genes 444 respectively. Sequences placed in chr00 were unable to pass alignment quality thresholds for 445 placement in chromosomes 1-14. For example, when aligned with minimap2 in D-GENIES to produce 446 dotplots, StAPkA1H1 chr00 sequences tended to cluster with PKNH chr00 both representing P.

637
TransposonPSI was also used on the AM-F assemblies with default parameters to find repeat 638 elements based on their coding sequences.

21
Redundant repeat element sequences were removed from the outputs of RepeatMasker, OCTFA, 640 LTRHarvest and TransposonPSI using a custom script, to generate a genome feature file (GFF3) 641 where each transposable and repetitive element of each AM-F assembly is represented once. Then, 642 within each draft assembly, repeat elements were masked using the coordinates present in the non-643 redundant GFF3 file and the 'maskfasta' function of bedtools (v2.27; default settings and `-soft`).

651
Companion software was run with no transcript evidence, 500bp minimum match length and 80% 652 match similarity for contig placement, 0.8 AUGUSTUS [76] score threshold and taxid 5851.

653
Additionally, pseudochromosomes were contiguated, reference proteins were aligned to the target 654 sequence, pseudogene detection was carried out, and RATT was used for reference gene models.

681
Failed SV types were manually filtered based on length and quality alone to determine the presence 682 of high-quality, low-occurrence variants. 688 [83]. VCF files for successful reads-based and assembly-based SV calling as well as the failed SV-type 689 VCF files were further filtered to remove any variants less than 50bp in length and less than Q5 in 690 quality using a bcftools one-liner (https://github.com/samtools/BCFtools). A quality filter was not 691 applicable for the assembly-based approach due to the lack of quality information in the original