Novel functional insights from the Plasmodium falciparum sporozoite-specific proteome by probabilistic integration of 26 studies

Plasmodium species, the causative agent of malaria, have a complex life cycle involving two hosts. The sporozoite life stage is characterized by an extended phase in the mosquito salivary glands followed by free movement and rapid invasion of hepatocytes in the human host. This transmission stage has been the subject of many transcriptomics and proteomics studies and is also targeted by the most advanced malaria vaccine. We applied Bayesian data integration to determine which proteins are not only present in sporozoites but are also specific to that stage. Transcriptomic and proteomic Plasmodium data sets from 26 studies were weighted for how representative they are for sporozoites, based on a carefully assembled gold standard for Plasmodium falciparum (Pf) proteins known to be present or absent during the sporozoite life stage. Of 5418 Pf genes for which expression data were available at the RNA level or at the protein level, 1105 were identified as enriched in sporozoites and 90 specific to them. We show that Pf sporozoites are enriched for proteins involved in type II fatty acid synthesis in the apicoplast and GPI anchor synthesis, but otherwise appear metabolically relatively inactive, in the salivary glands of mosquitos. Newly annotated hypothetical sporozoite-specific and sporozoite-enriched proteins highlight sporozoite specific functions. They include PF3D7_0104100 that we identified to be homologous to the prominin family, which in human has been related to a quiescent state of cancer cells. We document high levels of genetic variability for sporozoite proteins, specifically for sporozoite-specific proteins that elicit antibodies in the human host. Nevertheless, we can identify nine relatively well-conserved sporozoite proteins that elicit antibodies and that together can serve as markers for previous exposure. Our understanding of sporozoite biology benefits from identifying key pathways that are enriched during this life stage. This work can guide studies of molecular mechanisms underlying sporozoite biology and potential well-conserved targets for marker and drug development. Author Summary When a person is bitten by an infectious malaria mosquito, sporozoites are injected into the skin with mosquito saliva. These sporozoites then travel to the liver, invade hepatocytes and multiply before the onset of the symptom-causing blood stage of malaria. By integrating published data, we contrast sporozoite protein expression with other life stages to filter out the unique features of sporozoites that help us understand this stage. We used a “guideline” that we derived from the literature on individual proteins so that we knew which proteins should be present or absent at the sporozoite stage, allowing us to weigh 26 data sets for their relevance to sporozoites. Among the newly discovered sporozoite-specific genes are candidates for fatty acid synthesis while others might play a role keeping the sporozoites in an inactive state in the mosquito salivary glands. Furthermore, we show that most sporozoite-specific proteins are genetically more variable than non-sporozoite proteins. We identify a set of conserved sporozoite proteins against which antibodies can serve as markers of recent exposure to sporozoites or that can serve as vaccine candidates. Our predictions of sporozoite-specific proteins and the assignment of previously unknown functions give new insights into the biology of this life stage.

Malaria is a mosquito transmittable disease resulting in over 220 million clinical cases and half a million 53 deaths annually. Most deaths are caused by Plasmodium falciparum (Pf), one of the five species of 54 Plasmodium that can infect humans. The infection begins with the deposition of liver-infective 55 sporozoite forms in the skin by blood-feeding mosquitoes. These sporozoites travel to the liver where 56 they invade, differentiate and multiply asymptomatically inside hepatocytes for approximately a week 57 before releasing red blood cell (RBC)-infective merozoites into the circulation. The subsequent asexual 58 multiplication, rupture and re-invasion of the parasites into circulating RBCs cause the symptoms 59 associated with malaria. The used data sets were examined for their correlations with each other (S1 Figure). By and large 128 most data sets show little correlation. We did leave the few correlated data sets in to keep the data 129 integration transparent and maximize the amount of included information. Oocyst-derived and 130 salivary gland sporozoites showed high correlations with each other, which led us to combining all 131 respective studies into the "sporozoite" data input and not make a distinction between those stages. 132 S1 Figure: Correlations of all integrated data sets with each other, hierarchically clustered.

139
Proteomic data were converted into unique peptide counts for each protein identified and 140 transcriptomic data were converted into expression percentiles for a total of 5668 P. falciparum gene 141 IDs. Proteomic and transcriptomic data was binned consistently for all data sets, with 0, 1, or >1 142 identified unique peptides, or into four bins, containing transcripts that are in the >80 percentile, >60 143 percentile, >40 percentile and <40 percentile, respectively. The data sets were then weighted 144 according to their ability to retrieve the gold standard proteins. Each bin in each data set was given a 145 log2 score according to equation (1), where B = present in bin, S = sporozoite specific and nonS = not 146 sporozoite specific.

147
(1) ( ( | )) = ( The Bayesian score for an individual protein is then the sum of the scores for the bins in which it 149 occurs (one bin per data set). As the expected number of sporozoite-specific proteins was unknown, 150 no prior was included, rather we used cutoffs based on the position of known sporozoite specific 151 proteins to define sporozoite specific proteins and sporozoite enriched proteins. 152

Overrepresentation of function categories in sporozoite proteins
153 GO terms were acquired from PlasmoDB and formatted into a .gmt file according to the format 154 specified by the GSEA server at the Broad Institute [14]. GSEAPreranked on the ranked list of 155 sporozoite-specific proteins was used with the conservative "preranked" option in the "classic mode", to the sporozoite stage (S1 Table). In order to optimally exploit those data to obtain sporozoite-230 enriched proteins we integrated them in a Bayesian manner (Methods). Integrating the data sets using 231 the sets of 31 positive and 39 negative gold standard proteins (S2 Table, S3 Table) produced a list of 232 all proteins in P. falciparum ranked according to their likelihood of being sporozoite-specific (S4 Table).

233
The score distribution of the negative and positive gold standard proteins varied depending on using 234 all available data sets, or proteomic or transcriptomic data separately ( Figure 1A and S2 Figure). The

235
Bayesian integration using only transcriptomic data sets resulted in a ranked list where 14 negative 236 gold standard genes scored higher than the lowest scoring positive gold standard gene (S2 Figure). This 237 overlap was lower when using only proteomic data sets (S2 Figure), Proteins were considered sporozoite-specific when ranking in the first quartile of the gold standard 256 protein list i.e. scoring above 12.27, which represents a factor of being at least 2 12.37 to 2  5200 more 257 likely to be specific to sporozoites than to any of the other stages. Do note that hereby we ignore the 258 prior probability of a protein being sporozoite specific at all. Although such a prior is standard in 259 Bayesian data integration, in this case a prior, e.g. of 1/5, is given the high cut-off for sporozoite specific 260 proteins that we used, not very relevant. As we did not consider STARP to be sporozoite specific (see 261 above), we considered proteins that scored higher than the second to lowest positive gold standard 262 protein (LIMP protein, score = -1.31), to be enriched in sporozoites. Finally, the abundance of unique 263 peptides by mass-spectrometry was assessed for each protein. Proteins were deemed present in 264 sporozoites when identified in two independent studies or with more than 1 unique peptide in at least 265 one study. Our analysis thus identified 90 sporozoite-specific proteins, 1105 sporozoite-enriched 266 proteins and 2736 that were present in sporozoites (S4 Table). Out of the 90 sporozoite-specific 267 proteins, 67 were not part of the positive gold standard list. We validated our predictions by 5-fold 268 cross validation (5-fold CV) by randomly skipping 1/5 th of the gold standard proteins from the data 269 integration and assessing their predicted sporozoite specificity based on the remaining data ( Fig 1B).

270
The high sensitivity and specificity indicated that novel sporozoite-specific proteins would also score 271 higher than non-sporozoite specific proteins. We also compared the ranking of the gold standard 272 proteins based on the integrated data with a ranking based on individual data sets of sporozoite RNAs 273 and proteins. The cross validation separated the gold standard proteins better than individual data 274 sets, supporting the integration of multiple data sets ( Fig 1B). Toxoplasma gondii, an apicomplexan related to P. falciparum that is present in the HHpred database.

289
The sporozoite specific protein with the highest score that was not part of the gold standard,  The level of polymorphisms among P. falciparum strains is indicated for the separate regions as the average number of polymorphisms per nucleotide in the strains in PlasmoDB. PF3D7_0104100 has a high density of polymorphisms within P. falciparum strains that are concentrated in the second extracellular loop. Cysteines that are conserved among the homologs in Plasmodium species are indicated. The cysteines that are conserved also in human homologs are in bold. Note that the conserved cysteines occur in close proximity to each other, suggesting the formation of disulphide bonds. shorter TM region, putting the cysteine that is located in that TM region in the extracellular space. 311 The Toxoplasma protein was included because its sequence profile has significant sequence similarity 312 against both the human protein profile (E= 2e-20) and the Plasmodium protein profile (E= 3.4e-44), 313 while the similarity between the human and the Plasmodium protein is less significant (E=0.0001).

394
The width of the arrows is determined by the Bayesian score reflecting the level of over representation 395 of that enzyme in sporozoites, e.g. 16 for ACS2 and 8 for FabB/F (S4 Table). For PDH that consists of 396 three proteins, the width of the arrow was determined by the average of those three. Most of FASII 397 proteins are enriched in sporozoites, except PKII, FABZ and LipA. The scheme is a simplification of the 398 pathway as depicted by Shears et al.[22], to which ACS2 was added as it is highly enriched in sporozoites 399 and relevant for fatty acid synthesis.

402
In contrast to the paucity of upregulated processes, there is significant under representation (FDR =< 403 0.001) of processes linked to splicing, translation, translation elongation, folding of proteins as well as 404 proteolysis (S5 Table). These processes are primarily modulated by a number of sporozoite specific  Table).  Table S7. Induced antibody profiles 421 represent a blue print of immunogenic proteins in sporozoites, liver stages and early blood stages.

422
Here we focus on the set of sporozoite target proteins that may be used as markers of previous 423 sporozoite exposure or may act as potential targets for vaccines. Minimal sequence variation 424 between Pf strains would thereby strengthen the candidature of the proteins for epidemiological or 425 clinical applications, however antibody eliciting proteins including CSP, show relatively high levels of 426 polymorphisms among sequenced malaria strains (S4 Figure), and also sporozoite specific proteins 427 show high levels of polymorphisms (S5 Figure). The selection of proteins that could serve as markers 428 of exposure or vaccine candidates is a compromise between on the one hand protein sequence 429 conservation and on the other hand the frequency of volunteers with antibodies to that protein. As 430 the maximum level of sequence variation between candidate marker proteins we used 8 non-431 synonymous SNPs per kilobase, which is lower than the variation in for instance Pfs48/45, a highly 432 conserved gametocyte protein with 8.9 nonsynomymous SNP per kilobase. As the minimum number 433 of people in which a protein should elicit antibodies we chose six (out of the 38). Using a "greedy 434 search algorithm" (methods) we selected a set of nine proteins of which at least one elicited 435 antibodies per volunteer (Table 1).
436 were also applied to the indels. Numbers of called SNPs and indels were roughly similar for NF135 468 and NF166, reflecting their independent geographic origins. As expected, NF54 showed much lower 469 numbers of called SNPs and indels than the other strains, as it is the parental line from which 3D7 is 470 derived (S8 Table). As observed in Plasmodb, proteins in NF166 and NF135 showed high levels of 471 polymorphisms for antibody-binding proteins enriched in sporozoites ( Figure S6)

503
Despite translational repression, which is expected to reduce the correlation between mRNA and 504 protein levels, including the mRNA levels led to a better performance at the protein level that only 505 including proteomics data. Cross validation shows furthermore that the integrated list is better at 506 separating the gold standard positive and negative data sets from each other than the individual data 507 sets.

508
In this study, we did not separate datasets derived from oocyst sporozoites (the earliest form of this 509 stage) and salivary gland sporozoites (the mature form). There may have been subtle difference in 510 protein expression between sporozoites in the two differing host environments (midgut versus salivary 511 gland) that were not detected. However, oocyst-sporozoite data represent a minority of the combined 512 data set (2/26) and are highly correlated with salivary gland-derived sporozoite data (S1 Figure). A list 513 of proteins with its own sporozoite specificity score will be a valuable resource for studying sporozoite 514 biology and understanding novel protein function.  [67]. Indeed as we have shown here both sporozoite specific proteins and proteins that elicit 544 antibodies are highly polymorphic, and only a fraction of those are conserved between NF54, NF135 545 and NF166 (Fig S6, S7). The correlation between immunogenicity and level of sequence conservation 546 suggest that antigenic drift plays a role in the sequence variation. It is not clear whether antigenic drift 547 would also be responsible for high variation among sporozoite specific proteins, as we did not observe 548 a correlation between the Bayesian score of sporozoite specificity and the immunogenicity.