Dual RNA-Seq meta-analysis in Plasmodium infection

1 Dual RNA-Seq is the simultaneous retrieval and analysis of host and pathogen transcriptomes. 2 It follows the rationale that cross-species interactions determine both pathogen virulence and host 3 tolerance, resistance or susceptibility. Correlated gene expression might help identify interlinked 4 signaling, metabolic or gene regulatory pathways in addition to potentially physically interacting 5 proteins. Numerous studies have used RNA-Seq to investigate Plasmodium infection with a focus 6 on only one organism, either the host or the parasite. 7 Here we propose a meta-analysis approach for dual RNA-Seq. We screened malaria transcrip- 8 tome experiments for gene expression data from both Plasmodium and its host. Out of 105 malaria 9 studies in Homo sapiens, Macaca mulatta and Mus musculus , we identiﬁed 56 studies with the 10 potential to provide host and parasite data. While 15 studies (1,935 total samples) of these 56 11 explicitly aimed to generate dual RNA-Seq data, 41 (1,129 samples) had an original focus on either 12 the host or the parasite. We show that a total of up to 2,530 samples are suitable for dual RNA-Seq 13 analysis providing an unexplored potential for meta-analysis. 14 We argue that the multitude of variations in experimental conditions should help narrow down 15 a conserved core of cross-species interactions. Diﬀerent hosts are infected by evolutionarily diverse 16 species of the genus Plasmodium . We propose to overlay interaction networks of diﬀerent host- 17 parasite systems based on orthologous genes. This might allow us to gauge the applicability of 18 model systems for diﬀerent pathways in malaria infection and to address the evolution of parasite- host interactions.


23
Transcriptomes are often analyzed as a first attempt to understand cellular and organismic events, 24 because a comprehensive profile of RNA expression can be obtained at a reasonable cost and with high 25 technical accuracy [37]. Microarrays dominated transcriptomics since 1995 [11,22,44]. Microarrays 26 quantify gene expression based on hybridization of a target sequence to an immobilized probe of known 27 sequence. Technical difficulties associated with microarrays lie in probe-selection, cross-hybridization, 28 and design cost of custom chips [51]. RNA sequencing (RNA-Seq) eliminates these difficulties and 29 provides deep and accurate expression estimates for all RNAs in a sample. RNA-Seq has thus replaced 30 microarrays as the predominant tool for transcriptomics [37,50]. RNA-Seq assesses host and parasite 31 transcriptomes simultaneously, if RNA of both organisms is contained in a sample. It has been pro- inactive in the mammalian host [39]. In blood stage of the infection, leukocytes are, thus, the source 43 of host mRNA. Malaria research, especially transcriptomics, is traditionally designed to target one 44 organism, either the host or the parasite. Expression of mRNA, for example, can be compared between 45 different time points in the life cycle of Plasmodium or between different drug treatment conditions. 46 Researchers conducting a targeted experiment might consider transcripts from the non-target organism 47 "contamination". Nevertheless, expression of those transcripts potentially corresponds to response to 48 stimuli during the investigation. Additionally, some recent studies on malaria make intentional use of 49 dual RNA-Seq. Malaria is the most thoroughly investigated disease caused by a eukaryotic organism 50 and the accumulation of these two kinds of studies, RNA-Seq with "contaminants" and intentional 51 dual RNA-Seq, provides a rich resource for meta-analysis. 52 Such a meta-analysis can use co-regulated gene expression to infer host-parasite interactions. Cor-53 relation of mRNA expression can be indicative of different kinds of biological "interactions": On one 54 2 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted March 27, 2019. ; https://doi.org/10.1101/576116 doi: bioRxiv preprint hand, protein products could be directly involved in the formation of complexes and might therefore be 55 produced in quantities varying similarly under altered conditions. On the other hand, involvement in 56 the same biological pathways can result in co-regulated gene expression without physical interaction.

57
This broad concept of interaction has long been exploited in single organisms (e.g. [10,33,42,46] studies with the potential to provide host and Plasmodium reads for our meta-analysis. We first 82 selected studies with "library strategy" given as "RNA-Seq" and "Plasmodium" in study title, abstract 83 or sample attributes using the "dbGetQuery" function. Then we used the "getSRA" function with the 84 3 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted March 27, 2019. ; https://doi.org/10.1101/576116 doi: bioRxiv preprint query "(malaria OR Plasmodium) AND RNA-Seq". This function searches all fields. We manually 85 curated the combined results and added studies based on a literature review using the terms described  In total, we found 56 potentially suitable studies in this database and literature review ( Figure   95 1a). Homo sapiens was the host organism for 22 studies, Mus musculus for 24 and Macaca mulatta 96 for 10. Human studies featured P. falciparum and P. vivax. P. yoelii, P. chabaudi and P. berghei were 97 used in mouse infection studies. P. berghei was additionally used to infect human liver cell culture.

98
Macaque studies included P. cynomolgi and P. coatneyi infections (table 1). 99 We note that 20 of the 56 studies depleted (or enriched, respectively) specific classes of cells from to focus expression analysis on Plasmodium or the host immune system, respectively. In all these 106 scenarios, we suspect depletion to be imperfect and thus the samples to potentially include mRNA of 107 both organisms.

108
For 15 out of the 56 studies, the authors state that they intended to simultaneously study host  Besides invading liver and RBCs, Plasmodium parasites sequester in bone marrow, adipose tissue, 114 lung, spleen and brain (during cerebral malaria) [12,18]. To study a comprehensive spectrum of host-115 parasite interactions, it would be optimal to have data from these different tissues. Our collection of 116 4 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted March 27, 2019. ; https://doi.org/10.1101/576116 doi: bioRxiv preprint studies comprise data derived from blood and liver for all three host organisms. In addition, we have 117 found seven transcritpomic studies on spleen [1-3,7,19,29,36] and two studies on cerebral malaria [5,48] 118 from mice. MaHPIC offers a collection of blood and bone marrow studies in macaques.

119
Experiments performed on mouse blood focus on the parasite instead of the host (11 vs. 0). Studies 120 on human blood infection focus more often on the host immune response than on the parasite (9 vs.   Suitable runs bearing "contaminants" at the thresholds chosen here are identified from human-P. 166 falciparum, macaque-P. cynomolgi, human-P. berghei and mouse-P. berghei systems. Unfortunately, 167 with current thresholds and currently available data, we highly under-represent human-P. vivax and 168 human-P. berghei systems, the two liver in vitro models. This outcome is understandable owing 169 to the low parasitemia in liver cultures [13]. We note that the thresholds could further be made 170 lenient enough to include more runs for these systems at the cost of analyzing only the most highly  impracticable for very large datasets in the proposed meta-analysis. 198 We consider partial correlation as an additional approach that could be combined with the above 199 methods. Partial correlation can control pairwise correlations for the influence of other genes [26].

200
In transcriptomic applications, full-conditioned partial correlation is computationally very expensive.

201
Some studies therefore resort to second-order partial correlation (relationship between two genes in-  would require the construction of networks for each study in a single host-parasite system followed by 224 a multi-layered network analysis on these networks. 225 We anticipate that correlation between host and parasite transcript expression will highlight host-

11
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted March 27, 2019. ; https://doi.org/10.1101/576116 doi: bioRxiv preprint Figure 1: Proportion and number of sequencing reads and expressed genes from parasite and host in selected malaria RNA-Seq studies. We mapped sequencing reads from studies selected for their potential to provide both host and parasite gene expression data (total studies=56, total runs=3064; references in Supplementary File 1) against appropriate host and parasite genomes. (a) The percentage of parasite reads (x-axis) is plotted for each run in each study. The studies are categorised according to the host organisms and labeled "enriched/depleted" to indicate enrichment of infected hepatocytes or depletion of leukocytes from blood. Studies labeled "dual" were originally intended to simultaneously assess host and parasite transcriptomes. The colored area gives an estimation of the density of runs at a particular host-parasite percentage. The number of runs, "N", in each study are displayed. We also plot the number of reads mapped in each run against the number of expressed genes for (b)host and (c)parasite estimated based on this mapping. The number of expressed genes increases with sequencing depth towards the maximum of all annotated genes for the respective organism. The vertical lines indicate a threshold of 1,000,000 and 100,000 reads for host and parasite, respectively. The horizontal lines correspond to thresholds on the number of expressed genes at 10,000 and 3,000 for host and at 3,000, 1,000 and 100 for the parasite. At such exemplary thresholds, data could be considered sufficient for dual RNA-Seq analysis on both organisms.
14 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted March 27, 2019. ; https://doi.org/10.1101/576116 doi: bioRxiv preprint