Abstract
Dual RNA-Seq is the simultaneous retrieval and analysis of host and pathogen transcriptomes. It follows the rationale that cross-species interactions determine both pathogen virulence and host tolerance, resistance or susceptibility. Correlated gene expression might help identify interlinked signaling, metabolic or gene regulatory pathways in addition to potentially physically interacting proteins. Numerous studies have used RNA-Seq to investigate Plasmodium infection with a focus on only one organism, either the host or the parasite.
Here we propose a meta-analysis approach for dual RNA-Seq. We screened malaria transcriptome experiments for gene expression data from both Plasmodium and its host. Out of 105 malaria studies in Homo sapiens, Macaca mulatta and Mus musculus, we identified 56 studies with the potential to provide host and parasite data. While 15 studies (1,935 total samples) of these 56 explicitly aimed to generate dual RNA-Seq data, 41 (1,129 samples) had an original focus on either the host or the parasite. We show that a total of up to 2,530 samples are suitable for dual RNA-Seq analysis providing an unexplored potential for meta-analysis.
We argue that the multitude of variations in experimental conditions should help narrow down a conserved core of cross-species interactions. Different hosts are infected by evolutionarily diverse species of the genus Plasmodium. We propose to overlay interaction networks of different host-parasite systems based on orthologous genes. This might allow us to gauge the applicability of model systems for different pathways in malaria infection and to address the evolution of parasite-host interactions.
Introduction
Transcriptomes are often analyzed as a first attempt to understand cellular and organismic events, because a comprehensive profile of RNA expression can be obtained at a reasonable cost and with high technical accuracy [50]. Microarrays dominated transcriptomics since 1995 [5,28,63]. Microarrays quantify gene expression based on hybridization of a target sequence to an immobilized probe of known sequence. Technical difficulties associated with microarrays lie in probe-selection, cross-hybridization, and design cost of custom chips [89]. RNA sequencing (RNA-Seq) eliminates these difficulties and provides deep and accurate expression estimates for all RNAs in a sample. RNA-Seq has thus replaced microarrays as the predominant tool for transcriptomics [50, 88]. RNA-Seq assesses host and parasite transcriptomes simultaneously, if RNA of both organisms is contained in a sample. It has been proposed to analyze the transcriptomes of both organisms involved in an infection as pathogen virulence in an infectious disease is often a result of interlinked processes of host and pathogen (host-pathogen interactions) [43, 88, 89].
In case of malaria, unlike in bacterial infections, both the pathogen and the host are eukaryotic organisms with similar transcriptomes. Host and parasite mRNA is selected simultaneously when poly-dT priming is used to amplify polyadenylated transcripts [88, 89]. This means that most malaria RNA-Seq datasets potentially contain transcripts of both species, making them suitable for dual RNA-Seq analysis. In the mammalian intermediate host, Plasmodium first invades liver and then red blood cells (RBCs) for development and asexual expansion. While the nuclear machinery of cells from both host and parasite produces mRNA in the liver, RBCs are enucleated and transcriptionally inactive in the mammalian host [53]. In blood stage of the infection, leukocytes are, thus, the source of host mRNA. Malaria research, especially transcriptomics, is traditionally designed to target one organism, either the host or the parasite. Expression of mRNA, for example, can be compared between different time points in the life cycle of Plasmodium or between different drug treatment conditions. Researchers conducting a targeted experiment might consider transcripts from the non-target organism “contamination”. Nevertheless, expression of those transcripts potentially corresponds to response to stimuli during the investigation. Additionally, some recent studies on malaria make intentional use of dual RNA-Seq. Malaria is the most thoroughly investigated disease caused by a eukaryotic organism and the accumulation of these two kinds of studies, RNA-Seq with “contaminants” and intentional dual RNA-Seq, provides a rich resource for meta-analysis.
Such a meta-analysis can use co-regulated gene expression to infer host-parasite interactions. Cor-relation of mRNA expression can be indicative of different kinds of biological “interactions”: On one hand, protein products could be directly involved in the formation of complexes and might therefore be produced in quantities varying similarly under altered conditions. On the other hand, involvement in the same biological pathways can result in co-regulated gene expression without physical interaction. This broad concept of interaction has long been exploited in single organisms (e.g. [3, 44, 58, 81]). We (and others before [57]) propose to extrapolate this to interactions between a host and its pathogen. It can be expected that a stimulus presented by the parasite to its host elicits host immune response and that the parasite, in turn, tries to evade this response, creating a cascade of genes co-regulated at different time points or under different conditions.
In this paper, we explore first steps in a comparative meta-analysis of dual RNA-Seq transcriptomes. Existing raw read datasets collectively present an unexplored potential to answer questions that have not been investigated by individual studies. Meta-analyses increase the number of observations and statistical power and help eliminate false positives and true negatives, which may otherwise conceal important biological inferences [20, 21, 27]. Since mice and macaques are used as laboratory models for human Plasmodium infection, we analyzed the availability and suitability of mRNA sequencing data from three evolutionarily close hosts - Homo sapiens, Macaca mulatta and Mus musculus - and their associated Plasmodium parasites. Here we summarize available data, outline challenges and show exemplary approaches to uncover host-parasite interactions. We discuss orthology across different host-parasite systems as a means to enrich information.
Data review and curation of potentially suitable studies
Sequence data generated in biological experiments is submitted to one of the three mirroring databases of the International Nucleotide Sequence Database Collaboration (INSDC): NCBI Sequence Read Archive (SRA), EBI Sequence Read Archive (ERA) and DDBJ Sequence Read Archive (DRA). Comprehensive query tools to access these databases via web interfaces and programmatically via scriptable languages exist (for example, SRAdb [93], ENAbrowseR [82]). In these databases, all experiments submitted under a single accession are given a single “study accession number” and are collectively referred to as a “study” here onwards.
We used SRAdb, a Bioconductor/R package [31, 32], to query SRA [45] for malaria RNA-Seq studies with the potential to provide host and Plasmodium reads for our meta-analysis. We first selected studies with “library strategy” given as “RNA-Seq” and “Plasmodium” in study title, abstract or sample attributes using the “dbGetQuery” function. Then we used the “getSRA” function with the query “(malaria OR Plasmodium) AND RNA-Seq”. This function searches all fields. We manually curated the combined results and added studies based on a literature review using the terms described for the “getSRA” function in SRA, PubMed and Google Scholar. From this search resulting in 196 studies, we disregarded 91 studies, all of which provide data from vectors and non-target hosts (e.g. avian malaria). 49 more studies were excluded because their gene expression data was derived from Plasmodium. spp in blood-stage Plasmodium cultures and thus can be expected to be devoid of host mRNA. We then used SRAdb and the prefetch and fastq-dump functions from SRAtoolkit [83], to download all replicate samples (experimental replicate or “run” in the jargon of sequencing databases) of the selected studies. The final curation of studies and download of runs was performed on 21 January, 2019.
In total, we found 56 potentially suitable studies in this database and literature review (Figure 1a). Homo sapiens was the host organism for 22 studies, Mus musculus for 24 and Macaca mulatta for 10. Human studies featured P. falciparum and P. vivax. P. yoelii, P. chabaudi and P. berghei were used in mouse infection studies. P. berghei was additionally used to infect human liver cell culture. Macaque studies included P. cynomolgi and P. coatneyi infections(table 1).
Proportion and number of sequencing reads and expressed genes from parasite and host in selected malaria RNA-Seq studies. We mapped sequencing reads from studies selected for their potential to provide both host and parasite gene expression data (total studies=56, total runs=3064; references in Supplementary File 1) against appropriate host and parasite genomes. (A) The percentage of parasite reads (x-axis) is plotted for each run in each study. The studies are categorised according to the host organisms and labeled “enriched/depleted” to indicate enrichment of infected hepatocytes or depletion of leukocytes from blood. Studies labeled “dual” were originally intended to simultaneously assess host and parasite transcriptomes. The colored area gives an estimation of the density of runs at a particular host-parasite percentage. The number of runs, “N”, in each study are displayed. We also plot the number of reads mapped in each run against the number of expressed genes for (B)host and (C)parasite estimated based on this mapping. The number of expressed genes increases with sequencing depth towards the maximum of all annotated genes for the respective organism. The vertical lines indicate a threshold of 1,000,000 and 100,000 reads for host and parasite, respectively. The horizontal lines correspond to thresholds on the number of expressed genes at 10,000 and 3,000 for host and at 3,000, 1,000 and 100 for the parasite. At such exemplary thresholds, data could be considered sufficient for dual RNA-Seq analysis on both organisms.
Number of studies for each host-parasite pair and suitability analysis of their runs.
We note that 20 of the 56 studies depleted (or enriched, respectively) specific classes of cells from their samples. The low number of parasites cells at the physiologically asymptomatic liver stage [40], on one hand, gives the opportunity to test the effect of sporozoite-derived vaccines [2, 30, 64], but on the other hand, makes it difficult to study Plasmodium transcriptomes in this stage. To reduce overwhelming host RNA levels, 3 out of 10 liver studies sorted infected hepatoma cells from uninfected cells. Similarly, 17 other studies have depleted or enriched host white blood cells (WBCs or leukocytes) to focus expression analysis on Plasmodium or the host immune system, respectively. In all these scenarios, we suspect depletion to be imperfect and thus the samples to potentially include mRNA of both organisms.
For 15 out of the 56 studies, the authors state that they intended to simultaneously study host and parasite transcriptomes (“dual RNA-Seq”). This includes 8 studies from MaHPIC (Malaria Host-Pathogen Interaction Center), based at Emory University, that made extensive omics measurements in macaque malaria. The original focus of the remaining 41 studies was on the parasite in 20 and on the host in 21 cases.
Besides invading liver and RBCs, Plasmodium parasites sequester in bone marrow, adipose tissue, lung, spleen and brain (during cerebral malaria) [7, 22]. To study a comprehensive spectrum of host-parasite interactions, it would be optimal to have data from these different tissues. Our collection of studies comprise data derived from blood and liver for all three host organisms. In addition, we have found seven transcritpomic studies on spleen and two studies on cerebral malaria from mice (see Supplementary Table 1). MaHPIC offers a collection of blood and bone marrow studies in macaques.
Experiments performed on mouse blood focus on the parasite instead of the host (11 vs. 0). Studies on human blood infection focus more often on the host immune response than on the parasite (9 vs. 5). Liver and spleen studies focus on host and parasite almost equally as often, with sources for host tissue in this case being either mice (in vivo) or hepatoma cell cultures (in vitro).
Below, we argue that small clusters of genes co-expressed across several such diverse conditions might help point towards potentially novel core host-parasite interactions. In addition, the extent of agreement in gene expression correlation between in vivo and in vitro systems, different tissues and organisms, might help us assess how far different biological pathways of these systems have model characteristics for human malaria.
Dual RNA-Seq suitability analysis
A sample suitable for dual RNA-Seq analysis must provide “sufficient” gene expression from both host and parasite. To assess the proportion for host and parasite RNA sequencing reads in each study and sample, we mapped sequencing reads onto concatenated host and parasite reference genomes using STAR [9,10]. Simultaneous mapping against both genomes should avoid non-specific mapping of reads in regions conserved between host and parasites. We quantified the sequencing reads mapped to exons using the “countOverlaps” function of the GenomicRanges package [41] and calculated the proportion of reads mapping to host and parasite genes.
The proportions of host and parasite reads for each run does not always reflect the original focus of a study (fig. 1a). Studies without depletion or enrichment give us an idea how skewed overall RNA expression towards one organism is to be expected under native conditions. In studies originally designed to use a dual RNA-Seq approach on blood stages, samples in which parasitemia is very high are used. Samples with lower parasitemia, such as many of those for which, the original focus is mostly on immune gene expression from leukocytes, the number of host reads is often overwhelming and these samples are mostly not suitable for dual RNA-Seq analysis (table 1).
Many studies using depletion or enrichment prior to RNA sequencing (“enriched/depleted” in fig. 1a) still show considerable expression of the non-target organism. Studies on liver infection often comprise several runs with balanced proportions of host and parasite reads. This is a result of infected liver cells being sorted from uninfected cells in culture. Studies depleting whole blood from leukocytes to focus on parasite transcriptomes still show considerable host gene expression and provide principally suitable runs for the analysis of blood infection at lower intensities. The latter comes with the caveat that host expression might be biased by unequal depletion of particular cell types. This could be the case if incomplete depletion affected different types of WBCs differentially and hence biased the detectable host mRNA expression in the direction of less depleted cell types. For similar reasons, controlling for parasite depletion might be challenging.
To establish suitability thresholds for inclusion of individual samples in further analysis, we plotted the number of host and parasite reads against the number of host and parasite genes expressed(fig. 1b and fig. 1c). For runs with high sequencing depth, the total number of expressed genes of the host and parasite approaches the number of all annotated genes: around 30,000 for the mammalian host and around 4,500 for Plasmodium. When sequencing depth is lower, the number of genes detected as “expressed” is lower and a decrease in sensitivity can be expected to prevent analysis of poorly expressed genes. We propose four parameters for suitability thresholds in dual RNA-Seq analysis: the number reads mapping to (1)host and (2)parasite genes and the number of genes these reads map to (expressed genes) in (3)host and (4)parasite. In table 1, we give the number of runs considered suitable for three different combinations of thresholds. Without claiming a particular threshold to be ideal, we propose to use thresholds to avoid uninformative runs in further processing to reduce the unnecessary computational burden of co-expression analysis.
Suitable runs bearing “contaminants” at the thresholds chosen here are identified from human-P. falciparum, macaque-P. cynomolgi, human-P. berghei and mouse-P. berghei systems. Unfortunately, with current thresholds and currently available data, we highly under-represent human-P. vivax and human-P. berghei systems, the two liver in vitro models. This outcome is understandable owing to the low parasitemia in liver cultures [8]. We note that the thresholds could further be made lenient enough to include more runs for these systems at the cost of analyzing only the most highly expressed parasite genes. An alternative approach relies on using depleted/enriched samples for these systems. For further analysis, however, including depleted/enriched samples could prove challenging as discussed before. Analysis approaches such as multilayer networks (see below) might help to gauge problems with such runs for the inference of co-expression in further steps of analysis.
Identification of co-expressed genes via correlation techniques
Some genes are likely to show almost uniform expression under different experimental conditions (e.g., “housekeeping genes”). Naïve assessments of correlation could, however, identify pairs of such genes as highly correlated. An analysis of co-expression can deal with this challenge in two different ways:
Firstly, the most variable genes within and across studies can be selected and other genes discarded. While requiring little computational time and resources, exclusion of genes with too little variance in expression from downstream analysis should be performed with caution, as seemingly small variations might result in a suitable signal over a large set of runs. To select only variable genes, one option is to compute their variance across all samples (in one or multiple studies). Genes with variance below a threshold may then be excluded from further analysis. As variance increases with the mean for gene expression data, the Biological Coefficient of Variation (BCV) [51,59] may provide a more robust threshold.
Secondly, one can compute empirical correlation indices, similar to p-values, for any gene-pair. Empirical correlation indices are a robust way to estimate whether gene-pairs are correlated because of specific events (treatment condition, time point) and not by chance (e.g., housekeeping genes) [37, 55]. These methods construct a null distribution using permutations of the given data instead of assuming a null distribution in advance. Since host and parasite genomes total nearly 35,000 genes, the number of permutations has to be around 1012 (for a resolution of 0.1% FDR) to be suitable for corrections for multiple-testing [1]. Alternatively, as computational costs for these permutations can be expected to be too high for datasets with thousands of samples, uncorrected “p-values” may be considered ranking for host-parasite gene correlation, following the suggestion of Reid and Berriman [57]. Nevertheless, reliance on empirical computation of p-values without prior variance/BCV filtering might become impracticable for very large datasets in the proposed meta-analysis.
We consider partial correlation as an additional approach that could be combined with the above methods. Partial correlation can control pairwise correlations for the influence of other genes [35]. In transcriptomic applications, full-conditioned partial correlation is computationally very expensive. Some studies therefore resort to second-order partial correlation (relationship between two genes in-dependent of two other genes) [94, 95]. A suitable pipeline might first use zero-order partial, i.e., “regular” correlation with empirical p-values to remove constitutively expressed genes. For all cor-relations with an empirical “p-value” below a certain threshold, one could compute e.g. first-order partial correlations, thus, reducing the number of computations. Iterations of such an approach with higher-order partial correlations are then possible.
Across different studies; across different host-parasite systems
Gene × gene matrices obtained from correlation analysis can be visualized and analyzed as interaction networks. We have identified different but interlinked workflows to reconstruct a consensus network of expression correlation (fig. 2). A first approach (fig. 2(a)) integrates data from different studies of one host-parasite system by simply appending expression profiles of their runs.
Two strategies identified to reconstruct host-parasite interaction networks from SRA. We identified two approaches to obtain a consensus network involving multiple hosts and multiple parasites. We selected appropriate studies from SRA for this analysis. The aim is to find a set of important interactions in malaria using co-regulated gene expression and visualizing this information as a biological network. Using the first approach (figure (A)), we form single networks from single RNA-Seq datasets or single networks from all studies of a host-parasite system appended one after the other, using cross-species gene correlation. To obtain a consensus network for all hosts and all parasites, we use 1:1 orthologous genes names for all hosts and all parasites, rename these genes to show their equivalency and append them to form one big dataset. Next, we perform pairwise correlation of genes and finally, a network that will represent the direct interactions among orthologous genes. In (B), the second approach, we implement multi-layered network analysis to obtain a consensus network from several layers of individual networks. In this approach, we make single networks for individual RNA-seq datasets. To obtain a network for a host-parasite system, we either append all datasets of the host-parasite system with each other and form a network, or we apply multi-layered network analysis on single networks to get the consensus. To reconstruct a network involving multiple host-parasite systems, we rename orthologous genes in each layer and then look for overlapping communities.
Knowledge of 1:1 orthologs [38] between different host and different parasite species can be used in the next steps to integrate across different host-parasite systems. Humans and macaques share 18,179 1:1 orthologous genes, humans and mice share 17,089 orthologous genes and 14,776 genes are 1:1:1 orthologous among all three species. Similarly, 7,600 groups of orthologous genes exist among the Plasmodium species.
A simple approach to combine data across host-parasite systems could again append these orthologs in the original datasets before correlations of gene expression. Alternatively, to construct a consensus network involving all hosts and parasites, a multi-layer network analysis could align networks by orthologous genes. This approach (fig. 2(b)) can offer more control when looking for correlations consistent in different layers representing different host-parasite systems. Similarly, more insight could be gained when correlations from different types of tissues are combined as multilayer networks. This would require the construction of networks for each study in a single host-parasite system followed by a multi-layered network analysis on these networks.
We anticipate that correlation between host and parasite transcript expression will highlight host-parasite interactions worth scrutiny of further focused research. As a second goal, meta-analysis involving different host-parasite systems could give insights into how easily observations made in malaria models can be translated to human malaria. If, e.g., certain groups of pathways show lower evolutionary conservation in host-parasite co-expression networks, one could expect results on those pathways to be more difficult to translate between systems. Finally, one can ask whether expression correlation between host and parasite species is more or less evolutionarily conserved than within host species in malaria [47, 86, 91].
Funding
This work was supported by the Alliance Berlin Canberra “Crossing Boundaries: Molecular Interac-tions in Malaria”, which is co-funded by a grant from the Deutsche Forschungsgemeinschaft (DFG) for the International Research Training Group (IRTG) 2290 and the Australian National University.
Conflict of Interest Statement
The authors declare that there is no conflict of interest.
Author Contribution
EH and PM designed the study. PM obtained and analyzed the data. EH and PM contributed to the final manuscript.
Acknowledgements
We thank Gaéetan Burgio, Alyssa Ingmundson, Kai Matuschewski and Alice Balard for comments on earlier versions of this manuscript.
References
- [1].↵
- [2].↵
- [3].↵
- [4].
- [5].↵
- [6].
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].
- [12].
- [13].
- [14].
- [15].
- [16].
- [17].
- [18].
- [19].
- [20].↵
- [21].↵
- [22].↵
- [23].
- [24].
- [25].
- [26].
- [27].↵
- [28].↵
- [29].
- [30].↵
- [31].↵
- [32].↵
- [33].
- [34].
- [35].↵
- [36].
- [37].↵
- [38].↵
- [39].
- [40].↵
- [41].↵
- [42].
- [43].↵
- [44].↵
- [45].↵
- [46].
- [47].↵
- [48].
- [49].
- [50].↵
- [51].↵
- [52].
- [53].↵
- [54].
- [55].↵
- [56].
- [57].↵
- [58].↵
- [59].↵
- [60].
- [61].
- [62].
- [63].↵
- [64].↵
- [65].
- [66].
- [67].
- [68].
- [69].
- [70].
- [71].
- [72].
- [73].
- [74].
- [75].
- [76].
- [77].
- [78].
- [79].
- [80].
- [81].↵
- [82].↵
- [83].↵
- [84].
- [85].
- [86].↵
- [87].
- [88].↵
- [89].↵
- [90].
- [91].↵
- [92].
- [93].↵
- [94].↵
- [95].↵