Meta-analysis of RNA-seq studies reveals genes responsible for life stage-dominant functions in Schistosoma mansoni

Background Since the genome of the parasitic flatworm Schistosoma mansoni was sequenced in 2009, various RNA-seq studies have been conducted to investigate differential gene expression between certain life stages. Based on these studies, the overview of gene expression in all life stages can improve our understanding of S. mansoni genome biology. Methods publicly available RNA-seq data covering all life stages and gonads were mapped to the latest S. mansoni genome. Read counts were normalised across all samples and differential expression analysis was preformed using the generalized linear model (GLM) approach. Results we revealed for the first time the dissimilarities among all life stages. Genes that are abundantly-expressed in all life stages, as well as those preferentially-expressed in certain stage(s), were determined. The latter reveals genes responsible for stage-dominant functions of the parasite, which can be a guidance for the investigation and annotation of gene functions. In addition, distinct differential expression patterns were observed between adjacent life stages, which not only correlate well with original individual studies, but also provide additional information on changes in gene expression during parasite transitions. Furthermore, thirteen novel housekeeping genes across all life stages were identified, which is valuable for quantitative studies (e.g., qPCR). Conclusions the metaanalysis provides valuable information on the expression and potential functions of S. mansoni genes across all life stages, and can facilitate basic as well as applied research for the community.


Introduction
Schistosomes are parasite flatworms that infect more than 240 million people worldwide, and cause up to 200,000 deaths every year (GBD 2015 Mortality and Causes of Death Collaborators 2016). There is currently no vaccine, and only one putatively-used drug. To accelerate the elimination of this parasite, it is important to understand its biology, in particular its genome biology. First sequencing of schistosome genomes were accomplished in 2009, for two of the most common Schistosoma species, S. mansoni (Berriman et al. 2009) and S. japonicum (The Schistosoma japonicum Genome Sequencing and Functional Analysis Consortium 2009). Afterwards, an improved version of S. mansoni genome was achieved (Protasio et al. 2012). The genome information provides a basis for many high throughput approaches, among which RNA sequencing (RNA-seq) has been developed and exploited for determining gene expression in recent years.
Due to the complexity of schistosome life cycle, as well as the high sequencing cost in the past, previous RNA-seq studies were mainly focused on certain life stages, such as the larvae Protasio et al. 2012), the adult stage (Lu et al. 2016;Anderson et al. 2015), or based on certain experimental conditions, e.g., UV-radiation treatment (Collins III et al. 2013). While we can get valuable information from the partial comparisons, an overall estimation of gene expression during schistosome development is missing. On the other hand, expression profiling in all life stages is also able to improve functional annotation of the genome, especially for stage-specifically expressed genes and hypothetical genes (Elias et al. 2009).
To obtain the information on gene expression changes in different life stages, we performed a comprehensive meta-analysis on published RNA-seq studies in Schistosoma mansoni. We identified genes that might account for the dominant function of the parasite in specific life stage(s), which will be important for understanding the biology of schistosomes and the process of parasitism. Further explored differential gene expression during life stage transitions, as well as those ubiquitously-and/or abundantly-expressed genes can benefit basic and applied research for the community.

Methods
Implemented life stages and RNA-seq sequence data All sequence data were obtained from ENA (http://www.ebi.ac.uk/ena), originating from four published studies. The accession numbers and sample types were summarised in Table 1.

Principle Component Analysis (PCA) and Hierarchical Clustering
Sample dissimilarities were revealed by PCA using the R (https://www.r-project.org) (v3.3.2) package DESeq2 (Love et al. 2014) (v1.14.1), with the parameter ntop set to the total number of genes. Plotting of PCA data were performed using default settings. Sample distance matrix was calculated using the sample package. Hierarchical clustering of all genes was performed using the hclust function and the "ward.D2" method. Heat maps were generated using the gplots (https://cran.r-project.org/web/packages/gplots/index.html) (v2.2.1) package.

Normalisation and differential expression analysis
Read counts were imported into edgeR (Robinson et al. 2010) (v3.16.5) and normalised across all samples by the Trimmed Mean of M-values (TMM) method (Robinson et al. 2010) using the function calcNormFactors(). Differential gene expression were analysed with the Generalized Linear Models (GLM) approach using the functions glmFit() and glmLRT().

Determination of most abundantly-expressed, and stage(s)-preferentially expressed genes
To obtain most abundantly-expressed genes in all life stages, mean expression values (before normalisation) of genes in all samples excluding gonads were ranked and the top twenty were listed. Life stage(s)-preferentially expressed genes were defined as significant higher expression in one sample/group than in the rest of samples. This was calculated by using the GLM approach and setting False Discovery Rate (FDR) < 0.01. Further manual filtration was performed to select genes with higher expression in certain stage(s) than in any other stage.

Identification of housekeeping genes
Housekeeping genes were identified with the GLM approach which compared all life stages (excluding gonads) to bFe, as an arbitrarily selected reference. Candidate housekeepers were selected from those with no significant difference (FDR > 0.05) and in addition folddifference < 1.5. Further testing of their suitability was performed by calculating the stability value among all samples using the R package Normfinder (Andersen et al. 2004(Andersen et al. ) (v05/01-2015.

Correlations with original studies
To check the correlations between meta-analysis and original studies, differentially expressed genes (FDR cut-off 0.01) were selected from a specific comparison and corresponding log2based fold-change (log 2 FC) values from both analyses were used to calculated Pearson correlation coefficient.

Dissimilarities between all life stages and stage-associated expression profiling
Across all life-cycle stages and all experiments, the proportion of genes showing convincing expression (RPKM > 1) varied from 58% to 83% (Table 2), with eggs displaying the smallest repertoire of expressed genes. By Principle Component Analysis (PCA) (Fig. 1A) several clusters were identified: bOv, bTe, Egg, Mir-Spo, Cer-Som, sFe-sMa-bMa, and bFe. This set of dissimilarities were confirmed by sample distance matrix (Fig. 1B) and hierarchical clustering analysis (Fig. 1C). Based on the revealed sample clusters, we obtained 407, 2,141, 1,955, 1,541, and 870 genes with preferential expression in Egg, Mir-Spo, Cer-Som, sM-sF-bM, and bFe, respectively (Supplementary Table 1). These preferentially-expressed genes make up 68% of S. mansoni protein-coding genes, and their product information indicates that they are responsible for the primary role of the parasite at specific life stage(s) (Fig. 2, Supplementary Table 1). For instance, splicing factors, RNA helicases, eIFs are commonly known as important for proliferation (Jankowsky 2011;Jackson et al. 2010), which happens tremendously in sporocyst; GPCRs, channel proteins and calcium-associated proteins are associated with sensory and motion (Lodish et al. 2000;Gover et al. 2009; Julius and Nathans 2012), important processes for cercaria and schistosomulum development; tetraspanins, tegument allergen, VALs and MEGs are well known to be involved in host-parasite interaction (Philippsen et al. 2015); and eggshell proteins are required for reproduction in S. mansoni (note that the number of eggshell proteins should be higher than indicated as many of them are annotated as hypothetical proteins). Top pathways associated with these genes were shown at the right on Fig. 2, which also supports the transition of dominant functions in the parasite. Figure 2. Preferentially-expressed genes and associated functions. Heat map of genes with preferential expression in certain life stages, the primary function of that stage, main products for these genes, and associated pathways were shown. Number of preferentially-expressed genes in each group was shown in the parentheses. Heat map was generated using the Z-score method (scaling on row). In the product information, number of genes coding for the same product was indicated in parentheses. For full list of products, please refer to Supplementary Table 1. In the pathway map, notations for common pathways, such as "Metabolic pathways" and "Ribosome", or ambiguous pathways, such as "Huntington's disease" and "Vibrio cholerae infection", were excluded from the list.
Top 10 genes from each of above enriched groups were summarised in Table 3 with information from previous investigations. The expression patterns of these genes are provided in Fig. 3A. Note that many of them are annotated as "hypothetical protein" (also see Fig. 2) and the patterns of their expression abundance will facilitate their predicted functions in the parasite, or improve the function annotation in the database (e.g., Smp_032670, Smp_193380 and Smp_179420 were annotated as "egg protein" but they have highest and preferential expression in mircidium / sporocyst).

Distinct differential expression patterns occur during parasite development
To obtain detailed gene expression during parasite development, and to validate the results of meta-analysis, pairwise differential gene expression analysis was performed for adjacent life stages. By setting the threshold at FDR < 0.01 and fold difference > 2, the number of differentially expressed genes (DEGs) were summarised in Fig. 4 (See detailed lists in Supplementary Table 2). We can observe a massive change in gene expression levels from sporocyst to cercariae, and from schistosomulum (skin-stage) to adult (numbers of DEGs > 4,300), whereas less DEGs were obtained in other comparisons. The results obtained by the meta-analysis were in good correlations with previous individual studies in either the larval stages (e.g., Spo vs Mir , Som vs Cer (Protasio et al. 2012) ), or in the adult stage (e.g., bTe vs bOv, and adult comparisons between male and female or before and after pairing (Lu et al. 2016)) (Fig. 5).  Besides the known comparisons, we can also obtain new information about the transitions from gonads to the egg/embryo, from sporocyst to cercaria, as well as from schistosomula to adults (Supplementary Table 2). The transitions not only reflect in the morphological change, but also in changing the biological processes, as reflected in significantly enriched GO terms in each stage. The mostly significantly enriched GO terms in sporocyst include mRNA splicing, RNA secondary structure unwinding, mitotic cell cycle process, and regulation of translation, whereas in cercaria they include cilium assembly, transport, chemical synaptic transmission, cell surface receptor signalling pathway (Fig. 6A). This observation is consistent with our previous conclusions (Fig. 2). As for the transformation from skin-stage schistosomulum to adult worm, the GO terms also reveal different processes. In schistosomulum, abundantly-expressed genes are involved in positive regulation of transcription, cell differentiation, anatomical structure development, response to lipid, etc, which indicate early response to host factors. After developed into adult worm, the processes switch to cilium assembly, oxidation-reduction process, transport along microtubule, ATP metabolic process, etc (Fig. 6B).

Novel housekeeping genes were identified across all life stages
Thirteen potential housekeeping genes were identified by the GLM approach (excluding gonads). Furthermore, their suitability was tested by calculating the stability value among all samples using Normfinder (Andersen et al. 2004). The gene IDs, product information, and stability values were summarised in Table 4. Expression patterns can be seen in Fig. 3B. Most of them demonstrate good stabilities, and are suitable candidates as ubiquitously-and/or strongly-expressed genes.

Most abundantly-expressed genes in all life stages
Abundantly-expressed genes can be used for functional genomics studies, e.g., promoters for transgenes, or for screening targets for vaccine development. With ranked average expression in all stages (excluding gonads) and to avoid zero-inflation, we select those with RPKM > 100 in all samples as ubiquitously abundant genes. Table 5 shows a summary for top 20 of those genes, and their expression profiles can be obtained from Fig. 3C. Some of them are previously known as ubiquitously-expressed genes, e.g., HPS70 (Smp_106930) (Neumann et al. 1993), and as vaccine candidates, e.g., aldolase (Smp_042160), GAPDH (Smp_056970), GTS28 (Smp_054160) (Wilson et al. 2016).

Discussion
Meta-analysis of gene expression has been exploited in other organisms such as humans, either with RNA-seq data across species and tissues (Sudmant et al. 2015), or with Microarray data from different studies (O'Mara et al. 2016). While technical differences and additional biological variabilities might exist among studies, a robust normalisation and differential analysis method is critical for reliable results. Trimmed Mean of M-values (TMM) is a batch normalisation method on a group of samples and has been proven to perform well in bulk RNA-seq (Dillies et al. 2013). As for differential expression analysis, the negative binomial generalized linear model (GLM) was tested suitable for low inter-study variation and small numbers of studies (Rau et al. 2014). In our case, by applying the TMM normalisation and GLM approach for differential expression analysis, we obtained reliable results having very good correlations with original studies. In the case of the egg sample, the variability was probably due to technical difference (de novo assembly vs reference mapping) and natural differences between S. mansoni strains (Brazilian vs Liberian strain).
Genes identified to be preferentially expressed in certain life stages can be valuable resource for the research community. On one hand, it can be a guidance for investigating gene functions in proper life stage(s), as many genes show extremely higher expression in one stage than in the others, e.g. calmodulins in cercariae. On the other hand, stage-preferential gene expression can benefit functional gene annotations. For instance, there are 23 genes annotated as "egg protein" but detected in our analysis as preferentially-expressed in miracidium/sprocyst. Another example is that many hypothetical proteins show preferential expression in bFe (paired female; Fig. 2), and by comparing the expression in whole female with that in the ovary, we can probably estimate the gene function in the vitellarium. Finally, many of the identified genes were consistent with previous studies (Table 3), indicating the robustness of the meta-analysis.
Our analysis also confirms assumptions from previous studies. For example, elav2 (Smp_194950) and cdc25 (Smp_152200) were previously found to be exclusively transcribed in testes in the adult stage (Lu et al. 2016), and our analysis extended the data and supported the conclusion (Fig. 7A). Furthermore, cpeb1 (Smp_070360) was proposed to fulfil roles in oocyte maturation (Lu et al. 2016;Wang et al. 2017), and the meta-analysis suggests its additional function in the embryo (Fig. 7A), as a dual-function protein also found in Xenopus (Novoa et al. 2010). In addition, our data supports the differential expression of aromatic-Lamino-acid decarboxylase (AADC) and allatostatin-A receptor-like gene (AlstR) between male and female worms, which were characterised in a recent study in Schistosoma japonicum as mediator of reproduction (Wang et al. 2017). Their S. mansoni orthologues exhibiting similar expression patterns (Fig. 7B) indicates a similar regulation theme. When we calculated differential gene expression between adjacent stages, we obtained quite distinct numbers of DEGs (Fig. 4). The most significant gene expression changes were observed during the transformation from sporocyst to cercaria, as well as from schistosomulum (3h skin stage) to adult worm. The former can be associated with the changes from a less active state where the parasites are mainly manipulating themselves, to a state where they need to swim fast, and to sense the definitive host efficiently, as confirmed in GO enrichment analysis (Fig. 6A). We saw that genes more abundantly-expressed in sporocyst are mainly involved in cell replication (e.g., mRNA splicing, DNA replication initiation, translational elongation, etc.), and those in cercaria are mainly involved in sensory and mobility (e.g., cilium assembly / movement, transport, chemical synaptic transmission, etc.) The latter can reveal many physical and biochemical changes in the parasite from skinto lung-stage, accompanied by increased immunological defences (GOBERT et al. 2007). Overall, the tremendous differential expression is associated with the morphological and functional changes in theses stages, and can support our understanding of schistosome developmental biology.
With respect to venom allergen-like (VALs) genes, our data supports previous findings. As show in Fig. 8A, SmVALs are distributed in different life stages. While consistent with previous finding on the presence of SmVALs2, 3, 5, 9 in egg, miracidia and sporocyst (Chalmers et al. 2008), we found the highest abundance in the egg instead of miracidia stage as discovered by the authors. This is probably due to the unsuitability of the housekeeping gene used in the previous study. The same expression patterns were found for SmVAL23 and SmVAL29, as discovered before (Wu et al. 2009). In addition, we supported the discovery of SmVAL22 preferential expression in sporocyst ). SmVAL1 and SmVAL24 were found to be nearly exclusively in cercaria, which differs from the finding about higher SmVAL24 expression in the germ balls (Fernandes et al. 2017). More interestingly, we discovered two VALs that seem exclusive for the testis, whose functions need further investigations.
Micro-exon genes (MEGs) were previously reported to be clustered in the esophagus of adult schistosomes Wang and Collins 2016) and proposed with functions for blood processing. Our data supports that by showing that most of known SmMEGs were found to be preferential in the sM-sF-bM group (Fig. 8B). We also identified that some members of SmMEG2 and SmMEG3 are preferential in the egg, which agrees with previous comparison between expression of them in egg to cercaria (DeMarco et al. 2010).
Our results also give new information about expression of neuropeptides in schistosomes. Based on the discovery of novel neuropeptide precursors (NPPs) in flatworms (Koziol et al. 2016), we found that most of SmNPPs were preferentially-expressed in sM-sF-bM (Fig. 8C), a pattern identified recently in the adult stage (Lu et al. 2016). This indicates active neuronal processes at the host-parasite interface. The identified novel housekeeping genes and abundantly-expressed genes can be useful for quantitative and/or functional studies. While candidates for housekeeping genes had been proposed before (Lu et al. 2016), the presented comprehensive analysis can provide more reliable result as it covers all life stages of schistosomes. From the expression data they seem to have a reasonable transcript abundance and can be used for testing purposes (Fig. 3B). Abundantly-expressed genes normally exhibited strong promoter activities thus suitable for functional approaches including overexpression or know-down of specific genes, and for vaccine development. We extracted 20 strongly expressed genes in all life stages (Table 5; Fig. 3C). While some of them have been discussed before, the list provides new candidates for such purposes.