Abstract
Advances in sequencing technology have forced a quantitative revolution in Evolutionary Biology. One important feature of this renaissance is that comprehensive genomic resources can be obtained quickly for almost any taxon, thus speeding the development of new model organisms. Here we analyze 20 RNA-seq libraries from morphologically, sexually, and genetically distinct polyp types from the gonochoristic colonial hydrozoan, Hydractinia symbiolongicarpus (Cnidaria). Analyses of these data using Weighted Gene Co-expression Networks highlights deeply conserved genetic elements of animal spermatogenesis and demonstrate the utility of these methods in identifying modules of genes that correlate with different zooid types across various statistical contrasts. RNAseq data and analytical scripts described here are deposited in publicly available databases.
Introduction
Next Generation Sequencing (NGS) is revolutionizing organismal biology and holds particular promise for understanding the genetic mechanisms that give rise to complex morphological traits (Meng, et al. 2013; Reed et al 2011; Siebert, et al. 2011). One common application of NGS is RNAseq. In RNAseq, sequencing libraries are prepared from mRNAs extracted from tissues of interest and sequenced on a massively parallel platform such as Illumina (Mortazavi, et al. 2008). By mapping individual sequencing reads back to genomic or transcriptome assemblies, quantitative estimates of the expression levels of individual genes can be assessed statistically (Wang, et al. 2009). Comparisons of differential gene expression between tissues of interest are now ubiquitous in biological research, but additional information exists in these large and complex datasets. For instance, genes that underlie a specific biological function are often expressed in a correlated fashion (Miller, et al. 2008; Stuart, et al. 2003). Thus, analytical procedures that focus on identifying such correlations such as Weighted Gene Co-expression Network Analysis (WGCNA; (Langfelder and Horvath 2008), are a powerful means for identifying the key genetic modules underling the trait of interest. Here we test the utility of the WGCNA (Langfelder and Horvath 2008) in characterizing the genetic modules that underlie specific morphological phenotypes (e.g. zooid types) using an extensive RNAseq dataset derived from the modular colonial hydrozoan, Hydractinia symbiolongicarpus.
H. symbiolongicarpus is a powerful model system for integrative studies in organismal biology for several reasons. First, it is amenable to laboratory husbandry and the full reproductive cycle can be accomplished in the laboratory within roughly three months - a shorter period from generation to generation than other popular model taxa like mice (Silver 1995) or zebrafish (Detrich, et al. 2011). Second, colonies are easily crossed, facilitating the sorts of genetics studies that are difficult in other hydrozoan models such as Hydra species. Further, H. symbiolongicarpus, like other colonial invertebrates, shows a remarkable capacity for regeneration, allowing individual colonies to be sub-cloned by surgical fragmentation and stocked in perpetuity. In addition, the recent development of transgenic approaches in the congener, Hydractinia echinata (Kunzel, et al. 2010), opens the possibility of functional genomics. Finally, transcriptome data for H. echinata are available (Soza-Ried, et al. 2010) and a draft genome sequence of H. symbiolongicarpus is in preparation (Schnitzler, et al. 2014).
Colonies of H. symbiolongicarpus consist of up to four distinct types of polyps, including feeding gastrozooids, reproductive gonozooids, defensive dactylozooids, and sensory spiral zooids, all of which develop from basal ectodermal mat or network of stolons (Frank, et al. 2001; McFadden, et al. 1984). Here we characterize an extensive RNAseq dataset comprising 20 sequencing libraries representing multiple replicates of male and female gastrozooids and gonozooids from different genotypes. We describe a workflow for the analysis of these data using the WGCNA statistical package (Langfelder and Horvath 2008) and show that sufficient signal exists in our RNAseq dataset to construct networks of correlated gene expression. Many of the genetic modules recovered by these methods correlate with specific zooid morphologies. Our analyses provide the foundation for extending our understanding of the modules of functionally related genes that underlie specific morphological features, such as the different types of zooids in H. symbiolongicarpus and other modular organisms that exhibit morphological polymorphism at the level of zooids or polyps.
Raw data furnished by this study can be accessed at Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra) (Bioproject PRJNA244078). Data matrices and scripts for statistical analyses are available at https://bitbucket.org/plachetzki/plachetzki-et-al.-sicb-2014, and sequence files associated with this analysis are located at dryad (doi:10.5061/dryad.98pt3).
Methods
Animal Stocks and Husbandry
We collected colonies of H. symbiolongicarpus inhabiting shells of the intertidal hermit crab, Pagurus longicarpus, from Barnstable Harbor, MA in July of 2010. Animal husbandry was conducted as previously described (Grosberg, et al. 1996). Briefly, we maintained the hermit crabs and their associated H. symbiolongiocarpus in a 200-gallon recirculating seawater system (held at 18°C) at the University of California, Davis. We fed the carbs and their associated colonies daily to repletion with 2-3 day old cultures of brine shrimp (Artemia salina). We mated single pairs of male and female colonies by isolating crabs carrying single, sexually mature male or female colonies in the dark for 24 hours, then placing them together in two-liter containers of filtered (0.44 µm) seawater (FSW) at room temperature (RT) under bright light for 4 hours. We collected ≈ 500 fertilized embryos from each mating, and transferred them into clean seawater in sterile petri dishes. We transferred the developing embryos daily into clean FSW. After 55 hours we induced metamorphosis by treating planulae with a solution of 55 mM CsCl in FSW (Muller 1973). We transferred metamorphosing planulae to borosilicate glass plates submerged in FSW, where they continued to develop into primary feeding polyps. We fed the resulting colonies daily to repletion with 2-3 day old Artemia until the colonies reached sexually maturity after five to eight weeks. We selected two male and two female full-sibling offspring for extractions of RNA.
Extraction of RNA, Construction of RNAseq Libraries, and Sequencing
We used fine scissors to excise gonozooids and gastrozooids from living colonies, and placed the polyps directly into Trizol (Invitrogen) on ice. We immediately extracted RNA after removal, following manufacturers protocols (Invitrogen). We extracted three replicate total-RNA samples from each type of polyp (male gastrozooid, male gonozooid, female gastrozooid, female gonozooid) from four full-sibling individuals (two male and two female colonies, each), for a total of 24 extractions. We eluted total RNA and stored the RNA at -80° C in RNAse-free H2O. We assessed RNA quality using a Bioanalyzer and the RNA 1000 kit (Agilent). We prepared RNAseq libraries using the TrueSeq kit (version 2, Illumina) with 200 – 500 ng of total RNA per library, following the manufacturer’s protocols. We used a Bioanalyzer and DNA 100 chips (Agilent) to determine physical quality of sequencing libraries. Sequencing and de-multiplexing were carried out at the University of California, Davis Genome Center on two lanes of HiSeq 2000 (paired end 100) with 12 RNAseq libraries per lane. Four of our RNAseq datasets had low read content, and we eliminated them from subsequent analysis (see Table 1).
Bioinformatics: Filtering and Assembly of Raw Data
We first cleaned low-quality reads from the raw sequence data for each RNAseq dataset using fastq_quality_filter version 0.013 and the first 10 base pairs of each read were removed using fastx_trimmer version 0.013 (http://hannonlab.cshl.edu/fastx_toolkit/index.html). Sequencing adaptors were removed using cutadapt (Martin 2011). Comparisons between pre-cleaning and post-cleaning datasets were made using fastqc version 0.10.0 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Following cleaning, we used Trinity version 2013.11.10 (Grabherr, et al. 2011) under default parameters to assemble non-redundant datasets with the highest number of reads. We then mapped reads from individual, cleaned RNAseq datasets to contigs using bowtie version 1.0.1 (Langmead, et al. 2009) within RSEM version 1.1.15 (Li and Dewey 2011). Mapping efficiency was assessed using the flagstat command in samtools version 0.1.19 (Li, et al. 2009). The raw transcriptome assembly and the resulting matrix of un-normalized expected count data are available at the dryad link associated with this work.
Bioinformatics: Statistical Analytics
We analyzed the counts matrix from all of 20 datasets using Weighted Gene Co-expression Network Analysis (WGCNA) version 1.36 (Langfelder and Horvath 2008). First, we discarded all contigs that included fewer than 10 counts in any library. The resulting matrix included 3,539 contigs. Next, we normalized count data using the model based calcNormFactors function in EdgeR version 3.6.2 (Robinson, et al. 2010). Hierarchical clustering was then used to assess similarity between the resulting 20 RNAseq datasets using flashClust in R version 3.0.0 (R_Development_Core_Team 2008). As none of our datasets exceeded a threshold of 0.9% divergence in Euclidian distance, none were considered outliers and all were included in the following analyses. Soft thresholding analyses of powers and WGCNA analyses were conducted as per the authors’ instructions using the pickSoftThreshold function in WGCNA. Once modules of co-expressed genes were estimated, statistical analyses included: i) tests of association between individual genes and proposed module membership, ii) tests of association between individual genes and sex or zooid type, and iii) analyses of variance (ANOVA) to test for the effect of zooid type and sex on expression of each of the elucidated modules. P-values were corrected for multiple tests using a false discovery rate of 0.05. Finally, the sequences of individual modules of co-expressed genes that resulted from this analysis were BLASTed against the refseq_protein database (Altschul, et al. 2009) with an expect value of 1.0 × 10−3 and annotated using BLAST2GO (Conesa, et al. 2005).
Results
RNAseq Data
Twenty Hydractinia symbiolongicarpus RNAseq datasets (paired end 100bp, HiSeq 2000) were prepared during this study. These datasets comprise multiple replicates of male and female gastrozooids and gonozooids, with two genotypes per sex (genotypes 5, 48, 17 and 73). Table 1 describes these data and their publicly available accessions, which together comprise > 2.5 × 108 reads.
Assembly in Trinity produced 120,196 contigs (transcripts). Twenty-five per cent of the total sequence length from the assembly is contained in 8,519 contigs of 1,556 bp or greater, 50% of total sequence length from the assembly is contained in 24,514 contigs of 890 bp or greater, and 75% of total sequence length from the assembly is contained in 54,696 contigs of 432 bp or greater. On average 88% of read data across 20 cleaned RNAseq datasets mapped to the assembly.
Figure 1 shows a dendrogram of these 20 RNAseq libraries, following the removal of contigs that had fewer than 10 counts per contig for each library. Each type of polyp, including female gonozooids, male gonozooids, and gastrozooids were recovered each as unique clusters, the gastrozooid cluster being comprised of both sexes. The genotype of a given RNAseq library was not a strong determinant in its clustering, as would be expected if genotype played a limited role in driving differences in gene expression between types of polyps.
WGCNA
The first step in weighted co-expression gene network analysis (WGCNA) is the selection of β, the soft thresholding power used to calculate the adjacency matrix, or co-expression network, of genes. Using the automated approach to network construction (Langfelder and Horvath 2008; Zhang and Horvath 2005) we selected a power of 9 for β, which is the inflection point in proposed power values where the scale-free topology model fit to our data begins to stop improving with increases in power (Fig. 2). Using this value for β, our data were partitioned into 10 gene co-expression modules that included between 38 and 1258 transcripts (Fig 3A). Fifteen transcripts, out of the original 3,549 did not fall into any significant module of gene co-expression.
Genes included in specific modules may share similar biological functions. Thus, understanding the composition of specific modules may inform our understanding of the differences in gene expression that underlie the different types of zooids in H. symbiolongicarpus. In order to assign modules of co-expressed transcripts to specific traits (i.e. gastrozooids and gonozooids of different sexes), we first verified a gene’s module membership by testing the correlation between the individual gene’s expression and the overall expression eigenscore of its assigned module. We then tested the association between module eigenscores and trait (male gonozoid (MO), female gonozooid (FO) or gastrozooid (A)). Figure 3B shows the results of these correlations when the zooid type A/O (gastrozooid A and gonozooid O) is used as a correlate of membership in a gene module. This categorization scheme disregards sex differences between zooid samples. These analyses suggest that the turquoise, pink, blue, and yellow modules are strongly associated with A/O zooid types and therefore may represent genetic modules underlying some of the morphological differences in these zooid types (Fig. 3B). However, other modules that do not yield high correlations between module membership and A/O trait-distinctions may be explained by other possible trait categorization schemes. For instance, genes included in the red, brown, and green modules show a low correlation to zooid type under the A/O scheme, but a high correlation to zooid type once zooids are categorized as either gastrozooid, male gonozooid, or female gonozooid (A/MO/FO) (Fig. 3C).
In order to test whether zooid type and/or sex has significant explanatory power to predict the expression of our estimated genetic modules, we conducted an analysis of variance on each module (Fig. 4). These analyses suggest that the turquoise module is significantly associated with gastrozooid type (zooid: F = 100, p < 0.001); the magenta and brown modules with sex differences (magenta: F = 6.5, p = 0.0424; brown: F = 155, p < 0.001); the green module is strongly associated with the female gonozooid type (sex: F = 35.4, p = 0.0001; zooid: F = 30.1, p = 0.0002); and the blue module associates strongly with the male gonozooid type (sex: F = 12.5, p = 0.0069; zooid: F = 16.1, p = 0.0034). Support for this association between the between blue module and male gonozooid polyp type comes from GO analyses of the blue module, which reveal a number of genes associated with spermatogenesis and meiosis (Table 2). The sequences of the transcripts from each module are given at the dryad link associated with this publication; however, a detailed global analysis of all the genes contained in each module will be explicated in a forthcoming publication.
Discussion
We describe the data and their deposition in publicly available databases from 20 RNAseq libraries derived from multiple replicates of male and female gastrozooids and gonozooids from the colonial hydrozoan H. symbiolongicarpus. Our analyses suggest that the WGCNA approach to uncovering genes associated with specific morphological features may be a powerful tool for organismal biology. We recovered gene modules that correlate strongly with specific types of zooids. In particular, the blue module correlates with the male gonozooid trait and this module contains numerous genes associated with spermatogenesis and meiosis (Table 2).
The modules of correlated gene expression estimated from WGCNA are a priori naïve to trait-states and associations between the specific modules identified and trait-states are driven by nuances in the expression data. This renders the WGCNA approach a valuable tool for identifying the key genetic constituents underlying traits where little is known, such as the polymorphic polyp morphologies of H. symbiolongicarpus. In addition, modules that show no correlation with any particular state of a trait do not represent failures in the procedure. Such modules may represent molecular functions that are shared across states but have correlated expression nonetheless, such as metabolic pathways.
All comparative transcriptomics analyses are likely to include false positive and negative results. Therefore, analyses such as these represent only a first step in identifying the genes that function to specify different polyp morphologies. Future analyses will investigate the expression patterns and molecular functions of selected candidate genes recovered in the modules of correlated gene expression described here. In addition, our present analyses of gene ontology are based on BLAST (Altschul, et al. 1997) searches of similarity against accessioned databases. While useful, such analyses conflate orthology and parology. Future analyses will apply a more robust phylogenetic approach aimed at disentangling orthology from parology in genes of interest. Only then may we be able to address the evolutionary significance of our findings as they relate to the origins of polymorphic polyp types and gametogenesis.
Funding
This work was supported by a Howard Hughes Medical Institute postdoctoral fellowship of the Life Sciences Research Foundation to DCP and a National Science Foundation grant OCE 09-29057 to RKG.
Acknowledgements
We are indebted to Brenda B. Cameron for assistance and training in animal husbandry. The manuscript benefited greatly from helpful comments from two anonymous reviewers and the editorial staff at ICB.