Abstract
Identifying co-regulated genes can provide a useful approach for defining pathway-specific machinery in an organism. To be efficient, this approach relies on thorough genome annotation, which is not available for most organisms with sequenced genomes. Studies in Tetrahymena thermophila, the most experimentally accessible ciliate, have generated a rich transcriptomic database covering many well-defined physiological states. Genes that are involved in the same pathway show significant co-regulation, and screens based on gene co-regulation have identified novel factors in specific pathways, for example in membrane trafficking. However, a limitation has been the relatively sparse annotation of the Tetrahymena genome, making it impractical to approach genome-wide analyses. We have therefore developed an efficient approach to analyze both co-regulation and gene annotation, called the Co-regulation Data Harvester (CDH). The CDH automates identification of co-regulated genes by accessing the Tetrahymena transcriptome database, determines their orthologs in other organisms via reciprocal BLAST searches, and collates the annotations of those orthologs' functions. Inferences drawn from the CDH reproduce and expand upon experimental findings in Tetrahymena. The CDH, which is freely available, represents a powerful new tool for analyzing cell biological pathways in Tetrahymena. Moreover, to the extent that genes and pathways are conserved between organisms, the inferences obtained via the CDH should be relevant, and can be explored, in many other systems.
1. Motivation and significance
Tetrahymena thermophila is a ciliate, one of the best-studied members of this large group of protists [1]. Its use as a model system led to the Nobel Prize-winning discoveries of telomerase and self-splicing RNA, as well as to other breakthroughs, including the isolation of dyneins and making the link between histone modification and transcriptional regulation [2, 3, 4, 5]. These contributions to our understanding of important cellular pathways made use of classical forward and reverse genetics, as well as biochemical approaches. More recently, genomic and transcriptomic data became available for T. thermophila, which have been used to infer functional gene networks [6, 7, 8, 9, 10, 11].
The T. thermophila genome has been sequenced and assembled [6], and is available online on the Tetrahymena Genome Database (TGD) [7]. While the TGD collates the sequence data along with available gene annotations and descriptions, the genome overall remains incompletely annotated. An extensive transcriptomic database, the Tetrahymena Functional Genomics Database or TetraFGD, is also available for T. thermophila [10]. These data were collected over a well-established range of culture conditions in which T. thermophila undergoes large physiological changes [8, 9, 10, 11]. In addition to displaying individual expression profiles, the Tetra FGD can indicate the statistical strength of co-expression between any two genes, as calculated using the Context Likelihood of Relatedness (CLR) algorithm [9, 12, 13]. Co-expression in T. thermophila, as judged based on mRNA levels, can reveal functionally significant co-regulation. Gene regulation in this species appears to predominantly occur at the level of transcription [14], and so steady-state mRNA levels may explain the majority of steady-state protein levels, as reported in other systems [15]. In this report, we will refer to genes that are listed as co-expressed in the Tetra FGD as co-regulated.
A high-throughput analysis of T. thermophila gene expression profiles revealed that accurate gene networks can be inferred from co-regulation data [13], providing evidence that co-regulated genes tend to be functionally associated. This approach has been used in bacterial, mammalian, and apicomplexan systems [12, 16, 17]. There is also experimental evidence in T. thermophila to support the conclusion that co-regulation corresponds to functional association: Co-regulation data were used to successfully predict novel sorting factors and proteases involved in the biosynthesis of a class of secretory vesicles, called mucocysts [18, 19]. These results suggest that the T. thermophila transcriptome may be used to bioinformatically infer factors involved in an array of cellular pathways.
The CDH was designed to facilitate genome-wide analyses of gene co-regulation. The CDH automatically mines co-regulation data for genes of interest, and annotates the co-regulated genes via forward and reciprocal BLAST searches that identify orthologs in other model organisms. The CDH provides a systematic tool for gathering and annotating genomic information from public databases, and it can allow a researcher to quickly develop a robust hypothesis about the cellular pathways or structures in which a gene of interest may be acting, based upon the genes with which it is co-regulated.
2. Software description
2.1. Software Architecture
The CDH was developed for Python 2.7, along with the following packages: sys, os, platform, logging, re, dill, difflib, csv, pdb, shutil, xml, win32com.shell, requests, BeautifulSoup4, and Biopython [21].
Executables for Windows (x64) and MacOS (10.6+) were made using the Pyinstaller library. The CDH gathers available data for a set of co-regulated genes from publicly available databases, and uses these data to predict possible gene functions (Figure 1). The gathered information includes the co-regulation data from the TetraFGD, and the gene names, sequences, and annotations from the TGD. The available annotations come from a combination of experimental results and inferences from homology [7]. The CDH predicts annotations for genes based on the annotation of their respective orthologs, which are themselves identified by a series of forward and reciprocal BLAST searches via the National Center for Biotechnology Information (NCBI). These predicted annotations are generated by using the Ratcliff-Obershelp algorithm [22], as implemented in the python difflib library, to identify common phrases in the orthologs’ annotations.
2.2. Software Functionalities
The basic functions of the CDH are to gather available co-regulation and annotation data, perform forward and reciprocal BLAST searches and predict gene annotations, and report this gathered information in a human-readable format. The CDH interface first asks the user to enter the ID for the gene whose co-regulated factors are of interest (Figure 2). Next, the user defines: how many of the co-regulated genes should be interrogated via BLAST; how to process data files that had been previously generated and are relevant to the current query; whether to use the BLASTp or BLASTx algorithm; and in which taxa to run the forward BLAST searches. The results of the CDH analysis are saved as a Comma Separated Values (.csv) file in the user's “Documents” folder: /Documents/CoregulationDataHarvester/csvFiles.
3. Illustrative Examples
3.1. A CDH analysis of a factor required for programmed genome rearrangement returns the vast majority of experimentally-verified genes involved in the pathway
Programmed genome rearrangement is a tightly regulated process that occurs during the formation of the new somatic nucleus in conjugating Tetrahymena [23, 24]. This process is well-studied and known to be driven by a special adaptation of RNA interference, utilizing Dicer- and Piwi-like proteins, among other factors [25, 26]. TWI1 encodes a Piwi-like protein that plays a central role in programmed genome rearrangement [27, 26]. When TWI1 is entered as the query for the CDH, the CDH retrieves a large number of DNA and RNA-processing factors, as well as chromodomain proteins (Supplementary File 1). Importantly, these include the key factors known to be involved in programmed genome rearrangement (Table 1). The CDH report for this TWI1 query is attached as Supplementary File 2. Within this report, we have highlighted the cases in which the CDH matched or expanded upon existing annotations.
It is also notable that specific homologs of Dicer that are not involved in programmed genome rearrangement, namely DCR1 and DCR2 [26], are not present in the list of genes co-regulated with TWI1. Similarly, while TPB2 is a known genome rearrangement factor and is present in the CDH output [26], its paralog TPB1 is neither involved in this process nor identified as co-regulated with TWI1. Thus, the CDH is a useful tool for focusing on pathway-specific paralogs within gene families.
3.2. A CDH analysis of a mucocyst biogenesis factor enriches for mucocyst cargo and maturation factors
Mucocysts in Tetrahymena are secretory organelles. Mucocysts undergo a maturation process that requires the catalyzed cleavage of cargo proteins, called GRLs [28]. The T. thermophila genome encodes approximately 480 predicted proteases [6], but only five of these are co-regulated with GRLs, as revealed by a manual inspection of expression profiles on the TetraFGD [19]. Two of these proteases, called CTH3 (cathepsin 3) and CTH4 (cathepsin 4), were subsequently shown to represent key enzymes for GRL cleavage [19, 29].
Using CTH3 as a query for the CDH results in a list that includes a large number of genes known to be involved in mucocyst biogenesis (Table 2), and is enriched in membrane-trafficking factors and proteins with as-yet unknown functions in this organism (Supplementary File 3). Among the latter are a subunit of the AP3 complex and a syntaxin in the STX7 subfamily. Subsequent functional analysis of these genes showed that they are both essential for mucocyst formation, providing the best evidence to date that mucocysts are lysosome-related organelles (Kaur et al., submitted). The CDH report for this CTH3 query is attached as Supplementary File 4. This report is also edited to indicate the cases when the CDH matched or expanded upon existing gene annotations.
4. Impact
The CDH reproduces existing annotations with high accuracy, and provides a large number of new annotations and expansions upon existing ones (Table 3; Supplementary Files 2 and 4). Effectively, the CDH increased the annotation coverage of the genes co-regulated with TWI1 from 46% to 60%, and the annotation coverage of the genes co-regulated with CTH3 from 41% to 57%. Specifying the BLAST parameters allows the user to discover the most informative functional predictions for their genes and pathways of interest. Limiting the CDH search to lineages outside of the ciliates is more likely to retrieve previously annotated orthologs, but runs the increased risk that weak homologs will generate spurious results. For some processes, such as programmed genome rearrangement in which TWI1 is involved, the most informative BLAST searches may be those restricted to the ciliates. In our trials, the effectiveness of the CDH is maintained regardless of which taxa the BLAST searches are run against.
In addition to providing a means of quickly gathering available data about a set of co-regulated genes and inferring their functions, the CDH data can be extended to to investigate the potential overlap between components of different cellular pathways. For example, NUP50 encodes a gene that functions both in nuclear import at the nuclear pore complex and as part of a complex involved in transcription [30, 31]. Accordingly, the genes co-regulated with NUP50 show extensive overlap with genes co-regulated with an import factor (Importinβ) and with a gene involved in transcription (RPB81, an RNA polymerase II subunit), among other factors involved in both processes (Figure 3, A).
It is informative to compare the overlap of co-regulated genes in different pathways. The co-regulated gene sets for three factors involved in genome rearrangement, TWI1, GIW1, and DCL1, show almost complete overlap, suggesting that they may be involved in a single common process (Figure 3, B). In contrast to the case of programmed genome rearrangement, the co-regulated gene sets for three factors required in mucocyst formation, CTH3, SOR4, and APM3, show partial overlap, hinting that one or more of these factors may also play roles unrelated to mucocysts (Figure 3, C). Consistent with this idea, CTH3 is an essential gene, while mucocysts themselves are dispensable for cell viability in the laboratory [19]. Importantly, there is very little overlap between the co-regulated gene sets defined for the three different cellular processes (nuclear import/transcription, genome remodeling, and mucocyst formation) (Figure 3, D). The overlap is smallest between the genes co-regulated with mucocyst biogenesis factors and genes co-regulated with either nuclear import, transcriptional regulation, or programmed genome rearrangement. The somewhat greater sharing of genes co-regulated with nuclear import, transcriptional regulation, and programmed genome rearrangement may reflect the fact that these pathways all take place in the nucleus and are intrinsically linked to the cell cycle. Given the ease of assembling sets of co-regulated genes using the CDH, this type of overlap analysis can be extended to many cellular pathways.
5. Conclusions
Protists constitute the majority of eukaryotic diversity, meaning that this group needs to be included in evolutionary analyses of cellular processes, but this diversity is largely overlooked in the standard collection of model eukaryotes [20]. We present the Co-regulation Data Harvester for T. thermophila (CDH) as a tool that expedites analyses of T. thermophila genome, transcriptome, and cellular biology in an evolutionary context. The CDH is freely available and provides a systematic framework for genome annotation. It quickly gathers information from disparate databases and, by optionally reusing BLAST results that had been stored during previous queries, can increase in speed with successive uses. In providing a new means to analyze transcriptomic data, the CDH makes clear the potential for using the rapidly growing amount of genomic and transcriptomic data in many organisms, to facilitate functional analysis in poorly annotated or emerging model systems.
Users of the CDH should keep in mind that its reports are necessarily limited by pre-existing data from the TGD, TetraFGD, and the NCBI. For example, the Tetra FGD does not provide co-expression data for genes whose expression level falls below a set threshold. Because of this limit, some T. thermophila genes may be overlooked by the CDH. Executable files for the program can be found at http://ciliate.org/index.php/show/CDH. A manual with detailed instructions and usage examples is provided in Supplementary File 5.
Acknowledgements
We would like to thank Stefano Allesina for providing the initial impetus for this project, excellent advice, and review of this manuscript. Thanks to Chad Pearson and Alexander Stemm-Wolf for their feedback on the CDH and this manuscript. We thank Naomi Stover and Wei Miao for granting the Ciliate community access to the TGD and TetraFGD, without which none of this work would be possible. We thank Tokuko Haraguchi for helpful discussion about NUP50. We thank Daniela Sparvoli and Harsimran Kaur for helpful discussion and good company. This work was supported by NIH GM-105783 to APT.