Who is this gene and what does it do? A toolkit for munging transcriptomics data in python

Transcriptional regulation is extremely complicated. Unfortunately, so is working with transcriptional data. Genes can be referred to using a multitude of different identifiers and are assigned to an ever increasing number of categories. Gene expression data may be available in a variety of units (e.g, counts, RPKMs, TPMs). Batch effects dominate signal, but metadata may not be available. Most of the tools are written in R. Here, we introduce a library, genemunge, that makes it easier to work with transcriptional data in python. This includes translating between various types of gene names, accessing Gene Ontology (GO) information, obtaining expression levels of genes in healthy tissue, correcting for batch effects, and using prior knowledge to select sets of genes for further analysis. Code for genemunge is freely available on Github.


I. OVERVIEW
munge: verb 1. to manipulate (raw data), especially to convert (data) from one format to another.
www.dictionary.com/browse/munge Like any area that uses big data, transcriptomics data requires extensive munging -rote but critical tasks such as cleaning data, selecting relevant data, structuring metadata, and making labels interpretable. These tasks often need to be repeated on a given project as the data and aims evolve, and tend to be similar between different analyses. To face these challenges, a library of data munging tools can be extraordinarily useful. Such a library can provide reliable and tested tools to cleanly separate munging tasks from analysis, making it easier to start new projects and data processing pipelines less fragile. This note introduces genemunge, a library of tools for working with human transcriptomics data. genemunge is written in python and is available as a package through PyPI.
This initial version, v0.0, contains tools for tasks such as: • Translating between conventions for gene symbols [1].
• Using prior knowledge of biological and molecular processes to select gene sets [5,6].
• Converting expression data to TPM from counts or RPKM [11].
The goal of genemunge, and its current use case for the authors, is to serve as a resource for gene information that can return useful data structures and be integrated into processing pipelines. The next section gives a few example use cases of the library.

II. EXAMPLE USE CASES OF GENEMUNGE
We consider a simple analysis where genemunge is useful. Suppose we want to find genes associated with the immune system, select those with larger expression in the small intestine than the stomach, and then retrieve basic information about those genes.
The following code snippets use the API from genemunge v0.0. We begin by importing libraries required for the example. The Gene Ontology (GO) contains basic descriptors for each ontology entry [2][3][4]. Since we want genes related to the immune system, we will do a keyword search for immune and retrieve GO identifiers with this keyword. We can then obtain the associated genes. We will then use prior knowledge to remove housekeeping genes. A list of housekeeping genes curated by [6] is stored in genemunge. We can use genemunge to access summary statistics from the GTEx project (through recount) about expression levels in healthy tissue [7][8][9][10]. The median expression value can be used to find genes that are more expressed in the small intestine than the stomach.

C. Converting between gene identifier types
In genemunge, the base representation of genes is in terms of their Ensembl ID (without a version number). We will want to see the gene symbol in the results, so we convert the gene symbols. We can then select genes with high relative expression. Of course, one should be careful with this type of thing and do a differential expression analysis, but we'll just wing it.

III. SUMMARY
genemunge was built to make working with transcriptomics data easier. It provides a simple way to select genes of interest in an analysis and return useful metadata about them.
We find that it is a useful component of a larger analysis and data processing pipeline. Our intent on open sourcing the package is to engage with the computational biology community and build it into a broadly useful tool. We welcome feedback, feature requests, and contributions on genemunge through GitHub.