Geneplot: a coordinate conversion approach for graphical representation of protein domain data on the exon-intron structure of a gene

Graphical representation of single gene data, including subgenic features, polymorphisms and protein domains, is part of the regular routine of genome analyses. In the case of protein-coding genes, integration of such information with the exon-intron structure has advantages since intron polymorphisms may also have a biological impact, and the extent to which exons and protein domains overlap is of interest to evolutionary research. This report introduces geneplot, an open-source Python library to generate this type of graphical output from standard file formats. The library applies a coordinate conversion approach in order to represent protein domain data on genomic areas.

gene evolution during speciation events), and intron persistence on genes among different lineages [2], [3]. Thus, integration of domain information on the exon-intron topology becomes a relevant approach to study the interplay between gene structure and function. For example, correlation between domain topology and exons among several genes of eukaryotes has been an active area of research that has fed the exon shuffling theory, as a mechanism to explain how domains could be reused during evolution [3]- [5].
Additionally, single nucleotide polymorphisms (SNPs) are one of the preferred molecular markers, because they are abundant, highly polymorphic and genome-wide distributed in nearly every organism [6]- [9]. Many SNPs within the coding area of genes have been reported to contribute to phenotypic diversity, including human diseases or agricultural traits. SNPs overlapping with protein domains have been always considered stronger candidates to identify trait-associated polymorphisms, even promoting the development of dedicated databases [10], [11]. However, introns may contain functional elements whose mutations can influence the biological fitness of an organism, as exemplified in the human genome [12], highlighting the relevance of including exon-intron topology in the graphical representation of SNP data.
Several visualizing tools exist to display genetic information at the single gene level. Most of them allow a straightforward representation of SNPs and other data types, but they don't often integrate protein domain characteristics on the exon-intron structure, and finding a tool to undertake this task becomes challenging, especially in non-model organisms. For example, online genome browsers, such as the UCSC (https://genome.ucsc.edu/index.html, [13]), are valuable general-purpose resources that provide protein domain information mapped on genic subfeatures, but they are limited to the model species listed in its database. In the UCSC database, Pfam domains are firstly identified on protein sequences and after they are mapped to the transcripts themselves using a specialized tool called pslmap. As an alternative to this mapping approach, programming languages bring also different options to convert the coordinates of transcript/protein sequences to genome positions (and vice versa) from raw data sources, such as genome annotation files or protein annotation software. Then, converted coordinates can be used in your favorite visualizing tool. Some R packages allow this avenue, for example ensembldb and GenomicFeatures [14], [15]. In principle, they are intended to be used only with Ensembl, UCSC and BioMart databases, but a customized database from a GFF or GTF file can be created by the GenomicFeatures library, allowing the possibility to be applied in non-models species. Likewise, Perl also offers a solution to this problem with the Bio::Location library of Bioperl [16]. However, at the time of writing, there is only a Mapper module under development in Python to get this goal up, but not fully integrated in the latest stable release of Biopython [17]. This publication introduces geneplot, a Python library for integrating and visualizing exon-intron topology, SNP data and protein domain information. It implements its own coordinate conversion method within a Python framework and it can be applied to non-model species, since it takes as input standard files and formats generated from popular tools, including InterproScan [18], the General Feature Format v3 (GFF3, https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md) and the Variant Call Format (VCF, https://github.com/samtools/hts-specs).

Implementation.
The library combines information from genome annotation files in GFF3 format, the standard output of InterproScan containing domains identified on protein sequences, and SNP data from VCF files is available from GitHub (https://github.com/) under a GPL v3.0 license, and an automated test for assisting in package development is supplied to be used with the unittest Python library (http://docs.python.org/library/unittest.html). Execution and error traceability has been implemented with the logging package (https://docs.python.org/3/library/logging.html). The geneplot library depends on the external software VCFtools [19], and the non built-in Python libraries gffutils, Biopython and matplotlib [22].

Results and discussion.
The largest transcripts from 1,351 genes on chromosome 2R from the fruit fly genome  Figure S1-1 of the supplementary material shows an example of perfect match between both approaches. In several cases, additional Pfam domains were identified or missed in UCSC predictions (e.g. Figure S1-2). This might represent different versions of the Pfam database used by UCSC genome browser and geneplot. Since this problem does not relate to the genome coordinate conversion analysis, those cases were discarded and no longer considered for the comparison. Among cases where the same domains were identified, one source of discrepancy consisted in differences that were magnified due to the small length of exons (Figure S1-3 serves as an example). Other times, discrepancy came from the strategy of domain mapping followed by the UCSC tools, which use all available exons of a gene model as a template. Conversely, geneplot only uses transcript-associated exons (see Figure S1-4 for another example). Figure S1 Protein signatures corresponding to extremely short domains or even only functional residues (such as those involved in catalysis, folding or protein-ligand interaction) are of special relevance with regard to the applicability of coordinate conversion, since mapping approaches are probably less suitable to correctly transform those protein positions into genomic ranges, due to the small signature length.