Repurposing The Dark Genome. I - Antisense Proteins

From the functional standpoint, the genome may be considered a collection of three types of sequences: protein encoding, RNA encoding, and non-expressing. Based on the standard sequencing and annotation work, it is now well accepted that a small proportion of the genome has been allocated the job of encoding proteins, while most of the genome encodes RNA, and some DNA sequences are not used for expression. The exact ratio among these three types of sequences varies based on the organism. We asked: Is it possible to artificially encode protein and peptide sequences from naturally non-expressing (dark genome) sequences? This led to proof of the concept of making functional proteins from the intergenic sequences of E.coli (Dhar et al 2009). This study is an extension of the original concept and has been organized around antisense DNA sequences. The full-length antisense gene equivalents in forward and reverse orientations were computationally studied for their structural, cellular location, and functional properties, leading to many interesting observations. The current study points to a huge untapped genomic space that needs to be examined from cell physiology, evolutionary, and application perspectives.


Introduction
Genomes are the fundamental molecular chassis of organisms providing primary molecular codes for running the cell. Genomes come with various sizes and base pair proportions, hosting a unique arrangement of genes and non-genic regions providing the fundamental blueprint for activating the downstream molecular processes. Historically, DNA has been the main focus of several generations of scientists, as it provides intellectual fuel to understand collective molecular events like replication, transcription, translation, pathways, and so on. Following the discovery of protein-coding genes, the regulatory space defined by RNA-coding genes has received increased attention for the last several decades. It is now accepted that the so-called 'junk' is a huge repository of RNA coding genes. Interestingly, the non-expressing part of the genome has only met a computational treatment leading to an unclear understanding of its role in the cell.
To fill up this gap, our lab provided preliminary experimental evidence of artificially encoding intergenic sequences into functional molecules (Dhar et al 2009). Extending the search further, the current study was organized to understand the potential of antisense strands in generating novel biomolecules.

Material and Methods
The Escherichia coli (Strain: K-12 MG1655) genome database and Saccharomyces cerevisiae (S288C) genome database and Drosophila melanogaster (fruit fly) genome database was used in this study (NCBI Resource Coordinators, 2013). Data was downloaded from the NCBI (National Center for Biotechnology Information) database. 4315 protein-coding genes of E. coli, 6017 protein-coding genes of S. cerevisiae, and 13,962 protein-coding genes of Drosophila were used to generate antisense DNA sequence equivalents.

Translation of Antisense and Reverse Antisense Sequences
Sequences on the minus (-) strand of E. coli and S. cerevisiae genomes were computationally translated into Protein Sequence with the help of the UGENE tool (Okonechnikov et al., 2012). Only those translated sequences used were full-length equivalents, not interrupted by a stop codon (Figure 1). Sequences that showed stop codons within computationally translated antisense strands were discarded. Expasy-Translation Tool (Gasteiger, 2003) was used to quality check the result. In the antisense strand, both directions were used to predict new genes as shown in Fig.1.

Protein Sequence Similarity With Previously Available Database
In the next step, full-length antisense translated sequences were matched against peptides and protein sequences available in the NCBI Non-Redundant Protein Database (Altschul et al., 1990). Only those sequences were included further in the study that did not show any sequence similarity with the available protein sequences.

Physicochemical properties of protein Sequences
Physicochemical properties of Sequences checked by the Expasy-ProtParam tool of SIB Swiss Institute of Bioinformatics (Roy et al., 2011). The ProtParam tool computes various physical and chemical properties of protein sequences such as molecular weight, theoretical pI, instability index, and grand average of hydropathicity (GRAVY) index.

Structure and Function Prediction
Structure and Function were predicted by the I-TASSER Server (Iterative Threading ASSEmbly Refinement) tool (Zhang, 2008). I-TASSER predicts the secondary and tertiary structure of a protein along with probable functions. Subcellular localization of protein was computed by the WoLF PSORT tool (Horton et al., 2007).

Stereochemical properties of protein models
The stereochemical properties of the protein model were studied by using Procheck Server. The geometry of amino acids was assessed by the Ramachandran plot. Ramachandran Plot is a two-dimensional (2D) plot between the torsional angles of amino acids phi (φ) and psi (ψ) in protein sequences (Hollingsworth & Karplus, 2010).

Translation of Antisense and Reverse Antisense Sequences
The computational translation of the antisense DNA sequences showed several fulllength gene sequence equivalents (i.e., without stop codon) in the forward and reverse directions (Table 1).

Sequence Similarity
The full-length antisense and reverse antisense sequences from E. coli, S. cerevisiae, and Drosophila were matched against the existing NCBI protein sequence database. Some sequences are not found to have any similarity with previously available protein databases (Novel protein) and the remaining sequences show similarity in different percentages. (Table 2)

Physicochemical properties
Antisense and reverse antisense encoded proteins were studied for their physicochemical properties (Table 4). The molecular Weight of antisense proteins is found to range from 3.8 KDa to 43.0 KDa, with Isoelectric Point (pI) value ranging from 3.80 to 13.04 (pI <7 show acidic nature whereas pI >7 indicate basic nature), Instability index ranged from 4.5 -134, hydropathicity value (GRAVY) ranged from -0.496 to 0.672 showing interaction with water (low GRAVY score indicates better interaction with water), with G-C ratio ranging from 33.3 -64.4.

Structure and Function Prediction
Structure and function prediction of dark proteins predicted by I-Tasser showed a favorable C-score i.e., within the range of -5 to +2. The secondary and tertiary prediction of protein structures (predicted by I-TASSER) along with normalized B factor (BFP) profile were found within a favorable range. BFP shows experimental structure stability of proteins. Subcellular localization (WoLF PSROT server) showed predominant localization of proteins to mitochondria and several cytoplasmic destinations, while some were found to be secretory Reverse Antisense Protein

Discussion
The term 'dark-genome' refers to fully dark (never expressing), partially dark (RNA expressing), and virtually dark (pseudogenes) sequences that can be artificially expressed into functional peptides and proteins. The fully dark sequences comprise antisense, reverse coding, highly repetitive sequences, and intergenic sequences. The "partially dark" genome comprises sequences that didn't go beyond transcription i.e., tRNA, long ncRNA, ribosomal RNA, and intron sequences. The "virtually dark" genome comprises sequences like pseudogenes that were previously expressed but were retired during evolution.
This study using E.coli, S.cerevisiae, and D.melanogaster is an extension of our previous work (Dhar et al., 2009) where intergenic sequences of E.coli were artificially expressed into cell growth inhibiting proteins. Subsequently, it was observed that pseudogenes also have the potential to generate stable and functional proteins, if expressed (Shidhi et al., 2015) Given that genes are found on both strands, the term antisense does not indicate an "end-to-end" stretch of complementary DNA. Thus, sense and antisense only indicate a specific arrangement on a particular genomic address. In this study, we considered full-length hypothetical genes on the antisense strands in the forward and reverse directions. DNA sequences that showed stop codons on computational translation were discarded.
Strong evidence of unutilized genomic potential in the form of hypothetical full-length antisense proteins and reverse antisense proteins were observed in E.coli (0.7% and 5.1%), S.cerevisiae (0.15%, 0.5%) and D.melanogaster (0.2%, 2.1%), respectively. The physicochemical properties of these molecules favored the possibility of a welldefined structure and function.
Most of the proteins considered in this study were predicted to be located in various cellular locations with some proteins showing a possibility of secretory nature. The functional prediction tools mapped many of these proteins to transporter and enzyme functions.
To our best knowledge, this is the first report that describes the repurposing of antisense DNA sequences toward generating functional biomolecules. An additional parameter of "reverse antisense" proteins was considered in this study, to raise more evolutionary questions and increase the inventory of synthetic biomolecules. In the future, more work is needed to release the raw antisense and reverse antisense data in the form of a publicly available database and experimentally validate some of the key predictions.

Author Contributions
PKD conceived the idea of making functional proteins from antisense DNA strands, supervised the work and wrote final version of the manuscript. MG performed all the computational work, and drafted the first version of the manuscript.