RT Journal Article
SR Electronic
T1 Defining functional intergenic transcribed regions based on heterogeneous features of phenotype genes and pseudogenes
JF bioRxiv
FD Cold Spring Harbor Laboratory
SP 127282
DO 10.1101/127282
A1 John P. Lloyd
A1 Zing Tsung-Yeh Tsai
A1 Rosalie P. Sowers
A1 Nicholas L. Panchy
A1 Shin-Han Shiu
YR 2017
UL http://biorxiv.org/content/early/2017/04/13/127282.abstract
AB With advances in transcript profiling, the presence of transcriptional activities in intergenic regions has been well established in multiple model systems. However, whether intergenic expression reflects transcriptional noise or the activity of novel genes remains unclear. We identified intergenic transcribed regions (ITRs) in 15 diverse flowering plant species and found that the amount of intergenic expression correlates with genome size, a pattern that could be expected if intergenic expression is largely non-functional. To further assess the functionality of ITRs, we first built machine learning classifiers using Arabidopsis thaliana as a model that can accurately distinguish functional sequences (phenotype genes) and non-functional ones (pseudogenes and random unexpressed intergenic regions) by integrating 93 biochemical, evolutionary, and sequence-structure features. Next, by applying the models to ITRs, we found that 2,453 (21%) had features significantly similar to phenotype genes and thus were likely parts of functional genes, while an additional 17% resembled benchmark RNA genes. However, ∼60% of ITRs were more similar to nonfunctional sequences and should be considered transcriptional noise unless falsified with experiments. The predictive framework establish here provides not only a comprehensive look at how functional, genic sequences are distinct from likely non-functional ones, but also a new way to differentiate novel genes from genomic regions with noisy transcriptional activities.