TY - JOUR T1 - Defining the functional significance of intergenic transcribed regions based on heterogeneous features of phenotype genes and pseudogenes JF - bioRxiv DO - 10.1101/127282 SP - 127282 AU - John P. Lloyd AU - Zing Tsung-Yeh Tsai AU - Rosalie P. Sowers AU - Nicholas L. Panchy AU - Shin-Han Shiu Y1 - 2017/01/01 UR - http://biorxiv.org/content/early/2017/04/27/127282.abstract N2 - With advances in transcript profiling, the presence of transcriptional activities in intergenic regions has been well established in multiple model systems. However, whether intergenic expression reflects transcriptional noise or the activity of novel genes remains unclear. We identified intergenic transcribed regions (ITRs) in 15 diverse flowering plant species and found that the amount of intergenic expression correlates with genome size, a pattern that could be expected if intergenic expression is largely nonfunctional. To further assess the functionality of ITRs, we first built machine learning classifiers using Arabidopsis thaliana as a model that can accurately distinguish functional sequences (phenotype genes) and nonfunctional ones (pseudogenes and unexpressed intergenic regions) by integrating 93 biochemical, evolutionary, and sequence-structure features. Next, by applying the models genome-wide, we found that 4,427 ITRs (38%) and 796 annotated ncRNAs (44%) had features significantly similar to benchmark protein-coding or RNA genes and thus were likely parts of functional genes. However, ∼60% of ITRs and ncRNAs were more similar to nonfunctional sequences and should be considered transcriptional noise unless falsified with experiments. The predictive framework established here provides not only a comprehensive look at how functional, genic sequences are distinct from likely nonfunctional ones, but also a new way to differentiate novel genes from genomic regions with noisy transcriptional activities. ER -