Abstract
Most organisms’ transcript and protein level only moderately correlate for various reasons, such as regulation of transcription and protein degradation. Better prediction and understanding the correlation between gene expression and protein abundance has been possible by harnessing the matching RNA/protein datasets produced by modern high-throughput RNA-Seq and mass spectrometry methods. In this work, we have utilized some well-studied matching RNA/protein datasets, and explored for the first time a bi-clustering method to cluster genes that have consistent correlation patterns between gene expression and protein abundance. The clustering results have been interpreted from the perspective of both transcriptomic and proteomic features, which show that mRNA half-life, protein half-life and protein structure in concert significantly affect the correlation of gene expression and protein abundance. With these and other carefully selected features, a prediction model based on individual clusters, called Cluster-based Linear prediction Model (CLM), was built and tested on mouse liver mitochondrial, mouse brainstem mitochondrial, Saccharomyces cerevisiae and Danio rerio datasets. CLM could find genes for which protein abundance can be predicted from mRNA data. In summary, based on bi-clustering, feature selection and CLM model, we have established a new and valuable cluster-based protein abundance prediction method.