%0 Journal Article %A Shyam M. Saladi %A Nauman Javed %A Axel Müller %A William M. Clemons, Jr. %T Decoding sequence-level information to predict membrane protein expression %D 2017 %R 10.1101/098673 %J bioRxiv %P 098673 %X The expression of membrane proteins remains a major bottleneck in the characterization of these important proteins. Expression levels are currently unpredictable, which renders the pursuit of these targets challenging and inefficient. Evidence demonstrates that small changes in the nucleotide or amino-acid sequence can dramatically affect membrane protein biogenesis; yet these observations have not resulted in generalizable approaches to improve expression. Here, we develop a data-driven statistical model, named IMProve, that enriches for the likelihood of selecting membrane proteins that express in E. coli directly from sequence. The model, trained on experimental data, combines a set of sequence-derived variables resulting in a score that predicts the likelihood of expression. We test the model against various independent datasets that contain a variety of experimental outcomes demonstrating that the model significantly enriches for expressed proteins. Analysis of the underlying features reveals a significant role for nucleotide derived features in predicting expression. This computational model can immediately be used to identify favorable targets for characterization.Author Summary Membrane proteins play a pivotal role in biology, representing a quarter of all proteomes and a majority of drug targets. While considerable effort has been focused on improving our functional understanding of this class, much of the investment has been hampered by the inability to obtain sufficient amounts of sample. Until now, there have been no broadly successful strategies for predicting and improving expression which means that each target requires an ad hoc adventure. Complex biological processes govern membrane protein expression; therefore, sequence characteristics that influence protein biogenesis are not simply additive. Many properties must be considered simultaneously in predicting the expression level of a protein.We provide a first solution to the membrane protein expression problem by learning from published data to develop a statistical model that predicts the outcomes of expression trials across families, scales, and laboratories (all independent of the model’s training data). Given that the process of finding a target for large-scale expression is arduous, often requiring a long trial-and-error process that consumes significant financial and human resources, this work will have immediate applicability. The ability to study and engineer inaccessible membrane proteins becomes feasible with the use of our predictor. Furthermore, this work will enable others in developing new computational methods to assist in the experimental study of membrane proteins. %U https://www.biorxiv.org/content/biorxiv/early/2017/06/21/098673.full.pdf