PT - JOURNAL ARTICLE AU - Anna Rychkova AU - MyMy C. Buu AU - Curt Scharfe AU - Martina I. Lefterova AU - Justin I. Odegaard AU - Iris Schrijver AU - Carlos Milla AU - Carlos D. Bustamante TI - Developing Gene-Specific Meta-Predictor of Variant Pathogenicity AID - 10.1101/115956 DP - 2017 Jan 01 TA - bioRxiv PG - 115956 4099 - http://biorxiv.org/content/early/2017/03/10/115956.short 4100 - http://biorxiv.org/content/early/2017/03/10/115956.full AB - Rapid, accurate, and inexpensive genome sequencing promises to transform medical care. However, a critical hurdle to enabling personalized genomic medicine is predicting the functional impact of novel genomic variation. Various methods of missense variants pathogenicity prediction have been developed by now. Here we present a new strategy for developing a pathogenicity predictor of improved accuracy by applying and training a supervised machine learning model in a gene-specific manner. Our meta-predictor combines outputs of various existing predictors, supplements them with an extended set of stability and structural features of the protein, as well as its physicochemical properties, and adds information about allele frequency from various datasets. We used such a supervised gene-specific meta-predictor approach to train the model on the CFTR gene, and to predict pathogenicity of about 1,000 variants of unknown significance that we collected from various publicly available and internal resources. Our CFTR-specific meta-predictor based on the Random Forest model performs better than other machine learning algorithms that we tested, and also outperforms other available tools, such as CADD, MutPred, SIFT, and PolyPhen-2. Our predicted pathogenicity probability correlates well with clinical measures of Cystic Fibrosis patients and experimental functional measures of mutated CFTR proteins. Training the model on one gene, in contrast to taking a genome wide approach, allows taking into account structural features specific for a particular protein, thus increasing the overall accuracy of the predictor. Collecting data from several separate resources, on the other hand, allows to accumulate allele frequency information, estimated as the most important feature by our approach, for a larger set of variants. Finally, our predictor will be hosted on the ClinGen Consortium database to make it available to CF researchers and to serve as a feasibility pilot study for other Mendelian diseases.