PT - JOURNAL ARTICLE AU - E Mossotto AU - JJ Ashton AU - RJ Pengelly AU - RM Beattie AU - BD MacArthur AU - S Ennis TI - GenePy – a score for estimating gene pathogenicity in individuals using next-generation sequencing data AID - 10.1101/336701 DP - 2018 Jan 01 TA - bioRxiv PG - 336701 4099 - http://biorxiv.org/content/early/2018/06/01/336701.short 4100 - http://biorxiv.org/content/early/2018/06/01/336701.full AB - NGS is a revolutionising diagnosis and treatment of rare diseases. However, its relatively modest application in common diseases is limited by analytical approaches.Instead of variant-level approaches, typical for rare disease or large cohort analyses, contemporary investigation of common polygenic disorders requires the development of tools combining mutational burden and biological impact of a personalised set of mutations into single gene scores. GenePy (https://github.com/UoS-HGIG/GenePy) is a gene score for transforming sequencing data capable of estimating whole-gene pathogenicity on a per-patient basis.GenePy implements known deleteriousness metrics, incorporates allele frequency and individual zygosity information. Individuals harbouring multiple rare highly deleterious mutations accumulate extreme gene scores while the majority of genes usually achieve very low scores. Following correction for gene length, GenePy intuitively prioritises genes within individuals and affords gene/pathway score comparison between groups of individuals. Herein, we generate GenePy scores from whole-exome sequencing data for ∼15,000 genes across a cohort of 508 individuals. We describe score attributes and model behaviour under various biological conditions.We demonstrate proof of concept that GenePy sensitively identifies known causal genes by calculating GenePy scores for NOD2 (an established causal Crohn’s Disease gene), in a modest cohort of patients for comparison against controls. This test of GenePy using a positive control gene demonstrates markedly more significant results (p=1.37 × 10-4) compared to the most commonly applied tool for combining common and rare variation.In addition to increasing the biological information content for each variant, the per gene-per individual nature of GenePy transforms the utility of sequencing data. GenePy scores are intuitive when assessing for individual patients or for comparing between groups. Because GenePy intrinsically reflects pathogenicity at the gene level, this specifically facilitates downstream data integration (e.g. into machine learning, network and topological analyses) with transcriptomic and proteomic data that also report at the gene level.Author Summary Rapid technological advances have made DNA sequencing an effective, economic tool for detecting genomic variation. Detecting rare variation at the individual level is proving very successful in identifying the genetic causes of disease when just single mutations are sufficient to manifest disease. However, interpreting genomic data is much less straightforward for common diseases such as asthma, arthritis or heart disease where many genetic changes across multiple genes combine with the environment to bring about disease symptoms.We have developed a new scoring system called GenePy that generates whole gene pathogenicity scores for indiviual patients. The score corrects for the length of the gene and is intuitive to use. Unlike many mutation deleteriousness metrics, GenePy also takes into account the population frequency of the variant and the number of copies of any given mutation and combines data for as many variants as are present in a given gene for any one individual.In this paper we apply the GenePy scoring system to a cohort of over 500 individuals for whom we have sequencing data across all genomics regions that code for protein. We descibe how GenePy performs and demonstrate superior sensitivity to detect known causal genes in a common autoimmune condition.