1 Abstract
We found tremendous inequality across gene and protein annotation resources. We observe that this bias leads biomedical researchers to focus on richly annotated genes instead of those with the strongest molecular data. We advocate for researchers to reduce these biases by pursuing data-driven hypotheses.
2 Introduction
After analyzing samples with a high throughput technology, the de facto first step is to perform pathway or network analysis to identify biological processes that are statistically enriched in the data.1 Researchers typically form hypotheses for their follow up experiments based on the genes or proteins involved in the enriched processes. Commonly used resources for identifying gene functions and interactions include the Gene Ontology (GO),2 Reactome,3 Comparative Toxi-cogenomics Database (CTD),4 DrugBank,5 Protein Data Bank (PDB),6 Pubpular,7 and NCBI GeneRIF. Since these resources are created by curation of the scientific literature, they typi-cally only contain functional annotations for genes with published experimental data. Although GO includes predicted functional annotations for genes, they are considered of low quality.8 Consequently, researchers select those genes or proteins for further validation that have prior experimental evidence, which, in turn, leads to more functional annotations for those genes at the expense of under-studied genes.
We hypothesized that this experimental paradigm has led to a gene-centric disease research bias where hypotheses are confounded by the streetlight effect of looking for “answers where the light is better rather than where the truth is more likely to lie”.9–11 To test this hypothesis, we examined the annotation inequality for the human genome across a number of biomedical databases using gini coefficient, which is a measure of inequality such that high coefficient value indicates higher inequality.12
3 Results
3.1 Gene annotation inequality persists across databases
Despite the tremendous growth of GO from 20,826 annotations in 2004 for 7,524 human genes to 122,926 annotations for 16,173 genes in 2017, annotation inequality in GO has in-creased from a gini coefficient of 0.34 in 2004 to 0.50 in 2017. The growth in inequality over time validates that genes with existing annotations continue to receive even more anno-tations. Pathway databases, including Reactome (gini=0.33)3 and the CTD (gini=0.47),4 have a similarly high level of inequality. Indeed, every gene annotation resource we examined displayed a similarly high level of annotation inequality, including: CTD chemical-gene associations (gini=0.63);4 PDB 3D protein structures (gini=0.68);6 DrugBank drug-gene associations (gini=0.70);5 GeneRIF gene publication annotations (gini=0.79); and Pubpular disease-gene publication associations (gini=0.82).7, 13 We calculated global gene annotation gini coefficient of 0.63 when considering the number of annotations pooled across all these databases. When comparing annotation inequality in gene resources to income inequality in the world, we observed that the inequality index for many of the gene resources is higher than any nation in the Organisation for Economic Co-operation and Development (OECD).14
3.2 Annotation inequality bias affects biomedical research
Next, we explored whether disease research may be affected by the inequality in gene annotation databases. General concern that most published findings are false,15 many results are inflated,16 and research funding is being wasted17, 18 has led to searches for solutions that will yield reproducible and clinically relevant findings. Using a multi-cohort analysis framework,19, 20 we have repeatedly demonstrated that it can identify novel disease-gene relationships that lay outside the “halo of the streetlight”, and which have diagnostic, prognostic, and therapeutic utility across diverse diseases including cancer,21–23 organ transplantation,19 infectious diseases,24–27 and autoimmunity.28
In our manually curated meta-analyses of 104 distinct human conditions, we have integrated transcriptome data from over 41,000 patients and 619 studies to calculate an effect size for disease-gene associations.20 Our meta-analyses covered diverse classes of human conditions, such as cancer, autoimmune disease, viral infection, neurodegenerative and psychiatric disorders, pregnancy, and obesity. For these conditions, we extracted all disease gene associations with at least ten publications.7, 13 Published disease-gene associations exhibited no significant correlation with differential gene expression false discovery rate (FDR) rank [Figure 2A, Spearman’s correlation= −0.005, p = 0.716]. Overall, only 19.5% of published disease-gene associations were identified in gene expression meta-analyses at a FDR of 5% [Figure S1a]. This result is consistent with previous publications that have successfully replicated between 11% – 25% of research studies.29, 30
We measured the gini coefficient across a variety of gene annotation resources. For comparison, we also displayed gini coefficients for income inequality across a sample of OECD nations.
(A) The number of publications for every disease-gene pair was not significantly correlated with the gene expression meta-analysis effect size FDR rank [Spearman’s correlation = −0.005, p = 0.716]. (B) The number of publications for every disease-gene pair correlated with the number of non-inferred from electronic annotation (non-IEA) Gene Ontology annotations [Spearman’s correlation = 0.100, p=8.7e-13]. Orange points represent disease-gene associations published in our prior meta-analyses.19, 23, 24 Purple points have at least 1000 publications. See also Figure S1.
To observe whether this phenomenon was specific to gene expression, we extracted genome wide significant single nucleotide polymorphisms (SNPs) from the GWAS catalog.31 We observed a nominally significant correlation between the number of publications and SNP pvalues, indicating moderate concordance between genetic mutations and disease-gene publications [Figure S1b, Spearman’s correlation = −0.127, p = 0.015].
Based on these results, we hypothesized that the lack of correlation with molecular evidence may have been an artifact of research bias towards well-characterized genes. Therefore, we examined correspondence between publications about a disease-gene pair and existing knowledge about that gene as indicated by the number of GO annotations. Indeed, the number of GO annotations for a gene of interest was significantly correlated with the published disease-gene associations [Figure 2B, Spearman’s correlation = 0.100, p=8.7e-13], but not with gene expression effect size FDR rank in disease [Figure S1c, Spearman’s correlation = −0.010, p = 0.136].2 Many of the highly published disease-gene associations may have been studied for reasons that would not be directly reflected in gene expression analysis, including BRCA1 in breast cancer and CD4 in human immunodeficiency virus. The more troubling bias occurs when associations with strong molecular evidence have no publication record. Disease-gene associations we have reported in our published meta-analyses were typically novel findings with few Gene Ontology annotations, despite having extremely low false discovery rates and high effect sizes19, 21, 24 [orange points in Figure 2]. We observed similar patterns when we performed the same analysis on similar publication and GWAS data from HuGE Navigator32, 33[Figures S1d, S1e, S1f].
4 Discussion
Collectively, our results provide evidence of a strong research bias in literature that focuses on well annotated genes instead of the genes with the most significant disease relationship in terms of both gene expression and genetic variation. While focusing research on the best characterized genes may be natural because it is easy to formulate a mechanistic hypothesis of the gene’s function in disease, we propose that omics-era researchers should instead allow data to drive their hypotheses. Our prior work shows that expanding research outside of the streetlight of well characterized genes identifies novel disease-gene relationships, leads to successful repurposings of drugs, and provides clinically actionable diagnostics.19, 21–27, 34 To enable researchers to pursue data-driven hypotheses, we have made our gene expression meta-analysis data publicly available at [http://metasignature.stanford.edu] where it may be explored based on either diseases or genes of interest. By focusing on genes with the strongest molecular evidence instead of the most annotations, researchers will break the self-perpetuating annotation inequality cycle that results in research bias.
5 Materials and Methods
5.1 Gini coefficient calculation
We calculated the gini coefficients using the R package ineq.35 We included all human genes with at least one annotation in the gini calculations. We used the Entrez Gene list downloaded in February 2017 of 20,698 current, protein-coding, human genes as our source of human genes.
We calculated the number of annotations for each human gene in the Gene Ontology.2 We only considered the biological process and molecular function categories and excluded terms with evidence codes IEA and ND. Duplicate annotations that only differ in evidence codes were counted once. We calculated the number of annotations in the January 2004 release and annually for the January 2009-2017 releases.
We manually downloaded gene-publication data in August 2016 from Pubpular for 102 of the diseases in our gene expression database.7, 13 “Pubpular Total” refers to the inequality of gene-publication data across all diseases. “Pubpular Median” refers to the median inequality of gene-publication for each disease.
We downloaded Reactome pathway data from the complete database release 59.3 We downloaded data in MySQL format and parsed pathways into UniProt identifiers using custom scripts. We converted UniProt identifiers to gene names using the UniProt identifier conversion tool.36 We calculated the number of pathways including each gene name.
We downloaded the CTD4 data in February 2017, with the chemical-gene associations and the gene-pathway associations. We calculated the number of chemical-gene and gene-pathway associations for each gene name.
We downloaded GeneRIFs from the NCBI in February 2017. We included all human GeneRIFS (Tax ID: 9606). We calculated the number of GeneRIFs for each gene.
We downloaded the gene names associated with protein structures from the Protein Data Bank6 in February 2017 and calculated the number of structures per gene name.
We downloaded the DrugBank5 database version 5.0.5 and identified all drugs with known activities on human genes. We calculated the number of drugs targeting each gene.
We downloaded OECD14 nation income inequality gini coefficients from the July 2016 data release at http://www.oecd.org/social/income-distribution-database. htm.
The code and data we used to run this analysis is available at http://khatrilab.stanford.edu/researchbias.
5.2 Gene expression data collection and meta-analysis
Gene expression meta-analysis data was compiled from the MetaSignature database.20 MetaSignature includes data from manual meta-analysis of over 41,000 samples, 619 studies, and 104 diseases. Briefly, relevant data were downloaded from Gene Expression Omnibus and Array-Express.37, 38 Cases and controls were manually labeled for each disease and meta-analysis was performed using the MetaIntegrator package.20 We used the Hedges’ g summary effect size, standard error, and false discovery rate which the MetaIntegrator package calculates for every gene.
5.3 Data collection for disease-gene publications and SNP data
We downloaded the number of publications for each disease-gene relationship from PubPular and HuGE Navigator in August 2016 for as many of the 104 disease in MetaSignature as were present in the databases (102 in PubPular and 81 in HuGE).7, 13, 32 PubPular gave the top 261 gene associations, and HuGE gave all known associations. For all correlations, we only considered disease-gene associations with at least 10 publications to limit false positive associations.
We downloaded disease-SNP relationships, including gene mappings, odds ratios, and p-values, from the GWAS Catalog and HuGE Navigator for 61 and 54, respectively, of the 103 diseases in MetaSignature.31, 33 From Gene Ontology, we calculated the counts of non-Inferred from Electronic Annotation annotations for all the genes in the MetaSignature database.2 The Spearman rank correlation was used for all correlations.
Our plots show the top 10,000 gene associations for each disease by effect size FDR rank. Correlation calculations do not include a similar limit.
7 Author Contributions
Conceptualization, W.A.H., A.T., and P.K; Methodology, W.A.H., A.T., and P.K; Software, W.A.H. and A.T.; Investigation, W.A.H. and A.T.; Data Curation, W.A.H. and A.T.; Writing-Original Draft, W.A.H. and P.K.; Writing-Reviewing and Editing, W.A.H., A.T., and P.K.; Visualization, W.H.; Funding Acquisition, P.K.
6 Acknowledgements
We thank Paul J. Utz for feedback about the manuscript and figures and Alex Schrenchuk for computer support. WAH is funded by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-114747. PK is funded by the the Bill and Melinda Gates Foundation, and NIAID grants 1U19AI109662, U19AI057229 and U54I117925.