Abstract
Pathway-based interpretation of gene lists is a staple of genome analysis. It depends on frequently updated gene annotation databases. We analyzed the evolution of gene annotations over the past seven years and found that the vocabulary of pathways and processes has doubled. This strongly impacts practical analysis of genes: 80% of publications we surveyed in 2015 used outdated software that only captured 20% of pathway enrichments apparent in current annotations.
Pathway enrichment analysis is a common technique for interpreting candidate gene lists derived from high-throughput experiments1-3. It reveals the most characteristic biological processes and molecular pathways of candidate genes by combining statistical enrichment methods and prior knowledge of gene function from resources such as Gene Ontology4 (GO) and Reactome5. However, genomic and transcriptomic datasets have grown by several orders of magnitude since the turn of the century6,7 and the scientific literature has doubled between 2010 and 20158. For example, GO is updated daily while new Reactome versions are released quarterly. Dozens of software tools have been written to interpret gene lists using information derived from GO and pathway databases, but only a few are regularly updated with the most current functional gene annotations. Many pathway enrichment tools have not been updated for years.
To investigate whether the age of pathway annotations adversely affects gene list interpretation, we analyzed the breadth and depth of functional information. We then reinterpreted cancer genes from recent experiments using gene annotation data with data snapshots taken at various times during the past seven years (2009-2016). First we surveyed the update times of 21 web-based pathway enrichment analysis tools9-29 (Table 1, Supplementary Figure 1). While several tools (g:Profiler9, Panther12, ToppGene18, GREAT15) were up to date with functional datasets revised within the past six months (February 2016), most publicly available tools were outdated by several years, and 12 (57%) tools had not been updated in the past five years (e.g. DAVID19 in January 2010). Nine tools had been updated within five years (e.g. FunSpec29, GeneCodis23) and eight tools had been updated within two years (e.g. Babelomics14, WebGestalt11). Surprisingly, the vast majority of surveyed publications from 2015 (84%) cited outdated tools that had not been updated within five years (Figure 1A, Supplementary Table 1).
(A) The majority of public pathway enrichment analysis software tools use outdated gene annotations, and the majority of surveyed papers published of 2015 use annotations that are more than 5 years old. (B) Annotated Gene Ontology (GO) terms per organism have grown rapidly in 2009-16 and more than doubled for human genes. (C) Two-dimensional density plots show accumulation of gene annotations during 2009-2016. Pathways per annotated gene (Y-axis) and median pathway sizes per annotated gene (X-axis) are shown. The median gene is shown with green dashed lines. (D) Quality of gene annotations is improving rapidly as manually curated Reactome annotations are becoming more frequent and fewer genes have only automatic GO annotations (Inferred from Electronic Annotation, IEA). (E) Gene annotations from 2010 miss 75% of enriched GO biological processes and Reactome pathways in essential breast cancer genes from recent shRNA screens. (F) Pathway enrichment analysis of frequently mutated glioblastoma (GBM) genes shows the proportion of results missed in outdated GO annotations. Each bar compares annotations of a given year to annotations of 2016. (G) Enrichment Map shows functional themes enriched in GBM genes that are missed in analysis of 2010 annotations. Nodes represent processes and pathways and edges connect nodes with many shared genes. Druggable pathways are indicated with asterisks.
Web-based pathway enrichment analysis software tools with times of most recent updates.
We then asked how the use of outdated annotation databases would affect functional analysis of genes. We found consistent growth in the number and complexity of functional annotations over time across multiple species (2009-2016; Figure 1B). The number of biological processes in GO having human gene annotations has more than doubled between 2009 and 2016, from 6,509 to 14,735 with an average growth of 12.5% per year (Supplementary Figure 2). Similar growth was seen among cell components (936 to 1,728) and molecular functions (3,035 to 4,291). A typical pathway database, Reactome, had similar growth. Its human pathways nearly doubled over the same period from 880 to 1,746. GO annotations of mouse, fly, yeast, and Arabidopsis thaliana all grew at similar rates.
In addition to improvements in quantity, we found that the annotation databases have also improved in quality and complexity. The GO hierarchy has grown significantly deeper over time, as measured by the average path length between terms and root (7.59 to 8.06; p<10−5, permutation test; Supplementary Figure 3). The mean number of paths to root per term has tripled (47 to 160; p<10−5; Supplementary Figure 4). The former metric shows greater detail of our biological vocabulary, while the latter shows increasing interconnectedness of concepts. These changes directly affect gene list interpretation, as genes in GO terms are automatically propagated to parent terms.
Analysis of gene-term associations showed rapid growth in the knowledge of individual genes (Figure 1C). The median human gene in 2016 is associated with 29 processes and pathways compared 16 in 2009 (p<10−5). The median functional gene set has also grown from 817 to 1,144 genes (p<10−5). Earlier general terms included thousands of genes, while recent annotations cover a wide spectrum of broad and specific pathways and processes. In particular, the manually-curated Reactome resource has grown from 49 to 150 genes per median pathway (Supplementary Figure 5). The increase in GO annotations is also mirrored in model organisms (Supplementary Figure 6).
High-confidence experimental annotations are becoming more common, while the fraction of poorly annotated human genes is decreasing (Figure 1D). In 2016, 42% of genes have at least one Reactome annotation (versus 15% in 2009), while 14% of genes have only low-confidence electronic annotations (versus 37% in 2009). The first category is based on manual literature curation, while the latter only includes IEA annotations in GO (Inferred from Electronic Annotation). The ‘dark matter’ of the genome also contributes to the quality and quantity of gene annotations. In 2009, one of eight human protein-coding genes (12.4%) had no annotations in GO or Reactome, compared to 4.9% of genes in 2016 (Supplementary Figure 7). Changes in gene symbols also contribute to loss of information. Earlier software would miss one of eight genes due to changing gene nomenclature. Using standard gene symbols from 2015, we found that genes were missed due to nomenclature changes 12.2% of the time when using symbols from 2009, compared to just 1.1% of gene symbols in 2014 (Supplementary Figure 8). Collectively, these data show that gene annotations have become substantially broader, more specific and higher-quality over the last seven years, covering more protein-coding genes and reducing the number of un-annotated genes.
Given this study improvement, we evaluated the impact of annotation evolution on gene list interpretation. First we analyzed essential genes of 77 breast cancer cell lines derived from recent shRNA screens30. Strikingly, 74% of enriched terms of 2016 were missed when testing the top 500 essential genes from each cell line with annotations from 2010 (191 pathways per median cell line in 2010 vs. 695 in 2016, Figure 1E, Supplementary Figure 9). We also confirmed this finding using the top-100 genes (Supplementary Figure 10) and by repeating the analysis separately for GO and Reactome (Supplementary Figure 11).
To confirm our observations in a high-confidence dataset, we studied 75 significantly mutated driver genes in the glioblastoma (GBM) form of brain cancer, taken from a recent pan-cancer analysis of 6,800 tumors31,32 that includes both known cancer drivers (EGFR, PIK3CA, PIK3R1, PTEN, TP53, NF1, RB1)33 and less well-known candidates. A standard enrichment analysis, performed using Fisher’s exact test and outdated annotations, missed many pathways detected with 2016 data (Figure 1F, Supplementary Figure 12, Online Methods). In particular, the 2010 era annotations (which the DAVID software uses) only capture ~20% of the 2016 results, including 172/827 GO biological processes and 16/128 Reactome pathways. The 13 general processes exclusively seen in 2010, which include large groupings such as transcription and apoptosis, were restructured in later databases.
Some enriched pathways and processes are also missed in relatively recent gene annotations (Figure 1F). In the enrichment analysis performed using 2015 data, 89/743 (12%) of GO terms and 29/116 (25%) Reactome pathways were missed compared to 2016. Some missing GO terms (8%) are present in the set of annotations from 2016, while most are not significant at FDR p<0.05. However ~40% of the insignificant processes are seen at less stringent cutoffs (FDR p<0.1) (Supplementary Figure 13). As pathways grow over time, enrichment signals from gene lists are diluted. Thus researchers may not be able to replicate all pathways when repeating analysis using newer gene annotations.
A detailed summary of GBM pathways represented as an Enrichment Map34 shows major functional themes missed in 2010 annotations (FDR p<0.05, Figure 1G; Supplementary Tables 2-3). Certain themes neurotransmitter signaling (n=6, FDR p=0.0013), circadian clock (n=8, FDR p=1.5x10−4) and glucose signaling (n=7, FDR p=0.0016) are only highlighted in current analysis. These processes are expected from brain cancer genes 35,36, for example enhanced glucose uptake of brain tumor initiating cells helps these overcome nutrient deprivation37. In particular, immune response (n=29, FDR p=5.2x10−5) and related processes are only apparent in annotations from 2016 and emphasize emerging opportunities in cancer immunotherapy45. Other themes underline increased specificity of neuronal context: while apoptosis is highlighted in both analyses, neuronal apoptosis only comes up in newer data (n=7 genes, FDR p=0.018). The difference between annotations of 2010 and 2016 is even stronger when excluding GO IEA annotations, as 96.5% (603/625) of high-confidence pathways and processes are missed when older data are used (Supplementary Figure 14).
The analysis with current annotation data also highlights signaling pathways relevant to GBM biology and therapy development38, 39, 33 such as Notch (n=5, FDR p=0.0019), TGF-β (n=5, FDR p=0.027), and fibroblast growth factor (n=12, FDR p=1.13x10−6) (Figure 1G). For example, the Notch pathway is targetable with γ-secretase inhibitors (R04929097) that are currently in phase I and II glioma trials38. The TGF-β pathway can be inhibited via the ligand (Trabedersen) or the receptor (Galunisertib)40. Current gene annotations also highlight the EGFRvIII41,42 signaling pathway (n=5, FDR p=1.07x10−5) that is among the most common alterations in GBM. It involves deletion of exons 2-7 of the EGFR gene that drives tumor progression and correlates with poor prognosis43. While EGFR inhibitors have been unsuccessful in GBM treatment to date, the recently developed Rindopepimut vaccine targets EGFRvIII and has entered clinical trial44. These targetable pathways would not have been identified using outdated annotation information.
The growth in the quantity and completeness of functional annotations has a crucial impact on practical analysis of high-throughput data in current literature. Out of the 21 tools we studied, the most popular software was DAVID, used in 2,500 publications in 2015 and representing 71.4% of the citations of software tools we reviewed. Our analysis shows that at least 74% of pathway enrichment hits are missed when analyzing gene annotations from 2010, including ~12% of dark-matter genes lost due to lack of annotations, ~12% of genes lost due to outdated or absent gene symbols, and at least 50% of results missed due to enhancements in the catalog of biological pathways and processes. The implication is that thousands of recent studies have severely underestimated the functional significance of their gene lists because of outdated software and gene annotations. We now have the opportunity to discover new valuable information and outline experimental hypotheses by carefully re-analyzing existing datasets. The bioinformatics community needs to prioritize the timeliness of gene annotations and data reproducibility. At least semiannual software updates are required as genome databases and biomedical ontologies receive major updates several times a year. To ensure reproducibility of earlier analyses, software tools need to provide access to historical versions of gene annotations. As an example of recommended practice, our g:Profiler web server (http://biit.cs.ut.ee/gprofiler), which includes gene annotations for human and more than two hundred other organisms, is synchronized with the Ensembl database every few months and maintains an archive of earlier releases dating back to 2011.
Researchers and reviewers need to pay attention to timeliness of data, and software tools need to clearly indicate the time of the most recent update. Reliable and up-to-date software tools allow researchers to make the best use of current knowledge of gene function and best interrogate experimental data for making scientific discoveries.
Online methods
Ontologies and pathways
Functional terminology of biological processes, molecular functions, and cell components was retrieved from the Gene Ontology4 website and comprised January releases of each year (2009-2016). Gene annotations were derived from the Gene Ontology Annotation (UniProt-GOA) database46. Molecular pathways from the Reactome5 database were retrieved from archives and included December releases of previous years. Genes were annotated to GO terms as well as parent and ancestor terms via all possible paths. Obsolete terms and negative relationships in GO were removed. We filtered human genes with non-public status and analyzed protein-coding genes of matched versions of the NCBI Consensus Coding Sequence Database47 (CCDS).
Analysis of gene annotations
Pathway databases were analyzed for growth in total number of pathway terms separately for the three main ontologies in GO (biological processes, molecular functions, cell components) for each year (2009-2016). The same analysis was repeated for human Reactome pathways and GO annotations for model organisms (mouse, Arabidopsis thaliana, fly, yeast). We counted GO terms and Reactome pathways with at least one annotated gene of the studied species. Path lengths and numbers from terms to roots were computed with custom scripts. Human annotations of GO terms and Reactome pathways contained high-confidence protein-coding genes from the nearest previous release of the CCDS database release (e.g. 2015 release for 2016 annotations) and only included genes with public status. Density of human gene annotations was assessed with two-dimensional density plots. For each gene, number of associated processes and pathways (Y-axis) and median size of corresponding gene sets per gene (Y-axis) were shown. The density plots include non-annotated genes (i.e. “dark matter”) for density estimation but are not shown. Dark matter genes were selected as protein-coding genes of the corresponding CCDS release that had no annotations in GO or Reactome, or only had top-level GO annotations (one or more of biological_protess, cell_component, molecular_function). GO biological processes per gene were also estimated for model organisms (mouse, A. thaliana, fly, yeast) without filtering of CCDS genes and dark matter. The proportion of missing gene symbols was estimated from earlier CCDS releases relative to the most recent CCDS release of 2015. Quality of human gene annotations was assessed in three mutually exclusive categories - genes with at least one Reactome annotation, genes with at least one nonelectronic (non-IEA) annotation in GO, and genes with only IEA (Inferred from Electronic Annotation) annotations in GO.
Pathway Enrichment Analysis
Pathway enrichment analysis was conducted on GO biological processes and Reactome pathways using Fisher’s exact tests. Multiple testing was conducted separately for GO and Reactome terms using the Benjamini-Hochberg False Discovery Rate (FDR) procedure. Terms with FDR p<0.05 were considered significant. Enrichment analysis of GO and Reactome terms conservatively comprised separate background gene sets that included all the genes with at least one gene annotation GO Biological Process and Reactome, respectively. We chose this general enrichment strategy, as direct comparison of tools would be confounded by differences in underlying methods.
Two sets of enrichment analyses were conducted on cancer gene lists using gene annotations from 2016 (corresponding to g:Profiler) and 2010 (corresponding to DAVID). First we analyzed essential breast cancer genes from recent shRNA screens of 77 cell lines30. We separately analyzed top-100 and top-500 lists of genes according to per-gene zGARP scores provided by the study. We counted shared, outdated-only, and recent-only gene annotations enriched in the analyses (FDR p<0.05) and matched these using GO and Reactome term identifiers. The most common terms only found in the up-to-date analysis were visualized with the WordCloud R package. To simulate practical analysis, we did not manually convert outdated gene symbols in breast cancer analysis. Second, the same comparison pipeline was repeated for a smaller set of 75 glioblastoma driver genes found as frequently mutated in pan-cancer analysis31,32 derived from the IntOGen database48. We manually mapped outdated gene symbols in the GBM analysis to create a more conservative analysis scenario. We compared enriched annotations across the years 2009-2015 relative to 2016 and counted common and distinct terms as above. We also visualized the pathway and process enrichments of 2010 and 2016 using the Enrichment Map34 app of Cytoscape49. The Enrichment Map analysis covered pathways with at least four genes. Our observations were also confirmed when all pathways were included. Functional themes and signaling pathways were curated manually.