Abstract
Since the entry into genome-enabled biology 20 years ago, much progress has been made in determining, describing, and disseminating functions of genes and their products. Yet, this information is still difficult to access by many, especially across genomes. To provide easy access to the status of genome function annotation for model organisms and bioenergy and food crop species, we created a web application (https://genomeannotation.rheelab.org) to visualize and download genome annotation data for 27 species. The summary graphics and data tables will be updated semi-annually and snapshots archived to provide a historical record of the progress of genome function annotation efforts.
Background
Rapid advances in DNA sequencing technologies made genome sequences widely available and revealed a plethora of genes encoded within the genomes in the last two decades [1]. The timely invention and wide adoption of the Gene Ontology (GO) system transformed how gene and protein functions are described, quantified, and compared across many organisms [2,3]. Despite this tremendous progress in genome biology, it is still nontrivial for scientists to get a snapshot of the status of genome function annotation across species.
There are several reasons for the difficulty in obtaining the status of genome function annotation across species. First, genome sequences and their annotations are hosted across multiple databases that use different gene/protein/sequence identifier systems. For example, Phytozome [4] uses its own database identifiers for its genes and does not provide cross-database identifier (ID) mapping functionalities. Although some databases include cross database references and provide tools to map IDs, such as UniProt’s Retrieve/ID mapping and BioMart’s ID conversion [5], these tools are not available for all sequenced genomes. Second, gene function information is not generally annotated using the GO system in the literature and databases. Third, genome function annotation databases generally only include annotated genes and it is not trivial to retrieve the number and identity of unannotated genes. Importance of unannotated genes is exemplified by a recent success in identifying the minimal bacterial genome that included 473 essential genes [6]. Among these were 149 whose molecular functions remain unknown.
To provide scientists and students an easy way to access the status of genome function annotations of model species and bioenergy and food crops, we created a web application that displays these data graphically and tabularly. The website retrieves data from multiple databases, and generates plots that show the percentages of genes with experimental, computational, or no annotations. The snapshots are updated semi-annually and past snapshots will be archived.
Results and Discussion
To represent the status of genome function annotation, we selected three groups of organisms: model organisms, bioenergy model and crop species, and most annotated plant species (Figure 1). Model organisms are important experimental tools for investigating biological processes and represent key reference points of biological knowledge for other species [7–9]. This panel includes: Arabidopsis thaliana, Caenorhabditis elegans, Danio rerio, Drosophila melanogaster, Mus musculus, and Saccharomyces cerevisiae (Fig. 1A). We also included Homo sapiens, a species for which many model organisms are studied. Next, we selected bioenergy models and crops, which are important in expanding the renewable energy sector needed to combat the climate crisis and steward a more sustainable environment. Biomass is currently the biggest source of renewable energy [10] and is projected to become the biggest source of primary energy by 2050 [11]. The bioenergy models and crops we selected include: Brachypodium distachyon, Chlamydomonas reinhardtii, Glycine max, Miscanthus sinensis, Panicum hallii, Panicum virgatum, Physcomitrium patens, Populus trichocarpa, Sorghum bicolor, and Setaria italica (Fig. 1B). Finally, we selected ten additional plant species that have the most number of GO annotations in UniProt [12], which include: Oryza sativa Japonica Group (rice), Gossypium hirsutum (cotton), Spinacia oleracea (spinach), Zea mays (corn), Medicago truncatula, Solanum tuberosum (potato), Ricinus communis (castor bean), Nicotiana tabacum (tobacco), Papaver somniferum (opium poppy), Triticum aestivum (wheat) (Fig. 1C). These include the world’s most important cereal crops, such as corn, rice, wheat, and vegetable crops such as potato [13].
There are several ways of accessing the status of genome function annotation for the 27 species. From the front page, visitors can get a quick summary of the state of the genome function annotation as pie charts for the three groups of species (Figure 1). These pie charts show the percentage of genes that have: 1) annotations with experimental evidence (green); 2) only the annotations that are computationally generated (blue); or 3) no annotations or annotations as being unknown (Figure 1). Of the 7 selected model organisms, S. cerevisiae has the highest percentage of genes with experimental evidence and the least number of genes unannotated or annotated as having unknown function, followed by H. sapiens and A. thaliana. Among the model organisms, C. elegans is the least known species with the greatest number of genes unannotated or annotated as having unknown function. Most of the plant species have few GO annotations based on experimental support to be even visible in the pie charts. Visitors can get more detailed information of any of the species by clicking on the species name below the pie charts. Each species page shows additional information about the annotation status, including displaying the portion of genes annotated to at least one GO domain (molecular function, cellular component, and biological process [2,3]) as well as a Venn diagram showing the overlap of genes annotated to more than one GO domain (Figure 2). This page also has links to source data and a tabular format of the annotation summary for browsing and downloading.
In developing our web application, we came across a few hurdles. First, there was not a single site where all data were available. To obtain GO annotations from the 27 species, we had to visit at least three databases. A positive finding was that all sites that had GO annotations were using the GO Annotation File (GAF) format. Nevertheless, having a single-entry point where GO annotations of any species can be accessed would be useful. Second, our website includes genes that are unannotated, which is often missing in gene function annotations and enrichment analyses [14]. Currently, extracting genes that are not annotated is not trivial and requires many steps that are different for each species. Including the unannotated genes in a genome into GAF files would facilitate many downstream applications.
To our surprise, some plant species with well-maintained, species-specific databases seem to have a low number of experimentally supported GO annotated genes in UniProt. Outside of TAIR that provides GO annotations for A. thaliana [19], we were not able to find any database that provides experimental evidence codes to their GO annotations. Apart from Nicotiana tabacum and Papaver somniferum, all plants species on our website are included in the most recent version of Phytozome V13, but their GO terms are assigned computationally [4]. The Sol Genomics Network (SGN) (https://solgenomics.net accessed 13 June 2022) [15] hosts genome annotations of Solanaceae species, including Nicotiana tabacum and Solanum tuberosum. An annotation file for Nicotiana tabacum is available [16] but they are assigned with computational support coming from InterProScan [17]. SpudDB [18] (http://spuddb.uga.edu/ accessed 13 June 2022) provides GO annotation for Solanum tuberosum but they are generated with InterProScan and by best hit to the Arabidopsis proteome (TAIR10) [19]. MaizeGDB [20] (https://www.maizegdb.org/ accessed 13 June 2022) provides GO annotation for Zea mays that are assigned with GO annotation tools including Argot2.5, FANN-GO, and PANNZER [21], which are all computational annotations. SpinachBase (http://www.spinachbase.org/ accessed 13 June 2022) provides a centralized access to Spinacia oleracea, and their GO annotations are generated computationally with Blast2GO [22]. Oryza sativa Japonica Group GO annotations can be found on Rice Genome Annotation Project [23] and they are assigned with BLASTP searches against Arabidopsis GO-curated proteins [24]. Gramene [31] (https://www.gramene.org/ accessed 13 June 2022) hosts genome data for many species but we could not find GO annotations with evidence codes. We were not able to find species-specific databases that provide GO annotations for Triticum aestivum, Gossypium hirsutu, Medicago truncatula, Papaver somniferum or Ricinus communis. In summary, most plant genome databases stop at computationally generating GO annotations and some important species do not appear to have dedicated databases. More efforts are needed in both experimentally validating functional annotations made from computational approaches and curating experimentally supported function descriptions in the literature into structured annotations such as GO, which will be crucial for accelerating gene function discovery.
Conclusions
Our website provides a convenient way to obtain the current state of genome function annotation for model organisms and crops for bioenergy, food, and medicine. Our website shows how much is annotated and unannotated in the 27 species that represent some of the most intensely studied and arguably the most valuable organisms for science and society. By proxy, these charts illustrate how much is known and unknown. These snapshots will be updated on a semi-annual basis, and comparing the charts across time will reflect how biological knowledge changes over time. These snapshots can be useful in many contexts including research projects, grant proposals, review articles, annual reports, and outreach materials. The data summarized on this website can be linked to their sources, which can be used for a variety of investigations. Successful examples include exploring why certain proteins remain unannotated [25], developing pipelines to infer function without relying on sequence similarity [26], and assessing annotation coverage across bacterial proteomes [27]. As our society transitions into biology-enabled manufacturing [32], fundamental knowledge of how genes and their products function at various scales will be crucial in ushering in the era of bio-economy.
Methods
Selecting species and data retrieval
For the seven model organisms, gene function annotations were downloaded as GO Annotation Files (GAF files) from the GO consortium website (http://current.geneontology.org/products/pages/downloads.html accessed 13 June 2022) of the 2022-05-16 release. Genes found in a genome were retrieved from the source indicated on the GO annotation download page as General Feature Format (GFF) files. A detailed description of the files used to generate charts on our website, including data for the other category of species, can be found in Table S1.
Genome annotation and gene list for bioenergy models and crops were downloaded from Phytozome version V13 (https://data.jgi.doe.gov/refine-download/phytozome accessed 13 June 2022). Although some species in this category had GO annotations in the GO consortium database, the sequence identifiers (IDs) for genes could not easily be mapped to Phytozome IDs. To maintain consistency within this category, all annotation files were downloaded from Phytozome. All Phytozome GO annotations are computationally generated [4]. Gene lists were also retrieved from Phytozome V13.
For the last category of plant species, we selected the most annotated plant species from the UniProt GO annotation database [28] GAF files hosted on the GO consortium website (http://current.geneontology.org/products/pages/downloads.html 2022-05-16 release, accessed 13 June 2022). We downloaded these species reference proteomes from the UniProt release 2022_02 and retrieved the number of corresponding genes.
Using the evidence codes provided by GAF files, we generated the numbers of genes annotated with GO supported by experimental evidence. If a gene has at least one GO term annotated using any of the following codes: EXP (Inferred from Experiment), IDA (Inferred from Direct Assay), IPI (Inferred from Physical Interaction), IMP (Inferred from Mutant Phenotype), IGI (Inferred from Genetic Interaction), or IEP (Inferred from Expression Pattern), we categorized the gene as having “Experimental Evidence” for function. Genes that have at least one annotated GO term, but no terms have the evidence codes described above, are categorized as “Predicted”. Since Phytozome has only computationally generated GO annotations, all of their genes are categorized as having their functions “Predicted”. By subtracting the annotated genes from the total number of genes, we retrieved the number of genes without any GO annotations. These numbers were used to generate pie charts to show the proportions of genes in each category for every species.
All files were processed with scripts written in Python (3.10). All pie charts were generated using Python Matplotlib version 3.5.2 and Venn diagrams were generated using Python matplotlib-venn version 0.11.7. The repository of codes can be found at GitHub (https://github.com/bxuecarnegie/AnnotationStats).
Creating the Website
To create a website for hosting our charts, we used Node.js [29] for our server-side environment, which provides the Application Program Interface (API) for the front end to retrieve the plots generated by Python. The front end of the website uses AngularJS [30].
Declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Availability of data and materials
Data used in this study are all publicly available. GO annotation files were downloaded from (http://current.geneontology.org/annotations/index.html 2022-05-16 release, accessed 13 June 2022) and Phytozome (https://data.jgi.doe.gov/refine-download/phytozome V13, accessed 13 June 2022). Gene data were downloaded from sources indicated on the GO (http://current.geneontology.org/products/pages/downloads.html accessed 13 June 2022), Phytozome, and UniProt (https://www.uniprot.org/ accessed 13 June 2022). Supplemental Table S1 provides detailed information on all species annotation and gene source databases, downloaded versions, and URLs. Graphs and statistics data generated in this study are available at (http://genomeannotation.rheelab.org/ accessed 13 June 2022). Scripts used to process the data and generate the graphs are written in Python 3 and are available at GitHub (https://github.com/bxuecarnegie/AnnotationStats accessed 13 June 2022).
Competing interests
The authors declare no competing interests.
Funding
This work was supported, in part, by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research, Genomic Science Program grant nos. DE-SC0018277, DE-SC0008769, DE-SC0020366 and DE-SC0021286 and the U.S. National Science Foundation grants MCB-1617020 and IOS-1546838. This work was done on the ancestral land of the Muwekma Ohlone Tribe, which was and continues to be of great importance to the Ohlone people.
Authors’ contributions
SYR conceived the project and BX implemented the project. BX and SYR wrote and edited the manuscript.
Supplemental Information
Additional file 1
Table S1 Source data. A table describing the data sources, versions downloaded, and URLs
Acknowledgements
We would like to thank the members of Rhee lab for the discussions and suggestions on the project.
Footnotes
Emails: srhee{at}carnegiescience.edu (SYR); bxue{at}carnegiescience.edu (BX)
http://current.geneontology.org/products/pages/downloads.html