Abstract
A global genome database of all of Earth’s species diversity could be a treasure trove of scientific discoveries. However, regardless of the major advances in genome sequencing technologies, only a tiny fraction of species have genomic information available. To contribute to a more complete planetary genomic database, scientists and institutions across the world have united under the Earth BioGenome Project (EBP), which plans to sequence and assemble high-quality reference genomes for all ∼1.5 million recognized eukaryotic species through a stepwise phased approach. As the initiative transitions into Phase II, where 150,000 species are to be sequenced in just four years, worldwide participation in the project will be fundamental to success. As the European node of the EBP, the European Reference Genome Atlas (ERGA) seeks to implement a new decentralised, accessible, equitable and inclusive model for producing high-quality reference genomes, which will inform EBP as it scales. To embark on this mission, ERGA launched a Pilot Project to establish a network across Europe to develop and test the first infrastructure of its kind for the coordinated and distributed reference genome production on 98 European eukaryotic species from sample providers across 33 European countries. Here we outline the process and challenges faced during the development of a pilot infrastructure for the production of reference genome resources, and explore the effectiveness of this approach in terms of high-quality reference genome production, considering also equity and inclusion. The outcomes and lessons learned during this pilot provide a solid foundation for ERGA while offering key learnings to other transnational and national genomic resource projects.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
To indicate shared authorship of the first three authors, and to move the senior authors to the back of the author list. No edits have been made to the text of the manuscript.
Glossary
- Biodiversity genomics
- The application of genomic methods to research biodiversity.
- BUSCO
- A bioinformatic method (Benchmarking Universal Single-Copy Orthologues) used to estimate the completeness of the coding fraction of an organism’s genome based on the proportion of (lineage specific) single copy orthologous genes that are found in a genome assembly 51.
- INSDC
- International Nucleotide Sequence Database Collaboration (https://www.insdc.org/) is an initiative between the DDBJ, EMBL-EBI and NCBI that together act as a global repository of sequence data and associated metadata, and provide tools and services that allow access to genomic resources.
- Reference genome
- An accepted standard representation of an organism’s DNA sequence. High-quality reference genomes typically have high completeness (chromosome-level with few gaps in sequence), few errors, and are annotated and accessible. A reference genome serves as a tool for alignment-based analyses, such as variant calling or RNAseq, and has many other applications, for example, phylogenetics and evolutionary relationships, identification of genes and variants, functional analysis and comparative genomics. Reference genomes referred to as “drafts” are those that are under active construction and refinement, and not yet finalised through manual curation.
- Genomic resource
- A genomic resource, for the purpose of this manuscript, refers to a reference genome, genome annotation, voucher specimen, cryopreserved sample and comprehensive metadata.
- FAIR Principles
- A set of principles to guide appropriate management and curation of scientific data (https://www.go-fair.org/fair-principles/) that emphasise data accessibility and use by ensuring that data are Findable, Accessible, Interoperable, and Reusable. Due to the increasing amount of scientific data being reposited, FAIR guidelines promote a data format that is amenable to automated computational access of data by stakeholders63.
- CARE Principles
- The CARE principles for Indigenous data governance (https://www.gida-global.org/care) provide a governance framework that supports the recognition of rights and interests Indigenous Peoples’ to their physical and digital data as well as their Indigenous Knowledges64.
- Metadata
- A collection of data that provides contextual information about multiple characteristics of other, corresponding original data.
- Voucher
- A voucher specimen is a permanently preserved object (either whole or in part, and/or physical or digital) of an identified organism (verified by a recognised expert) and which is deposited in an accessible facility or database. A voucher provides physical evidence about any specimen’s taxonomic identity14. Voucher deposition is a best practice for conducting biodiversity genomics research.
- (Genome) annotation
- The process of identifying the functions of different pieces of a genome. This includes genes that code for proteins and non coding features (e.g. intron-exon structure of protein coding genes, promotors, transposable elements). Typically performed using computational methods, followed by manual curation.
- (Genome) completeness
- An estimate of how well a reference genome represents the complete sequence of the target organism. A complete genome should equal the haploid genome size of the target, but may be defined when ‘all chromosomes are gapless and have no runs of 10 or more ambiguous bases, there are no unplaced or unlocalized scaffolds, and all expected chromosomes are present.’ (https://www.ncbi.nlm.nih.gov/assembly/). There are different approaches to estimate the completeness, like BUSCO, analysing K-mers, etc.
- Library
- DNA, cDNA, or RNA that has been prepared for NGS within (usually) a specific size range and containing adapters, which are designed to be appropriate for (a) specific sequencing platform(s).
- (Genome) assembly
- A genome assembly is a representation of an organism’s genome that is made using computer programs to turn (assemble) raw sequence data into longer, continuous sequences.
- PUID
- A permanent unique identifier is a unique label for an object that does not change, such as the Digital Object Identifier (DOI) attached with a scientific publication.
- ENA
- The European Nucleotide Archive (https://www.ebi.ac.uk/ena) is a global repository for sequence data and provides resources that support management and access to sequence data.
- Equity Deserving
- According to the Canadian Council (https://canadacouncil.ca/glossary/equity-seeking-groups) equity deserving groups are those individual researchers, communities, Peoples, regions or countries that have identified barriers to equal access, opportunities, and resources due to disadvantage and/or discrimination and that are actively seeking, and deserving of social justice and reparation. The discrimination experienced could be caused by attitudinal, historic, social, and environmental barriers that could be based on a plethora of characteristics that are including (but not limited to) sex, age, ethnicity, disability, economic status, gender, gender expression, nationality, race, sexual orientation, and creed.
- COPO
- The Collaborative OPen Omics (COPO) platform is for researchers to publish their research assets, providing metadata annotation and deposition capability. It allows researchers to describe their datasets according to community standards and broker the submission of such data to appropriate repositories whilst tracking the resulting accessions/identifiers28.
- Open data
- Open data are freely accessible and unrestricted data that can be accessed, used, reused and shared with third parties for any purpose.
- HSM
- Hierarchical Storage Management is both a data management and data storage technique which transparently manages the movement of data between the different layers of a tiered storage based on file size thresholds, usage and I/O pressure. Usually, a tiered storage is composed of one or more layers of disk arrays, ordered by capacity, latency, redundancy and storage cost. A slow but economically effective archival layer is at the bottom, composed of magnetic tape libraries and automated tape robots, with the highest capacity and latency. The movement between layers is automatically triggered.
- ONT
- Oxford Nanopore Technologies (ONT; https://nanoporetech.com/) is a next generation sequencing technology whereby sequence data are generated from the changes in current that occur as single-stranded DNA or RNA molecules pass through nanoscale protein pores (nanopores). ONT provides long read data (up to several megabases) that facilitate genome assembly65,66.
- PacBio
- Pacific Biosciences (PacBio; https://www.pacb.com/) is a single-molecule, real time (SMRT) next generation sequencing technology in which sequence data are generated by fluorescent light emission that occurs when a DNA polymerase adds nucleotides. PacBio produces long read data (tens of kilobases) that facilitate genome assembly.
- HiFi reads
- HiFi (High Fidelity) PacBio reads are produced by taking multiple sequences of the same molecule to provide a consensus sequence that is usually 12-20kbp long and has a low error rate (>99.9 % consensus accuracy)67.
- Hi-C
- Sequencing-based method used to study three-dimensional interactions among chromatin regions by measuring the frequency of contact between pairs of loci. Since contact frequency is related to the distance between a pair of loci, Hi-C linking information is used to help with scaffolding stages during a genome assembly process.
- Hi-C map / graph production
- The occurrence and frequency of Hi-C contacts are analysed and used in assembly scaffolding. They are typically visualised in Hi-C 2D heatmaps with the full genome sequence on the X and Y axis and a markup for each observed contact.
- Omni-C
- Modified version of Hi-C that uses a sequence-independent endonuclease during its protocol to produce more even sequence coverage increasing overall resolution.
- RNA-Seq
- RNA-Seq is a technique that determines the complete or partial RNA sequence using NGS. The RNA expression profiles vary in different tissues of the same organism and can be influenced by physiopathological circumstances. RNA-Seq data facilitate genome assembly by providing empirical evidence for annotation of transcribed regions68.
- IsoSeq
- This is a sequencing protocol developed by PacBio that aims to sequence full-length transcripts using the accurate, long read capabilities of PacBio HiFi technology. IsoSeq data facilitate analysis of transcriptomes and genome annotation by identifying full-length isoforms of transcripts.
- Haplotype
- A haplotype refers to the collection of genetic material within an organism that is inherited together. Haplotype may be used to describe a few loci or any number of chromosomes (a chromosome-scale haplotype).
- K-mer
- A K-mer is a DNA sequence of length k; for example, the sequence AGCT contains the 3-mers (K-mers of length 3) AGC and GCT.
- Transcriptome
- A transcriptome is a set of aligned RNAseq reads representing RNA collected from a sample or collection of samples. This includes both protein-coding and non-coding transcripts. For the ERGA Pilot Project, poly-A+ transcripts were profiled.
- Interested Parties
- This term, for the purposes of this manuscript refers to the range of external stakeholders (e.g., commercial companies, policymakers etc) and rights holders (e.g., Indigenous Peoples) that have an interest in biodiversity genomics research.
- EBP Genome assembly quality standard 6..Q40
- Minimum reference standard of 6.C.Q40, i.e. megabase N50 contig continuity and chromosomal scale N50 scaffolding, with less than 1/10,000 error rate. For species with chromosome N50 smaller than a megabase this will be C.C.Q40. Additional recommendations include K-mer completeness >90%, BUSCO complete single-copy single >90%, BUSCO complete single duplicate < 5%, and Gaps/Gbp <1000.
- Widening Country
- Widening countries are countries with low participation rates in FP7 and H2020 projects (low level of investment into research and innovation (R&I)). According to the Horizon Europe regulation the Widening countries are: Bulgaria, Croatia, Cyprus, Czech republic, Estonia, Greece, Hungary, Latvia, Lithuania, Malta, Poland, Portugal, Romania, Slovakia, Slovenia and all associated countries with equivalent characteristics in terms of R&I performance and the Outermost Regions.