MaveDB v2: a curated community database with over three million variant effects from multiplexed functional assays

A central problem in genomics is understanding the effect of individual DNA variants. Multiplexed Assays of Variant Effect (MAVEs) can help address this challenge by measuring all possible single nucleotide variant effects in a gene or regulatory sequence simultaneously. Here we describe MaveDB v2, which has become the database of record for MAVEs. MaveDB now contains a large fraction of published studies, comprising over two hundred datasets and three million


Background
Genomes contain both genes that encode RNAs and proteins and non-coding regulatory elements that modulate gene expression.Variation within genomes produces interindividual differences that govern a multitude of traits, including many that relate to disease.Because DNA sequencing is now inexpensive and widely deployed, human genetic variants are being discovered at a staggering pace.For example, approximately 241 million small variants comprising single nucleotide changes and small deletions/insertions have been identified in 140,000 individuals in gnomAD [1].Of these variants, 4.6 million are single amino acid changes (i.e.missense variants).4.5 million missense variants have been identified among 200,000 individuals in the UK Biobank [2].In contrast, only 437,000 missense variants have been annotated in ClinVar, of which greater than 70% are currently variants of uncertain significance and thus cannot be used for clinical decision making [3].Understanding how these existing variants, as well as the ones we will discover as more individuals are sequenced, impact molecular, cellular and organismal phenotypes represents a central challenge for genomics.
In the past, genetic variants would be tested for functional effects in bespoke assays singly or in relatively low numbers, but more recent technologies have enabled the increasingly popular multiplexed assays of variant effect (MAVEs) approaches [4].In a MAVE, the functional effects of thousands or tens of thousands of variants of a DNA regulatory region, coding genes, UTRs or other functional elements are simultaneously experimentally determined.To achieve this scale, a large library of variants is made and tested in a pooled fashion, with high-throughput DNA sequencing used to read out variant effects (for a detailed description see [5][6][7]).The result is a comprehensive variant effect map, which contains the experimentally measured effects of most or all of the possible single nucleotide or missense variants.Variant effect maps are of high utility.For example, in genes where germline variants can increase disease risk, variant effect maps can help to resolve up to ~70% of clinical variants of uncertain significance [8].Variant effect maps can also be used to probe the protein sequence/function relationship , assist in protein design [43], reveal protein structure [44,45], elucidate regulatory DNA and gene function [46][47][48][49] and train or evaluate variant effect predictors [50][51][52].Thus, multiplexed functional data are poised to have a major impact, and efforts are now underway to scale up and apply MAVEs to a significant fraction of the human genome [53][54][55].
However, multiplexed functional data availability has been a major hurdle.In 2019 we created MaveDB, a public, open source repository for multiplexed functional data [56].MaveDB allows researchers to store, share, and access processed multiplexed variant functional data and associated metadata in a standardized, searchable format.MaveDB implements an easy-to-use web interface that facilitates data deposition by researchers.However, MaveDB suffered from three key limitations.First, it contained only a small fraction of the data available at the time.Second, data from new multiplexed assays such as saturation genome editing [14,57] were not compatible with the original MaveDB data model.Finally, MaveDB was not designed with federation across genomic data resources in mind.
Here we describe several technical advances and data model improvements implemented in MaveDB v2, as well as provide a progress update on our extensive curation of previously-published multiplexed assay results and expansion of database content.We have improved the user experience for contributors by adding API-based user uploads and a companion Python module for researchers with bioinformatics or data science experience.We have updated our data model by adding a new type of record for imputation or the combination of results across multiple experiments.We have also refined and formalized our variant representation, allowing us to support more diverse types of MAVE variants and associated experimental designs, while also improving compatibility with existing international standards.In addition to these technical improvements, we have curated 161 new datasets from the literature, encompassing two million new variants and more than tripling the total number of variant effect measurements in the database.

Database functionality and content
MaveDB is designed to store and distribute multiplexed variant functional data, including scores, associated data and metadata.A MAVE dataset minimally includes a collection of variant effect scores that describe the functional consequences of nucleotide or amino acid variants.Each variant effect score can also be associated with a variance estimate, variant frequency or other data.The metadata includes descriptions of the experimental and data analysis methods and target information, such as the sequence and accession numbers in other databases.
MaveDB has a hierarchical structure populated by score set, experiment, and experiment set records.Score sets contain the variant effect scores and associated data columns, such as variance estimates and variant counts, details about the experimental target sequence, and a description of the score calculations.Experiments group together multiple score sets that depend on the same raw data, thereby improving discoverability for users, and describe the assay that was performed.Experiment sets do not have any data or metadata themselves, but group related experiments, such as multiple assays performed on a single target and described in the same publication.

Meta-analysis score sets
In addition to standard score sets linked to an experiment and its raw data, MaveDB v2 implements meta-analysis score sets.These new records are linked to one or more existing score set records, and are used to describe a distinct version of the scores that have been transformed in some way.For example, a dataset that has imputed the values of missing scores would be best represented as a meta-analysis score set linked to the pre-imputation scores, ensuring the pre-imputation scores are preserved and discoverable.Another example is a set of scores that result from the combination of multiple existing score sets (Figure 1).We recommend that users upload minimally-transformed scores as standard score sets to MaveDB, and create new meta-analysis score sets that describe normalization or imputation steps as applicable.This enables other researchers who want to build models that would be sensitive to data normalization or evaluate their own normalization methods to make use of the data without having to perform a full re-analysis from counts or sequence reads.
With its hierarchical data model and the introduction of meta-analysis score sets, MaveDB enables provenance tracking for individual variant measurements from a multiplexed assay.This tracking is critical for downstream usage, particularly in clinical pipelines, where it is essential to understand the independence of different data sources for evaluating variant pathogenicity using the ACMG guidelines [58].Two assays were performed on the gene NUDT15 and combined into a resulting "function score" that summarized performance across both assays [41].The meta-analysis score set that combines these results is shown on the far right, and is described in MaveDB under urn:mavedb:00000055-0-1.

Standardized variant representation
Previously MaveDB used a variant representation based on the Enrich2 [59] output format, which was in turn inspired by the HGVS Sequence Variant Nomenclature [60].Starting with MaveDB v2, we have adopted a revised variant representation we call MAVE-HGVS.MAVE-HGVS is a subset of HGVS version 20.05.HGVS nomenclature is comprehensive and very expressive, and consequently includes syntax that is not needed to represent variants from multiplexed variant functional data.While packages exist for parsing HGVS [61,62], they are intended for use in human genetics and rely on sequence database entries that are not always available for multiplexed assays.MAVE-HGVS is an easy-to-parse subset of the HGVS nomenclature that captures the types of variants that routinely occur in MAVE datasets while not relying on external sequence databases or identifiers.
In addition to the specification, MAVE-HGVS has a reference Python implementation, mavehgvs.We use this implementation in MaveDB v2 to validate variants when multiplexed variant functional data are uploaded to the database, both to make sure that the variant strings are in a valid format and to ensure that variants are consistent with the target sequence.MAVE-HGVS offers three major improvements over the old MaveDB format.First, MAVE-HGVS is much more easily convertible to standard HGVS.Second, the previous format did not define target-identical ("wild type") variants in a standard way and was not able to define target-identical variants by position, a feature that is needed for MITE-Seq datasets [31].Finally, we can continue to draw from HGVS as support for new variant types is needed.We have already taken advantage of this by adding splice variants to more faithfully represent datasets curated from the literature.

Community tools
To empower the MAVE community, we developed MaveTools, a Python library designed for interactive development environments like Jupyter notebooks [63], replacing our previous command-line tool mavedbconvert.MaveTools has functions to convert commonly-used variant representations such as Enrich seqid [64] to MAVE-HGVS and create MAVE-HGVS strings that describe the differences between pairs of codons.MaveTools also implements a local version of the dataset validation logic used by the MaveDB server, making it easier for users to identify and resolve formatting errors.In addition to MaveDB-specific features, we are augmenting the library with general utility functions for researchers working with MAVE datasets.MaveDB, mavehgvs, and MaveTools are all distributed under OSI-approved open source licenses and development activity is ongoing in their public GitHub repositories (see Code availability section).We envision that MaveTools will become a valuable community resource, particularly for students and others who are new to working with MAVE data.We welcome contributions and engagement from other members of the community in those spaces or by contacting us directly.

Improved API support
The previous version of MaveDB only accepted data via web form.We now also support data deposition through REST API.The API uses the same logic and validation as the web interface, ensuring continuity and data integrity.This makes depositing some datasets much more efficient, such as a series of similar assays that measure variant effects under different concentrations of a small molecule.We hope that authors of MAVE analysis pipelines will consider adopting this API endpoint as an output option.
MaveTools provides REST API wrappers so users with a bioinformatics or data science background can query the database and upload or download datasets.By formatting a Python object that contains appropriate metadata and file paths for score and count files, users can validate the data and upload it to the server within a script or interactive session.Researchers who are building machine learning models or otherwise using large amounts of MAVE data can also use the wrappers to automate downloads.The MaveTools documentation and source code repository includes example Jupyter notebooks for both upload and download that users can follow and modify.
We have also added a new API endpoint that allows users to download data on individual variants.This feature currently only supports the MaveDB variant IDs that are assigned when the variants are created in our database, but we are in the process of mapping these IDs to more widely-used accession numbers, particularly for genes of clinical relevance.

Database curation and contents
When the original MaveDB manuscript was published in 2019, data from only 7% of publications at the time were included.Since then, as with many databases, the gap in representation has grown.Thus, we launched a concerted effort to deposit published datasets that were not yet included in MaveDB.Our curation team spanned three sites: WEHI/University of Melbourne in Melbourne, Australia; University of Washington in Seattle, USA; and Harvard University in Boston, USA.
To enable the discovery, reuse, and interpretation of data in MaveDB, we developed a robust process for curating and summarizing these heterogeneous experimental results, including training materials, much of which has been incorporated in the updated documentation pages in MaveDB v2.Key information was abstracted from the associated publications and synthesised into a title, short description, abstract, and methods for each experiment and score set record. Accession numbers for raw data and target sequences for each dataset were also included.Each draft MaveDB entry was peer reviewed by at least one other MaveDB curator to ensure all relevant information was present and accurate before inclusion in the database.In addition to writing the free text sections and organizing associated metadata, our curation team also formatted scores and related values from supplemental tables.As a result of these efforts, MaveDB now contains 231 datasets from 68 publications.This comprises 46% of all published papers for which data was available, 28% of published papers overall (Figure 2).Furthermore, MaveDB now makes available 20 datasets from 7 papers that were not provided as part of their original publications.

Conclusions
Measuring, predicting and understanding variant effects on a genome-wide scale are key challenges.MAVEs are an important approach for overcoming these challenges, but the data must be stored in a stable, standardized fashion along with the metadata required to enable reanalysis, modeling and prediction.Moreover, MAVE datasets must be easily discoverable, and MAVE data must appear in other commonly used data portals.
MaveDB v2 marks a major improvement to our data model and ability to handle these heterogeneous datasets.It improves upon the previously-used variant representation and furthers our aim to standardize and share data.MaveDB v2 also incorporates new API features, including the ability to request data on individual variants and upload data via script.These features are supported by open source software designed to serve the research community.
Additionally, we launched a massive curation effort involving 161 datasets, ultimately populating MaveDB with nearly half of all published data.This equates to three million variant measurements.Going forward we strongly encourage data generators to deposit their data directly into MaveDB prior to or upon publication.

Figure 1 :
Figure 1: Representative structure for a meta-analysis score set.The cartoon depicts the relationship between experiment sets, experiments, score sets, and meta-analysis score sets for a real-world dataset.Two assays were performed on the gene NUDT15 and combined into a resulting "function score" that summarized performance across both assays[41].The meta-analysis score set that combines these results is shown on the far right, and is described in MaveDB under urn:mavedb:00000055-0-1.

Figure 2 :
Figure 2:Growth of datasets and availability over time.We compiled a list of 238 publications that contained at least one new MAVE dataset and determined whether counts, scores, or raw sequence data were made available in supplementary information.At present, we have made 46% of all datasets for which scores were available accessible via MaveDB, and work is underway to add the remaining datasets.