MPRAVarDB: an online database and web server for exploring regulatory effects of genetic variants

Summary: Massively parallel reporter assay (MPRA) is an important technology to evaluate the impact of genetic variants on gene regulation. Here, we present MPRAVarDB, an online database and web server, for exploring regulatory effects of genetic variants. MPRAVarDB harbors 18 MPRA experiments designed to assess the regulatory effects of genetic variants associated with GWAS loci, eQTLs and various genomic features, resulting in a total of 242,818 variants tested across more than 30 cell lines and 30 human diseases or traits. MPRAVarDB empowers the query of MPRA variants by genomic region, disease and cell line or by any combination of these query terms. Notably, MPRAVarDB offers a suite of pretrained machine learning models tailored to the specific disease and cell line, facilitating the genome-wide prediction of regulatory variants. MPRAVarDB is friendly to use, and users only need a few clicks to receive query and prediction results.


Supplementary Notes
Computational pipeline to process MPRA summary data We collect the MPRA summary data from 18 MPRA studies in Table 1.Next, we use a computational pipeline to process MRPA variants from all studies into the same format with 18 annotations (Table 2).For annotations missing from the MPRA summary data, we will adopt the following bioinformatics tools to generate the annotations.Specifically,

Variant annotation:
If genomic position, reference and alternative alleles are provided for variants in the MPRA summary data but rsID is missing, we adopt "snpsByOverlaps" function in in "SNPlocs.Hsapiens.dbSNP144.GRCh37/GRCh38" R/Bioconductor package, which takes the "GRanges" object of query variants and reference SNPs in "SNPlocs.Hsapiens.dbSNP144.GRCh37/GRCh38" R/Bioconductor package as the parameters and output the rsID.If rsID is provided for variants in the MPRA summary data but reference and alternative alleles are missing, we adopt "snpsById" function, which takes rsID and reference SNPs in "SNPlocs.Hsapiens.dbSNP144.GRCh37/GRCh38" R/Bioconductor package as the parameters and output the genomic positions of query variants.Next, we adopt "inferRefAndAltAlleles" function in "BSgenome.Hsapiens.UCSC.hg19/hg38"R/Bioconductor package, which takes the genomic positions of query variants and human refence genome in "BSgenome.Hsapiens.UCSC.hg19/hg38"R/Bioconductor package as the parameters and output the reference and alternative alleles for the query variants.

Statistical measurement:
If log2FC and pvalue are missing, we fill the missing entries with NA.

Gene annotation:
Entrez IDs for all human protein-coding genes are retrieved from "TxDb.Hsapiens.UCSC.hg19/hg38.knownGene"R/Bioconductor package.Accordingly, gene symbols are obtained from R/Bioconductor package "org.Hs.eg.db" by using "select" function with Entrez IDs as the parameter.Next, gene annotations can be obtained by using "annotatePeakInBatch" function in R/Bioconductor package "ChIPpeakAnno" with "GRanges" object of query variants and reference gene annotations in "EnsDb.Hsapiens.v75/v86"R/Bioconductor package as the parameters and output the gene-level annotations for query variants.
Feature extraction for machine learning models 1. Motif frequency for DNA sequence For random forest model with motif frequency as the feature, we obtain the motif feature in the R/Bioconductor environment.The human motifs are collected from Bioconductor package "JASPAR2020" with species ID 9606, which results in 633 Position frequency matrix (PFM).Motif scan is carried out by using the 633 PFMs to scan 1kb DNA sequence surrounding each MPRA variant with the variant in the middle.The DNA sequence can be provided in the fasta file directly or as genomic regions in the bed file.
If input file is fasta file, the DNA sequences will be converted into "DNAStringSet" object in R/Bioconductor package "Biostrings".If input file is bed file, the genomic regions will be converted to "GRanges" object in R/Bioconductor package "GenomicRanges".Next, "matchMotifs" function in R/Bioconductor package "motifmatchr", if input file is fasta file, takes two parameters including (1) 633 human motifs in the format of PFMs (2) DNA sequences stored in the "DNAStringSet" object; if input file is bed file, takes three parameters including (1) 633 human motifs in the format of PFMs (2) genomic regions stored in a "GRanges" object and (3) Human reference genome represented R/Bioconductor package"BSgenome.Hsapiens.UCSC.hg19/hg38".As a result, for each DNA sequence, the motif features will be generated as a vector of 633 motif frequency.

3mer frequency for DNA sequence
For random forest with 3mer frequency as the feature, we obtain 3mer frequency in the R/Bioconductor environment.If input file is fasta file, the DNA sequences will be converted into "DNAStringSet" object in R/Bioconductor package "Biostrings".If the input file is bed file, the genomic regions will be converted to "GRanges" object in R/Bioconductor package "GenomicRanges"."getSeq" function in R/Bioconductor package "BSgenome", takes both the "GRanges" object and human reference genom"BSgenome.Hsapiens.UCSC.hg19/hg38"as the parameters and output the DNA sequences in a "DNAStringSet" object.Next, "trinucleotideFrequency" function in R/Bioconductor package "Biostrings" takes "DNAStringSet" object and extracts the 3mer frequency, resulting in 64 3mer features.