Abstract
Background Non-coding variants have emerged as important contributors to the pathogenesis of human diseases, not only as common susceptibility alleles but also as rare high-impact variants. Despite recent advances in the study of regulatory elements and the availability of specialized data collections, the systematic annotation of non-coding variants from genome sequencing remains challenging.
Results We integrated 24 data sources to develop a standardized collection of 2.4 million regulatory elements in the human genome, transcription factor binding sites, DNase peaks, ultra-conserved non-coding elements, and super-enhancers. Information on controlled gene(s), tissue(s) and associated phenotype(s) are provided for regulatory elements when possible. We also calculated a variation constraint metric for regulatory regions and showed that genes controlled by constrained regions are more likely to be disease-associated genes and essential genes from mouse knock-out screenings. Finally, we evaluated 16 non-coding impact prediction scores providing suggestions for variant prioritization. The companion tool allows for annotation of VCF files with information about the regulatory regions as well as non-coding prediction scores to inform variant prioritization. The proposed annotation framework was able to capture previously published disease-associated non-coding variants and its integration in a routine prioritization pipeline increased the number of candidate genes, including genes potentially correlated with patient phenotype, and established clinically relevant genes.
Conclusion We have developed a resource for the annotation and prioritization of regulatory variants in WGS analysis to support the discovery of candidate disease-associated variants in the non-coding genome.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
edoardo.giacopuzzi{at}well.ox.ac.uk, niko.popitsch{at}well.ox.ac.uk, jenny.taylor{at}well.ox.ac.uk
Abbreviations
- OR
- Odds-ratio
- TPR
- True positive rate (sensitivity)
- TNR
- True negative rate (specificity)
- FDR
- False discovery rate
- ACC
- Accuracy
- HPO
- Human Phenotype Ontology
- TFBS
- Transcription factor binding site
- UCNE
- Ultra-conserved non-coding element
- AUC
- Area under the curve
- OPM
- Overall Performance Measure