RT Journal Article SR Electronic T1 GoMi - A new gold standard corpus for miRNA Named Entity Recognition to test dictionary, rule-based and machine-learning approaches JF bioRxiv FD Cold Spring Harbor Laboratory SP 2021.10.18.464801 DO 10.1101/2021.10.18.464801 A1 Anika Frericks-Zipper A1 Markus Stepath A1 Karin Schork A1 Katrin Marcus A1 Michael Turewicz A1 Martin Eisenacher YR 2021 UL http://biorxiv.org/content/early/2021/10/19/2021.10.18.464801.abstract AB Biomarkers have been the focus of research for more than 30 years [REF1]. Paone et al. were among the first scientists to use the term biomarker in the course of a comparative study dealing with breast carcinoma [REF2]. In recent years, in addition to proteins and genes, miRNA or micro RNAs, which play an essential role in gene expression, have gained increased interest as valuable biomarkers. As a result, more and more information on miRNA biomarkers can be extracted via text mining approaches from the increasing amount of scientific literature. In the late 1990s the recognition of specific terms in biomedical texts has become a focus of bioinformatic research to automatically extract knowledge out of the increasing number of publications. For this, amongst other methods, machine learning algorithms are applied. However, the recognition (classification) capability of terms by machine learning or rule based algorithms depends on their correct and reproducible training and development. In the case of machine learning-based algorithms the quality of the available training and test data is crucial. The algorithms have to be tested and trained with curated and trustable data sets, the so-called gold or silver standards. Gold standards are text corpora, which are annotated by expertes, whereby silver standards are curated automatically by other algorithms. Training and calibration of neural networks is based on such corpora. In the literature there are some silver standards with approx. 500,000 tokens [REF3]. Also there are already published gold standards for species, genes, proteins or diseases. However, there is no corpus that has been generated specifically for miRNA. To close this gap, we have generated GoMi, a novel and manually curated gold standard corpus for miRNA. GoMi can be directly used to train ML-methods to calibrate or test different algorithms based on the rule-based approach or dictionary-based approach. The GoMi gold standard corpus was created using publicly available PubMed abstracts.GoMi can be downloaded here: https://github.com/mpc-bioinformatics/mirnaGS---GoMi.Competing Interest StatementThe authors have declared no competing interest.