TY - JOUR T1 - Accurate Name Entity Recognition for Biomedical Literatures: A Combined High-quality Manual Annotation and Deep-learning Natural Language Processing Study JF - bioRxiv DO - 10.1101/2021.09.15.460567 SP - 2021.09.15.460567 AU - Dao-Ling Huang AU - Quanlei Zeng AU - Yun Xiong AU - Shuixia Liu AU - Chaoqun Pang AU - Menglei Xia AU - Ting Fang AU - Yanli Ma AU - Cuicui Qiang AU - Yi Zhang AU - Yu Zhang AU - Hong Li AU - Yuying Yuan Y1 - 2021/01/01 UR - http://biorxiv.org/content/early/2021/09/17/2021.09.15.460567.abstract N2 - A combined high-quality manual annotation and deep-learning natural language processing study is reported to make accurate name entity recognition (NER) for biomedical literatures. A home-made version of entity annotation guidelines on biomedical literatures was constructed. Our manual annotations have an overall over 92% consistency for all the four entity types — gene, variant, disease and species —with the same publicly available annotated corpora from other experts previously. A total of 400 full biomedical articles from PubMed are annotated based on our home-made entity annotation guidelines. Both a BERT-based large model and a DistilBERT-based simplified model were constructed, trained and optimized for offline and online inference, respectively. The F1-scores of NER of gene, variant, disease and species for the BERT-based model are 97.28%, 93.52%, 92.54% and 95.76%, respectively, while those for the DistilBERT-based model are 95.14%, 86.26%, 91.37% and 89.92%, respectively. The F1 scores of the DistilBERT-based NER model retains 97.8%, 92.2%, 98.7% and 93.9% of those of BERT-based NER for gene, variant, disease and species, respectively. Moreover, the performance for both our BERT-based NER model and DistilBERT-based NER model outperforms that of the state-of-art model—BioBERT, indicating the significance to train an NER model on biomedical-domain literatures jointly with high-quality annotated datasets.Competing Interest StatementThe authors have declared no competing interest. ER -