PT - JOURNAL ARTICLE AU - Nathaniel T. Hawkins AU - Marc Maldaver AU - Anna Yannakopoulos AU - Lindsay A. Guare AU - Arjun Krishnan TI - Systematic tissue annotations of –omics samples by modeling unstructured metadata AID - 10.1101/2021.05.10.443525 DP - 2021 Jan 01 TA - bioRxiv PG - 2021.05.10.443525 4099 - http://biorxiv.org/content/early/2021/05/20/2021.05.10.443525.short 4100 - http://biorxiv.org/content/early/2021/05/20/2021.05.10.443525.full AB - There are currently >1.3 million human –omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are routinely described using varied terminologies written in unstructured natural language. We propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for –omics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms. Our approach significantly outperforms an advanced graph-based reasoning annotation method (MetaSRA) and a baseline exact string matching method (TAGGER). Model similarities between related tissues demonstrate that NLP-ML models capture biologically-meaningful signals in text. Additionally, these models correctly classify tissue-associated biological processes and diseases based on their text descriptions alone. NLP-ML models are nearly as accurate as models based on gene-expression profiles in predicting sample tissue annotations but have the distinct capability to classify samples irrespective of the –omics experiment type based on their text metadata. Python NLP-ML prediction code and trained tissue models are available at https://github.com/krishnanlab/txt2onto.Competing Interest StatementThe authors have declared no competing interest.