Text mining of CHO bioprocess bibliome: Topic modeling and document classification

Chinese hamster ovary (CHO) cells are widely used for mass production of therapeutic proteins in the pharmaceutical industry. With the growing need in optimizing the performance of producer CHO cell lines, research on CHO cell line development and bioprocess continues to increase in recent decades. Bibliographic mapping and classification of relevant research studies will be essential for identifying research gaps and trends in literature. To qualitatively and quantitatively understand the CHO literature, we have conducted topic modeling using a CHO bioprocess bibliome manually compiled in 2016, and compared the topics uncovered by the Latent Dirichlet Allocation (LDA) models with the human labels of the CHO bibliome. The results show a significant overlap between the manually selected categories and computationally generated topics, and reveal the machine-generated topic-specific characteristics. To identify relevant CHO bioprocessing papers from new scientific literature, we have developed supervized models using Logistic Regression to identify specific article topics and evaluated the results using three CHO bibliome datasets, Bioprocessing set, Glycosylation set, and Phenotype set. The use of top terms as features supports the explainability of document classification results to yield insights on new CHO bioprocessing papers.

studies irrelevant to CHO bioprocess. For each BP abstract in the CHO bibliome, one or 50 more category labels from a total of 16 research categories were manually assigned based 51 on the types of phenotypic and bioprocess data contained therein (6).

52
The CHO bibliome continues to grow since its last compilation in 2015, with over 53 500 PubMed citations annually. To automate text analysis of the CHO bibliome and gain 54 insight to key topics and trends in CHO bioprocessing and biotechnologies, we have 55 applied topic modeling to explore and classify CHO literature and compared results with 56 those manually assigned category labels in the CHO bibliome. When coupled with our 4 57 classifiers trained with supervised machine learning methods, the resulting models can 58 automatically classify the newly published CHO cell studies after 2015 into bioprocess 59 categories and help researchers select CHO cell research articles of their interest.

60
Topic modeling and document classification. Natural language processing 61 (NLP) allows machines to interpret human language with either unsupervised or 62 supervised approaches (7,8). For text analysis to uncover the main topics in an unlabeled 63 set of documents, probabilistic topic models are considered an effective framework for 64 unsupervised topic discovery (9, 10). Latent Dirichlet Allocation (LDA) is a widely used 65 topic modeling method (11) with many applications (12). It is a generative probabilistic 66 model of a corpus. The basic principle is that documents are represented as random 67 mixtures over latent (hidden) topics, where each topic is characterized by a distribution 68 over words in the corpus. In this study, LDA is adopted for an automatic exploration of 69 latent topics in the CHO bioprocess bibliome, which are then compared and contrasted 70 with those previously manually assigned research categories. This allows to gain insight 71 into practical performance of LDA topic models in comparison with human manual 72 category labels, and potential benefits from applying topic modeling to identify 73 significant topics.

74
To identify new CHO bioprocessing papers from PubMed (especially for 75 publications after 2015), a classifier is needed to separate BP from non-BP studies and 76 identify their bioprocessing topics by learning how the existing CHO bibliome classifies 77 them. For this task, a supervised approach, Logistic Regression, is utilized to classify the 78 bibliome using three datasets, one for the overall "Bioprocess" category (BP set), and two 79 on the specific bioprocessing categories of "Phenotype and Production Characteristics" 80 (Phenotype set) and "Glycosylation" (Glycosylation set), respectively. Logistic 81 regression allows for different term representations to be used in classification efficiently, 82 ranging from a term's binary presence/absence method, , term frequency (tf), and term 83 frequency-inverse document frequency (tf-idf) (13). Our objective is to determine if each 84 category of interest includes unique terms that could be used for document classification.

85
If the model is able to predict the category of a document in a dataset with high accuracy, 86 it suggests that the documents in that category share an adequate amount of unique terms 87 for classification, which may yield insights on new CHO bioprocessing papers.

89
The CHO bibliome processing and analysis workflow consists of document processing,

121
LDA is among the most widely applied probabilistic topic modeling approach (12).

122
Python's GENSIM package(18) was used for LDA applications in this study. Bigrams and 123 trigrams were created with GENSIM phrase detection and added to the dictionary. Words that 124 appear in less than 5 documents were filtered out, resulting in 2534 words in the final dictionary.   The second trial involved the use of tf-idf (13), which is often used to capture the   The LDA model discovered 9 topics (Fig 2, S2 File) from the bioprocess documents.

218
With the terms mapped closely between Topic-4 and Category "Enzyme", it is not surprising to 219 see that majority of documents captured in Topic-4 are indeed in Category "Enzyme" in human 220 annotation, and vice versa, indicating an intrinsic cohesiveness of human label and fitted LDA 221 model for this topic (Fig 3A).

222
In contrast, there are 3 significant human label categories for Topic-3, "Glycosylation", 223 "Purification" and "Enzyme" (excluding category "Phenotype" which have known intrinsically 224 diverse documents). Among the most frequent words for Topic-3, "glycosylation", "structure", 225 "purify", "glycan", "oligosaccharide", "glycoprotein", "nglycan", "residue" are discriminative 226 terms for categories "Glycosylation" and "Purification" (S3 File). Likewise the frequent and 227 discriminative words for Topic-2 include "expression", "gene", "clone", "stable", "promoter", 228 "transfection", "selection", "vector", which correlate well with categories "Expression", "Cell 229 Line" and "Secretion" where those words can be common and expected to occur together. In 230 summary, our LDA model is able to cluster BP documents into topics with salient terms that are 231 discriminative and descriptive for their underlying categories, and the computationally generated 232 models correlate well with several human-labeled categories.   because the true number of positives was so limited the accuracy was 93%. This confirms the 262 need for under-sampling. The accuracy is also much higher than the other statistics for the BP vs.
263 non-BP which shows that under-sampling may be needed here as well. When the ratio of the positive to the negative data was set to be roughly 1:1, the performance 271 was greatly improved (Table 3).  Representations from Transformers) due to its strength in classifying biomedical literature (20). 308 We used the Google Cloud platform and conducted preliminary studies where the learning rate, 309 epoch amount, and token limits, as well as other variables can be controlled (20,21). This 310 method is a possible path for supervised text classification of datasets such as the CHO 311 bioprocess bibliome, but more research must be performed to test its applicability. BERT models 312 are powerful models and tend to overfit when training data is not sufficiently large. This was a 313 factor for not including their use in this work.

315
We thank CHO Genome to Phenome CHOg2p project community for their helpful 316 discussions and suggestions, and appreicate Dr. Sarah Harcum (Clemson University) and Dr.

317
Kelvin Lee (University of Delaware) groups for sharing their ideas at the early stage. This work