Abstract
The role of macrophages in regulating the tumor microenvironment has spurned the exponential generation of nanoparticle targeting technologies. With the large amount of literature and the speed at which it is generated it is difficult to remain current with the most up-to-date literature. In this study we performed a topic modeling analysis of the most common usages of nanoparticle targeting of macrophages in solid tumors. The data spans 20 years of literature, providing an extensive meta-analysis of the nanoparticle strategies. Our topic model found 6 distinct topics: Immune and TAMs, Nanoparticles, Imaging, Gene Delivery and Exosomes, Vaccines, and Multi-modal Therapies. We also found distinct nanoparticle usage, tumor types, and therapeutic trends across these topics. Moreover, we established that the topic model could be used to assign new papers into the existing topics, thereby creating a Living Review. This type of meta-analysis provides a useful assessment tool for aggregating data about a large field.
Introduction
Macrophage nanoparticle targeting in cancer immunotherapy is an exploding field. Tumor associated macrophage (TAMs) are a prime target for biologists who seek to understand their role in tumor progression and immune evasion. Likewise, they have been a target for engineers who design therapeutics to reprogram TAM functions. The enormous amount of literature generated spans multiple scientific disciplines, institutions and geographical regions. In addition, literature is accessed across numerous databases. Moreover, differences in disciplines can lead to different nomenclatures which may limit the accuracy of traditional literature search strategies. All of these complexities can make it difficult for any single research to stay current.
A scoping review is a powerful tool that gives researchers a broad picture of a field by extracting data from literature related to a specific research question. To accomplish this goal, papers are processed through multiple phases of searching, screening, and data extraction. By reviewing a large volume of work, researchers can then review an entire volume of literature and obtain a qualitative and quantitative view of a field. This systematic approach has other advantages as well. It can reduce bias that is introduced when reviewing a smaller number of papers. Furthermore, by utilizing a protocol for screening papers, it can help identify emerging research trends.
One of the challenges of a scoping review is the time and work needed to screen and extract data from thousands of papers. Machine learning, and specifically natural language processing, have been used to aid in multiple phases of a scoping review[1]–[3]. Previously, topic modeling, a form of unsupervised machine learning, was used to create an overview of the types of papers in a literature review based on their abstracts [4]. These include but are not limited to cancer immunotherapy [5], emergency medicine [6], and HIV-AIDS research [7].
In this paper, we generate a scoping review about macrophage targeting cancer therapy using the Gensim Ensemble Latent Dirichlet Allocation (eLDA) topic modeling algorithm [8], [9]. The articles used in this project were the result of a full-text screening process using pre-defined eligibility criteria, previously described in “Scoping Review of Pre-clinical and Translational Studies on Macrophage Polarization in Nanoparticle–based Cancer Immunotherapy” [10]. Additionally, using this model, we demonstrate how LDA topic models can be used to create a “living scoping review” where new articles can be assigned to existing topics.
Materials and Methods
Datasets and Code
The datasets used to create this model can be found in the supplemental information. Additionally, a Jupyter notebook containing sample code to create the model and figures can be found in the supplemental material. Finally, the files that contain the model can be found in the supplemental material.
Obtaining the Dataset
The dataset for this review was obtained using guidance from several scoping review methodology resources [8,9]. Scoping reviews are used to obtain a near-comprehensive overview of a particular body of literature with a well-defined scope. A systematic approach to searching for and identifying relevant articles is used, with the aim of minimizing selection bias. With this approach in mind, a protocol was developed and registered on the Open Science Framework [10]. A search string was crafted by two of the authors who are information specialists with experience in evidence synthesis methods (MG, SY) to search for articles containing information about nanoparticles and macrophage polarization in a cancer context (Supplemental File 1). We searched the following bibliographic databases: Web of Science Core Collection (including Science Citation Index-Expanded, Social Science Citation Index and Emerging Sources Citation Index), Scopus, IEEE Xplore Digital Library, Medline (PubMed), and Biotechnology & BioEngineering Abstracts (ProQuest). Additional articles were found by hand-searching journals, references from related literature reviews, and Google Scholar. The database searches were run between April 23, 2020 and October 20, 2020. We chose to only include articles published in 2000 or later due to the large growth of research in cancer nanotheranostics in the last twenty years. Due to the large number of included articles and limited resources, we did not conduct forward or backward citation searching.
The articles, hereafter referred to as records, underwent the first few stages of a scoping review, including de-duplication, title and abstract screening, and full text screening [11]. In the de-duplication phase, records that appeared multiple times in the dataset were consolidated into one entry. This work was done in Zotero.
In the title-abstract screening phase, records were labeled as “include” or “exclude” based on a set of pre-defined criteria [10]. These criteria included whether the study was carried out in a cancer context, if the study included information about the characterization and use of nanoparticles, and whether a study may contain information about macrophage polarization. To be fully included or excluded, two independent reviewers needed to have identical labels. In the event of a conflict, a third expert reviewer determined the final label.
During the full text screening, records were again labeled “include” and “exclude” based on more refined criteria. As with the previous screening stage, inclusion decisions needed to be consistent between two independent reviewers, with conflicts being resolved by a third expert reviewer. Papers that were excluded in this section were also labeled with an exclusion reason. The title-abstract and full text screening phases were carried out in Sysrev [12].
We used Sysrev’s built-in machine learning capabilities to automatically exclude articles in our large dataset [12]. We first trained the Sysrev inclusion prediction algorithm by manually screening half (7,482) of the records in the initial title-abstract phase of screening. We then used its prediction to automatically exclude records with a prediction value of less than 40% (3,460 records). We found that 14 of the 7,482 articles (0.19%) in the training dataset had been incorrectly classified for exclusion by the prediction algorithm when using a prediction value of less than 40%. These records were manually included in the full-text screening phase of the project. Records identified from Google Scholar, Biotechnology & BioEngineering Abstracts (ProQuest), and hand searching were not used to train the algorithm and were manually screened for inclusion at a later stage of the project.
The topic model used in this study was created using 854 of the abstracts of papers included after the full text screening (Supplemental File 2). An additional 95 abstracts retrieved using the same search engine strategies to create a living scoping review.
Data Pre-Processing
The dataset went through multiple phases of pre-processing including converting all words to lowercase, tokenization, removing stop words, and stemming.
Tokenization and Converting to Lowercase
The NLTK RegEx tokenizer was used to tokenize the document set [3]. We used the regular expression r’\w+’ which matches all Unicode word characters. The tokens for each document were then converted to lowercase.
Stop Words
A list of stop words was created in multiple steps to remove from the data set. First, a list of common English stop words was imported from NLTK [3]. Next, a list of stop words was created based on the search string. These included words like “cancer”, “nanoparticle”, and “macrophage” which were common in the dataset, even though they are not common in the English language. Finally, words that occurred in one abstract or 90% of abstracts were removed. Sometimes, a domain-specific list of stop words is needed, such as when an abstract set has a narrow scope [4].
Stemming
The abstracts were stemmed using the NLTK Porter Stemmer [3]. We chose to stem the abstracts rather than lemmatize them because many words in the dataset are very uncommon in colloquial English. Therefore, lemmatization would not work as well on this dataset.
LDA Algorithms
The traditional LDA model generated noisy, incoherent topics and this result was insensitive to hyperparameter tuning. Therefore, we attempted to use a modified topic modeling algorithm that was developed to decrease the noise associated with the random allocation of topics when initializing LDA [13]. Gensim Ensemble Latent Dirichlet Allocation (eLDA) uses the DBSCAN algorithm on a collection of LDA topics to distinguish between stable and noisy topics [14]. We found that eLDA had superior performance to the traditional LDA model and used this technique for the scoping review [15]. In addition, we have made the code available for public use (Supplemental File 3).
The Gensim eLDA module requires the parameters corpus, which is a bag of words model of the documents set, and id2word, a mapping of all words to an integer identity. The dictionary was created using the Dictionary module from the Gensim corpora module. The bag of words model of the document set, the corpus, was created by applying the doc2bow function from the dictionary to every abstract in the cleaned dataset.
Evaluating the Topic Model
The topic model was evaluated both quantitatively and qualitatively. The model was evaluated quantitatively using the metric average topic coherence, which was measured using the Gensim Coherence Model module [15]. The topics created by the models were also evaluated qualitatively for coherence with individuals with domain knowledge in the field. This was done by evaluating the top 30 highest weighted words from each topic distribution as defined by the LDA model. Finally, we looked at the 100 most frequent terms to look for trends within each topic based on previous domain knowledge.
LDA for a Living Review
We used the 95 new abstracts to demonstrate a “proof-of-concept” for a living review. To do this, we first pre-processed the abstracts as described above and created a “bag-of-words” model of each document. We then used the eLDA model to extract topics for each model. LDA, and by extension eLDA, has the ability to extract topics from documents that were not previously seen in the dataset, a strategy that has previously been used in classification studies [8], [16], [17]. In this instance, we did not update the model to incorporate the new documents.
Results and Discussion
Overview of the Scoping Review and the generated dataset
The searches of bibliographic databases, handsearching and Google Scholar resulted in 28,217 records. After deduplication and removing retracted records, 15,374 were screened with 859 records included in the final dataset (Figure 1). 854 records were used to train the topic model and 95 new records were used to generate a living scoping review. Note that the 95 new records did not undergo the screening process.
Flow chart outlining the number of records from article identification, screening, and inclusion into the final data set used in the eLDA topic model. Of the 28,217 articles identified, 859 articles that met the inclusion criteria. 95 newly identified records were used in the Living Review model. Adapted from [18].
Before creating the topic model, analyses were carried out to characterize the dataset, which can be found in (Figure 2). First, we investigated the trends over time in the publications (Figure 2A). We observed two inflection points of growth in the field of macrophage cancer immunotherapy. These represent two growth fields in the immunotherapy field. In 2011 Ipilimumab, the immune checkpoint inhibitor, was approved by the FDA for use in melanoma [19]. Moreover, the 10 year sequel to the original Hallmarks of Cancer included the role of the immune system as a principal biological component in cancer progression and treatment [20]. In 2017, the first CAR T therapy was approved by the FDA [21]. During this time period, it became increasingly apparent that the immune system played a crucial role in finding effective treatments for cancer. Both of these advances utilize T-cells, which is why they have been most useful for tumors with high T-cell invasion as well as liquid tumors [19]. Macrophages make up a large portion of the volume within tumors and contribute to the immunosuppressive environment that prevents T-cell infiltration [22]. Perhaps for this reason, macrophages specifically became a target for cancer nanotherapeutic research, resulting in exponential growth in this field.
(a) Distribution of number of authors in the dataset. (b) Number of documents in the dataset over time. Gray lines indicate important advancements in the field. (c) Most common journals in the dataset. (d) Distribution of journal topics included in the dataset. (e) Breakdown of the journal scopes within (e) physical and engineering journals and (f) biological journals.
Further bibliographic analysis demonstrated other publishing trends within the field. The median number of authors was about 8, suggesting that this type of research frequently requires a larger collaborative team, since immunomodulatory nanotherapeutics will often need a multidisciplinary approach (Figure 2B). We did not extract data related to author gender and institution because the sociological expertise needed to properly extract and analyze these features was viewed to be out of scope. Overall, the articles were published in an array of journals with Biomaterials, Journal of Controlled Release, and ACS Nano being the most represented (Figure 2C). All three of these journals emphasize the multidisciplinary nature of their publication: physics, chemistry, and biology.
Characterization of the Topic Model
The topic model was evaluated both qualitatively and quantitatively to evaluate the accuracy of the model. Average topic coherence was used as a quantitative evaluation [23]. The LDA hyperparameters and number of topics are typically tuned using the coherence score. However, we found that varying these hyperparameters did not have a significant impact on the coherence score. Instead, we chose hyperparameters to create a model that fit our purposes. The Cv coherence score for this model was 0.589. A visualization of the topic model can be found in (Figure 3), which was created using pyLDAvis [24].
(A) Intertopic Distance Map of LDA topics. Size of the circles represents the relative number of articles within each topic while the distance between circles represents similarity between topics. Marginal topic distribution for each topic indicates the percentage of the dataset that belong to each topic. (B) Most Salient Terms. These are the thirty most frequent terms used to create the topic model. This figure was generated using the pyLDAvis library for topic model visualization in Python [24].
For the qualitative evaluation, the topic model was determined to be successful if it produced topics that made human sense. That is, a successful topic model had topics with distinct themes based on the top words and documents assigned to them. The topics were named in accordance with these themes. The topics were qualitatively assigned names based on abstract contents as follows: Immune and TAMs, Nanoparticles, Imaging, Gene Delivery and Exosomes, Vaccines, and Multi-modal Therapies.
Although the eLDA algorithm was effective in reducing the generation of incoherent topics, it did not result in reproducible topics over multiple runs [23], [24]. This is because the LDA algorithm which is the basis for eLDA uses a random allocation to initially assign topics. The random allocation in the model can return different topics when given the same parameters. When utilizing this strategy, we compiled the model multiple times to determine the most persistent topics to represent the dataset. We calculated the average topic coherence for each topic model created and found that it did not vary significantly between runs.
Qualitatively, we found the model most frequently returned to Immunology and Nanoparticle Topics. The remaining topics appeared throughout the model simulations but not as consistently. Imaging and Multi-model Therapies were usually found in the Nanoparticle Topic while, Vaccine, Gene Delivery and Exosomes topics were contained within the Immunology Topic. This seemed reasonable given the topic proximity on the Intertopic Distance Map (Figure 3A). When evaluating the topic models qualitatively, the models produced using eLDA still make more “human sense” than the topic models created with LDA alone with this dataset. Therefore, while the eLDA algorithm did not create entirely reproducible models, it did reliably create meaningful topics. The eLDA model therefore discussed throughout the rest of this manuscript is the one that we found to best characterize the major and minor features of the dataset.
The Immune and TAMs section had the largest distribution of documents, this was expected because the searching and screening phases of the scoping review targeted documents that discussed macrophage polarization in the context of cancer immunotherapy. Interestingly, the smallest topics were Nucleic Acids and Exosomes and Nanoparticles.
The most salient terms, the most frequent terms in the corpus, are shown in (Figure 3B). The bar length represents the term frequency in the entire dataset. The high prevalence of TAM is unsurprising, as the scoping review targeted studies in which nanoparticles interacted with tumor-associated macrophages (TAMs).
Identification of distinct categories within the dataset
To evaluate the topic contents, we categorized the most frequent words that occurred within each topic. Each article was assigned into a topic if the weight for that topic was at least 0.40. These words were then categorized by theme and sorted by frequency within the dataset. (Figure 4A) summarizes the results of this analysis. In total, six categories of words were identified: immunology, nanoparticles, imaging, gene delivery and exosomes, vaccines, and multi-modal therapies.
(a) Word clouds by topic. Word size was determined by the frequency within the dataset while the colors were determined by their category. (b) Heat map of word categories within each topic.
Some topics were largely dominated by one type of word. The Immunology topic contained terms such as TAM, M1, and M2 listed prominently. Nanoparticle topics, which also had the highest weighted score of all the topics, predominantly contained words associated with nanoparticle formulation and design. However, the other topics were identified by the unique combination of two or more categories. The Imaging topic by contrast, was highlighted by “imaging” and “nanoparticle” terms. The Nucleic Acids and Exosome group was highlighted by “nanoparticle”, “nucleic acid” and “cancer” terms. Furthermore, these trends were quantitatively validated by the weight scores (Figure 4B). The weight scores were calculated by dividing the frequency of words belonging to a given category by the total frequency of words in that topic. For example, in documents that were assigned to the nanoparticle topic, among all appearances of the 100 most frequent words, nanoparticle terms occurred approximately 24% of the time. It is important to note that words that were too vague to elucidate their meaning in the context of the documents were not included in this analysis, even if they did appear in the 100 most frequent terms. These include words like “activ”, “method”, and “cancer”.
It should also be noted that while more than one topic contained nanoparticles as a predominant word category, the kinds of nanoparticle terms that appeared in each topic were unique. For instance, the nanoparticle terms in the imaging topic such as “iron” and “gold” did not appear in the Nanoparticle topic (i.e., PLGA, micelle, chitosan) (Figure 4A).The nucleic acid word category with respect to the Vaccine topic contained “cpg”, “odn”, and “antigen” in comparison to the Nucleic Acids and Exosome group which contained “mRNA”, “microRNA”, and “3p”.
We also noticed that organs were more prominent based on the topic. For example, breast, liver, brain, and blood were key terms that appeared in the imaging topic. In nucleic acids and exosomes, pancreas, bone, prostate, and hepatocellular organs were prevalent.
Immunology
Papers in the Immunology topic tend to prioritize repolarization of tumor associated macrophages (TAMs). These papers describe the TAM phenotype and strategies to convert them into anti-tumoral pro-inflammatory macrophages. These strategies include cytokine treatment, microRNA gene delivery, and small molecule stimulation of pro-inflammatory transcriptional pathways such as NF-kB and TLR ⅞. Notably, these therapeutic cargoes are distinct from the types of therapeutics seen in the Nanoparticle Topic. Rather than attempting to induce death in cancer cells, these strategies are attempting to target TAM function, migration [25], [26], transcriptional activity, and antigen presentation [27]. This was a particularly strong topic within the data set, and consistently appeared during model reruns.
Nanoparticles
The nanoparticle topic showcases papers with a strong focus on nanoparticle characterization and development for anti-tumor therapy. While nanoparticles show up in other topics, papers that describe in vivo pharmacokinetic/pharmacodynamic (PK/PD), in vitro and in vivo cytotoxicity dose measurements, as well as structural and loading characterization are most likely to be found in this topic. Polymeric nanoparticles are the most common composition found followed by lipid/liposomes. In addition, chemotherapeutic drugs such as Paclitaxel, Docetaxel, and Doxorubicin feature prominently in this section. Such papers intend to overcome multidrug resistance [28]–[30]. This was another strong topic, persistently appearing alongside the immunology topic when the model was rerun.
Imaging
The imaging topic contains papers that use nanoparticles to visualize macrophages within tumors. The most prominent nanoparticles are metallic nanoparticles that provide great contrast in MRI imaging. Given the high prevalence of macrophages within solid tumors, these nanoparticles are also used to detect tumor metastasis [31]–[33]. Nanoparticles are conjugated with mannose [34]–[36] to enhance uptake in monocyte, macrophage, and dendritic cell populations. Moreover, these nanoparticles are used as theranostics (diagnostics + therapeutics). Many papers combine metal particles with therapeutics like curcumin [37] and a COX-2 inhibitor [38]. This topic has some overlap to the Multimodal Therapy Topic.
Gene Delivery and Exosomes
The Gene Delivery and Exosomes Topic featured papers that used extracellular vesicles (EV) drug delivery as well as intercellular signaling within tumors. Extracellular vesicles, which include exosomes, are released by cells, and can carry a wide array of cell constituents like DNA, RNA, and proteins [39]. EVs can be used as therapeutics, either as drug delivery vehicles or therapeutics as themselves. In these papers, most studies were using EVs to deliver nucleic acids, such as microRNA. The topic broadly focused on the delivery of biologics, besides microRNA, proteins like an anti-CD137 antibody and SIRP□ were also delivered using nanoparticles [40], [41]. Besides extracellular vesicles, this category also included nanoparticles with different surface functionalization aimed at improving cell-specific targeting: macrophage membrane, cancer cell membrane, and CD47 [42]–[44].
Vaccines
As the title describes, the topic contains vaccine strategies for macrophage reprogramming. Here, cargo strategies were not primarily chemotherapeutics but antigens and adjuvants designed to initiate pro-inflammatory activation. Lymph nodes were featured in this topic because of the organ’s significance in adaptive immunity and as a site of metastasis. While this topic did include more traditional vaccines against different cancer types like melanoma [45], it also encompassed therapies that aimed at activating the immune system, like modulating the lymphocyte populations within the tumor [46], reprogramming the tumor microenvironment [47], or heightening the patients’ immune response [48]. CpG ODNs were commonly seen in these studies as an immune stimulator and adjuvant. Notably, the cancer vaccine field is much larger than what is represented in the current dataset because of the exclusion of vaccines in search string design.
Multi-modal Therapies
The Multi-modal Therapies Topic encapsulated papers that employ more than one strategy. Photothermal Therapy, Photodynamic Therapy, or magnetic fields combined with chemotherapy or immunotherapy. Interestingly, we found glioblastoma to be a prevalent cancer type in this category. These studies often utilized the coactive relationship between nanoparticles and the second strategy, like how metal nanoparticles will be rapidly heated during photothermal therapy, allowing targeted heat to the tumor and minimizing off-target effect [49]–[51]. Another second strategy seen was an alternative delivery method, like inhalation [52], [53]. As stated previously, there was overlap into the imaging category, with the second strategy being the imaging component, assisted by the nanoparticles. Some of these overlapping studies included nanoparticles being used as radiosensitizers [54], where nanoparticles enhance a tumor’s sensitivity to radiation through ROS production [55]. Overall, this topic’s studies focused on how nanoparticles and a second strategy or therapy can amplify each other’s effects, resulting in a synergistic relationship.
Method to Implement a Living Scoping Review
While the topic model provided a great assessment of the existing literature, this review will eventually become outdated and lose its utility. Thus, we sought to determine whether we could use the eLDA topic model to generate a living review, a system that could be continuously updated (Figure 5A) [16], [17], [56]. We used the eLDA topic model to extract topics from 95 abstracts that also passed the full-text screening. The model was able to identify five out of the six topics in this new dataset (Figure 5B). The only topic that was not extracted was the Exosomes/Gene Delivery topic.
(a) Schematic of process used to generate a living scoping review. (b) Table comparing the percentages of papers that were distributed into each topic between the original dataset and the newly acquired papers that form the Living Review. (c) Word cloud of most significant word by topic of papers included in the Living Review. Word size was determined by the frequency within the dataset while the colors were determined by their category.
This is unsurprising since this topic was the smallest in the original dataset as visualized via the intertopic distance map (Figure 3). We then followed the same methodology to identify trends within the topics using wordclouds (Figure 5C). We found there was a large degree of consistency between frequent words in the original topics and in the newer topics. For example, the Immunology topic still contained words related to TAMs and immunotherapy. Additionally, the Nanoparticles topic still had a focus on polymeric nanoparticle terms such as Micelle, DSPE, and PEG.
Conclusion
We used machine learning enabled topic modeling to categorize pre-clinical literature regarding macrophage targeting nanoparticle literature within the cancer field. By using the eLDA model, we were able to analyze a huge subset of data collected during our scoping review. To systematically retrieve data from over 800 papers by hand is a huge undertaking and would have required an unrealistic time investment.
The eLDA topic modeling analysis revealed important insights on the last 20 years of macrophage-targeted cancer nanotherapeutics. Based on word frequency, six distinct topics were determined and labeled as “Immunology, Nanoparticles, Imaging, Gene Delivery and Exosomes, Vaccines, and Multi-modal Therapies.” The topics demonstrated different study approaches used within the field. The immunology topic highlighted the TAM repolarization strategy, where researchers attempt to make the tumor microenvironment less immunosuppressive by changing macrophage phenotype. As demonstrated in the topic’s word cloud, these papers often used liposomes and targeted breast cancer, lung cancer, or melanoma. The nanoparticle topic focused on studies whose analysis was centered on nanoparticle formulation with detailed information about the formulation’s PD/PK and structure. The nanoparticles most often studied in this fashion were polymeric and were often loaded with chemotherapeutics. The imaging topic demonstrated the strategy of using TAM’s phagocytic ability to uptake nanoparticles, often metal, to aid in imaging tumors. The most common targets for these imaging studies were shown to be tumors in the breast, liver, and brain. The exosome topic very clearly illustrated the link between using exosomes or other vesicle-like particles for gene delivery, particularly miRNA. The vaccine topic encompassed “traditional” vaccines where nanoparticles are used to deliver an antigen often alongside an adjuvant, but also studies aimed at simply activating the immune system to encourage an immune response to the tumor. CpG is very prominent within this category, expectedly, as it is a strong innate cell activator and a vaccine adjuvant. The multi-modal topic highlighted how some studies combine approaches for a synergistic treatment strategy, with NIR (near infrared radiation), photothermal therapy, and imaging being augmented by the administered nanoparticle.
While each topic employed nanoparticles, the eLDA model was able to capture distinct nanoparticle types and therapeutic features, demonstrating the scope of the TAM-targeted nanoparticle field. The word frequency tables generated using the eLDA model also allowed visualization of different cancer subtypes and organs associated with each topic as outlined above. This insight is invaluable for future therapeutic design and development, as well as an ability to draw connections between nanoparticle strategy and targeted cancer.
This study demonstrates how using Topic Modeling allows scientists to analyze large amounts of literature in a manner that can reduce prestige, journal, and geographical bias. However, as with all uses of machine learning, care must be given to the initial prompt used when generating the model to achieve the most accurate results. also recommended to collaborate with information specialists who have expertise in evidence synthesis methods when compiling search strategies to ensure comprehensive results.
It is important to consider that the topic modeling approach is a way to extract latent themes from a dataset. Data can often be summarized and categorized in multiple different ways, which is reflected in the fact that two topic models that are created using the same parameters and data can produce different results. Instead, this topic model should be interpreted as one way to organize and make sense of a large amount of information.
Finally, we demonstrated the ability to generate a living review by using our topic model to classify papers that were not included in the original dataset. Thus, we propose that topic modeling approaches can be adopted by an individual or a team of investigators for use in curating existing documents stored in reference managers. Topic modeling is a useful tool that can also be used to assess new scientific fields quickly and thoroughly.
Data Availability
The raw data required to reproduce these findings are included in the supplemental files. The processed data included the codes used to produce the findings are also included in the supplemental files.
Acknowledgements
Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award numbers 1 R35 GM142957-01 and the Carnegie Mellon University SURG and ChESS Program. We would also like to thank Megha Anand, Anushree Gupta, Archippe Mbembo, Wonhee Han, Jacob Bauldock, Chris Hunyh, Hannah Yankello and Dasia Aldarondo for their participation in the abstract screening.
References
- [1].↵
- [2].
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].
- [30].↵
- [31].↵
- [32].
- [33].↵
- [34].↵
- [35].
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵