Are Machines-learning Methods More Efficient than Humans in Triaging Literature for Systematic Reviews?

Systematic literature reviews provide rigorous assessments of clinical, cost-effectiveness, and humanistic data. Accordingly, there is a growing trend worldwide among healthcare agencies and decision-makers to require them in order to make informed decisions. Because these reviews are labor-intensive and time consuming, we applied advanced analytic methods (AAM) to determine if machine learning methods could classify abstracts as well as humans. Literature searches were run for metastatic non-small cell lung cancer treatments (mNSCLC) and metastatic castration-resistant prostate cancer (mCRPC). Records were reviewed by humans and two AAMs. AAM-1 involved a pre-trained data-mining model specialized in biomedical literature, and AAM-2 was based on support vector machine algorithms. The AAMs assigned an accept/reject status, with reasons for exclusion. Automatic results were compared to those of humans. For mNSCLC, 5820 records were processed by humans and 440 (8%) records were accepted and the remaining items rejected. AAM-1 correctly accepted 6% of records and correctly excluded 79%. AAM-2 correctly accepted 6% of records and correctly excluded 82%. The review was completed by AAM-1 or AAM-2 in 52 hours, compared to 196 hours for humans. Work saved was estimated to be 76% and 79% by AAM-1 and AAM-2, respectively. For mCRPC, 2434 records were processed by humans and 26% of these were accepted and 74% rejected. AAM-1 correctly accepted 23% of records and rejected 62%. AAM-2 correctly accepted 20% of records and rejected 66%. The review was completed by AAM-1, AAM-2, and humans in 25, 25 and 85 hours, respectively. Work saved was estimated to be 61% and 68% by AAM-1 and AAM-2, respectively. AAMs can markedly reduce the time required for searching and triaging records during a systematic review. Methods similar to AAMs should be assessed in future research for how consistent their performances are in SLRs of economic, epidemiological and humanistic evidence.

searches were run for metastatic non-small cell lung cancer treatments (mNSCLC) and 23 metastatic castration-resistant prostate cancer (mCRPC). Records were reviewed by humans and 24 two AAMs. AAM-1 involved a pre-trained data-mining model specialized in biomedical 25 literature, and AAM-2 was based on support vector machine algorithms. The AAMs assigned an 26 accept/reject status, with reasons for exclusion. Automatic results were compared to those of 27 humans. For mNSCLC, 5820 records were processed by humans and 440 (8%) records were 28 accepted and the remaining items rejected. AAM-1 correctly accepted 6% of records and 29 correctly excluded 79%. AAM-2 correctly accepted 6% of records and correctly excluded 82%. 30 The review was completed by AAM-1 or AAM-2 in 52 hours, compared to 196 hours for 31 humans. Work saved was estimated to be 76% and 79% by AAM-1 and AAM-2, respectively.

32
For mCRPC, 2434 records were processed by humans and 26% of these were accepted and 74% 33 rejected. AAM-1 correctly accepted 23% of records and rejected 62%. AAM-2 correctly 34 accepted 20% of records and rejected 66%. The review was completed by AAM-1, AAM-2, and 35 humans in 25, 25 and 85 hours, respectively. Work saved was estimated to be 61% and 68% by Introduction 42 A systematic literature review (SLR) is a specialized type of literature review that is designed to 43 address a specific research question using rigorous, reproducible, and transparent methods. An 44 SLR may focus on various topics, such as clinical trials for a given drug and/or indication, 45 economic evaluation of healthcare technologies, epidemiological information, or humanistic 46 data. Because SLRs are rigorous, there is a growing trend worldwide among health technology 47 agencies (HTAs) and other healthcare decision-makers to require SLRs in order to make 48 informed decisions [1,2]. In addition, the importance of SLRs in supporting modern evidence-49 based medicine is shown by the 27-fold increase in the annual rate of published SLRs to 28,959 50 [3]. However, an SLR is highly time and labor-intensive. An analysis of 195 SLRs in the 51 PROSPERO registry found that it took an average of 67.3 (±31.0) weeks to complete a review 52 through publication [4]. Currently, there are more than 32 million citations cataloged in PubMed, 53 and the list grows by more than 3000 articles daily; thus, generating an SLR requires searching 54 through substantially large numbers of references [5]. 55 Because SLRs are generally meant to contain the most recently available studies, they need to be 56 updated periodically. Moreover, for both novel and updates to SLRs, the task of identifying, 57 examining the text, and triaging the results into accepted and rejected articles for inclusion in the 58 reviews is particularly time-consuming [6,7]. Automating these tasks could speed up the process 59 as well as free up investigators' time to tend to other matters. This is why several articles have 60 been investigating how automation can be implemented in the SLR process [8,9]. A potential 61 solution is to use artificial intelligence (AI), including natural language processing (NLP), in 62 which computer science, and linguistics are used to teach machines to understand human 63 languages and to learn to classify or categorize text [10]. With this method, algorithms can 64 identify and extract unstructured language elements and render them into a form that a machine 65 can understand, following which the text elements can be used for categorization [7]. 66 One method explored is deep learning, a subset of AI [11][12][13]. It is an advanced analytics 67 technique that teaches a machine to perform a task through learning by example. However, deep 68 learning requires considerable computing power, and it is not always practical to develop a deep 69 learning model from scratch. A common method to deal with this issue is transfer learning.

70
Instead of training a deep model from blank, an already trained model in a similar domain is used 71 to initialize the model. It allows the target model to be trained faster and achieve better results. In 72 terms of NLP, the base model is trained through the use of a very public corpus, for example, all 73 Wikipedia articles, using an axillary task. This allows for grabbing general grammar and 74 language structures, but not texts specific to a specialized field. In this way, a model learns what 75 a noun, adjective, verb is, and so on. This step is called pre-training.

76
Once a large pre-trained model is established and becomes available for other to use, it can be 77 transferred onto a more specialized model by combining a selected pre-trained model with a 78 classifier (e.g. feed-forward neural network), trained on a smaller dataset, resulting in substantial 79 time savings for the more specialized model. This step is called fine-tuning. During this phase 80 there is no need to use such a big amount of data and computing power -it can be fine-tuned 81 using only target task data, resulting in a classification model ready to be used.

82
For example, BERT and other Transformer encoder architectures have been very successful on a 83 variety of tasks in natural language processing. They compute vector-space representations of 84 natural language that are suitable for use in deep learning models. The BERT family of models 85 uses the Transformer encoder architecture to process each token of input text in the full context 86 of all tokens before and after, hence the name: Bidirectional Encoder Representations from 87 Transformers (BERT).

88
BERT models are usually pre-trained on a large corpus of text, then fine-tuned for specific tasks.

89
Consequently, text classification with BERT can be done by fine-tuning a pre-trained BERT by 90 feeding its vector-space representations of text into a classifier trained on domain data [14]. For 91 this reason, the BERT family of models has become a standard approach for the training of task-

100
A second method explored employs the use of support vector machines (SVM), that is a 101 supervised machine learning algorithm that is commonly used for the purpose of automatic 102 classification of data [17]. In order to use SVMs to classify textual data, such as the titles and 103 abstracts of clinical studies, the texts need to be represented as points in a geometrical space 104 (numerical vectors). This is achieved by decomposing the text of a study into words 105 (tokenization) and word counts with the same linguistic word stem (stemming) that are added up 106 in one entry of the vector representing the study text (context agnostic bag of words 107 representation).
SVMs are based on the idea of dividing a dataset into two classes by determining a linear 109 separation (hyperplane) [17]. The algorithm chooses the hyperplane that results in the greatest 110 distance between the hyperplane and the nearest data points (support vectors) from either 111 training set of the two classes (maximum margin) in order to find the best separation. By 112 constructing more than one SVM and applying advanced analytical methods, data can also be 113 classified into three or more categories simultaneously (multi-label classification). While SVM 114 represent a powerful means for the classification of data, their complexity depends on the size of 115 input data, and they are not optimal for classifying large data sets or text corpora [18]

131
Literature searches were performed on EMBASE and run separately for two highly investigated 132 health conditions: metastatic non-small cell lung cancer (mNSCLC) and metastatic castration-  Table 1.  Abbreviations: HRQoL = health care-related quality of life; PICOS = population, intervention, comparator, outcomes and study design.
AAM-2 was an AI tool combining SVM with other advanced analytic methods. Between 25% 155 and 30% of the studies from the literature search were used for training, from which more than 156 10,000 words were tokenized, based on stem forms of the words, and counted into a vector for 157 each study. These were used to determine the hyperplanes to classify each study as "accept" or    whether a F1 score is "good" or "bad", we can compare it to a F1 score calculated on the 203 prevalence (positive class: include) of the data (random classification).

204
The Matthews Correlation Coefficient (MCC) is a measure of the quality of a binary 205 classification. This is a measure of the correlation between observed and predicted 206 classification. A score of 1 indicates perfect correlation, 0 means the prediction is no 207 better than random prediction, and -1 indicates complete disagreement between predicted 208 and observed classifications.

235
A. Metastatic non-small cell lung cancer 236 There was a total of 7845 records analyzed during the human-conducted review. Of these, 2025 237 were used to train both AAM-1 and AAM-2. The remaining 5820 records were analyzed by 238 AAM-1 and AAM-2 using the protocol-predefined criteria. The human-conducted classification 239 took 196 hours to complete, whereas TCAR for AAM-1 and AAM-2 was 52 hours each.

Multilabel classification 273
For multilabel classification, the records were also classified into five categories simultaneously, 274 with classes being "accept" and "reject" with reasons for rejection given as "population",

275
"intervention", "outcome", and "study design" with results presented in a multi-classification 276 confusion matrix for AAM-1 (Table 7) and AAM-2 (Table 8). The items that were classified 277 correctly are represented in the shaded cells in a diagonal line in the tables.

278
A total of 2866 records (62%) were rejected for the correct reasons by AAM-1 (Table 7).

279
Among these, 1671 of 3047 were correctly rejected for population, 185 of 303 were correctly 280 rejected for intervention, 137 of 181 were correctly rejected for outcomes, and 873 of 1849 were 281 correctly rejected for study design (Table 7). Of those correctly rejected by AAM-2, the correct rejection reason was attributed to 3112 (65%) 285 records (Table 8). Among these, 1627 of 3047 were correctly rejected for population, 75 of 303 286 were correctly rejected for intervention, 22 of 181 were correctly rejected for outcomes, and 287 1388 of 1849 were correctly rejected for study design (Table 8).  have limited prediction power (Table 9). Thus, predicting the reasons to reject is better than and wrongly classified 15% (363 records) (Table 10; Fig 2).  The data presented in the binary classification confusion matrices for giving the TN, FN, FP, and 322 TP values, the assessment metrics, ROC-AUC, F1, MCC and WSS, were determined for prostate 323 cancer (Table 12) The records were also classified into four categories simultaneously, with classes being "accept" 333 and "reject" with reasons for rejection given as "population", "intervention", "study design"

334
Results are presented in a multi-classification confusion matrix for AAM-1 (Table 13) and 335 AAM-2 (Table 14). Although "outcomes" were part of eligibility criteria defined in the protocol 336 for this condition, this class was not used as an exclusion reason in the title and abstract review 337 level. For records rejected by AAM-1, 12 of 202 were correctly rejected for population, 0 of 5 338 were correctly rejected for intervention /comparator, and 1363 of 1589 were correctly rejected 339 for study design (Table 13). For records rejected by AAM-2, 25 of 196 were correctly rejected 340 for population, 0 of 5 were correctly rejected for intervention /comparator, and 1454 of 1581 341 were correctly rejected for study design (Table 14).     and up to four times faster than human reviewers. This represents a substantial saving in time 364 spent, as manual review of abstracts can be a very time-consuming task, taking days to complete.

365
The WSS@95% values for AAM-1 and AAM-2 for both searches indicated predicted work understand what would be the result from a conflict resolution between the automatic 405 classification and the initial human classification. Also, while records could have been excluded 406 as having irrelevant outcomes during the human review of mCRPC records due to lack of 407 reporting of outcomes pre-specified in the SLR protocol, this exclusion reason was not applied at 408 the title and abstract screening level. This practice could sometimes be used by literature review 409 teams that are careful about wrongfully excluding records that meet other inclusion criteria but 410 could potentially provide data on outcomes during the next level review of full-text papers.

411
Consequently, no records classified as having irrelevant outcomes were used to train either of the 412 AAMs that assessed the mCRPC evidence. Nevertheless, the AAMs were still able to produce 413 results comparable to the human reviewers.

414
Another potential limitation is that the SVM found 25 mCRPC records to be very close to the 415 decision boundary (hyperplane), so automatic classification was non-conclusive for records with a confidence value below 0.1. These examples should be reviewed manually. The AAMs 417 appeared to work better for accept/reject classification than for assigning reasons for the rejected 418 records. In addition, automated methods require training data. The impact of size, nature and 419 generation process of the training set, as well as the impact of the overall size of the database, is 420 not entirely clear.

421
Lastly, although, the WSS@95% is a largely documented metric in the literature for estimating involved. Our analysis found that a key driver of this metric is how long it takes to create and 429 prepare the training dataset. Consequently, we acknowledge that the human review rate in our 430 work may be conservative for some teams that are experts in literature reviewing. In the cases 431 explored in our research, retrospective data from previously completed SLRs were used, and as 432 such the final status of each record was known prior to preparing the training dataset. In a de 433 novo review, such information is not available. As a result, to prepare the training dataset, it may 434 be necessary to screen more records than the ones actually used to train the model.