The structural extraction of Chinese medical narratives

Medical narratives document a vast amount of clinical data. This data has a valuable secondary purpose, as it may be used to optimize health service delivery and improve the quality of medical care. However, medical narratives are typically recorded in an unstructured manner, which complicates the process of extracting the structured information required for optimization. In this paper, we address this problem by applying and comparing two models, a rule-based model and a model based on conditional random fields (CRFs), to a data set of Chinese medical narratives. Among 4626 manually annotated Chinese medical narratives, collected from Shanxi Dayi Hospital in China, the rule-based model achieved 95.87% precision, 69.82% recall, and an F-score of 80.80%, and the CRF-based model realized 95.99% precision, 65.11% recall, and a 77.59% F-score. These experimental results demonstrate the efficacy of both proposed models for structural extraction from Chinese medical narratives.


Introduction
Since the 1960s, natural language processing (NLP) research has been active in the clinical sector. Although medical NLP research has lagged behind more general NLP applications [1], a great deal of meaningful work has been done. Lack of shared data used to impede medical NLP research, so the efforts of [2][3][4][5] to make clinical data available for sharing merit note. There is also a large body of work on NLP for clinical information [6][7][8][9][10][11][12][13][14][15][16], including applications related to respiratory illness, radiology, discharge summaries, electrocardiography, echocardiography, pharmacovigilance, postoperative surgical complications, and pathology. This study focused on medical narrative that documented progress notes and daily rounds for patients with cardiovascular diseases. One well-known NLP project, the Language String Project-Medical Language Processor (LSP-MLP) [17][18][19], extracts and summarizes symptomologies and medication dosage information [20]. Columbia University has developed Medical Language Extraction and Coding System (MEDLEE), an NLP system that identifies clinical information in narrative reports and converts textual information into structured representations [21]. This study shares the objectives of these systems, but differs from the existing approaches in that the proposed method is specifically applied to medical narratives composed using Chinese characters and language.
Despite its late start, Chinese natural language processing has made great progress in the past two decades [22][23][24]. Huang et al. proposed a segmentation standard for Chinese natural language processing which serves as a preliminary work on the structural extraction of Chinese medical narratives [25]. A group from Harbin Institute of Technology developed an integrated Chinese language processing platform that achieved good results by executing syntactic and semantic parsing modules [26]. Wang et al. presented a method for symptom name recognition from free-text clinical records based on conditional random fields (CRFs) [27]. Xu et al. developed two independent systems of word segmentation and named entity recognition which were also based on CRFs [28].
As a country, China has a well-established and solid foundation for the information construction of medical institutions. The electronic medical record (EMR) system has been widely applied by medical institutions at all levels. The content of EMRs consists of unstructured narrative text as well as structured coded data [29]. Although EMRs contain abundant data for advanced research, existing techniques for processing unstructured text are too inefficient for practical application. Therefore, the key to data mining EMRs is to fulfill the structural extraction of the medical record data, which is accomplished by processing EMR narrative content into structured text to facilitate effective data mining subsequently. Both proposed models, the rule-based model and the CRF model, have been widely used for English medical language processing [30][31][32]. Because of the difficulties encountered in word segmentation and other ambiguities, natural language processing is more complex and less understood for documents written in Chinese.
In this paper, we evaluate the execution of rule-based and CRF models on Chinese medical narratives and compare the results. Factors limiting the extraction results were also explored. Specifically, we conducted a detailed analysis of the word segmentation process. The unique contributions of this paper are as follows: 1) Instead of relying a single model, we executed and compared a rule-based model and a CRF model; 2) For the structural extraction of Chinese medical narratives with similar writing habits, we observed that the rule-based model achieved higher recall while the CRF-based model attained higher precision; 3) To our knowledge, there has been little focused study of the progress notes and daily records in Chinese medical narratives that were used in this study.

Data sources and overview
We obtained EMRs from the cooperative hospital, Shanxi Dayi Hospital, located in Taiyuan, China. Each EMR is comprised of patient progress notes and daily records sorted by timestamp. Each EMR describes the physical condition and associated diagnostic information of the patient at the time of observation. If applicable, examination results are also included. Owing to variations in EMR reporting criteria instituted by different hospitals, we suggest tailoring preprocessing to the EMR protocols of the target hospital. However, after targeted preprocessing, the structured extraction methods of the proposed method are universal. Table 1 provides an excerpt from one of the EMRs used in this study, as original transcribed in Chinese and translated to English. heart rate 64 beats/min; noise unheard in all valve auscultation areas; soft abdomen; no tenderness or rebound tenderness; no swelling of both lower extremities. Cardiac echocardiography showed enlargement of the left atrium, tricuspid regurgitation (mildness), and pulmonary artery pressure (PASP = 31 mmHg). Continue secondary prevention and anti-infective treatment of coronary heart disease, the patient requested to leave the hospital tomorrow. Regular medication and outpatient review is suggested. ***

Data preprocessing
As previously mentioned, targeted preprocessing is applied to sets of EMRs obtained from individual hospitals to standardize the data for universal structured extraction techniques. An EMR, or clinical narrative, contains multiple progress notes and daily records divided into stages according to the timestamp of each notation. In the Chinese language, clauses are independent and complete when separated by commas and periods. Thus, for each stage, or individually timestamped note, the text was separated into independent clauses based on the punctuation. In the final step of preprocessing, the independent clauses were segmented using Jieba, a Chinese word segmentation tool. Table 2 provides an example of an input clause and the segmentation implemented during preprocessing.

The rule-based model
The pipeline of the rule-based model Fig 1 presents the structured extraction process of the rule-based model. First, the clauses of Chinese medical narratives are input as test data. Next, the Jieba segmentation tool separated the input clauses into individual terms. Then, if the terms are found in the concept dictionary, they are replaced by code indicating their associated concept and an ordinal number. The rules are then applied to the encoded segments to extract the valid information from the clause and produce the structural results as the final output.

Constructing the concept dictionary
The rule-based model requires the use of a Chinese medical word dictionary [33]. Thus, we constructed a lightweight Chinese medical word dictionary based on the EMR data set used in this experiment. Matching the key-value pairs of input text and dictionary concepts was crucial to the efficacy of the proposed approach. For example, the term "abdomen" corresponds to the concept "Body_Part," while "cough" is recorded as a "Symptom." The key-value matching process was repeated many times because the accuracy and the capacity of the concept dictionary directly affects the results of the subsequent rule-based structural extraction, especially the recall. Table 3 displays an excerpt of the concept dictionary constructed for this study.

Rules
To establish the rules that would direct the structured extraction, we analyzed the EMR data set and mined rules based on commonly accepted medical transcription conventions. Thus, abstract sentences with similar attributes or content were classified into rules delivered as regular expressions [34]. Table 4 lists all the ten rules used by the proposed model. To identify additional meaningful patterns in the Chinese medical narratives, we created some more complex rules that consisted of two regular expressions.

Revision of the extraction
The results of the structured extraction were constrained by the accuracy of the text segmentation. For example, if the segmentation was excessively fragmented, then the named entities of key-value pairs could be separated during segmentation, thus failing to recognize the appropriate conceptual pairing. Conversely, if the segmentation was too rough, the named entities could be embedded in larger segments that provide suboptimal specificity. Both situations would produce terms that could not be matched to regular expressions of the model rules and would therefore be excluded from the structured extraction implemented by enforcing these rules.
To mitigate omissions related to the granularity of the segmentation process, the extracted results were compared to a golden data set through multiple manual checks. During the debugging process, incorrect extraction results were evaluated to determine whether segmentation error was a factor. Regardless of the segmentation process, expected terms that were missing in the result set were added to an attached dictionary, and their frequency was noted. Frequency was an important consideration because higher frequency terms are easier to identify and properly extract to produce meaningful results. We note that the structured extraction results improved considerably after modifying the segmentation as described even though relatively few omissions were identified by the preceding revision process. The missing terms were added to the concept dictionary immediately after their discovery. As part of the process for incorporating the newly discovered terms, we tested whether the new terms could be correctly segmented, and adjusted the segmentation method as needed.

The conditional random fields (CRF) model Training model and test model
The conditional random field is a framework for building probabilistic models [35] widely used in natural language processing for word segmentation. The tagging set consists of EMRs processed by manual annotations. Four tags were assigned to the characters in a Chinese word: beginning (B), middle (M), end (E), or a single (S) word itself. While Chinese characters were described by BMES tags, punctuation and numbers were distinguished using the respective tags "PUNC" and "NUM." The pipeline of the conditional random fields (CRF) model Fig 2 shows the extraction process of the CRF model. Note that the CRF extraction process that defines this model replaces the Jieba segmentation tool used by the preceding rules-based model. We first constructed the training model by manual annotating the EMR data set using the tags described in subsection 2.4.1. Then, the clauses and sentences segmented from the Chinese EMRs during preprocessing were supplied as input to the CRF model which executed word segmentation based on the assigned BMES tags. The subsequent structural extraction process resembled that of the rule-based model.

Results
The performance of the proposed methods was evaluated using the standard three metrics, precision, recall, and F-score, derived from a confusion matrix. The input parameters of the confusion matrix are explained in Table 5. Table 6 provides the calculations used to obtain the statistical data and evaluation metrics presented in Table 7 and Table 8. Fig 3 and Fig 4 chart the variation and trendlines of precision, recall, and F-score as the annotations are increased using the rules-based model and the CRF model, respectively. The quadratic fitting curves are also plotted in the above Figs. According to the fitting formulas, calculating data is given in Table 8. And Fig 5 provides the comparison between two models.

Parameters
Implication Interpretation FN False Negative It is determined as a negative sample, but in fact a positive sample.

FP False Positive
It is determined as a positive sample, but in fact a negative sample.

TN True Negative
It is determined as a negative sample, and in fact a negative sample.

TP True Positive
It is determined as a positive sample, and in fact a positive sample. Table 6. The calculation of some metrics in Table 7.

Metrics Calculation
Number of annotations u l im

Number of extractions u l iu
Precision u Ȁ u l iu) Recall u Ȁ u l im) F score u‫ݎ‬ ݁ ܿ ݅ ݁݁ Ȁu‫ݎ‬ ݁ ܿ ݅ l ݁݁ The bold data indicates the model with a better performance compared with each other.

Discussion
As the rule-based model is a white box method, the error statistics were used to determine whether a rule is acceptable or requires modification. Fig 6 graphs the error distribution per thousand units of annotated data, clearly indicating that the "Body parts-Description" (BP_DS) rule was the most flawed rule. Table 9 analyzes an application of the BP_DS rule to show why the errors occurred. As observed in Table 9, faulty word segmentation caused the extraction error. In Chinese, "intervertebral" may also written as what translates to English as "intervertebral hole." When the Chinese words were segmented, "intervertebral (椎间)" and "hole (孔)" were separated. Thus, although the concept of "intervertebral ( 椎 间 孔 ) @@Body_Part" was in the concept dictionary, only the "narrow hole (孔狭窄) " was extracted, resulting in the loss of semantics.
Regarding the CRF model, our evaluation revealed that this model required a much larger training data set compared to the rule-based model. Despite using a training set of more than one million words, the training set did not include many common phrases and medical terms. Furthermore, the variety of word combinations and phrase sequences also exerted a strong influence on the correctness of the test set.

Conclusions
This paper compared a rule-based model and a CRF model in extracting Chinese medical narratives. Among 4626 un-seen manually annotated records, the rules-based model achieved 95.87% precision, 69.82% recall, and an F-score of 80.80%, while the CRF model realized 95.99% precision, 65.11% recall, 77.59% F-score.
To summarize, for Chinese natural language texts with similar writing habits, such as progress notes and daily records, the narrative content is more easily abstracted into specific rules, to the advantage of the rule-based model. In contrast, although the recall of the CRF-based model is unremarkable, its accuracy is considerable. Increasing the size and quality of the training set would improve the performance of the CRF model. In the future work, we will consider an ensemble model that combines the merits of both the CRF model and the rule-based model. Furthermore, there is a great deal of unlabeled data, which could be used to extend the training set. Finally, it is also possible to apply a semi-supervised model to learn from both annotated data and unlabeled data to further improve the extraction results.