IMPatienT: an Integrated web application to digitize, process and explore Multimodal PATIENt daTa

Medical acts, such as imaging, lead to the production of several medical text report that describes the relevant findings. This induces multimodality in patient data by linking image data to free-text and consequently, multimodal data have become central to drive research and improve diagnosis. However, the exploitation of patient data is challenging as the ecosystem of analysis tools is fragmented depending on the type of data (images, text, genetics), the task (processing, exploration) and domains of interest (clinical phenotype, histology). To address the challenges, we present IMPatienT (Integrated digital Multimodal PATIENt daTa), a simple, flexible and open-source web application to digitize, process and explore multimodal patient data. IMPatienT has a modular architecture to: (i) create a standard vocabulary for a domain, (ii) digitize and process free-text data, (iii) annotate images and perform image segmentation, and (iv) generate a visualization dashboard and perform diagnosis suggestions. We showcased IMPatienT on a corpus of 40 simulated muscle biopsy reports of congenital myopathy patients. As IMPatienT relies on a user-designed vocabulary, it can be adapted to any domain of research and can be used as a patient registry for exploratory data analysis (EDA). A demo instance of the application is available at https://impatient.lbgi.fr/.

source web application to digitize, process and explore multimodal patient data. IMPatienT has a 25 modular architecture to: (i) create a standard vocabulary for a domain, (ii) digitize and process free-26 text data, (iii) annotate images and perform image segmentation, and (iv) generate a visualization 27 dashboard and perform diagnosis suggestions. We showcased IMPatienT on a corpus of 40 28 simulated muscle biopsy reports of congenital myopathy patients. As IMPatienT relies on a user-29 designed vocabulary, it can be adapted to any domain of research and can be used as a patient 30 registry for exploratory data analysis (EDA). A demo instance of the application is available at 31 https://impatient.lbgi.fr/. 32 Patient data now incorporates the results of numerous modalities, including imaging, next-34 generation sequencing and more recently wearable devices. Most of the time, medical acts 35 produce imaging data, such as echography, radiology or histology result in the production of 36 medical reports that describe the relevant findings. Thus, multimodality is induced in patient data, 37 as imaging data is inherently linked to free-text reports. The link between image and report data is 38 crucial as raw images can be re-interpreted during the patient's medical journey with new domain 39 knowledge or by different experts leading to different reports. Thus, patient multimodal data 40 needs to be processed in an integrated way to preserve this link in a single database. 41 Useful tools to centralize, process and explore multimodal data are essential to drive research and 42 improve diagnosis. The use of multimodal data has been shown to increase disease understanding 43 and diagnosis (Kerr et  text data) to improve diagnosis of Alzheimer's disease (Venugopalan et al., 2021). In Mendelian 46 diseases, integration of multiple levels of information is key to the establishment of a diagnosis. 47 For instance, in congenital myopathies (CM), a combination of muscle biopsy analysis (imaging 48 information) with medical records and sequencing data is essential for differential diagnosis 49 between CM subtypes (Böhm et al., 2013;Cassandrini et al., 2017;North et al., 2014). Centralization 50 of multimodal data using dedicated software is essential to implement such an approach. 51 However, the ecosystem of tools for the exploitation of patient data is heavily fragmented, 52 depending on the type of data (images, text, genetic sequences), the task to be performed 53 (digitization, processing, exploration) and the domain of interest (clinical phenotype, histology…). 54 Exploitation tools can be divided in two main categories: (i) tools to process the data and (ii) tools 55 to explore the data. 56

108
IMPatienT is a web application developed with the Flask micro-framework, which is a Python-based 109 web framework. Figure 1 illustrates the global organization of the web application. The web 110 application is composed of four modules: (i) Standard Vocabulary Creator, (ii) Report Digitization, 111 (iii) Image Annotation, and (iv) Automatic Visualization Dashboard. All modules incorporate free, 112 open-source and well-maintained libraries that are described in detail in the corresponding 113 sections. 114

115
The standard vocabulary creator module allows to create and modify a hierarchical list of 116 vocabulary terms with rich definitions that can be used as an image annotation class, for text 117 reports processing, or suggestion of diagnosis. The standard vocabulary is an essential module of 118 IMPatienT as it interacts with all subsequent modules. 119 Figure 2 shows a screenshot of the page used to create and manage the standard vocabulary tree. 120 The ergonomic drag and drop system using the graphical user interface (GUI) allows the user to 121 intuitively and quickly edit and reorganize the vocabulary to add new terms or modify existing 122 ones. Also, the vocabulary term (node) detailed form makes it easy to edit term properties. 123 The tree is generated and rendered with the JavaScript library JSTree (version 3.3.12). Each node 124 (term) can have only one parent. For each created node (vocabulary terms), the user can assign a 125 name and organize the tree structure (hierarchy) through the drag and drop interface. Each term 126 in the tree is associated with nine optional properties. Four properties are defined by the user: 127 description, list of synonyms, translation in another language, show the term as annotation class. 128 Two properties are automatically generated: the term's unique identifier (ID) and the hexadecimal 129 color associated with the term (for image annotation). Additional term properties (associated 130 diagnosis/disease class, associated genes, list of positively correlating terms [i.e. co-occurring terms 131 in reports]) are extracted from patient records registered in the database. 132 Finally, if the user defines an alternative translation for terms, there is an "invert vocabulary 133 language" button to conveniently switch between standard vocabulary languages. For instance, the 134 user can create a vocabulary in any language and define the translation in English, then switch 135 between the two display modes easily. 136 its unique ID, display name, alternative language translation, synonyms, description, associated 141 genes and diseases and correlating terms extracted from the application instance database. 142

143
The standard vocabulary terms are used to process documents that are in a free-text format. 144 Module 2 uses a semi-automatic approach for digitization and processing of free-text reports that 145 combines fast automatic detection of terms with manual reviewing of the detection. The interface 146 of Module 2 is a form divided into four parts (Figure 3). 147 In the first part of the digitization form (Fig 3a), a PDF file of the free-text report can be uploaded 148 for natural language processing (NLP) of the content. The text of the PDF report is automatically 149 extracted and processed with NLP. The NLP method is only used to detect histological terms 150 defined in the standard vocabulary. Detected standard vocabulary terms are highlighted (see 151 corresponding section below "Optical Character Recognition and Vocabulary Terms Detection"). 152 Highlighted terms allow to easily identify what standard vocabulary terms were detected as 153 present or in negative form. This is useful for quantitative performance assessment. 154 The second part (Fig 3b)  absence/presence slider. This section allows the user to correct the automatic detection of the NLP 180 method or to add new observations. Each vocabulary term can be marked as present, absent or 181 no information. For terms marked as present, the slider is used to indicate a notion of quantity or 182 certainty of the term. For example, the statement "There are a small number of fibers containing 183 rods" can be annotated by hand by setting the vocabulary "Rods" to the value "Present" with a low 184 quantity value. For terms that have been automatically detected, this slider value is automatically 185 set to 0 (present in a negated sentence) or 1 (present). 186 Finally, the fourth part (Fig 3d)

192
The patient report digitization in module 2 is facilitated by the automatic text recognition and 193 keyword detection method. The user uploads a PDF version of the text reports to perform Optical 194 Character Recognition (OCR), followed by Natural Language Processing (NLP) to automatically 195 detect terms from the standard vocabulary in the report. The NLP method is only match the raw 196 text to the standard vocabulary defined in Standard Vocabulary Module 1. Figure 5 describes the 197 workflow of the vocabulary terms detection method. First the PDF file is converted to plain text 198 using the Tesseract OCR (implemented in python as pyTesseract). Then, the text is processed with 199 Spacy, an NLP python library, by splitting the text into sentences and then into individual words. 200 The resulting list of sentences is then processed to detect negation using a simple implementation 201 of the concept of NegEx (Chapman et al., 2001). An n-gram (monograms, digrams, and trigrams) 202 procedure is applied 203 to the list of words to identify contiguous words in the context of all the sentences of the report. 204 The n-grams are then mapped against the user-created standard vocabulary using fuzzy partial 205 matching (using Levenshtein distance) with a score threshold of 0.8. Matched keywords are kept 206 and shown on the interface with a green or red highlight of the detected text using Mark.JS 207 JavaScript library (green indicates the presence of the keyword, red indicates the presence in a 208 negated sentence). Keywords are also automatically marked as present or absent (negated) in the 209 vocabulary tree. 210 the similarity between a list of input vocabulary terms annotated as "present" for a patient (the 218 query) and a simulated patient profile for each disease class (model report) that is generated based 219 on the data from already registered patients. 220 We implemented this algorithm in python, and we modified it to use the frequencies of vocabulary 221 terms per disease for the generation of the model report instead of the initial deterministic way 222 (not frequency aware). This means that the model report is generated based on the probability 223 (frequency) of each vocabulary term. For example, if disease A is annotated with vocabulary term 224 B at a frequency=0.9 and vocabulary term C at a frequency=0.1, the generated model report for 225 disease A will have a probability=0.9 of containing vocabulary term B and a probability=0.1 of 226 containing vocabulary term C. 227 Due to the stochastic nature of the generation of the model report, for any given prediction, the 228 generation and computation of the similarity with the query is repeated 50 times. For each 229 repetition, if a disease has a prediction probability>0.5, it is considered to be the best prediction, 230 otherwise the prediction is "no prediction". Finally, of the 50 repetitions, the prediction with the 231 highest occurrence is taken as the final prediction. 232 Additionally, access to all modules and data entered via the web application is restricted by a login-286 page and user accounts can only be created by the administrator of the platform. No user 287 information is stored except for the username, email and salted and hashed passwords. 288

289
IMPatienT is an interactive and user-friendly web application that integrates a semi-automatic 290 approach for text and image data digitization, processing, and exploration. Due to its modular 291 architecture and its standard vocabulary creator, it has a wide range of potential uses. 292 IMPatienT implements novel functionalities to process and exploit patient data. For example, 300

IMPatienT Main Functionalities
IMPatienT is compatible with any domain of research thanks to its standard vocabulary builder. 301 Also, with the OCR/NLP method, IMPatienT can process histologic text reports, allowing the user 302 to exploit scanned documents. Finally, IMPatienT also provides useful utilities to exploit patient 303 data with the various visualizations, the term, frequency table, correlation matrix and the 304 automatic enrichment of the vocabulary terms definition (associated genes and diseases). 305 IMPatienT Usage 308 Figure 1 shows how the user can interact with the web application to digitize, process, and explore 309 patient data. In IMPatienT, modules can be used independently, allowing users to only use the 310 tools they need. For example, a user might only have text report data, in this case they would be 311 able to use the standard vocabulary creator, the report digitization tools and the visualization 312 dashboard to process and explore their data. In another scenario, a user could only be interested 313 in annotating an image dataset using a shared standard vocabulary that can be modified and 314 updated collaboratively. In this use case, they would be able to only use the standard vocabulary 315 creator and the image annotation module. However, the main strength of IMPatienT lies in the 316 multimodal approach it provides and the module interactions. 317 For the complete multimodal approach, the first step is to create a standard vocabulary using the 318 Standard Vocabulary Creator interface (module 1). The user only needs to create a few terms 319 (nodes) to begin using the web application. Defining the properties of the terms (definition, 320 synonyms…) is optional, and organizing them in a hierarchical structure is also optional. 321 Then, the user can start digitizing patient reports using module 2 (step 2). This can be done 322 manually by filling out the form in module 2 and checking terms as present or absent in a given 323 report, or the user can employ the Vocabulary Term Matching method by uploading a PDF version 324 of the report. Using module 3, the user can also upload, annotate, and segment image data. 325 Finally, the user can view multiple exploratory graphs (histograms, correlation matrix, confusion 326 matrix, frequency tables) that are automatically generated in module 4. All data entered via the 327 web application are retrievable in standard formats, including the whole database of reports as a 328 single SQLite3 file or CSV files, the images and their segmentation models and masks as a GZIP 329 archive, the standard vocabulary with annotation as a JSON file and various graphs and tables as 330 JSON or PNG files. 331

332
As a use case of IMPatienT, we focused on congenital myopathies (CM). We used the standard 333 vocabulary creator to create a sample muscle histology standard vocabulary based on common 334 terms used in muscle biopsy reports from the Paris Institute of Myology. Then, we inserted 40 335 generated digital patients in the database with random sampling of standard vocabulary terms 336 and associated a gene and disease class among a list of common CM genes and three recurring 337 CM subtypes (nemaline myopathy, core myopathy and centronuclear myopathy). All these data are 338 available on the demo instance of IMPatienT (https://impatient.lbgi.fr/). 339 For text data, Supplementary Figure S1 shows the results of the automatic NLP method applied to 340 an artificial muscle histology report. Twenty-two keywords were detected and match to the 341 standard vocabulary and seven of them were detected in negated sentences (red highlight). Among 342 the 22 vocabulary terms detected. Out of the twenty-two keywords, eighteen were correctly 343 detected and one was detected in the wrong state of negation: "abnormal fiber differentiation" is 344 highlighted as negated while it is present is a non-negated sentence part. Three keywords (fiber 345 type, internalized nuclei, centralized nuclei) were detected as matching for multiple keywords from 346 the vocabulary at the same time due to high similarity. For example, the keyword "internalized 347 nuclei" and "centralized nuclei" have a similarity score of 86 using the Levenstein distance. Two 348 keywords defined in the standard vocabulary were missed and not highlighted: "biopsy looks 349 abnormal" ("abnormal biopsy" in the vocabulary) and "purplish shade" ("purplish aspect" in the 350 vocabulary). 351 For the image data, figure 7 shows an example of the segmentation of a biopsy image, where we 352 annotated the cytoplasm of the cells (green), intercellular spaces (black) and cell nuclei (red). The 353 raw image (Fig 7a) is annotated with free-shape areas associated with standard vocabulary terms 354 (Fig 7b). Then, the whole image is automatically segmented based on the annotations, producing 355 the segmentation mask where each pixel is associated with a class (Fig 7c 7d). 356 The automatic visualization dashboard was used to generate the six visualizations provided in 357 figure 8. These visualizations include a breakdown of the patients in the database by age, genes, 358 or diagnosis (Fig 8a). A correlation matrix (using Pearson correlation coefficient) between the 359 occurrence of standard vocabulary terms is generated (Fig 8b), which can serve as a starting point 360 for exploration of co-occurrence of features in patients. The confusion matrix of the final diagnosis 361 of patients versus the suggested diagnosis with BOQA (Fig 8c) allows the user to monitor the 362 accuracy of the disease suggestion function. In addition, a histogram showing the classification of 363 patients without a final diagnosis is provided to indicate possible prognosis of undiagnosed 364 patients (Fig 8d). Finally, the frequency of each standard vocabulary term by gene and by disease 365 is automatically calculated and shown in two tables (see supplementary tables S1 and S2).

381
IMPatienT is a platform that simplifies the digitization, processing, and exploration of both textual 382 and image patient data. The web application is centered around the concept of a standard 383 vocabulary tree that is easy to create and used to process text and image data. This allows 384 IMPatienT to work with patient data from domains that still lack a consensus ontology and rely on 385 well-established ontologies for patient data, such as HPO for phenotypes, Orphanet for disease 386 names or HGCN/HGVS for genetic diagnoses. 387 The semi-automatic approach implemented in IMPatienT offers faster digitization processes while 388 ensuring accuracy through manual review. This is achieved by analyzing text data using OCR and 389 NLP to automatically match the text to the standard vocabulary, followed by manual correction. 390 For image data, the user first provides sparse annotations on the image, which are then used to 391 compute an automatic segmentation of the whole image. For data exploration, IMPatienT uses a 392 fully automatic approach including various visualizations as well as diagnosis suggestions, while 393 allowing the user to extract the processed data in a standard format for further analysis (database, 394 images, frequency tables). 395 IMPatienT aims to integrate multiple approaches in a unified platform with two main objectives: 396 universality (i.e not restricted to a specific domain) and multimodality (i.e. integration of multiple 397 data types). To our knowledge, other tools similar to IMPatienT do not fulfill both objectives. 398 We performed a comparison of the main functionalities of IMPatienT with other tools used in the 399 community. Phenotips, SAMS and PhenoStore are similar to IMPatienT as they are designed as a 400 patient information database. However, they are restricted to processing patient phenotype data 401 by using HPO and do not integrate multimodal data. IMPatienT goes further by allowing for custom 402 observations with the vocabulary builder and with automatic digitization with OCR/NLP as well as 403 integrating tools to exploit image data. 404 Other tools are similar to one or two modules only of IMPatienT. For example, Doc2HPO is a tool 405 that also uses a semi-automatic approach to digitize clinical text according to a list of HPO terms, 406 based on NLP methods and negation detection. However, as Doc2HPO is also restricted to HPO, it 407 does not provide custom vocabulary tree facilities. In contrast IMPatienT is suitable for digitization 408 of text data from any domain of interest. 409 For image data, software such as Cytomine and Ilastik are widely used and perform well on 410 biological data, but they do not allow the user to take into consideration the multimodal aspects 411 of patient data by keeping the raw image and the expert interpretation (histological report) in a 412 single database along with a collaborative and rich-defined custom ontology. 413 Finally, in IMPatienT we reimplemented the diagnosis suggestion algorithm called BOQA that is 414 also used in Phenomizer, a tool to rank a list of the top matching diseases based on a list of input 415 HPO terms. We modified the algorithm to consider frequencies of terms by disease to have 416 meaningful predictions. However, BOQA uses binary states for terms (terms are marked as present 417 or absent) and is not compatible with numeric features. In the future, it will be necessary to 418 implement a more complex system such as explainable AI with learning classifier systems 419 (Urbanowicz & Moore, 2015). This should improve accuracy, explainability, and handling of 420 quantitative values, although at the cost of computational power. 421 IMPatienT still lacks some feature compared to other tools, such as a pedigree editor, support for 422 DICOM and gigapixel images and phenotypic data export to the Phenopacket format. In the future, 423 we plan to further develop IMPatienT by adding these features to the interface. We also want to 424 explore the automatization of the standard vocabulary creation with the analysis of a complete 425 corpus of text. For text analysis, we wish to implement additional context comprehension, i.e. not 426 only negation but also hypothetical statements, uncertainty and family context as well as better 427 text-vocabulary terms matching. Finally, we plan to expand the scope of the OCR/NLP method by 428 integrating existing NLP tools to automatically detect HPO terms, gene symbols and disease name 429 the report text. 430 With IMPatienT, we have developed an integrated web application to digitize, process and explore 431 multimodal patient data. IMPatienT can serve as a research tool to find new associations of patient 432 features that might be relevant for diagnosis. A demonstration instance of the web application is 433 available at https://impatient.lbgi.fr. 434

435
The source code for IMPatienT v1.5.0 is available in its GitHub repository 436 (https://github.com/lambda-science/IMPatienT). The synthetic dataset generated and analyzed 437 during the current study is also available in the same repository. 438

439
The authors declare that they have no conflict of interest. 440

441
We thank the BiGEst-ICube platform for their assistance. We thank the Agence Nationale de 442  histology report text with detected keywords highlighted in green and red. A red highlight indicated that the keyword is in a negated sentence. (b) Table  458 of some highlighted keywords and the details of the match (matching vocabulary ID and terms, position in the raw text, matching n-gram [raw text] and 459 the similarity score of the comparison). Green and red colors correspond to keywords detected as present and present in negated sentence respectively. 460