DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains

In recent years, pre-trained language models (PLMs) achieve the best performance on a wide range of natural language processing (NLP) tasks. While the first models were trained on general domain data, specialized ones have emerged to more effectively treat specific domains. In this paper, we propose an original study of PLMs in the medical domain on French language. We compare, for the first time, the performance of PLMs trained on both public data from the web and private data from healthcare establishments. We also evaluate different learning strategies on a set of biomedical tasks. In particular, we show that we can take advantage of already existing biomedical PLMs in a foreign language by further pre-train it on our targeted data. Finally, we release the first specialized PLMs for the biomedical field in French, called DrBERT, as well as the largest corpus of medical data under free license on which these models are trained.


Introduction
During the last years, pre-trained language models (PLMs) have been shown to significantly improve performance on many Natural Language Processing (NLP) tasks. Recent models, such as BERT (Devlin et al., 2018) or RoBERTa (Liu et al., 2019), are more and more taking advantage of the huge quantities of unlabeled data thanks to recent unsupervised approaches like masked language models based on the Transformers architecture (Vaswani et al., 2017). Most of these PLMs are frequently pre-trained on general domain corpora, such as news articles, books or encyclopedia. An additional step of fine-tuning can also be applied to use these PLMs on a targeted task (Devlin et al., 2018).
Although these generic models are used in various contexts, recent works have shown that optimal performance in specialized domains, such as * Equal contribution. finance (Yang et al., 2020), medical (Yang et al., 2022) or traveling (Zhu et al., 2021), can only be achieved using PLMs adapted to the targeted conditions.
The adaptation of language models to a domain generally follows two strategies. The first is the training from scratch of a new model using only textual data of the targeted specialty. The second approach, called continual pre-training (Howard and Ruder, 2018), pursues the training of already pre-trained models, allowing them to pass from a generic model to a specialized one. Even if studies have shown that the first strategy generally offers better performance (Lee et al., 2019), the second requires a much more constrained number of resources (Chalkidis et al., 2020;El Boukkouri et al., 2022) whether in terms of computing resources or of amount of data.
However, domain specific data are generally difficult to obtain, resulting in quite a few specialized PLMs available. This difficulty is even greater for languages other than English. For the medical domain, data produced during clinical care contain the finesse of medical reasoning and are the most prevalent in terms of quantity. However, they are rarely accessible due to patient privacy constraints.
In this paper, we describe and freely disseminate DrBERT, the first RoBERTa-based PLMs specialized in the biomedical field in French, and the corpus that had allowed their trainings. We also propose an original study concerning the evaluation of different language model pre-training strategies for the medical field, while comparing it with our model derived from clinical private data, called ChuBERT. In our experiments, PLMs with publicly available biomedical data can result in similar or better performance compared to highly specialized private data collected from hospital reports or to larger corpora having only generic data. Our contributions can be summarized as follows: • A new benchmark aggregating a set of NLP tasks in the medical field in French has been set up, making it possible to evaluate language models at the syntactic and semantic level (multi-label classification, part-of-speech tagging, named entity recognition, etc.).
• A large textual data collection, called NA-CHOS, crawled from multiple biomedical online sources.
• The construction and evaluation of the first open-source PLMs in French for the biomedical domain based on RoBERTa architecture, called DrBERT, including the analysis of different pre-training strategies.
• A set of models using both public and private data trained on comparable data sizes. These models were then compared by evaluating their performance on a wide range of tasks, both public and private.
• The free distribution of the NACHOS corpus and of the public PLMs under the open-source MIT license 1 .

Related work
BERT (Devlin et al., 2018) is a contextualized word representation model based on the concept of masked language model and pre-trained using bidirectional Transformers (Vaswani et al., 2017). Since its release, it obtains state-of-the-art (SOTA) performance on almost every NLP tasks, while requiring minimal task-specific architectural modifications. However, the training cost of such a model is very high in terms of computation due to the complexity of each training objective and the quantity of data needed. Consequently, new methods emerge and propose more effective ways of performing pre-training. One of them is RoBERTa (Liu et al., 2019). In order to improve the initial BERT model, the authors made some simple design changes in its training procedure. They modify the masked-language strategy to perform dynamic masking, remove the next sentence prediction task, increase dramatically the batch sizes and use significantly more data during a longer training period. Nowadays, RoBERTa is the standard model for a lot of NLP tasks and languages, including French with CamemBERT model (Martin et al., 2020).
Recently, multiple language models have been developed for biomedical and clinical fields through unsupervised pre-training of Transformerbased architectures, mainly for English language. One of the first models was BioBERT (Lee et al., 2019), which is based on the initially pretrained BERT model and further pre-trained using biomedical-specific data through continual pretraining. Other models like BlueBERT (Peng et al., 2019) and ClinicalBERT (Huang et al., 2019) also used this approach on various data sources. An alternative method, when enough in-domain data is available, is to directly pre-train models from scratch (SciBERT (Beltagy et al., 2019), PubMed-BERT (Gu et al., 2021), etc.). Note that SciBERT was trained on mixed-domain data from biomedical and computer science domains, while PubMed-BERT on biomedical data only. Gu et al. (2021) disputed the benefits of mixed-domain data for pretraining, based on results obtained on tasks from BLURB benchmark.
In other languages than English, BERT-based models are much rarer and primarily rely on continual pre-training. Examples include German (Shrestha, 2021), Portuguese (Schneider et al., 2020), and Swedish (Vakili et al., 2022). Only the Spanish (Carrino et al., 2021) and Turkish (Türkmen et al., 2022) models were trained from scratch with biomedical and clinical data from various sources. For French, there is, to our knowledge, no publicly available model specifically built for the biomedical domain.

Pre-training datasets
In the biomedical domain, previous works (Gu et al., 2021) on PLMs highlighted the importance of matching the data sources used for its training to the targeted downstream tasks. Due to their sensitive nature (protection of user data, protected health information of patients, etc.), medical data are extremely difficult to obtain. Massive collection of web data related to this domain appears to be a solution that can overcome this lack. However, these web documents vary in terms of quality. No comparison has been made between PLMs based on specific domain data from the web and those on private documents from clinical data warehouses, whose quality can be controlled.
We extracted two different medical datasets for French. The first one gathers data crawled from a variety of free-of-use online sources, and the second one private hospital stays reports from the Nantes University Hospital. Table 1 gives a general overview of the two collected corpora. The public web-based data, detailed in Section 3.1, allowed the constitution of a corpus, called NACHOS large , containing 7.4 GB of data. The private dataset, called NBDW small is described in Section 3.2 and contains 4 GB of data. In order to perform comparable experiments, we extracted a NACHOS sub-corpus (NACHOS small ) of the same size as the private data. Finally, Section 3.3 describes the pre-processing applied to both datasets.

Public corpus -NACHOS
We introduce the opeN crAwled frenCh Healthcare cOrpuS (NACHOS), a French medical opensource dataset compiled by crawling a variety of textual sources around the medical topic. It consists of more than one billion words, drawn from 24 French speaking high-quality websites. The corpus includes a wide range of medical information: descriptions of diseases and conditions, information on treatments and medications, general health-related advice, official scientific meeting reports, anonymized clinical cases, scientific literature, thesis, French translation pairs, university health courses and a large range of data obtained from raw textual sources, web scrapping, and optical character recognition (OCR). Table 2 summarizes the different data sources of NACHOS. We use heuristics to split the texts into sentences and aggressively filter out short or low-quality sentences like those obtained from OCR. Finally, we classified them into languages by using our own classifier trained on the multilingual Opus EMEA (Tiedemann and Nygaard, 2004) and MAS-SIVE (FitzGerald et al., 2022) corpora to keep only the sentences in French.
For the 4 GB version of NACHOS (NACHOS small ), we shuffled the whole corpus and selected randomly 25.3M sentences to maximize data sources homogeneity. The full NACHOS corpus is now freely available online 2 .

Private corpus -NBDW
The private corpus, called Nantes Biomedical Data Warehouse (NBDW), was obtained using the data warehouse from Nantes University Hospital. This data warehouse includes different dimensions of patients' related data: socio-demographic, drug prescriptions and other information associated with consultation or hospital stays (diagnosis, biology, imagery, etc.). The authorization to implement and exploit the NBDW dataset was granted in 2018 by the CNIL (Commission National de l'Informatique et des Libertés), the French independent supervisory authority in charge of application of national and European data privacy protection laws; authorization N°2129203. For this work, a sample of 1.7 million deidentified hospital stays reports was randomly selected and extracted from the data warehouse. As described in Table 3, the reports are from various hospital departments, emergency medicine, gynecology and ambulatory care being the most frequent.
Each reports was split into tokens sequence with an average of 15.26 words per sequence. Then, all tokens sequences from all reports were shuffled to build the corpus. This corpus contains 655M words, from 43.1M sentences, for a total size of approximately 4 GB.

Pre-processing step
The supplied text data has been split into subword units using SentencePiece (Kudo and Richardson, 2018), an extension of Byte-Pair encoding (BPE) (Sennrich et al., 2016) and WordPiece (Wu et al., 2016) that does not require pre-tokenization (at the word or token level), thereby avoiding the requirement for language-specific tokenizers. We employ a vocabulary size of 32k subword tokens. For each model pre-trained from scratch (see Section 4.2), tokenizers were built using all the sentences from the pre-training dataset.

Models pre-training
In this section, we describe the pre-training modalities of our studied models from two points of view: 1) the influence of the data used (size and nature), and 2) the pre-training strategies of the models. These two levels are respectively detailed in Sections 4.1 and 4.2. Section 4.3 finally presents the existing state-of-the-art pre-trained models that will be used for comparison purposes.

Influence of data
One issue is to identify the amount of data required to create a model that performs well and can compete with models trained on general domains. Recent studies, such as those by Zhang et al. (2020) and Martin et al. (2020), discuss the impact of the size of pre-training data on model performance. According to these studies, some tasks are performing better with fewer data while others, such as commonsense knowledge and reasoning tasks, keep improving performance when pre-training data are added.
In the medical field, no study has been conducted to compare the impact of varying the amount of domain-specific data during pre-training, or to assess the impact of the supposedly variable quality of the data depending on their source of collection.
We thus propose to evaluate the pre-training of several language models on either NACHOS small or NBDW small corpus, as described in Section 3. Additionally, we propose a model pre-trained on NACHOS large to investigate if having almost twice as much data improves model performance. Finally, a combination of both public NACHOS small and NBDW small sources for a total of 8 GB (NBDW mixed ) is explored, to demonstrate if combining private and public data is a viable approach in low-resource domains.

Pre-training strategies
In addition to the analysis on the size and the sources of data, we also seek to evaluate three training strategies of PLMs for the medical domain: • Training a full model from scratch, including the subword tokenizer.
• Continuing the pre-training of the state-of-theart language model for French, called Camem-BERT, on our medical-specific data while keeping the initial tokenizer.
• Continuing the pre-training of a state-of-theart domain specific language model for medical but here in English, called PubMedBERT, on our French data while keeping the initial tokenizer.
Regarding the last strategy, our objective is to compare the performance of an English medical model further pre-trained on our French medical data, against another one based on a generic French model. Indeed, the medical domains shares many terms across languages that make relevant the mixture of resources from two languages. Table 4 summarizes all the configurations evaluated in this paper, integrating both the study of data size and pre-training strategies.  Model architecture All models pre-trained fromscratch use the CamemBERT base configuration, which is the same as RoBERTa base architecture (12 layers, 768 hidden dimensions, 12 attention heads, 110M parameters). We did not train the large version of our models due to resources limitations.
Language modeling We train the models on the Masked Language Modeling (MLM) task using HuggingFace library (Wolf et al., 2019). It consists of randomly replacing a subset of tokens from the sequence by a special token, and asking the model to predict them using cross-entropy loss. In BERT and RoBERTa models (including CamemBERT), 15% of the tokens are randomly selected. Of those selected tokens, 80% are replaced with the <mask> token, 10% remain unchanged and 10% are randomly replaced by a token from the vocabulary. We keep this masking probability of 15% for the training of our models.
Optimization & Pre-training We optimize the models for 80k steps with batch sizes of 4,096 sequences, each sequence filled with 512 tokens, allowing to process 2.1M tokens per step. The learning rate is warmed up linearly for 10k steps, going up from zero to the initial 5×10 -5 learning rate. Models are trained on 128 Nvidia V100 32 GB GPUs for 20 hours on Jean Zay supercomputer. We use mixed precision training (FP16) (Micikevicius et al., 2017) to reduce the memory footprint, allowing us to enlarge the batch size to 32 sequences on each GPU.

Baseline models
We describe some existing pre-trained models used as baselines in our comparative study.
CamemBERT (Martin et al., 2020) is a RoBERTa based model pre-trained totally from scratch on the French subset of OSCAR corpus (138 GB). In our case, this model is our main baseline to compare our results on, since it is the stateof-the-art model for French. We also use the 4 GB model's variants of CamemBERT to compare the impact of the nature and quantity of the data.
PubMedBERT (Gu et al., 2021) is a BERT based biomedical-specific model pre-trained totally from scratch on the 3.1 billions words of PubMed corpus (21 GB).
ClinicalBERT (Huang et al., 2019) is a clinicalspecific model based on BERT tokenizer and weights, which has been further pre-trained on the 0.5 billion words of MIMIC corpus (3.7 GB).
BioBERT v1.1 (Lee et al., 2019) is a biomedicalspecific model based on BERT tokenizer and weights which has been further pre-trained using the 4.5 billion words of PubMed corpus.

Downstream evaluation tasks
To evaluate the different pre-training configurations of our models, a set of tasks in the medical domain is necessary. While this NLP domain-specific benchmark exists in English (BLURB (Gu et al., 2021)), none exist for French. In this section, we describe an original benchmark, summarized in Table 5, integrating various NLP medical tasks for French. Among them, some are from publiclyavailable datasets (Section 5.1), allowing the replication of our experiments. Other tasks come from private datasets (Section 5.2) and cannot be shared. However, they are useful to evaluate our models more accurately.

Publicly-available tasks ESSAIS / CAS: French Corpus with Clinical Cases
The ESSAIS (Dalloux et al., 2021) and CAS (Grabar et al., 2018) corpora respectively contain 13,848 and 7,580 clinical cases in French. Some clinical cases are associated with discussions. A subset of the whole set of cases is enriched with morpho-syntactic (part-of-speech (POS) tagging, lemmatization) and semantic (UMLS concepts, negation, uncertainty) annotations. In our case, we focus only on the POS tagging task.
FrenchMedMCQA The FrenchMedMCQA corpus (Labrak et al., 2022) is a publicly available Multiple-Choice Question Answering (MCQA) dataset in French for medical domain. It contains 3,105 questions coming from real exams of the French medical specialization diploma in pharmacy, integrating single and multiple answers.  2014) introduces an extensive corpus of biomedical documents annotated at the entity and concept levels to provide NER and classification tasks. Three text genres are covered, comprising a total of 103,056 words obtained either from EMEA or MEDLINE. Ten entity categories corresponding to UMLS (Bodenreider, 2004) Semantic Groups were annotated, using automatic pre-annotations validated by trained human annotators. Overall, a total of 26,409 entity annotations were mapped to 5,797 unique UMLS concepts. To simplify the evaluation process, we sort the nested labels by alphabetical order and concatenate them together into a single one to transform the task to a usable format for token classification with BERT based architectures.

QUAERO French Medical Corpus
MUSCA-DET MUSCA-DET is a French corpus of sentences extracted from the "Lifestyle" section in clinical notes from Nantes University Hospital biomedical data warehouse. The corpus contains 27,000 pseudonymized sentences annotated with 26 entities related to Social Determinants of Health (living, marital status, housing, descendants, employment, alcohol, smoking, drug abuse, physical activity). The corpus includes two tasks: nested name entity recognition (NER) and multi-label classification.

Private tasks
Technical Specialties Sorting This classification task has to assign the specialty of a medical of a medical report based on its transcription. The dataset consists of 7,356 French medical reports that have been manually annotated and equally sampled across 6 specialties: Psychiatry, Urology, Endocrinology, Cardiology, Diabetology, and Infectiology. Medical report acute heart failure structuration (NER) This corpus contains 350 hospital stay reports (divided into 3,511 sentences) from Nantes University Hospital. The reports are annotated with 46 entity types related to the following clinical information: cause of chronic heart failure, triggering factor for acute heart failure, diabetes, smoking status, heart rate, blood pressure, weight, height, medical treatment, hypertension and left ventricular ejection fraction. Overall, the corpus contains 6,116 clinical entities.
Acute heart failure (aHF) classification This task consists of the classification of hospital stays reports according to the presence or absence of a diagnostic of acute heart failure. This corpus consists of 1,639 hospital stays reports from Nantes university hospital, which are labeled as positive or negative to acute heart failure.

Results and Discussions
As previously described, we evaluate the performance of our pre-trained language models proposed for the biomedical domain on a set of public and private NLP downstream tasks related to the medical domain. We first propose to analyze the results according to the different pre-training strategies used (Section 6.1) then to focus on the impact of the pre-training data, whether in terms of size or nature (Section 6.2). Finally, we are interested in the generalization capacities of our domain-specific   Tables 6 and 7 for respectively private and public tasks. For readability reasons, the first part of each table presents the existing baseline models results, the second part our specialized models trained from-scratch, and the last part our models using continual pretraining.

Impact of pre-training strategies
As observed both in Tables 6 and 7, models pretrained completely from scratch (DrBERT NA-CHOS and ChuBERT NBDW) tend to produce the best results for both types of data sources and tasks (i.e. private and public). Indeed, considering the F1-score, they obtain the best results on all private tasks and on almost all public ones (5 tasks out of 7). The two public remaining tasks (MUSCA-DET T2 and QUAERO-MEDLINE) are then better handled using PubMedBERT NACHOS small , a model that has already been pre-trained on domainspecific data (biomedical English data) then further pre-trained with our French medical data (NACHOS small ).
We also observed that continual pre-training from domain generic models (CamemBERT NACHOS small or CamemBERT NBDW small ) does not allow reaching the performance of the other specific models, neither of these two models reaching the first or second place (in terms of performance) on any task.
Finally, the baseline models trained on generic data (CamemBERT OSCAR) and those trained on biomedical data in English (PubMedBERT, Clin-icalBERT and BioBERT) remain competitive in few biomedical public tasks (CAS POS, FrenchM-CQA or MUSCA-DET T2), while none of them are placed in first or second place on private tasks. This seems to highlight the difficulty of private tasks when non-matching data are used.

Effect of data
Regarding the amount of data used for pre-training models (small vs. large or mixed), results show that, the larger the data are, the better the model performs, no matter the pre-training strategy or the source of data (private or public). However, the difference is very low for most tasks, with small systems often being ranked second behind large models, even though they contain half as much data.
We notice a clear dominance of models that were pre-trained on web-based sources, specifically OS-CAR and NACHOS, when applied to public tasks. Indeed, models relying on private NBDW data only achieve the best performance (in terms of F1-score) on the MUSCA-DET T1 task. This trend is not quite observed on private tasks, where NBDWbased models obtain more acceptable or even better performance when mixed with public biomedical data (ChuBERT NBDW mixed ), as seen in Table 6. We believe this discrepancy is mainly due to the different nature of processed data.
Finally, we observe that English-based models perform closely to the French-based CamemBERT model. This shows the usefulness of pre-training on domain specific data. For example, better results are obtained with continual pre-training of the PubMedBERT model with our specialized data in French (PubMedBERT NACHOS small ), corroborating our hypothesis about the effectiveness of cross-language knowledge transfer.   Table 8 gives the results obtained by all PLMs on general domain downstream tasks. These tasks come from Martin et al. (2020) who used them to evaluate the CamemBERT model. The first four are POS tagging tasks (GSD, SEQUOIA, SPOKEN and PARTUT), the last being a natural language inference task (XNLI). All results of our models decrease in performance on all tasks. The most important drop is for the natural language inference task, with a performance of ChuBERT NBDW small almost 13% lower than CamemBERT 138 GB. We also observe that the specialized models in English are as efficient as our biomedical models in French. It seems quite clear from the previous observations that specialized models are difficult to generalize to other tasks, but that specialized information captured in one language could transfer to another language.

Conclusion
In this work, we proposed the first biomedical and clinical Transformer-based language models, based on RoBERTa architecture, for French language. An extensive evaluation study of these specific models has been performed on an aggregated collection of diverse private and public medical tasks. Our open-source DrBERT models improved the state of the art in all medical tasks against both French general model (CamemBERT) and English medical ones (BioBERT, PubMedBERT and Clinical-BERT). In addition, we showed that pre-training on constrained resources (4 GB) of web-crawled medical makes it possible to compete with, and even frequently surpass, models trained with specialized data from medical reports.
Results also highlighted that continual pretraining on an existing domain-specific English model, here PubMedBERT, is a more viable solution than on a French domain-generalist model while targeting French biomedical downstream tasks. It needs to further investigate the performance of this approach using more data, similar to what we have done with DrBERT NACHOS large .
The pre-trained models as well as the pretraining scripts 3

Ethical considerations
Concerning the risks and biases, all the freely available models pre-trained on NACHOS can supposedly be exposed to some of the concerns presented by the work of Bender et al. (2021) and Sheng et al. (2021) since some of the NACHOS sub-corpora quality might be lower than expected, specifically for non-governmental sources. When using a BERT-based biomedical language model, potential biases can be encountered including fairness, gendered language, limited representation and temporal correctness.

Limitations
It is important to mention some limitations of our work. Firstly, it would be wise to evaluate the impact of the tokenizer on the performance of the models to ensure that this is not the main reason for the observed performance gains. Furthermore, we can not affirm in this study whether the medical domain transfer observed from English to French using continual pre-training on PubMedBERT can be generalized to other languages or other domains.
Finally, it is possible that training a ChuBERT model with more diverse private clinical data and in a larger quantity could have brought notable performance gains on private tasks.
A considerable amount of computational resources was used to conduct this study, since approximately 18,000 hours of GPU computation were used to create the 7 models presented here, as well as about 7,500 hours of GPU for debugging due to technical issues related to model configurations and poor performance, for a total of 25,500 hours. The total environmental cost, according to the Jean Zay supercomputer documentation 4 is equivalent to 6,604,500 Wh or 376.45 kg CO2eq based on the carbon intensity of the energy grid mention by BLOOM environmental cost study also made on Jean Zay (Luccioni et al., 2022). This makes the present study difficult to reproduce and to transpose to other languages when limited material resources are available.

A Appendix
A.1 Vocabularies Inter-coverage As we can see in Figure 3, despite having similar performances, some of the models do not share a lot of mutual vocabulary.

A.2 Models Stability
We observe during the evaluation phase that most of the models based on continual pre-training strategy from CamemBERT OSCAR 138 GB are suffering from bad consistency and stability during fine-tuning, which translates into fluctuation in performance between runs. We also notice during PubMedBERT NACHOS small pre-training that the model loss is globally stable during almost all the duration of the pre-training, until reaching the step 71,000, where the loss fall down until touching down zero at step 72,500. A2. Did you discuss any potential risks of your work?
In the Section 8 A3. Do the abstract and introduction summarize the paper's main claims?
Section 1 A4. Have you used AI writing assistants when working on this paper?
Left blank.

B Did you use or create scientific artifacts?
The pre-trained models used as our baselines are presented in Section 2 and 4.3. The datasets used are listed in the Table 5. For the tools, we cite Huggingface library in Section 4.2 under the paragraph "Language modeling". Our publicly available artifacts are explicitly enumerate in the contributions of our introduction (Section 1) and conclusion (Section 7). Will be made available online once the paper accepted: -Our training scripts (available at the anonymized repository link in the paper) under MIT license. -The web crawled pre-training corpus called NACHOS on Zenodo and HuggingFace (currently private) under CC0 1.0 license. -Our models trained on NACHOS on HuggingFace (currently private) under MIT license.

B1. Did you cite the creators of artifacts you used?
We cite the used artifacts models during the Section 2 and 4.3. The datasets used are listed in the Table 5 and cited in the section 5.1 and 6.3. For the tools, we cite Huggingface library in Section 4.2 under the paragraph "Language modeling".
B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
The licenses are explicited at the end of the introduction (Section 1).
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)? Not applicable. Left blank.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? We used two corpora for our language models : The first is trained on crawled data, hand-picked from high quality web sources, and doesn't require any anonymization step. For our second model, it's trained on clinical data, and we specify in Section 3.2 that we used de-identified hospital stays reports and comply with GDPR and local authorities since we have official accreditations.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Coverage of domains are presented in Tables 2 and 3. Concerning languages, we only consider French in our study, as the title suggest. Demographic groups represented in the data are not explicited in the paper.
The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.