Creation of an Open Science Dataset from PREVENT-AD, a Longitudinal Cohort Study of Pre-symptomatic Alzheimer’s Disease

We describe the creation of an open science dataset from a cohort of cognitively unimpaired aging individuals with a parental or multiple-sibling history of Alzheimer’s disease (AD). Our purpose was to enable PResymptomatic EValuation of Novel or Experimental Treatments for AD (“PREVENT-AD”). To characterize this population, possibly progressing in the pre-symptomatic phase of AD, we studied genetic variants and obtained longitudinal measures of cognition, brain structure and function, blood and cerebral fluid biochemistry and neurosensory capacities. Two nested prevention trials were also conducted. Data were hosted in LORIS, a platform that facilitates data organization, curation and sharing. We initially assessed 425 individuals, 385 meeting criteria for sustained investigation and 330 remaining active for longitudinal follow-ups. Between 2011 and 2017, we obtained quality-controlled data from 1704 MRI scans, 532 CSF samples, and 1882 cognitive evaluations. To date, 310 active participants (94%) have agreed that their data be openly shared. In addition to being a living resource for continued data acquisition, therefore, PREVENT-AD offers shared data to facilitate understanding of AD pathogenesis.


ABSTRACT:
We describe the creation of an open science dataset from a cohort of cognitively unimpaired aging individuals with a parental or multiple-sibling history of Alzheimer's disease (AD). Our purpose was to enable PResymptomatic EValuation of Novel or Experimental Treatments for AD ("PREVENT-AD"). To characterize this population, possibly progressing in the pre-symptomatic phase of AD, we studied genetic variants and obtained longitudinal measures of cognition, brain structure and function, blood and cerebral fluid biochemistry and neurosensory capacities. Two nested prevention trials were also conducted. Data were hosted in LORIS, a platform that facilitates data organization, curation and sharing. We initially assessed 425 individuals, 385 meeting criteria for sustained investigation and 330 remaining active for longitudinal follow-ups. Between 2011 and 2017, we obtained quality-controlled data from 1704 MRI scans, 532 CSF samples, and 1882 cognitive evaluations. To date, 310 active participants (94%) have agreed that their data be openly shared. In addition to being a living resource for continued data acquisition, therefore, PREVENT-AD offers shared data to facilitate understanding of AD pathogenesis.

BACKGROUND AND SUMMARY
Dementia is the final stage of Alzheimer's disease (AD), representing the culmination of a process that begins decades before onset of symptoms. [1][2][3] Characterizing and tracking the pre-symptomatic stage of AD requires methods sensitive to the disease's early manifestations. These may include not only subtle cognitive decline, but also biochemical changes and structural or functional brain alterations. Studying these pre-symptomatic changes is crucial to a full understanding of AD, and their precise measurement is critical for trials of interventions that seek to prevent symptom onset.
To meet this challenge, in 2010, investigators at McGill University and the Douglas Mental Health University Institute Research Centre created a Centre for Studies on Prevention of Alzheimer's Disease (StoP-AD). The Centre's prime objective was to pursue innovative studies of pre-symptomatic AD, with efforts to provide a relatively enriched population sample for prevention trials requiring individuals 'at-risk' of developing the disease. 8 To this end, the StoP-AD Centre developed an observational cohort for PRe-symptomatic EValuation of Experimental or Novel Treatments for AD ("PREVENT-AD"). This cohort consists of cognitively unimpaired persons with a parental or multiple-sibling history of AD-like dementia, a population having a 2-3 fold relative increase in risk of AD dementia. 9  The StoP-AD Centre uses LORIS as its data management platform for storage and curation of data. 11,12 LORIS was designed to facilitate sharing of data with research collaborators. The Centre shares data and with more than 12 collaborative research groups. Recently, a portion of the PREVENT-AD data was made available more widely under the principles of open science [13][14][15] (https://openpreventad.loris.ca/). The PREVENT-AD data sharing resource is a major initiative of the Canadian Open Neuroscience Platform (CONP; https://conp.ca), and the Tanenbaum Open Science Institute. 16

A. Overview
Here we briefly describe the PREVENT-AD cohort and associated clinical trials, including the data acquisition strategy, methods and the infrastructure used for the curation and dissemination of data to the wider research community.

Observational Cohort
Recruitment to the observational PREVENT-AD cohort began in November 2011 but was suspended, owing to funding constraints, in May 2017. To increase the probability that cognitively intact participants would harbor early changes of presymptomatic AD, entry criteria rested on two broad principles of advanced age and a parental or multiple-sibling history of AD. Participants were 60 years of age or older, excepting persons between 55 -59 years old who were eligible if their own age was within 15 years of symptom onset in their youngest-affected firstdegree relative. Participants' family history of "AD-like dementia" was ascertained either by compelling report of an AD diagnosis from an experienced clinician or, if such was not available, by use of a structured questionnaire developed for the Cache County Study and intended to establish memory or concentration issues in first-degree relatives sufficiently severe to cause disability or loss of function, with an insidious onset or gradual progression (as opposed to obvious consequences of a stroke or other sudden insult). Enrollment further required confirmation of intact cognition, stable general health and availability of a study partner to provide information on daily functioning (Table 1). For more details about recruitment and eligibility determination, see section B.1.

Inclusion criteria
➔ Parental or multiple-sibling (defined by 2 or more) history of Alzheimer-like dementia ➔ Age 60 years or older (persons aged 55-59 years and < 15 years younger than their affected index relative were also eligible.) ➔ Minimum of 6 years of formal education ➔ Study partner available to provide information on cognitive status ➔ Sufficient fluency in spoken and written French and/or English ➔ Ability and intention to participate in regular visits ➔ Provision of informed consent ➔ Agreement for periodic donation of blood and urine samples ➔ Agreement to participate in periodic multimodal assessments via MRI and LP for CSF collection (LP optional at first, then mandatory for participation) ➔ Agreement to limit use of medicines as required by investigational protocols, if applicable

Exclusion criteria
➔ Cognitive disorders -Known or identified during eligibility assessments (MoCA and CDR) ➔ Use of acetyl-cholinesterase inhibitors including tacrine, donepezil, rivastigmine, galantamine ➔ Use of memantine or other approved prescription cognitive enhancer ➔ Use of vitamin E at greater than 600 i.u. / day or aspirin at >325 mg / day ➔ Use of opiates (oxycodone, hydrocodone, tramadol, meperidine, hydromorphone) ➔ Use of NSAIDs or regular use of systemic or inhalation corticosteroids ➔ Clinically significant hypertension (accepted if controlled medically), anemia, significant liver or kidney disease ➔ Concurrent use of warfarin, ticlopidine, clopidrogel, or similar anti-coagulant ➔ Current plasma Creatinine >1.5 mg/dl (132 mmol/l) ➔ Current alcohol, barbiturate or benzodiazepine abuse/dependence CSF: cerebrospinal fluid; MRI: magnetic resonance imaging; NSAID: non-steroidal anti-inflammatory drug; LP: lumbar puncture After telephone and on-site screening, eligible participants were enrolled and followed annually with structured evaluations. Cognitive performances (immediate memory, delayed memory, language, attention and visuospatial capacities) were assessed by the Repeatable Battery for Assessment of Neuropsychological Status (RBANS) 26 and neurosensory abilities were evaluated by measuring olfactory identification abilities using the standardized University of Pennsylvania Smell Identification Test (UPSIT). 35 At each visit, the clinical team obtained blood and urine samples and performed neurological and physical examinations, including electrocardiogram. Further 'in-house' medical history and review of systems questionnaires were also administered. Participants also underwent an MRI scanning session of 1 to 1.5 hours including numerous structural and functional acquisitions. On a separate day, participants who consented to the procedure donated CSF samples via lumbar puncture (LP).
Initially, we performed lumbar punctures only on participants enrolled in clinical trials. In 2016, however, considering the overall success of the LP program (acceptance, tolerability, and retention through serial repetitions) we began also to perform serial LPs in the broader observational cohort. In 2017, consent for such LPs became an inclusion criterion for new participants. Over time, various other modalities were added. These included: i) in 2014, evaluation of central auditory processing, a neurosensory function; ii) in 2015, evaluation of subjective cognitive impairment (Everyday Cognition test -ECog); and iii) in 2016, a modified MRI protocol designed to investigate the integrity of hippocampal subfields and brain microstructure (iron deposition, myelination concentration) ( Fig. 2). Telephone follow-ups (FU) were conducted between on-site annual visits to continue contact and update clinical information (Fig. 3). Most evaluations were conducted from the program's onset (blue items), while some were added later (gray items). Lumbar punctures (dotted line) were originally performed in clinical trials participants but then became optional in the observational cohort and most recently, an integral part of the program.

Preventive intervention trials
We conducted two clinical trials nested in PREVENT-AD to test potentially preventive pharmaceutical agents. Described below, these trials were INTREPAD, a randomized, placebo-controlled trial of low dose naproxen sodium, and DEPEND, a proof-of-concept trial of the lipid lowering agent probucol as a potential inducer of apolipoprotein E (apoE) protein availability (Fig. 3). three months after randomization. The 3-month assessment was intended to determine whether treatment-related changes, if any, occurred gradually or as a rapid response. LPs were optional, but were undertaken by over half of all participants. The primary outcome was a composite Alzheimer Progression Score (APS, described below) derived using item response theory from various cognitive and biomarker measures. 17 Results of INTREPAD were published in Meyer et. al., 2019. 18 b. DEPEND (Dosage and Efficacy of Probucol-Induced apoE to Negate cognitive Deterioration; clinical trials.gov NCT02707458) was a single-arm proof-of-concept trial planned as a 3-month dose-finding phase, followed by a 1-year validation and follow-up phase. As suggested by earlier pilot data, its principal outcome was change in concentration of apoE. Secondary outcomes were corresponding reduction in vascular biomarkers in CSF and blood. 19 LPs were therefore obligatory. Twenty-four participants enrolled in the first phase were given a standard dose of probucol (600 mg), with intention to develop an individualized dosing regimen. Data collection for the first 3-month phase occurred from June through December 2016. The follow-up phase aimed to treat participants with personalized doses of probucol over one year to observe the specified outcomes. See Figure 4 for the participant flow in the recruitment process.

Data Sharing Initiatives
All consent procedures fulfilled modern requirements for human subject's protection, while avoiding excess participant burden. Consent forms were carefully crafted to use simple but comprehensive language (typically at an 8th grade reading level).

Characteristics of the population
The range of data collected at recruitment and updated at each FU encounter is shown in Table 2. Genotyping. All participants consented to blood acquisition for genotyping, but genetic analysis was performed only in those who were confirmed eligible. DNA was isolated from 200 μl whole blood using a QIASymphony apparatus and the DNA Blood Mini QIA Kit (Qiagen, Valencia, CA, USA). The standard QIASymphony isolation program was used following the manufacturer's instructions. As indicated, allelic variants of six genes associated with AD were determined using pyrosequencing (PyroMArk96). [22][23][24][25] From the 425 participants who underwent baseline visits, 385 were confirmed as appropriate for final data analysis. Among the 40 exclusions, 31 were judged unsuitable for continued participation because of cognitive deficits that had escaped detection but became apparent upon more detailed testing at baseline.
Other reasons for exclusion included similar post-enrollment detection of stroke (2), anxiety and attention problems (2), refusal of further MRI (2), or discovery that their AD family history in fact failed to fulfill entry criteria (3). Table 3 summarizes key baseline characteristics of the remaining 385 members of the analysis pool. The somewhat smaller number available for data sharing reflects the ongoing process of re-consent to the open science sharing. As of January 2020, we contacted the remaining group of active participants (n=330).
Information about participants who are lost to follow-up or who have withdrawn will be presented with the forthcoming data sharing plan as a second step. Out of the 330 who remained active, 310 accepted (94%), 15 refused (5%) and 5 did not answer (2%). In addition to this re-consent process, data curation, including quality controls may have excluded few participants.

Longitudinal follow-up
Data for the observational cohort and trial participants are described separately. The

PREVENT-AD biomarkers
The PREVENT-AD research group measured not only classical AD biomarkers, but also emergent potential indicators of AD progression described below.

a. Cognition
Neuropsychological performance was measured using the RBANS, 26 which evaluates 5 cognitive domains: immediate memory, delayed memory, attention, language and visuospatial abilities. A global cognition score summarizes all these domains. This test was designed specifically to provide sensitive detection of cognitive decline in persons whose cognitive status is still within normal limits.
It is therefore used frequently in prevention trials or studies of cognitively frail (but still "normal") elderly. Its ~30-minute battery is available in both French and English in 4 equivalent versions to reduce practice effects in longitudinal assessment. Trained research assistants administered testing, and scores were calculated by a single PhD neuropsychologist. We developed correction factors to improve version equivalence among the 4 French versions. 27 Apart from this adjustment, we used 'research' (not "corrected" for age) scores instead of published age 'norms' that are customary in clinical use. For more details about "age-corrected" scores, see 'Code Availability'.
At the same visit, we administered the structured Alzheimer Dementia 8 (AD8) interview to the study partner, who rated the participant on eight functional abilities intended to discriminate normal cognitive aging from very mild dementia.
The AD8 was designed specifically for the detection of change over time.
Beginning in 2015, each participant was also asked to rate subjective change in memory abilities using the Measurement of Everyday Cognition (ECog). This instrument uses a four-point scale to ascertain perceived changes over the past year. Although administered annually, these ratings did not necessarily coincide with annual FU visits.

b. Cerebrospinal Fluid (CSF) proteins
Lumbar puncture (LP) was performed by a neurologist (PR-N) in a procedure that typically lasted less than 15 minutes. A large-bore introducer was inserted at the L3-L4 or L4-L5 intervertebral space, after which the atraumatic Sprotte 24 ga.
spinal needle was inserted to puncture the dura. Up to 30 ml of CSF were withdrawn in 5.0 ml polypropylene syringes. These samples were centrifuged at room temperature for 10 minutes at ~2000g, and then aliquoted in 0.  Table 4 for parameters and Figure 6-9 for acquisition protocols.    Because heart rate and respiration correlate with functional Blood Oxygen Level Dependent (BOLD) signal, physiological monitoring was performed during all functional acquisitions, allowing physiological correction of the fMRI data (heart rate and breathing). [30][31][32] During these functional acquisitions, heart rate, chest wall motion and a logic pulse (marking the time of each fMRI volume) were recorded using a BIOPAC MP150 system at a sampling rate of 400Hz (BIOPAC Sytems, Inc., Goleta, CA).
Chest wall motion was monitored using a respiratory belt transducer (TSD201) connected to a RSP100C Respiration Amplifier module, part of the BIOPAC system, while EKG traces were recorded using the ECG100C amplifier of the BIOPAC to monitor participant's heart rate.
Episodic memory task fMRI An episodic memory task for object-location associations was performed by participants longitudinally. The study design is similar to that published previously. 33,34 Participants were scanned as they encoded an object and it's left/right spatial location on the screen. Forty-eight encoding stimuli were presented one at a time for 2000 msec with a variable inter-trial interval (ITI). A twenty minute break followed encoding, during which time structural MRIs were acquired. After this break participants were presented with the associative retrieval task in which they were presented with 96 objects (48 "old"-previously encoded objects; 48 "new" objects) and were asked to make a forced-choice between four-alternative answers: i) "The object is FAMILIAR but you don't remember the location"; ii) "You remember the object and it was previously on the LEFT"; iii) "You remember the object and it was previously on the RIGHT"; and iv) "The object is NEW". The E We tested central auditory processing (CAP) using the Synthetic Sentence Identification with Ipsilateral Competing Message (SSI-ICM) test and the Dichotic Stimulus Identification (DSI) test. After having first been assessed for simple auditory acuity (with monosyllabic words), participants were asked to identify spoken "pseudo-sentences," either with various sound levels of a distracting background narrative (SSI-ICM) or with dichotic binaural presentation (DSI). The latter test was available only in French. CAP testing was introduced in 2014. 38,39 In the SSI-ICM test, one pseudo-sentence is heard while a story is recited in the background. Both the sentence and story are played in the same (ipsilateral) ear.
The participant is asked to identify the target sentence among 10 choices offered. Participants performed this task a minimum of 10 and a maximum of 30 times, with designated score-dependent stopping points. 40 The other ear was then tested using the same protocol. SSI-ICM testing can typically be completed in less than 30 minutes.
The DSI task tests dichotic listening capability. relies on an assumption that informative changes in any AD marker arise from a single underlying latent process, viz., AD pathogenesis. 17 The construct validity of the APS approach was first demonstrated in the BIOCARD study, in which a version that incorporated data from that study showed ability to conjoin four different imaging and CSF markers as predictors of subsequent change in clinical diagnosis to MCI or AD dementia. 17 We used non-trial PREVENT-AD participant data both to estimate parameters for the APS scoring algorithm (using baseline data) and then to demonstrate measurement invariance (i.e., showing that the same set of parametric "weights" estimated at baseline served well to estimate disease progression at later time points). We further demonstrated "portability" of the score to the trial cohort by comparing the performance of observational cohort-derived "weighting" parameters performance using "weights" derived de novo in trial participants. Variables for inclusion were selected if their longitudinal measure of change more than offset item variance (hence, they were likely to contribute positively to statistical power of the method to detect change).
Included measures were several cognitive test results, neurosensory abilities (olfactory identification), total brain volume, grey matter cortical thickness and density, cerebral blood-flow and CSF biomarkers. Missing data were accommodated in an "averaging over" of interpolating data points assessed by a subsequent iterative validation of the resulting values. The value of the composite APS is evidenced by the fact that this score appears to provide more information on subtle brain changes than any one of its constituent biomarkers. 18

Data Management & Open Science Plans
Management, quality control (QC), validation and distribution of PREVENT-AD data were performed in LORIS, a system designed for linking heterogeneous data (e.g. behavioral, clinical, imaging, genomic) within a longitudinal context. 12 Numerous LORIS modules were used to facilitate the curation process, including the Participant Status, Family Information, Family History, Acknowledgements, Document Repository, Drug Compliance and Data Release modules (Table 5). In addition, behavioral forms included customized algorithms developed for aggregating various pieces of data in a user-friendly manner. Data selection and dissemination was done through the Data Query Tool, or via specialized scripts that prepared large amounts of data via releases in spreadsheet-ready formats. 12

LORIS Modules Brief Description
Participant Status Overall status categories "Active", "Stop Medication Active", "Ineligible", "Withdrawn", "Excluded", etc for each participant and their respective statuses for the individual drug trials.

Family Information
Identifying and characterizing relatives of participants.
Family History Family history of clinical and memory problems.
Acknowledgements Reference for study collaborators that dynamically generates a publication acknowledgement list.

Document Repository
Centralized location for managing study documents . For each such data release, a "freeze" was placed on all data as of a specified date. Between the time of the "freeze" and eventual release, data were edited, validated and cleaned. While data were continuously queryable in the Data Query Tool, data releases were also tagged as a series of spreadsheets (including a data dictionary for all instrument variables), archived imaging datasets, and a document summarizing the project and available variables. To gain access to PREVENT-AD data releases, collaborators were required to sign a Data Use Agreement that included a publication policy.

Development of Open Science data sharing
As an initial instance of Open Science data sharing, we released a subset of the Second, even when partially de-identified data had been prepared for sharing with collaborating research teams, additional dataset de-identification steps were required to share data with a much larger community of researchers. All PREVENT-AD participant names had already been assigned an internal study code. These codes were now assigned a new "public" alphanumeric code, to which the participant's identity cannot directly linked, with the sole exception of an ability to do this retained exclusively by the StoP-AD team. All brain images to be shared were "scrubbed" to remove all potentially identifying fields from their header (e.g., date of birth or acquisition dates) and structural modalities were defaced to prevent re-identification using 3D rendering of the face. 42  data, medical information, family history of AD, and certain demographic data such as ethnicity and profession/occupation (see Table 7). Furthermore, some of these sensitive data may be shared with additional precautions, following an 'onrequest' procedure. Permitted uses of data are communicated using Consent Codes, a structured way of presenting consent-based permissions and restrictions on the use of Open Science data. 44,45 In our case, PREVENT-AD data must be used for neuroscience research as stipulated in the consent forms and in the terms of use.

Ongoing and future efforts.
AD research is at last regarded as a critical priority for Canada and the world. [4][5][6] Accordingly, it is now essential to collect data from large cohorts, and subsequently to make such data available to the global research community. As an encouragement to other projects contemplating the conversion of their data to open science access, we note that this transition in PREVENT-AD required substantial resource availability but was relatively smoothly achieved over the course of approximately six months. In that time, enormous efforts were required to obtain additional ethics review and a re-consent process for all participants whose data are to be shared. Additional data preparation and "cleaning" was also required. Although costly, we expect these efforts and resources to yield much incremental value to the project in years to come.
The STOP-AD Centre continues to collect data. In time, additional data-gathering modalities along with an expanded set of longitudinal observations in PREVENT-AD will become available to the greater research community. Newer methods will include other neuroimaging techniques, such as positron emission tomography (PET), magnetoencephalography (MEG), as well as genomic information from large-scale methods, such as GWAS. Additional information will also become available on lifestyle, comparison groups data, and data from participants who develop mild cognitive impairment (MCI). To further complement the existing cohort, novel sensitive blood-based biomarker assays are being developed to bypass the CSF (or PET) requirement now needed to monitor disease progression in both the pre-symptomatic and symptomatic phases of AD. Given that some of these new acquisitions will differ from pre-existing PREVENT-AD modalities, they may yield less longitudinal information, but will nonetheless enhance the information value of the evolving data resource. With these new data continually being acquired in this "at-risk" population, PREVENT-AD is poised to become a marquee study in dementia research, akin to other data sharing initiatives such as the Alzheimer's Disease Neuroimaging Initiative (ADNI). The emerging field of machine learning and deep learning methods are bound to change the analytic landscape in the years to come and as PREVENT-AD will release more data, our dissemination methods will have to adapt to better reflect this new reality.
Code Availability:

Generation of neuropsychological dataset: RBANS specificities
The RBANS French versions are known not to be fully equivalent. The author of the RBANS (Dr. Christopher Randolph) recommended systematic correction of +4 for the semantic fluency section of version B. Our group compared performance at the 3M visit with BL scores in the INTREPAD trial, assuming no treatment effects and comparable abilities at the two timepoints. It was determined that additional corrections were needed to control version differences. Suspecting part of this problem could be traced back to the nonequivalence of English and French tests, we developed adjustment factors that brought the several versions into approximate equivalence. This procedure is described in Lafaille-Magnan et al., 2018. 27 The script used is as followed https://github.com/marieleyse/RBANS-correction.
Whereas clinical testing may call for scoring criteria that vary by age to compare individual performance to "normative" data, we avoided correcting for age in scoring the RBANS for research purposes. We scored all participants using norms for individuals aged 60-69 years. This method revealed an actual decline in performance with age, whether or not this decline was related to disease. Both scores (clinical; adjusted for age and research; using 60-69 norms) are available.

Generation of episodic memory task fMRI dataset: software and stimuli
Psychology Software Tools E-Prime (version 2) were used to design the experiment, collect the data and perform analysis. Data were saved in.edat2 format readable by the program only. Data were also saved as text files to facilitate data sharing (https://pstnet.com/products/e-prime-legacy-versions/).
For de-identification before data sharing, the text files containing the behavioral task fMRI related data were scrubbed to remove dates and PREVENT-AD study ID using a script available on Github: https://github.com/cmadjar/Loris-MRI/blob/open_preventad_v20.1.0/tools/scrub_and_relabel_task_events.pl. Images used for the task were from a bank of standardized stimuli. 46,47 Generation of de-identified MR images for data sharing: Anatomical images (T1W, T2W, FLAIR, MP2RAGE, quantitative T2*, GRE T2*, fieldmap magnitude files) were defaced using the defacing algorithm developed by Fonov and Collins (2018) that were shown to not significantly affect data processing outcome. 42 A slight modification of the code was done so that it could be integrated into the LORIS platform. The version of the script used to run the defacing on the PREVENT-AD datasets is available in Github (https://github.com/cmadjar/Loris-MRI/blob/open_preventad_v20.1.0/uploadNeuroDB/bin/deface_minipipe.pl).
Identifying fields (such as PREVENT-AD participant's ID, date of birth, date of MRI, etc) were scrubbed from the DICOM headers using the DICOM Anonymization Tool (DICAT; https://github.com/aces/DICAT).

Available data
Since April 2019, MRI raw data and basic demographics (age at MRI, sex, language, handedness) are available for sharing using the LORIS platform (https://openpreventad.loris.ca).
Users are able to access the OPEN data listed in Table 6, for a group of participants who had agreed to data sharing at the time of the release (n=232 as of April 2019). The URL provided leads to the PREVENT-AD OPEN LORIS instance, since the CONP portal was not entirely functional at the time of the release. More data from these 232 participants will be accessible via a Registered Access model and will be eventually released (Table 7). Other waves of data sharing with increased number of participants and other type of data are also planned to enrich this important research resource.  Published manuscripts using this dataset Major findings using PREVENT-AD data form data releases (1.0 to 5.0) can be found in 22 published articles and in more than 75 abstracts (http://preventalzheimer.net/). In brief, several analyses demonstrated novel association, correlation or prediction among various direct and derived measures of AD pathology, including MRI, CSF biomarkers of AD, protein mediators of innate immune activity, and neurosensory faculties. 18,36,38,[48][49][50] In addition to the INTREPAD trial derived results publications associations were established between measures of AD pathology (revealed by MRI and/or CSF proteins) and subjective cognitive decline, and proximity to age at onset of parental AD symptoms. 18,[51][52][53][54] Novel MRI techniques and disease progression modelling were also validated in this dataset. 17,41,55 . Articles looking at the association between PET and CSF amyloid and tau and looking at the relation with vascular risk factors were recently published (Add Melissa and Theresa Here). Several additional papers are in preparation.
We expect that sharing the PREVENT-AD data with the larger community under the open science principle will result in many additional publications in the coming years.
Additional information about the PREVENT-AD program can be found at http://prevent-alzheimer.net/.

TECHNICAL VALIDATION
Homogeneity of procedure: We decided to acquire data for this study at a single site to limit the need for harmonization. Other elements were part of the protocol with the intention to reduce the risk of external variability: i) cognitive testing was scored by a single neuropsychologist and the final scores were automatically computed by LORIS; ii) MRI acquisitions were performed on the same scanner and quality controlled by the same individual; iii) lumbar punctures were performed by the same physician (more than 95% of the time) following an internationally accepted, published protocol; iv) clinical information was reviewed by the same two physicians; v) clinical evaluation was performed by the same team of nurses and psychometrists.
Data entry: Data was entered in LORIS in duplicate. The 'conflict resolvers' feature of LORIS allowed detection of discrepancies between two entries of the same information and systematic corrections of mistakes by the data entry personnel. In case of doubt, the proximity of the team of nurses and psychometrists facilitated the process of information verification from the source documentation, at the time of data entry. Before data releases, the clinical team, headed by the study physicians and the research coordinator, reviewed special cases and determined their potential for data analysis by a pass or fail decision. Any failed cases were then flagged in LORIS and made unavailable for analysis.
Quality control (QC): LORIS has several internal QC in place to avoid missing data in required fields, out of range values, etc. Weekly, 26 automatic checkpoints were run to detect any abnormalities. At every cycle of data freeze and release, the data entry personnel completed data entry and performed data verification via additional automatic checkpoints. MRI acquisitions were harmonized across sessions by setting up saving each session's protocol directly on the scanner console so that the same protocol could be used for a given MRI visit. In addition, acquisition parameters were automatically checked to ensure acquisition harmonization at the time of insertion into LORIS. Finally, manual quality control was performed on all structural modalities of the original dataset by the same individual and the result of that quality assessment was imported into the open dataset along with the shared images. QC status, predefined comments and text comments were saved directly in LORIS at the time of QC.
Additional QC for data sharing on CONP: 1 out of 10 research files were manually reviewed to compare the source documents with data entered in LORIS. We choose to review data acquired in 2013 and 2017 and reviewed all data from the following 'instruments': RBANS tests, smell identification tests, laboratory blood results and central auditory processing tests (at BL and all the annual FU). We selected handedness, family history of AD, medical and surgical history and laboratory blood results to review at eligibility and enrolment visit. Out of 140 instruments reviewed, only 5 mistakes were found and corrected (wrong visit date, repetition not entered for RBANS test (3) and 1 information missing in the family history of AD). We added 5 more automated QC checks to make sure no names were part of the dataset, no duplicated visit label were created, no data entry 'in progress' were pending and checking for '0' values in laboratory results. After the de-identification process of the anatomical MR images, every single image was visually reviewed to insure proper defacing. Any problem with the defacing process was detected, fixed and reviewed. A final review to detect the presence of dates and any potentially identifiable information was also performed in the whole shared dataset.

Uniformity and comparability of data collection methods and analysis:
The aforementioned attention to design and quality control contributed to the reproducibility of findings by different investigators (thus avoiding concerns about reproducibility). 56,57 Similarly, rigorous methods were employed to ensure proper data curation. 58 Overarching strategies such as the FAIR data principles are the current guideposts to success in this era of Big Data, as data sharing becomes a necessity. 14 Leveraging this data to create larger pools of data for conjoint analysis is also becoming the norm, but, this requires proper documentation, full provenance and reliable organization. The most thoroughly documented and harmonized data sets are of little use without ease of access to these resources. Thus user-friendly data systems like LORIS also provide a substantial advantage for such data-sharing efforts.

USAGE NOTES
For reuse of the PREVENT-AD data, users must respect PREVENT-AD terms and publication policy. They can be found when requesting an account on https://openpreventad.loris.ca/. In brief, data must be used for neuroscience purposes, and users must commit to properly cite the dataset and to follow good data use practices (e.g. not attempt to re-identify participants). The PREVENT-AD research group request that anyone using PREVENT-AD data for publication state in the methods section of their publication that PREVENT-AD is the (or one of the) source of data, and cite 2 PREVENT-AD scientific papers (including this one). 8 For reuse of the PREVENT-AD data, we suggest that researchers carefully read and understand the context of the data collection described in this paper and in the documentation available with the data. For re-analysis of the INTREPAD trial data, please refer to our results paper recently published in Neurology and contact Dr. John Breitner for possible collaboration and further details about the study design 18 .
Label convention used in the PREVENT-AD dataset is presented in Table 8. Data collected at a specific time point (under a specific visit label) are regrouped within instruments containing numerous variables (ex.: data for total RBANS scores, 12 months after baseline in the observational cohort, will be associated with PREFU12 under the instrument named 'RBANS'). IMPORTANT NOTES about the label convention: The first visit in the program is always labelled as PREEL00. INTREPAD trial participants are identified at enrolment visit (NAPEN00) and the following (ex.: NAPBL00, NAPFU03, etc). Even after the termination of the treatment and trial protocol (24 months; FU24), INTREPAD participants remain named as NAP for all the following annual FU (NAPFU36, NAPFU48, for example). However, if a participant was not able to follow study protocol until NAPFU03, he/she was excluded from the trial and 'switched back' to the observational cohort (PRE) to continue to be followed annually (PREFU12 and following). Central auditory processing (AP) also has its own label, even if this test was performed in concordance with BL and or FU. Lumbar puncture (LP) stand alone as the procedure was done on a separate day, close to the annual FU. PREVENT-AD is the result of efforts of many other co-investigators from a range of academic institutions and private corporations, as well as an extraordinarily dedicated and talented clinical and technical assistant staff, students, and postdoctoral fellows. Here is listed the entire PREVENT-AD

COMPETING INTEREST
No competing interest was disclosed.