Background
Dementias Platform UK (DPUK) is a £53M public-private-partnership established by the Medical Research Council (MRC) to facilitate experimental medicine programmes that bridge the evidence gap between basic mechanistic research and large-scale trials. DPUK does this in three ways: 1) by providing access to individual and aggregate level data from multiple cohort studies for hypothesis generation and testing, 2) a register of highly characterised risk-stratified volunteers consented for re-contact (which is undergoing recruitment), and 3) a programme of academic and industry based experimental studies. Here we describe the DPUK Data Portal which facilitates access to cohort data (including one e-cohort) data for 3 461 244 individuals in 35 cohorts. Meta data are available for a further 12 cohorts.
There are several arguments for multi-cohort-focused data repositories including: 1) as research questions focus on smaller effect sizes access is required to data at-scale to achieve statistical purchase, 2) as emerging research questions become more complex access to diverse multi-modal data is needed for rigorous hypothesis testing, 3) as scientific rigour increases there is growing recognition of the value of triangulation and replication using independent data, 4) as cohort datasets increase in size the cohort-by-cohort transfer of large datasets is decreasingly feasible, 5) as cohort datasets become more complex the mastering of bespoke data models for survey, omics (genomics, proteomics and metabolomics), imaging and device data becomes burdensome, 6) as cohort datasets become more sensitive the non-auditable use of data is decreasingly acceptable. Whilst these issues can be addressed individually, the Data Portal provides an integrated solution.
Data resource basics
The DPUK Data Portal (https://portal.dementiasplatform.uk/) [1] is a collaboration between DPUK and a growing number of cohort research teams who wish to make their data globally accessible (Table 1).
Benefits for cohort research teams
The Data Portal supports the work of cohort research teams through data curation, access management, and cohort enhancement. The Data Portal contributes to the widely accepted FAIR principles (Findability, Accessibility, Interoperability and Reusability) to improve the infrastructure supporting the reuse of data [2]. DPUK also facilitates the legal engagement necessary to facilitate data transfer into the Data Portal on behalf of accessing researchers, by ensuring robust contractual arrangements, in the form of the DPUK Data Deposit Agreement, are in place as an overarching mechanism for data governance and use [3].
The Data Portal operates within the UKSeRP environment according to ISO 27001 [4] as a data processor according to the UK Data Protection Act 2018 [5] and EU General Data Protection Regulation 2016 [6]. Data may be accessed remotely for in-situ analyses but not downloaded to third-party sites. Data-use approval remains with the cohort research teams who retain control over data access. Preparing datasets for third-party researchers and providing suitable documentation is resource intensive. The Data Portal reduces this burden through the management of access requests on behalf of cohort research teams, use of a common data model, and the development of standard documentation. For data stored within the Data Portal, the need for repeated data transfer is eliminated.
The Data Portal enables cohort enhancement through web-based procedures that can be ‘branded’ for each cohort. This utility is suitable for collecting consent, questionnaire and cognitive performance data. Whilst the comparability of data collected through different modalities is an empirical question, there is growing evidence on the acceptability and validity of remote recruitment and assessment procedures [7]. The UKSeRP environment has been specifically designed for use with linked electronic health records and is a suitable environment for the onward sharing of linked data.
Benefits for researchers
For researchers the Data Portal enables the pursuit of ideas from any location with suitable connectivity. It has three core utilities: data discovery, access, and analysis. A tiered data discovery pathway begins with the Cohort Matrix [8] which provides a high-level comparison of data availability for each cohort. The Cohort Directory [9] enables detailed exploration across cohorts using a range of metadata categories (Figure 1).
For data access, the Data Portal provides a single point of contact for multiple cohorts. The application form is a synthesis of key issues that are addressed by most if not all the individual cohort access management procedures, comprising public interest, potential for subject identifiability, scientific rationale, appropriate analysis plan, and conflict of scientific interest. These issues are specifically identified as fail-fast criteria to enable cohort data access management panels to more easily evaluate an application. Of the 47 cohorts that have provided data or metadata to the Data Portal, 26 have adopted the DPUK electronic application pro-forma (Table 1). A further 14 require their own access form to be completed alongside the DPUK process, and seven require their own process to be used exclusively.
For data analysis, once approval has been granted by a cohort research team and a data access agreement has been completed, data are made available within the secure analysis area of the Data Portal for in-situ analysis. The analysis area includes the use of several widely used general statistical packages (R, STATA, SPSS, SAS, Matlab, Python). Specialist software can be made available on request and bespoke software can be uploaded upon approval. Researchers are provided with a personal virtual desktop infrastructure which requires two-factor authentication to access. The standard desktop specification is optimised for survey epidemiologic survey data and includes 8GB RAM and four CPUs which are sufficient for most analyses. Bespoke desktop configurations can be requested for computationally intensive operations.
Multi-modal analysis is facilitated by optimising a virtual desktop infrastructure for survey, imaging, or omics analysis, in terms of capacity and tooling, and then combining results within a project-specific common data folder. For example, image-derived phenotypes may be generated within the environment using specialist software, transferred to a common folder, and then integrated with survey data. For device data (e.g. data from smart phones, accelerometers, portable cardiac monitors), the current solution is for data to be processed into clinically meaningful measures before upload to the Data Portal, and to be accessible via the standard desktop. Linked data are also available where governance permissions allow. Results can be exported for dissemination purposes. All exports must be requested, screened for non-identifiability, and approved prior to release. Microsoft Office is provided for preparing reports for export.
The Data Portal allows joint use of data within research consortia. Although the virtual desktop is the researcher’s own personal virtual laboratory, for distributed research groups and for consortia, the Data Portal can be used to hold a core dataset which can then be accessed by researchers in multiple locations without risk to the integrity of the core. This network of virtual desktops is flexible and can be configured in terms of access rights and capacity according to the requirements of the consortium.
The Data Journey
The data journey begins with upload to the Data Portal (Figure 2). Datasets and data dictionaries are received from cohorts on an ‘as-is’ basis along with other supporting documentation. Data are then curated to a common data model. The data model simplifies the analytic challenge of working across multiple datasets by providing a standard structure, variable naming and value labelling conventions. It is optimised for the analysis of ‘flat-file’ observational data and allows sorting by cohort, data category, repeat measurement to assess measurement error (array), and serial measurement to detect change (wave). Higher order data must be pre-processed prior to curation. Other data models, CDISC [10], OMOP [11], or HPO [12] involve structural complexity that is rarely relevant to cohort based analyses. Data curation is resource intensive and is ongoing. To enable the feasibility of analyses to be assessed prior to a data application being made, a set of 20 variables relevant to dementia have been harmonised across cohorts (Figure 2). Researchers may request access to either native or curated data.
Data collected
Data are available for 35 cohorts (Table 1). Of these, 22 (n=1 399 082) have uploaded full or partial datasets and 13 (n=2 062 162) will upload on a per project basis. A further 12 cohorts (n=52 361) have begun the process of making data available and have provided metadata.
The data are diverse. Clinical cohort studies include familial disease cohorts [13][14]), disease focused patient cohorts [15][16]), ageing focused population cohorts [17][18][19][20]), re-purposed cardiovascular cohorts [21][22], birth cohorts [23][24], repurposed cancer cohorts [25] [26]), and disease agnostic cohorts [27][28][29]) In terms of real world evidence the Data Portal provides access to an e-cohort (SAIL-DeC) covering health records for 1.2m individuals including 130k dementia cases [30]. Data availability varies according to cohort but includes epidemiologic survey, imaging, genetics, and linked administrative data. The XNAT [31] imaging platform is used to receive and process DICOM and NIfTI files. For genetics, variant call format and allele frequency data may be uploaded. Although not a cohort, the Data Portal provides a link to the UK-CRIS Network for natural language processing of 2.5m UK mental health records [32]. Overall these cohorts and data modalities represent an unusually complex data environment suitable for machine learning as well as hypothesis driven analyses.
Looking ahead, brain bank digital data, omics, devices, and environmental data are developing areas for DPUK. Cohorts remain the best source of post mortem scientifically informative brain tissue due to the wealth of background data that are available. Cohort derived brain tissue can also represent a range of disease stages for any particular outcome and a range of pathologies. DPUK is working with the UK Brain Banking Network to provide a central database linking brain donation to cohorts. Devices are an increasingly important source of data. To explore the collection of device data, DPUK provides a cost-neutral pipeline for data capture, processing and storage for collaborating cohorts. Specific omics pipelines can be made available on request.
Data resource use
From its public launch in November 2017 to end of December 2018, 81 data access requests have been received involving 149 applicants. The 81 requests span 41 institutions (34 academic, five commercial, two government) in nine countries and 51 requests involve multiple cohorts. Of the 81 data requests, 11 have been declined and seven withdrawn. The remainder have either been approved (n=39) or are under review (n=24). Our target response time for a decision on applications is 28 days. Currently the median response time is 25 days (mean=44 days).
Project proposals are diverse with applications coming from multi-disciplinary applicants involving multiple institutions. Projects include Mendelian randomisation, imaging, psychometric, and machine learning studies alongside risk stratification studies models for dementia prediction and diagnosis. Others are less dementia specific such as: trajectories of longitudinal assessment of comorbidities between mental health, hormonal indicators and cognitive change; the impact of child adversity on adult outcomes; the longitudinal tracking and determinants of well-being and cognitive performance; and successful cognitive ageing in 90+ year olds.
To facilitate innovative and exploratory analyses, without comprising data security or cohort governance principles, the Data Portal can be configured as a ‘sandbox’ environment upon request. An example of this was the hosting of a datathon at the Alan Turing Institute utilising the Deep and Frequent Phenotyping pilot study data [33]. These multi-modal data include Magneto-encephalography (MEG), Positron Emission Tomography (PET), structural and functional MRI, ophthalmology, gait, and serial cognitive and clinical assessment. The Data Portal was used to host 40 data scientists over a three-day multidisciplinary workshop, during which traditional regression and machine learning procedures were used to interrogate the data within the virtual desktop interface.
Strengths and weaknesses
A strength of the Data Portal is that it obviates the need for repeated data transfer. Other strengths include providing a single point of access for multiple cohort datasets, streamlined and standard access procedures, a common data model, a secure analysis environment, and a process which is fully auditable from data upload to the results export. The Data Portal is optimised and populated for dementia. However, it is a generic solution to the problem of analysing cohort data that can be used for any health outcome for which data are available.
The Data Portal provides access to real world evidence to inform experimental medicine and other clinical studies. Observational datasets can be used to inform emerging hypotheses, scrutinise genetic instruments in a Mendelian randomisation framework, and validate experimental findings. This has particular relevance for biomarker development and drug discovery. A secure data repository also strengthens the case for national data linkage agreements. In the UK for example, DPUK is working closely with Health Data Research UK [34] to establish cohort linkage to electronic health records on behalf of all collaborating UK cohorts.
The Data Portal provides a solution for data access beyond the UK. The Data Portal is not geographically restricted, and data are available on the Data Portal from an increasing number of international cohorts. However, national or regional repositories may be more acceptable to funders. To increase the overall size of the available data corpus, collaboration is underway with Dementias Platform Korea, EMIF AD, and the Ontario Brain Institute to establish a fully interoperable environment for European, Korean, and Canadian data.
Challenges include meeting the data access needs and expectations of diverse scientific communities; epidemiologic, imaging, genetics, and data science communities each have different conventions over what constitutes an appropriate data request. This problem is illustrated in that 11 out of 81 (14%) Data Portal applications were declined due to insufficient detail. Parallel processes of expectations coalescing across disciplines, and educating applicants on how to develop high quality proposals, will assist all stakeholders to simplify and standardise procedures.
Time is required for cohort research teams to adjust to the opportunity provided by a data platform although for many cohort research teams centralised data access management provides immediate advantage. For researchers, a challenge is the discipline of accessing data remotely rather than locally. However, as datasets become increasingly valuable and sensitive, remote access via secure repositories will likely become accepted routine practice. A more fundamental limitation is that the data repository model is not appropriate for all datasets. Clearly there is a need for a mixed model and the Data Portal offers both centralised and distributed analyses.
In the dementia space, other data platforms are available. The JPND Global Cohort Directory [35] provides contact details for 175 cohorts (n=3 586 109) whilst the IALSA Network [36] provides details for 110 cohorts (n=1 485 410). More sophisticated and convenient data discovery tools are provided by GAAIN [37] with 47 cohorts (n=480 020). GAAIN also offers centralised processing for selected datasets. EMIF-AD [38] offers a comprehensive data harmonisation programme for a selection of their 60 catalogued cohorts (n=135 959) and 18 electronic health records datasets (n=65M). For selected datasets EMIF-AD provides centralised cohort data processing facilities through the TranSMART platform [39].
Data resource access
The researcher journey begins with the data discovery tools (Figure 2). The Cohort Matrix and Directory are accessible to registered bona fide researchers. Registration requires having either an academic email address or an industry email from a certified company, and in the case of PhD or Master’s students, the requirement is to have a senior researcher as study lead. Registered researchers can complete a data access application form and submit it for review by the data guardians of the datasets requested. Upon approval, completion by the applicant of a data access agreement is required prior to access being granted. DPUK undertakes to send this to the applicant’s legal representative within two working days. Upon receipt of a completed data access agreement, data access is granted within five working days [40].
Two-factor authentication is required to enable access to approved datasets. This involves the provision of a username with password creation, and an authentication code generated by an app on a mobile device of the applicant’s choosing. The data may then be accessed for analysis on the Data Portal.
Tables, graphs and scripts for export are submitted to the data export panel for approval. Manuscripts may be prepared in the Data Portal so that collaborators who are registered users may contribute without the need for manuscript download. A facility for import is also available, enabling researchers to upload scripts and additional datasets from outside the Data Portal to reside within their approved DPUK datasets.
Publications arising from use of the Data Portal are required to conform to the DPUK publications policy [41]. The intention of this policy is to acknowledge the importance of the ‘team science’ underlying the opportunity provided to researchers. Not only is the researcher dependent on decades of generosity and work from cohort participants and cohort research teams respectively, but also upon the data scientists who deliver the provenance of the infrastructure, and the funders. A goal of the Data Portal is to reduce data access costs sufficiently that they may be borne centrally; effectively making data free at point of access. By undertaking data storage, curation and access management on behalf of cohorts the need for access fees is ameliorated and for most cohorts there is no access fee.
The DPUK Data Portal was established by MRC to accelerate the development of new treatments for dementia by using cohort data to inform experimental medicine. It is recognition of the unique value of cohort data and a contribution to the wider debate on how best to support cohort studies and facilitate their use within the wider research environment. By streamlining procedures for cohort research teams, increasing data accessibility for researchers, and reducing costs and adding value for funders, the Data Portal is also an investment in the future of cohorts generally.
Profile in a nutshell
The DPUK Data Portal was established to increase the realised scientific value of cohort data by enabling remote access to multi-modal data from multiple independent datasets
Launched in 2017, the Data Portal enables access to individual level data for 3m participants from 35 population and clinical cohorts
Data types vary according to cohort and include survey, imaging, genetic, device and linked outcome data
All projects are by default collaborations with the cohort research teams which have generated the data and application for access can be made through the Data Portal https://portal.dementiasplatform.uk/
Acknowledgements
DPUK would like to express gratitude to:
Cohort members and their research teams for generously making data available
EMIF-AD for providing access to their data catalogue and supporting software
Professor Ian Deary and Dr Declan Jones for their contribution to this paper from their support in the DPUK Executive Team.
This work was supported by the UK Research and Innovation Medical Research Council [MR/L023784/1 and MR/L023784/2]
Footnotes
↵† Joint lead authors