Abstract
Though it is widely recognized that data sharing enables faster scientific progress, the sensible need to protect participant privacy hampers this practice in medicine. We train deep neural networks that generate synthetic subjects closely resembling study participants. Using the SPRINT trial as an example, we show that machine-learning models built from simulated participants generalize to the original dataset. We incorporate differential privacy, which offers strong guarantees on the likelihood that a subject could be identified as a member of the trial. Investigators who have compiled a dataset can use our method to provide a freely accessible public version that enables other scientists to perform discovery-oriented analyses. Generated data can be released alongside analytical code to enable fully reproducible workflows, even when privacy is a concern. By addressing data sharing challenges, deep neural networks can facilitate the rigorous and reproducible investigation of clinical datasets.
One Sentence Summary Deep neural networks can generate shareable biomedical data to allow reanalysis while preserving the privacy of study participants.
Introduction
Sharing individual-level data from clinical studies remains challenging. The status quo often requires scientists to establish a formal collaboration and execute extensive data usage agreements before sharing data. These requirements slow or even prevent data sharing between researchers in all but the closest collaborations.
Recent initiatives have begun to address cultural challenges around data sharing. The New England Journal of Medicine recently held the Systolic Blood Pressure Trial (SPRINT) Data Analysis Challenge to examine possible benefits of clinical trial data sharing (1, 2). The SPRINT clinical trial examined the efficacy of intensive management of systolic blood pressure (<120 mmHg) compared with standard management (<140 mmHg). Intensive management resulted in fewer cardiovascular events and the trial was stopped early. Reanalysis of the challenge data led to the development of personalized treatment scores (3) and decision support systems (4), in addition to a more specific analysis of blood pressure management in participants with chronic kidney disease (5). Such efforts have begun to address cultural norms. Even for this effort which focused on data sharing, investigators were required to execute data use agreements that included clauses to maintain security and prohibit re-identification or sharing.
We sought to remove technical barriers that hamper data sharing. Computer scientists have tackled problems considered to be particularly challenging using deep neural networks (6). This class of machine learning methods is now becoming more widely used in biology and medicine (7). In this work, we trained two deep neural networks against each other to generate realistic simulated participant blood pressure trajectories from the SPRINT trial dataset. One neural network, called the generator, is trained to generate a participant from a set of random numbers. The other neural network, called a discriminator, is trained to classify data as real or generated. As the networks are trained, the generator learns to build samples that fool the discriminator. Networks trained in this way are called Generative Adversarial Networks (GANs) (8) and can also be used for labeled samples (9). A pair of recent preprints have reported participant generation via neural networks (10, 11). However, it is not enough to simply build new examples. Numerous linkage and membership inference attacks have demonstrated the ability to re-identify participants or reveal participation in a study on both biomedical datasets (12–18) and from machine learning models (19–21).
To provide a formal privacy guarantee, we build GANs under the constraint of differential privacy (16). Informally, differential privacy requires that no subject in the study has a significant influence on the information released by the algorithm (see Materials and Methods for a formal definition). Despite being a stringent notion, differential privacy allows us to generate new plausible individuals while revealing almost nothing about any single study participant. This is especially important in the biomedical domain where, for example, Homer et al. showed the ability to identify whether an individual was a part of a study even with complex genomic mixtures (22). Simmons and Berger later developed a method to enable differential privacy for genome-wide association studies (23). Recently, methods have been developed to train deep neural networks under differential privacy with formal assurances about privacy risks (24, 25). In the context of a GAN, the discriminator is the only component that accesses the real, private, data. By training the discriminator under differential privacy, we can produce a differentially private GAN framework.
We evaluated whether or not this approach could generate biomedical data that could be shared for reanalysis while reducing participant privacy risks. We evaluated usefulness by: (1) comparing variable distributions between the real and simulated data, (2) comparing the correlation structure between variables in the real and simulated data, (3) comparing machine learning predictors constructed on real vs. simulated data. We find that the model learns realistic distributions and that models constructed from the simulated data successfully classify participants in a held-out portion of the underlying real dataset.
Results
We used an Auxiliary Classifier Generative Adversarial Network (AC-GAN) (9) to simulate participants based on the population of the SPRINT clinical trial. We included all participants with measurements for the first twelve time periods (n=6,502); dividing them into a training set (n=6,000) and a test set (n=502). We trained two AC-GANs using the training set: a traditional, standard, AC-GAN (labeled non-private) and an AC-GAN trained under differentially privacy (labeled private). We used both to simulate data that we compared to the real data. We visualized participant blood pressure trajectories, analyzed variable correlation structure and evaluated transfer learning performance for a machine learning classification task.
Auxiliary Classifier GAN for SPRINT Clinical Trial Data
An AC-GAN (Fig. 1A) is made up of two neural networks competing with each other. We found convolutional layers effectively modeled the sequential measurements and used deep convolutional neural networks for both the generator and discriminator (Fig. 1B, 1C). We trained the Generator (G) to take in a specified treatment arm (standard/intensive) and random noise and generate new participants that can fool the Discriminator (D). We trained the discriminator to differentiate real and simulated data from a dataset containing both groups. We repeated this process until the generator created synthetic participants that were difficult to discriminate from real ones.
A.) Structure of an AC-GAN. B.) The generator model takes a class label and random noise as input and outputs a 3x12 vector for each participant (SBP, DBP and medication counts at each time point). C.) The discriminator model takes both real and simulated samples as input and learns to predict the source and a class label (i.e. normal or intensive treatment group). D.) Training loss for a non-private AC-GAN. E.) Training loss for a private AC-GAN.
We trained under differential privacy by limiting the effect any single subject has on the training process and by adding random noise based on the maximum effect of a single subject. From the technical perspective, we limited the effect of participants by clipping the norm of the gradient and added proportionate Gaussian noise. This combination offered plausible deniability, training could have been guided by a different subject within or outside the real training data. The maximum effect of an outlier is limited and bounded. Comparing the loss functions of the private and non-private training process demonstrates the effects of these constraints. Under normal training the losses of the generator and discriminator converged to an equilibrium before eventually increasing steadily (Fig. 1D). Under differentially private training the losses converged to and remained in a noisy equilibrium (Fig. 1F). At the beginning of training the neural networks changed rapidly. As training continued and the model achieved a better fit these steps, the gradient, decreased. When the gradient became very small, the noise outweighed the signal and limited further training.
Evaluation of Simulated Participants
After training the AC-GAN we compared the simulated synthetic participants to the real participants (Figure 2). Figure 2 shows the median systolic blood pressures for: (1) real participants, (2) simulated participants via a non-private AC-GAN and (3) simulated participants via the differentially private AC-GAN. The non-private participants generated at the end of training appear similar to the real participants. The private participants have wider variability because of the noise added during training (Fig. 2A). As the models achieve better fit, the gradient shrinks, causing the gradient to noise ratio to decrease. This can occasionally lead to the private generator and discriminator falling out of sync (Supp. Fig. 1) or more commonly the private model generating less realistic samples. To best choose when to stop training, we developed an approach that incorporates any machine learning analysis chosen to resemble expected use cases. Here we tested each epoch’s data by training an additional classifier that must distinguish whether a generated participant was a part of the normal or intensive treatment groups. We applied two common machine learning classification algorithms and selected the top epochs in a differentially private manner (Fig. 2B and 2C).
A.) Simulated samples (private and non-private) generated from the final (500th) epoch of training. B.) Simulated samples generated from the epoch with the best performing logistic regression classifier. C.) Simulated samples from the epoch with the best performing random forest classifier. D.) Simulated samples from the top five random forest classifier epochs and top five logistic regression classifier epochs.
However, selecting only a single epoch does not account for the AC-GAN training process. Because the discriminator and generator compete from epoch to epoch, their results can cycle around the underlying distribution. The non-private models consistently improved throughout training (Supp. Fig. 2A, Supp. Fig. 3A), but this could be due to the generator eventually learning characteristics specific to individual participants. We observed that epoch selection was important for the generation of realistic populations from models that incorporated differential privacy (Supp. Fig. 2B, Supp. Fig. 3B). To address this, we simulated 1,000 participants from each of the top five epochs by both the logistic regression and random forest evaluation and combined them to form a multi-epoch training set. This process maintained differential privacy and resulted in a generated population that, throughout the trial, was consistent with the real population (Fig. 2D).
A.) Original, real data, B.) Non-private, AC-GAN simulated data C.) Differentially private, AC-GAN simulated data.
The Pearson correlation structure of the real data (Fig. 3A) was closely reflected by the correlation structure of the non-private generated data (Fig. 3B). Of note was initial positive correlation between the number of medications a participant was taking and the early systolic blood pressures, but this correlation decreased as time goes on. The private generated data generally reflects these trends, but has an increased level of noise (Fig. 3C). The noisy training process of the private discriminator places an upper bound on its ability to fit the distribution of data. Increased sample sizes would help to clarify this distribution and because larger sample sizes cause less privacy loss, less noise would need to be added to achieve an acceptable privacy budget.
Feasibility of Simulated Participants for Transfer Learning Task
Visualizations of patient distributions and variable correlations showed that synthetic participants appeared similar to real participants. We sought to determine whether or not synthetic participants could be used for subsequent data mining. We trained machine learning classifiers using four methods (logistic regression, random forests, support vector machines, and nearest neighbors) to distinguish treatment groups on three different sources of data: real participants, synthetic participants generated by the non-private model, and synthetic participants generated by the private model. We compared performance of these classifiers on a holdout test set of 502 real participants (Fig. 4 A-D). This analysis revealed two main trends: classifiers trained on the set constructed from combined top epochs exhibited more stable performance on the test data in line with observations from the population distributions, and classifiers trained on data from the non-private model slightly outperformed those trained on data from the private model. A drop in performance was expected because adding noise to maintain privacy reduces signal. If desired, training a non-private model could provide an upper bound for expected performance.
A.) Performance on transfer learning task by source of training data for each machine learning model. B.) Random forest variable importance scores by training data. C.) Logistic Regression variable coefficients by training data. D.) Support Vector Machine variable coefficients by training data.
We also sought to determine the extent to which the classifiers were using similar predictive features. We evaluated the random forest feature importance scores (Fig. 4E) as well as the logistic regression and support vector machine feature coefficients (Fig. 4G, 4F). All showed similar trends of useful features between real and generated data, and a spearman correlation test was performed between the importance scores (random forest) and coefficients (SVM and logistic regression) of the models trained on real data and each synthetic set revealed significant associations in all cases (Table 1). Though all three classification methods achieved similar accuracy, the random forest classifier found the medication features to be important while these features had near zero coefficients in the SVM and logistic regression classifiers.
Spearman Correlation between variable importance scores (Random Forests) and model coefficients (Support Vector Machine and Logistic Regression).
Privacy Analysis
The formal definition of differential privacy has two parameters. The key parameter ε measures the “privacy loss” incurred by the computation. The second parameter δ bounds the probability that the privacy loss exceeds ε. The values of (ε, δ) accumulate as the algorithm repeatedly accesses the private data. In our experiment, our private AC-GAN algorithm is able to generate useful synthetic data with ε = 2 and δ < 10^{-5} (Fig.5). The upper bound of the epoch selection task, (see Materials Methods) used (0.05, 0) per each model included for a total of (0.5, 0) differential privacy. This established a modest, single digit epsilon privacy budget of (2.5, 10–5).
The value of delta as a function of epoch for different epsilon values. An ε value of 2 allows for 500 epochs of training and δ < 10-5.
Discussion
Deep generative adversarial networks and differential privacy offer a technical solution to the challenge of sharing biomedical data to facilitate exploratory analyses. Our approach, which uses deep neural networks for data simulation, can generate synthetic data to be distributed and used for secondary analysis. We perform training with a differential privacy framework that limits the study subjects’ privacy risk. We apply this approach to data from the SPRINT clinical trial due to its recent use for a data reanalysis challenge
We introduce an approach that samples from multiple epochs to improve performance while maintaining privacy. However, several challenges remain. Deep learning models have many training parameters and require substantial sample sizes, which can hamper this method’s use for small clinical trials or targeted studies. Another fruitful area of use may be large electronic health records systems, where the ability to share synthetic data may aid methods development and the initial discovery of predictive models. Similarly, financial institutions or other organizations that use outside contractors or consultants to develop risk models might choose to share generated data instead of actual client data. In very large datasets, there is evidence that differential privacy may even prevent overfitting to reduce the error of subsequent predictions (26).
Though our approach provides a general framing, the precise neural network architecture may need to be tuned for specific use cases. Data with multiple types presents a challenge. EHRs contain binary, categorical, ordinal and continuous data. Neural networks require these types to be encoded and normalized, a process that can reduce signal and increase the dimensionality of data. New neural network have been designed to deal more effectively with discrete data (27, 28). Researchers will need to incorporate these techniques and develop new methods for mixed types if their use case requires it. The practice of generating data under differential privacy with deep neural networks offers a technical solution for those who wish to share data to the challenge of patient privacy. This technical work complements ongoing efforts to change the data sharing culture of clinical research.
Materials and Methods
We developed an approach to train auxiliary classifier generative adversarial networks (AC-GANs) in a differentially private manner to enable privacy preserving data sharing. Generative adversarial networks offer the ability to simulate realistic-looking data that closely matches the distribution of the source data. AC-GANs add the ability to generate labeled samples. By training AC-GANs under the differential privacy framework we generated realistic samples that can be used for initial analysis while guaranteeing a specified level of participant privacy.
The source code for all analyses is available under a permissive open source license in our repository (https://github.com/greenlab/SPRINT_gan). In addition, continuous analysis (29) was used to re-run all analyses, to generate docker images matching the environment of the original analysis, and to track intermediate results and logs. These artifacts are freely available (https://hub.docker.com/r/brettbj/sprint-gan/ and archival version: https://doi.org/10.6084/m9.figshare.5165731.v1).
SPRINT Clinical Trial Data
The SPRINT was a randomized, single blind treatment trial where participants were randomized into two groups, an intensive treatment group with a systolic blood-pressure target of less than 120 mmHg and a standard treatment group with a systolic blood-pressure target of less than 140 mm Hg. The trial included a total of 9,361 participants. We included 6,502 participants from the trial by filtering for all participants that had blood pressure measurements for each of the first 12 measurements (RZ, 1M, 2M, 3M, 6M, 9M, 12M, 15M, 18M, 21M, 24M, 27M). We included measurements for systolic blood pressure, diastolic blood pressure and the count of medications prescribed to each participant. This provided an input vector of shape (3, 12).
Auxiliary Classifier Generative Adversarial Network
We implemented the AC-GAN as described in Odena et al. (9) using Keras (30). Results shown use a latent vector of dimension 100, a learning rate of 0.0002, and a batch size of 100. To handle edge cases and mimic the sensitivity of the real data measurements, we take the floor of zero or the simulated value and convert all values to integers.
Transfer Learning Task
Each of the 6,502 in the SPRINT dataset is labeled by their treatment group. We evaluate machine learning methods (logistic regression, support vector machines, and random forests from the scikit-learn (31) package) by their ability to predict which group a participant belongs to. This was done by splitting the 6,502 into a training set of 6,000 participants (labeled real) and a test set of 502 participants. A vanilla AC-GAN was trained using the 6,000 participant training set providing a simulated training set (labeled non-private). A differentially private AC-GAN was trained using the 6,000 training set providing a differentially private training simulated training set (labeled private). Each classifier was then trained on the real, non-private and private training sets and evaluated on the same, real test set of participants. This allows for a comparison of classification performance between models trained on the real data, synthetic data and private synthetic data. We evaluated both accuracy as well as the correlation between important features (random forest) and model coefficients (logistic regression and support vector machine).
Differential Privacy
Differential privacy is a stability property for algorithms, specifically for randomized algorithms (32). Informally, it requires that the change of any single data point in the data set has little influence on the output distribution by the algorithm. To formally define differential privacy, let us consider X as the set of all possible data records in our domain. A dataset is a collection of n data records from X. A pair of datasets D and D’ are neighboring if they differ by at most one data record. In the following, we will write R to denote the output range of the algorithm, which in our case correspond to the set of generative models.
[Differential Privacy (33)]: Let ε, δ > 0. An algorithm A: Xn → R satisfies (ε, δ)-differential privacy if for any pair of neighboring datasets D, D’, and any event S ⊆ R, the following holds
where the probability is taken over the randomness of the algorithm.
A crucial property of differential privacy is its resilience to post-processing --- any data independent post-processing procedure on the output by a private algorithm remains private. More formally:
[Resilience to Post-Processing]: Let algorithm A: Xn → R be an (ε, δ)-differentially private algorithm. Let A’: R → R’ be a “post-processing” procedure. Then their composition of running A over the dataset D, and then running A’ over the output A(D) also satisfies (ε, δ)-differential privacy.
Training AC-GANs in a Differentially Private Manner
During the training of AC-GAN, the only part that requires direct access to the private (real) data is the training of the discriminator. To achieve differential privacy, we only need to “privatize” the training of the discriminators. The differential privacy guarantee of the entire AC-GAN directly follows because the output generative models are simply post-processing from the discriminator.
To train the discriminator under differential privacy we add noise to the stochastic gradient descent process as outlined in Abadi et al. (24). First, we provide an upper bound onto the norm of the gradient at any individual step. This is done by clipping the ℓ2-norm of the gradient. Next, we perturb each coordinate of the gradient by adding noise drawn from a Gaussian distribution with a variance proportional to the gradient clipping. The more noise we added (relative to the clipped norm of the gradient) the better privacy guarantee. To achieve a modest privacy budget, we found we could clip the ℓ2-norm of the gradient at 0.0001 and add noise from a normal distribution with a σ2 of 1 (𝒩(μ, 1 * (0.00012))). This is substantially higher than previously shown, likely due to either the dynamic nature of GAN training where the target is inexact and changes over time or averaging over many mini-batches. We used the moments accountant described in Abadi et al. (24) to compute the privacy parameters (ε, δ).
Differentially Private Model Selection
We found that sampling from multiple different epochs throughout training provided a more diverse training set. This provided summary statistics closer to the real data and higher accuracy in the transfer learning task. During the GAN training, we saved all the generative models across all epochs. We then generated a batch of synthetic data from each generative model, and used a machine learning algorithm (logistic regression or random forest) to train a prediction model based on each synthetic batch of data. We then tested each prediction model on the real dataset and calculate the resulting accuracy. To select epochs that generate training data for the most accurate models under differential privacy, we used the standard “Report Noisy Min” subroutine: first add independent Laplace noise to the accuracy of each model (drawn from Lab(1/(n*ε)) to achieve (ε, 0) differential privacy where n is the size of the private dataset we perform prediction on and output the model with the best noisy accuracy.
In practice, we choose the top five models in the transfer learning task using both logistic regression classification and random forest classification (for a total of 10 models). We performed this task under (0.5, 0)-differential privacy. In each of the ten rounds of selection epsilon was set to 0.05. This achieves a good balance of accuracy while maintaining a reasonable privacy budget.
Acknowledgments
We thank Jason H. Moore (University of Pennsylvania), Aaron Roth (University of Pennsylvania), Gregory Way (University of Pennsylvania), Yoseph Barash (University of Pennsylvania) and Anupama Jha (University of Pennsylvania) for their helpful discussions. We also thank the participants of the SPRINT trial and the entire SPRINT Research Group for providing the data used in this study. Funding: This work was supported by the Gordon and Betty Moore Foundation under a Data Driven Discovery Investigator Award to C.S.G. (GBMF 4552). B.K.B.-J. Was supported by a Commonwealth Universal Research Enhancement (CURE) Program grant from the Pennsylvania Department of Health and by US National Institutes of Health grants AI116794 and LM010098. Z.S.W is funded in part by a subcontract on the DARPA Brandeis project and a grant from the Sloan Foundation. Author Contributions: B.K.B.-J. and C.S.G. conceived the study. B.K.B.-J. And C.W. performed initial analyses. B.K.B.-J. and Z.S.W. designed and validated the privacy approach. B.K.B.-J., C.S.G. and Z.S.W. wrote the manuscript and all authors revised and approved the final manuscript. Competing interests: The authors have no competing interests to disclose. Data and materials availability: All data used in this manuscript are available via the NHLBI (https://biolincc.nhlbi.nih.gov/studies/sprint_pop/), the source code is available via GitHub (https://github.com/greenlab/SPRINT_gan) and an archived version is available via Figshare (DOI: 10.6084/m9.figshare.5165737).