Abstract
Though it is widely recognized that data sharing enables faster scientific progress, the sensible need to protect participant privacy hampers this practice in medicine. We train deep neural networks that generate synthetic participants closely resembling study participants. Using the SPRINT trial as an example, we show that machine-learning models built from simulated participants generalize to the original dataset. We incorporate differential privacy, which offers strong guarantees on the likelihood that a participant could be identified as a member of the trial. Investigators who have compiled a dataset can use our method to provide a freely accessible public version that enables other scientists to perform discovery-oriented analyses. Generated data can be released alongside analytical code to enable fully reproducible workflows, even when privacy is a concern. By addressing data sharing challenges, deep neural networks can facilitate the rigorous and reproducible investigation of clinical datasets.
One Sentence Summary Deep neural networks can generate shareable biomedical data to allow reanalysis while preserving the privacy of study participants.
Introduction
Sharing individual-level data from clinical studies remains challenging. The status quo often requires scientists to establish a formal collaboration and execute extensive data usage agreements before sharing data. These requirements slow or even prevent data sharing between researchers in all but the closest collaborations.
Recent initiatives have begun to address cultural challenges around data sharing. The New England Journal of Medicine recently held the Systolic Blood Pressure Trial (SPRINT) Data Analysis Challenge to examine possible benefits of clinical trial data sharing (1, 2). The SPRINT clinical trial examined the efficacy of intensive lowering of systolic blood pressure (<120 mmHg) compared with treatment to a standard systolic blood pressure goal (<140 mmHg). Intensive blood pressure lowering resulted in fewer cardiovascular events and the trial was stopped early. Reanalysis of the challenge data led to the development of personalized treatment scores (3) and decision support systems (4), in addition to a more specific analysis of blood pressure management in participants with chronic kidney disease (5). Initiatives such as the SPRINT Data Analysis Challenge have begun to address cultural norms. Even for this effort which focused on data sharing, investigators were required to execute data use agreements that included clauses to maintain security and prohibit re-identification or sharing.
We sought to alleviate privacy barriers that hamper data sharing. One approach is to generate synthetic individual participant data similar enough to the original trial data that analyses yield the same answers. Park and Ghosh developed an initial approach to managing privacy threats using a perturbed Gibbs sampler, a method that generates synthetic data with a quantifiable privacy risk (6).Goodfellow et al. (7) developed a method entitled Generative Adversarial Networks (GANs) using neural networks to generate realistic data from complex distributions. GANs have become a class of widely used machine learning methods and have recently been used in biology and medicine (8). In this work, we trained two deep neural networks against each other to generate realistic simulated participant blood pressure trajectories and medication adjustments from the SPRINT trial dataset. One neural network, called the generator, is trained to generate a participant from a set of random numbers. The other neural network, called a discriminator, takes in both real data and the synthetic data from the generator. It is trained to classify whether a sample is real or a synthetic sample created by the generator. As the networks are trained, the generator learns to build datasets that fool the discriminator. When the discriminator can no longer differentiate between the real and fake data, the synthetic samples “look” realistic. Networks trained in this way are called GANs and can also be used to create synthetic participants from multiple groups, in this case the standard and intensive arms of the trial (9). A pair of recent preprints have reported generation of synthetic individual participant data via neural networks (10, 11). For example, Esteban et al., generated synthetic patient data and showed that a neural network could not distinguish between the synthetic data and real data. However, it is not enough to simply build synthetic participants. Numerous linkage and membership inference attacks on both biomedical datasets (12-19) and from machine learning models (20–22) have demonstrated the ability to re-identify participants or reveal participation in a study..
To provide a formal privacy guarantee, we built GANs to generate realistic synthetic individual participant data with mathematical properties like those of the original participants’ data, adding the extra protection of differential privacy (16). Differential privacy protects against common privacy attacks including membership inference, homogeneity and background knowledge attacks (23). Informally, differential privacy requires that no single study participant has a significant influence on the information released by the algorithm (see Materials and Methods for a formal definition). Despite being a stringent notion, differential privacy allows us to generate new plausible individuals while revealing almost nothing about any single study participant. Within the biomedical domain, Simmons and Berger developed a method using differential privacy to enable privacy preserving genome-wide association studies (24). Recently, methods have also been developed to train deep neural networks under differential privacy with formal assurances about privacy risks (25, 26). In the context of a GAN, the discriminator is the only component that accesses the real, private, data. By training the discriminator under differential privacy, we can produce a differentially private GAN framework.
We evaluated whether or not this approach could generate biomedical data that could be shared for valid reanalysis while reducing participant privacy risks. We evaluated usefulness by: (1) comparing variable distributions between the real and simulated data, (2) comparing the correlation structure between variables in the real and simulated data, (3) judging individual participant data real or simulated from 3 clinicians’ perspectives, and (4) comparing machine learning predictors constructed on real vs. simulated data. We find that the model learns to generate realistic data and that models constructed from the simulated data successfully predict which arm participants were assigned to in a held-out portion of the underlying real dataset.
Results
We used a type of GAN known as an Auxiliary Classifier Generative Adversarial Network (AC-GAN) (9) to simulate participants based on the population of the SPRINT clinical trial. We included all participants with measurements for the first twelve time periods (n=6,502), dividing them into a training set (n=6,000) and a test set (n=502). To evaluate the effect of applying differential privacy during the generation of synthetic participant data, we trained two AC-GANs using the training set: a traditional, standard, AC-GAN (results termed “non-private” throughout the remainder of this manuscript) and an AC-GAN trained under differentially privacy (results termed “private”). We used both GANs to simulate data that we then compared to the real data by visualizing participant blood pressure trajectories, analyzing variable correlation structure and evaluating transfer learning performance for a machine learning classification task. Three clinicians attempted to predict whether participants were real or synthetic and whether they were in the standard or intensive treatment group.
Auxiliary Classifier GAN for SPRINT Clinical Trial Data
An AC-GAN (Supp. Fig. 1A) is made up of two neural networks competing with each other. We found convolutional layers effectively modeled the sequential measurements made during the clinical trial, so used deep convolutional neural networks for both the generator and discriminator (Supp. Fig. 1B, 1C). We trained the Generator (G) to take in a specified treatment arm (standard/intensive) and random noise and generate new participants that can fool the Discriminator (D). The generator simulated a systolic blood pressure, diastolic blood pressure and a number of medications for each synthetic patient for each of 12 SPRINT study visits. We trained the discriminator to differentiate real and simulated data from a dataset containing both groups. We repeated this process until the generator created synthetic participants that were difficult to discriminate from real ones (i.e. the accuracy of the discriminator could not improve above ~50%).
We trained under differential privacy by limiting the effect any single SPRINT study participant has on the training process and by adding random noise based on the maximum effect of a single study participant. From the technical perspective, we limited the effect of participants by clipping the norm of the discriminator’s training gradient and added proportionate Gaussian noise. This combination ensures that training cannot be tied to an individual and that it could have been guided by a different subject within or outside the real training data. The maximum effect of an outlier is limited and bounded. Comparing the neural network loss functions of the private and non-private training process demonstrates the effects of these constraints. Under normal training the losses of the generator and discriminator converged to an equilibrium before eventually increasing steadily (Supp. Fig. 1D). Under differentially private training the losses converged to and remained in a noisy equilibrium (Supp. Fig. 1F). At the beginning of training the neural networks changed rapidly. As training continued and the model achieved a better fit these steps, the gradient, decreased. Eventually the gradient becomes too small in comparison to the noise for training to continue any further.
Evaluation of Simulated Participants
After training the AC-GAN we compared the simulated synthetic participants to the real participants (Figure 1). Figure 2 shows the median systolic blood pressures for: (1) real participants, (2) simulated participants via a non-private AC-GAN and (3) simulated participants via the differentially private AC- GAN. The non-private participants generated at the end of training appear similar to the real participants. The private participants have wider variability because of the noise added during training (Fig. 1A). As the models achieve better fit, the gradient shrinks, causing the gradient to noise ratio to decrease. This can occasionally lead to the private generator and discriminator falling out of sync (Supp. Fig. 2) or more commonly the private model generating less realistic samples due to noise. To best select epochs, or training steps, where synthetic samples closely real samples, we tested each epoch’s data by training an additional classifier that must distinguish whether a generated participant was a part of the normal or intensive treatment groups. We applied two common machine learning classification algorithms and selected the top epochs in a differentially private manner (Fig. 1B and 1C).
Median Systolic Blood Pressure Trajectories from initial visit to 27 months. A.) Simulated samples (private and non-private) generated from the final (500th) epoch of training. B.) Simulated samples generated from the epoch with the best performing logistic regression classifier. C.) Simulated samples from the epoch with the best performing random forest classifier. D.) Simulated samples from the top five random forest classifier epochs and top five logistic regression classifier epochs.
However, selecting only a single epoch does not account for the AC-GAN training process. Because the discriminator and generator compete from epoch to epoch, their results can cycle around the underlying distribution. The non-private models consistently improved throughout training (Supp. Fig. 3A, Supp. Fig. 4A), but this could be due to the generator eventually learning characteristics specific to individual participants. We observed that epoch selection based on the training data was important for the generation of realistic populations from models that incorporated differential privacy (Supp. Fig. 3B, Supp. Fig. 4B). To address this, we simulated 1,000 participants from each of the top five epochs by both the logistic regression and random forest evaluation on the training data and combined them to form a multi-epoch training set. This process maintained differential privacy and resulted in a generated population that, throughout the trial, was consistent with the real population (Fig. 1D). The epoch selection process was independent of the holdout testing data.
To evaluate whether the resulting synthetic data are similar to the real data, we evaluate the correlation between each study visit’s systolic blood pressure, diastolic blood pressure, and medication count. We performed this analysis within the SPRINT dataset (“real correlation structure”) and within the datastets generated by the GAN without and the GAN with differential privacy (“non-private correlation structure” and “private correlation structure,” respectively). The Pearson correlation structure of the real SPRINT data (Fig. 2A) was closely reflected by the correlation structure of the non-private generated data (Fig. 2B). Of note was initial positive correlation between the number of medications a participant was taking and the early systolic blood pressures, but this correlation decreased as time goes on. The correlation matrices between the real SPRINT data (i.e., the training data) and the non-private data were highly correlated (Spearman correlation = 0.9645, p-value < 10−325). Addition of differential privacy during the synthetic data generation process (i.e., the “private dataset”) generated data generally reflecting these trends, but with an increased level of noise (Fig. 2C). The correlation matrices between the real SPRINT data and the private generated data were only slightly less correlated (Spearman correlation = 0.8787, p- value = 7.692e-204). The noisy training process of the private discriminator places an upper bound on its ability to fit the distribution of data. Increased sample sizes (such as in EHRs or other real-world data sources) would help to clarify this distribution and because larger sample sizes cause less privacy loss, less noise would need to be added to achieve an acceptable privacy budget.
Pairwise Pearson correlation between columns for the A.) Original, real data, B.) Non-private, AC- GAN simulated data C.) Differentially private, AC-GAN simulated data. (RZ, randomization visit; 1M, 1 month visit; 2M, 2 month visit; 3M, 3 month visit; 6M, 6 month visit; 9M, 9 month visit; 12M, 12 month visit; 15M, 15 month visit; 18M, 18 month visit; 21M, 21 month visit; 24M, 24 month visit; 27M, 27 month visit).
To ensure similarity between the synthetic and real SPRINT data persists during rigorous inspection at more granular scale, we asked three clinicians to judge whether individual participant data were real SPRINT data, or synthetic data. These three physicians, experienced in the treatment of hypertension and familiar with the SPRINT trial, were asked to determine in a blinded fashion whether 100 participants (50 real, 50 synthetic) looked real. The clinicians looked for data inconsistent with the SPRINT protocol or that otherwise appeared anomalous. For example, the clinicians were alert for instances in which the systolic blood pressure was less than 100 mm Hg, but the participant was prescribed an additional medication. The clinicians classified each record on a zero to ten realism scale (10 was the most realistic), as well as whether the data correspond to standard or intensive treatment (Fig. 3A-D). The mean realism score for synthetic patients was 5.01 and the mean score for the real patients 5.16 (Figure 3E). We performed a Mann-Whitney U test to evaluate whether the scores were drawn from significantly different distributions and found a p-value of 0.287. The clinicians correctly classified 61.3% of the real SPRINT participants and 67.3% of the synthetic participants as the standard or intensive group.
A.) Synthetic participant scored a 2 by clinician expert. B.) Synthetic participant scored a 4 by clinician expert. C.) Synthetic participant scored a 6 by clinician expert. D.) Synthetic participant scored a 8 by clinician expert. E.) Comparison of scores between real and synthetic participant (dotted red lines indicate means). F.) Distribution of scores between real (blue) and synthetic (green) patients.
Machine Learning Models Trained on Simulated Participants are Accurate for Real Participants
Clinician review, visualizations of participant distributions and variable correlations showed that synthetic participants appeared similar to real participants. We next sought to determine whether or not synthetic participants could be used for subsequent data analysis. We trained machine learning classifiers using four methods (logistic regression, random forests, support vector machines, and nearest neighbors) to distinguish treatment groups on three different sources of data: real participants, synthetic participants generated by the non-private model, and synthetic participants generated by the private model. We compared performance of these classifiers on a separate holdout test set of 502 real participants that were not included in the training process (Fig. 4 A-D). This analysis revealed two main trends: classifiers trained on the set constructed from combined top epochs exhibited more stable performance on the test data in line with observations from the population distributions, and classifiers trained on data from the non-private model slightly outperformed those trained on data from the private model. A drop in performance was expected because adding noise to maintain privacy reduces signal. If desired, training a non-private model could provide an upper bound for expected performance.
Performance on transfer learning task by source of training data for each machine learning method. A.) Logistic Regression. B.) Random Forest. C.) Support Vector Machine. D.) Nearest Neighbors.
We also sought to determine the extent to which the classifiers trained on real vs. synthetic data were relying on the same features to make their predictions (Supplemental Figure 5). We found that there was significant correlation between the importance scores (random forest) and coefficients (SVM and logistic regression) for the models trained on real vs. synthetic data (Supplemental Table 1).
Privacy Analysis
The formal definition of differential privacy has two parameters. The key parameter ε measures the “privacy loss” incurred by the computation. The second parameter δ bounds the probability that the privacy loss exceeds ε Put in other words, represents the worst-case privacy loss where there is no privacy breach, and δ represents the probability of a privacy breach. Therefore, it is important to choose values for and that are satisfactory to the specific use case and correspond to the consequences of a privacy breach. The values of (ε, δ) accumulate as the algorithm repeatedly accesses the private data. In our experiment, our private AC-GAN algorithm is able to generate useful synthetic data with ε= 2 and δ < 10−5 (Fig. 5). The upper bound of the epoch selection task, (see Materials Methods) used (0.05, 0) per each model included for a total of (0.5, 0) differential privacy. This established a modest, single digit epsilon privacy budget of (2.5, 10−5).
The value of delta as a function of epoch for different epsilon values. An ε value of 2 allows for 500 epochs of training and δ < 10−5.
Predicting Heart Failure in the MIMIC Critical Care Database
We tested whether our approach could be applied in a second dataset by predicting heart failure from the first five measurements for nine vital sign measurements in 7,222 patients from the MIMIC Critical care database. Performance on privately generated synthetic patients was on par with performance models trained on real patients (Fig. 6A-D). As in the SPRINT data, the coefficients for logistic regression and the support vector machine as well as the feature importances were significantly correlated between real and synthetic data (Supplemental Table 2).
A-D.) Performance on transfer learning task by source of training data for each machine learning method. E.) Pairwise Pearson correlation between columns for the Original, real data F.) Pairwise Pearson correlation between columns for the Private synthetic data.
Discussion
Deep generative adversarial networks and differential privacy offer a technical solution to the challenge of sharing biomedical data to facilitate exploratory analyses. Our approach, which uses deep neural networks for data simulation, can generate synthetic data to be distributed and used for secondary analysis. We perform training with a differential privacy framework that limits study participants’ privacy risk. We apply this approach to data from the SPRINT clinical trial due to its recent use for a data reanalysis challenge
We introduce an approach that samples from multiple epochs to improve performance while maintaining privacy. However, several challenges remain. Deep learning models have many training parameters and require substantial sample sizes, which can hamper this method’s use for small clinical trials or targeted studies. Another fruitful area of use may be large electronic health records systems, where the ability to share synthetic data may aid methods development and the initial discovery of predictive models. Similarly, financial institutions or other organizations that use outside contractors or consultants to develop risk models might choose to share generated data instead of actual client data. In very large datasets, there is evidence that differential privacy may even prevent overfitting to reduce the error of subsequent predictions (27).
Though our approach provides a general framing, the precise neural network architecture may need to be tuned for specific use cases. Data with multiple types presents a challenge. EHRs contain binary, categorical, ordinal and continuous data. Neural networks require these types to be encoded and normalized, a process that can reduce signal and increase the dimensionality of data. New neural networks have been designed to deal more effectively with discrete data (28, 29). Researchers will need to incorporate these techniques and develop new methods for mixed types if their use case requires it. We expect this approach to be most well suited to sharing specific variables from clinical trials to enable wide sharing of data with similar properties to the actual data. We do not intend the method to be applied to generate high dimensional genetic data from whole genome sequences or other such features. Application to that problem would require the selection of a subset of variants of interest or substantial additional methodological work.
Due to the fluid nature of security and best practices, it is important to choose a method which is mathematically provable and ensures that any outputs are robust to post-processing. Differential privacy satisfies both needs and is thus being relied upon in the upcoming 2020 United States Census (30). It is imperative to remember that to receive the guarantees of differential privacy a proper implementation is required. We believe testing frameworks to ensure accurate implementations are a promising direction for future work, particularly in domains with highly sensitive data. like healthcare.
The practice of generating data under differential privacy with deep neural networks offers a technical solution for those who wish to share data to the challenge of patient privacy. This technical work complements ongoing efforts to change the data sharing culture of clinical research.
Materials and Methods
We developed an approach to train auxiliary classifier generative adversarial networks (AC-GANs) in a differentially private manner to enable privacy preserving data sharing. Generative adversarial networks offer the ability to simulate realistic-looking data that closely matches the distribution of the source data.
AC-GANs add the ability to generate labeled samples. By training AC-GANs under the differential privacy framework we generated realistic samples that can be used for initial analysis while guaranteeing a specified level of participant privacy.
The source code for all analyses is available under a permissive open source license in our repository (https://github.com/greenelab/SPRINTgan). In addition, continuous analysis (31) was used to re-run all analyses, to generate docker images matching the environment of the original analysis, and to track intermediate results and logs. These artifacts are freely available (https://hub.docker.com/r/brettbj/sprint-gan/ and archival version: https://doi.org/10.6084/m9.figshare.5165731.v1).
SPRINT Clinical Trial Data
The SPRINT was a randomized, single blind treatment trial where participants were randomized into two groups, an intensive treatment group with a systolic blood-pressure target of less than 120 mmHg and a standard treatment group with a systolic blood-pressure target of less than 140 mm Hg. The trial included a total of 9,361 participants. We included 6,502 participants from the trial by filtering for all participants that had blood pressure measurements for each of the first 12 measurements (RZ, 1M, 2M, 3M, 6M, 9M, 12M, 15M, 18M, 21M, 24M, 27M). We included measurements for systolic blood pressure, diastolic blood pressure and the count of medications prescribed to each participant. This provided an input vector of shape (3, 12).
Auxiliary Classifier Generative Adversarial Network
We implemented the AC-GAN as described in Odena et al. (9) using Keras (32) to simulate systolic and diastolic blood pressures as well as the number of hypertension medications prescribed. Results shown use a latent vector of dimension 100, a learning rate of 0.0002, and a batch size of 1 trained for 500 epochs. To conform with the privacy claims laid out in Abadi et al. (25), gradients must be clipped per example, in our implementation this requires the batch size to be 1. To handle edge cases and mimic the sensitivity of the real data measurements, we take the floor of zero or the simulated value and convert all values to integers. Full implementation details can be seen in the GitHub repository (https://github.com/greenelab/SPRINT_gan/blob/master/ac_gan.py).
Clinician Evaluation
Three physicians made a “real or synthetic” determination for each of 100 figures showing systolic blood pressure, diastolic blood pressure, and number of medications at each of 12 visits. The cardiologists classified how realistic the patients looked (from 1-10 where 10 is most realistic) and whether the patients were a part of the standard or intensive treatment plan. Prior to reviewing the figures and regularly during the review of figures, the clinicians reviewed the published SPRINT protocol to help contextualize the data. We performed a Mann-Whitney U test to evaluate whether the real or synthetic samples received significantly different scores and compared the accuracy of the treatment plan classifications.
Transfer Learning Task
Each of the 6,502 participants in our analytical dataset is labeled by treatment group. We evaluate machine learning methods (logistic regression, support vector machines, and random forests from the scikit-learn (33) package) by their ability to predict which group a participant belongs to. This was done by splitting the 6,502 participants into a training set of 6,000 participants (labeled real) and a test set of 502 participants. A vanilla AC-GAN was trained using the 6,000-participant training set providing a simulated training set (labeled non-private). A differentially private AC-GAN was trained using the 6,000-participant training set providing a differentially private training simulated training set (labeled private). Each classifier was then trained on the real, non-private and private training sets and evaluated on the same, real test set of participants. This allows for a comparison of classification performance between models trained on the real data, synthetic data and private synthetic data. We evaluated both accuracy as well as the correlation between important features (random forest) and model coefficients (logistic regression and support vector machine).
Differential Privacy
Differential privacy is a stability property for algorithms, specifically for randomized algorithms (34). Informally, it requires that the change of any single data point in the data set has little influence on the output distribution by the algorithm. To formally define differential privacy, let us consider X as the set of all possible data records in our domain. A dataset is a collection of n data records from X. A pair of datasets D and D’ are neighboring if they differ by at most one data record. In the following, we will write R to denote the output range of the algorithm, which in our case correspond to the set of generative models.
Definition 1 [Differential Privacy (35)]: Let ε, δ > 0. An algorithm A: Xn → R satisfies (ε, δ)-differential privacy if for any pair of neighboring datasets D, D’, and any event S ⊆ R, the following holds
where the probability is taken over the randomness of the algorithm.
A crucial property of differential privacy is its resilience to post-processing --- any data independent postprocessing procedure on the output by a private algorithm remains private. More formally:
Lemma [Resilience to Post-Processing]: Let algorithm A: Xn → R be an (ε, δ)-differentially private algorithm. Let A’: R → R’ be a “post-processing” procedure. Then their composition of running A over the dataset D, and then running A’ over the output A(D) also satisfies (ε, δ)-differential privacy.
Training AC-GANs in a Differentially Private Manner
During the training of AC-GAN, the only part that requires direct access to the private (real) data is the training of the discriminator. To achieve differential privacy, we only need to “privatize” the training of the discriminators. The differential privacy guarantee of the entire AC-GAN directly follows because the output generative models are simply post-processing from the discriminator.
To train the discriminator under differential privacy we add noise to the stochastic gradient descent process as outlined in Abadi et al. (25). First, we provide an upper bound onto the norm of the gradient at any individual step. This is done by clipping the ℓ2-norm of the gradient. Next, we perturb each coordinate of the gradient by adding noise drawn from a Gaussian distribution with a variance proportional to the gradient clipping. The more noise we added (relative to the clipped norm of the gradient) the better privacy guarantee. To achieve a modest privacy budget, we found we could clip the ℓ2-norm of the gradient at 0.0001 and add noise from a normal distribution with a σ2 of 1 (□(μ, 1 * (0.00012))). This is substantially higher than previously shown, likely due to either the dynamic nature of GAN training where the target is inexact and changes over time or averaging over many mini-batches. We used the moments accountant described in Abadi et al. (25) to compute the privacy parameters (ε, δ). These parameters were determined after running a grid search for noise (0.25, 0.5, 1, 1.5, 2, 3, 4, 8) and gradient clipping (0.1, 0.01, 0.001, 0.0001, 0.00001) to determine how long models could be trained under (ε, δ) of (2.5, 10−5).
Differentially Private Model Selection
We found that sampling from multiple different epochs throughout training provided a more diverse training set. This provided summary statistics closer to the real data and higher accuracy in the transfer learning task. During the GAN training, we saved all the generative models across all epochs. We then generated a batch of synthetic data from each generative model, and used a machine learning algorithm (logistic regression or random forest) to train a prediction model based on each synthetic batch of data. We then tested each prediction model on the training set from the real dataset and calculate the resulting accuracy. To select epochs that generate training data for the most accurate models under differential privacy, we used the standard “Report Noisy Min” subroutine: first add independent Laplace noise to the accuracy of each model (drawn from Lab(1/(n*ε)) to achieve (ε, 0) differential privacy where n is the size of the private dataset we perform the prediction on and output the model with the best noisy accuracy.
In practice, we choose the top five models that performed best on the transfer learning task for the training data using both logistic regression classification and random forest classification (for a total of 10 models). We performed this task under (0.5, 0)-differential privacy. In each of the ten rounds of selection epsilon was set to 0.05. This achieves a good balance of accuracy while maintaining a reasonable privacy budget.
Predicting Heart Failure in the MIMIC Critical Care Database
We applied the method to the MIMIC Critical Care Database (36) to demonstrate its generality. We generated synthetic patients for the purpose of predicting Heart Failure. MIMIC is a database of 46,297 de-identified electronic health records for critical care patients at Beth Israel. We defined patients who suffered from Heart Failure as any patient in MIMIC diagnosed with an ICD-9 code included in the Veterans Affair’s Chronic Heart Failure Quality Enhancement Research Initiative’s guidelines (402.01, 402.11, 402.91, 404.01, 404.03, 404.11, 404.13, 404.91, 404.93, 428, 281.1, 428.20, 428.21, 428.22, 428.23, 428.30, 428.31, 428.32, 428.33, 428.40, 428.41, 428.42, 428.43, and 428.9). We performed complete case analysis for patients with at least five measurements for mean arterial blood pressure, arterial systolic and diastolic blood pressures, beats per minute, respiration rate, peripheral capillary oxygen saturation (SpO2), mean non-invasive blood pressure and mean systolic and diastolic blood pressures. For patients with more than five measurements for these values, the first five were used. This yielded 8,260 total patients and 2,110 cases of heart failure. We included the first 7,500 patients in the training set and the remaining 760 in a hold-out test set. The training and transfer learning procedures matched the SPRINT protocol. Because the classes were unbalanced, we used f1 score to evaluate the results from the transfer learning exercise.
Funding
This work was supported by the Gordon and Betty Moore Foundation under a Data Driven Discovery Investigator Award to C.S.G. (GBMF 4552). B.K.B.-J. Was supported by a Commonwealth Universal Research Enhancement (CURE) Program grant from the Pennsylvania Department of Health and by US National Institutes of Health grants AI116794 and LM010098. Z.S.W is funded in part by a subcontract on the DARPA Brandeis project and a grant from the Sloan Foundation. J.B.B. is funded by US National Institutes of Health grant K23-HL128909.
Author Contributions
B.K.B.-J. and C.S.G. conceived the study. B.K.B.-J. And C.W. performed initial analyses. B.K.B.-J. and Z.S.W. designed and validated the privacy approach. J.B.B performed a blinded review of records. B.K.B.-J., C.S.G. and Z.S.W. wrote the manuscript and all authors revised and approved the final manuscript.
Competing interests
The authors have no competing interests to disclose.
Data and materials availability
All data used in this manuscript are available via the NHLBI (https://biolincc.nhlbi.nih.gov/studies/sprintpop/), the source code is available via GitHub (https://github.com/greenelab/SPRINTgan) and an archived version is available via Figshare (DOI: 10.6084/m9.figshare.5165737).
Supplemental Materials
AC-GAN architecture and training. A.) Structure of an AC-GAN. B.) The generator model takes a class label representing the treatment group (e.g. intensive or standard care group) and random noise as input and outputs a 3×12 vector for each participant (SBP, DBP and medication counts at each time point). C.) The discriminator model takes both real and simulated samples as input and learns to predict the source and a class label (i.e. normal or intensive treatment group). D.) Training loss for a non-private AC-GAN. E.) Training loss for a private AC-GAN.
Random noise breaks equilibrium.
Top Ranking Epochs for Transfer Learning Exercise
Scores vs. Epoch for Transfer Learning Task.
A.) Random forest variable importance scores by training data. B.) Logistic Regression variable coefficients by training data. C.) Support Vector Machine variable coefficients by training data.
Spearman Correlation between variable importance scores (Random Forests) and model coefficients (Support Vector Machine and Logistic Regression).
Spearman Correlation between variable importance scores (Random Forests) and model coefficients (Support Vector Machine and Logistic Regression).
Acknowledgments
We thank Jason H. Moore (University of Pennsylvania), Aaron Roth (University of Pennsylvania), Gregory Way (University of Pennsylvania), Yoseph Barash (University of Pennsylvania), AnupamaJha(University of Pennsylvania) and Blanca Himes (University of Pennsylvania) for their helpful discussions. This Manuscript was prepared using SPRINT_POP Research Materials obtained from the NHLBI Biologic Specimen and Data Repository Information Coordinating Center and does not necessarily reflect the opinions or views of the SPRINT_POP or the NHLBI. We thank the participants of the SPRINT trial and the entire SPRINT Research Group.