Generating Digital Twins with Multiple Sclerosis Using Probabilistic Neural Networks

Multiple Sclerosis (MS) is a neurodegenerative disorder characterized by a complex set of clinical assessments. We use an unsupervised machine learning model called a Conditional Restricted Boltzmann Machine (CRBM) to learn the relationships between covariates commonly used to characterize subjects and their disease progression in MS clinical trials. A CRBM is capable of generating digital twins, which are simulated subjects having the same baseline data as actual subjects. Digital twins allow for subject-level statistical analyses of disease progression. The CRBM is trained using data from 2395 subjects enrolled in the placebo arms of clinical trials across the three primary subtypes of MS. We discuss how CRBMs are trained and show that digital twins generated by the model are statistically indistinguishable from their actual subject counterparts along a number of measures.


I. INTRODUCTION
Statistical models that characterize and predict disease progression have the potential to become important tools for research and management of complex, chronic indications like Multiple Sclerosis (MS).In the context of clinical research, for example, statistical models of disease progression could be used to simulate different study designs [1], to identify patients who are likely to progress more rapidly than typical patients [2], or as external comparators in early stage clinical trials [3].Despite the availability of Disease Modifying Therapeutics (DMTs) for MS, many clinical trials still compare experimental treatments to placebo controls [4][5][6][7], which can deter patients from enrolling in those trials.At the same time, the rich history of clinical trials in MS provides a high quality source of data about disease progression under placebo or non-DMT control conditions, creating an opportunity for statistical models to further the clinical understanding of MS and better inform the design and analysis of trials [8].
Ideally, a statistical model of disease progression would describe all of the clinical characteristics that are relevant for a patient with MS, how these characteristics change through time, their inter-relationships, and their variation.In more technical terms, it would describe a joint probability distribution of the relevant clinical characteristics over time.Throughout, we refer to a sample drawn from this model distribution as a digital subject.By definition, therefore, a digital subject is a computationally generated clinical trajectory with the same statistical properties as clinical trajectories from actual patients.The primary advantages of digital subjects, in comparison to data from actual patients, are that they present no risk of revealing private health information and make it possible to quickly simulate patient cohorts of any size and characteristics.
Although the ability to create digital subjects enables one to simulate cohorts of patients, there are many applications for which one would like to make predictions about a particular patient.A statistical model that is able to generate digital subjects can also be used for individual patients by generating digital subjects that have the same clinical characteristics as the patient of interest at a given date (such as the start of a clinical study).We refer to a digital subject that has been generated in order to match a particular patient as a digital twin of that patient, in analogy with the usage of the term "digital twin" in engineering applications [9].For example, if the statistical model represents disease progression on a placebo, then a digital twin provides a potential outcome -i.e., what would likely happen to this patient if she/he were given a placebo in a clinical trial?In more technical terms, a digital twin is a sample from a joint probability distribution of the relevant clinical characteristics at future times conditioned on the values of those characteristics at a previous time.Note that it is possible to generate multiple digital twins for any patient.
Evaluating the quality of a model that generates digital subjects or digital twins is more complex than the evaluation of traditional machine learning models.For example, it is necessary, but not sufficient, that the means, standard deviations, correlations, and autocorrelations of all covariates computed from cohorts of digital subjects agree with those computed from cohorts of actual subjects.Here, we introduce a concept called statistical indistinguishability. Digital subjects, or digital twins, are statistically indistinguishable from actual subjects if a statistical procedure designed to classify a given clinical record as a digital subject or an actual subject performs consistently with random guessing.
Creating a statistical model that can generate digital subjects and digital twins that are statistically indistinguishable from actual patients is no small feat, but recent advances in generative neural networks make it possible to train generative models for complex probability distributions [10][11][12], including time-dependent clinical data [2].Here, we train a probabilistic generative neural network to create digital subjects with MS using a database of placebo arms aggregated across 8 historical clinical trials [8] covering many clinically relevant covariates including demographic information, relapses, the Expanded Disability Status Scale (EDSS) [13], and functional tests.To illustrate the concept of a digital subject for MS, Figure 1 shows two digital subjects generated by the model with the same starting characteristics (i.e., the two subjects are digital twins).
As a complex neurodegenerative disorder with a course that is "highly varied and unpredictable" [14], MS is both a good test-case for the application of machine learning methods as well as a disease area that would benefit from improved characterization and prediction of disease progression.Previous models for disease progression in MS include survival models [15][16][17], Markov models [18][19][20], and other methods [21][22][23][24][25].Most of these studies aim to predict the progression of either the total EDSS score or relapses as they are common ways to characterize disease severity.To our knowledge, our approach is the first that models comprehensive subject-level clinical trajectories that go beyond the total EDSS score.In this case we model 20 covariates relevant to capturing MS progression (see Appendix A).Disease [2].
A CRBM1 is a class of probabilistic neural network that learns a joint probability distribution over time, and is closely related to Restricted Boltzmann Machines [26][27][28][29][30].A CRBM learns the relationship between covariates at k visits expressed as a parametric probability distribution.For example, for k = 3 with a 3-month spacing between visits the distribution takes the form in which x(t) is the vector of covariates at time t (in months), h is a vector of hidden variables, U (•) is called the energy function, and Z is a normalization constant.More details are provided in Appendix B.
The CRBM allows for approximate sampling from the joint probability distribution over all covariates and across k visits.Thinking of disease progression as a Markov process, the model can be used to probabilistically generate data for a subject at a visit given data from the previous k −1 visits.For MS, we model data in 3-month intervals and find that modeling k = 3 simultaneous visits in the CRBM is sufficient to accurately generate long trajectories.
In other words, we model MS clinical trajectories as a lag-2 Markov process.
The CRBM is used in an iterative fashion to generate clinical trajectories for digital subjects or digital twins.To generate a digital subject, the model is used to generate data at time zero (i.e., at baseline) by sampling from the marginal distribution p(x(t = 0)) using Markov Chain Monte Carlo (MCMC) methods.To generate a digital twin, the covariates at time zero (baseline) are set equal to the observed covariates of a particular patient.Starting from these baseline data, the CRBM is used to generate data for the 3-and 6-month visits for that subject by sampling from p(x(t = 6), x(t = 3)|x(t = 0)).Then these sampled 3and 6-month visit data may be used to generate data for the 9-month visit by sampling from p(x(t = 9)|x(t = 6), x(t = 3)).Similarly the sampled 6-and 9-month visit data may be used to generate data for the 12-month visit by sampling from p(x(t = 12)|x(t = 9), x(t = 6)).
This continues as needed to create clinical trajectories of the desired length.
Under this framework, we can now define digital subjects and digital twins more formally.
A digital subject is a clinical trajectory of length τ sampled from the joint distribution across all τ visits {x(t)} t=τ t=0 ∼ p(x(t = τ ), . . ., x(t = 0)).A digital twin is a clinical trajectory of length τ in which the visits after t = 0 are sampled from the conditional distribution {x(t)} t=τ t=3 ∼ p(x(t = τ ), . . ., x(t = 3)|x(t = 0)).One important concept is that digital subjects and digital twins are stochastic.Therefore, one can create many digital twins for a given patient.Each of these digital twins will have the same covariates at baseline, but their trajectories will differ after t = 0. Taken together, these digital twins map out the distribution of possible clinical trajectories for that patient.
Note that subjects in clinical trials do not receive different treatments at the baseline visit, so that digital twins may be created for subjects in any arm.If the CRBM models the progression of the disease under the control condition, then the model may be used to generate counterfactuals that can be used to estimate treatment effects.In this work, our training and validation data were collected entirely from placebo control arms of previously completed clinical trials so that digital subjects generated from the model are representative of placebo controls.
A schematic overview of our process for building the CRBM and generating digital twins is given in Figure 2.

B. Data
One legacy of the long history of clinical development for MS is a rich set of historical clinical trials with placebo arms.The Multiple Sclerosis Outcome Assessments Consortium (MSOAC) [8] has compiled data from 16 clinical trials that span three subtypes of MS: relapsing remitting (RRMS), secondary progressive (SPMS), and primary progressive (PPMS).
A database of placebo arms from 8 of these trials comprising 2465 subjects has been made An important feature of the CRBM is that it can create digital subjects and digital twins.Digital subjects are synthetic clinical records generated from the model.Digital twins are digital subjects whose baseline data is that of a given subject, allowing for subject-level predictions of outcomes.
The CRBM is probabilistic, meaning many digital twins may be created for an actual subject and used to build distributions of predicted outcomes.In this work we use digital twins to analyze the quality of the CRBM as a disease progression model.available to qualified researchers.This database contains measurements spanning a number of domains including background information, questionnaires and functional assessments, medical history, and concomitant medications.No imaging or biomarker data is available.
MSOAC data are encoded according to the Study Data Tabulation Model (SDTM) format, a highly structured format commonly used to submit the results of clinical trials to regulatory authorities [31,32].Starting from the MSOAC placebo dataset, we standardized and converted SDTM-formatted measurements into a tidy format [33]

C. Training
A well-trained CRBM is able to generate digital subjects with the same statistical properties as actual subjects in the dataset.To accomplish this goal, we trained the CRBM via stochastic gradient ascent to maximize a convex combination of a composite likelihood and an adversarial objective as previously described [2,10].The relative weight given to the likelihood and adversarial components of the objective function was treated as a hyperparameter.
Hyperparameters refer to various parameters of the model or training process that must be specified, but cannot be learned by applying gradient ascent to the objective function as with the main model parameters.These hyperparameters included the weighting of the likelihood and adversarial terms of the objective function, number of epochs, batch size, initial learning rate, number of hidden units, magnitude of parameter regularization, and sampling parameters (i.e., the number of Monte Carlo steps and the magnitude of temperature-driven sampling [10]).We performed a grid search over each of these dimensions in parallel, training a total of 1296 CRBMs with various hyperparameter choices to maximize an objective function on the training dataset.
After training the CRBMs in the grid search, we evaluated them using a variety of metrics that assess a model's statistical performance as well as clinically important outcomes.All metrics were computed on the validation dataset.The statistical metrics are coefficients of determination (R 2 values) between the actual subjects from the test set and their digital twins the for equal-time and lagged autocorrelations.Three clinical metrics were used, one measuring the ability to predict individual relapse events, another measuring the agreement between average EDSS progression values for the data and model, and the third measuring the agreement between chronic disease worsening (CDW) fractions for the data and model.
We adopted a minimax approach to choose a best performing model across the grid search.That is, we aimed to choose a CRBM that performed well across all metrics, even if it was not the best performing model on any single metric.To accomplish this, we ranked all N (N =1296) models from best (rank 1) to worst (rank N ) for each metric and determined the worst rank across metrics for each model.We used this worst rank to choose the top 25% of models.This selected models that performed well on all metrics.To choose a best model from this set, we repeated the process on the clinical metrics alone, ranking models on the clinical metrics and determining the worst rank of each model.We selected the model with the lowest (best) worst rank and designated this as the "optimal" model.Narrowing the focus to the clinical metrics emphasized their importance in model selection.
The hyperparameters of this optimal model were then used to train a new, "final" model on the combined training and validation datasets.All results that follow in the main text focus on the performance of this final model computed using the test dataset.More details on the hyperparameters, training procedures, and selection process for the optimal model are described in Appendix B and Appendix C.

A. Model Performance Across the Hyperparameter Sweep
The hyperparameter sweep trained 1296 models over a grid of hyperparameters on the training dataset.83 of these models failed to finish training due to poor performance (e.g., due to too high of a learning rate), and were excluded from further evaluation because they were ranked lowest in model selection.We computed the model selection metrics using the validation dataset on the remaining 1213 models and determined the hyperparameters of the optimal model.These hyperparameters were used to train a new model on the combination of the training and validation datasets, which we refer to as "the final model".
For comparison, the test metrics were computed using both the optimal model determined from the validation set and the final model trained on combining training and validation sets.
The distributions of the validation metrics used for the hyperparameter sweep are shown in Figure 3, along with metric values for the optimal and final models.As expected, the optimal model performed well on all metrics without performing poorly on any individual metric.Furthermore, the performance of the optimal model was consistent across the validation and test datasets, with the largest CDW fraction showing the largest difference.The performance of the optimal and final models on the test dataset were also quite similar, aside from the relapse prediction metric.This suggests that overall model performance is fairly stable with respect to changes to the dataset used for model training or evaluation.

B. Assessing Statistical Indistinguishability of Digital Twins
We say that digital twins and actual subjects are statistically indistinguishable if statistical analysis methods cannot differentiate the two groups better than random chance.Due to the complexity of clinical data, there are a number of ways to assess statistical indistinguishability.We focus on three different methods to quantitatively assess the degree of indistinguishability that span from subject-and covariate-level statistics to population-level statistics.The first compares the means, standard deviations, correlations, and autocorrelations of all covariates computed from the digital twins to those computed from the actual subjects, quantifying agreement using linear regression and coefficients of determination where appropriate.The second aims to quantify the agreement between the observed clinical trajectory of a particular patient and the distribution of potential clinical trajectories defined by his/her digital twins.The third trains logistic regression models to distinguish between actual subjects and their digital twins, quantifying performance with the area under the receiver operating characteristic curve (AUC).
To evaluate the statistical indistinguishability of digital twins, we created 1000 digital twins for each of the 718 actual subjects in the test dataset.The statistical properties of these twins were compared to the actual subjects in a number of ways to quantify the goodness-of-fit of the CRBM.
First, in Figure 4 we compare the means, standard deviations, correlations, and autocorrelations between covariates computed from the actual subjects and a cohort consisting of a single digital twin for each subject.The duration of the clinical trajectory of each digital Points on the plot are darker if more data is present (see the color bar).These regressions estimate the relationship between actual subjects and digital twins.Theil-Sen regression is used for the means and standard deviations, which is outlier robust and appropriate for the widely varying scales present [36,37].For the equal-time and lagged autocorrelations, an ordinary least squares regression is used, which allows the R 2 values shown to be computed.The best fit line is shown for each comparison.
twin was the same as the matched patient.We ignored the baseline time point when computing the statistics because the actual subject and his/her digital twin have the same values of all covariates at the initial time point, by definition.All of the statistics computed from the cohort of digital twins agree with those computed from the cohort of actual subjects, with R 2 values for all comparisons greater than 0.95 and best fit coefficients close to their ideal values (i.e., 0 for the intercept, 1 for the slope).
Figure 4 compares a cohort of actual subjects to a cohort of matched digital twins based on population-level statistics.It is also important to quantify performance using statistics that are sensitive to subject-level agreement between an actual subject and his/her digital twins as in Figure 5.We generated 1000 digital twins for each actual subject.For each subject, each covariate, and each visit, the samples from the digital twins form a distribution to which the observed data for the actual subject may be compared.We used the distribution defined by the digital twins to compute a tail area probability (i.e., p-value) for each observation, then used the inverse cumulative distribution function of the standard normal distribution to convert the p-value to a statistic, ϕ = Φ −1 (p), that has mean 0 and standard deviation 1 if the digital twins and subject data are drawn from identical distributions.This statistic controls for highly non-normal distributions of covariates better than a traditional z-score.
We applied the Kolmogorov-Smirnov test to assess the if the distribution of these statistics is not N (0, 1) and mark any cases with statistically significant disagreement at α = 0.05 after a Bonferroni correction.
Only early visits for the timed 25-foot walk and 9-hole peg test show statistically significant differences between actual subjects their digital twins, with the difference primarily due to the standard deviation of the ϕ distribution being too small.This indicates that the standard deviations of the distributions defined for these covariates by the digital twins are too large.These two functional tests are scored as time to completion of a task.Subjects can have large single-visit outliers in this time due to short-term ambulatory or motor difficulties, which is particularly challenging to learn.Although the model generally agrees with the data well, these results suggest that the CRBM generates large times for these tests too frequently.
Finally, in Figure 6, we demonstrate that a logistic regression model cannot distinguish between an actual subject and one of his/her digital twins better than random chance, which is a direct evaluation of statistical indistinguishability.For each time point (except baseline, when the actual subjects and their twins are the same by definition), we trained a logistic regression model to distinguish between each subject and his/her digital twins.
In addition, we trained logistic regression models to distinguish between each subject and his/her digital twins using differences between time points.Over 36 months, nearly every logistic regression model's performance is consistent with random guessing, meaning they are unable to distinguish between actual subjects and their digital twins better than random chance.

IV. DISCUSSION
This work introduced three terms: digital subjects, digital twins, and the concept that digital subjects may be statistically indistinguishable from actual subjects.Using a dataset of subjects enrolled in the placebo arms of MS clinical trials, we trained a Conditional Restricted Boltzmann Machine to generate digital subjects.This dataset included demographic information and disease history, functional assessments, and components of the EDSS score.
The model learned the relationship between covariates across multiple visits, treating disease progression as a Markov process; this means that given data for a subject at some number of prior visits, the model is capable of generating clinical trajectories representing the distribution of potential outcomes at future visits for that subject.Taken in-context with previous work by the authors applying CRBMs to generate digital subjects for Alzheimer's Disease [2], this work suggests that approaches based on probabilis-tic neural networks like CRBMs will be broadly applicable to generating digital subjects across multiple diseases.
the evolution or presentation of MS.Disease modifying therapies were rarely used (5% of subjects received them) and were likely provided as rescue therapies from relapses or adverse events.However, as dosing time information was not provided, these variables were not included in further analyses.
The Expanded Disability Status Scale (EDSS) is commonly used to assess disease progression in MS [13], but the score has a non-linear nature that makes it difficult to model.Note that we are assuming that subjects with high EDSS scores have impaired ambulatory function.KFSS scoring rubrics do not provide consistent guidelines for assigning scores above the ambulation cutoff, indicating this assumption is valid.As a cross-check of this assumption, we took the subjects whose reported EDSS score was less than or equal to 4.5, and recomputed the score from the KFSS components.Although we found disagreements between the reported and recomputed EDSS scores in approximately 10% of cases, in a majority of these cases we were able to clearly identify an apparent mis-scoring of EDSS based on the reported KFSS component scores.As a result, we used the EDSS scores that we recomputed directly from the KFSS components in further analyses.
As with many statistical models, the performance of a CRBM is often improved by scaling the variables so that their values are of order 1.For binary and categorical covariates no normalization is applied, while for ordinal covariates we scale the covariate by its maximum so that the values range between 0 and 1.For continuous covariates we apply different normalizing transformations depending on the features of the covariate.Details of these transformations are provided in Table II.
For age, the "standardization" transform means we subtract the mean and divide by the standard deviation of the training set.The functional assessment variables (timed 25-foot walk, 9-hole peg test, and paced auditory serial addition test (PASAT)) all use a logit transformation.The functional form of these transformations is where x ± are upper and lower limits of the variable (the range values given in the table ) and δ is a buffer to prevent the argument from being too small or large (we use δ = 0.5).
These transformations, like the others used, are invertible, meaning we can transform from the representation of the data generated by the CRBM back to the natural representation of the data.For the PASAT test, we choose to model the variable as continuous (rounding the value when using the model output) rather than ordinal due to the large number of possible values.
in which x t is the vector of variables at time index t, h is a vector of hidden variables, Z is a normalization constant, and U (•|•) is the conditional energy function.K is the number of prior time points on which the Here, the conditional energy function is a family of functions parametrized by linear functions of k previous time points.Graphically, this is a model in which there are directed arrows extending from the previous time points to the hidden units (and, potentially, the current visible units) of an RBM (Figure 7).
The CRBM architecture described in [12,41] is one directional; the future is conditioned on the past.However, there are a number of reasons why we may desire a model that is bidirectional so that we could predict the future given the past or predict the past given the future.For example, the ability to predict the past given the future is especially useful for imputing missing observations -e.g., knowledge that a subject had an EDSS score of 3.5 at their second visit provides some information on what their EDSS score likely was at the first visit, and we could use that information to impute the first visit value if it was not measured.
Our CRBM architecture differs from that in [12,41] because it is bidirectional and it allows for some of the visible variables (such as sex or race) to be time-independent (denoted x static ).To make the model bidirectional, we remove all connections between visible units and make all of the visible-hidden unit connections undirected (Figure 7).This leads to a joint probability distribution of k + 1 time-adjacent vectors given by p(x t+k , . . ., x t+1 , x t , x static ) = Z −1 dh e −U (x t+k ,...,x t+1 ,xt,x static ,h) , (B2) in which the energy function U (•) takes the form, Each unit of the visible and hidden layers has bias parameters determined by the choice of functions a ij (•) and b µ (•), as well as scale parameters σ ij and µ .The connection between the layers is parameterized by the weight matrices W i,jν .Understood as a single RBM, our CRBM contains the visible units for multiple time points coupled to a standard hidden layer.The visible units are organized as: where k is the time lag of the model and ⊕ signifies the direct sum.The static units are only used once over all time points, as they are constant over all times.
A bidirectional CRBM learns the complete joint probability distribution between all k +1 adjacent time points simultaneously, p(x t+k , . . ., x t , x static ).That means that any conditional sampling of the data may be performed, such as predicting the data for a time point given the previous k time points.In fact, the conditional distribution p(x t+k |x t+k−1 . . ., x t+1 , x t , x static ) has the same functional form as that in [12,41].Trajectories can be sampled by iteratively drawing from appropriate conditional distributions.For example, the trajectory (x t=τ , . . ., x t=1 ) | (x t=0 , x static ) is created by sampling from x t=2 ∼ p(x t=2 |x t=1 , x t=0 , x static ) . . .
in that sequence.Note that we can sample also from a marginal distribution like p(x t=0 , x static ) simply by sampling from p(x k , . . ., x 0 , x static ) and discarding (x k , . . ., x 1 ); therefore, it is possible to use the model both to simulate a cohort of subjects at baseline and to simulate the time-evolution of a cohort of subjects given their baseline measurements.In Figure 7, we show a schematic diagram highlighting the differences between an RBM, the bidirectional CRBM we describe here, and the directed CRBM originally described in [12,41].< l a t e x i t s h a 1 _ b a s e 6 4 = " m v a w f l J 3 x a z p e J R q 0 k s

As
a i a I K e 0 S t 6 s z L r x X q 3 P u a j K 1 a 5 c 4 T + w P r 8 A Q 5 P k + w = < / l a t e x i t > a i a I K e 0 S t 6 s z L r x X q 3 P u a j K 1 a 5 c 4 T + w P r 8 A Q 5 P k + w = < / l a t e x i t > x t+1 < l a t e x i t s h a 1 _ b a s e 6 4 = " m n +  x static < l a t e x i t s h a 1 _ b a s e 6 4 = " x Y y g r W R / P q B K 7 A e B W H 2 m P u Z K w V S 2 j g k 9 + s N u 2 l P g R a J U 5 I G l G j 7 9 S 9 3 E J N U 0 E g T j p X q O 3 a  for a patient at time t (in months) and x static is the vector of static variables for the same patient, then the visible units used to train the CRBM are v = {x t+6 , x t+3 , x t , x static } -a concatenation of the data from the adjacent time points t, and t+3 months, and t+6 months with the static variables represented only once.Transforming the data from a subject into this triplet format involves 3 steps: 1.For each subject we create vectors of data from all possible sets of 3 consecutive visits.
For example if a subject has data through 12 months we create vectors for 0, 3, and 6 months, 3, 6, and 9 months, and 6, 9, and 12 months.Repeated occurrences of the static (non-longitudinal) covariates are dropped so that they only appear once in the vector.
2. Any vector that has no longitudinal data for the last visit is removed in order to remove cases in which the subject has missed visits.
3. The data are standardized according to the transformations discussed in Appendix A.
Each subject typically has trajectories much longer than the 6-month time window (trajectories up to 57 months are used).As a result, there are multiple triplets per subject, and the number of samples in the combined training and validation datasets (11129) is much larger than the number of subjects.Before training, samples are randomly shuffled so that minibatches contain a mixture of subjects and times.
The CRBM is trained using stochastic gradient descent to minimize an objective function that simultaneously aims to maximize the composite likelihood of the data and to minimize the ability of a classifier to discriminate between samples drawn from the model and samples taken from the training dataset [10].Let V denote the set of all triplets of visible variables in the training set.The logarithm of the composite likelihood of the data is in which θ denotes the parameters of the RBM.Note that two neighboring triplets from the same subject will have repeated entries and, therefore, are not independently distributed.
However, we are making an approximation in which we are treating these neighboring triplets as independently distributed.Therefore, it is more appropriate to refer to L(θ) as the logarithm of a composite likelihood [43] than a true likelihood.
In [10], we showed that likelihood-based approaches to training RBMs have a tendency to oversmooth the estimated distribution.To prevent this oversmoothing, we train a discriminator q(data|h) to classify a vector of hidden unit activities as corresponding to a visible vector drawn from the data or drawn from the model.In this case, the classifier is a random forest trained on the hidden unit activities of the previous minibatch.Then, we aim to minimize an adversarial objective which means that we aim to train the RBM so that it generates samples that the discriminator classifies as coming from the data distribution.This concept is similar to approaches used to train Generative Adversarial Networks in machine learning [30].
The final objective function C(θ) is a linear combination of log-composite likelihood L(θ) and adversarial A(θ) objectives, in which γ is a parameter -selected as part of the hyperparameter grid search -weighing the relative importance of the two objectives.We minimize this objective function with stochastic gradient descent using the ADAM optimizer [44].

Temperature driven sampling
Computing the gradients of the objective function (Equation B7) requires computing various moments of the model distribution [10].Unfortunately, the moments of a distribution defined by an RBM are typically intractable and can only be approximated using Monte Carlo methods.Specifically, it is possible to sample from an RBM using block Gibbs sampling by iteratively drawing h ∼ p(h|v) and v ∼ p(v|h).This creates a Markov chain that converges to the stationary distribution, but the rate of convergence may be very slow.
To address this, we employ a simple approach to improve the mixing of the CRBM when generating digital subjects or digital twins.
The underlying RBM that the CRBM is built upon can integrate the concept of temperature, which enters the joint probability distribution as p β (v, h) = Z −1 β e −βE(v,h) , where β is the inverse temperature and v and h are the visible and hidden units of the model.
Let M ij denote the rank of model i on metric j.That is, M ij = 1 if model i is the best performing model on metric j, and M ij = N if model i is the worst performing model on metric j.We used a two-step minimax procedure to select the best performing model on these metrics.In step one, we sorted the models based on their worst performance on any metric, MaxRank i = max j (M ij ), and discarded all models outside of the lowest quartile.In the second step, we focused on a subset J of the metrics and sorted the models based on their worst performance in this subset MaxSubsetRank i = max j∈J (M ij ).The model with the lowest MaxSubsetRank i was selected as the best performing model overall.This two step procedure ensured that selected model (1)   forms well should be the same as those computed from actual data.Therefore, we defined some metrics to evaluate the performance of a model based on the agreement between the lagl correlations computed from the model and those computed from the data for l = 0, 1, 2, 3.
Computing these correlations is complicated by the fact that some observations are missing.Therefore, let x s a,t,i denote the value of variable a at time t in subject i from source s ∈ {data, twin}; that is, x twin a,t=0,i = x data a,t=0,i .In addition, let I a,t,i = 1 if x data a,t,i was observed, and I a,t,i = 0 if x data a,t,i was missing.Then, we computed the lag-autocovariance between variables a and b from source s using C s a,b, = i t x s a,t,i − i t x s a,t,i I a,t,i I b,t+ ,i i t I a,t,i I b,t+ ,i x s b,t+ ,i − i t x s b,t+ ,i I a,t,i I b,t+ ,i i t I a,t,i I b,t+ ,i i t I a,t,i I b,t+ ,i .

(C1)
As a metric, we defined the squared correlation coefficient between the lag-autocovariances computed from the data and the twins as the standard deviations were computed as, σ s a,t = i (x s a,t,i − µ s a,t,i ) 2 I a,t,i i I a,t,i , (D2) and the lag-autocovariances were computed using Equation C1 for sources s ∈ {data, twin}.
Each of the sums only includes subjects from the test data set.
We assessed the agreement between the means and standard deviations computed from the data and those computed from their digital twins using a Theil-Sen regression.In principle, a slope of a regression line can be estimated from any pair of data points.However, this estimate for the slope is very noisy.Therefore, a Theil-Sen estimate computes many estimates for the slope from multiple pairs of data points (up to all pairs) and then uses the median slope.This procedure is robust to outliers, which is helpful here because the means and standard deviations for different variables can have different scales.
The lag-autocovariances can be set to the same scale by converting them to autocorrelations; i.e., by normalizing relative to the appropriate variances for source s ∈ {data, twin}.As a result, the problems that we encountered with scale differences in the means and standard deviations doesn't apply to analyzing the agreement between the autocorrelations computed from the data and those computed from their digital twins.However, each value of ρ s a,b, was only computed using pairs of data for which both variables a and b were observed.As a result, some autocorrelations were estimated with more precision than others.Therefore, we estimated the relationship by minimizing the in which f a,b, = t i I a,t,i I b,t+ .i .We report the estimated coefficients and the coefficient of determination in the main text.
Although the CRBM provides a model for the probability distribution of x a,t , this distribution is generally intractable.However, we can estimate properties of this distribution using samples drawn with MCMC algorithms.Let x twin a,t,i,k be a set of k = 1, . . ., K digital twins matched to subject i sampled from the CRBM such that x twin a,t=0,i,k = x data a,t=0,i for all k.
Figure 6 shows the area under the receiver operating characteristic curve (AUC) for logistic regression models trained to distinguish actual subjects from their digital twins.An AUC score of 0.5 indicates that the logistic regression cannot do better than random chance at determining whether data comes from actual subjects or their digital twins, while an AUC of 1.0 indicates that logistic regression can perfectly identify the source of a clinical record.
Different subjects in the dataset were observed for different durations, so we trained separate logistic regression models on each visit or on the difference between consecutive visits.In each case, we computed the mean AUC taken over 100 different simulations with different digital twins for each simulation; the error bars on the points show the standard deviation over simulations.The AUC for each simulation was estimated using 5-fold cross validation.Missing data from actual subjects is mean imputed; the corresponding values for the digital twins are assigned the same mean values to avoid biasing the logistic regression models.

FIG. 2 .
FIG. 2. Overview.A graphical summary of the process used to build the CRBM studied in this paper (A), and how digital twins are analyzed (B).To build the model, MSOAC data in SDTM format are converted to a numerical form suitable for machine learning.A set of models are trained that sweep over a grid of hyperparameters, variations in the way the model is trained.The final model is selected by computing metrics on validation data and choosing the best ranked model.

FIG. 4 .
FIG. 4. The model captures the leading statistical moments of the data.Moments for the actual subject data are plotted against moments for digital twins for means, standard deviations, equal-time correlations, and lagged autocorrelations.Each digital twin contributes data for the same study duration as its actual subject counterpart up to 36 months.The means and standard deviations are compared at each visit, with a logarithmic scale used to accommodate differing scales of the covariates.For the equal-time and lagged autocorrelations the correlation coefficient is computed for each pair of covariates across all visits.In all cases the slope and intercept fit coefficients shown come from regressions weighted by the fraction of data present for each covariate.

FIG. 5 .FIG. 6 .
FIG. 5.The model accurately simulates individual subject trajectories.The mean and standard deviation of a statistic measuring subject-level agreement of the data with the CRBM, for each longitudinal covariate.Each visit beyond baseline is displayed using a point for the mean and an error bar for the standard deviation.The visits are ordered for each covariate, with data shown up to 36 months.For each subject, a p-value for each covariate at each visit is computed by comparing the data value to the distribution of values predicted by the CRBM under repeated simulations of digital twins for that subject.The inverse normal CDF is applied to this p-value to define a statistic, ϕ ≡ Φ −1 (p), and the mean and standard deviation of the distribution of ϕ over subjects are computed and plotted as a point and error bar.As the diagram on the right explains, if the CRBM is in statistical agreement with the data (unbiased) then the mean should be 0 and the standard deviation should be 1.Different types of bias are shown, and we label statistically significant cases (p < 0.05, with a confidence level adjusted to 0.05/144 after applying a Bonferroni correction that factors in the 144 comparisons shown in the figure) in red.
Using a variety of statistical tests, we demonstrated digital subjects generated by the CRBM are statistically indistinguishable from actual subjects enrolled in the placebo arms of clinical trials for MS.Modeling MS disease progression is challenging due to the complexity of endpoint measures and the varied course of the disease.Relapses are difficult to predict, and the EDSS score has multiple domains and depends on many components.It is likely that significant improvements could be made to the model by incorporating additional data on imagingrelated endpoints.Incorporating these additional covariates would enable generation of digital subjects with most of the covariates that are typically measured in MS clinical trials.In addition, incorporating data from subjects treated with interferons or other commonly-used DMTs would make it possible to generate cohorts of digital subjects representing broader control populations.The ability to generate digital twins that are statistically indistinguishable from actual subjects enrolled in control arms of clinical trials is a powerful tool to understand disease progression and model clinical trials.Each digital twin represents a potential outcome of their matched patient; that is, what would likely happen to this patient if she/he were to receive a placebo in the context of a clinical trial?Models of potential outcomes are important tools for estimating treatment effects [38-40], and ability to generate digital twins representing per-subject controls opens the door to a variety of novel clinical trial analyses aimed at assessing responses to treatments for individual patients.

EDSS scores below 5 . 2 *
0 represents patients who are able to walk without aid.In this regime, the EDSS score is determined from the components of the Kurtzke Functional Systems Score (KFSS) components, a 7-component test assessing function in variety of bodily systems[13,34,35].EDSS scores of 5.0 or greater, by contrast, are principally defined by the subject's ability to walk.Due to the nature of scoring, some values are much more likely than others, giving a very multi-modal marginal distribution of EDSS scores.We chose to model the individual KFSS components and ambulatory impairment of subjects separately rather than model the total score directly, allowing the model to learn the simpler constitutive covariates and the relationships between them.Unfortunately, the MSOAC database does not record the ambulatory impairment component of EDSS directly.Instead, we infer this ambulation score from the EDSS score and encode it as an ordinal variable.The ambulation score was determined using ambulation score = (EDSS − 4.5), else discussed in the main text, to train the CRBM we divided the 2395 subjects in the dataset up into 3 parts: training (50% of subjects), validation (20% of subjects), and test (30% of subjects).Training was done in two stages.First we train a large number (1296) models -each of which has a single hidden layer of ReLU units[42] -over a grid of hyperparameters and measure the performance of models according to a number of metrics.The hyperparameters of the best-performing model were used to train a model on the combination of the training and validation data.We use a bidirectional CRBM with lag k = 2.As a result, the CRBM is trained on the data from adjacent triples of time points.If x t is the vector of time-dependent variables conditioning p(x t+k , . . ., x t+1 , x t , x static ) _ Z dh e U (xt+k,...,xt,xstatic,h) < l a t e x i t s h a 1 _ b a s e 6 4 = " M 9 o 9 i T p w J h 3 i D P g f 1 P 9 B S T c 1 e d 0 = " > A A A C v 3 i c j V F N T 9 s w G H Y y G K x j W z e O u 1 h U k 4 r G q m Q g s c M O s F 0 4 g k R p p a a r H M d p r T q 2 Z b 9 B q y L / y R 0 m 8 W 9 w S h C 0 T G i v Z P n x 8 7 z f T r X g F q L o J g h f b G y + 3 N p + 1 X q 9 8 + b t u / b 7 D 1 d W l Y a y P l V C m W F K L B N c s j 5 w E G y o D S N F K t g g n f + s 9 c E 1 M 5 Y r e Q k L z c Y F m U q e c 0 r A U 5 P 2 X 9 1 N C g K z N K 9 + u 0 k F n + f u A C c i U 2 D 9 v a L E b o W B l d c S m q K y 4 B N T t 4 + T a 2 K 0 U R o U T r g E f O + Q u f u o m c P J A W a / q i / 9 / + v h 2 Y o P 2 s z t u 0 m 7 E / W i p e G n I G 5 A B z V 2 P m n / S T J F y 4 J J o I J Y O 4 o j D e O K G J 9 b M N d K S s s 0 o X M y Z S M P J S m Y H V f L / T v 8 y T M Z z p X x x 4 + 6 Z B 9 H V K S w d l G k 3 r P u 0 a 5 r N f k v b V R C / m 1 c c a l L Y J L e F c p L g f 1 S 6 8 / E G T e M g l h 4 Q K j h v l d M Z 8 Q Q C v 7 L W 3 4 J 8 f r I T 8 H V 1 1 5 8 2 I s u j j o n P 5 p 1 b K O P a A 9 1 U Y y O 0 Q k 6 Q + e o j 2 j w P U i D e S D C 0 3 A a y l D f u Y Z B E 7 O L V i x c 3 A K M I 9 5 D < / l a t e x i t > p(x t+k |x t+k 1 , . . ., x t , x static ) _ Z dh e U (xt+k,h|xt+k 1 ,...,xt,xstatic) < l a t e x i t s h a 1 _ b a s e 6 4 = " m m 0 J 3 m k m D y 9 N g O b L Z k 9 d e P I N p m k u n I p d g 2 g 8 1 e L S b Q F 1 a A 7 4 3 d d s 4 P S N a a a l A 4 p R X g O 8 T c n d f N X A 4 b W N 2 Y n e S O R t / e w / + Z + n p n l y / 2 Y o 6 0 S T w P I i n o I W m c d B v / k 5 z S e u S V U A F M a Y b R w p 6 l m j

a u 6 9
Y j o g m l D w f 6 H h h x D P X n k e H L 7 v x B 8 6 0 d e P r b 3 9 6T i W 0 B v 0 F m 2 h G H 1 C e + g L O k A J o k E S 2 O A q + B F + D y / D 6 / D n X W o Y T G s 2 0 K M I f / 0 B A p X o H g = = < / l a t e x i t > r 7 A w g B A r P 4 O N f 4 P T Z o C W J 1 l 6 e u / u f P e C h D O l b f v b q i w t r 6 y u V d d r G 5 t b 2 z v 1 3 b 2 u i l N J a I f E P J a 9 A C v K W U Q 7 m m l O e 4 mk W A S c 3 g f j 6 8 K / f 6 B S s T i 6 0 5 O E e g I P I x Y y g r W R / P q B K 7 A e B W H 2 m P u Z K w V S 2 j g k 9 + s N u 2 l P g R a J U 5 I G l G j 7 9 S 9 3 E J t W 2 b M P f 2 B 9 / g C a B 5 c L < / l a t e x i t > x t+k < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 F w O n 1 0 U N 9 8 a / B 3 s l E o T e b 4 z e W k = " > A A A B + X i c b V D L S s N A F J 3 4 r P U V d e l m s A i C U B I V d F l 0 4 7 KC f U A b w m Q 6 a Y d O J m H m p l h C / s S N C 0 X c + i f u / B s n b R b a e m D g c M 6 9 3 D M n S A T X 4 D j f 1 s r q 2 v r G Z m W r u r 2 z u 7 d v H x y 2 d Z w q y l o 0 F r H q B k Q z w S V r A Q f B u o l i J A o E 6 w T j u 8 L v T J j S P J a P M E 2 Y F 5 G h 5 C G n B I z k 2 3 Y / I j A K w u w p 9 z M 4 H + e + X X P q z g x 4 m b g l q a E S T d / + 6 g 9 i m k Z M A h V E 6 5 7 r J O B l R A G n g u X V f q p Z Q u i Y D F n P U E k i p r 1 s l j z H p 0 Y Z 4 D B W 5 k n A M / X 3 R k Y i r a d R Y C a L n H r R K 8 T / v F 4 K 4 Y 2 X c Z m k w C S d H w p T g S H G R Q 1 4 w B W j I K a G E K q 4 y Y r p i C h C w Z R V N S W 4 i 1 9 e J u 2 L u n t Z d x 6 u a o 3 b s o 4 K O k Y n 6 A y 5 6 B o 1 0 D 1 q o ha i a I K e 0 S t 6 s z L r x X q 3 P u a j K 1 a 5 c 4 T + w P r 8 A Q 5 P k + w = < / l a t e x i t > x t+1 < l a t e x i t s h a 1 _ b a s e 6 4 = " m n +X I f E i U Q e g + L j l L 0 2 f L b l A D x Q = " > A A A B + X i c b V D L S s N A F L 3 x W e s r 6 t L N Y B E E o S Q q 6 L L o x m U F + 4 A 2 h M l 0 0 g 6 d T M L M p F h C / s S N C 0 X c + i f u / B s n b R b a e m D g c M 6 9 3 D M n S D h T 2 n G + r Z X V t f W N z c p W d X t n d 2 / f P j h s q z i V h L Z I z G P Z D b C i n A n a 0 k x z 2 k 0 k x V H A a S c Y 3 x V + Z 0 K l Y r F 4 1 N O E e h E e C h Y y g r W R f N v u R 1 i P g j B 7 y v 1 M n 7 u 5 b 9 e c u j M D W i Z u S W p Q o u n b X / 1 B T N K I C k 0 4 V q r n O o n 2 M i w 1 I 5 z m 1 X 6 q a I L J G A 9 p z 1 C B I 6 q 8 b J Y 8 R 6 d G G a A w l u Y J j W b q 7 4 0 M R 0 p N o 8 B M F j n V o l e I / 3 m 9 V I c 3 X s Z E k m o q y P x Q m H K k Y 1 T U g A Z M U q L 5 1 B B M J D N Z E R l h i Y k 2 Z V V N C e 7 i l 5 d J + 6 L u X t a d h 6 t a 4 7 a s o w L H c A J n 4 M I 1 N O A e m t A C A h N 4 h l d 4 s z L r x X q 3 P u a j K 1 a 5 c w R / Y H 3 + A L Y e k 7 I = < / l a t e x i t > x t < l a t e x i t s h a 1 _ b a s e 6 4 = " o T 7 O w L 7 i B j 7 u s Y O h v R L J L M S + b E E = " > A A A B 9 X i c b V D L S s N A F L 3 x W e u r 6 t J N s A i u S q K C L o t u X F a w D 2 h j m U w n 7 d D J J M z c q C X k P 9 y 4 U M S t / + L O v 3 H S Z q G t B w Y O 5 9 z L P X P 8 W H C N j v N t L S 2 v r K 6 t l z b K m 1 v b O 7 u V v f 2 W j h J F W Z N G I l I d n 2 g m u G R N 5 C h Y J 1 a M h L 5 g b X 9 8 n f v t B 6 Y 0 j + Q d T m L m h W Q o e c A p Q S P d 9 0 K C I z 9 I n 7 J + i l m / U n V q z h T 2 I n E L U o U C j X 7 l q z e I a B I y i V Q Q r b u u E 6 O X E o W c C p a V e 4 l m M a F j M m R d Q y U J m f bS a e r M P j b K w A 4 i Z Z 5 E e 6 r + 3 k h J q P U k 9 M 1 k n l L P e 7 n 4 n 9 d N M L j 0 U i 7 j B J m k s 0 N B I m y M 7 L w C e 8 A V o y g m h h C q u M l q 0 x F R h K I p q m x K c O e / v E h a p z X 3 r O b c n l f r V 0 U d J T i E I z g B F y 6 g D j f Q g C Z Q U P A M r / B m P V o v 1 r v 1 M R t d s o q d A / g D 6 / M H X S 6 T E Q = = < / l a t e x i t > r 7 A w g B A r P 4 O N f 4 P T Z o C W J 1 l 6 e u / u f P e C h D O l b f v b q i w t r 6 y u V d d r G 5 t b 2 z v 1 3 b 2 u i l N J a I f E P J a 9 A C v K W U Q 7 m m l O e 4 m k W A S c 3 g f j 6 8 K / f 6 B S s T i 6 0 5 O E e g I P I e w D O r V k 0 k w b m p k M y R 2 x D P N 3 b t y 6 8 x f c u F D E p e k D H 2 0 v B M 4 9 5 7 5 y / F h w D Y 7 z Y u U W F p e W V / K r h b X 1 j c 0 t e 3 u n p m

FIG. 7 .
FIG. 7. Architectures of RBMs and CRBMs.A schematic architecture of the CRBM used in this work is shown in the middle, compared to an RBM, which is an equivalent architecture, and the original CRBMs, which are described in the text.Both the RBM and CRBM used here have completely undirected connections, while the original CRBM has directed connections except for the last time point.Directed connections are shown as gray arrows.
The hyperparameter grid used to train the CRBM.The grid was defined as the Cartesian product of the values listed across hyperparameters.The value of each hyperparameter for the selected model is shown in bold.The grid of batch sizes was chosen so that there were 5, 10, or 20 batches per epoch in the combined training and validation datasets.The grid of hidden units was chosen as 1/2 or 1 times the the number of visible units rounded up to the nearest integer.The grid of learning rates was chosen as 1/4, 1/2, or 1 times the inverse of the number of visible units.

R 2 = 1 −
a≤b (C data a,b, − C twin a,b, ) 2 a≤b (C twin a,b, − 1 p(p−1) a≤b C twin a,b, ) 2 , (C2) in which p is the number of variables.These correlation coefficients are computed for = 0, 1, 2, 3 ( = 0 are the equal-time and > 0 are the lagged autocorrelations).The metrics computed from the equal and time-lagged autocovariances capture a general notion of goodness-of-fit for the CRBM, but we would also like to ensure that the model performs well on outcome measures that are frequently used as endpoints in clinical trials: the probability of a relapse at each visit, the change in the Expanded Disability Status Scale (EDSS) over 18 months, and 6-month, 1-year, and 2-year Confirmed Disability Worsening (CDW) scores.For relapses, which is one of the time-dependent variables in the model, we defined metrics in terms of the area under the receiver operating characteristic curve (AUC) that measures the model's ability to predict relapses at each visit from 3 to 18 months.The total EDSS score is not one of the variables directly included in the model, but it can be computed because all of its components are included in the model.Let s data i denote the change in the EDSS score for subject i over 18 months, and let s twin ik denote the change in the EDSS score for the k th digital twin of subject i.We defined an EDSS score metric as t EDSS = in which the sums only include subjects for which the 18-month change in EDSS was observedand n is the number of those subjects.We used K = 10 digital twins for each subject.

TABLE III .
did not perform poorly on any metric and (2) performed especially well on a subset of metrics.The hyperparameters of the best performing model are shown in bold in Table III.These hyperparameters are used to train a model on the combination of the training and validation datasets.Equal and time-lagged correlations between variables computed from a model that per-