## Abstract

Prediction settings with multiple studies have become increasingly common. Ensembling models trained on individual studies has been shown to improve replicability in new studies. Motivated by a groundbreaking new technology in human neuroscience, we introduce two generalizations of multi-study ensemble predictions. First, while existing methods weight ensemble elements by cross-study prediction performance, we extend weighting schemes to also incorporate covariate similarity between training data and target validation studies. Second, we introduce a hierarchical resampling scheme to generate pseudo-study replicates (“study straps”) and ensemble classifiers trained on these rather than the original studies themselves. We demonstrate analytically that existing methods are special cases. Through a tuning parameter, our approach forms a continuum between merging all training data and training with existing multi-study ensembles. Leveraging this continuum helps accommodate different levels of between-study heterogeneity.

Our methods are motivated by the application of Voltammetry in humans. This technique records electrical brain measurements and converts signals into neurotransmitter concentration estimates using a prediction model. Using this model in practice presents a cross-study challenge, for which we show marked improvements after application of our methods. We verify our methods in simulations and provide the `studyStrap` R package.

## 1. Introduction

### 1.1. Motivating Application

The elucidation of the neural correlates of neurological and psychiatric diseases is critical for the development of treatments. Recently, a groundbreaking application of Fast-Scan Cyclic Voltammetry (FSCV), a technology historically used in animal models ([Rode-berg, et al., 2017]), has been implemented in awake humans to enable the monitoring of changes in neurochemical levels in the human brain with the temporal resolution (usually 10 measurements per second) and spatial resolution (probes are on the micrometer scale) necessary to investigate how these signals evolve in real-time and contribute to human behavior ([Kishida et al., 2011]; also see commentary by [Platt and Pearson, 2016]; [Kishida et al., 2016]; [Moran, et al., 2018]). This technique offers unprecedented opportunity to monitor human brain activity that is currently impossible to measure with non-invasive brain imaging techniques (e.g., functional magnetic resonance imaging (fMRI), positron emission tomography (PET), electroencephalograhy (EEG)). FSCV is an invasive technique that involves temporarily inserting a tiny recording electrode deep into the brain of participants undergoing neurosurgery for psychiatric or neurological conditions. This allows for the measurement of neurotransmitters (chemicals that brain cells use to communicate) such as dopamine, an important molecule that is thought to underlie learning, reward, and many psychiatric and neurological diseases such as drug addiction and Parkinson’s disease ([Volkow, et al., 2018]). Measurements are taken at 10 Hz (i.e., 10 measurements per second) while participants perform behavioral tasks. Their neurochemical activity is then estimated and correlated with their performance on the task.

This technological achievement involves the translation of hardware commonly used in pre-clinical settings to the human neurosurgery operating room, but also critical is the novel application of statistical learning tools. This is because the technique does not directly measure neurotransmitter concentration (the measure of scientific interest) but rather records high-dimensional electrical signals that arise from measuring current changes in response to manipulation of the voltage potential across a recording electrode. These signals then must be “converted” into neurochemical concentration estimates via a predictive model. To train this model, one can use an in vitro training set where the true concentrations of the neurotransmitters (i.e., the outcome) are known, and the covariates (i.e., the electrical measurements) are measured ([Kishida et al., 2016]; [Moran, et al., 2018]). In practice, it is impossible to generate calibration data sets that exactly replicate the same settings as those present in brain tissue. The use of FSCV in humans requires the use of prediction models that generalize from controlled laboratory settings to data collected in the operating room where less experimental control is afforded. Producing accurate predictions in the brain therefore requires that models trained on in vitro training sets generalize 1) to different electrodes and 2) to data collected in the brain.

The original work to implement FSCV in humans ([Kishida et al., 2016]; [Moran, et al., 2018]) report that training models on datasets that combine training sets from multiple electrodes substantially improves the cross-electrode generalizability and accuracy of neurotransmitter estimates. This forces one to train an algorithm on data from multiple sources (multiple electrodes in vitro) and make predictions on data collected in a different setting (a different electrode in vivo). As validation in vivo is not possible, we sought to optimize neurochemical estimate accuracy by improving cross-electrode generalizability of the models. We used 15 in vitro datasets, each generated on a different electrode. We then developed algorithms to train models on all datasets save a held out validation set. This presents a cross-study problem, where each “study” is a training set generated on a different electrode. In an attempt to improve the quality of estimates of neurotransmitter concentrations, we drew upon the multi-study learning literature.

### 1.2. Dataset Structure

The data studied here are from 15 studies (electrodes) each comprised of roughly 20,000 observations. Covariates for each observation of the form vector called CV (“cyclic voltammogram”) which can also be viewed as a single functional covariate. The covariates are electrical measurements (current measured in nA) collected at 1000 discrete voltage potentials. In each observation, the outcome is a measurement of chemical (neurotransmitter) concentration (nM). This dataset exhibited considerable between-study heterogeneity in the distribution of the covariates (Figure 15) and in the conditional distribution, of the outcome conditional on the covariates, (Figure 16). Neuroscientists have analyzed voltammetry data based not only on the raw covariates themselves ([Rodeberg, et al., 2017]) a but also on a numerical estimate of the derivative of the current with respect to voltage potential index, in an effort to provide between-study standardization ([Kishida et al., 2016]; [Moran, et al., 2018]). For this reason we present results from both here. The data are described in greater depth in supplementary materials (Section C.1).

### 1.3. Formal Problem Statement

We propose methods for training models using multiple studies (electrodes) to improve their predictive performance. We explore the setting in which we have *K* training studies with common outcomes and covariates, and we aim to make predictions on data collected in a separate study (study *K* + 1), or so called “multi-study learning.” We begin with a set of training studies, {𝕊_{1}, …, 𝕊_{K}}, 𝕊_{k} = [**y**_{k}|𝕏_{k}], where we denote the outcome of the *k*^{th} study as **y**_{k} and its matrix of covariates 𝕏_{k}. The studies have sample sizes, *n*_{1}, *n*_{2}, …, *n*_{K}. We assume the data from each study are generated independently *across* studies, where the covariates from study *k* are drawn from distribution , and the outcome from . We aim to make predictions based upon the design matrix of the target study, 𝕏_{K+1}.

We do not assume the marginal distributions of 𝕏_{k}, **y**_{k} or the conditional distribution of **y**_{k}|𝕏_{k} are common between studies. We specifically deal with between-study heterogeneity in both the distribution of the covariates, *f* (𝕏_{k}), and the conditional distribution of the outcome given the covariates, *f* (**y**_{k} | 𝕏_{k}), (i.e., the true model coefficients, *β*_{k}).

### 1.4. Earlier Methods

The present work proposes weighting and ensembling schemes that build directly upon recent work which we review briefly. Connections with other related concepts are found in the Discussion.

#### 1.4.1. Merging

As a baseline, we will consider the standard strategy of training a single model **Ŷ**_{Merged}(·) on data from all studies combined into a single “Merged dataset.” 𝕊_{Merged} = [𝕊_{1}, 𝕊_{2}, …, 𝕊_{K}]^{T}.

#### 1.4.2. Trained-on-Observed-Studies (TOS) Ensemble

Recent work has shown that in settings where there is high between-study heterogeneity, ensembling classifiers trained on each of the individual studies can improve the predictive performance in multi-study settings above that of Merging ([Guan et al., 2019]; [Patil and Parmigiani, 2018]). Denote a classifier trained on data from the *k*^{th} study by **Ŷ**_{k} (·) and the predictions made with this classifier on the covariates of study *K* + 1 by **Ŷ**_{k} (𝕏_{k+1}). If one assumes a linear model and uses the OLS estimator, for example, then , where .

The Trained-on-Observed-Studies (TOS) classifier **Ŷ**_{TOS} (·) trains a single model on each study and then ensembles the resulting classifiers via
where *w*_{k} are weights.

#### 1.4.3. Stacking Weights

This ensembling paradigm can be implemented with a variety of weighting schemes ([Ramchandran, et al., 2019]; [Patil and Parmigiani, 2018]). Stacking weights ([Breiman, 1996]), for example, are generated from the coefficient estimates of a regression of the predictions from each constituent model against the true outcome labels of the training studies. This upweights predictions made from models that exhibit strong cross-study predictive performance. It is common to regress 𝕏_{S} on **y**_{S} using, for example, a non-negative least squares (NNLS) or Ridge regression, where
and **Ŷ**_{k}(𝕏_{j}) are the predictions on the design matrix of study *j* using the model trained on study *k*; **y**_{k} are the labels from study *k*. Assuming an intercept is included and the model is fit with NNLS, the weights ŵ are generated from the following constrained optimization problem
where . We use stacking in several cases, and include implementation details in the supplementary materials (Section: A.1).

#### 1.4.4. Hierarchical Resampling

We also leverage the bootstrap and bootstrap aggregation (“bagging”) literature. [Davison, et al., 1997] (pp.100-102) described a *randomized cluster bootstrap* or “Strategy 1”, where both clusters and observations within a cluster are sampled *with* replacement; and the *cluster bootstrap* “Strategy 2” where only the first stage is sampled *with* replacement. Also relevant is Bagging or “**B**ootstrap **agg**regation,” first proposed by Leo Breiman ([Breiman, 1996]), which implements bootstrap resampling in prediction settings. Training one or more learners on each of the bootstrap samples and then ensembling these resulting models can sometimes enhance performance, often through a variance reduction.

### 1.5. Heuristic Summary and Motivation for Proposed Methods

Our methods are motivated by the empirical observation that ensembling (TOS) and Merging each outperform each other at times, but that it may be difficult to predict when one will be superior. We thus sought to combine the methods by ensembling classifiers that were trained on combinations of the different datasets. This is achieved through a hierarchical resampling scheme that first samples a bag of studies (a list of study indices from which to subsample), then assigns slots to each study by drawing random proportions, and finally randomly selects observations from the selected studies to fill the slots. This results in synthetic studies (“Study Straps”) that are combinations of a subset of the *K* observed training studies, with varying degrees of similarity to the originals. We then use these Study Straps as the units for training constituent models in an ensemble.

This hierarchical resampling scheme is a form of the generalized bootstrap. The application of the resampling approach implemented here in an ensembling setting is in essence a hierarchical bagging or cluster-based bagging approach to ensembling ([Li, et al., 2019]).

We analytically explore fundamental properties of the Study Strap framework and demonstrate that the TOS Ensemble and Merging approaches are special cases of the Study Strap. By identifying a unifying framework for these opposing strategies, we show that a continuum exists between them. We argue that leveraging this continuum to accommodate the level of within- and between-study heterogeneity can produce gains in predictive performance.

In our application, as well as others, the design matrix of the target study is available prior to observation of the labels, and at the time training takes place. This information can be leveraged to give greater weight to studies or study straps that have covariate distributions, or profiles, that are more similar to the target study. To implement this, we propose the Covariate Profile Similarity (CPS) Weighting scheme. Heuristically, this assumes that predictors will perform better on a given dataset if it has been trained on data that have a comparable distribution of covariates. We propose an ensembling method that tries to favor the generation of pseudo-studies that are similar, by some metric, to a target study.

### 1.6. Objectives and Outline

This paper seeks to provide an in-depth introduction to Covariate Profile Similarity Weighting, the Study Strap, and the Covariate-Matched Study Strap. We describe the algorithms (Section 2), explore their properties analytically (Section 3), and provide empirical results via applications to simulated data (Section 4) and the electrode measurement dataset (Section 5) ([Kishida et al., 2016]).

## 2. Algorithms

### 2.1. Ensembling with Covariate Profile Similarity Weighting

We propose Covariate Profile Similarity (CPS) Weighting, a weighting scheme that upweights classifiers trained on datasets with average feature profiles that are similar to that of the target study, 𝕊_{K+1}. For a given study or study strap *k*, we define a similarity metric, *𝒮* _{k} = 𝒮 (𝕏_{K+1}, 𝕏_{k}) ∈ ℝ^{+} such that larger values correspond to greater similarity. An example metric may simply be the inverse of the *ℓ*_{2} distance between the sample means of the covariates in the two studeis considered: . The similarity metric can then be used to adjust the weights for the predictions from the *k* study to obtaining final weights .

### 2.2. Ensembling with the Study Strap

We propose an ensembling framework based on a hierarchical resampling scheme that generates pseudo-study replicates called *study straps*, and then ensembles the models trained on these replicates. To generate the *r*^{th} study strap 𝕊^{(r)}, we sample counts of study labels from a *Multinomial* . We then determine, by normalizing these counts, the proportions of the original *n*_{k} observations that are to be drawn into the pseudo-study. This is shown in Figure 1. In the second, or observation-sampling step, we sample observations (with or without replacement) from each training study, rounding to the nearest integer. We refer to *b* as the “bag size”. It controls, stochastically, the number of studies from which we draw observations. A small bag size leads to fewer studies being represented with more observations from each study. A large bag size allows for more studies to be represented in a given study strap with fewer observations from each study. Although the Study Strap can employ observation-level sampling with or without replacement, we will argue below that sampling without replacement is preferable and therefore the standard Study Strap samples without replacement. The sample size of each pesudo-study replicate, *n*^{(r)} = ∑_{k} *n*_{k}(*r*) is random, and is controlled by the bag size (through **A**^{(r)}) and the sample sizes of the original studies. When *n*_{k} = *n* for all *k*’s, then *n*^{(r)} = *n*. We denote a classifier trained on the *r*^{th} “study strap” as **Ŷ** ^{(r)}(·). To avoid redundancy in the predictions we require that each study strap is unique. We generate a total of *R* unique pseudo-study replicates. We later show that Merging and TOS are special cases of the Study Strap that arise from limiting values of *b*. As such, *b* can be used to adapt to the degree of between-study heterogeneity. We explore this property in Section 3. Study Strap ensemble learning proceeds by building a collection of studies in this fashion, training one or more single-study learners on each, and ensembling predictions with a simple average or another weighting scheme (e.g., stacking, CPS weighting or both). We leave this general in the algorithm description and denote the weighting function simply as, 𝒲(·).

### 2.3. Ensembling with the Covariate Matched Study Strap

While generating study straps has the potential to create powerful training sets, it may also require large numbers of straps before enough studies with high similarity to the target comprise the training set. To address this issue, we developed an adaptive “accept/reject” (AR) approach to embedding covariate similarity into the study strap resampling scheme. AR generates study straps as in Algorithm 1, but only trains classifiers on study straps that meet an increasingly selective threshold for covariate similarity. Each accepted study strap updates the threshold for the similarity metric of the current strap. The algorithm runs until a certain number of successive study straps are generated without acceptance for model training. Due to the stochasticity of this sampling scheme, we propose iterating through multiple accept/reject paths and ensembling all accepted classifiers from all paths. This reduces between-path variability in performance.

### Ensembling with the Covariate Matched Study Strap (Accept Reject)

As was the case for the Study Strap, we can weight predictions based upon, for example, stacking, CPS, or simple average weights (i.e., ).

### 2.4. `studyStrap` Package

We provide the `studyStrap` package, which implements the standard methods Merging, TOS, and stacking. Additionally it implements our proposed methods, the Study Strap, Covariate-Profile Similarity Weighting, and the Covariate-Matched Study Strap. Embedded within the `caret` environment, our package allows users to fit models with a broad range of methods (i.e., all available through `caret`) and fit multiple single-study learners per study. Our prediction functions automatically apply average weights, stacking weights, custom user-specified weighting functions, or one of many standard built in covariate-profile similarity weighting functions. Our package can be downloaded from `github` repository: https://github.com/gloewing/studyStrap

## 3. Analytical Results

We present analytical results to demonstrate that the Study Strap provides a flexible and useful multi-study prediction framework that encompasses not only the methods proposed above but also earlier methods. Specifically, we show that the TOS Ensemble and the Merging Algorithm are special cases of the Study Strap. This is visually depicted in the supplementary materials (Figure 1).

First, observe that under standard (non-hierarchical) bagging, the number of times an observation is represented in a given bootstrap sample follows the Binomial distribution where *N* = ∑_{k} *n*_{k} is the total sample size. In the Study Strap, we implement a hierarchical resampling scheme. From the multinomial specification of the vector of counts **A**^{(r)}, the count corresponding to the *k*^{th} study is *marginally* distributed as a . If one samples observations with replacement, then the number of times the *i*^{th} observation of the *k*^{th} study is included in the study strap is distributed as where . If one samples without replacement, then is indicator that the *i*^{th} observation of the *k*^{th} study is represented in the *r*^{th} study strap, where

We point out that under standard (non-hierarchical) bagging without replacement (or bagging with jackknife resampling) of the Merged dataset, the probability that an observation is represented in a given bootstrap sample is *s/N*, where *N* is the total sample size of the dataset and *s* is the size of the subsample.

Proposition 1. *The Trained-on-Observed-Studies Ensemble is a Study Strap with bag size b* = 1, *R* = *K, and sampling without replacement.*

Proof. First, observe that for the *r*^{th} study strap, all of the observations in 𝕊^{(r)} are sampled from a single study, say 𝕊_{k} since we draw the *r*^{th} study bag from a *Multinomial* . Therefore *A*_{k}(*r*) = 1 for the sampled *k* and 0 otherwise. Then *n*^{(r)} = *n*_{k} observations are sampled from the *k*^{th} study. When the study strap is sampled without replacement, all observations are sampled so the *r*^{th} study strap and the *k*^{th} study are identical (i.e., 𝕊^{(r)} = 𝕊_{k}). It then trivially follows that the corresponding classifiers and predictions are equivalent: **Ŷ** ^{(r)} = **Ŷ**_{k} and **Ŷ** ^{(r)}(𝕏_{K+1}) = **Ŷ**_{k} (𝕏_{K+1}). Since we require each study strap to be unique, (i.e., 𝕊^{(i)} ≠ 𝕊^{(j)} ∀ *i* ≠ *j* ∈ {1, 2, …, *R*}, there are exactly *K* distinct bags (i.e., *R* = *K*) and corresponding study straps ({𝕊_{1}, 𝕊_{2}, …, 𝕊_{K}} = {𝕊^{(1)}, 𝕊^{(2)}, …, 𝕊^{(K)}}). It therefore directly follows that the final ensembles are equivalent. That is, the Trained-on-Observed-Studies Ensemble **Ŷ**_{TOS} (·) is equivalent to the Study Strap classifier, **Ŷ**_{SS} (·).

The above proof stipulates that we cannot have multiple identical classifiers trained in an ensemble in order to achieve an exact equivalence between a special case of the Study Strap algorithm and the Trained-on-Observed-Studies Ensemble algorithm approaches. Although we feel this stipulation is principled, it is trivial to show that the vector of predictions generated from a Study Strap with no limitations on the number of identical classifiers in an ensemble would converge in probability to that produced by the Trained-on-Observed-Studies Ensemble (i.e., as *R* → ∞), through appealing to the Weak Law of Large Numbers:

We now move to an important connection between the Merging algorithm and the Study Strap.

Proposition 2. *The study strap with* *is a delete N* − *b jackknife (non-hierarchical) bagging of the Merged dataset.*

Proof. Let be the indicator of whether the *i*^{th} observation in the *k*^{th} study is represented in the *r*^{th} study strap. First, recall that a delete-d jackknife sampling of a sample of size *n** entails sampling *n* − *d* observations *without* replacement from the *n** observations. Thus, a delete *N* – *b* jackknife entails sampling *N* − (*N* − *b*) = *b* observations from the the Merged dataset, which has sample size *N*.

Next, recall that under standard (non-hierarchical) bagging without replacement (or bagging with jackknife resampling) of the Merged dataset, the indicator *Z*_{i} that observation *i* is represented in a given bootstrap sample is distributed as *Ber*(*s/N*) where *s* is the number of observations sampled with replacement from the datatset. Now if we let , we have . Thus that *Z*_{i} ∼ *Ber*(*b/N*) ≡ *Ber*(1*/K*).

Now, let us show for a study strap with parameters, that the probability that the *i*^{th} observation in the *k*^{th} study is represented in the *r*^{th} bag is marginally distributed as *Ber*(1*/K*). First, recall that in generating the *r*^{th} study strap, we sample observations from the *k*^{th} study, where . Now the probability that the *i*^{th} observation in the *k*^{th} study is represented in a given study strap follows approximately the conditional distribution . Where the approximation is up to the rounding error to ensure an integer sample size for the *r*^{th} study strap. Therefore, the marginal probability that the *i*^{th} observation in the *k*^{th} study is represented in the *r*^{th} study strap can be expressed as:

We can achieve a similar result when we sample observations without replacement, however, this requires the extra, and perhaps unrealistic, assumption that *n* = *n*_{1} = *n*_{2} = … = *n*_{k}.

Proposition 3. *The study strap with b* = *N is a standard (non-hierarchical) bagging of the Merged dataset when n* = *n*_{1} = *n*_{2} = … = *n*_{k}.

Proof. Recall the Study Strap, is a hierarchical resampling scheme, where the proportion of observations sampled from the *k*^{th} study, *marginally*, is proportional to . The number of times the *i*^{th} observation of the *k*^{th} study is included in a bootstrap sample is distributed as . By standard probability results, the marginal distribution of , where the equivalence follows from the assumption that *n* = *n*_{1} = *n*_{2} = … = *n*_{k}. Thus when *b* = *N*, we have that the probability that a given observation in the *k*^{th} study is represented in a bag follows the distribution , a standard bagging scheme.

It then directly follows that under the same conditions, (i.e., *n* = *n*_{1} = *n*_{2} = … = *n*_{k}), but with *b* = *n* and *n*^{(r)} = *n*, the Study Strap is a standard (non-hierarchical) bagging of the Merged dataset with a smaller sample size. We achieve a similar result when we sample *without* replacement, but instead the marginal probability of an observation being selected into a study strap follows a *Ber*(*b/N*). Then the study strap with , is simply a delete-d jackknife bagging scheme where .

Taken together, the TOS is a Study Strap with bag size, *b* = 1, sample size, and sampling without replacement. The Study Strap with the same parameters except where *b* = *n*_{k} is a delete-d jacknife bagging scheme of the Merged dataset. Framed in this manner we see how the parameter *b*, varying between 1 and , can be thought of as providing a tuning mechanism that generates a continuum between the Merging algorithm and the TOS.

The utility of these proofs is three-fold. First, by framing the TOS and the Merging Algorithm as special cases of the Study Strap, we propose a unifying framework for multi-study prediction. Second, we provide a theoretical foundation for why the bag size provides a tuning parameter that can be used to account for varying levels of heterogeneity. With greater between-study heterogeneity, the TOS is often advantageous compared to the Merging algorithm ([Guan et al., 2019]); in such a setting, our proposed framework would then anticipate a smaller bag size to be optimal as the bag size *b* = 1 is identical to that of the TOS. Conversely, the framework predicts lower between-study heterogeneity would be better accommodated by a larger bag size (i.e., moving closer a subsample of the Merged Dataset). A hold-one-study out cross validation provides a principled tuning procedure to empirically identify the optimal bag size.

The third contribution of these proofs to the Study Strap framework is that they provide analytical rationale for the choice of sampling *with* or *without* replacement in the secondary (observation-level) step. If the studies have variable sample sizes and one wishes to ensure that each observation has the same probability of being selected into a study strap, then sampling *without* replacement will ensure this because this property is “robust” against variation in sample sizes between the studies (save some small approximation error from rounding). This then essentially “weights” the contribution of each study to the Study Strap ensemble in a manner proportional to the corresponding sample sizes of each training study. On the other hand, if one wishes to more evenly draw observations from each study despite variable sample sizes, then sampling *with* replacement is justified. This is because observations that come from studies with smaller sample sizes will have a greater marginal probability of being selected in a given study strap than observations from studies with larger sample sizes. Recall that in Proposition 3, we required that *n* = *n*_{1} = … = *n*_{k} to achieve an equivalence between the Merging and a special case of the Study Strap. On the other hand, when one samples *without* replacement at the observation-level step, this readily accommodates variable sample sizes across the studies as we did not need to assume *n* = *n*_{1} = … = *n*_{k} to achieve the desired equivalence.

## 4. Simulations

### 4.1. Simulation Framework

To probe the performance of our proposed methods, we generated 100 iterations, each with 16 training studies (*K* = 16) one test set, under various conditions. We simulated studies with 20 covariates (*p* = 20) and sample sizes of *n*_{k} = 400 for all studies. We arbitrarily set 10 of these true model coefficients to be exactly 0, so as to simulate the sparse setting we encountered in the neuroscience data. We chose *p* = 20, *n*_{k} = 400, to replicate the ratio of in the neurosience data . We noticed considerable variation in the conditional distribution of **y**_{k}| 𝕏_{k} in the neuroscience dataset (Figure 16) and thus we tried to emulate this in simulations. The conditional distribution of our outcome followed **y**_{k} | 𝕏_{k} *∼N* (𝕏_{k}*β*_{k}, 1). We visually illustrate the general simulation framework in figure 9*a*. We randomly generated the *p*^{th} model coefficient from the *k*^{th} study as . To explore the impact of varying degrees of between-study heterogeneity in the true model, we simulated datasets where the variance of the hyperdistribution from which we drew the true model coefficients ranged over four levels: . What constitutes high or low between-study heterogeneity in the true model coefficients is dependent on the study context and likely requires domain specific knowledge. In the absence of any meaningful measure to contextual the degree of heterogeneity in simulation, we based our range off of the performance of the Merging Algorithm. As shown above each figure in 8, the performance of the Merging Algorithm range from nearly perfect performance (RMSE is bounded below by 1 since *V ar* (**y**_{k}| 𝕏_{k}) = 1), to roughy 150 times that, suggesting that we achieved a wide range of scenarios that captured “easy” problems all the way up to arguably unrealistically “hard” problems. We describe further parameter choices in the supplement (Supplement B.1).

We also tested the impact of study clusters on the performance of our proposed methods. We opted to explore this because the data application that motivated the current work appeared to exhibit study clusters (Figure 4). Clusters were modeled by generating studies with similar true model coefficients and similar distributions of the covariates across studies within a cluster. The true model coefficients differed across studies within a cluster by a degree of noise that was proportional to the between-study variability in model coefficients. If is a mean vector of true model coefficients for studies in the *c*^{th} cluster, then the true model coefficients for the *j*^{th} study in the *c*^{th} cluster is where is drawn from . Noise was added within clusters to ensure the studies within a cluster did not have the same model coefficients or covariate distributions. We simulated studies that had the following number of clusters *c* = 0, 4 (*c* = 0 indicates that each study was generated from a distinct distribution). The number of training and test studies was selected to allow for even studies/cluster numbers. The simulation scheme for the clusters is visually depicted in Figure 9*b*.

We opted to use a LASSO as our single-study learner for two reasons. First, we wanted to emulate the sparse setting of the FSCV data as well as many multi-study settings. Given the sparsity of the simulations, we then encountered the challenge that a generic similarity measure comparing covariate profile similarity would equally weight all covariates, including the covariates that did not impact the conditional distribution of *f* (**y**_{k}| 𝕏_{k}). As such we opted to weight the covariate-wise similarity by a function of the corresponding coefficient estimates as a measure of variable importance. For this reason, we wanted to use a single-study learner that set some coefficients exactly to zero. Additionally, the LASSO is a common method used in the human FSCV literature. Specifically,we used the similarity measure: , where ⊙ indicates the Hadamard Product and where the *p*^{th} element
where we estimated the variance of using a bootstrap with 500 iterations. The motivation behind including was to ensure that noisier studies that had greater variance for the coefficient estimates were penalized.

### 4.2. we Simulation Results

We present the results of our simulations in Figure 8 and in Table 2. Figure 8 is on the log scale to allow for easier comparison of the methods. Versions on the original scale are presented in the supplement (Figure 9).

As expected, the average performance of all methods monotonically deteriorates as between-study variability in true model coefficients (*σ*_{β}) increases. Similarly, when there is clustering in the studies, performance improves; this is expected since increasing the number of studies per cluster functionally reduces the level of between-study variability in true model coefficients. Indeed, the degree of between-study variability within a cluster pales in comparison to that of the between-study variability across clusters.

Consistent with past results, the TOS is superior to Merging in a majority of the test studies when there are no clusters (*c* = 0) and there is moderate to high heterogeneity in model coefficient .

CPS weighting confers substantial benefit when there is clustering in the data. The Trained-on-Study-Straps Ensemble produces variable benefits above the TOS when there is clustering, but shows the most promise when combined with CPS weights.

The AR algorithm with average weights is superior to other forms of ensembling (TOS and SS) when there is clustering. Interestingly, it is only when is low, that we see clear benefits above Merging.

When tuning the study strap and accept/reject step algorithms, we used a hold-one-study-out cross validation scheme within the *training* studies. We generated bag size performance curves but unlike the tuning step, we generated these post-hoc where the RMSEs are from the test sets. Here we include one for the accept/reject algorithm (Figure 3) with clusters here to emphasize the utility of bag size tuning: as grows, so does the optimal bag size. We emphasize this relationship further in the supplement (figure 14). Moreover, the performance curve underscores the continuum between the TOS and the Merging Algorithm generated through varying the bag size, *b*. The performance of high bag sizes is nearly identical to that of the Merging Algorithm and likewise the performance of low bag sizes approaches that of the TOS as anticipated by the Study Strap framework. The bag size that was used in the accept/reject algorithm and the study strap are indicated on each plot in figure 8. The tuning step did not always choose the optimal bag size, but weighted predictions appear less sensitive to correct bag size specification.

For brevity, we only include the AR bag size for the clusters here, but have included all figures in the supplement including with the performance of weighting (Supplementary Materials B). Importantly, weighting with CPS or stacking appears to be most beneficial for lower bag sizes; as the bag size becomes large and approaches the Merging Algorithm, the performance of average, CPS and stacking weights appear to converge to the performance of Merging. Moreover, the relationship between optimal bag size and is far less clear for the Trained-on-Study-Straps Ensemble as well as for the AR without clusters. The relationship between between-study heterogeneity and optimal bag size appears complex and warrants further study.

We point out that more sophisticated bag size tuning schemes may be an important area of future research: in almost every case the tuned bag size for the AR and Trained-on-Study-Straps Ensemble algorithms were suboptimal. This could be because of cross validation bias. Additionally, this may have arisen because we did not correct for the ratio of the *b* to *K* when moving from the hold-one-study-out cross validation scheme (where *K*_{train} = 15 since one study is held out) to the actual training setting (where *K* = 16).

Importantly, the benefit of either CPS or Stacking weighting schemes diminishes as the bag size increases and approaches the Merging Algorithm (Figure 3). This is likely because as *b* rises, the degree of variability between study straps diminishes since study straps will be more likely to be composed of roughly equal proportions from all studies.

Taken together, these simulations demonstrate that CPS weights and the Study Strap framework confers substantial benefit when there are clusters of data that share similar marginal covariate distributions and similar conditional distributions of the outcome given the covariates. Moreover, the simulations broadly demonstrate the benefit of the bag size tuning parameter, *b*, as the improvement in predictive performance of the Trained-on-Study-Straps Ensemble and AR varied considerably between bag sizes (Figure 3). Importantly, the simulations offer empirical evidence that the bag size tuning parameter creates a continuum between the TOS and Merging Algorithm approaches.

## 5. Data Application: Neurochemical measurements

We now move to the primary motivation for the above described methods. The purpose of this work was to improve the cross-electrode generalizability of models trained on in vitro data. The scientific motivation for this multi-study approach was to develop methods that would generate more reliable neurochemical concentration estimates from measurements made in the brain. The need for this arises since electrodes used to generate in vitro training sets differed from those used for brain recordings and training sets were generated purely from in vitro data. Specifically we sought to improve the predictive performance of models trained on a set of in vitro training sets each generated on a different electrode (i.e., studies), and validated on a training set generated on another electrode.

### 5.1. Modeling and Methods Application

We applied our proposed methods to these data using Principal Component Regression (PCR) ([Rodeberg, et al., 2017]) as the single study learner since it is the standard in the field. We point out, however, that unlike most applications of PCR in FSCV, we selected the number of principal components to retain using cross validation as opposed to using a measure of the proportion of variance explained. While regularization methods have also been proposed ([Kishida et al., 2016]; [Moran, et al., 2018]), we found that PCR produced superior cross-study predictive performance (Figure 18). While we did not explicitly account for the functional nature of the data, we point out that we implemented functional data analytic methods such as functional regression and functional Principal Components Regression (fPCR) using basis splines and found that it exhibited inferior performance in this application. We tuned the number of principal components to keep in the case of each held-out-study (electrode) separately. This was based upon a hold-one-study out cross validation scheme excluding the test study with the merging approach (i.e., each test study was not used in their respective tuning stage for training or validation). The number of components selected for each held-out-study was held constant across all of the methods explored. The final methods were validated with a hold-one-study-out validation scheme to estimate out-of-study prediction error. Since the outcome is chemical concentration, we imposed the non-negativity constraint:. We implemented this constraint so that predictions would be scientifically coherent in any case, although it only slightly impacted the net predictions/performance.

The vector of covariates was high dimensional and the vector of true model coefficients was sparse. For this reason, we opted to engineer our features to summarize average differences in the covariates between the studies. Let 𝕏_{k} be the design matrix of the *k*^{th} study. As we describe in greater detail (Supplement C.2), we engineered features via a dimension reduction step where for each study *ν*_{k} = *g* (𝕏_{k}), *ν*_{k} ∈ ℝ^{8} is a vector features engineered to provide a summary of the covariate profile of the *k*^{th} study. We then used as a similarity measure (Figure 17). We used this distance measure taken on the raw covariates when fitting models on both the original covariates and a numerical estimate of the derivative of the covaraites. We estimated the derivative of the covariates using the `diff()` function in R, since the measurements were taken at evenly spaced intervals (in time and in voltage potential intervals).

Given the inherently stochastic nature of our methods, we noticed some between-seed variability in results for the accept/reject step method. For that reason, we ran the methods on 10 separate seeds and present the results averaged across seeds. We show the variability in average performance between seeds in the supplement (Figure 20).

As the FSCV data were large, memory considerations even on a powerful cluster were a constant challenge. For that reason, we were unable to systematically implement stacking in all of our analyses given the additional computational cost. We did explore the use of stacking weights and found that it provided no additional benefit. Given the additional computational cost, we opted not to implement it. We have included a figure in the supplement exploring the performance of stacking weights in the AR algorithm (Figure 19).

For the bag size tuning curve, we ran all bag sizes on the accept/reject algorithim on a single path and with a subset of data (*n*_{k} = 2500) due to the high computational cost associated with using the full dataset. We tuned our bag sizes *b* = *K* = 14 for the raw and *b* = 75 for the derivative analyses.

### 5.2. Results

As mentioned above, we noticed clustering in the average covariates of the studies and provided a motivation for emulating this in the simulations.

The results above demonstrate the strength of the Covariate Profile Similarity weighting scheme. While the Trained-on-Observed-Studies Ensemble with equal weights, and the weighted versions of it (CPS weights and Stacking weights) perform dramatically worse than Merging, the degree of improvement from the simple average weights is substantial. The CPS weights produced a 53.9% and 60.7% reduction in RMSE relative to simple average weights for the raw and derivative respectively. While these weights are inferior to stacking, the covariate profile similarity measure allows for the integration of an accept/reject step within the Study Strap framework. Indeed, the accept/reject step exhibited the strongest performance among all the methods examined. Importantly, this demonstrates that even when Merging is superior to the TOS, that ensembling approaches can out-compete Merging in some scenarios.

Importantly, the standardization of the covariates that the derivative pre-processing provides (Figure 15) shifts the optimal bag size towards the Merging Algorithm. As apparent in figure 6, the optimal bag size lies between roughly 12 and 24 for the raw data, and between 35 and 85 for the derivative. The preprocessing thus pushes the optimal bag-size towards the Merging Algorithm. Thus, the present data provides a within-dataset demonstration of the utility of the bag size: as between-study heterogeneity changes, the optimal bag size necessary to accommodate the heterogeneity also shifts.

Consistent with the observation that the optimal bag size is shifted towards the Merging Algorithm in the derivative case, is that the benefit of the accept/reject step is reduced. This echos the observation in the simulations that as the bag size increases, the benefit of the SS, AR or any weighting scheme diminishes (Figure 10, Figure 12). The derivative provides a between-study standardization of the covariates and improves the performance of all methods substantially. This enhancement in performance across-the-board may also contribute to the discrepancy in performance improvement between the raw and derivative analyses.

## 6. Discussion

Motivated by challenges of transportability of prediction models in emerging neuroscience technologies, we propose a general multi-study learning framework including a resampling scheme and complementary weighting procedure. These methods produced substantial improvements in performance when there is clustering of studies. Importantly, in these settings, Merging was superior to the Trained-on-Observed-Studies Ensemble (with or without stacking weights). However, when we ensembled and applied Covariate Profile Similarity weighting and/or the accept/reject algorithm, these vastly outperformed Merging, highlighting that ensembling approaches can sometimes outperform Merging when between study heterogeneity is low. It appears that a Covariate Profile Similarity approach, whether applied through CPS weights or the accept/reject algorithm, is most helpful when the “borders” of the studies are murky. That is, when there is substantial clustering of the studies, one might be better off delineating between clusters instead of between the original observed studies, themselves. When the studies do not exhibit clustering, it appears that the standard TOS or Merging techniques are superior.

By expressing the TOS and the Merging Algorithm as special cases of the Study Strap, we propose a unifying framework for multi-study prediction. We suggest a theoretical foundation for why the bag size can help adjust for varying levels of heterogeneity. Using a numerical estimate of the derivative for feature engineering appears to produce cross-study standardization and shifts the optimal bag-size to greater values (i.e., towards the Merging Algorithm) compared to the optimal bag-size of the raw covariates. This example serves to demonstrate the utility of the bag size tuning parameter in accounting for between-study heterogeneity. We emphasize that the benefit of the Covariate-Matched Study Strap (AR) appears to rely on ensembling at its core. For example, using the final study strap accepted prior to convergence as the sole learner performed worse than ensembling all accepted study straps in the neuroscience data. This is the case even though the last accepted replicate has a covariate profile more similar to the target study than all previously accepted study straps. Interestingly, including a burn-in step (eliminating early accepted study straps) did not appear to provide any systematic benefit and could actually be detrimental depending on the degree of burn in. We point out that we used 10 paths in our accept/reject step algorithm to be conservative, but we noticed little benefit after 5 paths. Moreover, the average performance does not change from running one path, but instead served to reduce between-run variability in performance.

We explored a variety of similarity measures in the in the initial stages of this work, although we present the results from only one for the sake of brevity and consistency. Matrix correlation packages and canonical correlation analysis techniques are out-of-the-box methods that exhibited promise in the simulations we explored. Our `studyStrap` packagae, provides over 20 of such generic similarity measures that are automatically computed for any of the ensembling methods. Context-specific similarity measures (as we developed for the neuroscience application) may be preferable if subject-matter knowledge can guide their development. But even in the absence of any straight-forward method of constructing such metrics, more general measures, such as that used in the simulations, may be viable in a range of scenarios.

The Study Strap framework allows for a variety of weighting schemes including stacking, CPS and out-of-study-bag weighting. It not only introduces flexibility in how study replicates and ensembles are formed, but also in how stacking can be executed. For example, we have proposed a number of stacking approaches that differ in the manner in which one constructs the design matrix and outcome vector used in the stacking regression. We outline a couple in the supplement that are enabled by the Study Strap framework and may be suited to different scenarios (Sections A.2 and A.3). Although this was not the focus of the paper, we point this out as a potential area of future research.

Importantly, the FSCV results demonstrate that despite the statistical challenges present in the application of FSCV in humans, substantial improvements in cross-electrode generalizability are feasible. Statistical research into this topic is critical as FSCV in humans is a new technique and has, in some domains, unprecedented capacity to provide insight into the brain, behavior and cognition. Moreover, FSCV in humans marks an important victory for statistics in science: novel applications of statistical learning methods were instrumental in its successful implementation in humans. What is more, these statistical methods have enabled researchers to use FSCV to measure multiple neurotransmitters simultaneously (), a feat thought not to be possible with this technology, even in preclinical settings ([Moran, et al., 2018];).

[17] consider heterogeneity in multi-study settings, with the aim of identifying a subset of studies within which comparison of prediction algorithms can be reliably carried out, and comparing algorithms using a merged subset of the studies. They address both problems by integrating clustering of studies and model comparison through a Bayesian model for the array of cross-study validation statistics. The combination of study-strap and covariate similarity weighting is a natural evolution of this approach to multi-study ensembles. While implementation is not Bayesian, and clustering is not explicit, the resampling and reweighting jointly will make is to that the out-of-study predictions will be based on the subset of studies most similar in both covariate similarity and covariate-outcome relations.

### 6.0.1. Covariate Shift

The Covariate Shift literature contains a rich array of prediction methods for settings in which there exist differences in the distribution of the data between training and test sets. Investigators distinguish between three situations: 1) “virtual drift” (also sometimes referred to as simply “covariate shift”) occurs when differences in the distribution of the covariates exist (i.e., *f* [𝕏_{train}] ≠ *f* [𝕏_{test}]) but the conditional distribution remains constant (i.e., *f* [**Y**_{train}|𝕏_{train}] = *f* [**Y**_{test}|𝕏_{test}]); 2) “real drifts” refer to contexts in which the conditional probability, *f* [**Y**_{t}|𝕏_{t}]), changes over time independent of changes in the marginal distribution of the covaraites, *f* [𝕏_{t}]; 3) “hybrid drifts” present when both *f* [**Y**_{t}|𝕏_{t}] and *f* [𝕏_{t}] change over time ([Yang, et al., 2019]).

In the present work, we focus on a multi-study analogue of the hybrid drift literature. The multi-study framework does not posit, however, any meaningful ordering (temporal or otherwise) among the studies and focuses on data that are often from different sources. This is in contrast to much of the concept/hybrid drift literature that deal with streams of data (often time series) that change over time (in one of the manners described above) and are typically from one source, such as environmental or energy data (Yang et al., 2019; Raza et al., 2019; Karnick et al., 2008; Ditzler et al., 2010), (i.e., *f* (𝕏_{t}) and/or *f* (**Y**_{t}|𝕏_{t}) change over time). As such, many concept drift methods employ approaches that are inherently tailored to the temporal nature of the data streams, such as moving averages ([Raza, et al., 2008]), detection systems to identify the presence ([Yang, et al., 2019]) or speed ([Minku, et al., 2009]) of a shift at a given instance, and ensemble methods that add, remove, or reweight classifiers across time ([Raza, et al., 2008]; [Minku, et al., 2009]). While virtual drift work often does not assume changes over time, its methods usually center on reweighting observations ([Sugiyama, et al., 2008]; ([Shimodaira, 2000]) and assume that *f* (**Y**|𝕏) remains constant between training and test sets. These differences likely explain why there has been little implementation of covariate shift methods in multi-study learning.

### 6.1. Limitations

Our methods and analyses are not without limitations. To begin, this work does not provide an explicit analytical expression for prediction error (RMSE) as a function of the bag size parameter, *b*. As such, tuning the bag size parameter, *b*, must be conducted empirically with cross validation. Indeed, our simulations and data application reveal that the bag size parameter is helpful in accommodating between-study heterogeneity, but do not reveal the exact nature of het-erogeneity for which it benefits. While it appears to help account for heterogeneity in the marginal distribution of the covaraites, ) in the simulations with clusters and the neuroscience data, it is clear that the relationship is neither linear nor monotonic and the contours of the relationship appears to vary considerably between simulation settings. Finally, although the accept/reject step showed the most promise in the neuroscience data, it is also the most computationally expensive. More efficient algorithms that achieve a similar goal may allow the application of this technique in a wider range of settings. Future work may seek to achieve this through an optimization step as opposed to random re-sampling.

As mentioned above, we did not account for the functional nature of the neuroscience data explicitly because we found that the functional methods that we implemented (e.g., fPCR with basis splines) produced inferior performance. In the future, investigators may wish to experiment with other functional approaches (e.g., wavelet regression) as they may produce gains in predictive accuracy. Moreover, we did not explicitly account for the temporal nature of the neuroscience data as this was not the goal of the present work. We point out that no standard methods in the FSCV field, to our knowledge, account for the time-series nature of the data in estimating neurochemical concentration ([Rodeberg, et al., 2017]; [Kishida et al., 2016]; [Moran, et al., 2018]). Nevertheless, future work might consider implementing methods that account for this feature of the data structure as it could be exploited for gains in performance.

Finally, although we have shown progress in enhancing between-electrode generalizability of the models, we cannot verify the generalizability from in vitro models to in vivo, as this would require an additional gold standard measure of neurochemical concentartion in the human brain for which assays does not currently exist. We point out that such a gold standard does not exist even in animal models where a greater array of invasive methods are available. It is our hope that by enhancing cross-electrode generalizability that generalization to the brain is improved.

### 6.2. Conclusions

Taken together, the Study Strap, Covariate Profile Similarity Weighting and the Covariate-Matched Study Strap are flexible ensembling and weighting schemes that can improve predictive performance in multi-study settings. We hope the methodology will benefit the statistics, neuroscience and computational psychiatry ([Montague, et al., 2012];) communities. We have provided a user-friendly R package that can be used by investigators that use FSCV as well as other analysts who encounter multi-study prediction settings. We argue that the present work contributes to the multi-study literature both in its proposed methods and theoretically through the general Study Strap framework.

## SUPPLEMENTARY MATERIAL

### A.1 Stacking Strategy

We used non-negative least squares with an intercept to generate weights. That is, we used the weights from where . We found that standardizing the weights led to a degradation of performance and so we proceeded without standardizing the coefficient estimates. Thus, the final predictions are:

### A.2. Study Strap Stacking Strategy

This is the Study Strap (or AR) analogue of the stacking that was implemented for the TOS. This regresses 𝕏_{S} on **y**_{S}, where
and **Ŷ** ^{(r)}(𝕏^{(j)}) are the predictions on the design matrix of study strap *j* using the model trained on study strap *r*; **y**^{(j)} are the labels from study strap *r*. The stacking procedure then proceeds as above.

### A.3. Study Strap Stacking Strategy 2

Below is an additional way one could construct the stacking regression for Study Strap (or AR). This regresses 𝕏_{S} on **y**_{S}, where
and **Ŷ** ^{(r)}(𝕏_{k}) are the predictions on the design matrix of training study *k* using the model trained on study strap *r*; **y**_{k} are the labels from training study *k*. The stacking procedure then proceeds as above.

One could similarly construct the stacking regression to weight the TOS or standard Trained-on-Study-Straps Ensemble using the design matrices of the accepted study straps in the AR algorithm (and the models from the TOS or standard TOS respectively). The above variations on standard stacking highlight the flexibility introduced by the Study Strap framework.

## ACKNOWLEDGEMENTS

## APPENDIX B SIMULATIONS SUPPLEMENTARY MATERIALS

### B.1. Simulation Parameters

Here we describe additiomal details related to the simulations. When tuning the *λ* hyperparameter for the LASSO, we used a hold-one-study out cross validation scheme and used the same *λ* across studies and methods within a given simulation iteration. In other words, we used the same *λ* for all methods (i.e., Merging, TOS, SS, AR) so that this did not influence our between-methods comparisons. We tested 43 values of *λ* between 0.0001 and 5. When tuning the study strap for the bag size, we implemented 150 study straps at each of 21 bag sizes between *b* = 1 and *b* = *N*. We used a hold-one-study-out cross validation scheme to tune the bag size. When generating the study strap bag size performance curve we used 250 study straps per bag size. When testing the final Study Strap with the tuned *b*, we used 500 study straps. We used less study straps during tuning stages in order to avoid the computational cost (since we were doing a hold-one-study-out cross validation scheme, it was computationally intensive).

When we were tuning the AR bag size, we used a convergence criteria of 10000 consecutive study straps without an acceptance. We averaged across 3 paths. We used the same parameters for the AR bag size performance curve. For the final AR implementation with the tuned *b*, we used 5 paths and a convergence criteria of 100000 consecutive study straps without an acceptance. We felt these parameters struck the balance of being computationally feasible and also sufficient for our purposes.

### B.2. Extra Simulation Figures

## APPENDIX C NEUROSCIENCE DATA

### C.1. Data Description

The data is collected by exposing an electrode in a flow cell (in vitro) to different prepared concentrations (*c*_{j}) of a neurotransmitter, dopamine, a naturally occurring brain chemical (where the concentration is known to the investigator). Many measurements are taken at a given concentration (these different measurements are referred to as “Replicate” below). While measurements are taken across time, time is ignored in the training set (i.e., each observation is treated as independent). The rows of the dataset then correspond to a measurement (observation) at a given concentration, at a given time point. Furthermore, these data are paired with a vector of labels of known neurotransmitter concentrations ([*DA*]_{i}). The structure of the data for the *k*^{th} electrode is presented in Table 4.

A sample of the average covariates of four of the 15 electrodes is presented in Figure for both the raw and derivative pre-processing versions. The figure highlights the standardization in covariates that the derivative provides.

### C.2. Similarity Metric

In order to compare the covariates of two studies, we engineered features that summarize the covariate profile of a given observation. These features are the coordinates of the inflection points of the CVs (Figure 17b): ** ν** contains theses coordinates collapsed into a single vector (i.e.,

**∈ ℝ**

*ν*^{8}). Then to compare, the

*i*

^{th}and

*j*

^{th}studies, we calculate the distance between the average

**of each study via the similarity metric, (Figure 17a). Studies/electrodes differed not only in the average magnitude of the covariates (i.e., the height of the figures), but also the covariates that were concentration-sensitive (i.e., which coefficients,**

*ν***were non-zero). This measure was designed to account for this by measuring distance in terms the covariate index and height.**

*β*## Footnotes

↵* NSF-DMS1810829

↵† T32 AI 007358

↵‡ NIH, R01 DA048096; NIH, R01 MH121099; NIH, R01 NS092701; NIH, 5KL2TR00142

↵§ WFSOM, Phys/Pharm Neurosurgery

Address of the First and Second authors Usually a few lines long, E-MAIL: gloewinger{at}g.harvard.edu, kkishida{at}wakehealth.edu

Address of the Third author, Usually a few lines long, E-MAIL: gp{at}jimmy.harvard.edu, URL: http://www.foo.com