Decentralized Distribution-sampled Classification Models with Application to Brain Imaging

0.1 background In this age of big data, large data stores allow researchers to compose robust models that are accurate and informative. In many cases, the data are stored in separate locations requiring data transfer between local sites, which can cause various practical hurdles, such as privacy concerns or heavy network load. This is especially true for medical imaging data, which can be constrained due to the health insurance portability and accountability act (HIPAA). Medical imaging datasets can also contain many thousands or millions of features, requiring heavy network load. 0.2 New Method Our research expands upon current decentralized classification research by implementing a new singleshot method for both neural networks and support vector machines. Our approach is to estimate the statistical distribution of the data at each local site and pass this information to the other local sites where each site resamples from the individual distributions and trains a model on both locally available data and the resampled data. 0.3 Results We show applications of our approach to handwritten digit classification as well as to multi-subject classification of brain imaging data collected from patients with schizophrenia and healthy controls. Overall, the results showed comparable classification accuracy to the centralized model with lower network load than multishot methods. 0.4 Comparison with Existing Methods Many decentralized classifiers are multishot, requiring heavy network traffic. Our model attempts to alleviate this load while preserving prediction accuracy. 0.5 Conclusions We show that our proposed approach performs comparably to a centralized approach while minimizing network traffic compared to multishot methods. 0.6 Highlights A novel yet simple approach to decentralized classification Reduces total network load compared to current multishot algorithms Maintains a prediction accuracy comparable to the centralized approach

There is a current body of various decentralized models [Gazula et al., 2018, Saha et al., 2017, 48 Wojtalewicz et al., 2017, Baker et al., 2015, and more specifically, decentralized neural networks 49 [Lewis et al., 2017] and support-vector machines (SVM) [Forero et al., 2010]. However, these models 50 are multishot, meaning they pass statistical information many times during the training process, 51 which can require a great deal of network traffic. The multishot neural network, or decentralized-52 data neural network (dDNN) [Lewis et al., 2017] requires heavy network traffic at least once every 53 epoch, or one full iteration through the entire dataset during the training process. This is because 54 the dDNN model passes all gradient information from local sites to a centralized location after 55 every epoch, then calculates the average of these gradients, and passes the averaged gradients to 56 the local sites. As neural networks can require many thousands of epochs, the overall network 57 traffic would be unmanageable for neuroimaging data, which can contain hundreds of thousands of 58 features. This same problem occurs for multishot SVMs, which also require a high number of steps 59 in which gradients are passed between local sites. 60 In this research, we attempt to mitigate these issues for certain classifiers by introducing a 61 singleshot method. Singleshot methods require statistical information to be passed only once, either end for 16: end procedure before or after the local models have been trained. In our case, statistical information is passed 63 to the local sites, and then each site trains separately. The statistical information is an estimated 64 distribution of the local data, which is comprised of the per-feature mean and a covariance matrix 65 of the features. We refer to this model as a decentralized distribution-sampled classifier (dDSC). 66 This use of statistical inference to estimate new samples for decentralized modeling is applied 67 to both neural networks (dDS-NN) and SVMs (dDS-SVM) to show efficacy in use with multiple 68 classification models. We quantify the data at each local site by building local distributions using a 69 Gaussian mixture model (GMM) and pass these distributions to the remaining local sites which will 70 then be used in training models at the local sites. Each local site combines artificial data sampled 71 from the given distributions with locally available data to train the models. We demonstrate the 72 efficacy of dDS-NN and dDS-SVM on two datasets. In the previous multishot models, dDNN and the consensus-based SVM, high-level statistical infor-76 mation (i.e. gradients) are passed between local sites many times during the training of the models.

77
However, this requires a high traffic load and as the number of training iterations increases, the 78 chance of network failure also increases. The dDNN aggregates the local gradients every iteration 79 (or epoch through the data), averages the gradients, and passes these updated gradients to the 80 local sites. The multishot SVM uses the alternating direction method of multipliers (ADMoM) 81 to accumulate the updating parameters, or the model weights, [Forero et al., 2010]  Our approach for singleshot classifiers-dDSC gathers statistical information about the datasets, 86 rather than the models as is the case in the multishot algorithms, at the local sites and passes 87 this information between the sites before the models are trained. We use a GMM to estimate 88 the distribution of the local site data for each class. Once the distribution is gathered from the 89 model for each site, this distribution is passed to the other sites. The other sites then draw artificial 90 samples from the remaining sites' distributions and trains their own model on both locally available 91 data as well as the artificial samples. This approach also shows a much smaller amount of network 92 traffic, as the mixture model is transferred once, with a polynomial relationship to the number of 93 input features. This is the case for both the neural network and SVM methods.

Gaussian Mixture Model
EM algorithm begins by, for each class, creating a random Gaussian distribution, or randomized 97 mean and variance. Then, the probability that each data point is within a given distribution an aggregated multisite dataset [Potkin et al., 2008, Gollub et al., 2013, Hanlon et al., 2011, Aine 118 et al., 2017. We tested the models' performances on MNIST with three cases: the data is uniformly 119 and randomly distributed across three sites, three sites have access to only certain classes, and the 120 datasets are uniformly and randomly distributed across 20 sites. We also tested the models via the 121 sMRI data in two cases: the data is uniformly distributed across four sites at random, and four 122 sites have access to only certain classes. A full break-down of the experiments can be seen in we tested the models using the test set established by the dataset creators.

128
In the first dDS-NN experiment, we randomly select 20,000 images from the entire dataset of 129 60,000 images for each of the three sites. This process is used for the dDNN and dDS-NN model. However, the centralized model have one site with access to all 60,000 images. All of the neural

145
The third experiment, which was only used to test the dDS-NN and not the dDS-SVM, uses 146 60,000 images as in the first experiment. The data is processed the same way as the first experiment, 147 and the models are also of the same architecture. However, the primary difference is that the data 148 is separated into 20 local sites as opposed to 3. This means that there are a total of 3,000 images 149 at each local site distributed uniformly at random. Then, as in the previous experiments, the 150 accuracies between the three models are compared.

151
The dDS-SVM model was tested on the same MNIST dataset as was used to test the dDS-NN.

152
In the first SVM experiment, we randomly and uniformly distributed all of the training data across The sMRI dataset is very large with over 58 thousand features. As such, with or without statistical 164 inference, the network traffic would be problematic. Due to this, we use a diagonal matrix to store all four local sites. The goal is to show the model's robustness to extremely biased data. 186 We also use the sMRI data to test the dDS-SVM by uniformly distributing the sMRI data 187 across four sites at random. 10-fold cross-validation was used to test the entire dataset. This was 188 compared to an centralized SVM in which one site has access to the entire dataset. across all three sites, shows near identical accuracies between the dDNN and centralized approaches.

195
In the second neural network experiment, the data are biased in such a way that each site had access to only three of the possible classes. Digit '3' is removed so as to give each site an equal 197 number of classes. From Figure 5, we see that, as in the previous experiment, the dDNN and 198 centralized approaches are almost identical, converging toward 97.8%. The dDS-NN approach is 199 slightly less accurate, converging towards 95.5% accuracy. The discrepancy between the accuracies 200 of the same procedures is most likely due to the impact from the exclusion of digit three. It appears 201 that including digit three makes the MNIST problem slightly more difficult.

202
The

224
In the second sMRI experiment of the dDS-NN model, the data is evenly distributed across four 225 sites, but is biased in such a way that each site had access to only one of the possible classes. This 226 means that two sites had access to only patients and the remaining two sites had access to controls.

227
From Figure 7, we see that, as in the previous experiment, the dDNN and centralized approaches 228 are almost identical, converging towards 72.8% accuracy. The dDS-NN approach converges towards 229 65.1% accuracy.

230
When the sMRI data is modeled with an SVM, we uniformly and randomly distribute the data 231 across 3 sites. 10-fold cross validation is used, covering all data samples, and the data are uniformly 232 distributed to four different sites at random. The mean of the accuracies across all 10 folds from 233 the dDS-SVM model is 67.5%, whereas the mean from the centralized method across all folds is 72%. These results are congruent with the results from the dDS-NN.

236
The importance of this work rests in the dDSC model's ability to reduce total network traffic.

237
Our asymptotic analysis of the models provides a method to quantify the total network load as

302
Although the model appears to be valid for many criteria, it does have limitations. As seen 303 in Figure 2, the accuracy of the dDSC model decreases with many more local sites. be noted that small datasets will always be a limiting factor for machine learning, and a certain 308 amount of error is expected with small datasets. 309 We suggest that future researchers could develop models that are more accurate in general, or at 310 least in the case of many local sites. This could include testing different mixture models or methods    Figure 1: The three paradigms from left to right: dDS-NN, DDNN, and a centralized model. In the dDSC model, for every site i, every other site calculates the distribution of the local data and passes the distribution (in the form of a matrix) to site i. Site i then samples data from these distributions and uses this artificial data as well as the local data to train its own model. In the dDNN model, each local site trains its own model and the available local data and passes the gradient data to a centralized server. The centralized server then averages the local gradients and passes this average to the local sites to train the local models. The centralized paradigm uses all possible data in a single model at a central site.