A neural network-based framework to understand the Type 2 Diabetes (T2D)-related alteration of the human gut microbiome

To identify the microbial markers from the complex human gut microbiome for delineating the disease-related microbial alteration is of great interest. Here, we develop a framework combining neural network (NN) and random forest (RF), resulting in 40 marker species and 90 marker genes identified from the metagenomic dataset D1 (185 healthy and 183 type 2 diabetes (T2D) samples), respectively. Using these markers, the NN model obtains higher accuracy in classifying the T2D-related samples than machine learning-based approaches. The NN-based regression analysis determines the fasting blood glucose (FBG) is the most significant association factor (P<<0.05) in the T2D-related alteration of the gut microbiome. Twenty-four marker species that vary little across the case and control samples and are often neglected by the statistic-based methods greatly shift in different stages of the T2D development, implying that the cumulative effect of the markers rather than individuals likely drives the alteration of the gut microbiome.

requires massive samples for training. Another reason is that applying the NN model to the 107 microbiome data is still a challenge due to its 'black box' nature. This makes it difficult to 108 demonstrate which input feature plays a decisive role in the output. 109 Here, we propose a framework combining NN and RF algorithms for identifying the 110 biomarkers in linking the gut microbiome with T2D based on the microbial profiles. To 111 demonstrate the utility of our approach, we took advantage of two publicly available 112 independent metagenomic datasets from Chinese diabetes patients and non-diabetic controls 113 (Qin, Li et al. 2012). We first trained a NN model for predicting the disease state of the 114 samples. Further, RF was used to rank the microbial features, and the most important features 115 contributing to the prediction accuracy of the NN model were selected as microbial markers. 116 Using these microbial markers, we finally constructed interaction networks and performed the  Neural network (NN) model training process and comparison 150 We used D1 samples to train the NN model and determined the parameter setting for the 151 model based on the microbial profiles. There are no standard approaches to determine the 152 parameters (e.g., the number of layers and the number of nodes in each layer) for a neural 153 network. A common practice is to set up a range of parameters and choose the suitable ones, 154 with which the NN model has the best performance. The datasets used in this study contain 155 only hundreds of samples. Therefore, the number of hidden layers was set up in a range from 156 2 to 5, and the range of the number of nodes in a layer was set up from 5 to 25. The prediction 157 performance (i.e., accuracy) of the NN model with different parameter values was assessed 158 using 5-fold cross-validation (CV) based on the microbial profiles (species-level and gene 159 level, respectively) of D1. In 5-fold CV, all samples of D1 were randomly divided into five 160 equal size subsamples, and four were used for training, the remaining one for testing. The 161 process was repeated five times, and each of the five subsamples was used once as the testing 162 samples. To assess the performance of the model, we adopted 20 runs of 5-fold CV, resulting 163 in a total of 100 trials for testing. We then chose the parameter settings with the best 164 performance in this study. Rectified linear units (ReLU) were used for the hidden layers, and 165 the softmax function was used in the output layer when we used the NN model for a 166 classification task. Meanwhile, The loss function (cross entrop loss) was used for training.

167
The code was implemented using Keras package in the Python (3.6). Briefly, let = 168 [ 1 , 2 , ⋯ , ] to be the abundance of the nth-marker in one sample, and prediction of its state 169 is as follows: Where 1, 1 , 2 , 3 , 1, 2, 3 are the model parameters (Table S1 and Table S2), and Ls 173 represents the predicted result of the sample (0 for the non-diabetic, 1 for the diabetic).   within the corresponding interval were calculated and normalized (mapped to (0, 1)).

249
Accuracy, precision, recall, and F1 metrics 250 The three metrics were defined as follows: those of the NN model (Table 1). Also, they have lower performances on both recall and F1

301
(especially for KNN) than the NN model, which means that the proportion of the T2D 302 samples that can be identified by these machine learning classifiers is low.

303
The identified marker species play a decisive role in the T2D-related 304 alteration of the human gut microbiota. 305 We further investigated which species rather than all of the gut microbiota, play a decisive 306 role in classifying the T2D-related samples using the NN model. We used the feature selection suggesting that these species can be used to delineate the T2D-related alteration of the gut 315 microbiota. Thus, we took the selected top 40 species as the marker species (Table S4). 316 We further employed the independent dataset D2+ to assess the prediction capability of  (Table S5). However, with these  (Table S6). 378 We further tested the NN model using the 90-marker genes (trained on D1-dataset) in  The data (i.e., the profiles of D1 and D2) and the demo of the NN model is available in the 518 GitHub repository (https://github.com/gsgowell/microbial_markers_identification).

519
Competing interests 520 The authors declare that they have no competing interests.    Table S4 The list of diabetes-associated gut microbiota species that selected by our method 644 Table S5 The list of diabetes-associated gut microbiota species that were selected 645 simultaneously by our method and LEfSe.
646 Table S6 The list of diabetes-associated gut microbiota genes that selected by our method 647 Table S7 The functional prediction of the marker genes in KEGG 648 Table S9 The list of the marker species that could be assigned to previously known  The abundance of the 270 species is used as the feature for the classifiers. Prediction performance metrics 658 were calculated by 20 runs of 5-fold CV for all classifiers. In bold we report the best value for each 659 prediction performance metric.  The best performance was achieved as the parameters were set to 18 and 9 for the first and second hidden 672