Single-cell classification using graph convolutional networks

Background Analyzing single-cell RNA sequencing (scRNAseq) data plays an important role in understanding the intrinsic and extrinsic cellular processes in biological and biomedical research. One significant effort in this area is the identification of cell types. With the availability of a huge amount of single cell sequencing data and discovering more and more cell types, classifying cells into known cell types has become a priority nowadays. Several methods have been introduced to classify cells utilizing gene expression data. However, incorporating biological gene interaction networks has been proved valuable in cell classification procedures. Results In this study, we propose a multimodal end-to-end deep learning model, named sigGCN, for cell classification that combines a graph convolutional network (GCN) and a neural network to exploit gene interaction networks. We used standard classification metrics to evaluate the performance of the proposed method on the within-dataset classification and the cross-dataset classification. We compared the performance of the proposed method with those of the existing cell classification tools and traditional machine learning classification methods. Conclusions Results indicate that the proposed method outperforms other commonly used methods in terms of classification accuracy and F1 scores. This study shows that the integration of prior knowledge about gene interactions with gene expressions using GCN methodologies can extract effective features improving the performance of cell classification. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04278-2.

In this section, we describe the datasets and also the data preprocessing used in this work.

280
The distances between the cell populations were computed using this matrix. Figure

291
To investigate the generalizability of the methods, we used the within-dataset classification and cross-292 dataset classification to evaluate the performance of the methods. In our within-dataset classification, we 293 randomly split each dataset for training and testing the models. We used the same training and testing 294 datasets for all the methods. In our cross-dataset classification study, first, we combined the four human

328
Almost all the methods perform well on the Xin, Segerstolpe, and Muraro datasets and there are not 329 significant differences between the performance of the methods. It can be due to well-separated cell 330 population in these datasets. As can be seen in Figure 2 the distances between clusters are higher in these 331 datasets compared to the rest. Interestingly, for more complex datasets, Zhengsorted, Zheng68K, in which 332 the Euclidian distances between clusters are lower, our proposed method performs significantly better as 333 can be seen in Table 1. 334

364
We need to mention that scPred, scID, scmapcluster, and scmapcell provide the function of rejection 365 which means they predict a cell class as "unassigned". We computed the accuracy and F1 scores based on 366 the results of including these unassigned cells in our comparison (Table 1 and Table2). Since our method 367 outputs the probability of cell class assignments, we also provide an additional function to predict a cell 368 class as "unassigned" by setting a threshold of prediction. We set the threshold as 0.65 which means if a 369 cell does not have a probability of prediction larger than 0.65, then the cell will be predicted as 370 "unassigned". We computed accuracy and F1 scores when not including the unassigned cells for these 371 four methods and our method with the "unassigned" function (shown in Additional file 2 Table). Results

372
show that our method has the smallest unassigned rate and the best or near the best accuracy and F1.

375
Since the Zhengsorted dataset is the one that has the ground truth, we provided more details on the 376 performance of the methods using this dataset in this section. The confusion matrix of the ten classes 377 using our proposed method, sigGCN, is shown in Figure 4. The confusion matrices of the other 378 scRNAseq classifiers are shown in Additional file 1: Figures S7-S8, and those of the conventional 379 methods are shown in Additional file 1: Figure S9. As can be seen in Figure 4,

383
To further investigate the performance of the methods, we executed the ROC analysis for each class 384 using the Zhengsorted dataset. Figure 5(a) shows the ROCs for each class using our proposed method.

385
We also compared our model with the other methods in terms of AUC and the ROC analysis of each class.

386
The ROC analysis for class 4 (CD4+/CD25 T Reg) using all the methods is shown in Figure 5 Segerstolpe datasets are all from the human pancreas, we used these four datasets for the validation. The 398 common cell types among these four datasets are alpha, beta, delta, and gamma, so we extracted the four 399 cell types from each dataset for combination. Before combining the datasets, we preprocessed the data 400 using the log-transformation and normalized each dataset by min-max scaling to make the four datasets in 401 the same level. We run four experiments, and in each experiment, we used three of the four datasets as the 402 training dataset and the remaining one as the testing dataset. Additional file 1: Figures S11-S14 show the 403 performance of the methods in terms of accuracy, median F1, median precision, and median recall score.

404
The results of the accuracy are shown in Table 3 and the median F1 scores are shown in Table 4