Constructing early diagnosis model of colorectal cancer based on expression profile

purpose In order to break through the restrictive factors such as the fecal occult blood test (FOBT) in the routine detection of bowel cancer, which is susceptible to diet and drugs, and the high cost and inconvenience of microscopy, Seeking a possible FOBT alternative. Methods An error back propagation neural network (BPNN) algorithm was used to construct a CRC diagnosis model based on expression profiles. Results The accuracy of the model on the training and test sets is 0.943 and 0.935, respectively. AUC all reached above 0.95. Conclusion The CRC molecular detection model based on expression profiles provides a possible alternative to FOBT. It provides a new approach and method for the clinical diagnosis of bowel cancer.


Differentially Expressed Gene Screening
Differentially expressed gene analysis is mainly based on GSE39582 and GSE41258 data, because both are affymetrix platforms and the sample type is tissue. The limma (version 3.8) tool was used to identify differentially expressed genes (DEGs) in normalized and tumor samples of GSE39582 and GSE41258 after homogenization. Genes with fold change more than 2 times and FDR (BH adjusted Pvalue) <0.05 were taken as DEGs. TCGA's COAD and READ data are of the NGS type. The read count value of the transcript was analyzed by DESeq2, and the FDR <0.05 DEGs threshold was also taken.
Before the above differential expression analysis, the similarity of the samples was evaluated on the GEO and TCGA datasets, and the correlation coefficients between the samples were calculated. The results show that the tumor and normal samples from different sources have high internal consistency (S1_Fig).
Both the heatmap and volcano map expressed by DEGs were constructed using R software.

Functional enrichment analysis
We use clusterProfiler (version 3.8) package to perform DEA on the biological process (BP) of Gene Ontology (GO), cellular component (CC), molecular function (MF) and KEGG pathway enrichment analysis. Take q value <0.05 as the threshold for significant enrichment. The dotplot of clusterProfiler displays the enrichment result.

protein interaction analysis
Using the protein interaction (PPI) information provided by the STRING database ((https://stringdb.org/), we built a PPI network of DEGs, retaining PPI information with confidence socre> 0.9, and using Cytoscape (version 3.7.1) Show PPI network. Hub gene analysis uses Cytoscape's NetworkAnalyzer plug-in for analysis, calculates the connectivity degree of each gene (node), and ranks genes according to the connectivity degree. For genes that are significantly up-regulated and significantly down-regulated, construct the above network and take the degree respectively The largest gene is determined as the hub gene, and finally 2 hub genes are obtained. The PPI network module analysis uses the MCODE tool and the parameters take the default value. The GO and KEGG analysis of the genes in the module also use the clusterProfiler tool.

Construction of a neural network-based diagnostic model
Using the error back propagation neural network (BPNN) algorithm, we constructed a CRC diagnosis model based on hub genes. First, randomly group the Normal (healthy), Mucosa, and CRC samples of the GSE44076 dataset, set seed = 12345, and divide the 246 samples into a training set and a testing set evenly. The main parameters of the BPNN algorithm are the learning rate, the lambda of the regular term coefficient, the number of hidden layers, and the number of neurons included in the hidden layer. In order to find the optimal parameters, a grid search method is used to evaluate the performance of the model under different parameter combinations. Since our target value is a categorical variable, the accuracy of the model prediction is used as the model's judgment index. Accuracy is calculated as follows: Finally, the model with the maximum training set and testing set accuracy (training set accuracy + testing set accuracy-1) is the optimal model. The model parameter learning rate = 0.006, lambda = 6e-04, hidden layer = 10 neurons. To avoid biasing the model by random grouping, we used the bootstrap method to calculate the accuracy (S12_Table) of the training set and testing set of the model under 100 random samples.

Statistical Analysis
Statistical analyses were performed using R (version 3.5.2) software. Student t-test was used to test the significance of differences in gene expression levels of paired samples, and Wilcox rank test was used to perform a two-group significance test of gene expression levels of unpaired samples. The Kruskal-Wallis rank test was used for the significance test of two or more groups, and the FDR was calculated using the BH-method. In this study, unless otherwise specified, *** indicates p <1e-5, ** indicates p <0.01, and * indicates p <0.05.

Analysis process
In this study, the gene expression profile data of normal and tumor samples provided by the GSE39582 and GSE41258 datasets were first used to calculate the differentially expressed genes (DEGs) of the two using the limma tool. The DEGs common to both were used for subsequent verification. Using the TCGA's COAD and READ data sets, we verified the identified DEGs to further determine the reliability of our DEGs. We then commented on the possible functions of DEGs, including participating biology processes (BP) and pathways. Analysis of protein interactions allows us to have a deeper understanding of changes in cellular pathways (signaling pathways, metabolic pathways) that may be involved in the transition from normal to tumor.

2.2(DGEs)Identification of differentially expressed genes
Using the limma tool, we analyzed the differentially expressed genes in the normal and tumor

DEGs Functional Analysis
In order to further study the functions of these DEGs, we performed GO function annotation on 270 DEGs. Due to the significant difference in up-and down-regulated gene expression patterns, we analyzed the function of up-and down-regulated genes, respectively. We see that genes that are up-regulated in the tumor are mainly involved in biological processes related to the extracellular matrix and extracellular structure (Fig2A, S4_Table), while the genes that are down-regulated are involved in biological processes that are significantly different from up-regulated genes, mainly related to ion Detoxification is related to stress response (Fig2B, S5_Table). Analysis of the KEGG pathway found that genes that were significantly up-regulated in the tumor were significantly enriched in the ECM-receptor interaction, focal adhesion, and PI3K-Akt signaling pathway (Fig2C, S6_Table), and these pathways are importantly related to tumor formation and progression. The down-regulated genes were significantly enriched in the pathways such as Fatty acid degradation, Glycolysis / Gluconeogenesis (Fig2D, S7_Table), which indicates that these genes involved in fatty acid and glucose metabolism are inhibited in tumor cells.

GO enrichment results for shared up-regulated genes；B：GO enrichment results for shared
down-regulated genes；C：KEGG pathway enrichment results for shared up-regulated genes； D：KEGG pathway enrichment results for the shared down-regulated genes.

DEGs Interaction Network and Hub Gene Analysis
The 90 DEGs up and 180 DEGs down get 715 and 1089 PPI network edges, respectively. These edges have a confidence score> 0.9. Hub gene analysis found that the CCND1 and FOS genes had the highest degree of up-regulation and down-regulation in the DEGs network and were significantly higher than other genes (43/54, S8_Table, S9_Table). The module analysis can significantly divide the upregulated DEGs network into 3 subclusters, which contain 25, 12, and 5 genes, respectively. Among them, subcluster1 is closely related to cancer occurrence, and the functions of subcluster2 and subcluster3 are unknown (Fig4A, S8_Table). The down-regulated DEGs network can be significantly divided into 5 sub-networks, among which subcluster1 is related to glycolysis / gluconeogenesis metabolism, subcluster2 is related to ion metabolism, subcluster3 is related to bile secretion, and subcluster4 is related to nitric biosynthesis process (Fig4B, S9_Table).

Fig4. Analysis of Differentially Expressed Gene Interaction Networks A：Module Analysis and
Function Notes for Upgrading DEGs Network； B：Modified the module analysis and function notes of DEGs network.

2.5DEGs expression analysis in independent validation set
Using expression data of intestinal cancer (colon and rectal adenocarcinoma) provided by TCGA, we verified the expression of 270 DEGs. Since TCGA's CRC expression data is of RNA seq type, we used the read count values corresponding to these DEGs to compare the overall expression of up-and down-regulated genes on normal and tumor. It is significantly higher than normal, and the downregulated genes are also significantly lower than normal on the tumor (Fig5A-B). These are highly consistent with our results based on the GEO chip expression data. Further testing the expression levels of each of the up-and down-regulated genes in tumor and normal, it was found that in COAD and READ samples, 99% (261/263) and 93% (244/263) of the gene expression were significant. The difference (FDR <0.05, Fig5C-D, S10-S11_Table), which further shows the reliability of the DEGs we identified.

Fig5. The expression of differentially expressed genes on the TCGA CRC dataset. A：Up-and down-regulation of overall gene expression levels in tumor and normal samples of COAD； B： Up-and down-regulation of overall gene expression levels in READ's tumor and normal
samples； C：Differential expression test of up-and down-regulated genes in tumor and normal samples of COAD； D：Up-regulated and down-regulated genes were tested for differential expression in tumor and normal samples of READ. 3 In conclusion 1. This study used the GSE39582 and GSE41258 data sets from GEO to identify a group of differentially expressed genes between normal and CRC. A total of 270 DEGs were obtained on two sets of data sets from different sources, 90 of which were in CRC samples. Medium up-regulated genes and 180 down-regulated genes in CRC samples.

Constructing a neural network-based diagnostic model
2. TCGA database's CRC (colon adenocarcinoma + rectal adenocarcinoma) independent data set was used to verify 270 DEGs. Among them, more than 90% of the genes showed differential expression in normal and tumor samples, and the expression patterns were up-and down-regulated Also consistent. This shows that our DEGs filtered based on the GEO dataset are reliable.

3.
The functional annotation of differentially expressed genes found that genes that are upregulated in the tumor are mainly involved in biological processes related to the extracellular matrix and extracellular structure, while down-regulated genes are mainly related to the detoxification and stress response of ion. Pathway enrichment analysis shows that the pathways involved in upregulated