Identification of biological mechanisms by semantic classifier systems

The interpretability of a classification model is one of its most essential characteristics. It allows for the generation of new hypotheses on the molecular background of a disease. However, it is questionable if more complex molecular regulations can be reconstructed from such limited sets of data. To bridge the gap between complexity and interpretability, we replace the de novo reconstruction of these processes by a hybrid classification approach partially based on existing domain knowledge. Using semantic building blocks that reflect real biological processes these models were able to construct hypotheses on the underlying genetic configuration of the analysed phenotypes. As in the building process, also these hypotheses are composed of high-level biology-based terms. The semantic information we utilise from gene ontology is a vocabulary which comprises the essential processes or components of a biological system. The constructed semantic multi-classifier system consists of expert base classifiers which each select the most suitable term for characterising their assigned problems. Our experiments conducted on datasets of three distinct research fields revealed terms with well-known associations to the analysed context. Furthermore, some of the chosen terms do not seem to be obviously related to the issue and thus lead to new, hypotheses to pursue. Author summary Data mining strategies are designed for an unbiased de novo analysis of large sample collections and aim at the detection of frequent patterns or relationships. Later on, the gained information can be used to characterise diagnostically relevant classes and for providing hints to the underlying mechanisms which may cause a specific phenotype or disease. However, the practical use of data mining techniques can be restricted by the available resources and might not correctly reconstruct complex relationships such as signalling pathways. To counteract this, we devised a semantic approach to the issue: a multi-classifier system which incorporates existing biological knowledge and returns interpretable models based on these high-level semantic terms. As a novel feature, these models also allow for qualitative analysis and hypothesis generation on the molecular processes and their relationships leading to different phenotypes or diseases.

Understanding the information hidden in molecular profiles brings new possibilities for 2 the accurate identification of diseases and the selection of individual treatments. It is 3 one foundation of precision or personalised medicine [1]. Molecular profiles typically do 4 not permit any direct inspection in a multivariate setting, as they usually comprise tens 5 of thousands of measurements. Computer-aided systems are paramount for 6 interpretation and categorisation. 7 Such a computer-aided system is a classification system that assigns known 8 phenotypes, called classes to molecular profiles. The classification system itself 9 generates the diagnostic rules applied in this process. The corresponding decision 10 criteria are constructed in a training phase according to a set of sample instances. 11 Data-driven classification algorithms therefore generate new hypotheses on the 12 underlying characteristics of a biological phenotype and allow for addressing different 13 research questions. Nevertheless, this ability can be limited by the restricted amount of 14 training samples. Complex molecular interactions might not be detected. 15 However, information about molecular processes can be extracted from various 16 sources such as biological literature and databases [2]. A continuous challenge is to 17 incorporate this heterogenous information into existing classification algorithms. This 18 will lead from purely data-driven systems to knowledge-based modelling approaches. An 19 overview of existing approaches can be found in Porzelius et al. [3]. In the context of  23 Johannes et al. extract information from protein-protein interaction networks for the 24 optimisation criterion of their recursive feature elimination algorithm [5]. Another  28 designed a hierarchical system that is guided by the terms of the gene ontology [7]. 29 Taudien et al. utilise semantic terms for aggregating mutation patterns [8]. In this work, 30 we concentrate on semantic multi classifier systems (S-MCS), which constructs an 31 ensemble of term-dependent base classifiers [9]. We extend this approach to incorporate 32 semantic relationships given by an ontology during the training process. This eventually 33 leads to the identification of new biological mechanisms underlying the discrimination 34 process.

36
Classification is the procedure of categorizing an object into one of k ≥ 2 distinct classes 37 Y = {y 1 , . . . , y k }. These classes are assumed to reflect some kind of semantic 38 interpretation (e.g. carcinoma vs. inflammation). They are predicted according to a 39 vector of measurements x = (x (1) , . . . , x (n) ) T of the object. The corresponding decision 40 rule is described by a classification model where X ⊆ R n denotes the measurement or feature space. 42 The final structure and the properties of a classification model are dependent on the 43 chosen type of concept or function class C and the chosen adaptation process l to be 44 used on a set of learning instances May 25, 2018 2/14 In this basic version, the training procedure l can be seen as a purely data-driven 46 process in which a classifier is adapted according to an optimisation criterion. 47 Knowledge about the underlying biology is typically not incorporated.

48
After the initial training, the generalisation performance is tested in an independent 49 set of test samples ability of a classifier is its accuracy in predicting the class label of an unseen sample 51 correctly: Interpretability 53 A second, more fuzzy property of a trained classification model is its ability of providing 54 insight into the characteristics, causes or processes of a difference between classes. We 55 will call this property interpretability in the following. Interpretability might be seen as 56 a model's ability of generating hypotheses on the underlying data.

57
This property is partially influenced by the structure and complexity of the chosen 58 concept class C, which determines the syntactical structure of the final decision rule [10]. 59 For example, a logical conjunction can be more easily interpreted than a linear 60 combination [11].

61
The interpretability of a classifier is also influenced by the interpretability of its 62 input signals and their transformations [12][13][14]. A binary signal (e.g. high/low) often 63 allows an easier interpretation than a real-valued gradient [15]. Nevertheless, many of 64 these techniques do not directly provide a possible interpretation regarding the original 65 experiment. While many algorithms lead to a solution with a clear mathematical 66 structure, they often lack a direct semantic interpretation in the terms of the underlying 67 (biological) processes.

69
Feature selection is one of the most prominent techniques for increasing the 70 interpretability of high-dimensional profiles [13,16,17]. It is typically applied to reduce 71 noisy or irrelevant measurements [18,19]. By fulfilling the chosen quality criterion the 72 remaining features are considered as possible (informative) candidate markers.

73
Technically, feature selection can be seen as a process which maps from the chosen The elements (signatures) i ∈ I will be used to indicate the features which pass the data-driven or model-driven process, which incorporates stochastic or heuristic elements. 81 Exhaustive and therefore optimal strategies are typically avoided due to the size of the 82 corresponding search spaces [20].

83
Although data-or model-driven approaches can improve the accuracy of a  An alternative to this approach is the idea of knowledge-based feature selection [9].

91
Here, it is assumed that a predefined set of interpretable signatures V ⊆ I exists. It will 92 be called a vocabulary V = {v i } |V| i=1 in the following. In this scenario, the problem of 93 constructing a suitable marker signature is altered to a selection problem in which a 94 predefined signature v ∈ V or a combination of predefined signatures V ⊆ V is chosen 95 from a vocabulary Assuming that |V| |I|, the use of a vocabulary reduces the corresponding search 97 space to 2 |V| − 1 elements.

98
The term "knowledge-based feature selection" refers to the known semantic assumed to summarise the components of a more complex structure. In general, we will 102 utilise the word term to denote both a predefined signature as well as its interpretation. 103 In the context of systems biology, a term might reflect a known molecular signalling 104 pathway [21] or a cellular component [22]. There are pre-formed terms available from

114
Although we assume that the signatures in a vocabulary are unique this might not be true for their interpretation. For example, a union of two signatures 116 v i ∪ v j might cover the signature of a third one. Therefore we constructed a 117 multi-classifier system that handles the terms and the corresponding measurements 118 individually.

119
Semantic multi-classifier system 120 The design of our semantic multi classifier system (S-MCS) can be seen as an ensemble 121 E ⊆ C V of semantic base classifiers C V which are restricted to the signature of an The ensemble itself will be denoted by . The restriction to a specific 124 signature will force a classifier c v (x) to become an expert in interpreting the signals in 125 v and therefore to become an expert in interpreting the corresponding term. A S-MCS 126 utilizing a vocabulary V will be denoted by S-MCS V .

127
The overall structure of a S-MCS is shown in Figure 1d). The individual predictions 128 of the semantic base classifiers are merged on a symbolic level by a late-aggregation 129 strategy [23] 130 In this way, the selected signatures are encapsulated during the system's training and 131 prediction phases. The interpretation of the corresponding terms will not be mixed up. 132 More precisely, our ensemble decision utilises a majority vote on the chosen base During its training phase the S-MCS will gain access to a preselected set of possible 135 candidate terms V * ⊆ V. From this set V * the terms and the corresponding base The initial set of candidate terms V * is 140 constructed according to filter criteria based on the available domain knowledge.

141
Knowledge-based term selection 142 A general vocabulary V is typically not designed for a particular classification task. It 143 consists of a large collection of widespread terms which can be used to describe several 144 tasks in a certain domain or field. In order to guide the search process of the S-MCS a 145 subset of suitable terms V * ⊆ V can be pre-selected according to existing domain 146 knowledge.

147
In the following, we present a term selection strategy which is guided by knowledge 148 about suitable measurements. We assume that an additional signature w ∈ V exists 149 which comprises known signals that are associated with a given classification task. The In the context of our S-MCS we utilise overrepresentation analysis as a technique for 165 integrating vocabularies or ontologies (Figure 1a). A term is added to the set of 166 candidates if it has a significant overlap to the initial gene set w Ontological Search

168
An ontological search relies on the graphical representation of the relationships between 169 the terms of a vocabulary V. It requires the availability of an ontology or a semantic 170 network. The structure of the corresponding graph induces a neighbourhood which can 171 be screened for suitable candidate terms later on.

172
In the following, we concentrate on ontological graph structures ( Figure 1b). Thus, 173 we assume a directed hypernym-hyponym relationship r(w, v) which reflects the 174 hierarchical relationship between a category w and its subcategories v. For a given set 175 of already selected terms W ⊆ V we additionally consider those terms as candidates 176 which are direct subcategories of the terms in V Our final candidate set for the data-driven term selection V * is then constructed from 178 the terms selected by the overrepresentation analysis and the ontological search Experiments 180 We have evaluated our S-MCS in three molecular settings. A short characterisation of 181 the datasets can be found in Table 1 [32] 54613 41 20 21 Leukemia [33] 12559 72 24 48 Pancreatic cancer [34] 54613 78 39 39 Gene Ontology The results will be discussed for each dataset individually. An overview of the achieved 218 classification results can be found in Table 2

225
The dataset is publicly available from the GEO database under accession GSE53890.

226
Total RNA was isolated from cortical grey matter samples from the frontal pole of 227 post-mortem human brains. It was hybridised to Affymetrix U133 plus 2.0 arrays.

228
For our analysis of ageing we combined the first two classes into a "adult" class (21 229 samples) and the latter two into an "aged" class (20 samples). The GO term ageing 230 (GO:0007568) has been chosen as initial set w for the term filtering procedure. The  Table 2. Accuracies of the 10 × 10 CV (mn ± std) for all datasets. Abbreviations: CV, cross validation; mn, mean; std, standard deviation; 1-NN, 1-nearest neighbour; SVM, support vector machine; NCC, nearest centroid classifier; RF-100, random forest with 100 trees; S-MCS, semantic multi classifier system.  Table 2 shows the results of the 10 × 10 cross validation for the ageing dataset.  As expected due to the common relation to adenylate cyclase the first two terms 251 have an overlap (9 genes). Additionally terms one and five (9 genes) and four and five 252 (8 genes) have an overlap. However, these overlaps are only pairwise. This indicates 253 that the classification performance is not related to only a small subset of genes 254 associated to the terms.

255
For most of these terms evidence for a connection to ageing can be found in

283
The most frequently selected terms of the S-MCS V * are given in Fig 2b). In 80% of 284 all experiments the term protein kinase binding was chosen (332 genes), followed by the 285 term laminin binding (66.0%; 24 genes) and less frequently by the terms signal 286 transduction activity (29.0%; 226 genes) and T cell costimulation (21.0%; 72 genes). In 287 18.0% the term regulation of apoptotic process was chosen (155 genes).

288
For this dataset more overlaps of single genes occur than in the others, especially 289 between the most frequently selected term protein kinase binding and the other terms. 290 The largest overlaps are between the most frequently selected term protein kinase 291 binding and signal transduction activity (19 genes overlap), T cell costimulation (13 292 genes overlap), and regulation of apoptotic process (10 genes overlap). The first two 293 terms contain a high number of genes. This might result in the especially high overlap 294 between the two terms. The overlaps between the other terms are smaller than 10 genes. 295 All sets (except the first one) contain the LGALS1 gene which is hypermethylated in al. [42]. A discussion of signal transduction processes in in the context of leukaemia is 301 given by Ihle et al. [43].

302
Pancreatic cancer dataset 303 Badea et al. [34] analysed gene expression data of 36 pancreatic ductal adenocarcinoma 304 (PDAC) tumour patients. They used pairs of tumour samples and normal pancreatic 305 tissue samples from the same patients. The dataset is publicly available in the GEO 306 database under accession GSE15471. Total RNA was hybridised to Affymetrix U133 307 plus 2.0 arrays. As an initial marker set a mRNA signature developed from the data 308 published by Gress and co-workers was chosen [44]. The terms most frequently selected by S-MCS V * are shown in Fig 2c). The term p53 316 binding was selected in 51.0% of all experiments (56 genes). Base classifiers for negative 317 regulation of Ras protein signal transduction (24 genes) and negative regulation of signal 318 transduction (18 genes) were chosen in 38.0% and 32.0% of all experiments. The terms 319 fibroblast proliferation (4 genes) and negative regulation of cellular component movement 320 (6 genes) were both selected in 22.0% of all experiments. The terms selected by the 321 classifier system do not overlap.

322
The role of p53 in tumor development is frequently discussed in literature [45,46].

323
The influence of Ras mutations is for example discussed by Fensterer et al. [47].

324
Fibroblasts and fibroblast proliferation are also known to play a role in cancer [48,49]. 325 In case of pancreatic cancer they play a role in the invasiveness of the cancer [50]. This 326 could also be related to the last term which is negative regulation of cellular component 327 movement. Effects on the associated components may increase cell movement which is a 328 key step in the formation of metastases [51][52][53][54].

330
In this work, we proposed an ontologically guided multi-classifier system that can 331 incorporate semantic domain knowledge into a data-driven modelling process. Utilizing 332 information about the integral parts of a biological process or a cellular component the 333 system was able to construct interpretable hypotheses on the molecular background of 334 the analysed phenotypes or diseases in the form of a set of GO terms. These terms are a 335 high-level description of the problem and a basis for further research.

336
In our scenarios, we evaluated the multi-classifier system in three well-structured 337 research fields. Literature could corroborate evidence for the chosen semantic terms 338 while also suggesting so far unknown mechanisms. Utilising information about the 339 integral parts of a biological process or a cellular component the system was able to 340 construct interpretable hypotheses on the molecular background of the analysed 341 phenotypes or diseases. Suggesting that the system can be used to guide molecular 342 experiments. One of the central assumptions of systems biology is that common 343 high-level components influence almost all biological processes. Thus, our 344 multi-classifier system can be applied as a general purpose instrument to many different 345 problems in a broad field of applications.

347
The research leading to these results has received funding from the European