Pan-cancer in silico analysis of somatic mutations in G-protein coupled receptors: The effect of evolutionary conservation and natural variance

G protein-coupled receptors (GPCRs) form the most frequently exploited drug target family, moreover they are often found mutated in cancer. Here we used an aggregated dataset of mutations found in cancer patient samples derived from the Genomic Data Commons and compared it to the natural human variance as exemplified by data from the 1000 Genomes project. While the location of these mutations across the protein domains did not differ significantly in the two datasets, a mutation enrichment was observed in cancer patients among conserved residues in GPCRs such as the “DRY” motif. We subsequently created a ranking of high scoring GPCRs, using a multi-objective approach (Pareto Front Ranking). The validity of our approach was confirmed by re-discovery of established cancer targets such as the LPA and mGlu receptor families, and we identified novel GPCRs that had not been directly linked to cancer before such as the P2Y Receptor 10 (P2RY10). As a proof of concept, we projected the structurally investigated mutations in the crystal structure of the C-C Chemokine (CCR) 5 receptor, one of the high-ranking GPCRs previously linked to cancer. Several positions were pinpointed that relate to either structural integrity or endogenous and synthetic ligand binding, providing a rationale to their mechanism of influence in cancer. In conclusion, this study identifies a list of GPCRs that are prioritized for experimental follow up characterization to elucidate their role in cancer. The computational approach here described can be adapted to investigate the roles in cancer of any protein family. Author summary Despite cancer being one of the most studied diseases due to its high mortality rate, one underexplored aspect is the association of certain protein families with tumor pathogenicity. We focused here on the G protein-coupled receptors family for three reasons. Firstly, it has been shown that this is the second most mutated class of proteins in cancer following kinases. Secondly, this family has been extensively studied resulting in a wide availability of experimental data for these proteins. Finally, more than 30 % of the drugs currently in the market target its members. For these receptors, we explored the mutational landscape across cancer patients compared to healthy individuals. Our findings show the existence of cancer-related alteration patterns that occur at conserved positions. Additionally, we computationally ranked these G protein-coupled receptors on their importance in the pathogenesis of cancer based on multiple objectives. The result is a list of recommendations on where to focus next. These results suggest that there is room for repurposing existing therapies for cancer treatment while also highlighting the risk of potential interactions between cancer treatments and common drugs. All in all, we present a window of opportunity for new targeting strategies in oncology for G protein-coupled receptors.


69
Cancer is the second leading cause of death globally [1]. Research towards this multifactorial 70 disease has expanded our knowledge significantly over the last two decades [2,3]. Recently,71 results from these endeavors have been condensed in the form of public databases containing 72 patient-derived data [4]. Cancer is typically the result of compounding mutations that transform 73 healthy cells to malignant ones [5]. Previous work involving large scale mutational analysis 74 picked up G Protein-coupled receptors (GPCRs) as the second most mutated class of proteins 75 in the context of cancer [6]. Cancer cells are driven to proliferate and avoid the immune system. 76 GPCRs have multiple functions in this process from increased growth (early stage) all the way 77 to metastasis (late stage) [7]. Thus, any anomalies in GPCR functioning might be related to 78 cancer growth. Another interesting property of GPCRs is that they are the most common drug 79 target family with around 35% of drugs acting through a GPCR [8], providing a diverse set of 80 molecular tools to potentially combat cancer. to also be involved in ligand recognition and activation, whereas the intracellular part of the 88 receptor is linked to G protein recognition and activation. Finally GPCRs contain an N-and C-89 terminus which are also relatively little conserved [10,11]. To denote the residues in GPCRs in 90 a comprehensive way, we use Ballesteros-Weinstein (BW) numbering [12]. BW is mainly 91 restricted to the TM domains and consists of two parts, i.e. the first number is the TM where 92 this residue is found, and the second number is relative to the most conserved residue in that 93 5 TM. The most conserved residue receives number 50, and the number goes down for residues 94 towards the N-terminus and up for residues towards C-terminus.

96
In previous work, knock-down studies have been performed on several proteins to identify their 97 role in the context of cancer, but these studies were typically embarked upon after prior 98 identification of the protein's role in cancer [13][14][15]. One of the main reasons these in vivo 99 studies are done is to identify whether a mutation is either a driver or a passenger mutation, 100 where the first provides a selective growth advantage, and thus promotes cancer development, 101 while the latter has occurred coincidentally and is thus generally of less interest. Moreover, 102 these studies provide insight whether a driver mutation is located on either an oncogene or a 103 tumor suppressor gene [16].

105
In the current work, we focused on GPCRs in the context of cancer by using patient-derived 106 data sets and specifically looked at trends and mutational patterns in this protein family. We 107 performed a deeper investigation into several "motifs", parts of the GPCR sequence that are 108 conserved that contribute most to the stability and function of the GPCR [17][18][19]. Moreover, 109 we provide a list of GPCRs with known small molecule ligands (in some cases approved drugs), 110 ranked by relative interest for follow-up using multi-objective ranking. This ranking 111 incorporates mutational count, locations of mutations in regions of interest, availability of in-112 house expertise, and ability to perform virtual screening (as performed by QSAR). Finally, we 113 exemplified our findings in a more in-depth analysis on C-C chemokine receptor type 5 (CCR5) 114 to show the feasibility of our approach.

116
Overview of datasets 117 Several filtering steps were applied to both the GDC and 1000 Genomes dataset including 118 constraints to missense mutations and to mutations in GPCRs in residues defined in the 119 GPCRdb alignment. The mutation analysis was done for all unique missense mutations in 120 GPCRs, while for the Pareto optimization the datasets were enriched with ChEMBL data for 121 those GPCRs for which such data were available. The corresponding numbers are shown in 122 The GDC dataset is larger compared to the 1000 Genomes set based on patient count, but both 126 are in the same order of magnitude when looking at the amount of missense mutations.

127
Nevertheless, for better comparison in our analyses, the fraction of mutated residues per dataset 128 was used instead of absolute mutation count to correct for the absolute difference in data points.

132
A two-entropy analysis (TEA) was performed on our dataset as was done previously with slight 133 modifications [19,20]. Key to the TEA approach is that for each alignment position the Shannon 134 entropy is calculated both within a GPCR subfamily and within all GPCRs, with the difference 135 between these indicating the residue function in the protein family and superfamily. From this 136 type of analysis multiple interesting groups were identified, in particular residues relevant for 137 receptor function such as activation (type A), and residues relevant for ligand recognition (type 138 L). The former is made up by positions with a low Shannon entropy both within GPCR 139 subfamilies and for the entire GPCR superfamily, indicating high conservation in general and 140 within the subfamily. This high conservation has been linked to their involvement in GPCR-141 conserved working mechanisms [20]. The latter (type L) is made up by residues that are 142 conserved within subfamilies, yet are not so much conserved within the GPCR superfamily.

143
Hence, these are typically associated with ligand recognition, which is specific and conserved 144 within a given subfamily. Type L residues are represented in the top left corner in Figure 1  In Figure 1, we colored the data points based on the frequency of absolute mutation counts 165 found per Ballesteros-Weinstein GPCR aligned position in cancer patients in the GDC dataset.

166
The rest of aligned positions without a Ballesteros-Weinstein label is represented in Figure S1. 167 We defined residues with a high mutation frequency as those above the 75 th percentile in the 168 distribution of mutation counts by position. Conversely, residues with a low mutation frequency 169 were defined as those under the 25 th percentile. The middle mutation frequency are the 170 remaining data points. From Figure 1, it follows that absolute mutation count is (anti)correlated 171 with entropy. We observe a trend where the more conserved type A residues (bottom left corner 172 of the graph, low entropy) have a higher mutation rate in cancer compared to the less conserved 173 residues (top right corner of the graph, high entropy). We illustrate this with the mean ± SD is supported by the fact that from the type A residues highlighted in Figure 1, the higher 186 mutation frequencies are associated with the most conserved positions in TM domains 3, 4, 6, 187 and 7 (i.e. 3.50, 4.50, 6.50, and 7.50). Three of these (i.e. 3.50, 6.50, and 7.50) are interesting 188 positions, since they are part of the "DRY" (TM3), "CWxP" (TM6), and "NPxxY" (TM7) 189 10 conserved GPCR functional motifs. The high amount of mutations in residues of these motifs 190 is remarkable and will be investigated further in section 'Mutation patterns within functionally 191 conserved motifs'. Overall, cancer mutation frequency is correlated with individual residue 192 conservation as we initially noted from Figure 1. We therefore investigated groups of residues 193 as defined by GPCR domains in order to further explore cancer mutation patterns.

195
Mutation rates over GPCR domains 196 We hypothesized that mutations associated with altered function in the context of cancer, would The rescaled average number of mutations per position over domains represented in Figure 2B 232 allow us to more accurately describe the differences between protein domains and datasets. In suggest that the mutations in ICL1 are context-specific and need to be examined for every 245 GPCR individually. However, the effects of these mutations may be limited [22][23][24]. We 246 observed that all TM domains, all ECL loops, and ICL2-4 demonstrate a slightly higher 247 mutation rate in the cancer samples, although for TM1 and TM4 the difference is minimal.

248
Conversely, N-terminus, ICL1, and C-terminus demonstrate a slightly lower mutation rate in 249 the cancer samples. From the analysis of this data, we concluded that some domains may be 250 more amenable to mutation in the context of cancer, but that the high diversity of the GPCRs 251 studied and their diverse roles obscure a clear conclusion. To further investigate these incipient 252 mutation patterns in protein domains, we proceeded to the analysis of previously identified 253 motifs that have a conserved function in GPCRs and that were also highlighted in our two-254 entropy analysis.  Genomes datasets of conserved motifs found in GPCRs. "DRY", "CWxP", and "NPxxY" motifs are analyzed along with their 271 "extended" version, which includes three residues before and after the motif, as found in Table 2 in mutation rate visible. In the extended "DRY" motif, this effect is smaller but still visible. For 282 both "CWxP" and "NPxxY" the rate in the GDC dataset is comparable to the whole sequence 283 rate, whereas it is also lower in the 1000 Genomes dataset for these two motifs. In the extended 284 motifs of "CWxP" and "NPxxY", this trend is still observed. An absolute count of the mutations 285 found per residue in the aforementioned motifs in both the GDC dataset and the 1000 Genomes 286 dataset is shown in Figure S4. From this, we conclude that a trend of higher mutation rates is 287 present within highly conserved motifs in the GDC dataset compared to the average mutation 288 rate, which is not observed in the 1000 Genomes dataset. Moreover, a pattern is observed that 289 the mutation rate in conserved motifs is lower in the 1000 Genomes set compared to the GDC 290 set, confirming their essential role and conservation. To gain further insights into the mutation 291 rate and trends identified here we selected the most mutated individual positions in the GDC 292 dataset ( Figure 4). A count overview of unique GPCR cancer mutations for  Weinstein positions is provided in Figure S5, and an overview of the substitutions found in all 294 of the mutations is provided in Figure S6.  and "NPxxY" motifs. These three positions also showed a higher mutation rate in the two-305 entropy analysis. In addition, residue 3.53, which is part of the extended "DRY" motif, also 306 shows up as highly mutated. The fact that two residues of the "DRY" motif are some of the 307 most mutated in cancer could explain why this motif shows the biggest enrichment in the GDC 308 dataset in Figure 3. On the contrary, no residues of the "CWxP" motif are included in Figure 4, Having confirmed that patterns can be identified in GPCR mutations in the cancer context, we 315 ranked GPCRs for experimental follow-up. For each GPCR, the absolute mutation count was 316 divided by receptor length, to provide a mutation rate for each receptor (a higher mutation rate 317 yielding a lowerbetterrank). To identify patterns within GPCR families (as classified by    The first front in the Pareto optimization is considered "dominating", which means that this set 362 of GPCRs has no GPCR that scores better in the properties. For the remaining (i.e. dominated) 363 data points, a second front can be calculated, with GPCRs that score worse than those in the 364 first front but better than the rest of the solutions. Therefore, we used the first and second fronts 365 for a subsequent ranking based on crowding distances between the receptors ( Figures 6A and   366 6B, respectively). Crowding distances are a measure of how dense the environment is. Denser 367 environments mean more balance in the objectives and thus more interesting GPCRs. As the 368 crowding distance can go up to near infinite, we used a cut-off at a value of 10. In Figure 6A, the 15 GPCRs from the best scoring (first) front are shown, which translate to the 377 GPCRs with the most desirable scores in the combined objectives of the Pareto optimization. 378 We demonstrate that GPCRs previously linked to cancer show up in the first front alongside 379 others that have not been thoroughly investigated yet. Hence, our list can be seen as a list of 380 potential candidates for follow-up experimental research. Twelve of these GPCRs (80%) have 381 been identified in literature as related to cancer (red bars in Figure 6A). The second Pareto front 382 ( Figure 6B), reveals a list of 19 GPCRs, from which 14 (74%) have been previously linked to 383 cancer (red bars in Figure 6B). Out of the cancer-related receptors in our analysis we selected  In this study, we performed a comprehensive comparison of mutations found in cancer patients 414 (GDC dataset) versus mutations found in natural variance (1000 Genomes dataset) in GPCRs. 415 We followed this up by investigating several highly conserved motifs for an increase in   We also studied mutation distribution after aggregating residues by protein domains rather than  A closer look at the more functionally conserved motifs of GPCRs showed a clearer pattern.

452
The higher mutation pressure observed in the GDC data compared to the 1000 Genomes data 453 to avoid these motifs, and especially the "DRY" motif ( Figure 3), drove us to speculate that 454 changes in these positions have a very high chance of disabling receptor function. Thus, this 455 might not be tolerated in healthy tissues but can be advantageous to cancer development. For 456 "DRY" mutations, it has been shown that G protein coupling and recognition can be decreased 457 which can reduce binding affinity of drugs [17,29,31,56]. For both mutations in "DRY" and 458 "NPxxY" it has been shown that a decrease in ligand-receptor complex stability may occur, 459 decreasing the response from the GPCR [18,27,28]. Thus, any mutations found in these motifs 460 can have an impact on the signal transduction of an endogenous ligand or the therapeutic effect 461 26 of a small molecule drug. In fact, these motifs have been shown to be collectively involved in 462 a conserved Class A GPCR activation pathway [57]. In practice, however, the effects of 463 mutations found in these motifs have been shown to cause instability or loss of function in some 464 GPCRs, but increased expression or activity in others [17,24,26,31,56,[58][59][60]. Recently, in a complementary extensive study by Wu et al (2019) [81] the TCGA dataset was 511 used to identify significantly mutated GPCRs in cancer. Compared to their study, we elaborate 512 on our findings through a motif analysis of highly conserved residues in GPCRs, a link to 513 positional entropy, and a link to structural information (i.e. analyzing the CCR5 chemokine 514 receptor). Moreover, in our analysis we included the availability of chemical tools to study the 515 selected GPCRs, as exemplified by our QSAR models. Recently, we have published an analysis 516 of another GPCR, the Adenosine A2B receptor, for which cancer-related somatic mutations were 517 prioritized based on a structural analysis as presented here [46]. There we used a yeast system 518 to explore the effect said cancer-related mutations have on receptor function directly and found 519 that there is a complex pattern of activation modulation (increase, decrease, or disable). Similar 520 approaches could be used to experimentally validate the relevance in cancer of somatic 521 mutations in GPCRs prioritized in this work.  We conclude from our study that mutations found in GPCRs related to cancer are in general 536 weakly correlated to specific domains in the protein or evolutional conservation. Rather we 537 conclude that these are highly context dependent (cancer type, tissue type). However, we do 538 demonstrate that there is a higher mutational pressure in conserved motifs (i.e. "DRY", 539 "CWxP", and "NPxxY") in cancer patients (as shown in the GDC set) compared to healthy 540 individuals. We observe a correlation between our mutational analysis and empirical findings 541 on the role of several receptors in the cancer process. Moreover, we show that the role and 542 mechanism of specific mutations can be elucidated using structural analysis as an intermediate connected by a complex network of primary (PK) and foreign (FK) keys built to optimize the 561 storage space and query processing. All PKs are unique numerical values. Some data fields (i.e. 562 gene expression data) contain analyzed data derived from GDC raw data files. A more extensive 563 description of the database architecture, as well as the analyses performed and the end-to-end 564 mapping strategy is available in the supplementary data. For this project, we used data on 565 somatic missense mutations found in a diverse set of cancer types, to which we will refer as the 566 "GDC" data set.  identifies the TM where this residue is found, and the second number is relative to the most 605 conserved residue in that TM. The most conserved residue is defined to be position 50, with 606 downstream positions receiving a lower number (towards the N-terminus) and upstream 607 positions receiving a higher number (towards the C-terminus). When discrepancies in the BW 608 number were found in the alignment, the most common label was used.  Figure S8 shows the results of our re-implementation in the synthetic dataset provided in 615 [19]. Hereafter, we renamed "Total entropy" as "Rescaled Shannon entropy" and "Average 616 entropy" as "Average entropy across families" for clarification. Firstly, we adapted the previous 617 implementation by using the GPCRdb hierarchy levels to define GPCR subfamilies, resulting 618 in 81 subfamilies for analysis. Secondly, we did not limit the entropy calculation to class A 619 GPCRS but applied it to all GPCRs. However, as opposed to the previous implementation, we 620 included only human GPCR sequences, resulting in 388 sequences for analysis.

622
Structural information 623 The data set was enriched with structural information from GPCRdb [90]. This consisted of 624 data that was annotated to the GPCRs present in the GDC and 1000 Genomes dataset. Included 625 were the family trees to find related proteins, the amino acid sequence of a protein and sequence 626 alignment data, which was used to add BW numbering to the residues. Finally, to connect all 627 the data we found, we used the HUGO Gene Nomenclature Committee (HGNC) for source to 628 source mapping [94].