Abstract
Autophagy is a cellular process to recycle damaged cellular components and its modulation can be exploited for disease treatments. A key autophagy player is a ubiquitin-like protein, LC3B. Compelling evidence attests the role of autophagy and LC3B in different cancer types. Many LC3B structures have been solved, but a comprehensive study, including dynamics, has not been yet undertaken. To address this knowledge gap, we assessed ten physical models for molecular dynamics for their capabilities to describe the structural ensemble of LC3B in solution using different metrics and comparison with NMR data. With the resulting LC3B ensembles, we characterized the impact of 26 missense mutations from Pan-Cancer studies with different approaches. Our findings shed light on driver or neutral mutations in LC3B, providing an atlas of its modifications in cancer. Our framework could be used to assess the pathogenicity of mutations by accounting for the different aspects of protein structure and function altered by mutational events.
Introduction
Autophagy is a highly conserved pathway in eukaryotes that allows the recycling of multiple cellular components during physiological conditions and in response to different types of stress such as starvation (Galluzzi et al., 2017; Gatica et al., 2018). Autophagy requires the sequestration of cytoplasm within double-membrane vesicles, known as autophagosomes, which then mature and fuse to lysosomes, leading to the degradation of their cargo (Parzych and Klionsky, 2013). This last step allows for the production of molecular building blocks that can be recycled and reused by the cell (He and Klionsky, 2009).
Autophagy selectively mediates the removal of damaged or unwanted cellular components (Fimia et al., 2013; Gatica et al., 2018; Stolz et al., 2014; Zaffagnini and Martens, 2016) and involves the binding of cargo receptors to the core autophagy machinery (Fimia et al., 2013; Gatica et al., 2018; Rogov et al., 2014; Stolz et al., 2014). Selective autophagy occurs in diverse forms depending on the target organelles (Kubli and Gustafsson, 2012; Zaffagnini and Martens, 2016) or cellular components. A common trait among different types of selective autophagy can be identified; a receptor protein needs to bind the cargo and to link it to the LC3/GABARAP family of Atg8 mammalian homologs (Maruyama et al., 2014; Schaaf et al., 2016) through the interaction with a scaffold protein (Gatica et al., 2018). An autophagy receptor is, by definition, a protein that can bridge the cargo to the autophagosomal membrane, thus promoting the engulfment of the cargo by the autophagosome (Johansen and Lamark, 2011; Stolz et al., 2014). Atg8 family members can also interact with another class of proteins, the autophagy adaptors. The adaptor proteins do not guide the engulfment and degradation of the cargo, but they work as anchor points or scaffolds for the autophagy machinery (Johansen and Lamark, 2011; Stolz et al., 2014).
One of the most studied LC3/GABARAP family proteins is LC3B, which is used as a marker for the assessment of autophagy activity in cellular assays (Kabeya, 2004; Klionsky et al., 2016; Mizushima et al., 2010; Tanida et al., 2008). The first study focused on the function of LC3B in autophagy was published in 2000 (Kabeya et al., 2000) and has gained more than 5000 citations ever since. LC3B structure is characterized by two α-helices at the N-terminal of the protein and a ubiquitin (Ub)-like core (Figure 1A and B) (Mohan and Wollert, 2018; Sugawara et al., 2004). LC3B is a versatile protein that serves as a platform for protein-protein interactions (Nakatogawa et al., 2007; Wild et al., 2014). LC3B and Atg8 proteins, more broadly, recruit the autophagy receptors through the binding of a specific short linear motif known as the LC3-interacting region (LIRs) in eukaryotes (Figure 1C) (Alemu et al., 2012; Kim et al., 2016; Klionsky and Schulman, 2014; Noda et al., 2010, 2008; Slobodkin and Elazar, 2013). The first LIR was discovered in the mammalian SQSTM1 (also known as p62) autophagy receptor in 2007 (Pankiv et al., 2007). The consensus sequence of the LIR motif includes an aromatic residue (W/F/Y) and a hydrophobic one (L/I/V), separated by two other residues (Alemu et al., 2012; Birgisdottir et al., 2013; Noda et al., 2008). Non-canonical LIR motifs have also been identified (Di Rita et al., 2018; Kuang et al., 2016a; von Muhlinen et al., 2012). The binding of the LIR to the LC3 proteins occurs at the interface between the N-terminal helical region and the Ub-like subdomain (Figure 1C). Interactions of biological partners with LC3/GABARAP can also be mediated at interfaces distinct from the LIR motif (Marshall et al., 2019; Wild et al., 2014). The LC3/GABARAP family members are not necessarily redundant in terms of functions, and each of them can be involved in different autophagy pathways (Lee and Lee, 2016; Schaaf et al., 2016), as illustrated in Figure 1D.
Selective autophagy pathways are known to positively or negatively affect a large range of diseases including cancer (Mah and Ryan, 2012; Mancias and Kimmelman, 2016; Mizushima, 2018; Rybstein et al., 2018). These findings have attracted the attention of pharmaceutical and medical researchers as modulation of autophagy could provide means for new disease treatments and the identification of prognostic factors and disease-related markers (Thorburn et al., 2014; Towers and Thorburn, 2016). For example, in colorectal cancer, high LC3B expression levels correlate with improved survival after surgery (Wu et al., 2015), whereas in clear cell renal cell carcinomas, without the loss of the tumor suppressor gene VHL, high levels of LC3 proteins have been linked to tumor progression (Cowey and Rathmell, 2009; Mikhaylova et al., 2012) and the VHL-regulated miRNA-204 can suppress tumor growth inhibiting LC3B (Mikhaylova et al., 2012). In another study, the low protein expression of Beclin1 and LC3A/B contributed to an aggressive cancer phenotype in metastatic colorectal carcinoma (H. Zhao et al., 2017). A systematic review of cancer data found that high expression of LC3B predicts adverse prognosis in breast cancer (He et al., 2014) and oral squamous carcinoma (Tang et al., 2013). In a Pan-Cancer study on 20 cancer types, LC3B exhibited a common expression pattern in malignancy, whereby high LC3B expression was associated with proliferation, invasion, metastasis, high grades and an adverse outcome (Lazova et al., 2012). Similar expression patterns for LC3B have also been found in other independent studies on triple negative breast cancer (Zhao et al., 2013) and hepatocellular carcinoma (Wu et al., 2014). In summary, LC3B-associated autophagy is likely to promote or repress tumorigenic events depending on the context and tumor site (Schaaf et al., 2016). Autophagy can have protective functions against cancer development but can contribute to cancer progression and resistance to treatment. In this way, autophagy is often described as a double-edged sword in cancer (Kimmelman, 2011; Singh et al., 2018; Thorburn, 2014; White, 2012; White and DiPaola, 2009; Yao et al., 2016).
Apart from evidence regarding changes in the expression level of LC3B in different cancer types, genomic alterations of LC3/GABARAP family members and their relation to cancer have been poorly explored (Nuta et al., 2018). We thus decided to curate an atlas of missense mutations found in cancer samples for the LC3B autophagy markers and characterize them with a combination of structural and systems biology approaches. We aim at providing a framework that can be possible to extend to other Atg8 proteins and other cancer targets in general.
Several three-dimensional (3D) structures of LC3B have been solved by X-ray or NMR in the free state (Kouno et al., 2005; Rogov et al., 2013) or in complex with different biological partners (Ichimura et al., 2008; Jemal et al., 2011; Kuang et al., 2016b; Kwon et al., 2017; Lv et al., 2016; McEwan et al., 2015; Olsvik et al., 2015; Qiu et al., 2017; Rogov et al., 2017, 2013; Stadel et al., 2015; Suzuki et al., 2014; Yang et al., 2017). This provides a valuable source of information for investigations using biomolecular simulations and other structural computational methods of both the LC3B wild-type and its mutated variant. LC3 proteins have not been thoroughly explored by systematic biomolecular simulations or other computational structural methods. Among the simulations present in the literature, each one focused on specific aspects of LC3B dynamics using all-atom molecular dynamics (MD) simulations with a single force field (Jatana et al., 2019; Liu et al., 2013; Nuta et al., 2018; L. Zhao et al., 2017) or coarse-grained approaches (Thukral et al., 2015).
Here we aim to apply methods which combine conformational ensembles derived by molecular dynamics (MD) simulations and analysis with methods inspired by graph theory (Nygaard et al., 2016; Papaleo, 2015; Papaleo et al., 2016), to study the LC3B structure-dynamics-function relationship. We also aim to investigate the effects of missense mutations found in different cancer types to facilitate the classification of cancer driver and passenger mutations. As a first step, we aimed to choose the best physical description for LC3B by assessing ten different state-of-the-art physical models (i.e., force fields) for MD. The choice of the force field can affect the quality of the sampling in a system-dependent manner, and it is crucial to evaluate them when a new protein is under investigation (Dror et al., 2012; Klepeis et al., 2009). Most of the current force fields indeed differ for subtle substitutions in the torsional potential of the protein backbone and side chains, which can have a significant impact on the simulated structure and dynamics (Guvench and MacKerell, 2008; Lange et al., 2010; Lindorff-Larsen et al., 2012; Martín-García et al., 2015; Tiberti et al., 2015; Unan et al., 2015). This assessment has been extensively carried out for ubiquitin with different strategies (Bowman, 2016; F et al., 2012; Huang et al., 2017; Lange et al., 2010; Li and Brüschweiler, 2010a; Lindorff-Larsen et al., 2016, 2012; Long et al., 2011; Maragakis et al., 2008; Martín-García et al., 2015; Showalter and Brüschweiler, 2007; Sultan et al., 2018) but not for the Atg8 proteins.
With meaningful conformational ensembles in hand for LC3B, we turned our attention to the study of the impact of mutations identified in genomics Pan-Cancer studies, focusing on those which are unlikely to be natural polymorphisms in the healthy population. In our study, we accounted for the different aspects that a mutation could alter (Figure 2): i) protein stability, ii) interaction with the biological partners, iii) long-range communication between sites distant from the functional ones (which is often at the base of allostery), and iv) the interplay with post-translational modifications. We thus integrated the analysis of the MD ensembles with a range of other bioinformatic, network-based and structural approaches to achieve a comprehensive classification of the missense mutations in the LC3B coding region.
Results and Discussion
State-of-the-art force fields for all-atom molecular dynamics consistently describe microsecond dynamics of LC3B
To compare the quality and sampling of the MD simulations of LC3B carried out with the ten state-of-the-art force fields selected for this study, we integrated different and complementary metrics.
At first, we estimated the atomic resolution (R) for each of the MD ensembles, as we recently did for another cancer-related protein simulated with different physical models (Nygaard et al., 2016). The predicted R-value allows for the collective quantification of different parameters for structural quality (Berjanskii et al., 2012) such as the population of side-chain dihedrals, rotamers that deviate from the “penultimate rotamer library” (Lovell et al., 2000), deviation from the allowed regions of the Ramachandran plot, packing of the protein core, hydrogen-bond networks and atomic clashes. After discarding the flexible N- (1-6) and C-terminal (117-120) tails from the analysis, the median R values for the different MD ensembles of LC3B7-116 were in reasonable agreement with each other, and all the ensembles featured R values mostly below the of resolution (Figure 3A). Notably, all the MD simulations provided conformational ensembles with R values better than the applied MD algorithm are refining the structure to a level close to the X-ray structure deposited in the entry 3VTU (R= 1.6, Å, (Rogov et al., 2013)). We performed a statistical pairwise Wilcoxon test to verify which of the MD ensembles provided a significantly different resolution distribution (see Table S1). Out of the force fields tested, RSFF1 had the highest median in proximity of the threshold, and ff99SB*-ILDN had lowest median values with most R values below the threshold. ff99SBnmr1, a99SB-disp, ff99SB*-ILDN-Q had the highest R values and were significantly different from the MD ensembles generated with the CHARMM force fields or ff99SB*-ILDN. LC3B has been studied by NMR spectroscopy, providing a number of probes of protein dynamics in solution, such as the assignment of the backbone chemical shifts (Kouno et al., 2005). Chemical shifts provide information about motions occurring on a heterogeneous range of time scales (Case, 2013; Li and Brüschweiler, 2010a; Palmer, 2015; Robustelli et al., 2012) and can be calculated from a MD ensemble of structures (Kohlhoff et al., 2009; Li and Brüschweiler, 2015, 2012). As chemical shifts are useful for experimental validation of the MD ensembles produced using different force fields (Beauchamp et al., 2012; Best et al., 2012b; Guvench and MacKerell, 2008; Henriques et al., 2015; Lange et al., 2010; Lindorff-Larsen et al., 2012; Martín-García et al., 2015; Nygaard et al., 2016; Papaleo et al., 2018, 2014b; Piana et al., 2014; Unan et al., 2015), we calculated the backbone chemical shifts from each MD ensemble and compared them to the experimental values. The calculated chemical shifts from our simulations are overall in fair agreement with the experimental data (Table S2) with an RMSE value below or close to the error of the prediction associated with each atom type (Li and Brüschweiler, 2015). The only relevant exceptions are RSFF1 and RSFF2 with RMSE values higher than 1.20 ppm for the Cα atoms. For LC3B in its unbound, we did not find available Cβ or side-chain methyl chemical shifts, which might have been more sensitive to changes in the intramolecular interactions described by the different force fields (Papaleo et al., 2018).
As a complementary approach to evaluate the impact of different force fields on the description of the protein, the structural ensembles can also be compared in terms of the overlap of the conformational space sampled by the different simulations. Dimensionality reduction techniques can be used as part of this approach (Lange et al., 2010; Martín-García et al., 2015; Tiberti et al., 2015). At first, we used Principal Component Analysis (PCA) and focused on the projection along the first three principal components (PCs) (Supplementary File S1). We found that in most of the simulations there is a region of general overlap but that the sampling also deviates from the starting structure (marked as a *). This is probably due to the low resolution of the initial structure deposited in PDB which the MD force fields refined, as confirmed by the lower R values and agreement with the chemical shifts.
Since the first three PCs explained only 35.5% of the total variance for the concatenated trajectory, we turned our attention to more quantitative metrics for comparison of the distribution underlying structural ensembles. We decided to use a structural clustering method (Clustering-based Ensemble Similarity, CES) implemented in ENCORE (Tiberti et al., 2015) to achieve a complementary assessment of the similarities between force fields and the sampled conformational subspaces. As revealed by the projections in Figure 3B and the heatmaps (Supplementary File S2), RSFF1, ff99SB*-ILDN, RSFF2 and a99SB-disp are the ensembles more distant from the others.
We then wondered if these differences could be due to local conformational variabilities in the structures sampled by each simulation. To this aim, we exploited the structural alphabet (SA) paradigm (Craveur et al., 2015) to estimate the sampling of local states in the simulations, relying on a reduced version of the M32K25 structural alphabet (Pandini et al., 2010) (A, K, U, R, N and Y, as detailed in Methods and Figure 3C). At first, we estimated the discrete probability distribution of the states for each fragment of the protein in the different MD ensembles. We used the Jensen-Shannon divergence (jsD) as a measure to compare the calculated probability distributions to have an estimate of the differences in the conformations of the MD ensembles (Figure 3C and Supplementary File S3). The SA analysis confirmed the diversity of the structures explored by RSFF1 in different areas of the protein, including the LIR-binding interface (see Supplementary File S3). The analysis also highlighted the different local structures sampled by ff99SB*-ILDN in the C-terminal subdomain of the protein (residues 82-105). Moreover, we observed local differences for other force fields, particularly CHARMM36, CHARMM27, ff14SB and RSFF2 (Figure 3D as an example and Supplementary File S4).
In summary, all the force fields provide a structural ensemble of reasonable quality for LC3B. However, using methods for comparison of ensembles of structures and their local states, we have been able to identify important local differences. Overall, CHARMM22* seems to be among the most robust force fields in the description of LC3B according to the properties here analysed. We thus selected it for further analyses to link the MD ensemble to functional properties and to study the mutation sites of LC3B found in genomic cancer studies.
An atlas of LC3B missense mutations in cancer and their interplay with post-translational modifications and functional motifs
We retrieved 28 missense mutations identified in LC3B from cancer genomic studies (Figure 1B) and verified if there were possible natural polymorphisms in healthy individuals. In the ExAC database (Kobayashi et al., 2017), we identified only three mutations within our dataset (R37Q, K65E, and R70H) that had a very low frequency in the population (< 1/10000). We thus retained all 28 mutations in 23 residue sites found in 13 different cancer types (see Supplementary File S5). We analyzed the mutation sites in the context of their interplay with PTMs, overlap with functional short linear motifs (SLiMs), coupling and conservation upon coevolution-based analysis, and their potential to be pathogenic according to the REVEL score (Nilah M Ioannidis et al., 2016; Li et al., 2018). We also verified that the corresponding mutated transcript was expressed in the samples, through analysis of the corresponding entry in Cbioportal (Cerami et al., 2012). Most of the mutant LC3B variants are expressed with the exception of M60I and R70C, for which we cannot exclude additional effects due a marked low expression of the gene product. We verified (for those samples with information available) that the mutation was the only one targeting the LC3B gene in that specific sample. For most of the mutation sites, we also found an overlap with predicted SLiMs associated to different functions, regulatory modifications or interactors (Figure 4A).
We carried out a first annotation of the potential pathogenic impact for each mutation using REVEL. We identified 11 likely pathogenic variants: R16G, D19Y, P28L, P32Q, R37Q, K49N, R70C/H, V89F, Y113C, and G120 R/V. We evaluated if the mutations could have a functional impact abolishing any of the SLiMs or PTMs, along with the likelihood of mutant variants harboring new PTM sites (Figure 4A). The mutation T29A is expected to abolish a phosphorylation site specific for the protein kinase C (Jiang et al., 2010). The analysis of the multiple sequence alignment generated by a coevolution method (Ovchinnikov et al., 2014) and the associated scores show that at this position negatively charged residues or other phosphorylatable residues (i.e., serine) are favoured with high conservation scores, emphasizing the functional importance of a negative charge at this position (Figure 4B). S3W, P32Q, and K49N are in the flanking region of T29 or other two phosphorylation sites (i.e. T6, (Jiang et al., 2010) and T50 (Wilkinson et al., 2015; Wilkinson and Hansen, 2015) and could impair the binding of the kinases/phosphatases. The R21G/Q, K49N and K65E mutations are predicted to abolish methylation (Figure 4A), acetylation (Huang et al., 2015) and ubiquitination sites (Wagner et al., 2011), respectively (Figure 4A). In particular, acetylation of K49 on the cytoplasmic form of LC3B is important for nuclear transport and the maintenance of the LC3B reservoir, deacetylation is, on the contrary, functional to the translocation to cytoplasm for interaction with the autophagy machinery (Huang et al., 2015; Huang and Liu, 2015). Deacetylation/acetylation cycles are important to maintain the proper pools of LC3B in the cell and a mutation impairing this modification, such as K49 could have detrimental effects increasing the cytoplasmic pool of LC3B and an uncontrolled autophagy.
D19Y is predicted to introduce a phosphorylatable residue for the TK and EGFR families of kinases by both GPS (Xue et al., 2008) and NetPhos (Blom et al., 2004). The residue is exposed on the protein surface and its substitution to tyrosine could make it available for post-translational modification, introducing a new level of regulation absent in the wild-type variant.
D19 is also tightly coupled to R16 according to the coevolution analysis (Supplementary File S6) and the only substitutions which are conserved are to glutamate and lysine (Figure 4B), respectively, suggesting an important contribution by charged residues at these positions, which might be compromised by the somatic mutation of arginine to glycine and of aspartate to tyrosine.
The position 35, where an isoleucine is located in the wild-type LC3B, shows propensity to allocate any hydrophobic aliphatic residue and, as a such, the I35V mutations are likely to provide the same features of the wild type variant (Figure 4B). A similar scenario occurs for the M60I and L82F mutations (Figure 3B). The arginine at the position 70 shows a lower conservation score than the histidine residue to which it is mutated in cancer (Supplementary File S6).
Most of the positions corresponding to the LC3B mutation sites are conserved in the other LC3 proteins (LC3A, LC3C and GABARAP) or with very conservative substitutions, especially when LC3A and LC3B are compared. The only marked differences are in R21 (highly variable site), T29 (where only LC3A retain the phospho-site), Y38 replaced by an alanine in GABARAP, K39 and D56 substituted by prolines or polar residues in LC3C and GABARAP, and K65 which is replaced by a serine and a phenylalanine in LC3C and GABARAP, respectively. We thus evaluated if LC3A, LC3C and GABARAP featured mutations similar to LC3B in cancer samples, using a similar approach as the one illustrated in Figure 4A. The mutation sites which featured the highest overlap in the four proteins were D19, P28, T29, P32, Y38, R70, and V89. Other sites were sensitive to mutations only in LC3B and A but not in GABARAP or LC3C. Interestingly, the R70C and H mutations are found in cancer samples in the corresponding regions of the other LC3 proteins, whereas the P28L and P32Q are found as somatic mutations in other members of the family, suggesting a sensitive hotspot for these regions. The surroundings of P32 and R70 have been also shown as important for the protein activity since the corresponding mutant variants in yeast were autophagy-defective (Nakatogawa et al., 2007).
The mutations D19Y and R21Q/G are also likely to impair the signal motif (i.e. Endosome-Lysosome-Basolateral sorting signals, ELB) to direct the protein to the endosome and lysosome compartments with a potential major impact on the autophagy pathway (Figure 4A). Several LC3B mutations could affect docking or recognition motifs for phosphatases or kinases, thus impacting on the regulation of LC3B activity and stability by its upstream regulators (see Supplementary File S5). Mutations at the residue G120 could abolish the cleavage site by ATG4B, impairing the activity of LC3B, as confirmed by experiments on the G120A mutant variant of LC3B (Kouno et al., 2005; Satoo et al., 2009).
Assessment of the impact on protein stability upon LC3B missense mutations
A deeper understanding of the impact of the missense mutations on the protein can be achieved using structural methods (Nielsen et al., 2017; Nygaard et al., 2016; Riera et al., 2014; Scheller et al., 2019; Stein et al., 2019). A deleterious effect of a mutation could, for example, be to alter the protein structural stability, causing local misfolding and a higher propensity of the protein to be degraded in the cell. We thus estimated the changes in free energy associated to protein stability for each of the LC3B somatic mutations using an empirical energy function (Guerois et al., 2002; Schymkowitz et al., 2005). We implemented this procedure in a high-throughput manner (Nygaard et al., 2016; Papaleo et al., 2014a) so that all the possible mutations at each LC3B site can be assessed. Such high-throughput approach allowed us to evaluate the impact of all possible mutations in the protein without being limited to the mutations found in the cancer genomic studies. Indeed, we investigated, more broadly, if the cancer mutation sites are sensitive hotspots to substitutions and to predict if other sites of the protein when mutated could impact on its stability, providing groundwork for the prediction of LC3B somatic mutations which might arise from other cancer genomic studies or from the profiling of more cancer samples (Figure 5A). Our mutational scan shows that R16, P32, I35 and Y113 are sensitive hotspots for protein stability and they cannot tolerate most of the amino-acid substitutions. With regards to the atlas of somatic mutations found for LC3B, twelve of them are clearly destabilizing for the structural stability of the protein (R16G, R21G, P28L, P32Q, I35V, Y38H, R70C, L82F, and Y113C). Other mutations have mild or neutral effects with the exception of V89F which has been found with a stabilizing effect, likely to be due to an improved packing of the protein core (Figure 5B).
To achieve a better understanding of the role of these residues in the structural stability of LC3B, we also employed a method to estimate atomic contacts (Mercadante et al., 2018) and their lifetime in the CHARMM22* MD ensemble (see Materials and Methods). We calculated the average local interaction time (avLIT) for each residue of the protein during the simulation (Figure 5C). The mean of the distribution of avLIT values is of 0.4 (fraction of frames) and the somatic mutations sites with avLIT values higher than this threshold were R16, D19, P32, L82, V98 and Y113, reflecting the results of the mutational scan above. We then estimate the strength and location of the interactions for each mutation site over time, as well as the associated number of encounters (Figure 5D). Two macro-groups can be identified, i.e. mutation sites with only atomic contacts with residues contiguous in the sequence and mutation sites involved also in strong or more dynamic contacts with residues distant in the sequence. In the first class, we found mostly the mutation sites predicted neutral for stability, whereas the second group account for residues such as D19, P32, L82, and Y113.
Local effects of the mutations on binding of LIR motifs and other interactors
LC3B interactome is large (Wild et al., 2014) and the main function of LC3B is to recruit many different proteins to the autophagosome. Thus, to better appreciate the effect of the mutations identified in the cancer samples, one should also consider in the same samples the spectrum of alteration of the LC3B interactors. To this goal, we retrieved LC3B interactors mining the IID protein-protein interactions database (Kotlyar et al., 2019) and integrated them with interactors reported in a recent publication (Wild et al., 2014). We identified overall 95 LC3B partners and for each of them we verified if a LIR motif was reported in the literature as experimentally validated. For cases where no information was available on the mode of interaction, we predicted putative LIRs with iLIR (Jacomin et al., 2016).
We identified 70 interactors as either experimentally validated LIR-containing proteins or having a predicted LIR motif with a significant score by the predictor. We then retrieved the mutational status of the LIR-containing interactors in the same samples where the LC3B missense mutations were identified to explore the possibility of co-occurrence of mutations. For each of them, we evaluated if the mutation was in the proximity of the experimentally validated or predicted LIRs (Figure 6A-B). We found 39 mutations in 27 LIR-containing interactors occurring in samples where LC3B was mutated (highlighted in red in Figure 6A). In particular, 17 of these mutations were truncation with the potential of abolishing all or most of the LIR motifs in the interactors (Figure 6B). The remaining mutations were located in the core motif or in its proximity, along with in the N- and C-terminal regions (Figure 6B). We noticed that the mutations were in the proximity of the LIR region or in the core motifs affected mostly charged residues, which are likely to stabilize the binding of the wild-type complexes through changes in the electrostatic interactions.
For two of the interactors (OPTN and FUNCD1), we used the known 3D structures of the complexes with LC3B to estimate the changes in binding free energy associated to the cancer mutations of the LIR in co-occurrence with LC3B mutations: A184D of OPTN and R37Q of LC3B, along with D16N of FUNDC1 and R70C of LC3B. We identified a slightly negative binding ΔΔG (average ΔΔG =-0.37 kcal/mol) due to the combination of A184D (OPTN) and R37Q (LC3B) and a destabilization (average ΔΔG =1.41 kcal/mol) induced by D16N(FUNDC1) and R70C (LC3B), respectively. In parallel, we estimate the changes in binding free energies also with another protocol based on Rosetta (Barlow et al., 2018) for a reciprocal control of the calculation results, as recently suggested for similar applications (Buß et al., 2018). The calculation suggests that only D16N(FUNDC1)-R70C(LC3B) has a destabilizing effect on binding affinity, whereas A184D(OPTN)-R37Q(LC3B) has neutral effects according to Rosetta scan.
Moreover, we estimated local effects induced by the LC3B mutations on the binding to their partners of interaction calculating the changes in binding free energies for the complexes of LC3B-p62 (Ichimura et al., 2008) and LC3B-FUNDC1 (Kuang et al., 2016a), as prototypes of two different binding modes (Figure 6C-D). Most of the mutations have neutral effects on the local binding. We observed LIR-specific effects for the remaining mutations, i.e., P32Q and R70C as destabilizing mutations for the p62-LC3B complex, whereas K49N and R70C/H affected the LC3B binding with the FUNDC1 LIR. These results are also confirmed by the mutational scan with another method based on the Rosetta energy function (Supplementary File S8). The results were in agreement with the decreased binding of LC3B to p62 and FYCO LIRs upon mutations of LC3B R70 to alanine (Ichimura et al., 2008; Olsvik et al., 2015). D19N was recently characterized in the context of the binding of FUNDC1 with LC3B (Kuang et al., 2016a). The minor effect that we observed for this mutation on the unphosphorylated FUNDC1-LC3B is in agreement with the fact that D19 is located in proximity of the LIR phospho-site Y18 of FUNDC1 LIR and it is responsible for the stabilization of the phosphorylated state of FUNDC1, a function that is likely abolished by this cancer mutation, but that does not have a marked effect on the binding of the unphosphorylated LIRs. We thus assessed the impact of this mutation (D19N) on the structure of the phosphorylated FUNDC1-LC3B (Kuang et al., 2016a) complex and we estimated a higher destabilizing effect with a ΔΔG of approximately 1.14-1.27 kcal/mol.
K49N is also predicted to have destabilizing effects on the binding of the FUNDC1 LIR with LC3B (Figure 6D). K49 is especially important since it is a gatekeeper that regulates the binding of the LIR to the LC3/GABARAP pocket and undergoes conformational changes upon binding (Suzuki et al., 2014).
LC3B doesn’t interact with LIR motifs exclusively, but also with other proteins of the core autophagy machinery, such as Atg4B (Satoo et al., 2009). We estimated the changes in binding free energy upon mutations also for this protein complex (Figure 6E). A group of mutations (Y38H, K65E, L82F and G120V/R) specifically altered the interaction with Atg4B and did not affect the ones with the LIR motifs, as also shown by Rosetta Flex ddG (Supplementary File S8). Interestingly, L82 has been shown to be an essential residue for the C-terminal cleavage of LC3B (Fass et al., 2007), supporting our predictions. Y113C was also showed experimentally with a functional impact on the interaction with Atg4B (Nuta et al., 2018), but it is not identified as destabilizing for the interaction by our local mutation scan. According to our mutational scan on protein stability, this could be due to the fact that Y113C has an impact on structural stability (Figure 5B). Thus, a Y113C LC3B variant could result in protein variant more prone to proteasomal degradation. The altered biological readout might thus be a consequence of lower levels of this protein variant in the cell.
LC3B ensemble under the lens of protein structure network: hubs for stability and long-range induced effects
The mutational scan described in the previous section only captures local effects for mutations in residues in the proximity of the interface. We thus used the Protein Structure Network (PSN) framework combined to MD simulations (Papaleo, 2015) to assess more distal effects. PSN-based methods provide is a solid framework to study protein structures (Angelova et al., 2011; Di Paola and Giuliani, 2015; Ghosh and Vishveshwara, 2008; Mariani et al., 2013; Verkhivker, 2019; Vuillon and Lesieur, 2015). We recently applied it to the study of other cancer-related proteins, allowing to shed light on the underlying structural communication among distal sites (Invernizzi et al., 2014; Lambrughi et al., 2016; Papaleo et al., 2012a; Sora and Papaleo, 2019) or to help in the classification of potential cancer passenger and driver missense mutations (Nygaard et al., 2016).
In details, the PSN employs the graph formalism to identify a network of interacting residues in a given protein from the number of non-covalent contacts in the protein or other intramolecular interactions (Tiberti et al., 2014; Viloria et al., 2017). Two main properties of a PSN are the hub residues, i.e., residues that are highly connected within the network and the connected components, i.e., clusters of residues which are inter-connected but do not interact with residues in other clusters. Hubs in a PSN could have both the role of shortening the communication between distal residues or they can have a structural role thanks to their contribution to the robustness of the network. Indeed, substitutions occurring on the nodes with small degree are likely not to have a large effect on the network (and thus the structure) integrity. On the contrary, if hubs are altered, the network integrity can be compromised. We calculated three PSNs, based on sidechain contacts (Tiberti et al., 2014; Viloria et al., 2017), hydrogen bonds (Tiberti et al., 2014) and salt-bridges (Jónsdóttir et al., 2014; Tiberti et al., 2014), respectively.
We then calculated the hubs and connected components from the contact-based and hydrogen-bond based PSNs from the CHARMM22* simulation of LC3B. Among the LC3B cancer mutation sites, I35, Y38, and V89 have a hub behaviour in the contact-based PSN (Figure 7A). These residues also belong to the second connect component of the contact-based network together with other hydrophobic residues, highlighting their importance for protein stability (Figure 7B). Moreover, many mutation sites are located at hub positions in the hydrogen bond network (i.e., R16, D19, R21, Y38, M60, K65, R70, V98 and Y113) (Figure 7C-D). Overall, due to their pivotal role to mediate different classes of intramolecular interactions, and the introduction of substitutions which would not conserve these interactions, the mutations R16G, D19Y, R21G, Y38H, R70C, Y113C are likely to impact on the structural stability of LC3B.
Since 11 mutations where in charged residues of LC3B, we also calculated the network of electrostatic interactions between positively (arginine and lysine) and negatively charged (aspartate and glutamate) residues in the MD ensemble (Supplementary File S9). D19 is central to a small network of salt bridges with K51 and R11 (which is important for the LIR binding), R16 is on one side of the four-residue network with D106, K8 and D104, which constraint a loop of the protein. K65 and R21 are only involved in local intra-helical salt bridges, whereas R70 shows a persistent interaction with D48. The other charged mutation sites have either low persistent electrostatic interactions or they are not involved in salt bridges and in solvent exposed positions.
To predict effects promoted from distal sites to the LIR binding region, we also calculated the shortest paths of communication from the contact-based PSN between each of the mutation sites and the LIR binding interface, i.e., R10, R11 (Ichimura et al., 2008; Kirkin et al., 2009), K49, K51, L53, H57 and R70 (Ichimura et al., 2008; Kirkin et al., 2009; Olsvik et al., 2015), which could be disrupted or weakened by the mutations (Table 1). We identified long-range communication between the LC3B mutation sites only for the interface residues K49, H57 and L53. We observed that I35 is crucial for the communication from the core of the protein to the LIR binding interface at several different sites, spanning different areas of the LIR binding groove. I35 is also often intermediate residue in other paths mediated by different mutation sites (Table 1). In particular, we identified a strong short path of communication between the gatekeeper residue 49K at the LIR binding site to the mutation site 35I (average weight = 64.7) which might be weakened when I35 is mutated to a shorter side-chain as valine. On the opposite site of the binding pocket, I35 can, acting through a longer path, trigger conformational changes to the L53 and H57 binding site residues (Table 1). K49, apart from playing an important local role in mediating the LIR binding, can also communicate long range with H57 on the LIR binding groove important for the binding of the C-terminal part of the LIR. A similar behaviour is observed for M60 and K65 which are pivotal for long range communication to all the three LIR binding sites (K49, L53 and H57, Table 1). In addition, K65 and I35 are part of the same long-range communication spine from the surface of the protein to the LIR binding interface, suggesting that two site communication can occur between the Atg4B binding site to which K65 belongs and the LIR binding site. Other potential long-range effects can be exerted by P32 to H57 (passing also through L53), along with the three valine residues at positions 89, 91 and 98, which seems to act more as intermediate nodes of more complex paths (Table 1).
For all these residues that we found critical in distal communication, we speculate that their mutations could alter the native structural communication of LC3B protein from distal sites to the binding sites for different interactors.
Classification and impact of LC3B missense mutations
We integrated all the data collected in this study to map the different effects that the cancer-related mutations of LC3B could exert, providing a comprehensive view of the many aspects that they can alter and that are ultimately linked to protein function at the cellular level, i.e. protein stability, regulation, abolishment/formation of PTMs or functional motifs for protein-protein interactions, local and distal effects influencing the binding to the partner of interactions, along which co-occurrence of mutations in the same cancer samples for the protein and its ligands (Figure 8A). We then ranked the mutations according to the properties that they alter to help in the classification of potential damaging and neutral ones (Figure 8B). The ranking allowed the selection of mutations to prioritize for validation of their “driver” or “passenger” potential, along with planning the proper experimental readout for the validation. For example, if a mutation is predicted damaging in relation to stability, experiments tailored to estimate the cellular protein level and half-life could be used as we recently did for other disease-related proteins (Nielsen et al., 2017; Scheller et al., 2019). On the other side, if the impact is more related to the protein activity or introduction/abolishment of functional motifs, binding assays for example based on peptide arrays, isothermal titration calorimetry, NMR spectroscopy, co-immunoprecipitation can be used together with assays of the related biological readouts in cellular models as for example we recently combined to the study of a LIR-containing scaffolding protein (Di Rita et al., 2018). Assays with upstream modifiers, such as ubiquitination or phosphorylation assays can instead be used to validate the interplay with PTMs, both in the direction of introduction of new layers of regulation upon mutation or their abolishment.
In our case study, we identified three potential classes of driver mutations: i) mutations that alter both stability and activity (R16G, D19Y, R70C, P32Q, Y113C, and R21G); ii) detrimental mutations for protein stability (R16G, V98A, V89F); iii) mutations neutral for the stability but altering the activity (G120R, G120V, and K49N).
We then searched in literature for experimental data that could validate our predictions, and we found results in agreement with the functional impact for mutations at R70, D19, G120, and K49, supporting our results. Mutations at R70 showed no accumulation of the pro-forms for LC3B (Liu et al., 2013), slower kinetics for Atg4B-mediated cleavage (Costa et al., 2016) and reduced binding for more that 20 interactors (Behrends et al., 2010; Kraft et al., 2014; Olsvik et al., 2015). G120 is fundamental for a proper C-terminal cleavage, which is impaired when this glycine is mutated to alanine (Tanida et al., 2004) and also G120 substitution with alanine has been shown to impair the binding of lamin B1 with LC3B (Dou et al., 2015). Y113C was recently shown to inhibit the enzymatic activity of Atg7 (E1-like enzyme) but not the E2-like activity (Nuta et al., 2018). Mutations at K49 alter the binding to the phosphorylated variants of the LIR-containing LC3B interactor, FUNDC1 (Lv et al., 2016), whereas if this residue is mutated to alanine can increase the binding of another LIR-containing protein, i.e. Nix (Rogov et al., 2017). D19N altered the selectivity for phosphorylation and unphosphorylated variants of FUNDC1 (Kuang et al., 2016a).
Materials and Methods
Molecular Dynamics Simulations
The molecular dynamics (MD) simulations for the human LC3B protein were performed in explicit solvent using GROMACS version 4.6 (Hess et al., 2008). One-s MD simulations of LC3B monomer were collected starting from the free state NMR structure with PDB entry 1V49 (Kouno et al., 2005). We have employed ten force fields (ff14SB (Maier et al., 2015), ff99SBnmr1 (Li and Brüschweiler, 2010b), ff99SB*-ILDN (Best and Hummer, 2009; Lindorff-Larsen et al., 2010), ff99SB*-ILDN-Q (Best et al., 2012a; Best and Hummer, 2009; Lindorff-Larsen et al., 2010), a99SB-disp (Robustelli et al., 2018), RSFF2 (Li and Elcock, 2015; Zhou et al., 2015), CHARMM22* (Piana et al., 2011), CHARMM27 (Bjelkmar et al., 2010), CHARMM36 (Huang et al., 2017) and RSFF1 (Jiang et al., 2014).
We used as the solvent model the TIP3P adjusted for CHARMM force fields (MacKerell, et al., 1998), TIP3P (Jorgensen et al., 1983) for AMBER force fields and TIP4P-Ew (Horn et al., 2004) water model for the RSFF1 force field. We used a dodecahedral box with a minimum distance between protein and box edges of 12 Å applying periodic boundary conditions and a concentration of NaCl of 150mM, neutralizing the charges of the system. In the simulations, we used the Nε2-H tautomer for all the histidine residues. The system was equilibrated according to a protocol previously applied to other cases of study (Papaleo et al., 2014b). We carried out productive MD simulations in the canonical ensemble at 300 K and a time-step of 2 fs. We calculated long-range electrostatic interactions using the Particle-mesh Ewald summation scheme, whereas we truncated Van der Waals and Coulomb interactions at 10 Å.
Structural assessment of the MD ensembles
We selected a subset of the MD ensembles, selecting 100 frames (equally spaced in time) from each of the ten different simulations as representative of each trajectory to be used in the prediction of chemical shifts and atomic resolution.
In particular, we calculated the backbone and proton side-chain chemical shift values for each MD ensemble with PPM_One (Li and Brüschweiler, 2015). We calculated the Root Mean Square Error (RMSE) and the Mean Average Error (MAE) between the predicted chemical shift values (from the simulations) and the experimentally measured chemical shifts from Biological Magnetic Resonance Bank (BMRB entry 5958) (Kouno et al., 2005) to assess the ability of the force fields in describing a conformational ensemble close to the experimental one. We used for the comparison 119 Cα, 116 C, 119 Hα, 113 H and 114 N chemical shifts, respectively.
We predicted the atomic resolutions with ResProx (Berjanskii et al., 2012) to assess the structural quality of the MD conformational ensembles representing each trajectory. ResProx uses machine learning techniques to estimate the atomic resolution of a protein structure from features that are derived from its atomic coordinates. We also verified that the ResProx results and their distribution were not affected by the approach used to select the subset of 100 MD frames. To this aim, we carried out the ResProx calculations on a set of equally-spaced 1000 frames selected from the one-s CHARMM36 trajectory. Moreover, we also carried out structural clustering on the 1000-frame ensemble of CHARMM36 trajectory using the GROMOS algorithm for clustering. In the clustering, we used a mainchain Root Mean Square Deviation (RMSD) cutoff of 2.4 Å. We retained only the most populated cluster (which accounts for 888 structures) and estimated the atomic resolution of each structure of the cluster and the corresponding distribution. The two approaches gave results similar to calculations performed on the 100-frame ensembles, featuring similar distributions and median values.
To assess the statistical difference of the ResProx data for the different force field simulations, we calculated the distribution of the atomic resolution data for each MD ensemble and performed the Shapiro-Wilk test to evaluate if our data distribution can be approximated to a normal distribution. According to this test, most of the ResProx data distributions do not come from a normally-distributed population. In fact, the distribution of the sampled R values were either left- or right-skewed, due to outlier structures. The only exceptions are the data of the MD ensembles obtained with ff99SBnmr1 and a99SB-disp. We performed a pairwise Wilcoxon (Mann-Whitney U) rank sum test with continuity correction adjusting the p-value with the Holm-Bonferroni method on the R sets obtained from each force field pair to test whether the samples were selected from populations having the same distributions. The samples of R values from different force field simulation pairs that have p-values higher than 0.05 are not significantly different from each other under the Holm-Bonferroni correction.
Principal Component Analysis of the MD trajectories
Principal component analysis (PCA) is used to extract the essential motions relevant to the function of the protein through the eigenvectors (principal components, PCs) of the covariance matrix of the positional fluctuations observed in MD trajectories, leaving out the irrelevant physically constrained local fluctuations of the protein (Amadei et al., 1993). We have performed all-atom PCA on a concatenated trajectory (including all the trajectories for the ten force fields) superposing the protein using the Cɑ coordinates to compare them in the same subspace. Prior the calculation, we discarded the six N-terminal and four C-terminal residues from our analyses, to prevent their motion to mask the important motion in the remainder of the protein. The first three PCs account for 35% of the fluctuations of the protein, which is not surprising since the trajectory used for the analyses is a concatenated one including all the force field simulations.
MD ensemble comparison with ENCORE
We have used the ENCORE (Tiberti et al., 2015) as implemented in the MDAnalysis package (Michaud-Agrawal et al., 2011) to calculate ensemble similarity scores between each pair of ensembles. ENCORE estimates the probability distributions of the conformations that underlie each ensemble and calculates a probability similarity measures between each pair of them. To compare the LC3B simulations, we used the clustering ensemble similarity (CES) approach available in ENCORE. The method calculates the ensemble similarity as the Jensen-Shannon divergence between the estimated probability densities. CES partitions the whole conformational space in clusters and uses the relative populations of different ensembles in the clusters as an estimate of probability density. The CES values range between 0 and ln(2), where 0 indicates completely superimposable ensembles and ln(2) means non-overlapping ensembles. The clustering process is carried out using the affinity propagation method (Frey and Dueck, 2007) The calculation of the similarity score was carried out using 1000 frames for each simulation, on Cα only and excluding the flexible N- and C-terminal tails, as done in the PCA. The pairwise divergence values were visualised in heat maps and rendered as scatter plots using the tree preserving embedding method (Shieh et al., 2011).
Structural alphabets
We have compared differences in the sampling of local conformations by analysing the trajectories using the M32K25 structural alphabet (SA for sake of clarity) (Pandini et al., 2010). This particular alphabet describes the local conformation of the protein by means of unique fragments made of Cαs of four consecutive residues, which were originally described by means of three angles. The 25 conformations or letters of the SA represent a set of canonical states describing the most probable local conformations (i.e. conformational attractors) in a set of experimentally derived protein structures. For every simulation, we have used GSATools (Pandini et al., 2013) to encode the conformation of each frame into a SA string, composed of 117 letters for our 120-residue protein. We then transformed the SA representation to that of a reduced structural alphabet (rSA), according to the mapping defined between these two alphabets (Pandini and Fornili, 2016). The rSA is a reduced representation of the original alphabet in which each letter corresponds to a macro-region of the density space. As we will be comparing distributions derived from the structural alphabet (see below), this ensures that the observed differences depend on significant differences between different states rather than the minor differences existing between letters of the original SA.
We then devised a fragment-wise measure to estimate the difference in sampling between different simulations. The following procedure has been carried out independently for each fragment. For each simulation, we calculated the frequency of each letter over the frames and used this as an estimate of the discrete probability distribution of the letters for that fragment and simulation. We then pairwise compared these distributions using the Jensen-Shannon divergence. In this way we obtained a JSd value for every pair of simulations, which accounts for the difference in sampling of different letters.
Network Analyses of the MD Ensembles
We applied protein structure networks analysis (PSNs) to the MD ensemble as implemented in Pyinteraph (Tiberti et al., 2014). We defined as hubs those residues of the network with at least three edges, as commonly done for networks of protein structures (Papaleo, 2015). We used the node inter-connectivity to calculate the connected components, which are clusters of connected residues in the graph. For the contact-based PSN, we tested four different distance cutoffs to define the existence of a link between the nodes (i.e., 5, 5.125, 5.25 and 5.5 Å). Then we selected the cutoff of 5.125 Å as the best compromise between an entirely connected and a sparse network, according to our recent work on PSN cutoffs (Viloria et al., 2017). The distance was estimated between the center of mass of the residues side chains (expect glycines). Since MD force fields are known to have different mass definitions, we thus used the PyInteraph mass databases for each of the MD ensembles.
We also calculated other two PSNs to reflect other classes of intramolecular interactions, i.e., hydrogen bonds and salt bridges. For salt bridges, all of the distances between atom pairs belonging to charged moieties of two oppositely charged residues were calculated, and the charged moieties were considered as interacting if at least one pair of atoms was found at a distance shorter than 4.5 Å. In the case of aspartate and glutamate residues, the atoms forming the carboxylic group were considered. The NH3- and the guanidinium groups were employed for lysine and arginine, respectively. A hydrogen bond was identified when the distance between the acceptor atom and the hydrogen atom was lower than 3.5 Å and the donor-hydrogen-acceptor atom angle was greater than 120°.
To obtain contact, salt bridges or hydrogen bond-based PSNs for each MD ensemble, we retained only those edges which were present in at least 20% of the simulation frames (pcrit = 20%), as previously applied to other proteins (Papaleo et al., 2012b; Tiberti et al., 2014). We applied a variant of the depth-first search algorithm to identify the shortest path of communication. We defined the shortest path as the path in which the two residues were non-covalently connected by the smallest number of intermediate nodes. All the PSN calculations have been carried out using the PyInteraph suite of tools (Tiberti et al., 2014), whereas we used the xPyder plugin for Pymol (Pasi et al., 2012) the mapping of the connected components on the 3D structure.
Contact analysis with CONAN
We performed the analysis of intramolecular contacts using the program CONtact ANalysis (CONAN) (Mercadante et al., 2018). CONAN is a powerful tool that implements the statistical and dynamical analysis of contacts in proteins and investigate how they evolve during time along MD simulations trajectories. The program is open-source and it performs the calculations of contacts during the trajectory times using the mdmat tool available in GROMACS version 5.1. We used the CONAN executable conan.py to perform the analysis over all the different force field simulations of LC3B. Inter-residue contacts in CONAN are defined by different cutoff of distances, that can be defined by the user.
rcut is the main cutoff and any pair of residues that doesn’t have at least an atom within this range are not considered. Rinter is the cutoff under which a contact is formed between a pair of residues. Rijmin(t) is the minimum distance between atoms of residue i and j for a time period t. Rhigh-inter is the cutoff over which interactions are broken. We set up rcut at 10 Å, while rinter and rhigh-inter to 5 Å in agreement with the values used in the reference paper to study simulations of ubiquitin (Mercadante et al., 2018). We used a number of discretization levels to analyze contacts formation between 0 and rcut of 1001, obtaining a resolution of 0.01 Å. We performed the calculation every one ns of simulations, considering all the atoms of the protein. We performed additional analysis on data on output from CONAN using the persistence value of each contact between pair of residues during simulation time, calculated as the frame in which the contact is identified divided by the total number of frames in the trajectory and the number of encounters, as the number that a contact is formed and broken during the trajectory. We used an in-house R script to produce the plot heatmaps in Figure 5D. We included in the supplementary materials the movie of contact map evolution during the trajectory created by CONAN, generated through mencoder, part of the MPlayer software package.
Identification of cancer missense mutations and their annotation
We collected and aggregated a subset of cancer-related missense somatic mutations found in LC3B from cBioPortal (Cerami et al., 2012), considering all studies available on the 12th of October 2018 and COSMIC version 86 (Tate et al., 2019), considering all cancer types and excluding those mutations classified as natural polymorphisms. Moreover, we collected annotations on post-translational modification at the mutation sites using a local version of the PhosphoSitePlus database (Hornbeck et al., 2015), downloaded on 04/05/2018. Additional PTMs have been manually annotated through a survey of the literature on LC3B. We collected short linear motifs (SLiMs) located in proximity of the identified mutations using predictions from the Eukaryotic Linear Motif (ELM) server (Van Roey et al., 2013). Those SLiMs for which an interaction is not compatible with their localization on the LC3B structure have been discarded by further analyses, such as a PP2A docking site. Moreover, for each cancer samples where the mutation was identified we verified (when the information was available): i) the expression level of the LC3B gene; ii) if other mutations were occurring in the LC3B gene; and iii) if any of the interactors (see below) was mutated in the same sample. We also verified that any of the mutations was reported as possible natural polymorphism in the health population, through search in the ExAC database (Kobayashi et al., 2017). We predicted if each of the mutant variant could harbor new SLiMs querying ELM with the sequence of each mutant variant. We annotated each variant with the REVEL score (Nilah M. Ioannidis et al., 2016), as available on the MyVariant.info web resource (Xin et al., 2016). REVEL is an ensemble method for predicting the pathogenicity of missense variants from the scores generated by other individual prediction tools, which was found to be among the top performing pathogenicity predictors in a recent benchmarking effort (Li et al., 2018). The REVEL score can range from 0 to 1, with higher values indicating a stronger indication of pathogenicity. As done in the benchmarking study, we classified as pathogenic those variants having a score >= 0.4, which represents the best trade-off between sensitivity and specificity
Coevolution analysis
We used two different parameters estimated by Gremlin (Ovchinnikov et al., 2014) to analyse the mutation sites. In particular, we employed: i) the conservation score estimated by the coupling matrix for the wild-type and the mutated residue at a certain position; ii) the residues that are coupled to the wild-type residue with a scaled score higher than one. We also derived a logo plot from the Gremlin sequence alignment with Weblogo (Crooks et al., 2004).
LC3B interaction network and identification of LIR-containing candidates
We retrieved the known LC3B interactors through the Integrated Interaction Database (IID) version 2018-05 (Kotlyar et al., 2019). We retained only those interactions identified by at least two of the studies annotated in the database. We then predicted LIR motifs for each of the interactors through the iLIR database (Jacomin et al., 2016) and retained only those with a score higher than 11. This threshold was selected as it allows for a higher sensitivity (92%) at price of slightly lower specificity (Jacomin et al., 2016). We also verified through literature search if any of the interactors include one or more already experimentally verified LIR motifs (see Supplementary File 14). For each interactor with at least one LIR motif, we annotated the occurrence of cancer mutations in the same samples where a mutation of LC3B was found. We then retained only the interactors for which this mutation was abolishing a LIR motif or has mutations in its proximity. The resulting LC3B interactors were displayed as a network plot, using the igraph R package and in-house developed code.
Structure-Based Prediction of Impact on Protein Stability and Binding Free Energies
We employed the FoldX energy function (Guerois et al., 2002; Schymkowitz et al., 2005) to perform in silico saturation mutagenesis using a Python wrapper that we recently developed. We used the same protocol that we recently applied to another protein (Nygaard et al., 2016). Calculations with the wrapper resulted in an average ΔΔG (differences in ΔG between mutant and wild-type variant) for each mutation over five independent runs performed using the NMR structure of LC3B (PDB entry 1V49, (Kouno et al., 2005)) and the X-ray crystallographic structure 3VTU (Rogov et al., 2013). We used the same pipeline to estimate the effect of mutations on the interaction free energy with the LIR domains using as a reference the structure of LC3B in complex with p62, FUNDC1 and ATG4B. This was performed by using the AnalyseComplex FoldX command on the mutant variant and corresponding wild-type conformation, and calculating the difference between their interaction energies.
Moreover, we also accounted for a correction to the FoldX energy values related to protein stability, as defined by Tawfik’s group (Tokuriki et al., 2007) to make the ΔΔG FoldX values more comparable with the expected experimental values, as previously described (Nygaard et al., 2016). In our calculations, we cannot use an experimental value of unfolding ΔG for the wild-type variant of LC3B domain since there are no experimental data available in the literature at the best of our knowledge. Nevertheless, values in the range of 5–15 kcal/mol are generally obtained for the net free energy of unfolding of proteins (Fersht and Serrano, 1993; Privalov, 1979). LC3B has a ubiquitin-like fold and we can refer to the free energy of unfolding experimentally measured for ubiquitin (Khorasanizadeh et al., 1996), which is 7.2 kcal/mol, which we used as reference value to estimate the ΔG of the mutant variant.
In parallel, we collected ΔΔG predictions upon mutation using the Rosetta-based Flex ddG protocol (Barlow et al., 2018), which couples standard side-chains repacking and minimization with a “backrub” approach to produce an ensemble of structures sampling backbone degrees of freedom and a generalized additive model applied to the Rosetta energy function. Flex ddG returns ΔΔG scores in Rosetta Energy Units (REUs), which are not directly convertible to kcal/mol but have been shown to correlate with experimental values (Barlow et al., 2018). We ran the protocol for each point mutation setting 35000 backrub trials, 5000 maximum iterations per minimization and an absolute score threshold for minimization convergence of 1.0 REUs. We generated ensembles of 35 different structures for each mutant and calculate the ΔΔGs for each structure and derived an average score. Destabilizing and stabilizing mutations are predicted using as empirical cutoffs values higher than 1 or lower than −1 REUs, respectively. We ran the protocol using Rosetta v2019.12.60667.
Conclusions
We unveiled the effects exerted by missense mutations found in cancer genomic studies for the key autophagy ubiquitin-like protein, LC3B, providing a solid computational framework that allows to assess in parallel the impact on the most important properties that define its function and stability. We identified as driver sites for LC3B function four mutation sites that were already experimentally proved to alter the protein activity, supporting our approach. Moreover, we suggested new mutations for experimental studies, such as on D19, R70 and Y113C. For these variants, it would be useful to evaluate if the effect on the activity is dominated by a pronounced effect on protein stability which alters the turnover of LC3B mutated variants. R16G is also of interest, which has not been studied so far and it seems to play a critical role in modulating the protein stability. Moreover, our data, thanks to the collection of MD simulations with ten different force fields, can also guide the selection of physical models for MD simulations for the conformational ensemble and structure-dynamics-function relationships of the proteins of the Atg8 family, here illustrated on LC3B as a prototype of this family.
Our framework provides the groundwork to better understand the impact of mutations found in high-throughput cancer genomics data on a group of proteins that are key players of the autophagy pathway. More in general, it can be applied to the study of cancer proteins to prioritize candidates for experimental validation of their potential passenger and driver effects. In such cases it is also able to suggest the most convenient experimental methodologies for the validation, depending if the impact of the mutations is likely to be on the protein structural stability or its activity or even more specific aspects such as changes in post-translational regulation or allosteric mechanisms.
Acknowledgments
This project was supported by an exploratory LEO foundation grant (LF17006), Carlsberg Foundation Distinguished Fellowship (CF18-0314), The Danish Council for Independent Research, Natural Science, Project 1 (102517), Danmarks Grundforskningsfond (DNRF125) to EP group. Moreover, the project has been supported by a KBVU pre-graduate fellowship and a Netaji Subhash ICAR international fellowship, Govt.of India to MaK and MuK to work in EP group, respectively. The calculations described in this paper were performed using the DeiC National Life Science Supercomputer Computerome at DTU (Denmark), a DECI-PRACE 14th HPC Grant for calculations on Archer (UK), and ISCRA-CINECA HP10C0T58M. The authors would like to thank Lisa Cantwell for the professional scientific proofreading of the manuscript and Emiliano Maiani for fruitful comments and discussion.
Footnotes
Revision of title paragraph on page 20: 'LC3B interaction network and identification of LIR-containing candidate' Typos and duplicate references fixed