ABSTRACT
Background Single Nucleotide Polymorphisms (SNPs) in the HEXB gene are associated with a neurodegenerative disorder called Sandhoff disease (SD) (GM2 gangliosidosis-O variant). This study aimed to predict the possible pathogenic SNPs of this gene and their impact on the protein using different bioinformatics tools.
Methods SNPs retrieved from the NCBI database were analyzed using several bioinformatics tools. The different algorithms collectively predicted the effect of single nucleotide substitution on both structure and function of beta subunit beta subunit of both hexosaminidase A and hexosaminidase B proteins.
Results Forty nine mutations were found to be extremely damaging to the structure and function of the HEXB gene protein.
Conclusion According to this study, forty two novel nsSNP in HEXB are predicted to have possible role in Sandhoff disease using different bioinformatics tools, beside two SNPs found to have effect on miRNAs binding site affecting expression of HEXB gene. Our findings may assist in genetic study and diagnosis of Sandhoff disease.
1. INTRODUCTION
Sandhoff disease (SD) also called (GM2 gangliosidosis-O variant) is a fatal rare autosomal recessive lysosomal storage disease of sphingolipid (GM2 ganglioside GM2) metabolism resulting from the deficiency of β-hexosaminidase (HexB) (1-5). The estimated carrier frequency of SD with a high incidence among certain isolated communities and ethnic groups around the world such as in Saskatchewan at rate 1:15 (1, 6). It is caused by deficiency of N-acetyl-β-hexosaminidase (Hex) resulting in pathological accumulation of GM2 ganglioside in lysosomes of the central nervous system (CNS) and progressive neurodegeneration. Currently, there is no treatment for SD. (2, 7) Affected individuals present with a wide spectrum of clinical manifestations, ranging from psychomotor impairment and death in the infantile form to motor neuron disease and autonomic dysfunction in the adult form(8) The disorder is classified according to the age of onset, as infantile, juvenile and adult form.(5) The most common and severe form of Sandhoff disease becomes apparent in infancy,(9-13) The clinical features of juvenile SD include ataxia, dysarthria and cerebellar atrophy, develop early and severe sensory loss in addition to chronic motor neuron disease and cerebellar ataxia(14) While the clinical features of adult SD include progressive muscle cramps, as well as wasting and weakness of the legs with onset after age 20. They also can show intention tremor of the upper extremities and dysarthria.(8)
The HEXB gene provides instructions for making a protein that is a subunit of two related enzymes; beta-hexosaminidase A and beta-hexosaminidase B. play a critical role in the central nervous system, These enzymes are found in lysosomes. Within lysosomes, the enzymes break down sphingolipids, oligosaccharides, and molecules that are linked to sugars such as glycoproteins (15). HEXB gene is localized on 5q13.3, which is the long (q) arm of chromosome 5 at position 13.3. The disease is caused by mutation in HEXB encoding the β-subunit of β-hexosaminidase A. β-Hexosaminidase A exists as a heterodimer consisting of α- and β-subunits (16). Numerous studies in the past have shown that SNPs are responsible of mutations that lead to cause Sandhoff disease (16, 17). Single nucleotide polymorphisms (SNPs) are an important source of human genome variability. Non-synonymous SNPs occurring in coding regions result in single amino acid polymorphisms (SAPs) that may affect protein function and lead to pathology (18). Functional variations can have deleterious or neutral effects on protein structure or function (19) Damaging effects might include destabilization of protein structure, altering gene regulation (20) affecting protein charge, geometry, hydrophobicity(21), stability, dynamics, translation and inter/intra protein interactions. (22, 23), hence structural integrity of cells comes under risk (24). Thus it can be avowed that nsSNPs might get linked with many human diseases because of these missense SNPs.(25) The aim of this study was to identify the possible pathogenic SNPs in HEXB gene using in silico prediction software, and to determine the structure, function and regulation of their respective proteins. This is the first study which covers an extensive in silico analysis of nsSNPs of HEXB protein hence this work might be useful in future in developing precision medicines for the treatment of diseases caused by these genomic variations. (26)
2. MATERIALS AND MEYHODS
2.1 Data mining
The data on human HEXB gene was collected from National Center for Biological Information (NCBI) web site (27). The SNP information (protein accession number and SNP ID) of the MEFV gene was retrieved from the NCBI dbSNP (http://www.ncbi.nlm.nih.gov/snp/) and the protein sequence was collected from Swiss Prot databases (http://expasy.org/). (28)
2.2 SIFT
SIFT is a sequence homology-based tool (29) that sorts intolerant from tolerant amino acid substitutions and predicts whether an aminoacid substitution in a protein will have a phenotypic Effect. Considers the position at which the change occurred and the type of amino acid change. Given a protein sequence, SIFT chooses related proteins and obtains an alignment of these proteins with the query. Based on the amino acids appearing at each position in the alignment, SIFT calculates the probability that an amino acid at a position is tolerated conditional on the most frequent amino acid being tolerated. If this normalized value is less than a cutoff, the substitution is predicted to be deleterious. SIFT scores <0.05 are predicted by the algorithm to be intolerant or deleterious amino acid substitutions, whereas scores >0.05 are considered tolerant. It is available at (http://sift.bii.a-star.edu.sg/).
2.3 Polyphen-2
It is a software tool (30) to predict possible impact of an amino acid substitution on both structure and function of a human protein by analysis of multiple sequence alignment and protein 3D structure, in addition it calculates position-specific independent count scores (PSIC) for each of two variants, and then calculates the PSIC scores difference between two variants. The higher a PSIC score difference, the higher the functional impact a particular amino acid substitution is likely to have. Prediction outcomes could be classified as probably damaging, possibly damaging or benign according to the value of PSIC as it ranges from (0_1); values closer to zero considered benign while values closer to 1 considered probably damaging and also it can be indicated by a vertical black marker inside a color gradient bar, where green is benign and red is damaging. nsSNPs that predicted to be intolerant by Sift has been submitted to Polyphen as protein sequence in FASTA format that obtained from UniproktB /Expasy after submitting the relevant ensemble protein (ESNP) there, and then we entered position of mutation, native amino acid and the new substituent for both structural and functional predictions. PolyPhen version 2.2.2 is available at (http://genetics.bwh.harvard.edu/pph2/index.shtml).
2.4 Provean
Provean is a software tool (31) which predicts whether an amino acid substitution or indel has an impact on the biological function of a protein. it is useful for filtering sequence variants to identify nonsynonymous or indel variants that are predicted to be functionally important. It is available at (https://rostlab.org/services/snap2web/).
2.5 SNAP2
Functional effects of mutations are predicted with SNAP2 (32). SNAP2 is a trained classifier that is based on a machine learning device called “neural network”. It distinguishes between effect and neutral variants/non-synonymous SNPs by taking a variety of sequence and variant features into account. The most important input signal for the prediction is the evolutionary information taken from an automatically generated multiple sequence alignment. Also structural features such as predicted secondary structure and solvent accessibility are considered. If available also annotation (i.e. known functional residues, pattern, regions) of the sequence or close homologs are pulled in. In a cross-validation over 100,000 experimentally annotated variants, SNAP2 reached sustained two-state accuracy (effect/neutral) of 82% (at an AUC of 0.9). In our hands this constitutes an important and significant improvement over other methods. It is available at (https://rostlab.org/services/snap2web/).
2.6 PHD-SNP
An online Support Vector Machine (SVM) based classifier, is optimized to predict if a given single point protein mutation can be classified as disease-related or as a neutral polymorphism, it is available at: (http://snps.biofold.org/phd-snp/phdsnp.html).
2.7 SNP& Go
SNPs&GO is an accurate method that, starting from a protein sequence, can predict whether a variation is disease related or not by exploiting the corresponding protein functional annotation. SNPs&GO collects in unique framework information derived from protein sequence, evolutionary information, and function as encoded in the Gene Ontology terms, and outperforms other available predictive methods. (33) It is available at (http://snps.biofold.org/snps-and-go/snps-and-go.html)
2.8 P-Mut
PMUT a web-based tool (34) for the annotation of pathological variants on proteins, allows the fast and accurate prediction (approximately 80% success rate in humans) of the pathological character of single point amino acidic mutations based on the use of neural networks. It is available at (http://mmb.irbbarcelona.org/PMut).
2.9 I-Mutant 3.0
I-Mutant 3.0 Is a neural network based tool (35) for the routine analysis of protein stability and alterations by taking into account the single-site mutations. The FASTA sequence of protein retrieved from UniProt is used as an input to predict the mutational effect on protein stability. It is available at (http://gpcr2.biocomp.unibo.it/cgi/predictors/I-Mutant3.0/I-Mutant3.0.cgi).
2.10 Project Hope
Online software is available at: (http://www.cmbi.ru.nl/hope/method/). It is a web service where the user can submit a sequence and mutation. The software collects structural information from a series of sources, including calculations on the 3D protein structure, sequence annotations in UniProt and prediction from other software. It combines this information to give analysis for the effect of a certain mutation on the protein structure. HOPE will show the effect of that mutation in such a way that even those without a bioinformatics background can understand it. It allows the user to submit a protein sequence (can be FASTA or not) or an accession code of the protein of interest. In the next step, the user can indicate the mutated residue with a simple mouse click. In the final step, the user can simply click on one of the other 19 amino acid types that will become the mutant residue, and then full report well is generated (36).
2.12 Raptor X
RaptorX (http://raptorx.uchicago.edu/): It is a web server predicting structure property of a protein sequence without using any templates. It outperforms other servers, especially for proteins without close homologs in PDB or with very sparse sequence profile. The server predicts tertiary structure (37)
2.11 UCSF Chimera (University of California at San Francisco)
UCSF Chimera (https://www.cgl.ucsf.edu/chimera/) is a highly extensible program for interactive visualization and analysis of molecular structures and related data, including density maps, supramolecular assemblies, sequence alignments, docking results, trajectories, and conformational ensembles. High-quality images and animations can be generated. Chimera includes complete documentation and several tutorials. Chimera is developed by the Resource for Biocomputing, Visualization, and Informatics (RBVI), supported by the National Institutes of Health (P41-GM103311).(38)
2.12 PolymiRTS
PolymiRTS is a software used to predict 3UTR (un-translated region) polymorphism in microRNAs and their target sites available at (http://compbio.uthsc.edu/miRSNP/). It is a database of naturally occurring DNA variations in mocriRNAs (miRNA) seed region and miRNA target sites. MicroRNAs pair to the transcript of protein coding genes and cause translational repression or mRNA destabilization. SNPs in microRNA and their target sites may affect miRNA-mRNA interaction, causing an effect on miRNA-mediated gene repression, PolymiRTS database was created by scanning 3UTRs of mRNAs in human and mouse for SNPs in miRNA target sites. Then, the effect of polymorphism on gene expression and phenotypes are identified and then linked in the database. The PolymiRTS data base also includes polymorphism in target sites that have been supported by a variety of experimental methods and polymorphism in miRNA seed regions. (39)
2.13 GeneMANIA
It is gene interaction software that finds other genes which is related to a set of input genes using a very large set of functional association data. Association data include protein and genetic interactions, pathways, co-expression, co-localization and protein domain similarity. GeneMANIA also used to find new members of a pathway or complex, find additional genes you may have missed in your screen or find new genes with a specific function, such as protein kinases. available at (https://genemania.org/) (40).
3. RESULTS
3.1 The functional effect of Deleterious and damaging nsSNPs of HEXB by SIFT, Provean PolyPhen-2, and SNAP2
3.2 Functional analysis of ADAMTS13 gene using Disease Related and pathological effect of nsSNPs by PhD-SNP, SNPs & GO and PMut softwares
3.3 Prediction of Change in Stability due to Mutation Using I-Mutant 3.0 Server
3.4 SNPs effect on 3’UTR Region (miRNA binding sites) in HEXB using PolymiRTS Database
3.5 Modeling of amino acid substitution effects on protein structure using
Interactions of HEXB gene with other Functional Genes illustrated by GeneMANIA
4. DISSCUTION
42 novel nsSNPs were identified out of 49 most damaging nsSNPs in the HEXB gene retrieved by our analysis using different bioinformatics tools which they could have adverse effects on the resulting protein fractionally and structurally.
The variant rs373979283 which entails the substitution Y266D was previously reported in patients with Sandoff disease(41-43), was predicted to be deleterious by all of the softwares used in this study in spite of the fact that Tyrosine is not highly conserved at this position because other homologous sequences were observed having other residues at the same position, but none of them is similar to Aspartate in terms of physiochemical properties, because Aspartate is smaller and less hydrophobic than Tyrosine, which could affect the protein folding process, in addition to that Tyrosine is neutral while Aspartate is negative in charge, therefore, this mutation could still be damaging.
The rs121907985 variant results in converting Proline into Serine which is smaller and less hydrophobic amino acid at the position 504, the difference in size could cause an empty space in the core of the protein, and the internal hydrophobic interactions could be lost because of the difference in the hydrophobicity value. Moreover, Proline’s rigidity could be crucial to induce a special backbone conformation at this position which can be disturbed by this substitution. This mutation is annotated to be pathogenic in ClinVar and ExPASy databases, and was previously reported by Hou et al.(1998) in two sisters with Sandoff disease, they found that this mutation significantly increases precursor/mature ratio as a result of increasing the retention of newly synthesized protein in the Endoplasmic reticulum, and also lowers its heat stability(44). The P504S substitution is mostly linked to chronic and less severe forms of the disease (45).
The variant rs764552042 marks the substitution from Arginine to Cysteine at the position 533 (R533C), the mutant residue is neutral in charge, smaller and less hydrophobic than the wild residue which is positive in charge. The difference in charge and size could possibly have some detrimental effects on external interactions and interactions with internal molecules. The same mutation was previously reported in many countries in patients with Sandoff disease (13, 46-48). Zampieri et al.(2012) found that the activity of the affected protein is considerably diminished or completely absent compared to the normal protein, and they predicted the mutant residue C533 could form a disulphide bond with C551 which normally forms this bond with C534, and that could result in severe misfolding of the C terminal loop of the protein(49).
All the three mentioned SNPs reside at IPR025705 and IPR017853 domains, in addition to IPR015883 for Y266D, P504S and R533C. These three domains are crucial for normal protein function and they interact with each other, which mean any mutation with deleterious effects within those domains could directly disturb the normal protein functionality.
Two SNPs (rs75974765 and rs1048088) were found to have an effect in 3’UTR. rs75974765 SNP is predicted to disrupt binding site of hsa-miR-3679-3p miRNA and creating new binding site at hsa-miR-204-5p, hsa-miR-211-5p, hsa-miR-4446-5p, hsa-miR-4755-5p and hsa-miR-5006-3p miRNAs while rs1048088 SNP is predicted to creat new binding site at hsa-miR-4731-5p hsa-miR-5589-5p miRNAs. Both of those SNPs might result in expression of the gene.
The importance of this study relies on the fact that it has subjected all the reported variants within the HEXB gene for computational functional and structural analysis which might facilitate further association studies. However, there are some limitations to be considered, like the highly selective protocol which might lead to missing some important SNPs. Computational analysis provides a good insight into which SNPs could drastically affect the protein’s tertiary structure and function, but it’s not completely conclusive, as such, we recommend following this study with wet-lab functional genetic studies and gene knockout on real animal models.
5. CONCLUSION
This study revealed 42 novel nsSNP out of 49 damaging SNPs in the HEXB gene affecting its function, structure and stability which they possibly lead to Sandhoff disease, by using different bioinformatics tools. Also, 2 SNPs found to have effect on miRNAs binding site affecting expression of HEXB gene. Optimistically, these results will aid in genetic studying and diagnosis of Sandhoff disease improvement.
7. DATA AVAILABILITY
All relevant data used to support the findings of this study are included within the manuscript and supplementary information files.
8. CONFLICT OF INTEREST
The authors declare that there is no conflict of interest regarding the publication of this paper.
6. ACKNOWLEDGMENT
The authors desire to acknowledge the exciting collaboration of Africa City of Technology – Sudan.