Prediction of the Effect of Naturally Occurring Missense Mutations on Cellular N-Acetyl-Glucosaminidase Enzymatic Activity

In 2015, the Critical Assessment of Genome Interpretation (CAGI) proposed a challenge to devise a computational method for predicting the phenotypic consequences of genetic variants of a lysosomal hydrolase enzyme known as α-N-acetylglucosaminidase (NAGLU). In 2014, the Human Gene Mutation Database released that 153 NAGLU mutations associated with MPS IIIB and 90 of them are missense mutations. The ExAC dataset catalogued 189 missense mutations NAGLU based on exome sequence data from about 60,000 individual and 24 of them are known to be disease associated. Biotechnology company, BioMarin, has quantified the relative functionality of NAGLU for the remaining subset of 165 missense mutations. For this particular challenge, we examined the subset missense mutations within the ExAC dataset and predicted the probability of a given mutation being deleterious and relating this measure to the capacity of enzymatic activity. In doing so, we hoped to learn the degree to which changes in amino acid physicochemical properties are tolerable for NAGLU function. Amino acid substitution (AAS) prediction methods are mainly based on the sequence and structure information. Simple comparisons between different AAS methods are not only difficult, but also irrational because each method was tested on various datasets and based on varied versions of databases. Currently, several AAS prediction methods have been introduced. PolyPhen-2, an updated version of PolyPhen, is a tool used to predict possible impacts of an amino acid substitution on the structure and function. Users are required to provide protein or SNP identifiers, protein sequences, substitution positions, etc. A score is provided, ranging from 0 to 1, corresponding to the probability of a mutation resulting in no functionality for the enzyme Once the probability scores were generated, the dataset was then run through multiple machine learning algorithms to generate an applicable model for predicting the enzymatic activity of MPS IIIB-related mutations. This prediction was generated using the PolyPhen-2 probability score and other information about the mutation (amino acid type, location, allele frequency, etc.) as input feature variables. This generated a predicted aggregate score for each mutation, which was then reported back to CAGI. The results of the analysis are significant enough to hold confidence that the scores are decent predictors of enzymatic activity given a mutation in the NAGLU amino acid sequence.


Introduction
Sanfilippo Syndrome type B disease (Mucopolysaccharidosis IIIB, MPS IIIB) is an autosomal recessive, neurodegenerative disease that primarily afflicts children. Early manifested clinical symptoms includes intellectual disability, behavioral disturbance and death in the second or third decade. MPS IIIB is the inborn error of glycosaminoglycan metabolism and is characterized by the systematic accumulation of substrate heparin sulfate leading to cognitive decline [1]. The lysosomal hydrolase α-N-acetylglucosaminidase (NAGLU) performs the function of removing terminal α-N-acetyl-glucosaminidase residues from heparin sulfate and lacking NAGLU is one of the four systematic defects that cause MPS IIIB.
The National Institute of Neurological Disorders and Strokes estimates Sanfilippo syndrome type IIIB has a prevalence of 1 in every 200,000 live births [2]. According to the work done by Meikle, et al. in Prevalence of Lysosomal Storage Disorders [3], Sanfilippo syndrome type IIIB generally manifests in young children, carried by an autosomal recessive gene. For those affected, the symptoms are debilitating.
The National MPS Society reports common symptoms include, delayed development and behavioral issues. These symptoms are generally seen in children. As the disease progresses with age, behavioral and cognitive symptoms such as chewing on the hands, difficulties with toilet training, learning disabilities, etc. continue to grow. Other symptoms may include: nausea, digestive problems, and diarrhea as well as the apparent facial formation and distinctive look of MPS-III patients [4].
Currently, there is no known cure for this disease and those diagnosed die at a very early age [4]. An assessment of mutational severity can be time-consuming and costly when one employs traditional wet-lab experiments. In recent years, there has been a drive towards concerted efforts to boost the efficiency of gene mutation studies by implementing computational methods to predict a functional outcome [5]. The effectiveness in characterizing and predicting functional outcomes has proven that the use of computationally-driven predictive models to evaluate the effect of a genetic mutation can save time and money. To increase the use and accuracy of these predictive models, they need to first be utilized and executed in known cases. Continuous research efforts to understand this disease will hopefully lead to a better diagnosis and treatment options. Patients suffering from MPS-III may benefit from a computational framework to predict functional outcomes of particular variants where a more targeted treatment strategy may be developed.

CAGI
The Critical Assessment of Genome Interpretation (CAGI) provides a unique platform for researchers in bioinformatics to partake in the application of machine learning methods to sharpen the understanding of a life-altering critical disease. The NAGLU use case was part of series IV of the CAGI challenges. The CAGI platform provided the mutation data for MPS-IIIB, which we used in this study.
Results predicting the effects of the NAGLU enzymatic activity were submitted to the Critical Assessment of Genome Interpretation Contest [6]. The submitted prediction is numerically valued in the table 1. Table 1: Prediction scoring table   from 0  no enzymatic activity  to 1 wild-type level of activity Figure 1: Prediction scores as a function of NAGLU enzymatic activity Included with the above values, amino acid mutation scores and the standard deviation of the predictions were also reported. As shown in Figure 1, these values range from 0.0 to 1.0 and correspond to the interpretations laid out by

VCFTools
Data from 1000 Genomes is reported in the Variant Call Format (.vcf) [7]. For MPS-IIIB, the interest is only in a particular region on the 17th chromosome. However, only data for the entire 17th chromosome is available for direct use. Once the data was downloaded, the file needed to be trimmed such that we are only working with the region of interest.
Upon downloading the entire chromosome, the file was noticeably large -∼23GB for the .vcf file. At this size, it was far too large to manage simply by using a spreadsheet program. This format was also initially difficult to use in its current tabular layout as it seemed to have pertinent information, but presented in the wrong way for what we want to do with it. See Table 3.
Using VCFTools, a Perl add-in for reading and working with .vcf files, we transformed the large input file, shown in Table 3, into a more manageable, tabular format, as shown in Table 4 [8,9].
We ran the tool on the entire .vcf file of chromosome 17 and summarized the data to a format that reports the frequency of different alleles at each biallelic site along the chromosome. This information is summarized from the 2,504 unrelated genomes from 26 populations around the globe. By creating a statistical summary and filtering regions of the chromosome not in question, the resulting file size was drastically reduced while providing a relevant indicator of likely pathogenicity based upon incidence.   Table 4: Summarized information after the use of VCFTools

PolyPhen-2
PolyPhen-2 is a tool that allows for the prediction of functional effects of human nonsynonymous single nucleotide polymorphisms (nsSNPs) [10]. This tool, executed via Perl, takes external references and databases for use in making the functional predictions. We utilized the precomputed MLC and MultiZ alignments, which are necessary in order to decrease the overall computational time of the program, as well as bundled databases: ncbi-blast-2.2.31+ tools, UniRef100 nonredundant protein sequence databases, and the PDB and DSSP databases.
The tools within PolyPhen-2 run in a sequential order, beginning with the optional first tool, Map SNPs (mapsnps.pl). For the CAGI dataset, which provided the reference amino acids and their mutated variant given the position, the PolyPhen-2 input variant format was chosen as the method for running the data through the pipeline. PolyPhen-2 is capable of making predictions based upon two models, HumDiv and HumVar. For this task, the HumVar-model was implemented as the training set used included nsSNPs with minor allele frequencies greater than 1.0%, as well as known disease-associated mutations. Map SNPs is an annotation tool that uses the genomic coordinates for each SNP and produces an output file of those variants which produce missense mutations. This was formatted to be piped into the PolyPhen-2 run (run_pph.pl).
As a safety check on the performance of PolyPhen-2, we collected known disease-associated SNPs from the Online Mendelian Inheritance in Man (OMIM) database as a positive control [11]. We also tested the performance on polymorphisms curated by dbSNP as having no associated disease to serve as a negative control set [12]. From this evaluation we felt confident in PolyPhen-2's ability to reliably calculate the probability of a mutation being deleterious.
For the 1000 Genomes files, the only information present is the genomic position and the variants, making the initial Map SNPs tool necessary in order to extract the nsSNPs. This leads into the next tools for protein variant annotation and probabilistic classification, which includes the initial PolyPhen-2 run (run_pph.pl) and Weka prediction (run_weka.pl). The PolyPhen-2 output file for SNPs belonging to our CAGI dataset was subsampled and reformatted using Perl in order to highlight columns of interest. A high probability close to 1.0 indicates amino acid change likely to damage the functionality of the protein. PolyPhen-2 generates a prediction output, which we parsed into a tabular format and removed unnecessary attributes, as shown below in Table 5.  Table 5: Sample parsed PolyPhen-2 output of CAGI dataset.
The data from the 1000 Genomes is provided as a .vcf file, which provides the genomic coordinates of each of the mutations, but lacks any clear information to ascertain which results are nonsynonymous mutations within the exonic regions for NAGLU. Polyphen-2's Map SNPs functionality (mapsnps.pl) was used to handle this task. However, the output from this tool showed that there were no non-synonymous mutations of the 253 variants called within the region of the NAGLU gene. This result seems unlikely given the large number of variants. A test was done by joining the table of 1000 Genomes SNPs on the table of CAGI nsSNPs, which presented some matches. To determine which variants from the 1000 Genomes data were nsSNPs, and thus variants that could serve as input for PolyPhen-2 and Weka runs, wANNOVAR was used instead. These variants need to be processed through this software in order to make the necessary comparisons across datasets.

wANNOVAR
To accomplish the task of accurately describing the 1000 Genomes dataset, wANNOVAR was employed [13]. wANNO-VAR allows for the selection of the hg19 genome build used by 1000 Genomes, and in several different file formats. The input file format is similar to the "genomic position" format used by PolyPhen-2, and it includes the variant genomic position start and end, reference nucleotide, and the variant nucleotide in separate columns. This web-based tool annotates the set of variants, allowing for the extraction of those SNPs designated as being exonic and resultant in nonsynonymous mutations. In addition, the output includes a PolyPhen-2 score derived from the HumDiv model.

Database Construction
Given that the machine learning task requires a single, flattened, tabular dataset as input, we curated a database where we could join the various aforementioned features together. These data were housed in a Microsoft SQL Server 2014 server instance.
Importing the datasets was performed using SQL Server Integration Services, an enterprise-grade Extract, Transform, Load (ETL) software. Once the data was successfully written into the database in their respective tables, the tables were joined together by amino acid sequence position. This combines, for example, the .vcf summary with the CAGI data with the PolyPhen probabilities, resulting in a final SQL view that served as the input dataset for our machine learning analysis. See Table 7.

Machine Learning
Using the conglomerate dataset (shown in Table 7), we utilized multiple machine learning algorithms to generate five different models. Two platforms were used: Microsoft Azure Machine Learning [14] and Apache Spark [15] on Azure Databricks [16]. Both platforms and the various algorithms within all used the same input dataset and the same features. See Table 8.

Microsoft Azure Machine Learning
Using the Microsoft Azure Machine Learning Studio, our preliminary model was generated using a Decision Forest algorithm [17,18]. The input parameters for the Decision Forest algorithm are in Table 9.

Spark MLlib
Revisiting the machine learning approach, we also trained four additional models using the Apache Spark Machine Learning Library (MLlib) [19]. While the Microsoft Azure Machine Learning approach was successful in generating a highly predictive model of NAGLU activity, Spark allows for a much more robust sweep of parameters and crossvalidation in a distributed, cluster context.
The four MLlib algorithms used to create additional models were: logistic regression [20], random forest [21], gradient boosted tree [22], and decision tree [23]. Each of the four additional models were trained using a grid search through various parameters as well as a 5-fold cross-validation. See Table 9. An additional change was made in the evaluation of model fitness. As the CAGI competition has since published actual (experimental) enzymatic activity for the mutations in the training data, we used this information in the test dataset to evaluate the models. This was not used in the model training, but to compare the prediction to the actual enzymatic activity values. Then, a model for each algorithm was selected based on minimizing the root mean square error (RMSE) between the prediction and the actual enzymatic activity.  Table 9: Parameters and resulting RMSE and Correlation (compared to actual enzymatic activity) for each machine learning algorithm.

Discussion
Here we show that, by combining heterogeneous information such as reference data (around alleles and amino acids) with MPS-IIIB data (from CAGI) and then process the data using using functional analysis (in PolyPhen-2 with wANNOVAR exon data), we can create a useful training dataset for predicting the enzymatic effects of the mutations related to this disease.
Using various machine learning approaches, we successfully created multiple machine learning models. The winning model, in terms of Pearson correlation is the Spark MLlib-based random forest model. However, the Azure Machine Learning-based decision forest model outperformed based on RMSE.
These models, while not perfect, do have decent predictive power for predicting enzymatic activity for MPS-IIIB.
Predictive models such as these will help researchers understand and pinpoint locations where mutations drastically affect the output and function of NAGLU. By understanding the mutations' effects, more targeted treatments can be developed to mitigate, reverse, or protect against the functional changes in the mutated enzyme.
Future work should also include further model tuning, complete with the addition of other data. It appears that the correlation between the models' predictions and the actual tested activity is quite positive, but still lacking. This means that there are likely additional factors that have not been captured by the various datasets used in this study. Thus, finding those other factors and adding them to a model should assist in future model performance, creating a more precise predictions.