Prediction of Polygenic Risk Score by Machine Learning and Deep Learning Methods in Genome-wide Association Studies

Polygenic risk score (PRS) is a method that using multiple SNPs simultaneously and can be calculated as a typical disease risk score. It is useful method for precision and personalised medicine. Calculating PRS with the classical method, it is frequently used to use two different data sets which are training and testing sets. It is a disadvantage for the classical method. By using a single data set, machine learning (ML) and deep learning (DL) methods both avoid the problem of overfitting and can be used as a good alternative method. Genome-wide Association Studies (GWAS) data were generated with the PLINK Program by replicating a hundred times at different allele frequencies and different sample size. We applied two different ML algorithms which are Support Vector Machine (SVM) and Random Forest (RF) as well as DL approach. ML methods can obtain more consistent results in terms of case-control separation compared to PRS calculated with the classical method (PRS). The use of ML and DL methods as an alternative to classical methods to calculate PRS has been suggested.


INTRODUCTION
Genome-wide association studies (GWAS) are methods for applying to find out and investigating the genes and genomic regions may have a reason of specific disease. By using whole chromosomes and thus hundreds of thousands of gene simultaneously, gene-gene and gene-environmental relationships are investigated. Therefore, GWAS are not only conducting on specifically selected genes but also aiming determine differences between case and control groups so that using big data sets. Changes in single nucleotides in the genome sequence are called single nucleotide polymorphism (SNP), which explains to a great extent why some individuals are healthier while others are prone to disease, why the same disease progresses differently among different individuals, and why some individuals respond positively to treatment while others do not. GWAS are conducted with SNP data. Nowadays, SNPs that cause many diseases have been identified. Therefore, studies are need far beyond the GWAS.
Those studies such as polygenic risk score (PRS), it may help for clinicians and, it would be useful to precision medicine and personalized medicine before disease would not be development [1,2]. PRS is a risk criterion that uses so many SNPs at the same time for calculating a risk score of any genetic disease. A low PRS means that the risk of genetic disease susceptibility is low, a high PRS, on the contrary, means a high risk of predisposition to genetic disease. Aims of this study, to find out the model that most accurate to consider to separation of case-control and predicting genetic risk, besides, validation of classical PRS calculating by using machine learning (ML) and deep learning (DL).

MATERIALS AND METHODS
In this study content for calculating PRS, as a purpose of finding out the best model of PRS, raw GWAS data set (which is bad, bim, fam files) that has been simulated from 1000 genome project real datasets, it is consisting of 251 case and 232 controls as well as 489805 SNPs that associated with the obesity is used. To determine cases and controls, above the 30 BMI was used as a criterion [3]. Our simulation scenarios, we created different odds ratios for each case and control groups and different sample sizes.
If compared with all SNPs, number of the SNPs associated with disease have %1 rate are created for cases groups in all datasets. For the purposes of separating cases and controls group, the SNPs associated with disease have been created by less than 0.05 p-value is obtained from logistic regression analysis.

Poligenik Risk Score
Polygenic risk score (PRS) is a method useful for calculating a complex genetic diseases risk as well as by nature of PRS, whole variants on genomes are using for calculation and it could be including more information using whole variants on genome than individual mutations. PRS is calculated with the sum of all variants on genomes.
It is calculated by (1) formulation. Where, i-th is weights of SNP, while express the alleles affected by the weights 3 .The odds ratio is used for each weight vector.

Machine Learning Methods
Machine learning (ML) is a name of the system of algorithms which are catching patterns form the data that used and improving prediction rate by themself. The other word, the ML algorithms are having self-improvement for increasing performance of the prediction by patterns on used datasets [4].
Machine learning are increasingly becoming most popular methods nowadays because compared with classical statistical methods they have no assumption like normality, sample size ect. Besides, having so many tunning parameters when solving nonlinear problems an advantages of ML methods [5,6]. By the nature of genetic epidemiological datasets are big data; therefore, the mathematical methods which are used ought to be able to support this situation as well. Generally, due to ML methods are having good prediction power, it has been become more applying methods when analysing big data 7 .

Support Vector Machines
Support vector machines (SVM) invented by Vladimir Vapnik in 1992. Lagrange multipliers are using for solving classification problems thus it can be finding the lines among the classes that separated with minimum error rate. Though the training time is higher than the other algorithms if compared, SVM is resistant to over-fitting problem [8,9].

Random Forest
Random Forest (RF) is an algorithm consisting of combining a lots of decision trees. It can be used for both categorical and continuous data as well as classification and regression models. RF also can be used for missing data imputation, feature selection, finding contribution of variables on model which is call variable importance [10][11][12][13].

Deep Learning Methods
Deep learning (DL) is a branch of the machine learning (ML) however, the differences between ML and DL is processing data such as feature selection, data labelling ect. before the application of the model.
The data should have processed before applying ML while DL does not need this [14]. The major advancements have been in image and speech recognition as well as natural language processing and language translation. The successes of deep learning originate from how it learns hierarchical representations of data by increasing the level of abstraction [15,16]. Deep learning architectures are artificial neural networks of multiple non-linear layers in which each successive layer operates on the representation from the preceding layer. Each layer consists of one or more artificial neurons, each of which is connected to other neurons in the preceding layer. The artificial neuron receives separately weighted inputs and sums them to produce an activation using an activation function such as sigmoid and rectified linear unit (ReLU). In the training step, the most suitable hierarchical representations can be learned from data by optimizing the weight parameters in each layer. Once the forward pass sequentially propagates the output function signals forward through the network, in the final output layer, the loss function calculates the error. To minimize the error, the backward pass back-propagates error signals and updates weight parameters using optimization algorithms based on stochastic gradient descent (SGD) [17].

Alternative Methods for Polygenic Risk Score Prediction
Firstly, we would like to identify weight vectors for developing a new method when predicting PRS.
Weight vectors as determined by the and for the SVM and DL methods. In these methods, weight matrix for determining of the classes are used in vector of weights when predicting PRS. , variable importance measurements for RF method, was used as weight vector. For all methods, each SNPs multiplying with weight vector and thus, individually risk scores were obtained. The formulas are as follows.

Simulations Steps
It is very hard and expensive to access to GWAS data for a specific disease, therefore; we simulated data with PLINK software, the raw files (bed, bim, fam) were used for developing a new method as well as comparing to power of approaches. Samples size of the simulations for case-controls: 500-500,

RESULTS
In simulation study, we compared polygenic risk scores calculated by classical method, SVM, RF and was used (p=0.760). The same situation is seen that when 100000 SNPs was used (Table 3).   [24,25]. In a PRS calculation study using ML in the field of neurology, it was concluded that it is suggested to use SNPs by weighting with ML methods [26]. It has been observed that PRS calculated by MR methods can achieve more successful results in terms of case-control distinction in terms of means compared to the classical method, with different sample sizes and different SNP numbers. In the low number of SNPs, using DL may give better results, and it can be observed that it loses its power as the number of SNPs increases, and it has been seen in previous studies that DL is not strong enough for classification ability is not strong enough for binary GWAS data [27]. Considered to all the PRS calculations methods are based on the weights*SNPs, the part of major alleles which is *SNPs part is all the same. Therefore, source of the difference case and controls is coming from the calculated weights. However, with PRS it is possible to measure the risk of a progressive disease in individuals [30]. The average risk score to be obtained with the method to be used and the standard deviation values of the scores to be obtained accordingly may be advantageous for the use of PRS, as the range of variation is small. It can be said that ML methods give better results than DL and traditional methods. When calculating PRS with the classical method calculation, it is frequently used to use two different data sets, training, and test sets. Although this is not always easy, it brings a disadvantage for the classical method [31]. By using a single data set, ML and DL methods both avoid the problem of overfitting and can be used as a good alternative method, despite the disadvantage of using two different data sets of the classical method. Previous studies of similar nature have also shown that using GWAS summary statistics in continuous data, ML methods yield better results than the classical polygenic risk score calculation method [32,33]. In a similar study examining gene-gene and gene-environment interactions related to type-II diabetes, it was observed that the clinical risk score was highly similar to PRS. In another study conducted with similar logic, PRS was used with variables that pose clinical risk [34].