Hybrid Support Vector Regression Model and K-Fold Cross Validation for Water Quality Index Prediction in Langat River, Malaysia

Water quality analysis is an important step in water resources management and needs to be managed efficiently to control any pollution that may affect the ecosystem and to ensure the environmental standards are being met. The development of water quality prediction model is an important step towards better water quality management of rivers. The objective of this work is to utilize a hybrid of Support Vector Regression (SVR) modelling and K-fold cross-validation as a tool for WQI prediction. According to Department of Environment (DOE) Malaysia, a standard Water Quality Index (WQI) is a function of six water quality parameters, namely Ammoniacal Nitrogen (AN), Biochemical Oxygen Demand (BOD), Chemical Oxygen Demand (COD), Dissolved Oxygen (DO), pH, and Suspended Solids (SS). In this research, Support Vector Regression (SVR) model is combined with K-fold Cross Validation (CV) method to predict WQI in Langat River, Kajang. Two monitoring stations i.e., L15 and L04 have been monitored monthly for ten years as a case study. A series of results were produced to select the final model namely Kernel Function performance, Hyperparameter Kernel value, K-fold CV value and sets of prediction model value, considering all of them undergone training and testing phases. It is found that SVR model i.e., Nu-RBF combined with K-fold CV i.e., 5-fold has successfully predicted WQI with efficient cost and timely manner. As a conclusion, SVR model and K-fold CV method are very powerful tools in statistical analysis and can be used not limited in water quality application only but in any engineering application.


Introduction
.  Water Quality Index (WQI) 93 Water Quality Index (WQI) has been used to check the status of river quality for 94 different uses. WQI was developed by Brown et.al 1970 andthen, in 1975, was       To obtain the estimation of and , Equation (3) is transformed to the primal 177 function given by Equation (5)    and , 0, , 1, 2,..., the inner product of two vectors and in the feature space and , that is,

199
Four common kernel function types of SVM are given as follows:

207
There are two types of SVM regression; both have the general formula given in Equation

208
(2). The first type of SVM regression is known as Type 1 or Epsilon. This type of error function 209 is given by the formula shown in Equation (6). The second type of regression is known as Nu.

210
C and gamma are the parameters in the RBF kernel for a nonlinear SVM. As for a higher value 211 of gamma, the variance is smaller indicating the support vector does not have wide-spread 212 influence. In general, larger gamma leads to higher bias and lower variance models, and vice-213 versa.

214
Meanwhile, C is the parameter for the soft margin cost function, which controls the 215 influence of each individual support vector, whereas this process involves trading error penalty face the risk of overfitting. Technically, higher value of C leads to lower bias and higher 218 variance models, and smaller C will cause higher bias and lower variance [5].  268 K-fold cross-validation (CV) is a robust technique to evaluate the accuracy of a model.

269
The advantage of k-fold CV is always gives more accurate estimates of the test error rate [21].

270
Smaller value of K is more biased and therefore unacceptable. Instead, larger value of K is less 271 biased, but can affected to higher variance. Even though, there is no formal rule to choose the 272 value of k, but by considering this situation, the common choices of k are 5 [37] or 10 (Fig 6). 273 These values have been shown empirically to yield test error rate estimates that suffer neither 274 from excessively high bias nor from very high variance. [12,18,19].

280
In this study, the SVM model for the prediction of WQI has been developed by using 281 various of kernel functions such as Linear, Radial Basis, Polynomial and Sigmoid function.

282
Initially, these functions need to be evaluated to determine the best kernel type by utilizing 10-283 fold cross validation value. As a result, a comparison for the performance of the SVM models 284 has been shown in  The selection of the optimal parameter sets is an important role in order to attain valid 293 predictive performance for SVM model. The SVM generalization performances are influenced 294 by setting of hyper-parameters (C, γ) and kernel parameters (epsilon ε, nu ν). Hence, in order 295 to find the best SVM performance, two types of RBF kernel functions which are epsilon ε and 296 nu ν are considered for WQI prediction. As for epsilon-RBF model, C is fixed to 8 and gamma 297 equals to 0.2 and epsilon is set to various values in the range of 0.001 to 0.5. As a result, the 298 training and the testing phases shows as epsilon increases, the value of RMSE also increases.

299
However, the value of correlation coefficient and the number of support vectors decreases.       is visualized by the similarity behavior of plot between the actual and predicted values of WQI.

370
The closeness of this data indicated that the deviation of error is very small and can be 371 neglected. Hence, this result also confirms the suitability of combination Nu-RBF with 5-fold 372 CV model selection. In Malaysia, WQI is calculated based on DOE-WQI formula by using 6 water quality 379 parameters (ammoniacal nitrogen (AN), biochemical oxygen demand (BOD), chemical oxygen The prediction of water quality is very crucial in pollution monitoring and to ensure the 424 environmental standards for water resources are being met. If there are changes in water quality, 425 this prediction can provide early warnings and thus possibly minimize the effects of poor water 426 quality to the community if stringent action is executed properly by the authority. In this study, 427 two-stations monthly water quality data from Langat River, Kajang are utilized by using six 428 water quality parameters for ten-year period.

429
In this research, the Support Vector Regression (SVR) model combined with K-fold 430 cross-validation is proposed to predict the Water Quality Index (WQI). Table 5 previously 431 shows the Nu-RBF type with 5-fold CV provides the highest correlation coefficient (R), 0.9998 432 compared to Epsilon-RBF, 0.9994. This result concludes that the optimal performance of SVR 433 selection is obtained through 5-fold cross-validation and this value of K-fold CV provides better 434 performance over the other K-fold CV value.

435
The SVR algorithm is developed by considering several input parameters combination 436 (Model 1 -Model 7), whereby each model used a different combination of water quality 437 parameters as modelling input (