An Evaluation of Techniques for Outlier Detection and Missing Values Imputation of Hydrological Data Series in the Zarrineh-Roud Basin, Lake Urmia

Eishoeei, Edith; Miryaghoubzadeh, Mirhassan; Erfanian, Mahdi; Mahboobi Esfanjani , Reza; Mancini, Marco

doi:10.61882/ jwmr.2025.1310

Volume 16, Issue 2 (9-2025) J Watershed Manage Res 2025, 16(2): 19-34 | Back to browse issues page

‎ 10.61882/ jwmr.2025.1310

Mendeley

Zotero

RefWorks

Eishoeei E, Miryaghoubzadeh M, Erfanian M, Mahboobi Esfanjani R, Mancini M. (2025). An Evaluation of Techniques for Outlier Detection and Missing Values Imputation of Hydrological Data Series in the Zarrineh-Roud Basin, Lake Urmia. J Watershed Manage Res. 16(2), 19-34. doi:10.61882/ jwmr.2025.1310
URL: http://jwmr.sanru.ac.ir/article-1-1310-en.html

An Evaluation of Techniques for Outlier Detection and Missing Values Imputation of Hydrological Data Series in the Zarrineh-Roud Basin, Lake Urmia

Edith Eishoeei¹, Mirhassan Miryaghoubzadeh ^*¹

, Mahdi Erfanian¹, Reza Mahboobi Esfanjani², Marco Mancini³

1- Department of Watershed Management Engineering, Natural Resources Faculty, Urmia University, Urmia, Iran
2- Department of Electrical Engineering, Faculty of Electrical Engineering, Sahand University of Technology, Tabriz, Iran
3- Department of Civil and Environmental Engineering, Politecnico di Milano, Milan, Italy

Abstract: (1111 Views)

Extended Abstract
Background: Accurate river flow measurements are essential for effective water resource management, flood mitigation, river conservation and restoration, and stream rehabilitation. The majority of flood control and design flow strategies in river management and restoration initiatives are derived from hydrological and hydraulic analyses based on observed river flow. Hydrological investigations are fundamentally reliant on observational statistical data, which frequently contain multiple errors. Outliers, which are defined as data points deviating significantly from the norm, can introduce substantial calculation errors. Outlier detection techniques include supervised, semi-supervised, and unsupervised approaches, which may include distribution-based, clustering-based, and density-based methods. These errors can arise from computational issues, misreporting, sampling inaccuracies, and human or instrumental errors, leading to problems such as unrecorded data, incorrect values, equipment failure or loss, and the misidentification of outliers as missing data. Consequently, the estimation and assessment of these data are essential for their application in models, and to mitigate mistakes, preprocessing must be performed before their utilization. Preprocessing methods prepare data series for computations, such as classification, prediction, and estimation, and include the elimination of missing data, removal of outliers, imputation of missing values, and data normalization.
Method: This study utilized flow and rainfall data from six hydrometeorological stations and 16 rain stations to identify outliers and impute missing or incomplete hydrological values. The data, obtained from the Zarrineh-roud basin, were implemented using R software. The Zarrineh River watershed constitutes the largest watershed of Lake Urmia. Normalization tests, including the Shapiro-Wilk and Kolmogorov-Smirnov tests, were used to normalize the data, and the findings indicated that the data did not conform to a normal distribution. Subsequent to data normalization, outlier detection was executed using approaches including boxplot, z-score, histogram, chi-square, mean and standard deviation, and median techniques. Values exceeding the established maximum were removed. Missing values were imputed using K-Nearest Neighbor (KNN), Lasso regression, and Bayesian linear regression. Lasso regression is a regularization technique designed to diminish model complexity and avoid overfitting. Bayesian linear regression is a statistical analysis method that integrates linear regression with Bayesian techniques. The KNN algorithm is a sample-based method related to nonparametric models and supervised learning classification. Cross-validation was used to assess the accuracy of the imputation methods, with RMSE and R² serving as performance metrics.
Result: According to the results, P-values at all six study stations were less than 0.05. The cross-validation approach was used to assess the accuracy and precision of the KNN, Lasso regression, and linear Bayesian regression techniques. RMSE values near zero and R² values above 0.7 across all stations indicated that KNN was a robust and accurate method for missing value imputation. It provides significantly more accurate and reliable outcomes without reshaping the data series trend than Lasso regression and Bayesian linear regression. Outliers were removed from the Jan-Agha and Darreh Pandedan stations during normalization. Histogram analysis revealed skewness and outliers at the Jan-Agha, Sariqamish, and Pol-Anyan stations, indicating a heterogeneous and non-normally distributed dataset. Outliers were identified and removed following normalization. The Shapiro-Wilk and Kolmogorov-Smirnov tests yielded p-values significantly below 0.05 after normalization, confirming a normal distribution. This suggests that the normalization process and outlier removal were executed with precision, indicating the significan detection and estimation of outliers. The Rosner test established the upper limit for each data series across two successive tests, classifying values beyond this limit as outliers. The consistency of the probability density functions between the observed and imputed values using the KNN method indicates an adequate alignment of the two probability density functions. This method has proved effective in imputing the maximum, average, and minimum values relative to the other two methods at the studied stations.
Conclusion: The results of this investigation indicate that the boxplot identifies data values outside the lines as outliers, leading to a substantial number of outliers being detected compared to the other methods. Consequently, this method is considered unsuitable for outlier detection in hydrological data. KNN proved highly effective for missing data imputation compared to Lasso regression and Bayesian linear regression. This study involved normalizing the data series, calculating the values of outliers, and employing the KNN algorithm to identify incomplete or unmeasured and missing values. In datasets exhibiting little variation, KNN has high accuracy and is regarded as one of the most valuable and dependable techniques for attributing and imputing missing values. Cross-validation confirmed the performance of KNN, Lasso regression, and Bayesian linear regression. KNN achieved R² values above 0.7 and RMSE values close to zero. KNN outperformed the other two methods in estimating missing values in continuous and discontinuous flow data. This effectiveness is attributed to KNN's ability to identify optimal nearest neighbor values, making it suitable for accurate predictions, even during low flow periods. The precision of KNN stems from its computational simplicity and high efficacy in calculating and imputing missing values while preserving the integrity of the data series.

Keywords: Bayesian linear regression, K Nearest Neighbor, Lasso regression, Shapiro-Wilk test, Zarrineh-roud basin

Full-Text [PDF 2139 kb] (5 Downloads)

Type of Study: Research | Subject: هيدرولوژی
Received: 2025/02/2 | Accepted: 2025/05/11

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Designed & Developed by: Yektaweb

Journal of Watershed Management Research

Related Websites

Site Keywords