Extended Abstract
Introduction and Objectives: Accurate river flow measurements are essential for effective water resource management, flood mitigation, river conservation and restoration, and stream rehabilitation. The majority of flood control and design flow strategies in river management and restoration initiatives are derived from hydrological and hydraulic analyses based in observed river flow. Hydrological investigations are fundamentally reliant on observational statistical data, which frequently contain multiple errors. Outliers, defined as data points deviating significantly from the norm, can introduce substantial calculation errors. Outlier detection techniques include supervised, semi-supervised, and unsupervised approaches, which may include distribution-based, clustering-based, and density-based methods. These errors can arise from computational issues, misreporting, sampling inaccuracies, and human or instrumental errors, leading to problems such as unrecorded data, incorrect values, equipment failure or loss, and the misidentification of outliers as missing data. Consequently, the estimation and assessment of these data are essential for their application in models, and to mitigate mistakes, preprocessing must be performed before to their utilization. Preprocessing methods prepare the data series for computations, such as classification, prediction, and estimate, and include the elimination of missing data, removal of outliers, imputation of missing values, and data normalization.
Material and Methods: This study utilized flow and rainfall data from six hydrometeorological stations and sixteen rain stations to identify outliers and impute missing or incomplete hydrological values. The data, obtained from the Zarrineh-roud basin, were implemented using R software. The Zarrineh River watershed constitutes the largest watershed of Lake Urmia. To normalize the data, normalization tests, including the Shapiro-Wilk and Kolmogorov-Smirnov tests, were used, and the findings indicated that the data did not conform to a normal distribution. Subsequent to data normalization, outlier detection was executed using approaches including boxplot, z-score, histogram, chi-square, mean and standard deviation, and median techniques. Values exceeding the established maximum were removed. Missing values were imputed using K-Nearest Neighbor (KNN), Lasso regression, and Bayesian linear regression. Lasso regression is a regularization technique designed to diminish model complexity and avoid overfitting. Bayesian linear regression is a statistical analysis method that integrates linear regression with Bayesian techniques. The KNN algorithm is a sample-based method related to nonparametric models and supervised learning classification. Cross-validation was used to assess the accuracy of the imputation methods, with RMSE and R² serving as performance metrics.
Results: P-values at all six study stations were less than 0.05. The cross-validation approach was used to assess the accuracy and precision of the KNN, Lasso regression, and linear Bayesian regression techniques. RMSE values near zero and R² values above 0.7 across all stations indicated that KNN is a robust and accurate method for missing value imputation. In comparison to lasso regression and Bayesian linear regression, it provides significantly more accurate and reliable outcomes without reshaping the data series trend. Outliers were removed from the Jan-Agha and Darreh Pandedan stations during normalization. Histogram analysis revealed skewness and outliers at the Jan-Agha, Sariqamish, and Pol-Anyan stations, indicating a heterogeneous and non-normally distributed dataset. Following normalization, outliers were identified and removed. The Shapiro-Wilk and Kolmogorov-Smirnov tests yielded p-values significantly below 0.05 after normalization, confirming a normal distribution. This suggests that the normalization process and outlier removal were executed with precision, implementing the detection and estimation of outliers valid. The Rosner test established the upper limit for each data series across two successive tests, classifying values beyond this limit as outliers. The consistency of the probability density functions between the observed and imputed values using the KNN method indicates an adequate alignment of the two probability density functions. This method has proved effective in imputing the maximum, average, and minimum values relative to the other two methods at the studied stations.
Conclusion: The results indicate that the boxplot identifies data values outside the lines as outliers, leading to a substantial number of outliers being detected compared to other methods. Consequently, this method is considered unsuitable for outlier detection in hydrological data. KNN proved highly effective for missing data imputation compared to Lasso regression and Bayesian linear regression. This study involved normalizing the data series, calculating the values of outliers, and employing the KNN algorithm to identify incomplete or unmeasured and missing values. In datasets exhibiting little variation, KNN has high accuracy and is regarded as one of the most valuable and dependable techniques for attributing and imputing missing values. Cross-validation confirmed the performance of KNN, Lasso regression, and Bayesian linear regression. KNN achieved R² values above 0.7, and RMSE values close to zero. KNN outperformed the other two methods in estimating missing values in continuous and discontinuous flow data. This effectiveness is attributed to KNN's ability to identify optimal nearest neighbor values, making it suitable for accurate predictions, even during low flow periods. The precision of KNN stems from its computational simplicity and high efficacy in calculating and imputing missing values while preserving the integrity of the data series.
Type of Study:
Research |
Subject:
هيدرولوژی Received: 2025/02/9 | Accepted: 2025/06/15