Abstract:
In the past few decades, the banking sector has increasingly recognized the significance of an automated system for managing significant data quality, leading to a growin...Show MoreMetadata
Abstract:
In the past few decades, the banking sector has increasingly recognized the significance of an automated system for managing significant data quality, leading to a growing focus on data quality evaluation. Data governance for the computerized system is necessary to ensure the performance of the machine learning (ML) models. Data cleansing is a fundamental component of evaluating data governance, which focuses on quality and is an essential step before creating data analytics services. This paper introduces an automated framework for ensuring data quality using statistical and ML methods in banking, highlighting its objectives, functionality, and methodological advancements. The novel proposed approach focuses on proving the necessity of data quality assessment before training the ML models. In the evaluation of data quality, the outliers are detected using three different approaches: the first approach used is Tukey's IQR, the statistical method, the second is Isolation Forest (IF), the supervised learning method, and the last is the DBSCAN, the unsupervised learning method. Among the three, the highest outliers were detected by the IQR method. The outliers are removed, and then the three methods are compared to train the three ML models, i.e., logistic regression, K nearest neighbors, and the Naive Bayes.
Published in: 2024 Second International Conference on Emerging Trends in Information Technology and Engineering (ICETITE)
Date of Conference: 22-23 February 2024
Date Added to IEEE Xplore: 18 April 2024
ISBN Information: