Abstract:
Machine Learning typically assumes that training and test sets are independently drawn from the same distribution, but this assumption is often violated in practice which...Show MoreMetadata
Abstract:
Machine Learning typically assumes that training and test sets are independently drawn from the same distribution, but this assumption is often violated in practice which creates a bias. Many attempts to identify and mitigate this bias have been proposed, but they usually rely on ground-truth information. But what if the researcher is not even aware of the bias? In contrast to prior work, this paper introduces a new method, Imitate, to identify and mitigate Selection Bias in the case that we may not know if (and where) a bias is present, and hence no ground-truth information is available. Imitate investigates the dataset's probability density, then adds generated points in order to smooth out the density and have it resemble a Gaussian, the most common density occurring in real-world applications. If the artificial points focus on certain areas and are not widespread, this could indicate a Selection Bias where these areas are underrepresented in the sample. We demonstrate the effectiveness of the proposed method in both, synthetic and real-world datasets. We also point out limitations and future research directions.
Published in: 2020 IEEE International Conference on Data Mining (ICDM)
Date of Conference: 17-20 November 2020
Date Added to IEEE Xplore: 09 February 2021
ISBN Information: