Abstract:
We describe Statistically-sound Knowledge Discovery from Data (StatKDD), a groundbreaking change of paradigm that shifts the focus of the KDD pipeline from the (overzealo...Show MoreMetadata
Abstract:
We describe Statistically-sound Knowledge Discovery from Data (StatKDD), a groundbreaking change of paradigm that shifts the focus of the KDD pipeline from the (overzealous) analysis of the available data towards understanding the, partially unknown, random, Data Generating Process (DGP) that produces the data. This shift is required by the practice of scientific research and by many industrial application, where results from data analysis must capture new knowledge about the DGP, while avoiding costly false discoveries. StatKDD considers every result obtained from the data as a hypothesis, which must pass severe statistical testing under a strong null model, in order to be considered significant, i.e., informative about the DGP. The challenges to be solved to enable StatKDD, include (1) developing representative null models and severe tests for different KDD tasks from different kind of data; (2) considering multiple hypotheses testing as a necessity, not an afterthought; (3) offering flexible statistical guarantees, depending on the stage of the discovery process; and (4) creating algorithms for the extraction and testing of hypotheses that scale along multiple axes, including but not limited to the size of the data, and the number and complexity of the hypotheses.
Date of Conference: 01-04 November 2023
Date Added to IEEE Xplore: 19 February 2024
ISBN Information: