Skip to Main Content
We propose in this paper a stability measure of entropy estimate based on the principle of Bayesian statistics. Stability, or how the estimates vary as training set does, is a critical issue especially for the problems where parameter-to-data ratio is extremely high as in language modeling and text compression. There are two natural estimates of entropy, one being the classical estimate and the other the Bayesian estimate. We show that the difference of them is in strong positive correlation with the variance of the classical estimate when it is not so small, and propose this difference as stability measure of entropy estimate. In order to evaluate it for language models where estimates are available but posterior distribution is not in general, we suggest to use a Dirichlet distribution so that its expectation agrees with the estimated parameters and that the total count is preserved at the same time. Experiments on two benchmark corpora show that the proposed measure indeed reflects the stability of classical entropy estimates.