Skip to Main Content
The minimum description length (MDL) principle is derived for universal compression of i.i.d. sources with large alphabets of size k that may be up to sub-linear with the data sequence length n. Each unknown source probability parameter is shown to cost 0.5log(n/k) bits. This result is shown to be a lower bound in the average minimax sense, and also for most sources in the class. The bound is shown to be achievable even sequentially with the well-known Krichevsky-Trofimov low-complexity scheme.