Symbol ranking text compressors | IEEE Conference Publication | IEEE Xplore

Symbol ranking text compressors


Abstract:

Summary form only given. In 1951 Shannon estimated the entropy of English text by giving human subjects a sample of text and asking them to guess the next letters. He fou...Show More

Abstract:

Summary form only given. In 1951 Shannon estimated the entropy of English text by giving human subjects a sample of text and asking them to guess the next letters. He found, in one example, that 79% of the attempts were correct at the first try, 8% needed two attempts and 3% needed 3 attempts. By regarding the number of attempts as an information source he could estimate the language entropy. Shannon also stated that an "identical twin" to the original predictor could recover the original text and these ideas are developed here to provide a new taxonomy of text compressors. In all cases these compressors recode the input into "rankings" of "most probable symbol", "next most probable symbol", and so on. The rankings have a very skew distribution (low entropy) and are processed by a conventional statistical compressor. Several "symbol ranking" compressors have appeared in the literature, though seldom with that name or even reference to Shannon's work. The author has developed a compressor which uses constant-order contexts and is based on a set-associative cache with LRU update. A software implementation has run at about 1 Mbyte/s with an average compression of 3.6 bits/byte on the Calgary Corpus.
Date of Conference: 25-27 March 1997
Date Added to IEEE Xplore: 06 August 2002
Print ISBN:0-8186-7761-9
Print ISSN: 1068-0314
Conference Location: Snowbird, UT, USA

Contact IEEE to Subscribe