Compression, Information Theory, and Grammars: A Unified Approach

Abraham Bookstein, Shmuel T. Klein

Research output: Contribution to journalArticlepeer-review

16 Scopus citations


Text compression is of considerable theoretical and practical interest. It is, for example, becoming increasingly important for satisfying the requirements of fitting a large database onto a single CD-ROM. Many of the compression techniques discussed in the literature are model based. We here propose the notion of a formal grammar as a flexible model of text generation that encompasses most of the models offered before as well as, in principle, extending the possibility of compression to a much more general class of languages. Assuming a general model of text generation, a derivation is given of the well known Shannon entropy formula, making possible a theory of information based upon text representation rather than on communication. The ideas are shown to apply to a number of commonly used text models. Finally, we focus on a Markov model of text generation, suggest an information theoretic measure of similarity between two probability distributions, and develop a clustering algorithm based on this measure. This algorithm allows us to cluster Markov states, and thereby base our compression algorithm on a smaller number of probability distributions than would otherwise have been required. A number of theoretical consequences of this approach to compression are explored, and a detailed example is given.

Original languageEnglish
Pages (from-to)27-49
Number of pages23
JournalACM Transactions on Information Systems
Issue number1
StatePublished - 1 Mar 1990
Externally publishedYes


Dive into the research topics of 'Compression, Information Theory, and Grammars: A Unified Approach'. Together they form a unique fingerprint.

Cite this