TY - JOUR
T1 - Compression, Information Theory, and Grammars
T2 - A Unified Approach
AU - Bookstein, Abraham
AU - Klein, Shmuel T.
PY - 1990/3/1
Y1 - 1990/3/1
N2 - Text compression is of considerable theoretical and practical interest. It is, for example, becoming increasingly important for satisfying the requirements of fitting a large database onto a single CD-ROM. Many of the compression techniques discussed in the literature are model based. We here propose the notion of a formal grammar as a flexible model of text generation that encompasses most of the models offered before as well as, in principle, extending the possibility of compression to a much more general class of languages. Assuming a general model of text generation, a derivation is given of the well known Shannon entropy formula, making possible a theory of information based upon text representation rather than on communication. The ideas are shown to apply to a number of commonly used text models. Finally, we focus on a Markov model of text generation, suggest an information theoretic measure of similarity between two probability distributions, and develop a clustering algorithm based on this measure. This algorithm allows us to cluster Markov states, and thereby base our compression algorithm on a smaller number of probability distributions than would otherwise have been required. A number of theoretical consequences of this approach to compression are explored, and a detailed example is given.
AB - Text compression is of considerable theoretical and practical interest. It is, for example, becoming increasingly important for satisfying the requirements of fitting a large database onto a single CD-ROM. Many of the compression techniques discussed in the literature are model based. We here propose the notion of a formal grammar as a flexible model of text generation that encompasses most of the models offered before as well as, in principle, extending the possibility of compression to a much more general class of languages. Assuming a general model of text generation, a derivation is given of the well known Shannon entropy formula, making possible a theory of information based upon text representation rather than on communication. The ideas are shown to apply to a number of commonly used text models. Finally, we focus on a Markov model of text generation, suggest an information theoretic measure of similarity between two probability distributions, and develop a clustering algorithm based on this measure. This algorithm allows us to cluster Markov states, and thereby base our compression algorithm on a smaller number of probability distributions than would otherwise have been required. A number of theoretical consequences of this approach to compression are explored, and a detailed example is given.
UR - http://www.scopus.com/inward/record.url?scp=0025263004&partnerID=8YFLogxK
U2 - 10.1145/78915.78917
DO - 10.1145/78915.78917
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:0025263004
SN - 1046-8188
VL - 8
SP - 27
EP - 49
JO - ACM Transactions on Information Systems
JF - ACM Transactions on Information Systems
IS - 1
ER -