Modeling word occurrences for the compression of concordances

A. Bookstein, S. T. Klein, T. Raita

Research output: Contribution to journalArticlepeer-review

11 Scopus citations

Abstract

An earlier paper developed a procedure for compressing concordances, assuming that all elements occurred independently. The models introduced in that paper are extended here to take the possibility of clustering into account. The concordance is conceptualized as a set of bitmaps, in which the bit locations represent documents, and the one-bits represent the occurrence of given terms. Hidden Markov Models (HMMs) are used to describe the clustering of the one-bits. However, for computational reasons, the HMM is approximated by traditional Markov models. A set of criteria is developed to constrain the allowable set of n-state models, and a full inventory is given for n ≤ 4. Graph-theoretic reduction and complementation operations are defined among the various models and are used to provide a structure relating the models studied. Finally, the new methods were tested on the concordances of the English Bible and of two of the world's largest full-text retrieval system: the Trésor de la Langue Française and the Responsa Project.

Original languageEnglish
Pages (from-to)254-290
Number of pages37
JournalACM Transactions on Information Systems
Volume15
Issue number3
DOIs
StatePublished - Jul 1997

Keywords

  • E.2 [Data]: Data Storage Representations - composite structures
  • E.4 [Data]: Coding and Information Theory - data compaction and compression
  • F.1.2 [Computation by Abstract Devices]: Modes of Computation - probabilistic computation
  • Markov models

Fingerprint

Dive into the research topics of 'Modeling word occurrences for the compression of concordances'. Together they form a unique fingerprint.

Cite this