Abstract
An earlier paper developed a procedure for compressing concordances, assuming that all elements occurred independently. The models introduced in that paper are extended here to take the possibility of clustering into account. The concordance is conceptualized as a set of bitmaps, in which the bit locations represent documents, and the one-bits represent the occurrence of given terms. Hidden Markov Models (HMMs) are used to describe the clustering of the one-bits. However, for computational reasons, the HMM is approximated by traditional Markov models. A set of criteria is developed to constrain the allowable set of n-state models, and a full inventory is given for n ≤ 4. Graph-theoretic reduction and complementation operations are defined among the various models and are used to provide a structure relating the models studied. Finally, the new methods were tested on the concordances of the English Bible and of two of the world's largest full-text retrieval system: the Trésor de la Langue Française and the Responsa Project.
Original language | English |
---|---|
Pages (from-to) | 254-290 |
Number of pages | 37 |
Journal | ACM Transactions on Information Systems |
Volume | 15 |
Issue number | 3 |
DOIs | |
State | Published - Jul 1997 |
Keywords
- E.2 [Data]: Data Storage Representations - composite structures
- E.4 [Data]: Coding and Information Theory - data compaction and compression
- F.1.2 [Computation by Abstract Devices]: Modes of Computation - probabilistic computation
- Markov models