Storing text retrieval systems on CD-ROM: compression and encryption considerations

S. Klein, Abraham Bookstein, Scott Deerwester

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Trésor de la Langue Fran&ccidel;aise on a CD-ROM is examined in this paper. The text alone of this database is 700 megabytes long, more than a CD-ROM can hold. In addition, the dictionary and concordance needed to access these data must be stored. A further constraint is that some of the material is copyrighted, and it is desirable that such material be difficult to decode except through software provided by the system. Pertinent approaches to compression of the various files are reviewed, and the compression of the text is related to the problem of data encryption: Specifically, it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishable from a representation of coin flips.
Original languageAmerican English
Title of host publicationProc. 12-th ACM-SIGIR Conf
StatePublished - 1989

Bibliographical note

Place of conference:Cambridge

Fingerprint

Dive into the research topics of 'Storing text retrieval systems on CD-ROM: compression and encryption considerations'. Together they form a unique fingerprint.

Cite this