TY - JOUR
T1 - Processing truncated terms in document retrieval systems
AU - Bratley, Paul
AU - Choueka, Yaacov
PY - 1982
Y1 - 1982
N2 - In a typical inverted-file full-text document retrieval system, the user submits queries consisting of strings of characters combined by various operators. The strings are looked up in a text-dictionary which lists, for each string, all the places in the database at which it occurs. It is desirable to allow the user to include in his query truncated terms such as X*, *X, *X*, or X*Y, where X and X are specified strings and * is a variable-length-don't-care character, that is, * represents an arbitrary, possibly empty, string. Processing these terms involves finding the set of all words in the dictionary that match these patterns. How to do this efficiently is a long-standing open problem in this domain. In this paper we present a uniform and efficient approach for processing all such query terms. The approach, based on a "permuted dictionary" and a corresponding set of access routines, requires essentially one disk access to obtain from the dictionary all the strings represented by a truncated term, with negligible computing time. It is thus well suited for on-line applications. Implementation is simple, and storage overhead is low: it can be made almost negligible by using some specially adapted compression techniques described in the paper. The basic approach is easily adaptable for slight variants, such as fixed (or bounded) length don't-care characters, or more complex pattern matching templates.
AB - In a typical inverted-file full-text document retrieval system, the user submits queries consisting of strings of characters combined by various operators. The strings are looked up in a text-dictionary which lists, for each string, all the places in the database at which it occurs. It is desirable to allow the user to include in his query truncated terms such as X*, *X, *X*, or X*Y, where X and X are specified strings and * is a variable-length-don't-care character, that is, * represents an arbitrary, possibly empty, string. Processing these terms involves finding the set of all words in the dictionary that match these patterns. How to do this efficiently is a long-standing open problem in this domain. In this paper we present a uniform and efficient approach for processing all such query terms. The approach, based on a "permuted dictionary" and a corresponding set of access routines, requires essentially one disk access to obtain from the dictionary all the strings represented by a truncated term, with negligible computing time. It is thus well suited for on-line applications. Implementation is simple, and storage overhead is low: it can be made almost negligible by using some specially adapted compression techniques described in the paper. The basic approach is easily adaptable for slight variants, such as fixed (or bounded) length don't-care characters, or more complex pattern matching templates.
UR - http://www.scopus.com/inward/record.url?scp=0020246348&partnerID=8YFLogxK
U2 - 10.1016/0306-4573(82)90004-8
DO - 10.1016/0306-4573(82)90004-8
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:0020246348
SN - 0306-4573
VL - 18
SP - 257
EP - 266
JO - Information Processing and Management
JF - Information Processing and Management
IS - 5
ER -