Abstract
Text indexing is a fundamental problem in computer science, where the task is to index a given text (string) T[1..n], such that whenever a pattern P[1..p] comes as a query, we can efficiently report all those locations where P occurs as a substring of T. In this paper, we consider the case when P contains wildcard characters (which can match with any other character). The first non-trivial solution for the problem was given by Cole et al. [11], where the index space is O(nlogkn) words or O(nlogk+1n) bits and the query time is O(p+2hloglogn+occ), where k is the maximum number of wildcard characters allowed in P, h≤k is the number of wildcard characters in P and occ represents the number of occurrences of P in T. Even though many indexes offering different space-time trade-offs were later proposed, a clear improvement on this result is still not known. In this paper, we first propose an O(nlogk+εn) bits index achieving the same query time as the of Cole et al.'s index, where 0<ε<1 is an arbitrary small constant. Then we propose another index of size O(nlogknlogσ) bits, but with a slightly higher query time of O(p+2hlogn+occ), where σ denotes the alphabet set size. We also study a related problem, where the task is to index a collection of documents (of n characters in total) so as to find the number of distinct documents containing a query pattern P. For the case where P contains at most a single wildcard character, we propose an O(nlog. . n)-word index with optimal O(p) query time.
Original language | English |
---|---|
Pages (from-to) | 120-127 |
Number of pages | 8 |
Journal | Theoretical Computer Science |
Volume | 557 |
Issue number | C |
DOIs | |
State | Published - 2014 |
Bibliographical note
Publisher Copyright:© 2014.
Funding
Early part of this work appeared in ISAAC 2013 [21] . Work supported by NSERC of Canada and the Canada Research Chairs program.
Funders | Funder number |
---|---|
Natural Sciences and Engineering Research Council of Canada | |
Canada Research Chairs |
Keywords
- Data structures
- Range searching
- String indexing
- Suffix trees
- Wildcards