We consider the problem of fingerprinting text by sets of symbols. Specifically, if S is a string, of length n, over a finite, ordered alphabet Σ, and S' is a substring of S, then the fingerprint of S' is the subset φ of Σ of precisely the symbols appearing in S'. In this paper we show efficient methods of answering various queries on fingerprint statistics. Our preprocessing is done in time O(n |Σ| log n log |Σ|) and enables answering the following queries: (1) Given an integer k, compute the number of distinct fingerprints of size k in time O(1). (2) Given a set φ ⊆ Σ, compute the total number of distinct occurrences in S of substrings with fingerprint φ in time O(| Σ| log n).
Bibliographical noteFunding Information:
Giorgio Satta's work was supported in part by MURST under project PRIN: BioInformatica e Ricerca Genomica and by University of Padova, under project Sviluppo di Sistemi ad Addestramento Automatico per l'Analisi del Linguaggio Naturale.
Amihood Amir was partially supported by NSF grant CCR-01-04494, BSF grant 96-00509, and an Israel–Italy exchange scientist grant.
Alberto Apostolico's work was supported in part by NSF Grant CCR-9700276, by MURST under project PRIN: BioInformatica e Ricerca Genomica, by the University of Padova under project Development of Novel Pattern Discovery Algorithms and Software, and by an Israel–Italy exchange scientist grant.
This research was performed during exchange visits conducted, respectively, by the first and third authors at the University of Padova, and by the second author at the Universities of Bar-Ilan and Haifa, as part of an Israel–Italy exchange scientist grant jointly funded by the Israel Ministry of Science and the National Research Council of Italy.
Gad Landau was partially supported by NSF grants CCR-9610238, and CCR-0104307, by NATO Science Programme grant PST.CLG.977017, by the Israel Science Foundation grants 173/98 and 282/01, by the FIRST Foundation of the Israel Academy of Science and Humanities, and by IBM Faculty Partnership Award, and an Israel–Italy exchange scientist grant.
- Combinatorial algorithms on words
- Design and analysis of algorithms