Efficient text fingerprinting via parikh mapping

  • Amihood Amir
  • , Alberto Apostolico
  • , Gad M. Landau
  • , Giorgio Satta

Research output: Contribution to journalArticlepeer-review

47 Scopus citations

Abstract

We consider the problem of fingerprinting text by sets of symbols. Specifically, if S is a string, of length n, over a finite, ordered alphabet Σ, and S' is a substring of S, then the fingerprint of S' is the subset φ of Σ of precisely the symbols appearing in S'. In this paper we show efficient methods of answering various queries on fingerprint statistics. Our preprocessing is done in time O(n |Σ| log n log |Σ|) and enables answering the following queries: (1) Given an integer k, compute the number of distinct fingerprints of size k in time O(1). (2) Given a set φ ⊆ Σ, compute the total number of distinct occurrences in S of substrings with fingerprint φ in time O(| Σ| log n).

Original languageEnglish
Pages (from-to)409-421
Number of pages13
JournalJournal of Discrete Algorithms
Volume1
Issue number5-6
DOIs
StatePublished - Oct 2003

Bibliographical note

Funding Information:
Giorgio Satta's work was supported in part by MURST under project PRIN: BioInformatica e Ricerca Genomica and by University of Padova, under project Sviluppo di Sistemi ad Addestramento Automatico per l'Analisi del Linguaggio Naturale.

Funding Information:
Amihood Amir was partially supported by NSF grant CCR-01-04494, BSF grant 96-00509, and an Israel–Italy exchange scientist grant.

Funding Information:
Alberto Apostolico's work was supported in part by NSF Grant CCR-9700276, by MURST under project PRIN: BioInformatica e Ricerca Genomica, by the University of Padova under project Development of Novel Pattern Discovery Algorithms and Software, and by an Israel–Italy exchange scientist grant.

Funding Information:
This research was performed during exchange visits conducted, respectively, by the first and third authors at the University of Padova, and by the second author at the Universities of Bar-Ilan and Haifa, as part of an Israel–Italy exchange scientist grant jointly funded by the Israel Ministry of Science and the National Research Council of Italy.

Funding Information:
Gad Landau was partially supported by NSF grants CCR-9610238, and CCR-0104307, by NATO Science Programme grant PST.CLG.977017, by the Israel Science Foundation grants 173/98 and 282/01, by the FIRST Foundation of the Israel Academy of Science and Humanities, and by IBM Faculty Partnership Award, and an Israel–Italy exchange scientist grant.

Funding

Giorgio Satta's work was supported in part by MURST under project PRIN: BioInformatica e Ricerca Genomica and by University of Padova, under project Sviluppo di Sistemi ad Addestramento Automatico per l'Analisi del Linguaggio Naturale. Amihood Amir was partially supported by NSF grant CCR-01-04494, BSF grant 96-00509, and an Israel–Italy exchange scientist grant. Alberto Apostolico's work was supported in part by NSF Grant CCR-9700276, by MURST under project PRIN: BioInformatica e Ricerca Genomica, by the University of Padova under project Development of Novel Pattern Discovery Algorithms and Software, and by an Israel–Italy exchange scientist grant. This research was performed during exchange visits conducted, respectively, by the first and third authors at the University of Padova, and by the second author at the Universities of Bar-Ilan and Haifa, as part of an Israel–Italy exchange scientist grant jointly funded by the Israel Ministry of Science and the National Research Council of Italy. Gad Landau was partially supported by NSF grants CCR-9610238, and CCR-0104307, by NATO Science Programme grant PST.CLG.977017, by the Israel Science Foundation grants 173/98 and 282/01, by the FIRST Foundation of the Israel Academy of Science and Humanities, and by IBM Faculty Partnership Award, and an Israel–Italy exchange scientist grant.

FundersFunder number
FIRST Foundation of the Israel Academy of Science and Humanities
Israel Ministry of Science
MURST
Universities of Bar-Ilan and Haifa, as part of an Israel
National Science FoundationCCR-0104307, CCR-9700276, CCR-9610238, CCR-01-04494
International Business Machines Corporation
North Atlantic Treaty OrganizationPST.CLG.977017
National Research Council
United States-Israel Binational Science Foundation96-00509
Università degli Studi di Padova
Israel Science Foundation282/01, 173/98

    Keywords

    • Combinatorial algorithms on words
    • Design and analysis of algorithms

    Fingerprint

    Dive into the research topics of 'Efficient text fingerprinting via parikh mapping'. Together they form a unique fingerprint.

    Cite this