Efficient sampling of non-strict turnstile data streams

Neta Barkay, Ely Porat, Bar Shalem

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

We study the problem of generating a large sample from a data stream S of elements (i,v), where i is a positive integer key, v is an integer equal to the count of key i, and the sample consists of pairs (i, Ci) for Ci=∑(i,v)∈Sv. We consider strict turnstile streams and general non-strict turnstile streams, in which Ci may be negative. Our sample is useful for approximating both forward and inverse distribution statistics, within an additive error ε and provable success probability 1-δ. Our sampling method improves by an order of magnitude the known processing time of each stream element, a crucial factor in data stream applications, thereby providing a feasible solution to the sampling problem. For example, for a sample of size O(ε-2log (1/δ)) in non-strict streams, our solution requires O((log log (1/ε))2+(log log (1/δ))2) operations per stream element, whereas the best previous solution requires O(ε-2log2 (1/δ)) evaluations of a fully independent hash function per element. We achieve this improvement by constructing an efficient K-elements recovery structure from which K elements can be extracted with probability 1 -δ. Our structure enables our sampling algorithm to run on distributed systems and extract statistics on the difference between streams.

Original languageEnglish
Pages (from-to)106-117
Number of pages12
JournalTheoretical Computer Science
Volume590
DOIs
StatePublished - 26 Jul 2015

Bibliographical note

Publisher Copyright:
© 2015 Elsevier B.V.

Keywords

  • Data streams
  • Inverse distribution
  • Sampling

Fingerprint

Dive into the research topics of 'Efficient sampling of non-strict turnstile data streams'. Together they form a unique fingerprint.

Cite this