TY - GEN
T1 - Efficient sampling of non-strict turnstile data streams
AU - Barkay, Neta
AU - Porat, Ely
AU - Shalem, Bar
PY - 2013
Y1 - 2013
N2 - We study the problem of generating a large sample from a data stream of elements (i,v), where the sample consists of pairs (i,Ci) for C i = ∑(i,v)∈stream v. We consider strict turnstile streams and general non-strict turnstile streams, in which Ci may be negative. Our sample is useful for approximating both forward and inverse distribution statistics, within an additive error ε and provable success probability 1 - δ. Our sampling method improves by an order of magnitude the known processing time of each stream element, a crucial factor in data stream applications, thereby providing a feasible solution to the problem. For example, for a sample of size O(ε-2 log(1/δ)) in non-strict streams, our solution requires O((loglog(1/ε))2 + (loglog(1/δ))2) operations per stream element, whereas the best previous solution requires O(ε-2 log2(1/δ)) evaluations of a fully independent hash function per element. We achieve this improvement by constructing an efficient K-elements recovery structure from which K elements can be extracted with probability 1 - δ. Our structure enables our sampling algorithm to run on distributed systems and extract statistics on the difference between streams.
AB - We study the problem of generating a large sample from a data stream of elements (i,v), where the sample consists of pairs (i,Ci) for C i = ∑(i,v)∈stream v. We consider strict turnstile streams and general non-strict turnstile streams, in which Ci may be negative. Our sample is useful for approximating both forward and inverse distribution statistics, within an additive error ε and provable success probability 1 - δ. Our sampling method improves by an order of magnitude the known processing time of each stream element, a crucial factor in data stream applications, thereby providing a feasible solution to the problem. For example, for a sample of size O(ε-2 log(1/δ)) in non-strict streams, our solution requires O((loglog(1/ε))2 + (loglog(1/δ))2) operations per stream element, whereas the best previous solution requires O(ε-2 log2(1/δ)) evaluations of a fully independent hash function per element. We achieve this improvement by constructing an efficient K-elements recovery structure from which K elements can be extracted with probability 1 - δ. Our structure enables our sampling algorithm to run on distributed systems and extract statistics on the difference between streams.
UR - http://www.scopus.com/inward/record.url?scp=84883184512&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-40164-0_8
DO - 10.1007/978-3-642-40164-0_8
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:84883184512
SN - 9783642401633
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 48
EP - 59
BT - Fundamentals of Computation Theory - 19th International Symposium, FCT 2013, Proceedings
T2 - 19th International Symposium on Fundamentals of Computation Theory, FCT 2013
Y2 - 19 August 2013 through 21 August 2013
ER -