TY - JOUR
T1 - Fingerprints for highly similar streams
AU - Bachrach, Yoram
AU - Porat, Ely
N1 - Publisher Copyright:
© 2015 Published by Elsevier Inc.
PY - 2015/10/12
Y1 - 2015/10/12
N2 - We propose an approach for approximating the Jaccard similarity of two streams, J(A,B)=|A∩B||A∪B|, for domains where this similarity is known to be high. Our method is based on a reduction from Jaccard similarity to F2 norm estimation, for which there exists a sketch that is efficient in terms of both size and compute time, which we augment by a sampling technique. Our approach offers an improvement in the fingerprint size that is quadratic in the degree of similarity between the streams. More precisely, to approximate the Jaccard similarity up to a multiplicative factor of ε with confidence δ, it suffices to take a fingerprint of size O(ln(1δ)(1-t)2ε2log11-t) where t is the known minimal Jaccard similarity between the streams. Further, computing our fingerprint can be done in time O(1) per element in the stream.
AB - We propose an approach for approximating the Jaccard similarity of two streams, J(A,B)=|A∩B||A∪B|, for domains where this similarity is known to be high. Our method is based on a reduction from Jaccard similarity to F2 norm estimation, for which there exists a sketch that is efficient in terms of both size and compute time, which we augment by a sampling technique. Our approach offers an improvement in the fingerprint size that is quadratic in the degree of similarity between the streams. More precisely, to approximate the Jaccard similarity up to a multiplicative factor of ε with confidence δ, it suffices to take a fingerprint of size O(ln(1δ)(1-t)2ε2log11-t) where t is the known minimal Jaccard similarity between the streams. Further, computing our fingerprint can be done in time O(1) per element in the stream.
UR - http://www.scopus.com/inward/record.url?scp=84941361489&partnerID=8YFLogxK
U2 - 10.1016/j.ic.2015.06.001
DO - 10.1016/j.ic.2015.06.001
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:84941361489
SN - 0890-5401
VL - 244
SP - 113
EP - 121
JO - Information and Computation
JF - Information and Computation
ER -