TY - GEN
T1 - Exponential space improvement for min-wise based algorithms
AU - Feigenblat, Guy
AU - Porat, Ely
AU - Shiftan, Ariel
PY - 2012
Y1 - 2012
N2 - In this paper we introduce a general framework that exponentially improves the space, the degree of independence, and the time needed by min-wise based algorithms. The authors, in SODA11, [15] introduced an exponential time improvement for min-wise based algorithms by defining and constructing an almost k-min-wise independent family of hash functions. Here we develop an alternative approach that achieves both exponential time and exponential space improvement. The new approach relaxes the need for approximately min-wise hash functions, hence gets around the Ω(log1/ε) independence lower bound in [23]. This is done by defining and constructing a d-k-min-wise independent family of hash functions. Surprisingly, for most cases only 8-wise independence is needed for the additional improvement. Moreover, as the degree of independence is a small constant, our function can be implemented efficiently. Informally, under this definition, all subsets of size d of any fixed set X have an equal probability to have hash values among the minimal k values in X, where the probability is over the random choice of hash function from the family. This property measures the randomness of the family, as choosing a truly random function, obviously, satisfies the definition for d = k = |X|. We define and give an efficient time and space construction of approximately d-k-min-wise independent family of hash functions for the case where d = 2, as this is sufficient for the additional exponential improvement. We discuss how this construction can be used to improve many min-wise based algorithms. To our knowledge such definitions, for hash functions, were never studied and no construction was given before. As an example we show how to apply it for similarity and rarity estimation over data streams. Other min-wise based algorithms, can be adjusted in the same way.
AB - In this paper we introduce a general framework that exponentially improves the space, the degree of independence, and the time needed by min-wise based algorithms. The authors, in SODA11, [15] introduced an exponential time improvement for min-wise based algorithms by defining and constructing an almost k-min-wise independent family of hash functions. Here we develop an alternative approach that achieves both exponential time and exponential space improvement. The new approach relaxes the need for approximately min-wise hash functions, hence gets around the Ω(log1/ε) independence lower bound in [23]. This is done by defining and constructing a d-k-min-wise independent family of hash functions. Surprisingly, for most cases only 8-wise independence is needed for the additional improvement. Moreover, as the degree of independence is a small constant, our function can be implemented efficiently. Informally, under this definition, all subsets of size d of any fixed set X have an equal probability to have hash values among the minimal k values in X, where the probability is over the random choice of hash function from the family. This property measures the randomness of the family, as choosing a truly random function, obviously, satisfies the definition for d = k = |X|. We define and give an efficient time and space construction of approximately d-k-min-wise independent family of hash functions for the case where d = 2, as this is sufficient for the additional exponential improvement. We discuss how this construction can be used to improve many min-wise based algorithms. To our knowledge such definitions, for hash functions, were never studied and no construction was given before. As an example we show how to apply it for similarity and rarity estimation over data streams. Other min-wise based algorithms, can be adjusted in the same way.
KW - Hash functions
KW - Min-wise
KW - On line algorithms
KW - Similarity
KW - Streaming
KW - Sub-linear algorithms
UR - http://www.scopus.com/inward/record.url?scp=84880212113&partnerID=8YFLogxK
U2 - 10.4230/LIPIcs.FSTTCS.2012.70
DO - 10.4230/LIPIcs.FSTTCS.2012.70
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:84880212113
SN - 9783939897477
T3 - Leibniz International Proceedings in Informatics, LIPIcs
SP - 70
EP - 85
BT - 32nd International Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2012
T2 - 32nd International Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2012
Y2 - 15 December 2012 through 17 December 2012
ER -