TY - JOUR

T1 - Exponential time improvement for min-wise based algorithms

AU - Feigenblat, Guy

AU - Porat, Ely

AU - Shiftan, Ariel

N1 - Funding Information:
< Supported by ISF, BSF and Google award.

PY - 2011/4

Y1 - 2011/4

N2 - In this paper we extend the notion of min-wise independent family of hash functions by defining a k-min-wise independent family of hash functions. Informally, under this definition, all subsets of size k of any fixed set X have an equal chance to have the minimal hash values among all the elements in X, when the probability is over the random choice of hash function from the family. This property measures the randomness of the family, as choosing a truly random function, obviously, satisfies the definition for k = X. We define and give an efficient time and space construction of approximately k-min-wise independent family of hash functions by extending Indyk's construction of approximately min-wise independent. The number of words needed to represent each function is O(kloglog(1/ε)+log(1/ε)), which is only suboptimal by a factor of O(loglog(1/ε)), where ε ε (0, 1) is the desired error bound. This construction is the first applicable for sampling bottom-k sketches out of the universe. In addition, we introduce a general and novel technique that utilizes our construction, and can be used to improve many min-wise based algorithms. As an example we show how to apply it for similarity estimation over data streams, and reduce exponentially the run time of the current known result [5]. In addition, we also discuss improvements of known algorithms for estimating rarity and entropy of random walk over graphs.

AB - In this paper we extend the notion of min-wise independent family of hash functions by defining a k-min-wise independent family of hash functions. Informally, under this definition, all subsets of size k of any fixed set X have an equal chance to have the minimal hash values among all the elements in X, when the probability is over the random choice of hash function from the family. This property measures the randomness of the family, as choosing a truly random function, obviously, satisfies the definition for k = X. We define and give an efficient time and space construction of approximately k-min-wise independent family of hash functions by extending Indyk's construction of approximately min-wise independent. The number of words needed to represent each function is O(kloglog(1/ε)+log(1/ε)), which is only suboptimal by a factor of O(loglog(1/ε)), where ε ε (0, 1) is the desired error bound. This construction is the first applicable for sampling bottom-k sketches out of the universe. In addition, we introduce a general and novel technique that utilizes our construction, and can be used to improve many min-wise based algorithms. As an example we show how to apply it for similarity estimation over data streams, and reduce exponentially the run time of the current known result [5]. In addition, we also discuss improvements of known algorithms for estimating rarity and entropy of random walk over graphs.

UR - http://www.scopus.com/inward/record.url?scp=79951789742&partnerID=8YFLogxK

U2 - 10.1016/j.ic.2011.01.005

DO - 10.1016/j.ic.2011.01.005

M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???

AN - SCOPUS:79951789742

SN - 0890-5401

VL - 209

SP - 737

EP - 747

JO - Information and Computation

JF - Information and Computation

IS - 4

ER -