TY - GEN
T1 - Similarity based deduplication with small data chunks
AU - Aronovich, Lior
AU - Asher, Ron
AU - Harnik, Danny
AU - Hirsch, Michael
AU - Klein, Shmuel T.
AU - Toaft, Yair
N1 - Place of conference:Czech Republic
PY - 2012
Y1 - 2012
N2 - Large backup and restore systems may have a petabyte or more data in their repository. Such systems are often compressed by means of deduplication techniques, that partition the input text into chunks and store recurring chunks only once. One of the approaches is to use hashing methods to store fingerprints for each data chunk, detecting identical chunks with very low probability for collisions. As alternative, it has been suggested to use similarity instead of identity based searches, which allows the definition of much larger chunks. This implies that the data structure needed to store the fingerprints is much smaller, so that such a system may be more scalable than systems built on the first approach. This paper deals with an extension of the second approach to systems in which it is still preferred to use small chunks. We describe the design choices made during the development of what we call an approximate hash function, serving as the basic tool of the new suggested deduplication system and report on extensive tests performed on an variety of large input files.
AB - Large backup and restore systems may have a petabyte or more data in their repository. Such systems are often compressed by means of deduplication techniques, that partition the input text into chunks and store recurring chunks only once. One of the approaches is to use hashing methods to store fingerprints for each data chunk, detecting identical chunks with very low probability for collisions. As alternative, it has been suggested to use similarity instead of identity based searches, which allows the definition of much larger chunks. This implies that the data structure needed to store the fingerprints is much smaller, so that such a system may be more scalable than systems built on the first approach. This paper deals with an extension of the second approach to systems in which it is still preferred to use small chunks. We describe the design choices made during the development of what we call an approximate hash function, serving as the basic tool of the new suggested deduplication system and report on extensive tests performed on an variety of large input files.
KW - Approximate hash scheme
KW - Compression
KW - Deduplication
UR - http://www.scopus.com/inward/record.url?scp=84870499060&partnerID=8YFLogxK
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:84870499060
SN - 9788001050958
T3 - Proceedings of the Prague Stringology Conference, PSC 2012
SP - 3
EP - 17
BT - Proceedings of the Prague Stringology Conference, PSC 2012
T2 - Prague Stringology Conference, PSC 2012
Y2 - 27 August 2012 through 28 August 2012
ER -