TY - CHAP
T1 - Comparative analysis of approximate blocking techniques for entity resolution
AU - Papadakis, George
AU - Svirsky, Jonathan
AU - Gal, Avigdor
AU - Palpanas, Themis
PY - 2016
Y1 - 2016
N2 - Entity Resolution is a core task for merging data collections. Due to its quadratic complexity, it typically scales to large volumes of data through blocking: similar entities are clustered into blocks and pair-wise comparisons are executed only between co-occurring entities, at the cost of some missed matches. There are numerous blocking methods, and the aim of this work is to offer a comprehensive empirical survey, extending the dimensions of comparison beyond what is commonly available in the literature. We consider 17 state-of-the-art blocking methods and use 6 popular real datasets to examine the robustness of their internal configurations and their relative balance between effectiveness and time efficiency. We also investigate their scalability over a corpus of 7 established synthetic datasets that range from 10,000 to 2 million entities.
AB - Entity Resolution is a core task for merging data collections. Due to its quadratic complexity, it typically scales to large volumes of data through blocking: similar entities are clustered into blocks and pair-wise comparisons are executed only between co-occurring entities, at the cost of some missed matches. There are numerous blocking methods, and the aim of this work is to offer a comprehensive empirical survey, extending the dimensions of comparison beyond what is commonly available in the literature. We consider 17 state-of-the-art blocking methods and use 6 popular real datasets to examine the robustness of their internal configurations and their relative balance between effectiveness and time efficiency. We also investigate their scalability over a corpus of 7 established synthetic datasets that range from 10,000 to 2 million entities.
UR - http://www.scopus.com/inward/record.url?scp=84975801848&partnerID=8YFLogxK
U2 - 10.14778/2947618.2947624
DO - 10.14778/2947618.2947624
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.chapter???
AN - SCOPUS:84975801848
T3 - Proceedings of the VLDB Endowment
SP - 684
EP - 695
BT - Proceedings of the VLDB Endowment
PB - Association for Computing Machinery
T2 - 42nd International Conference on Very Large Data Bases, VLDB 2016
Y2 - 5 September 2016 through 9 September 2016
ER -