Comparative analysis of approximate blocking techniques for entity resolution

George Papadakis, Jonathan Svirsky, Avigdor Gal, Themis Palpanas

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

128 Scopus citations

Abstract

Entity Resolution is a core task for merging data collections. Due to its quadratic complexity, it typically scales to large volumes of data through blocking: similar entities are clustered into blocks and pair-wise comparisons are executed only between co-occurring entities, at the cost of some missed matches. There are numerous blocking methods, and the aim of this work is to offer a comprehensive empirical survey, extending the dimensions of comparison beyond what is commonly available in the literature. We consider 17 state-of-the-art blocking methods and use 6 popular real datasets to examine the robustness of their internal configurations and their relative balance between effectiveness and time efficiency. We also investigate their scalability over a corpus of 7 established synthetic datasets that range from 10,000 to 2 million entities.

Original languageEnglish
Title of host publicationProceedings of the VLDB Endowment
PublisherAssociation for Computing Machinery
Pages684-695
Number of pages12
Edition9
DOIs
StatePublished - 2016
Externally publishedYes
Event42nd International Conference on Very Large Data Bases, VLDB 2016 - New Delhi, India
Duration: 5 Sep 20169 Sep 2016

Publication series

NameProceedings of the VLDB Endowment
Number9
Volume9
ISSN (Electronic)2150-8097

Conference

Conference42nd International Conference on Very Large Data Bases, VLDB 2016
Country/TerritoryIndia
CityNew Delhi
Period5/09/169/09/16

Fingerprint

Dive into the research topics of 'Comparative analysis of approximate blocking techniques for entity resolution'. Together they form a unique fingerprint.

Cite this