CleanEr: Interactive, Query-Guided Error Mitigation for Data Cleaning Systems

Ran Schreiber, Yael Amsterdamer

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

A key challenge in data cleaning is estimating which of the tuples in a given database are correct and which are not. However, the output of such systems typically includes both false positives and false negatives, i.e., incorrect tuples labeled as correct and vice versa. When queries are performed over the output of such cleaning systems, cleaning errors may have an intricate impact on the query results. We introduce CleanEr, a generic framework that is used on top of existing data cleaning systems and that assists users in identifying the impact of potential cleaning errors on query results, and in deciding accordingly whether and how to proceed with the cleaning. We introduce novel indicators reflecting the current uncertainty with respect to the tuples in the query result, as well as the effect of each relevant input tuple on this uncertainty. We design and implement efficient algorithms for computing these indicators in CleanEr. Based on these indicators, CleanEr helps the data analysts decide whether to trust the query output and guides them in further cleaning of relevant parts of the data through an interactive process. We propose to demonstrate CleanEr using NELL, a large database extracted from the Web.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024
PublisherIEEE Computer Society
Pages5421-5424
Number of pages4
ISBN (Electronic)9798350317152
DOIs
StatePublished - 2024
Event40th IEEE International Conference on Data Engineering, ICDE 2024 - Utrecht, Netherlands
Duration: 13 May 202417 May 2024

Publication series

NameProceedings - International Conference on Data Engineering
ISSN (Print)1084-4627
ISSN (Electronic)2375-0286

Conference

Conference40th IEEE International Conference on Data Engineering, ICDE 2024
Country/TerritoryNetherlands
CityUtrecht
Period13/05/2417/05/24

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

Keywords

  • data cleaning
  • provenance
  • uncertain databases

Fingerprint

Dive into the research topics of 'CleanEr: Interactive, Query-Guided Error Mitigation for Data Cleaning Systems'. Together they form a unique fingerprint.

Cite this