Entropy-Based Approach to Efficient Cleaning of Big Data in Hierarchical Databases

Eugene Levner, Boris Kriheli, Arriel Benis, Alexander Ptuskin, Amir Elalouf, Sharon Hovav, Shai Ashkenazi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

When databases are at risk of containing erroneous, redundant, or obsolete data, a cleaning procedure is used to detect, correct or remove such undesirable records. We propose a methodology for improving data cleaning efficiency in a large hierarchical database. The methodology relies on Shannon’s information entropy for measuring the amount of information stored in databases. This approach, which builds on previously-gathered statistical data regarding the prevalence of errors in the database, enables the decision maker to determine which components of the database are likely to have undergone more information loss, and thus to prioritize those components for cleaning. In particular, in cases where the cleaning process is iterative (from the root node down), the entropic approach produces a scientifically motivated stopping rule that determines the optimal (i.e. minimally required) number of tiers in the hierarchical database that need to be examined. This stopping rule defines a more streamlined representation of the database, in which less informative tiers are eliminated.

Original languageEnglish
Title of host publicationBig Data – BigData 2020 - 9th International Conference, Held as Part of the Services Conference Federation, SCF 2020, Proceedings
EditorsSurya Nepal, Wenqi Cao, Aziz Nasridinov, MD Zakirul Alam Bhuiyan, Xuan Guo, Liang-Jie Zhang
PublisherSpringer Science and Business Media Deutschland GmbH
Pages3-12
Number of pages10
ISBN (Print)9783030596118
DOIs
StatePublished - 2020
Event9th International Conference on Big Data, BigData 2020, held as part of the Services Conference Federation, SCF 2020 - Honolulu, United States
Duration: 18 Sep 202020 Sep 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12402 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference9th International Conference on Big Data, BigData 2020, held as part of the Services Conference Federation, SCF 2020
Country/TerritoryUnited States
CityHonolulu
Period18/09/2020/09/20

Bibliographical note

Publisher Copyright:
© 2020, Springer Nature Switzerland AG.

Keywords

  • Data cleaning
  • Entropy evaluation
  • Entropy-based analytics

Fingerprint

Dive into the research topics of 'Entropy-Based Approach to Efficient Cleaning of Big Data in Hierarchical Databases'. Together they form a unique fingerprint.

Cite this