Abstract
Data scientists frequently examine the raw content of large tables when exploring an unknown dataset. In such cases, small subsets of the full tables (sub-tables) that accurately capture table contents are useful. We present a framework which, given a large data table T, creates a sub-table of small, fixed dimensions by selecting a subset of T's rows and projecting them over a subset of T's columns. The question is: Which rows and columns should be selected to yield an informative sub-table?Our first contribution is an informativeness metric for sub-tables with two complementary dimensions: cell coverage, which measures how well the sub-table captures prominent data patterns in T, and diversity. We use association rules as the patterns captured by sub-tables, and show that computing optimal sub-tables directly using this metric is infeasible. We then develop an efficient algorithm that indirectly accounts for association rules using table embedding. The resulting framework produces sub-tables for the full table as well as for the results of queries over the table, enabling the user to quickly understand results and determine subsequent queries. Experimental results show that high-quality sub-tables can be efficiently computed, and verify the soundness of our metrics as well as the usefulness of selected sub-tables through user studies.
Original language | English |
---|---|
Title of host publication | Proceedings - 2023 IEEE 39th International Conference on Data Engineering, ICDE 2023 |
Publisher | IEEE Computer Society |
Pages | 2496-2509 |
Number of pages | 14 |
ISBN (Electronic) | 9798350322279 |
DOIs | |
State | Published - 2023 |
Event | 39th IEEE International Conference on Data Engineering, ICDE 2023 - Anaheim, United States Duration: 3 Apr 2023 → 7 Apr 2023 |
Publication series
Name | Proceedings - International Conference on Data Engineering |
---|---|
Volume | 2023-April |
ISSN (Print) | 1084-4627 |
Conference
Conference | 39th IEEE International Conference on Data Engineering, ICDE 2023 |
---|---|
Country/Territory | United States |
City | Anaheim |
Period | 3/04/23 → 7/04/23 |
Bibliographical note
Publisher Copyright:© 2023 IEEE.
Keywords
- Interactive data exploration and discovery