TY - GEN
T1 - Efficient multidimensional quantitative hypotheses generation
AU - Amir, Amihood
AU - Kashi, Reuven
AU - Netanyahu, Nathan S.
PY - 2003
Y1 - 2003
N2 - Finding local interrelations (hypotheses) among attributes within very large databases of high dimensionality is an acute problem for many databases and data mining applications. These include, dependency modeling, clustering large databases, correlation and link analysis. Traditional statistical methods are concerned with the corroboration of (a set of) hypotheses on a given body of data. Testing all of the hypotheses that can be generated from a database with millions of records and dozens of fields is clearly infeasible. Generating, on the other hand, a set of the most "promising" hypotheses (to be corroborated) requires much intuition and ingenuity. In this paper we present an efficient method for ranking the multidimensional hypotheses using image processing of data visualization. In the heart of the method lies the use of visualization techniques and image processing ideas to rank subsets of attributes according to the relation between them in the databases. Some of the scalability issues are solved by concise generalized histograms and by using an efficient on-line computation of clustering around a median with only five additional memory words. In addition to presenting our algorithmic methodology, we demonstrate its efficiency and performance by applying it to real census data sets, as well as synthetic data sets.
AB - Finding local interrelations (hypotheses) among attributes within very large databases of high dimensionality is an acute problem for many databases and data mining applications. These include, dependency modeling, clustering large databases, correlation and link analysis. Traditional statistical methods are concerned with the corroboration of (a set of) hypotheses on a given body of data. Testing all of the hypotheses that can be generated from a database with millions of records and dozens of fields is clearly infeasible. Generating, on the other hand, a set of the most "promising" hypotheses (to be corroborated) requires much intuition and ingenuity. In this paper we present an efficient method for ranking the multidimensional hypotheses using image processing of data visualization. In the heart of the method lies the use of visualization techniques and image processing ideas to rank subsets of attributes according to the relation between them in the databases. Some of the scalability issues are solved by concise generalized histograms and by using an efficient on-line computation of clustering around a median with only five additional memory words. In addition to presenting our algorithmic methodology, we demonstrate its efficiency and performance by applying it to real census data sets, as well as synthetic data sets.
UR - http://www.scopus.com/inward/record.url?scp=35048827670&partnerID=8YFLogxK
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:35048827670
SN - 0769519784
SN - 9780769519784
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 1
EP - 10
BT - Proceedings - 3rd IEEE International Conference on Data Mining, ICDM 2003
T2 - 3rd IEEE International Conference on Data Mining, ICDM '03
Y2 - 19 November 2003 through 22 November 2003
ER -