TY - JOUR

T1 - Analyzing quantitative databases: Image is everything

AU - Amir, Amihood

AU - Kashi, Reuven

AU - Netanyahu, Nathan S.

PY - 2001/1/1

Y1 - 2001/1/1

N2 - Traditional statistical methods deal with corroborating given hypotheses on a given body of data. However, generating the hypothesis itself is a matter of intuition and ingenuity. It is clearly impossible to test all hypotheses on a database with millions of records and hundreds of fields. There have been attempts to bridge this gap through data mining. Association generation is a method of creating such statistical hypotheses for binary data. For quantitative databases the situation is still not good. There are a number of known methods. One is a reduction to binary data by creating intervals and then generating associations. This method is computationally expensive. Another suggested method was by generating associations that are statistically interesting. This method also was tried only on small databases and is applicable only for binary relations, e.g., in certain ranges of field X, field Y lies significantly outside its average. We suggest a method that answers some of the problems with the current techniques. Our idea is based on using visualization techniques and image processing ideas to rank subsets of fields according to the relation between them in the database. This ranking suggests the hypotheses to be statistically investigated. Our method has the following advantages: 1. It is scalable. Our algorithm is mainly based on analyzing histograms of the data set, thus is more efficient. It is also naturally suitable for sampling. 2. It is generalizable in the size of the set of fields. No current method handles more than a binary relation. 3. It affords comparability between fields over different base sets. This allows a uniform scale for different sets of fields in different databases. In this paper we present an algorithmic methodology and the results of its application to the census bureau data bases, cpsm93p and nhis93ac.

AB - Traditional statistical methods deal with corroborating given hypotheses on a given body of data. However, generating the hypothesis itself is a matter of intuition and ingenuity. It is clearly impossible to test all hypotheses on a database with millions of records and hundreds of fields. There have been attempts to bridge this gap through data mining. Association generation is a method of creating such statistical hypotheses for binary data. For quantitative databases the situation is still not good. There are a number of known methods. One is a reduction to binary data by creating intervals and then generating associations. This method is computationally expensive. Another suggested method was by generating associations that are statistically interesting. This method also was tried only on small databases and is applicable only for binary relations, e.g., in certain ranges of field X, field Y lies significantly outside its average. We suggest a method that answers some of the problems with the current techniques. Our idea is based on using visualization techniques and image processing ideas to rank subsets of fields according to the relation between them in the database. This ranking suggests the hypotheses to be statistically investigated. Our method has the following advantages: 1. It is scalable. Our algorithm is mainly based on analyzing histograms of the data set, thus is more efficient. It is also naturally suitable for sampling. 2. It is generalizable in the size of the set of fields. No current method handles more than a binary relation. 3. It affords comparability between fields over different base sets. This allows a uniform scale for different sets of fields in different databases. In this paper we present an algorithmic methodology and the results of its application to the census bureau data bases, cpsm93p and nhis93ac.

UR - http://www.scopus.com/inward/record.url?scp=34347358023&partnerID=8YFLogxK

M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???

JO - VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases

JF - VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases

ER -