TY - JOUR
T1 - "padding" bitmaps to support similarity and mining
AU - Gelbard, Roy
PY - 2013/3
Y1 - 2013/3
N2 - The current paper presents a novel approach to bitmap-indexing for data mining purposes. Currently bitmap-indexing enables efficient data storage and retrieval, but is limited in terms of similarity measurement, and hence as regards classification, clustering and data mining. Bitmap-indexes mainly fit nominal discrete attributes and thus unattractive for widespread use, which requires the ability to handle continuous data in a raw format. The current research describes a scheme for representing ordinal and continuous data by applying the concept of "padding" where each discrete nominal data value is transformed into a range of nominal-discrete values. This "padding" is done by adding adjacent bits "around" the original value (bin). The padding factor, i.e.; the number of adjacent bits added, is calculated from the first and second derivative degrees of each attribute's domain-distribution. The padded representation better supports similarity measures, and therefore improves the accuracy of clustering and mining. The advantages of padding bitmaps are demonstrated on Fisher's Iris dataset.
AB - The current paper presents a novel approach to bitmap-indexing for data mining purposes. Currently bitmap-indexing enables efficient data storage and retrieval, but is limited in terms of similarity measurement, and hence as regards classification, clustering and data mining. Bitmap-indexes mainly fit nominal discrete attributes and thus unattractive for widespread use, which requires the ability to handle continuous data in a raw format. The current research describes a scheme for representing ordinal and continuous data by applying the concept of "padding" where each discrete nominal data value is transformed into a range of nominal-discrete values. This "padding" is done by adding adjacent bits "around" the original value (bin). The padding factor, i.e.; the number of adjacent bits added, is calculated from the first and second derivative degrees of each attribute's domain-distribution. The padded representation better supports similarity measures, and therefore improves the accuracy of clustering and mining. The advantages of padding bitmaps are demonstrated on Fisher's Iris dataset.
KW - Bitmap-index
KW - Classification
KW - Cluster analysis
KW - Data mining
KW - Data representation
KW - Similarity index
UR - http://www.scopus.com/inward/record.url?scp=84874944930&partnerID=8YFLogxK
U2 - 10.1007/s10796-011-9318-9
DO - 10.1007/s10796-011-9318-9
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:84874944930
SN - 1387-3326
VL - 15
SP - 99
EP - 110
JO - Information Systems Frontiers
JF - Information Systems Frontiers
IS - 1
ER -