TY - GEN
T1 - CUBS
T2 - 10th IEEE International Conference on Data Mining Workshops, ICDMW 2010
AU - Richardson, Ariella
AU - Kaminka, Gal
AU - Kraus, Sarit
PY - 2010
Y1 - 2010
N2 - Multivariate temporal sequence classification is an important and challenging task. Several attempts to address this problem exist, but none provide a full solution. In this paper we present CUBS: Classification Using Bounded Z-Score with Sampling. CUBS uses itemset mining to produce frequent subsequences, and then selects among them the statistically significant subsequences to compose a classification model. We introduce an improved itemset mining algorithm that solves the short sequence bias present in many itemset mining algorithms. Unfortunately, the z-score normalization hinders pruning. We provide a bound on the z-score to address this issue. Calculation of the z-score normalization requires knowledge of some statistical values of the data gathered using a small sample of the database. The sampling causes a distortion in the values. We analyze this distortion and correct it.We evaluate CUBS for accuracy and scalability on a synthetic dataset and on two real world dataset. The results demonstrate how short subsequence bias is solved in the mining, and show how our bound and sampling technique enable speedup.
AB - Multivariate temporal sequence classification is an important and challenging task. Several attempts to address this problem exist, but none provide a full solution. In this paper we present CUBS: Classification Using Bounded Z-Score with Sampling. CUBS uses itemset mining to produce frequent subsequences, and then selects among them the statistically significant subsequences to compose a classification model. We introduce an improved itemset mining algorithm that solves the short sequence bias present in many itemset mining algorithms. Unfortunately, the z-score normalization hinders pruning. We provide a bound on the z-score to address this issue. Calculation of the z-score normalization requires knowledge of some statistical values of the data gathered using a small sample of the database. The sampling causes a distortion in the values. We analyze this distortion and correct it.We evaluate CUBS for accuracy and scalability on a synthetic dataset and on two real world dataset. The results demonstrate how short subsequence bias is solved in the mining, and show how our bound and sampling technique enable speedup.
KW - Classification
KW - Mining multiple information sources
KW - Multivariate sequence mining
KW - Sampling
UR - http://www.scopus.com/inward/record.url?scp=79951747936&partnerID=8YFLogxK
U2 - 10.1109/icdmw.2010.38
DO - 10.1109/icdmw.2010.38
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:79951747936
SN - 9780769542577
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 72
EP - 79
BT - Proceedings - 10th IEEE International Conference on Data Mining Workshops, ICDMW 2010
Y2 - 14 December 2010 through 17 December 2010
ER -