TY - JOUR
T1 - REEF: Resolving length bias in frequent sequence mining using sampling
AU - Richardson, Ariella
AU - Kaminka, Gal A.
AU - Kraus, Sarit
PY - 2014
Y1 - 2014
N2 - Classic support based approaches efficiently address
frequent sequence mining. However, support based mining has
been shown to suffer from a bias towards short sequences.
In this paper, we propose a method to resolve this bias when
mining the most frequent sequences. In order to resolve the
length bias we define norm-frequency, based on the statistical zscore
of support, and use it to replace support based frequency.
Our approach mines the subsequences that are frequent relative
to other subsequences of the same length. Unfortunately, naive
use of norm-frequency hinders mining scalability. Using normfrequency
breaks the anti-monotonic property of support, an
important part in being able to prune large sets of candidate
sequences. We describe a bound that enables pruning to provide
scalability. Calculation of the bound uses a preprocessing stage
on a sample of the dataset. Sampling the data creates a distortion
in the samples measures. We present a method to correct this
distortion. We conducted experiments on 4 data sets, including
synthetic data, textual data, remote control zapping data and
computer user input data. Experimental results establish that
we manage to overcome the short sequence bias successfully,
and to illustrate the production of meaningful sequences with
our mining algorithm.
AB - Classic support based approaches efficiently address
frequent sequence mining. However, support based mining has
been shown to suffer from a bias towards short sequences.
In this paper, we propose a method to resolve this bias when
mining the most frequent sequences. In order to resolve the
length bias we define norm-frequency, based on the statistical zscore
of support, and use it to replace support based frequency.
Our approach mines the subsequences that are frequent relative
to other subsequences of the same length. Unfortunately, naive
use of norm-frequency hinders mining scalability. Using normfrequency
breaks the anti-monotonic property of support, an
important part in being able to prune large sets of candidate
sequences. We describe a bound that enables pruning to provide
scalability. Calculation of the bound uses a preprocessing stage
on a sample of the dataset. Sampling the data creates a distortion
in the samples measures. We present a method to correct this
distortion. We conducted experiments on 4 data sets, including
synthetic data, textual data, remote control zapping data and
computer user input data. Experimental results establish that
we manage to overcome the short sequence bias successfully,
and to illustrate the production of meaningful sequences with
our mining algorithm.
UR - https://scholar.google.co.il/scholar?q=REEF%3A+Resolving+length+bias+in+frequent+sequence+mining+using+sampling&btnG=&hl=en&as_sdt=0%2C5
UR - http://www.umiacs.umd.edu/users/sarit/data/articles/immm-2013-5-20-20043.pdf.
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
VL - 7
SP - 208
EP - 222
JO - International Journal On Advances in Intelligent Systems
JF - International Journal On Advances in Intelligent Systems
IS - 1
ER -