TY - JOUR
T1 - Improving text categorization bootstrapping via unsupervised learning
AU - Gliozzo, Alfio
AU - Strapparava, Carlo
AU - Dagan, Ido
PY - 2009/10/1
Y1 - 2009/10/1
N2 - We propose a text-categorization bootstrapping algorithm in which categories are described by relevant seed words. Our method introduces two unsupervised techniques to improve the initial categorization step of the bootstrapping scheme: (i) using latent semantic spaces to estimate the similarity among documents and words, and (ii) the Gaussian mixture algorithm, which differentiates relevant and nonrelevant category information using statistics from unlabeled examples. In particular, this second step maps the similarity scores to class posterior probabilities, and therefore reduces sensitivity to keyword-dependent variations in scores. The algorithm was evaluated on two text categorization tasks, and obtained good performance using only the category names as initial seeds. In particular, the performance of the proposed method proved to be equivalent to a pure supervised approach trained on 70 - 160 labeled documents per category.
AB - We propose a text-categorization bootstrapping algorithm in which categories are described by relevant seed words. Our method introduces two unsupervised techniques to improve the initial categorization step of the bootstrapping scheme: (i) using latent semantic spaces to estimate the similarity among documents and words, and (ii) the Gaussian mixture algorithm, which differentiates relevant and nonrelevant category information using statistics from unlabeled examples. In particular, this second step maps the similarity scores to class posterior probabilities, and therefore reduces sensitivity to keyword-dependent variations in scores. The algorithm was evaluated on two text categorization tasks, and obtained good performance using only the category names as initial seeds. In particular, the performance of the proposed method proved to be equivalent to a pure supervised approach trained on 70 - 160 labeled documents per category.
KW - Bootstrapping
KW - Text categorization
KW - Unsupervised machine learning
UR - http://www.scopus.com/inward/record.url?scp=70350339784&partnerID=8YFLogxK
U2 - 10.1145/1596515.1596516
DO - 10.1145/1596515.1596516
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:70350339784
SN - 1550-4875
VL - 6
JO - ACM Transactions on Speech and Language Processing
JF - ACM Transactions on Speech and Language Processing
IS - 1
M1 - 1
ER -