Abstract
The need to cluster small text corpora composed of a few hundreds of short texts rises in various applications; e.g., clustering top-retrieved documents based on their snippets. This clustering task is challenging due to the vocabulary mismatch between short texts and the insufficient corpus-based statistics (e.g., term co-occurrence statistics) due to the corpus size. We address this clustering challenge using a framework that utilizes a set of external knowledge resources that provide information about term relations. Specifically, we use information induced from the resources to estimate similarity between terms and produce term clusters. We also utilize the resources to expand the vocabulary used in the given corpus and thus enhance term clustering. We then project the texts in the corpus onto the term clusters to cluster the texts. We evaluate various instantiations of the proposed framework by varying the term clustering method used, the approach of projecting the texts onto the term clusters, and the way of applying external knowledge resources. Extensive empirical evaluation demonstrates the merits of our approach with respect to applying clustering algorithms directly on the text corpus, and using state-of-the-art co-clustering and topic modeling methods.
Original language | English |
---|---|
Pages (from-to) | 273-306 |
Number of pages | 34 |
Journal | Information Retrieval Journal |
Volume | 21 |
Issue number | 4 |
DOIs | |
State | Published - 1 Aug 2018 |
Bibliographical note
Publisher Copyright:© 2017, Springer Science+Business Media, LLC, part of Springer Nature.
Funding
Acknowledgements This work was partially supported by the MAGNETON Grant No. 43834 of the Israel Ministry of Industry, Trade and Labor, the Israel Ministry of Science and Technology, the Israel Science Foundation Grant 1112/08 and Grant 1136/17 the PASCAL-2 Network of Excellence of the European Community FP7-ICT-2007-1-216886 and the European Communitys Seventh Framework Programme (FP7/ 2007-2013) under Grant Agreement No. 287923 (EXCITEMENT). We would like to thank NICE Systems and especially Maya Gorodetsky, Gennadi Lembersky and Ezra Daya for help in creating the datasets. Finally, we thank the anonymous reviewers for their useful comments and suggestions. This work was partially supported by the MAGNETON Grant No. 43834 of the Israel Ministry of Industry, Trade and Labor, the Israel Ministry of Science and Technology, the Israel Science Foundation Grant 1112/08 and Grant 1136/17 the PASCAL-2 Network of Excellence of the European Community FP7-ICT-2007-1-216886 and the European Communitys Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 287923 (EXCITEMENT). We would like to thank NICE Systems and especially Maya Gorodetsky, Gennadi Lembersky and Ezra Daya for help in creating the datasets. Finally, we thank the anonymous reviewers for their useful comments and suggestions.
Funders | Funder number |
---|---|
FP7/2007 | |
Israel Ministry of Science and Technology | |
NICE Systems | |
Israel Science Foundation | 1136/17, FP7-ICT-2007-1-216886, 1112/08 |
Ministry of Industry, Trade and Labor | |
Seventh Framework Programme | 287923 |
Keywords
- Clustering
- Clustering short texts
- Short text similarities