Clustering small-sized collections of short texts

Lili Kotlerman, Ido Dagan, Oren Kurland

Research output: Contribution to journalArticlepeer-review

12 Scopus citations

Abstract

The need to cluster small text corpora composed of a few hundreds of short texts rises in various applications; e.g., clustering top-retrieved documents based on their snippets. This clustering task is challenging due to the vocabulary mismatch between short texts and the insufficient corpus-based statistics (e.g., term co-occurrence statistics) due to the corpus size. We address this clustering challenge using a framework that utilizes a set of external knowledge resources that provide information about term relations. Specifically, we use information induced from the resources to estimate similarity between terms and produce term clusters. We also utilize the resources to expand the vocabulary used in the given corpus and thus enhance term clustering. We then project the texts in the corpus onto the term clusters to cluster the texts. We evaluate various instantiations of the proposed framework by varying the term clustering method used, the approach of projecting the texts onto the term clusters, and the way of applying external knowledge resources. Extensive empirical evaluation demonstrates the merits of our approach with respect to applying clustering algorithms directly on the text corpus, and using state-of-the-art co-clustering and topic modeling methods.

Original languageEnglish
Pages (from-to)273-306
Number of pages34
JournalInformation Retrieval Journal
Volume21
Issue number4
DOIs
StatePublished - 1 Aug 2018

Bibliographical note

Publisher Copyright:
© 2017, Springer Science+Business Media, LLC, part of Springer Nature.

Funding

Acknowledgements This work was partially supported by the MAGNETON Grant No. 43834 of the Israel Ministry of Industry, Trade and Labor, the Israel Ministry of Science and Technology, the Israel Science Foundation Grant 1112/08 and Grant 1136/17 the PASCAL-2 Network of Excellence of the European Community FP7-ICT-2007-1-216886 and the European Communitys Seventh Framework Programme (FP7/ 2007-2013) under Grant Agreement No. 287923 (EXCITEMENT). We would like to thank NICE Systems and especially Maya Gorodetsky, Gennadi Lembersky and Ezra Daya for help in creating the datasets. Finally, we thank the anonymous reviewers for their useful comments and suggestions. This work was partially supported by the MAGNETON Grant No. 43834 of the Israel Ministry of Industry, Trade and Labor, the Israel Ministry of Science and Technology, the Israel Science Foundation Grant 1112/08 and Grant 1136/17 the PASCAL-2 Network of Excellence of the European Community FP7-ICT-2007-1-216886 and the European Communitys Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 287923 (EXCITEMENT). We would like to thank NICE Systems and especially Maya Gorodetsky, Gennadi Lembersky and Ezra Daya for help in creating the datasets. Finally, we thank the anonymous reviewers for their useful comments and suggestions.

FundersFunder number
FP7/2007
Israel Ministry of Science and Technology
NICE Systems
Israel Science Foundation1136/17, FP7-ICT-2007-1-216886, 1112/08
Ministry of Industry, Trade and Labor
Seventh Framework Programme287923

    Keywords

    • Clustering
    • Clustering short texts
    • Short text similarities

    Fingerprint

    Dive into the research topics of 'Clustering small-sized collections of short texts'. Together they form a unique fingerprint.

    Cite this