Abstract
In this work we suggest a novel Text Categorization (TC) scenario, motivated by an ad-hoc industrial need to assign documents to a set of predefined categories, while labeled training data for the categories is not available. The scenario is applicable in many industrial settings and is interesting from the academic perspective. We present a new dataset geared for the main characteristics of the scenario, and utilize it to investigate the name-based TC approach, which uses the category names as its only input and does not require training data. We evaluate and analyze the performance of state-of-the-art methods for this dataset to identify the shortcomings of these methods for our scenario, and suggest ways for overcoming these shortcomings. We utilize statistical correlation measured over a target corpus for improving the state-of-the-art, and offer a different classification scheme based on the characteristics of the setting. We evaluate our improvements and adaptations and show superior performance of our suggested method.
| Original language | English |
|---|---|
| Pages (from-to) | 227-261 |
| Number of pages | 35 |
| Journal | Language Resources and Evaluation |
| Volume | 49 |
| Issue number | 2 |
| DOIs | |
| State | Published - 1 Jun 2015 |
Bibliographical note
Publisher Copyright:© 2015, Springer Science+Business Media Dordrecht.
Funding
This work was supported by the Next Generation Video (NeGeV) Project. We would like to thank our industrial partners, Comverse Technology Inc. and Orca Interactive Ltd. We thank Libby Barak for helping us in replicating the results of Barak et al. (). We thank Naomi Zeichner for preparing the taxonomy and annotating the dataset. Finally, we thank the anonymous reviewers for their useful comments and suggestions.
| Funders |
|---|
| Next Generation Video |
Keywords
- Name-based Text Categorization
- Natural language processing
- Semantic similarity