Abstract
In this work we suggest a novel Text Categorization (TC) scenario, motivated by an ad-hoc industrial need to assign documents to a set of predefined categories, while labeled training data for the categories is not available. The scenario is applicable in many industrial settings and is interesting from the academic perspective. We present a new dataset geared for the main characteristics of the scenario, and utilize it to investigate the name-based TC approach, which uses the category names as its only input and does not require training data. We evaluate and analyze the performance of state-of-the-art methods for this dataset to identify the shortcomings of these methods for our scenario, and suggest ways for overcoming these shortcomings. We utilize statistical correlation measured over a target corpus for improving the state-of-the-art, and offer a different classification scheme based on the characteristics of the setting. We evaluate our improvements and adaptations and show superior performance of our suggested method.
Original language | English |
---|---|
Pages (from-to) | 227-261 |
Number of pages | 35 |
Journal | Language Resources and Evaluation |
Volume | 49 |
Issue number | 2 |
DOIs | |
State | Published - 1 Jun 2015 |
Bibliographical note
Publisher Copyright:© 2015, Springer Science+Business Media Dordrecht.
Keywords
- Name-based Text Categorization
- Natural language processing
- Semantic similarity