Text Categorization from category name in an industry-motivated scenario

Chaya Liebeskind, Lili Kotlerman, Ido Dagan

Research output: Contribution to journalArticlepeer-review

8 Scopus citations

Abstract

In this work we suggest a novel Text Categorization (TC) scenario, motivated by an ad-hoc industrial need to assign documents to a set of predefined categories, while labeled training data for the categories is not available. The scenario is applicable in many industrial settings and is interesting from the academic perspective. We present a new dataset geared for the main characteristics of the scenario, and utilize it to investigate the name-based TC approach, which uses the category names as its only input and does not require training data. We evaluate and analyze the performance of state-of-the-art methods for this dataset to identify the shortcomings of these methods for our scenario, and suggest ways for overcoming these shortcomings. We utilize statistical correlation measured over a target corpus for improving the state-of-the-art, and offer a different classification scheme based on the characteristics of the setting. We evaluate our improvements and adaptations and show superior performance of our suggested method.

Original languageEnglish
Pages (from-to)227-261
Number of pages35
JournalLanguage Resources and Evaluation
Volume49
Issue number2
DOIs
StatePublished - 1 Jun 2015

Bibliographical note

Publisher Copyright:
© 2015, Springer Science+Business Media Dordrecht.

Keywords

  • Name-based Text Categorization
  • Natural language processing
  • Semantic similarity

Fingerprint

Dive into the research topics of 'Text Categorization from category name in an industry-motivated scenario'. Together they form a unique fingerprint.

Cite this