Text Categorization from category name in an industry-motivated scenario

Research output: Contribution to journalArticlepeer-review

8 Scopus citations

Abstract

In this work we suggest a novel Text Categorization (TC) scenario, motivated by an ad-hoc industrial need to assign documents to a set of predefined categories, while labeled training data for the categories is not available. The scenario is applicable in many industrial settings and is interesting from the academic perspective. We present a new dataset geared for the main characteristics of the scenario, and utilize it to investigate the name-based TC approach, which uses the category names as its only input and does not require training data. We evaluate and analyze the performance of state-of-the-art methods for this dataset to identify the shortcomings of these methods for our scenario, and suggest ways for overcoming these shortcomings. We utilize statistical correlation measured over a target corpus for improving the state-of-the-art, and offer a different classification scheme based on the characteristics of the setting. We evaluate our improvements and adaptations and show superior performance of our suggested method.

Original languageEnglish
Pages (from-to)227-261
Number of pages35
JournalLanguage Resources and Evaluation
Volume49
Issue number2
DOIs
StatePublished - 1 Jun 2015

Bibliographical note

Publisher Copyright:
© 2015, Springer Science+Business Media Dordrecht.

Funding

This work was supported by the Next Generation Video (NeGeV) Project. We would like to thank our industrial partners, Comverse Technology Inc. and Orca Interactive Ltd. We thank Libby Barak for helping us in replicating the results of Barak et al. (). We thank Naomi Zeichner for preparing the taxonomy and annotating the dataset. Finally, we thank the anonymous reviewers for their useful comments and suggestions.

Funders
Next Generation Video

    Keywords

    • Name-based Text Categorization
    • Natural language processing
    • Semantic similarity

    Fingerprint

    Dive into the research topics of 'Text Categorization from category name in an industry-motivated scenario'. Together they form a unique fingerprint.

    Cite this