Topic-based Classification through Unigram Unmasking

Yaakov Hacohen-Kerner, Avi Rosenfeld, Asaf Sabag, Maor Tzidkani

Research output: Contribution to journalConference articlepeer-review

9 Scopus citations

Abstract

Text classification (TC) is the task of automatically assigning documents to a fixed number of categorieS. TC is an important component in many text applications such as text indexing, information extraction, information retrieval, text mining, and word sense disambiguation. In this paper, we present an alternative method of feature reduction - a concept we call unigram unmasking. Previous text classification approaches have typically focused on a "bag-of-words" vector. We posit that at times some of the most frequent unigrams, which have the greatest weight within these vectors, are not only unnecessary for classification, but can at times even hurt models' accuracy. We present an approach where a percentage of common unigrams are intentionally removed, thus "unmasking" the added value from less popular unigramS. We present results from a topic-based classification task (hundreds of online free text-books belonging to five domains: Career and study Advice, Economics and Finance, IT Programming, Natural sciences, statistics sand Mathematics) and show that unmasking was helpful across several machine learning models with some models even benefiting from removing nearly 50% of the most frequent unigrams from the bag-of-word vectorS.

Original languageEnglish
Pages (from-to)69-76
Number of pages8
JournalProcedia Computer Science
Volume126
DOIs
StatePublished - 2018
Externally publishedYes
Event22nd International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, KES 2018 - Belgrade, Serbia
Duration: 3 Sep 20185 Sep 2018

Bibliographical note

Publisher Copyright:
© 2018 The Author(s).

Keywords

  • Bag of words
  • Overfitting Features
  • Supervised machine learning
  • Text classification
  • Textual features
  • Topic-based classification Unmasking
  • Word unigrams

Fingerprint

Dive into the research topics of 'Topic-based Classification through Unigram Unmasking'. Together they form a unique fingerprint.

Cite this