Various document clustering tasks using word lists

Yaakov HaCohen-Kerner, Orr Margaliot

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

This research investigates whether it is appropriate to use word lists as features for clustering documents to their authors, to the documents' countries of origin or to the historical periods in which they were written. We have defined three kinds of word lists: most frequent words (FW) including function words (stopwords), most frequent filtered words (FFW) excluding function words, and words with the highest variance values (VFW). The application domain is articles referring to Jewish law written in Hebrew and Aramaic. The clustering experiments have been done using The EM algorithm. To the best of our knowledge, performing clustering tasks according to countries or periods are novel. The improvement rates in these tasks vary from 11.53% to 39.43%. The clustering tasks according to 2 or 3 authors achieved results above 95% and present superior improvement rates (between 15.61% and 56.51%); most of the improvements have been achieved with FW and VFW. These findings are surprising and contrast the initial assumption that FFW is the prime word list for clustering tasks.

Original languageEnglish
Title of host publicationInformation Retrieval Technology - 9th Asia Information Retrieval Societies Conference, AIRS 2013, Proceedings
Pages156-169
Number of pages14
DOIs
StatePublished - 2013
Externally publishedYes
Event9th Asia Information Retrieval Societies Conference on Information Retrieval Technology, AIRS 2013 - Singapore, Singapore
Duration: 9 Dec 201311 Dec 2013

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8281 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference9th Asia Information Retrieval Societies Conference on Information Retrieval Technology, AIRS 2013
Country/TerritorySingapore
CitySingapore
Period9/12/1311/12/13

Keywords

  • Authorship attribution
  • Composition country
  • Document clustering
  • Historical period
  • Word lists

Fingerprint

Dive into the research topics of 'Various document clustering tasks using word lists'. Together they form a unique fingerprint.

Cite this