Authorship attribution of responsa using clustering

Yaakov Hacohen-Kerner, Orr Margaliot

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Authorship attribution of text documents is a "hot" domain in research; however, almost all of its applications use supervised machine learning (ML) methods. In this research, we explore authorship attribution as a clustering problem, that is, we attempt to complete the task of authorship attribution using unsupervised machine learning methods. The application domain is responsa, which are answers written by well-known Jewish rabbis in response to various Jewish religious questions. We have built a corpus of 6,079 responsa, composed by five authors who lived mainly in the 20th century and containing almost 10 M words. The clustering tasks that have been performed were according to two or three or four or five authors. Clustering has been performed using three kinds of word lists: most frequent words (FW) including function words (stopwords), most frequent filtered words (FFW) excluding function words, and words with the highest variance values (HVW); and two unsupervised machine learning methods: K-means and Expectation Maximization (EM). The best clustering tasks according to two or three or four authors achieved results above 98%, and the improvement rates were above 40% in comparison to the "majority" (baseline) results. The EM method has been found to be superior to K-means for the discussed tasks. FW has been found as the best word list, far superior to FFW. FW, in contrast to FFW, includes function words, which are usually regarded as words that have little lexical meaning. This might imply that normalized frequencies of function words can serve as good indicators for authorship attribution using unsupervised ML methods. This finding supports previous findings about the usefulness of function words for other tasks, such as authorship attribution, using supervised ML methods, and genre and sentiment classification.

Original languageEnglish
Pages (from-to)530-545
Number of pages16
JournalCybernetics and Systems
Volume45
Issue number6
DOIs
StatePublished - 18 Aug 2014
Externally publishedYes

Keywords

  • Hebrew
  • authorship attribution
  • responsa
  • text clustering
  • unsupervised machine learning methods
  • word lists

Fingerprint

Dive into the research topics of 'Authorship attribution of responsa using clustering'. Together they form a unique fingerprint.

Cite this