Cross-domain authorship attribution: Author identification using char sequences, word Uni-grams, and POS-tags features: Notebook for PAN at CLEF 2018

Yaakov HaCohen-Kerner, Daniel Miller, Yair Yigal, Elyashiv Shayovitz

Research output: Contribution to journalConference articlepeer-review

3 Scopus citations

Abstract

Authorship Attribution deals with identifying the author of an anonymous text, i.e., to attribute each test text of unknown authorship to one of a set of known authors, whose training texts are given. In this paper, we describe the participation of our teams (millerl 8 and yigall 8, both teams contain the same people, but in another order) in the PAN 2018 shared task on cross-domain Author Identification. Given a set of documents authored by known authors, there is a need to identify the authors of documents from another set of documents. All documents are in the same language that may be one of the five following languages: English, French, Italian, Polish, or Spanish. In this paper, we describe our pre-processing, feature sets, the applied machine learning methods and the average Fl scores of three submitted models. For the evaluation corpus, we sent the top three models according to their results on the development corpus using PCA and Linear SVC. The first model scored an average of 0.582. Its features consist of the frequencies of all char 6-gram sequences, POS-tags sequences frequencies, Orthographic features, Quantitative features, and lexical richness features. The second model scored an average of 0.598. Its features consist of all the char sequences of length between 3 to 8, all word Uni-grams, POS-tags features, and all stylistic features from the first model. The third model scored an average of 0.611. Its features consist of the content-based features mentioned in the second model and POS-tags features.

Original languageEnglish
JournalCEUR Workshop Proceedings
Volume2125
StatePublished - 2018
Externally publishedYes
Event19th Working Notes of CLEF Conference and Labs of the Evaluation Forum, CLEF 2018 - Avignon, France
Duration: 10 Sep 201814 Sep 2018

Bibliographical note

Funding Information:
Acknowledgments. This work was partially funded by the Jerusalem College of Technology (Lev Academic Center) and we gratefully acknowledge its support.

Keywords

  • Author Identification
  • Authorship Attribution
  • Content-based Features
  • Style-based Features
  • Supervised Machine Learning
  • Text Classification

Fingerprint

Dive into the research topics of 'Cross-domain authorship attribution: Author identification using char sequences, word Uni-grams, and POS-tags features: Notebook for PAN at CLEF 2018'. Together they form a unique fingerprint.

Cite this