Survey on profiling age and gender of text authors

  • Yaakov HaCohen-Kerner

Research output: Contribution to journalReview articlepeer-review

27 Scopus citations

Abstract

Author profiling from text documents has become a popular task in latest years, in natural language applications. Author profiling is important for various domains such as advertising, marketing, forensics, and security. This survey focuses on profiling age and gender, the two features, which are probably the most researched profile attributes. In this paper, we present an overview of representative studies and datasets of the field (including those organized by PAN) with several significant leaps. Due to the increasing use of deep learning (DL) methods in recent years, we have also reviewed several DL systems that profile authors’ age and gender. Most age and gender datasets contain blog posts or Twitter messages written in English, Spanish or Arabic. There are also several relevant datasets written in Dutch, Italian, Portuguese, Turkish, and Russian. There is no consistency and no uniformity in the datasets concerning to the number and types of their documents, the division into training, dev, and test sets, the types of the applied preprocessing methods, and the quality measures used to evaluate the classification results. A prominent interesting finding is that the best age accuracy results are not as high as we might have expected taking into account relatively simple types of classification especially by gender (only 2 categories) when a large number of teams have competed over the years. Another interesting finding that repeats itself in various classification tasks is that classical ML methods are still better than DL methods for age and gender classification tasks. Most classical systems used word unigrams and bigrams and character 3–4-5-grams. Several systems also used various types of stylistic features. While many earlier systems did not apply preprocessing methods, most recent systems applied several preprocessing methods, e.g., lowercase conversion and replacement of various strings (e.g., URLs, LF characters, and User Mentions). We also suggest several potential future issues in age and gender profiling research.

Original languageEnglish
Article number117140
JournalExpert Systems with Applications
Volume199
DOIs
StatePublished - 1 Aug 2022
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2022 Elsevier Ltd

Funding

My deepest thanks to Prof. Walter Daelemans for his wise advices during various stages of this paper. I am also grateful to Netanel Sadeh, my student in the past, who helped me with several papers that are related to author profiling using deep learning methods. The author acknowledges partial financial support from the Jerusalem College of Technology (Lev Academic Center) and the COST Action CA16204 “Distant Reading for European Literary History.”

FundersFunder number
Jerusalem College of Technology
European Cooperation in Science and TechnologyCA16204

    Keywords

    • Age classification
    • Author profiling
    • Deep learning
    • Gender classification
    • Supervised machine learning
    • Text classification

    Fingerprint

    Dive into the research topics of 'Survey on profiling age and gender of text authors'. Together they form a unique fingerprint.

    Cite this