Abstract
Author profiling from text documents has become a popular task in latest years, in natural language applications. Author profiling is important for various domains such as advertising, marketing, forensics, and security. This survey focuses on profiling age and gender, the two features, which are probably the most researched profile attributes. In this paper, we present an overview of representative studies and datasets of the field (including those organized by PAN) with several significant leaps. Due to the increasing use of deep learning (DL) methods in recent years, we have also reviewed several DL systems that profile authors’ age and gender. Most age and gender datasets contain blog posts or Twitter messages written in English, Spanish or Arabic. There are also several relevant datasets written in Dutch, Italian, Portuguese, Turkish, and Russian. There is no consistency and no uniformity in the datasets concerning to the number and types of their documents, the division into training, dev, and test sets, the types of the applied preprocessing methods, and the quality measures used to evaluate the classification results. A prominent interesting finding is that the best age accuracy results are not as high as we might have expected taking into account relatively simple types of classification especially by gender (only 2 categories) when a large number of teams have competed over the years. Another interesting finding that repeats itself in various classification tasks is that classical ML methods are still better than DL methods for age and gender classification tasks. Most classical systems used word unigrams and bigrams and character 3–4-5-grams. Several systems also used various types of stylistic features. While many earlier systems did not apply preprocessing methods, most recent systems applied several preprocessing methods, e.g., lowercase conversion and replacement of various strings (e.g., URLs, LF characters, and User Mentions). We also suggest several potential future issues in age and gender profiling research.
| Original language | English |
|---|---|
| Article number | 117140 |
| Journal | Expert Systems with Applications |
| Volume | 199 |
| DOIs | |
| State | Published - 1 Aug 2022 |
| Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2022 Elsevier Ltd
Funding
My deepest thanks to Prof. Walter Daelemans for his wise advices during various stages of this paper. I am also grateful to Netanel Sadeh, my student in the past, who helped me with several papers that are related to author profiling using deep learning methods. The author acknowledges partial financial support from the Jerusalem College of Technology (Lev Academic Center) and the COST Action CA16204 “Distant Reading for European Literary History.”
| Funders | Funder number |
|---|---|
| Jerusalem College of Technology | |
| European Cooperation in Science and Technology | CA16204 |
Keywords
- Age classification
- Author profiling
- Deep learning
- Gender classification
- Supervised machine learning
- Text classification
Fingerprint
Dive into the research topics of 'Survey on profiling age and gender of text authors'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver