Abstract
Author profiling deals with the identification of various details about the author of the text (e.g., age and gender). In this paper, we describe the participation of our team (hacohenkerner19) in the PAN 2019 shared task on Bots and Gender Profiling in two languages: English and Spanish. Given a Twitter feed, we should determine whether its author is a bot or a human. In the case of human, we should identify her/his gender. In this paper, we describe our preprocessing methods, feature sets, five applied machine learning methods, and accuracy results. The best accuracy result for the English dataset (84.8%) was obtained by LinearSVC using 2,000 word unigrams. The same result (84.8%) was also obtained by LR by using four preprocessing methods, 2,000 word unigrams, and 1,000 word bigrams with maximal skips of 2 words. The best accuracy result (75,54%) for the Spanish dataset was achieved using LinearSVC with only the HTML tag removal preprocessing method and a combination of 1,000 word unigrams, 1,000 word bigrams, and 3,000 character trigrams.
| Original language | English |
|---|---|
| Journal | CEUR Workshop Proceedings |
| Volume | 2380 |
| State | Published - 2019 |
| Externally published | Yes |
| Event | 20th Working Notes of CLEF Conference and Labs of the Evaluation Forum, CLEF 2019 - Lugano, Switzerland Duration: 9 Sep 2019 → 12 Sep 2019 |
Bibliographical note
Publisher Copyright:© 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano, Switzerland.
Funding
Acknowledgments. This work was partially funded by the Jerusalem College Technology (Lev Academic Center) and we gratefully acknowledge its support. This work was partially funded by the Jerusalem College of Technology (Lev Academic Center) and we gratefully acknowledge its support.
| Funders |
|---|
| Jerusalem College Technology |
| Jerusalem College of Technology |
Keywords
- Bot Profiling
- Character N-grams
- Gender Profiling
- Supervised Machine Learning
- Word N-grams
Fingerprint
Dive into the research topics of 'Bots and gender profiling of tweets using word and character N-grams notebook for PAN at CLEF 2019'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver