Abstract
We present our system for the CLIN29 shared task on cross-genre gender detection for Dutch. We experimented with a multitude of neural models (CNN, RNN, LSTM, etc.), more “traditional” models (SVM, RF, LogReg, etc.), different feature sets as well as data pre-processing. The final results suggested that using tokenized, non-lowercased data works best for most of the neural models, while a combination of word clusters, character trigrams and word lists showed to be most beneficial for the majority of the more “traditional” (that is, non-neural) models, beating features used in previous tasks such as ngrams, character n-grams, part-of-speech tags and combinations thereof. In contradiction with the results described in previous comparable shared tasks, our neural models performed better than our best traditional approaches with our best feature set-up. Our final model consisted of a weighted ensemble model combining the top 25 models. Our final model won both the in-domain gender prediction task and the cross-genre challenge, achieving an average accuracy of 64.93% on the in-domain gender prediction task, and 56.26% on cross-genre gender prediction.
Original language | English |
---|---|
Pages (from-to) | 53-61 |
Number of pages | 9 |
Journal | CEUR Workshop Proceedings |
Volume | 2453 |
State | Published - 2019 |
Event | 2019 Shared Task on Cross-Genre Gender Prediction in Dutch at CLIN29, GxG-CLIN29 2019 - Groningen, Netherlands Duration: 31 Jan 2019 → … |
Bibliographical note
Publisher Copyright:© 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Funding
This work has been supported by Dublin City University Faculty of Engineering & Computing under the Daniel O’Hare Research Scholarship scheme and by the ADAPT Centre for Digital Content Technology, funded under the SFI Research Centres Programme (Grant 13/RC/2106) and Theo Hoffenberg, founder & CEO of Reverso. We would also like to thank the organizers of the shared task.
Funders | Funder number |
---|---|
ADAPT Centre for Digital Content Technology | |
Science Foundation Ireland | 13/RC/2106 |
Dublin City University |