Abstract
In this paper, we, the JCT team, describe our submissions for the HASOC 2023 track. We participated in task 4, which addresses the problem of hate speech and offensive language identification in three languages: Bengali, Bodo, and Assamese. We developed different models using five classical supervised machine learning methods: multinomial Naive Bayes)MNB(, support vector classifier, random forest, logistic regression (LR), and multi-layer perceptron. Our models were applied to word unigrams and/or character n-gram features. In addition, we applied two versions of relevant deep learning models. Our best model for the Assamese language is an MNB model with 5-gram features, which achieves a macro averaged F1-score of 0.6988. Our best model for Bengali is an MNB model with 6-gram features, which achieves a macro averaged F1-score of 0.66497. Our best submission for Bodo is a LR with all word unigrams in the training set. This model obtained a macro averaged F1-score of 0.85074. It was ranked in the shared 2nd-3rd place out of 20 teams. Our result is lower by only 0.00576 than the result of the team that was ranked in the 1st place. Our GitHub repository link is avigailst/co2023 (github.com).
Original language | English |
---|---|
Pages (from-to) | 418-426 |
Number of pages | 9 |
Journal | CEUR Workshop Proceedings |
Volume | 3681 |
State | Published - 2023 |
Externally published | Yes |
Event | 15th Forum for Information Retrieval Evaluation, FIRE 2023 - Goa, India Duration: 15 Dec 2023 → 18 Dec 2023 |
Bibliographical note
Publisher Copyright:© 2023 Copyright for this paper by its authors.
Keywords
- Char n-grams
- hate speech
- offensive language
- supervised machine learning
- word unigrams