Detecting Offensive Language in Bengali, Bodo, and Assamese using Word Unigrams, Char N-grams, Classical Machine Learning, and Deep Learning Methods

Avigail Stekel, Avital Prives, Yaakov HaCohen-Kerner

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

In this paper, we, the JCT team, describe our submissions for the HASOC 2023 track. We participated in task 4, which addresses the problem of hate speech and offensive language identification in three languages: Bengali, Bodo, and Assamese. We developed different models using five classical supervised machine learning methods: multinomial Naive Bayes)MNB(, support vector classifier, random forest, logistic regression (LR), and multi-layer perceptron. Our models were applied to word unigrams and/or character n-gram features. In addition, we applied two versions of relevant deep learning models. Our best model for the Assamese language is an MNB model with 5-gram features, which achieves a macro averaged F1-score of 0.6988. Our best model for Bengali is an MNB model with 6-gram features, which achieves a macro averaged F1-score of 0.66497. Our best submission for Bodo is a LR with all word unigrams in the training set. This model obtained a macro averaged F1-score of 0.85074. It was ranked in the shared 2nd-3rd place out of 20 teams. Our result is lower by only 0.00576 than the result of the team that was ranked in the 1st place. Our GitHub repository link is avigailst/co2023 (github.com).

Original languageEnglish
Pages (from-to)418-426
Number of pages9
JournalCEUR Workshop Proceedings
Volume3681
StatePublished - 2023
Externally publishedYes
Event15th Forum for Information Retrieval Evaluation, FIRE 2023 - Goa, India
Duration: 15 Dec 202318 Dec 2023

Bibliographical note

Publisher Copyright:
© 2023 Copyright for this paper by its authors.

Keywords

  • Char n-grams
  • hate speech
  • offensive language
  • supervised machine learning
  • word unigrams

Fingerprint

Dive into the research topics of 'Detecting Offensive Language in Bengali, Bodo, and Assamese using Word Unigrams, Char N-grams, Classical Machine Learning, and Deep Learning Methods'. Together they form a unique fingerprint.

Cite this