Detecting Offensive Language in English, Hindi, and Marathi using Classical Supervised Machine Learning Methods and Word/Char N-grams

Yaakov HaCohen-Kerner, Moshe Uzan

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

In this paper, we describe our submissions for the HASOC 2021 contest. We tackled subtask 1A that addresses the problem of hate speech and offensive language identification in three languages: English, Hindi, and Marathi. We developed different models using six classical supervised machine learning methods: support vector classifier, binary support vector classifier, random forest, ada-boost classifier, multi-layer perceptron, and logistic regression. Our best submission was a model we built for offensive language identification in Marathi using random forest. This model was ranked in 6th place out of 25 teams. Our result is lower by only 0.0059 than the result of the team that was ranked in 3rd place. Our ML models were applied on various combinations of character and/or word n-gram features from uni-gram to 8-gram.

Original languageEnglish
Pages (from-to)501-507
Number of pages7
JournalCEUR Workshop Proceedings
Volume3159
StatePublished - 2021
EventWorking Notes of FIRE - 13th Forum for Information Retrieval Evaluation, FIRE-WN 2021 - Gandhinagar, India
Duration: 13 Dec 202117 Dec 2021

Bibliographical note

Publisher Copyright:
© 2021 Copyright for this paper by the Forum for Information Retrieval Evaluation, December 13-17, 2021, India.

Keywords

  • Hate Speech
  • offensive language
  • supervised machine learning
  • word/char n-grams

Fingerprint

Dive into the research topics of 'Detecting Offensive Language in English, Hindi, and Marathi using Classical Supervised Machine Learning Methods and Word/Char N-grams'. Together they form a unique fingerprint.

Cite this