Gender clustering of blog posts using distinguishable features

Yaakov HaCohen-Kerner, Yarden Tzach, Ori Asis

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

The aim of this research is to find out how to perform effective clustering of unlabeled personal blog posts written in English by gender. Given a gender-labeled blog corpus and a blog corpus that is not genderlabeled, we extracted from the labeled corpus distinguishable unigrams for both males and females. Then, we defined two general features that represent the relative frequencies of the distinguishable males' unigrams and females' unigrams, (males' frequency and females' frequency). The best distinguishable feature was found to be the males' frequency feature with a ratio factor at least 1.4 times that of females. This feature leads to accuracy rate of 83.7% for gender clustering of the unlabeled blog corpus. To the best of our knowledge, this study presents two novelties: (1) this is the first study to cluster blog posts by gender, and (2) clustering of an unlabeled corpus using distinguishable features that were extracted from a labeled corpus.

Original languageEnglish
Title of host publicationKDIR 2016 - 8th International Conference on Knowledge Discovery and Information Retrieval
EditorsAna Fred, Jan Dietz, David Aveiro, Kecheng Liu, Jorge Bernardino, Joaquim Filipe, Joaquim Filipe
PublisherSciTePress
Pages384-391
Number of pages8
ISBN (Electronic)9789897582035
DOIs
StatePublished - 2016
Externally publishedYes
Event8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2016 - Porto, Portugal
Duration: 9 Nov 201611 Nov 2016

Publication series

NameIC3K 2016 - Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management
Volume1

Conference

Conference8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2016
Country/TerritoryPortugal
CityPorto
Period9/11/1611/11/16

Bibliographical note

Funding Information:
The authors would like to acknowledge the ?laser and additive manufacturing unit and Advanced Melting Unit? at Central Metallurgical Research and Development Institute, for supporting this work. They also would like to thank Prof. Khalid Abdelhany for his valuable assistance in this work.

Funding Information:
The authors would like to acknowledge the "laser and additive manufacturing unit and Advanced Melting Unit" at Central Metallurgical Research and Development Institute, for supporting this work. They also would like to thank Prof. Khalid Abdelhany for his valuable assistance in this work.

Keywords

  • Blog Posts
  • Distinguishable Features
  • Gender Clustering.

Fingerprint

Dive into the research topics of 'Gender clustering of blog posts using distinguishable features'. Together they form a unique fingerprint.

Cite this