Classifying documents' authors to their ethnic group using stems

Yaakov Hacohen-Kerner, Hananya Beck, Zvi Boger, Elchai Yehudai

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Semitic language processing in general is of great interest today. However, the Hebrew and Aramaic languages have been relatively little studied. In this study, we investigate how to classify Jewish Law articles written in these languages according to the ethnic group of their authors. The motivation is to investigate the cultural differences in writing between Ashkenazi authors and Sephardi authors. Two artificial neural networks (ANNs) have been built for implementing this task. The first ANN uses stems of words excluding the most frequent (>95%) and the least frequent (<5%). The second ANN uses all stems excluding those appearing only once. These ANNs lead to correct classification results of 85% and 89.7%, respectively. These results are reasonable but not excellent. Possible explanations to these findings might be: The correct classification rate of the stemming program we use needs to be improved; and that some Sephardi and Ashkenazi rabbis were active in modern Israel and their articles were influenced by the prevalent non-ethnic Hebrew speech. Several future directions for research are: conducting more experiments using other advanced ML methods and checking whether stem-based classification can be also used for other tasks of ethnic classification, e.g.: various sects of Muslims that use Arabic.

Original languageEnglish
Title of host publication20th International Conference on Computer Applications in Industry and Engineering 2007, CAINE 2007
Pages5-11
Number of pages7
StatePublished - 2007
Externally publishedYes
Event20th International Conference on Computer Applications in Industry and Engineering 2007, CAINE 2007 - San Francisco, CA, United States
Duration: 7 Nov 20079 Nov 2007

Publication series

Name20th International Conference on Computer Applications in Industry and Engineering 2007, CAINE 2007

Conference

Conference20th International Conference on Computer Applications in Industry and Engineering 2007, CAINE 2007
Country/TerritoryUnited States
CitySan Francisco, CA
Period7/11/079/11/07

Keywords

  • Artificial neural network
  • Ethnic group
  • Stems
  • Text classification

Fingerprint

Dive into the research topics of 'Classifying documents' authors to their ethnic group using stems'. Together they form a unique fingerprint.

Cite this