TY - GEN
T1 - Classifying documents' authors to their ethnic group using stems
AU - Hacohen-Kerner, Yaakov
AU - Beck, Hananya
AU - Boger, Zvi
AU - Yehudai, Elchai
PY - 2007
Y1 - 2007
N2 - Semitic language processing in general is of great interest today. However, the Hebrew and Aramaic languages have been relatively little studied. In this study, we investigate how to classify Jewish Law articles written in these languages according to the ethnic group of their authors. The motivation is to investigate the cultural differences in writing between Ashkenazi authors and Sephardi authors. Two artificial neural networks (ANNs) have been built for implementing this task. The first ANN uses stems of words excluding the most frequent (>95%) and the least frequent (<5%). The second ANN uses all stems excluding those appearing only once. These ANNs lead to correct classification results of 85% and 89.7%, respectively. These results are reasonable but not excellent. Possible explanations to these findings might be: The correct classification rate of the stemming program we use needs to be improved; and that some Sephardi and Ashkenazi rabbis were active in modern Israel and their articles were influenced by the prevalent non-ethnic Hebrew speech. Several future directions for research are: conducting more experiments using other advanced ML methods and checking whether stem-based classification can be also used for other tasks of ethnic classification, e.g.: various sects of Muslims that use Arabic.
AB - Semitic language processing in general is of great interest today. However, the Hebrew and Aramaic languages have been relatively little studied. In this study, we investigate how to classify Jewish Law articles written in these languages according to the ethnic group of their authors. The motivation is to investigate the cultural differences in writing between Ashkenazi authors and Sephardi authors. Two artificial neural networks (ANNs) have been built for implementing this task. The first ANN uses stems of words excluding the most frequent (>95%) and the least frequent (<5%). The second ANN uses all stems excluding those appearing only once. These ANNs lead to correct classification results of 85% and 89.7%, respectively. These results are reasonable but not excellent. Possible explanations to these findings might be: The correct classification rate of the stemming program we use needs to be improved; and that some Sephardi and Ashkenazi rabbis were active in modern Israel and their articles were influenced by the prevalent non-ethnic Hebrew speech. Several future directions for research are: conducting more experiments using other advanced ML methods and checking whether stem-based classification can be also used for other tasks of ethnic classification, e.g.: various sects of Muslims that use Arabic.
KW - Artificial neural network
KW - Ethnic group
KW - Stems
KW - Text classification
UR - http://www.scopus.com/inward/record.url?scp=84883279683&partnerID=8YFLogxK
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:84883279683
SN - 9781604239867
T3 - 20th International Conference on Computer Applications in Industry and Engineering 2007, CAINE 2007
SP - 5
EP - 11
BT - 20th International Conference on Computer Applications in Industry and Engineering 2007, CAINE 2007
T2 - 20th International Conference on Computer Applications in Industry and Engineering 2007, CAINE 2007
Y2 - 7 November 2007 through 9 November 2007
ER -