TY - JOUR
T1 - Cuisine
T2 - Classification using stylistic feature sets and/or name-based feature sets
AU - HaCohen-Kerner, Yaakov
AU - Beck, Hananya
AU - Yehudai, Elchai
AU - Rosenstein, Mordechay
AU - Mughaz, Dror
PY - 2010/8
Y1 - 2010/8
N2 - Document classification presents challenges due to the large number of features, their dependencies, and the large number of training documents. In this research, we investigated the use of six stylistic feature sets (including 42 features) and/or six name-based feature sets (including 234 features) for various combinations of the following classification tasks: ethnic groups of the authors and/or periods of time when the documents were written and/or places where the documents were written. The investigated corpus contains Jewish Law articles written in Hebrew-Aramaic, which present interesting problems for classification. Our system CUISINE (Classification Using Stylistic feature sets and/or NamE-based feature sets) achieves accuracy results between 90.71 to 98.99% for the seven classification experiments (ethnicity, time, place, ethnicity&time, ethnicity&place, time&place, ethnicity&time& place). For the first six tasks, the stylistic feature sets in general and the quantitative feature set in particular are enough for excellent classification results. In contrast, the name-based feature sets are rather poor for these tasks. However, for the most complex task (ethnicity&time&place), a hill-climbing model using all feature sets succeeds In significantly improving the classification results. Most of the stylistic features (34 of 42) are language-independent and domain-independent. These features might be useful to the community at large, at least for rather simple tasks.
AB - Document classification presents challenges due to the large number of features, their dependencies, and the large number of training documents. In this research, we investigated the use of six stylistic feature sets (including 42 features) and/or six name-based feature sets (including 234 features) for various combinations of the following classification tasks: ethnic groups of the authors and/or periods of time when the documents were written and/or places where the documents were written. The investigated corpus contains Jewish Law articles written in Hebrew-Aramaic, which present interesting problems for classification. Our system CUISINE (Classification Using Stylistic feature sets and/or NamE-based feature sets) achieves accuracy results between 90.71 to 98.99% for the seven classification experiments (ethnicity, time, place, ethnicity&time, ethnicity&place, time&place, ethnicity&time& place). For the first six tasks, the stylistic feature sets in general and the quantitative feature set in particular are enough for excellent classification results. In contrast, the name-based feature sets are rather poor for these tasks. However, for the most complex task (ethnicity&time&place), a hill-climbing model using all feature sets succeeds In significantly improving the classification results. Most of the stylistic features (34 of 42) are language-independent and domain-independent. These features might be useful to the community at large, at least for rather simple tasks.
UR - http://www.scopus.com/inward/record.url?scp=77954913728&partnerID=8YFLogxK
U2 - 10.1002/asi.21350
DO - 10.1002/asi.21350
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:77954913728
SN - 1532-2882
VL - 61
SP - 1644
EP - 1657
JO - Journal of the American Society for Information Science and Technology
JF - Journal of the American Society for Information Science and Technology
IS - 8
ER -