Abstract
Stemming is useful for various natural language processing tasks, such as document indexing and text classification. Therefore, identification of the correct root of any given word is important. For Hebrew this is not a trivial task, due to the complex nature of Hebrew morphology and its orthography. Many Hebrew words are ambiguous in the sense that each one of them can be created from a few possible roots. However, for a given word in a specific context, each word has only one correct root or no root at all. We have developed a variety of features in order to find the correct root for a Hebrew ambiguous word. These features are classified into 3 distinct groups: root-based features, conjugation-based features and statistical features. Several common machine learning methods have been tested in order to find a successful integration of the features. The best result has been achieved by Naïve Bayes, with about 87% accuracy.
Original language | English |
---|---|
Pages (from-to) | 36-53 |
Number of pages | 18 |
Journal | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
Volume | 8003 |
DOIs | |
State | Published - 2014 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© Springer-Verlag Berlin Heidelberg 2014.
Keywords
- Disambiguation
- Hebrew-Aramaic documents
- Machine learning methods
- Natural language processing
- Stemming