TY - JOUR
T1 - Automatic machine learning of keyphrase extraction from short html documents written in Hebrew
AU - Hacohen-Kerner, Yaakov
AU - Stern, Ittay
AU - Korkus, David
AU - Fredj, Erick
PY - 2007/1
Y1 - 2007/1
N2 - Keyphrases extracted from documents may save precious time for tasks such as filtering, summarization, and categorization. A few such systems are available for documents written in English. In this paper, we propose a model called LEH_KEY (Learning to Extract Hebrew KEYphrases) that for the first time learns to extract keyphrases for documents written in Hebrew. Firstly, we introduce a relatively high number (15) of baseline extraction methods as opposed to other related systems that use combinations of a low number (two/three) of baseline extraction methods. In contrast, we have investigated various combinations of larger number of baseline methods and various machine learning methods have been tested. The best results have been achieved by a combination of six baseline methods using J48 (an improved variant of C4.5). Our results have been found to be at least of equal quality to those achieved by extraction systems for documents written in English, which are regarded as state-of-the art.
AB - Keyphrases extracted from documents may save precious time for tasks such as filtering, summarization, and categorization. A few such systems are available for documents written in English. In this paper, we propose a model called LEH_KEY (Learning to Extract Hebrew KEYphrases) that for the first time learns to extract keyphrases for documents written in Hebrew. Firstly, we introduce a relatively high number (15) of baseline extraction methods as opposed to other related systems that use combinations of a low number (two/three) of baseline extraction methods. In contrast, we have investigated various combinations of larger number of baseline methods and various machine learning methods have been tested. The best results have been achieved by a combination of six baseline methods using J48 (an improved variant of C4.5). Our results have been found to be at least of equal quality to those achieved by extraction systems for documents written in English, which are regarded as state-of-the art.
UR - http://www.scopus.com/inward/record.url?scp=33847135192&partnerID=8YFLogxK
U2 - 10.1080/01969720600998546
DO - 10.1080/01969720600998546
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:33847135192
SN - 0196-9722
VL - 38
SP - 1
EP - 21
JO - Cybernetics and Systems
JF - Cybernetics and Systems
IS - 1
ER -