Automatic machine learning of keyphrase extraction from short html documents written in Hebrew

Yaakov Hacohen-Kerner, Ittay Stern, David Korkus, Erick Fredj

Research output: Contribution to journalArticlepeer-review

20 Scopus citations

Abstract

Keyphrases extracted from documents may save precious time for tasks such as filtering, summarization, and categorization. A few such systems are available for documents written in English. In this paper, we propose a model called LEH_KEY (Learning to Extract Hebrew KEYphrases) that for the first time learns to extract keyphrases for documents written in Hebrew. Firstly, we introduce a relatively high number (15) of baseline extraction methods as opposed to other related systems that use combinations of a low number (two/three) of baseline extraction methods. In contrast, we have investigated various combinations of larger number of baseline methods and various machine learning methods have been tested. The best results have been achieved by a combination of six baseline methods using J48 (an improved variant of C4.5). Our results have been found to be at least of equal quality to those achieved by extraction systems for documents written in English, which are regarded as state-of-the art.

Original languageEnglish
Pages (from-to)1-21
Number of pages21
JournalCybernetics and Systems
Volume38
Issue number1
DOIs
StatePublished - Jan 2007
Externally publishedYes

Fingerprint

Dive into the research topics of 'Automatic machine learning of keyphrase extraction from short html documents written in Hebrew'. Together they form a unique fingerprint.

Cite this