Word segmentation, unknown-word resolution, and morphological agreement in a Hebrew parsing system

Yoav Goldberg, Michael Elhadad

Research output: Contribution to journalArticlepeer-review

20 Scopus citations


We present a constituency parsing system for Modern Hebrew. The system is based on the PCFG-LA parsing method of Petrov et al. (2006), which is extended in various ways in order to accommodate the specificities of Hebrew as a morphologically rich language with a small treebank. We show that parsing performance can be enhanced by utilizing a language resource external to the treebank, specifically, a lexicon-based morphological analyzer. We present a computational model of interfacing the external lexicon and a treebank-based parser, also in the common case where the lexicon and the treebank follow different annotation schemes. We show that Hebrew word-segmentation and constituency-parsing can be performed jointly using CKY lattice parsing. Performing the tasks jointly is effective, and substantially outperforms a pipelinebased model. We suggest modeling grammatical agreement in a constituency-based parser as a filter mechanism that is orthogonal to the grammar, and present a concrete implementation of the method. Although the constituency parser does not make many agreement mistakes to begin with, the filter mechanism is effective in fixing the agreement mistakes that the parser does make. These contributions extend outside of the scope of Hebrew processing, and are of general applicability to the NLP community. Hebrew is a specific case of a morphologically rich language, and ideas presented in this work are useful also for processing other languages, including English. The lattice-based parsingmethodology is useful in any case where the input is uncertain. Extending the lexical coverage of a treebank-derived parser using an external lexicon is relevant for any language with a small treebank.

Original languageEnglish
Pages (from-to)121-160
Number of pages40
JournalComputational Linguistics
Issue number1
StatePublished - Mar 2013
Externally publishedYes


Dive into the research topics of 'Word segmentation, unknown-word resolution, and morphological agreement in a Hebrew parsing system'. Together they form a unique fingerprint.

Cite this