This chapter describes a general, comprehensive and robust word-alignment system and its application to the Hebrew-English language pair. A major goal of the system architecture is to assume as little as possible about its input and about the relative nature of the two languages, while allowing the use of (minimal) specific monolingual pre-processing resources when required. The system thus receives as input a pair of raw parallel texts and requires only a tokeniser (and possibly a lemmatiser) for each language. After tokenisation (and lemmatisation if necessary), a rough initial alignment is obtained for the texts using a version of Fung and McKeown's DK-vec algorithm (Fung und McKeown, 1997; Fung, this volume). The initial alignment is given as input to a version of the word_ align algorithm (Dagan, Church and Gale, 1993), an extension of Model 2 in the IBM statistical translation model. Word_align produces a word level alignment for the texts and a probabilistic bilingual dictionary. The chapter describes the details of the system architecture, the algorithms implemented (emphasising implementation details), the issues regarding their application to Hebrew and similar Semitic languages, and some experimental results.
|Original language||American English|
|Title of host publication||Parallel text processing|
|State||Published - 2000|
|Name||Text, Speech and Language Technology|