A comprehensive bilingual word alignment system

Yaacov Choueka, Ehud S. Conley, Ido Dagan

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

Abstract

This chapter describes a general, comprehensive and robust word-alignment system and its application to the Hebrew-English language pair. A major goal of the system architecture is to assume as little as possible about its input and about the relative nature of the two languages, while allowing the use of (minimal) specific monolingual pre-processing resources when required. The system thus receives as input a pair of raw parallel texts and requires only a tokeniser (and possibly a lemmatiser) for each language. After tokenisation (and lemmatisation if necessary), a rough initial alignment is obtained for the texts using a version of Fung and McKeown's DK-vec algorithm (Fung und McKeown, 1997; Fung, this volume). The initial alignment is given as input to a version of the word_ align algorithm (Dagan, Church and Gale, 1993), an extension of Model 2 in the IBM statistical translation model. Word_align produces a word level alignment for the texts and a probabilistic bilingual dictionary. The chapter describes the details of the system architecture, the algorithms implemented (emphasising implementation details), the issues regarding their application to Hebrew and similar Semitic languages, and some experimental results.
Original languageAmerican English
Title of host publicationParallel text processing
EditorsJean Véronis
PublisherSpringer Netherlands
Pages69-96
ISBN (Print)978-94-017-2535-4
StatePublished - 2000

Publication series

NameText, Speech and Language Technology
Volume13

Fingerprint

Dive into the research topics of 'A comprehensive bilingual word alignment system'. Together they form a unique fingerprint.

Cite this