Efficient unsupervised recursive word segmentation using minimum description length

Shlomo Argamon, Navot Akiva, Amihood Amir, Oren Kapah

Research output: Contribution to conferencePaperpeer-review

15 Scopus citations

Abstract

Automatic word segmentation is a basic requirement for unsupervised learning in morphological analysis. In this paper, we formulate a novel recursive method for minimum description length (MDL) word segmentation, whose basic operation is resegmenting the corpus on a prefix (equivalently, a suffix). We derive a local expression for the change in description length under resegmentation, i.e., one which depends only on properties of the specific prefix (not on the rest of the corpus). Such a formulation permits use of a new and efficient algorithm for greedy morphological segmentation of the corpus in a recursive manner. In particular, our method does not restrict words to be segmented only once, into a stem+affix form, as do many extant techniques. Early results for English and Turkish corpora are promising.

Original languageEnglish
StatePublished - 2004
Event20th International Conference on Computational Linguistics, COLING 2004 - Geneva, Switzerland
Duration: 23 Aug 200427 Aug 2004

Conference

Conference20th International Conference on Computational Linguistics, COLING 2004
Country/TerritorySwitzerland
CityGeneva
Period23/08/0427/08/04

Bibliographical note

Publisher Copyright:
© 2004 COLING 2004 - Proceedings of the 20th International Conference on Computational Linguistics. All rights reserved.

Fingerprint

Dive into the research topics of 'Efficient unsupervised recursive word segmentation using minimum description length'. Together they form a unique fingerprint.

Cite this