Efficient Unsupervised Recursive Word Segmentation Using Minimum Description Length

S Argamon, N Akiva, A. Amihood, O Kapah

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Automatic word segmentation is a basic requirement for unsupervised learning in morphological analysis. In this paper, we formulate a novel recursive method for minimum description length (MDL) word segmentation, whose basic operation is resegmenting the corpus on a prefix (equivalently, a suffix). We derive a local expression for the change in description length under resegmentation, i.e., one which depends only on properties of the specific prefix (not on the rest of the corpus). Such a formulation permits use of a new and efficient algorithm for greedy morphological segmentation of the corpus in a recursive manner. In particular, our method does not restrict words to be segmented only once, into a stem+affix form, as do many extant techniques. Early results for English and Turkish corpora are promising.
Original languageAmerican English
Title of host publicationThe 20th international conference on Computational Linguistics
PublisherAssociation for Computational Linguistics
StatePublished - 2004

Bibliographical note

Place of conference:Geneva, Switzerland

Fingerprint

Dive into the research topics of 'Efficient Unsupervised Recursive Word Segmentation Using Minimum Description Length'. Together they form a unique fingerprint.

Cite this