TY - GEN
T1 - Unsupervised lexicon-based resolution of unknownwords for full morphological analysis
AU - Adler, Meni
AU - Goldberg, Yoav
AU - Gabay, David
AU - Elhadad, Michael
PY - 2008
Y1 - 2008
N2 - Morphological disambiguation proceeds in 2 stages: (1) an analyzer provides all possible analyses for a given token and (2) a stochastic disambiguation module picks the most likely analysis in context. When the analyzer does not recognize a given token, we hit the problem of unknowns. In large scale corpora, unknowns appear at a rate of 5 to 10% (depending on the genre and the maturity of the lexicon). We address the task of computing the distribution p(t|w) for unknown words for full morphological disambiguation in Hebrew. We introduce a novel algorithm that is language independent: it exploits a maximum entropy letters model trained over the known words observed in the corpus and the distribution of the unknown words in known tag contexts, through iterative approximation. The algorithm achieves 30% error reduction on disambiguation of unknown words over a competitive baseline (to a level of 70% accurate full disambiguation of unknown words). We have also verified that taking advantage of a strong language-specific model of morphological patterns provides the same level of disambiguation. The algorithm we have developed exploits distributional information latent in a wide-coverage lexicon and large quantities of unlabeled data.
AB - Morphological disambiguation proceeds in 2 stages: (1) an analyzer provides all possible analyses for a given token and (2) a stochastic disambiguation module picks the most likely analysis in context. When the analyzer does not recognize a given token, we hit the problem of unknowns. In large scale corpora, unknowns appear at a rate of 5 to 10% (depending on the genre and the maturity of the lexicon). We address the task of computing the distribution p(t|w) for unknown words for full morphological disambiguation in Hebrew. We introduce a novel algorithm that is language independent: it exploits a maximum entropy letters model trained over the known words observed in the corpus and the distribution of the unknown words in known tag contexts, through iterative approximation. The algorithm achieves 30% error reduction on disambiguation of unknown words over a competitive baseline (to a level of 70% accurate full disambiguation of unknown words). We have also verified that taking advantage of a strong language-specific model of morphological patterns provides the same level of disambiguation. The algorithm we have developed exploits distributional information latent in a wide-coverage lexicon and large quantities of unlabeled data.
UR - http://www.scopus.com/inward/record.url?scp=84859916086&partnerID=8YFLogxK
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:84859916086
SN - 9781932432046
T3 - ACL-08: HLT - 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference
SP - 728
EP - 736
BT - ACL-08
T2 - 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-08: HLT
Y2 - 15 June 2008 through 20 June 2008
ER -