Abstract
This work investigates the most basic units that underlie contextualized word embeddings, such as BERT - the so-called word pieces. In Morphologically-Rich Languages (MRLs) which exhibit morphological fusion and non-concatenative morphology, the different units of meaning within a word may be fused, intertwined, and cannot be separated linearly. Therefore, when using word-pieces in MRLs, we must consider that: (1) a linear segmentation into sub-word units might not capture the full morphological complexity of words; and (2) representations that leave morphological knowledge on sub-word units inaccessible might negatively affect performance. Here we empirically examine the capacity of word-pieces to capture morphology by investigating the task of multi-tagging in Hebrew, as a proxy to evaluating the underlying segmentation. Our results show that, while models trained to predict multi-tags for complete words outperform models tuned to predict the distinct tags of WPs, we can improve the WPs tag prediction by purposefully constraining the word-pieces to reflect their internal functions. We conjecture that this is due to the naïve linear tokenization of words into word-pieces, and suggest that linguistically-informed word-pieces schemes, that make morphological knowledge explicit, might boost performance for MRLs.
Original language | English |
---|---|
Title of host publication | SIGMORPHON 2020 - 17th SIGMORPHON Workshop on Computational Research in Phonetics Phonology, and Morphology, Proceedings of the Workshop |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 204-209 |
Number of pages | 6 |
ISBN (Electronic) | 9781952148194 |
DOIs | |
State | Published - 2020 |
Event | 17th SIGMORPHON Workshop on Computational Research in Phonetics Phonology, and Morphology, SIGMORPHON 2020 as part of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 - Virtual, Online, United States Duration: 10 Jul 2020 → … |
Publication series
Name | Proceedings of the Annual Meeting of the Association for Computational Linguistics |
---|---|
ISSN (Print) | 0736-587X |
Conference
Conference | 17th SIGMORPHON Workshop on Computational Research in Phonetics Phonology, and Morphology, SIGMORPHON 2020 as part of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 |
---|---|
Country/Territory | United States |
City | Virtual, Online |
Period | 10/07/20 → … |
Bibliographical note
Publisher Copyright:© 2020 Association for Computational Linguistics.
Funding
We thank Yoav Goldberg, Noah Smith, Omer Levy and three reviewers for interesting discussions of an earlier draft. This research is funded by an ERC Grant #677352 and an ISF grant #1739/26, for which we are grateful.
Funders | Funder number |
---|---|
European Commission | 677352 |
Israel Science Foundation | 1739/26 |