TY - GEN
T1 - Getting the life out of living: How Adequate Are Word-Pieces for Modelling Complex Morphology?
AU - Klein, Stav
AU - Tsarfaty, Reut
PY - 2020
Y1 - 2020
N2 - This work investigates the most basic units that underlie contextualized word embeddings, such as BERT-the so-called word pieces. In Morphologically-Rich Languages (MRLs) which exhibit morphological fusion and non-concatenative morphology, the different units of meaning within a word may be fused, intertwined , and cannot be separated linearly. Therefore, when using word-pieces in MRLs, we must consider that: (1) a linear segmen-tation into sub-word units might not capture the full morphological complexity of words; and (2) representations that leave morphological knowledge on sub-word units inaccessible might negatively affect performance. Here we empirically examine the capacity of word-pieces to capture morphology by investigating the task of multi-tagging in Hebrew, as a proxy to evaluating the underlying segmentation. Our results show that, while models trained to predict multi-tags for complete words out-perform models tuned to predict the distinct tags of WPs, we can improve the WPs tag prediction by purposefully constraining the word-pieces to reflect their internal functions. We conjecture that this is due to the na¨ıvena¨ıve linear to-kenization of words into word-pieces, and suggest that linguistically-informed word-pieces schemes, that make morphological knowledge explicit, might boost performance for MRLs.
AB - This work investigates the most basic units that underlie contextualized word embeddings, such as BERT-the so-called word pieces. In Morphologically-Rich Languages (MRLs) which exhibit morphological fusion and non-concatenative morphology, the different units of meaning within a word may be fused, intertwined , and cannot be separated linearly. Therefore, when using word-pieces in MRLs, we must consider that: (1) a linear segmen-tation into sub-word units might not capture the full morphological complexity of words; and (2) representations that leave morphological knowledge on sub-word units inaccessible might negatively affect performance. Here we empirically examine the capacity of word-pieces to capture morphology by investigating the task of multi-tagging in Hebrew, as a proxy to evaluating the underlying segmentation. Our results show that, while models trained to predict multi-tags for complete words out-perform models tuned to predict the distinct tags of WPs, we can improve the WPs tag prediction by purposefully constraining the word-pieces to reflect their internal functions. We conjecture that this is due to the na¨ıvena¨ıve linear to-kenization of words into word-pieces, and suggest that linguistically-informed word-pieces schemes, that make morphological knowledge explicit, might boost performance for MRLs.
UR - https://www.mendeley.com/catalogue/e5c1441d-a177-3259-a54a-d8055f8ad614/
UR - https://www.mendeley.com/catalogue/e5c1441d-a177-3259-a54a-d8055f8ad614/
U2 - 10.18653/v1/2020.sigmorphon-1.24
DO - 10.18653/v1/2020.sigmorphon-1.24
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
SP - 204
EP - 209
BT - Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, SIGMORPHON 2020, Online, July 10, 2020
A2 - Nicolai, Garrett
A2 - Gorman, Kyle
A2 - Cotterell, Ryan
PB - Association for Computational Linguistics
ER -