Abstract
Historical linguists have identified regularities in the process of historic sound change. The comparative method utilizes those regularities to reconstruct proto-words based on observed forms in daughter languages. Can this process be efficiently automated? We address the task of proto-word reconstruction, in which the model is exposed to cognates in contemporary daughter languages, and has to predict the proto word in the ancestor language. We provide a novel dataset for this task, encompassing over 8,000 comparative entries, and show that neural sequence models outperform conventional methods applied to this task so far. Error analysis reveals a variability in the ability of neural model to capture different phonological changes, correlating with the complexity of the changes. Analysis of learned embeddings reveals the models learn phonologically meaningful generalizations, corresponding to well-attested phonological shifts documented by historical linguistics.
Original language | English |
---|---|
Title of host publication | NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics |
Subtitle of host publication | Human Language Technologies, Proceedings of the Conference |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 4460-4473 |
Number of pages | 14 |
ISBN (Electronic) | 9781954085466 |
DOIs | |
State | Published - 2021 |
Event | 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021 - Virtual, Online Duration: 6 Jun 2021 → 11 Jun 2021 |
Publication series
Name | NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference |
---|
Conference
Conference | 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021 |
---|---|
City | Virtual, Online |
Period | 6/06/21 → 11/06/21 |
Bibliographical note
Publisher Copyright:© 2021 Association for Computational Linguistics.
Funding
the very conservative orthography of French, that masks the phonological innovations that occurred We thank Arya McCarthy for pointing out to rel-in the language. Indeed, the network focuses ex-evant references. This project received funding clusively on French for the reconstruction of the from the Europoean Research Council (ERC) un-characters <h> and <y>, which are consistently der the Europoean Union’s Horizon 2020 research represented only in French orthography, disap-and innovation programme, grant agreement No. pearing from the written form of the other Ro-802774 (iEXTRACT). mance languages. The comparison to the atten- 4468 tion of the phonetic dataset shows that the network tends to actually ignore French, favoring other sources instead. Similarly, in the orthographic dataset, French is favored in the initial positions, a tendency that disappears in the phonetic dataset. Finally, an interesting trend in the phonetic dataset is a tendency to attend to Romanian at the initial positions and to Portuguese at later ones.
Funders | Funder number |
---|---|
Europoean Union’s Horizon 2020 | |
European Commission |