Ab Antiquo: Neural Proto-language Reconstruction

Carlo Meloni, Shauli Ravfogel, Yoav Goldberg

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

25 Scopus citations

Abstract

Historical linguists have identified regularities in the process of historic sound change. The comparative method utilizes those regularities to reconstruct proto-words based on observed forms in daughter languages. Can this process be efficiently automated? We address the task of proto-word reconstruction, in which the model is exposed to cognates in contemporary daughter languages, and has to predict the proto word in the ancestor language. We provide a novel dataset for this task, encompassing over 8,000 comparative entries, and show that neural sequence models outperform conventional methods applied to this task so far. Error analysis reveals a variability in the ability of neural model to capture different phonological changes, correlating with the complexity of the changes. Analysis of learned embeddings reveals the models learn phonologically meaningful generalizations, corresponding to well-attested phonological shifts documented by historical linguistics.

Original languageEnglish
Title of host publicationNAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics
Subtitle of host publicationHuman Language Technologies, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages4460-4473
Number of pages14
ISBN (Electronic)9781954085466
DOIs
StatePublished - 2021
Event2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021 - Virtual, Online
Duration: 6 Jun 202111 Jun 2021

Publication series

NameNAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference

Conference

Conference2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021
CityVirtual, Online
Period6/06/2111/06/21

Bibliographical note

Publisher Copyright:
© 2021 Association for Computational Linguistics.

Funding

the very conservative orthography of French, that masks the phonological innovations that occurred We thank Arya McCarthy for pointing out to rel-in the language. Indeed, the network focuses ex-evant references. This project received funding clusively on French for the reconstruction of the from the Europoean Research Council (ERC) un-characters <h> and <y>, which are consistently der the Europoean Union’s Horizon 2020 research represented only in French orthography, disap-and innovation programme, grant agreement No. pearing from the written form of the other Ro-802774 (iEXTRACT). mance languages. The comparison to the atten- 4468 tion of the phonetic dataset shows that the network tends to actually ignore French, favoring other sources instead. Similarly, in the orthographic dataset, French is favored in the initial positions, a tendency that disappears in the phonetic dataset. Finally, an interesting trend in the phonetic dataset is a tendency to attend to Romanian at the initial positions and to Portuguese at later ones.

FundersFunder number
Europoean Union’s Horizon 2020
European Commission

    Fingerprint

    Dive into the research topics of 'Ab Antiquo: Neural Proto-language Reconstruction'. Together they form a unique fingerprint.

    Cite this