A Balanced Data Approach for Evaluating Cross-Lingual Transfer: Mapping the Linguistic Blood Bank

Dan Malkin, Tomasz Limisiewicz, Gabriel Stanovsky

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

10 Scopus citations

Abstract

We show that the choice of pretraining languages affects downstream cross-lingual transfer for BERT-based models. We inspect zeroshot performance in balanced data conditions to mitigate data size confounds, classifying pretraining languages that improve downstream performance as donors, and languages that are improved in zero-shot performance as recipients. We develop a method of quadratic time complexity in the number of languages to estimate these relations, instead of an exponential exhaustive computation of all possible combinations. We find that our method is effective on a diverse set of languages spanning different linguistic features and two downstream tasks. Our findings can inform developers of large-scale multilingual language models in choosing better pretraining configurations.

Original languageEnglish
Title of host publicationNAACL 2022 - 2022 Conference of the North American Chapter of the Association for Computational Linguistics
Subtitle of host publicationHuman Language Technologies, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages4903-4915
Number of pages13
ISBN (Electronic)9781955917711
StatePublished - 2022
Externally publishedYes
Event2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022 - Seattle, United States
Duration: 10 Jul 202215 Jul 2022

Publication series

NameNAACL 2022 - 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference

Conference

Conference2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022
Country/TerritoryUnited States
CitySeattle
Period10/07/2215/07/22

Bibliographical note

Publisher Copyright:
© 2022 Association for Computational Linguistics.

Funding

We would like to thank Roy Schwartz for his helpful comments and suggestions and the anonymous reviewers for their valuable feedback. This work was supported in part by a research gift from the Allen Institute for AI. Tomasz Limisiewicz’s visit to the Hebrew University has been supported by grant 338521 of the Charles University Grant Agency and the Mobility Fund of Charles University.

FundersFunder number
Mobility Fund of Charles University
Grantová Agentura, Univerzita Karlova

    Fingerprint

    Dive into the research topics of 'A Balanced Data Approach for Evaluating Cross-Lingual Transfer: Mapping the Linguistic Blood Bank'. Together they form a unique fingerprint.

    Cite this