You Can Have Your Data and Balance It Too: Towards Balanced and Efficient Multilingual Models

Tomasz Limisiewicz, Dan Malkin, Gabriel Stanovsky

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Multilingual models have been widely used for cross-lingual transfer to low-resource languages. However, the performance on these languages is hindered by their underrepresentation in the pretraining data. To alleviate this problem, we propose a novel multilingual training technique based on teacherstudent knowledge distillation. In this setting, we utilize monolingual teacher models optimized for their language. We use those teachers along with balanced (sub-sampled) data to distill the teachers knowledge into a single multilingual student. Our method outperforms standard training methods in lowresource languages and retains performance on high-resource languages.

Original languageEnglish
Title of host publicationSIGTYP 2023 - 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Proceedings of the Workshop
EditorsLisa Beinborn, Koustava Goswami, Saliha Muradoglu, Alexey Sorokin, Ritesh Kumar, Andreas Shcherbakov, Edoardo M. Ponti, Ryan Cotterell, Ekaterina Vylomova
PublisherAssociation for Computational Linguistics
Pages1-11
Number of pages11
ISBN (Electronic)9781959429562
StatePublished - 2023
Externally publishedYes
Event5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, SIGTYP 2023, co-located with the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Hybrid, Dubrovnik, Croatia
Duration: 6 May 2023 → …

Publication series

NameSIGTYP 2023 - 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Proceedings of the Workshop

Conference

Conference5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, SIGTYP 2023, co-located with the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023
Country/TerritoryCroatia
CityHybrid, Dubrovnik
Period6/05/23 → …

Bibliographical note

Publisher Copyright:
© 2023 Association for Computational Linguistics.

Funding

We thank anonymous reviewers for their valuable comments on the previous versions of this article. This work was supported in part by a research gift from the Allen Institute for AI, and a research grant 2336 from the Israeli Ministry of Science and Technology. Tomasz Limisiewicz’s visit to the Hebrew University has been supported by grant 338521 of the Charles University Grant Agency and the Mobility Fund of Charles University.

FundersFunder number
Mobility Fund of Charles University
Grantová Agentura, Univerzita Karlova
Ministry of science and technology, Israel338521

    Fingerprint

    Dive into the research topics of 'You Can Have Your Data and Balance It Too: Towards Balanced and Efficient Multilingual Models'. Together they form a unique fingerprint.

    Cite this