Abstract
Multilingual pretrained language models have demonstrated remarkable zero-shot cross-lingual transfer capabilities. Such transfer emerges by fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning. Despite promising results, we still lack a proper understanding of the source of this transfer. Using a novel layer ablation technique and analyses of the model's internal representations, we show that multilingual BERT, a popular multilingual language model, can be viewed as the stacking of two sub-networks: a multilingual encoder followed by a task-specific language-agnostic predictor. While the encoder is crucial for cross-lingual transfer and remains mostly unchanged during fine-tuning, the task predictor has little importance on the transfer and can be reinitialized during fine-tuning. We present extensive experiments with three distinct tasks, seventeen typologically diverse languages and multiple domains to support our hypothesis.
Original language | English |
---|---|
Title of host publication | EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 2214-2231 |
Number of pages | 18 |
ISBN (Electronic) | 9781954085022 |
State | Published - 2021 |
Event | 16th Conference of the European Chapter of the Associationfor Computational Linguistics, EACL 2021 - Virtual, Online Duration: 19 Apr 2021 → 23 Apr 2021 |
Publication series
Name | EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference |
---|
Conference
Conference | 16th Conference of the European Chapter of the Associationfor Computational Linguistics, EACL 2021 |
---|---|
City | Virtual, Online |
Period | 19/04/21 → 23/04/21 |
Bibliographical note
Publisher Copyright:© 2021 Association for Computational Linguistics
Funding
We want to thank Hila Gonen, Shauli Ravfogel and Ganesh Jawahar for their insightful reviews and comments. We also thank the anonymous reviewers for their valuable suggestions. This work was partly funded by two French National funded projects granted to Inria and other partners by the Agence Nationale de la Recherche, namely projects PAR-SITI (ANR-16-CE33-0021) and SoSweet (ANR-15-CE38-0011), as well as by the third author’s chair in the PRAIRIE institute funded by the French national agency ANR as part of the “Investisse-ments d’avenir” programme under the reference ANR-19-P3IA-0001. Yanai Elazar is grateful to be partially supported by the PBC fellowship for outstanding Phd candidates in Data Science.
Funders | Funder number |
---|---|
French national agency ANR | ANR-19-P3IA-0001 |
SoSweet | ANR-15-CE38-0011 |
Agence Nationale de la Recherche | ANR-16-CE33-0021 |
Planning and Budgeting Committee of the Council for Higher Education of Israel |