On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations

Roy Schwartz, Gabriel Stanovsky

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

Recent work has shown that deep learning models in NLP are highly sensitive to lowlevel correlations between simple features and specific output labels, leading to overfitting and lack of generalization. To mitigate this problem, a common practice is to balance datasets by adding new instances or by filtering out "easy" instances (Sakaguchi et al., 2020), culminating in a recent proposal to eliminate single-word correlations altogether (Gardner et al., 2021). In this opinion paper, we identify that despite these efforts, increasingly-powerful models keep exploiting ever-smaller spurious correlations, and as a result even balancing all single-word features is insufficient for mitigating all of these correlations. In parallel, a truly balanced dataset may be bound to "throw the baby out with the bathwater" and miss important signal encoding common sense and world knowledge. We highlight several alternatives to dataset balancing, focusing on enhancing datasets with richer contexts, allowing models to abstain and interact with users, and turning from large-scale fine-tuning to zero- or few-shot setups.

Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics
Subtitle of host publicationNAACL 2022 - Findings
PublisherAssociation for Computational Linguistics (ACL)
Pages2182-2194
Number of pages13
ISBN (Electronic)9781955917766
StatePublished - 2022
Externally publishedYes
Event2022 Findings of the Association for Computational Linguistics: NAACL 2022 - Seattle, United States
Duration: 10 Jul 202215 Jul 2022

Publication series

NameFindings of the Association for Computational Linguistics: NAACL 2022 - Findings

Conference

Conference2022 Findings of the Association for Computational Linguistics: NAACL 2022
Country/TerritoryUnited States
CitySeattle
Period10/07/2215/07/22

Bibliographical note

Publisher Copyright:
© Findings of the Association for Computational Linguistics: NAACL 2022 - Findings.

Funding

We would like to thank Matt Gardner and Will Mer-rill for the in-depth discussion. We would also like to thank Omri Abend, Yoav Goldberg, Inbal Magar, and the anonymous reviewers for their feedback. This work was supported in part by research gifts from the Allen Institute for AI.

FundersFunder number
ALLEN INSTITUTE

    Fingerprint

    Dive into the research topics of 'On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations'. Together they form a unique fingerprint.

    Cite this