Evaluating models’ local decision boundaries via contrast sets

Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer SinghNoah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, Ben Zhou

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

245 Scopus citations

Abstract

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture the abilities a dataset is intended to test. We propose a more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model’s decision boundary, which can be used to more accurately evaluate a model’s true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, and IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets—up to 25% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics Findings of ACL
Subtitle of host publicationEMNLP 2020
PublisherAssociation for Computational Linguistics (ACL)
Pages1307-1323
Number of pages17
ISBN (Electronic)9781952148903
StatePublished - 2020
EventFindings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020 - Virtual, Online
Duration: 16 Nov 202020 Nov 2020

Publication series

NameFindings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020

Conference

ConferenceFindings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020
CityVirtual, Online
Period16/11/2020/11/20

Bibliographical note

Publisher Copyright:
©2020 Association for Computational Linguistics

Funding

We thank the anonymous reviewers for their helpful feedback on this paper, as well as many others who gave constructive comments on a publicly-available preprint. Various authors of this paper were supported in part by ERC grant 677352, NSF grant 1562364, NSF grant IIS-1756023, NSF CAREER 1750499, ONR grant N00014-18-1-2826 and DARPA grant N66001-19-2-403.

FundersFunder number
National Science Foundation1562364, IIS-1756023, 1750499
Office of Naval ResearchN00014-18-1-2826
Defense Advanced Research Projects AgencyN66001-19-2-403
Horizon 2020 Framework Programme802800
Engineering Research Centers677352

    Fingerprint

    Dive into the research topics of 'Evaluating models’ local decision boundaries via contrast sets'. Together they form a unique fingerprint.

    Cite this