Diversify your datasets: Analyzing generalization via controlled variance in adversarial datasets

Ohad Rozen, Vered Shwartz, Roee Aharoni, Ido Dagan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

21 Scopus citations

Abstract

Phenomenon-specific "adversarial" datasets have been recently designed to perform targeted stress-tests for particular inference types. Recent work (Liu et al., 2019a) proposed that such datasets can be utilized for training NLI and other types of models, often allowing to learn the phenomenon in focus and improve on the challenge dataset, indicating a "blind spot" in the original training data. Yet, although a model can improve in such a training process, it might still be vulnerable to other challenge datasets targeting the same phenomenon but drawn from a different distribution, such as having a different syntactic complexity level. In this work, we extend this method to drive conclusions about a model's ability to learn and generalize a target phenomenon rather than to "learn" a dataset, by controlling additional aspects in the adversarial datasets. We demonstrate our approach on two inference phenomena - dative alternation and numerical reasoning, elaborating, and in some cases contradicting, the results of Liu et al. Our methodology enables building better challenge datasets for creating more robust models, and may yield better model understanding and subsequent overarching improvements.

Original languageEnglish
Title of host publicationCoNLL 2019 - 23rd Conference on Computational Natural Language Learning, Proceedings of the Conference
PublisherAssociation for Computational Linguistics
Pages196-205
Number of pages10
ISBN (Electronic)9781950737727
StatePublished - 2019
Event23rd Conference on Computational Natural Language Learning, CoNLL 2019 - Hong Kong, China
Duration: 3 Nov 20194 Nov 2019

Publication series

NameCoNLL 2019 - 23rd Conference on Computational Natural Language Learning, Proceedings of the Conference

Conference

Conference23rd Conference on Computational Natural Language Learning, CoNLL 2019
Country/TerritoryChina
CityHong Kong
Period3/11/194/11/19

Bibliographical note

Funding Information:
We would like to thank Ori Shapira for assisting in data analysis, and the anonymous reviewers for their constructive comments. This work was supported in part by the German Research Foundation through the German-Israeli Project Cooperation (DIP, grant DA 1600/1-1), by a grant from Reverso and Theo Hoffenberg, and by the Israel Science Foundation (grant 1951/17).

Publisher Copyright:
© 2019 Association for Computational Linguistics.

Fingerprint

Dive into the research topics of 'Diversify your datasets: Analyzing generalization via controlled variance in adversarial datasets'. Together they form a unique fingerprint.

Cite this