GENIE Toward Reproducible and Standardized Human Evaluation for Text Generation

Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg, Nicholas Lourie, Jungo Kasai, Yejin Choi, Noah A. Smith, Daniel S. Weld

Research output: Contribution to conferencePaperpeer-review

7 Scopus citations

Abstract

While often assumed a gold standard, effective human evaluation of text generation remains an important, open area for research. We revisit this problem with a focus on producing consistent evaluations that are reproducible-over time and across different populations. We study this goal in different stages of the human evaluation pipeline. In particular, we consider design choices for the annotation interface used to elicit human judgments and their impact on reproducibility. Furthermore, we develop an automated mechanism for maintaining annotator quality via a probabilistic model that detects and excludes noisy annotators. Putting these lessons together, we introduce GENIE: a system for running standardized human evaluations across different generation tasks. We instantiate GENIE with datasets representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension. For each task, GENIE offers a leaderboard that automatically crowd-sources annotations for submissions, evaluating them along axes such as correctness, conciseness, and fluency. We have made the GENIE leaderboards publicly available, and have already ranked 50 submissions from 10 different research groups. We hope GENIE encourages further progress toward effective, standardized evaluations for text generation.

Original languageEnglish
Pages11444-11458
Number of pages15
StatePublished - 2022
Externally publishedYes
Event2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 - Abu Dhabi, United Arab Emirates
Duration: 7 Dec 202211 Dec 2022

Conference

Conference2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period7/12/2211/12/22

Bibliographical note

Publisher Copyright:
© 2022 Association for Computational Linguistics.

Funding

The authors would like to thank the leaderboard team at Allen Institute for AI, particularly Michal Guerquin and Sam Skjonsberg. We thank Peter Clark, Oyvind Tafjord and Daniel Deutsch for valuable feedback throughout this project. We are grateful to the many AMT workers whose contributions make human evaluation possible, and to the anonymous reviewers for their helpful feedback on this manuscript. This work was supported in part by DARPA MCS program through NIWC Pacific (N66001-19-2-4031) and research grant 2336 from the Israeli Ministry of Science and Technology.

FundersFunder number
Defense Advanced Research Projects Agency2336, N66001-19-2-4031
ALLEN INSTITUTE
Ministry of science and technology, Israel

    Fingerprint

    Dive into the research topics of 'GENIE Toward Reproducible and Standardized Human Evaluation for Text Generation'. Together they form a unique fingerprint.

    Cite this