WEC: Deriving a Large-scale Cross-document Event Coreference dataset from Wikipedia

Alon Eirew, Arie Cattan, Ido Dagan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

12 Scopus citations

Abstract

Cross-document event coreference resolution is a foundational task for NLP applications involving multi-text processing. However, existing corpora for this task are scarce and relatively small, while annotating only modest-size clusters of documents belonging to the same topic. To complement these resources and enhance future research, we present Wikipedia Event Coreference (WEC), an efficient methodology for gathering a large-scale dataset for cross-document event coreference from Wikipedia, where coreference links are not restricted within predefined topics. We apply this methodology to the English Wikipedia and extract our large-scale WEC-Eng dataset. Notably, our dataset creation method is generic and can be applied with relatively little effort to other Wikipedia languages. To set baseline results, we develop an algorithm that adapts components of state-of-the-art models for within-document coreference resolution to the cross-document setting. Our model is suitably efficient and outperforms previously published state-of-the-art results for the task.

Original languageEnglish
Title of host publicationNAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics
Subtitle of host publicationHuman Language Technologies, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages2498-2510
Number of pages13
ISBN (Electronic)9781954085466
StatePublished - 2021
Event2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021 - Virtual, Online
Duration: 6 Jun 202111 Jun 2021

Publication series

NameNAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference

Conference

Conference2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021
CityVirtual, Online
Period6/06/2111/06/21

Bibliographical note

Funding Information:
We would like to thank Valentina Pyatkin, Daniela Stepanov and Oren Pereg for their valuable assistance in the data validation process. The work described herein was supported in part by grants from Intel Labs, Facebook, the Israel Science Foundation grant 1951/17, the Israeli Ministry of Sci-2506 ence and Technology and the German Research Foundation through the German-Israeli Project Co-operation (DIP, grant DA 1600/1-1).

Funding Information:
We would like to thank Valentina Pyatkin, Daniela Stepanov and Oren Pereg for their valuable assistance in the data validation process. The work described herein was supported in part by grants from Intel Labs, Facebook, the Israel Science Foundation grant 1951/17, the Israeli Ministry of Science and Technology and the German Research Foundation through the German-Israeli Project Cooperation (DIP, grant DA 1600/1-1).

Publisher Copyright:
© 2021 Association for Computational Linguistics.

Fingerprint

Dive into the research topics of 'WEC: Deriving a Large-scale Cross-document Event Coreference dataset from Wikipedia'. Together they form a unique fingerprint.

Cite this