Abstract
Cross-document event coreference resolution is a foundational task for NLP applications involving multi-text processing. However, existing corpora for this task are scarce and relatively small, while annotating only modest-size clusters of documents belonging to the same topic. To complement these resources and enhance future research, we present Wikipedia Event Coreference (WEC), an efficient methodology for gathering a large-scale dataset for cross-document event coreference from Wikipedia, where coreference links are not restricted within predefined topics. We apply this methodology to the English Wikipedia and extract our large-scale WEC-Eng dataset. Notably, our dataset creation method is generic and can be applied with relatively little effort to other Wikipedia languages. To set baseline results, we develop an algorithm that adapts components of state-of-the-art models for within-document coreference resolution to the cross-document setting. Our model is suitably efficient and outperforms previously published state-of-the-art results for the task.
Original language | English |
---|---|
Title of host publication | NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics |
Subtitle of host publication | Human Language Technologies, Proceedings of the Conference |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 2498-2510 |
Number of pages | 13 |
ISBN (Electronic) | 9781954085466 |
State | Published - 2021 |
Event | 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021 - Virtual, Online Duration: 6 Jun 2021 → 11 Jun 2021 |
Publication series
Name | NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference |
---|
Conference
Conference | 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021 |
---|---|
City | Virtual, Online |
Period | 6/06/21 → 11/06/21 |
Bibliographical note
Funding Information:We would like to thank Valentina Pyatkin, Daniela Stepanov and Oren Pereg for their valuable assistance in the data validation process. The work described herein was supported in part by grants from Intel Labs, Facebook, the Israel Science Foundation grant 1951/17, the Israeli Ministry of Sci-2506 ence and Technology and the German Research Foundation through the German-Israeli Project Co-operation (DIP, grant DA 1600/1-1).
Funding Information:
We would like to thank Valentina Pyatkin, Daniela Stepanov and Oren Pereg for their valuable assistance in the data validation process. The work described herein was supported in part by grants from Intel Labs, Facebook, the Israel Science Foundation grant 1951/17, the Israeli Ministry of Science and Technology and the German Research Foundation through the German-Israeli Project Cooperation (DIP, grant DA 1600/1-1).
Publisher Copyright:
© 2021 Association for Computational Linguistics.