Abstract
We created a dataset of syntactic-ngrams (counted dependency-tree fragments) based on a corpus of 3.5 million English books. The dataset includes over 10 billion distinct items covering a wide range of syntactic configurations. It also includes temporal information, facilitating new kinds of research into lexical semantics over time. This paper describes the dataset, the syntactic representation, and the kinds of information provided.
Original language | English |
---|---|
Title of host publication | SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task |
Subtitle of host publication | Semantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity |
Editors | Mona Diab, Tim Baldwin, Marco Baroni |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 241-247 |
Number of pages | 7 |
ISBN (Electronic) | 9781937284480 |
State | Published - 2013 |
Event | 2nd Joint Conference on Lexical and Computational Semantics, SEM 2013 - Atlanta, United States Duration: 13 Jun 2013 → 14 Jun 2013 |
Publication series
Name | SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity |
---|
Conference
Conference | 2nd Joint Conference on Lexical and Computational Semantics, SEM 2013 |
---|---|
Country/Territory | United States |
City | Atlanta |
Period | 13/06/13 → 14/06/13 |
Bibliographical note
Publisher Copyright:©2013 Association for Computational Linguistics.
Funding
This work was supported in part by the DARPA TransTac program.
Funders | Funder number |
---|---|
Defense Advanced Research Projects Agency |