A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

Yoav Goldberg, Jon Orwant

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We created a dataset of syntactic-ngrams (counted dependency-tree fragments) based on a corpus of 3.5 million English books. The dataset includes over 10 billion distinct items covering a wide range of syntactic configurations. It also includes temporal information, facilitating new kinds of research into lexical semantics over time. This paper describes the dataset, the syntactic representation, and the kinds of information provided.

Original languageEnglish
Title of host publicationSEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task
Subtitle of host publicationSemantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity
EditorsMona Diab, Tim Baldwin, Marco Baroni
PublisherAssociation for Computational Linguistics (ACL)
Pages241-247
Number of pages7
ISBN (Electronic)9781937284480
StatePublished - 2013
Event2nd Joint Conference on Lexical and Computational Semantics, SEM 2013 - Atlanta, United States
Duration: 13 Jun 201314 Jun 2013

Publication series

NameSEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual SimilaritySEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

Conference

Conference2nd Joint Conference on Lexical and Computational Semantics, SEM 2013
Country/TerritoryUnited States
CityAtlanta
Period13/06/1314/06/13

Bibliographical note

Publisher Copyright:
©2013 Association for Computational Linguistics.

Funding

This work was supported in part by the DARPA TransTac program.

FundersFunder number
Defense Advanced Research Projects Agency

    Fingerprint

    Dive into the research topics of 'A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books'. Together they form a unique fingerprint.

    Cite this