A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

Yoav Goldberg, Jon Orwant

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

96 Scopus citations

Abstract

We created a dataset of syntactic-ngrams (counted dependency-tree fragments) based on a corpus of 3.5 million English books. The dataset includes over 10 billion distinct items covering a wide range of syntactic configurations. It also includes temporal information, facilitating new kinds of research into lexical semantics over time. This paper describes the dataset, the syntactic representation, and the kinds of information provided.

Original languageEnglish
Title of host publication*SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics
PublisherAssociation for Computational Linguistics (ACL)
Pages241-247
Number of pages7
ISBN (Electronic)9781937284480
StatePublished - 2013
Event2nd Joint Conference on Lexical and Computational Semantics, *SEM 2013 - Atlanta, United States
Duration: 13 Jun 201314 Jun 2013

Publication series

Name*SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics
Volume1

Conference

Conference2nd Joint Conference on Lexical and Computational Semantics, *SEM 2013
Country/TerritoryUnited States
CityAtlanta
Period13/06/1314/06/13

Bibliographical note

Publisher Copyright:
c 2013 Association for Computational Linguistics

Fingerprint

Dive into the research topics of 'A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books'. Together they form a unique fingerprint.

Cite this