Lexical richness revisited: Blueprint for a more economical measure

Noam Ordan, Victoria Itskovich, Miriam Shlesinger, Ido Kanter

Research output: Contribution to journalReview articlepeer-review

2 Scopus citations

Abstract

According to Heaps' Law, the increase in the number of types, S, in natural texts consists of L tokens and follows a power law, S = K · L β, where K and β are constants. After presenting the theoretical background, we try to predict the behaviour of Heaps' Law asymptotically using a 25-million-token corpus of original English. We then examine this law in a corpus for English, where original English corpora (O) are compared with three subcorpora of translations into English (T) from three different languages. We show that (1) K and β change along the L-axis: as K grows, β gets smaller; (2) K is larger in the translations and β larger in originals; (3) for a given L, S of O is higher than that of S of T. Finally, we show a more economical way to tell O from T, based on increase in types belonging to specific parts of speech. Last, we discuss the consequences of this research for information retrieval and for translation studies.

Original languageEnglish
Pages (from-to)55-67
Number of pages13
JournalJournal of Quantitative Linguistics
Volume17
Issue number1
DOIs
StatePublished - Feb 2010

Bibliographical note

Funding Information:
The research underlying this article was partially supported by the Israel Science Foundation, grant no. 1180/06. We are grateful to Dr Brenda Malkiel from Bar Ilan University and to Dr Pietro Bortone from the University of Illinois. This article was conducted partly within the PhD framework of Noam Ordan.

Funding

The research underlying this article was partially supported by the Israel Science Foundation, grant no. 1180/06. We are grateful to Dr Brenda Malkiel from Bar Ilan University and to Dr Pietro Bortone from the University of Illinois. This article was conducted partly within the PhD framework of Noam Ordan.

FundersFunder number
Israel Science Foundation1180/06

    Fingerprint

    Dive into the research topics of 'Lexical richness revisited: Blueprint for a more economical measure'. Together they form a unique fingerprint.

    Cite this