Abstract
According to Heaps' Law, the increase in the number of types, S, in natural texts consists of L tokens and follows a power law, S = K · L β, where K and β are constants. After presenting the theoretical background, we try to predict the behaviour of Heaps' Law asymptotically using a 25-million-token corpus of original English. We then examine this law in a corpus for English, where original English corpora (O) are compared with three subcorpora of translations into English (T) from three different languages. We show that (1) K and β change along the L-axis: as K grows, β gets smaller; (2) K is larger in the translations and β larger in originals; (3) for a given L, S of O is higher than that of S of T. Finally, we show a more economical way to tell O from T, based on increase in types belonging to specific parts of speech. Last, we discuss the consequences of this research for information retrieval and for translation studies.
Original language | English |
---|---|
Pages (from-to) | 55-67 |
Number of pages | 13 |
Journal | Journal of Quantitative Linguistics |
Volume | 17 |
Issue number | 1 |
DOIs | |
State | Published - Feb 2010 |
Bibliographical note
Funding Information:The research underlying this article was partially supported by the Israel Science Foundation, grant no. 1180/06. We are grateful to Dr Brenda Malkiel from Bar Ilan University and to Dr Pietro Bortone from the University of Illinois. This article was conducted partly within the PhD framework of Noam Ordan.
Funding
The research underlying this article was partially supported by the Israel Science Foundation, grant no. 1180/06. We are grateful to Dr Brenda Malkiel from Bar Ilan University and to Dr Pietro Bortone from the University of Illinois. This article was conducted partly within the PhD framework of Noam Ordan.
Funders | Funder number |
---|---|
Israel Science Foundation | 1180/06 |