TY - JOUR

T1 - Numerical analysis of word frequencies in artificial and natural language texts

AU - Cohen, A.

AU - Mantegna, R. N.

AU - Havlin, S.

PY - 1997/3

Y1 - 1997/3

N2 - We perform a numerical study of the statistical properties of natural texts written in English and of two types of artificial texts. As statistical tools we use the conventional Zipf analysis of the distribution of words and the inverse Zipf analysis of the distribution of frequencies of words, the analysis of vocabulary growth, the Shannon entropy and a quantity which is a nonlinear function of frequencies of words, the frequency "entropy". Our numerical results, obtained by investigation of eight complete books and sixteen related artificial texts, suggest that, among these analyses, the analysis of vocabulary growth shows the most striking difference between natural and artificial texts. Our results also suggest that, among these analyses, those who give a greater weight to low frequency words succeed better in distinguishing between natural and artificial texts. The inverse Zipf analysis seems to succeed better than the conventional Zipf analysis and the frequency "entropy" better than the usual word entropy. By studying the scaling behavior of both entropies as a function of the total number of words T of the investigated text, we find that the word relative entropy scales with the same functional form for both natural and artificial texts but with a different parameter, while the frequency relative "entropy" decreases monotonically with T for the artificial texts while having a minimum at T ≈ 104 for the natural texts.

AB - We perform a numerical study of the statistical properties of natural texts written in English and of two types of artificial texts. As statistical tools we use the conventional Zipf analysis of the distribution of words and the inverse Zipf analysis of the distribution of frequencies of words, the analysis of vocabulary growth, the Shannon entropy and a quantity which is a nonlinear function of frequencies of words, the frequency "entropy". Our numerical results, obtained by investigation of eight complete books and sixteen related artificial texts, suggest that, among these analyses, the analysis of vocabulary growth shows the most striking difference between natural and artificial texts. Our results also suggest that, among these analyses, those who give a greater weight to low frequency words succeed better in distinguishing between natural and artificial texts. The inverse Zipf analysis seems to succeed better than the conventional Zipf analysis and the frequency "entropy" better than the usual word entropy. By studying the scaling behavior of both entropies as a function of the total number of words T of the investigated text, we find that the word relative entropy scales with the same functional form for both natural and artificial texts but with a different parameter, while the frequency relative "entropy" decreases monotonically with T for the artificial texts while having a minimum at T ≈ 104 for the natural texts.

UR - http://www.scopus.com/inward/record.url?scp=0002999358&partnerID=8YFLogxK

U2 - 10.1142/S0218348X97000103

DO - 10.1142/S0218348X97000103

M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???

AN - SCOPUS:0002999358

SN - 0218-348X

VL - 5

SP - 95

EP - 104

JO - Fractals

JF - Fractals

IS - 1

ER -