Statistical and linguistic features of noncoding DNA: A heterogeneous «Complex system»

H. E. Stanley, S. V. Buldyrev, A. L. Goldberger, S. Havlin, R. N. Mantegna, C. K. Peng, M. Simons

Research output: Contribution to journalArticlepeer-review

13 Scopus citations

Abstract

We present evidence supporting the idea that the DNA sequence in genes containing noncoding regions is correlated, and that the correlation is remarkably long range-indeed, base pairs thousands of base pairs distant are correlated. We do not find such a long-range correlation in the coding regions of the gene; we utilize this fact to build a Coding Sequence Finder algorithm, which uses statistical ideas to locate the coding regions of an unknown DNA sequence. We resolve the problem of the «non-stationarity» feature of the sequence of base pairs (that the relative concentration of purines and pyrimidines changes in different regions of the mosaic-like chain) by describing a new algorithm called Detrended Fluctuation Analysis (DFA). We address the claim of Voss that there is no difference in the statistical properties of coding and noncoding regions of DNA by systematically applying the DFA algorithm, as well as standard FFT analysis, to every DNA sequence (33 301 coding and 29 453 non-coding) in the entire GenBank database. We describe a simple model to account for the presence of long-range power law correlations (and the systematic variation of the scaling exponent α with evolution) which is based upon a generalization of the classic Lévy walk. Finally, we describe briefly some recent work showing that the noncoding sequences have certain statistical features in common with natural languages. Specifically, we adapt to DNA the Zipf approach to analyzing linguistic texts, and the Shannon approach to quantifying the «redundancy» of a linguistic text in terms of a measurable entropy function. We suggest that noncoding regions in eukaryotes may display a smaller entropy and larger redundancy than coding regions for plants and invertebrates, further supporting the possibility that noncoding regions of DNA may carry biological information.

Original languageEnglish
Pages (from-to)1339-1356
Number of pages18
JournalIl Nuovo Cimento D
Volume16
Issue number9
DOIs
StatePublished - Sep 1994

Keywords

  • Conference proceedings
  • General, theoretical, and mathematical biophysics (including logic of biosystems, quantum biology, and relevant aspects of thermodynamics, information theory, cybernetics, and bionics)
  • Statistical mechanics

Fingerprint

Dive into the research topics of 'Statistical and linguistic features of noncoding DNA: A heterogeneous «Complex system»'. Together they form a unique fingerprint.

Cite this