Optimal Probabilistic Generation of XML Documents

Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, P. Senellart

Research output: Contribution to journalArticlepeer-review


We study the problem of, given a corpus of XML documents and its schema, finding an optimal (generative) probabilistic model, where optimality here means maximizing the likelihood of the particular corpus to be generated. Focusing first on the structure of documents, we present an efficient algorithm for finding the best generative probabilistic model, in the absence of constraints. We further study the problem in the presence of integrity constraints, namely key, inclusion, and domain constraints. We study in this case two different kinds of generators. First, we consider a continuation-test generator that performs, while generating documents, tests of schema satisfiability; these tests prevent from generating a document violating the constraints but, as we will see, they are computationally expensive. We also study a restart generator that may generate an invalid document and, when this is the case, restarts and tries again. Finally, we consider the injection of data values into the structure, to obtain a full XML document. We study different approaches for generating these values.

Original languageEnglish
Pages (from-to)806-842
Number of pages37
JournalTheory of Computing Systems
Issue number4
Early online date21 Oct 2014
StatePublished - 1 Nov 2015
Externally publishedYes

Bibliographical note

Funding Information:
We would like to thank Yann Ollivier for insightful comments, and Siqi Liu for feedback on the proof of Theorem 5. This work has been supported in part by the Advanced European Research Council grants Webdam, agreement 226513 ( http://webdam.inria.fr/ ), and MoDaS, agreement 291071 ( http://www.math.tau.ac.il/~milo/projects/modas/ ), by the Israel Ministry of Science, and by the US–Israel Binational Science Foundation.

Publisher Copyright:
© 2014, Springer Science+Business Media New York.


  • Constraints
  • Generator
  • Probabilistic model
  • Schema
  • XML


Dive into the research topics of 'Optimal Probabilistic Generation of XML Documents'. Together they form a unique fingerprint.

Cite this