Most information extraction systems focus on the textual content of the documents. They treat documents as sequences of words, disregarding the physical and typographical layout of the information. While this strategy helps in focusing the extraction process on the key semantic content of the document, much valuable information can also be derived form the document physical appearance. Often, fonts, physical positioning and other graphical characteristics are used to provide additional context to the information. This information is lost with pure-text analysis. In this paper we describe a general procedure for structural extraction, which allows for automatic extraction of entities from the document based on their visual characteristics and relative position in the document layout. Our structural extraction procedure is a learning algorithm, which automatically generalizes from examples. The procedure is a general one, applicable to any document format with visual and typographical information. We also describe a specific implementation of the procedure to PDF documents, called PES (PDF Extraction System). PES works with PDF documents and is able to extract fields such as Author(s), Title, Date, etc. with very high accuracy.
|Number of pages||8|
|State||Published - 2002|
|Event||Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM 2002) - McLean, VA, United States|
Duration: 4 Nov 2002 → 9 Nov 2002
|Conference||Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM 2002)|
|Period||4/11/02 → 9/11/02|