TY - JOUR
T1 - Visual information extraction
AU - Aumann, Yonatan
AU - Feldman, Ronen
AU - Liberzon, Yair
AU - Rosenfeld, Benjamin
AU - Schler, Jonathan
PY - 2006/7
Y1 - 2006/7
N2 - Typographic and visual information is an integral part of textual documents. Most information extraction (IE) systems ignore most of this visual information, processing the text as a linear sequence of words. Thus, much valuable information is lost. In this paper, we show how to make use of this visual information for IE. We present an algorithm that allows to automatically extract specific fields of the document (such as the title, author, etc.) based exclusively on the visual formatting of the document, without any reference to the semantic content. The algorithm employs a machine learning approach, whereby the system is first provided with a set of training documents in which the target fields are manually tagged and automatically learns how to extract these fields in future documents. We implemented the algorithm in a system for automatic analysis of documents in PDF format. We present experimental results of applying the system on a set of financial documents, extracting nine different target fields. Overall, the system achieved a 90% accuracy.
AB - Typographic and visual information is an integral part of textual documents. Most information extraction (IE) systems ignore most of this visual information, processing the text as a linear sequence of words. Thus, much valuable information is lost. In this paper, we show how to make use of this visual information for IE. We present an algorithm that allows to automatically extract specific fields of the document (such as the title, author, etc.) based exclusively on the visual formatting of the document, without any reference to the semantic content. The algorithm employs a machine learning approach, whereby the system is first provided with a set of training documents in which the target fields are manually tagged and automatically learns how to extract these fields in future documents. We implemented the algorithm in a system for automatic analysis of documents in PDF format. We present experimental results of applying the system on a set of financial documents, extracting nine different target fields. Overall, the system achieved a 90% accuracy.
KW - Information extraction
KW - PDF analysis
KW - Text analysis
KW - Wrapper induction
UR - http://www.scopus.com/inward/record.url?scp=33745942846&partnerID=8YFLogxK
U2 - 10.1007/s10115-006-0014-x
DO - 10.1007/s10115-006-0014-x
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:33745942846
SN - 0219-1377
VL - 10
SP - 1
EP - 15
JO - Knowledge and Information Systems
JF - Knowledge and Information Systems
IS - 1
ER -