TY - JOUR
T1 - TEG - A hybrid approach to information extraction
AU - Feldman, Ronen
AU - Rosenfeld, Benjamin
AU - Fresko, Moshe
PY - 2006/1
Y1 - 2006/1
N2 - This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while drastically reducing the amount of manual labour by relying on statistics drawn from a training corpus. The implementation of the model, called TEG (trainable extraction grammar), can be adapted to any IE domain by writing a suitable set of rules in a SCFG (stochastic context-free grammar)-based extraction language and training them using an annotated corpus. The system does not contain any purely linguistic components, such as PoS tagger or shallow parser, but allows to using external linguistic components if necessary. We demonstrate the performance of the system on several named entity extraction and relation extraction tasks. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amounts of training data. We also demonstrate the robustness of our system under conditions of poor training-data quality.
AB - This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while drastically reducing the amount of manual labour by relying on statistics drawn from a training corpus. The implementation of the model, called TEG (trainable extraction grammar), can be adapted to any IE domain by writing a suitable set of rules in a SCFG (stochastic context-free grammar)-based extraction language and training them using an annotated corpus. The system does not contain any purely linguistic components, such as PoS tagger or shallow parser, but allows to using external linguistic components if necessary. We demonstrate the performance of the system on several named entity extraction and relation extraction tasks. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amounts of training data. We also demonstrate the robustness of our system under conditions of poor training-data quality.
KW - Hidden Markov models
KW - Hybrid approaches
KW - Information extraction
KW - Rule bases systems
KW - Stochastic context free grammars
KW - Text mining
UR - http://www.scopus.com/inward/record.url?scp=32544451271&partnerID=8YFLogxK
U2 - 10.1007/s10115-005-0204-y
DO - 10.1007/s10115-005-0204-y
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:32544451271
SN - 0219-1377
VL - 9
SP - 1
EP - 18
JO - Knowledge and Information Systems
JF - Knowledge and Information Systems
IS - 1
ER -