TY - GEN
T1 - Minimizing manual annotation cost in supervised training from corpora
AU - Engelson, Sean P
AU - Dagan, I.
N1 - Place of conference:Santa Cruz, California, USA
PY - 1996
Y1 - 1996
N2 - Corpus-based methods for natural language processing often use supervised training, requiring expensive manual annotation of training corpora. This paper investigates methods for reducing annotation cost by sample selection. In this approach, during training the learning program examines many unlabeled examples and selects for labeling (annotation) only those that are most informative at each stage. This avoids redundantly annotating examples that contribute little new information. This paper extends our previous work on committee-based sample selection for probabilistic classifiers. We describe a family of methods for committee-based sample selection, and report experimental results for the task of stochastic part-of-speech tagging. We find that all variants achieve a significant reduction in annotation cost, though their computational efficiency differs. In particular, the simplest method, which has no parameters to tune, gives excellent results. We also show that sample selection yields a significant reduction in the size of the model used by the tagger.
AB - Corpus-based methods for natural language processing often use supervised training, requiring expensive manual annotation of training corpora. This paper investigates methods for reducing annotation cost by sample selection. In this approach, during training the learning program examines many unlabeled examples and selects for labeling (annotation) only those that are most informative at each stage. This avoids redundantly annotating examples that contribute little new information. This paper extends our previous work on committee-based sample selection for probabilistic classifiers. We describe a family of methods for committee-based sample selection, and report experimental results for the task of stochastic part-of-speech tagging. We find that all variants achieve a significant reduction in annotation cost, though their computational efficiency differs. In particular, the simplest method, which has no parameters to tune, gives excellent results. We also show that sample selection yields a significant reduction in the size of the model used by the tagger.
UR - https://scholar.google.co.il/scholar?q=Minimizing+Manual+Annotation+Cost+in+Supervised+Training+from+Corpora+&btnG=&hl=en&as_sdt=0%2C5
M3 - Conference contribution
BT - 34th annual meeting on Association for Computational Linguistics
PB - Association for Computational Linguistics
ER -