Sample selection in natural language learning

Sean P Engelson, I. Dagan

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review


Many corpus-based methods for natural language processing are based on supervised training, requiring expensive manual annotation of training corpora. This paper investigates reducing annotation cost by sample selection. In this approach, the learner examines many unlabeled examples and selects for labeling only those that are most informative at each stage of training. In this way it is possible to avoid redundantly annotating examples that contribute little new information. The paper first analyzes the issues that need to be addressed when constructing a sample selection algorithm, arguing for the attractiveness of committee-based selection methods. We then focus on selection for training probabilistic classifiers, which are commonly applied to problems in statistical natural language processing. We report experimental results of applying a specific type of committee-based selection during training of a stochastic part-of-speech tagger, and demonstrate substantially improved learning rates over complete training using all of the text.
Original languageAmerican English
Title of host publicationInternational Joint Conference on Artificial Intelligence
EditorsStefan Wermter, Ellen Riloff, Gabriele Scheler
PublisherSpringer Berlin Heidelberg
ISBN (Print)978-3-540-49738-7
StatePublished - 1995

Publication series

NameLecture Notes in Computer Science


Dive into the research topics of 'Sample selection in natural language learning'. Together they form a unique fingerprint.

Cite this