Abstract
We suggest an unsupervised approach to
template induction for information extraction,
through detecting sub-topics and themes that cut
across the documents of a topical corpus. We
introduce a new method ñ cross component
clustering ñ that simultaneously clusters the
components forming our setting, each of which
consists of the words of a single article. Our
algorithm is derived from the Information
Bottleneck clustering algorithm. The resulting
clusters are found to be in systematic
correspondence with sets of terms that are used
in filling the slots of the MUC3/4 ready-made
template, which was used for evaluation.
Original language | American English |
---|---|
Title of host publication | Workshop on Text Learning (TextML-2002) |
State | Published - 2002 |