Using sequence classification for filtering web pages

Binyamin Rosenfeld, Ronen Feldman, Lyle Ungar

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Web pages often contain text that is irrelevant to their main content, such as advertisements, generic format elements, and references to other pages on the same site. When used by automatic content-processing systems, e.g., for Web indexing, text classification, or information extraction, this irrelevant text often produces substantial amount of noise. This paper describes a trainable filtering system based on a feature-rich sequence classifier that removes irrelevant parts from pages, while keeping the content intact. Most of the features the system uses are purely form-related: HTML tags and their positions, sizes of elements, etc. This keeps the system general and domainindependent. We also experiment with content words and show that while they perform very poorly alone, they can slightly improve the performance of pure-form features, without jeopardizing the domain-independence. Our system achieves very high accuracy (95% and above) on several collections of Web pages. We also do a series of tests with different features and different classifiers, comparing the contribution of different components to the system performance, and comparing two known sequence classifiers, Robust Risk Minimization (RRM) and Conditional Random Fields (CRF), in a novel setting.

Original languageEnglish
Title of host publicationProceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08
Pages1355-1356
Number of pages2
DOIs
StatePublished - 2008
Externally publishedYes
Event17th ACM Conference on Information and Knowledge Management, CIKM'08 - Napa Valley, CA, United States
Duration: 26 Oct 200830 Oct 2008

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Conference

Conference17th ACM Conference on Information and Knowledge Management, CIKM'08
Country/TerritoryUnited States
CityNapa Valley, CA
Period26/10/0830/10/08

Keywords

  • Sequence classification
  • Text mining
  • Web page cleaning

Fingerprint

Dive into the research topics of 'Using sequence classification for filtering web pages'. Together they form a unique fingerprint.

Cite this