TY - GEN
T1 - Using sequence classification for filtering web pages
AU - Rosenfeld, Binyamin
AU - Feldman, Ronen
AU - Ungar, Lyle
PY - 2008
Y1 - 2008
N2 - Web pages often contain text that is irrelevant to their main content, such as advertisements, generic format elements, and references to other pages on the same site. When used by automatic content-processing systems, e.g., for Web indexing, text classification, or information extraction, this irrelevant text often produces substantial amount of noise. This paper describes a trainable filtering system based on a feature-rich sequence classifier that removes irrelevant parts from pages, while keeping the content intact. Most of the features the system uses are purely form-related: HTML tags and their positions, sizes of elements, etc. This keeps the system general and domainindependent. We also experiment with content words and show that while they perform very poorly alone, they can slightly improve the performance of pure-form features, without jeopardizing the domain-independence. Our system achieves very high accuracy (95% and above) on several collections of Web pages. We also do a series of tests with different features and different classifiers, comparing the contribution of different components to the system performance, and comparing two known sequence classifiers, Robust Risk Minimization (RRM) and Conditional Random Fields (CRF), in a novel setting.
AB - Web pages often contain text that is irrelevant to their main content, such as advertisements, generic format elements, and references to other pages on the same site. When used by automatic content-processing systems, e.g., for Web indexing, text classification, or information extraction, this irrelevant text often produces substantial amount of noise. This paper describes a trainable filtering system based on a feature-rich sequence classifier that removes irrelevant parts from pages, while keeping the content intact. Most of the features the system uses are purely form-related: HTML tags and their positions, sizes of elements, etc. This keeps the system general and domainindependent. We also experiment with content words and show that while they perform very poorly alone, they can slightly improve the performance of pure-form features, without jeopardizing the domain-independence. Our system achieves very high accuracy (95% and above) on several collections of Web pages. We also do a series of tests with different features and different classifiers, comparing the contribution of different components to the system performance, and comparing two known sequence classifiers, Robust Risk Minimization (RRM) and Conditional Random Fields (CRF), in a novel setting.
KW - Sequence classification
KW - Text mining
KW - Web page cleaning
UR - http://www.scopus.com/inward/record.url?scp=70349250028&partnerID=8YFLogxK
U2 - 10.1145/1458082.1458276
DO - 10.1145/1458082.1458276
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:70349250028
SN - 9781595939913
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 1355
EP - 1356
BT - Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08
T2 - 17th ACM Conference on Information and Knowledge Management, CIKM'08
Y2 - 26 October 2008 through 30 October 2008
ER -