TY - GEN
T1 - Style-based text categorization: What newspaper am I reading
AU - Argamon-Engelson, Shlomo
AU - Koppel, M.
AU - Avneri, Galit
N1 - Place of conference:USA
PY - 1998
Y1 - 1998
N2 - Most research on automated text categorization has focused on determining the topic of a given text. While topic is generally the main characteristic of an information need, there are other characteristics that are useful for information retrieval. In this paper we consider the problem of text categorization according to style. For example, we may wish to automatically determine if a given text is taken from a magazine or a newspaper, is an editoral or a news item, is promotional or informative, was written by a native English speaker or not, and so on. Learning to determine the style of a document is a dual to that of determining its topic, in that those document features which capture the style of a document are precisely those which are independent of its topic. We here define the features of a document to be the frequencies of each of a set of function words and parts-of-speech triples. We then use machine learning techniques to classify documents. We test our methods on four collections of newspaper and magazine articles.
AB - Most research on automated text categorization has focused on determining the topic of a given text. While topic is generally the main characteristic of an information need, there are other characteristics that are useful for information retrieval. In this paper we consider the problem of text categorization according to style. For example, we may wish to automatically determine if a given text is taken from a magazine or a newspaper, is an editoral or a news item, is promotional or informative, was written by a native English speaker or not, and so on. Learning to determine the style of a document is a dual to that of determining its topic, in that those document features which capture the style of a document are precisely those which are independent of its topic. We here define the features of a document to be the frequencies of each of a set of function words and parts-of-speech triples. We then use machine learning techniques to classify documents. We test our methods on four collections of newspaper and magazine articles.
UR - https://scholar.google.co.il/scholar?q=Which+Newspaper+Am+I+Reading%3A+Style-Based+Text+Categorization+&btnG=&hl=en&as_sdt=0%2C5
M3 - Conference contribution
BT - AAAI Workshop on Text Categorization
ER -