TY - GEN

T1 - A needle in a haystack

T2 - Proceedings, Twenty-First International Conference on Machine Learning, ICML 2004

AU - Crammer, Koby

AU - Chechik, Gal

PY - 2004

Y1 - 2004

N2 - This paper addresses the problem of finding a small and coherent subset of points in a given data. This problem, sometimes referred to as one-class or set covering, requires to find a small-radius ball that covers as many data points as possible. It rises naturally in a wide range of applications, from finding gene-modules to extracting documents' topics, where many data points are irrelevant to the task at hand, or in applications where only positive examples are available. Most previous approaches to this problem focus on identifying and discarding a possible set of outliers. In this paper we adopt an opposite approach which directly aims to find a small set of coherently structured regions, by using a loss function that focuses on local properties of the data. We formalize the learning task as an optimization problem using the Information-Bottleneck principle. An algorithm to solve this optimization problem is then derived and analyzed. Experiments on gene expression data and a text document corpus demonstrate the merits of our approach.

AB - This paper addresses the problem of finding a small and coherent subset of points in a given data. This problem, sometimes referred to as one-class or set covering, requires to find a small-radius ball that covers as many data points as possible. It rises naturally in a wide range of applications, from finding gene-modules to extracting documents' topics, where many data points are irrelevant to the task at hand, or in applications where only positive examples are available. Most previous approaches to this problem focus on identifying and discarding a possible set of outliers. In this paper we adopt an opposite approach which directly aims to find a small set of coherently structured regions, by using a loss function that focuses on local properties of the data. We formalize the learning task as an optimization problem using the Information-Bottleneck principle. An algorithm to solve this optimization problem is then derived and analyzed. Experiments on gene expression data and a text document corpus demonstrate the merits of our approach.

UR - http://www.scopus.com/inward/record.url?scp=14344258245&partnerID=8YFLogxK

M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???

AN - SCOPUS:14344258245

SN - 1581138385

T3 - Proceedings, Twenty-First International Conference on Machine Learning, ICML 2004

SP - 201

EP - 208

BT - Proceedings, Twenty-First International Conference on Machine Learning, ICML 2004

A2 - Greiner, R.

A2 - Schuurmans, D.

Y2 - 4 July 2004 through 8 July 2004

ER -