Improved spectral-norm bounds for clustering

Pranjal Awasthi, Or Sheffet

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

66 Scopus citations

Abstract

Aiming to unify known results about clustering mixtures of distributions under separation conditions, Kumar and Kannan [1] introduced a deterministic condition for clustering datasets. They showed that this single deterministic condition encompasses many previously studied clustering assumptions. More specifically, their proximity condition requires that in the target k-clustering, the projection of a point x onto the line joining its cluster center μ and some other center μ′, is a large additive factor closer to μ than to μ′. This additive factor can be roughly described as k times the spectral norm of the matrix representing the differences between the given (known) dataset and the means of the (unknown) target clustering. Clearly, the proximity condition implies center separation - the distance between any two centers must be as large as the above mentioned bound. In this paper we improve upon the work of Kumar and Kannan [1] along several axes. First, we weaken the center separation bound by a factor of √k, and secondly we weaken the proximity condition by a factor of k (in other words, the revised separation condition is independent of k). Using these weaker bounds we still achieve the same guarantees when all points satisfy the proximity condition. Under the same weaker bounds, we achieve even better guarantees when only (1 - ε)-fraction of the points satisfy the condition. Specifically, we correctly cluster all but a (ε+O(1/c 4))-fraction of the points, compared to O(k 2 ε)-fraction of [1], which is meaningful even in the particular setting when ε is a constant and k = ω(1). Most importantly, we greatly simplify the analysis of Kumar and Kannan. In fact, in the bulk of our analysis we ignore the proximity condition and use only center separation, along with the simple triangle and Markov inequalities. Yet these basic tools suffice to produce a clustering which (i) is correct on all but a constant fraction of the points, (ii) has k-means cost comparable to the k-means cost of the target clustering, and (iii) has centers very close to the target centers. Our improved separation condition allows us to match the results of the Planted Partition Model of McSherry [2], improve upon the results of Ostrovsky et al [3], and improve separation results for mixture of Gaussian models in a particular setting.

Original languageEnglish
Title of host publicationApproximation, Randomization, and Combinatorial Optimization
Subtitle of host publicationAlgorithms and Techniques - 15th International Workshop, APPROX 2012, and 16th International Workshop, RANDOM 2012, Proceedings
Pages37-49
Number of pages13
DOIs
StatePublished - 2012
Externally publishedYes
Event15th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, APPROX 2012 and the 16th International Workshop on Randomization and Computation, RANDOM 2012 - Cambridge, MA, United States
Duration: 15 Aug 201217 Aug 2012

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume7408 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference15th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, APPROX 2012 and the 16th International Workshop on Randomization and Computation, RANDOM 2012
Country/TerritoryUnited States
CityCambridge, MA
Period15/08/1217/08/12

Funding

FundersFunder number
Directorate for Computer and Information Science and Engineering0830540, 1116892, 1065251

    Fingerprint

    Dive into the research topics of 'Improved spectral-norm bounds for clustering'. Together they form a unique fingerprint.

    Cite this