TY - GEN
T1 - Parallel pairwise clustering
AU - Yom-Tov, Elad
AU - Slonim, Noam
PY - 2009
Y1 - 2009
N2 - Given the pairwise affinity relations associated with a set of data items, the goal of a clustering algorithm is to automatically partition the data into a small number of homogeneous clusters. However, since the input size is quadratic in the number of data points, existing algorithms are non feasible for many practical applications. Here, we propose a simple strategy to cluster massive data by randomly splitting the original affinity matrix into small manageable affinity matrices that are clustered independently. Our proposal is most appealing in a parallel computing environment where at each iteration, each worker node clusters a subset of the input data and the results from all workers are then integrated in a master node to create a new clustering partition over the entire data. We demonstrate that this approach yields high quality clustering partitions for various real world problems, even though at each iteration only small fractions of the original data matrix are examined and at no point is the entire affinity matrix stored in memory or even computed. Furthermore, we demonstrate that the proposed algorithm has intriguing stochastic convergence properties that provide further insight into the clustering problem.
AB - Given the pairwise affinity relations associated with a set of data items, the goal of a clustering algorithm is to automatically partition the data into a small number of homogeneous clusters. However, since the input size is quadratic in the number of data points, existing algorithms are non feasible for many practical applications. Here, we propose a simple strategy to cluster massive data by randomly splitting the original affinity matrix into small manageable affinity matrices that are clustered independently. Our proposal is most appealing in a parallel computing environment where at each iteration, each worker node clusters a subset of the input data and the results from all workers are then integrated in a master node to create a new clustering partition over the entire data. We demonstrate that this approach yields high quality clustering partitions for various real world problems, even though at each iteration only small fractions of the original data matrix are examined and at no point is the entire affinity matrix stored in memory or even computed. Furthermore, we demonstrate that the proposed algorithm has intriguing stochastic convergence properties that provide further insight into the clustering problem.
UR - http://www.scopus.com/inward/record.url?scp=72749119484&partnerID=8YFLogxK
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:72749119484
SN - 9781615671090
T3 - Society for Industrial and Applied Mathematics - 9th SIAM International Conference on Data Mining 2009, Proceedings in Applied Mathematics
SP - 741
EP - 751
BT - Society for Industrial and Applied Mathematics - 9th SIAM International Conference on Data Mining 2009, Proceedings in Applied Mathematics 133
T2 - 9th SIAM International Conference on Data Mining 2009, SDM 2009
Y2 - 30 April 2009 through 2 May 2009
ER -