# Project "Practical theory for clustering algorithms"

[

Members] [

Support] [

Overview] [

Publications]

- Johannes Blömer
- Christian Sohler
- Marcel R. Ackermann
- Daniel Kuntze
- Melanie Schmidt

Support

Research supported by Deutsche Forschungsgemeinschaft, grant So 514/4-2. The project is within the research cluster

Algorithm Engineering.

The clustering is the process of partitioning a set of objects into subsets such that each two objects inside one subset are more similar than two objects in different subsets. Similarity of two objects is usually measured using similarity or distance measures. Distance measures can be classified as metric spaces and non-metric spaces. Famous metric spaces are Euclidean distance, Manhattan norm, more generally norm vector spaces and arbitrary metric spaces with bounded doubling dimension. In the last decade we had lots of new approximation algorithms for this class of distance measures. Most of these algorithms have high constants behind or even in the exponent of their running times which makes testing their applicability on real-world data a tedious work if not impossible. In practice, we have some known heuristics for this class of distance measures which have fast running times. One issue about these heuristics is that they do not have good theoretical guarantees. The first goal in this project is to close this gap between theory and practice in clustering by proposing new approximation algorithms which have smaller constants. Besides we would like to study the worst case analysis of known heuristics trying to find theoretical guarantees of their bounds.

The Cosine similarity measure, Pearson correlation, Kullback-Leibler divergence (relative entropy), Mahalanobis distances, and Bregman divergences are well-known non-metric distance measures. Very few results in the area of clustering have been known for non-metric distance measures. Our second goal in this project is to develop approximation algorithms for non-metric distance measures.

The highlights of this project are to

- Develop new tools and algorithms for clustering problems in high dimensional euclidean spaces.
- Develop new tools and algorithms for clustering problems under Metric and Non-metric distance measures.
- Analyze well-known clustering such as k-Means and Agglomerative clustering.
- Implement clustering algorithms and tools to study their applicability on real world data.

Implementations

BICO: Fast k-means clustering in data streams

Dan Feldman, Morteza Monemizadeh, Christian Sohler, David P. Woodruff:

Coresets and Sketches for High Dimensional Subspace Approximation Problems. In Proceedings of SODA 2010, pp. 630-649, 2010.

Letzte Änderung am 01.10.2013 von M. Schmidt