Research supported by Deutsche Forschungsgemeinschaft, grant So 514/4-2. The project is within the research cluster Algorithm Engineering.
The clustering is the process of partitioning a set of objects into subsets such that each two objects inside one subset are more similar than two objects in different subsets. Similarity of two objects is usually measured using similarity or distance measures. Distance measures can be classified as metric spaces and non-metric spaces. Famous metric spaces are Euclidean distance, Manhattan norm, more generally norm vector spaces and arbitrary metric spaces with bounded doubling dimension. In the last decade we had lots of new approximation algorithms for this class of distance measures. Most of these algorithms have high constants behind or even in the exponent of their running times which makes testing their applicability on real-world data a tedious work if not impossible. In practice, we have some known heuristics for this class of distance measures which have fast running times. One issue about these heuristics is that they do not have good theoretical guarantees. The first goal in this project is to close this gap between theory and practice in clustering by proposing new approximation algorithms which have smaller constants. Besides we would like to study the worst case analysis of known heuristics trying to find theoretical guarantees of their bounds.
The Cosine similarity measure, Pearson correlation, Kullback-Leibler divergence (relative entropy), Mahalanobis distances, and Bregman divergences are well-known non-metric distance measures. Very few results in the area of clustering have been known for non-metric distance measures. Our second goal in this project is to develop approximation algorithms for non-metric distance measures.
The highlights of this project are to
BICO: Fast k-means clustering in data streams