Sprungmarken

Servicenavigation

       

Hauptnavigation

Bereichsnavigation

BICO

BICO is a fast streaming algorithm to compute high quality solutions for the k-means problem on very large sets of points. It combines the tree data structure of SIGMOND Test of Time Award winning algorithm BIRCH with insights from clustering theory to obtain solutions fast while keeping the error regarding the k-means cost function low.

Getting BICO

BICO is implemented in C++. The most recent version of the source code is available here. You may use this bash script as a guide on how to compile and (optionally) install BICO using gcc. It is tailored to Debian-based distributions but should be easily adaptable. Besides the actual BICO code which allows to compute an accurate summary of a large point set, the source code contains a straightforward implementation of k-means++ (see 2), which can be used to compute a k-means solution using the summary.

The sourcecode provided on this webpage is the intellectual property of its authors. You have permission to download and use this sourcecode for private and academic use. You are not allowed to redistribute this sourcecode without our permission. If you want to use the sourcecode in any other way, please contact us.

We thank the authors of the library CluE for providing some framework classes which are included in the source code of BICO.

Experimental Evaluation

We evaluated BICO experimentally in 3 to test its speed and accuracy. The original source code that was used to perform the experiments is available here. Please note that this version is intended for reproducing the experiments only. We strongly suggest to use the most recent version from the top of this page for all other purposes.

Data Sets

We used five data sets. Tower, Covertype and Census are from the UCI Machine Learning Repository. BigCross is a subset of the Cartesian product of Tower and Covertype, created by the authors of 1. The fifth data set is called CalTech128 and consists of 128 SIFT descriptors. For further information on the data sets and the availability of these, please contact us.

Reference Implementations

BIRCH: We used the implementation of BIRCH by the authors available here.
MacQueen: We used the implementation of MacQueen's k-means implemented in the open source project ESMERALDA.
StreamKM++: We used the implementation of Stream-KM++ by the authors available here.
StreamLS: We used the implementation of StreamLS by the authors available here.

Testing Environment

The testing environment that we used is based on the environment provided on the StreamKM++ webpage and contains scripts for the automatic execution of BICO and the reference algorithms.

References

  1. M. Ackermann, C. Lammersen, M. Märtens, C. Raupach, C. Sohler, K. Swierkot: StreamKM++: A Clustering Algorithm for Data Streams. ALENEX 2010. Also see the webpage of StreamKM++.
  2. D. Arthur, S. Vassilvitskii: k-means++: The advantages of careful seeding. SODA 2007
  3. H. Fichtenberger, M. Gillé, M. Schmidt, C. Schwiegelshohn, C. Sohler. ESA 2013