BICO
BICO is a fast streaming algorithm to compute high quality solutions for the k-means problem on very large sets of points. It combines the tree data structure of SIGMOND Test of Time Award winning algorithm BIRCH with insights from clustering theory to obtain solutions fast while keeping the error regarding the k-means cost function low.
Getting BICO
BICO is implemented in C++.
The most recent version of the source code is available here. You may use
this bash script as a guide on how to compile and (optionally) install BICO using gcc. It is tailored to Debian-based distributions but should be easily adaptable.
Besides the actual BICO code which allows to compute an accurate summary of a large point set, the source code contains a straightforward implementation of k-means++ (see
2), which can be used to compute a k-means solution using the summary.
The sourcecode provided on this webpage is the intellectual property of its authors.
You have permission to download and use this sourcecode for private and academic use.
You are not allowed to redistribute this sourcecode without our permission.
If you want to use the sourcecode in any other way, please
contact us.
We thank the authors of the library
CluE for providing some framework classes which are included in the source code of BICO.
Experimental Evaluation
We evaluated BICO experimentally in
3 to test its speed and accuracy.
The original source code that was used to perform the experiments is available
here.
Please note that this version is intended
for reproducing the experiments only.
We strongly suggest to use the most recent version from
the top of this page for all other purposes.
Data Sets
We used five data sets.
Tower, Covertype and Census are from the
UCI Machine Learning Repository.
BigCross is a subset of the Cartesian product of Tower and Covertype, created by the authors of
1.
The fifth data set is called CalTech128 and consists of 128 SIFT descriptors.
For further information on the data sets and the availability of these, please
contact us.
Reference Implementations
BIRCH:
We used the implementation of BIRCH by the authors available
here.
MacQueen: We used the implementation of MacQueen's k-means implemented in the open source project
ESMERALDA.
StreamKM++: We used the implementation of Stream-KM++ by the authors available
here.
StreamLS: We used the implementation of StreamLS by the authors available
here.
Testing Environment
The
testing environment that we used is based on the environment provided on the
StreamKM++ webpage and contains scripts for the automatic execution of BICO and the reference algorithms.
- M. Ackermann, C. Lammersen, M. Märtens, C. Raupach, C. Sohler, K. Swierkot: StreamKM++: A Clustering Algorithm for Data Streams. ALENEX 2010.
Also see the webpage of StreamKM++.
- D. Arthur, S. Vassilvitskii: k-means++: The advantages of careful seeding. SODA 2007
- H. Fichtenberger, M. Gillé, M. Schmidt, C. Schwiegelshohn, C. Sohler. ESA 2013