Research supported by Deutsche Forschungsgemeinschaft, grant SO 514/3-1 and SO 514/4-2.
The design and implementation of internet search engines becomes an increasingly sophisticated challenge. Several requirements have to be met, such as dealing with massive amounts of data, efficient text processing and fault tolerance. The aims of the Krake web-search project are the implementation and practical improvement of the different components of internet search engines with an emphasis on scalability and flexibilty.
This involves:
An early evaluation of the freely available web crawlers lead to the conclusion that a new crawler framework should be created to meet all the defined requirements. In order to satisfy the demand for distribution and simplicity at the same time, it was decided to use the very popular MapReduce paradigm as basis for the framework.
The development phase was heavily prototype driven to ensure the practical relevance of the framework and finally led to the stable and final system as presented below.
The Krake crawler framework is a reliable, distributed and modern crawler framework that can easily be modified to fit individual research interests. It is meant to be used as an out of the box crawler or as a basis for a customized crawling and analysis system.
In conjunction with the actual crawler, the Krake framework also provides means to export and aggregate the gathered data into a web-graph and content-database. The file formats of the exported data are designed to be easily read or modified by other programs without sacrificing efficiency.
The framework was thoroughly tested by performing two big crawls in the ".de" and ".li"-domainspace which lasted several months in total. Both crawls were a success and yielded web-graphs with millions of pages/links as well as a content-database containing multiple terabytes of data.
Please contact Christian Sohler (christian.sohlertu-dortmund