Sprungmarken

Servicenavigation

       

Hauptnavigation

Bereichsnavigation

Web-Project "Krake"



networks around the globe

Members

  • Alexander Munteanu
  • Christian Sohler

Former Members

  • Thomas Nitschke
  • Marc Gillé
  • Christopher Schröder
  • Hendrik Spiegel

Support

Research supported by Deutsche Forschungsgemeinschaft, grant SO 514/3-1 and SO 514/4-2.

Overview

The design and implementation of internet search engines becomes an increasingly sophisticated challenge. Several requirements have to be met, such as dealing with massive amounts of data, efficient text processing and fault tolerance. The aims of the Krake web-search project are the implementation and practical improvement of the different components of internet search engines with an emphasis on scalability and flexibilty.

This involves:

  1. Distributed data management
  2. Solid and tested web crawler
  3. Efficient web graph implementation
  4. Easy access to the crawled content
  5. Simplicity and extendibility
  6. Scalabilty

Results

An early evaluation of the freely available web crawlers lead to the conclusion that a new crawler framework should be created to meet all the defined requirements. In order to satisfy the demand for distribution and simplicity at the same time, it was decided to use the very popular MapReduce paradigm as basis for the framework.

The development phase was heavily prototype driven to ensure the practical relevance of the framework and finally led to the stable and final system as presented below.

Krake Crawler Framework

The Krake crawler framework is a reliable, distributed and modern crawler framework that can easily be modified to fit individual research interests. It is meant to be used as an out of the box crawler or as a basis for a customized crawling and analysis system.

In conjunction with the actual crawler, the Krake framework also provides means to export and aggregate the gathered data into a web-graph and content-database. The file formats of the exported data are designed to be easily read or modified by other programs without sacrificing efficiency.

The framework was thoroughly tested by performing two big crawls in the ".de" and ".li"-domainspace which lasted several months in total. Both crawls were a success and yielded web-graphs with millions of pages/links as well as a content-database containing multiple terabytes of data.

Features at a glance

  • Based on Apache Hadoop / MapReduce
  • Makes heavy use of distributed computation
  • Exports crawled data as web-graph and content-database
  • Written entirely in Java
  • Easy to adapt
  • Well tested through several months of operation

Downloads

Please contact Christian Sohler (christian.sohler(at)tu-dortmund.de) if you would like to access the crawled data and web-graphs.


Last update: 20.06.2012 by L. Pradel