Wednesday, February 20, 2008

Do You Have an Internet Sized Problem?

Yahoo! Launches World's Largest Hadoop Production Application. This is mainly about Hadoop reaching a certain level of maturity.

The Webmap build starts with every Web page crawled by Yahoo! and produces a database of all known Web pages and sites on the internet and a vast array of data about every page and site. This derived data feeds the Machine Learned Ranking algorithms at the heart of Yahoo! Search.

Some Webmap size data:

* Number of links between pages in the index: roughly 1 trillion links
* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over 10,000
* Raw disk used in the production cluster: over 5 Petabytes


As described in the video, Webmap is the directed graph of the web and store the aggregate the metadata about the links. A lot of the code sounds like it is written in C++ (which is why there's the pipes API in Hadoop). According to the Hadoop mailing list, this means that Webmap is roughly equal to Google's scale.

No comments: