More News: Do You Have an Internet Sized Problem?

Wednesday, February 20, 2008

Do You Have an Internet Sized Problem?

Yahoo! Launches World's Largest Hadoop Production Application. This is mainly about Hadoop reaching a certain level of maturity.

The Webmap build starts with every Web page crawled by Yahoo! and produces a database of all known Web pages and sites on the internet and a vast array of data about every page and site. This derived data feeds the Machine Learned Ranking algorithms at the heart of Yahoo! Search.

Some Webmap size data:

* Number of links between pages in the index: roughly 1 trillion links
* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over 10,000
* Raw disk used in the production cluster: over 5 Petabytes

As described in the video, Webmap is the directed graph of the web and store the aggregate the metadata about the links. A lot of the code sounds like it is written in C++ (which is why there's the pipes API in Hadoop). According to the Hadoop mailing list, this means that Webmap is roughly equal to Google's scale.

Wednesday, February 20, 2008

Do You Have an Internet Sized Problem?

No comments: