The Webmap build starts with every Web page crawled by Yahoo! and produces a database of all known Web pages and sites on the internet and a vast array of data about every page and site. This derived data feeds the Machine Learned Ranking algorithms at the heart of Yahoo! Search.
Some Webmap size data:
* Number of links between pages in the index: roughly 1 trillion links
* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over 10,000
* Raw disk used in the production cluster: over 5 Petabytes
As described in the video, Webmap is the directed graph of the web and store the aggregate the metadata about the links. A lot of the code sounds like it is written in C++ (which is why there's the pipes API in Hadoop). According to the Hadoop mailing list, this means that Webmap is roughly equal to Google's scale.
No comments:
Post a Comment