Wednesday, June 27, 2007

Scalability, Scalability, Scalability, Scalability

Dare Obasanjo has four recent postings on the recent Google Scalability Conference.

Google Scalability Conference Trip Report: MapReduce, BigTable, and Other Distributed System Abstractions for Handling Large Datasets:

The talk was about the three pillars of Google's data storage and processing platform; GFS, BigTable and MapReduce.

A developer only has to write their specific map and reduce operations for their data sets which could run as low as 25 - 50 lines of code while the MapReduce infrastructure deals with parallelizing the task and distributing it across different machines, handling machine failures and error conditions in the data, optimizations such as moving computation close to the data to reduce I/O bandwidth consumed, providing system monitoring and making the service scalable across hundreds to thousands of machines.

Currently, almost every major product at Google uses MapReduce in some way. There are 6000 MapReduce applications checked into the Google source tree with the hundreds of new applications that utilize it being written per month. To illustrate its ease of use, a graph of new MapReduce applications checked into the Google source tree over time shows that there is a spike every summer as interns show up and create a flood of new MapReduce applications that are then checked into the Google source tree.


Google Scalability Conference Trip Report: Using MapReduce on Large Geographic Datasets:

A common pattern across a lot of Google services is creating a lot of index files that point and loading them into memory to make lookups fast. This is also done by the Google Maps team which has to handle massive amounts of data (e.g. there are over a hundred million roads in North America).

Q: Where are intermediate results from map operations stored?
A: In BigTable or GFS


Google Scalability Conference Trip Report: Lessons in Building Scalable Systems:

The most important lesson the Google Talk team learned is that you have to measure the right things. Questions like "how many active users do you have" and "how many IM messages does the system carry a day" may be good for evaluating marketshare but are not good questions from an engineering perspective if one is trying to get insight into how the system is performing.

Specifically, the biggest strain on the system actually turns out to be displaying presence information.

Giving developers access to live servers (ideally public beta servers not main production servers) will encourage them to test and try out ideas quickly. It also gives them a sense of empowerement. Developers end up making their systems easier to deploy, configure, monitor, debug and maintain when they have a better idea of the end to end process.


And finally, Google Scalability Conference Trip Report: Scaling Google for Every User, which had some interesting ideas about search engine usage.

Update: links for 2007-07-06 includes links to the Google talk on MapReduce tasks on large datasets and other goodies (scalable b-trees, hashing, etc).

Update 2: Greg Linden has a much better list of videos and commentary.

No comments: