Saturday, January 26, 2008

Scaling MapReduce

Before I started down using Hadoop I had to be roughly sure that it would scale. One of the pieces of evidence that convinced me was done by IBM in "Scalability of the Nutch Search Engine", they performed tests using real world systems and also modeled the architecture to verify it:
We observe...that the throughput, for a fixed data set size per back-end, increases with the number of back-ends.

We observe a generally good agreement between measurement and prediction for small number of back-end servers. The response time is essentially flat with the number of back-end servers, up to 500, 1000, and 2000 servers for data set sizes (per server) of 10 GB, 20 GB, and 40 GB respectively. The system is more scalable for larger data set
sizes because the work per back-end server per query increases.


They are able to service 1000+ nodes with a single server.

No comments: