Apache's Hadoop project aims to solve these problems by providing a framework for running large data processing applications on clusters of commodity hardware. Combined with Amazon EC2 for running the application, and Amazon S3 for storing the data, we can run large jobs very economically. This paper describes how to use Amazon Web Services and Hadoop to run an ad hoc analysis on a large collection of web access logs that otherwise would have cost a prohibitive amount in either time or money.
It took about 5 minutes to transfer the data (which was compressed - Hadoop can read compressed input data out of the box) from S3 to to HDFS, so the whole job took less than a hour. At $0.10 per instance-hour, this works out at only $2 for the whole job, plus S3 storage and transfer costs - that's external transfer costs, because remember transfers between EC2 and S3 are free.
Friday, July 20, 2007
MapReduce by the Hour
Running Hadoop MapReduce on Amazon EC2 and Amazon S3