Wednesday, April 15, 2009

MapReduce vs SQL Databases

A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks
...we present the results of running the benchmark on a 100-node cluster to execute each task. We tested the publicly available open-source version of MapReduce, Hadoop [1], against two parallel SQL DBMSs, Vertica [3] and a second system from a major relational vendor.

First, as we demonstrate in Section 4, at 100 nodes the two parallel DBMSs range from a factor of 3.1 to 6.5 faster than MapReduce on a variety of analytic tasks. While MR may indeed be capable of scaling up to 1000s of nodes, the superior efficiency of modern DBMSs alleviates the need to use such massive hardware on datasets in the range of 1–2PB (1000 nodes with 2TB of disk/node has a total disk capacity of 2PB). For example, eBay’s Teradata configuration uses just 72 nodes (two quad-core CPUs, 32GB RAM, 104 300GB disks per node) to manage approximately 2.4PB of relational data. As another example, Fox Interactive Media’s warehouse is implemented using a 40-node Greenplum DBMS. Each node is a Sun X4500 machine with two dual-core CPUs, 48 500GB disks, and 16 GB RAM (1PB total disk space) [7]. Since few data sets in the world even approach a petabyte in size, it is not at all clear how many MR users really need 1,000 nodes.


In section 3.1 there's some points made about the advantages of databases over MR in relation to data integrity, "...a MR framework and its underlying distributed storage system has no knowledge of these rules, and thus allows input data to be easily corrupted with bad data. By again separating such constraints from the application and enforcing them automatically by the run time system, as is done by all SQL DBMSs, the integrity of the data is enforced without additional work on the programmer’s behalf."

They mention that "all DBMSs require that data conform to a well-defined schema, whereas MR permits data to be in any arbitrary format. Other differences also include how each system provides indexing and compression optimizations, programming models, the way in which data is distributed, and query execution strategies."

If you strip it away they are talking about text processing versus indexed data structures (and other parts of a DBMS).

For loading, "Without using either block-or record-level compression, Hadoop clearly outperforms both DBMS-X and Vertica since each node is simply copying each datafile from the local disk into the local HDFS instance and then distributing two replicas to other nodes in the cluster." The obvious difference to me would be that the SQL databases are creating "...a hash partitioned across all nodes on the salient attribute for that particular table, and then sorted and indexed on different attributes..."

For text processing, they note that the main problems with Hadoop are the start-up costs (10-25 seconds before all Map tasks start) and during the Reduce phase the cost of combining many small files. When you are comparing a fully indexed system versus text processing then you would expect the indexed system to be faster. Compression was also considered an advantage in the systems like Vertica's over Hadoop - where it actually reduced performance. It depends on the work being done whether the overhead of compression is worth the overhead so obviously - it's not explained why compression was a negative for Hadoop.

They also talk about the problems in setting up and configuring the parallel databases over Hadoop, which is not an insignificant difference when you are scaling to 100s and 1000s of nodes.

In the summary they talk about 25 years of database development and the advantages of B-Trees and column stores. It begs the question, then why wasn't a similar system used on the Hadoop infrastructure? MR is really more like distributed processing not an indexed, querying system.

If you took away the distributed layer what they are doing is comparing something like grep (a really bad implementation of grep) with Lucene or MySQL. Would anyone be surprised with the results then? A better comparison would've been comparing it against HBase or other distributed, indexed, data stores like Hive or Cloudbase.

Update: There's a good followup on the hadoop list by Jonathan Gray "Hadoop is not suited for random access, joins, dealing with subsets of
your data; ie. it is not a relational database! It's designed to
distribute a full scan of a large dataset, placing tasks on the same nodes
as the data its processing. The emphasis is on task scheduling, fault
tolerance, and very large datasets, low-latency has not been a priority.
There are no "indexes" to speak of, it's completely orthogonal to what it
does, so of course there is an enormous disparity in cases where that
makes sense. Yes, B-Tree indexes are a wonderful breakthrough in data
technology". He suggested Pig, Hive and Cascading would be more suitable for comparison.

No comments: