Monday, December 24, 2007

Fat Controller

When does MapReduce make sense and what architecture is appropriate? I don't really know; wheras Tom has some ideas. I like the idea of MapReduce in the Google architecture (as cloned by Hadoop). I like the use of a distributed file system to evenly smear the data across the nodes in the cluster. However, these choices don't always seem appropriate.

The easiest and most obvious example where this causes a problem is that sometimes you want to ensure a single entry or exit in your system (when to start or stop). Another obvious example is where the overhead of network latency overwhelms your ability to parallelize the processing. More than that, it seems that if you can order things globally then the processing should be more efficient but it's unclear to me where the line is between that and the more distributed MapReduce processing (and whether adding that complexity is always worth it).

Google's third lecture (slides) answers some of the basic questions such as why use a distributed file system. It also lists decisions made such as read optimization, mutation handling (serializing writes and atomic appends), no caching (no need due to large file size), fewer but larger files (64MB chunks) and how the file handling is essentially garbage collection. They have implemented appends as it was a frequent operation. This is something that Hadoop has yet to do and can be an issue, especially for databases (which includes the requirements for appends and truncates attached to that issue).

There is obviously some adaption needed to alogirthms to run in a MapReduce cluster. Lecture 5 gives some examples of MapReduce algorithms. It includes a breadth first search of a graph and PageRank. Breadth first is chosen so there doesn't need to be any backtracking. For graph searching, they suggested creating an adjacency matrix - 1 indicating a link in the matrix and 0 indicating no link. To transfer it efficiently they use a sparse matrix (where you only record the links - very much like column databases of course). MapReduce is similar, with the processing split as a single row per page. For both of these process there is a non-MapReduce component. For example, in the PageRank algorithm a process exists to determine convergence of page values.

No comments: