Tuesday, January 03, 2012

Partitioning Graphs in Hadoop

A recent article at Linked In called "Recap: Improving Hadoop Performance by (up to) 1000x" had a section called "Drill Bit #2: graph processing" mentioning the problem of partitioning the triples of RDF graphs amongst different nodes.

According to "Scalable SPARQL Querying of Large RDF Graphs" they use MapReduce jobs to create indexes where triples such as s,p,o and o,p',o' are on the same compute node.  The idea of using MapReduce to create better indexing is not a new one - but it's good to see the same approach being used to process RDF rather than actually using MapReduce jobs to do the querying.  It's similar to what I did with RDF molecules and creating a level of granularity between graphs and nodes as well as things like Nutch.
