Thursday, April 17, 2008

hashCode and equals for Blank Nodes

You don't need node ids. Most, if not all RDF triple stores take a Literal, URI Reference or Blank Node and generate a node id. Sometimes it's a hash or UUID, sometimes it's from a node pool or value store but you don't really need it. As an aside, in a distributed store you could even do the blocks of ids trick which people have done in SQL databases but I haven't seen that done for RDF yet.

When you do operations, like joins, in Java or Ruby or some other language you rely on hash codes to generate different values, if they're the same then you call equals.

What if you don't have a node pool?

It's easy to do for what I like to call globally addressable values - URI References and Literals - no matter where you are, these methods return the same results from their hash code or equals. Not so with Blanks Nodes, which are tied to the context of an RDF graph.

One solution is to ban blank nodes - they're pains to parse, query and store. But I actually like blank nodes. There good at representing things where you don't want to confuse it with something that might actually be a URI to dereference.

The idea we've been working on with our high-falutin' scale-out MapReduce blah blah is really just coming up with sensible implementations of the hashCode and equals methods for blank nodes. There is previous work done in distributing blank nodes across graphs, the one that I'm most familiar with is RDF Molecules. But they didn't really quite cut it as far as hash codes and equals are concerned and that's basically what I'm presenting next week in China. The hash code is basically the head triple and the equals is the minimal context, sub-graph for a given blank node.

There's a lot more to say, as I've had to find something to talk about for the whole 15 minutes.

No comments: