Wednesday, April 09, 2008

My (Continued) SPARQL Debacle

One of the reasons I started this blog was to record my current thoughts at a particular time. With this in mind, I should track my recent comments about SPARQL and the empty graph pattern and further rehashing of it.

I made a few mistakes during the discussion and spent well over a week in discussion and maybe a week prior to asking the question thinking about it and much time thereafter just thinking about summarizing it.

SPARQL is an algebra that is not consistent (isomorphic) with what I think of as set/relational/bag algebras (even though it appeared at one stage this was considered). The reason is that identities I believe hold in these algebras don't for SPARQL.

The set/relational/bag algebra identities are:
* A + 0 = A * U = A
* A + U = U
* A * 0 = 0

Where + is UNION, * is INTERSECTION, A is any set, 0 is the empty set and U is the universal set. The second one is expressible and does work in SPARQL. The first one isn't expressible in SPARQL. The last two don't hold.

You can derive the last two identities from the first two as long as you have compatible definitions for things like inverse (or complement). When Date creates the algebra for bags he spends most of his time coming up with a reasonable definition for the complement of a bag which seems to be more like difference. In my interpretation 1/T/U is the relational TABLE_DEE and 0/F/empty set is TABLE_DUM. I thought this is quite clear but it appears even this is up for interpretation.

Prior to Date's latest book, I had a bunch of his writings which I used to create identities for OPTIONAL, JOIN and UNION. I struggled a while back to see whether they were compatible with SPARQL, which I eventually decided that they were compatible, it ends up that they are not - because the identities don't hold. I think as long as you don't ask these questions then it still returns the right answer and you could create special cases for SPARQL's empty graph pattern but I'm just not that confident anymore. SPARQL is more like an algebra of numbers than of sets or bags.

Reflecting on this, I was striving for a consistency that just wasn't there and if I squint hard enough I can see how the SPARQL algebra by itself makes sense.

There is still some behavior, even within the SPARQL specification, that appears to be really bad like "SELECT ?x WHERE { ?s ?p ?o }" being a valid query (it returns a number of unbounds to ?x for however many triples there are in the graph). This is quite different to SQL or relational PROJECT. It's also weird that the SPARQL specification is different to the Perez papers about SPARQL - evaluation is done at a grammatical level. UNION also differs as it's defined as multiset union not set union even though OPTIONAL is made up of set union not multiset union. Actually, I'm still not sure if UNION is multiset union because in the implementations I've seen the order is important (that is {} UNION {} UNION { ?s ?p ?o } is different to {} UNION { ?s ?p ?o } UNION {} and { ?s ?p ?o } UNION {} UNION {}) but I guess that's because of the grammatical evaluation.

It does put any further work on JRDF's SPARQL implementation in a bad position. I can keep calling it SPARQL but know that it's not following the standard or rename it (currently I'm thinking URQL) but the whole point of bothering seems to be questionable. The ironic thing is that it could pass all the SPARQL tests even though I know it's not compatible. In other work that I've been doing, I've been interested in SPARQL as the Unix pipes for RDF and blank node round tripping but SPARQL doesn't work there either. Blank node round tripping is where you take the result of one SPARQL query that includes a blank node and put it into a second.

Sometimes you come away from asking a question feeling validated or smarter and sometimes not. This time it's definitely not - I no longer feel confident talking about SPARQL or relational algebra anymore.

No comments: