Friday, November 16, 2007

SPARQL isn't Unix Pipes

It wasn't supposed to be this way. I was just trying to get what I wrote in 2004 acknowledged. All I wanted then, as now, was aggregate functions and a matching data model. I did a bit more research in 2006 (mostly in my spare time while I had a day job to go to) and thought that people could read it and understand it. I even spent some time over the last Christmas holidays making a more gentler introduction to it all.

SPARQL is a proposed recommendation - which is one step away from being a published standard. So, I put my objections in to the SPARQL working group. From what I can tell people either didn't understand or thought that I was some kind of weirdo. The unhappiest part of this is the summary of my objection, "The Working Group is unfamiliar with any existing query languages that meet the commenter's design goals."

All I wanted was closure of operations, where the query language matches the data model it queries. Maybe this is a very odd thing to want. No one seems to know what the relational model really is either. Maybe it's a bad example.

Maybe a better example is Unix pipes. Unix pipes have operations (sort, cut, etc.) that take in text and output text. That is, it takes the same input as output or something known as closure. So you can take input from one tool and string them together in any order you want. Sometimes it's more efficient to do one operation first over another. In SPARQL you can't do that as the first operation of every query turns it into variable bindings.

I was hoping that SPARQL would be the Unix pipes of RDF. This would mean that the operations like join, filter, restrict (matching triples) and so on take in an RDF graph (or graphs) and output an RDF graph (or graphs). This gives tremendous flexibility in that you can create new operations that all work on the same model. It also means that a lot of the extra complexity that is part of SPARQL (for example, CONSTRUCT and ASK) go away.

This is not to say that SPARQL doesn't have value and shouldn't be supported. It is just a missed opportunity. It could have avoided repeating mistakes made with SQL (like not standarizing aggregate functions, having a consistent data model and so on).

Update: I re-read this recently. It struck me that maybe I was being a little unclear about what I expected as input and output in the RDF pipes view of SPARQL. Really, it's not a single RDF graph per se that are being processed but sets of triples. It's not really a big difference - as RDF graphs are just sets of triples but it's more that the triples being processed don't have to come from one graph. There's no restriction on what I'm talking about above to process, in one go, triples from many different graphs. The criticism is the same though - SPARQL breaks triples into variable bindings. Having multiple graph processing (or sets of triple processing) just requires the graph that the triple came from recorded (the quad in most systems). It certainly something that could be added to JRDF's SPARQL implementation.


Simon said...

Sadly, SPARQL ended up as a tool for salvaging data from that irritating RDF graph format into familiar SQL tables. :P said...

I think I agree that piping SPARQL queries should be thought of from the beginning, similar to the W3C's XML pipeline language. I was expecting the RDF and XML groups, within W3C, know about each other's work.

anyway, maybe (CONSTRUCT * ) will "help" solving the problem. Although this is not yet part of SPARQL but i heard it will be added.