Wednesday, August 03, 2005

Duplication is a Mistake

Again, I'm looking at DISTINCT in SPARQL.

In the relational world Date talks about how users don't care about duplicates and it makes optimization difficult and invalidates operations (like JOIN). Preventing duplicates means that optimizers can make logically equivalent transformations. It seems quite valid to made distinct results the only option.

An example he gives is a query to get supplier numbers for suppliers who supply at least one part , "DOUBLE TROUBLE, DOUBLE TROUBLE PART 1": "The obvious first point to make is that the twelve different formulations produce nine different results! -- different, that is, with respect to their degree of duplication...Thus, if the user really cares about duplicates, then he or she needs to be extremely careful in formulating the query appropriately in order to obtain exactly the desired result."

"Here are some implications of this point:
* First, the optimizer code itself is harder to write, harder to maintain, and probably more buggy--all of which combines to make the product simultaneously more expensive and less reliable, as well as late in delivery in the marketplace.
* Second, system performance is likely to be worse than it might otherwise be.
* Third, the user is going to have to get involved in performance issues; for instance, the user might have to spend time and effort on figuring out the best way to express a given query (a state of affairs, incidentally, that the relational model was explicitly designed to avoid)."

"...if I say "the sun is shining here today" and "the sun is shining here today," I'm simply telling you the sun is shining here today! And from this perspective, the notion of duplicate rows--as that notion is usually understood--obviously makes no sense at all."

There's also a part two.

The same point is made here: "I think it would be a mistake for the query language to take a position on whether or not query result sets could contain duplicate rows (or if it did take a position, I'd want it to be that they couldn't!) From a selfish perspective, I worry that we'll have to de-tune RDF Gateway's query evaluation in order to allow duplicate rows to exist in a resultset (after all if a user wants duplicate rows, they can merely select out the variable(s) that make those rows distinguishable). Perhaps the issue of duplicate rows could be implementation specific?"


It seems that Danny is reading the same thing I am.

No comments: