Wednesday, November 28, 2007

Academic Software and BioMANTA

On Monday I gave a presentation at the ACB all hands meeting on the BioMANTA project. It covered the basics: the integration process, ontology design and the architecture.

There were some very incomprehensible presentations. But of the ones I did understand the lipid raft modeling (which looked a bit like Conway's Game of Life) was perhaps the coolest. There was quite a few presentations of software that involved: "we did it this way, the specifications changed, next time we'll do it right because what we have at the moment is a mess". It's very frustrating to see that change is still not accepted as the norm.

Two posts I read recently reminded me of this too: "Architecture and innovation" and "Why is it so difficult to develop systems?". Both describe the poor state of academic software. Many projects seem to suffer from this problem, not just academic ones, although academic ones seem to be prone to adding technology because it's cool/trendy/whatever which may help get it published but usually obscures the real novel aspects (or worse they don't have anything except the cool or trendy technologies).

But my impression of the last 10-15 years (especially W3C and Grid/eScience projects) is that they rapid become overcomplicated, overextended and fail to get people using them.

Ultimately much of the database and repository technology is too complicated for what we need at the start of the process. I am involved in one project where the database requires an expert to spend six months tooling it up. I thought DSpace was the right way to go to reposit my data but it wasn’t. I (or rather Jim) put 150,000+ molecules into it but they aren’t indexed by Google and we can’t get them out en masse. Next time we’ll simply use web pages.

By contrast we find that individual scientists, if given the choice, revert to two or three simple, well-proven systems:

* the hierarchical filesystem
* the spreadsheet

A major reason these hide complexity is that they have no learning curve, and have literally millions of users or years’ experience. We take the filesystem for granted, but it’s actually a brilliant invention. The credit goes to Denis Ritchie in ca. 1969. (I well remember my backing store being composed of punched tape and cards).

If you want differential access to resources, and record locking and audit trails and rollback and integrity of commital and you are building it from scratch, it will be a lot of work. And you lose sight of your users.

So we’re looking seriously at systems based on simpler technology than databases - such as RDF triple stores coupled to the filesystem and XML.

Present and Future Scalability

Measurable Targets for Scalable Reasoning

This analysis is based on published results of several of the most scalable engines: ORACLE, AllegroGraph, DAML DB, Openlink Virtuoso, and BigOWLIM. The targets are defined with respect to two of the currently most popular performance measuring sticks: the LUBM repository benchmark (and its UOBM modification) and the OWL version of UNIPROT - the richest database of protein-related information.

An interesting aspect of this paper is its description of some of the difficulties in comparing data loading and querying across triple stores. For example, at load time this can include forward chaining (inferencing) and what the complexity of the data model is (whether named graphs and other metadata is used). Query evaluation can vary due to backward chaining, result set size and the types of queries.

While some triple stores can now load up to 40,000 triples a second (BigOWLIM) the average seems to be around 10,000 triples a second for a billion triples. The target in the next few years is 20-100 billion triples at 100,000 triples second. The rate of 100,000 triples per second is the upper range but I would imagine that to load the data in a reasonable time this is what people have to aim towards. Otherwise, 100 billion triples is going to take 100 days.

Thursday, November 22, 2007

Why Riding Your Bike Doesn't Help Global Warming

An interesting article, "Climate Change – an alternative approach", which highlights that the current supply of oil fails to meet demand. So reducing your usage of CO2 emissions, like riding your bike to work, does not help - the drop in demand is quickly met by consumption somewhere else. The author is suggesting a more viable approach is to focus on production rather than consumption.

He was talking about flying to Sydney and stated that if you chose not to fly you were making an immediate carbon saving (as apposed to offsetting the flight where the saving was at least delayed if it ever happened at all). Does tearing up your ticket to Sydney reduce carbon emissions? Ask the question, have some fossil fuels been left in the ground that would otherwise be extracted? The answer, absolutely not, and I’m not talking about how the plane’s still going to fly without you.

I’m talking about the fact that oil extraction is not determined by demand, it’s determined by supply. It has been since earlier this decade when the market price diverged markedly from the production costs.

According to the author, the answer lies in the oil and coal producing nations reducing production which is much easier to coordinate than billions of consumers. This doesn't detract from investing in reducing consumption and the usual suspects (Australia, India, USA and China) are involved on both sides of production and consumption.

Friday, November 16, 2007

SPARQL isn't Unix Pipes

It wasn't supposed to be this way. I was just trying to get what I wrote in 2004 acknowledged. All I wanted then, as now, was aggregate functions and a matching data model. I did a bit more research in 2006 (mostly in my spare time while I had a day job to go to) and thought that people could read it and understand it. I even spent some time over the last Christmas holidays making a more gentler introduction to it all.

SPARQL is a proposed recommendation - which is one step away from being a published standard. So, I put my objections in to the SPARQL working group. From what I can tell people either didn't understand or thought that I was some kind of weirdo. The unhappiest part of this is the summary of my objection, "The Working Group is unfamiliar with any existing query languages that meet the commenter's design goals."

All I wanted was closure of operations, where the query language matches the data model it queries. Maybe this is a very odd thing to want. No one seems to know what the relational model really is either. Maybe it's a bad example.

Maybe a better example is Unix pipes. Unix pipes have operations (sort, cut, etc.) that take in text and output text. That is, it takes the same input as output or something known as closure. So you can take input from one tool and string them together in any order you want. Sometimes it's more efficient to do one operation first over another. In SPARQL you can't do that as the first operation of every query turns it into variable bindings.

I was hoping that SPARQL would be the Unix pipes of RDF. This would mean that the operations like join, filter, restrict (matching triples) and so on take in an RDF graph (or graphs) and output an RDF graph (or graphs). This gives tremendous flexibility in that you can create new operations that all work on the same model. It also means that a lot of the extra complexity that is part of SPARQL (for example, CONSTRUCT and ASK) go away.

This is not to say that SPARQL doesn't have value and shouldn't be supported. It is just a missed opportunity. It could have avoided repeating mistakes made with SQL (like not standarizing aggregate functions, having a consistent data model and so on).

Update: I re-read this recently. It struck me that maybe I was being a little unclear about what I expected as input and output in the RDF pipes view of SPARQL. Really, it's not a single RDF graph per se that are being processed but sets of triples. It's not really a big difference - as RDF graphs are just sets of triples but it's more that the triples being processed don't have to come from one graph. There's no restriction on what I'm talking about above to process, in one go, triples from many different graphs. The criticism is the same though - SPARQL breaks triples into variable bindings. Having multiple graph processing (or sets of triple processing) just requires the graph that the triple came from recorded (the quad in most systems). It certainly something that could be added to JRDF's SPARQL implementation.

Thursday, November 15, 2007

Sesame Native Store

I'm very impressed at the moment with OpenRDF's native store as others have been in the past. One of the best things is how easy it was to work into the existing JRDF code.

As I've said before I've been searching for an on disk solution for loading and simple processing of RDF/XML. In the experiments I've been doing OpenRDF's btree index is much faster than any other solution (again not unexpected based on previous tests). The nodepool/string pool or ValueStore though is a bit slower than both Bdb and Db4o.

Loading 100,000 triples on my MacBook Pro 2GHz takes 37 secs with pure Sesame, 27 with the Sesame index and Db4o value store, 35 with Bdb value store and ehCache is still going (> 5 minutes). A million takes around 5 minutes with Sesame index and Db4o nodepool (about 3,400 triples/second) and 3 minutes with a Sesame index and memory nodepool (about 5500 triples/second).

There's lots of cleanup to go and there's no caching or anything clever going on at the moment, as I'm trying to hit deadlines. 0.5.2 is going to be a lot faster than 0.5.1 for this stuff.

Update: I've done some testing on some fairly low-end servers (PowerEdge SC440, Xeon 1.86GHz, 2GB RAM) and the results are quite impressive. With 100,000 triples averaging around 11,000 triples/second and 10 million averaging 9,451 triples/second.

Update 2: JRDF 0.5.2 is out. This is a fairly minor release for end user functionality but meets the desired goal of creating, reading and writing lots of RDF/XML quickly. Just to give some more figures: Bdb/Sesame/db4o (SortedDiskJRDFFactory) is 30% faster for adds and 10% slower for writing out RDF/XML than Bdb/Sesame (SortedBdbJRDFFactory). Both have roughly the same performance for finds. I removed ehcache as it was too slow compared to the other approaches.

Friday, November 09, 2007

JRDF 0.5.1 Released

This release is mainly a bug release. There are improvements and fixes to the Resource API, datatype support and persistence. Another persistence library has been added, db4o, which has some different characteristics compared to the BDB implementation. However, it's generally a little slower that BDB. The persistence offered is currently only useful for processing large RDF files in environments with low memory requirements.

Also, the bug fixes made to One JAR have been integrated, so JRDF no longer has its own version.

Available here.