Thursday, March 29, 2007

MPTStore

Presentation Summary on MPTStore. A summary of an interesting approach by the Fedora guys to storing lots of triples, fast.

The real motivation behind experimenting with a new triplestore, however, was the NSDL use case. The National Science Digital Library5 (NSDL) is a moderately large repository (4.7 million objects, 250 million triples) with a lot of write activity (driven by periodic OAI harvests; primarily mixed ingests and datastream modifications). The NSDL data model also includes existential/referential integrity constraints that must be enforced. Querying the RI to determine correct repository state proved to be difficult: Kowari is aggressively buffering triple, sometimes on the order of seconds, before writing them to disk. Flushing the buffer after every write is also computationally expensive (hence the drive to use buffers in the first place).


Based on this observation, their solution, called “Mapped Predicate Tables,” creates a table for every predicate in the triplestore. This has several advantages: a low computational cost for triple adds and deletes, queries for known predicates are fast, complex queries benefit from the relatively mature RDBMS planner having finer-granularity statistics and query plans, and flexible data partitioning to help address scalability. This solution comes with several disadvantages, however: one needs to manage predicate to table mapping, complex queries crossing many predicates require more effort to formulate, and with a naive approach simple unbound queries scale linearly with the number of predicates.


They achieved basically the same performance with either asynchronous or synchronous modification.

The project is available on Sourceforge, including slides and javadoc (which has a similar design to JRDF except no blank nodes).

2 comments:

Rob said...

eek!an RDBMS? The sky is falling!

I thought a native triple store would finally rid the world of the inflexible relational model.

The (relatively) expensive write operations in Kowari may also be related to the AVL tree data structure. Writing requires balancing.

If the store requires a higher ratio of writes to reads, perhaps a binary tree may work better (or even just infrequent balancing of the tree run in a background thread). But this would obviously require some extra development.

What do you think? Am I screaming heresy? :D

Andrew said...

I still think a native triple store is going to kick the butt of any SQL database. Not only in storing RDF but also from requiring less administration and better suport for developers. But I also think that RDF databases are going to out perform SQL databases too.

The AVL tree in Kowari makes a fine structure for a certain kind of usage - many reads, single writes. You could also create a multi-write AVL tree if you wanted to but there are probably better structures.

Ideally, I'd like a triple store that could tune itself automatically - optimise itself based on a variety of things like the shape of the triples, the ontology and the queries.

This means moving away from a single datastructure to store all data and being able to pick the which data is stored in which structure.

SQL databases do this now but they're limited in what they can do. With the data decomposed into triples the optimiser has much greater flexibility to rearrange the data. And it's in a crude form in Kowari's resolves (but there's no knowledge expressed on the abilities of resolvers to process the data).

Furthermore, there's nothing inherently wrong about the relational model, it's flexible enough, or can be made flexible enough with untyped relations (as I've written about before).

I thought this article was interesting because the writers looked at their data and saw that there weren't many predicates and there weren't any blank nodes and optimised it for that purpose.

Now we just have to get the people out of the picture.