Tuesday, August 24, 2004

Java + SW = Jena

How Big Is Your Store? "The repository will end up holding metadata about more than 16 million articles (plus their associated authors, affiliations, publications, etc) and as you'd imagine thats going to end up exploding into a large number of triples."

"I'd be interested to hear about how big a store people have worked with, including which APIs, etc they've been using.

To give a bit more context, as we're mainly a Java shop I've begun by considering any store that can be plugged into Jena. So the Jena persistence model support would be my baseline, with Kowari being another candidate. I see that RDFStore may be adding Jena support so we may explore that too."

16 million triples shouldn't be a problem for many stores.

I find that Jena does impose some limitations on scalability in that it tends to want to use in memory Models for a lot of things. It also tends to over use iterators. It's also a terribly complicated thing for something, RDF, that is quite simple.

I've been recently thinking that JRDF, Sesame's RIO, Kowari store and SOFA for OWL inferencing would make a small, scalable solution (around 5MB). Is anyone interested in that though?

6 comments:

Anonymous said...

It'll more than 16 million triples: there will be ~5-10 triples per article so between 80-160 million triples.

I'll take a closer look at JRDF (it's already on my radar) and RIO. Thanks for the tip!

L.

Andrew said...

Our store on a 64-bit system has had around 240 million triples int but I think it can store more than that. On a 32 bit system you won't get near there - 20 million or so but that's just the standard configuration. You might get double or triple that but not much more. I don't know if the new store is going to make it to Kowari - I hope so.

Danny said...

100 million triples sounds quite a challenge, but it's easy to imagine that kind of storage been needed in a lot of other domains (what do the bio folks do?).

I've never looked at this kind of scale, couple of questions you may be able to answer - is it feasible to leverage an underlying RDBMS with some/any of the APIs (I realise there's ModelDB etc. but will they scale)?

You mention a ceiling with the 32bit systems - do you know of any work on using multiple fairly-big stores together?

Andrew said...

Our distributed queries in TKS allows you to query multiple servers at once. So you can scale that way too if you want.

As far as SQL databases are concerned I think MySQL certainly seems to perform well up to a point. I'm unsure when that point occurs (it's been a long time since I've used an SQL database to store RDF) but I think it's well before 100 million triples.

Anonymous said...

Yes, I am interested....

a nice rdf db that has a light footprint would be great. The thought that going with it won't limit any future scalability requirements is a nice one as well.

Anonymous said...

oh.... and did I mention that java based is key ;)