Wednesday, November 28, 2007

Present and Future Scalability

Measurable Targets for Scalable Reasoning

This analysis is based on published results of several of the most scalable engines: ORACLE, AllegroGraph, DAML DB, Openlink Virtuoso, and BigOWLIM. The targets are defined with respect to two of the currently most popular performance measuring sticks: the LUBM repository benchmark (and its UOBM modification) and the OWL version of UNIPROT - the richest database of protein-related information.


An interesting aspect of this paper is its description of some of the difficulties in comparing data loading and querying across triple stores. For example, at load time this can include forward chaining (inferencing) and what the complexity of the data model is (whether named graphs and other metadata is used). Query evaluation can vary due to backward chaining, result set size and the types of queries.

While some triple stores can now load up to 40,000 triples a second (BigOWLIM) the average seems to be around 10,000 triples a second for a billion triples. The target in the next few years is 20-100 billion triples at 100,000 triples second. The rate of 100,000 triples per second is the upper range but I would imagine that to load the data in a reasonable time this is what people have to aim towards. Otherwise, 100 billion triples is going to take 100 days.

7 comments:

Kingsley Idehen said...

Andrew,

I've looked at your benchmark numbers and they are off base re. Virtuoso.

This particular benchmark is affected by working sets and memory, so you can compare a 8 gig run against an engine equipped with 16 gigs amongst other things.

If you don't mind, send me your Virtuoso INI file, and I will send you an optimized variant that you can use to re-run your tests.

I don't expect any of the engines on the list to be faster that Virtuoso when we are running on the same server, so let's simply go for a head-head on the same machine versus DAML-DB. And if DAM-DB is faster, then I guess we have some work to do :-)

Kingsley Idehen said...

Here is a link to a project that has an INI file tuned for a large RDF based project (Neurocommons), you can use the notes here to tweak your Virtuoso INI file.

http://esw.w3.org/topic/HCLS/Banff2007Demo/HCLS/Banff2007Demo/HowToMakeOneForYourself

Kingsley Idehen said...

I noticed a problem with the URI is my previous comment. Thus, Here is the INI file URI.

Kingsley Idehen said...

Sorry!

Just realized, you aren't the one that performed the benchmark :-)

Unfortunately the paper is missing contact details for the author etc..

Naso said...

Kingsley,

first of all, sorry for not providing my contact details. One can find them by googling "Atanas Kiryakov". I will fix this in the paper

Regarding the results - we do not publish result of benchmarking on our side of any system other than our own repository (OWLIM). This is partly due to the risk of sub-optimal configuration.

The figures about OpenLink in my paper are collected from your publication "Advances in Virtuoso RDF Triple Storage (Bitmap Indexing)". Reference is provided in my publication.

I agree that this is not a head-to-head comparison - this is made clear in the paper. Still, the principle of "publishing manifacturer's data" has its own merits.

If you have more up to date performance results that can fit in my overview, please, provide them and I will be happy to update it.

Regards,
Atanas Kiryakov

Kingsley Idehen said...

Naso,

I did Google you hence the email you received from me :-)

We will furnish you with updated numbers as the document you used is somewhat dated re. Virtuoso.

BTW - Virtuoso is the server behind http://dbpedia.org/sparql (i.e the public sparql endpoint for the DBpedia RDF Data Store). You may consider incorporating the DBpedia and other Linked Data data sets into your research etc..

Naso said...

Dear Kingsley,

I will take care to update the figures as you provide those to me, but I prefer that you also publish them somewhere, so, that I can put a reference in my paper. This way, it will be clear, what you report as results and what is the analysis and the interpretation on my side.

Regarding the SWEO datasets - we are aware of this page. However, benchmarking at our side is not the point in my paper. It is limited to interpretation of published data of performance evaluations. Even in this case, I had to undertake more interpretation on the data than what I feel comfortable with. For some reason, most of the people provide results in a rather sparse and selective manner.

One of the motivations behind my paper was to provoke the other semantic repository developers to start publishing more concrete and detailed evaluation data, which can be compared to those of others.

Best regards
Naso