Wednesday, November 28, 2007

Present and Future Scalability

Measurable Targets for Scalable Reasoning

This analysis is based on published results of several of the most scalable engines: ORACLE, AllegroGraph, DAML DB, Openlink Virtuoso, and BigOWLIM. The targets are defined with respect to two of the currently most popular performance measuring sticks: the LUBM repository benchmark (and its UOBM modification) and the OWL version of UNIPROT - the richest database of protein-related information.

An interesting aspect of this paper is its description of some of the difficulties in comparing data loading and querying across triple stores. For example, at load time this can include forward chaining (inferencing) and what the complexity of the data model is (whether named graphs and other metadata is used). Query evaluation can vary due to backward chaining, result set size and the types of queries.

While some triple stores can now load up to 40,000 triples a second (BigOWLIM) the average seems to be around 10,000 triples a second for a billion triples. The target in the next few years is 20-100 billion triples at 100,000 triples second. The rate of 100,000 triples per second is the upper range but I would imagine that to load the data in a reasonable time this is what people have to aim towards. Otherwise, 100 billion triples is going to take 100 days.
Post a Comment