Monday, May 26, 2008

RDF Processing

One of the interesting things about biological data, and probably other types, is that a lot of it is not quite the right structure. That's not to say that there's not people working to improve it, the Gene Ontology seems to be updated almost daily, but data in any structure maybe wrong for a particular purpose.

Biologists make a habit, out of necessity, of just hacking and transforming large amounts of data to suite their particular need. Sometimes, these hacks get more generalized and abstracted like GO Slims. We've been using GO Slims in BioMANTA for sub-cellular location (going from 2000 terms to 500). GO contains lots and lots of information and you don't need it all at once and more often you don't need it at the maximum level of granularity that it has. Some categories only have one or two known instances, for example. You may even need to whittle this down further (from say 500 to 200). For example, when we are determining the quality of an interaction we only care where the proteins exist generally in an organism. If the two proteins are recorded to interact but one is in the heart and the other in the liver then it's unlikely that they will react in the host organism. The part of the liver or the heart and other finer structural detail is not required for this kind of work (AFAIK anyway).

The point is, a lot of our work is processing not querying RDF. What's the difference between the two and what effect does it have?

For a start, querying assumes, at least to some degree, that the data is selective - that the results you're getting is vastly smaller than your original data. In processing, you're taking all of the data or large chunks of it (by sets of predicates, for example) and changing or producing more data based on the original set.

Also, writing is at least as important as reading the data. So data structures optimized for lots of writes, temporary, concurrent, is of greater importance than those built around more familiar requirements for a database.

Sorting and processing distinct items is a lot more important too. When processing millions of data entries it can be quite inefficient if the data has a large number of duplicates and needs to be sorted. Processing can also be decentralized - or perhaps maybe more decentralized.

To top it off, the data still has to be queried. So this doesn't remove the need for efficient, read only data structures to perform selective queries for the usual analysis, reporting, etc. So none of the existing problems goes away.

1 comment:

Tom Adams said...

Sounds like a use case to me... :)