Friday, March 28, 2008

Microsoft LINQs Data

Microsoft and "Research-Output" Repositories

Our goal is to abstract the use of underlying technologies and provide an easy-to-use development model, based on .NET and LINQ, for building repositories on top of robust technologies.

The platform has a "semantic computing" flavor. The concepts of "resource" and "relationship" are first-class citizens in our platform API. We do offer a number of "research-output"-related entities for those who want to use them (e.g. "technical report", "thesis", "book", "software download", "data", etc.), all of which inherit from "resource". However, new entities can be introduced into the system (even programmatically) while the existing ones can be further extended through the addition of properties.

This means, obviously, that arbitrary relationships between resources can be established. Our platform comes with a number of "known" predicates (e.g. "added by", "authored by", "cites", etc.) but it is extensible to accommodate any new predicates developers want to introduce. Furthermore, we do not interpret the semantics of the relationships; we let applications define how to reason about them.

The concept of a "relationship" may make many think that we are building a triple-store, perhaps even speculate that we are using one. While we do store tuples, we have opted for a hybrid approach between a fully-blown relational schema and a triple-store. Our thesis is that by sitting in the middle of the "triple store <–> relational schema" spectrum, we will be able to stay flexible enough without impacting performance.

At the Open Repositories 2008 conference, we will formally unveil our work in advance of its official release and initiate interactions/exchanges with the DSpace, EPrints, Fedora, and other players in the repository community. This is crucial to us because—like every other project our group undertakes—we are intensely focused on interoperability.


Maybe Microsoft and Yahoo! have more in common than previously thought.

Via, Microsoft set to launch Semantic Web light. I previously looked around for LINQ tools for RDF.

Tuesday, March 04, 2008

Save Ontologies from the Ontologists

I presented a talk at InterOntology08 last week (list of slides presented). It was only 15 minutes so there wasn't room for much content. What I think is the most important slide was number 11 about how the BioMANTA project is attempting to produce ontologies as an agile, engineering artefact that are verified in reality (due to experiments being performed, provenance tracked and data analysis on the quality of the provenance to filter out irrelevant or incorrect data).

There were some good things to come out if it. Thinking about how to describe to other people problems with ontologies in terms of inconsistencies - what will be inferred that contradicts your ontology - was very useful. The work done by Werner Ceusters, Nicola Gurino and Yu Lin were the most close to our work. One of the speakers gave what I think as a succinct description of the difference between top down vs. bottom ontology development: "what to expect" vs. "what to extract". I also met a lot of great people who I hope to meet again and Japan was very cool.

Easily the best, in terms of the most thought provoking, was Barry Smith's, "The Evaluation of Ontologies: Editorial Review vs Democratic Ranking". This discussed the work of the Gene Ontology and the OBO Foundry. He cited the Gene Ontology as the most useful and most used ontology which has been developed using a top down process. It allows comparable data to be produced, it removes data silos and he compared it to creation of standard measures (metric system). He said that in order to achieve this standardisation you need editorial committees. An ontology becomes part of the peer review, journal process. He introduced the OBO Foundry which has many principles such as being open, has a formal language, collaborative, orthogonal components, versioned, well documented and must have data before it can be accepted.

The alternative view he offered was attributed to Mark Musen. It's a bottom up, annotation of ontologies and many of the slides were taken from a previous talk. Mark believes that ontologies are still a cottage industry and that it is often difficult to ascertain the quality of an ontology just by inspection. He said it is also true that we may wish to use parts of ontologies even if they are not well designed. He questions whether a top down approach can scale. He is developing BioPortal which offers a way to upload and rate various ontologies. The key question about BioPortal is whether it will generate enough interest to reach a critical mass of reviews.

I had many problems with this talk. Firstly, the way it was characterised as one vs. the other - why can't they both work? What stops peer reviews of popular ontologies or getting popular ratings of peer reviewed ontologies. Barry mentioned that a selection approach works for refrigerators (where peer review designs the function of the refrigerator and colour is selected by the masses) but questioned whether this should work for science. This is an obviously negative view of what mass selection can do - we choose representatives in a democracy or successful products in a capitalist market, surely these are very important things that are left to the masses. Are ontologies any less than these things?

Beyond that, both of these methods seem to suggest a certain centralisation. Doesn't this encourage gatekeepers, people holding onto power, hasn't the web (governments, capitalism, science etc.) shown that decentralisation is better? I see science as a competition of ideas, the best model is chosen over many possible ones that best fits existing data and predicts new observations.

One of the OBO Foundary principles is that you can't reuse an ontology. That is, if you're outside the OBO Foundry and you make a change you can't redistribute or use the same identifiers. This just seems wrong. I must be misunderstanding this part, because it is supported by people who I would expect to support the idea of reusing ideas and, most importantly, sharing them with others.

Many of these arguments seems to be around whether an ontology is attempting to create or represent reality or if its an engineering artefact. I see it as a bit of both but its primary utility, I'd suggest, is as an engineering artefact. It represents a (hopefully working) system.

A simple example is our "fixing" of BioPAX. BioPAX uses string literals for certain properties and this prevents them being used as subjects in RDF. I would like to link, maybe dereference them and do other cool things with them that you can only do with URIs. So I'd like to make a change now, get something working and distribute my software with these changes.

I do think that ontologies should be well documented but documentation can be a barrier when you want to change something, try it out, make more changes, try it out again - the documentation is potentially going to be missing or wrong. The whole process seems to be trying to do too much upfront - which is terrible for the few, overworked ontologist that there are.

I don't want to wait while my ontology gets peer reviewed necessarily - the chances of the right person finding a mistake really doesn't sit with a committee or voting process - I'd like it to include everyone. I'd like to do it cheaply both in time and money; if for no other reason than to see whether it works well. If it doesn't work then it's not a big deal I can just change it back. It seems that if this was part of a big process then it would be less likely to happen.

Both methods also lack verification (or at least it wasn't discussed). There's nothing to say that a bunch of people in the OBO Foundary or a voting process will necessarily achieve certain modelling objectives - something that is right for me or for everyone. Like most systems, ontologies will have contradictory requirements such as flexibility and completeness or security and privacy - there really isn't one true answer. I'd prefer a process that quickly adapts to changing requirements which can then be verified.

Monday, March 03, 2008

PURLs for GO

I think these have been published before but I only just noticed the use of PURLs for references in the Gene Ontology. For example: http://purl.org/obo/owl/GO#GO_0008150 (the gene is not dereferenceable but then that seems okay for current purposes). This is following Recipe 1a from the "Best Practice Recipes for Publishing RDF Vocabularies", although the file sizes are probably too big (as suggested in "Serving Static RDF Files"). It does seem inconsistent with the Banff Manifesto which suggests URLs more like http://purl.org/bm/go:0008150. I know about slashes and hashes but I'm not sure about colons.

Sunday, March 02, 2008

Algebra A and SPARQL

I've been reading, "Logic and Databases: The Roots of Relational Theory" and more importantly the chapter 10 which is about "Why is it called a Relational Algebra?". He defines what an algebra is such as identities, idempotence, absorption and so on with respect to Algebra A. I first came across Algebra A in the 3rd Manifesto which is an untyped relational algebra that defines a relationally complete system in about three operations: REMOVE, NOR or NAND and TCLOSE (transitive closure). I say, "about three" because I'm not sure TCLOSE is part of a relationally complete system and NOR and NAND are made up of an untyped OR or AND plus NOT. It also gets rid of WHERE, EXTEND and SUMMARIZE (which can be used for aggregate functions) by creating "relational operators" which are special relations that perform an operation (like COUNT).

Anyway, one of the more interesting points is that on page 260-261 of "Logic and Databases" he talks about identities such as: A + 0 = A * U = A, A + U = U and A * 0 = 0. Where A is any relation, 0 is the empty relation (DUM) and U is the universal relation (DEE). These match the tables I created for JOIN and UNION for SPARQL - and likewise I think are correct for OPTIONAL.

There is also a chapter on the closed world assumption and why Date dislikes the open world assumption which I'm still trying to digest. It seems that to get around 3VL Date uses strings - which seems like a massive hack.

It also occurred to me when reading this that SPARQL and query languages in general are non-monotonic - that is as you add more information the results you get from a query can be different - which is different to RDF. It made me wonder what a monotonic query language would look like but not for too long.