Monday, May 12, 2008

git + RDF = versioned RDF

Reading, Git for Computer Scientists, and it seems like if you turn the blob into a set of triples you pretty much have versioned RDF (or molecules even).

I'm also wondering, if Digg is so pro-Semantic Web, where's the http://digg.com/semweb?

Tuesday, May 06, 2008

I See Triples

Digg makes official its adoption of a 'semantic Web' standard "Other brief mentions on Digg's blogs over the past month have been the only indications the company has been giving to the world of its direct -- and perhaps even principal -- involvement in RDF and RDFa, besides a simple check of the site's own source code, where attributions such as rel="dc:source" property="dc:title" within <DIV> elements are now common. A few weeks ago, developer Bob DuCharme discovered these little attributions and began playing with them to discern their viability."

"The possibility exists for a kind of mega-meta-source to emerge from Digg, where interesting news topics are associated with cataloged resources. But for that to actually work, someone has to manage those resources -- and that effort will take a level of humanpower and resources of another kind (the kind symbolized with "$") that RDF won't provide even the most ambitious sites just on its own."

See Digging RDFa. More news about RDFa is available at RDFa.info.

To see Digg in all its RDFa glory one way is to copy this Javascript for highlighting or this one for getting RDF triples into you bookmark bar after the Digg front page has loaded.

So I still haven't finished writing up everything that I've saw at WWW2008 but the overall messages were:

  • RDFa is easy and gets people going with RDF quickly (see "They knew the train would come"). Semantic wikis (links to the Semantic Mediawiki project) have also come a long way to making it more err user friendly.

  • HTML5 and the end of the browser development winter seems like the death to plugins at last. I hadn't realized this before, but the message seems to be that a plugin is a way of saying to the Web "your browser isn't full featured enough".

  • The Facebooks of the world and all those online communities really are a danger to the Web - the creation of data silos. And I'd really like to have the time to write some SIOC plugins to help open up these silos (or just change my blog template to have RDFa).

Bankrupt

"And now, we're [Americans are] the most religious nation on earth - that's why we kill so easily. We're sending people to heaven. And because we are now terribly, terribly religious in a sense that no proper American ever was when I was young - I was in the Second World War." - Gore Vidal.

And they are bankrupt in the finacial sense as well due to Iraq (and other causes of course). The speaker also follows a line I've seen often where the war has been fought without enough commitment from the government (i.e. decreasing taxes instead of increasing them, hiring fighters instead of drafting, etc.). One rather shocking statistic was that 48 percent of returning troops will be disabled in some way - maybe that's because more a living than dying but it's still quite an amazing number - but it means "...we've created just for the disabled in this war in the last five years, a gap equal to the gap that we created over decades in the social security system...It's an order of magnitude worse than the Vietnam War."

Friday, May 02, 2008

When URIs are too Much

Every Subject is a Blank Node "In RDF, URIs are good at defining unambiguous property values, in other words objects, including type. But very often, and maybe most of the time, the individual subject (in both meaning of subject of an RDF triple, and topic maps subject of conversation) is best represented as a blank node bearing all kinds of identified properties, but none of them conferring absolute identity. This way, it's left to applications to figure out identification rules, in other words which property or boolean combination of properties they want to consider as identifying or not."

From the mailing list: "With no URI, you are free to let applications decide which contexts are considered the same or not, based on specific rules on properties. Some applications would decide that all contexts where role "I" is played by "John Black" are the same, and will cluster all contextResource properties, some other will not."

Long tail of programming languages

While I'm sick of long tail blahs, I recently came across the idea that programming languages follow the same power laws found in other areas. This particular long tail this should be encouraging for those who have a disdain for the current mainstream computer languages, "Rather than finding ways to create an even lower lowest common denominator, the Long Tail is about finding economically efficient ways to capitalize on the infinite diversity of taste and demand that has heretofore been overshadowed by mass markets."

Furthermore, "There is a long tail because the more specialized a language is to a domain, the better it fits to solve problems for that domain. These niche languages trade off generality for efficiency in a domain and they are simply better and more efficient tools for that domain."

Grep the Web

Slides and talks from the recent Hadoop Summit are now available. Some of the more interesting ones is Facebook's Hive, Amazon's GrepTheWeb, IBM's JAQL and Yahoo's just about everything else.

Thursday, May 01, 2008

Engrich

The ability for a foreigner to speak just enough English in order to swindle stupid Westerners.

Thursday, April 24, 2008

Update from WWW2008

The HCLS workshop was very good. I especially enjoyed Mark Wilkinson's talk about BioMody 2.0 (very Larry Lessig-esque) and Chris Baker's. There's some definite interest from a number of people about my talk too.

The keynote of the first day was from the Vice President of Engineering at Google, Kai-Fu Lee. In my talk I said that IBM had noted that scale-out architecture gives you a 4 times performance benefit for the same cost. He said that Google gets around 33 times or more from a scale-out architecture. The whole cloud thing is really interesting in that it's not only about better value but about doing things you just can't do with more traditional computing architectures. The number of people I've overheard saying that they haven't been able to get their email working because they're using some sort of client/server architecture is amazing. I mean what's to get working these days when you can just use GMail?

The SPARQL BOF was interesting as well (Eric took notes). The time frame seems to be around 2009 before they get started on SPARQL the next generation. What sticks out in my mind is the discussion around free text searching - adding something like Lucene. There was also aggregates, negation, starting from a blank node in a SPARQL query and transitive and following owl:sameAs. I was pretty familiar with all of these so it was interesting just to listen for a change. So with both aggregates and free text you are creating a new variable. Lucene gives you a score back and I remember in Kowari we had that information but I don't think it was ever visible in a variable (maybe I'm wrong I don't really remember). It would be nice to be able to bind new variables somehow from things in the WHERE clause for this - and that would also allow you to filter out based on COUNTS greater than some value (without having a HAVING clause) or documents that match your Lucene query greater than a certain value. Being able to do transitive relationships just on a subset of the subclass relationship (like only subclasses of mammals not infer the whole tree of life) seemed to have been met with some reluctance. I really didn't understand this but it seemed to be around that it was the store's responsibility to control this and not up to the user to specify.

The other thing that was mentioned was transactions. It seems that transactions probably won't be part of SPARQL due to the nature of distributed transactions across the Web.

There was one paper on the first day that really stood out. I don't know what it is about logicians giving talks but they are generally really appealing to me. It was "Structured Objects in OWL: Representation and Reasoning" presented by Bernardo Grau. It seems to take the structural parts of an OWL ontology and creates a graph to represent it. This prevents DL reasoning of an infinite tree and creates a bounded graph. This is cool for biology - the make up for a cell for example but it also speeds up reasoning and allows errors to be found.

The other interesting part was the linked data area. I was a bit concerned that it was going to create a read only Semantic Web. A lot of the work, such as DBpedia that converts Wikipedia to RDF, seems a bit odd to me as you can only edit the Semantic Web indirectly through documents. But in the Linked Data Workshop a paper was presented called "Tabulator Redux: Browsing and Writing Linked Data" which of course adds write capabilities. I spoke to Chris Bizer (who gave a talk on how the linked data project now has ~2 billion triples) about whether you could edit DBpedia this way and he said probably not yet. That's going to be interesting to see where it goes.

I am just going off memory rather than notes. So I'll probably flesh this out a bit more later.

Saturday, April 19, 2008

Big Web Table

I thought I read someone ports Google's AppEngine to use HBase. Good idea but not quite. "Announcing A BigTable Web Service": "I then came up with the crazy idea to offer BigTable as a web service using App Engine. It would be an infinitely scalable database running in Google's datacenters. I spent my weekend learning Python and hacking together an implementation. Now I'm happy to present the BigTable Web Service. It models the API of Hbase—a BigTable clone. Now you can have simulated BigTable running atop App Engine, which itself provides an abstraction on top of the real BigTable."

What it actually does is use HBase's Thrift API on top of Google's BigTable or as they say BigTable as a Web Service (a RESTful one).

Friday, April 18, 2008

Mario's Jazz Bar

Just a random thing to share - there seems to be quite a lot of competition playing Super Mario Galaxy songs on YouTube. A fairly recent one is a Jazz Interpretation of the Observatory theme (there's also a guitar version or accordian one). I still quite like the original orchestral version of Gusty Garden Galaxy Theme and it's version on piano. Koji Kondo is also good to look up on YouTube from time to time too.

Thursday, April 17, 2008

What Women Want: Pairing

It's not the first time I've read an article about the continuing decline of women in IT, "Where Did All the Girl Geeks Go?" continues to note the slide:
"There's a perception that being a computer science major leads to a job as a programmer and you sit in a cubicle where you type 12 hours a day and have no interactions with other people," Block said.

Yusupova noted that even if pure programming jobs are outsourced, opportunities still remain within a company for people to bridge the relationship between the outsourced IT vendors and the business side.

"These roles would probably be ideal for women who prefer to be in communication-focused roles, if they know computer science and can communicate to all parties involved," Nelly Yusupova, chief technology officer of Webgrrls International, a networking organization.


There's was a talk given recently, at a local XP group, that lead to a discussion on the benefits of things like pair programming (see "Pair Programming").

I see pair programming and other means to improve interactions between developers not only essential for better code and a better project but also as a way to improve the IT industry generally and to expand its appeal especially to younger people and women. The idea being that certain people work better in a participatory manner rather than being told what to do.

This is pretty much what an article a couple of year ago suggested called "Debunking the Nerd Stereotype with Pair Programming" (or as PDF):
Jamie wants to be a software engineer. She enjoyed her programming and science classes in high school and wants to combine her interest in both disciplines to help society through biomedical applications. Since she started college, it seems that her life has been centered on time consuming programming classes. In those classes, her professors insist that she work alone—some professors expressly forbid even discussing assignments with fellow classmates. Before entering college, Jamie was aware of the stereotypical view that programmers work long hours by themselves. Based on her college experience, now she knows it’s more than just a stereotype—it’s true. Perhaps she should forget programming. She likes the friends she’s met in her biology lab group—maybe biology would be a better major.


Having been working in bioinformatics for over a year it's startling the number of women in this area compared to IT. It seems basically 50/50 in what is essentially an application of information technology. They still write code, they still develop large applications and so on. Why does it drop to 1 in 20 or worse in IT? It does seem that in bioinformatics you are expected to work in groups and teams, they are forever interacting with each other - it seems a brilliant environment as far as productivity is concerned.

And it's not just IT or biology but it seems that there is a general benefit from greater interaction, more pairing and the like generally improves performance:
The success rate of underrepresented minorities in science courses has been shown to be dramatically improved by shifting the learning paradigm from individual study to one that capitalizes on group processes, such as student work groups and student-student tutoring.


From an IT perspective pairing doesn't only improve the quality of the software it also improves your abilities as an individual programming as well, as has been demonstrated where pair programming has been used in IT courses and the results of students in exams improved (see "Pair Programming Improves Student Retention, Confidence, and Program Quality").

I re-read "All I Really Need to Know about Pair Programming I Learned In Kindergarten" which still holds up quite well as a set of rationales behind pair programming and NCSU's Pair Learning has lots of papers related pairing, learning and making IT more attractive to more discovery based system.

Update: Finally found a public version of the nerd article.

hashCode and equals for Blank Nodes

You don't need node ids. Most, if not all RDF triple stores take a Literal, URI Reference or Blank Node and generate a node id. Sometimes it's a hash or UUID, sometimes it's from a node pool or value store but you don't really need it. As an aside, in a distributed store you could even do the blocks of ids trick which people have done in SQL databases but I haven't seen that done for RDF yet.

When you do operations, like joins, in Java or Ruby or some other language you rely on hash codes to generate different values, if they're the same then you call equals.

What if you don't have a node pool?

It's easy to do for what I like to call globally addressable values - URI References and Literals - no matter where you are, these methods return the same results from their hash code or equals. Not so with Blanks Nodes, which are tied to the context of an RDF graph.

One solution is to ban blank nodes - they're pains to parse, query and store. But I actually like blank nodes. There good at representing things where you don't want to confuse it with something that might actually be a URI to dereference.

The idea we've been working on with our high-falutin' scale-out MapReduce blah blah is really just coming up with sensible implementations of the hashCode and equals methods for blank nodes. There is previous work done in distributing blank nodes across graphs, the one that I'm most familiar with is RDF Molecules. But they didn't really quite cut it as far as hash codes and equals are concerned and that's basically what I'm presenting next week in China. The hash code is basically the head triple and the equals is the minimal context, sub-graph for a given blank node.

There's a lot more to say, as I've had to find something to talk about for the whole 15 minutes.

Wednesday, April 09, 2008

My (Continued) SPARQL Debacle

One of the reasons I started this blog was to record my current thoughts at a particular time. With this in mind, I should track my recent comments about SPARQL and the empty graph pattern and further rehashing of it.

I made a few mistakes during the discussion and spent well over a week in discussion and maybe a week prior to asking the question thinking about it and much time thereafter just thinking about summarizing it.

SPARQL is an algebra that is not consistent (isomorphic) with what I think of as set/relational/bag algebras (even though it appeared at one stage this was considered). The reason is that identities I believe hold in these algebras don't for SPARQL.

The set/relational/bag algebra identities are:
* A + 0 = A * U = A
* A + U = U
* A * 0 = 0

Where + is UNION, * is INTERSECTION, A is any set, 0 is the empty set and U is the universal set. The second one is expressible and does work in SPARQL. The first one isn't expressible in SPARQL. The last two don't hold.

You can derive the last two identities from the first two as long as you have compatible definitions for things like inverse (or complement). When Date creates the algebra for bags he spends most of his time coming up with a reasonable definition for the complement of a bag which seems to be more like difference. In my interpretation 1/T/U is the relational TABLE_DEE and 0/F/empty set is TABLE_DUM. I thought this is quite clear but it appears even this is up for interpretation.

Prior to Date's latest book, I had a bunch of his writings which I used to create identities for OPTIONAL, JOIN and UNION. I struggled a while back to see whether they were compatible with SPARQL, which I eventually decided that they were compatible, it ends up that they are not - because the identities don't hold. I think as long as you don't ask these questions then it still returns the right answer and you could create special cases for SPARQL's empty graph pattern but I'm just not that confident anymore. SPARQL is more like an algebra of numbers than of sets or bags.

Reflecting on this, I was striving for a consistency that just wasn't there and if I squint hard enough I can see how the SPARQL algebra by itself makes sense.

There is still some behavior, even within the SPARQL specification, that appears to be really bad like "SELECT ?x WHERE { ?s ?p ?o }" being a valid query (it returns a number of unbounds to ?x for however many triples there are in the graph). This is quite different to SQL or relational PROJECT. It's also weird that the SPARQL specification is different to the Perez papers about SPARQL - evaluation is done at a grammatical level. UNION also differs as it's defined as multiset union not set union even though OPTIONAL is made up of set union not multiset union. Actually, I'm still not sure if UNION is multiset union because in the implementations I've seen the order is important (that is {} UNION {} UNION { ?s ?p ?o } is different to {} UNION { ?s ?p ?o } UNION {} and { ?s ?p ?o } UNION {} UNION {}) but I guess that's because of the grammatical evaluation.

It does put any further work on JRDF's SPARQL implementation in a bad position. I can keep calling it SPARQL but know that it's not following the standard or rename it (currently I'm thinking URQL) but the whole point of bothering seems to be questionable. The ironic thing is that it could pass all the SPARQL tests even though I know it's not compatible. In other work that I've been doing, I've been interested in SPARQL as the Unix pipes for RDF and blank node round tripping but SPARQL doesn't work there either. Blank node round tripping is where you take the result of one SPARQL query that includes a blank node and put it into a second.

Sometimes you come away from asking a question feeling validated or smarter and sometimes not. This time it's definitely not - I no longer feel confident talking about SPARQL or relational algebra anymore.

Friday, March 28, 2008

Microsoft LINQs Data

Microsoft and "Research-Output" Repositories

Our goal is to abstract the use of underlying technologies and provide an easy-to-use development model, based on .NET and LINQ, for building repositories on top of robust technologies.

The platform has a "semantic computing" flavor. The concepts of "resource" and "relationship" are first-class citizens in our platform API. We do offer a number of "research-output"-related entities for those who want to use them (e.g. "technical report", "thesis", "book", "software download", "data", etc.), all of which inherit from "resource". However, new entities can be introduced into the system (even programmatically) while the existing ones can be further extended through the addition of properties.

This means, obviously, that arbitrary relationships between resources can be established. Our platform comes with a number of "known" predicates (e.g. "added by", "authored by", "cites", etc.) but it is extensible to accommodate any new predicates developers want to introduce. Furthermore, we do not interpret the semantics of the relationships; we let applications define how to reason about them.

The concept of a "relationship" may make many think that we are building a triple-store, perhaps even speculate that we are using one. While we do store tuples, we have opted for a hybrid approach between a fully-blown relational schema and a triple-store. Our thesis is that by sitting in the middle of the "triple store <–> relational schema" spectrum, we will be able to stay flexible enough without impacting performance.

At the Open Repositories 2008 conference, we will formally unveil our work in advance of its official release and initiate interactions/exchanges with the DSpace, EPrints, Fedora, and other players in the repository community. This is crucial to us because—like every other project our group undertakes—we are intensely focused on interoperability.


Maybe Microsoft and Yahoo! have more in common than previously thought.

Via, Microsoft set to launch Semantic Web light. I previously looked around for LINQ tools for RDF.

Tuesday, March 04, 2008

Save Ontologies from the Ontologists

I presented a talk at InterOntology08 last week (there are a list of slides presented. It was only 15 minutes so there wasn't room for much content. What I think is the most important slide was number 11 about how the BioMANTA project is attempting to produce ontologies as an agile, engineering artefact that are verified in reality (due to experiments being performed, provenance tracked and data analysis on the quality of the provenance to filter out irrelevant or incorrect data).

There were some good things to come out if it. Thinking about how to describe to other people problems with ontologies in terms of inconsistencies - what will be inferred that contradicts your ontology - was very useful. The work done by Werner Ceusters, Nicola Gurino and Yu Lin were the most close to our work. One of the speakers gave what I think as a succinct description of the difference between top down vs. bottom ontology development: "what to expect" vs. "what to extract". I also met a lot of great people who I hope to meet again and Japan was very cool.

Easily the best, in terms of the most thought provoking, was Barry Smith's, "The Evaluation of Ontologies: Editorial Review vs Democratic Ranking". This discussed the work of the Gene Ontology and the OBO Foundry. He cited the Gene Ontology as the most useful and most used ontology which has been developed using a top down process. It allows comparable data to be produced, it removes data silos and he compared it to creation of standard measures (metric system). He said that in order to achieve this standardisation you need editorial committees. An ontology becomes part of the peer review, journal process. He introduced the OBO Foundry which has many principles such as being open, has a formal language, collaborative, orthogonal components, versioned, well documented and must have data before it can be accepted.

The alternative view he offered was attributed to Mark Musen. It's a bottom up, annotation of ontologies and many of the slides were taken from a previous talk. Mark believes that ontologies are still a cottage industry and that it is often difficult to ascertain the quality of an ontology just by inspection. He said it is also true that we may wish to use parts of ontologies even if they are not well designed. He questions whether a top down approach can scale. He is developing BioPortal which offers a way to upload and rate various ontologies. The key question about BioPortal is whether it will generate enough interest to reach a critical mass of reviews.

I had many problems with this talk. Firstly, the way it was characterised as one vs. the other - why can't they both work? What stops peer reviews of popular ontologies or getting popular ratings of peer reviewed ontologies. Barry mentioned that a selection approach works for refrigerators (where peer review designs the function of the refrigerator and colour is selected by the masses) but questioned whether this should work for science. This is an obviously negative view of what mass selection can do - we choose representatives in a democracy or successful products in a capitalist market, surely these are very important things that are left to the masses. Are ontologies any less than these things?

Beyond that, both of these methods seem to suggest a certain centralisation. Doesn't this encourage gatekeepers, people holding onto power, hasn't the web (governments, capitalism, science etc.) shown that decentralisation is better? I see science as a competition of ideas, the best model is chosen over many possible ones that best fits existing data and predicts new observations.

One of the OBO Foundary principles is that you can't reuse an ontology. That is, if you're outside the OBO Foundry and you make a change you can't redistribute or use the same identifiers. This just seems wrong. I must be misunderstanding this part, because it is supported by people who I would expect to support the idea of reusing ideas and, most importantly, sharing them with others.

Many of these arguments seems to be around whether an ontology is attempting to create or represent reality or if its an engineering artefact. I see it as a bit of both but its primary utility, I'd suggest, is as an engineering artefact. It represents a (hopefully working) system.

A simple example is our "fixing" of BioPAX. BioPAX uses string literals for certain properties and this prevents them being used as subjects in RDF. I would like to link, maybe dereference them and do other cool things with them that you can only do with URIs. So I'd like to make a change now, get something working and distribute my software with these changes.

I do think that ontologies should be well documented but documentation can be a barrier when you want to change something, try it out, make more changes, try it out again - the documentation is potentially going to be missing or wrong. The whole process seems to be trying to do too much upfront - which is terrible for the few, overworked ontologist that there are.

I don't want to wait while my ontology gets peer reviewed necessarily - the chances of the right person finding a mistake really doesn't sit with a committee or voting process - I'd like it to include everyone. I'd like to do it cheaply both in time and money; if for no other reason than to see whether it works well. If it doesn't work then it's not a big deal I can just change it back. It seems that if this was part of a big process then it would be less likely to happen.

Both methods also lack verification (or at least it wasn't discussed). There's nothing to say that a bunch of people in the OBO Foundary or a voting process will necessarily achieve certain modelling objectives - something that is right for me or for everyone. Like most systems, ontologies will have contradictory requirements such as flexibility and completeness or security and privacy - there really isn't one true answer. I'd prefer a process that quickly adapts to changing requirements which can then be verified.

Monday, March 03, 2008

PURLs for GO

I think these have been published before but I only just noticed the use of PURLs for references in the Gene Ontology. For example: http://purl.org/obo/owl/GO#GO_0008150 (the gene is not dereferenceable but then that seems okay for current purposes). This is following Recipe 1a from the "Best Practice Recipes for Publishing RDF Vocabularies", although the file sizes are probably too big (as suggested in "Serving Static RDF Files"). It does seem inconsistent with the Banff Manifesto which suggests URLs more like http://purl.org/bm/go:0008150. I know about slashes and hashes but I'm not sure about colons.

Sunday, March 02, 2008

Algebra A and SPARQL

I've been reading, "Logic and Databases: The Roots of Relational Theory" and more importantly the chapter 10 which is about "Why is it called a Relational Algebra?". He defines what an algebra is such as identities, idempotence, absorption and so on with respect to Algebra A. I first came across Algebra A in the 3rd Manifesto which is an untyped relational algebra that defines a relationally complete system in about three operations: REMOVE, NOR or NAND and TCLOSE (transitive closure). I say, "about three" because I'm not sure TCLOSE is part of a relationally complete system and NOR and NAND are made up of an untyped OR or AND plus NOT. It also gets rid of WHERE, EXTEND and SUMMARIZE (which can be used for aggregate functions) by creating "relational operators" which are special relations that perform an operation (like COUNT).

Anyway, one of the more interesting points is that on page 260-261 of "Logic and Databases" he talks about identities such as: A + 0 = A * U = A, A + U = U and A * 0 = 0. Where A is any relation, 0 is the empty relation (DUM) and U is the universal relation (DEE). These match the tables I created for JOIN and UNION for SPARQL - and likewise I think are correct for OPTIONAL.

There is also a chapter on the closed world assumption and why Date dislikes the open world assumption which I'm still trying to digest. It seems that to get around 3VL Date uses strings - which seems like a massive hack.

It also occurred to me when reading this that SPARQL and query languages in general are non-monotonic - that is as you add more information the results you get from a query can be different - which is different to RDF. It made me wonder what a monotonic query language would look like but not for too long.

Friday, February 22, 2008

Tissue Parade

This is just a quick note to let people who have expressed interest before in BioMANTA that the web site is now pretty much up-to-date with the latest papers and presentations (except for the InterOntology08 presentation in Japan next week). If we've sneezed and there was a Powerpoint slide it's there.