Thursday, April 24, 2008

Update from WWW2008

The HCLS workshop was very good. I especially enjoyed Mark Wilkinson's talk about BioMody 2.0 (very Larry Lessig-esque) and Chris Baker's. There's some definite interest from a number of people about my talk too.

The keynote of the first day was from the Vice President of Engineering at Google, Kai-Fu Lee. In my talk I said that IBM had noted that scale-out architecture gives you a 4 times performance benefit for the same cost. He said that Google gets around 33 times or more from a scale-out architecture. The whole cloud thing is really interesting in that it's not only about better value but about doing things you just can't do with more traditional computing architectures. The number of people I've overheard saying that they haven't been able to get their email working because they're using some sort of client/server architecture is amazing. I mean what's to get working these days when you can just use GMail?

The SPARQL BOF was interesting as well (Eric took notes). The time frame seems to be around 2009 before they get started on SPARQL the next generation. What sticks out in my mind is the discussion around free text searching - adding something like Lucene. There was also aggregates, negation, starting from a blank node in a SPARQL query and transitive and following owl:sameAs. I was pretty familiar with all of these so it was interesting just to listen for a change. So with both aggregates and free text you are creating a new variable. Lucene gives you a score back and I remember in Kowari we had that information but I don't think it was ever visible in a variable (maybe I'm wrong I don't really remember). It would be nice to be able to bind new variables somehow from things in the WHERE clause for this - and that would also allow you to filter out based on COUNTS greater than some value (without having a HAVING clause) or documents that match your Lucene query greater than a certain value. Being able to do transitive relationships just on a subset of the subclass relationship (like only subclasses of mammals not infer the whole tree of life) seemed to have been met with some reluctance. I really didn't understand this but it seemed to be around that it was the store's responsibility to control this and not up to the user to specify.

The other thing that was mentioned was transactions. It seems that transactions probably won't be part of SPARQL due to the nature of distributed transactions across the Web.

There was one paper on the first day that really stood out. I don't know what it is about logicians giving talks but they are generally really appealing to me. It was "Structured Objects in OWL: Representation and Reasoning" presented by Bernardo Grau. It seems to take the structural parts of an OWL ontology and creates a graph to represent it. This prevents DL reasoning of an infinite tree and creates a bounded graph. This is cool for biology - the make up for a cell for example but it also speeds up reasoning and allows errors to be found.

The other interesting part was the linked data area. I was a bit concerned that it was going to create a read only Semantic Web. A lot of the work, such as DBpedia that converts Wikipedia to RDF, seems a bit odd to me as you can only edit the Semantic Web indirectly through documents. But in the Linked Data Workshop a paper was presented called "Tabulator Redux: Browsing and Writing Linked Data" which of course adds write capabilities. I spoke to Chris Bizer (who gave a talk on how the linked data project now has ~2 billion triples) about whether you could edit DBpedia this way and he said probably not yet. That's going to be interesting to see where it goes.

I am just going off memory rather than notes. So I'll probably flesh this out a bit more later.

Saturday, April 19, 2008

Big Web Table

I thought I read someone ports Google's AppEngine to use HBase. Good idea but not quite. "Announcing A BigTable Web Service": "I then came up with the crazy idea to offer BigTable as a web service using App Engine. It would be an infinitely scalable database running in Google's datacenters. I spent my weekend learning Python and hacking together an implementation. Now I'm happy to present the BigTable Web Service. It models the API of Hbase—a BigTable clone. Now you can have simulated BigTable running atop App Engine, which itself provides an abstraction on top of the real BigTable."

What it actually does is use HBase's Thrift API on top of Google's BigTable or as they say BigTable as a Web Service (a RESTful one).

Friday, April 18, 2008

Mario's Jazz Bar

Just a random thing to share - there seems to be quite a lot of competition playing Super Mario Galaxy songs on YouTube. A fairly recent one is a Jazz Interpretation of the Observatory theme (there's also a guitar version or accordian one). I still quite like the original orchestral version of Gusty Garden Galaxy Theme and it's version on piano. Koji Kondo is also good to look up on YouTube from time to time too.

Thursday, April 17, 2008

What Women Want: Pairing

It's not the first time I've read an article about the continuing decline of women in IT, "Where Did All the Girl Geeks Go?" continues to note the slide:
"There's a perception that being a computer science major leads to a job as a programmer and you sit in a cubicle where you type 12 hours a day and have no interactions with other people," Block said.

Yusupova noted that even if pure programming jobs are outsourced, opportunities still remain within a company for people to bridge the relationship between the outsourced IT vendors and the business side.

"These roles would probably be ideal for women who prefer to be in communication-focused roles, if they know computer science and can communicate to all parties involved," Nelly Yusupova, chief technology officer of Webgrrls International, a networking organization.

There's was a talk given recently, at a local XP group, that lead to a discussion on the benefits of things like pair programming (see "Pair Programming").

I see pair programming and other means to improve interactions between developers not only essential for better code and a better project but also as a way to improve the IT industry generally and to expand its appeal especially to younger people and women. The idea being that certain people work better in a participatory manner rather than being told what to do.

This is pretty much what an article a couple of year ago suggested called "Debunking the Nerd Stereotype with Pair Programming" (or as PDF):
Jamie wants to be a software engineer. She enjoyed her programming and science classes in high school and wants to combine her interest in both disciplines to help society through biomedical applications. Since she started college, it seems that her life has been centered on time consuming programming classes. In those classes, her professors insist that she work alone—some professors expressly forbid even discussing assignments with fellow classmates. Before entering college, Jamie was aware of the stereotypical view that programmers work long hours by themselves. Based on her college experience, now she knows it’s more than just a stereotype—it’s true. Perhaps she should forget programming. She likes the friends she’s met in her biology lab group—maybe biology would be a better major.

Having been working in bioinformatics for over a year it's startling the number of women in this area compared to IT. It seems basically 50/50 in what is essentially an application of information technology. They still write code, they still develop large applications and so on. Why does it drop to 1 in 20 or worse in IT? It does seem that in bioinformatics you are expected to work in groups and teams, they are forever interacting with each other - it seems a brilliant environment as far as productivity is concerned.

And it's not just IT or biology but it seems that there is a general benefit from greater interaction, more pairing and the like generally improves performance:
The success rate of underrepresented minorities in science courses has been shown to be dramatically improved by shifting the learning paradigm from individual study to one that capitalizes on group processes, such as student work groups and student-student tutoring.

From an IT perspective pairing doesn't only improve the quality of the software it also improves your abilities as an individual programming as well, as has been demonstrated where pair programming has been used in IT courses and the results of students in exams improved (see "Pair Programming Improves Student Retention, Confidence, and Program Quality").

I re-read "All I Really Need to Know about Pair Programming I Learned In Kindergarten" which still holds up quite well as a set of rationales behind pair programming and NCSU's Pair Learning has lots of papers related pairing, learning and making IT more attractive to more discovery based system.

Update: Finally found a public version of the nerd article.

hashCode and equals for Blank Nodes

You don't need node ids. Most, if not all RDF triple stores take a Literal, URI Reference or Blank Node and generate a node id. Sometimes it's a hash or UUID, sometimes it's from a node pool or value store but you don't really need it. As an aside, in a distributed store you could even do the blocks of ids trick which people have done in SQL databases but I haven't seen that done for RDF yet.

When you do operations, like joins, in Java or Ruby or some other language you rely on hash codes to generate different values, if they're the same then you call equals.

What if you don't have a node pool?

It's easy to do for what I like to call globally addressable values - URI References and Literals - no matter where you are, these methods return the same results from their hash code or equals. Not so with Blanks Nodes, which are tied to the context of an RDF graph.

One solution is to ban blank nodes - they're pains to parse, query and store. But I actually like blank nodes. There good at representing things where you don't want to confuse it with something that might actually be a URI to dereference.

The idea we've been working on with our high-falutin' scale-out MapReduce blah blah is really just coming up with sensible implementations of the hashCode and equals methods for blank nodes. There is previous work done in distributing blank nodes across graphs, the one that I'm most familiar with is RDF Molecules. But they didn't really quite cut it as far as hash codes and equals are concerned and that's basically what I'm presenting next week in China. The hash code is basically the head triple and the equals is the minimal context, sub-graph for a given blank node.

There's a lot more to say, as I've had to find something to talk about for the whole 15 minutes.

Wednesday, April 09, 2008

My (Continued) SPARQL Debacle

One of the reasons I started this blog was to record my current thoughts at a particular time. With this in mind, I should track my recent comments about SPARQL and the empty graph pattern and further rehashing of it.

I made a few mistakes during the discussion and spent well over a week in discussion and maybe a week prior to asking the question thinking about it and much time thereafter just thinking about summarizing it.

SPARQL is an algebra that is not consistent (isomorphic) with what I think of as set/relational/bag algebras (even though it appeared at one stage this was considered). The reason is that identities I believe hold in these algebras don't for SPARQL.

The set/relational/bag algebra identities are:
* A + 0 = A * U = A
* A + U = U
* A * 0 = 0

Where + is UNION, * is INTERSECTION, A is any set, 0 is the empty set and U is the universal set. The second one is expressible and does work in SPARQL. The first one isn't expressible in SPARQL. The last two don't hold.

You can derive the last two identities from the first two as long as you have compatible definitions for things like inverse (or complement). When Date creates the algebra for bags he spends most of his time coming up with a reasonable definition for the complement of a bag which seems to be more like difference. In my interpretation 1/T/U is the relational TABLE_DEE and 0/F/empty set is TABLE_DUM. I thought this is quite clear but it appears even this is up for interpretation.

Prior to Date's latest book, I had a bunch of his writings which I used to create identities for OPTIONAL, JOIN and UNION. I struggled a while back to see whether they were compatible with SPARQL, which I eventually decided that they were compatible, it ends up that they are not - because the identities don't hold. I think as long as you don't ask these questions then it still returns the right answer and you could create special cases for SPARQL's empty graph pattern but I'm just not that confident anymore. SPARQL is more like an algebra of numbers than of sets or bags.

Reflecting on this, I was striving for a consistency that just wasn't there and if I squint hard enough I can see how the SPARQL algebra by itself makes sense.

There is still some behavior, even within the SPARQL specification, that appears to be really bad like "SELECT ?x WHERE { ?s ?p ?o }" being a valid query (it returns a number of unbounds to ?x for however many triples there are in the graph). This is quite different to SQL or relational PROJECT. It's also weird that the SPARQL specification is different to the Perez papers about SPARQL - evaluation is done at a grammatical level. UNION also differs as it's defined as multiset union not set union even though OPTIONAL is made up of set union not multiset union. Actually, I'm still not sure if UNION is multiset union because in the implementations I've seen the order is important (that is {} UNION {} UNION { ?s ?p ?o } is different to {} UNION { ?s ?p ?o } UNION {} and { ?s ?p ?o } UNION {} UNION {}) but I guess that's because of the grammatical evaluation.

It does put any further work on JRDF's SPARQL implementation in a bad position. I can keep calling it SPARQL but know that it's not following the standard or rename it (currently I'm thinking URQL) but the whole point of bothering seems to be questionable. The ironic thing is that it could pass all the SPARQL tests even though I know it's not compatible. In other work that I've been doing, I've been interested in SPARQL as the Unix pipes for RDF and blank node round tripping but SPARQL doesn't work there either. Blank node round tripping is where you take the result of one SPARQL query that includes a blank node and put it into a second.

Sometimes you come away from asking a question feeling validated or smarter and sometimes not. This time it's definitely not - I no longer feel confident talking about SPARQL or relational algebra anymore.