Thursday, October 30, 2008

Cats Stealing Dogs' Jobs

Who knew that people from Japan read my blog. Tama the station master (actually in true Japanese style "super station master") is doing almost the exact thing I mentioned. Except they added merchandising - those damn cats.

A Billion Triples in your Pocket

At the Billion Triples Challenge this afternoon Cathrin Weiss showed off i-MoCo. It's a demonstration not only of a Semantic Web application with an iPhone using a stupid number of triples, but also of their indexing technique Hexastore which does all 6 indexes of RDF's subject, predicate and object in order to improve querying (they are ordered which allows you to do merge sorts). Actually, this made me think that the next steps in RDF triple stores will be indexes that are optimised for SPARQL operations and OWL inferences. Indexes for transitive closure perhaps? The data is regular and the storage is available to index triples that improve querying performance.

That wasn't the only impressive demo today. For me it's a toss up between another iPhone SemWeb demo, DBPedia mobile and SemaPlorer. DBPedia stood out because it was the only one that allowed you to write data to the Semantic Web rather than just read the carefully prepared triples. For a similar reason I thought SemaPlorer was good because they tried to do more by making it more generic like integrating flickr. But they were all excellent, all of them showing what you get with a billion or more triples and inferencing.

That combined with the guys at Freebase making all of their data available as RDF and it was a big day for the Semantic Web.

Update: I noticed that John Giannadrea's talk when he mentioned the three aspects to Freebase he went from the bottom up - probably reading too much into it.

Update: Also caught an interview with John Francis the guy who stopped talking, started walking, stopped taking mechanical transport and tried to change the world.

Update: ISWC 2008 awards have been published.

Wednesday, October 22, 2008

Time for REST

The best part of the latest release of JRDF is not code that we've written but Restlet. Restlet provided an excellent abstraction for providing RESTful web services and allows deployment without a container (the distributed and local server are both about ~6MB in size).

This has allowed us to quickly develop a web service that answers SPARQL queries that can be from one or many machines. We're using it to query the results from MapReduce tasks in our Hadoop cluster but it could probably be used as a general way to query other SPARQL services.

It's still early days and there's a lot that still needs to be added (and re-added) such as limiting result sizes, being able to choose which format to return and better JSON support. A lot of it has been pushed out in time to show a front-end at SSWS.

It would be good to have a Javascript client that submits the SPARQL queries and handles the ordering and such rather than having the distributed query server hang onto all results until returning.

Download it here.

President of Real America

Yes, Virginians you may not be in the real part, "Jon Stewart Clarifies Palin Remarks, Expands To 'F%ck All Y'All'". America seems to have gone completely mental. The second video covers the current lunacy including the pro and anti American members of congress and the endorsement of Colin Powell, a black militant backing Obama, what a surprise.

Tuesday, October 21, 2008

Case Classes

Reading "Is Scala Not “Functional Enough”? linked to a post I've read before "Are Scala's case classes a failed experiment?" but hadn't blogged about it. In Java I simply avoid using switching based on type (for example, if foo.getClass() else if bar.getClass()) I really like case classes in Scala especially for language processing.

Friday, October 10, 2008

5TB Processing

Scaling Hadoop to 4000 nodes at Yahoo! "The 4000-node cluster throughput was 7 times better than 500’s for writes and 3.6 times better for reads even though the bigger cluster carried more (4 v/s 2 tasks) per node load than the smaller one." They did some performance monitoring of the cluster and achieved sorting 6TB of data in 37 minutes.

Update: Ex-Google and Yahoo employees create start-up (called Cloudera) and Hadoop Primer by Sun (PDF).

Update: Hadoop users group meeting with IBM talking about different join techniques and Cloudbase project which is an SQL abstraction over log files (tutorial PDF).

Update: Microsoft's first project is Hadoop.

Wednesday, October 08, 2008

Monks Practicing Evolution

Before the printing press allowed exact copies of texts, such as Bibles and other works, scribes would copy manuscripts by hand. These copies were imperfect and these mistakes would then be replicated as other scribes made further copies. The implication is that the church was practicing evolution before science had even discovered it. Darwin could've just popped down to his local monastery or church instead of cruising around the world.

Scientists have used phylogentic software to look at these texts in order to discovery the original document sources (creating a book of life if you will). Like evolution in the natural world, the mutations aren't random and you get errors such as recombination, lateral transfer, deletions, and even convergent evolution. There are some interesting relationships that can be determined, such as certain areas of text are more likely to have mistakes in them than others.

What's cool about a theory, such as evolution, is that it can be applied to many different areas such as natural languages, behavioural patterns, archaeological artifacts, and written works such as chain letters and medieval manuscripts.

More information, Manuscript evolution and Phylogenetics of artificial manuscripts.

Thursday, October 02, 2008

Coffee Inspired


  • The Global graphs in JRDF was inspired by the work done in MSG (minimum self-contained graphs, published in the RDFSync paper) and RDF Molecules. The former links to an implementation of DBin (P2P Semantic Web client) and there's also GVS (Graph Versioning System).
  • It's a trap (cloud computing) It's a fairly typical Stallman statement - not wrong but not aware of the compromises people make. It is obvious that one of the reasons vendors are excited about cloud computing is because it is a chance for them to try and own your data (or at least make switching too hard). But you do have to do more than put data in the cloud - you do have to have executable services there. There are open source cloud computing infrastructure that you can run-up on your own servers like Hadoop or CouchDB. And it's not just back to mainframes and renting CPU time out by the hour it really is different to what has gone before.

  • Live the Cloud Life lists the cloud computing applications in categories such as email, documents, data, music, photo editing and storing and browser synchronisation. Some categories are missing like RSS reading and calendaring.

  • Apple drops NDA - the outrage worked.

  • The main eResearch 2008 conference is over and some of the papers are available.

  • Speaking of which, all development, and that definitely includes software, in whatever organisation (universities, governments, banks, etc) should have failure as an option. One sign of a truly stuffed culture is to never have a project fail.

  • Muradora a refactoring of Fedora to allow pluggable authentication and enable metadata editing.

  • Acer Aspire One links: Dual Monitor Support,
    Installing Firefox 3 on Acer Aspire One Linux and
    Updated repositories. It's a shame that OpenOffice doesn't support presenters view yet.