Tuesday, December 29, 2009

Syntax Hell

As anyone who has applied a more functional style to Java will realize, the Java syntax really gets in the way. I was using my usual guinea pig, JRDF, with "with" so you don't have to iterate but apply a function and so you don't have to close a ClosableIterator. The typical code to print out the graph is:
ClosableIterable<Triple> triples = graph.find(ANY_TRIPLE);
try {
   for (Triple triple : triples) {
     System.out.println("Graph: " + triple);
   }
 } finally {
   triples.iterator().close();
 }

Using "ClosableIterators.with" this becomes:
with(graph.find(ANY_TRIPLE), new Function<Void, ClosableIterable<Triple>>() {
   public Void apply(ClosableIterable<Triple> object) {
     for (Triple triple : object) {
       System.out.println("Graph: " + triple);
     }
     return null;
   }
});
It's typically one line less but that's not that much of an improvement.

Saturday, November 21, 2009

JRDF 0.5.6 and testing

JRDF 0.5.6 has JSON SPARQL support and many things got a good refactoring including the MergeJoin (which now looks more like typical text book examples - for better or worse). The upgrade to Java BDB version 4 has improved disk usage and performance mostly around temporary results from finds and the like. The next version will include named graph support.

I'm considering this being the last version to support Java 5.

This release was also about learning things like Hamcrest, JUnit 4.7, various mock extensions (Unitils, Powermock and Spring automocking (which didn't make it due to not being able to mix and match runners). This seems to be a design flaw that I've been encountering with JUnit - you can't mix and match features from various runners. Even Powermock and JUnit's Rules (for exceptions anyway) was problematic. The answer was to go back to the inner class block version.

Wednesday, August 12, 2009

Simpler SPARQL Results

SPARQL Results have two different, distinct and unnecessary ways of doings query results - one for SPARQL SELECT and one for SPARQL ASK (the whole SELECT vs ASK is worth pursuing too).

The section of a SPARQL SELECT XML result looks like:

...
<results>
<result>
<binding name="x"> ... </binding>
<binding name="hpage"> ... </binding>
</result>
...


An ASK result looks like:

...
<boolean>true</boolean>
...


The with the value inside the boolean either true or false (or possibly fred - depending on how you parse it - it would be nice to have an XSD). One set of results live in a results/result/binding and the other in a boolean. A simpler way to do it is just to use the absence or presence of a binding within the results/result.

For true:

<results>
<result>
<binding/>
</result>
</results>


For false:

...
<results>
<result/>
</results>
...


The JSON looks pretty much the same:

{
...
bindings : [{}] // true
bindings : [] // false
}


No parsing of true or false or having special cases for code producing or consuming SPARQL different kinds of answers.

Sunday, July 26, 2009

A Dangerous Adventure

One of the many messages left by leaders of the world on a silicon disk was this: "May the high courage and the technical genius which made this achievement possible be so used in the future that mankind will live in a universe in which peace, self expression, and the chance of a dangerous adventure are available to all."

Via, Debate over future of US space program.

Sunday, July 12, 2009

See and Feel the iPhone

A recent patent backs up an idea I had about a haptic interface for the iPhone (it's not a new idea even for the iPhone):
The proposed solution is the adoption of "haptic" display technologies which allow for some tactile feedback from touch screen displays. Apple proposes including a grid of piezoelectronic actuators that can be activated on command. By fluctuating the frequency of these actuators, the user will "feel" different surfaces as their finger moves across it.

JRDF 0.5.5.5

It's been a long while between updates but it's finally here. There's been some general concessions made to long standing "features" in JRDF namely relational semantics and checked exceptions - both are gone. Yuan-Fang added merge-join support which improved join performance (by up to 8 times). There's Groovy support and a nasty memory leak fixed. It's in the usual place. The next version won't be so far away with some further SPARQL query improvements including perhaps some of the newly proposed features.

Monday, April 27, 2009

Happiness by Empiricism (and no numbers or God)

Daniel Everett: Endangered Languages and Lost Knowledge. There are some quite interesting observations made by Daniel about the Piraha:
...so English has how many verb forms, well it has about 5, sing, sang, sung, singing, sings...Spanish or Portuguese might have 40 different verb forms, well Piraha like many American Indian languages has a very complex verbal system. So Piraha has 16 different suffix that can go with the end of a verb, that gives 2 to the 16th power possible verb forms and that is a lot. That is more than 40. And of those things, 3 suffixes are very important and those tell you how you got your evidence. So every verb has to have on it the source of the evidence, did you hear about it, did you see it with your own eyes, or did you deduce it from the local evidence. So if I say did John go fishing? They can say John went fishing "heai" which means I heard that he did, or they can say John went fishing “sibiga” and that means I deduced that he did, or they can say John went fishing “ha” and that means I saw he went. In some respects they are the ultimate empiricist...

They actually demanded evidence for what I believe and I realized, I could not give it as well as they wanted me to give it. So, this changed my profoundly, but I remember telling them about Jesus one time and they said “So Dan, is Jesus is he brown like us or is he white like you? “I do not know I haven’t seen him.” “What did your dad say? Because your dad must have seen him.” “No, he never saw him.” “Oh what did your friends say who saw him?” “No I do not know anybody who saw him.” “Why are you telling us about him then? Why would you talk about something you do not have evidence for?” But of course we do that all the time.


This is coupled with the fact that they don't have the concept of one, instead there are only relative terms like little amount. They also mention that the Rosetta Project is available on DVD.

Thursday, April 23, 2009

Linking Resources without HTML, XHTML, Atom...

In HTML you can link to alternative versions of your HTML using the link tag:
<link rel="alternate" type="application/rss+xml" title="RSS Feed"
href="report.rss" />

There is a way to do linking just using HTTP through the LINK header (as seen in POWDER. It looks very similar but is more powerful, from the IETF draf document:

Link: <http://example.org/>; rel="index start";
rel="http://example.net/relation/other" rev=copyright


Defines 4 links for http://example.org/, three outbound (index, start and http://example.net/relation/other) and one inbound (copyright).

More information: HTTP Status Report and LinkHeader.

Thursday, April 16, 2009

Trams in Brisbane and Per Capita Carbon Emissions

China expert raps luxurious Aussie life
Prof Pan was unimpressed with Australia's environmental standards, saying public transport seemed poor and the buildings and street lighting were not energy efficient.

He labelled as "insufficient" Australia's pledge to cut greenhouse emissions by five to 15 per cent by 2020.

The Chinese experts called for a global climate pact that would involve each country being allowed to emit a certain amount, based on their populations.

This is ominous for Australia because it has very high per capita emissions, whereas China has fairly low per capita emissions.

Australian climate adviser Ross Garnaut backed the per capita push in a video address to the conference, saying it was fair.


This was at the Australia-China Climate Change Forum.

Although the biggest mistake is that while there is hope that there is clean coal serious investment in other technologies doesn't exist, as highlighted by the recent Clean Coal Air Freshener advert from This is Reality.

As for public transport, Brisbane used to have trams including a line along Milton Road. There was fire in Paddington where Paddington Central is now and the trams were replaced by buses. The tram e-petition got a massive 600 signatures which puts it well behind things like a petition against recycling water.

Wednesday, April 15, 2009

MapReduce vs SQL Databases

A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks
...we present the results of running the benchmark on a 100-node cluster to execute each task. We tested the publicly available open-source version of MapReduce, Hadoop [1], against two parallel SQL DBMSs, Vertica [3] and a second system from a major relational vendor.

First, as we demonstrate in Section 4, at 100 nodes the two parallel DBMSs range from a factor of 3.1 to 6.5 faster than MapReduce on a variety of analytic tasks. While MR may indeed be capable of scaling up to 1000s of nodes, the superior efficiency of modern DBMSs alleviates the need to use such massive hardware on datasets in the range of 1–2PB (1000 nodes with 2TB of disk/node has a total disk capacity of 2PB). For example, eBay’s Teradata configuration uses just 72 nodes (two quad-core CPUs, 32GB RAM, 104 300GB disks per node) to manage approximately 2.4PB of relational data. As another example, Fox Interactive Media’s warehouse is implemented using a 40-node Greenplum DBMS. Each node is a Sun X4500 machine with two dual-core CPUs, 48 500GB disks, and 16 GB RAM (1PB total disk space) [7]. Since few data sets in the world even approach a petabyte in size, it is not at all clear how many MR users really need 1,000 nodes.


In section 3.1 there's some points made about the advantages of databases over MR in relation to data integrity, "...a MR framework and its underlying distributed storage system has no knowledge of these rules, and thus allows input data to be easily corrupted with bad data. By again separating such constraints from the application and enforcing them automatically by the run time system, as is done by all SQL DBMSs, the integrity of the data is enforced without additional work on the programmer’s behalf."

They mention that "all DBMSs require that data conform to a well-defined schema, whereas MR permits data to be in any arbitrary format. Other differences also include how each system provides indexing and compression optimizations, programming models, the way in which data is distributed, and query execution strategies."

If you strip it away they are talking about text processing versus indexed data structures (and other parts of a DBMS).

For loading, "Without using either block-or record-level compression, Hadoop clearly outperforms both DBMS-X and Vertica since each node is simply copying each datafile from the local disk into the local HDFS instance and then distributing two replicas to other nodes in the cluster." The obvious difference to me would be that the SQL databases are creating "...a hash partitioned across all nodes on the salient attribute for that particular table, and then sorted and indexed on different attributes..."

For text processing, they note that the main problems with Hadoop are the start-up costs (10-25 seconds before all Map tasks start) and during the Reduce phase the cost of combining many small files. When you are comparing a fully indexed system versus text processing then you would expect the indexed system to be faster. Compression was also considered an advantage in the systems like Vertica's over Hadoop - where it actually reduced performance. It depends on the work being done whether the overhead of compression is worth the overhead so obviously - it's not explained why compression was a negative for Hadoop.

They also talk about the problems in setting up and configuring the parallel databases over Hadoop, which is not an insignificant difference when you are scaling to 100s and 1000s of nodes.

In the summary they talk about 25 years of database development and the advantages of B-Trees and column stores. It begs the question, then why wasn't a similar system used on the Hadoop infrastructure? MR is really more like distributed processing not an indexed, querying system.

If you took away the distributed layer what they are doing is comparing something like grep (a really bad implementation of grep) with Lucene or MySQL. Would anyone be surprised with the results then? A better comparison would've been comparing it against HBase or other distributed, indexed, data stores like Hive or Cloudbase.

Update: There's a good followup on the hadoop list by Jonathan Gray "Hadoop is not suited for random access, joins, dealing with subsets of
your data; ie. it is not a relational database! It's designed to
distribute a full scan of a large dataset, placing tasks on the same nodes
as the data its processing. The emphasis is on task scheduling, fault
tolerance, and very large datasets, low-latency has not been a priority.
There are no "indexes" to speak of, it's completely orthogonal to what it
does, so of course there is an enormous disparity in cases where that
makes sense. Yes, B-Tree indexes are a wonderful breakthrough in data
technology". He suggested Pig, Hive and Cascading would be more suitable for comparison.

Monday, April 13, 2009

TDD for Ontologies

As it was said to me recently, very often something goes from being mysterious and magical to obvious (I'm looking at you not-so-magical-anymore Grails wiring). So it goes with test driving ontologies (or at least it did for me). One of the standard methodologies for ontology development is creating competency questions for an ontology to successfully answer. This generally means gathering requirements, use cases, organising the data and ontology, testing them and maintenance. Doesn't that sound an awful lot like waterfall? It would appear obvious, that just like code, ontologies could be developed using test driven design (neh development) instead. This should have the same benefits as agile code - like being more adaptable to change.

The closest paper I've come to that talks about this idea is, "Unit tests for ontologies" (and there's also bachelor thesis, Unit Testing for Ontologies). Unfortunately, it talks about adding tests during the maintenance of software. Luckily, it talks about using them as a way to initially build the ontology and surely that would be a way to drive out an ontology. Good stuff. I could imagine that certain good patterns of ontology design like making them small, self consistent, fast, intuitive, etc. would all come naturally out of applying TDD techniques.

As an aside the unit testing paper also mentions the usefulness of autoepistemic operators (K and A). You can state that for a given instance there must exist a certain property, for example, a country must have a capital city.

Tuesday, March 31, 2009

One Way to Describe What I Did Today

I've thought a bit about how best to describe what I did today. I spent a whole day, knowingly, adding more technical debt rather than two days removing existing debt. Now writing bad code is not linear and it's hard to judge but from what I can tell there is now twice as much debt as yesterday. And this is seen as progress.

Friday, March 20, 2009

Things Get Dark

The Crisis of Credit Visualized. It was quite good except maybe the explanation of leverage was a bit extreme, with 10,000 turning into 990,000.

Thursday, March 19, 2009

Haptic Technology for the iPhone

One of the questions that often occurs is how could you improve on the iPhone without changing the form factor too much. One of the things that occurred to me was mentioned was a keyboard. But it would be nice to have a keyboard that worked as well as the current one does (well better) so that it could work in both portrait and landscape mode. Is that crazy talk?

Well the idea would be to use electroheological fluid to create a keyboard dynamically. I came across some discussion on how to use it for braille and it should work equally well for creating a QWERTY keyboard - although power usage maybe a problem. For more technical discussion (and I haven't really read it except for the practical use in virtual reality gloves), Haptic Interfaces Using
Electroheological Fluids
.

Thursday, March 12, 2009

You Should Read This

Tony's letter to the Medical Board of Queensland.

Some Nice Programming with Jenabeans

Writing out SIOC triples using Jena + Jenabean "Jenabean’s model connected programming model makes this easy, using interfaces that declare each of the vocabularies as a set of methods." The code example shows how you can piece together (using a fluent interface) different vocabularies:
Thing thing = new Thing(m);
thing.at(uri1).
as(DCTerms.class).
title("Creating connections between discussion clouds with SIOC").
created("2006-09-07T09:33:30Z").
isa(Sioc.Post.class).
has_container(thing.at(uri2)).
...

CloudBurst

This is software that uses MapReduce to handle massive amounts of sequence data and reassemble it. It includes an explanation of the algorithmic side of things. I wonder how it compares with using de Bruijn graphs.

Wednesday, March 11, 2009

With the Passage of Time

Who would've thought that RoboCop was overly optimistic.

First, it was assumed that people would still be living there. Second, that they'd still be able to make things like robots. And third, that cars in America would get 8 mpg and not 1-3.5 mpg.

Tuesday, March 03, 2009

URLs for 10 minutes

Protect Your Site With URL Rewriting seems a little bit of a mad suggestion. They are basically suggesting that you change your application's URIs every 10 minutes to prevent XSS and XSRF (Cross Site Request Forging).

We could mitigate much of the risk of these vulnerabilities by frequently changing our URLs—not once every 200 years but once every 10 minutes. Attackers would no longer be able to exploit application vulnerabilities by mass e-mailing poisoned hyperlinks because the links would be broken and invalid by the time the messages reached their intended victims. With all due respect to Sir Tim, while "cool" URIs may not change, secure ones certainly do.


The negatives seem vast and I wonder what this is really trying to solve, as they say at the end:
An automatically expiring URL can still be exploited by an attacker with access to a Web server of his own. Instead of sending out malicious hyperlinks that point directly to the vulnerable page, he can send out hyperlinks that point to his own site. When his site gets a hit from one of the phished e-mails, it can contact a landing page on the vulnerable site to obtain a valid time stamp and then redirect the user accordingly.


If you are a REST advocate then maybe a quick read over, The Resource-Oriented Architecture in Action, may soothe you (from the excellent book, RESTful Web Services).

A Quick Survey of Bio-Ontologies

The other day I was trying to find a paper that talks about the need for ontologies in biology that dated sometime around the early 90s - way before OWL and the Semantic Web. I couldn't find the paper I was thinking about, but here are some others that are pretty good that seem to follow at least the same themes:

"Ontologies for molecular biology": "Molecular biology has a communication problem. There are many databases using their own labels and categories for storing data objects and some using identical labels and categories but with a different meaning."


"Ontological Foundations for Biology Knowledge Models": This one was good because it talked about processes and transformations which is where the rules and inferencing stuff comes in.

"Toward Principles for the Design of Ontologies Used for Knowledge Sharing": While not specifically about biology, this is probably the most cited paper and it's what I often think about when you're explaining ontologies and the process of improvement that occurs when you make one.

"Bio-ontologies: current trends and future directions": This covers the process and the parts that make a good ontology on the web. I guess the key for the use of the Web for ontologies is a means to share knowledge. It also has a good history of ontologies going back to the 1600s.

And this one is one of the original GO papers, "Gene Ontology: tool for the unification of biology".

Plastic Ocean

"Sailing the Great Pacific Garbage Patch" has some interesting points including that there's no such thing as a free range fish (because the oceans are so polluted) and that the island of plastic is more a concentration of plastic - which sounds far worse as far as the environment is concerned.

Adding Some Herbs to Open Databases

"Harnessing the Crowd to Make Better Drugs: Merck’s Friend Nails Down $5M to Propel New Open Source Era"

Friend, 54, is leaving his high-profile job as Merck’s senior vice president of cancer research, after having nailed down $5 million in anonymous donations to pursue this vision at a nonprofit organization getting started in Seattle called Sage.

Sage is built on the premise that vast networks of genes get perturbed, or thrown off-kilter, in complex diseases like cancer, diabetes, and obesity. Scientists can’t just pick one faulty gene or protein and make a magic bullet to shut it down. But what if researchers around the world capturing genomic profiles on patients could get all of their data to talk to each other through a free, open database? A researcher in Seattle looking at how all 35,000 genes in breast cancer patients are dialed on or off at a certain stage of illness might be able to make critical comparisons by stacking results up against a deeper and broader data pool that integrates clinical, genetic, and other molecular data from peers in, say, San Francisco, New Haven, CT, or anywhere else.

Some big names have signed on for the early incubating phase. Besides the full-time efforts of Friend and Schadt, the Sage board includes Nobel Laureate Lee Hartwell of the Fred Hutchinson Cancer Research Center; Paul Ramsey dean of the School of Medicine at the University of Washington; Richard Lifton, the chairman of genetics at Yale University; and Hans Wigzell, director emeritus of Sweden’s Karolinska Institute. For insight into how to apply lessons from the open-source computing world, the board has brought on John Wilbanks, the vice president of science at the San Francisco-based Creative Commons.

As with any far-out vision, plenty of things can derail it along the way. What if researchers use different gene analysis machines, from Affymetrix, Illumina, or Applied Biosystems? How will Sage reconcile differences in how experiments are designed by different scientists? How will researchers be enticed to let go of their precious data, currently stored on password-protected hard drives and servers? How will Sage manage the intellectual property that arises from the database? Why would companies want to participate and run the risk of putting valuable proprietary data out in public? How will this get financed?

Some of these things Friend can answer, and some still need to be worked out. Software is already making it possible to manage differences between the various instruments scientists use, and deal with the differences in experimental design, Friend says.

Thursday, February 12, 2009

LOL (List of Links)