Monday, April 27, 2009

Happiness by Empiricism (and no numbers or God)

Daniel Everett: Endangered Languages and Lost Knowledge. There are some quite interesting observations made by Daniel about the Piraha: English has how many verb forms, well it has about 5, sing, sang, sung, singing, sings...Spanish or Portuguese might have 40 different verb forms, well Piraha like many American Indian languages has a very complex verbal system. So Piraha has 16 different suffix that can go with the end of a verb, that gives 2 to the 16th power possible verb forms and that is a lot. That is more than 40. And of those things, 3 suffixes are very important and those tell you how you got your evidence. So every verb has to have on it the source of the evidence, did you hear about it, did you see it with your own eyes, or did you deduce it from the local evidence. So if I say did John go fishing? They can say John went fishing "heai" which means I heard that he did, or they can say John went fishing “sibiga” and that means I deduced that he did, or they can say John went fishing “ha” and that means I saw he went. In some respects they are the ultimate empiricist...

They actually demanded evidence for what I believe and I realized, I could not give it as well as they wanted me to give it. So, this changed my profoundly, but I remember telling them about Jesus one time and they said “So Dan, is Jesus is he brown like us or is he white like you? “I do not know I haven’t seen him.” “What did your dad say? Because your dad must have seen him.” “No, he never saw him.” “Oh what did your friends say who saw him?” “No I do not know anybody who saw him.” “Why are you telling us about him then? Why would you talk about something you do not have evidence for?” But of course we do that all the time.

This is coupled with the fact that they don't have the concept of one, instead there are only relative terms like little amount. They also mention that the Rosetta Project is available on DVD.

Thursday, April 23, 2009

Linking Resources without HTML, XHTML, Atom...

In HTML you can link to alternative versions of your HTML using the link tag:
<link rel="alternate" type="application/rss+xml" title="RSS Feed"
href="report.rss" />

There is a way to do linking just using HTTP through the LINK header (as seen in POWDER. It looks very similar but is more powerful, from the IETF draf document:

Link: <>; rel="index start";
rel="" rev=copyright

Defines 4 links for, three outbound (index, start and and one inbound (copyright).

More information: HTTP Status Report and LinkHeader.

Thursday, April 16, 2009

Trams in Brisbane and Per Capita Carbon Emissions

China expert raps luxurious Aussie life
Prof Pan was unimpressed with Australia's environmental standards, saying public transport seemed poor and the buildings and street lighting were not energy efficient.

He labelled as "insufficient" Australia's pledge to cut greenhouse emissions by five to 15 per cent by 2020.

The Chinese experts called for a global climate pact that would involve each country being allowed to emit a certain amount, based on their populations.

This is ominous for Australia because it has very high per capita emissions, whereas China has fairly low per capita emissions.

Australian climate adviser Ross Garnaut backed the per capita push in a video address to the conference, saying it was fair.

This was at the Australia-China Climate Change Forum.

Although the biggest mistake is that while there is hope that there is clean coal serious investment in other technologies doesn't exist, as highlighted by the recent Clean Coal Air Freshener advert from This is Reality.

As for public transport, Brisbane used to have trams including a line along Milton Road. There was fire in Paddington where Paddington Central is now and the trams were replaced by buses. The tram e-petition got a massive 600 signatures which puts it well behind things like a petition against recycling water.

Wednesday, April 15, 2009

MapReduce vs SQL Databases

A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks
...we present the results of running the benchmark on a 100-node cluster to execute each task. We tested the publicly available open-source version of MapReduce, Hadoop [1], against two parallel SQL DBMSs, Vertica [3] and a second system from a major relational vendor.

First, as we demonstrate in Section 4, at 100 nodes the two parallel DBMSs range from a factor of 3.1 to 6.5 faster than MapReduce on a variety of analytic tasks. While MR may indeed be capable of scaling up to 1000s of nodes, the superior efficiency of modern DBMSs alleviates the need to use such massive hardware on datasets in the range of 1–2PB (1000 nodes with 2TB of disk/node has a total disk capacity of 2PB). For example, eBay’s Teradata configuration uses just 72 nodes (two quad-core CPUs, 32GB RAM, 104 300GB disks per node) to manage approximately 2.4PB of relational data. As another example, Fox Interactive Media’s warehouse is implemented using a 40-node Greenplum DBMS. Each node is a Sun X4500 machine with two dual-core CPUs, 48 500GB disks, and 16 GB RAM (1PB total disk space) [7]. Since few data sets in the world even approach a petabyte in size, it is not at all clear how many MR users really need 1,000 nodes.

In section 3.1 there's some points made about the advantages of databases over MR in relation to data integrity, "...a MR framework and its underlying distributed storage system has no knowledge of these rules, and thus allows input data to be easily corrupted with bad data. By again separating such constraints from the application and enforcing them automatically by the run time system, as is done by all SQL DBMSs, the integrity of the data is enforced without additional work on the programmer’s behalf."

They mention that "all DBMSs require that data conform to a well-defined schema, whereas MR permits data to be in any arbitrary format. Other differences also include how each system provides indexing and compression optimizations, programming models, the way in which data is distributed, and query execution strategies."

If you strip it away they are talking about text processing versus indexed data structures (and other parts of a DBMS).

For loading, "Without using either block-or record-level compression, Hadoop clearly outperforms both DBMS-X and Vertica since each node is simply copying each datafile from the local disk into the local HDFS instance and then distributing two replicas to other nodes in the cluster." The obvious difference to me would be that the SQL databases are creating "...a hash partitioned across all nodes on the salient attribute for that particular table, and then sorted and indexed on different attributes..."

For text processing, they note that the main problems with Hadoop are the start-up costs (10-25 seconds before all Map tasks start) and during the Reduce phase the cost of combining many small files. When you are comparing a fully indexed system versus text processing then you would expect the indexed system to be faster. Compression was also considered an advantage in the systems like Vertica's over Hadoop - where it actually reduced performance. It depends on the work being done whether the overhead of compression is worth the overhead so obviously - it's not explained why compression was a negative for Hadoop.

They also talk about the problems in setting up and configuring the parallel databases over Hadoop, which is not an insignificant difference when you are scaling to 100s and 1000s of nodes.

In the summary they talk about 25 years of database development and the advantages of B-Trees and column stores. It begs the question, then why wasn't a similar system used on the Hadoop infrastructure? MR is really more like distributed processing not an indexed, querying system.

If you took away the distributed layer what they are doing is comparing something like grep (a really bad implementation of grep) with Lucene or MySQL. Would anyone be surprised with the results then? A better comparison would've been comparing it against HBase or other distributed, indexed, data stores like Hive or Cloudbase.

Update: There's a good followup on the hadoop list by Jonathan Gray "Hadoop is not suited for random access, joins, dealing with subsets of
your data; ie. it is not a relational database! It's designed to
distribute a full scan of a large dataset, placing tasks on the same nodes
as the data its processing. The emphasis is on task scheduling, fault
tolerance, and very large datasets, low-latency has not been a priority.
There are no "indexes" to speak of, it's completely orthogonal to what it
does, so of course there is an enormous disparity in cases where that
makes sense. Yes, B-Tree indexes are a wonderful breakthrough in data
technology". He suggested Pig, Hive and Cascading would be more suitable for comparison.

Monday, April 13, 2009

TDD for Ontologies

As it was said to me recently, very often something goes from being mysterious and magical to obvious (I'm looking at you not-so-magical-anymore Grails wiring). So it goes with test driving ontologies (or at least it did for me). One of the standard methodologies for ontology development is creating competency questions for an ontology to successfully answer. This generally means gathering requirements, use cases, organising the data and ontology, testing them and maintenance. Doesn't that sound an awful lot like waterfall? It would appear obvious, that just like code, ontologies could be developed using test driven design (neh development) instead. This should have the same benefits as agile code - like being more adaptable to change.

The closest paper I've come to that talks about this idea is, "Unit tests for ontologies" (and there's also bachelor thesis, Unit Testing for Ontologies). Unfortunately, it talks about adding tests during the maintenance of software. Luckily, it talks about using them as a way to initially build the ontology and surely that would be a way to drive out an ontology. Good stuff. I could imagine that certain good patterns of ontology design like making them small, self consistent, fast, intuitive, etc. would all come naturally out of applying TDD techniques.

As an aside the unit testing paper also mentions the usefulness of autoepistemic operators (K and A). You can state that for a given instance there must exist a certain property, for example, a country must have a capital city.