Thursday, August 28, 2008

Answering the Right Question

From a reddit comment on Ubiquity:
It's a natural-language interface to the entire web. And not a lame "find me the page at http://xxxxx", but "find the page, filter it of information, extract and cross-reference this data with multiple other data-sets and deliver the results to me in a format and medium of my choosing".
It's not really naturally language anymore than "Sam and Max" was natural language but the point of the filtering, finding, and integrating seems spot on to me - that's answering the right question.

It's a bit like Linus talking about how Subversion sucks (about half way through is a good place to start) - they made branching cheap - who cares.

It's this whole Unix pipes for the web thing I guess.

Wednesday, August 27, 2008

Jesus Camp

This seems like quite an amazing depiction of American fundamentalists. It seems as if bigotry and fundamentalism is being turned into a competition with the Middle East. If you don't like to be scared about American evangelicals look away.

Finally a Reason for HTML Email

Ubiquity for Firefox. I've been a little hestitant recently to put the stuff I've been finding interesting. But this has to go up - it's a really great example of user driven mashups and what this Semantic Web thing should be all about.

Update: More information at "If You Want To Create a Mashup, Just Ask Your Browser. Mozilla Labs Launches Ubiquity." and the Mozilla Labs blog.

Friday, August 22, 2008

MapReduce Groovy

Cascading "The processing API lets the developer quickly assemble complex distributed processes without having to "think" in MapReduce. And to efficiently schedule them based on their dependencies and other available meta-data."

Also, HBase now supports transactions (using Optimistic Concurrency Control).

Via “Beyond Relational Databases”.

The Trouble with Mini-Wheats

Australian Mini-Wheats are apparently the only Mini-Wheats with triticale, oats, barley, wheat, and rye. Why is that interesting? Triticale or quadrotriticale (cause we're in the future) was in "The Trouble with Tribbles". How many Star Trek references do you eat?

Thursday, August 21, 2008

Information Poo

Thursday, August 14, 2008

One Machine

Kevin Kelly predicting the next 5,000 days of the web. The specifications for the current web, the one machine, is interesting. It uses 5% of the electricty on Earth and holds and processes the equivalent of one human brain (1 HB). If technology continues growing at the current rate, 6 billion HB will be achieved somewhere between 2020 and 2040, outstripping the current human population (and hopefully the population then as well). He also mentions that every bit will be a web bit, that all bits go through the web, and you see that happening with traditionally non-web data usage like word processing or mobile phone traffic (actually internet would probably be more accurate). This coincides with the convergence of digital and atomic "worlds", rather than moving from one to the other (like in "The Matrix") he suggests that it'll be integrated and that we are the extension of the web rather than the other way around.

The second half of the talk is about the Semantic Web (found from here, here and here). He describes his own definition of the Semantic Web by showing how it fits into what has gone before. The stages of networking includes: connecting by site from one computer to another, connecting by page and linking between them and the last two stages (I didn't see a clear distinction between the two) is by data, idea or item.

His web site also mentions that computers are beating some of the best players at Go (the article is quite bad - teraflops as a measure of storage - this one is better).

Tuesday, August 12, 2008

Real Developers don't use Ruby

Hadoop: When grownups do open source. It's quite an amusing read - especially the part about the word count example on 9,000 blogs, the digg at Twitter, Starfish being practically useless (using MySQL and no Reduce phase) and the bit about understanding something being harder than writing a Ruby version of it.
"Twitter decided they would be cute and trendy. They wrote their code in Ruby: the official state language of the hipster-developer nation. Doug Cutting, on the other hand, decided he would get xxxx done, and wrote Hadoop in Java. Starling was hidden away in some corner and forgotten (it's hosted at RubyForge...). Hadoop lives prominently at the Apache Software Foundation. Starling is a re-hash of an existing Java Enterprise API called JMS that has several open source implementations. Hadoop is an implementation of Google's MapReduce, a system that publicly only existed on paper. Hadoop has the added benefit of actually working."
Ahh the joys of installing Visual Studio - enough time to install IntelliJ, run it up, and catch up on news.

Pigeon Programming

Intellij 8 continues with "all you need is a space bar and meta combinations to program" functionality: "Pressing Ctrl-Shift-Space twice allows you to find values of the expected types which are "two steps away" (can be retrieved through a chained method call)." It seems to be supporting what I'd consider a code smell.

Burnator

Ack, using ISOs on Windows XP fits in that time period where I think I'll do it often enough to remember what software to use but ends up being long enough that I don't. So the two I've used the most is InfraRecord and IsoRecorder (a secret).

Monday, August 11, 2008

JRDF is very crap (part 2)

It didn't take as long as expected which means it's probably wrong. For non-filtered queries it's three times faster and for certain FILTER queries (with equals) it's 47 times faster (from 284 to 6 seconds). At least it's now in the same order of magnitude as most tools and it's a a tiny bit faster than some (although adding more features will probably slow it down again).

What changed:
* AttributeValuePair has been removed and replaced with maps (as discussed previously).
* Seeing as though maps were used so much hashCode and equals were optimized. As I've found before (I think), isAssignableFrom is slower than try/catch for equals (depending on your usage of course).
* Queries go through unsorted and uncopied rather than standard graph finds. I'd forgotten about how much effort had gone into allowing remove and automatic sorting on iterators.
* A very simple optimizer (it's really only simplifying the FILTER constraints at the moment) was added. Tree manipulation was painful - I resorted to mutating in place operations.
* Better designed. It's a bit hard to qualify this except what was there was truly awful - objects being created in constructors and passing itself in. The nice thing about IoC is it's quite easy to see when you're not using objects at the same architectural level.

Update: For download.

Thursday, August 07, 2008

JRDF is very crap (part 1)

I've been spending some time looking at the querying part of JRDF. And it's quite bad. How bad? Well I've been profiling it and noticing that a lot of time was spent comparing attribute value pairs. An attribute in JRDF consists of a name (variable or position in a triple) and type (position in a triple or literal, URI Reference or blank node). Comparisons are done during most operations (like joins) and they are done on sorted attribute values. This is incredibly dumb. What's much better is to have a map of attributes to values. No sorting required and O(1) lookup - hurrah. The code around the comparisons also got a lot simpler and is obviously better. I think there's at least one other case of this at a different level and potentially room for about an order of magntitude speed up over the current release. Test queries are already 2-3 times faster.

The main reason for this though, is that currently the FILTER in JRDF runs about 10 times more slowly than a query using triple matching (this isn't a complexity measurement - it's based on a rather small set of triples). So a query with "?a <some:value> 'foo'" is much slower than "?a <some:value> ?b . FILTER(str(?b) = 'foo')". The queries aren't the same but their performance shouldn't be that much slower. However, in order to get to a stage of improving FILTER's performance the code has to be refactored - hopefully simpler and faster.

FILTER is nicely functionaly - it seems a shame to implement it in Java - it's eye poppingly bad at the moment. I was thinking functional Java but instead of taking the gateway drug I was think of just going to the hard stuff. FILTER is operated by creating different operations within a relation - which allows you to put ANDed FILTERs vertically across a relation (columns) and ORed FILTERs horizontally (rows). I don't know if anyone else implements it this way - it might be another bad idea over time.

Saturday, August 02, 2008

Mice Spiders

Interview with Simon Pegg on Spaced. The UK DVDs are still the only source of the show (and ABC2) for Australians until October.