Friday, December 12, 2008

Metadata for Fork Sake

Once in a while, I get excited and frustrated about the state of metadata - particularly in file systems. Previously, I've written about rumors of OS X doing metadata better and the wonders of BeOS. A resource fork is something that Macs have had for ages, although it has appeared in other systems before and since:
The concept of a resource manager for graphics objects, to save memory, originated in the OOZE package on the Alto in Smalltalk-76...Although the Windows NT NTFS can support forks (and so can be a file server for Mac files), the native feature providing that support, called an alternate data stream, has never been used extensively...Early versions of the BeOS implemented a database within the filesystem...Though not strictly a resource fork, AmigaOS stores meta data in files known as .info files...NeXT operating systems NeXTSTEP and OPENSTEP, and its successor, Mac OS X, and other systems like RISC OS implemented another solution. Under these systems the resources are left in an original format, for instance, pictures are included as complete TIFF files instead of being encoded into some sort of container.


This links to the Grand Unified Model 1 and Grand Unified Model 2 which are also good for a few quotes:
Almost every piece of data in the Macintosh ended up being touched by the Grand Unified Model. Even transient data, data being cut and pasted within and between applications, did not escape. The Scrap Manager labeled each piece of data on the clipboard with a resource type. In another Mac innovation, multiple pieces of data, each of a different type, could be stored on the clipboard simultaneously, so that applications could have a choice of representation of the same data (for example, storing both plain and styled text). And since this data could easily be stored on disk in a resource file, we were able to provide cutting and pasting of relatively large chunks of data by writing a temporary file called the Clipboard.

Since resource objects were typed, indicating their internal data format, and had ID's or names, it seemed that files should be able to be typed in the same way. There should be no difference between the formats of an independent TEXT file, stored as a standalone file, and a TEXT resource, stored with other objects in a resource file. So I decided we should give files the same four-byte type as resources, known as the type code. Of course, the user should not have to know anything about the file's type; that was the computer's job. So Larry Kenyon made space in the directory entry for each file for the type code, and the Mac would maintain the name as a completely independent piece of information.

Tuesday, December 09, 2008

Restlet Talk

I spoke last night at the Java Users Group about Restlet. It's a basic introduction to both Restlet and trying to link data across web sites. I wasn't very happy with the example - it was basically stolen from a Rails introduction. At least I could answer the question about why you would allow your data to be searched (to sell adverts on your recipe web site). I think it went down okay, mainly because most Java developers are used to large frameworks and complicated APIs in order to do what Restlet does (so it's impressive), the Rails developers knew some of the concepts already and while most are wary of RDF, SPARQL, OWL and the Semantic Web stack it was a fairly incremental addition to achieve something reasonably powerful.

Thursday, December 04, 2008

Getting Groovy with JRDF

In an effort to speed up and improve the test coverage in JRDF I've started writing some of the tests in Groovy. It's been a good experience so far - so much so that I'm probably not going back to writing tests in Java in the future.

One of the things I wanted to try was a RdfBuilder, this is similar to Groovy's NodeBuilder.

There are a couple of things that make it a bit tricky. When parsing or debugging builders I haven't yet found a way to find the methods/properties available even using MetaClass. And of course, when the magic goes wrong it's a bit harder to debug Groovy versus java.

It certainly smartens up the creation of triples, for example (bits from the NTriples test case):
def rdf = new RdfBuilder(graph)
rdf.with {
namespace("eg", "http://example.org/")
namespace("rdfs", "http://www.w3.org/2000/01/rdf-schema#")
"eg:resource1" "eg:property":"eg:resource2"
"_:anon" "eg:property":"eg:resource2"
"eg:resource1" "eg:property":"_:anon"
(3..6).each {
"eg:resource$it" "eg:property":"eg:resource2"
}
"eg:resource7" "eg:property":'"simple literal"'
"eg:resource17" ("eg:property":['"\\u20AC"',
'"\\uD800\\uDC00"', '"\\uD84C\\uDFB4"', '"\\uDBFF\\uDFFF"'])
"eg:resource24" "eg:property":'"<a></a>"^^rdfs:XMLLiteral'
"eg:resource31" "eg:property": '"chat"@en'
}
The first two lines defines the two namespaces used. The third line shows the general use of RDF and Groovy. It works out well, an RDF predicate and object maps to an attribute and value in Groovy. The next two lines show how you refer to the same blank node across two statements. And the following lines show using ranges and creating different types of literals. The third last line creates 4 triples with the same subject and predicate but with different objects.

Using the builder results in a file that's smaller than the test case file. You could remove some duplication by creating a method that takes in a number and the object and generates "eg:resource$number" "eg:property" "$object" but doing that may actually make it harder to read.

If you stick to only using URIs you can do things like:
rdf.with {
urn.foo6 {
urn.bar {
urn.baz1
urn.baz2
}
}
}
Which produces two triples: "urn:foo6, urn:bar, urn:baz1" and "urn:foo6, urn:bar, urn:baz2".

I expect that JRDF will only be more Groovy friendly in the future.

Thursday, November 27, 2008

Turing in Life

Reading the Wikipedia entry on Conway's Game of Life answers the question of why it was developed "Conway was interested in a problem presented in the 1940s by renowned mathematician John von Neumann, who tried to find a hypothetical machine that could build copies of itself and succeeded when he found a mathematical model for such a machine with very complicated rules on a rectangular grid." It's interesting that von Neumann's idea of self replication actually predates the discovery of DNA by a few years.

So I asked Google the question and someone has implemented a Turing machine inside Coway's Game of Life; way back in 2000. A book called "Collision Based Computing" and a applet called LogiCell (which uses Conway's Game of Life to do simple calculations) is available here.

Wednesday, November 12, 2008

Indexing for Efficient SPARQL

Another interesting way of indexing triples: A role-free approach to indexing large RDF data dets in secondary memory for efficient SPARQL evaluation "We propose a simple Three-way Triple Tree (TripleT) secondary-memory indexing technique to facilitate efficient SPARQL query evaluation on such data sets. The novelty of TripleT is that (1) the index is built over the atoms occurring in the data set, rather than at a coarser granularity, such as whole triples occurring in the data set; and (2) the atoms are indexed regardless of the roles (i.e., subjects, predicates, or objects) they play in the triples of the data set. We show through extensive empirical evaluation that TripleT exhibits multiple orders of magnitude improvement over the state of the art on RDF indexing, in terms of both storage and query processing costs."

While looking around at arXiv I did a quick search and found two more interesting papers that seems related to a previous discussion on how the Semantic Web needs it's own programming language or I would say at least a way to process the web of data, both by Marko A. Rodriguez: "The RDF Virtual Machine" and "A Distributed Process Infrastructure for a Distributed Data Structure".

Tuesday, November 11, 2008

While you were away...

Now that I'm currently looking around for jobs, I came across a presentation on some of the work the the easyDoc project did at Suncorp, "Technical Lessons Learned Turning the Agile Dials to Eleven". It includes automating getter/setter testing, hibernate, and immutability. It's good to see the sophistication continued to increase after I left to reach quite a high level (like automatic triangulation and doing molecule level testing).

Thursday, October 30, 2008

Cats Stealing Dogs' Jobs

Who knew that people from Japan read my blog. Tama the station master (actually in true Japanese style "super station master") is doing almost the exact thing I mentioned. Except they added merchandising - those damn cats.

A Billion Triples in your Pocket

At the Billion Triples Challenge this afternoon Cathrin Weiss showed off i-MoCo. It's a demonstration not only of a Semantic Web application with an iPhone using a stupid number of triples, but also of their indexing technique Hexastore which does all 6 indexes of RDF's subject, predicate and object in order to improve querying (they are ordered which allows you to do merge sorts). Actually, this made me think that the next steps in RDF triple stores will be indexes that are optimised for SPARQL operations and OWL inferences. Indexes for transitive closure perhaps? The data is regular and the storage is available to index triples that improve querying performance.

That wasn't the only impressive demo today. For me it's a toss up between another iPhone SemWeb demo, DBPedia mobile and SemaPlorer. DBPedia stood out because it was the only one that allowed you to write data to the Semantic Web rather than just read the carefully prepared triples. For a similar reason I thought SemaPlorer was good because they tried to do more by making it more generic like integrating flickr. But they were all excellent, all of them showing what you get with a billion or more triples and inferencing.

That combined with the guys at Freebase making all of their data available as RDF and it was a big day for the Semantic Web.

Update: I noticed that John Giannadrea's talk when he mentioned the three aspects to Freebase he went from the bottom up - probably reading too much into it.

Update: Also caught an interview with John Francis the guy who stopped talking, started walking, stopped taking mechanical transport and tried to change the world.

Update: ISWC 2008 awards have been published.

Wednesday, October 22, 2008

Time for REST

The best part of the latest release of JRDF is not code that we've written but Restlet. Restlet provided an excellent abstraction for providing RESTful web services and allows deployment without a container (the distributed and local server are both about ~6MB in size).

This has allowed us to quickly develop a web service that answers SPARQL queries that can be from one or many machines. We're using it to query the results from MapReduce tasks in our Hadoop cluster but it could probably be used as a general way to query other SPARQL services.

It's still early days and there's a lot that still needs to be added (and re-added) such as limiting result sizes, being able to choose which format to return and better JSON support. A lot of it has been pushed out in time to show a front-end at SSWS.

It would be good to have a Javascript client that submits the SPARQL queries and handles the ordering and such rather than having the distributed query server hang onto all results until returning.

Download it here.

President of Real America

Yes, Virginians you may not be in the real part, "Jon Stewart Clarifies Palin Remarks, Expands To 'F%ck All Y'All'". America seems to have gone completely mental. The second video covers the current lunacy including the pro and anti American members of congress and the endorsement of Colin Powell, a black militant backing Obama, what a surprise.

Tuesday, October 21, 2008

Case Classes

Reading "Is Scala Not “Functional Enough”? linked to a post I've read before "Are Scala's case classes a failed experiment?" but hadn't blogged about it. In Java I simply avoid using switching based on type (for example, if foo.getClass() else if bar.getClass()) I really like case classes in Scala especially for language processing.

Friday, October 10, 2008

5TB Processing

Scaling Hadoop to 4000 nodes at Yahoo! "The 4000-node cluster throughput was 7 times better than 500’s for writes and 3.6 times better for reads even though the bigger cluster carried more (4 v/s 2 tasks) per node load than the smaller one." They did some performance monitoring of the cluster and achieved sorting 6TB of data in 37 minutes.

Update: Ex-Google and Yahoo employees create start-up (called Cloudera) and Hadoop Primer by Sun (PDF).

Update: Hadoop users group meeting with IBM talking about different join techniques and Cloudbase project which is an SQL abstraction over log files (tutorial PDF).

Update: Microsoft's first project is Hadoop.

Wednesday, October 08, 2008

Monks Practicing Evolution

Before the printing press allowed exact copies of texts, such as Bibles and other works, scribes would copy manuscripts by hand. These copies were imperfect and these mistakes would then be replicated as other scribes made further copies. The implication is that the church was practicing evolution before science had even discovered it. Darwin could've just popped down to his local monastery or church instead of cruising around the world.

Scientists have used phylogentic software to look at these texts in order to discovery the original document sources (creating a book of life if you will). Like evolution in the natural world, the mutations aren't random and you get errors such as recombination, lateral transfer, deletions, and even convergent evolution. There are some interesting relationships that can be determined, such as certain areas of text are more likely to have mistakes in them than others.

What's cool about a theory, such as evolution, is that it can be applied to many different areas such as natural languages, behavioural patterns, archaeological artifacts, and written works such as chain letters and medieval manuscripts.

More information, Manuscript evolution and Phylogenetics of artificial manuscripts.

Thursday, October 02, 2008

Coffee Inspired


  • The Global graphs in JRDF was inspired by the work done in MSG (minimum self-contained graphs, published in the RDFSync paper) and RDF Molecules. The former links to an implementation of DBin (P2P Semantic Web client) and there's also GVS (Graph Versioning System).
  • It's a trap (cloud computing) It's a fairly typical Stallman statement - not wrong but not aware of the compromises people make. It is obvious that one of the reasons vendors are excited about cloud computing is because it is a chance for them to try and own your data (or at least make switching too hard). But you do have to do more than put data in the cloud - you do have to have executable services there. There are open source cloud computing infrastructure that you can run-up on your own servers like Hadoop or CouchDB. And it's not just back to mainframes and renting CPU time out by the hour it really is different to what has gone before.

  • Live the Cloud Life lists the cloud computing applications in categories such as email, documents, data, music, photo editing and storing and browser synchronisation. Some categories are missing like RSS reading and calendaring.

  • Apple drops NDA - the outrage worked.

  • The main eResearch 2008 conference is over and some of the papers are available.

  • Speaking of which, all development, and that definitely includes software, in whatever organisation (universities, governments, banks, etc) should have failure as an option. One sign of a truly stuffed culture is to never have a project fail.

  • Muradora a refactoring of Fedora to allow pluggable authentication and enable metadata editing.

  • Acer Aspire One links: Dual Monitor Support,
    Installing Firefox 3 on Acer Aspire One Linux and
    Updated repositories. It's a shame that OpenOffice doesn't support presenters view yet.

Tuesday, September 30, 2008

More Data

Neurocommons have released their integrated RDF datasets. It is composed of different modules or bundles including MeSH, Medline, OBO and others.

Wednesday, September 24, 2008

Who Holds the Power?

On Monday I went and saw John Wilbanks talk on "Publishing in Today`s Environment" (I can't seem to find a decent URL but a good overview of some of the topics discussed is "The Open Access Interviews: John Wilbanks"). I didn't take many notes but a few things have stuck with me:
  • Libraries providing a role as repositories of data (both public and private).

  • The idea of "free as in puppy" in relation to digital curation - a great metaphor. When someone gives you a puppy it is initially free but the upkeep of it is anything but.

  • The Queensland Government has its own Creative Commons initiative and the Australian Government is following suite.

  • There's a power struggle occurring between researchers and publishers to make data, papers and the like freely available. The power used to be on the side of publishers but as the producers band together (by country, university, faculty and so on) the power is going back to them.

  • Creative Commons are continuing to fight, in cunning ways, to get back to decent copyright law.

The kind of behavior that publishers have exhibited appears to be on the way out and seems to be going the other way. This is where databases interoperate with each other, papers can link to the original data, and tools, data and papers are all part of an integrated experience. The linking and integration of papers and other artifacts seems to be a one way process, it's hard to imagine a developer, scientist, arts graduate or anyone else spending huge amounts of money and time finally wresting control from one bunch of people only to tie it to another.

This leads to the T-Mobile G1 announcement. The announcement was very underwhelming - guys speaking for the first 10-15 minutes saying what a wonderful job they had done and the guy talking about how good an experience the phone was at playing Pacman. The more intersting part is the open platform. Much like the publishing area the ability to have open access will be a massive differentiator. It reminded me a bit about the discussion on how AOL, Prodigy and the other closed networks quickly died when faced with the open web:
By contrast, the proprietary networks of CompuServe, AOL, Prodigy, and Minitel were out beating the bushes for content, arranging to provide it through the straightforward economic model of being paid by people who would spend connect time browsing it. If anything, we would expect the proprietary networks to offer more, and for a while they did. But they also had a natural desire to act as gatekeepers—to validate anything appearing on their network, to cut individual deals for revenue sharing with their content providers, and to keep their customers from affecting the network’s technology. These tendencies meant that their rates of growth and differentiation were slow.

The closed versus open network is not quite the whole picture with respect to the iPhone versus Android. While there is competition between iTunes versus Amazon, Streetview versus normal Google Maps and there will be other content battles, I don't think that's the source of the real innovation. I think it's probably the technical innovation that the online providers couldn't match that was decisive and will be for phones:
The software driving these communities was stagnant: subscribers who were both interested in the communities’ content and technically minded had few outlets through which to contribute technical improvements to the way the communities were built. Instead, any improvements were orchestrated centrally. As the initial offerings of the proprietary networks plateaued, the Internet saw developments in technology that in turn led to developments in content and ultimately in social and economic interaction: the Web and Web sites, online shopping, peer-to-peer networking, wikis, and blogs.

Is developing on an iPhone going to lead to more technical innovation over Android? Does the ability to have open source code on Android beat the Apple NDAs?

Apple will probably recognize that it's the developers that will ultimately have the power but like publishing it depends on the actions of both parties. At the moment both of these new phone platforms are a little limited - the ability for people to innovate on iPhones using iTunes, Mail is out but it doesn't seem terribly better for Android (being DRM free is a good start though). If history is any guide, it would seem to favor open development over closed.

Update: Similar more succinct explanation.

Update: No Pragamatic book due to NDA - the outrage!

Friday, September 19, 2008

Tuesday, September 16, 2008

Make Every Web Site a Semantic Web Site

Back in March this was news but I completely missed it. Dapper has a rather nice way of turning web sites into data - XML, JSON, etc but it also includes a semanify service. It includes using existing namespaces (FOAF, GSS, Creative Commons, Media RSS and Dublin Core) supported by Yahoo's search engine.

This is covered in more depth in Semantify Hacks - Creating a your own RDF schema using Dapper:
So now, building a Dapp means you also built your own RDF compatible schema, that you can use wherever by just pointing to the webservice:

http://www.dapper.net/websiteServices/dapp-scheme.php?dappName=MYDAPP


The given example is MSN's search engine which you can see in all its RDF/XML glory.

ReadWriteWeb has step-by-step instructions.

Monday, September 08, 2008

Into Thick Air

Chrome, JavaScript, and Flash: Two (Mostly) Opposing Views the second comment closely follows a reason update on CounterNotions, Google Chrome: Bad news for Adobe.
But a full-fledged browser. One that behaves, however, as a platform to host applications best tied to cloud computing with built-in local persistence for offline computing. Sure, in its current form Chrome can’t compete with Silverlight or Flex/AIR for what Adobe calls “expressiveness,” meme-speak for rich graphics, animations, integrated video and other visual UI goodies.

Chrome may shut it off for good. It’s possible that various open source Chrome technologies could melt into Safari and Firefox. But –– whether as a stand-alone product or a progenitor of fast, powerful and expressive browsers –– Chrome signals to anybody but the diehard Microsoft constituents that the browser itself, not a proprietary plug-in or a separate runtime, is the future of RIAs. With its huge ecosystem, Microsoft can live with that. At least until its enterprise monopoly seriously erodes. But Adobe cannot.

In a world where the online pie is divided among the .NET army of Microsoft, the browser-gang of Apple+Mozilla+Google, and the lone Adobe, it’s not difficult to predict whose share will shrink into insignificance. If the exclusion of Flash from the iPhone wasn’t a wake-up call for Adobe, Chrome should certainly be one.


Most of the commentary is focused on the browser within an operating system angle. Although one of the easter eggs is a familiar screensaver. I think it's more helpful to concentrate on the fact that these browsers are getting rich enough to remove the applications embedded within browsers. There is already a lot of functionality developed or being developed such as SVG and storage (part of HTML 5). Chrome ships with Gears though and Webkit, to see how HTML5 and Gears relates see Aaron Boodman's talk on implementing HTML5 in Gears. They create two namespaces one for implementing the standard APIs and one for non-standard APIs - it seems like it has quite a solid development process behind it. Is there really a lot of reason left to support these proprietary applications within applications?

More MapReduce Groovy

A very good post on Cascading (covered previously), GOODBYE MAPREDUCE, HELLO CASCADING

Cascading’s logical model abstracts away MapReduce into a convenient tuples, pipes, and taps model. Data is represented as “Tuples”, a named list of objects. For example, I can have a tuple (”url”, “stats”), where “url” is a Hadoop “Text” object and “stats” is my own “UrlStats” complex object, containing methods for getting “numberOfHits” and “averageTimeSpent”. Tuples are kept together in “streams”, and all tuples in a stream have the exact same fields.

An operation on a stream of tuples is called a “Pipe”. There are a few kinds of pipes, each encompassing a category of transformations on a tuple stream. For instance, the “Each” pipe will apply a custom function to each individual tuple. The “GroupBy” pipe will group tuples together by a set of fields, and the “Every” pipe will apply an “aggregator function” to all tuples in a group at once.

One of the most powerful features of Cascading is the ability to fork and merge pipes together.

Once you have constructed your operations into a “pipe assembly”, you then tell Cascading how to retrieve and persist the data using an abstraction called “Tap”. “Taps” know how to convert stored data into Tuples and vice versa, and have complete control over how and where the data is stored. Cascading has a lot of built-in taps - using SequenceFiles and Text formats via HDFS are two examples. If you want to store data in your own format, you can define your own Tap. We have done this here at Rapleaf and it has worked seamlessly.

Senior Re-Searcher

Someone who continues to search (re-search) Google until he finds what is required. Senior in that he remembers a time before Google.

Wednesday, September 03, 2008

Stats for Nerds

I read "Is google chrome under / partially reporting RAM usage ?" which seems to be the typical problem with Microsoft's task manager. A better way to check out Chrome vs others is to hit Shift-Escape, click on "Stats for Nerds" and it has a comparison there.

Tuesday, September 02, 2008

Antimetabole

Bill Clinton had an interesting one recently in his Democratic convention speech: "People the world over have always been more impressed by the power of our example than by the example of our power."

Monday, September 01, 2008

Thesis and Antithesis

David Anderson on Agile. He introduces his talk by discussing some very broad ideas, including how agility has been applied to other areas. He talks about how the agile manifesto seems to have become more about belief and superstition than a scientific model. It's also rarely clicked through - to see the principles behind the manifesto.

The agile community has found it is better to develop in a failure tolerant environment. This was a reaction to the focus on more and more accurate estimates which lead to analysis patterns (or antipatterns). They are antipatterns because there was never enough detail in the analysis to provide accurate enough estimates and to deal with the unexpected.

Through his talk he listed a bunch of thought bubbles on software development:
* Value is providing functionality fastest.
* Knowledge work is perishable.
* Perfect is the enemy of good enough.
* Develop a high trust, high social capital.
* Have a highly collaborative culture.
* Reflect and adjust.
* Sustainable pace.
* Craftmanship.
* Value is contextual, context is temporal.
* Waste over scale.

Traditional industries have relied on bigger batch sizes to provide better economies of scale. With software development the transaction costs causes large batch sizes to work against you. Smaller batches allow you release more business value more quickly and reduce waste (waste over scale).

He introduces kanban which has ideas such as: pull, flow, and regulate work in progress. It does not use time boxed iterations and little or no planning or estimations (at least from a traditional agile approach). It still has constant improvement and delivery. It drops the idea of a generalist approach to labor which is seen by some as waterfall in disguise. The reason this comes about is because at an enterprise scale it's not feasible to hire a lot of experts who are excellent generalists - the labor pool does not exist. This leads to a tension with typical lean principals of reducing waste by having generalists.

He sees two main changes coming in the future: software factories (software product lines - which includes DSLs) and cloud services (deploying web services in the cloud). Architecture and modeling will come back in fashion as a value chain allows incentives for delivering common behavior. He's talking about a 100 fold improvement in productivity through using these ideas.

Finally, he talks about how CMM/CMMI is actually quite a good indication of the ability for software projects to succeed and the industry as a whole. He mentions a report on how agile and CMMI can work together, the idea being that not only is organizational maturity a good idea but it actually means that agile methods can be implemented better (it's called "CMMI or Agile: Why Not Embrace Both?!" - I only found a few references, for example Agile+CMMI Panel @ SEPG).

Via.

Still have Richard P. Gabriel's talk to go through too.

Thursday, August 28, 2008

Answering the Right Question

From a reddit comment on Ubiquity:
It's a natural-language interface to the entire web. And not a lame "find me the page at http://xxxxx", but "find the page, filter it of information, extract and cross-reference this data with multiple other data-sets and deliver the results to me in a format and medium of my choosing".
It's not really naturally language anymore than "Sam and Max" was natural language but the point of the filtering, finding, and integrating seems spot on to me - that's answering the right question.

It's a bit like Linus talking about how Subversion sucks (about half way through is a good place to start) - they made branching cheap - who cares.

It's this whole Unix pipes for the web thing I guess.

Wednesday, August 27, 2008

Jesus Camp

This seems like quite an amazing depiction of American fundamentalists. It seems as if bigotry and fundamentalism is being turned into a competition with the Middle East. If you don't like to be scared about American evangelicals look away.

Finally a Reason for HTML Email

Ubiquity for Firefox. I've been a little hestitant recently to put the stuff I've been finding interesting. But this has to go up - it's a really great example of user driven mashups and what this Semantic Web thing should be all about.

Update: More information at "If You Want To Create a Mashup, Just Ask Your Browser. Mozilla Labs Launches Ubiquity." and the Mozilla Labs blog.

Friday, August 22, 2008

MapReduce Groovy

Cascading "The processing API lets the developer quickly assemble complex distributed processes without having to "think" in MapReduce. And to efficiently schedule them based on their dependencies and other available meta-data."

Also, HBase now supports transactions (using Optimistic Concurrency Control).

Via “Beyond Relational Databases”.

The Trouble with Mini-Wheats

Australian Mini-Wheats are apparently the only Mini-Wheats with triticale, oats, barley, wheat, and rye. Why is that interesting? Triticale or quadrotriticale (cause we're in the future) was in "The Trouble with Tribbles". How many Star Trek references do you eat?

Thursday, August 21, 2008

Information Poo

Thursday, August 14, 2008

One Machine

Kevin Kelly predicting the next 5,000 days of the web. The specifications for the current web, the one machine, is interesting. It uses 5% of the electricty on Earth and holds and processes the equivalent of one human brain (1 HB). If technology continues growing at the current rate, 6 billion HB will be achieved somewhere between 2020 and 2040, outstripping the current human population (and hopefully the population then as well). He also mentions that every bit will be a web bit, that all bits go through the web, and you see that happening with traditionally non-web data usage like word processing or mobile phone traffic (actually internet would probably be more accurate). This coincides with the convergence of digital and atomic "worlds", rather than moving from one to the other (like in "The Matrix") he suggests that it'll be integrated and that we are the extension of the web rather than the other way around.

The second half of the talk is about the Semantic Web (found from here, here and here). He describes his own definition of the Semantic Web by showing how it fits into what has gone before. The stages of networking includes: connecting by site from one computer to another, connecting by page and linking between them and the last two stages (I didn't see a clear distinction between the two) is by data, idea or item.

His web site also mentions that computers are beating some of the best players at Go (the article is quite bad - teraflops as a measure of storage - this one is better).

Tuesday, August 12, 2008

Real Developers don't use Ruby

Hadoop: When grownups do open source. It's quite an amusing read - especially the part about the word count example on 9,000 blogs, the digg at Twitter, Starfish being practically useless (using MySQL and no Reduce phase) and the bit about understanding something being harder than writing a Ruby version of it.
"Twitter decided they would be cute and trendy. They wrote their code in Ruby: the official state language of the hipster-developer nation. Doug Cutting, on the other hand, decided he would get xxxx done, and wrote Hadoop in Java. Starling was hidden away in some corner and forgotten (it's hosted at RubyForge...). Hadoop lives prominently at the Apache Software Foundation. Starling is a re-hash of an existing Java Enterprise API called JMS that has several open source implementations. Hadoop is an implementation of Google's MapReduce, a system that publicly only existed on paper. Hadoop has the added benefit of actually working."
Ahh the joys of installing Visual Studio - enough time to install IntelliJ, run it up, and catch up on news.

Pigeon Programming

Intellij 8 continues with "all you need is a space bar and meta combinations to program" functionality: "Pressing Ctrl-Shift-Space twice allows you to find values of the expected types which are "two steps away" (can be retrieved through a chained method call)." It seems to be supporting what I'd consider a code smell.

Burnator

Ack, using ISOs on Windows XP fits in that time period where I think I'll do it often enough to remember what software to use but ends up being long enough that I don't. So the two I've used the most is InfraRecord and IsoRecorder (a secret).

Monday, August 11, 2008

JRDF is very crap (part 2)

It didn't take as long as expected which means it's probably wrong. For non-filtered queries it's three times faster and for certain FILTER queries (with equals) it's 47 times faster (from 284 to 6 seconds). At least it's now in the same order of magnitude as most tools and it's a a tiny bit faster than some (although adding more features will probably slow it down again).

What changed:
* AttributeValuePair has been removed and replaced with maps (as discussed previously).
* Seeing as though maps were used so much hashCode and equals were optimized. As I've found before (I think), isAssignableFrom is slower than try/catch for equals (depending on your usage of course).
* Queries go through unsorted and uncopied rather than standard graph finds. I'd forgotten about how much effort had gone into allowing remove and automatic sorting on iterators.
* A very simple optimizer (it's really only simplifying the FILTER constraints at the moment) was added. Tree manipulation was painful - I resorted to mutating in place operations.
* Better designed. It's a bit hard to qualify this except what was there was truly awful - objects being created in constructors and passing itself in. The nice thing about IoC is it's quite easy to see when you're not using objects at the same architectural level.

Update: For download.

Thursday, August 07, 2008

JRDF is very crap (part 1)

I've been spending some time looking at the querying part of JRDF. And it's quite bad. How bad? Well I've been profiling it and noticing that a lot of time was spent comparing attribute value pairs. An attribute in JRDF consists of a name (variable or position in a triple) and type (position in a triple or literal, URI Reference or blank node). Comparisons are done during most operations (like joins) and they are done on sorted attribute values. This is incredibly dumb. What's much better is to have a map of attributes to values. No sorting required and O(1) lookup - hurrah. The code around the comparisons also got a lot simpler and is obviously better. I think there's at least one other case of this at a different level and potentially room for about an order of magntitude speed up over the current release. Test queries are already 2-3 times faster.

The main reason for this though, is that currently the FILTER in JRDF runs about 10 times more slowly than a query using triple matching (this isn't a complexity measurement - it's based on a rather small set of triples). So a query with "?a <some:value> 'foo'" is much slower than "?a <some:value> ?b . FILTER(str(?b) = 'foo')". The queries aren't the same but their performance shouldn't be that much slower. However, in order to get to a stage of improving FILTER's performance the code has to be refactored - hopefully simpler and faster.

FILTER is nicely functionaly - it seems a shame to implement it in Java - it's eye poppingly bad at the moment. I was thinking functional Java but instead of taking the gateway drug I was think of just going to the hard stuff. FILTER is operated by creating different operations within a relation - which allows you to put ANDed FILTERs vertically across a relation (columns) and ORed FILTERs horizontally (rows). I don't know if anyone else implements it this way - it might be another bad idea over time.

Saturday, August 02, 2008

Mice Spiders

Interview with Simon Pegg on Spaced. The UK DVDs are still the only source of the show (and ABC2) for Australians until October.

Tuesday, July 29, 2008

YADS and RDF Molecules

BNodes Out! discusses how any usefully scalable system doesn't use blank nodes. What is interesting is the comment on YADS (Yet Another DOI Service). The best reference is Tony's presentation although it is mentioned in Jane's as well. "YADS implements a simple, safe and predictable recursive data model for describing resource collections. The aim is to assist in programming complex resource descriptions across multiple applications and to foster interoperability between them...So, the YADS model makes extensive use of bNodes to manage hierarchies of “fat” resources - i.e. resource islands, a resource decorated with properties. The bNodes are only used as a mechanism for managing containment."

This sounds a lot like RDF molecules and supports visualization (apparently). This seems like a good use of molecules that I hadn't previously thought of (Tony's talk gives an example of the London underground). The main homepage of YADS isn't around anymore - it'll be interesting to see if it's still being used/worked on.

Update: Tony has fixed up the YADS home page (there's also an older version).

Monday, July 28, 2008

Hadoop and Microsoft

Pluggable Hadoop lists some extensions to Hadoop in the pipeline: job scheduling (including one based on Linux's completely fair scheduler), block placement, instrumentation, serialization, component lifecycle, and code cleanup (the analysis used Structure101).

I found the reason why HQL was removed from HBase (to be replaced by a Ruby DSL and to ensure that HBase wasn't confused with an SQL database) and moved to HRdfStore.

There's also rumours that Microsoft's recent investment in Apache may lead to them working on Hadoop too.

Tuesday, July 22, 2008

Save us China

I was in Victoria when the ETS for Australia was announced (well the discussion papers). It's fairly funny, that replacing the world's worst plants even with other coal plants (Hazelwood is the world's worst), with Chineese brown coal plant technology, would reduce emissions by 30% to 40% (by just drying out the brown coal). It's still very poluting but it just shows how far behind Australia is. This has lead to greater compensation to Victorian polluters (which is just mad). At the same time Queensland is creating another coal port because we can't export the carbon fast enough.

The exclusions were annoying (aluminium, cement and some types of steel). Cement is annoying (5% of all CO2 apparently) as there exists green alternative technologies. The time is to invest not compensate.

Square brackets are scary

For what may be an increasing trend of surfing the Web at 320x480 I noticed Cydia has a number of applications for Jailbroken iPhones (Java, Python and Ruby mainly). The mailing list on iPhone/Java doesn't have much on it except some interesting uses of JocStrap and UICaboodle (available from SVN by Jay Freeman). There's also the Sun blog that has some interesting sample applications using different Java implementations on the iPhone.

Friday, July 04, 2008

JRDF 0.5.5.1 Released

Just a quick note about a new version of JRDF. It's been a short time between releases but it still contains one significant advance over the previous one and that's persistent graphs. It's still in the early stages but it's basic enough for simple use cases. It also contains text serialization (based on NTriples) that is useful for moving RDF molcules around nodes in a cluster (for example). A lot of this code is fairly much "spike" code and I expect that another release will be released after we exercise these new features more (and write some tests/rewrite the code).

Update: 0.5.5.2 is now available fixing many bugs and introducing FILTER support.

Of Mats and Cats

No universal things Re: comparing XML and RDF data models was started by Bernard Vatant. This comes to the heart of whether people can know reality (well that's how I'd summarize the idea of universals see Beyond Concepts).

There were a few quotes that I found interesting:
It's been counter-productive in science for centuries. Physics had to go over the notion of universal thing to understand that light is neither a wave, nor a particle. Biology to go over the notion of taxa as rigid concepts based on phenotypes to understand genetics etc. Many examples can be found in all science domains. My day-to-day experience in ontology building, listening to domain experts, is indeed not that 'there are things that people are trying to describe', but that 'there are descriptions people take for granted they represent things before you ask, but really don't know exactly what those things are when you make them look closely'.


Bijan wrote:
I do think that the family of views in computational ontologies generally called "realist" is indeed naive and fundamentally wrong headed. Whether it's a "useful fiction" that helps people write better or more compatible ontologies is an open empirical question.

But I, for one, wouldn't bet on it.

I remember also a project where we were trying to get people to write simple triples. They got that they needed triples. But what they ended up putting into the tool was things like

S P O
"The cat is" "on the" "mat".
"Mary eats" "pudding" "on toast"

They just split up the sentences into somewhat equal parts!


I really feel like an interested amateur and my view is probably influenced by databases in computer science, where you are taking the non-realist approach. I say this because there are usually properties in databases that are not really based on reality but are a result of other requirements (like a column like "isDeleted" rather than actually deleting the statement).

Wednesday, July 02, 2008

Round of Links

Tuesday, July 01, 2008

Ob. iPhone 2

Good to see carriers actually putting up a bit of a fight for iPhone business. Telstra announces iPhone 3G details with $279, $30 a month on a 24 month, with free access to WiFi hotspots. This better be true.

Update: Optus releases pricing

Thursday, June 26, 2008

JRDF 0.5.5

The main difference in version 0.5.5 from the previous one is the inclusion of a RDF molecule store. Both in memory and disk based versions are supported and can be queried just like a normal triple store. This is also the first version that has been renamed URQL instead of SPARQL for the query evaluation. The SPARQL grammar is the same but it does not support the weird outliers that SPARQL has for empty graph patterns but follows relational (and other) algebras. There's also the usual bug fixes and other features.

Update: Due to a couple of bugs found in 0.5.5 there will be a 0.5.5.1 version released soon.

Sunday, June 22, 2008

Beef of the Sea

Everyone is probably sick of me talking about the Gruen Transfer. So what better what to continue to talk about it than to blog about it. Perhaps the best part of the show is The Pitch especially episode one's selling whale meat (this is the runner up) and making the Democrats electable (the second is best). Who would've thought deconstructing chocolate adverts would be interesting? One of the good things is that the show is available for download. There is also some good discussion in the forum and links to some other good adverts (although it possible should've been crows).

Tuesday, June 17, 2008

Bad Balmer

Eight Years of Wrongness. Lists some of the things believed to have gone wrong with Microsoft in the last 10 years or so. They include: losing the DOJ and EU cases, Vista, XBox, IE, Zune, and Windows Mobile. Linked mainly because they use Fake Steve as a source of analysis.

Apple Sprouts

AppleInsider has some details on SproutCore. The official web site says, "makes Javascript fun and easy" - and it's just a Ruby gem install away. They also link to some previous talk about Cocoa for Windows.

Apple's trojan horse in the runtime wars has been well known for a while.

The photo demo looks a lot like the MobileMe Gallery that was presented at WWDC 2008 (SproutCore doesn't seem to work too well under IE 7 and the rotation only works in Safari). Gallery has less functionality than things like Photoshop Express although the integration is obviously better.

There's also an interesting Javascript library for drawing 2D objects (UML, workflows, etc) that I've been shown recently called Draw 2D.

Friday, June 13, 2008

The Curse of the Floppy Penises

A Western floppy penis is more valuable than preventing blindness in an African eye (see neglected diseases). This is part of the story in the video of the launch of "The Health Commons". The video talks about how hundreds of thousands of people go blind from "river blindness". It has very little value associated with it and drug companies focus on more valuable drugs to do with baldness and erectile disfunction. The video goes on to talk about how the network changes things and how there's a lack of process change in science to take advantage of these effects. If you can leverage network effects then this hopefully reduces the cost of drug discovery making drug development in less valuable diseases viable. The white paper covers some more of this in detail.

It also talks about an idea that I've often thought of as useful - the collection of failed experiments, "This deeply set inability to capture collective learning dooms everyone to revisit infinitely many blind alleys. The currency of scientific publication encourages individual scientists to hoard rather than share data that they will never have the time or resources to exhaustively mine. And, the wealth of “negative” information gleaned from clinical trial data is mostly lost to the need for companies to safeguard their commercial investments."

The general idea seems to share and standardize all aspects of research and science.

Thursday, June 12, 2008

Ob. iPhone

So I've been trying to find more information from a variety of sources on pricing.

The closest to reality that I've been able to find is these leaked details from Optus (via Gizmodo):
"The iPhone will only be available on a 24 month contract – no outright purchase, with the 8GB model to sell at AUD $220, and the 16GB model at $330, with only the 16GB model in white as Steve Jobs announced at the WWDC keynote.

Accessories will only be available through Apple stores – Optus will only carry the iPhone 3G itself, and the all important voice and data plans are as follows: $79 cap for $300 worth of calls and 1GB of data, or a $99 cap with $400 worth of calls and a 3G data download limit.

Visual voicemail is included, and the cap is whittled away in 35c per 30 second chunks, 25c per SMS message and the always annoying but always present flagfall which is set at 30c."

This makes it over twice as expensive as the ATT plans (and I think they had unlimited data). This is where I get cranky about Australian carriers and their stupid plans. It would probably count me out at those prices.

Update: No more Apple rumours. As Brad says in the comments, this is wrong.
Update 2: Looks like the UK is getting a good deal.
Update 3: Gizmodo link gone...nothing to see here.

Wednesday, June 11, 2008

Tuesday, June 10, 2008

Linked Data, FOAF, and OWL DL

So I spent a little time a while ago looking through all the different ways ontologies support linked data. Some of my data I wish to link together is not RDF but documents that define a subject. For example, a protein will have peer reviewed documents that define it. It's not RDF but it is important.

The tutorial on linked data has a little bit of information: "In order to make it easier for Linked Data clients to understand the relation between http://dbpedia.org/resource/Alec_Empire, http://dbpedia.org/data/Alec_Empire, and http://dbpedia.org/page/Alec_Empire, the URIs can be interlinked using the rdfs:isDefinedBy and the foaf:page property as recommended in the Cool URI paper."

The Cool URIs paper, Section 4.5 says: "The rdfs:isDefinedBy statement links the person to the document containing its RDF description and allows RDF browsers to distinguish this main resource from other auxiliary resources that just happen to be mentioned in the document. We use rdfs:isDefinedBy instead of its weaker superproperty rdfs:seeAlso because the content at /data/alice is authoritative."

There is also some discussion about linking in URI-based Naming Systems for Science.

Now my use case is linking things to documents that define that thing. So rdfs:seeAlso is not appropriate as it "might provide additional information about the subject resource". And rdfs:isDefinedBy is also out as it is used to link RDF documents together. I need a property that defines a thing, is authoritative but isn't linking RDF (it's for humans). I also would like to keep my ontology within OWL DL.

FOAF has a page property. I've used the OWL DL version of FOAF before and FOAF cleaner (or should that be RDFS cleaner). So it seemed like a good match. However, its inverse is topic which isn't good. Because I'm linking the thing to the page - it's not a topic. So scrub that.

RSS has a link property which extends Dublin Core's identifier. This seems more like it. However, I'd like to extend my own version of link and I'm stuck because as soon as you use RDFS vocabularies in OWL DL you're in OWL Full territory. It'd be nice to stay in OWL DL. There is an OWL DL version of Dublin Core. All of the Dublin Core properties are nicely converted to annotation properties. However, you're still stuck because you can't make sub-properties without going into OWL Full. I like the idea of annotation and semantically Dublic Core seems to be a suitable vocabulary of annotation properties. Extending Dublin Core is out of OWL DL - which is shame because it's probably the closest match to what I wanted.

As an aside, annotation properties are outside the reasoning engine. The idea is that you don't want an OWL reasoner or RDF application necessarily inferring over this data or trying to look it up in order for the document to be understood. So the way they do it in OWL DL is to have annotation properties that are outside of/special to the usual statements. Sub-properties require reasoning, so limiting them makes some sense but it does hamper extensibility - it'd be nice to express them and turn on the reasoning only when asking about those properties (I think Pellet has this feature but I didn't look up the details).

The other vocabulary I looked at was SIOC's link. Again, this seems like a close match but again it's RDFS.

In the end, I just created another annotation property called link.

In summary:

  • For my requirements, the suggestions for linking data seems to only work for RDF and RDFS ontologies. Reusing RDFS from OWL DL or OWL DL from RDFS doesn't look feasible as one isn't a subset of the other (an old problem I guess).

  • Current, popular Semantic Web vocabularies are in RDFS. Why aren't there more popular OWL DL versions of these things? Is the lack of extensibility holding it back?

  • Is my expectation wrong - should I stick within OWL DL or is an RDFS and OWL DL combination okay?

  • Why not allow annotation properties to have sub-properties?

  • Maybe the OWL DL specification does have suitable properties for linking certain data but I don't understand which is the right one.



Update: The Neurocommon's URI documentation protocol is quite similar as well. Except that, it seems to be too specific as it ties the name with a single thing that defines it. All the parts of Step 5 could potentially be eliminated with what I'm thinking of.

Friday, May 30, 2008

Somewhere


Alarm Bells Sound for the Amazon


Brazil's land mass and farming industry make it one of the most agriculturally productive countries in the world. It has already been dubbed "the world's feeding bowl" and is exporting more and more to emerging economies, such as India and China.

As China's middle-class continues to grow, so, too, does its demand for food. Brazil exports 10 million tons of soybeans to China a year for both animal feed and human consumption, trade that is crucial to Brazil's economic development.

And it's not just poverty that's an issue.

The state of Para has some of the worst human rights abuses in Brazil. People are trafficked from across the impoverished northeast of the country to work in slavelike conditions in the sawmills, illegal charcoal ovens and cattle farms.

They usually work in horrific conditions, with no basic rights and existing on roughly $5 a day. If they try to seek help from the authorities, they are threatened with death.


There's also the WHO page on "Deaths from Climate Change".

Monday, May 26, 2008

RDF Processing

One of the interesting things about biological data, and probably other types, is that a lot of it is not quite the right structure. That's not to say that there's not people working to improve it, the Gene Ontology seems to be updated almost daily, but data in any structure maybe wrong for a particular purpose.

Biologists make a habit, out of necessity, of just hacking and transforming large amounts of data to suite their particular need. Sometimes, these hacks get more generalized and abstracted like GO Slims. We've been using GO Slims in BioMANTA for sub-cellular location (going from 2000 terms to 500). GO contains lots and lots of information and you don't need it all at once and more often you don't need it at the maximum level of granularity that it has. Some categories only have one or two known instances, for example. You may even need to whittle this down further (from say 500 to 200). For example, when we are determining the quality of an interaction we only care where the proteins exist generally in an organism. If the two proteins are recorded to interact but one is in the heart and the other in the liver then it's unlikely that they will react in the host organism. The part of the liver or the heart and other finer structural detail is not required for this kind of work (AFAIK anyway).

The point is, a lot of our work is processing not querying RDF. What's the difference between the two and what effect does it have?

For a start, querying assumes, at least to some degree, that the data is selective - that the results you're getting is vastly smaller than your original data. In processing, you're taking all of the data or large chunks of it (by sets of predicates, for example) and changing or producing more data based on the original set.

Also, writing is at least as important as reading the data. So data structures optimized for lots of writes, temporary, concurrent, is of greater importance than those built around more familiar requirements for a database.

Sorting and processing distinct items is a lot more important too. When processing millions of data entries it can be quite inefficient if the data has a large number of duplicates and needs to be sorted. Processing can also be decentralized - or perhaps maybe more decentralized.

To top it off, the data still has to be queried. So this doesn't remove the need for efficient, read only data structures to perform selective queries for the usual analysis, reporting, etc. So none of the existing problems goes away.

Monday, May 12, 2008

git + RDF = versioned RDF

Reading, Git for Computer Scientists, and it seems like if you turn the blob into a set of triples you pretty much have versioned RDF (or molecules even).

I'm also wondering, if Digg is so pro-Semantic Web, where's the http://digg.com/semweb?

Tuesday, May 06, 2008

I See Triples

Digg makes official its adoption of a 'semantic Web' standard "Other brief mentions on Digg's blogs over the past month have been the only indications the company has been giving to the world of its direct -- and perhaps even principal -- involvement in RDF and RDFa, besides a simple check of the site's own source code, where attributions such as rel="dc:source" property="dc:title" within <DIV> elements are now common. A few weeks ago, developer Bob DuCharme discovered these little attributions and began playing with them to discern their viability."

"The possibility exists for a kind of mega-meta-source to emerge from Digg, where interesting news topics are associated with cataloged resources. But for that to actually work, someone has to manage those resources -- and that effort will take a level of humanpower and resources of another kind (the kind symbolized with "$") that RDF won't provide even the most ambitious sites just on its own."

See Digging RDFa. More news about RDFa is available at RDFa.info.

To see Digg in all its RDFa glory one way is to copy this Javascript for highlighting or this one for getting RDF triples into you bookmark bar after the Digg front page has loaded.

So I still haven't finished writing up everything that I've saw at WWW2008 but the overall messages were:

  • RDFa is easy and gets people going with RDF quickly (see "They knew the train would come"). Semantic wikis (links to the Semantic Mediawiki project) have also come a long way to making it more err user friendly.

  • HTML5 and the end of the browser development winter seems like the death to plugins at last. I hadn't realized this before, but the message seems to be that a plugin is a way of saying to the Web "your browser isn't full featured enough".

  • The Facebooks of the world and all those online communities really are a danger to the Web - the creation of data silos. And I'd really like to have the time to write some SIOC plugins to help open up these silos (or just change my blog template to have RDFa).

Bankrupt

"And now, we're [Americans are] the most religious nation on earth - that's why we kill so easily. We're sending people to heaven. And because we are now terribly, terribly religious in a sense that no proper American ever was when I was young - I was in the Second World War." - Gore Vidal.

And they are bankrupt in the finacial sense as well due to Iraq (and other causes of course). The speaker also follows a line I've seen often where the war has been fought without enough commitment from the government (i.e. decreasing taxes instead of increasing them, hiring fighters instead of drafting, etc.). One rather shocking statistic was that 48 percent of returning troops will be disabled in some way - maybe that's because more a living than dying but it's still quite an amazing number - but it means "...we've created just for the disabled in this war in the last five years, a gap equal to the gap that we created over decades in the social security system...It's an order of magnitude worse than the Vietnam War."

Friday, May 02, 2008

When URIs are too Much

Every Subject is a Blank Node "In RDF, URIs are good at defining unambiguous property values, in other words objects, including type. But very often, and maybe most of the time, the individual subject (in both meaning of subject of an RDF triple, and topic maps subject of conversation) is best represented as a blank node bearing all kinds of identified properties, but none of them conferring absolute identity. This way, it's left to applications to figure out identification rules, in other words which property or boolean combination of properties they want to consider as identifying or not."

From the mailing list: "With no URI, you are free to let applications decide which contexts are considered the same or not, based on specific rules on properties. Some applications would decide that all contexts where role "I" is played by "John Black" are the same, and will cluster all contextResource properties, some other will not."

Long tail of programming languages

While I'm sick of long tail blahs, I recently came across the idea that programming languages follow the same power laws found in other areas. This particular long tail this should be encouraging for those who have a disdain for the current mainstream computer languages, "Rather than finding ways to create an even lower lowest common denominator, the Long Tail is about finding economically efficient ways to capitalize on the infinite diversity of taste and demand that has heretofore been overshadowed by mass markets."

Furthermore, "There is a long tail because the more specialized a language is to a domain, the better it fits to solve problems for that domain. These niche languages trade off generality for efficiency in a domain and they are simply better and more efficient tools for that domain."

Grep the Web

Slides and talks from the recent Hadoop Summit are now available. Some of the more interesting ones is Facebook's Hive, Amazon's GrepTheWeb, IBM's JAQL and Yahoo's just about everything else.

Thursday, May 01, 2008

Engrich

The ability for a foreigner to speak just enough English in order to swindle stupid Westerners.

Thursday, April 24, 2008

Update from WWW2008

The HCLS workshop was very good. I especially enjoyed Mark Wilkinson's talk about BioMody 2.0 (very Larry Lessig-esque) and Chris Baker's. There's some definite interest from a number of people about my talk too.

The keynote of the first day was from the Vice President of Engineering at Google, Kai-Fu Lee. In my talk I said that IBM had noted that scale-out architecture gives you a 4 times performance benefit for the same cost. He said that Google gets around 33 times or more from a scale-out architecture. The whole cloud thing is really interesting in that it's not only about better value but about doing things you just can't do with more traditional computing architectures. The number of people I've overheard saying that they haven't been able to get their email working because they're using some sort of client/server architecture is amazing. I mean what's to get working these days when you can just use GMail?

The SPARQL BOF was interesting as well (Eric took notes). The time frame seems to be around 2009 before they get started on SPARQL the next generation. What sticks out in my mind is the discussion around free text searching - adding something like Lucene. There was also aggregates, negation, starting from a blank node in a SPARQL query and transitive and following owl:sameAs. I was pretty familiar with all of these so it was interesting just to listen for a change. So with both aggregates and free text you are creating a new variable. Lucene gives you a score back and I remember in Kowari we had that information but I don't think it was ever visible in a variable (maybe I'm wrong I don't really remember). It would be nice to be able to bind new variables somehow from things in the WHERE clause for this - and that would also allow you to filter out based on COUNTS greater than some value (without having a HAVING clause) or documents that match your Lucene query greater than a certain value. Being able to do transitive relationships just on a subset of the subclass relationship (like only subclasses of mammals not infer the whole tree of life) seemed to have been met with some reluctance. I really didn't understand this but it seemed to be around that it was the store's responsibility to control this and not up to the user to specify.

The other thing that was mentioned was transactions. It seems that transactions probably won't be part of SPARQL due to the nature of distributed transactions across the Web.

There was one paper on the first day that really stood out. I don't know what it is about logicians giving talks but they are generally really appealing to me. It was "Structured Objects in OWL: Representation and Reasoning" presented by Bernardo Grau. It seems to take the structural parts of an OWL ontology and creates a graph to represent it. This prevents DL reasoning of an infinite tree and creates a bounded graph. This is cool for biology - the make up for a cell for example but it also speeds up reasoning and allows errors to be found.

The other interesting part was the linked data area. I was a bit concerned that it was going to create a read only Semantic Web. A lot of the work, such as DBpedia that converts Wikipedia to RDF, seems a bit odd to me as you can only edit the Semantic Web indirectly through documents. But in the Linked Data Workshop a paper was presented called "Tabulator Redux: Browsing and Writing Linked Data" which of course adds write capabilities. I spoke to Chris Bizer (who gave a talk on how the linked data project now has ~2 billion triples) about whether you could edit DBpedia this way and he said probably not yet. That's going to be interesting to see where it goes.

I am just going off memory rather than notes. So I'll probably flesh this out a bit more later.

Saturday, April 19, 2008

Big Web Table

I thought I read someone ports Google's AppEngine to use HBase. Good idea but not quite. "Announcing A BigTable Web Service": "I then came up with the crazy idea to offer BigTable as a web service using App Engine. It would be an infinitely scalable database running in Google's datacenters. I spent my weekend learning Python and hacking together an implementation. Now I'm happy to present the BigTable Web Service. It models the API of Hbase—a BigTable clone. Now you can have simulated BigTable running atop App Engine, which itself provides an abstraction on top of the real BigTable."

What it actually does is use HBase's Thrift API on top of Google's BigTable or as they say BigTable as a Web Service (a RESTful one).

Friday, April 18, 2008

Mario's Jazz Bar

Just a random thing to share - there seems to be quite a lot of competition playing Super Mario Galaxy songs on YouTube. A fairly recent one is a Jazz Interpretation of the Observatory theme (there's also a guitar version or accordian one). I still quite like the original orchestral version of Gusty Garden Galaxy Theme and it's version on piano. Koji Kondo is also good to look up on YouTube from time to time too.

Thursday, April 17, 2008

What Women Want: Pairing

It's not the first time I've read an article about the continuing decline of women in IT, "Where Did All the Girl Geeks Go?" continues to note the slide:
"There's a perception that being a computer science major leads to a job as a programmer and you sit in a cubicle where you type 12 hours a day and have no interactions with other people," Block said.

Yusupova noted that even if pure programming jobs are outsourced, opportunities still remain within a company for people to bridge the relationship between the outsourced IT vendors and the business side.

"These roles would probably be ideal for women who prefer to be in communication-focused roles, if they know computer science and can communicate to all parties involved," Nelly Yusupova, chief technology officer of Webgrrls International, a networking organization.


There's was a talk given recently, at a local XP group, that lead to a discussion on the benefits of things like pair programming (see "Pair Programming").

I see pair programming and other means to improve interactions between developers not only essential for better code and a better project but also as a way to improve the IT industry generally and to expand its appeal especially to younger people and women. The idea being that certain people work better in a participatory manner rather than being told what to do.

This is pretty much what an article a couple of year ago suggested called "Debunking the Nerd Stereotype with Pair Programming" (or as PDF):
Jamie wants to be a software engineer. She enjoyed her programming and science classes in high school and wants to combine her interest in both disciplines to help society through biomedical applications. Since she started college, it seems that her life has been centered on time consuming programming classes. In those classes, her professors insist that she work alone—some professors expressly forbid even discussing assignments with fellow classmates. Before entering college, Jamie was aware of the stereotypical view that programmers work long hours by themselves. Based on her college experience, now she knows it’s more than just a stereotype—it’s true. Perhaps she should forget programming. She likes the friends she’s met in her biology lab group—maybe biology would be a better major.


Having been working in bioinformatics for over a year it's startling the number of women in this area compared to IT. It seems basically 50/50 in what is essentially an application of information technology. They still write code, they still develop large applications and so on. Why does it drop to 1 in 20 or worse in IT? It does seem that in bioinformatics you are expected to work in groups and teams, they are forever interacting with each other - it seems a brilliant environment as far as productivity is concerned.

And it's not just IT or biology but it seems that there is a general benefit from greater interaction, more pairing and the like generally improves performance:
The success rate of underrepresented minorities in science courses has been shown to be dramatically improved by shifting the learning paradigm from individual study to one that capitalizes on group processes, such as student work groups and student-student tutoring.


From an IT perspective pairing doesn't only improve the quality of the software it also improves your abilities as an individual programming as well, as has been demonstrated where pair programming has been used in IT courses and the results of students in exams improved (see "Pair Programming Improves Student Retention, Confidence, and Program Quality").

I re-read "All I Really Need to Know about Pair Programming I Learned In Kindergarten" which still holds up quite well as a set of rationales behind pair programming and NCSU's Pair Learning has lots of papers related pairing, learning and making IT more attractive to more discovery based system.

Update: Finally found a public version of the nerd article.