Friday, May 30, 2008


Alarm Bells Sound for the Amazon

Brazil's land mass and farming industry make it one of the most agriculturally productive countries in the world. It has already been dubbed "the world's feeding bowl" and is exporting more and more to emerging economies, such as India and China.

As China's middle-class continues to grow, so, too, does its demand for food. Brazil exports 10 million tons of soybeans to China a year for both animal feed and human consumption, trade that is crucial to Brazil's economic development.

And it's not just poverty that's an issue.

The state of Para has some of the worst human rights abuses in Brazil. People are trafficked from across the impoverished northeast of the country to work in slavelike conditions in the sawmills, illegal charcoal ovens and cattle farms.

They usually work in horrific conditions, with no basic rights and existing on roughly $5 a day. If they try to seek help from the authorities, they are threatened with death.

There's also the WHO page on "Deaths from Climate Change".

Monday, May 26, 2008

RDF Processing

One of the interesting things about biological data, and probably other types, is that a lot of it is not quite the right structure. That's not to say that there's not people working to improve it, the Gene Ontology seems to be updated almost daily, but data in any structure maybe wrong for a particular purpose.

Biologists make a habit, out of necessity, of just hacking and transforming large amounts of data to suite their particular need. Sometimes, these hacks get more generalized and abstracted like GO Slims. We've been using GO Slims in BioMANTA for sub-cellular location (going from 2000 terms to 500). GO contains lots and lots of information and you don't need it all at once and more often you don't need it at the maximum level of granularity that it has. Some categories only have one or two known instances, for example. You may even need to whittle this down further (from say 500 to 200). For example, when we are determining the quality of an interaction we only care where the proteins exist generally in an organism. If the two proteins are recorded to interact but one is in the heart and the other in the liver then it's unlikely that they will react in the host organism. The part of the liver or the heart and other finer structural detail is not required for this kind of work (AFAIK anyway).

The point is, a lot of our work is processing not querying RDF. What's the difference between the two and what effect does it have?

For a start, querying assumes, at least to some degree, that the data is selective - that the results you're getting is vastly smaller than your original data. In processing, you're taking all of the data or large chunks of it (by sets of predicates, for example) and changing or producing more data based on the original set.

Also, writing is at least as important as reading the data. So data structures optimized for lots of writes, temporary, concurrent, is of greater importance than those built around more familiar requirements for a database.

Sorting and processing distinct items is a lot more important too. When processing millions of data entries it can be quite inefficient if the data has a large number of duplicates and needs to be sorted. Processing can also be decentralized - or perhaps maybe more decentralized.

To top it off, the data still has to be queried. So this doesn't remove the need for efficient, read only data structures to perform selective queries for the usual analysis, reporting, etc. So none of the existing problems goes away.

Monday, May 12, 2008

git + RDF = versioned RDF

Reading, Git for Computer Scientists, and it seems like if you turn the blob into a set of triples you pretty much have versioned RDF (or molecules even).

I'm also wondering, if Digg is so pro-Semantic Web, where's the

Tuesday, May 06, 2008

I See Triples

Digg makes official its adoption of a 'semantic Web' standard "Other brief mentions on Digg's blogs over the past month have been the only indications the company has been giving to the world of its direct -- and perhaps even principal -- involvement in RDF and RDFa, besides a simple check of the site's own source code, where attributions such as rel="dc:source" property="dc:title" within <DIV> elements are now common. A few weeks ago, developer Bob DuCharme discovered these little attributions and began playing with them to discern their viability."

"The possibility exists for a kind of mega-meta-source to emerge from Digg, where interesting news topics are associated with cataloged resources. But for that to actually work, someone has to manage those resources -- and that effort will take a level of humanpower and resources of another kind (the kind symbolized with "$") that RDF won't provide even the most ambitious sites just on its own."

See Digging RDFa. More news about RDFa is available at

To see Digg in all its RDFa glory one way is to copy this Javascript for highlighting or this one for getting RDF triples into you bookmark bar after the Digg front page has loaded.

So I still haven't finished writing up everything that I've saw at WWW2008 but the overall messages were:

  • RDFa is easy and gets people going with RDF quickly (see "They knew the train would come"). Semantic wikis (links to the Semantic Mediawiki project) have also come a long way to making it more err user friendly.

  • HTML5 and the end of the browser development winter seems like the death to plugins at last. I hadn't realized this before, but the message seems to be that a plugin is a way of saying to the Web "your browser isn't full featured enough".

  • The Facebooks of the world and all those online communities really are a danger to the Web - the creation of data silos. And I'd really like to have the time to write some SIOC plugins to help open up these silos (or just change my blog template to have RDFa).


"And now, we're [Americans are] the most religious nation on earth - that's why we kill so easily. We're sending people to heaven. And because we are now terribly, terribly religious in a sense that no proper American ever was when I was young - I was in the Second World War." - Gore Vidal.

And they are bankrupt in the finacial sense as well due to Iraq (and other causes of course). The speaker also follows a line I've seen often where the war has been fought without enough commitment from the government (i.e. decreasing taxes instead of increasing them, hiring fighters instead of drafting, etc.). One rather shocking statistic was that 48 percent of returning troops will be disabled in some way - maybe that's because more a living than dying but it's still quite an amazing number - but it means "...we've created just for the disabled in this war in the last five years, a gap equal to the gap that we created over decades in the social security system...It's an order of magnitude worse than the Vietnam War."

Friday, May 02, 2008

When URIs are too Much

Every Subject is a Blank Node "In RDF, URIs are good at defining unambiguous property values, in other words objects, including type. But very often, and maybe most of the time, the individual subject (in both meaning of subject of an RDF triple, and topic maps subject of conversation) is best represented as a blank node bearing all kinds of identified properties, but none of them conferring absolute identity. This way, it's left to applications to figure out identification rules, in other words which property or boolean combination of properties they want to consider as identifying or not."

From the mailing list: "With no URI, you are free to let applications decide which contexts are considered the same or not, based on specific rules on properties. Some applications would decide that all contexts where role "I" is played by "John Black" are the same, and will cluster all contextResource properties, some other will not."

Long tail of programming languages

While I'm sick of long tail blahs, I recently came across the idea that programming languages follow the same power laws found in other areas. This particular long tail this should be encouraging for those who have a disdain for the current mainstream computer languages, "Rather than finding ways to create an even lower lowest common denominator, the Long Tail is about finding economically efficient ways to capitalize on the infinite diversity of taste and demand that has heretofore been overshadowed by mass markets."

Furthermore, "There is a long tail because the more specialized a language is to a domain, the better it fits to solve problems for that domain. These niche languages trade off generality for efficiency in a domain and they are simply better and more efficient tools for that domain."

Grep the Web

Slides and talks from the recent Hadoop Summit are now available. Some of the more interesting ones is Facebook's Hive, Amazon's GrepTheWeb, IBM's JAQL and Yahoo's just about everything else.

Thursday, May 01, 2008


The ability for a foreigner to speak just enough English in order to swindle stupid Westerners.