Friday, February 22, 2008

Tissue Parade

This is just a quick note to let people who have expressed interest before in BioMANTA that the web site is now pretty much up-to-date with the latest papers and presentations (except for the InterOntology08 presentation in Japan next week). If we've sneezed and there was a Powerpoint slide it's there.

Wednesday, February 20, 2008

Do You Have an Internet Sized Problem?

Yahoo! Launches World's Largest Hadoop Production Application. This is mainly about Hadoop reaching a certain level of maturity.

The Webmap build starts with every Web page crawled by Yahoo! and produces a database of all known Web pages and sites on the internet and a vast array of data about every page and site. This derived data feeds the Machine Learned Ranking algorithms at the heart of Yahoo! Search.

Some Webmap size data:

* Number of links between pages in the index: roughly 1 trillion links
* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over 10,000
* Raw disk used in the production cluster: over 5 Petabytes

As described in the video, Webmap is the directed graph of the web and store the aggregate the metadata about the links. A lot of the code sounds like it is written in C++ (which is why there's the pipes API in Hadoop). According to the Hadoop mailing list, this means that Webmap is roughly equal to Google's scale.

Wednesday, February 13, 2008


Maybe I'm in leftist, socialist heaven here but it does seem that most people were following the apology to the stolen generations. People were sitting in their cars listening to it rather than going to work. I liked the focus on how saying sorry is not about making you feel better it's about the person (or people in this case) you're saying sorry to and that you have to take the good with the bad about your country and make amends. There is also a sense that some sort of action will take place; with both sides of parliament working together. A lot of the Howard policies were designed as wedges to divide, so while I'm very skeptical, I'm a little bit hopeful that some inclusive politics will occur.

There were a couple of moments I thought worth mentioning from Rudd's speech:
After living in Alice Springs for a "few years", government policy changed and the young girl was handed over to the missions.

"The kids were simply told to line up in three lines ... those on the left were told they had become Catholics, those in the middle, Methodist and those on the right, Church of England," Mr Rudd said.

"That's how the complex questions of post-reformation theology were resolved in the Australian outback in the 1930s.

"It was as crude as that."

Mr Rudd said should there still be doubts, the historical record showed that between 1910 and 1970, between 10 and 30 per cent of indigenous children were forcibly taken from their mothers and fathers.

"As a result up to 50,000 children were forcibly taken from their families," he said.

Mr Rudd said one of the most notorious examples of this approach came from the Northern Territory Protector of Natives, who had stated: "Generally by the fifth and invariably by the sixth generation all native characteristics of the Australian Aborigine are eradicated. The problem of our half castes... will quickly be eliminated by the complete disappearance of the black race and the swift submergence of their progeny in the white."

"The 1970s is not exactly a point in remote antiquity," he said.

"There are still serving members in this parliament who were first elected to this place in the early 1970s.

It's been a long time, I remember Paul Keating's Redfern Park speech which was voted the number 3 speech after Martin Luther King and Jesus.

Tuesday, February 12, 2008

Concept Extractor

Last year ClearForest was bought by Reuters and Tim O'Reilly has covered the story of how the CEO of Reuters sees a Semantic future. This is one of those companies that I'd hope would become semantically enabled way back when. It would be good for other text mining companies like Autonomy and Inxight to jump on board too. The free web service provided "categorizes and links your document with entities (people, places, organizations, etc.), facts (person ‘x’ works for company ‘y’), and events (person ‘z’ was appointed chairman of company ‘y’ on date ‘x’). The metadata results are stored centrally and returned to you as industry-standard RDF constructs accompanied by a Globally Unique Identifier (GUID). Using the Calais GUID, any downstream consumer is able to retrieve this metadata via a simple call to Calais".

Thursday, February 07, 2008

No Crappy Wrappers

Charles Petrie has an article called, "Is Semantic Web Technology Taking the Wrong Turn?". The author suggests that the current direction in Semantic Web development is leading it towards irrelevance. He notes its requirements and its attempt to provide ways to simplify and speed up tasks, such as integration, by an order of magnitude over existing technologies.

He sees a problem with how Semantic Web Technologies (SWTs) have typically been applied by adding layers alongside existing ones. This just increases "the number of interfaces and mediations required". Furthermore, most publications using SWTs talk about homogeneous environments - languages, ontologies, definitions are all constrained and any differences avoided. The other mistake highlighted is that the work has been divided into the usual architectural layers (persistence, processes, UI, etc) which has lead to each of these layers having their own Semantic Web layer added - creating "a disaster for software architects and engineers who must use the results from several communities in building software applications that are hosted and interconnected".

There are two obvious ways you could attack this argument. The first is that these types of constraints have been applied because of the immaturity of the underlying systems. It's hard to develop an integrated system in one go - dividing up these problems into their layers is obviously one way to make progress. There is also obvious infrastructure missing, not just better triple stores, but also ways to make sure you can reuse ontologies and processes. The other is that there are many examples of aligning ontologies, reusing them, and merging concepts from them; but it too is still in its infancy. Areas like data mining, that the Semantic Web could leverage, lacks mainstream use as well.

Straight ahead from here leads to more SWT languages, hard-to-integrate ontologies, and technology components such as libraries, RDF databases, and logic reasoners. Those who build real-world applications will have to integrate all those elements to use them holistically, thus leaving the integration problem unresolved. As this approach increases the effort required in every part of the software engineering life cycle, chances are that developers will adopt the SWT only for very specific areas and solutions, rather than for general use across all domains in which computing is applied.

He offers a possible solution in addressing the data and process heterogeneity by not restricting SWT to the edges of systems but for it to be applied throughout systems.

Rather than looking at SWT as interface-wrapping technology, it seems appropriate to make it the foundation for all aspects of information technology and scientific computing. In concrete terms, one way to eliminate mediations when crossing layers is to ensure that data objects are encoded in a single format (such as RDF) and not mapped between layers but rather handed over from layer to layer without change. This, in turn, would challenge the various technologies used for implementing these layers to become totally SWT aware.

I think that development goes through cycles of integration and separation but I do agree that if the Semantic Web is just a technology of wrappers it will fail.

Update: Much along similar lines is an article about Dieter Fensel, "Are Semantic Researchers Missing the Big Picture?", he says:
...we do a lot on research of apply[ing] in semantics to all aspects of Enterprise Application Integration where you integrate data, processes, and services (and not only web pages)...

Is the industry neglecting the greater overall goals of scalability for interoperability?

“No,” writes Fensel. “I think they are aware of [it]. For example, Michael Broodie, Scientific Director at Verizon, estimates that world wide around 1 trillion dollars are spent per annumn on application integration. The semantic web community (and not the industry) is mostly ignoring this area.”

I do like the cycle though, we've gone from an initial SEMANTIC web and criticism, to semantic WEB and now this criticism and back to highlighting semantic again.

Tuesday, February 05, 2008

50% Less Code or Your Money Back

Less that two weeks ago 0.5.3 of JRDF is released and now it's 0.5.4's turn.

This is mainly driven by making the fix to the bug in the btree (or Sesame's version) available.

It does however mean that the changes to the Resource API are made available more quickly. I remember looking at Jena's Resource and thinking it was bloated, confusing and rather poorly thought out from an efficiency point of view (holding onto all Graph and associated objects) but now I finally understand the positive effects it has on the code and I like it. JRDF's Resource sits on top of an RDF graph and automatically performs conversion from Java objects like URIs (which become URIReferences) and insert or removes them from the graph. It too has many methods that do the same thing but with different types. It's better than the recent changes to TripleFactory as it allows blank nodes too. JRDF's Graph implementation is no longer the heavy weight object it once was, it references the indexes and nodepool but no longer has any real control over them (this meant removing serialization).

Code becomes a lot smaller too, for example (the create call is URI.create):

Resource supplier = elementFactory.createResource();
supplier.addValue(create("urn:supplier"), "S1");
supplier.addValue(create("urn:name"), create("urn:Smith"));
supplier.addValue(create("urn:status"), 20);
supplier.addValue(create("urn:city"), "London", XSD.STRING);

Creates the triples:

_:1 urn:sno "sno"
_:1 urn:name urn:Smith
_:1 urn:status "20"^^xsd:int
_:1 urn:city "London"^^xsd:string

It used to be something like:

Resource supplier = elementFactory.createResource();
URIReference supplierPred = elementFactory.
graph.add(supplier, supplierPred, "S1");