Sunday, April 29, 2007

You Are Here, Now

Just Fearful Enough

Profile of Jon Stewart
It's sort of this odd and I've always had this problem with the rationality of it. That the President says, "We are in the fight for a way of life. This is the greatest battle of our generation, and of the generations to come. "And, so what I'm going to do is you know, Iraq has to be won, or our way of life ends, and our children and our children's children all suffer. So, what I'm gonna do is send 10,000 more troops to Baghdad."

So, there's a disconnect there between — you're telling me this is fight of our generation, and you're going to increase troops by 10 percent. And that's gonna do it. I'm sure what he would like to do is send 400,000 more troops there, but he can't, because he doesn't have them. And the way to get that would be to institute a draft. And the minute you do that, suddenly the country's not so damn busy anymore. And then they really fight back, and then the whole thing falls apart. So, they have a really delicate balance to walk between keeping us relatively fearful, but not so fearful that we stop what we're doing and really examine how it is that they've been waging this.

And there was you know, this enormous amount of space and coverage to Virginia Tech, as there should have been. And I happened to catch, sort of a headline lower down, which was 200 people killed in four bomb attacks in Iraq. And I think my focus on what was happening here versus sort of this peripheral vision thing that caught my eye about, "Oh, right, there are lives--"

Saturday, April 28, 2007

Planet RDF Roundup

  • Research Project: Pig Given Yahoo's usage of Hadoop it's good to see them building a query layer on top of it. And it's not SQL - hence the name - it's relationally based (it uses bags and still has DISTINCT) because that scales. Hah! Some of the documentation is gold, "In a conventional database management system, SQL queries are translated into relational algebra expressions, which are in turn translated into physical evaluation plans. Pig Latin queries are already an algebra, so we're bypassing the first layer." It even has nested relations and a flatten operation (which removes the nesting). A good read at "Yahoo Pig and Google Sawzall". Notes that Google's Sawzall looks like Scala.

  • REST Compile/Describe & WADL together with I finally get REST. Wow. all link to WADL. In what seems like an age ago, I remember several enterprise architects asking what the REST equivalent to WSDL was. The later has one of the best one liners to describe REST, "state machine as node graph traversed via URI". An example of how to use is give in "REST Describe first working Beta released"

  • Stavanger, oil, and the Semantic Web Talks about the "Norwegian Semantic Web Days" which covered things such as the OBO foundary.

Tuesday, April 24, 2007

Jobs for my Dog

So I'm idly looking at the escalating real estate market and was considering ways my dog could finally make his living. So I thought back across jobs that I'd seen that could be done equally well by him:

  1. A station master. While this sounds like a job requiring thumbs, really not so much. The station being mastered no longer received any trains but still took truck deliveries. Not very many though. Now, the station master is unable to load or unload these deliveries - this is done by the drivers. So the job mainly consisted of watching people drop things off and pick things up; a job for my dog.

  2. Train driver. In recent memory an accident occurred that had the distinct posibility of being caused by both drivers doing things other than driving the train. The introduction of an automated systems was suggested to prevent it happening again. Driving an automated train is something that my dog could do quite well. Actually, he could do it twice as well and he wouldn't join the union. I'm not sure he'd stick to it as it would bore the heck out of him too.

  3. Doorman. Now normally this would require the ability to open and close doors. But more recently, I noticed there was a guy at a building watching an automatic door open and close. Maybe he gave directions or something but the door watching seemed to be his prime activity. I've seen my dog look attentively at the front door - he's very qualified.

  4. Network Administrator. Maybe the frontal lobes need to be developed further for this one but then again maybe not. The key activities undertaken for this position, that I've seen, revolved around attending meetings and playing Solitaire. I think my dog could do this job and he wouldn't even need a computer. Apparently, this is not an uncommon job amongst dogs.

  5. Phone Handler (?). These are the people I've found at the front of buildings saying, "Please use that phone to contact people before going up". Usually replaced by a sign, I think an indicative paw would do as well.

Mistakes I've Made a Few

So the ghosts of implementations past have come to haunt me in recent weeks mostly around Kowari/Tucana (now Mulgara).

Firstly, there was exclude and NAF in iTQL:

> It was introduced to provide
> a limited form of negation, and one that interacts poorly with the
> open-world assumption. We also now have minus, which is well
> defined, corresponds closely to our intuitive understanding of the
> operation, and is (I am told) what was actually required.
> If my memory is correct we should probably at least deprecate, if
> not remove exclude entirely from mulgara.
> Do we agree that exclude should be removed?

Well *I* agree anyway. As you say, it doesn't do what anyone thinks
it does.

> If it should be removed, when should this occur?


And now adding the Jena API to Kowari cost customers:
Another possible reason is Tucana/Kowari/Mulgara’s Jena support - originally put in to provide a migration path for companies looking to move on from research projects to scalable infrastructure - which as Jena is the defacto semweb tool of choice, people used to evaluate Kowari’s scalability. Jena’s lack of scaling hurt us several times, I can remember lots of frantic calls as some company wrote us off because of our Jena API.

I'm current still working in this area (maybe somewhat surprisingly) and Jena still dominates (all of the tools I'm currently looking at are Jena based). And I still haven't seen a Jena implementation that scales (see page 10). Maybe the decision to open source Kowari cost another round of funding too. Maybe this is why Garlick or Radar Network's triple stores are still behind closed doors.

Thursday, April 19, 2007

Extending SPARQL

A few papers related to extending SPARQL, I may update this as I find more.

  • Apart from, SPARQ2L and PSPARQL there's also at the ESWC 2007 conference: iSPARQL (which uses SimPack) and SPARQLeR (which supports some quite sophisticated path queries)

  • SPARQL-DL: SPARQL Query for OWL-DL An enhancement of SPARQL to support DL semantics. And they note: "SPARQL-DL is a step between RDF QLs that are too unstructured w.r.t. OWL-DL and DL QLs which are not as expressive. We believe SPARQL-DL would help interoperability on the Semantic Web as it bridges this gap. As part of future work, we intend to investigate other possible extensions to SPARQL-DL including (but not limited to) aggregation operators, epistemic operators (and negation as failure), and regular expressions on OWL properties."

  • SPARQL/Update Adding INSERT, MODIFY, DELETE and UPDATE to SPARQL. I'd seen this before but hadn't linked to it.

Monday, April 16, 2007

State of Origin

Same-Origin Policy Part 1: Why we’re stuck with things like XSS and XSRF/CSRF
In my experience most developers—and even many security people—don’t really know what the same-origin policy is. Worse yet, the rise of AJAX and mash-ups seems to have turned same-origin into something developers are trying to break. Complicating the issue further are the weaknesses in most browsers’ implementations of same-origin, leaving open questions about the effectiveness of the policy itself.

I was surprised that not everyone understood this recently talking to Web 2.0 gurus. Points to the very clear definition of what exactly is the same origin. See also: Subverting Ajax, which includes things like using XSS to extend the XMLHttpRequest object to capture calls and to record the data being transmitted.

Update: Just noticed that the lastest Crypto-gram highlights a similar attack overriding object creation.

Ontologies in 3D

"OntoSphere3D is a Protégé plug-in for ontologies navigation and inspection using a 3-dimensional hyper-space where information is presented on a 3D view-port enriched by several visual cues (as the colour or the size of visualized entities)."

While I'm not usually a fan of 3D interfaces this is a fairly intuitive approach. It's certainly seems better than the neighbourhood view and hyperbolic view that they talk about in their paper. They offer a "global view" (projected onto a sphere) and "tree focus".

Java 3D can be pain to install, at least it was for me, as it wouldn't detect Protégé's JRE to install into.

Tuesday, April 10, 2007

Lucene for the Semantic Web

Google's [WWW] Bigtable, a distributed storage system for structured data, is a very effective mechanism for storing very large amounts of data in a distributed environment.

Just as Bigtable leverages the distributed data storage provided by the [WWW] Google File System, Hbase will provide Bigtable-like capabilities on top of Hadoop.

Data is organized into tables, rows and columns, but a query language like SQL is not supported. Instead, an Iterator-like interface is available for scanning through a row range (and of course there is an ability to retrieve a column value for a specific key).

Any particular column may have multiple values for the same row key. A secondary key can be provided to select a particular value or an Iterator can be set up to scan through the key-value pairs for that column given a specific row key.

From the Hbase/HbaseArchitecture page:
HBase uses a data model very similar to that of Bigtable. Users store data rows in labelled tables. A data row has a sortable key and an arbitrary number of columns. The table is stored sparsely, so that rows in the same table can have crazily-varying columns, if the user likes.

A column name has the form ":

The example tables given are very similar to untyped relations. This has only just become part of the nightly build.

Via, Data Parallel.

RDF Path Queries

SPARQ2L: Towards Support For Subgraph Extraction Queries in RDF Databases

Many applications in analytical domains often have the need to connect the dots i.e., query about the structure of data. In bioinformatics for example, it is typical to want to query about interactions between proteins. The aim of such queries is to extract relationships between entities i.e. paths from a data graph. Often, such queries will specify certain constraints that qualifying results must satisfy e.g. paths involving a set of mandatory nodes. Unfortunately, most present day Semantic Web query languages including the current draft of the anticipated recommendation SPARQL, lack the ability to express queries about arbitrary path structures in data.

Implemented using Java and Berkley DB and the memory store Brahms. Also mentions PSPARQL (part of Exmo), which in February reached version complete status.

Friday, April 06, 2007

The Killer Demo

What a relief there's a demo that can turn someone onto the “semantic” Web, this has obviously been a long time coming. As long as you believe the hype, the way you get the Semantic Web is one A-list blogger at a time.

Other comments: "Scoble Gets the Semantic Web", "Scoble Gets the Semantic Web", "Nova Spivack sees to it that Robert Scoble finally gets the Semantic Web", "You say Tomato...", "Describing the Semantic Data Web (Take 2)" and "QOTD : Scobleized".


ETech '07 Summary - Part 2 - MegaData.

Here's the thing, we need a new kind of data store, a new kind of SQL, something that does for storing and querying large amounts of data what SQL did for normalized data.

Sure you can store a lot of data in a relational database, but when I say large, I mean really large; a billion or more records. I know we need this because I keep seeing people build it.

All this talk about making SPARQL behave like SQL maybe for nothing if people realize that's not what they need after all.

The back of the envelope scalability for an RDF store would be potentially 100s of billions of statements.

The key requirements highlighted are: distributed, joinless (no referential integrity at the store level), denormalized and transactionless.

I was aware of this because a comment linked to one of my previous posts about Kowari scalability (which I must of snuck through at some stage). Kowari got up to 10,000 triples/second later on its life.

Wednesday, April 04, 2007

What no MOOSE?

I promised myself I wasn't going to do this again, but I have one more, one more reason why the option of even having DISTINCT/LOOSE/CHOOSE in SPARQL is a bad idea. Part of this is stimulated because once more I'm sitting next to people trying to make the Semantic Web work and from my perspective SPARQL is letting them down.

It's not a new reason, it's one I wrote in 2004 which offers a pretty good reason why having this as an optional feature doesn't make sense for RDF:

"The other issue with the SPARQL is the lack of an implicit distinct. In my understanding of SQL, DISTINCT is optional because if your queries work on normalized data and joins are based on distinct keys then the returned results cannot be duplicated. If your query works on rows with repeated values on the same column then you apply DISTINCT.

In RDF's data model there isn't really this problem of duplicated data and normalization. SPARQL has the idea of matching statements in the graph. From my understanding, RDF's data model doesn't support the idea of multiple subject, predicates and/or objects with the same values.

In other words, it only seems valid that if a query matches one result in the graph it should return that one unique result not repeated multiple results."

This is on top of the other reasons I came up in "Bagging SPARQL". This could actually be seen as further discussion from the the initial response I got. Among other things, it was said that duplicates could arise by querying multiple graphs. I'd argue that forced distinct values provide the context to effectively count (or perform other aggregate values) across these multiple graphs.

It's three years on and they couldn't even allows users to declaratively count the number of statements in this mystical, future web of data.

Tuesday, April 03, 2007

Patterns In Software - Part 4

Patterns of Software: Tales from the Software Community (freely available in PDF). The chapter I'll be covering this time is: "The Failure of Pattern Languages".

So it begins with the punch that the last few chapters have been building up to:

Alexander’s story does not end with the publication of A Pattern Language in 1977. It went on and still goes on. And Alexander did not sit still after he wrote the patterns. Like any scientist he tried them out.

And they did not work.

The reason, quoted by Gabriel is that other processes other than architecture play a more fundamental role. That a building is not just the process of slapping together a bunch of very well thought out patterns based on emperical evidence. It's the result of many other processes like finance, zoning, construction, etc. problem with the building process is lump-sum development. In such development few resources are brought to bear on the problems of repair and piecemeal growth. Instead, a large sum of money is dedicated to building a large artifact, and that artifact is allowed to deteriorate somewhat, and anything that is found lacking in the design or construction is ignored or minimally addressed until it is feasible to abandon the building and construct a replacement.

Gabriel gives an example from a previous chapter about how these processes can destroy the otherwise good qualities of development. For example, the process of getting a mortgage and paying it off directly influences the types of buildings built and used. Generally, you invest a large amount of money in a property, so large, that you usually can't afford to make piecemeal modifications because you can barely afford paying it off. This is okay as long as people are jumping from house to house. This avoids fixing up the problems there maybe with these houses until things get really bad and they are knocked down and rebuilt. Perhaps losing what was wrong with them in the first place.

So if you retain these old processes you still get the old results. Now it's clear in software that this is also the case - a nicely architected and controlled software project does not necessarily lead to high quality software. How often have you worked on a software project that you knew was going to be re-written in a few years anyway?

Gabriel suggest that the answer lies in good code and coders not in the typical separation of analysis, design and implementation:

And isn’t the old-style software methodology to put design in the hands of analysts and designers and to put coding in the hands of lowly coders, sometimes offshore coders who can be paid the lowest wages to do the least important work?

Methodologists who insist on separating analysis and design from coding are missing the essential feature of design: The design is in the code, not in a document or in a diagram. Half a programmer’s time is spent exploring the code, not in typing it in or changing it. When you look at the code you see its design, and that’s most of what you’re looking at, and it’s mostly while coding that you’re designing.

And finally, that the typical software patterns are not following Christopher Alexander's original concept of a pattern language:

When I look at software patterns and pattern languages, I don’t see the quality without a name in them, either. Recall that Alexander said that both the patterns themselves and the pattern language have the quality. In many cases today the pattern languages are written quickly, sort of like students doing homework problems. I heard one neophyte pattern writer say that when writing patterns he just writes what he knows and can write four or five patterns at a sitting.

The answer is to build software piecemeal, incrementally, partially designed and reflecting - much more like a Turkish rug and that's the next chapter, "The Bead Game, Rugs, and Beauty".

Defenders of the Web

Tim Berners Lee goes postal on spam "So how are you going to stop the Semantic Web being poisoned?"

TBL, the GLB, replied:

Well, everybody who's building the semantic web pretty much that I know are building systems take data from lots of places, but take data with an awareness of where those places are. So for example, suppose you're getting Geotags and the OS runs a service, lots of people in this country might trust the OS to say this point has a church with a spire - other people might say it's a great church to go to, other people might say it's a heathen church to go to... those are the other sources of data...

There was no let up from the press:

"But that was the basis for Google, and Google got poisoned... "

Shadbolt and Hendler stepped in to shield Sir Tim, but he was seething at the impertinence:

I remember a conference, we were discussing the Semantic Web, and someone asked what do you think is the worst thing that can happen and all the pencils come out. I know you two have been asking about "Woargh - I know the one about... what about the bad guys? Won't we be phished" There's a temptation to give readers about all the terrible things out there OK, and all the ways the web can become less usable.

At this point, your reporter wanted to remind Sir Tim that of all the problems the web has, a hostile press is not one of them. In fact, you can't pick up a newspaper or magazine without reading about how it's ushering in a New Age of Enlightenment. Time magazine gave "Person Of The Year" to every web user in America - or at least every one who looked at the mirror Time placed on its front cover.

He continued, cryptically:

Yes you'll find a bank that's less usable - ... I've never been phished.

So the Greatest Living Briton has never been phished, which is a relief. His answer to the Semantic Web didn't inspire much confidence for the rest of us: it would be used within the firewall, amongst trusted groups, "areas where one is much less worrying about the bad guys"."

Monday, April 02, 2007

Testing and AJAX

Two completely unrelated areas of interest.

The first is Unitils which automates a number of testing tasks. It includes:
* Equality assertion through reflection, with options like ignoring Java default/null values and ignoring order of collections.
* Support for database testing involving test data management with DbUnit, automatic maintenance of unit test databases and automatic constraints disabling
* Hibernate integration features such as session management and testing the mapping with the database
* Integration with Spring, involving ApplicationContext management and injection of spring managed beans
* Integration with EasyMock and injection of mocks into other objects

The cookbook offer concrete examples to do things like mock object injection and the like.

The second is Slingshot: Desktop Apps via Rails, "Slingshot looks like it wraps up a Rails app and packages it in such a way that the user is opening up a Site Specific Browser that accesses your world". It's seen as an alternative to the Firefox 3 or Apollo.

Sunday, April 01, 2007

Just One More Layer of Abstraction

" Open Source, Open data, Process Models" has Sean McGrath linking to "Open Data matters more than Open Source" which is a comment on "Open Source is Dead".

Sean writes:

Traditionally, reference implementations (i.e. traditional source code) has been the way to do this. "Running code" is the final arbiter.

Maybe this is as good as it gets? Unfortunately, a fully blown word processor runs to many, many thousands of lines of code and the semantic devil is buried way down in the details...

Dave writes:

Until we can convince (or force) web sites to embrace and standardize on Open Data formats — XML, JSON, or even CSV, as appropriate — we will be in some ways even more locked in than we were in the bad old desktop days.

Dare writes:

Similarly, how much value do you think there is to be had from a snapshot of the source code for eBay or Facebook being made available? This is one area where Open Source offers no solution to the problem of vendor lock-in. In addition, the fact that we are increasingly moving to a Web-based world means that Open Source will be less and less effective as a mechanism for preventing vendor-lockin in the software industry. This is why Open Source is dead, as it will cease to be relevant in a world where most consumers of software actually use services as opposed to installing and maintaining software that is "distributed" to them.

The point the Sean is making is that even if we achieve what Dave is suggesting we still haven't solved the semantic problem. Making it explicit and non-proprietary is not found in XML, JSON or CSV - these just aren't descriptive enough. And having running code is all fine but it's not generic enough - it will be tied to Java or C# or whatever.

The answer is of course both, but both a data format that is descriptive enough (like RDF/OWL) and open source stores that have the ability to process large quantities of it (because you will have vaste quantities of your own data in the future and you won't want one company to own it).