Thursday, March 29, 2007

MPTStore

Presentation Summary on MPTStore. A summary of an interesting approach by the Fedora guys to storing lots of triples, fast.

The real motivation behind experimenting with a new triplestore, however, was the NSDL use case. The National Science Digital Library5 (NSDL) is a moderately large repository (4.7 million objects, 250 million triples) with a lot of write activity (driven by periodic OAI harvests; primarily mixed ingests and datastream modifications). The NSDL data model also includes existential/referential integrity constraints that must be enforced. Querying the RI to determine correct repository state proved to be difficult: Kowari is aggressively buffering triple, sometimes on the order of seconds, before writing them to disk. Flushing the buffer after every write is also computationally expensive (hence the drive to use buffers in the first place).


Based on this observation, their solution, called “Mapped Predicate Tables,” creates a table for every predicate in the triplestore. This has several advantages: a low computational cost for triple adds and deletes, queries for known predicates are fast, complex queries benefit from the relatively mature RDBMS planner having finer-granularity statistics and query plans, and flexible data partitioning to help address scalability. This solution comes with several disadvantages, however: one needs to manage predicate to table mapping, complex queries crossing many predicates require more effort to formulate, and with a naive approach simple unbound queries scale linearly with the number of predicates.


They achieved basically the same performance with either asynchronous or synchronous modification.

The project is available on Sourceforge, including slides and javadoc (which has a similar design to JRDF except no blank nodes).

Wednesday, March 28, 2007

Reliving Old Times

SketchFighter 4000. If you're anything like me, and I assume you are, you will have done something like this. Or played one of the games inspired by thrust such as Oids.

Via, An open letter to Ambrosia Software: Please bring Sketch Fighter to the DS.

Monday, March 26, 2007

Scaling to Hundreds of Billions of Triples

Nova Spivack Talks with Talis (in MP3 format) talks about the recent work by Radar Networks and how they require a triple store to scale to hundreds of billions (from about 45 minutes in). Mentions an interesting business opportunity (48 minutes in) to build a scalable triple store (mentions Tucana and Kowari and how it disappeared) that handles lots of writes and federated queries.

Sunday, March 25, 2007

News from Oracle

Oracle 11g Gains Native OWL Support and Oracle 11g to support some OWL inferencing. Following the links to it leads to, "Semantic Technologies Center" and an announcement about TopQuadrant TopBraid Composer supporting Oracle. Apart from the native OWL inferencing, faster querying and faster bulk loading the PDF referenced talks about the various scalability achieved (600 million triples), clustering, failover, concurrent read/write and query features such as a semantic relationship operators: sem_related and sem_distance.

On the Web No One Can Agree on the Schema

Wednesday, March 21, 2007

A Better NetInfo

Sick of searching for unused GID and manually adding properties using NetInfo? A recent Peachpit artcle, "Master Mac OS X Users and Groups by Making Your Mac Think It's a Server" suggests downloading the Server Admin tool. It can run on any OS X machine by just hitting cancel initially and then connecting to the local directory (Apple+D). It suggests using GID from 600 so it doesn't clash with others.

Tuesday, March 20, 2007

Wondering About Transactions

Transactionless

A couple of years ago I was talking to a couple of friends of mine who were doing some work at eBay. It's always interesting to hear about the techniques people use on high volume sites, but perhaps one of the most interesting tidbits was that eBay does not use database transactions.

My immediate follow-up to the news of transactionless was to ask what the consequences were for the application programmer, in particular the overall feeling about transactionlessness. The reply was that it was odd at first, but ended up not being a big deal - much less of a problem than you might think. You have to pay attention to the order of your commits, getting the more important ones in first. At each commit you have to check that it succeeded and decide what to do if it fails.


I've been wondering about the usefulness of transactions for sometime and ACID in particular. One of the easiest to dismiss, suprisingly perhaps, is isolation. For example, skip list implementations ruin it as the data may change during iteration and it can easily be avoided by writing the information of one transaction to a file to be read by something else in another (either by accident or on purpose). The alternative might be something like CAS and RDF triples. Expecting application programmers to do it sounds awful however.

Also, noticed by, "Living Without Database Transactions".

Sunday, March 18, 2007

Self Selection

I'm a bit late with this one but was interesting none the less, "The Power of the Marginal":

If it's corrupt enough, a test becomes an anti-test, filtering out the people it should select by making them to do things only the wrong people would do. Popularity in high school seems to be such a test. There are plenty of similar ones in the grownup world. For example, rising up through the hierarchy of the average big company demands an attention to politics few thoughtful people could spare. [3] Someone like Bill Gates can grow a company under him, but it's hard to imagine him having the patience to climb the corporate ladder at General Electric—or Microsoft, actually.

It's kind of strange when you think about it, because lord-of-the-flies schools and bureaucratic companies are both the default. There are probably a lot of people who go from one to the other and never realize the whole world doesn't work this way.

I think that's one reason big companies are so often blindsided by startups. People at big companies don't realize the extent to which they live in an environment that is one large, ongoing test for the wrong qualities.

If you're an outsider, your best chances for beating insiders are obviously in fields where corrupt tests select a lame elite. But there's a catch: if the tests are corrupt, your victory won't be recognized, at least in your lifetime.


He also contradicts the Bible:

Almost everyone makes the mistake of treating ideas as if they were indications of character rather than talent—as if having a stupid idea made you stupid. There's a huge weight of tradition advising us to play it safe. "Even a fool is thought wise if he keeps silent," says the Old Testament (Proverbs 17:28).


And Yoda:

The word "try" is an especially valuable component. I disagree here with Yoda, who said there is no try. There is try. It implies there's no punishment if you fail. You're driven by curiosity instead of duty. That means the wind of procrastination will be in your favor: instead of avoiding this work, this will be what you do as a way of avoiding other work. And when you do it, you'll be in a better mood.


Also the entire movie file is available as are the others from that conference (Martin Fowler deconstructing Rails is good too) in mp4.

Friday, March 16, 2007

Radiant RDF

A new and hopefully readable article on XML.com, "A Relational View of the Semantic Web". The main reason is to illicit some feedback and to make a more understanble version of things like this. It also discusses whether or not to use blank nodes when mapping RDF to the relational model (which is something I first published in my thesis). It includes my take, which is no doubt influenced by Simon and working on Kowari, on the DISTINCT issue amongst others.

Update: Well if any feedback is good feedback, having positive feedback is awesome: "But it begs the question: given the relative proximity of RDF/SPARQL to the relational model, why are the Semantic Web standards not closer both syntactically and semantically to it? This is, for me, the major reason which prevents its adoption...Nice to see such a straightforward and well-referenced article outlining the Semantic Web and relational worlds in any case. Thank you."

The question raised in the comment, why not keep pushing SQL, is one I wish I had a short answer for. Often SPARQL syntax has been chosen to look like SQL. But obviously it's still different enough because it is a different domain. I still favour a clean break from SQL for that reason.

I wrote about SQL not being relational enough quite a while ago now (nearly 3 years). At the time, I think I was unaware of some of the operations in SQL like INTERSECT and EXCEPT (set difference).

Update 2: Also noted by Nova Spivacks, "Excellent Overview of Benefits of RDF and SPARQL".

Tuesday, March 13, 2007

Guice Boost

Guice "Guice injects constructors, fields and methods (any methods with any number of arguments, not just setters). Guice includes advanced features such as custom scopes, circular dependencies, static member injection, Spring integration, and AOP Alliance method interception, most of which you can ignore until you need it."

The moment I saw IoC wiring in code I thought of one man's wish for better IoC wiring in code (we were using Spring at the time).

Behold:

public class MyModule extends AbstractModule {
protected void configure() {i
bind(Service.class)
.to(ServiceImpl.class)
.in(Scopes.SINGLETON);
}
}


And how to inject the Service implementation:

public class Client {

private final Service service;

@Inject
public Client(Service service) {
this.service = service;
}

public void go() {
service.go();
}
}


Now you can test drive IoC wiring without feeling fruity (test driving XML always felt weird).

Which reminded me of a quote by Victoria Livschitz:

We now have a generation of young programmers who think of software in terms of angle brackets. An enormous mess of XML documents that are now being created by enterprises at an alarming rate will be haunting our industry for decades. With all that excitement, no one seems to have the slightest interest in basic computer science.


She was talking about web services but it seems applicable to IoC wiring too.

Via, Guice 1.0 Released.

Update: Similarly, Configuration in Java - It sure beats XML!. Mentions, using Java 6's hotpatch to reload the configuration too.

Navigating Java Collections

As a followup to a posting a while back two articles about SortedSet and SortedMap made easier and the interfaces they implement, NavigableSet and NavigableMap. They are implemented by skip lists, Concurrent Skipt List Set and Concurrent Skip List Map.

The first article describes in more detail the poll, iteration and LHCF (lower (L), higher (H), ceiling (C) and floor (F)) methods. The later is useful returning elements in a sorted map less than, greather than, less than or equal to and greater than or equal to.

The second article, notes the lack of returning subsets based on values rather than keys: "The disadvantage with NavigableMap is that the subMap method has no provision for returning a map based on a range of values rather than on keys. Therefore, implementing something similar to database views is laborious."

The thing to remember with these skip list data structures is: "Iterators are weakly consistent, returning elements reflecting the state of the map at some point at or since the creation of the iterator."

Also, there's a recent article about being able to hotpatch in Java 6 (which Symantec Visual Cafe was doing back in 1997). Visual Cafe was an excellent IDE in its day, where many IDEs are still catching up although it was behind in things like refactoring.

Friday, March 09, 2007

Offensive Erlang

erlErr


Handling errors in Erlang is very different from handling errors in all other programming languages [...]. One of the most important rules we'll learn is not to program defensively. The idea of defensive programming—checking all arguments to a function—is alien to Erlang. So much so that we say that if the arguments to a function are incorrect then you should just let your program crash."

"This approach will seem very strange at first, but don't worry. The benefits are worth the effort. To start with, your programs will be a lot shorter. Sometimes as much as 30% of a conventional program can be devoted to defensive testing of function arguments—there is no such code in an Erlang program.

Incremental Adobe

Adobe edits the development cycle

The change we made was going from a traditional waterfall method to an incremental development model. Before, we would specify features up front, work on features until a "feature complete" date, and then (supposedly) revise the features based on alpha and beta testing and fix bugs. But we were scrambling so hard to get all the committed features in by the feature complete date - working nights and weekends up to the deadline - that the program was always very buggy at that point. We'd be desperately finding and fixing bugs, with little time to revise features based on tester feedback.

At the end of every cycle, we faced a huge "bugalanch" that required us to work many nights and weekends again. Of the three variables: features, schedule, and quality, the company sets the schedule and it's only slightly negotiable. Until feature complete, we could adjust the feature knob. But when we hit that milestone, quality sucked and we had only a fixed amount of time until the end. From there to the end, cutting features was not an option and all we could do was trade off our quality of life to get the quality of the product to the level we wanted by the ship date. We've never sacrificed product quality to get the product out the door, but we've sacrificed our home lives.


The quality of the program was higher throughout the development cycle, and there have been fewer total bugs. Instead of the bug count climbing towards a (frighteningly high) peak at "feature complete", it stayed at a much lower plateau throughout development. And we were able to incorporate more feedback from outside testers because we didn't switch into "frantic bug fix mode" so early.

Monday, March 05, 2007

Alphabetti Spaghetti!

Triple Soup "Apache TripleSoup is an effort started within the Apache Incubator to create a SPARQL endpoint that is easy to set up, fast maintainable, and such and so forth. We're only just getting started..."

The Triple Soup Wiki has more information: "TripleSoup is the simplest thing that you can do to turn your apache web server into a SPARQL endpoint. TripleSoup will be an RDF [2] store [3], tooling to work with that database, and a REST [4] web interface to talk to that database using SPARQL [5], implemented as an apache webserver module."

It's to be implemented in C and the authors of Redland (Dave Beckett) and B or lib B, a storage backend for RDF developed at Joost (Andrea Marchesini) are involved. Via Simon.

Also mentioned, "SPARQL Via HTTP Methods".

Sunday, March 04, 2007

RDF is DOOMed and other links