Monday, January 31, 2005

Ruling

Surprisingly, I might be doing something for this in my current job. So I thought I'd go through all the rules engines over the years. Paul recently blogged "I'm sure Andrew will have blogged this years ago, but I just found Jess. This is a Java implementation of the Rete algorithm." and he's right I first blogged it back in 2002 (linking to Role of Java in the Semantic Web). Like Danny mentions the licence for Jess is unusual.

Some interesting links:
* A good list of rules engine in Java Open Source Rule Engines Written In Java.
* Mandarax "Mandarax is based on backward reasoning....The easy integration of all kinds of data sources. E.g., database records can be easily integrated as sets of facts and reflection is used in order to integrate functionality available in the object model. Other data sources (like EJB, data returned by web services etc) can be integrated as well." The Mandarax blog mentions Mandarax Event Calculus Module "...a formalism for representing and reasoning about events and their effects in a logic programming framework." EC and Mandarax 3.3.2 is available from the Mandarax SF page.
* SWRL 0.7 RDF and XML concrete syntax that "...extends the set of OWL axioms to include Horn-like rules."
* Proposal for rules (including in RDF).
* ROWL "...source code and instructions to frame rules in OWL, transform them into Jess (Java Expert System Shell) rules. In addition, the page also contains a small tutorial that will help you frame rules in RDF/XML using our rule ontology."
* Rewerse - Reasoning-aware Querying "The objective of the WG is to develop, implement, and test (on selected Semantic Web applications) such a (provisionally rule-based) language at a pre-standard level."
* Pychinko "Pychinko is a Python implementation of the classic Rete algorithm (see Forgy82 for original report.)"

Sunday, January 30, 2005

Too Many Cooks

How Not to Write FORTRAN in Any Language "Style guides: I hate ’em. After all, I know which style is the best: mine! Style guides often appear to be dreary lists of arbitrary-seeming rules that limit my creativity. Reading them puts me to sleep.

When I maintain code, however, I set aside my personal style and try to match the style of the project. I want my code to look exactly like everyone else’s code, at least as far as the style guide goes. The reason for this, again, is familiarity. (Is this sounding familiar?) If you use the same coding conventions throughout a software project, the maintainers will grow accustomed to the style and it will magically become transparent to them. They will see the code, not the style.

Lack of consistency is one of the hallmarks of bad code. If 30 different people worked on a source file, I really, really don’t want to see 30 different coding styles or naming schemes when I read it. It becomes a nightmare to attempt to find structure in code like that. Coders have to be humble and accept that for code to be readable, their favorite style is not as good as the established style."

Changing Software and Legacy Code "The last consequence of avoiding change is fear. Unfortunately, many teams live with incredible fear of change and it gets worse every day. Often they aren't aware of how much fear they have until they learn better techniques and the fear starts to fade away."

Saturday, January 29, 2005

New SOFA

SOFA (Simple Ontology Framework API) New home and new release available for download. Updated support for Jena 2, new methods on OntologyModel and ThingModel discarded (will have to change this for Kowari).

Semantic Web Deployment

Piggy-Bank "The Piggy-Bank extension is designed to let users of the Mozilla Firefox browser collect and browse "semantic data" linked from ordinary web pages. Semantic data is data described using the Semantic Web project's Resource Description Framework (RDF)."

Similar to Annozilla but it comes with its own server - neat use of Firefox extensions.

"Piggy-Bank uses the HSQLDB databases, indexed using the Lucene library. Piggy-Bank insternally runs a Jetty web server to serves content retrieved from the databases, dynamically formatted using the Velocity web templating engine."

OS X users will have to use the Java Plugin to enable Java 1.4 in Firefox. But why use Firefox, it can't use ActiveX (it's 1997 all over again).

American Remakes

test screening of HITCHHIKER'S GUIDE TO THE GALAXY "...Speaking of Trillian, needless to say, the arc of her and Arthur falling for each other felt really out of place and was entirely gratuitous."

Reminds me of the following:
* Red Dwarf USA or White Dwarf.
* Office remake 'a flop' ""Daniels makes the same mistake that Coupling did by basically copying the original so much that you might as well just ignore this and buy the DVDs from BBC Video.".
* American market adapts British TV for new audience "Even from the start, the show faced changes by crossing the Atlantic. Some of the spicier sex jokes were cut, and the half-hour episodes were shortened eight minutes to inject commercials."
* Faulty Towers "John Cleese was a bit baffled with the Amanda's: "I asked the American company how the adaptation was looking, and they told me `It's looking good - we've only made one change.' They wrote out Basil Fawlty, which I found incomprehensible.""

A list of remakes here. Doesn't look like they've attempted Press Gang, yet.

Friday, January 28, 2005

The Digital Librarian

SIMILE: Practical Metadata for the Semantic Web "Like any good system developed in collaboration with a research library, DSpace manages metadata about the content it manages and distributes on the web. However, its metadata support is currently limited to the general but relatively small Dublin Core descriptive metadata schema. In the future, DSpace needs to support additional metadata schemas for a variety of purposes: finding digital research material described in various, domain-specific ways, and managing that digital content over time in order to preserve it. As DSpace expands to use new metadata schemas, it will have to deal with the problem of interoperability.

Enter the Semantic Web and extensible metadata. The Semantic Web Core stack — RDF, RDFS, and OWL — enables people to create ontologies to describe their specialized metadata (perhaps building on existing, more general ontologies) and to make them generally reusable. But most people are not trained Semantic Web developers. They are going to need some tools for this and also to be able to assess whether they did the job correctly."

"For users, can we design faceted browsing interfaces that scale to dozens of RDF ontologies? How about improving navigation across the linkages between ontologies? How can we support searching that will start in one domain/ontology and expand into relevant related domains/ontologies?"

Kowari, Jobs and stuff

This is just an update of the crisis from a few weeks ago.

First off, Kowari development hasn't stopped. There's been code changes from Simon, David M and myself. More importantly there's been quite a lot of planning involved and more to come. It's also been encouraging to get so many potential leads for future paid work. However, unlike some of the other people from Tucana I needed to get work, for sure, by sometime about now. So with that in mind I've taken a J2EE contracting job for 3 months, it starts Monday. While working on Kowari we've taken reduced pays or no pays but the periods have been somewhat predictable, allowing me to plan ahead. Without any of that, things had to happen faster than I would've liked.

Already though, I have a chance for my first paid Kowari work. This is an article about how Kowari handles blank nodes - both in querying and through various APIs.

The next set of work I'd like to do would be OPTIONAL. There seems to be quite a lot of interest in that. So having recently added things like trans, walk, having, exclude etc. to iTQL it should be fairly straightforward to do this work. Something I could do, I think, in my spare time.

The other area of interest is of course OWL and inferencing. There's several issues involved with this and the two I'm currently most interested in is management of the graphs and handling schema changes. The first issue means there's a requirement for creating a triggering mechanism and creating logical graphs from several physical graphs. The second means reading and some how trying to validate ideas that I've already had.

If you are someone who would like to get involved in supporting contracting work for Kowari then get in touch with Software Memetics. The key benefit of this is to be able to use the pool of ex-Tucana developers. Going forward it's obvious that we need more developers and a steady supply of work - I'd especially be interested in long term contracts or full time work.

Paul has also put up what's happening to him and so has Andrae.

Bubble cars and the year of metadata

Cheap Eats at the Semantic Web Café "This event happened for me the past few weeks; an experience formed of discussions about digital identity and laws of same, LID, Technorati Tags, new and old syndication formats, Google’s nofollow, and the divide between tech and user. Especially the divide between tech and user." Wow!

Burningbird on why tagging can't violate the Second Law of Thermodynamics "My initial concern about the hype is whether we're going to get more apps that get us tagging. If we don't, then tags won't have much effect. If we do, then I simply don't know whether we're going to be able to solve the problems inherent in scaling tags: Tags work because they're so simple and because they are so connected to the human semantic context, but having billions of tags won't work because they're so simple and connected to the human semantic context."

Applications to get you tagging is covered in: I Want a Pony: Snapshots of a Dream Productivity App "Nobody cares what “metadata” means, but they for damn sure know they want their mp3s tagged correctly. Ditto for del.icio.us, where Master Joshua has shown the world that people will tag stuff that’s important in their world. Don’t like someone else’s homebrewed taxonomy? Doesn’t matter, because you don't need to like it. If I have a repeatable system for tagging the information on just my Mac and it’s working for me, that’s really all that matters."

"The upcoming Tiger release of Mail.app brings iTunes-like smart folders to your mail, and that’s so great. But I also want a Gmail-like tagging system that lets me create multiple non-destructive groupings without multiple copies or resorting to complex hacks. I want all my “stuff” to reside in a big pile, and then I want smart help to script it, organize it, and associate it however I like."

Wednesday, January 26, 2005

Kowari Roadmap and Overview

I have been asked by several people for a general roadmap or plan for future Kowari development. While it's hard with the vagaries of open source development and time/money I've released a draft version which highlights what I think should be in Kowari 1.1 Pre-Release 3. A future version will include what will make up a full 1.1 release.

Update: Modified it based on some of Paul's early feedback and removed some of the more pressing grammatical errors (still a work in progress though).

Tuesday, January 25, 2005

SeRQL+

SeRQL Extensions "Finished extensions:
- column alias (AS)
- set operators (UNION/MINUS/INTERSECT)
- query nesting (IN)
- quantification (ANY/ALL)
- grouping (GROUP BY/HAVING)"

Based on the attached PDF the only things extra to iTQL is column alias and the aggregate function SUM (we have COUNT). The syntax is different and perhaps SeRQL is better although the use of IN in iTQL leads to less verbosity. They've also got slated recursion which is similar to trans and walk. It's good to see all stores reaching a common level of functionality. We have to get OPTIONAL and some other things to catch-up too.

Monday, January 24, 2005

Removing the Crap from Metacrap

Folksonomies, Taxonomies and Metacrap "One thing that is clear to me is that personal publishing via RSS and the various forms of blogging have found a way to trample all the arguments against metadata in Cory Doctorow's Metacrap article from so many years ago. Once there is incentive for the metadata to be accurate and it is cheap to create there is no reason why some of the scenarios that were decried as utopian by Cory Doctorow in his article can't come to pass. So far only personal publishing has provided the value to end users to make both requirements (accurate & cheap to create) come true.

Postscript: Coincidentally I just noticed a post entitled Meet the new tag soup by Phil Ringnalda pointing out that emphasizing end-user value is needed to woo people to create accurate metadata in the case of using semantic markup in HTML. So far most of the arguments I've seen for semantic markup [or even XHTML for that matter] have been academic. It would be interesting to see what actual value to end users is possible with semantic markup or whether it really has been pointless geekery as I've suspected all along."

Sunday, January 23, 2005

Using RDF to improve Object-First Development

There are many ways to create systems, including a "relational-first" approach and an "object-first" approach.

The example in these two pieces are two classes: Person and Employee. Person has first name, last name and age. Employee also has an ID and salary. How do you model these objects?

"One approach would be to create two tables, PERSON and EMPLOYEE, and use a foreign-key relationship to tie rows from one to the other. This will require a join between these two tables every time we want to work with a given Employee, which requires greater work on the part of the database on every query and modification to the data. We could store both Person and Employee data into a single EMPLOYEE table, but then when we create Student (extending Person), and want to find all Persons whose last name is Smith, we'll have to search both STUDENT and EMPLOYEE tables, neither of which at a relational level have anything to do with one another. And if this inheritance layer gets any deeper, we're just compounding the problem even further, almost exponentially."

"As if the above weren't enough, more frequently than not, the enterprise developer doesn't have control over the database schema--it's one that's already in use, either by legacy systems or other J2EE systems, or the schema has been laid down by developers in other groups. So even if we wanted to build a table structure to elegantly match the object model we built above, we can't arbitrarily change the schema definitions."

To approach this problem with an RDF based system is to create extensions to these schemas - this is not a problem and is almost expected. RDF specific databases, like Kowari, can avoid joining to represent these two objects - everything is triples. Subjects that are Person classes can be made Employee classes by adding two statements: saying the subject has an ID and salary.

Another advantage to an RDF system, especially when you start using ideas like resolvers, is to enable developers to integrate existing data and to create application specific database schemas. All data can be converted and adapted to the view required by the application. Filling out objects as they need to be viewed, potentially allowing a Person object to be viewed as Employee just with missing values. This can be achieved by applying a specific semantic for missing values (not recorded, not available, etc) or inferring missing details based on other data. You can of course, approach this normally and only show objects that are valid without modification. The point is, the flexibility is there.

Using this approach avoids both the "smaller than an object" and "smaller than a row (relation)" problems. The first by creating ad-hoc objects and the second because each item in RDF can be individually picked up, at the triple level, and used.

There's a third way mentioned, "Procedural-first" creates an encapsulation layer which takes the Session facade approach where the actual way in which data in persisted is abstracted.

Apart from the normal suspects (mentioned in the above articles) I've recently come across two related ideas, there are quite a few:
* JoSQL which is like SQL meets resolvers.
* SQLObject is an object-relational mapper. It allows you to translate RDBMS table rows into Python objects, and manipulate those objects to transparently manipulate the database.

Update: Danny wrote up a schema using RDFS to represent the classes.

Friday, January 21, 2005

Prince

Printing XML: Why CSS Is Better than XSL "The disagreement starts with how best to express all this. Walsh's solution is to write a 1000-line XSL transformation that generates XSL-FO, which is subsequently turned into PDF. We will argue that it's much easier for most authors to express styling in CSS; in the case of the WebArch document, one can reuse the existing CSS stylesheets (200 lines or so) and add some print-specific lines. And, although browsers tend to focus on dynamic screens rather than on printing, products like Prince happily combine CSS with XML and produce beautiful PDF documents."

Prince.

Repairing Databases

Coherent Integration of Databases by Abductive Logic Programming
"It is well-known that in general, the task of repairing a database is not tractable, as there may be an exponential number of different ways of repairing it."

"One important aspect of data integration systems is how concepts in the independent (stand-alone) data-sources and those of the unified database are mapped to each other. A proper specification of the relations between the source schemas and the schema of the amalgamated data exempts the potential user from being aware where and how data is arranged in the sources. One approach for this mapping, sometimes called global-centric or global-as-view (Ullman, 2000), requires that the unified schema should be expressed in terms of the local schemas. In this approach, every term in the unified schema is associated with a view (alternatively, a query) over the sources. This approach is taken by most of the systems for data integration, as well as ours. The main advantage of this approach is that it induces a simple query processing strategy that is based on unfolding of the query, and uses the same terminology as that of the databases...The other approach, sometimes called sourcecentric or local-as-view (used, e.g., in Bertossi et al., 2002), considers every source as a view over the integrated database, and so the meaning of every source is obtained by concepts of the global database. In particular, the global schema is independent of the distributed ones."

"When the set of integrity constraints is given in a clause form, methods of dynamic logic programing (Alferes et al., 2000, 2002) may be useful for handling revisions. As noted in (Alferes et al., 2002), assuming that each local database is consistent (as in our case), dynamic logic programing (together with a proper language for implementing it, like LUPS (Alferes et al., 2002)) provides a way of avoiding contradictory information, and so this may be viewed as a method of updating a database by a sequence of integrity constraints that arrive at different time points."

To determine which statements to keep and which statements are determined to be invalid when integrating:
"Among the common approaches are the skeptical (conservative) one, that it is based on a ‘consensus’ among all the elements of R(UDB, ?) (see Arenas et al., 1999; Greco & Zumpano, 2000), a ‘credulous’ approach, in which entailments are determined by any element in R(UDB, ?), an approach that is based on a ‘majority vote’ (Lin & Mendelzon, 1998; Konieczny & Pino P ?erez, 2002), etc."

ASystem homepage.

Every Machine is an Island

Semantic Web: Rise of the Machines "The bottom line is that it's about helping machines to communicate more effectively by providing them with a linguisitics framework. Distributed systems are currently built without this infrastructure, resulting in disjoint "islands of integration", the machine equivalent of the Tower of Babel."

"In the next part of this series, I'll talk about vocabularies, grammars and phrasebooks in more detail."

Thursday, January 20, 2005

Tuplespace

The Blogosphere as a Tuple Space "Tuple spaces get really interesting, though, because different collectives of users of the tuple space can each agree upon different sets of meanings for tuples—and the agreements exist outside of tuple space, so the space doesn’t need reconfiguration for any particular new use."

Reminded me of Semantic Clarity and a Google search also brings up Triple-space computing: Semantic Web Services based on persistent publication of information "...the semantic web is not made unnecessary based on the tuple-spaced paradigm. The global space can help to overcome heterogeneity in communication and cooperation, however, it does not provide any answer to data and information heterogeneity. In fact, this aspect is what the semantic web is all about."

Talking to Paul recently about how we were to proceed with OWL inferencing and it became apparent to me that the previous project to Kowari that I worked on, TMex, had a very similar alogirthm for processing documents and the underlying pre/post-conditions, coupled with the use of JMS (which has it's own query language) was similar to the modified Rete algorithm proposed for Kowari, it also had some similarities to some of the parallel rules based systems (still looking into that).

Via The Blogosphere as a Tuple Space

More on Comment Spam

Well I found some of mine here, here and here.

MSN Search and MSN Blogger are supporting it too.

Tim Bray's No Follow suggests another enhancement "...which says “don’t count this link for ranking purposes, but do take its content seriously as relevant to the indicated site.”

The Other Shoe on Nofollow "I can link to something I dislike and use the power of my link in order to punish the linked, but it won’t push them into a higher search result status...It’s an abuse of the purpose of the tag, which was agreed on to discourage comment spammers. More than that, though, it’s an abuse of the the core nature of this environment, where our criticism of another party, such as a weblogger, came with the price of a link. Now, even that price is gone."

Google adds "nofollow" to link tags "This will change how I write. And it will encourage more people to link to their competitors....Oh, and, did anyone notice how Google got its competitors to do something without needing to get a standards committee involved? All within hours?"

This reminds me of Annotea, "Annotea: An Open RDF Infrastructure for Shared Web Annotations" and Mozilla package.

Wednesday, January 19, 2005

Google Tweaks the Web's Semantics

Preventing comment spam " From now on, when Google sees the attribute (rel="nofollow") on hyperlinks, those links won't get any credit when we rank websites in our search results. This isn't a negative vote for the site where the comment was posted; it's just a way to make sure that spammers get no benefit from abusing public areas like blog comments, trackbacks, and referrer lists."

If this isn't a top down schema I don't know what is. Can't work? We'll see I guess.

Tuesday, January 18, 2005

Planning for Tomorrow's Web

Tomorrow's Semantic Web: Understanding What We Mean "If you're going to trust the answers from something, you've got to be able to understand why you should trust them. The web is also moving to being explainable, more capable of filtering, and more capable of executing services."

"Once I have just simple ontologies, so just taxonomies, just the subclass-superclass relationship, I can start to empower a lot of applications in ways that I couldn't before. I get to have the benefit of a shared vocabulary. It benefits things like search engines because you see more usage of the same terms, authors use the controlled vocabulary, users get encouraged to use a controlled vocabulary, databases leverage it, programs don't have to do translation between the terms—so we're all speaking the same language."

"These days if I'm doing an application that is looking to get into those more complicated ontologies at some point down the road—possibly not today but in a year where I want to exploit that information—I typically aim to encode in OWL, the ontology web language, because it sits on top of, it extends the fairly well used vocabularies of XML and RDF, and RDFS, to have more expressive power. And it gives me the ability to encode terms in a precise language where I can count on the semantics."

Monday, January 17, 2005

Optional Support in Rasqal

Rasqal 0.9.5 - SPARQL optionals "The one feature I particularly wanted to get out was optionals. These are pretty handy when you have a somewhat unknown graph that you are querying, which could have some useful triples, ahem, optionally. So rather than do lots of queries of the form "do you have triple (?x prop1 ?y) ?" and "do you have triple (?x prop2 ?z)?" they can be just attached to the other triples in the graph pattern, marked optional, and only if they match you get a binding, otherwise, the query carries on. This is all properly explained in the Including Optional Values section of the SPARQL query draft."

This (well something like it although different in a mathematical way) has been a long standing RFE in Kowari too.

18 millions users of RDF

Of course, it's not modern RDF but RDF 1999, still eighteen million firefox 1.0 downloads where everyone can use things like the Mozilla Amazon Browser.

An upgraded version of RDF support maybe on the cards with XUL Templates - Next Steps: "OK, I think it's time to finally improve XUL templates. There have been numerous proposals out there about how to to do it, but no one can really agree on what the best approach is. There is generally consensus that there needs to be support for XML data as well as RDF, and/or other data formats. What there isn't consensus on is what the templates will look like and whether to use a different template mechanism for different kinds of data. Unfortunately, like most of Mozilla, no one is in charge of things, so while lots of ideas float around, nothing actually gets done...I'm hoping that by implementing it incremently we can see better templates sooner rather than later. In fact, I've already implemented one new feature for RDF based generation which will appear in the Linspire version of Nvu."

Points to XUL Templates on the Mozilla Wiki.

Nvu is a Frontpage/Dreamweaver like program for Linux, OS X and Windows.

Beautiful Evidence of a Spotless Presentation

Corrupt Techniques in Evidence Presentations: New Chapter from Beautiful Evidence " Here is the first of several chapters on consuming presentations, on what alert members of an audience or readers of a report should look for in assessing the credibility of the presenter. Most of Beautiful Evidence is about helpful techniques in evidence presentations; these 3 or 4 chapters, however, will describe sources of corruption."

There's also a poster demonstrating Stalin's use of PowerPoint, Gantt Charts (a particular niggle of mine) or the Pioneer message we should've sent.

Stories

Seat Belts "Anti-seat belt law advocate is killed in automobile accident...In this vein, we note with a sense of both sadness and irony a couple of articles recently called to our attention. The first is a 17 September 2004 editorial published in the Daily Nebraskan and entitled "Individual Rights Buckle Under Seat Belt Laws," by Derek Kieper, a 21-year-old senior at the University of Nebraska-Lincoln, in which the writer inveighed against mandatory seat belt laws, opining that "Uncle Sam is not here to regulate every facet of life no matter the consequences," and that "Democrats and Republicans alike should stand together to stop these laws that are incongruous with the ideals of both parties."...an article in the 4 January 2005 Lincoln Journal Star reported that Mr. Kieper not only died in a car crash, but the tragic mishap that claimed his life was the very type of accident in which seat belts have proved so effective in saving lives — preventing passengers from being ejected from vehicles..."

Saturday, January 15, 2005

SPARQL Protocol

SPARQL Protocol for RDF "This is a first Public Working Draft of the SPARQL protocol produced by the RDF Data Access Working Group, (part of the Semantic Web Activity) for review by W3C Members and other interested parties. It reflects the best effort of the editors to reflect implementation experience and incorporate input from various members of the WG, but is not yet endorsed by the WG as a whole."

Includes HTTP bindings.

Working Over Google

Four more updates to the recent Always On article:

Now on Slashdot "A research group at University of Maryland has published a blog describing the latest approach for finding and indexing Semantic Web Documents. They have published it in reaction to Peter Norvig's (director of search quality at Google) view on the Semantic Web..."

On finding semantic web documents. About who was looking for Semantic Web documents. "As of this writing, I'd guess there are at least two million SWDs accessible on the web. Most of these are FOAF or RSS documents...There are lots of other uses of RDF content: embedded RDF in HTML documents, in other document types (e.g., PDF, JPG), in databases, etc."

It'd be valid to say "the Semantic Web is 40 million triples" or something if you had to rely on everything being RDF to be useful. But things like MP3, file systems, SQL databases etc. can all be viewed as RDF - without conversion, on the fly conversion or whatever. Somethings maybe viewed as RDF but never stored as RDF.

Semantic Web: A different perspective on what works and what doesn't

"More importantly, the promise of Semantic Web is closely tied to having the tools for semantic annotations of heterogeneous content, i.e., create semantic metadata automatically. This is much easier to do when you have high quality domain ontologies that bound the scope of automatic extraction."

"Commercial technologies (example) can process millions of pages per day and extract semantic metadata, and all these can be represented as RDF (and that is a good idea because of the benefits esp. for high end semantic applications such as analytics)."

"These types of ontologies routinely have millions of instances (look at SWETO, NCI ontology, GlycO..."

What also works... (which I did a triple take - I thought Danny was linking to some new Semantic Web blog). Talks about how the Semantic Web and LSI are helpful but not tied to one another and points to Semantic Web != Text Analysis; Semantic Web != Controlled Vocabularies.

A New Startup

Bill Coleman Just Can't Stay Retired "Three years ago, it sounded as though William Coleman was planning on kicking back and enjoying his retirement. The co-founder of Internet software maker BEA Systems (BEAS ) had plans to spend time with his wife, do some skiing near his second home in Aspen, Colo., and work on fund-raising for the Coleman Center...With more than $50 million in venture funding and 115 employees, he's aiming to recreate the magic of his early days at BEA...It was like starting BEA. I have a philosophy that if you want to build mission-critical software, if you don't bring in a team that has been together for at least five years, you're not going to get it right.

At BEA, we bought the Tuxedo product line and WebLogic. You're buying a team, buying a competence. When we started this, [we looked at] this company called Unlimited Scale.... These guys had been with Cray [Computers] (CRAY ) for at least 20 years and started a stand-alone company in 2000. Basically, they broke the scale problem. So I bought that company. "

Presentation, RDF Style

Xenon: An RDF Stylesheet Ontology "In this paper we...describe a more general mechanism for enabling heterogeneous composition for user agents. We term this mechanism a “stylesheet ontology” in analogy to the use of stylesheets on the Web and for XML as a way to abstract presentation from content. If we assume that stylesheets were deemed useful in the HTML and XML contexts, we claim that RDF possesses an even stronger need for stylesheets. HTML (and to some degree XML as well, when the schema in play is simple) is designed to yield human-readable content in a browser, whereas datasets that utilize the expressive power of RDF are rarely human-readable regardless of the syntax used."

"The Xenon stylesheet language is specified as an RDF ontology. In other words, the role the abstract syntax tree (AST) usually plays in functional languages such as XSLT or Lisp is played by an RDF fragment...By representing an AST in RDF, we have therefore drawn a correspondence between the terms “language” and “ontology”: a Xenon stylesheet is RDF written with respect to the Xenon ontology."

"The Xenon ontology provides a generic framework for describing how a resource may be transformed into a presentation as well as a template matching system for supporting heterogeneous composition."

"Because templates use RDF Schema to describe their parameters, both “code” and “data” have the same form and differ only by the complexity of the domain they are describing."

Update: Some browsers have a trouble with the above link. If you use "Save As" and open it with a PDF reader then it seems to work fine, rather than relying on the browser plugin.

Thursday, January 13, 2005

Oligopsonistic

Semantic Web Oligopsonies "Aren’t search engines and browsers in a sense oligopolistic (actually oligopsonistic) consumers of web content? There are only a few of each that matter anyway.

Barring interest from Google, Microsoft, Mozilla, and Yahoo! there may be oligopsonists (or monopsonists who want their standards adopted) in niches that can drive metadata adoption in their niches."

Triple Store Connector

Trippi "Java library providing a consistent, thread-safe access point for updating and querying a triplestore. Similar in spirit to JDBC, but for RDF databases."

Found out about this from a recent kowari-general post. Contains a Kowari implementation, source availabe in CVS. The main class seems to be TriplestoreConnector which gives you access to readers and writers.

Google Still Doesn't Trust Metadata

Semantic Web Ontologies: What Works and What Doesn't "A friend of mine just asked can I send him all the URLs on the web that have dot-RDF, dot-OWL, and a couple other extensions on them; he couldn't find them all. I looked, and it turns out there's only around 200,000 of them. That's about 0.005% of the web. We've got a ways to go."

A lot of RDF is XML or N3 and ends with dot-XML, dot-N3, dot-RSS, dot-ZIP, dot-GZ, etc. - a lot of this data is going to be fairly invisible.

"The best place where ontologies will work is when you have an oligarchy of consumers who can force the providers to play the game. Something like the auto parts industry, where the auto manufacturers can get together and say, "Everybody who wants to sell to us do this." They can do that because there's only a couple of them. In other industries, if there's one major player, then they don't want to play the game because they don't want everybody else to catch up. And if there's too many minor players, then it's hard for them to get together."

P2P searching shows that you don't need a rich, top level ontology to be able to find songs by Britney Spears, Birtney Speares, or however you spell it. To find a good quality song you need a bit of metadata. To find a specific performance of a song you need even better metadata. This is how it can grow in a bottom up manner. RDF and OWL allow you to grow out. If you don't need to change the code to support new metadata, as your ontology grows, then that is a positive thing for users and developers.

"So there's a problem of spelling correction; there's a problem of transliteration from another alphabet such as Arabic into a Roman alphabet; there's a problem of abbreviations, HP versus Hewlett Packard versus Hewlett-Packard, and so on. And there's a problem with identical names: Michael Jordan the basketball player, the CEO, and the Berkeley professor."

HP/Hewlett Packard/Hewlett-Packard all cluster statistically together. This kind of technology is sufficiently sophisticated - it's not very different from deciding between what's spam and what's not. Or a human can do it or most likely a combination. Being able to tell which Michael Jordan you're talking about is a problem that is solved by metadata.

"What this indicates is, one, we've got a lot of work to do to deal with this kind of thing, but also you can't trust the metadata. You can't trust what people are going to say. In general, search engines have turned away from metadata, and they try to hone in more on what's exactly perceivable to the user."

I can trust my metadata and I might trust yours. People may try and cheat Google's ranking algorithms but won't cheat themselves. Where people care about their own metadata and have to rely on it, it will improve over time.

Wednesday, January 12, 2005

5 for Java 5

Five Reasons to Move to the J2SE 5 Platform "The JVM is now self-configuring and self-tuning on server-class machines. A server-class machine is a machine with two or more CPUs and at least 2 GB of memory. The server-based performance ergonomics kicks in by rightsizing both the memory required and the class of optimizations needed for longer lived applications. This has resulted in an 80 percent improvement on one application server benchmark without changing a line of code or supplying any runtime options!...The Java Community stands behind the improvements in J2SE 5.0. The J2SE 5.0 expert group comprised the following who's who of the Java industry: Apache, Apple, BEA Systems, Borland, Cisco Systems, Fujitsu, Hewlett-Packard, IBM, Macromedia, Nokia, Oracle, SAP, SAS Institute, SavaJe Technologies, Sun Microsystems, John Zukowski, Osvaldo Doederlein, and Juergen Kreileder."

TestNG Article

An updated from Good-bye JUnit, an article from IBM's developerWorks TestNG makes Java unit testing a breeze: "Some peculiar features of JUnit come in for particular criticism from the latter group:

* The need to extend a TestCase class, because the Java language has single inheritance, is very limiting.
* It is impossible to pass parameters to JUnit's test method as well as to setUp() and tearDown() methods.
* The execution model is a bit strange: The test class is reinstantiated every time a test method is executed.
* The management of different suites of tests in complex projects can be very tricky."

"All these features, together with the adoption of Java annotations to define tests, make the whole testing process much more simple and flexible. There are only a few rules that you must obey to write tests; beyond these, you are absolutely free to choose the testing strategy you prefer."

More FTrain

Installing updates "Ftrain is on hiatus for many reasons, all of them good. First, it is on hiatus because I am writing a book...I am also starting a full-time job in February, rather than sitting around the house in my socks. & I'm getting well underway with coding a real, grown-up version of the framework under this site, which I must do in order for the new job. Thus, I'll give away more code for building weird semantic webbish sites soon, under an open-source license, if anyone wants it. For real this time. I swear. I really want to. I won't screw it up. It will be a combination of Apache2 mod_rewrite rules, PHP5, XSLT, MySQL, and the Sesame RDF engine, and you'll need to understand all five technologies to use it. It won't have a content management interface aside from weird XML files, and it won't work at all when you start it up. You'll love that.

AndroMDA 3.0M3

"AndroMDA is a code generation framework that follows the Model Driven Architecture (MDA) paradigm. It takes a UML model from a CASE tool and generates classes and deployable components (J2EE or other), specific for your application architecture. Because of its built-in code generation support for real world target platforms, it has become the mainstream open source MDA tool for generating enterprise applications."

Metafacades " AndroMDA Metafacades are facades that are used to provide access to models loaded by a repository. These "metafacades" shield us from the underlying meta model implementation. Meta models are MOF modules such as UML 1.4, UML 2.0, etc. Metafacades are generated by the andromda-meta-cartridge."

SPARQL4J

Profium, Asemantics and HP Laboratories initiate SPARQL4J open source effort "Profium, Asemantics and HP Laboratories have initiated an open source effort to provide Java programmers with a JDBC driver to access SPARQL-enabled metadata repositories."

SPARQL4J SF project.

A standard driver would be a great thing. Modelling RDF queries in SPARQL doesn't seem accurate using JDBC.

A JDBC driver seems doesn't really seem like the appropriate interface to support - there's a lot of data types in a JDBC driver that just aren't supported in SPARQL and probably won't be. RDF just has a different way of handling datatypes. Also, you have to use things java.sql.ResultSet, java.sql.Connection, java.sql.SQLException etc. Does it make sense to send a SPARQL query and receive an SQLException? Even other APIs built on SQL databases like Hibernate's Query Language and JDO's Query Language (or Using JDO 2.0: JDOQL) use their own APIs. XML databases have XMLDB.

Maybe having a look at some of the JDBC 4.0 which has things like XML datatype support but it's still "working with relational SQL data stores from the Java platform".

Tuesday, January 11, 2005

MusiK documentation

MusiK is Kowari's demo application. It does seem to have problems with some Mp3 metadata in iTunes but the supplied Mp3 in the data directory works. The UI is still a little rough (you have to scroll up the panel divider in OS X, for example). The point though is to show how to write an application using Kowari and the speed at querying, loading, etc.

MusiK (Music Player for Kowari).

Kowari now uses JID3 - A Java ID3 Class Library Implementation (in CVS) and it seems to do a better job of not dying when parsing the ID3 tags.

2 for J2EE

Improve the quality of your J2EE-based projects "Enter the Eclipse IDE, which provides built-in capabilities that, when used with several plug-ins, can aid in increasing the quality of both the codebase and the system. Eclipse is an open, extensible IDE built for anything and nothing in particular. Eclipse’s Java development environment is open source, free, and fully customizable. Eclipse both enables and promotes the addition of new capabilities via open source and commercially available custom-built plug-ins. By utilizing Eclipse...it is possible for a developer, and a team, to measure the quality of any J2EE- or Java-based system."

Includes: Checkstyle, Jupiter, Metrics, XDoclet, WSVT, JUnit, GroboCodeCoverage, Eclipse Profiler.

A non-free tool LISA smiles on J2EE app testers.

Monday, January 10, 2005

Both!

Jan 06, 2005: Folksonomies? How about Metadata Ecologies? "Neither works especially well on its own: controlled vocabularies often miss out on input from content authors and become rigid, stale, and distant from the vernacular of users; folksonomies will begin to break down...Treating them as major parts of a single metadata ecology might expose a useful symbiosis: encourage authors and users to generate folksonomies, and use those terms as candidates for inclusion in richer, more current controlled vocabularies that can evolve to best support findability."

Sunday, January 09, 2005

Better than Nothing

folksonomies + controlled vocabularies "The advantage of folksonomies isn’t that they’re better than controlled vocabularies, it’s that they’re better than nothing, because controlled vocabularies are not extensible to the majority of cases where tagging is needed. Building, maintaining, and enforcing a controlled vocabulary is, relative to folksonomies, enormously expensive, both in the development time, and in the cost to the user, especailly the amateur user, in using the system."

"The cost of finding your way through 60K photos tagged ‘summer’, when you can use other latent characteristics like ‘who posted it?’ and ‘when did they post it?’, is nothing compared to the cost of trying to design a controlled vocabulary and then force users to apply it evenly and universally."

Folksonomies succeed where the Semantic Web fails "I hazard a guess that most of the Semantic Web crowd is, like me, firmly in the ‘well-designed metadata’ camp. Our vocabularies and ontologies are designed by experts, then handed down to the users. We are not at ease with the idea of users creating their own categorization schemes. If we’ve learned anything from experience, then that the average user is unable to get a subclass realtionship right. A bunch of sloppily assigned tags will not be useful for inferencing."

"Betting on the Semantic Web is betting against ease of use, conceptual simplicity, and maximal user participation. And I don’t see how ontologies and the RDF data model stand even the slightest chance in this particular area."

Via Pro metadata will lose to folksonomy.

Saturday, January 08, 2005

How to Make Code Rule

Why Your Code Sucks "If you don't have tests for your code, it sucks. And I mean comprehensive, fine grained, programmer tests (aka something like unit tests), as well as higher level functional and integration tests. Tests that are automated. Tests that are run routinely. With the programmer tests run after any change to the code...Writing tests before you write the code means that the code is testable by definition."

"Code should be easy to read. Steve McConnell made the statement at his SD West '04 keynote that code should be convienent to read, not convienent to write...Choose style and formating conventions early in the project, and conform to them."

"When making framework decisions, consider if a lighter framework will do the required job. Using something like Hibernate, Prevayler, Spring, PicoContainer, NakedObjects, etc. can be a real win in many situations. Never blindly adopt a heavy framework just because it's the current bandwagon. Likewise, don't blindly adopt a lightweight framework in defiance. Always give due consideration to your choices."

Librarian Action Figure

Capitalising on richer Web data "The project’s second demonstrator involves semantic blogging in the bibliography management area. It was chosen because people working in this field need to share small items of information with a peer group in an easy and timely fashion. “The portal includes peoples’ Web diaries [blogs] with three semantic behaviours – view, navigation and query – built over the base,” says Miller. “If you are in a research reading group, you can call on a number of scripts to generate forms which tell others what the research paper discusses, and so on.” The enriched information generated by blogging and Web technologies makes it easier for computers to retrieve archived data."

The LIBRARIAN ACTION FIGURE "Weapon of Choice:
The Dewey Decimal System...The role of a librarian is to make sense of the world of information. If that's not a qualification for superhero-dom, what is?"

How to Manage a Triple Store

Views in Triple Stores "The first is to apply a "window" on the graph, and only extract the data that's within a certain distance of my origin. The MusicBrainz API has a similar notion of query depth; see the ASCII art under "Select & Get Documentation" section in in the docs, and the subsequent section for more information."

The related-to operations in TKS did some of this but it was unaware of a schema - it used statistics to find the most relevant relationships between resources.

When you start extending from a resource to other related resources you have to know about the schema - more importantly what to exclude or how to rank these relationships. With something like RDFS entailments everything is a resource so following that to other resources means everything is related by one level.

Maybe a walk with multiple predicates and a depth.

"The second subset may be created by filtering out the classes and properties extracted from the database based on their namespaces. For example I might have a triple store containing a mixture of public/private data, with the latter in a separate namespace and I want to pull out just the public aspects for returning from a web service."

"I see the combination of these subsetting techniques, along with a declarative mechanism to state how and when they should be applied as broadly analagous to a relational view.

Has anyone implemented anything like this, or aware of triple stores (or more likely APIs) that offer this facility?"

To get out URIs based on their namespace LIKE must be enabled (which is implemented just not available in iTQL). After that, though, these groups could be put into different models and alternatively viewed as one.

The functionality comes from being able to use set operations in the FROM clause:
"Because models are sets of statements, it is logical to compose them using set operations. The from clause permits set union using the or operator and set intersection using the and operator, with parentheses used to control association.

The following example queries only the statements appearing in all three models.

... from <rmi://mysite.com/server1#model1> and <rmi://mysite.com/server1#model2>
and <rmi://mysite.com/server1#model3> ..."

Semantic Web Experiences

Why You Should Be Looking at Semantic Technologies Now "From publishers to VCs: Look at what—and how—they're enabling solutions and innovation."

"There are at least 20 new vendors. One way to judge the arrival of something is the increase in the number of vendors. Now I have 10 suppliers of triple stores to consider. We have existing companies to look to. What is going on in IBM now? What is going on with HP? Why are they doing all this stuff with RDF? What is happening at Sun with SwoRDFish?"

"At NASA we're working on ontologies for the space shuttle and for new risk management, and now for wire management on the aging wire."

Hmm, the wire management ontology, I didn't see that one. Who said the Semantic Web was going to be merely academic.

RDF Beans

Re: Object triple Mapping "I have implemented a little rdfreactor like library[0][1] over the vacations for use in BlogEd. This allows me to create very simple interfaces using a java beans like pattern that contain all the information about the Ontology."

"This makes development of RDF aware programs in Java incredibly easy. And it also makes it much easier to explain RDF I think to java programmers. With a little more work (or waiting for java 5.0 annotations) it will be very easy to specify the owl ontology completely from an java file, and use those files to generate the
Ontology."

Thursday, January 06, 2005

AudioMan meets Kowari

AudioMan and Kowari "The only problem is the ID3 parser implementation, I guess it chokes on my iTunes-tagged files. I’m not aware of any Java ID3 lib that gets this right, let’s hope Ryan’s jid3rL can solve this."

"About AudioMan(ager), I’m considering if I should use Kowari. I always liked the idea of storing the AudioManager data as triples, but Kowari looks big and complex to me, at least for the moment. However, Kowari would nicely fit into Durham. I’d imagine the existing MP3 content handler could be modified to use the jid3rL parser, and additional handlers be created for XPSF, M3U, Ogg, etc.

For now AudioManager will stick with the Jena API (as I was already using it and Kowari provides some Jena interfaces anyway), and I’ll get working on MusicBrainz integration."

Quick Kowari Update

The most up-to-date (post Pre-Release 2) of the Kowari code is now available from CVS.

A second Kowari mailing list has been added for Kowari developers (I use that term generally not just code but high-level design, future direction and such).

Still looking for work.

Tuesday, January 04, 2005

Extracting Structure

Semantic Web vision is missing filters for unstructured data "I'd agree that the TBL's Semantic Web vision is missing filters for unstructured data, but I don't have the solution and I also don't have the answer to Adam's Where Have all the databases gone?"

Semantic Search Technology "As Artificial Intelligence (AI) technologies become more powerful, it is reasonable to ask for better search capabilities which can truly respond to detailed requests. This is the intent of semantic-based search engines and semantic-based search agents. A semantic search engine seeks to find documents that have similar ?concepts? not just similar ?words.? In order for the Web to become a semantic network, it must provide more meaningful meta-data about its content, through the use of Resource Description Framework (RDF - http://www.w3.org/RDF/) and Web Ontology Language (called OWL - http://www.w3.org/2004/OWL/ ) tags which will help to form the Web into a semantic network. In a semantic network, the meaning of content is better represented and logical connections are formed between related information."

"One short term approach is to garner semantic information from existing Web pages using LSI."

Semiotic Development

Application Semiotics Engineering Process: Towards ontology-based modelling of application semantics "The ASEP [application semiotics engineering process] is intended for modeling complex business rules, application logic and domain knowledge which need either encapsulated for change or separated from conventional software modelling functional dimensions of IT systems for different development or asset management. It targets systems with rich application semantics, such as knowledge systems, system integration with divergent and rapidly changing business logic, semantic interface specification of software components or web services, protocols for semantic interopation of collaborative processes or systems. The ASEP is aimed at the development of corporate or organizational intelligent systems and open services such as knowledge manage systems, semantic web services [18]."

"Lexons represent binary relationship between two entities. They are the vocabulary (not terminology) of the application semiotics. Similar to the vocabulary of the natural language, they have ideational purport without reference to specific application or task contexts...Thus underspecified, they serve as basis for consensus, agreement, reusability and versatility."

"While the lexons underpins the flexibility and reusability of the application semiotics with under-specification, the commitment is essentially dedicated to the semantically well-formed, fully specified, consistent actualizations of the underlying patterns with respect to a particular task or application."

"The layered model of application semiotics is important for encapsulating changes and dynamics of models. For example, the continued business process improvement or integration can be catered to by optimisation (different commitment constraints), resutructuring (changed in commitment networks), or innovation (new business concepts with additions to lexons).

From STAR Lab publications.

Aduna

Aduna Metadata Server 2005.1 RC1 Release "The Aduna Metadata Server automatically extracts text and metadata from information sources, like file servers, intranets or public web sites. The extracted information is available for tools such as Aduna AutoFocus and Aduna Spectacle. These tools enable the user to find and explore information by using the extracted metadata. Other tools can also make use of the extracted information by use of the Sesame library (see www.openrdf.org)."

I see a lot of similarities with what we did at Tucana. The metadata server is basically TMex - which then became the content handlers and resolvers in Kowari 1.1 (without the document workflow). They also have the commercial visualization add-ons.

Sunday, January 02, 2005

Looking forwards, looking back

2004: Retrospective "2004 is coming to an end, so I decided to jot down some of my more memorable moments, both good and bad...Served my first full year as CTO of webMethods...Thought a lot about the Semantic Web and Model Driven Design."

2005: Resolutions "I take New Year's resolutions pretty seriously, as it's depressing to make promises to yourself that you don't keep. That being said, here are some of my resolutions for 2005...Evangelize the Semantic Web."

Update: Also has Predications which include the Semantic Web overtaking Web Services in mindshare.

A couple of other 2005 predications:
* My predictions for IT in year 2005 "The Semantic Web will still be mostly at the same point it is now, at the end of 2005. That is, some nice ideas, including RDF and XML will stick around and find some uses, but OWL won’t take off."
* Yet Another 2005 Prediction List "XML Query, the Semantic Web, and WS-* will continue to hold promise. This is the polite way of saying that none of the above will have an explosive burst of adoption in 2005."
* Search 2005 "Metadata-enhanced search. Will be ad hoc and pragmatic, pulling useful bits from private sources and people following officious Semantic Web and lowercase semantic web practices."

Lock Free Programming

Java theory and practice: Going atomic "Until JDK 5.0, it was not possible to write wait-free, lock-free algorithms in the Java language without using native code. With the addition of the atomic variables classes in the java.util.concurrent.atomic package, that has changed. The atomic variable classes all expose a compare-and-set primitive (similar to compare-and-swap), which is implemented using the fastest native construct available on the platform (compare-and-swap, load linked/store conditional, or, in the worst case, spin locks). Nine flavors of atomic variables are provided in the java.util.concurrent.atomic package (AtomicInteger; AtomicLong; AtomicReference; AtomicBoolean; array forms of atomic integer; long; reference; and atomic marked reference and stamped reference classes, which atomically update a pair of values)."

See also, More flexible, scalable locking in JDK 5.0, Atomic Javadoc and an interesting article The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software.

And Java 1.5 Update 1 is out too.

Memory Bank

Break "Another frustration is that Tucana was the best job I'd had in years, which is why I stayed so long (5 years, including 4 months when I took unpaid leave to go contracting when Tucana ran out of money). We all know quite a lot about RDF and OWL now, and I think we've made something pretty good in Kowari and TKS, but that expertise is all about to go to waste. While some people in Australia are looking at RDF, there are no commercial organisations in Brisbane who have moved that far ahead yet. As a result, everyone will end up working in completely unrelated areas.

From my perspective, I was enjoying learning how to infer for OWL in Kowari. Since my Masters relies upon this I'll be continuing to work on Kowari in my own time. I'll just have to be disciplined as I won't get the opportunity to work on Kowari in my work time."

I haven't yet sat down and listed all the ideas we've had over the last 3 or so years working on this stuff. From what I can tell we've got heaps of ideas (good/ordinary/bad) that need to be written out and explained. Wonder GUI, handling blank nodes in a multi-user distributed database, naming graphs and so on. The recent talk about agile databases is something that one of our very first customers started using TKS for and something that Ben and others talked about ages ago. At the very least, we shouldn't let the ideas disappear with the company.

Meme with a View

Word on the street is that there's a hot new consultancy business called Software Memetics. From their resource page looks like they're pushing something called the Semantic Web and Kowari. It also includes links to Urchin-Kowari write-up (the demo shows Lucene integration) and Kowari: A Platform for Semantic Web Storage and Analysis (paper submitted for WWW2005).

The Winning Web

Why we will win "But do you think, being OS independent is enough? Are you content with having your programs run everywhere? If so, fine. But you shouldn't be. You should ask for more. You also want to be independent of applications! Take back your data. Data wants to be free, not locked inside an application. After you have written your text in Word, you want to be able to work with it in your Latex typesetter. After getting contact information via a Bluetooth connection to your mobile phone, you want to be able to send an eMail to the contact from your web mail account."

"Using a common data model with well defined semantics and solving tons of interoperability questions (Charset, syntax, file transfer) and being able to declare semantic mappings with ontologies - just try to imagine that! Applications being aware of each other, speaking a common language - but without standard bodies discussing it for years, defining it statically, unmoving.

There is a common theme in the IT history towards more freedom. I don't mean free like in free speech, I mean free like in free will.

That's why we will win."

Stuff

* Predictions for 2005 "Spring, Hibernate and lightweight framework backlashes occur...java.util.concurrent will be a talking point and will produce a rash of articles and guidance as developers weep like children in the face of Doug Lea's 3rd edition of CPIJ....Paul Graham declares victory and ships Arc, which end up being Common Lisp without the libraries."
* Predictions for 2005, Yet Another 2005 Prediction List and Microsoft's Top 10 Milestones for 2005.
* What's Out and In for 2005.
* Stapless Stapler.
* Java in MySQL
* tis the season: II "Code and Other Laws of Cyberspace" goes Wiki for version two.
* Translating Semantic Web languages into SCL "The three SWeb languages so far given W3C 'recommendation' status (RDF [RDF], RDFS [RDFS] and OWL[Webont]) can all be translated straightforwardly into SCL. This document gives full translations, showing how to express all of the content of these languages directly in SCL."
* Optional match with iTQL.

Agile Databases Again

Let them eat layer cake: flexibility versus clarity in data "Recently I've had some experience at the non-planetary scale of trading off between the extreme flexibility of technology like RDF versus a domain model that a person coming after could reasonably be expected to understand...It turns out that RDF is surprisingly cheap stuff to generate. The downside was that for purposes of communicating intent, ongoing maintenance and adding functionality against the collected data, RDF is not very pleasant to work with, at least not compared to SQL, Objects or XML. This is especially so at the presentation layer. It's also a different paradigm, and by using it you're technologically committed to yet another data model, directed graphs, along the usual suspects - objects, markup and relations. The cost of introducing a new model should not be underestimated. As a result RDF has been useful but not as cheap to manipulate as one would like."

Also, "This is another area where RDF falls down. Yes, there is Sparql and before that other SQL like languages, but again you're left iterating over raw RDF graph result sets, which is not always ideal."

As linked to previously, SPARQL results can be either a sub-graph or variable bindings.

"Arguably we could have used an RDF store such as Kowari, the in built persistence mappings of Jena, or even XQuery, along with the RDF interchange. The reality is there's only so much new technology you can apply in one go without taking on too much risk, especially in a short time frame, whereas we had a good idea of what we were getting into with a relational store."

Are triplestores good databases? " The Sparql language may not quite be finished but it’s certainly comparable to SQL...Overall this would suggest that RDF stores are potentially good DBs, on many points potentially much better than regular RDBMSs (or XML DBs) because of the more flexible model. But for this to be practicable it assumes the performance can be brought to a comparable level as RDBMSs, which if it hasn’t already been done would I think would only be a small matter of programming."

I've been fairly negative about SPARQL in the past because it's lack of counting and sorting and it also doesn't make sense to me to have DISTINCT - everything should be distinct, it's a graph.

There's also some comments about Kowari using Lucene for the database and not having transactions which is just wrong. Kowari uses NIO and AVL Trees for its store and has since the beginning. It's much faster than JDBM (which we looked at to store our blank node map and discarded because of speed).

Saturday, January 01, 2005

Java and Schemas

A recent posting to the Kowari, Sesame and other RDF lists about a Hibernate/JDO like tool for RDF. This lead to a link to RDFReactor.

As mentioned by Re: Object triple Mapping there are some tricks to modelling objects in RDF like multiple inheritance. My preference has been ontology based programming where the developer programs to a general ontology rather than trying to concretely to tie either RDF or Java to each other.

Crisis

In the same week Tucana goes under we get probably the busiest week on the Kowari mailing list and 45 downloads for Pre-release 2. Considering three quarters of these is the 50MB version, there's either a lot of good bandwidth out there or great patience.

Before the holidays there was great progress being made. We have been attempting to make the "10,000 triple/second challenge" and had succeeded (at least the first 1 million triples, it degraded to a few thousand a second after 200 million). This is on the same 1.6 Mhz Opteron system that we use for all our tests.

Andrae was working on a large refactoring of the transaction operations so that any JRDF, Jena, or iTQL (or anything new) could be put within a transaction or use an existing one. They were already in a transaction but they couldn't tell whether they already had the write phase for example. From what I can remember Simon was working on the problems associated with using file and HTTP protocols in the FROM (external resolvers). David M was working on speeding up deleting triples as well as loading speed. Paul had been working on SOFA and OWL. Robert had started looking at JDO, EJB and mapping relational databases to RDF.

Our commercial focus had been on speed and scalablilty. With Tucana gone some of this focus is probably going to change. It takes a lot of time, effort and resources to keep everything changing and to continue to perform at a commercial level. Between 1.0 and 1.1 there practically isn't a portion of code that wasn't modified - more often than not it was completely changed. This kind of development, on multiple fronts, is something that probably can't occur in the future.

So the future is probably smaller, simpler, with an eye on standards compliance. I hope some things like multiple writes, phase holding, pluggable datatypes, inferencing Hotspot, SPARQL support etc. will be developed but I doubt many of these things will see the light of day now. I'm not discounting them entirely, but there are many things that had to occur in parallel that need a team of developers. A lot of the new features depended on the multiple writers feature. Multiple writers is not an easy feature to describe but suffice to say it's more like Lucene than a normal relational database.

So I think that means improving RDF, RDFS and OWL support. We made some good improvements in Pre-Release 2 with better datatype support. David M added the functionality to allow literal and URI prefix matching - someone just has to add it into the query layer. Combining this with matching nodes based on type (literal, URI or bnode) and trans/walk queries and they're a very powerful combination of features. In the background there's also been a focus on languages (I know we were talking about 3066bis support). Paul and I were looking at inference models (seems to have similarities with pseudo models and datalog vs tableau). Paul's Masters will probably push some of this into reality.

So all in all there does seem to be some good opportunities in the future. I know that Tucana had customers who were paying lots of money for TKS and I know that Kowari did help in the risk assessment. So my intentions, at least in the short term, is to see that good support occurs, bugs get fixed and the like. Although there isn't anyone left to do TKS releases.

I don't really understand why this happened (or what's going on) but to quote Marge Simpson: "One person can make a difference. But most of the time they probably shouldn't."