More News

Tuesday, July 29, 2008

YADS and RDF Molecules

BNodes Out! discusses how any usefully scalable system doesn't use blank nodes. What is interesting is the comment on YADS (Yet Another DOI Service). The best reference is Tony's presentation although it is mentioned in Jane's as well. "YADS implements a simple, safe and predictable recursive data model for describing resource collections. The aim is to assist in programming complex resource descriptions across multiple applications and to foster interoperability between them...So, the YADS model makes extensive use of bNodes to manage hierarchies of “fat” resources - i.e. resource islands, a resource decorated with properties. The bNodes are only used as a mechanism for managing containment."

This sounds a lot like RDF molecules and supports visualization (apparently). This seems like a good use of molecules that I hadn't previously thought of (Tony's talk gives an example of the London underground). The main homepage of YADS isn't around anymore - it'll be interesting to see if it's still being used/worked on.

Update: Tony has fixed up the YADS home page (there's also an older version).

Monday, July 28, 2008

Hadoop and Microsoft

Pluggable Hadoop lists some extensions to Hadoop in the pipeline: job scheduling (including one based on Linux's completely fair scheduler), block placement, instrumentation, serialization, component lifecycle, and code cleanup (the analysis used Structure101).

I found the reason why HQL was removed from HBase (to be replaced by a Ruby DSL and to ensure that HBase wasn't confused with an SQL database) and moved to HRdfStore.

There's also rumours that Microsoft's recent investment in Apache may lead to them working on Hadoop too.

Tuesday, July 22, 2008

Save us China

I was in Victoria when the ETS for Australia was announced (well the discussion papers). It's fairly funny, that replacing the world's worst plants even with other coal plants (Hazelwood is the world's worst), with Chineese brown coal plant technology, would reduce emissions by 30% to 40% (by just drying out the brown coal). It's still very poluting but it just shows how far behind Australia is. This has lead to greater compensation to Victorian polluters (which is just mad). At the same time Queensland is creating another coal port because we can't export the carbon fast enough.

The exclusions were annoying (aluminium, cement and some types of steel). Cement is annoying (5% of all CO2 apparently) as there exists green alternative technologies. The time is to invest not compensate.

Square brackets are scary

For what may be an increasing trend of surfing the Web at 320x480 I noticed Cydia has a number of applications for Jailbroken iPhones (Java, Python and Ruby mainly). The mailing list on iPhone/Java doesn't have much on it except some interesting uses of JocStrap and UICaboodle (available from SVN by Jay Freeman). There's also the Sun blog that has some interesting sample applications using different Java implementations on the iPhone.

Friday, July 04, 2008

JRDF 0.5.5.1 Released

Just a quick note about a new version of JRDF. It's been a short time between releases but it still contains one significant advance over the previous one and that's persistent graphs. It's still in the early stages but it's basic enough for simple use cases. It also contains text serialization (based on NTriples) that is useful for moving RDF molcules around nodes in a cluster (for example). A lot of this code is fairly much "spike" code and I expect that another release will be released after we exercise these new features more (and write some tests/rewrite the code).

Update: 0.5.5.2 is now available fixing many bugs and introducing FILTER support.

Of Mats and Cats

No universal things Re: comparing XML and RDF data models was started by Bernard Vatant. This comes to the heart of whether people can know reality (well that's how I'd summarize the idea of universals see Beyond Concepts).

There were a few quotes that I found interesting:

It's been counter-productive in science for centuries. Physics had to go over the notion of universal thing to understand that light is neither a wave, nor a particle. Biology to go over the notion of taxa as rigid concepts based on phenotypes to understand genetics etc. Many examples can be found in all science domains. My day-to-day experience in ontology building, listening to domain experts, is indeed not that 'there are things that people are trying to describe', but that 'there are descriptions people take for granted they represent things before you ask, but really don't know exactly what those things are when you make them look closely'.

Bijan wrote:

I do think that the family of views in computational ontologies generally called "realist" is indeed naive and fundamentally wrong headed. Whether it's a "useful fiction" that helps people write better or more compatible ontologies is an open empirical question.

But I, for one, wouldn't bet on it.

I remember also a project where we were trying to get people to write simple triples. They got that they needed triples. But what they ended up putting into the tool was things like

S P O
"The cat is" "on the" "mat".
"Mary eats" "pudding" "on toast"

They just split up the sentences into somewhat equal parts!

I really feel like an interested amateur and my view is probably influenced by databases in computer science, where you are taking the non-realist approach. I say this because there are usually properties in databases that are not really based on reality but are a result of other requirements (like a column like "isDeleted" rather than actually deleting the statement).

Wednesday, July 02, 2008

Round of Links

Apache Hadoop Wins Terabyte Sort Benchmark "One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds...This is the first time that either a Java or an open source program has won." There were just under 1000 nodes, the benchmark results are hosted by HP (a tad more detail here).

Microsoft buys Powerset one of the interesting things is that they use Hadoop (see their blog). It's hard to tell whether this is bad or good for Hadoop.

Google vs Microsoft - oh for structure.

Tom talking about GridGain from his presentation in February. C++ isn't as productive as Java?

Applets are back (according to Sun).

Why commenting is for n00bs. "And Haskell, OCaml and their ilk are part of a 45-year-old static-typing movement within academia to try to force people to model everything. Programmers hate that. These languages will never, ever enjoy any substantial commercial success, for the exact same reason the Semantic Web is a failure. You can't force people to provide metadata for everything they do. They'll hate you."

Some interesting discussion on Web 2.0 and the future of the web.

Rich text editor for browsers. Not free though.

Linked data and what it is.

ThoughtWorks Podcasts (the REST talk was what drew me to it).

Turtle specification. I've been looking at this for serialization of RDF molecules but it seems that you can't have blank nodes as objects using the nested syntax.

Semantic Web for bioinformatics.

Data structure stuff: Linear Bloom Filters, Bloom filters for Spell Checking, Optimal Bloom Filter replacements and scalable btree and B-tries for Disk-based String Management.

Tuesday, July 01, 2008

Ob. iPhone 2

Good to see carriers actually putting up a bit of a fight for iPhone business. Telstra announces iPhone 3G details with $279, $30 a month on a 24 month, with free access to WiFi hotspots. This better be true.

Update: Optus releases pricing

Monday, June 30, 2008

ScalaCC

Formal Language Processing in Scala which links to External DSLs made easy with Scala Parser Combinators that I'd read from here.

Although, just to keep it balanced I have noticed Steve Yegge's comments, under "Static Typing's Paper Tigers", on the complexity of Scala's typing (it does have a lot) and it has been pointed out that this does lead to problems with writing IDEs to support it.

Thursday, June 26, 2008

JRDF 0.5.5

The main difference in version 0.5.5 from the previous one is the inclusion of a RDF molecule store. Both in memory and disk based versions are supported and can be queried just like a normal triple store. This is also the first version that has been renamed URQL instead of SPARQL for the query evaluation. The SPARQL grammar is the same but it does not support the weird outliers that SPARQL has for empty graph patterns but follows relational (and other) algebras. There's also the usual bug fixes and other features.

Update: Due to a couple of bugs found in 0.5.5 there will be a 0.5.5.1 version released soon.

Sunday, June 22, 2008

Beef of the Sea

Everyone is probably sick of me talking about the Gruen Transfer. So what better what to continue to talk about it than to blog about it. Perhaps the best part of the show is The Pitch especially episode one's selling whale meat (this is the runner up) and making the Democrats electable (the second is best). Who would've thought deconstructing chocolate adverts would be interesting? One of the good things is that the show is available for download. There is also some good discussion in the forum and links to some other good adverts (although it possible should've been crows).

Tuesday, June 17, 2008

Bad Balmer

Eight Years of Wrongness. Lists some of the things believed to have gone wrong with Microsoft in the last 10 years or so. They include: losing the DOJ and EU cases, Vista, XBox, IE, Zune, and Windows Mobile. Linked mainly because they use Fake Steve as a source of analysis.

Apple Sprouts

AppleInsider has some details on SproutCore. The official web site says, "makes Javascript fun and easy" - and it's just a Ruby gem install away. They also link to some previous talk about Cocoa for Windows.

Apple's trojan horse in the runtime wars has been well known for a while.

The photo demo looks a lot like the MobileMe Gallery that was presented at WWDC 2008 (SproutCore doesn't seem to work too well under IE 7 and the rotation only works in Safari). Gallery has less functionality than things like Photoshop Express although the integration is obviously better.

There's also an interesting Javascript library for drawing 2D objects (UML, workflows, etc) that I've been shown recently called Draw 2D.

Friday, June 13, 2008

The Curse of the Floppy Penises

A Western floppy penis is more valuable than preventing blindness in an African eye (see neglected diseases). This is part of the story in the video of the launch of "The Health Commons". The video talks about how hundreds of thousands of people go blind from "river blindness". It has very little value associated with it and drug companies focus on more valuable drugs to do with baldness and erectile disfunction. The video goes on to talk about how the network changes things and how there's a lack of process change in science to take advantage of these effects. If you can leverage network effects then this hopefully reduces the cost of drug discovery making drug development in less valuable diseases viable. The white paper covers some more of this in detail.

It also talks about an idea that I've often thought of as useful - the collection of failed experiments, "This deeply set inability to capture collective learning dooms everyone to revisit infinitely many blind alleys. The currency of scientific publication encourages individual scientists to hoard rather than share data that they will never have the time or resources to exhaustively mine. And, the wealth of “negative” information gleaned from clinical trial data is mostly lost to the need for companies to safeguard their commercial investments."

The general idea seems to share and standardize all aspects of research and science.

Thursday, June 12, 2008

Ob. iPhone

So I've been trying to find more information from a variety of sources on pricing.

The closest to reality that I've been able to find is these leaked details from Optus (via Gizmodo):
"The iPhone will only be available on a 24 month contract – no outright purchase, with the 8GB model to sell at AUD $220, and the 16GB model at $330, with only the 16GB model in white as Steve Jobs announced at the WWDC keynote.

Accessories will only be available through Apple stores – Optus will only carry the iPhone 3G itself, and the all important voice and data plans are as follows: $79 cap for $300 worth of calls and 1GB of data, or a $99 cap with $400 worth of calls and a 3G data download limit.

Visual voicemail is included, and the cap is whittled away in 35c per 30 second chunks, 25c per SMS message and the always annoying but always present flagfall which is set at 30c."

This makes it over twice as expensive as the ATT plans (and I think they had unlimited data). This is where I get cranky about Australian carriers and their stupid plans. It would probably count me out at those prices.

Update: No more Apple rumours. As Brad says in the comments, this is wrong.
Update 2: Looks like the UK is getting a good deal.
Update 3: Gizmodo link gone...nothing to see here.

Wednesday, June 11, 2008

Evolution and SUVs

Two things that have I've been interested in before: SUVs and evolution.

The first mirrors what is happening in Australia too where small cars are winning over larger ones and Falcon sales have dropped by half. The F150 isn't really that popular here.

The observation that bacteria have evolved to process other nutrients is interesting. But the related articles tend to be a bit more forceful: such as the Bible's many inaccuracies, the many occurrences of homosexuality and how it actually works in nature and that accepting evolution does not mean rejecting morality.

Tuesday, June 10, 2008

Linked Data, FOAF, and OWL DL

So I spent a little time a while ago looking through all the different ways ontologies support linked data. Some of my data I wish to link together is not RDF but documents that define a subject. For example, a protein will have peer reviewed documents that define it. It's not RDF but it is important.

The tutorial on linked data has a little bit of information: "In order to make it easier for Linked Data clients to understand the relation between http://dbpedia.org/resource/Alec_Empire, http://dbpedia.org/data/Alec_Empire, and http://dbpedia.org/page/Alec_Empire, the URIs can be interlinked using the rdfs:isDefinedBy and the foaf:page property as recommended in the Cool URI paper."

The Cool URIs paper, Section 4.5 says: "The rdfs:isDefinedBy statement links the person to the document containing its RDF description and allows RDF browsers to distinguish this main resource from other auxiliary resources that just happen to be mentioned in the document. We use rdfs:isDefinedBy instead of its weaker superproperty rdfs:seeAlso because the content at /data/alice is authoritative."

There is also some discussion about linking in URI-based Naming Systems for Science.

Now my use case is linking things to documents that define that thing. So rdfs:seeAlso is not appropriate as it "might provide additional information about the subject resource". And rdfs:isDefinedBy is also out as it is used to link RDF documents together. I need a property that defines a thing, is authoritative but isn't linking RDF (it's for humans). I also would like to keep my ontology within OWL DL.

FOAF has a page property. I've used the OWL DL version of FOAF before and FOAF cleaner (or should that be RDFS cleaner). So it seemed like a good match. However, its inverse is topic which isn't good. Because I'm linking the thing to the page - it's not a topic. So scrub that.

RSS has a link property which extends Dublin Core's identifier. This seems more like it. However, I'd like to extend my own version of link and I'm stuck because as soon as you use RDFS vocabularies in OWL DL you're in OWL Full territory. It'd be nice to stay in OWL DL. There is an OWL DL version of Dublin Core. All of the Dublin Core properties are nicely converted to annotation properties. However, you're still stuck because you can't make sub-properties without going into OWL Full. I like the idea of annotation and semantically Dublic Core seems to be a suitable vocabulary of annotation properties. Extending Dublin Core is out of OWL DL - which is shame because it's probably the closest match to what I wanted.

As an aside, annotation properties are outside the reasoning engine. The idea is that you don't want an OWL reasoner or RDF application necessarily inferring over this data or trying to look it up in order for the document to be understood. So the way they do it in OWL DL is to have annotation properties that are outside of/special to the usual statements. Sub-properties require reasoning, so limiting them makes some sense but it does hamper extensibility - it'd be nice to express them and turn on the reasoning only when asking about those properties (I think Pellet has this feature but I didn't look up the details).

The other vocabulary I looked at was SIOC's link. Again, this seems like a close match but again it's RDFS.

In the end, I just created another annotation property called link.

In summary:

For my requirements, the suggestions for linking data seems to only work for RDF and RDFS ontologies. Reusing RDFS from OWL DL or OWL DL from RDFS doesn't look feasible as one isn't a subset of the other (an old problem I guess).

Current, popular Semantic Web vocabularies are in RDFS. Why aren't there more popular OWL DL versions of these things? Is the lack of extensibility holding it back?

Is my expectation wrong - should I stick within OWL DL or is an RDFS and OWL DL combination okay?

Why not allow annotation properties to have sub-properties?

Maybe the OWL DL specification does have suitable properties for linking certain data but I don't understand which is the right one.

Update: The Neurocommon's URI documentation protocol is quite similar as well. Except that, it seems to be too specific as it ties the name with a single thing that defines it. All the parts of Step 5 could potentially be eliminated with what I'm thinking of.

Friday, May 30, 2008

Somewhere

Alarm Bells Sound for the Amazon

Brazil's land mass and farming industry make it one of the most agriculturally productive countries in the world. It has already been dubbed "the world's feeding bowl" and is exporting more and more to emerging economies, such as India and China.

As China's middle-class continues to grow, so, too, does its demand for food. Brazil exports 10 million tons of soybeans to China a year for both animal feed and human consumption, trade that is crucial to Brazil's economic development.

And it's not just poverty that's an issue.

The state of Para has some of the worst human rights abuses in Brazil. People are trafficked from across the impoverished northeast of the country to work in slavelike conditions in the sawmills, illegal charcoal ovens and cattle farms.

They usually work in horrific conditions, with no basic rights and existing on roughly $5 a day. If they try to seek help from the authorities, they are threatened with death.

There's also the WHO page on "Deaths from Climate Change".

Monday, May 26, 2008

RDF Processing

One of the interesting things about biological data, and probably other types, is that a lot of it is not quite the right structure. That's not to say that there's not people working to improve it, the Gene Ontology seems to be updated almost daily, but data in any structure maybe wrong for a particular purpose.

Biologists make a habit, out of necessity, of just hacking and transforming large amounts of data to suite their particular need. Sometimes, these hacks get more generalized and abstracted like GO Slims. We've been using GO Slims in BioMANTA for sub-cellular location (going from 2000 terms to 500). GO contains lots and lots of information and you don't need it all at once and more often you don't need it at the maximum level of granularity that it has. Some categories only have one or two known instances, for example. You may even need to whittle this down further (from say 500 to 200). For example, when we are determining the quality of an interaction we only care where the proteins exist generally in an organism. If the two proteins are recorded to interact but one is in the heart and the other in the liver then it's unlikely that they will react in the host organism. The part of the liver or the heart and other finer structural detail is not required for this kind of work (AFAIK anyway).

The point is, a lot of our work is processing not querying RDF. What's the difference between the two and what effect does it have?

For a start, querying assumes, at least to some degree, that the data is selective - that the results you're getting is vastly smaller than your original data. In processing, you're taking all of the data or large chunks of it (by sets of predicates, for example) and changing or producing more data based on the original set.

Also, writing is at least as important as reading the data. So data structures optimized for lots of writes, temporary, concurrent, is of greater importance than those built around more familiar requirements for a database.

Sorting and processing distinct items is a lot more important too. When processing millions of data entries it can be quite inefficient if the data has a large number of duplicates and needs to be sorted. Processing can also be decentralized - or perhaps maybe more decentralized.

To top it off, the data still has to be queried. So this doesn't remove the need for efficient, read only data structures to perform selective queries for the usual analysis, reporting, etc. So none of the existing problems goes away.

Monday, May 12, 2008

git + RDF = versioned RDF

Reading, Git for Computer Scientists, and it seems like if you turn the blob into a set of triples you pretty much have versioned RDF (or molecules even).

I'm also wondering, if Digg is so pro-Semantic Web, where's the http://digg.com/semweb?