A recent article at Linked In called "Recap: Improving Hadoop Performance by (up to) 1000x" had a section called "Drill Bit #2: graph processing" mentioning the problem of partitioning the triples of RDF graphs amongst different nodes.
According to "Scalable SPARQL Querying of Large RDF Graphs" they use MapReduce jobs to create indexes where triples such as s,p,o and o,p',o' are on the same compute node. The idea of using MapReduce to create better indexing is not a new one - but it's good to see the same approach being used to process RDF rather than actually using MapReduce jobs to do the querying. It's similar to what I did with RDF molecules and creating a level of granularity between graphs and nodes as well as things like Nutch.
Showing posts with label rdf. Show all posts
Showing posts with label rdf. Show all posts
Tuesday, January 03, 2012
Monday, December 13, 2010
What's a Dual?
Erik Meijer has been giving a talk at YOW (in Melbourne) and San Francisco at QCON about Key Value Stores being the dual of SQL.
The obvious question is what is a dual? And like most really good questions you already know the answer. Set operations AND and OR are dual. Which means that if A OR F = A is true, by replacing AND with OR and T with F (or as Erik said switching the arrows), you get A AND T = A. Likewise, A OR T = T becomes A AND F = F.
This is the bit where Maths almost becomes magical. Why should these properties hold? Why doesn't this work for multiplication and addition?
In his talk, Erik quoted Benjamin Black, that the noSQL movement took away the relational model and gave nothing back. His talk also seems to be about the beginnings of a noSQL/coSQL algebra and a project to write a noSQL database within Microsoft (noSQL Server perhaps?). He's not the only one thinking about these kinds of things obviously.
He also mentioned, closure as an important property of numbers and other things. That is you plug 1 + 1 in, you get 2 and you can plug 2 into + 1 and get 3 and so on - the answer of one operation can be used for the next. You can't do that with SQL or SPARQL, which in my mind is a big fail. Erik agrees - and suggested that noSQL (or coSQL) should support this property. He's not the only one to think that noSQL should support the Unix philosophy (which is not just closure but is an important part of it). And he's certainly not going to be the last.
He described these properties in the same way as I discovered them in RDF, which was through graphs. The object model in LINQ has a graph of objects - objects which don't have a globally identifiable name - much like blank nodes. The cool thing about LINQ, well one of the many cool things, is that it supports closure - you query your object graph, you get back another object graph and can continue to plug away at it as you narrow your query. This certainly makes a lot of sense from a user querying data perspective (among others) - where you start with all students, you ask for all male students, then you ask for all male students with a grade point average about 5 or whatever. Narrower and narrower is one typical way people find the answer they're looking for.
The other thing he mentions is that SQL is a nightmare due to normalization, the complexity of queries, null semantics and computing over them and the difference between objects and tables. Erik showed that objects and relations are dual.
This is the bit where I introduce monads and say it's like pizza delivery or something. Screw that. It's just good in that if your operations follow laws then they can be mapped to other things and inherit any work done to make them good - like optimisation or scaling. In Erik's talk that meant finding out cool things like noSQL being dual to SQL and LINQ can be done as MapReduce (see this slide of Erik's). This works because LINQ operations are associative and MapReduce operations are also associative. It's not just a good idea it's a monadic law. If you can, read 10 pages on category theory (I'm only up to page 8 though).
Obviously, not everything is peaches. The noSQL/coSQL people say don't do joins - they don't scale - you can't self join on the web for example. In a MapReduce/noSQL system you either end up querying multiple times or you end up processing the data again. I'd suggest that someone will write a library that does multiple queries and joins over noSQL databases so that you don't have to do it yourself - maybe locally and in memory at first, and then with disk based structures for scalability later - I guess this is where LINQ would fit in.
There were lots of dualing properties in the relational vs noSQL/coSQL talk. I only remember a few SQL vs noSQL/coSQL: closed world vs open world, identity by value vs identity by reference (or properties), and (I might be wrong) transactional vs non-transactional.
So with our arrows in hand, ignoring how cool the world will be without JOINs and UNIONs and our mighty knowledge of set theory, what would JOIN and UNION look like if they were a dual and if such a thing existed as 0 and U for relations. U JOIN R = R becomes 0 UNION R = R and likewise U UNION R = U becomes 0 JOIN R = 0 (which matches my existing prejudices).
Just to put the cherry on this pig of a blog post (to mix metaphors), objects and functions look like duals too (given that people claim there's no meaning behind what an object is and what a dual is I feel safe anyway):
Erik via Eugene Wigner said there's an unreasonable effectiveness of mathematics and I have to agree. Monads, duals, properties for LINQ, noSQL, relational databases, SPARQL, and so on. Very unreasonable and very cool.
The obvious question is what is a dual? And like most really good questions you already know the answer. Set operations AND and OR are dual. Which means that if A OR F = A is true, by replacing AND with OR and T with F (or as Erik said switching the arrows), you get A AND T = A. Likewise, A OR T = T becomes A AND F = F.
This is the bit where Maths almost becomes magical. Why should these properties hold? Why doesn't this work for multiplication and addition?
In his talk, Erik quoted Benjamin Black, that the noSQL movement took away the relational model and gave nothing back. His talk also seems to be about the beginnings of a noSQL/coSQL algebra and a project to write a noSQL database within Microsoft (noSQL Server perhaps?). He's not the only one thinking about these kinds of things obviously.
He also mentioned, closure as an important property of numbers and other things. That is you plug 1 + 1 in, you get 2 and you can plug 2 into + 1 and get 3 and so on - the answer of one operation can be used for the next. You can't do that with SQL or SPARQL, which in my mind is a big fail. Erik agrees - and suggested that noSQL (or coSQL) should support this property. He's not the only one to think that noSQL should support the Unix philosophy (which is not just closure but is an important part of it). And he's certainly not going to be the last.
He described these properties in the same way as I discovered them in RDF, which was through graphs. The object model in LINQ has a graph of objects - objects which don't have a globally identifiable name - much like blank nodes. The cool thing about LINQ, well one of the many cool things, is that it supports closure - you query your object graph, you get back another object graph and can continue to plug away at it as you narrow your query. This certainly makes a lot of sense from a user querying data perspective (among others) - where you start with all students, you ask for all male students, then you ask for all male students with a grade point average about 5 or whatever. Narrower and narrower is one typical way people find the answer they're looking for.
The other thing he mentions is that SQL is a nightmare due to normalization, the complexity of queries, null semantics and computing over them and the difference between objects and tables. Erik showed that objects and relations are dual.
This is the bit where I introduce monads and say it's like pizza delivery or something. Screw that. It's just good in that if your operations follow laws then they can be mapped to other things and inherit any work done to make them good - like optimisation or scaling. In Erik's talk that meant finding out cool things like noSQL being dual to SQL and LINQ can be done as MapReduce (see this slide of Erik's). This works because LINQ operations are associative and MapReduce operations are also associative. It's not just a good idea it's a monadic law. If you can, read 10 pages on category theory (I'm only up to page 8 though).
Obviously, not everything is peaches. The noSQL/coSQL people say don't do joins - they don't scale - you can't self join on the web for example. In a MapReduce/noSQL system you either end up querying multiple times or you end up processing the data again. I'd suggest that someone will write a library that does multiple queries and joins over noSQL databases so that you don't have to do it yourself - maybe locally and in memory at first, and then with disk based structures for scalability later - I guess this is where LINQ would fit in.
There were lots of dualing properties in the relational vs noSQL/coSQL talk. I only remember a few SQL vs noSQL/coSQL: closed world vs open world, identity by value vs identity by reference (or properties), and (I might be wrong) transactional vs non-transactional.
So with our arrows in hand, ignoring how cool the world will be without JOINs and UNIONs and our mighty knowledge of set theory, what would JOIN and UNION look like if they were a dual and if such a thing existed as 0 and U for relations. U JOIN R = R becomes 0 UNION R = R and likewise U UNION R = U becomes 0 JOIN R = 0 (which matches my existing prejudices).
Just to put the cherry on this pig of a blog post (to mix metaphors), objects and functions look like duals too (given that people claim there's no meaning behind what an object is and what a dual is I feel safe anyway):
But from another perspective, the apply "method" of a closure can be used as a low-level method dispatch mechanism, so closures can be, and are, used to implement very effective objects with multiple methods...With closures seen as a building block with which to implement objects, it's clear that objects are a poor man's closures...If your language only has these restricted closures, and you're forced to build an object system on top of them, it's clear that closures are a poor man's objects.
Erik via Eugene Wigner said there's an unreasonable effectiveness of mathematics and I have to agree. Monads, duals, properties for LINQ, noSQL, relational databases, SPARQL, and so on. Very unreasonable and very cool.
Saturday, November 21, 2009
JRDF 0.5.6 and testing
JRDF 0.5.6 has JSON SPARQL support and many things got a good refactoring including the MergeJoin (which now looks more like typical text book examples - for better or worse). The upgrade to Java BDB version 4 has improved disk usage and performance mostly around temporary results from finds and the like. The next version will include named graph support.
I'm considering this being the last version to support Java 5.
This release was also about learning things like Hamcrest, JUnit 4.7, various mock extensions (Unitils, Powermock and Spring automocking (which didn't make it due to not being able to mix and match runners). This seems to be a design flaw that I've been encountering with JUnit - you can't mix and match features from various runners. Even Powermock and JUnit's Rules (for exceptions anyway) was problematic. The answer was to go back to the inner class block version.
I'm considering this being the last version to support Java 5.
This release was also about learning things like Hamcrest, JUnit 4.7, various mock extensions (Unitils, Powermock and Spring automocking (which didn't make it due to not being able to mix and match runners). This seems to be a design flaw that I've been encountering with JUnit - you can't mix and match features from various runners. Even Powermock and JUnit's Rules (for exceptions anyway) was problematic. The answer was to go back to the inner class block version.
Thursday, March 12, 2009
Some Nice Programming with Jenabeans
Writing out SIOC triples using Jena + Jenabean "Jenabean’s model connected programming model makes this easy, using interfaces that declare each of the vocabularies as a set of methods." The code example shows how you can piece together (using a fluent interface) different vocabularies:
Thing thing = new Thing(m);
thing.at(uri1).
as(DCTerms.class).
title("Creating connections between discussion clouds with SIOC").
created("2006-09-07T09:33:30Z").
isa(Sioc.Post.class).
has_container(thing.at(uri2)).
...
Tuesday, December 09, 2008
Restlet Talk
I spoke last night at the Java Users Group about Restlet. It's a basic introduction to both Restlet and trying to link data across web sites. I wasn't very happy with the example - it was basically stolen from a Rails introduction. At least I could answer the question about why you would allow your data to be searched (to sell adverts on your recipe web site). I think it went down okay, mainly because most Java developers are used to large frameworks and complicated APIs in order to do what Restlet does (so it's impressive), the Rails developers knew some of the concepts already and while most are wary of RDF, SPARQL, OWL and the Semantic Web stack it was a fairly incremental addition to achieve something reasonably powerful.
Thursday, December 04, 2008
Getting Groovy with JRDF
In an effort to speed up and improve the test coverage in JRDF I've started writing some of the tests in Groovy. It's been a good experience so far - so much so that I'm probably not going back to writing tests in Java in the future.
One of the things I wanted to try was a RdfBuilder, this is similar to Groovy's NodeBuilder.
There are a couple of things that make it a bit tricky. When parsing or debugging builders I haven't yet found a way to find the methods/properties available even using MetaClass. And of course, when the magic goes wrong it's a bit harder to debug Groovy versus java.
It certainly smartens up the creation of triples, for example (bits from the NTriples test case):
Using the builder results in a file that's smaller than the test case file. You could remove some duplication by creating a method that takes in a number and the object and generates "eg:resource$number" "eg:property" "$object" but doing that may actually make it harder to read.
If you stick to only using URIs you can do things like:
I expect that JRDF will only be more Groovy friendly in the future.
One of the things I wanted to try was a RdfBuilder, this is similar to Groovy's NodeBuilder.
There are a couple of things that make it a bit tricky. When parsing or debugging builders I haven't yet found a way to find the methods/properties available even using MetaClass. And of course, when the magic goes wrong it's a bit harder to debug Groovy versus java.
It certainly smartens up the creation of triples, for example (bits from the NTriples test case):
def rdf = new RdfBuilder(graph)
rdf.with {
namespace("eg", "http://example.org/")
namespace("rdfs", "http://www.w3.org/2000/01/rdf-schema#")
"eg:resource1" "eg:property":"eg:resource2"
"_:anon" "eg:property":"eg:resource2"
"eg:resource1" "eg:property":"_:anon"
(3..6).each {
"eg:resource$it" "eg:property":"eg:resource2"
}
"eg:resource7" "eg:property":'"simple literal"'
"eg:resource17" ("eg:property":['"\\u20AC"',
'"\\uD800\\uDC00"', '"\\uD84C\\uDFB4"', '"\\uDBFF\\uDFFF"'])
"eg:resource24" "eg:property":'"<a></a>"^^rdfs:XMLLiteral'
"eg:resource31" "eg:property": '"chat"@en'
}
The first two lines defines the two namespaces used. The third line shows the general use of RDF and Groovy. It works out well, an RDF predicate and object maps to an attribute and value in Groovy. The next two lines show how you refer to the same blank node across two statements. And the following lines show using ranges and creating different types of literals. The third last line creates 4 triples with the same subject and predicate but with different objects. Using the builder results in a file that's smaller than the test case file. You could remove some duplication by creating a method that takes in a number and the object and generates "eg:resource$number" "eg:property" "$object" but doing that may actually make it harder to read.
If you stick to only using URIs you can do things like:
rdf.with {
urn.foo6 {
urn.bar {
urn.baz1
urn.baz2
}
}
}
Which produces two triples: "urn:foo6, urn:bar, urn:baz1" and "urn:foo6, urn:bar, urn:baz2".I expect that JRDF will only be more Groovy friendly in the future.
Wednesday, November 12, 2008
Indexing for Efficient SPARQL
Another interesting way of indexing triples: A role-free approach to indexing large RDF data dets in secondary memory for efficient SPARQL evaluation "We propose a simple Three-way Triple Tree (TripleT) secondary-memory indexing technique to facilitate efficient SPARQL query evaluation on such data sets. The novelty of TripleT is that (1) the index is built over the atoms occurring in the data set, rather than at a coarser granularity, such as whole triples occurring in the data set; and (2) the atoms are indexed regardless of the roles (i.e., subjects, predicates, or objects) they play in the triples of the data set. We show through extensive empirical evaluation that TripleT exhibits multiple orders of magnitude improvement over the state of the art on RDF indexing, in terms of both storage and query processing costs."
While looking around at arXiv I did a quick search and found two more interesting papers that seems related to a previous discussion on how the Semantic Web needs it's own programming language or I would say at least a way to process the web of data, both by Marko A. Rodriguez: "The RDF Virtual Machine" and "A Distributed Process Infrastructure for a Distributed Data Structure".
While looking around at arXiv I did a quick search and found two more interesting papers that seems related to a previous discussion on how the Semantic Web needs it's own programming language or I would say at least a way to process the web of data, both by Marko A. Rodriguez: "The RDF Virtual Machine" and "A Distributed Process Infrastructure for a Distributed Data Structure".
Thursday, October 30, 2008
A Billion Triples in your Pocket
At the Billion Triples Challenge this afternoon Cathrin Weiss showed off i-MoCo. It's a demonstration not only of a Semantic Web application with an iPhone using a stupid number of triples, but also of their indexing technique Hexastore which does all 6 indexes of RDF's subject, predicate and object in order to improve querying (they are ordered which allows you to do merge sorts). Actually, this made me think that the next steps in RDF triple stores will be indexes that are optimised for SPARQL operations and OWL inferences. Indexes for transitive closure perhaps? The data is regular and the storage is available to index triples that improve querying performance.
That wasn't the only impressive demo today. For me it's a toss up between another iPhone SemWeb demo, DBPedia mobile and SemaPlorer. DBPedia stood out because it was the only one that allowed you to write data to the Semantic Web rather than just read the carefully prepared triples. For a similar reason I thought SemaPlorer was good because they tried to do more by making it more generic like integrating flickr. But they were all excellent, all of them showing what you get with a billion or more triples and inferencing.
That combined with the guys at Freebase making all of their data available as RDF and it was a big day for the Semantic Web.
Update: I noticed that John Giannadrea's talk when he mentioned the three aspects to Freebase he went from the bottom up - probably reading too much into it.
Update: Also caught an interview with John Francis the guy who stopped talking, started walking, stopped taking mechanical transport and tried to change the world.
Update: ISWC 2008 awards have been published.
That wasn't the only impressive demo today. For me it's a toss up between another iPhone SemWeb demo, DBPedia mobile and SemaPlorer. DBPedia stood out because it was the only one that allowed you to write data to the Semantic Web rather than just read the carefully prepared triples. For a similar reason I thought SemaPlorer was good because they tried to do more by making it more generic like integrating flickr. But they were all excellent, all of them showing what you get with a billion or more triples and inferencing.
That combined with the guys at Freebase making all of their data available as RDF and it was a big day for the Semantic Web.
Update: I noticed that John Giannadrea's talk when he mentioned the three aspects to Freebase he went from the bottom up - probably reading too much into it.
Update: Also caught an interview with John Francis the guy who stopped talking, started walking, stopped taking mechanical transport and tried to change the world.
Update: ISWC 2008 awards have been published.
Thursday, October 02, 2008
Coffee Inspired
- The Global graphs in JRDF was inspired by the work done in MSG (minimum self-contained graphs, published in the RDFSync paper) and RDF Molecules. The former links to an implementation of DBin (P2P Semantic Web client) and there's also GVS (Graph Versioning System).
- It's a trap (cloud computing) It's a fairly typical Stallman statement - not wrong but not aware of the compromises people make. It is obvious that one of the reasons vendors are excited about cloud computing is because it is a chance for them to try and own your data (or at least make switching too hard). But you do have to do more than put data in the cloud - you do have to have executable services there. There are open source cloud computing infrastructure that you can run-up on your own servers like Hadoop or CouchDB. And it's not just back to mainframes and renting CPU time out by the hour it really is different to what has gone before.
- Live the Cloud Life lists the cloud computing applications in categories such as email, documents, data, music, photo editing and storing and browser synchronisation. Some categories are missing like RSS reading and calendaring.
- Apple drops NDA - the outrage worked.
- The main eResearch 2008 conference is over and some of the papers are available.
- Speaking of which, all development, and that definitely includes software, in whatever organisation (universities, governments, banks, etc) should have failure as an option. One sign of a truly stuffed culture is to never have a project fail.
- Muradora a refactoring of Fedora to allow pluggable authentication and enable metadata editing.
- Acer Aspire One links: Dual Monitor Support,
Installing Firefox 3 on Acer Aspire One Linux and
Updated repositories. It's a shame that OpenOffice doesn't support presenters view yet.
Tuesday, September 16, 2008
Make Every Web Site a Semantic Web Site
Back in March this was news but I completely missed it. Dapper has a rather nice way of turning web sites into data - XML, JSON, etc but it also includes a semanify service. It includes using existing namespaces (FOAF, GSS, Creative Commons, Media RSS and Dublin Core) supported by Yahoo's search engine.
This is covered in more depth in Semantify Hacks - Creating a your own RDF schema using Dapper:
The given example is MSN's search engine which you can see in all its RDF/XML glory.
ReadWriteWeb has step-by-step instructions.
This is covered in more depth in Semantify Hacks - Creating a your own RDF schema using Dapper:
So now, building a Dapp means you also built your own RDF compatible schema, that you can use wherever by just pointing to the webservice:
http://www.dapper.net/websiteServices/dapp-scheme.php?dappName=MYDAPP
The given example is MSN's search engine which you can see in all its RDF/XML glory.
ReadWriteWeb has step-by-step instructions.
Thursday, August 21, 2008
Information Poo
- Inspiron 910 hopefully will come out on Friday. Inspiron means cheap in Dell marketing? I guess it's better than Dell E. There's something wrong with looking forward to a Dell. I mean there is no Dell Insider or Inspiron Rumors.
- SearchMonkey Tutorials. SearchMonkey uses a format similar to RDF called DataRSS. The FAQ describes the parts of the Semantic Web that work with the parts of SearchMonkey - no reasoning just the triples.
- Introduction to High-Level Programming With Scala. A compiled language that is finally catching up with Ruby's duck typing.
- Dr Horrible Ringtones.
- Freebase and Parallax talks about the excellent demo making the rounds. It's available as an open source project. Who knew George Bush has had more than one assassination attempt made on him? That's the power of the Semantic Web.
- The goodness of Google apps talks about another happy corporate customer (except for Google Code - integrating GitHub does sound good) and mentions Guido's code review tool part of rietveld a Google App Engine application.
- ccREL: The Creative Commons Rights Expression Language. Using RDFa so you don't repeat yourself.
- SCOPE, Automatic Hadoop Optimization and Cassandra (Facebook's BigTable) all via Greg. Cassandra is a Google Code project and includes some good documentation.
- OWL ED papers including an ontology versioning system and OWL constraints.
Tuesday, July 29, 2008
YADS and RDF Molecules
BNodes Out! discusses how any usefully scalable system doesn't use blank nodes. What is interesting is the comment on YADS (Yet Another DOI Service). The best reference is Tony's presentation although it is mentioned in Jane's as well. "YADS implements a simple, safe and predictable recursive data model for describing resource collections. The aim is to assist in programming complex resource descriptions across multiple applications and to foster interoperability between them...So, the YADS model makes extensive use of bNodes to manage hierarchies of “fat” resources - i.e. resource islands, a resource decorated with properties. The bNodes are only used as a mechanism for managing containment."
This sounds a lot like RDF molecules and supports visualization (apparently). This seems like a good use of molecules that I hadn't previously thought of (Tony's talk gives an example of the London underground). The main homepage of YADS isn't around anymore - it'll be interesting to see if it's still being used/worked on.
Update: Tony has fixed up the YADS home page (there's also an older version).
This sounds a lot like RDF molecules and supports visualization (apparently). This seems like a good use of molecules that I hadn't previously thought of (Tony's talk gives an example of the London underground). The main homepage of YADS isn't around anymore - it'll be interesting to see if it's still being used/worked on.
Update: Tony has fixed up the YADS home page (there's also an older version).
Tuesday, June 10, 2008
Linked Data, FOAF, and OWL DL
So I spent a little time a while ago looking through all the different ways ontologies support linked data. Some of my data I wish to link together is not RDF but documents that define a subject. For example, a protein will have peer reviewed documents that define it. It's not RDF but it is important.
The tutorial on linked data has a little bit of information: "In order to make it easier for Linked Data clients to understand the relation between http://dbpedia.org/resource/Alec_Empire, http://dbpedia.org/data/Alec_Empire, and http://dbpedia.org/page/Alec_Empire, the URIs can be interlinked using the rdfs:isDefinedBy and the foaf:page property as recommended in the Cool URI paper."
The Cool URIs paper, Section 4.5 says: "The rdfs:isDefinedBy statement links the person to the document containing its RDF description and allows RDF browsers to distinguish this main resource from other auxiliary resources that just happen to be mentioned in the document. We use rdfs:isDefinedBy instead of its weaker superproperty rdfs:seeAlso because the content at /data/alice is authoritative."
There is also some discussion about linking in URI-based Naming Systems for Science.
Now my use case is linking things to documents that define that thing. So rdfs:seeAlso is not appropriate as it "might provide additional information about the subject resource". And rdfs:isDefinedBy is also out as it is used to link RDF documents together. I need a property that defines a thing, is authoritative but isn't linking RDF (it's for humans). I also would like to keep my ontology within OWL DL.
FOAF has a page property. I've used the OWL DL version of FOAF before and FOAF cleaner (or should that be RDFS cleaner). So it seemed like a good match. However, its inverse is topic which isn't good. Because I'm linking the thing to the page - it's not a topic. So scrub that.
RSS has a link property which extends Dublin Core's identifier. This seems more like it. However, I'd like to extend my own version of link and I'm stuck because as soon as you use RDFS vocabularies in OWL DL you're in OWL Full territory. It'd be nice to stay in OWL DL. There is an OWL DL version of Dublin Core. All of the Dublin Core properties are nicely converted to annotation properties. However, you're still stuck because you can't make sub-properties without going into OWL Full. I like the idea of annotation and semantically Dublic Core seems to be a suitable vocabulary of annotation properties. Extending Dublin Core is out of OWL DL - which is shame because it's probably the closest match to what I wanted.
As an aside, annotation properties are outside the reasoning engine. The idea is that you don't want an OWL reasoner or RDF application necessarily inferring over this data or trying to look it up in order for the document to be understood. So the way they do it in OWL DL is to have annotation properties that are outside of/special to the usual statements. Sub-properties require reasoning, so limiting them makes some sense but it does hamper extensibility - it'd be nice to express them and turn on the reasoning only when asking about those properties (I think Pellet has this feature but I didn't look up the details).
The other vocabulary I looked at was SIOC's link. Again, this seems like a close match but again it's RDFS.
In the end, I just created another annotation property called link.
In summary:
Update: The Neurocommon's URI documentation protocol is quite similar as well. Except that, it seems to be too specific as it ties the name with a single thing that defines it. All the parts of Step 5 could potentially be eliminated with what I'm thinking of.
The tutorial on linked data has a little bit of information: "In order to make it easier for Linked Data clients to understand the relation between http://dbpedia.org/resource/Alec_Empire, http://dbpedia.org/data/Alec_Empire, and http://dbpedia.org/page/Alec_Empire, the URIs can be interlinked using the rdfs:isDefinedBy and the foaf:page property as recommended in the Cool URI paper."
The Cool URIs paper, Section 4.5 says: "The rdfs:isDefinedBy statement links the person to the document containing its RDF description and allows RDF browsers to distinguish this main resource from other auxiliary resources that just happen to be mentioned in the document. We use rdfs:isDefinedBy instead of its weaker superproperty rdfs:seeAlso because the content at /data/alice is authoritative."
There is also some discussion about linking in URI-based Naming Systems for Science.
Now my use case is linking things to documents that define that thing. So rdfs:seeAlso is not appropriate as it "might provide additional information about the subject resource". And rdfs:isDefinedBy is also out as it is used to link RDF documents together. I need a property that defines a thing, is authoritative but isn't linking RDF (it's for humans). I also would like to keep my ontology within OWL DL.
FOAF has a page property. I've used the OWL DL version of FOAF before and FOAF cleaner (or should that be RDFS cleaner). So it seemed like a good match. However, its inverse is topic which isn't good. Because I'm linking the thing to the page - it's not a topic. So scrub that.
RSS has a link property which extends Dublin Core's identifier. This seems more like it. However, I'd like to extend my own version of link and I'm stuck because as soon as you use RDFS vocabularies in OWL DL you're in OWL Full territory. It'd be nice to stay in OWL DL. There is an OWL DL version of Dublin Core. All of the Dublin Core properties are nicely converted to annotation properties. However, you're still stuck because you can't make sub-properties without going into OWL Full. I like the idea of annotation and semantically Dublic Core seems to be a suitable vocabulary of annotation properties. Extending Dublin Core is out of OWL DL - which is shame because it's probably the closest match to what I wanted.
As an aside, annotation properties are outside the reasoning engine. The idea is that you don't want an OWL reasoner or RDF application necessarily inferring over this data or trying to look it up in order for the document to be understood. So the way they do it in OWL DL is to have annotation properties that are outside of/special to the usual statements. Sub-properties require reasoning, so limiting them makes some sense but it does hamper extensibility - it'd be nice to express them and turn on the reasoning only when asking about those properties (I think Pellet has this feature but I didn't look up the details).
The other vocabulary I looked at was SIOC's link. Again, this seems like a close match but again it's RDFS.
In the end, I just created another annotation property called link.
In summary:
- For my requirements, the suggestions for linking data seems to only work for RDF and RDFS ontologies. Reusing RDFS from OWL DL or OWL DL from RDFS doesn't look feasible as one isn't a subset of the other (an old problem I guess).
- Current, popular Semantic Web vocabularies are in RDFS. Why aren't there more popular OWL DL versions of these things? Is the lack of extensibility holding it back?
- Is my expectation wrong - should I stick within OWL DL or is an RDFS and OWL DL combination okay?
- Why not allow annotation properties to have sub-properties?
- Maybe the OWL DL specification does have suitable properties for linking certain data but I don't understand which is the right one.
Update: The Neurocommon's URI documentation protocol is quite similar as well. Except that, it seems to be too specific as it ties the name with a single thing that defines it. All the parts of Step 5 could potentially be eliminated with what I'm thinking of.
Monday, May 26, 2008
RDF Processing
One of the interesting things about biological data, and probably other types, is that a lot of it is not quite the right structure. That's not to say that there's not people working to improve it, the Gene Ontology seems to be updated almost daily, but data in any structure maybe wrong for a particular purpose.
Biologists make a habit, out of necessity, of just hacking and transforming large amounts of data to suite their particular need. Sometimes, these hacks get more generalized and abstracted like GO Slims. We've been using GO Slims in BioMANTA for sub-cellular location (going from 2000 terms to 500). GO contains lots and lots of information and you don't need it all at once and more often you don't need it at the maximum level of granularity that it has. Some categories only have one or two known instances, for example. You may even need to whittle this down further (from say 500 to 200). For example, when we are determining the quality of an interaction we only care where the proteins exist generally in an organism. If the two proteins are recorded to interact but one is in the heart and the other in the liver then it's unlikely that they will react in the host organism. The part of the liver or the heart and other finer structural detail is not required for this kind of work (AFAIK anyway).
The point is, a lot of our work is processing not querying RDF. What's the difference between the two and what effect does it have?
For a start, querying assumes, at least to some degree, that the data is selective - that the results you're getting is vastly smaller than your original data. In processing, you're taking all of the data or large chunks of it (by sets of predicates, for example) and changing or producing more data based on the original set.
Also, writing is at least as important as reading the data. So data structures optimized for lots of writes, temporary, concurrent, is of greater importance than those built around more familiar requirements for a database.
Sorting and processing distinct items is a lot more important too. When processing millions of data entries it can be quite inefficient if the data has a large number of duplicates and needs to be sorted. Processing can also be decentralized - or perhaps maybe more decentralized.
To top it off, the data still has to be queried. So this doesn't remove the need for efficient, read only data structures to perform selective queries for the usual analysis, reporting, etc. So none of the existing problems goes away.
Biologists make a habit, out of necessity, of just hacking and transforming large amounts of data to suite their particular need. Sometimes, these hacks get more generalized and abstracted like GO Slims. We've been using GO Slims in BioMANTA for sub-cellular location (going from 2000 terms to 500). GO contains lots and lots of information and you don't need it all at once and more often you don't need it at the maximum level of granularity that it has. Some categories only have one or two known instances, for example. You may even need to whittle this down further (from say 500 to 200). For example, when we are determining the quality of an interaction we only care where the proteins exist generally in an organism. If the two proteins are recorded to interact but one is in the heart and the other in the liver then it's unlikely that they will react in the host organism. The part of the liver or the heart and other finer structural detail is not required for this kind of work (AFAIK anyway).
The point is, a lot of our work is processing not querying RDF. What's the difference between the two and what effect does it have?
For a start, querying assumes, at least to some degree, that the data is selective - that the results you're getting is vastly smaller than your original data. In processing, you're taking all of the data or large chunks of it (by sets of predicates, for example) and changing or producing more data based on the original set.
Also, writing is at least as important as reading the data. So data structures optimized for lots of writes, temporary, concurrent, is of greater importance than those built around more familiar requirements for a database.
Sorting and processing distinct items is a lot more important too. When processing millions of data entries it can be quite inefficient if the data has a large number of duplicates and needs to be sorted. Processing can also be decentralized - or perhaps maybe more decentralized.
To top it off, the data still has to be queried. So this doesn't remove the need for efficient, read only data structures to perform selective queries for the usual analysis, reporting, etc. So none of the existing problems goes away.
Monday, May 12, 2008
git + RDF = versioned RDF
Reading, Git for Computer Scientists, and it seems like if you turn the blob into a set of triples you pretty much have versioned RDF (or molecules even).
I'm also wondering, if Digg is so pro-Semantic Web, where's the http://digg.com/semweb?
I'm also wondering, if Digg is so pro-Semantic Web, where's the http://digg.com/semweb?
Tuesday, May 06, 2008
I See Triples
Digg makes official its adoption of a 'semantic Web' standard "Other brief mentions on Digg's blogs over the past month have been the only indications the company has been giving to the world of its direct -- and perhaps even principal -- involvement in RDF and RDFa, besides a simple check of the site's own source code, where attributions such as rel="dc:source" property="dc:title" within <DIV> elements are now common. A few weeks ago, developer Bob DuCharme discovered these little attributions and began playing with them to discern their viability."
"The possibility exists for a kind of mega-meta-source to emerge from Digg, where interesting news topics are associated with cataloged resources. But for that to actually work, someone has to manage those resources -- and that effort will take a level of humanpower and resources of another kind (the kind symbolized with "$") that RDF won't provide even the most ambitious sites just on its own."
See Digging RDFa. More news about RDFa is available at RDFa.info.
To see Digg in all its RDFa glory one way is to copy this Javascript for highlighting or this one for getting RDF triples into you bookmark bar after the Digg front page has loaded.
So I still haven't finished writing up everything that I've saw at WWW2008 but the overall messages were:
"The possibility exists for a kind of mega-meta-source to emerge from Digg, where interesting news topics are associated with cataloged resources. But for that to actually work, someone has to manage those resources -- and that effort will take a level of humanpower and resources of another kind (the kind symbolized with "$") that RDF won't provide even the most ambitious sites just on its own."
See Digging RDFa. More news about RDFa is available at RDFa.info.
To see Digg in all its RDFa glory one way is to copy this Javascript for highlighting or this one for getting RDF triples into you bookmark bar after the Digg front page has loaded.
So I still haven't finished writing up everything that I've saw at WWW2008 but the overall messages were:
- RDFa is easy and gets people going with RDF quickly (see "They knew the train would come"). Semantic wikis (links to the Semantic Mediawiki project) have also come a long way to making it more err user friendly.
- HTML5 and the end of the browser development winter seems like the death to plugins at last. I hadn't realized this before, but the message seems to be that a plugin is a way of saying to the Web "your browser isn't full featured enough".
- The Facebooks of the world and all those online communities really are a danger to the Web - the creation of data silos. And I'd really like to have the time to write some SIOC plugins to help open up these silos (or just change my blog template to have RDFa).
Friday, May 02, 2008
When URIs are too Much
Every Subject is a Blank Node "In RDF, URIs are good at defining unambiguous property values, in other words objects, including type. But very often, and maybe most of the time, the individual subject (in both meaning of subject of an RDF triple, and topic maps subject of conversation) is best represented as a blank node bearing all kinds of identified properties, but none of them conferring absolute identity. This way, it's left to applications to figure out identification rules, in other words which property or boolean combination of properties they want to consider as identifying or not."
From the mailing list: "With no URI, you are free to let applications decide which contexts are considered the same or not, based on specific rules on properties. Some applications would decide that all contexts where role "I" is played by "John Black" are the same, and will cluster all contextResource properties, some other will not."
From the mailing list: "With no URI, you are free to let applications decide which contexts are considered the same or not, based on specific rules on properties. Some applications would decide that all contexts where role "I" is played by "John Black" are the same, and will cluster all contextResource properties, some other will not."
Thursday, April 24, 2008
Update from WWW2008
The HCLS workshop was very good. I especially enjoyed Mark Wilkinson's talk about BioMody 2.0 (very Larry Lessig-esque) and Chris Baker's. There's some definite interest from a number of people about my talk too.
The keynote of the first day was from the Vice President of Engineering at Google, Kai-Fu Lee. In my talk I said that IBM had noted that scale-out architecture gives you a 4 times performance benefit for the same cost. He said that Google gets around 33 times or more from a scale-out architecture. The whole cloud thing is really interesting in that it's not only about better value but about doing things you just can't do with more traditional computing architectures. The number of people I've overheard saying that they haven't been able to get their email working because they're using some sort of client/server architecture is amazing. I mean what's to get working these days when you can just use GMail?
The SPARQL BOF was interesting as well (Eric took notes). The time frame seems to be around 2009 before they get started on SPARQL the next generation. What sticks out in my mind is the discussion around free text searching - adding something like Lucene. There was also aggregates, negation, starting from a blank node in a SPARQL query and transitive and following owl:sameAs. I was pretty familiar with all of these so it was interesting just to listen for a change. So with both aggregates and free text you are creating a new variable. Lucene gives you a score back and I remember in Kowari we had that information but I don't think it was ever visible in a variable (maybe I'm wrong I don't really remember). It would be nice to be able to bind new variables somehow from things in the WHERE clause for this - and that would also allow you to filter out based on COUNTS greater than some value (without having a HAVING clause) or documents that match your Lucene query greater than a certain value. Being able to do transitive relationships just on a subset of the subclass relationship (like only subclasses of mammals not infer the whole tree of life) seemed to have been met with some reluctance. I really didn't understand this but it seemed to be around that it was the store's responsibility to control this and not up to the user to specify.
The other thing that was mentioned was transactions. It seems that transactions probably won't be part of SPARQL due to the nature of distributed transactions across the Web.
There was one paper on the first day that really stood out. I don't know what it is about logicians giving talks but they are generally really appealing to me. It was "Structured Objects in OWL: Representation and Reasoning" presented by Bernardo Grau. It seems to take the structural parts of an OWL ontology and creates a graph to represent it. This prevents DL reasoning of an infinite tree and creates a bounded graph. This is cool for biology - the make up for a cell for example but it also speeds up reasoning and allows errors to be found.
The other interesting part was the linked data area. I was a bit concerned that it was going to create a read only Semantic Web. A lot of the work, such as DBpedia that converts Wikipedia to RDF, seems a bit odd to me as you can only edit the Semantic Web indirectly through documents. But in the Linked Data Workshop a paper was presented called "Tabulator Redux: Browsing and Writing Linked Data" which of course adds write capabilities. I spoke to Chris Bizer (who gave a talk on how the linked data project now has ~2 billion triples) about whether you could edit DBpedia this way and he said probably not yet. That's going to be interesting to see where it goes.
I am just going off memory rather than notes. So I'll probably flesh this out a bit more later.
The keynote of the first day was from the Vice President of Engineering at Google, Kai-Fu Lee. In my talk I said that IBM had noted that scale-out architecture gives you a 4 times performance benefit for the same cost. He said that Google gets around 33 times or more from a scale-out architecture. The whole cloud thing is really interesting in that it's not only about better value but about doing things you just can't do with more traditional computing architectures. The number of people I've overheard saying that they haven't been able to get their email working because they're using some sort of client/server architecture is amazing. I mean what's to get working these days when you can just use GMail?
The SPARQL BOF was interesting as well (Eric took notes). The time frame seems to be around 2009 before they get started on SPARQL the next generation. What sticks out in my mind is the discussion around free text searching - adding something like Lucene. There was also aggregates, negation, starting from a blank node in a SPARQL query and transitive and following owl:sameAs. I was pretty familiar with all of these so it was interesting just to listen for a change. So with both aggregates and free text you are creating a new variable. Lucene gives you a score back and I remember in Kowari we had that information but I don't think it was ever visible in a variable (maybe I'm wrong I don't really remember). It would be nice to be able to bind new variables somehow from things in the WHERE clause for this - and that would also allow you to filter out based on COUNTS greater than some value (without having a HAVING clause) or documents that match your Lucene query greater than a certain value. Being able to do transitive relationships just on a subset of the subclass relationship (like only subclasses of mammals not infer the whole tree of life) seemed to have been met with some reluctance. I really didn't understand this but it seemed to be around that it was the store's responsibility to control this and not up to the user to specify.
The other thing that was mentioned was transactions. It seems that transactions probably won't be part of SPARQL due to the nature of distributed transactions across the Web.
There was one paper on the first day that really stood out. I don't know what it is about logicians giving talks but they are generally really appealing to me. It was "Structured Objects in OWL: Representation and Reasoning" presented by Bernardo Grau. It seems to take the structural parts of an OWL ontology and creates a graph to represent it. This prevents DL reasoning of an infinite tree and creates a bounded graph. This is cool for biology - the make up for a cell for example but it also speeds up reasoning and allows errors to be found.
The other interesting part was the linked data area. I was a bit concerned that it was going to create a read only Semantic Web. A lot of the work, such as DBpedia that converts Wikipedia to RDF, seems a bit odd to me as you can only edit the Semantic Web indirectly through documents. But in the Linked Data Workshop a paper was presented called "Tabulator Redux: Browsing and Writing Linked Data" which of course adds write capabilities. I spoke to Chris Bizer (who gave a talk on how the linked data project now has ~2 billion triples) about whether you could edit DBpedia this way and he said probably not yet. That's going to be interesting to see where it goes.
I am just going off memory rather than notes. So I'll probably flesh this out a bit more later.
Thursday, April 17, 2008
hashCode and equals for Blank Nodes
You don't need node ids. Most, if not all RDF triple stores take a Literal, URI Reference or Blank Node and generate a node id. Sometimes it's a hash or UUID, sometimes it's from a node pool or value store but you don't really need it. As an aside, in a distributed store you could even do the blocks of ids trick which people have done in SQL databases but I haven't seen that done for RDF yet.
When you do operations, like joins, in Java or Ruby or some other language you rely on hash codes to generate different values, if they're the same then you call equals.
What if you don't have a node pool?
It's easy to do for what I like to call globally addressable values - URI References and Literals - no matter where you are, these methods return the same results from their hash code or equals. Not so with Blanks Nodes, which are tied to the context of an RDF graph.
One solution is to ban blank nodes - they're pains to parse, query and store. But I actually like blank nodes. There good at representing things where you don't want to confuse it with something that might actually be a URI to dereference.
The idea we've been working on with our high-falutin' scale-out MapReduce blah blah is really just coming up with sensible implementations of the hashCode and equals methods for blank nodes. There is previous work done in distributing blank nodes across graphs, the one that I'm most familiar with is RDF Molecules. But they didn't really quite cut it as far as hash codes and equals are concerned and that's basically what I'm presenting next week in China. The hash code is basically the head triple and the equals is the minimal context, sub-graph for a given blank node.
There's a lot more to say, as I've had to find something to talk about for the whole 15 minutes.
When you do operations, like joins, in Java or Ruby or some other language you rely on hash codes to generate different values, if they're the same then you call equals.
What if you don't have a node pool?
It's easy to do for what I like to call globally addressable values - URI References and Literals - no matter where you are, these methods return the same results from their hash code or equals. Not so with Blanks Nodes, which are tied to the context of an RDF graph.
One solution is to ban blank nodes - they're pains to parse, query and store. But I actually like blank nodes. There good at representing things where you don't want to confuse it with something that might actually be a URI to dereference.
The idea we've been working on with our high-falutin' scale-out MapReduce blah blah is really just coming up with sensible implementations of the hashCode and equals methods for blank nodes. There is previous work done in distributing blank nodes across graphs, the one that I'm most familiar with is RDF Molecules. But they didn't really quite cut it as far as hash codes and equals are concerned and that's basically what I'm presenting next week in China. The hash code is basically the head triple and the equals is the minimal context, sub-graph for a given blank node.
There's a lot more to say, as I've had to find something to talk about for the whole 15 minutes.
Monday, March 03, 2008
PURLs for GO
I think these have been published before but I only just noticed the use of PURLs for references in the Gene Ontology. For example: http://purl.org/obo/owl/GO#GO_0008150 (the gene is not dereferenceable but then that seems okay for current purposes). This is following Recipe 1a from the "Best Practice Recipes for Publishing RDF Vocabularies", although the file sizes are probably too big (as suggested in "Serving Static RDF Files"). It does seem inconsistent with the Banff Manifesto which suggests URLs more like http://purl.org/bm/go:0008150. I know about slashes and hashes but I'm not sure about colons.
Subscribe to:
Posts (Atom)