Saturday, December 29, 2007
Wii!
I couldn't be more impressed. For the first time in about 25 years (I think the last time was H.E.R.O. for the Atari 2600) my mum has sat down (well stood up too) and played a computer game - Wii golf, bowling and tennis. While I remained victorious, it wasn't by much (especially golf). It reminded me of a description of the arrival of Pong by Nolan Bushnell where women would hustle men (induce to gamble over the outcome of a game) in bars.
Monday, December 24, 2007
Fat Controller
When does MapReduce make sense and what architecture is appropriate? I don't really know; wheras Tom has some ideas. I like the idea of MapReduce in the Google architecture (as cloned by Hadoop). I like the use of a distributed file system to evenly smear the data across the nodes in the cluster. However, these choices don't always seem appropriate.
The easiest and most obvious example where this causes a problem is that sometimes you want to ensure a single entry or exit in your system (when to start or stop). Another obvious example is where the overhead of network latency overwhelms your ability to parallelize the processing. More than that, it seems that if you can order things globally then the processing should be more efficient but it's unclear to me where the line is between that and the more distributed MapReduce processing (and whether adding that complexity is always worth it).
Google's third lecture (slides) answers some of the basic questions such as why use a distributed file system. It also lists decisions made such as read optimization, mutation handling (serializing writes and atomic appends), no caching (no need due to large file size), fewer but larger files (64MB chunks) and how the file handling is essentially garbage collection. They have implemented appends as it was a frequent operation. This is something that Hadoop has yet to do and can be an issue, especially for databases (which includes the requirements for appends and truncates attached to that issue).
There is obviously some adaption needed to alogirthms to run in a MapReduce cluster. Lecture 5 gives some examples of MapReduce algorithms. It includes a breadth first search of a graph and PageRank. Breadth first is chosen so there doesn't need to be any backtracking. For graph searching, they suggested creating an adjacency matrix - 1 indicating a link in the matrix and 0 indicating no link. To transfer it efficiently they use a sparse matrix (where you only record the links - very much like column databases of course). MapReduce is similar, with the processing split as a single row per page. For both of these process there is a non-MapReduce component. For example, in the PageRank algorithm a process exists to determine convergence of page values.
The easiest and most obvious example where this causes a problem is that sometimes you want to ensure a single entry or exit in your system (when to start or stop). Another obvious example is where the overhead of network latency overwhelms your ability to parallelize the processing. More than that, it seems that if you can order things globally then the processing should be more efficient but it's unclear to me where the line is between that and the more distributed MapReduce processing (and whether adding that complexity is always worth it).
Google's third lecture (slides) answers some of the basic questions such as why use a distributed file system. It also lists decisions made such as read optimization, mutation handling (serializing writes and atomic appends), no caching (no need due to large file size), fewer but larger files (64MB chunks) and how the file handling is essentially garbage collection. They have implemented appends as it was a frequent operation. This is something that Hadoop has yet to do and can be an issue, especially for databases (which includes the requirements for appends and truncates attached to that issue).
There is obviously some adaption needed to alogirthms to run in a MapReduce cluster. Lecture 5 gives some examples of MapReduce algorithms. It includes a breadth first search of a graph and PageRank. Breadth first is chosen so there doesn't need to be any backtracking. For graph searching, they suggested creating an adjacency matrix - 1 indicating a link in the matrix and 0 indicating no link. To transfer it efficiently they use a sparse matrix (where you only record the links - very much like column databases of course). MapReduce is similar, with the processing split as a single row per page. For both of these process there is a non-MapReduce component. For example, in the PageRank algorithm a process exists to determine convergence of page values.
Sunday, December 23, 2007
Stacked
Apple to tweak 'Stacks' in Mac OS X Leopard 10.5.2 Update Getting better - now all they have to do is fix Spaces and Java.
Friday, December 14, 2007
Timeless Software
Barry Boehm (having read it many times it's pronounced "beam"), "A View of 20th and 21st Century Software Engineering", lists the timeless qualities of a good programmer that have been discovered through the decades. Some of them include: don't neglect the sciences, avoid sequential processing, avoid cowboy programming, what's good for products is good for process, and adaptability trumps repeatability.
The powerpoint for the presentation is also online.
Update: I have notes from this talk but little time to review them. There's three things that stick out in my mind though.
The first is slide 5. This lists the most common causes of project failure. Most of these are not technical but centre around project management and people. I think it can be successfully argued that the other issues, mainly technical problems, are also people problems too, especially things like "lack of executive support".
Slide 9 has a Hegelian view of the progress made in writing software. That is, there is a process of advancement that involves theses, antitheses and syntheses. For example, software as a craft was an overreaction to software being considered analogous to hardware development and agile methods are a reaction (some may say overreaction) to the waterfall methodology.
Thirdly, Slide 30 has what initially looks like a good diagram to help decide how agile your project should be. I had issues with this. Much of my experience with agile projects is that they are actually better planned than most traditionally run projects. I also don't see how the criticality of the defects is appropriate either, as the number of defects is far fewer and less critical in well run agile projects too. Maybe I've never been on a well run planned project.
This made me wonder, why does software development still seem so far removed from science? Or to put in another way, why is it so dependent on personal experience? OO is quite frequently vilified but the reason people supported and continue to do so was because reuse increased and costs decreased - people were just following the numbers. Before OO, it was structured programming. And in contrast, what seemed like good ideas, such as proving the correctness of programs, have not been widely adopted. Barry points out that the reasons, since its inception in the 1970s, was because it did not prevent defects occurring in the specification and it was not scalable. This lead to the ideas of prototyping and RAD which were developed to overcome these perceived failures.
The powerpoint for the presentation is also online.
Update: I have notes from this talk but little time to review them. There's three things that stick out in my mind though.
The first is slide 5. This lists the most common causes of project failure. Most of these are not technical but centre around project management and people. I think it can be successfully argued that the other issues, mainly technical problems, are also people problems too, especially things like "lack of executive support".
Slide 9 has a Hegelian view of the progress made in writing software. That is, there is a process of advancement that involves theses, antitheses and syntheses. For example, software as a craft was an overreaction to software being considered analogous to hardware development and agile methods are a reaction (some may say overreaction) to the waterfall methodology.
Thirdly, Slide 30 has what initially looks like a good diagram to help decide how agile your project should be. I had issues with this. Much of my experience with agile projects is that they are actually better planned than most traditionally run projects. I also don't see how the criticality of the defects is appropriate either, as the number of defects is far fewer and less critical in well run agile projects too. Maybe I've never been on a well run planned project.
This made me wonder, why does software development still seem so far removed from science? Or to put in another way, why is it so dependent on personal experience? OO is quite frequently vilified but the reason people supported and continue to do so was because reuse increased and costs decreased - people were just following the numbers. Before OO, it was structured programming. And in contrast, what seemed like good ideas, such as proving the correctness of programs, have not been widely adopted. Barry points out that the reasons, since its inception in the 1970s, was because it did not prevent defects occurring in the specification and it was not scalable. This lead to the ideas of prototyping and RAD which were developed to overcome these perceived failures.
Scala Tipping the Scales
With the announcement of the Scala book and a couple of posts on Reddit: "The awesomeness of Scala is implicit" and "Why Scala?". This coincided with me reading up on what the Workingmouse people (like Tony) were up to with Scalaz.
I've only read Chapter 1 of the book but it's fairly compelling reading mentioning that it's OO, functional, high level, verifiable, etc. It also gives examples of how succinct Scala is compared to Java.
Compared to:
This works for me much like the Ruby language and the books. You can understand where you came from and how it's better. It seems much more compelling than giving a sermon about how all Java users are fornicators of the devil and that they corrupt young children. It seems that most of the industry when selecting a computer language, to continue the religious theme, act like born-agains - jumping from one state to another, "I used to be a sinner managing my own memory, using pointer arithmetic and now I have seen the one true, pure language".
Update: There's also an interview with Bill Venners on Scala. He compares it with LINQ, why keeping static typing is important, how using the Option type removes nulls, that it is both more OO than Java (it removes static methods and primitives) while being fully functional, describes singleton objects and how it achieves the goal as a scalable language. One of the interesting points he makes is that strong types makes you solve problems in a particular way.
I've only read Chapter 1 of the book but it's fairly compelling reading mentioning that it's OO, functional, high level, verifiable, etc. It also gives examples of how succinct Scala is compared to Java.
boolean nameHasUpperCase = false;
for (int i = 0; i < name.length(); ++i) {
if (Character.isUpperCase(name.charAt(i))) {
nameHasUpperCase = true;
break;
}
}
Compared to:
val nameHasUpperCase = name.exists(_.isUpperCase)
This works for me much like the Ruby language and the books. You can understand where you came from and how it's better. It seems much more compelling than giving a sermon about how all Java users are fornicators of the devil and that they corrupt young children. It seems that most of the industry when selecting a computer language, to continue the religious theme, act like born-agains - jumping from one state to another, "I used to be a sinner managing my own memory, using pointer arithmetic and now I have seen the one true, pure language".
Update: There's also an interview with Bill Venners on Scala. He compares it with LINQ, why keeping static typing is important, how using the Option type removes nulls, that it is both more OO than Java (it removes static methods and primitives) while being fully functional, describes singleton objects and how it achieves the goal as a scalable language. One of the interesting points he makes is that strong types makes you solve problems in a particular way.
Wednesday, December 12, 2007
LINQed Data
I did a quick search for LINQ and RDF and there are two LINQ providers for .NET:
- Andrew Matthews' LINQtoRDF and LINQtoRDF Designer, which is a Google code project.
- Hartmut Maennel's A LINQ provider for RDF files. Available as a zip file
Wednesday, November 28, 2007
Academic Software and BioMANTA
On Monday I gave a presentation at the ACB all hands meeting on the BioMANTA project. It covered the basics: the integration process, ontology design and the architecture.
There were some very incomprehensible presentations. But of the ones I did understand the lipid raft modeling (which looked a bit like Conway's Game of Life) was perhaps the coolest. There was quite a few presentations of software that involved: "we did it this way, the specifications changed, next time we'll do it right because what we have at the moment is a mess". It's very frustrating to see that change is still not accepted as the norm.
Two posts I read recently reminded me of this too: "Architecture and innovation" and "Why is it so difficult to develop systems?". Both describe the poor state of academic software. Many projects seem to suffer from this problem, not just academic ones, although academic ones seem to be prone to adding technology because it's cool/trendy/whatever which may help get it published but usually obscures the real novel aspects (or worse they don't have anything except the cool or trendy technologies).
There were some very incomprehensible presentations. But of the ones I did understand the lipid raft modeling (which looked a bit like Conway's Game of Life) was perhaps the coolest. There was quite a few presentations of software that involved: "we did it this way, the specifications changed, next time we'll do it right because what we have at the moment is a mess". It's very frustrating to see that change is still not accepted as the norm.
Two posts I read recently reminded me of this too: "Architecture and innovation" and "Why is it so difficult to develop systems?". Both describe the poor state of academic software. Many projects seem to suffer from this problem, not just academic ones, although academic ones seem to be prone to adding technology because it's cool/trendy/whatever which may help get it published but usually obscures the real novel aspects (or worse they don't have anything except the cool or trendy technologies).
But my impression of the last 10-15 years (especially W3C and Grid/eScience projects) is that they rapid become overcomplicated, overextended and fail to get people using them.
Ultimately much of the database and repository technology is too complicated for what we need at the start of the process. I am involved in one project where the database requires an expert to spend six months tooling it up. I thought DSpace was the right way to go to reposit my data but it wasn’t. I (or rather Jim) put 150,000+ molecules into it but they aren’t indexed by Google and we can’t get them out en masse. Next time we’ll simply use web pages.
By contrast we find that individual scientists, if given the choice, revert to two or three simple, well-proven systems:
* the hierarchical filesystem
* the spreadsheet
A major reason these hide complexity is that they have no learning curve, and have literally millions of users or years’ experience. We take the filesystem for granted, but it’s actually a brilliant invention. The credit goes to Denis Ritchie in ca. 1969. (I well remember my backing store being composed of punched tape and cards).
If you want differential access to resources, and record locking and audit trails and rollback and integrity of commital and you are building it from scratch, it will be a lot of work. And you lose sight of your users.
So we’re looking seriously at systems based on simpler technology than databases - such as RDF triple stores coupled to the filesystem and XML.
Present and Future Scalability
Measurable Targets for Scalable Reasoning
An interesting aspect of this paper is its description of some of the difficulties in comparing data loading and querying across triple stores. For example, at load time this can include forward chaining (inferencing) and what the complexity of the data model is (whether named graphs and other metadata is used). Query evaluation can vary due to backward chaining, result set size and the types of queries.
While some triple stores can now load up to 40,000 triples a second (BigOWLIM) the average seems to be around 10,000 triples a second for a billion triples. The target in the next few years is 20-100 billion triples at 100,000 triples second. The rate of 100,000 triples per second is the upper range but I would imagine that to load the data in a reasonable time this is what people have to aim towards. Otherwise, 100 billion triples is going to take 100 days.
This analysis is based on published results of several of the most scalable engines: ORACLE, AllegroGraph, DAML DB, Openlink Virtuoso, and BigOWLIM. The targets are defined with respect to two of the currently most popular performance measuring sticks: the LUBM repository benchmark (and its UOBM modification) and the OWL version of UNIPROT - the richest database of protein-related information.
An interesting aspect of this paper is its description of some of the difficulties in comparing data loading and querying across triple stores. For example, at load time this can include forward chaining (inferencing) and what the complexity of the data model is (whether named graphs and other metadata is used). Query evaluation can vary due to backward chaining, result set size and the types of queries.
While some triple stores can now load up to 40,000 triples a second (BigOWLIM) the average seems to be around 10,000 triples a second for a billion triples. The target in the next few years is 20-100 billion triples at 100,000 triples second. The rate of 100,000 triples per second is the upper range but I would imagine that to load the data in a reasonable time this is what people have to aim towards. Otherwise, 100 billion triples is going to take 100 days.
Thursday, November 22, 2007
Why Riding Your Bike Doesn't Help Global Warming
An interesting article, "Climate Change – an alternative approach", which highlights that the current supply of oil fails to meet demand. So reducing your usage of CO2 emissions, like riding your bike to work, does not help - the drop in demand is quickly met by consumption somewhere else. The author is suggesting a more viable approach is to focus on production rather than consumption.
According to the author, the answer lies in the oil and coal producing nations reducing production which is much easier to coordinate than billions of consumers. This doesn't detract from investing in reducing consumption and the usual suspects (Australia, India, USA and China) are involved on both sides of production and consumption.
He was talking about flying to Sydney and stated that if you chose not to fly you were making an immediate carbon saving (as apposed to offsetting the flight where the saving was at least delayed if it ever happened at all). Does tearing up your ticket to Sydney reduce carbon emissions? Ask the question, have some fossil fuels been left in the ground that would otherwise be extracted? The answer, absolutely not, and I’m not talking about how the plane’s still going to fly without you.
I’m talking about the fact that oil extraction is not determined by demand, it’s determined by supply. It has been since earlier this decade when the market price diverged markedly from the production costs.
According to the author, the answer lies in the oil and coal producing nations reducing production which is much easier to coordinate than billions of consumers. This doesn't detract from investing in reducing consumption and the usual suspects (Australia, India, USA and China) are involved on both sides of production and consumption.
Friday, November 16, 2007
SPARQL isn't Unix Pipes
It wasn't supposed to be this way. I was just trying to get what I wrote in 2004 acknowledged. All I wanted then, as now, was aggregate functions and a matching data model. I did a bit more research in 2006 (mostly in my spare time while I had a day job to go to) and thought that people could read it and understand it. I even spent some time over the last Christmas holidays making a more gentler introduction to it all.
SPARQL is a proposed recommendation - which is one step away from being a published standard. So, I put my objections in to the SPARQL working group. From what I can tell people either didn't understand or thought that I was some kind of weirdo. The unhappiest part of this is the summary of my objection, "The Working Group is unfamiliar with any existing query languages that meet the commenter's design goals."
All I wanted was closure of operations, where the query language matches the data model it queries. Maybe this is a very odd thing to want. No one seems to know what the relational model really is either. Maybe it's a bad example.
Maybe a better example is Unix pipes. Unix pipes have operations (sort, cut, etc.) that take in text and output text. That is, it takes the same input as output or something known as closure. So you can take input from one tool and string them together in any order you want. Sometimes it's more efficient to do one operation first over another. In SPARQL you can't do that as the first operation of every query turns it into variable bindings.
I was hoping that SPARQL would be the Unix pipes of RDF. This would mean that the operations like join, filter, restrict (matching triples) and so on take in an RDF graph (or graphs) and output an RDF graph (or graphs). This gives tremendous flexibility in that you can create new operations that all work on the same model. It also means that a lot of the extra complexity that is part of SPARQL (for example, CONSTRUCT and ASK) go away.
This is not to say that SPARQL doesn't have value and shouldn't be supported. It is just a missed opportunity. It could have avoided repeating mistakes made with SQL (like not standarizing aggregate functions, having a consistent data model and so on).
Update: I re-read this recently. It struck me that maybe I was being a little unclear about what I expected as input and output in the RDF pipes view of SPARQL. Really, it's not a single RDF graph per se that are being processed but sets of triples. It's not really a big difference - as RDF graphs are just sets of triples but it's more that the triples being processed don't have to come from one graph. There's no restriction on what I'm talking about above to process, in one go, triples from many different graphs. The criticism is the same though - SPARQL breaks triples into variable bindings. Having multiple graph processing (or sets of triple processing) just requires the graph that the triple came from recorded (the quad in most systems). It certainly something that could be added to JRDF's SPARQL implementation.
SPARQL is a proposed recommendation - which is one step away from being a published standard. So, I put my objections in to the SPARQL working group. From what I can tell people either didn't understand or thought that I was some kind of weirdo. The unhappiest part of this is the summary of my objection, "The Working Group is unfamiliar with any existing query languages that meet the commenter's design goals."
All I wanted was closure of operations, where the query language matches the data model it queries. Maybe this is a very odd thing to want. No one seems to know what the relational model really is either. Maybe it's a bad example.
Maybe a better example is Unix pipes. Unix pipes have operations (sort, cut, etc.) that take in text and output text. That is, it takes the same input as output or something known as closure. So you can take input from one tool and string them together in any order you want. Sometimes it's more efficient to do one operation first over another. In SPARQL you can't do that as the first operation of every query turns it into variable bindings.
I was hoping that SPARQL would be the Unix pipes of RDF. This would mean that the operations like join, filter, restrict (matching triples) and so on take in an RDF graph (or graphs) and output an RDF graph (or graphs). This gives tremendous flexibility in that you can create new operations that all work on the same model. It also means that a lot of the extra complexity that is part of SPARQL (for example, CONSTRUCT and ASK) go away.
This is not to say that SPARQL doesn't have value and shouldn't be supported. It is just a missed opportunity. It could have avoided repeating mistakes made with SQL (like not standarizing aggregate functions, having a consistent data model and so on).
Update: I re-read this recently. It struck me that maybe I was being a little unclear about what I expected as input and output in the RDF pipes view of SPARQL. Really, it's not a single RDF graph per se that are being processed but sets of triples. It's not really a big difference - as RDF graphs are just sets of triples but it's more that the triples being processed don't have to come from one graph. There's no restriction on what I'm talking about above to process, in one go, triples from many different graphs. The criticism is the same though - SPARQL breaks triples into variable bindings. Having multiple graph processing (or sets of triple processing) just requires the graph that the triple came from recorded (the quad in most systems). It certainly something that could be added to JRDF's SPARQL implementation.
Thursday, November 15, 2007
Sesame Native Store
I'm very impressed at the moment with OpenRDF's native store as others have been in the past. One of the best things is how easy it was to work into the existing JRDF code.
As I've said before I've been searching for an on disk solution for loading and simple processing of RDF/XML. In the experiments I've been doing OpenRDF's btree index is much faster than any other solution (again not unexpected based on previous tests). The nodepool/string pool or ValueStore though is a bit slower than both Bdb and Db4o.
Loading 100,000 triples on my MacBook Pro 2GHz takes 37 secs with pure Sesame, 27 with the Sesame index and Db4o value store, 35 with Bdb value store and ehCache is still going (> 5 minutes). A million takes around 5 minutes with Sesame index and Db4o nodepool (about 3,400 triples/second) and 3 minutes with a Sesame index and memory nodepool (about 5500 triples/second).
There's lots of cleanup to go and there's no caching or anything clever going on at the moment, as I'm trying to hit deadlines. 0.5.2 is going to be a lot faster than 0.5.1 for this stuff.
Update: I've done some testing on some fairly low-end servers (PowerEdge SC440, Xeon 1.86GHz, 2GB RAM) and the results are quite impressive. With 100,000 triples averaging around 11,000 triples/second and 10 million averaging 9,451 triples/second.
Update 2: JRDF 0.5.2 is out. This is a fairly minor release for end user functionality but meets the desired goal of creating, reading and writing lots of RDF/XML quickly. Just to give some more figures: Bdb/Sesame/db4o (SortedDiskJRDFFactory) is 30% faster for adds and 10% slower for writing out RDF/XML than Bdb/Sesame (SortedBdbJRDFFactory). Both have roughly the same performance for finds. I removed ehcache as it was too slow compared to the other approaches.
As I've said before I've been searching for an on disk solution for loading and simple processing of RDF/XML. In the experiments I've been doing OpenRDF's btree index is much faster than any other solution (again not unexpected based on previous tests). The nodepool/string pool or ValueStore though is a bit slower than both Bdb and Db4o.
Loading 100,000 triples on my MacBook Pro 2GHz takes 37 secs with pure Sesame, 27 with the Sesame index and Db4o value store, 35 with Bdb value store and ehCache is still going (> 5 minutes). A million takes around 5 minutes with Sesame index and Db4o nodepool (about 3,400 triples/second) and 3 minutes with a Sesame index and memory nodepool (about 5500 triples/second).
There's lots of cleanup to go and there's no caching or anything clever going on at the moment, as I'm trying to hit deadlines. 0.5.2 is going to be a lot faster than 0.5.1 for this stuff.
Update: I've done some testing on some fairly low-end servers (PowerEdge SC440, Xeon 1.86GHz, 2GB RAM) and the results are quite impressive. With 100,000 triples averaging around 11,000 triples/second and 10 million averaging 9,451 triples/second.
Update 2: JRDF 0.5.2 is out. This is a fairly minor release for end user functionality but meets the desired goal of creating, reading and writing lots of RDF/XML quickly. Just to give some more figures: Bdb/Sesame/db4o (SortedDiskJRDFFactory) is 30% faster for adds and 10% slower for writing out RDF/XML than Bdb/Sesame (SortedBdbJRDFFactory). Both have roughly the same performance for finds. I removed ehcache as it was too slow compared to the other approaches.
Friday, November 09, 2007
JRDF 0.5.1 Released
This release is mainly a bug release. There are improvements and fixes to the Resource API, datatype support and persistence. Another persistence library has been added, db4o, which has some different characteristics compared to the BDB implementation. However, it's generally a little slower that BDB. The persistence offered is currently only useful for processing large RDF files in environments with low memory requirements.
Also, the bug fixes made to One JAR have been integrated, so JRDF no longer has its own version.
Available here.
Also, the bug fixes made to One JAR have been integrated, so JRDF no longer has its own version.
Available here.
Saturday, October 27, 2007
No Java 6 for You!
Just in case you're like me and you upgraded to Leopard only to find Java 6 no longer works and Java 5 unstable, here's a fix:
At least it comes with better Ruby support and a RubyCocoa bridge.
Update: I don't think I like Leopard. I don't like: the 3D or the 2D look of the dock, the semi-transparent menu bar, Spaces behaviour - you can't have multiple windows spanning desktops from the same application (like two browser windows in separate desktops), the removal of text lists (why not have Fan, Grid and List?), in Quick Look you can go to a page in a PDF or Word file but when you click on it it goes to the first page, and Java support (IntelliJ and others seems to have weird refresh issues and you can't seem to allocate it to a virtual desktop).
There's of course a lot to like though too (tab terminals, RSS reader, better Spotlight, Cover flow, Safari, etc).
Update 2: So the Spaces thing. If you minimise a window, it goes to the dock, change to another virtual desktop, then expand, it goes to the desktop that the window was originally in. As noted in the comments though (and it is in the guided tour), if you activate Spaces (the default is F8), then move the window to a new virtual desktop then it is tied to it. There are two other ways: click and hold on the window and switch to another desktop or click and hold on the window, move it to the edge of the screen, wait, and it will go to the next desktop.
To me, some of the behavior breaks the illusion/metaphor of the virtual desktop and seems unnecessarily difficult. I'd prefer non-click and hold options supported as well (pretty much like other features like copying files, etc).
The networking improvements (non-blocking, built in VNC) overcomes what was probably one of the worst things about OS X.
The IntelliJ issue that I mentioned is logged as issue 16084.
-- First, delete ~/Library/Java/Caches/deployment.properties
-- Move aside your Java 1.6 directory. The 1.6 preview on Tiger does not work on Leopard.
% cd /System/Library/Frameworks/JavaVM.framework/Versions
% sudo mv 1.6.0 Tiger_1.6
% sudo rm 1.6
At least it comes with better Ruby support and a RubyCocoa bridge.
Update: I don't think I like Leopard. I don't like: the 3D or the 2D look of the dock, the semi-transparent menu bar, Spaces behaviour - you can't have multiple windows spanning desktops from the same application (like two browser windows in separate desktops), the removal of text lists (why not have Fan, Grid and List?), in Quick Look you can go to a page in a PDF or Word file but when you click on it it goes to the first page, and Java support (IntelliJ and others seems to have weird refresh issues and you can't seem to allocate it to a virtual desktop).
There's of course a lot to like though too (tab terminals, RSS reader, better Spotlight, Cover flow, Safari, etc).
Update 2: So the Spaces thing. If you minimise a window, it goes to the dock, change to another virtual desktop, then expand, it goes to the desktop that the window was originally in. As noted in the comments though (and it is in the guided tour), if you activate Spaces (the default is F8), then move the window to a new virtual desktop then it is tied to it. There are two other ways: click and hold on the window and switch to another desktop or click and hold on the window, move it to the edge of the screen, wait, and it will go to the next desktop.
To me, some of the behavior breaks the illusion/metaphor of the virtual desktop and seems unnecessarily difficult. I'd prefer non-click and hold options supported as well (pretty much like other features like copying files, etc).
The networking improvements (non-blocking, built in VNC) overcomes what was probably one of the worst things about OS X.
The IntelliJ issue that I mentioned is logged as issue 16084.
Saturday, September 22, 2007
Migration
With the announcement of JRuby in Glassfish or the end of mongrel it seems to offer Sun the hope of capturing more developers not just from Ruby but from .NET. Many people from both Java and C# worlds jumped to Ruby and it occurred to me that .NET developers using Netbeans for Ruby development may notice that there's a suprisingly good C# clone under the covers.
Wednesday, September 19, 2007
Beautiful or Otherwise
Two of the chapters from Beautiful Code Alberto Savoia chapter Beautiful Tests and Simon Peyton Jones on concurrency are available in PDF form. Beautiful Tests covers most of the different kinds of tests and how to make changes to code to make it more testable and starts to cover creating and validating theories.
Speaking of tests, Scala, DSLs, Behavior Driven Development?, talks about how Java is poor at creating DSLs specifically compared to Scala. And what's the application? Behavior driven development in a Java project called beanSpec (which superficially looks similar to Instinct probably as they're both based on RSpec and using the same stack example). So you have this neat sort of convergence where people are looking at testing Java better in ways that are more declarative (functional even).
Making Java more functional is on the cards for Java 7 in, Will Java 7 be Beautiful?, it links to point free programming in Java and Haskell and how the new language proposals (closures) make Java 7 look a lot like Haskell with the suggestion that all Java 7 functions should be curried.
While the syntax is potentially getting more beautiful, the user interfaces are traditionally pretty poor in Java (even with OS X support) but even that is changing. I've been following Chet Haase's blog and recently Filthy Rich Clients was made available to purchase. It's all about improving the look and feel of Java and to finally approach the richness of native OS X and Windows applications. He even has some links about language proposals too, including bringing back line numbers. Some examples of their work are on the web site and on Roman Guy's blog which includes an entry called Beautiful Swing.
Roman also linked to a movie of *7 the prototype handheld device running Green (Java). Who knew Duke had a house?
Speaking of tests, Scala, DSLs, Behavior Driven Development?, talks about how Java is poor at creating DSLs specifically compared to Scala. And what's the application? Behavior driven development in a Java project called beanSpec (which superficially looks similar to Instinct probably as they're both based on RSpec and using the same stack example). So you have this neat sort of convergence where people are looking at testing Java better in ways that are more declarative (functional even).
Making Java more functional is on the cards for Java 7 in, Will Java 7 be Beautiful?, it links to point free programming in Java and Haskell and how the new language proposals (closures) make Java 7 look a lot like Haskell with the suggestion that all Java 7 functions should be curried.
While the syntax is potentially getting more beautiful, the user interfaces are traditionally pretty poor in Java (even with OS X support) but even that is changing. I've been following Chet Haase's blog and recently Filthy Rich Clients was made available to purchase. It's all about improving the look and feel of Java and to finally approach the richness of native OS X and Windows applications. He even has some links about language proposals too, including bringing back line numbers. Some examples of their work are on the web site and on Roman Guy's blog which includes an entry called Beautiful Swing.
Roman also linked to a movie of *7 the prototype handheld device running Green (Java). Who knew Duke had a house?
Thursday, September 13, 2007
A Real LINQ Clone for Java
Introducing Quaere - Language integrated queryies for Java
Via, Quaere: LINQ for Java.
The Quaere DSL is very flexible and it lets you perform a wide range of queries against any data structure that is an array, or implements the java.lang.Iterable or the org.quaere.Queryable interface. Below is an overview of the querying and other features available through the DSL interface, the underlying query expression model and query engine. See the examples section to gain an understanding of how these features are used.
* Ability to perform queries against arrays or data structure implementing the Iterable interface.
* An internal DSL (based on static imports and fluent interfaces) that lets you integrate the query language with regular Java code. No preprocessing or code generation steps are required to use the DSL, simply add a reference to the quaere.jar file (and its dependencies).
* A large number of querying operators including restriction, selection, projection, set, partitioning, grouping, ordering, quantification, aggregation and conversion operators.
* Support for lambda expressions.
* The ability to dynamically define and instantiate anonymous classes.
* Many new “keywords” for Java 1.5 and later.
Via, Quaere: LINQ for Java.
Friday, September 07, 2007
Column Databases Reach Slashdot
In a typically incorrect charactization of the issue Slashdot is covering column databases: Relational database pioneer says technology is obsolete via Are Relational Databases Obsolete?. A better way to explain it, is that column databases are designed to get around the current problems with modern computer architectures such as latency in memory and hard disks as well as achieving better CPU utilization (typically databases have very little parallelism).
While it's early days yet, column databases storing something like RDF may get faster than row-oriented SQL databases. I posted about column databases storing RDF previously.
And from the cited blog:
Seeing as though I've just been looking into this here are a few references:
Update 2: While I'm normally pretty skeptical about most things, column databases have had their own fair share of problems in the past.
While it's early days yet, column databases storing something like RDF may get faster than row-oriented SQL databases. I posted about column databases storing RDF previously.
"Vertica beats all row stores on the planet -- typically by a factor of 50," he wrote. "The only engines that come closer are other column stores, which Vertica typically beats by around a factor of 10."
And from the cited blog:
In addition, it provides built-in features appropriate to the needs of 2007 customers. These include:
o Linear scalability over a shared-nothing hardware grid
o Automatic high availability
o Automatic use of materialized views
o "No knobs" -- minimum DBA requirements
Seeing as though I've just been looking into this here are a few references:
- The End of an Architectural Era (It’s Time for a Complete Rewrite) One of the links from my previous post written by Michael Stonebraker and others. It provides a deeper explanation of column databases than in the above articles.
- Decomposition Storage Model the paper that started it all (from 1985). Section 2.5 is especially relevant as it mentions how to store directed graphs and parent-child relationships.
- Database Architecture Optimized for the New Bottleneck: Memory Access
- Monet: A Next-generation DBMS Kernel for Query-intensive Applications
Update 2: While I'm normally pretty skeptical about most things, column databases have had their own fair share of problems in the past.
Thursday, September 06, 2007
Perfect Storm (of Erlang)
It would be remiss of me not to keep track one of the other contenders for scalability. The first hint was a few days ago with, "CouchDB: Thinking beyond the RDBMS":
CouchDB and for even cooler kids, there's a Ruby API too.
The other bit that adds to this is the RDF JSON specification or RDFON. The first specification doesn't support nesting which means only named blank nodes are supported - which I think is a bug - naming things, especially things that don't naturally have names, can be a pain. The second was criticized as being a diversion away from N3.
I'm for a way to make it easier for programmers and web browsers to have better access to RDF data. Whether there needs to be another format to describe RDF is a bit unclear though - although being able to store it in a scalable way, perform queries and update it through REST seems like a positive. JSON is in that fuzzy area between code and data with its own drawbacks (like security).
Just to keep the pro scale-out/Hadoop/MapReduce etc. architecture going here's a recent posting on the more general issue:
Update: Dare corrects some mistakes made in the above posting about CouchDB. The way I read the original was it wasn't saying it was replacing relational databases just offering a better solution for certain kinds of problems. Where, "thinking beyond" is not being constrained by the expected features of a database.
CouchDB on first look seems like the future of database without the weight that is SQL and write consistency.
It stores document in a flat space.
There are no schemas. But you do store (and retrieve) JSON objects. Cool kids rejoice.
And all this happens using Real REST (you know, the one with PUT, DELETE and no envelopes to hide stuff), so it doesn’t matter that CouchDB is implemented in Erlang. (In fact, Erlang is a feature)
CouchDB and for even cooler kids, there's a Ruby API too.
The other bit that adds to this is the RDF JSON specification or RDFON. The first specification doesn't support nesting which means only named blank nodes are supported - which I think is a bug - naming things, especially things that don't naturally have names, can be a pain. The second was criticized as being a diversion away from N3.
I'm for a way to make it easier for programmers and web browsers to have better access to RDF data. Whether there needs to be another format to describe RDF is a bit unclear though - although being able to store it in a scalable way, perform queries and update it through REST seems like a positive. JSON is in that fuzzy area between code and data with its own drawbacks (like security).
Just to keep the pro scale-out/Hadoop/MapReduce etc. architecture going here's a recent posting on the more general issue:
Functional programming paradigms, the map/reduce pattern, and to a lesser extent distributed and parallel processing in general are subjects not widely understood by most quasi-technical management. Further, the notion of commodity machines with guarenteed lack of reliability as a means of achieving high performance and high scalability is essentially counterintuitive. Even referring newcomers to what I regard as the seminal papers on these topics (the papers written by Ghemewat, Dean et al at Google, and yes I know it all started with LISP but my management wasn't even alive in the 1970s although I was :-)), people steeped in a long tradition of "shared everything" database architectures still don't quite get it. I spend considerable amounts of time in what amounts to management de-programming: No MySQL can't do this, and Oracle can't either, except with Oracle it will cost you a lot more to find that out.
Update: Dare corrects some mistakes made in the above posting about CouchDB. The way I read the original was it wasn't saying it was replacing relational databases just offering a better solution for certain kinds of problems. Where, "thinking beyond" is not being constrained by the expected features of a database.
Wednesday, September 05, 2007
Sunday, August 26, 2007
JRDF 0.5.0
As mentioned this version of JRDF supports a Resource interface, datatype support, persistence (via Berkeley DB), and Java 6. It does run under Java 5 but the RDF/XML writer now uses the StAX API, so it requires Woodstox or some other JSR 173 implementation.
A few things in the works such as an Elmo like API, Globalized nodes, and MSG support.
Sourceforge download.
A few things in the works such as an Elmo like API, Globalized nodes, and MSG support.
Sourceforge download.
Better Chess
Higher Games On the 10 year anniversary of Deep Blue beating Kasparov, some thought provoking suggestions:
Fischer Random Chess, or Chess 960 removes one of the reasons that I've always disliked chess (as I never really could apply myself to remember openings).
Via wonderTissue.
It is interesting in this regard to contemplate the suggestion made by Bobby Fischer, who has proposed to restore the game of chess to its intended rational purity by requiring that the major pieces be randomly placed in the back row at the start of each game (randomly, but in mirror image for black and white, with a white-square bishop and a black-square bishop, and the king between the rooks). Fischer Random Chess would render the mountain of memorized openings almost entirely obsolete, for humans and machines alike, since they would come into play much less than 1 percent of the time. The chess player would be thrown back onto fundamental principles; one would have to do more of the hard design work in real time.
Fischer Random Chess, or Chess 960 removes one of the reasons that I've always disliked chess (as I never really could apply myself to remember openings).
Why isn't it just as nice--or nicer--to think that we human beings might succeed in designing and building brainchildren that are even more wonderful than our biologically begotten children? The match between Kasparov and Deep Blue didn't settle any great metaphysical issue, but it certainly exposed the weakness in some widespread opinions. Many people still cling, white-knuckled, to a brittle vision of our minds as mysterious immaterial souls, or--just as romantic--as the products of brains composed of wonder tissue engaged in irreducible noncomputational (perhaps alchemical?) processes. They often seem to think that if our brains were in fact just protein machines, we couldn't be responsible, lovable, valuable persons.
Via wonderTissue.
Saturday, August 25, 2007
Beautiful Engineering
One Bridge Doesn’t Fit All
Via Bridges and code.
American bridge engineering largely overlooks that efficiency, economy and elegance can be mutually reinforcing ideals. This is largely because engineers are not taught outstanding examples that express these ideals.
A 2000 report by the Federal Highway Administration indicated that an average of about 2,500 new bridges are completed each year; each could be an opportunity for better design. The best will be elegant and safe while being economical to build.
The key is to require that every bridge have one engineer who makes the conceptual design, understands construction and has a strong aesthetic motivation and personal attachment to the work. This will require not only a new ethos in professional practice, but also a new focus in the way engineers are educated, one modeled on the approach of those Swiss professors, Wilhelm Ritter and Pierre Lardy.
Via Bridges and code.
Wednesday, August 22, 2007
A Bunch of IntelliJ Goodness
I've been languishing without perhaps the best plugin of all time for IntelliJ. I mention it to everyone I work with and they said, hey I've come across this plugin that you might be interested in. And it's, ToggleTest. A simple plugin that lets you switch between tests and production code. Never let your hands leave the keyboard again.
Other plugins from the author include the best meta-plugin hot plugin and cruise watcher.
Other plugins from the author include the best meta-plugin hot plugin and cruise watcher.
Result
The previous issue of IEEE is avaiable to Xplore subscribers. It is an issue dedicated to TDD and agile development.
Mock Objects: Find Out Who Your Friends Are highlights the oft discussed criticisms of test driven development including: breaking encapsulation, hard to read, increased code fragility and whether it is a valuable use of developer time. Of course, these differences can be resolved in a typical XP way:
The second article, is about the studied effect of TDD on the quality of code, TDD: The Art of Fearless Programming. Pretty much every study showed an increased effort (15-60%) and an increase in quality (5-267%). The ratio of effort to quality is 1:2 - so it seems to pay. Example definitions of quality were "reduction in defects" or "increase in functional tests passing". In general:
Via Mock Objects
Mock Objects: Find Out Who Your Friends Are highlights the oft discussed criticisms of test driven development including: breaking encapsulation, hard to read, increased code fragility and whether it is a valuable use of developer time. Of course, these differences can be resolved in a typical XP way:
My point is that plenty of well-written code is produced without relying on mocks or Tell, Don’t Ask. On the other hand, I also encounter plenty of TDD novices who produce poor code and poor tests. Can Steve and Nat’s approach consistently help novices write better code and tests? It could be. At this point, I remain open minded and am ready to pair-program with Steve or Nat to learn more.
The second article, is about the studied effect of TDD on the quality of code, TDD: The Art of Fearless Programming. Pretty much every study showed an increased effort (15-60%) and an increase in quality (5-267%). The ratio of effort to quality is 1:2 - so it seems to pay. Example definitions of quality were "reduction in defects" or "increase in functional tests passing". In general:
All researchers seem to agree that TDD encourages better task focus and test coverage. The mere fact of more tests doesn’t necessarily mean that software quality will be better, but the increased programmer attention to test design is nevertheless encouraging. If we view testing as sampling a very large population of potential behaviors, more tests mean a more thorough sample. To the extent that each test can find an important problem that none of the others can find, the tests are useful, especially if you can run them cheaply.
Via Mock Objects
Saturday, August 18, 2007
Off to See the Wizard
Microsoft comes up with a quite elegant and well integrated language extension for relational and XML data. IBM fires back 2 years later with a wizard that auto-generates beans and Java code with embedded SQL. The article also makes assertions like:
It does seem like it does give you some benefits when writing SQL but it's simply not LINQ.
The most popular way to objectize to programmatically access and manipulate relational data has been through special APIs and wrappers that provide one or more SQL statements written as text strings.
It does seem like it does give you some benefits when writing SQL but it's simply not LINQ.
Thursday, August 16, 2007
Everything You Know is Wrong
- OWL can be, as the French would say, databasesque. Using OWL to implement database like integrity constraints.
- It's time to throw away your database (a paper describing the advantages of column databases) and replace it with something more flexible and scalable (using C-Store to store RDF). And column databases mean no NULLs. The problem isn't that your data is normalized it's that it's not normalized enough (also mentioned here).
- Following a question that I had earlier this year about where is the good code to learn from, I've read about 3/4 of Beautiful Code. And while I have a hard time finding C/C++ code beautiful it certainly covers some good examples of code and the process behind writing it (including Beautiful Tests, Beautiful Concurrency and MapReduce).
- Latency Moore's curse.
- Nova links to why what you thought about genes is wrong (I've actually heard this before but hadn't come across any articles).
- Scaling laws in Biology. Humans run on 100 watts or used to. Now we have cars etc. our power requirements are now 11,000 watts (making us the biggest creatures that have ever lived). And the magical number: 4.
- The Enemies of Reason is now on Google Video.
Wednesday, August 08, 2007
Can't wait until they find the Astronaut Helmet
Lego giant emerges from sea "Workers at a drinks stall rescued the 2.5-metre (8-foot) tall model with a yellow head and blue torso.
"We saw something bobbing about in the sea and we decided to take it out of the water," said a stall worker. "It was a life-sized Lego toy.""
"We saw something bobbing about in the sea and we decided to take it out of the water," said a stall worker. "It was a life-sized Lego toy.""
Naming
Relatively recently, UniProt (a protein sequence database) announced they were moving away from LSIDs (which are URNs) to URIs. The cons of LSIDs seem to be outweighing the pros. More generally, there seems to be much discussion as to what is more appropriate and what can and cannot be done with URIs vs URNs. And even some of LSIDs proponents are saying that using them at the moment is not a good idea.
The form of a URN is: urn:lsid:ubio.org:namebank:11815 which can be resolved using: http://lsid.tdwg.org/summary/urn:lsid:ubio.org:namebank:11815. A couple of IBM articles describe it in more detail: "Build a life sciences collaboration network with LSID" and "LSID best practices". Part of the problem seems to be that the LSID needs a resolving service, much like a web service, to return the data for a given LSID. The URI on the other hand can just use a bit of content negotiation to return either RDF data or human readable HTML. It's not a well known feature that a web client can tell a web server what data it can accept. So a Semantic Web client would say "give me RDF" and a normal web client says "give me HTML".
An alternative is using an existing standard, such as THTTP, which shows how to turn a URN into a URL by providing a REST based service. Where, requesting a URL for the urn "urn:foo:12345-54321" it becomes an HTTP request "GET /uri-res/N2L?urn:foo:12345-54321 HTTP/1.0". This is a bit like the biordf.org approach of "http://bio2rdf.org/namespace:id". Having de-referencable URIs is part of the Banff Manifesto.
Creating GUIDs is an interesting problem in a distributed environment. One of the other life science groups compared Handle, DOI, LSID and PURL (persistent URLs).
The mentioning of the Handle System brought back ideas from previous digital library work and using URNs to name RDF graphs (which I later to discovered wasn't entirely novel).
The form of a URN is: urn:lsid:ubio.org:namebank:11815 which can be resolved using: http://lsid.tdwg.org/summary/urn:lsid:ubio.org:namebank:11815. A couple of IBM articles describe it in more detail: "Build a life sciences collaboration network with LSID" and "LSID best practices". Part of the problem seems to be that the LSID needs a resolving service, much like a web service, to return the data for a given LSID. The URI on the other hand can just use a bit of content negotiation to return either RDF data or human readable HTML. It's not a well known feature that a web client can tell a web server what data it can accept. So a Semantic Web client would say "give me RDF" and a normal web client says "give me HTML".
An alternative is using an existing standard, such as THTTP, which shows how to turn a URN into a URL by providing a REST based service. Where, requesting a URL for the urn "urn:foo:12345-54321" it becomes an HTTP request "GET /uri-res/N2L?urn:foo:12345-54321 HTTP/1.0". This is a bit like the biordf.org approach of "http://bio2rdf.org/namespace:id". Having de-referencable URIs is part of the Banff Manifesto.
Creating GUIDs is an interesting problem in a distributed environment. One of the other life science groups compared Handle, DOI, LSID and PURL (persistent URLs).
The mentioning of the Handle System brought back ideas from previous digital library work and using URNs to name RDF graphs (which I later to discovered wasn't entirely novel).
Friday, August 03, 2007
Game Changer
Wag the dog
So I missed lunch with a friend earlier this year because he was stuck in an Exchange upgrade. This was at the same time I was looking into Google's architecture and it struck me that there's no real upgrade process or data conversion process that needs hand holding for their architecture. It was at that point that I thought a lot of the jobs system administrators currently do will be greatly simplified or removed with the right software. The ratio of system administrators to servers at Google has to be much smaller (i.e. more servers to people) because they need so many more servers and there just aren't that many system administrators to look after software written in the usual way.
It came up again when I got roped into a meeting with a bunch of guys that administer several clusters. I was quite happy to be quiet but it was brought up that I was looking at cluster techniques. I explained about EC2 and how you can run a server for as little as 10 cents per instance hour and experiments for $2. I think it came as a bit of a shock to them - first the cheapness (obviously) and second the availability (basically its removing the gatekeeping to resources). The second benefit is something that really removes a lot of the politics - something not to be underestimated.
Arguing that EC2 has no intrinsic business value, is like arguing that an electrical grid or a telephone network has no intrinsic business value. Speculation: one reason business systems can't adapt is because the assumptions about what the business used to do, are embedded deep in the code. Very deep, not easy to pull out. And not just in the code but in the physical architecture the system is running on. Business "logic" is like bindweed - by the time you've pulled it out, you've ripped out half the garden as well.
So I missed lunch with a friend earlier this year because he was stuck in an Exchange upgrade. This was at the same time I was looking into Google's architecture and it struck me that there's no real upgrade process or data conversion process that needs hand holding for their architecture. It was at that point that I thought a lot of the jobs system administrators currently do will be greatly simplified or removed with the right software. The ratio of system administrators to servers at Google has to be much smaller (i.e. more servers to people) because they need so many more servers and there just aren't that many system administrators to look after software written in the usual way.
It came up again when I got roped into a meeting with a bunch of guys that administer several clusters. I was quite happy to be quiet but it was brought up that I was looking at cluster techniques. I explained about EC2 and how you can run a server for as little as 10 cents per instance hour and experiments for $2. I think it came as a bit of a shock to them - first the cheapness (obviously) and second the availability (basically its removing the gatekeeping to resources). The second benefit is something that really removes a lot of the politics - something not to be underestimated.
Thursday, August 02, 2007
More Hadoop
A couple of days late, but the OSCON apparently had a focus on Hadoop:
The main focus of the attention was the Doug Cutting and Eric Baldeschwieler talk.
The whole, shared nothing programming model is certainly infectious and it seems scarily appropriate for Semantic Web data (see some of the talk for more information). Yahoo's Hadoop research cluster is now apparently up to 2x1000 nodes.
The freedom to take your data away rather than controlling the rights to your source code might dominate the Web 3.0 meme. Hadoop! "I got a call from David Filo (co-founder of Yahoo!) last night saying Yahoo! is really behind this."
The main focus of the attention was the Doug Cutting and Eric Baldeschwieler talk.
The whole, shared nothing programming model is certainly infectious and it seems scarily appropriate for Semantic Web data (see some of the talk for more information). Yahoo's Hadoop research cluster is now apparently up to 2x1000 nodes.
Monday, July 30, 2007
Scale Out or Drop Out
I recently came across this study done by IBM using Nutch/Lucene to compare scale-up (a single, shared memory, fast server) vs scale-out (a cluster). It showed that for the type of work performed scale-out systems outperformed scale-up systems in terms of price and performance. The IEEE article about Google's architecture in late 2002, notes that Google was spending $278,000 for 88 dual CPU 2GHz Xenons, 2 GB of RAM and each having a 80 GB hard drive. In the IBM paper, $200,000 gets you 112 blades, quad processor with 8GB of memory each with a 73GB drive as well as a shared storage system.
This scale-out approach was also recently mentioned by Tim O'Reilly's, MySQL: The Twelve Days of Scaleout.
Scale-up x Scale-out: A Case Study using Nutch/Lucene
This scale-out approach was also recently mentioned by Tim O'Reilly's, MySQL: The Twelve Days of Scaleout.
Scale-up x Scale-out: A Case Study using Nutch/Lucene
The Nutch/Lucene search framework includes a parallel indexing operation written using the MapReduce programming model [2]. MapReduce provides a convenient way of addressing an important (though limited) class of real-life commercial applications by hiding parallelism and fault-tolerance issues from the programmers, letting them focus on the problem domain.
The query engine part consists of one or more front-ends, and one or more back-ends. Each back-end is associated with a segment of the complete data set...The front-end collects the response from all the back-ends to produce a single list of the top
documents (typically 10 overall best matches).
We see that the peak performance of the scale out solution (BladeCenter) is approximately 4 times better.
Saturday, July 28, 2007
Moving Around
- A better spreadsheet, Resolver "...data and formulae you enter into cells are actually turned into code."
- How to Publish Linked Data on the Web Share structured data on the web today.
- The joys of RDF/XML and it's inability to have literals in lists. A possible solution was suggested that would also allow literals as subjects.
- An RDF DB shootout complete with TestSuite. Hopefully, JRDF's use of BDB will have better results.
- An extension to SPARQL to support federated queries.
- Radar Networks Progress Update "We're on track for a invite only launch in the fall timeframe as planned...Several of the world's big media empires have started approaching me...They are interested in the potential of the Semantic Web for adding new capabilities to their content and new services for their audiences."
- Snoggle "...a graphical, SWRL-based ontology mapper to assist in the task of OWL ontology alignment. It allows users to visualize ontologies and then draw mappings from one to another on a graphical canvas."
Friday, July 20, 2007
MapReduce by the Hour
Running Hadoop MapReduce on Amazon EC2 and Amazon S3
Apache's Hadoop project aims to solve these problems by providing a framework for running large data processing applications on clusters of commodity hardware. Combined with Amazon EC2 for running the application, and Amazon S3 for storing the data, we can run large jobs very economically. This paper describes how to use Amazon Web Services and Hadoop to run an ad hoc analysis on a large collection of web access logs that otherwise would have cost a prohibitive amount in either time or money.
It took about 5 minutes to transfer the data (which was compressed - Hadoop can read compressed input data out of the box) from S3 to to HDFS, so the whole job took less than a hour. At $0.10 per instance-hour, this works out at only $2 for the whole job, plus S3 storage and transfer costs - that's external transfer costs, because remember transfers between EC2 and S3 are free.
Friday, July 13, 2007
SPASQL
SPASQL
This looks like a nice, pragmatic approach to solving the legacy SQL data problem. There's some interesting discussion about the mismatch between UNION in SQL and SPARQL.
FeDeRate [FED] provides a mapping between RDF queries and SQL queries over conventional relational databases. SPASQL provides similar functionality, but eliminates the query rewriting phase by parsing the SPARQL [SPARQL] directly in the database server. This approach is efficient, and allows applications linked to currently deployed MySQL client libraries to execute SPARQL queries and get the results back through the conventional MySQL protocol.
This looks like a nice, pragmatic approach to solving the legacy SQL data problem. There's some interesting discussion about the mismatch between UNION in SQL and SPARQL.
Sunday, July 08, 2007
JRDF Development
I've started adding datatype support to JRDF. Nothing too flash at the moment, mainly to support the RDF/XML test cases and prompted by a similar piece of work that was completed by the people at KP Lab. They also implemented an Elmo like API (with lazy collections) and added a Resource interface (which adds methods on top of both Blank Nodes and URI References and is quite similar to Jena's Resource interface) which will be integrated soon.
An interesting aspect to the datatype support is of course semantically equal types like integer and long. In the current version of JRDF they wouldn't return equal. In Jena there is "sameValueAs" that returns true if they are semantically equal. In JRDF, I originally decided to use Comparable - but then the use in ordered maps may get confusing (although it would be similar to BigDecimal but it's the exception rather than the rule). Using Comparable in this way would also be different to the implementation of datatypes in Kowari/Mulgara. So I've kept the old behaviour and added an equivalent comparison operator that is a copy of Comparable but takes in semantic similarities.
The other thing that's been cleared up is the containers and collections - a lot of (90%) was redundant due to Generics and didn't need to be there anymore. However, I did come across, "The pseudo-typedef antipattern", which pretty much says that creating a StringList (that extends ArrayList) is an anti-pattern. In JRDF collections and containers are basically Lists or Maps restricted to object nodes. Which seems to fit the anti-pattern. However, for me it does add different behaviour over the standard collections and restricting it on type does seem to make sense. I'm willing to be convinced though.
I'm also aware of some of the more weird RDF/XML edge cases that don't look like they were ever implemented correctly in JRDF or Kowari/Mulgara (like turning rdf:li into numbers even if outside of collections). I don't think it's a big use case and no one has ever raised it as a bug as far as I know.
Update: It's been pointed out to me again that implementing comparable isn't flexible enough - I should've known better of course as it's been pointed out to me many times in the past and I've used it in the past the equivComparator is a bit dumb.
An interesting aspect to the datatype support is of course semantically equal types like integer and long. In the current version of JRDF they wouldn't return equal. In Jena there is "sameValueAs" that returns true if they are semantically equal. In JRDF, I originally decided to use Comparable - but then the use in ordered maps may get confusing (although it would be similar to BigDecimal but it's the exception rather than the rule). Using Comparable in this way would also be different to the implementation of datatypes in Kowari/Mulgara. So I've kept the old behaviour and added an equivalent comparison operator that is a copy of Comparable but takes in semantic similarities.
The other thing that's been cleared up is the containers and collections - a lot of (90%) was redundant due to Generics and didn't need to be there anymore. However, I did come across, "The pseudo-typedef antipattern", which pretty much says that creating a StringList (that extends ArrayList
I'm also aware of some of the more weird RDF/XML edge cases that don't look like they were ever implemented correctly in JRDF or Kowari/Mulgara (like turning rdf:li into numbers even if outside of collections). I don't think it's a big use case and no one has ever raised it as a bug as far as I know.
Update: It's been pointed out to me again that implementing comparable isn't flexible enough - I should've known better of course as it's been pointed out to me many times in the past and I've used it in the past the equivComparator is a bit dumb.
Friday, July 06, 2007
The Centre of Excellence
The Pmarca Guide to Big Companies, part 2: Retaining great people
Via Dare.
Don't create a new group or organization within your company whose job is "innovation". This takes various forms, but it happens reasonably often when a big company gets into product trouble, and it's hugely damaging.
Here's why:
First, you send the terrible message to the rest of the organization that they're not supposed to innovate.
Second, you send the terrible message to the rest of the organization that you think they're the B team.
That's a one-two punch that will seriously screw things up.
Via Dare.
Tuesday, July 03, 2007
When You Get What You Want But Not What You Need
I was at the eResearch 2007 conference last week and it was quite good. Although, I must say that I'm very sick of "the long tail" and mashups being mentioned.
The keynotes were quite good and I'll mentioned three but the others and the talks I went to were very good too. I wish I had've written it up last week when it was more fresh in my mind - so some of my recollections maybe a little inaccurate.
David De Roure mentioned the reuse of workflows and that automation requires machine processable descriptions. He also mentioned the CombeChem project and Semantic Grids. He made some interesting comments such as grids are inherently semantic grids, that a little semantics go a long way (it's all about linked data) and that mashups are workflows. He mentioned the very successful Taverna project.
Phil Bourne gave a scenario of someone taking the bus, reviewing a paper, contacting their friends because it contradicts their current results and by the end of the bus trip having validated their approach and written a response to the author of the paper. He used the acronym IPOL (iPod plus Laptop) but surely the iPhone would've been closer to the mark.
His main idea is that publications aren't enough, that the experimental data has to also be saved, reviewed and made part of the academic process. As someone who runs one of the protein databases and an editor of a PLoS journal he's obviously seen the benefits of this approach. It also reminded me that ontologies in biology were cool before the Semantic Web came about (although most biology ontologies aren't very good (pdf)).
He mentioned the BioLit project which tries to integrate the workflow of papers, figures, data, and literature querying and creating a positive feedback loop for academic metadata. The idea of getting the source data behind graphs that are published is a much touted application of things like ontologies and RDF.
The last thing he mentioned was creating podcasts for papers published - they should give an overview of a paper that's more indepth than an abstract but more general than the entire paper. To achieve this they've setup the SciVee.tv site (still early days for that). That sounded quite interesting - I can imagine a video explaining the key figures and facts in a multimedia presentation would be very useful. I'm not sure though that most current researchers have those skills. If it is a lot more useful it may lead to a similar situation now, where papers published before the 1980s or so don't get cited or read because they aren't online. Maybe people who are used to YouTube won't read papers because they don't have an accompanying video, although it's probably not as distinct as the pre-digital papers. It seems much more likely that papers without the original experimental data will increasingly be ignored.
The last keynote I'll talk about was by Alex Szalay. I appreciated this one the most even though he did mention the long tail. He has previously written about the exponential increase in scientific data. He wrote it with Jim Gray and he was one of the people that helped in his search (his blog has more information about that too). There's now computational x - where x is any science including things like biology, physics and astronomy. One of the key effects of this much data is that the process of analysing data and then publishing your results is changing. It's more publish the data then do the analysis and publish the analysis.
He mentioned four different types of places where the power law (long tail) occurs: projects (few big ones, many small ones), data sizes (few multi-petabytes sources, many more 10s and 100s of terrabytes and vastly more smaller ones), value added or refereed products and users of data (a few users use a lot but the vaste majority use it a little).
The main thing I liked was that he said the processing of this data is fundamentally different than what it was before. It's too difficult to move the data about when its petabytes - it's easier to move the processing to the data. It was pointed out to me later that versioning the software that processed the data now becomes a very tiny fraction of the data kept but is more often than not overlooked.
The data captured by CCD is about to or has converged with the more traditional telescopes and that the data published and searchable now is only 2 years behind the best possible results. For most astronomers it's actually better to observe the universe from the data than to use an actual telescope.
Processing, memory and CCDs are all following Moore's Law but bandwidth is not. He mentioned an approach that's very much along the lines of the Hadoop/GFS - the code moves to the data not the other way around. He also listed things that are fairly well known now: no time to get it right from the top down, data processing and management becomes the key skill in the future, taking data from different sources is highly valuable, and build it and they will come is not enough you must provide a decent interface.
He mentioned two projects: Life Under Your Feet and Virtual Observatory. Both have huge data sets and rather cool user interfaces.
The keynotes were quite good and I'll mentioned three but the others and the talks I went to were very good too. I wish I had've written it up last week when it was more fresh in my mind - so some of my recollections maybe a little inaccurate.
David De Roure mentioned the reuse of workflows and that automation requires machine processable descriptions. He also mentioned the CombeChem project and Semantic Grids. He made some interesting comments such as grids are inherently semantic grids, that a little semantics go a long way (it's all about linked data) and that mashups are workflows. He mentioned the very successful Taverna project.
Phil Bourne gave a scenario of someone taking the bus, reviewing a paper, contacting their friends because it contradicts their current results and by the end of the bus trip having validated their approach and written a response to the author of the paper. He used the acronym IPOL (iPod plus Laptop) but surely the iPhone would've been closer to the mark.
His main idea is that publications aren't enough, that the experimental data has to also be saved, reviewed and made part of the academic process. As someone who runs one of the protein databases and an editor of a PLoS journal he's obviously seen the benefits of this approach. It also reminded me that ontologies in biology were cool before the Semantic Web came about (although most biology ontologies aren't very good (pdf)).
He mentioned the BioLit project which tries to integrate the workflow of papers, figures, data, and literature querying and creating a positive feedback loop for academic metadata. The idea of getting the source data behind graphs that are published is a much touted application of things like ontologies and RDF.
The last thing he mentioned was creating podcasts for papers published - they should give an overview of a paper that's more indepth than an abstract but more general than the entire paper. To achieve this they've setup the SciVee.tv site (still early days for that). That sounded quite interesting - I can imagine a video explaining the key figures and facts in a multimedia presentation would be very useful. I'm not sure though that most current researchers have those skills. If it is a lot more useful it may lead to a similar situation now, where papers published before the 1980s or so don't get cited or read because they aren't online. Maybe people who are used to YouTube won't read papers because they don't have an accompanying video, although it's probably not as distinct as the pre-digital papers. It seems much more likely that papers without the original experimental data will increasingly be ignored.
The last keynote I'll talk about was by Alex Szalay. I appreciated this one the most even though he did mention the long tail. He has previously written about the exponential increase in scientific data. He wrote it with Jim Gray and he was one of the people that helped in his search (his blog has more information about that too). There's now computational x - where x is any science including things like biology, physics and astronomy. One of the key effects of this much data is that the process of analysing data and then publishing your results is changing. It's more publish the data then do the analysis and publish the analysis.
He mentioned four different types of places where the power law (long tail) occurs: projects (few big ones, many small ones), data sizes (few multi-petabytes sources, many more 10s and 100s of terrabytes and vastly more smaller ones), value added or refereed products and users of data (a few users use a lot but the vaste majority use it a little).
The main thing I liked was that he said the processing of this data is fundamentally different than what it was before. It's too difficult to move the data about when its petabytes - it's easier to move the processing to the data. It was pointed out to me later that versioning the software that processed the data now becomes a very tiny fraction of the data kept but is more often than not overlooked.
The data captured by CCD is about to or has converged with the more traditional telescopes and that the data published and searchable now is only 2 years behind the best possible results. For most astronomers it's actually better to observe the universe from the data than to use an actual telescope.
Processing, memory and CCDs are all following Moore's Law but bandwidth is not. He mentioned an approach that's very much along the lines of the Hadoop/GFS - the code moves to the data not the other way around. He also listed things that are fairly well known now: no time to get it right from the top down, data processing and management becomes the key skill in the future, taking data from different sources is highly valuable, and build it and they will come is not enough you must provide a decent interface.
He mentioned two projects: Life Under Your Feet and Virtual Observatory. Both have huge data sets and rather cool user interfaces.
Google Culture
So a perspective of Google has been going around for a little while now. The story of when an interview gets widely distributed though is more interesting. It's not like Microsoft has anything to gain from distributing this information.
Via, "RE: Life at Google - The Microsoftie Perspective".
Via, "RE: Life at Google - The Microsoftie Perspective".
Squirt Towers
This gives me some of my life back. The weird thing is even though you finish it you still want to play one more game.
Wednesday, June 27, 2007
Moving Up
I really appreciate blogs sometimes. I started Al Gore's, "The Assault on Reason", before putting it down halfway because it was depressing (which was as bad as "Sicko" and "A Crude Awakening"). Lawrence Lessig on the other hand has decided to move from the copyright brick wall to the corruption brick wall.
In one of the handful of opportunities I had to watch Gore deliver his global warming Keynote, I recognized a link in the problem that he was describing and the work that I have been doing during this past decade. After talking about the basic inability of our political system to reckon the truth about global warming, Gore observed that this was really just part of a much bigger problem. That the real problem here was (what I will call a "corruption" of) the political process. That our government can't understand basic facts when strong interests have an interest in its misunderstanding.
The answer is a kind of corruption of the political process. Or better, a "corruption" of the political process. I don't mean corruption in the simple sense of bribery. I mean "corruption" in the sense that the system is so queered by the influence of money that it can't even get an issue as simple and clear as term extension right. Politicians are starved for the resources concentrated interests can provide. In the US, listening to money is the only way to secure reelection. And so an economy of influence bends public policy away from sense, always to dollars.
Scalability, Scalability, Scalability, Scalability
Dare Obasanjo has four recent postings on the recent Google Scalability Conference.
Google Scalability Conference Trip Report: MapReduce, BigTable, and Other Distributed System Abstractions for Handling Large Datasets:
Google Scalability Conference Trip Report: Using MapReduce on Large Geographic Datasets:
Google Scalability Conference Trip Report: Lessons in Building Scalable Systems:
And finally, Google Scalability Conference Trip Report: Scaling Google for Every User, which had some interesting ideas about search engine usage.
Update: links for 2007-07-06 includes links to the Google talk on MapReduce tasks on large datasets and other goodies (scalable b-trees, hashing, etc).
Update 2: Greg Linden has a much better list of videos and commentary.
Google Scalability Conference Trip Report: MapReduce, BigTable, and Other Distributed System Abstractions for Handling Large Datasets:
The talk was about the three pillars of Google's data storage and processing platform; GFS, BigTable and MapReduce.
A developer only has to write their specific map and reduce operations for their data sets which could run as low as 25 - 50 lines of code while the MapReduce infrastructure deals with parallelizing the task and distributing it across different machines, handling machine failures and error conditions in the data, optimizations such as moving computation close to the data to reduce I/O bandwidth consumed, providing system monitoring and making the service scalable across hundreds to thousands of machines.
Currently, almost every major product at Google uses MapReduce in some way. There are 6000 MapReduce applications checked into the Google source tree with the hundreds of new applications that utilize it being written per month. To illustrate its ease of use, a graph of new MapReduce applications checked into the Google source tree over time shows that there is a spike every summer as interns show up and create a flood of new MapReduce applications that are then checked into the Google source tree.
Google Scalability Conference Trip Report: Using MapReduce on Large Geographic Datasets:
A common pattern across a lot of Google services is creating a lot of index files that point and loading them into memory to make lookups fast. This is also done by the Google Maps team which has to handle massive amounts of data (e.g. there are over a hundred million roads in North America).
Q: Where are intermediate results from map operations stored?
A: In BigTable or GFS
Google Scalability Conference Trip Report: Lessons in Building Scalable Systems:
The most important lesson the Google Talk team learned is that you have to measure the right things. Questions like "how many active users do you have" and "how many IM messages does the system carry a day" may be good for evaluating marketshare but are not good questions from an engineering perspective if one is trying to get insight into how the system is performing.
Specifically, the biggest strain on the system actually turns out to be displaying presence information.
Giving developers access to live servers (ideally public beta servers not main production servers) will encourage them to test and try out ideas quickly. It also gives them a sense of empowerement. Developers end up making their systems easier to deploy, configure, monitor, debug and maintain when they have a better idea of the end to end process.
And finally, Google Scalability Conference Trip Report: Scaling Google for Every User, which had some interesting ideas about search engine usage.
Update: links for 2007-07-06 includes links to the Google talk on MapReduce tasks on large datasets and other goodies (scalable b-trees, hashing, etc).
Update 2: Greg Linden has a much better list of videos and commentary.
DERI at Google
The Semantic Web at Google (I found the movie here).
Starting from the end, Stefan was asked about inferencing and relationships. He basically responded that linked data is more practical and immediately useful and that the affect of Description Logics has been over estimated.
The highlight for me was the demo on how to construct user interfaces automatically (from about 20 minutes in). The algorithms are described in more detail in, Extending faceted navigation for RDF data.
They also talked a little about how Ruby was a good language for Semantic Web applications and referenced, ActiveRDF: Object-Oriented Semantic Web Programming.
The applications and tools demoed: SIOC Project, Active RDF, JeromeDL (digital library), BrowseRDF (automatically generated, facted UI) and S3B (social semantic search and browsing).
Starting from the end, Stefan was asked about inferencing and relationships. He basically responded that linked data is more practical and immediately useful and that the affect of Description Logics has been over estimated.
The highlight for me was the demo on how to construct user interfaces automatically (from about 20 minutes in). The algorithms are described in more detail in, Extending faceted navigation for RDF data.
They also talked a little about how Ruby was a good language for Semantic Web applications and referenced, ActiveRDF: Object-Oriented Semantic Web Programming.
The applications and tools demoed: SIOC Project, Active RDF, JeromeDL (digital library), BrowseRDF (automatically generated, facted UI) and S3B (social semantic search and browsing).
Tuesday, June 26, 2007
The History of Trans and Walk
Paul recently started talking about the history of trans and walk in Kowari/Mulgara. I did a bit of searching in old Tucana emails and feature requests.
I also vaguely remember trying to create an abstract class for walk and the base transitive classes (like the Exhaustive and Direct transitive closure classes). And one of the feature requests also mentions the idea of "set depth" to reduce the recursion on unbounded traversal which would also be handy for the shortest path between two nodes.
I also vaguely remember trying to create an abstract class for walk and the base transitive classes (like the Exhaustive and Direct transitive closure classes). And one of the feature requests also mentions the idea of "set depth" to reduce the recursion on unbounded traversal which would also be handy for the shortest path between two nodes.
Friday, June 22, 2007
Getting up to Speed With OWL 1.1
Bijan's "Two Interesting Quotes" really got me started on understanding the new features of OWL 1.1. The addition of the RBox (to ABox and TBox), using SROIQ instead of SHOIN, is the obvious one.
The second interesting quote was from, "Describing chemical functional groups in OWL-DL for the classification of chemical compounds" which gives some clear examples using the new features of OWL 1.1 including qualified cardinality, negation and property chains. The first two are analogous to: chemical has two bonds and chemical has exactly one bond.
Property chains are another interesting feature which is highlighted in the above paper, in "The OBO to OWL mapping, GO to OWL 1.1!" and in the first paper Bijan spoke about. In one way this is a bit disappointing because the OBO work replicates what we been doing in the BioMANTA project (I hate duplication in all its forms) - we based it on the Protégé plugin. Our relationships are quite simple, mainly is-a, so OWL DL was enough. We hadn't yet encountered some of the problems such as reflexive and anti-symmetric relationships. It also links to a web page mapping OBO to OWL 1.1 and OWL DL semantics.
As noted in "Igniting the OWL 1.1 Touch Paper: The OWL API" the addition of an object model along with the OWL 1.1 specification makes it obvious what an OWL 1.1 API should look like. I haven't yet used OWL API much, except looking at how it integrates with Pellet and how closely it matched the OWL specification. The use of axioms made the addition of RBox a bit easier to understand (or misunderstand if I've got it wrong). Hopefully, punning will be made clear to me eventually too (I've been too stupid to understand it just based on the explanations and too lazy to look into it). Pellet 1.5 also introduces incremental inferencing which is hopefully as good as it sounds.
The other papers that were of interest: "Representing Phenotypes in OWL" (again making use of qualified cardinality) and "A survey of requirements for automated reasoning services for bio-ontologies in OWL".
And just to round off, I found a very good paper, "Towards a Quantitative, Platform-Independent Analysis of Knowledge Systems", about all the mistakes that can be made during modeling (errors, assumptions, etc.) and other types of failures such as language (under or over expressive), management, learning, reasoning, querying, and others.
And I know I'll need to come back to this, which lists the fragments of OWL 1.1 that can be used to keep things computable in polynomial time.
The second interesting quote was from, "Describing chemical functional groups in OWL-DL for the classification of chemical compounds" which gives some clear examples using the new features of OWL 1.1 including qualified cardinality, negation and property chains. The first two are analogous to: chemical has two bonds and chemical has exactly one bond.
Property chains are another interesting feature which is highlighted in the above paper, in "The OBO to OWL mapping, GO to OWL 1.1!" and in the first paper Bijan spoke about. In one way this is a bit disappointing because the OBO work replicates what we been doing in the BioMANTA project (I hate duplication in all its forms) - we based it on the Protégé plugin. Our relationships are quite simple, mainly is-a, so OWL DL was enough. We hadn't yet encountered some of the problems such as reflexive and anti-symmetric relationships. It also links to a web page mapping OBO to OWL 1.1 and OWL DL semantics.
As noted in "Igniting the OWL 1.1 Touch Paper: The OWL API" the addition of an object model along with the OWL 1.1 specification makes it obvious what an OWL 1.1 API should look like. I haven't yet used OWL API much, except looking at how it integrates with Pellet and how closely it matched the OWL specification. The use of axioms made the addition of RBox a bit easier to understand (or misunderstand if I've got it wrong). Hopefully, punning will be made clear to me eventually too (I've been too stupid to understand it just based on the explanations and too lazy to look into it). Pellet 1.5 also introduces incremental inferencing which is hopefully as good as it sounds.
The other papers that were of interest: "Representing Phenotypes in OWL" (again making use of qualified cardinality) and "A survey of requirements for automated reasoning services for bio-ontologies in OWL".
And just to round off, I found a very good paper, "Towards a Quantitative, Platform-Independent Analysis of Knowledge Systems", about all the mistakes that can be made during modeling (errors, assumptions, etc.) and other types of failures such as language (under or over expressive), management, learning, reasoning, querying, and others.
And I know I'll need to come back to this, which lists the fragments of OWL 1.1 that can be used to keep things computable in polynomial time.
Friday, June 08, 2007
Bottom Up Semantics
A couple of recent articles on DevX have come up by Rod Coffin exploring how a bit of semantics (and Semantic Web technology) can help improve your applications ability to understand what people are trying to do. The first, Let Semantics Bring Sophistication to Your Applications introduces the concepts of ontologies and querying. The second, Use Semantic Language Tools to Better Understand User Intentions shows how using WordNet you can supplement existing applications to improve search results.
Update: DevX now has a Semantic Web Zone.
Update: DevX now has a Semantic Web Zone.
Thursday, June 07, 2007
0 dollars
SCO's Linux Billions?
For the second quarter of 2007, SCO reported a total of zero dollars of revenue from its SCOsource licensing program. In the first quarter of 2007, SCO reported SCOsource licensing revenue of $34,000, which is somewhat less than the billions McBride had been expecting.
Wednesday, June 06, 2007
All the Web is a Stage
Google Gears' WorkerPool - Message Passing, Shared Nothing Concurrency
The same idea is also being implemented in Scala with their Actors.
By now, you‘ve probably already seen Google Gears, Google's solution for dragging AJAX applications offline. The one thing that jumped out at me was their WorkerPool component. This is a very nice solution for concurrency in Javascript.
In short: if you have any long running task, you can create a WorkerPool which is basically group of Javascript processes Note: they're not threads! The workers in a WorkerPool share nothing with each other, which makes them like OS processes or Erlang's lightweight processes (actually, they're more like the latter, as they're likely to run in the same address space).
And now guess, how these workers, these processes, communicate? Yep: messages, formatted as strings. Important to remember: if you format objects with as JSON strings, you can even send objects and structures along. The handler that receives messages also gets the ID of the sender, so if the sender implements a handler too, it‘s possible to return results asynchronously.
If you‘re reminded of Erlang or the old Actor concept you‘re right. I wonder what the Google Apps will do with this new concurrency approach (well, new for Javascript… yes I‘m ignoring Javascript Rhino).
I still hope that AJAX will die a quick death, like Java Applets, just for being so damn ugly and horrible to implement. But… things like this tell me that this will probably not happen. Good ideas like Google Gears will help paint over the ugly details of a solution, and it‘s all hip now, so it's easy to ignore many of the problems.
The same idea is also being implemented in Scala with their Actors.
Linked Together
This demo shows what can be achieved with the correct use of metadata from Flickr images. If nothing else, this answers why you would bother with metadata (as if there needs to be a justification these days). By correctly annotating your content you gain an advantage in leveraging a network of data. Via, Semantically-linked Interactive Imagery: Wow! The Emergent Web in Action and the demo Blaise Aguera y Arcas: Photosynth demo.
Update: What? I've edited this entry. Weird, I'm sure it made sense at the time.
Update: What? I've edited this entry. Weird, I'm sure it made sense at the time.
Tuesday, June 05, 2007
JRDF 0.4.1.1 Released
JRDF 0.4.1.1. A bug fix release for an error in the wiring. It was wiring up the join code rather than the union code when it was suppose to do union operations. Both Derby and Hadoop were removed for this release (which reduces the GUI jar by 2 MB).
Why Fahrenheit 451 is not like 1984
Ray Bradbury: Fahrenheit 451 Misinterpreted
Points to Ray Bradbury at Home with movies about many topics including stories and politics.
Now, Bradbury has decided to make news about the writing of his iconographic work and what he really meant. Fahrenheit 451 is not, he says firmly, a story about government censorship. Nor was it a response to Senator Joseph McCarthy, whose investigations had already instilled fear and stifled the creativity of thousands.
This, despite the fact that reviews, critiques and essays over the decades say that is precisely what it is all about. Even Bradbury’s authorized biographer, Sam Weller, in The Bradbury Chronicles, refers to Fahrenheit 451 as a book about censorship.
Points to Ray Bradbury at Home with movies about many topics including stories and politics.
Sunday, June 03, 2007
4:34
I was struck by an interview with Ayaan Hirsi Ali. In it she says a moderate version of Islam cannot exist without disregarding parts of the Koran (and without thinking, questioning and debating). I'd always assumed this was true and thought it wouldn't bother me but I was surprised when it did. About half way through the interview she was prompted to give support for her claim that Islam does promote violence. She says there is a criteria for determining whether something is true and she quoted Sura (4:34). Without dancing around the subject it says when faced with a disobedient wife: shout at her, ignore her, and finally beat her. And as far as I can tell this is not up for dispute. It's a bit hard to describe and I'm sure I'd heard it before but for some reason it finally got to me.
So like, Timothy 2, 12 this is a religious statement, without allegory, that is clearly in conflict with modern laws and morals (assuming that morals can exist outside of religion).
I guess I'd always assumed people ignore bits that don't make sense. But as someone reading Timothy said, "What do you do when you are confronted with a finding in scripture that either goes against what you've always believed or at least contradicts what you would like to believe? There are really only two choices. Understand it, accept it and conform to it or reject it and go on doing whatever you want."
So like, Timothy 2, 12 this is a religious statement, without allegory, that is clearly in conflict with modern laws and morals (assuming that morals can exist outside of religion).
I guess I'd always assumed people ignore bits that don't make sense. But as someone reading Timothy said, "What do you do when you are confronted with a finding in scripture that either goes against what you've always believed or at least contradicts what you would like to believe? There are really only two choices. Understand it, accept it and conform to it or reject it and go on doing whatever you want."
Friday, June 01, 2007
The Road After
D Conference Videos of Bill and Steve. All are from brightcove. The missing intro, complete with MacIntosh dating game is on YouTube (there's also a full transcript).
Tuesday, May 22, 2007
Oh the Waste
Running the Numbers
It also includes, ".5 feet wide by 10.5 feet tall in three horizontal panels, Depicts 125,000 one-hundred dollar bills ($12.5 million), the amount our government spends every hour on the war in Iraq."
Via, Statistical visualization.
This new series looks at contemporary American culture through the austere lens of statistics. Each image portrays a specific quantity of something: fifteen million sheets of office paper (five minutes of paper use); 106,000 aluminum cans (thirty seconds of can consumption) and so on. My hope is that images representing these quantities might have a different effect than the raw numbers alone, such as we find daily in articles and books. Statistics can feel abstract and anesthetizing, making it difficult to connect with and make meaning of 3.6 million SUV sales in one year, for example, or 2.3 million Americans in prison, or 426,000 cell phones retired every day. This project visually examines these vast and bizarre measures of our society, in large intricately detailed prints assembled from thousands of smaller photographs.
It also includes, ".5 feet wide by 10.5 feet tall in three horizontal panels, Depicts 125,000 one-hundred dollar bills ($12.5 million), the amount our government spends every hour on the war in Iraq."
Via, Statistical visualization.
Sunday, May 20, 2007
MapReduce SPARQL
Compositional Evaluation of W3C SPARQL Algebra via Reduce/Map "Tested against older DAWG testsuite. Implemented using functional programming idioms: fold (reduce) / unfold (map)
Does that suggest parallelizable execution?"
Thanks to SPARQL's adoption of compositional semantics (algebra) it's possible. From what I can remember, while there is an implicit left to right evaluation of OPTIONAL, which does limit the top level execution, the partial results are parallelizable. Being able to execute order independent OPTIONAL while retaining the correct results for OPTIONAL is still an open question (although feasible I think).
Available from SVN here. More musings, Musings of a Semantic / Rich Web Architect: What's Next?.
Does that suggest parallelizable execution?"
Thanks to SPARQL's adoption of compositional semantics (algebra) it's possible. From what I can remember, while there is an implicit left to right evaluation of OPTIONAL, which does limit the top level execution, the partial results are parallelizable. Being able to execute order independent OPTIONAL while retaining the correct results for OPTIONAL is still an open question (although feasible I think).
Available from SVN here. More musings, Musings of a Semantic / Rich Web Architect: What's Next?.
Friday, May 18, 2007
One More Pragmatic Language
Pragmatic Haskell
Via.
You see, I’ve been talking to the good folks over at Pragmatic Programmers about the possibility of doing a Haskell book. All of my writing effort has been going into that book, and as I didn’t know what the consequences would be of posting potential book content to the net, I elected to keep my mouth shut.
Well, they have agreed.
Via.
Where's the Evil?
I like bats much better than bureaucrats. I live in the Managerial Age, in a world of "Admin." The greatest evil is not now done in those sordid "dens of crime" that Dickens loved to paint. It is not done even in concentration camps and labour camps. In those we see its final result. But it is conceived and ordered (moved, seconded, carried, and minuted) in clean, carpeted, warmed, and well-lighted offices, by quiet men with white collars and cut fingernails and smooth-shaven cheeks who do not need to raise their voice. Hence, naturally enough, my symbol for Hell is something like the bureaucracy of a police state or the offices of a thoroughly nasty business concern.
From, Bullying of Academics in Higher Education. An interesting review, Notes on The Screwtape Letters.
Tuesday, May 15, 2007
The Health Benefits of the Semantic Web
A page about the HCLS Demo given in Banff at WWW2007 Some interesting demos. Using OpenLink's Virtuoso to store and query about 350 million statements (and many more required). Part of the demo used Armed Bear Common Lisp, which is Lisp for the JVM (compiles Lisp to bytecode).
Thursday, May 10, 2007
Defeasible Logic Plus Time
Temporal extensions to Defeasible Logic. Non-monotonic reasoning is about adding more information over time to reach different conclusions. Rather than adding information adding temporal extensions actually removes when this information applies.
Even the Banner Ads are More Interesting
I was reading, Sun Tells Java Plans, of which I got to the second paragraph before noticing an Apple ad (in Flash with sound). Listened to the commercial, closed the tab, I only barely care what the article was about. So who does marketing better (actually yesterday there was one with the PC guy banging his head against the banner which was better)? Of course, I could just be responding in a Pavlovian (hmm dessert) way to the background jingle.
Sunday, May 06, 2007
Romulus and Remus - C# and Java
Ted Neward talks about C# and Java. Also mentions the possible .NET backlash, Scala and LINQ. There's also a very impressive demo of LINQ (about 1/4 of the way through the code demo starts) - the struggle from imperative to declarative.
Saturday, May 05, 2007
An Efficient Link Store
- Another web of data store that produces a subset of RDF/XML, Astoria, is from an unlikely source, Microsoft. Instead of a proprietary Semantic Web, Danny sees it as going to town with URIs and REST.
- Silverlight was the other surprising Microsoft development, nothing beats running code - except maybe browser-based dynamic code 2000 times faster. Applets are cool again.
- Some ideas for static triple indexing "Most mature triplestores also index a 4th query element ‘graph’ or ‘context’. I intend to support this query type without expanding the index by using a trick: In my triples format the fact that the subjects are auto-generated and local to the graph means I can choose them to be sequential and effectively re-use them as graph indexes..."
- Plugged In/Invisible Worlds/Tucana/Northrop/TKS/TMex/Kowari/Mulgara podcast (links to the Talis page).
- PAGE a distributed triple store using DHT and YARS (the original). It does seem to miss the DELIS work on P2P RDF which scaled up to 64 nodes.
- Haskell and the Faith of Programming Languages Phillip Wadler gives a rather brilliant talk on programming languages. Covers Haskell, Java generics, combining different typed languages (weak, strong, very strong) as well as monads and Links.
Labels:
astoria,
david wood,
haskell,
microsoft,
programming languages,
rdf,
semantic web,
silverlight,
triple store,
yars
YARS Revenge
With little fanfare the folks at DERI have announced YARS2. I know of at least 4 next generation RDF stores (you know who you are) with a few others on the drawing board. Storing data is cool again.
I still wonder how the DERI guys can make the claim about it being their indexing scheme especially when Kowari was open sourced before YARS or the original paper came out. Maybe it's who publishes first? See Paul's previous discussion about it in 2005 (under the title "Indexing"). I mind that this hasn't been properly attributed as I'd like Paul and any others to get the attribution they deserve. On the other hand, I'm glad that people are taking this idea and running with it.
It's good to see that text searching on literals now seems like a standard feature too. They used a sparse index to create all 6 indices. They also hint out how reasoning is going to be performed by linking to, "Unifying Reasoning and Search to Web Scale", which suggests a tradeoff over time and trust.
To save disk space for the on-disk indices, we compress the individual blocks using Huffman coding. Depending on the data values and the sorting order of the index, we achieve a compression rate of ≈ 90%. Although compression has a marginal impact on performance, we deem that the benefits of saved disk space for large index files outweighs the slight performance dip.
Figure 4 shows the correspondence between block size and lookup time, and also shows the impact of Huffman coding on the lookup performance; block sizes are measured pre-compression. The average lookup time for a data file with 100k entries (random lookups for all subjects in the index) using a 64k block size is approximately 1.1 ms for the uncompressed and 1.4 ms for the compressed data file. For 90k random lookups over a 7 GB data file with 420 million synthetically generated triples (more on that dataset in Section 7), we achieve an average seek time of 23.5 ms.
I still wonder how the DERI guys can make the claim about it being their indexing scheme especially when Kowari was open sourced before YARS or the original paper came out. Maybe it's who publishes first? See Paul's previous discussion about it in 2005 (under the title "Indexing"). I mind that this hasn't been properly attributed as I'd like Paul and any others to get the attribution they deserve. On the other hand, I'm glad that people are taking this idea and running with it.
It's good to see that text searching on literals now seems like a standard feature too. They used a sparse index to create all 6 indices. They also hint out how reasoning is going to be performed by linking to, "Unifying Reasoning and Search to Web Scale", which suggests a tradeoff over time and trust.
Sunday, April 29, 2007
You Are Here, Now
- Mammal rise 'not linked' to dinos Perhaps the best part is the beautifully rendered result of evolution for the last 166 million years. Good global warming? "However, the supertree shows that the placental mammals had already split into these sub-groups by 93 million years ago, long before the space impact and at a time when dinosaurs still ruled the planet."
- Guice Talk by Google Listing all the good ways DI makes your code better - and yes testing is the main one, more OO is another, better than service locater (J2EE) etc. Not much new here, except for no XML, for people who have been doing this for a while.
- Another Google talk by the Cyc people referenced in Priming the Pump and Threshold Conditions . With many examples that will be familiar for Semantic Web proponents.
- Closures for Java JSR Scala gets a mentions as does the BGGA proposal.
- Three reasons that REST is not RPC "Being able to do state transition processing at disparate locations is hugely powerful...A single process can span machines offering differing levels of scalability, reliability and security."
- Six of One, a Half Dozen of the Other "Closures vs. objects should really be a koan". I think this was off a reddit post at some stage. Leads to a purely functional OO system in Scheme.
Labels:
cyc,
evolution,
functional programming,
google,
guice,
java,
rest,
semantic web
Subscribe to:
Posts (Atom)