Monday, July 30, 2007

Scale Out or Drop Out

I recently came across this study done by IBM using Nutch/Lucene to compare scale-up (a single, shared memory, fast server) vs scale-out (a cluster). It showed that for the type of work performed scale-out systems outperformed scale-up systems in terms of price and performance. The IEEE article about Google's architecture in late 2002, notes that Google was spending $278,000 for 88 dual CPU 2GHz Xenons, 2 GB of RAM and each having a 80 GB hard drive. In the IBM paper, $200,000 gets you 112 blades, quad processor with 8GB of memory each with a 73GB drive as well as a shared storage system.

This scale-out approach was also recently mentioned by Tim O'Reilly's, MySQL: The Twelve Days of Scaleout.

Scale-up x Scale-out: A Case Study using Nutch/Lucene

The Nutch/Lucene search framework includes a parallel indexing operation written using the MapReduce programming model [2]. MapReduce provides a convenient way of addressing an important (though limited) class of real-life commercial applications by hiding parallelism and fault-tolerance issues from the programmers, letting them focus on the problem domain.

The query engine part consists of one or more front-ends, and one or more back-ends. Each back-end is associated with a segment of the complete data set...The front-end collects the response from all the back-ends to produce a single list of the top
documents (typically 10 overall best matches).

We see that the peak performance of the scale out solution (BladeCenter) is approximately 4 times better.

Saturday, July 28, 2007

Moving Around

Friday, July 20, 2007

MapReduce by the Hour

Running Hadoop MapReduce on Amazon EC2 and Amazon S3
Apache's Hadoop project aims to solve these problems by providing a framework for running large data processing applications on clusters of commodity hardware. Combined with Amazon EC2 for running the application, and Amazon S3 for storing the data, we can run large jobs very economically. This paper describes how to use Amazon Web Services and Hadoop to run an ad hoc analysis on a large collection of web access logs that otherwise would have cost a prohibitive amount in either time or money.

It took about 5 minutes to transfer the data (which was compressed - Hadoop can read compressed input data out of the box) from S3 to to HDFS, so the whole job took less than a hour. At $0.10 per instance-hour, this works out at only $2 for the whole job, plus S3 storage and transfer costs - that's external transfer costs, because remember transfers between EC2 and S3 are free.

Friday, July 13, 2007


FeDeRate [FED] provides a mapping between RDF queries and SQL queries over conventional relational databases. SPASQL provides similar functionality, but eliminates the query rewriting phase by parsing the SPARQL [SPARQL] directly in the database server. This approach is efficient, and allows applications linked to currently deployed MySQL client libraries to execute SPARQL queries and get the results back through the conventional MySQL protocol.

This looks like a nice, pragmatic approach to solving the legacy SQL data problem. There's some interesting discussion about the mismatch between UNION in SQL and SPARQL.

Sunday, July 08, 2007

JRDF Development

I've started adding datatype support to JRDF. Nothing too flash at the moment, mainly to support the RDF/XML test cases and prompted by a similar piece of work that was completed by the people at KP Lab. They also implemented an Elmo like API (with lazy collections) and added a Resource interface (which adds methods on top of both Blank Nodes and URI References and is quite similar to Jena's Resource interface) which will be integrated soon.

An interesting aspect to the datatype support is of course semantically equal types like integer and long. In the current version of JRDF they wouldn't return equal. In Jena there is "sameValueAs" that returns true if they are semantically equal. In JRDF, I originally decided to use Comparable - but then the use in ordered maps may get confusing (although it would be similar to BigDecimal but it's the exception rather than the rule). Using Comparable in this way would also be different to the implementation of datatypes in Kowari/Mulgara. So I've kept the old behaviour and added an equivalent comparison operator that is a copy of Comparable but takes in semantic similarities.

The other thing that's been cleared up is the containers and collections - a lot of (90%) was redundant due to Generics and didn't need to be there anymore. However, I did come across, "The pseudo-typedef antipattern", which pretty much says that creating a StringList (that extends ArrayList) is an anti-pattern. In JRDF collections and containers are basically Lists or Maps restricted to object nodes. Which seems to fit the anti-pattern. However, for me it does add different behaviour over the standard collections and restricting it on type does seem to make sense. I'm willing to be convinced though.

I'm also aware of some of the more weird RDF/XML edge cases that don't look like they were ever implemented correctly in JRDF or Kowari/Mulgara (like turning rdf:li into numbers even if outside of collections). I don't think it's a big use case and no one has ever raised it as a bug as far as I know.

Update: It's been pointed out to me again that implementing comparable isn't flexible enough - I should've known better of course as it's been pointed out to me many times in the past and I've used it in the past the equivComparator is a bit dumb.

Friday, July 06, 2007

The Centre of Excellence

The Pmarca Guide to Big Companies, part 2: Retaining great people
Don't create a new group or organization within your company whose job is "innovation". This takes various forms, but it happens reasonably often when a big company gets into product trouble, and it's hugely damaging.

Here's why:

First, you send the terrible message to the rest of the organization that they're not supposed to innovate.

Second, you send the terrible message to the rest of the organization that you think they're the B team.

That's a one-two punch that will seriously screw things up.

Via Dare.

Tuesday, July 03, 2007

When You Get What You Want But Not What You Need

I was at the eResearch 2007 conference last week and it was quite good. Although, I must say that I'm very sick of "the long tail" and mashups being mentioned.

The keynotes were quite good and I'll mentioned three but the others and the talks I went to were very good too. I wish I had've written it up last week when it was more fresh in my mind - so some of my recollections maybe a little inaccurate.

David De Roure mentioned the reuse of workflows and that automation requires machine processable descriptions. He also mentioned the CombeChem project and Semantic Grids. He made some interesting comments such as grids are inherently semantic grids, that a little semantics go a long way (it's all about linked data) and that mashups are workflows. He mentioned the very successful Taverna project.

Phil Bourne gave a scenario of someone taking the bus, reviewing a paper, contacting their friends because it contradicts their current results and by the end of the bus trip having validated their approach and written a response to the author of the paper. He used the acronym IPOL (iPod plus Laptop) but surely the iPhone would've been closer to the mark.

His main idea is that publications aren't enough, that the experimental data has to also be saved, reviewed and made part of the academic process. As someone who runs one of the protein databases and an editor of a PLoS journal he's obviously seen the benefits of this approach. It also reminded me that ontologies in biology were cool before the Semantic Web came about (although most biology ontologies aren't very good (pdf)).

He mentioned the BioLit project which tries to integrate the workflow of papers, figures, data, and literature querying and creating a positive feedback loop for academic metadata. The idea of getting the source data behind graphs that are published is a much touted application of things like ontologies and RDF.

The last thing he mentioned was creating podcasts for papers published - they should give an overview of a paper that's more indepth than an abstract but more general than the entire paper. To achieve this they've setup the site (still early days for that). That sounded quite interesting - I can imagine a video explaining the key figures and facts in a multimedia presentation would be very useful. I'm not sure though that most current researchers have those skills. If it is a lot more useful it may lead to a similar situation now, where papers published before the 1980s or so don't get cited or read because they aren't online. Maybe people who are used to YouTube won't read papers because they don't have an accompanying video, although it's probably not as distinct as the pre-digital papers. It seems much more likely that papers without the original experimental data will increasingly be ignored.

The last keynote I'll talk about was by Alex Szalay. I appreciated this one the most even though he did mention the long tail. He has previously written about the exponential increase in scientific data. He wrote it with Jim Gray and he was one of the people that helped in his search (his blog has more information about that too). There's now computational x - where x is any science including things like biology, physics and astronomy. One of the key effects of this much data is that the process of analysing data and then publishing your results is changing. It's more publish the data then do the analysis and publish the analysis.

He mentioned four different types of places where the power law (long tail) occurs: projects (few big ones, many small ones), data sizes (few multi-petabytes sources, many more 10s and 100s of terrabytes and vastly more smaller ones), value added or refereed products and users of data (a few users use a lot but the vaste majority use it a little).

The main thing I liked was that he said the processing of this data is fundamentally different than what it was before. It's too difficult to move the data about when its petabytes - it's easier to move the processing to the data. It was pointed out to me later that versioning the software that processed the data now becomes a very tiny fraction of the data kept but is more often than not overlooked.

The data captured by CCD is about to or has converged with the more traditional telescopes and that the data published and searchable now is only 2 years behind the best possible results. For most astronomers it's actually better to observe the universe from the data than to use an actual telescope.

Processing, memory and CCDs are all following Moore's Law but bandwidth is not. He mentioned an approach that's very much along the lines of the Hadoop/GFS - the code moves to the data not the other way around. He also listed things that are fairly well known now: no time to get it right from the top down, data processing and management becomes the key skill in the future, taking data from different sources is highly valuable, and build it and they will come is not enough you must provide a decent interface.

He mentioned two projects: Life Under Your Feet and Virtual Observatory. Both have huge data sets and rather cool user interfaces.

Google Culture

So a perspective of Google has been going around for a little while now. The story of when an interview gets widely distributed though is more interesting. It's not like Microsoft has anything to gain from distributing this information.

Via, "RE: Life at Google - The Microsoftie Perspective".

Squirt Towers

This gives me some of my life back. The weird thing is even though you finish it you still want to play one more game.