Saturday, January 26, 2008

Scaling MapReduce

Before I started down using Hadoop I had to be roughly sure that it would scale. One of the pieces of evidence that convinced me was done by IBM in "Scalability of the Nutch Search Engine", they performed tests using real world systems and also modeled the architecture to verify it:
We observe...that the throughput, for a fixed data set size per back-end, increases with the number of back-ends.

We observe a generally good agreement between measurement and prediction for small number of back-end servers. The response time is essentially flat with the number of back-end servers, up to 500, 1000, and 2000 servers for data set sizes (per server) of 10 GB, 20 GB, and 40 GB respectively. The system is more scalable for larger data set
sizes because the work per back-end server per query increases.


They are able to service 1000+ nodes with a single server.

Friday, January 25, 2008

Kids Today

You tell them that there used to be monkey shaped pixels and they wouldn't believe you.

Wednesday, January 23, 2008

Nodes by Type

JRDF 0.5.3 released. This has a couple of bug fixes and features added. The find with no constraints (find(ANY_SUBJECT_NODE, ANY_PREDICATE_NODE, ANY_OBJECT_NODE)) has always been fast with the on disk version but not if it was constrained in anyway (a fixed subject for instance). This was because the wrong method on the btree was being called. This is a big improvement in performance obviously.

A cleanup on the Graph interface also occurred. There were all these methods like findUniquePredicates, getBlankNodes and getResources. There is now findNodes(NodeType) where NodeType is any of the positional (SPO) or value (Resource, URIRef, bnode, Literal) node types. This was able to be done because JRDF has a NodeTypePool where all nodes are in maps by their type. It is missing some types being supported but that should be done next release.

The other thing that was very annoying and now fixed was that a Literal or URI Reference had to be added via the GraphElementFactory before you could add it to the Graph. This only makes sense for blank nodes - other node types can just be re-localized (get their internal node id) each time. Handy for the amount of graph copying we have done lately.

There were some other boring changes like a bug fix to the RDF/XML parser and TripleImpl now being null proof.

Much more to do - haven't had a chance to cleanup the Molecule implementation or add the persistence support.

Solid State Drives

I was wondering what the effect would be on data structures and algorithms given the new features of SSD (solid state drives) - that is, drives with ten times the seek performance and improved read speeds. Desktop hard drives are about the same (18% slower) or beat SSDs in a straight line, sequential access, especially writes. In the short term, it looks like solid state drives biggest impact is likely to be in providing laptop drives the same performance as desktop ones.

A secondary short term impact may well be in providing another level of storage between spinning hard disks and other caches. This is mentioned in, "The five-minute rule twenty years later, and how flash memory changes the rules".

The name of their rule refers to the break-even interval between accesses. If a record (or page) is accessed more often, it should be kept in memory; otherwise, it should remain on disk and read when needed.

Not surprisingly, the optimal page size for B-tree indexes on modern high-bandwidth disks is much larger than traditional database systems have employed. The access time dominates for all small page sizes, such that additional byte transfer and thus additional utility are almost free. B-tree nodes of 256 KB are very near optimal...a traditional rotating hard disk, Table 3 indicates 337 seconds or just over 5 minutes.

Due to the lack of mechanical seeking and rotation, the transfer time dominates even for small pages. The optimal page size for B-trees on flash memory is 2 KB, much smaller than for traditional disk drives. In Table 3, the break-even interval for pages of 4 KB is 351 seconds.

Using O’Neil’s SB-trees, extents of 256 KB are the units of transfer between flash memory and disk, whereas pages of 4 KB are the unit of transfer between RAM and flash memory.


Mentions this paper, SB-Tree : An Index-Sequential Structure for High-Performance Sequential Access.

Via, Flash Memory and Databases.

Monday, January 21, 2008

Years to Days

Dr J Craig Venter – A DNA-Driven World (video)

For the past 15 years at ever faster rates we have been digitising biology. By that I mean going from the analog world of biology through DNA sequencing into the digital world of the computer. I also refer to this as reading the genetic code. The human genome is perhaps the best example of digitising biology. Our computer databases are growing faster per day then during the first 10 years of DNA sequencing. The databases have been filling even faster with the results of our global ocean sequencing project. As a result we have now over 10 million genes in the public databases, the majority of which have been contributed by my teams.

Instead of evolution happening only due to random mutations that survived selective pressure, we can see how by adding chromosomes to or exchanged between species, that thousands of changes could happen in an instant.

Now they can happen not just by random chance but by deliberate human design and selection. Human thought and design and specific selection is now replacing Darwinian evolution.

The biggest question in my mind is the one of scale. Last year we consumed more than 83 million barrels of oil per day or 30 billion barrels during the year. In addition we used over 3 billion tons of coal. These are mind boggling numbers and the only way that I can see replacing oil and coal is through a widely distributed system. If there were one million bio-refineries around the globe each one would still need to produce 17,000 liters per day.


He also talks about the rise of fundamentalism, that 25% of people in the US don't know that the Earth revolves around the Sun, half think people and dinosaurs co-existed and that 58% don't know how to calculate a 10% tip.

Saturday, January 19, 2008

Who Killed Functional Programming?

MapReduce: A major step backwards. Two good responses, "The Great MapReduce Debate" and "Relational Database Experts Jump The MapReduce Shark":
MapReduce is not a data storage or management system — it’s an algorithmic technique for the distributed processing of large amounts of data.

MapReduce has the same relationship to RDBMSs as my motorcycle has to a snowplow — it’s a step backwards in snowplow technology if you look at it that way.

I don’t think the authors understand distributed processing and distributed file systems when they think reduce must rely on FTP. The Wikipedia article says “each [reduce] node is expected to report back periodically with completed work and status updates,” pretty much the opposite of the “pull” DeWitt and Stonebraker criticize.


It's a fairly specious argument that more features means better and the comparisons are not equivalent - it would be better compare BigTable or HBase with databases not MapReduce - and that'd be silly because these guys are column database proponents which is (roughly) the same as HBase/BigTable (they do say it lacks views but the idea used by Google is copying rather mutating). An example of using MapReduce to do indexing, is Nutch. The distributed filesystem is important and it should be clear by now, you'd hope, that storing the data is not the problem, processing it is.

These types of comparisons, "X does something Y doesn't therefore X is better", reminds me of "Who Killed the Electric Car?". One of the reasons the EV1 was said to have failed was because "lacking an engine, it saves the driver the cost of replacement parts, motor oil, filters, and spark plugs" (and more) - all those moving parts, that had to maintained were gone. How can you change gears if you don't have a gearbox?

Does that remind you of a database or other piece of software you use? Isn't the software industry just like the car industry? There appears to be market forces ensuring that things stay complicated and expensive to maintain.

Meanwhile, Google is getting on with it by providing services for terabytes of data (via Wired), "Building on the company's acquisition of the data visualization technology, Trendalyzer, from the oft-lauded, TED presenting Gapminder team, Google will also be offering algorithms for the examination and probing of the information. The new site will have YouTube-style annotating and commenting features."

Update: It's been noted that bet on cheap and rickety if it's 10 times better (also mentions hypertable) and that they're comments are not even wrong.

Update 2: Part of my rant was going to be the whole Apple designs things with the fewest stuff in them - that's what makes them the leader what they leave out. It's the same gist that I read at "Heavier than Air".

Wednesday, January 16, 2008

Fusion Beats Transistor

Why fusion? Answered fairly well in, Big machines for big questions. The presenter says that fusion has been outstripping CPU development, the triple product in fusion generation doubles every 1.8 years vs 2 years. They talk about ITER being the first fusion plant to generate more energy than put in. Worth the price of admission to the 21st century alone.

Monday, January 14, 2008

It's Fun to Pun

Punning was one thing that stumped me about OWL 1.1 last time I looked - mainly because I got confused between it and OWL Full. We did similar things for ontologies such as the Cell Type ontology (although the names were not the same in order to avoid moving out of DL land).

Recently, there's been a nice thread on "Xtreme Punning". I quite like Pat Hayes description of the situation:
Conventions which everyone has to agree to use: which will never happen. Names are cheap, but agreement on names is not cheap, so the fewer names we can get away with the better. The datatype example in my earlier email is a good example, IMO: all three uses are quite natural, and I wouldn't want to have to remember to distinguish xsd:number from rdfs:numberMapping from owl:numberThing.

Names can be freely used in ANY logical syntactic role, without ANY restrictions. EVERY occurrence of a name denotes the same thing. ALL names denote both a class and a property (and in CL a function; and in IKL, a proposition. No doubt other ideas can be incorporated.) ANYTHING can be an instance. Hence, any two legal sentences
(ontologies) can be simply concatenated and the result is still legal (no checking to be done when merging graphs) and meaningful. And it all works very simply, with a simple uniform semantics, which one can quickly get used to.

Friday, January 11, 2008

RDF in a...database?

This is has many similiarities in a recent conversion about the relational model and RDF, comes "Relational databases for storing and querying RDF":
Our work in this area, "Scalable Semantic Web Data Management Using Vertical Partitioning," appeared in the VLDB Conference in Vienna in September. It showed that using a column-oriented database, along with this property representation, allows us to overcome the overhead of representing NULLs, while providing two orders of magnitude better performance than the naive triples representation. This is particularly true when processing queries that must access many triples during execution (e.g., computing the number of books grouped by subject area or institution.) Of course, there is a fair amount of subtlety to getting good performance out of such a representation. Have a look at our conference paper for the details!


This paper and others was mentioned previously. This view, of property tables in column databases, is very similar to untyped relations.

Monday, January 07, 2008

JDBC for Hadoop

From the Hadoop mailing list:
The code is heavily inspired by the MapReduce layer for HBase and works much like it. However, it's mainly meant to be used for development, as in it's current form, but could potentially be of use for people that must keep their data in a relational database and cannot migrate to HBase for some reason (without all the benefits of HBase of course).


Download here

Thursday, January 03, 2008

Intellij Under OSX

Seems that the bug with IntelliJ under OS X 10.5 (Leopard) (works for me now anyway) is fixed in currently nightly builds and will be in 7.0.3. More information here.