Tuesday, July 03, 2007

When You Get What You Want But Not What You Need

I was at the eResearch 2007 conference last week and it was quite good. Although, I must say that I'm very sick of "the long tail" and mashups being mentioned.

The keynotes were quite good and I'll mentioned three but the others and the talks I went to were very good too. I wish I had've written it up last week when it was more fresh in my mind - so some of my recollections maybe a little inaccurate.

David De Roure mentioned the reuse of workflows and that automation requires machine processable descriptions. He also mentioned the CombeChem project and Semantic Grids. He made some interesting comments such as grids are inherently semantic grids, that a little semantics go a long way (it's all about linked data) and that mashups are workflows. He mentioned the very successful Taverna project.

Phil Bourne gave a scenario of someone taking the bus, reviewing a paper, contacting their friends because it contradicts their current results and by the end of the bus trip having validated their approach and written a response to the author of the paper. He used the acronym IPOL (iPod plus Laptop) but surely the iPhone would've been closer to the mark.

His main idea is that publications aren't enough, that the experimental data has to also be saved, reviewed and made part of the academic process. As someone who runs one of the protein databases and an editor of a PLoS journal he's obviously seen the benefits of this approach. It also reminded me that ontologies in biology were cool before the Semantic Web came about (although most biology ontologies aren't very good (pdf)).

He mentioned the BioLit project which tries to integrate the workflow of papers, figures, data, and literature querying and creating a positive feedback loop for academic metadata. The idea of getting the source data behind graphs that are published is a much touted application of things like ontologies and RDF.

The last thing he mentioned was creating podcasts for papers published - they should give an overview of a paper that's more indepth than an abstract but more general than the entire paper. To achieve this they've setup the SciVee.tv site (still early days for that). That sounded quite interesting - I can imagine a video explaining the key figures and facts in a multimedia presentation would be very useful. I'm not sure though that most current researchers have those skills. If it is a lot more useful it may lead to a similar situation now, where papers published before the 1980s or so don't get cited or read because they aren't online. Maybe people who are used to YouTube won't read papers because they don't have an accompanying video, although it's probably not as distinct as the pre-digital papers. It seems much more likely that papers without the original experimental data will increasingly be ignored.

The last keynote I'll talk about was by Alex Szalay. I appreciated this one the most even though he did mention the long tail. He has previously written about the exponential increase in scientific data. He wrote it with Jim Gray and he was one of the people that helped in his search (his blog has more information about that too). There's now computational x - where x is any science including things like biology, physics and astronomy. One of the key effects of this much data is that the process of analysing data and then publishing your results is changing. It's more publish the data then do the analysis and publish the analysis.

He mentioned four different types of places where the power law (long tail) occurs: projects (few big ones, many small ones), data sizes (few multi-petabytes sources, many more 10s and 100s of terrabytes and vastly more smaller ones), value added or refereed products and users of data (a few users use a lot but the vaste majority use it a little).

The main thing I liked was that he said the processing of this data is fundamentally different than what it was before. It's too difficult to move the data about when its petabytes - it's easier to move the processing to the data. It was pointed out to me later that versioning the software that processed the data now becomes a very tiny fraction of the data kept but is more often than not overlooked.

The data captured by CCD is about to or has converged with the more traditional telescopes and that the data published and searchable now is only 2 years behind the best possible results. For most astronomers it's actually better to observe the universe from the data than to use an actual telescope.

Processing, memory and CCDs are all following Moore's Law but bandwidth is not. He mentioned an approach that's very much along the lines of the Hadoop/GFS - the code moves to the data not the other way around. He also listed things that are fairly well known now: no time to get it right from the top down, data processing and management becomes the key skill in the future, taking data from different sources is highly valuable, and build it and they will come is not enough you must provide a decent interface.

He mentioned two projects: Life Under Your Feet and Virtual Observatory. Both have huge data sets and rather cool user interfaces.

No comments: