While I'm sick of long tail blahs, I recently came across the idea that programming languages follow the same power laws found in other areas. This particular long tail this should be encouraging for those who have a disdain for the current mainstream computer languages, "Rather than finding ways to create an even lower lowest common denominator, the Long Tail is about finding economically efficient ways to capitalize on the infinite diversity of taste and demand that has heretofore been overshadowed by mass markets."
Furthermore, "There is a long tail because the more specialized a language is to a domain, the better it fits to solve problems for that domain. These niche languages trade off generality for efficiency in a domain and they are simply better and more efficient tools for that domain."
Showing posts with label long tail. Show all posts
Showing posts with label long tail. Show all posts
Friday, May 02, 2008
Tuesday, July 03, 2007
When You Get What You Want But Not What You Need
I was at the eResearch 2007 conference last week and it was quite good. Although, I must say that I'm very sick of "the long tail" and mashups being mentioned.
The keynotes were quite good and I'll mentioned three but the others and the talks I went to were very good too. I wish I had've written it up last week when it was more fresh in my mind - so some of my recollections maybe a little inaccurate.
David De Roure mentioned the reuse of workflows and that automation requires machine processable descriptions. He also mentioned the CombeChem project and Semantic Grids. He made some interesting comments such as grids are inherently semantic grids, that a little semantics go a long way (it's all about linked data) and that mashups are workflows. He mentioned the very successful Taverna project.
Phil Bourne gave a scenario of someone taking the bus, reviewing a paper, contacting their friends because it contradicts their current results and by the end of the bus trip having validated their approach and written a response to the author of the paper. He used the acronym IPOL (iPod plus Laptop) but surely the iPhone would've been closer to the mark.
His main idea is that publications aren't enough, that the experimental data has to also be saved, reviewed and made part of the academic process. As someone who runs one of the protein databases and an editor of a PLoS journal he's obviously seen the benefits of this approach. It also reminded me that ontologies in biology were cool before the Semantic Web came about (although most biology ontologies aren't very good (pdf)).
He mentioned the BioLit project which tries to integrate the workflow of papers, figures, data, and literature querying and creating a positive feedback loop for academic metadata. The idea of getting the source data behind graphs that are published is a much touted application of things like ontologies and RDF.
The last thing he mentioned was creating podcasts for papers published - they should give an overview of a paper that's more indepth than an abstract but more general than the entire paper. To achieve this they've setup the SciVee.tv site (still early days for that). That sounded quite interesting - I can imagine a video explaining the key figures and facts in a multimedia presentation would be very useful. I'm not sure though that most current researchers have those skills. If it is a lot more useful it may lead to a similar situation now, where papers published before the 1980s or so don't get cited or read because they aren't online. Maybe people who are used to YouTube won't read papers because they don't have an accompanying video, although it's probably not as distinct as the pre-digital papers. It seems much more likely that papers without the original experimental data will increasingly be ignored.
The last keynote I'll talk about was by Alex Szalay. I appreciated this one the most even though he did mention the long tail. He has previously written about the exponential increase in scientific data. He wrote it with Jim Gray and he was one of the people that helped in his search (his blog has more information about that too). There's now computational x - where x is any science including things like biology, physics and astronomy. One of the key effects of this much data is that the process of analysing data and then publishing your results is changing. It's more publish the data then do the analysis and publish the analysis.
He mentioned four different types of places where the power law (long tail) occurs: projects (few big ones, many small ones), data sizes (few multi-petabytes sources, many more 10s and 100s of terrabytes and vastly more smaller ones), value added or refereed products and users of data (a few users use a lot but the vaste majority use it a little).
The main thing I liked was that he said the processing of this data is fundamentally different than what it was before. It's too difficult to move the data about when its petabytes - it's easier to move the processing to the data. It was pointed out to me later that versioning the software that processed the data now becomes a very tiny fraction of the data kept but is more often than not overlooked.
The data captured by CCD is about to or has converged with the more traditional telescopes and that the data published and searchable now is only 2 years behind the best possible results. For most astronomers it's actually better to observe the universe from the data than to use an actual telescope.
Processing, memory and CCDs are all following Moore's Law but bandwidth is not. He mentioned an approach that's very much along the lines of the Hadoop/GFS - the code moves to the data not the other way around. He also listed things that are fairly well known now: no time to get it right from the top down, data processing and management becomes the key skill in the future, taking data from different sources is highly valuable, and build it and they will come is not enough you must provide a decent interface.
He mentioned two projects: Life Under Your Feet and Virtual Observatory. Both have huge data sets and rather cool user interfaces.
The keynotes were quite good and I'll mentioned three but the others and the talks I went to were very good too. I wish I had've written it up last week when it was more fresh in my mind - so some of my recollections maybe a little inaccurate.
David De Roure mentioned the reuse of workflows and that automation requires machine processable descriptions. He also mentioned the CombeChem project and Semantic Grids. He made some interesting comments such as grids are inherently semantic grids, that a little semantics go a long way (it's all about linked data) and that mashups are workflows. He mentioned the very successful Taverna project.
Phil Bourne gave a scenario of someone taking the bus, reviewing a paper, contacting their friends because it contradicts their current results and by the end of the bus trip having validated their approach and written a response to the author of the paper. He used the acronym IPOL (iPod plus Laptop) but surely the iPhone would've been closer to the mark.
His main idea is that publications aren't enough, that the experimental data has to also be saved, reviewed and made part of the academic process. As someone who runs one of the protein databases and an editor of a PLoS journal he's obviously seen the benefits of this approach. It also reminded me that ontologies in biology were cool before the Semantic Web came about (although most biology ontologies aren't very good (pdf)).
He mentioned the BioLit project which tries to integrate the workflow of papers, figures, data, and literature querying and creating a positive feedback loop for academic metadata. The idea of getting the source data behind graphs that are published is a much touted application of things like ontologies and RDF.
The last thing he mentioned was creating podcasts for papers published - they should give an overview of a paper that's more indepth than an abstract but more general than the entire paper. To achieve this they've setup the SciVee.tv site (still early days for that). That sounded quite interesting - I can imagine a video explaining the key figures and facts in a multimedia presentation would be very useful. I'm not sure though that most current researchers have those skills. If it is a lot more useful it may lead to a similar situation now, where papers published before the 1980s or so don't get cited or read because they aren't online. Maybe people who are used to YouTube won't read papers because they don't have an accompanying video, although it's probably not as distinct as the pre-digital papers. It seems much more likely that papers without the original experimental data will increasingly be ignored.
The last keynote I'll talk about was by Alex Szalay. I appreciated this one the most even though he did mention the long tail. He has previously written about the exponential increase in scientific data. He wrote it with Jim Gray and he was one of the people that helped in his search (his blog has more information about that too). There's now computational x - where x is any science including things like biology, physics and astronomy. One of the key effects of this much data is that the process of analysing data and then publishing your results is changing. It's more publish the data then do the analysis and publish the analysis.
He mentioned four different types of places where the power law (long tail) occurs: projects (few big ones, many small ones), data sizes (few multi-petabytes sources, many more 10s and 100s of terrabytes and vastly more smaller ones), value added or refereed products and users of data (a few users use a lot but the vaste majority use it a little).
The main thing I liked was that he said the processing of this data is fundamentally different than what it was before. It's too difficult to move the data about when its petabytes - it's easier to move the processing to the data. It was pointed out to me later that versioning the software that processed the data now becomes a very tiny fraction of the data kept but is more often than not overlooked.
The data captured by CCD is about to or has converged with the more traditional telescopes and that the data published and searchable now is only 2 years behind the best possible results. For most astronomers it's actually better to observe the universe from the data than to use an actual telescope.
Processing, memory and CCDs are all following Moore's Law but bandwidth is not. He mentioned an approach that's very much along the lines of the Hadoop/GFS - the code moves to the data not the other way around. He also listed things that are fairly well known now: no time to get it right from the top down, data processing and management becomes the key skill in the future, taking data from different sources is highly valuable, and build it and they will come is not enough you must provide a decent interface.
He mentioned two projects: Life Under Your Feet and Virtual Observatory. Both have huge data sets and rather cool user interfaces.
Thursday, June 30, 2005
Semantic Web Fast, SOAP not that Slow and other links
* The Semantic Web In One Day "...syntactic aspects of data integration turned out to be tedious. Often, output from tool A can’t be used directly as input for tool B, although both have the same language capabilities. For example, both tools can handle RDF for input and output, but the resulting data is syntactically incompatible to the extent that the tools can’t communicate." Full article here.
* SOAP Performance Considered Really Rather Good points to a number of people studying the speed of SOAP. An intersting paper is "An Evaluation of Contemporary Commercial SOAP Implementations" which says that "SOAP and non-SOAP implementations continued to widen with .NET Remoting offering 280 msgs/sec at peak while most SOAP implementations were only handling from 30 to 60 msgs/sec. Even the leading Product A Document/Literal implementation only gave a maximum throughput of 67 msgs/sec. The two lowest performing RPC/Encoded implementations only handled 15 msgs/sec, the binary/TCP alternative." Not quite the "speed of light is the limiting factor".
* Secrets of the A-list bloggers: Technorati vs. Google "If Google favors indexing more popular sites more often, a clear opprtunity for world-live-web search engines like Technorati would be in the long tail of less-often-indexed sites but Technorati seems to ignore that opportunity and concentrate on the top sites. What that will translate into is a direct reproduction of the power laws when it comes to indexing of blogs."
* A conversation with Jeff Nielsen about agile software development "I was particularly interested to hear about Jeff's use of FIT, Ward Cunningham's Framework for Integrated Test. This technique first appeared on my radar in an outtake from our 2003 story on test-driven development. A more recent development is Fitnesse, a Wiki that supports the use of FIT... pains me to say so but, according to Jeff, XML-oriented tools have so far failed to cut the mustard in this environment." XML is not agile!
* Managing Component Dependencies Using ClassLoaders "Java's class loading mechanism allows for more elegant solutions to this problem. One such solution is for each component's authors to specify the dependencies of their component inside of its JAR manifest."
* SOAP Performance Considered Really Rather Good points to a number of people studying the speed of SOAP. An intersting paper is "An Evaluation of Contemporary Commercial SOAP Implementations" which says that "SOAP and non-SOAP implementations continued to widen with .NET Remoting offering 280 msgs/sec at peak while most SOAP implementations were only handling from 30 to 60 msgs/sec. Even the leading Product A Document/Literal implementation only gave a maximum throughput of 67 msgs/sec. The two lowest performing RPC/Encoded implementations only handled 15 msgs/sec, the binary/TCP alternative." Not quite the "speed of light is the limiting factor".
* Secrets of the A-list bloggers: Technorati vs. Google "If Google favors indexing more popular sites more often, a clear opprtunity for world-live-web search engines like Technorati would be in the long tail of less-often-indexed sites but Technorati seems to ignore that opportunity and concentrate on the top sites. What that will translate into is a direct reproduction of the power laws when it comes to indexing of blogs."
* A conversation with Jeff Nielsen about agile software development "I was particularly interested to hear about Jeff's use of FIT, Ward Cunningham's Framework for Integrated Test. This technique first appeared on my radar in an outtake from our 2003 story on test-driven development. A more recent development is Fitnesse, a Wiki that supports the use of FIT... pains me to say so but, according to Jeff, XML-oriented tools have so far failed to cut the mustard in this environment." XML is not agile!
* Managing Component Dependencies Using ClassLoaders "Java's class loading mechanism allows for more elegant solutions to this problem. One such solution is for each component's authors to specify the dependencies of their component inside of its JAR manifest."
Saturday, March 12, 2005
Making Money from the Long Tail
The long tail of software. Millions of Markets of Dozens. "You know the real reason Excite went out of business? We couldn’t figure out how to make money from 97% of our traffic. We couldn’t figure out how to make money from the long tail – from those queries asked only once a day."
"57% of Amazon’s sales come from books you can’t even buy at a Barnes and Noble (to be fair, there is some skepticism around this number voiced here). This runs totally counter to the traditional 80/20 rule in retailing – that 80% of your sales come from 20% of your inventory. In Amazon’s case, 57% of their book revenue comes from 0% of Barnes and Nobles inventory."
"iTunes has over one million songs in it’s catalog. You know how many have been bought at least once?
Every one."
"57% of Amazon’s sales come from books you can’t even buy at a Barnes and Noble (to be fair, there is some skepticism around this number voiced here). This runs totally counter to the traditional 80/20 rule in retailing – that 80% of your sales come from 20% of your inventory. In Amazon’s case, 57% of their book revenue comes from 0% of Barnes and Nobles inventory."
"iTunes has over one million songs in it’s catalog. You know how many have been bought at least once?
Every one."
Subscribe to:
Posts (Atom)