WebFountain: IBM Buzzes the Web for Intelligent Applications "WebFountain works in three stages: base mining, in which indexing and search technologies are used to systematically mine the Internet using focused crawling; an industry component, which requires industry-specific expertise to know the types of algorithms with high value (IBM plans to work with customers and consulting organizations to build these); and the application, which will be delivered to customers as an on-demand service."
""For every megabyte of data we read in," says Carlson, "we create about 10 megabytes of metadata. Our value proposition is the metadata. We extract all of this stuff—nouns, locations, entities—then it goes into an industry process." He says, "We construct higher-level value from this information."
There have been a number of research breakthroughs that have allowed IBM to create the WebFountain infrastructure; the technical challenge was to get to the scale. It is an operation made up of about a 1000-node Intel Linux cluster and half a Petabyte of storage, according to Carlson. While a "me-too solution" could be made by "cobbling together about 30 or so companies in the marketplace," according to Carlson, he doubts they could get past 10 million pages."
Pretty much confirms what I thought, more metadata than data, 10:1 metadata to data.