Saturday, January 03, 2004

Analysis Engine

A Fountain of Knowledge "...imagine a marketing researcher trying to find out the online attitude of consumers toward the popular rock singer Pink. The researcher would have to wade through an ocean of search results to sort out which Web pages were talking about Pink, the person, rather than pink, the color.

What such a researcher needs is not another search engine, but something beyond that—an analysis engine that can sniff out its own clues about a document’s meaning and then provide insight into what the search results mean in aggregate. And that’s just what IBM is about to deliver. In a few months, in partnership with Factiva, a New York City online news company, it will launch the first commercial test of WebFountain...Up to now this kind of aggregate analysis was possible only with so-called structured data, which is organized in such a way as to make its meaning clear. Originally, this required the data to be in some sort of rigidly organized database; if a field in a database is labeled “product color,” there is little chance that an entry reading “pink” refers to a musician."

"Although the pooled data is compressed to about one-third its original size to reduce storage demands, WebFountain still requires a whopping 160 terabytes plus of disk space. It uses a cluster of thirty 2.4-GHz Intel Xeon dual-processor computers running Linux to crawl as much of the general Web as it can find at least once a week."

"WebFountain’s builders admit it’s not always able to guess right, but they point out that humans can also be confused by ambiguous meanings."

"Because the data has been converted from an unstructured format to a structured XML-based format, IBM and its partners can fall back on the data-mining experience and methodologies already developed for analyzing databases. The structured format also provides an easy target for developing new analytic tools."

"This, perhaps more than anything else, is why WebFountain looks like a winner. By creating an open commercial platform for content providers and data miners, it will foster rapid innovation and commercialization in the realm of machine understanding, currently dominated by isolated research projects."

No comments: