Wednesday, March 09, 2005

Using the Semantic Web to Handle Petascale Datasets

Scientific Data Management in the Coming Decade "Increasingly, the datasets are so large, and the application programs are so complex, that it is much more economical to move the end-user’s programs to the data and only communicate questions and answers rather than moving the source data and its applications to the user‘s local system."

"If the data is to be analyzed by generic tools, the tools need to “understand” the data. You cannot just present a bundle-of-bytes to a tool and expect the tool to intuit where the data values are and what they mean. The tool will want to know the metadata."

"In addition to standardization, computer-usable ontologies will help build the Semantic Web: applications will be semantically compatible beyond the mere syntactic compatibility that current-generation of Web services offer with type matching interfaces. However, it will take some time before high-performance general-purpose ontology engines will be available and integrated with data analysis tools...The XML integration in modern Database Management Systems (DBMS) opens the door for existing standards like RDF and OWL."

This would be something that I would disagree with - existing DBMS are not flexible enough to handle things like RDF. And OWL is even more crucially effected where you need the indexing available from something like Kowari. Other systems like Sesame and Jena are both adding tree structures to store their RDF. Some of that is maintenance and setup but also performance.

"As file systems grow to petabyte-scale archives with billions of files, the science community must create a synthesis of database systems and file systems. At a minimum, the file hierarchy will be replaced with a database that catalogs the attributes and lineage of each file. Set-oriented file processing will make file names increasingly irrelevant – analysis will be applied to “all data with these attributes” rather than working on a list of file/directory names or name patterns."

This is very similar to the TMex V2 architecture that I worked on.

Also Distributed Computing Economics "Computing economics are changing. Today there is rough price parity between (1) one database access, (2) ten bytes of network traffic, (3) 100,000 instructions, (4) 10 bytes of disk storage, and (5) a megabyte of disk bandwidth.".

Via Peta-scale Data Centers - a Trio of Great Jim Gray Papers (from: Kevin Schofield’s Weblog).

No comments: