More News: Lucene for the Semantic Web

Tuesday, April 10, 2007

Lucene for the Semantic Web

Google's [WWW] Bigtable, a distributed storage system for structured data, is a very effective mechanism for storing very large amounts of data in a distributed environment.

Just as Bigtable leverages the distributed data storage provided by the [WWW] Google File System, Hbase will provide Bigtable-like capabilities on top of Hadoop.

Data is organized into tables, rows and columns, but a query language like SQL is not supported. Instead, an Iterator-like interface is available for scanning through a row range (and of course there is an ability to retrieve a column value for a specific key).

Any particular column may have multiple values for the same row key. A secondary key can be provided to select a particular value or an Iterator can be set up to scan through the key-value pairs for that column given a specific row key.

From the Hbase/HbaseArchitecture page:

HBase uses a data model very similar to that of Bigtable. Users store data rows in labelled tables. A data row has a sortable key and an arbitrary number of columns. The table is stored sparsely, so that rows in the same table can have crazily-varying columns, if the user likes.

A column name has the form ":" where and can be any string you like. A single table enforces its set of s (called "column groups"). You can only adjust this set of groups by performing administrative operations on the table. However, you can use new strings at any write without preannouncing it. HBase stores column groups physically close on disk. So the items in a given column group should have roughly the same write/read behavior.

Writes are row-locked only. You cannot lock multiple rows at once. All row-writes are atomic by default.

All updates to the database have an associated timestamp. The HBase will store a configurable number of versions of a given cell. Clients can get data by asking for the "most recent value as of a certain time". Or, clients can fetch all available versions at once.

The example tables given are very similar to untyped relations. This has only just become part of the nightly build.

Via, Data Parallel.

Tuesday, April 10, 2007

Lucene for the Semantic Web

No comments: