Wednesday, August 21, 2002

Libraries and the Semantic Web

"The Scorpion database consists of an unordered set of concepts that are useful for classifying documents. Each concept is defined in a database record. Following the familiar vector-space model of information retrieval (Salton and McGill 1988), documents are submitted as queries against a database. The query returns a list of database records that contain terms found in the document, ranked in order of their similarity to the document. The result can be interpreted as a prioritized list of concepts that roughly characterize the content of the document.

In our studies, we have created Scorpion databases from two library classification schemes: the Dewey Decimal Classification (DDC) and the Library of Congress Classification (LCC). We include a test database derived from a portion of the LCC in this installation and refer to it here to illustrate a simple Scorpion database design."

What you wouldn't use is something like WordNet; as the concepts must be distinct. It relies on a Pears database and Gwen the search engine. Pears is a database designed for storage of hierarchically structured data. I think the Dewey Decimal system is fine when a book has to be put in one physical location but if you can say something as being 90% about one area and 50% about another that would provide a much better search system. Networks can flatten out to hierachies you just need to pick a starting point.

It seems that libraries are crying out for metadata extraction and storage programs. Their current process is strictly done by hand or by poor desktop tools! The National Library of Australia even gets a mention: "The work at NLA developed a practical model for dealing with the immediate threat of disappearing digital objects, and established a workable distributed archive. Similarly, a number of projects and researches - such as OAIS (Open Archival Information System), CEDARS (CURL Exemplars in Digital Archives), NEDLIB (Networked European Deposit Library), and others - have investigated options for dealing with long-term preservation challenges." The New Zealand Digital Library understands how to build a digital library with tools that infer compositional hierarchies or extract the most relevant words and phrases using a modified Bayesian approach. I have both Managing Gigabytes and Data Mining.

No comments: