Monday, October 20, 2003

A Little Magic for Java

"JHOVE (pronounced "jove"), the JSTOR/Harvard Object Validation Environment, is a tool to automate the validation of file formats. Unlike less reliable approaches that rely on superficial indicators such as file extensions and MIME types, JHOVE uses format-specific modules to probe a file's internal structure. JHOVE's plug-in style architecture will allow the work of developing format modules to be shared. The JHOVE site will eventually include a tutorial on module writing, and a full explanation of the module interface."

It comes with some simple formats such as ASCII, byte stream, and UTF-8 as well as more interesting ones: TIFF and PDF.

It's actually much more feature rich than Unix's magic. From the tutorial:
"Format validation conformance is determined at three levels: well-formedness, validity, and consistency.

1. A digital object is well-formed if it meets the purely syntactic requirements for its format
2. An object is valid if it is well-formed and it meets the higher-level semantic requirements for format validity
3. An object is consistent if it is valid and its internally extracted representation information is consistent with externally supplied representation information "

"The set of characteristics reported by JHOVE about a digital object is known as the object's representation information...includes: file pathname or URI, last modification date, byte size, format, format version, MIME type, format profiles, and optionally, CRC32, MD5, and SHA-1 checksums [CRC32, MD5, SHA-1]."

It also has output handlers, again pluggable, which include text and XML. Part of the JSTOR project (More information).

No comments: