Aperture "...is a Java framework for extracting and querying full-text content and metadata from various information systems (e.g. file systems, web sites, mail boxes) and the file formats (e.g. documents, images) occurring in these systems."
Supports: Plain text, HTML, XHTML, XML, PDF (Portable Document Format), RTF (Rich Text Format), Microsoft Office: Word, Excel, Powerpoint, Visio, Publisher, OpenOffice, OpenDocument, Corel WordPerfect, Quattro, Presentations and Emails (.eml files). Check out the Extractor API and associated interfaces.
Put all that together with stuff like Wikipedia3 and others.
The BBC's open programme information project... including Jon Pertwee in FOAF.