"Nutch provides a transparent alternative to commercial web search engines. Only open source search results can be fully trusted to be without bias. (Or at least their bias is public.) All existing major search engines have proprietary ranking formulas, and will not explain why a given page ranks as it does. Additionally, some search engines determine which sites to index based on payments, rather than on the merits of the sites themselves. Nutch, on the other hand, has nothing to hide and no motive to bias its results or its crawler in any way other than to try to give each user the best results possible."
It probably just harvests the URLs, however, it seeds the crawler using DMOZ's RDF. According to the The Inquirer the Nutch people include: Mitch Kapor, Tim O'Reilly, Peter Savich, Raymie Stata and Doug Cutting.
This lead to a link posted by Danny Ayers about using Lucene as a triple store: "For example a triple (document) would be:
http://jakarta.apache.org -> title -> "A great Java developer's website"
This would be just one document in the index."
In TKS we present Lucene as a model and use our own store for the triples and just do joins across the two models.
No comments:
Post a Comment