Showing posts with label life science. Show all posts
Showing posts with label life science. Show all posts

Tuesday, March 03, 2009

A Quick Survey of Bio-Ontologies

The other day I was trying to find a paper that talks about the need for ontologies in biology that dated sometime around the early 90s - way before OWL and the Semantic Web. I couldn't find the paper I was thinking about, but here are some others that are pretty good that seem to follow at least the same themes:

"Ontologies for molecular biology": "Molecular biology has a communication problem. There are many databases using their own labels and categories for storing data objects and some using identical labels and categories but with a different meaning."


"Ontological Foundations for Biology Knowledge Models": This one was good because it talked about processes and transformations which is where the rules and inferencing stuff comes in.

"Toward Principles for the Design of Ontologies Used for Knowledge Sharing": While not specifically about biology, this is probably the most cited paper and it's what I often think about when you're explaining ontologies and the process of improvement that occurs when you make one.

"Bio-ontologies: current trends and future directions": This covers the process and the parts that make a good ontology on the web. I guess the key for the use of the Web for ontologies is a means to share knowledge. It also has a good history of ontologies going back to the 1600s.

And this one is one of the original GO papers, "Gene Ontology: tool for the unification of biology".

Adding Some Herbs to Open Databases

"Harnessing the Crowd to Make Better Drugs: Merck’s Friend Nails Down $5M to Propel New Open Source Era"

Friend, 54, is leaving his high-profile job as Merck’s senior vice president of cancer research, after having nailed down $5 million in anonymous donations to pursue this vision at a nonprofit organization getting started in Seattle called Sage.

Sage is built on the premise that vast networks of genes get perturbed, or thrown off-kilter, in complex diseases like cancer, diabetes, and obesity. Scientists can’t just pick one faulty gene or protein and make a magic bullet to shut it down. But what if researchers around the world capturing genomic profiles on patients could get all of their data to talk to each other through a free, open database? A researcher in Seattle looking at how all 35,000 genes in breast cancer patients are dialed on or off at a certain stage of illness might be able to make critical comparisons by stacking results up against a deeper and broader data pool that integrates clinical, genetic, and other molecular data from peers in, say, San Francisco, New Haven, CT, or anywhere else.

Some big names have signed on for the early incubating phase. Besides the full-time efforts of Friend and Schadt, the Sage board includes Nobel Laureate Lee Hartwell of the Fred Hutchinson Cancer Research Center; Paul Ramsey dean of the School of Medicine at the University of Washington; Richard Lifton, the chairman of genetics at Yale University; and Hans Wigzell, director emeritus of Sweden’s Karolinska Institute. For insight into how to apply lessons from the open-source computing world, the board has brought on John Wilbanks, the vice president of science at the San Francisco-based Creative Commons.

As with any far-out vision, plenty of things can derail it along the way. What if researchers use different gene analysis machines, from Affymetrix, Illumina, or Applied Biosystems? How will Sage reconcile differences in how experiments are designed by different scientists? How will researchers be enticed to let go of their precious data, currently stored on password-protected hard drives and servers? How will Sage manage the intellectual property that arises from the database? Why would companies want to participate and run the risk of putting valuable proprietary data out in public? How will this get financed?

Some of these things Friend can answer, and some still need to be worked out. Software is already making it possible to manage differences between the various instruments scientists use, and deal with the differences in experimental design, Friend says.

Tuesday, September 30, 2008

More Data

Neurocommons have released their integrated RDF datasets. It is composed of different modules or bundles including MeSH, Medline, OBO and others.

Wednesday, August 08, 2007

Naming

Relatively recently, UniProt (a protein sequence database) announced they were moving away from LSIDs (which are URNs) to URIs. The cons of LSIDs seem to be outweighing the pros. More generally, there seems to be much discussion as to what is more appropriate and what can and cannot be done with URIs vs URNs. And even some of LSIDs proponents are saying that using them at the moment is not a good idea.

The form of a URN is: urn:lsid:ubio.org:namebank:11815 which can be resolved using: http://lsid.tdwg.org/summary/urn:lsid:ubio.org:namebank:11815. A couple of IBM articles describe it in more detail: "Build a life sciences collaboration network with LSID" and "LSID best practices". Part of the problem seems to be that the LSID needs a resolving service, much like a web service, to return the data for a given LSID. The URI on the other hand can just use a bit of content negotiation to return either RDF data or human readable HTML. It's not a well known feature that a web client can tell a web server what data it can accept. So a Semantic Web client would say "give me RDF" and a normal web client says "give me HTML".

An alternative is using an existing standard, such as THTTP, which shows how to turn a URN into a URL by providing a REST based service. Where, requesting a URL for the urn "urn:foo:12345-54321" it becomes an HTTP request "GET /uri-res/N2L?urn:foo:12345-54321 HTTP/1.0". This is a bit like the biordf.org approach of "http://bio2rdf.org/namespace:id". Having de-referencable URIs is part of the Banff Manifesto.

Creating GUIDs is an interesting problem in a distributed environment. One of the other life science groups compared Handle, DOI, LSID and PURL (persistent URLs).

The mentioning of the Handle System brought back ideas from previous digital library work and using URNs to name RDF graphs (which I later to discovered wasn't entirely novel).