There has been a large increase in the number of people and organisations interested in extracting or capturing chemical information from the public domain. This is typified by the ongoing discussions between individuals and organisations - here’s a comment on this blog from Antony Williams -Chemspiderman - who has been working very hard to develop approaches towards Open data (comment to Open Data in Science):
I’m in the middle of curating all chemical structures on Wikipedia. I spent a couple of hours discussing it with Martin Walker last night. The process involves a lot of manual work…I’m at over 150 hours right now. There are issues with chemical names not matching the structure diagrams (people can use nomenclature very poorly!) so this will be an ongoing issue for ANYBODY using name to structure conversion structure. However, there are many names agreeing with the chemical structure. Have you thought about applying OSCAR to WIkipedia to generate a real structure file? You can then add that into the WWMM and hook up to Wikipedia. If you wait a while I’ll have one done and will hopefully be able to get Wikipedia to accept InChIKeys on the structures directly and therefore make Wikipedia searchable by InChIKey. I’ll log about this soon but have other deadlines in the way at present. I have just co-authored a book chapter on name to structure conversion and talked about OSCAR-3 but couldn’t comment too much on capabilities. I can add it in in proofing.. Here are 10 names of structure on Wikipedia …they are correct for the structures. You commented “If the names can be interprted or looked up then OSCAR does a good job. “How well does OSCAR does on this set of 10? If you want to post the InChI strings I’ll check the structures and let you know…
We are very grateful for this work. We are also doing similar things and we’d be delighted to coordinate - I have also been mailing Martin and booked an IRC with him and WP-CHEM colleagues asap.
As Antony says there is a lot of hard work. The good news about social computing - of the sort he and we have been fostering - is that in principle it can scale. The difficulty is that it can be difficult to run technically hard projects - and this is a technically hard project. The reason is that it is not about certainty - what is the formula of “snow”? - but requires evaluation of assertions (X says the formula of A is C30H22O5; Y says the formula of A is C32H24O5).
There is an awful lot of grunt work. First we have to get the data. For Wikipedia this has been done manually, but I am looking at whether data can be extracted from other sources and fed in automatically. There’s at least 1000 common compounds that “should” be in WP. There’s the problem of rights - I think we are getting to the stage where the resistance to mining data from chemistry text will weaken. Then we have to deal with the syntax. PDF is still a major hurdle. Can we use images (I’ll post about that later). My work over the holiday has shown that extraction from web pages is still fragile, but we can get a lot. (e.g. does anyone have a parser for ALL inline chemical formulae - e.g. C(CH3)2=(CH2)2COC(CH3)2Cl.2H20 ? JUMBO does a so-so job. If anyone can do better that would be very useful).
Then when the data have all been extracted those from different source can be compared. This often shows real errors. In the case we absolutely need reasoning tools like RDF. It will highlight inconsistencies of the type above. But it can’t resolve them. Can we develop heuristics including probability? Recommender systems - A has fewer inconsistencies per entry than B so we weight A higher.
I shall respond to the technical questions on images and names in separate posts. I make it very clear that this is research - not a production system. There may be cases where precision and recall run at 20%. This is not a failure, it’s a starting point. Some of this is skunkworks - and I am reluctant to involve the community in skunk works. It takes time before we can reasonably loose development code at sourceforge.
Part of the point is to encourage authors and publishers to deposit semantic data as well as text. If all papers had InChIs (with compound numbers) then we probably wouldn’t have to extract stuff from images. (There are still compounds which can only be represented graphically). Similarly if all chemists published NMR spectra with molecular structures and assignments then we wouldn’t have to do any of these. All this is technically possible already.
There are many areas where the community can help. Chemical nomenclature is one. Part of the low recall for OPSIN is that the compounds aren’t in the vocabulary. It’s not part of our research. But it’s relatively easy for anyone to add these - and once added they are done. I’m guessing that we could double the PR by this method. But I’ll comment on this in detail.
So I think we shall see a valuable increase in distributed Open chemical information projects this year. It’s difficult to get funding - but not impossible - and we are hopeful. One important activity is workshops.
More later (including comments on OPSIN, OSRA, etc.).