Wikipedia is a highly useful source of Chemistry and also of chemical nomenclature. A limitation in chemical name to structure software, such as OPSIN, is that trivial names that are similar to systematic names may be misinterpreted if the program has never encountered the trivial names. The nature of Wikipedia means that the most important chemicals and hence the most prevalent trivial names are included so surely Wikipedia would be a great resource to look for name to InChI relationships where the name to structure software was at fault?
I used Matthew Gamble’s code for extracting chemboxes as RDF to quickly grab the contents of all the current chemboxes. From the output of this tool it was simple to get the name/InChI pairs. As I was interested in trivial names I used the title of each Wikipedia page as the input for name to structure.
430 cases were flagged up for a range of reasons: ring/chain tautomerism, intentionally underspecified names, under or over specification of stereochemistry and of course the type of error I was expected. However there were also a significant number of cases where the InChI clearly described a different compound. Upon investigation for the records I’ve corrected so far the root cause appeared to be an incorrect reference to ChemSpider. This then allows script assisted updates to pull in inappropriate InChIs/SMILES.
Example of a previously incorrect page.
Increasing the precision of identificiation of these incorrect name/strucutral identifiers pairs should be possible if the IUPAC names were used as input…
Well done! And nice catch on that MW parsing library! I did not know about that one 🙂
Correct me if I’m wrong but doesn’t the “verified” mean it’s manually checked… so someone incorrectly verified these?
Unfortunately in this case I think it only means that the identifiers have been verified against ChemSpider…which work if the page is linked to the wrong ChemSpider record.
http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Chemicals/Chembox_validation
Okay that makes more sense.. Was wondering how someone could put such an obviously wrong InChI on an entry (although I guess somewhere down the line it must have happened).
Nice post – thanks Daniel.