We recently got back from the BioCreative V meeting. In this we participated in two tracks, one of which was to extract chemicals/genes/proteins from patent abstracts, while the other was to identify diseases and identify causal relationships between chemicals and diseases.
As the latter task required normalizing the disease mentions to concepts (in this case MeSH IDs) we used Wikipedia to significantly improve our coverage of disease terms and how they can be linked to MeSH. As redirects in Wikipedia are intentionally designed to help people find the appropriate page, they are a great source of common names for diseases, as well as terms that imply a disease state e.g. diabetic.
16 teams participated in the task, with our use of Wikipedia allowing us to achieve the highest recall for this task (86.2%). Unfortunately this recall came at some expense to precision (partially due to genuine mistakes in our Wikipedia dictionary and partially due to terms that are not directly in MeSH being more likely to not be annotated or attributed to different MeSH IDs in the gold standard). On F1-score our solution was ranked 2nd, marginally behind (0.34%) the winning entrant due to the lower precision, which we are already working to improve.
We also spent a couple of weeks writing a simple pattern-based system for identifying chemical-induced disease relationships. This performed surprisingly well compared to the numerous machine learning solutions (18 teams participated) with only two solutions producing better results. On closer inspection, while our system relied solely on the text given to it, both of these solutions used knowledge bases of known chemical-disease relationships as features in their models!
You can find out more in the presentation we gave at BioCreative V (bottom of this post) and our two workshop proceedings papers (here and here). Due to the orders of magnitude speed difference between our solution and many of the other solutions, we started our presentation by discussing this before getting to the science of how we use Wikipedia terms.