Patents, such as those freely available from the US patent office, are a rich source of bioactivity data. One argument for favoring these data over data extracted from the academic literature is timeliness: a recent publication by Stefan Senger suggests an average delay of 4 years between the publication of compound-target interaction pairs in the patent literature compared to the academic literature.
However, another argument is simply the quantity of data. Daniel has been working on the general problem of extracting data from tables in patents, a certain proportion of which are bioactivity data. The following graph shows the amount of bioactivity data per (publication) year in ChEMBL versus extracted by LeadMine from US patents. Note that for the purposes of this comparison, the ChEMBL data excludes data extracted from patents by BindingDB.
The rise in the amount of patent data is due to an increase in the size of patents as well as the number thereof. If the trend continues, patents will become increasingly important as a source of bioactivity data.
Daniel presented the details of the text-mining procedure at the recent ACS meeting in San Francisco. The talk below also includes a comparison between the data extracted by LeadMine and that extracted manually by BindingDB. If you’re interested in seeing a poster on the topic, Daniel will be presenting at UK-QSAR this Wednesday.
Would be interesting to see a comparison of quantity versus quality for both data sources
Great post and slide deck.
Chris,
Agree, Although no source will be perfect, I would guess that the biological data is actually better in patent applications than in papers primarily because most of the data in patents is from industry which does a much better job of quality control than academia, in addition most companies do extensive checking of all data that goes into patents to ensure protection in case of a future legal action.
Great post