Chemistry Enabling Chinese, Japanese and Korean Patents

Chemical name in Chinese, Japanese and KoreanLast week I presented a poster at the EPO’s East Meets West conference. This conference focuses on the current state of the patent systems in Asia and what can be achieved in the future.

The poster covers improvements in our chemical name translation software, which now supports Korean in addition to Chinese and Japanese. For Korean patents we show how large amounts of chemical structure information can be extracted, with a significant amount being either not present in US patents or appearing earlier in the Korean publication.

Take a look here!

If this is an area of interest to you feel free to get in touch with us.

Big Data, Big Bad Data

All_smallChemical data is increasing exponentially and soon we will be engulfed in an unmanageable torrent of molecules that threatens to overwhelm our computers and our very sanity. Or maybe not.

At yesterday’s RSC CICAG meeting on From Big Data to Chemical Information in London, I presented a talk entitled:
100 million compounds, 100K protein structures, 2 million reactions, 4 million journal articles, 20 million patents and 15 billion substructures: Is 20TB really Big Data?

According to Wikipedia, the term “Big Data” is defined as data sets so large or complex that traditional data processing applications are inadequate. Turning this on its head, this implies that without sufficiently efficient algorithms and tools, essentially any dataset could be Big Data. This statement is not entirely facetious, as many cheminformatics algorithms might work fine for 10s of thousands of molecular structures but cannot handle a PubChem-sized dataset in a reasonable length of time.

I discussed the approach used by NextMove Software to develop highly performant tools to tackle a variety of cheminformatics problems on large datasets. These problems include calculating the maximum common subgraph, substructure searching, identifying matched series in activity datasets, canonicalising protein structures, and extracting and naming reactions from patents and the literature. In particular, I describe for the first time improvements in substructure searching (200 times faster than the state-of-the-art) and canonicalisation (up to 200 times faster than the state-of-the-art) developed in the last few months by Roger and John. More details to follow…

Upcoming webinar on matched series

[Update 23/04/2015 – A recording of this webinar as well as the slides are now available online.]

Next Tuesday (14 April) I’ll be presenting a webinar entitled “Beyond Matched Pairs: Applying Matsy to predict new optimisation strategies”. The webinar will be hosted by our collaborators at Optibrium who have incorporated the Matsy algorithm into StarDrop (see earlier post). Here’s the abstract:

Join our webinar with Noel O’Boyle of NextMove Software to learn how matched series analysis can predict new chemical substitutions that are most likely to improve target activity for your projects.

The Matsy™ algorithm for matched molecular series analysis grew out of a collaboration with computational chemists at AstraZeneca with the goal of supporting lead optimisation projects. Specifically, it was designed to answer the question, “What compound should I make next?”.

Matsy has been developed to generate and search in-house or public domain databases of matched molecular series to identify chemical substitutions that are most likely to improve target activity (J. Med. Chem., 2014, 57(6), pp 2704–2713). This goes beyond conventional ‘matched molecular pair analysis’ by using data from longer series of matched compounds (and not just pairs) to make more relevant predictions for a particular chemical series of interest. In addition, all predictions are backed by experimental results which can be viewed and assessed by the medicinal chemist when considering the predictions.

Matsy is applied in StarDrop’s Nova™ module, which automatically generates new compound structures to stimulate the search for optimisation strategies related to initial hit or lead compounds. StarDrop’s unique capabilities for multi-parameter optimisation and predictive modelling will enable efficient prioritisation of the resulting ideas to identify high quality compounds with the best chance of success.

I’ll be focusing on the science behind the algorithm itself rather than the specifics of its integration into StarDrop, so even if you’re not currently an Optibrium customer you may find this of interest.

To register, click on the image above or just here.