Big Data, Big Bad Data

All_smallChemical data is increasing exponentially and soon we will be engulfed in an unmanageable torrent of molecules that threatens to overwhelm our computers and our very sanity. Or maybe not.

At yesterday’s RSC CICAG meeting on From Big Data to Chemical Information in London, I presented a talk entitled:
100 million compounds, 100K protein structures, 2 million reactions, 4 million journal articles, 20 million patents and 15 billion substructures: Is 20TB really Big Data?

According to Wikipedia, the term “Big Data” is defined as data sets so large or complex that traditional data processing applications are inadequate. Turning this on its head, this implies that without sufficiently efficient algorithms and tools, essentially any dataset could be Big Data. This statement is not entirely facetious, as many cheminformatics algorithms might work fine for 10s of thousands of molecular structures but cannot handle a PubChem-sized dataset in a reasonable length of time.

I discussed the approach used by NextMove Software to develop highly performant tools to tackle a variety of cheminformatics problems on large datasets. These problems include calculating the maximum common subgraph, substructure searching, identifying matched series in activity datasets, canonicalising protein structures, and extracting and naming reactions from patents and the literature. In particular, I describe for the first time improvements in substructure searching (200 times faster than the state-of-the-art) and canonicalisation (up to 200 times faster than the state-of-the-art) developed in the last few months by Roger and John. More details to follow…


5 thoughts on “Big Data, Big Bad Data”

  1. Live-searching is definitely useful. If fast enough, the search result help to guide the user but also, as with the example you describe, it allows the user to quickly experiment with different possibilities and see what happens.

  2. In your benchmark of substructure search implementations you mention 3323 queries from the BindingDB substructure set. That substructure set appears to list 3488 queries on Andrew Dalke’s website.

    Is this an older version of the query set, or a subset that is compatible with all the software used?

  3. Hi Mike,

    I’m giving a talk on the benchmark at the next Cambridge Cheminformatics Network Meeting – there are actually a lot more tools on our plots. The benchmark set is a subset of those 3488. We will provide links to the queries, targets (eMol ID), and our results after the CCNM talk.

    Thanks,
    John

Comments are closed.