Chemical data is increasing exponentially and soon we will be engulfed in an unmanageable torrent of molecules that threatens to overwhelm our computers and our very sanity. Or maybe not.
At yesterday’s RSC CICAG meeting on From Big Data to Chemical Information in London, I presented a talk entitled:
100 million compounds, 100K protein structures, 2 million reactions, 4 million journal articles, 20 million patents and 15 billion substructures: Is 20TB really Big Data?
According to Wikipedia, the term “Big Data” is defined as data sets so large or complex that traditional data processing applications are inadequate. Turning this on its head, this implies that without sufficiently efficient algorithms and tools, essentially any dataset could be Big Data. This statement is not entirely facetious, as many cheminformatics algorithms might work fine for 10s of thousands of molecular structures but cannot handle a PubChem-sized dataset in a reasonable length of time.
I discussed the approach used by NextMove Software to develop highly performant tools to tackle a variety of cheminformatics problems on large datasets. These problems include calculating the maximum common subgraph, substructure searching, identifying matched series in activity datasets, canonicalising protein structures, and extracting and naming reactions from patents and the literature. In particular, I describe for the first time improvements in substructure searching (200 times faster than the state-of-the-art) and canonicalisation (up to 200 times faster than the state-of-the-art) developed in the last few months by Roger and John. More details to follow…