CaffeineFix
Version 2.0 [201211]
Chemical Nomenclature Spelling Correction
The complex technical nomenclature used by chemists to describe molecular structures presents unique challenges to regular natural language processing (NLP) tools. Software for handling English text often can't handle the non-standard use of whitespace, hyphenation, punctuation, Greek characters, italics and even superscripts found in chemical names. Likewise, the unusual letter combinations that occur in IUPAC, Chemical Abstracts, Beilstein and traditional names can trip up the trigram analysis frequently used in spell checking software.
CaffeineFix overcomes the limitations of existing solutions by using novel algorithms purely for handling IUPAC-like organic chemistry nomenclature. Unlike dictionary-based approaches, CaffeineFix's novel "push-down automaton" technology allows it to check and correct against an infinite number of words/chemical names. "Levenshtein distance" can be used to identify corrections and automatically correct unambiguous errors. User parameterizable substitution matrices and insertion-deletion (indel) penalties can be used to customize suggestion scores for a particular end-user application. For example, when working with OCR scanned text the cost of substituting "rn" with "m", or "l" with "1" can be reduced, or when the relative cost of deleting hyphens tweaked when processing paginated text.
12-dichlorobenzne? Did you mean 1,2-dichlorobenzene?
didec-2-ene? Did you mean dodec-2-ene?
- Correct typos in documents within word processors, spreadsheets, presentation and other office software.
- Enhance text-based chemical database searches of registration systems or chemical supplier catalogues, with Google-like "did you mean ...?" functionality.
- Fix OCR errors in scanned documents.
- Improve the performance of name-to-structure conversion in text-mining and chemical entity extraction applications.