Tabular data in patents is a useful source of experimental data and chemical structures. USPTO patents are available back to 1976 in formats where tables are explicitly annotated. For more recent patents these are XML tables similar in structure to what would be expected in HTML. Unfortunately the format used from 1976-2000 is not quite so straightforward to interpret leading to naive interpretations producing output that does not at all resemble the actual table, often with chemical name fragments scattered:
The format for these tables is briefly documented by the USPTO but the description raises as many questions as answers:
- Columns are delimited by one or more spaces… but a cell may contain spaces!
- An overly long cell may be split over multiple lines due the format being limited to 80 characters per line
- Where in the printed patent a cell spanned multiple rows it spans multiple lines in the format.
As the format is based on the how the tables were printed perfect reproduction of the semantics of these tables appears impossible, but a good approximation can be achieved.
Much better 🙂
(the colouring of Example 22 is due to “tertbutyl” being recognised as a misspelling of “tert-butyl”)
The method broadly works by:
- Identifiying the header, body and footer
- Producing a putative table layout
- Splitting cells where a single space is determined to be a split point between two columns
- Merging cells that are determined to be a continuation of a previous cell