The term patent family is generally used to describe a set of patents that cover the same invention but which are filed with different patent authorities. Here instead we look at finding groups of patents within a single authority (the USPTO) where the patents are linked by chemical structures.
It turns out that it is not unusual for essentially the same chemical information to appear in multiple patent applications within the USPTO, often with the same or similar title. I’m not sure of the reason for this – perhaps corrections, rewrites, or separate applications for different targets. In any case, it is useful to identify such cases for the purposes of linking or collation, or indeed to discard if looking for truly novel chemistry.
Here’s an approach that appears to work reasonably well: we regard as “chemically-related” two patents that share at least N key (but rare) molecules in common. All that remains is to define “N”, “key”, and “rare”:
- A key compound is one associated with a compound number (which may be in the text or a ChemDraw file) or associated with an experimental property (taken from a table, and possibly described in terms of R groups that need to be attached to a scaffold).
- A rare molecule is one that appears in 30 or fewer patents.
- N was defined as 8.
Naturally, these cutoffs could benefit from some tweaking with a testset (e.g. patents with the same title and assignee), but for the purposes of this blog post they seem to work well. Here is a typical example of a highly-connected chemical patent family, where the labels are the number of key (but rare) molecules in common:
US Patent Application | Title |
---|---|
US20030032623A1 | Tnf-alpha production inhibitors |
US20050014800A1 | Angiogenesis inhibitor |
US20060229342A1 | TNF-a production inhibitors |
US20060241155A1 | TNF-alpha production inhibitors |
US20080161270A1 | Angiogenesis inhibitors |
US20080182881A1 | TNF-alpha production inhibitors |
US20100016380A1 | TNF-alpha production inhibitors |
These patents appear to be all from Santen Pharmaceutical Co., though the company name is not listed as assignee on some of the patents. Equally interesting are those related families where the members are less highly connected. Here’s an example from GSK along with representative examples of the patent titles:
US Patent Application | Title |
---|---|
US20130012491A1 | PYRIMIDINE DERIVATIVES FOR USE AS SPHINGOSINE 1-PHOSPHATE 1 (S1P1) RECEPTOR AGONISTS |
US20120094979A1 | THIAZOLE OR THIADIZALOE DERIVATIVES FOR USE AS SPHINGOSINE 1-PHOSPHATE 1 (S1P1) RECEPTOR AGONISTS |
US20100273771A1 | OXADIAZOLE DERIVATIVES ACTIVE ON SPHINGOSINE-1-PHOSPHATE (SIP) |
US20100174065A1 | COMPOUNDS |
US20120101083A1 | S1P1 AGONISTS COMPRISING A BICYCLIC N-CONTAINING RING |
As ever, if this sparks some ideas and you’re interested in collaborating, drop us a line.
Very cool! It would be interesting to do something along the lines of comparing all patents based on the chemical similarity of the sets of novel compounds that appear in them. This has been done in the target space, as pioneered by methods such as SEA.
Andrew Dalke has a nice blog post about implementing it in ChemFP (http://www.dalkescientific.com/writings/diary/archive/2017/03/27/chembl_target_sets_association_network.html)
Exactly. The analysis above is an exact identity search (based on canonical SMILES as extracted), but a similarity search would clearly be a next step. A good question to ask is whether it would be based on the whole molecule or a Murcko scaffold. Of course, we also textmine the targets described, and so an even more powerful search would be to combine both sets of information.
In what might be a related issues we notified ChEMBL and BindingDB about a case of triplicated assay repeats from https://pubchem.ncbi.nlm.nih.gov/compound/53361128
These came from nominally different patents (but all from Vitae). Such instances seem rare from patents (but AWK are common for papers in ChEMBL). Vitae had a horrendous patent family related to these filings – can you take a look at the chemical clustering?
Aha, an unseen test case! Indeed, without any changes to the code, we find the related family. The family contains 14 members, and includes the 5 applications listed on PubChem. See:
https://nextmovesoftware.com/blog/wp-content/uploads/2017/07/ReplyToSouthan_PatentFamilies.png
Note that my analysis only included patent applications whereas PubChem also lists grants.
Hello, Interesting project – one reason you may see the same compounds (other than different uses, etc.) is that all the listed records retrieved in the example above are published patent *applications*, rather than granted patents. This is also why the assignee may not be listed, as the USPTO only recently added assignees/probable assignees to published applications.
Certainly. And thanks for the point about assignees.
For the benefit of a future reader wondering why we are looking at applications rather than grants, applications are of particular interest due to their timeliness, and so anyone interested in keeping up with the Joneses (“current awareness”) needs to keep an eye on applications.
You might want to see about combining this approach with the citation network approach taken by citationgecko.com