Following on from the work described in the last post, I have put together a dataset of text-mined peptides which I’ve uploaded to PubChem. This involved extending our biopolymer grammar (which I use to textmine with LeadMine) and improving the ability of Sugar&Splice to interpret the text-mined entities.
The dataset consists of 5350 unique peptides extracted from PubMed abstracts (up to 2017), comprising 8699 peptide line notation/PubMed ID (PMID) pairs. Although the grammar identifies many more peptides, I’ve excluded them as they are either very short, appear to be fragments of a larger peptide, or are not currently interpreted by Sugar&Splice.
The most popular peptide is “Cbz-Val-Ala-Asp-CH2F”, which I found in 296 abstracts in a variety of forms, e.g. “N-benzyloxycarbonyl-Val-Ala-Asp-fluoromethylketone” (PMID 8628997), “Z-Val-Ala-Asp-fluoromethyl-ketone” (PMID 10554885). This is an irreversible pan-caspase inhibitor, and appears to be used in studies of apoptosis. Not surprisingly, this was already in PubChem (CID 5497171). Its Asp(OMe) form also ranks highly as the fourth most popular peptide line notation (88 abstracts).
At the other extreme is the peptide that appears as “Pro-Gly-Phe-Ser-Pro-Phe-Arg” in PMID 5120. This was not found in any other abstract, and is a new entry in PubChem despite the 42 years since its publication in Biochemistry. It can now be found in PubChem at CID 134824750. If you go there and scroll down to “Depositor Provided PubMed Citations”, you will find the link to the PubMed abstract. This describes the action of a peptidase on various peptides. On the right-hand side under “Related Information”, if you click on “PubChem Compound” you will see listed two peptides mentioned in the text that were extracted by LeadMine, as well as an entry for bradykinin.
To see the complete NextMove Software biologics dataset, click on “8699 Live Substances” on the source page. For more information on using LeadMine to mine PubMed abstracts, see this post. Obviously, there is nothing here specific to PubMed Abstracts – other interesting datasets would be the patent literature or internal company documents.