{"id":2746,"date":"2018-04-09T13:17:44","date_gmt":"2018-04-09T12:17:44","guid":{"rendered":"https:\/\/nextmovesoftware.com\/blog\/?p=2746"},"modified":"2020-08-25T15:28:33","modified_gmt":"2020-08-25T14:28:33","slug":"textmining-pubmed-abstracts-with-leadmine","status":"publish","type":"post","link":"https:\/\/nextmovesoftware.com\/blog\/2018\/04\/09\/textmining-pubmed-abstracts-with-leadmine\/","title":{"rendered":"Textmining PubMed Abstracts with LeadMine"},"content":{"rendered":"<p>The US National Library of Medicine <a href=\"https:\/\/www.nlm.nih.gov\/databases\/download\/pubmed_medline.html\">provide<\/a> an annual baseline of PubMed Abstracts freely available for download, along with daily updates throughout the year. You can download the baseline with a command such as the following, if you have wget available:<\/p>\n<pre>wget --mirror --accept \"*.xml.gz\" ftp:\/\/ftp.ncbi.nlm.nih.gov\/pubmed\/baseline\/\n<\/pre>\n<p>Once downloaded, you will end up with close to 100 gzipped xml files, each one containing a large number of abstracts along with bibliographic data.<\/p>\n<p>While it is possible to use the LeadMine library and an XML reader to textmine these any way one wants, it is convenient to use LeadMine&#8217;s command-line application to do the textmining if possible, as this has built-in reporting of results, makes use of multiple processors, and well, you don&#8217;t need to write any Java to use it. However, without transforming these data somehow, the command-line application is not going to work very well as (a) it process each entire XML file in one go, with resulting large memory usage and a lack of correspondence between a particular PMID and its results, (b) using more than one processor simply exacerbates the memory problem, and (c) while LeadMine handles XML without problems, it will inevitably end up textmining information that is not in the abstract (e.g. part of a journal name).<\/p>\n<p>Fortunately, it is not difficult to transform the data into a more easily digestible form. For each gzipped XML file, the Python script below generates a zip containing a large number of small XML files, each one corresponding to a single PubMed abstract. Bibliographic information is included in XML attributes so that the only text that will be textmined is that of the title and abstract (indicated by T and A in the results below). Once the script is run, LeadMine can be used to textmine as shown below. Here I focus on diseases and ClinicalTrials.gov (NCT) numbers; note that while PubMed provide manually curated entries for both of these in the original XML, they are missing from the most recent abstracts (presumably due to a time lag).<\/p>\n<pre style=\"white-space: pre;\">java -jar leadmine-3.12.jar -c diseases_trails.cfg -tsv -t 12 -R D:\\PubMedAbstracts\\zipfiles &gt; diseases_trials.txt\n<\/pre>\n<p>Fifteen minutes later (on my machine), I get an output file that includes the following where the PMID (and version) appears in the first column:<\/p>\n<pre style=\"white-space: pre;\">DocName BegIndex        EndIndex        SectionType     EntityType      PossiblyCorrectedText   EntityText      CorrectionDistance      ResolvedForm\n29170069.1\t1272\t1279\tA\tDisease\tphobias\tphobias\t0\tD010698\n29170069.1\t1290\t1307\tA\tDisease\tanxiety disorders\tanxiety disorders\t0\tD001008\n29170072.1\t325\t358\tA\tDisease\tExocrine pancreatic insufficiency\tExocrine pancreatic insufficiency\t0\tD010188\n29170073.1\t1856\t1860\tA\tDisease\tpain\tpain\t0\tD010146\n29170073.1\t2087\t2098\tA\tTrial\tNCT02683707\tNCT02683707\t0\t\n29170074.1\t334\t349\tA\tDisease\tcystic fibrosis\tcystic fibrosis\t0\tD003550\n29170075.1\t127\t146\tT\tDisease\tdepressive symptoms\tdepressive symptoms\t0\tD003866\n29170075.1\t419\t438\tA\tDisease\tdepressive symptoms\tdepressive symptoms\t0\tD003866\n29170075.1\t476\t495\tA\tDisease\tdepressive symptoms\tdepressive symptoms\t0\tD003866\n29170075.1\t1579\t1598\tA\tDisease\tdepressive symptoms\tdepressive symptoms\t0\tD003866\n29170075.1\t2191\t2202\tA\tTrial\tNCT02860741\tNCT02860741\t0\t\n29170076.1\t198\t221\tT\tDisease\tend-stage renal disease\tend-stage renal disease\t0\tD007676\n29170076.1\t240\t262\tA\tDisease\tCardiovascular disease\tCardiovascular disease\t0\tD002318\n29170076.1\t320\t342\tA\tDisease\tchronic kidney disease\tchronic kidney disease\t0\tD051436\n29170076.1\t485\t507\tA\tDisease\tvascular calcification\tvascular calcification\t0\tD061205\n29170076.1\t583\t588\tA\tDisease\ttumor\ttumor\t0\tD009369\n29170076.1\t589\t597\tA\tDisease\tnecrosis\tnecrosis\t0\tD009336\n29170076.1\t765\t788\tA\tDisease\tend-stage renal disease\tend-stage renal disease\t0\tD007676\n<\/pre>\n<p><b>Python script<\/b><\/p>\n<style type=\"text\/css\">\n<!-- #vimCodeElement { white-space: pre; font-family: monospace; color: #000000; background-color: #ffffff; overflow: auto; height:800px;} .Comment { color: #0000ff; } .Identifier { color: #008b8b; } .Statement { color: #a52a2a; font-weight: bold; } .PreProc { color: #a020f0; } .Constant { color: #ff00ff; } .Special { color: #6a5acd; } --><br \/>\n<\/style>\n<p><script type=\"text\/javascript\"><br \/>\n<!-- --><br \/>\n<\/script><\/p>\n<pre>import os\nimport sys\nimport glob\nimport gzip\nimport zipfile\nimport multiprocessing as mp\nimport xml.etree.ElementTree as ET\n\nclass Details:\n    def __init__(self, title, abstract, year, volume, journal, page):\n        self.title = title\n        self.abstract = abstract\n        self.year = year\n        self.volume = volume\n        self.journal = journal\n        self.page = page\n    def __repr__(self):\n        return \"%s _%s_ *%s* _%s_ %s\\n\\nAbstract: %s\" % (self.title, self.journal, self.year, self.volume, self.page, self.abstract)\n\ndef getelements(filename_or_file, tag):\n    \"\"\"Yield *tag* elements from *filename_or_file* xml incrementaly.\"\"\"\n    context = iter(ET.iterparse(filename_or_file, events=('start', 'end')))\n    _, root = next(context) # get root element\n    for event, elem in context:\n        if event == 'end' and elem.tag == tag:\n            yield elem\n            root.clear() # free memory\n\ndef getText(node):\n    if node is None:\n        return \"\"\n    t = node.text\n    return \"\" if t is None else t\n\ndef extract(medline):\n    article = medline.find(\"Article\")\n    title = \"\".join(article.find(\"ArticleTitle\").itertext())\n    abstractNode = article.find(\"Abstract\")\n    abstract = \"\"\n    if abstractNode is not None:\n        abstract = []\n        for abstractText in abstractNode.findall(\"AbstractText\"):\n            abstract.append(\"\".join(abstractText.itertext()))\n        abstract = \" \".join(abstract)\n    page = getText(article.find(\"Pagination\/MedlinePgn\"))\n    journal = article.find(\"Journal\")\n    journalissue = journal.find(\"JournalIssue\")\n    volume = getText(journalissue.find(\"Volume\"))\n    year = getText(journalissue.find(\"PubDate\/Year\"))\n    journaltitle = getText(journal.find(\"Title\"))\n    return Details(title, abstract, year, volume, journaltitle, page)\n\nclass PubMed:\n    def __init__(self, fname):\n        self.iter = self.getArticles(gzip.open(fname))\n\n    def getArticles(self, mfile):\n        for elem in getelements(mfile, \"PubmedArticle\"):\n            medline = elem.find(\"MedlineCitation\")\n            pmidnode = medline.find(\"PMID\")\n            pmid = pmidnode.text\n            version = pmidnode.get('Version')\n            yield pmid, version, medline\n\n    def getAll(self):\n        for pmid, version, medline in self.iter:\n            yield pmid, version, extract(medline)\n\n    def getArticleDetails(self, mpmid):\n        for pmid, _, medline in self.iter:\n            if mpmid and mpmid != pmid: continue\n            return extract(medline)\n\ndef handleonefile(inpname):\n    pm = PubMed(inpname)\n    basename = os.path.basename(inpname).split(\".\")[0]\n    outname = os.path.join(\"reformatted\", basename+\".zip\")\n    if os.path.isfile(outname):\n        print(\"SKIPPING: \" + outname)\n        return\n    print(\"REFORMATTING: \" + outname)\n    idxfile = os.path.join(\"reformatted\", basename+\".idx\")\n    with zipfile.ZipFile(outname, mode=\"w\", compression=zipfile.ZIP_DEFLATED) as out:\n        with open(idxfile, \"w\") as outidx:\n            for pmid, version, article in pm.getAll():\n                article_elem = ET.Element(\"article\", {\n                    \"pmid\": pmid,\n                    \"version\": version,\n                    \"journal\": article.journal,\n                    \"year\": article.year,\n                    \"volume\": article.volume,\n                    \"page\": article.page\n                    })\n                title = ET.SubElement(article_elem, \"title\")\n                title.text = article.title\n                abstract = ET.SubElement(article_elem, \"abstract\")\n                abstract.text = article.abstract\n\n                xmlfile = f\"{pmid}.{version}.xml\"\n\n                xmldeclaration = b'<?xml version=\"1.0\" encoding=\"utf-8\"?>\\n'\n                try:\n                    xmltext = xmldeclaration + ET.tostring(article_elem)\n                except:\n                    print(article)\n                out.writestr(xmlfile, xmltext)\n                outidx.write(xmlfile[:-4] + \"\\n\")\n\nif __name__ == \"__main__\":\n    POOLSIZE  = 36 # number of CPUs\n    pool = mp.Pool(POOLSIZE)\n    if not os.path.isdir(\"reformatted\"):\n        os.mkdir(\"reformatted\")\n    fnames = glob.glob(os.path.join(\"abstracts\", \"*.xml.gz\"))\n    fnames.extend(glob.glob(os.path.join(\"dailyupdates\", \"*.xml.gz\")))\n    # Note that the filenames continue in numbering from one directory\n    # to the other (but do not overlap)\n\n    for x in pool.imap_unordered(handleonefile, fnames, 1):\n        pass\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>The US National Library of Medicine provide an annual baseline of PubMed Abstracts freely available for download, along with daily updates throughout the year. You can download the baseline with a command such as the following, if you have wget available: wget &#8211;mirror &#8211;accept &#8220;*.xml.gz&#8221; ftp:\/\/ftp.ncbi.nlm.nih.gov\/pubmed\/baseline\/ Once downloaded, you will end up with close to &hellip; <a href=\"https:\/\/nextmovesoftware.com\/blog\/2018\/04\/09\/textmining-pubmed-abstracts-with-leadmine\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Textmining PubMed Abstracts with LeadMine<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/2746"}],"collection":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/comments?post=2746"}],"version-history":[{"count":32,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/2746\/revisions"}],"predecessor-version":[{"id":2968,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/2746\/revisions\/2968"}],"wp:attachment":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/media?parent=2746"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/categories?post=2746"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/tags?post=2746"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}