[PYTHON] Let's create RDF using PubMed's dissertation information-Paper information acquisition-

Many of you may have wanted to give a quick overview (preferably in connection with existing knowledge) when you are too busy to check your dissertation or when you start something new. ..

This time, I will also study, and I will try to make something that can be used in such a case using RDF.

About RDF

Abbreviation for Resource Description Framework. It is expressed as a directed graph using three values, S (Subject), P (Predicate), and O (Object). There is also a mechanism that allows you to connect data and retrieve the information you want to know by querying.

Reference article: Miscellaneous explanation about RDF-Qiita [Intuition RDF !! Part 2-Create an easy-to-use RDF and search. --Qiita] (http://qiita.com/maoringo/items/0d48a3d967a35581cc24)

Preparation of dissertation data

If the paper information is PubMed provided by NCBI, the information can be obtained by API, so I will try using it.

There are four types of APIs provided.

  1. ESearch: Returns the article ID (PubMed ID) for the search word
  2. ESummary: Returns the title and author name for the article ID
  3. EFetch: Returns all information for the article ID
  4. ESpell: Check the spelling of search words

First of all, it seems that you need to get a list of paper IDs with ESearch and get details about each paper ID. ESpell doesn't seem to be needed this time.

Reference article Summary of PubMed API

Let's get the paper ID

Use ESearch to get the dissertation ID. Based on this URL

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=

If you enter a search keyword after "term =", that ID should be returned.

For example, try the search keyword: "cancer".

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=cancer

If you enter the above URL in your browser, you will see a result like this. f8abcec638e53612701c2c16d61103c4.png You have obtained the thesis ID list.

However, it is difficult to do it manually every time, and I want to erase unnecessary things and use only the paper ID. So I will write it using python.

Environmental information

  • Windows10

get_id.py


# coding: utf-8
import urllib.request

keyword = "cancer"
baseURL = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="

def get_id(url):#Get the dissertation ID
	result = urllib.request.urlopen(url)
	return result

def main():
	url = baseURL + keyword
	result = get_id(url)
	print(result.read())

if __name__ == "__main__":
    main()

When you do this

% python get_id.py
<?xml version="1.0" encoding="UTF-8" ?><!DOCTYPE eSearchResult 
PUBLIC "-//NLM//DTD esearch20060628//EN""https://eutils.ncbi.n
lm.nih.gov/eutils/dtd/20060628/esearch.dtd"><eSearchResult><Cou
nt>3465235</Count><RetMax>20</RetMax><RetStart>0</RetStart><IdL
ist><Id>28420040</Id><Id>28420039</Id><Id>28420037</Id>
....

It's hard to see without line breaks, but you can see that you can get the same information as you did with the browser earlier.

Extract only the paper ID

XML is basically

<element>Contents</element>
<element="element名"attribute="Attribute value">Contents</element>

It has a structure like. For example, for a dissertation ID

<Id>Paper ID</Id>

You can see that it looks like this by looking at the acquired information. Remove unnecessary parts such as elements, and extract only the required paper ID.

Since there is a library called ElementTree for handling XML, I will use it.

get_id.py


from xml.etree.ElementTree import *

After importing, rewrite main as follows.

get_id.py


def main():
	url = baseURL + keyword
	result = get_id(url)
	element = fromstring(result.read())
	print(element.findtext(".//Id"))

First, create an Element object with fromstring (). Subsequent element.findtext () will return the first content that matches the condition. This time I want "Id", so I specify it, but there is a rule to write ".// Id".

When you do this

% python get_id.py
28420040

I was able to extract only the first paper ID. If you want to extract not only the first one but all the matching contents, use element.findall () and write as follows.

get_id.py


def main():
	url = baseURL + keyword
	result = get_id(url)
	element = fromstring(result.read())
	for e in element.findall(".//Id"):
		print(e.text)

When you run

% python get_id.py
28420040
28420039
28420037
28420035
...

I was able to successfully extract only all the paper IDs.

Considering future processing, create a file called "idlist_search word.txt" and save the acquired ID list.

get_id.py


def main():
	url = baseURL + keyword
	result = get_id(url)
	element = fromstring(result.read())
	filename = "idlist_"+keyword+".txt"
	f = open(filename, "w")
	for e in element.findall(".//Id"):
		f.write(e.text)
		f.write("\n")
	f.close()

Reference article How to process XML with ElementTree in Python --hikm's blog

Get the dissertation summary from the dissertation ID

Next, let's get the summary using the obtained paper ID. The base URL of ESummary is

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=

is. Just like when you get the thesis ID, enter the thesis ID you want to get the information after "id =". For example, let's enter the first paper ID "28420040" that we obtained earlier.

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=28420040

By entering this URL in a browser, information on the publication date, author name, and article title could be obtained in this way. fe2324ab73ed344c3ee755869df3c06b.png

If you write up to here in Python

get_summary.py


# coding: utf-8
import urllib.request
from xml.etree.ElementTree import *

keyword = "cancer"
idfile = "idlist_"+keyword+".txt"
baseURL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id="

def get_xml(url):#Obtain a dissertation summary
	result = urllib.request.urlopen(url)
	return result

def main():
	idlist = []
	f = open(idfile,"r")
	for i in f.readlines():
		idlist.append(i.strip())
	f.close()
	url = baseURL + idlist[0]
	result = get_xml(url)
	print(result.read())

if __name__ == "__main__":
    main()

The dissertation ID is in a format that can be read from the saved file. Also, although I haven't used it here, I've already imported the ElementTree library first because I'll be using it soon. When you execute it, you should see a version without line breaks when you execute it in a browser, like when you get the article ID.

Extract only the content you want

After that, just like the paper ID, only the part of the desired content is extracted. However, unlike the article ID, Author and Title are attributes of the Item element. Therefore, as in the case of the dissertation ID

for e in element.findall(".//Item"):
	    print(e.text)

Then, all the dissertation information will be extracted in this way.

% python get_summary.py
28420040
2017 Apr 18
2017 Apr 18
J Surg Oncol
None
Duan W
Liu K
Fu X
Shen X
...

You can use this with this, but let's also know how to extract only what you want, such as Author and Title.

The Element object created by passing XML text is a dictionary type object, and each element can be accessed. Here are some examples.

print(element[0][3].text)
print(element[0][4][2].text)
print(element[0][6].text)

Execution result
2017 Apr 18
Fu X
Semi-end-to-end esophagojejunostomy after laparoscopy-assisted total gastrectomy better reduces stricture and leakage than the conventional end-to-side procedure: A retrospective study.

If you want to extract the list of authors, it looks like this.

for i in range(len(element[0][4])):
    print(element[0][4][i].text)

Execution result
Duan W
Liu K
Fu X
Shen X
...

Also, getting elements (tags) and attributes (keys)

print(element[0][4].tag)
print(element[0][4].attrib)
print(element[0][4].keys())

Execution result
Item
{'Name': 'AuthorList', 'Type': 'List'}
['Name', 'Type']

You can do it like this. I think it's useful to remember.

Get the abstract of your dissertation

I was able to get the information of the paper, but the title and author name are not enough. It would have been nice if there were keywords related to the dissertation, but it can't be helped. Therefore, I will use EFetch to obtain the abstract of the dissertation.

First of all, the URL that is the base of EFetch is

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=

When you enter the paper ID given in the browser ff917c40f933192050fb9244e345b598.png

It was returned in a very unwieldy format. It seems that you can specify the XML format with a parameter called retmode.

Reference article The E-utilities In-Depth: Parameters, Syntax and More - Entrez Programming Utilities Help - NCBI Bookshelf

If you try to execute the URL as follows

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=28420040&retmode=xml

Now in XML format! 94b3a32e62d51993b6eb30917d11f5fa.png

Up to this point, you can write in almost the same way as ESearch.

get_abstract.py


# coding: utf-8
import urllib.request
from xml.etree.ElementTree import *

keyword = "cancer"
idfile = "idlist_"+keyword+".txt"
baseURL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id="

def get_xml(url):#Get dissertation information
	result = urllib.request.urlopen(url)
	return result

def main():
	idlist = []
	f = open(idfile,"r")
	for i in f.readlines():
		idlist.append(i.strip())
	f.close()
	url = baseURL + idlist[0] + "&retmode=xml"
	result = get_xml(url)
	print(result.read())#Display the acquisition result as it is

if __name__ == "__main__":
    main()

After that, since the thesis abstract is an AbstractText element, it is the same as when extracting the thesis ID.

	element = fromstring(result.read())
	for e in element.findall(".//AbstractText"):
		print(e.text)#View abstract

You should be able to do it.

When I tried it, I was able to successfully extract only the abstract of the dissertation.

% python get_abstract.py
Laparoscopy-assisted total gastrectomy (LATG) has not 
gained popularity due to the technical difficulty of e
sophagojejunostomy (EJ) and the high incidence of EJ-r
elated complications. Herein, we compared two types of
 EJ for Roux-en-Y reconstruction to determine whether 
...

If we can process the abstracts of the dissertation obtained in this way and convert the knowledge into RDF, it seems that we can make something interesting. It took a long time to get the dissertation information, so I would like to continue with the next article.

Supplement

When processing the XML file once saved

element = fromstring(result.read())

To

tree = parse("efetch_result.xml")
element = tree.getroot()

It is possible by replacing it with.

Recommended Posts

Let's create RDF using PubMed's dissertation information-Paper information acquisition-
Let's create a REST API using SpringBoot + MongoDB